Function Recovery based on Program Slicing

61 downloads 14649 Views 47KB Size Report
Weiser, to this function recovery problem by modifying ... computational formulas or business rules. ... A business system is typically data-oriented because.
Function Recovery based on Program Slicing F.Lanubile, and G.Visaggio Dipartimento di Informatica University of Bari, Italy fax: +39-80-243196 email: [email protected] Abstract1 To make an existing program easier to understand and modify, we propose an expectation-driven model for reverse engineering which identifies and extracts two main kinds of components: environment-dependent operations and domain-dependent functionalities. A reference information model gives expectations of components and their interface data. We apply program slicing, firstly proposed by Weiser, to this function recovery problem by modifying the notion of program slice to a direct slice and transform slice. The former, direct slice, is an executable subset of the original program containing all the statements which directly contribute either to the writing on an external sink or to the reading from an external source. The latter, transform slice, is also an executable subset including all the instructions which directly or indirectly contribute to transform an external input into an external output.

1: Introduction This paper proposes a method to locate two basic types of components into a business application system. The former, environment-dependent components, depend on the technological environment which holds a system and usually consist of basic operations on the database, report production, displaying of interface maps, or usermachine dialogue. On the other hand, domain-dependent components characterize a class of problems in the same application domain and typically consist of computational formulas or business rules. The ability of correctly discerning these two classes of components is advantageous when dealing with adaptive maintenance or platform migration which regard only those parts of a program affected by the technological progress. 1 This work has been partially supported by the Finalized Project on "Information Systems and Parallel Computation" of Italian National Research Council (CNR) under grant no.91.00930PF69.

Moreover, the flexibility of a system is increased if those components which are influenced by the characteristics of the application domain can be easily isolated. We use a modified version of program slicing [11] to segment an unknown program in meaningful functionalities. Environment and domain-dependent modules are characterized differently as regards their data flow boundaries, and coherently slicing techniques differ too. The techniques are viewed as part of a process model for the reverse engineering of data-oriented applications. In [1] a method to recover the conceptual data model and trace entity attributes with data fields in the source code is shown. The detailed knowledge of the external inputs and outputs of a program not only increases the application domain knowledge but also gives useful clues to investigate the procedural part of code. Our slicing techniques exploit the knowledge of the input/output data coming from the reverse engineering of data. This paper, starting from the lessons learned from experimenting the component extraction ([2], [3]), overcomes some previous limitation and specifies the segmentation techniques in a formal way.

2: A reverse engineering process for dataoriented application A business system is typically data-oriented because most of its tasks are related to manipulating, retrieving and reporting data from a database. In business systems, data contain a large part of the knowledge needed to understand both the application domain model (at the conceptual level) and the software model (at the implementation level). Our process model of reverse engineering uses the results of the data recovery phase to speed up the understanding of the procedural code.

operations such as creation, reading, updating, and deletion.

2.1: Data recovery phase

2.2: Function recovery phase

A reference information model is selected so that it can be applied to all information systems belonging to that class of problems. Sources of information come from published literature, norms and laws in force, or from other models of information systems of the same application domain. The reference information model is expressed in terms of entities, hierarchies of entities and meaningful relationships between entities, by using entity-relationship diagrams. By analyzing the declarations of files, reports and I/O maps, we distinguish among: • conceptual data: data which can be associated to an entity or relationship of the reference information model; • control data: flags used to control the program logic; • structure data: fields used to build data structures dependent from the programming environment. The definition of a reference model provides the reverse engineer with a template to classify, as part of an entity or relationship, all the conceptual data encountered in the declaration part of the code. Data which can be derived from other primitive conceptual data are recognized as derived data, and the computation formula or the business rule are recorded in the data dictionary. Recognizing derived data is important because they generate expectations on the existence of transform functions into the source code. On the other hand the modeling of data in terms of entities and relationships suggests expectations of components containing basic

The next phase is to separate the program into distinct components which perform a single function. We recover three basic categories of functional components: source modules, sink modules and transform modules. A source module, shown in figure 1a, obtains information from external sources of data such as files on auxiliary storage, keyboard or communication lines. Typical names are "Read_Next_Record", "Find_Element", "Get_Line", "Obtain_Transaction_ Details", or "Receive _Packet". Analogously a sink module, as shown in figure 1b, puts data on external targets such as files, videos, printers or communication lines. Usual names are "Add_Record", "Update_Record", "Display_Map", Print_Report", "Transmit_Packet". A transform module, shown in figure 1c, is a component whose main purpose is to transform data into some other form. At the activation time, input data are ready to be processed and when the execution finishes the output data are returned back. Computational modules, such as "Compute_Square_Root" or "Compute_Instalment_Amount", fall into this category but also modules which embody business rules in the form of decision trees, such as "Validate_Check" or "Verify_Loan_Request".

x x

y

SOURCE

EXTINP

(a)

SINK

EXTINP

EXTOUT

TRANSFORM

EXTOUT

(b)

(c)

Figure 1. The three kinds of recovered components Source and sink modules are two examples of environment-dependent modules, i.e. components which could be reused in other application domains which share the same technological environment. On the other hand, transform modules are typically domaindependent, because functional requirements are usually modelled in terms of inputs, outputs, and description of the mapping rule. When deriving initial structural designs from the requirements specification, using transform analysis of Structured Design [14], transform modules correspond to the processes in the data flow diagrams. The data recovery phase provides knowledge about the external data. So what we know about a functionality is its external input (EXTINP), its external output (EXTOUT), or both. The extraction criteria for locating functionalities into the source code differ if the component to be recovered is a source/sink module or a transform module. In the first case the reverse engineer wants to capture all those statements which directly prepare the data, and execute the input (source module) or output (sink module) statements. On the contrary, when there is a transform module the aim is to take only those statements which yield the output data, both directly and indirectly, starting from the given input data. These statements don't include the input/output statements necessary to acquire the external data. When the component has been located, it should be extracted and packaged as a separate compilable subroutine ready for test and use. The new component will be provided with the minimal amount of code necessary for its execution: headings, parameter list, data declarations, comments, etc. The original program should be replaced by the complement of the extracted component which consists of the original program minus the extracted component itself. The complement acts as a main module which calls the packaged component with a list of parameters. Some statements could be duplicated if both the component and its complement depend on the same computation. The complement should be further decomposed, so the decomposition may be continued until all expected functions are recovered. Also an extracted component should be decomposed, if it is too much complex to be managed as a chunk. In this case, the decomposition starts from the component until all the expected subfunctions are obtained.

3: Extraction criteria

In order to extract our components we need a decomposition method which uses both data flow and control flow analysis. Program slicing, introduced by Weiser in [9], is based on the observation that we are often interested to only a portion of the program behavior, as in the debugging and modification tasks. In [10] an experiment was conducted whose results confirm the hypothesis that programmers implicitly use slices when debugging unknown programs. Program understanding is difficult because slices are often scattered through the entire program, making difficult locating bugs into programs. In order to make automatic this capacity of projection, the behavior of interest is expressed in the form of . Program slicing finds those portions of source code which might directly or indirectly affect the values of variables at the given instruction. Initially proposed for program debugging and parallel processing [11], later on program slicing has been applied to other activities such as module integration [5], cohesion evaluation [7], testing [4] and modification [4]. Our goal is to apply program slicing to reverse engineering but some modification is needed to include only those statements which characterize source, sink and transform modules. In fact, although the direct application of the Weiser's slicing excludes irrevelant functionalities, the chunks resulting from large programs are often too big to help program understanding. In the next subsections we present some basic definitions, the original Weiser's algorithm and our slicing algorithms.

3.1: Basic definitions Definition 1: A digraph G is a tuple (N, E), where N is a set of nodes and E ⊆ N×N is a set of edges. Given an edge (m, n), m is said immediate predecessor of n, or m∈IMP(n), while n is said immediate successor of m, or n∈IMS(m). A path p from m to n of length k is an ordered set of nodes (n0, ...., nk) such that n0 = m, nk = n, and ∀i, 1 ≤ i ≤ k-1 : (ni, ni+1)∈E. Definition 2: A flowgraph FG is a triple (N, E, n0), where (N, E) is a digraph and n0∈N is such that ∀n∈N there is a path (n0, ..., n). n0 is called initial node. Given m, n∈N, m dominates n, or m∈DOM(n), if m is on every path (n0, ..., n). Given m, n∈N, m is the nearest dominator of n, or m=NDOM(n), if m∈DOM(n) and ∀d∈DOM(n): d∈ DOM(m).

Definition 3: An hammock graph HG is a quadruple (N, E, n0, ne), where (N, E, n0) and (N, E-1, ne) are both flowgraphs. ne is called final node. Given m, n∈N, m inverse dominates n, or m∈ IDOM(n), if m is on every path (n,...,ne). Given m, n∈N, m is the nearest inverse dominator of n, or m=NIDOM(n), if m∈IDOM(n) and ∀d∈IDOM(n): d∈IDOM(m). Definition 4: A one entry/one exit program P can be modeled as an hammock graph, whose nodes are the program statements, the edges are given by the control flow, the entry point is the initial node, and the exit point is the final node. Given n∈N, the statements conditioned by n, or INFL(n), are the set of nodes which are on a path (n, NIDOM(n)), excluding the endpoints n and NIDOM(n). When IMS(n) contains only one element or is empty then INFL(n) = ∅. Definition 5: Let V be the set of variables in a program P. REF(n)⊆V are the variables used at instruction (node) n without to alter their previous value. DEF(n)⊆V are the variables modified at instruction (node) n, i.e. whose value is changed as effect of the instruction execution.

3.2: Weiser's slice Given a slicing criterion, C = , where i is a statement and V is a subset of the program variables, a slice Sc is an executable subset of P containing all the statements which contribute to the values of V just before statement i is executed. The building of Sc is defined recursively on the set of variables and statements which have either direct or indirect influence on V. Starting from zero, the superscripts represent the level of recursion. Some definitions appear different in style as respect to [11], because we have used a more compact notation already appeared in [6]. Definition 6: Rc0(n) = { v∈V | n = i } ∪ { REF(n) | DEF(n) ∩ Rc0(IMS(n)) ≠ ∅ } ∪ { Rc0(IMS(n)) − DEF(n) } 0 Sc = { n | DEF(n) ∩ Rc0(IMS(n)) ≠ ∅ } Rc0(n) represents the relevant variables with potential effects on V when program execution is at instruction n. Search starts from instruction i. The first subset expresses the base case. The second says that

variables which are used to assign values to other variables, already marked as relevant, become relevant. The third case excludes a relevant variable when it has been modified. Sc0 includes those statements whose execution can directly influence the values of relevant variables for each instruction of the program. Definition 7: Bc0 = { b | INFL(b) ∩ Sc0 ≠ ∅ } Bc0 includes all those conditional statements which control the execution of the statements in Sc0. Given a conditional statement a branch statement criterion can be issued: BC(b) = Definition 8: Rci+1(n) = Rci(n)

∪ RBC(b)0(n) b∈Bci i+1 Sc = { n | DEF(n) ∩ Rci+1(IMS(n)) ≠ ∅ } ∪ Bci i+1 Bc = { b | INFL(b) ∩ Sci+1 ≠ ∅ } At each step, the first equation includes the new variables which influence the conditional statements, and the last two iterate the results until no new statements may be included. In other words the iteration will stop when ∀n: Rci+1(n) = Rci(n) and Sci+1 = Sci = Sc

3.3: Direct slice Given a slicing criterion, C = , where i is a statement and V is a subset of the program variables, a direct slice DiSc is an executable subset of P containing all the statements which directly contribute to the values of V just before statement i is executed. DiSc is built in two steps. The former defines the relevant variables and the instructions which directly change the values of V before i. The latter finds all the conditional instructions which directly control those instructions which have been found in the first step. Definition 9: DiRc0(n) = { v∈V − DEF(n) | n = i } ∪ { DiRc0(IMS(n)) − DEF(n) } 0 DiSc = { n | n=i or DEF(n) ∩ DiRc0(IMS(n)) ≠ ∅ } DiRc0(n) represents the live variables in n with respect to their use in i. Search starts from the instruction i. Compared to definition 6, the first subset excludes the variables defined in i, and there isn't the subset which includes variables in the reference-definition chain. The set of instructions which directly modify V at instruction i contains also the statement i of the slicing criterion. Definition 10: DiBc0 = { b | INFL(b) ∩ DiSc0 ≠ ∅ and i∉INFL(b) }

DiBc0 restricts definition 7 because it includes those conditional statements controlling the statements in DiSc0, only if they don't condition the statement i of the slicing criterion. Definition 11: DiSc = DiSc0 ∪ DiBc0 The direct slice is an hammock graph as the initial program. So, it can be packaged as a separate one entry/one exit module. A complement of the direct slice may be derived, working as a caller which activates the module to get data (source module) or put data (sink module). In the former case the slicing criterion is expressed as , while in the latter as .

3.4: Transform Slice Given a slicing criterion, C = , where i is a statement, Vinp and Vout are subsets of the program variables, a transfom slice TrSc is an executable subset of P containing all the statements which either directly or indirectly contribute to the values of Vout starting from the values of Vinp , just before statement i is executed. As the Weiser's slice, the transform slice is built recursively. Starting from zero, the superscripts represent the level of recursion. Definition 12: TrRc0(n) = { v∈Vout | n = i } ∪ { REF(n) − Vinp | DEF(n) ∩ TrRc0(IMS(n)) ≠ ∅ } ∪ { TrRc0(IMS(n)) − DEF(n) } 0 TrSc = { n | DEF(n) ∩ TrRc0(IMS(n)) ≠ ∅ } TrRc0(n) represents the relevant variables which start from Vinp with potential effects on Vout when program execution is at instruction n. Search starts from instruction i. The first and third subsets are substantially equal to the corresponding ones in definition 6, while the second subset differs because only variables not in Vinp are included, when they are referenced for modifying the values of other relevant variables. Definition of TrSc0 includes those statements whose execution can directly influence the values of relevant variables for each instruction of the program. Definition 13: TrBc0 = { b | INFL(b) ∩ TrSc0 ≠ ∅ and i∉INFL(b)) } Like definition 10, TrBc0 is a restriction over definition 7. It includes those conditional statements controlling the statements in TrSc0, only if they don't condition the statement i of the slicing criterion. Given a

conditional statement a branch statement criterion can be issued: TrBC(b) = Definition 14: TrRci+1(n) = TrRci(n)

∪ TrRTrBC(b)0(n) b∈TrBci i+1 TrSc = { n | DEF(n) ∩ TrRci+1(IMS(n)) ≠ ∅ } ∪ TrBci i+1 TrBc = { b | INFL(b) ∩ TrSci+1 ≠ ∅ } This definition is equal to definition 8. Recursion stops when ∀n: TrRci+1(n) = TrRci(n) and TrSci+1 = TrSci = Sc Also the transform slice is an hammock graph which can be packaged as a distinct module. A complement of the direct slice may be derived, working as a caller which activates the transform module. The slicing criterion must be expressed as . Input statements which get EXTINP and output statements which put EXTOUT will not be included because in our definition the transform slice is input-restricted as regards variables in the slicing criterion and generally output-restricted because all the output statements are removed.

3.5: Example program In order to clarify the above definitions, an example will be discussed in the following. The example is a little Cobol program, taken from [12], which prints a tabular report from an external file after having done some arithmetic calculation. Figure 2 shows the DATA Division. The input is the sequential file PATRON-FILE, whose record PATRONRECORD contains the patron name, target contribution, actual contribution and the contribution date. The output is the printed report DEFICIENCY-LIST with two types of records: DEFICIENCY-LINE which contains those patrons with contribution below the target, and TOTALLINE which summarizes the results. Figure 3 shows the PROCEDURE Division made up of five paragraphs. The paragraph 000-PRINTDEFICIENCY-LIST is the main routine which calls the inizialization paragraph 100-INITIALIZE-VARIABLEFIELDS, repeatedly executes 200-PROCESS-PATRONRECORD until end of file is true, and finally calls 700PRINT-TOTAL-LINE. If the actual contribution is less than the target contribution, the paragraph 200PROCESS-PATRON-RECORD activates 210PROCESS-DEFICIENT-PATRON. There are two equal input statements (90-91 and 107-108), corresponding to the reading of the next record, and two distinct output statements (128-129 and 138-139): the former prints the detail line and the latter the total line.

000024 000025 000026 000027 000028 000029 000030 000031 000032 000033 000034 000035 000036 000037 000038 000039 000040 000041 000042 000043 000044 000045 000046 000047 000048 000049

DATA DIVISION. FILE SECTION. FD PATRON-FILE RECORD CONTAINS 74 CHARACTERS LABEL RECORDS ARE STANDARD. 01 PATRON-RECORD. 05 PR-NAME PIC X(18). 05 FILLER PIC X(42). 05 PR-TRGT-CON PIC 9(4). 05 PR-ACTL-CON PIC 9(4). 05 PR-CON-DATE. 10 PR-CON-MONTH PIC X(2). 10 PR-CON-DAY PIC X(2). 10 PR-CON-YEAR PIC X(2). FD DEFICIENCY-LIST RECORD CONTAINS 132 CHARACTERS LABEL RECORDS ARE OMITTED. 01 DEFICIENCY-LINE. 05 FILLER 05 DL-NAME 05 FILLER 05 DL-CON-MONTH

PIC X(1). PIC X(18). PIC X(2). PIC X(2).

000050 000051 000052 000053 000054 000055 000056 000057 000058 000059 000060 000061 000062 000063 000064 000065 000066 000067 000068 000069 000070 000071 000072 000073 000074 000075 000076 000077 000078 000079 000080 000081 000082 000083

05 05 05 05 05 05 05 05 05 05 05 05 05 05

DL-SLASH-1 PIC X(1). DL-CON-DAY PIC X(2). DL-SLASH-2 PIC X(1). DL-CON-YEAR PIC X(2). FILLER PIC X(4). DL-TRGT-CON PIC 9(4). FILLER PIC X(3). DL-ACTL-CON PIC 9(4). FILLER PIC X(3). DL-AMT-DEF PIC 9(4). FILLER PIC X(3). DL-DEF-PERCENT PIC 99.9. DL-PERCENT-SIGN PIC X(1). FILLER PIC X(73).

01 TOTAL-LINE. 05 FILLER PIC X(4). 05 TL-DEF-PATRONS PIC 9(3). 05 FILLER PIC X(38). 05 TL-TOTAL-AMT-DEF PIC 9(6). 05 FILLER PIC X(81). WORKING-STORAGE SECTION. 01 WS-SWITCHES. 05 WS-EOF-SWITCH

PIC X(3).

01 WS-ARITHMETIC-WORK-AREAS. 05 WS-AMT-DEFICIENT PIC 9(4). 05 WS-TOTAL-AMT-DEF PIC 9(6). 05 WS-DEFICIENCY-FRACTION PIC V999. 05 WS-DEF-PERCENT PIC 99V9. 05 WS-DEF-PATRONS PIC 9(3).

Figure 2. DATA Division of the example program

000084 000085 000086 000087 000088 000089 000090 000091 000092 000093 000094 000095 000096 000097 000098 000099 000100 000101 000102 000103 000104 000105 000106

PROCEDURE DIVISION. 000-PRINT-DEFICIENCY-LIST. OPEN INPUT PATRON-FILE OUTPUT DEFICIENCY-LIST. PERFORM 100-INITIALIZE-VARIABLE-FIELDS. READ PATRON-FILE AT END MOVE "YES" TO WS-EOF-SWITCH. PERFORM 200-PROCESS-PATRON-RECORD UNTIL WS-EOF-SWITCH IS EQUAL TO "YES". PERFORM 700-PRINT-TOTAL-LINE. CLOSE PATRON-FILE DEFICIENCY-LIST. STOP RUN. 100-INITIALIZE-VARIABLE-FIELDS. MOVE "NO" TO WS-EOF-SWITCH. MOVE ZERO TO WS-TOTAL-AMT-DEF, WS-DEF-PATRONS. 200-PROCESS-PATRON-RECORD. IF PR-ACTL-CON < PR-TRGT-CON PERFORM 210-PROCESS-DEFICIENT-PATRON.

000107 000108 000109 000110 000111 000112 000113 000114 000115 000116 000117 000118 000119 000120 000121 000122 000123 000124 000125 000126 000127 000128 000129

READ PATRON-FILE AT END MOVE "YES" TO WS-EOF-SWITCH. 210-PROCESS-DEFICIENT-PATRON. MOVE SPACES TO DEFICIENCY-LINE. MOVE PR-NAME TO DL-NAME. MOVE PR-TRGT-CON TO DL-TRGT-CON. MOVE PR-ACTL-CON TO DL-ACTL-CON. MOVE PR-CON-MONTH TO DL-CON-MONTH. MOVE PR-CON-DAY TO DL-CON-DAY. MOVE PR-CON-YEAR TO DL-CON-YEAR. MOVE "/" TO DL-SLASH-1, DL-SLASH-2. SUBTRACT PR-ACTL-CON FROM PR-TRGT-CON GIVING WS-AMT-DEF. MOVE WS-AMT-DEF TO DL-AMT-DEF. DIVIDE PR-ACTL-CON BY PR-TRGT-CON GIVING WS-DEFICIENCY-FRACTION. MULTIPLY WS-DEFICIENCY-FRACTION BY 100 GIVING WS-DEF-PERCENT. MOVE WS-DEF-PERCENT TO DL-DEF-PERCENT. MOVE "%" TO DL-PERCENT-SIGN. WRITE DEFICIENCY-LINE AFTER ADVANCING 2 LINES.

000130 000131 000132 000133 000134

ADD WS-AMT-DEF TO WS-TOTAL-AMT-DEF. ADD 1 TO WS-DEF-PATRONS. 700-PRINT-TOTAL-LINE. MOVE SPACES TO TOTAL-LINE.

000135 000136 000137 000138 000139

MOVE WS-DEF-PATRONS TO TL-DEF-PATRONS. MOVE WS-TOTAL-AMT-DEF TO TL-TOTAL-AMT-DEF. WRITE TOTAL-LINE AFTER ADVANCING 3 LINES.

Figure 3. PROCEDURE Division of the example program obtained with the slicing criterion C=. Figure 4 shows the direct slice for the recovered TrRc0(128-129) = {DL-DEF-PERCENT} source module "Read_Next_Patron", obtained with the TrRc0(127) = {DL-DEF-PERCENT} slicing criterion C=. TrRc0(126) = {WS-DEF-PERCENT} 0 0 DiRc (90-91) = ∅ DiSc = { 90-91} TrRc0(124-125) = {WS-DEFICIENCY-FRACTION} 0 DiBc = ∅ DiSc = { 90-91} TrRc0(122-123) = ∅ The slice captures only a single statement, because the ..................................... file organization is sequential. Instead when an indexed TrSc0 = { 122-123, 124-125, 126 } file is directly accessed, the direct slice takes also the INFL(93) ∩ TrSc0 ≠ ∅ 128-129∈INFL(93) setting of the record key. INFL(105) ∩ TrSc0 ≠ ∅ 128-129∈INFL(105) TrBc0 = ∅ TrSc1 = TrSc0 = TrSc = { 122-123, 124-125, 126 } ******************** TOP OF DATA *********************** The input variables come from PATRON-FILE while - - - - - - - - - - - - - - - - - - 89 LINES NOT DISPLAYED the output variable is printed on the detail line of 000090 READ PATRON-FILE DEFICIENCY-LIST. The resulting slice includes only 000091 AT END MOVE "YES" TO WS-EOF-SWITCH - - - - - - - - - - - - - - - - - - 48 LINES NOT DISPLAYED the statements necessary for the calculation plus the ****************** BOTTOM OF DATA ******************* MOVE of the result from the working-storage variable to the file variable. This last instruction is shared with the Figure 4. Direct slice of "Read_Next_Patron" direct slice of "Put_Deficient_Patron" module. Figure 5 shows the direct slice for the recovered sink module "Put_Deficient_Patron" obtained with the slicing criterion C=. DiRc0(128-129) = {DEFICIENCY-LINE} DiRc0(127) = {DEFICIENCY-LINE − DL-PERCENT-SIGN} DiRc0(126) = {DEFICIENCY-LINE − (DL-PERCENT-SIGN ∪ DL-DEF-PERCENT)} .......................... DiRc0(111) = ∅ DiSc0 = { 111, 112, 113, 114, 115, 116, 117, 118, 121, 126, 127, 128-129} INFL(93) ∩ DiSc0 ≠ ∅ 128-129∈INFL(93) INFL(105) ∩ DiSc0 ≠ ∅ 128-129∈INFL(105) DiBc0 = ∅ DiSc = { 111, 112, 113, 114, 115, 116, 117, 118, 121, 126, 127, 128-129} The slice includes the output statement and the MOVEs to the subfields of DEFICIENCY-LINE. Statement 111 is included because it defines the filler fields of DEFICIENCY-LINE. The direct slice doesn't include the arithmetic calculations to compute derived data. Figure 6 shows the transform slice for the recovered transform module "Compute_Deficiency_Percent",

******************** TOP OF DATA *********************** - - - - - - - - - - - - - - - - - - 110 LINES NOT DISPLAYED 000111 MOVE SPACES TO DEFICIENCY-LINE. 000112 MOVE PR-NAME TO DL-NAME. 000113 MOVE PR-TRGT-CON TO DL-TRGT-CON. 000114 MOVE PR-ACTL-CON TO DL-ACTL-CON. 000115 MOVE PR-CON-MONTH TO DL-CON-MONTH. 000116 MOVE PR-CON-DAY TO DL-CON-DAY. 000117 MOVE PR-CON-YEAR TO DL-CON-YEAR. 000118 MOVE "/" TO DL-SLASH-1, DL-SLASH-2. - - - - - - - - - - - - - - - - - - - 2 LINES NOT DISPLAYED 000121 MOVE WS-AMT-DEF TO DL-AMT-DEF. - - - - - - - - - - - - - - - - - - - 4 LINES NOT DISPLAYED 000126 MOVE WS-DEF-PERCENT TO DL-DEF-PERCENT. 000127 MOVE "%" TO DL-PERCENT-SIGN. 000128 WRITE DEFICIENCY-LINE 000129 AFTER ADVANCING 2 LINES. - - - - - - - - - - - - - - - - - - - 10 LINES NOT DISPLAYED ****************** BOTTOM OF DATA *******************

Figure 5. Direct slice of "Put_Deficient_Patron"

******************** TOP OF DATA *********************** - - - - - - - - - - - - - - - - - - 121 LINES NOT DISPLAYED 000122 DIVIDE PR-ACTL-CON BY PR-TRGT-CON 000123 GIVING WS-DEFICIENCY-FRACTION. 000124 MULTIPLY WS-DEFICIENCY-FRACTION BY 100 000125 GIVING WS-DEF-PERCENT. 000126 MOVE WS-DEF-PERCENT TO DL-DEF-PERCENT. - - - - - - - - - - - - - - - - - - - 13 LINES NOT DISPLAYED ****************** BOTTOM OF DATA *******************

Figure 6. Transform slice of "Compute_Deficiency_Percent"

4: Related work Among the methods based on static analysis, the REDO project [8] also uses program slicing when reverse engineering Cobol programs. The program slicing is used as a technique for a first segmentation of the program into smaller chunks. Because for large programs, the resulting slices are often not small enough to be manageable, they must be simplified by means of other techniques such as functional abstraction. Our definition of direct and transform slice allows a reverse engineer to isolate the relevant portions of code with a higher precision than Weiser's slicing. The increase in precision is due to the knowledge acquired with the data recovery phase which makes it possible to make expectations about the input and output of functions. In order to stress the difference with the Weiser's slice, Figure 7 shows the portion of code corresponding to the criterion C=. The Weiser's slice takes also instructions which are not essential to the computation of the deficiency percentage.

******************** TOP OF DATA *********************** - - - - - - - - - - - - - - - - - - - 85 LINES NOT DISPLAYED 000086 000-PRINT-DEFICIENCY-LIST. 000087 OPEN INPUT PATRON-FILE 000088 OUTPUT DEFICIENCY-LIST. 000089 PERFORM 100-INITIALIZE-VARIABLE-FIELDS. 000090 READ PATRON-FILE 000091 AT END MOVE "YES" TO WS-EOF-SWITCH. 000092 PERFORM 200-PROCESS-PATRON-RECORD 000093 UNTIL WS-EOF-SWITCH IS EQUAL TO "YES". - - - - - - - - - - - - - - - - - - - 3 LINES NOT DISPLAYED 000097 STOP RUN. 000098 000099 100-INITIALIZE-VARIABLE-FIELDS. 000100 MOVE "NO" TO WS-EOF-SWITCH. - - - - - - - - - - - - - - - - - - - 3 LINES NOT DISPLAYED 000104 200-PROCESS-PATRON-RECORD. 000105 IF PR-ACTL-CON IS LESS THAN PR-TRGT-CON 000106 PERFORM 210-PROCESS-DEFICIENT-PATRON. 000107 READ PATRON-FILE 000108 AT END MOVE "YES" TO WS-EOF-SWITCH. 000109 000110 210-PROCESS-DEFICIENT-PATRON. - - - - - - - - - - - - - - - - - - - 11 LINES NOT DISPLAYED 000122 DIVIDE PR-ACTL-CON BY PR-TRGT-CON 000123 GIVING WS-DEFICIENCY-FRACTION. 000124 MULTIPLY WS-DEFICIENCY-FRACTION BY 100 000125 GIVING WS-DEF-PERCENT. 000126 MOVE WS-DEF-PERCENT TO DL-DEF-PERCENT. - - - - - - - - - - - - - - - - - - - 13 LINES NOT DISPLAYED ****************** BOTTOM OF DATA *******************

Figure 7. Weiser's slice corresponding to "Compute_Deficiency_Percent"

In [2] a criteria for extracting transform module has been proposed, which takes into account all the statements of the output slice S(EXTOUT) which are not in the input slice, S(EXTINP). The input and output slices considered are independent from the instruction number, as defined in [4]. The accuracy of this technique is lower with respect to the transform slice algorithm presented in this paper. In fact, variable overloading and conditional instructions may cause the addition of extraneous statements or the deletion of necessary instructions as regards the transform module. These problems were partially solved by a preliminary segmentation of code to isolate the transactions, but this implies a loss of efficiency of the process.

5: Conclusions and future directions At the current date, the extraction of slices has been performed by using a commercial tool, VIA/Renaissance, running on a MVS-TSO/ISPF platform. The tool analyzes Cobol programs, making it possible to navigate along the control and data flow paths. It also supports five different criteria of slicing, including the Weiser's slicing. The slicing algorithms for function recovery have been verified on small example programs, and on 2 large Cobol programs (22 and 34 KLOCs) in the banking domain. Lessons learned from the experimentation have concerned both the proposed algorithms and the efficacy of the tool. In fact, the analysis of the results obtained with the difference of slices [2] has inducted the formulation of the current algorithm for transform slicing. As regards the tool, although the availability of a commercial tool makes it possible to analyze large Cobol programs, the process of function recovery is still partially automated because our slicing algorithms are not fully supported. Moreover the tool shows some drawback when program slicing is applied to files with multiple record types or when the instruction addressed in the slicing criterion doesn't contain the variable of the slicing criterion itself. Although our slicing techniques work with a higher precision than Weiser's slicing, the extracted slices could be too much complex to comprehend. In such a case the extraction process should be iterated to locate those subfunctions which refine the initial slice. On the contrary, when the extracted slice is empty or trivial, we can suppose that the expected function has been implemented in another program. In any case, to come to an end, the cognitive capabilities of the reverse engineer

are required. This bear out the thesis, explained in [1], that reverse engineering is a human-intensive activity. The following points require improvement and further investigation. Experimentation on large programs: the performances of the function recovery process should be empirically assessed using a significative sample of expected functions with different programs and different subjects. Development of a slicing tool: to make it possible an extensive experimentation, a tool is needed which supports our extraction criteria and derives program complements. We are going to develop a prototype of slicing tool which satisfies our requirements, by using the program knowledge which is stored in the Renaissance's repository. Reuse of recovered functions: sliced components might be documented and classified to be included in a reuse library of the domain application or a reuse library of the programming environment. Comparison with dynamic analyzers: dynamic methods for design recovery, like that in [13] based on test cases, are not necessarily alternative to static analysis methods. Limitations of both methods could be stressed, so that to use static and dynamic methods in a complementary way.

Acknowledgements Thanks to Filippo Cutillo, Gregorio Fatone, and the anonymous CSM referees, who made helpful comments on a draft of this paper.

References [1]

[2]

[3]

[4]

[5]

[6]

F.Abbattista, F.Lanubile, and G.Visaggio, "Recovering conceptual data models is human-intensive", to appear in Fifth International Conference on Software Engineering and Knowledge Engineering, San Francisco, California, 1993. F.Cutillo, F.Lanubile, and G.Visaggio, "Extracting application domain functions from old code: a real experience", to appear in 2nd Workshop on Program Comprehension, Capri, Italy, 1993. F.Cutillo, P.Fiore, and G.Visaggio, "Identification and extraction of domain independent components in large programs", to appear in Working Conference on Reverse Engineering, Baltimora, 1993. K.B.Gallagher and J.R.Lyle, "Using program slicing in software maintenance", IEEE Transactions on Software Engineering, vol.17, no.8, August 1991. S.Horwitz, T.Reps, and D.Binkley, "Interprocedural slicing using dependence graphs", in Proceedings of the SIGPLAN'88 Conference on Programming Language Design and Implementation, 1988, pp.35-46. J.Jiang, X.Zhou, and D.J.Robson, "Program slicing for C - the problems in implementation", Proceedings of IEEE

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

Conference on Software Maintenance, Sorrento, Italy, IEEE Computer Society Press, 1991. L.Ott, and J.Thuss, "The relationship between slices and module cohesion", in Proceedings of the 11th International Conference on Software Engineering, 1989, pp.198-204. H.van Zuylen, editor, The REDO Compendium of Reverse Engineering for Software Maintenance, John Wiley and Sons Ltd., UK, 1992. M.Weiser, "Program slicing", in Proceedings of the Fifth International Conference on Software Engineering, 1981, pp.439-449. M.Weiser, "Programmers uses slices when debugging", Communications of ACM, vol.25, no.27, July 1982, pp.446-452. M.Weiser, "Program slicing", IEEE Transactions on Software Engineering, vol.SE-10, no.4, July 1984, pp.352-357. T.Welburn, and W.Price, Structured Cobol 74/85: Fundamentals and Style, 3rd edition, McGraw-Hill, 1990. N.Wilde, J.A.Gomez, T.Gust, and D.Strasburg, "Locating user functionality in old code", Proceedings of IEEE Conference on Software Maintenance, Orlando, Florida, IEEE Computer Society Press, 1992. E.Yourdon, and L.L.Constantine, Structured Design: Fundamentals of a Discipline of Computer Program and Systems Design, NY: Yourdon Press, 1979.