Regression Test Suite Prioritization using Genetic ... - CiteSeerX

5 downloads 2449 Views 227KB Size Report
Jul 3, 2009 - of recent or frequent errors and test cost, code coverage information, and have ...... JDepend. http://www.clarkware.com/software/ JDepend.html.
International Journal of Hybrid Information Technology Vol.2, No.3, July, 2009

Regression Test Suite Prioritization using Genetic Algorithms R.Krishnamoorthi1 and S.A.Sahaaya Arul Mary2 Department of Computer Science and Engineering, Bharathidasan Institute of Technology, Anna University, Trichy-24, India [email protected], [email protected]

Abstract Regression testing is an expensive, but important process in software testing. Unfortunately, there may be insufficient resources to allow for the re-execution of all test cases during regression testing. In this situation, test case prioritization techniques aim to improve the effectiveness of regression testing by ordering the test cases so that the most beneficial are executed first. In this paper we propose a new test case prioritization technique using Genetic Algorithm (GA). The proposed technique prioritizes subsequences of the original test suite so that the new suite, which is run within a time-constrained execution environment, will have a superior rate of fault detection when compared to rates of randomly prioritized test suites. This experiment analyzes the genetic algorithm with regard to effectiveness and time overhead by utilizing structurally-based criterion to prioritize test cases. An Average Percentage of Faults Detected (APFD) metric is used to determine the effectiveness of the new test case orderings. Key words: Regression Testing, Test case prioritization, Genetic Algorithms

1. Introduction Test case prioritization techniques organize the test cases in a test suite, allowing for an increase in the effectiveness of testing. One performance goal, the fault detection rate, is a measure of how quickly faults are detected during the testing process. An improved rate of fault detection can provide faster feedback regarding the quality of the system under test, but frequently, complete testing is too expensive. This is often the case with regression testing, the process of validating modified software to detect whether new errors have been introduced into previously tested code and to provide confidence that modifications are correct. By increasing the overall rate of fault detection, a greater number of errors can be found more rapidly in the code. As frequent rebuilding and regression testing gain popularity, the need for a time constraint aware prioritization technique grows. New software development processes such as extreme programming also promote a short development and testing cycle and frequent execution of fast test cases [1]. Therefore, there is a clear need for a prioritization technique that has the potential for more effectiveness when a test suite's allowed execution time is known, particularly when that execution time is short. This paper shows that if the maximum time allotted for execution of the test cases is known in advance, a more effective prioritization can be produced. The time constrained test case prioritization problem can be reduced to the NP-complete zero/one knapsack problem [2, 3, 4] which can often be efficiently approximated with a genetic algorithm (GA) heuristic search technique. Just as genetic algorithms have been effectively used in other software engineering and programming language problems such as test generation [5], program transformation [6], and software

35

International Journal of Hybrid Information Technology Vol.2, No.3, July, 2009

maintenance resource allocation [7], this paper demonstrates that they also prove to be effective in creating time constrained test prioritizations. We present a technique that prioritizes regression test suites so that the new ordering (i) will always run within a given time limit and (ii) will have the highest possible potential for defect detection based on derived coverage information. In summary, the important contributions of this paper are as follows: 1. A GA based technique to prioritize a regression test suite that will be run within a time constrained execution environment. 2. An empirical evaluation of the effectiveness of the resulting prioritizations in relation to (i) GA-produced prioritizations using different parameters.

2. Related work In this section we provide an overview of related works of the test case prioritization. In recent years several researchers have addressed the test case prioritization problem and presented techniques for addressing it. Test case prioritization techniques reported in [8, 9] orders test cases such that

the test cases with highest priority, according to some criterion, are executed first. Test case prioritization can address a wide variety of objectives [10]. For example, concerning coverage alone, testers might wish to schedule test cases in order to achieve code coverage at the fastest rate possible in the initial phase of regression testing, to reach 100% coverage soonest or to ensure that the maximum possible coverage is achieved by some pre–determined cut–off point. In the Microsoft Developer Network (MSDN) library, the achievement of adequate coverage without wasting time is a primary consideration when conducting regression tests [11]. Furthermore, several testing standards require branch adequate coverage, making the speedy achievement of coverage an important aspect of the regression testing process. In the literature, many techniques for regression test case prioritization have been described. Most of these techniques are code–based, relying on information relating test cases to coverage of code elements. In [12], Rothermel et al. investigated several prioritizing techniques such as total statement (or branch) coverage prioritization and additional statement (or branch) coverage prioritization that can improve the rate of fault detection. In [9], Wong et al. prioritized test cases according to the criterion of ‘increasing cost per additional coverage’. Greedy Algorithms are also used and are implemented in a tool named ATAC [22]. In [14], Srivastava and Thiagarajan studied a prioritization technique that is based on the changes that have been made to the program and focused on the objective function of “impacted block coverage”. Other non–coverage based techniques in the literature include fault–exposing–potential (FEP) prioritization [10], history–based test prioritization [15], and the incorporation of varying test costs and fault severities into test case prioritization [16, 12]. In [17] Zheng Li, Mark Harman, and Robert M. Hierons studied five search techniques: two meta–heuristic search techniques (Hill Climbing and Genetic Algorithms), together with three greedy algorithms (Basic Greedy, Additional Greedy and 2–Optimal Greedy) and proved that Genetic Algorithms performed well in test case prioritization. Hyunsook Do, Gregg Rothermel and Alex Kinneer [18] have designed and performed a controlled experiment examining whether test case prioritization can be effective on Java programs tested under JUnit, and compared the results to those achieved in earlier studies. Their analyses show that test case prioritization can significantly improve the rate of fault detection of JUnit test suites. Saff and Ernst [19, 20, 21] considered test case prioritization for Java in the context of continuous testing, which used spare CPU resources to continuously run regression tests in the background as programmer codes. They combined the concepts of test frequency and test case prioritization, and reported that continuous prioritized testing can reduce waste of development time. Test case prioritization has also been done based on the relevant slices. Most recently, Dennis Jeffry and Neelam Gupta [29] proposed a prioritization technique based on the

36

International Journal of Hybrid Information Technology Vol.2, No.3, July, 2009

coverage requirements present in the relevant slices of the outputs of test cases. However, these prioritization techniques are based on different sources of information, such as history of recent or frequent errors and test cost, code coverage information, and have not considered test suite time. Hence in our proposed prioritization technique we consider test suite time along with coverage information.

3. Challenges in Time based Prioritization Test prioritization schemes typically create a single reordering of the test suite that can be executed after many subsequent changes to the program under test [23, 3]. Reordering of test suite can be more effective at finding faults if testing must be terminated earlier [3]. In this section we present the challenges in time based prioritization. For example, suppose that regression test suite T contains six test cases with the initial ordering {T1, T2, T3, T4, T5. T6} as described in Figure 1(a). For the purposes of motivation, this example assumes a priori knowledge of the faults detected by T in the program P. As shown in Figure 1(b), test case T1 can find seven faults, {f1, f2, f4, f5, f6, f7, f8} in nine minutes, T2, finds one fault, f1, in one minute, and T3 isolates two faults, { f1, f5} in three minutes. Test cases T4, T5, and T6 each find three faults in four minutes, { f2, f3 , f7}, { f4, f6, f8}, and { f2, f4, f6}, respectively. Faults Test Cases

f1 x

T1 T2 T3 T4 T5 T6

f2

f3

f4 x

x

x x

f5 x

f6 x

f7 x

f8 x

x x x

x

x x x

x x

x

(a) Execution Time (mins) Avg faults /min 9 0.78 1 1.00 3 0.67 4 0.75 4 0.75 4 0.75 (b) Figure 1. Sample Test cases, Faults identified and its Execution time

T1 T2 T3 T4 T5 T6

No.of Faults 7 1 2 3 3 3

Suppose that the time budget for regression testing is twelve minutes. Because we want to find as many faults as possible early on, it would seem intuitive to order the test cases by only considering the number of faults that they can detect. Without a time budget, the test case order T1, T4, T5, T6, T3, T2 would execute. Out of this, only the test case T1 would have time to run when under a twelve minute time constraint and would find only a total of seven faults, as noted in Figure 2. Since time is a principal concern, it may also seem intuitive to order the test cases with regard to their execution time. Consider the test case orders TC1 = {T1}, TC2 = {T2, T3, T4, T5}, TC3 = {T2, T1} and TC4 = {T5, T4, T3} for execution. In the time constrained environment, a time-based prioritization the test case order TC2 could be executed and find eight defects, as described in Figure 2. Another option would be to consider the time budget

37

International Journal of Hybrid Information Technology Vol.2, No.3, July, 2009

and fault information together. To do this, we could order the test cases according to the average percent of faults that they can detect per minute. Under the time constraint, the test case order TC3 would be executed and find a total of seven faults. If the time budget and the fault information are both considered intelligently, that is, in a way that accounts for overlapping fault detection, the test cases could be better prioritized and thus increase the overall number of faults found in the desired time period. In this example, the test cases would be intelligently reordered so that the test case order TC4 would run, revealing eight errors in less time than TC2. Also, it is clear that TC4 can reveal more defects than TC1 and TC3 in the specified testing time. Finally, it is important to note that the first two test cases of TC2, T2 and T3, find a total of two faults in four minutes whereas the first test case in TC4, T5, detects three defects in the same time period. Therefore, the “intelligent" prioritization, TC4, is favored over TC2 because it is able to detect more faults earlier in the execution of the tests. Time Limit: 12 minutes Fault TC1

Time TC2

APFD TC3

T2 T1 Total Faults Total Time

T3

T2

T4

T1

T5

Intelligent TC4 T5 T4 T3

7

8

7

8

9

12

10

11

Figure 2.Comparison of Prioritizations

4. Proposed Prioritization Technique The proposed prioritization technique is based on both testing time and potential fault detection information to intelligently reorder a test suite using Genetic Algorithm. Our prioritization algorithm reorders the tests in any sequence that maximizes the suite's ability to isolate defects. In section 4.1 we present an overview of the proposed prioritization technique and in section 4.2 we present the proposed prioritization technique. 4.1 Overview We utilized Genetic algorithm in the proposed prioritization technique. We first recorded the execution time of each test case. Because time constraint could be very short, test case execution times must be exact in order to properly prioritize. Only the execution time of the test case is included in the recorded time and not that of class loading. Timing information is additionally includes any initialization and shutdown time required by a test. Inclusion of initial time and shutdown time is necessary because these operations can greatly increase the execution time required by the test case. The program P and each Test case in a test suite are input into the genetic algorithm, along with the following user specified parameters:  Test suite – T  Collection of all permutations of elements of T - perms(2T)  Number of test suites to be created per iteration - s  Time budget - tmax  Two functions from permutations to the real numbers – time and fit.  Maximum iterations - dmax  Percent of total test suite time - pT

38

International Journal of Hybrid Information Technology Vol.2, No.3, July, 2009

     

Crossover probability - pC Mutation probability - pM Addition probability - pA Deletion probability - pD Test adequacy criteria - tc Program coverage weight - w

4.2 Genetic Algorithm The genetic algorithm performs test suite prioritization on T, based on a given time constraint pT. We calculate pT percent of the total time of T, and store the value in tmax, the maximum execution time for a test suite. We create a set R0 containing s random test suite TS from perms(2T) that can be executed in tmax, time. R0 is the first generation of s potential solutions. After creating a set of test suite, coverage information which is present in section 4.2.1 is used to determine the “goodness” of TSj using the method Calcfitness(P, TS, pT, tc, w) which is explained in section 4.2.2. Fj denotes the fitness value of TSj, where Fj = fit(P, TSj, tc, w). F = F1 , F2 ......, Fs denotes the test suite of fitnesses for each TSj  Rd, 0≤ d≤ dmax. Next to choose the two best test suites in Rd to be elements in the next generation Rd+1 of test suites, we use the method SelectTwoBest(Rd, F). We choose two best test suites in order to guarantee that Rd+1 have at least one “good” pair. It is important to carry these highly fit test suites into Rd+1 as they are in Rd because they are most likely very close to exceeding tmax. Any slight change to these test suites could cause them to require too much execution time, thus invalidating them. Through a roulette wheel selection technique based on a probability proportional |F|, we use the method SelectParents(Rd, F) to identify pairs of test suites (TSk, TSl) from Rd . The fitness values are normalized in relation to the rest of the test suite set by multiplying each Fj  F by a fixed number, so that the sum of all fitness values equals one. We then sort the test suites in descending fitness values, and calculate the accumulated normal fitness values. A random number r  [0, 1] is next generated, and the first individual whose accumulated normalized value is greater than or equal to r is selected. This selection method is repeated until enough test suites are selected to fill the set Rd+1. Candidate test suites with higher fitness are therefore less likely to be eliminated, but a few with lower fitness have a chance to be used in the test suite set as well. We merge the pair ( TSk, TSl) , using ApplyCrossover (pC, TSk, TSl) method, to create two potentially new test suites {TSq, TSr} based on pC, a user specified crossover probability as explained in section 4.2.3. Each test suite in the pair {TSq, TSr} may then be mutated based on pM, a user provided mutation probability. A new test case may be added or deleted from TSq or TSr using the AddAdditionalTests (T, pA, TSr) and DeleteATest (pD, TSr) methods. The cross over operator exchanges subsequences of the test suites, and mutation operator only mutates single elements. Test case addition and deletion are needed because no other operator allows for a change in the number of test cases in a test suite. After each of these modifications has been made to the original pair, both test suites TSq and TSr are entered into Rd+1. The same transformations are applied to all pairs selected by the SelectParents (Rd, F) method until Rd+1 contains s test suites. In total dmax test suites are iteratively created in this fashion. When the final set Rdmax has been created, the test suite with the greatest fitness, TSmax, is determined. This test suite is guaranteed to be the test suite with the highest fitness out of all d sets of size s. We present the infrastructure of the proposed GA prioritization in Figure3. Test Coverage

39

International Journal of Hybrid Information Technology Vol.2, No.3, July, 2009

Since it is very rare for a tester to know the location of all faults in P prior to testing, the prioritization technique must estimate how likely a test is to find defects, which factors into the function fit. The function fit yields the fitness of the test suite TSj based on its potential for fault detection and its time consumption. As it is impossible to reveal a fault without executing the faulty code, the percent of code covered by a test suite is used to determine the suite's potential. We consider, two forms of test adequacy criteria tc : (i) method coverage and (ii) block coverage [23, 25, 30]. A method is covered when it has been entered. A basic block, a sequence of instructions without any jumps or jump targets, is covered when it is entered for the first time. Because several high level language source statements can be in the same basic block, it is sensible to keep track of basic blocks rather than individual statements at the time of execution [23, 25]. Our genetic algorithm accepts coverage information based on the code covered in an application by an entire test suite. As noted by Kessis et al., this is the form that many tools such as Clover [32], Jazz [31], and Emma [25] produce. The presented prioritization approach can reorder a test suite without requiring coverage information on a per-test basis. While the genetic algorithm handles the common case, its calculation of test suite fitness could be enhanced to use coverage information on a per-test basis, similar to [23]. In formation revealing the impact of a test case Ti's test coverage on any other test case's coverage would also further improve the performance of the fitness function.

Genetic Algorithm Selection Create initial population

Tuple 1

Tuple 2

Program -P suites to be created per iteration - s Maximum iterations - dmax Percent of total test suite time - pT Crossover probability - pC Mutation probability - pM Addition probability - pA Deletion probability - pD Test adequacy criteria - tc Program coverage weight – w

Test suite - T

Crossover Calculate fitness Mutation Select Best

Addition

Deletion

Add new tuples

Next generation

Final test tuple

Figure 3. Proposed GA Prioritization Infrastructure

Fitness Function The CalcFitness(P, TSj, pT, tc, w) method uses fit (P, TSj,, tc, w) to calculate fitness. The fitness function, represented by fit, assigns each test suite a fitness based on (i) the percentage of code covered in P by that test suite and (ii) the time at which each test covers its associated code in P. It is then appropriate to divide fit into two parts such that fit(P, TSj, tc,

40

International Journal of Hybrid Information Technology Vol.2, No.3, July, 2009

w) = Fprimary(P, TSj, tc, w) + Fsecond(P, TSj, tc). The primary fitness Fprimary is calculated by measuring the code coverage cc of the entire test suite TSj. Because the overall coverage of the test suite is more important than the order in which the coverage is attained, Fprimary is weighted by multiplying the percent of code covered by the program coverage weight, w. The selection of w's value should be sufficiently large so that when Fprimary and Fsecond  [0, 1] are added together, Fprimary dominates the result. Formally, for some TSj  perms(2T ), Fprimary(P, TSj , tc, w) = cc(P, TSj, tc) × w

(1)

The second component Fsecond considers the incremental code coverage of the test suite, giving precedence to test suites whose earlier tests have greater coverage. Fsecond is also calculated in two parts. First, Fs-actual is computed by summing the products of the execution time time(Ti) and the code coverage cc of the subtest suite TSj{1,i} = (T1 …. Ti) for each test case Ti  TSj . Formally, for some TSj  perms(2T ), TS j

Fs-actual (P, TSj, tc) =

 timeT  × cc(P, TS i 1

i

j{1,i} ,

tc)

(2)

Fs-max represents the maximum value that Fs-actual could take (i.e., the value of Fs-actual if T1 covered 100% of the code covered by T.) For a TSj  perms(2T ), TS j

Fs-max (P, TSj, tc) = cc(P, TSj , tc) ×

 timeT  i 1

i

(3)

Finally, Fs-actual and Fs-max are used to calculate the secondary fitness Fsecond Specifically, for TSj  perms(2T ), Fsecondary(P,TSj, tc) =

Fs  actual P, TS j , tc  Fs  max P, TS j , tc 

(4)

As an example of a fitness calculation, let the program coverage weight w = 100, P be a program, and tc be a test adequacy criterion. Suppose TSj = (T1, T2, T3). Also, assume we have execution times time(T1) = 5, time(T2) = 3, and time(T3) = 1, and test suite code coverage cc(P, TSj, tc) = 0.20. Then, Fprimary(P, TSj, tc, w) = 0.2 × 100 = 20 Fsecondary next gives preference to test suites that have more code covered early in execution. To calculate Fsecond, the code coverages of TSj{1,1} = (T1), TSj {1, 2} = (T1, T2), and TSj{1,3} = (T1, T2, T3) must each be measured. Suppose for this example that cc(P, TSj{1,1}, tc) = 0:05, cc(P, TSj{1, 2}, tc) = 0.19, and, as already known, cc(P, TSj{1,3}, tc) = cc (P, TSj, tc) = 0.20. Fsecond is calculated as follows, Fs-actual(P, TSj, tc) = (5 × 0.05) + (3 × 0.19) + (1 × 0.20) = 1.02 Fs-max (P, TSj, tc) = 0.2(5 + 3 + 1) = 1.8 Fsecond(P, TSj, tc) =

1.02 = 0.567 1.8

Adding Fprimary and Fsecond gives the total fitness value Fj of TSj . Therefore, in this example,

41

International Journal of Hybrid Information Technology Vol.2, No.3, July, 2009

fit(P, TSj, tc, w) = Fprimary(P, TSj, tc, w) + Fsecond(P, TSj, tc) = 20 + 0.567 = 20.567 If a test suite execution time time(TSj) is greater than the time budget tmax, Fj is automatically set to -1 by the CalcFitness(P, TSj, pT, tc, w) method. Because such a test suite violates the execution time constraint, it cannot be a solution and thus receives the worst fitness possible. While a test suite TSj with Fj = -1 could simply not be added to the next generation Rd+1, populations with individuals that have a fitness of -1 can actually be favorable. Since the “optimal" test suite prioritization likely teeters on the edge of exceeding the designated time budget, any slight change to a TSj with Fj = -1 could create a new valid test suite. Therefore, some TSj's with Fj = -1 are maintained in the next generation. If the test suite execution time time(TSj)