Recentering and Restarting Genetic Algorithm ... - Semantic Scholar

4 downloads 0 Views 317KB Size Report
m15421 5. 398. 127 m15421 6. 350. 173 m15421 7. 383. 177 j02459 7. 405. 352 bx842596 4. 708. 442 bx842596 7. 703. 773 acin1. 182. 307 acin2. 1002. 451.
Recentering and Restarting Genetic Algorithm Variations for DNA Fragment Assembly James Hughes

Sheridan Houghten

Guillermo M. Mall´en-Fullerton

Daniel Ashlock

Engineering Universidad Iberoamericana M´exico D. F., M´exico [email protected]

Mathematics and Statistics University of Guelph Guelph, ON. Canada [email protected]

Computer Science Computer Science Brock University Brock University St. Catharines, ON. Canada St. Catharines, ON. Canada [email protected] [email protected]

Abstract—The Fragment Assembly Problem is a major component of the DNA sequencing process that is identified as being NP-Hard. A variety of approaches to this problem have been used, including overlap-layout-consensus, de Bruijn graphs, and greedy graph based algorithms. The overlap-layout-consensus approach is one of the more popular strategies which has been studied on a collection of heuristics and metaheuristics. In this study heuristics and Genetic Algorithm variations are combined to exploit their respective benefits. These algorithms were able to produce results that surpassed the best results obtained by a collection of state-of-the-art metaheuristics on ten of sixteen popular benchmark data sets. Index Terms—Bioinformatics, Evolutionary Algorithms, Genetic Algorithms, Representation, DNA Fragment Assembly, Recentering-Restarting, Island Model, Ring Species

I. I NTRODUCTION One of the most important and difficult parts of sequencing DNA is the process of fragment assembly. The fragment generated during sequencing need to be assembled into a consistent whole much larger than the individual sequences. Fragments lengths are generally 20 and 1000 bases with the final assembly length potentially beyond 100,000,000 bases. Because of repetitive sequence and sequencing errors, there are many possible assemblies. The goal is to obtain a sequence as close as possible to the original [1]. The introduction in 1977 of Sanger sequencing [2] was a substantial advance in sequencing technology. Despite the benefits of Sanger sequencing, there are a number of drawbacks such as low throughput, long running time, and dependencies on clone libraries for sample preparation [3]. These disadvantages stimulated advancements in sequencing technologies. New high-throughput, whole-genome shotgun style methods and Next-Generation Sequencing (NGS) technologies were introduced that parallelized the process, producing millions of sequences concurrently [4][5]. These new sequencing technologies produce a high volume of much smaller sequence fragments, called short reads. These short reads also exhibit a higher number of errors than Sanger sequencing. This massively increases the difficulty of the assembly process. One of the most popular approaches to this Fragment Assembly Problem (FAP) is the “overlap-layout-consensus” (OLC) approach. Large eukaryotic genomes have been successfully assembled from short

reads with the OLC approach [6] [7] [8]. Greedy, graph-based assemblers and de Bruijn graph strategies are alternatives that have been used with success [9]. This already difficult problem (proved NP-Hard in [10]) is further complicated by the fact that there are random coincident overlaps resulting from the small size of the DNA alphabet, sequencing errors, contamination, and naturally occurring repetitive sequences. Repetitive sequences are problematic to assemble properly from short reads because those reads may lie entirely within a repetitive region, making their location ambiguous. The large (factorial) number of possible combinations of fragment overlap that must be examined increases the problem difficulty even more. The ultimate goal is to find the best set of overlaps yielding a sequence as close as possible to the original. The OLC approach, commonly used with Sanger data, is one of the most popular approaches [9]. This approach to the fragment assembly problem involves the production of a set of contigs (a partial sequence derived from fragments) with very high overlap scores. With advancements in technology the OLC approach became less useful because of the larger number of smaller reads. The de Bruijn graph approach is better suited to assembling many small fragments. As the NGS technology progressed, longer reads became available, allowing OLC to become feasible again [11]. Experienced biologists are required to verify the proposed sequences and make any required changes to produce a sequence as close as possible to the real DNA strand. Multiple algorithms have been proposed for the OLC approach to the Fragment Assembly Problem ranging from heuristics to metaheuristics (machine learning and computational intelligence). Although this problem has been studied extensively the field is still wide open because of the challenge of scaling algorithms to full biological genomes. A Genetic Algorithm (GA) for the Fragment Assembly Problem was proposed and tested by Parsons [12]. This seminal work inspired other approaches that used GAs or variations of GAs successfully [13] [14] [15]. Ant Colony Optimization (ACO) algorithms were also applied in [16] [17]. A local search method named PALS produced significant results for this problem [18]. A Beehive algorithm was also proposed for the FAP by Firoz et al. [19] which

Fig. 1. Layout Phase and Consensus Phase Visualization. Note that the Consensus Phase is based on the 8 sequenced fragments from the Layout Phase.

was inspired by work done in [20] [21]. Hybrid approaches which combined parallelized differential evolution and particle swarm optimization were implemented by Mall´enFullerton and Fernandez-Anaya [1]. As the number of small fragments increased as NGS sequencing technologies advanced, non-metaheuristic approaches such as the de Bruijn graph technique came into greater use [11]. These include approaches by Zerbino & Birney, who created a tool called Velvet [22], Hernandex et al., who created a Edena, a tool which assumes no errors in the fragments and eliminates redundancies using transitivity before applying heuristics [23], and Li et al. who describe two tools for short-read de novo assembly called ABySS and SOAPdenovo [24]. Parsons also observed that FAP resembled the Travelling Salesman Problem (TSP) but argued that it could not be easily translated for a variety of reasons. Mall´en-Fullerton and Fernandez-Anaya managed to overcome the obstacles posed by Parsons with significant results [1]. Only a few adjustments are required to alter a FAP problem instance to work with popular TSP algorithms. The remainder of this paper is organized as follows. Section 2 describes the Fragment Assembly Problem along with information regarding the data sets chosen. Section 3 gives a description of the experimental design. Section 4 presents a discussion of the results obtained in this work and lastly section 5 covers conclusions and future work. II. DNA F RAGMENT A SSEMBLY P ROBLEM The process of assembling DNA fragments is one of the most challenging steps in DNA sequencing. Shotgun style and NGS are two popular methods used to obtain DNA fragments. Both work by making several copies of the sequence and dividing (shearing) them into a collection of many small fragments. The problem addressed in this study is assembling these fragments into contigs. One of the greatest advantages to taking the shotgun style and NGS approaches is its speed, high level of automation and scalability [25].

Coverage is defined as the number of bases that exist in all of the fragments divided by the total length of the target DNA sequence [26]. Normally the coverage is around 10 [1], meaning that on average one could expect to find 10 instances of a given base for a unique position from the original sequence within the set of fragments. A coverage greater than 1 is required to ensure that multiple fragments contain the same base, in turn meaning that there is a good chance that many of the fragments overlap. Organisms in which there is substantial repetitive sequence may require more coverage to compensate for resultant ambiguities. These overlaps are required to make ordering of the fragments possible. The process of assembling these fragments together into a set of contigs consists of three steps in the overlap-layoutconsensus approach. These steps are: 1) Overlap: Compare each fragment to all other fragments generated by the reading process to calculate the largest overlap between a suffix and prefix of two fragments. This overlap will exceed some predefined threshold to eliminate non-significant and coincidental overlaps. Note that when calculating the overlap scores it is important to consider all orientations of the fragments and not just the orientation in which they are received from the sequencing phase. This is because during the sequencing phase it is not known which of the two DNA strands each fragment comes from, meaning the actual orientation of the fragments is unknown. Ultimately one would want to find each fragment combination’s largest score. Typically this process is done with a dynamic programming algorithm applied to semiglobal alignments such as Smith-Waterman [27]. 2) Layout: Determine an ordering of the fragments (permutation) to maximize the overlap scores calculated in the overlap phase. This is the most challenging part of the OLC process. This NP-Hard problem [10] is made even more difficult due to the fact that there are errors introduced during the sequencing process (usually 1%-5% will be incorrect) [28], as well as redundancy, and the fact that the fragment can be in either orientation [29]. 3) Consensus: Determine the most likely contig based on the results of the layout phase. This process can be easily accomplished by applying a simple majority rule [26]. Figure 1 presents a visualization of the layout process and the generation of the consensus contig. The focus of this paper regarding the OLC approach to this problem is the most challenging part, namely the layout phase. Each fragment from the problem instance must be used once and only once in a potential layout (permutation). The major goal of this work is to maximize the overlap score of a layout. By maximizing the overlap score a tighter and better quality contig can be generated. A. Data Sets The data sets selected for this work are a collection of sixteen popular benchmarks which have been used previously in a variety of works. Table I summarizes details about these data sets and Table III contains a collection of results

TABLE I S UMMARY OF SMALL BENCHMARK DATA SETS ALONG WITH THE KNOWN INFORMATION ABOUT THE f - SERIES DATA SETS . M ORE DETAILS ABOUT THESE DATA SETS CAN BE FOUND IN [30][31]

Benchmark x60189 4 x60189 5 x60189 6 x60189 7 m15421 5 m15421 6 m15421 7 j02459 7 bx842596 4 bx842596 7 acin1 acin2 acin3 acin5 acin7 acin9 f25 305 f25 400 f25 500 f50 315 f50 412 f50 498 f100 307 f100 415 f100 512 f508 354 f635 350 f737 355 f1343 354 f1577 354

Mean Fragment Length Number of Fragments Sixteen Popular Instances 395 39 286 48 343 66 387 68 398 127 350 173 383 177 405 352 708 442 703 773 182 307 1002 451 1001 601 1003 751 1003 901 1003 1049 f -series 25 307 25 400 27 500 50 315 50 412 50 498 100 307 100 415 100 512 508 354 635 350 737 355 1343 354 1577 354

obtained by popular metaheuristics on the corresponding data sets [30]. An additional fourteen data sets were selected for analysis which have not been thoroughly studied in the past. Not much background details, such as the coverage and original sequence length, are known about these data sets, however, including these instances allows for a more in depth analysis of proposed algorithms. If an algorithm were to have success with relatively small benchmarks it would be beneficial to test that algorithm on much larger, scaled up instances, such as real world problems. In the case of the FAP one could simply take living organism sequences to create these large scaled up instances. Although the small benchmarks can supply a proof of concept, large, real world data can explore the potential of proposed strategies. Because of this an additional instance was selected which is much larger than all other instances, namely one based on staphylococcus aureus COL (COL). The COL data set has long reads and contains 18, 021 fragments which was derived from an original 50, 036 fragments by eliminating non-related and redundant fragments. A standardized set of all of the discussed data sets can be found at [31].

Fig. 2. Example traversal of search space by RRGA. In this example the number of transpositions used for each GA run decreases as better solutions are found meaning the radius of the circles decreases as the algorithm continues throughout evolution.

III. E XPERIMENTAL D ESIGN A. Algorithms A collection of Genetic Algorithm (GA) variations were selected for use in this work: Recentering-Restarting GA (RRGA), Island Model GA (IM), and a GA which uses Ring Species (RS). These variations were also used in combination to better explore the search space to maximize the significance of the results. 1) Recentering-Restarting Genetic Algorithm: The RRGA has been studied before in multiple works by D. Ashlock et al. studying epidemic networks [32], Side Effect Machines for decoding [33], and the travelling salesman problem [34]. In all of these works the RRGA was able to produce significant results. This work uses the RRGA due to its ability to avoid fixating on local optima, a common problem with many metaheuristics. This ability is gained by a sequential search through the search space for a global optimum. Figure 2 depicts this traversal through the search space. Before the algorithm begins a centre is chosen. This centre is a direct representation of a possible solution to the problem (in this case, a possible layout of the DNA fragments). This centre can be randomly selected or through some heuristic for the given problem. Once the centre is selected the population can then be created. Depending on which representation is used, some modifications may be required. In this work two different representations are studied, a direct representation and a transposition based representation (See subsection III-B). For the direct representation a string of n transpositions are applied to the centre to generate each member of the population. For the transposition representation strings of n transpositions are simply used as each member of the population. After the population is created a basic GA is run. With the transposition representation an additional translation step is required for the fitness evaluation which applies the n

Fig. 3. Example of 3 subpopulations evolving separately in parallel. Migrations happen periodically between subpopulations to encourage genetic diversity.

Fig. 4. Ring Species example where d = 3 and the starting index is 1. In this example chromosome 1 can only mate with chromosomes within a distance of 3 (chromosomes 22, 23, 24, 2, 3, and 4).

index = 1 23

24

1

2

22

3

21

4

20

5 6

19

7

18

8

17 9

16 10

15

transpositions to the centre to generate a permutation which can be analyzed. Once the GA completes a run the best chromosome in the final population is compared to the current centre. If the best chromosome has a better fitness than that of the current centre it will become the new centre and the whole process repeats but with the number of transpositions increased (or decreased). If the best chromosome’s fitness does not improve upon the current centre’s fitness, the current centre is left unchanged and the process is repeated with the number of transpositions altered in the opposite way to if the centre had been replaced. Once these alterations are made the GA will be run again and the whole process will repeat. By altering the centre and the number of transpositions used the search will differ from its previous run allowing a very dynamic traversal through the fitness landscape. 2) Island Model: Another variation of GA being used in this work is the island model (IM). The island model was developed by Whitley and has had success with many different problems [35][36]. The island model is very popular due to its great success in run times, convergence speed, and high quality of solutions. The island model is a parallel approach to evolutionary algorithms that evolve multiple subpopulations to ensure genetic diversity by allowing each subpopulation to traverse the search space along its own path. Periodically migrations are allowed between islands which promotes unique offspring by combining very genetically different parents. Figure 3 depicts the subpopulations and migrations of the IM. 3) Ring Species: An example of successful application of evolutionary algorithms that made use of RS is [37]. The motivation for these ring species in evolutionary algorithms is based on natural phenomenon: the differentiation of populations when they expand around geographical barriers. These populations develop slight differences. Fertility rates drop as the distance between two populations increases until some threshold distance is met where the two populations cannot produce offspring. In this work a simpler variation of the ring species is used which simply treats the population as a ring (consider the first and last members of the population as adjacent) and only allows mating to occur between two chromosomes if

14

13 12

11

d=3 they are within some predefined distance d from one another in the population. This idea is demonstrated in Figure 4. B. Representation Direct representations are frequently used in Genetic Algorithms as they produce quality results. However, it is important to explore and utilize other ideas and representations that have different characteristics while maintaining the qualities provided by a direct representation. In this work two different types of representations will be used for all variations of the GAs: a direct representation and an indirect transposition based representation. The direct representation in this work is simply an ordered list which represents the order in which the fragments occur. The GA will evolve the order of the fragments to maximize the overlap scores between the adjacent fragments. The indirect representation will be an ordered list of transpositions that exchange neighbouring fragments of the centre. The actual step of applying the transpositions is only applied during the fitness evaluation stage. With this representation the chromosome is the ordered list of transpositions and the genetic operators will not directly alter an order of the fragments but the order in which the transpositions occur. This representation always exchanges neighbouring elements which means that this representation is formed from a set of local search operators. By adjusting the number of transpositions one can easily alter the degree of locality that the GA searches. C. Experiments Multiple sets of experiments will be performed in this work: RRGA, RRGA+IM, and RRGA+RS. All of these combinations of GA variations will also be used with the two different types of representations except for the RRGA+IM which will not be done with the transposition representation due complications arising with the lengths of the transpositions and the separate subpopulations. In addition, all experiments will be seeded with two different solutions, the identity

permutation and a solution generated using 2-opt, a local search algorithm for the TSP [38]. Preliminary tests found that in all cases the 2-opt seed produces significantly better results than a random seed. The Lin-Kernighan algorithm [39][40] was also used as a starting seed for this work but it was observed that no improvements were made on any problem instance when the GAs were run. This corresponds to the observations made in [1] which suggests that the LinKernighan algorithm in fact produces the optimal solution for the small problem instances. Because of these observations only the results which used 2-opt for initial seeding will be represented in the results for the small problem instances. For the large problem instance Lin-Kernighan will be used as the initial seed and will be represented in this work due to the much larger search space of the real world instances and the ability of these local search algorithms to find relatively good solutions. The reason 2-opt and Lin-Kernighan were chosen for this work was because of their success with the TSP. This, in combination with the observations made in [1] which easily converts FAP instances to TSP instances, makes them ideal. A memetic variation of each algorithm was studied which used 2-opt on the best member of the population before it was compared to the centre but surprisingly preliminary tests showed that this method did not produce robust results and significantly slowed down evolution. Two different fitness functions were explored in this work[13] (See Equation 1 and Equation 2) but again, preliminary tests showed that Fitness Function 1 produced better results with much faster run times. Because of this only results from Equation 1 are represented in this work. F itness =

n−2 X

Overlapf ragi ,f ragi+1

(1)

i=0

F itness =

n−1 X n−1 X

|i − j| ∗ Overlapf ragi ,f ragj

(2)

i=0 j=0

D. System Parameters The system parameters used for these experiments can be seen in Table II. These values were selected through empirical analysis. Note that the number of generations allowed for a run is 100000000/populationsize (or 1000000/populationsize for the COL data set due to time constraints). This methodology is used to ensure that each algorithm, regardless of population size, has the same number of mating events to help maintain a fair comparison. This value was determined through preliminary tests to ensure population convergence occurs. For the direct representation the number of transpositions is the amount used to generate the initial population from the centre; the actual length of the chromosome is the number of fragments in the problem instance. For the transposition representation the number of transpositions is simply the length of the chromosome.

TABLE II S YSTEM PARAMETERS USED IN THIS WORK . 0 WAS USED FOR THE NUMBER OF RESTARTS TO REPRESENT A VERSION OF THE ALGORITHM NOT USING RRGA. * MEANS THAT THE VALUE WAS ONLY USED FOR THE COL DATA SET

Setting

Values All Algorithms

Pop. Size 11, 51, 101* Generations 107 /pop., 105 /pop.* Crossover Rate 80 Mut. Rate 20, 50 Num. of Mut. 5 Num. of Restarts 0, 5, 10, 20 Num. Trans. 100 Num. Trans. Increase 10 Num. Trans. Decrease 5 Ring Species Max Distance 2 Island Model Num. Islands 10 Time Between Migrations 500, 5000 Num. to Migrate 2

All values selected for the experiments were determined through empirical analysis. Each set of system parameters for each algorithm was executed 30 times to help ensure statistical significance. Due to time and memory constraints the COL data set was not run for a full 30 times meaning statistical significance is not assured. IV. R ESULTS AND D ISCUSSION As experiments were being performed it was observed that the results obtained were outperforming all of the best metaheuristics on many of the problem instances. Because of this it was decided to apply 2-opt to the results as a post-optimization to further improve the results for the small problem instances. Due to the size and run times required no post-optimization was run on the large problem instance; however, observations made when doing post-optimization on the small instances suggest that the results would be greatly improved if one were to apply the post-optimization to large instances. The results presented in this paper for the relatively small instances are those which were obtained with the 2-opt post optimization technique and the results presented for COL has no post optimization applied. Table III contains the best results obtained by a collection of the most competitive metaheuristics that have been applied to the popular sixteen benchmarks of DNA FAP problem instances. Table IV contains the best and the best average results for each combination of GA techniques on each of the problem instances studied in this work. Note that Table IV is a brief summary and does not contain the complete analysis for this work as there is too much information to include. The first observation to be made is that at least one of the algorithms used in this work was able to outperform all of the other best metaheuristics previously used on 10 out of 16 of the previously studied benchmark data sets. In multiple cases more than one of the studied algorithms were able to

TABLE III C OLLECTION OF RESULTS FOR THE COMMONLY USED SIXTEEN BENCHMARK DATA SETS [1]. T HIS TABLE ASSUMES THE RESULTS WERE OBTAINED WITH THE USE OF THE SAME SCORING MATRICES . Benchmark

LKH [40]

PPSO+DE [1]

x60189 4 x60189 5 x60189 6 x60189 7 m15421 5 m15421 6 m15421 7 j02459 7 bx842596 4 bx842596 4 acin1 acin2 acin3 acin5 acin7 acin9

11478 14161 18301 21271 38746 48052 55171 116700 227920 445422 47618 151553 167877 163906 180966 344107

11478 13642 18301 20921 38686 47669 54891 114381 224797 429338 47264 147429 163965 161511 180052 335522

QEGA [21][20] SA [19] Sixteen Popular Instances 11476 11478 14027 14027 18266 18301 21208 21271 38578 38583 47882 48048 55020 55048 116222 116257 227252 226538 443600 436739 47115 46955 144133 144705 156138 156630 144541 146607 155322 157984 322768 324559

PALS [25]

SAX [25][26]

POEMS [41]

11478 14021 18301 21210 38526 48048 55067 115320 225782 438215 46876 144634 156776 146591 158004 325930

11478 14027 18301 21268 38726 48048 55072 115301 223029 417680 46865 144567 155789 145880 157032 314354

11478

21261 38610 55092 116542 227233 444634

TABLE IV A

COLLECTION OF THE BEST RESULTS AND BEST AVERAGE BEST RESULTS OBTAINED BY EACH ALGORITHM ON ALL DATA SETS . B OLD ENTRIES REPRESENT THE BEST PERFORMING ALGORITHM ’ S RESULT ON EACH DATA SET AND Italicized Bold ENTRIES ARE RESULTS THAT OUTPERFORMED OR TIED ALL ALGORITHMS PRESENTED IN TABLE III. N OTE THAT THERE ARE MULTIPLE INSTANCES WHERE MORE THAN ONE ALGORITHM ’ S RESULTS ON A GIVEN DATA SET OUTPERFORMED ALL OTHER ALGORITHMS PRESENTED IN TABLE III.

Data Set Name LKH [40] acin1 47618 acin2 151553 acin3 167877 acin5 163906 acin7 180966 acin9 344107 bx842596 4 227920 bx842596 7 445422 j02459 7 116700 m15421 5 38746 m15421 6 48052 m15421 7 55171 x60189 4 11478 x60189 5 14161 x60189 6 18301 x60189 7 21271 f25 305 596 f25 400 777 f25 500 921 f50 315 1581 f50 412 1573 f50 498 1570 f100 307 2793 f100 415 2860 f100 512 2732 f508 354 18112 f635 350 22498 f737 355 25218 f1343 354 49042 f1577 354 57373 COL 8345841

Basic Max 47436 151285 167035 163061 179835 342936 227151 441893 116198 38675 48034 55094 11478 14161 18301 21228 596 777 921 1581 1571 1568 2784 2848 2717 17887 22155 24744 47928 55998 8345841

RRGA Avg 47284.77 151088.67 166455.4 162730.13 179578.17 342522.9 225900 438941.1 115489.17 38404.17 47634.97 54693.57 11262.5 13963.27 18016.93 20987.3 592.7 773.27 915.73 1561.87 1557.4 1554.63 2758.4 2820.27 2694.77 17779.37 22048.43 24555.7 47632.03 55671 -

Basic RRGA+RS Max Avg 47450 47287.7 151253 151086.07 166882 166466.9 163066 162762.03 179932 179543.03 342949 342474.73 227090 226011.7 441867 438697.57 116110 115579.87 38668 38397.4 48048 47642.7 55020 54698.2 11478 11236.7 14161 13966.7 18301 18010.97 21257 20964.43 596 593.33 777 773.3 921 916.07 1579 1562.2 1572 1556.07 1568 1554.27 2779 2760.37 2848 2818.87 2718 2696.07 17888 17778.63 22164 22046.47 24703 24557.63 47922 47632.8 55958 55665.73 8345841 -

produce better results. There were cases where the algorithms in this work were able to tie the results obtained by the LinKernighan algorithm which is suspected to be optimal [1]. When considering the system parameters used for each of the algorithms it is observed that there is little consistency with respect to which system parameters performed the best. When considering which of the algorithms studied performed better there is a little more consistency although the results are still unpredictable making it very difficult to say which algorithm performs the best. However, in nearly all cases the addition of recentering and restarting the GA made a

Basic RRGA+IM Max Avg 47437 47320.83 151243 151078.47 167214 166582.33 163027 162720.77 179886 179540.8 342965 342486.07 227171 225101.33 442100 438276.1 116336 114923.13 38667 38331.47 48052 47550.93 54986 54507.53 11478 11313.37 14161 13959.83 18184 17958.5 21218 20997.8 596 594.87 777 775.1 921 918.73 1580 1560.43 1569 1559.53 1570 1554.5 2781 2763.43 2842 2820.93 2713 2696.93 17872 17795.13 22145 22065.67 24706 24564.33 47877 47641.43 55942 55684.77 8345841 -

Trans Max 47400 151194 167202 163130 179888 342995 226639 441686 116029 38658 47916 54897 11478 14133 18184 21218 596 777 921 1576 1568 1566 2778 2844 2713 17853 22142 24704 47841 55894 8345841

RRGA Avg 47329.83 151101.43 166694.67 162771.03 179541.43 342548.17 225343.93 437711.73 115315.6 38410.43 47554.33 54743.37 11195.53 13950.83 18017 20904.2 594.23 769.27 916.23 1562.3 1552.77 1551.3 2763.73 2813.8 2696.8 17778.43 22048.33 24536.2 47638.13 55704.33 -

Trans RRGA+RS Max Avg 47408 47328.3 151223 151099.5 167144 166779.17 163038 162766.37 179820 179536.23 342860 342531.07 227068 225410.87 440757 437661.3 115943 115292.9 38655 38410.23 47875 47530.6 54888 54743.37 11478 11228.5 14127 13968.3 18184 18017 21210 20928.67 596 594.37 777 769.8 921 916 1579 1563.7 1570 1549.37 1568 1551.67 2779 2764.37 2842 2812.63 2711 2695.57 17868 17780.43 22101 22047.67 24725 24531.7 47879 47636.6 55943 55703.77 8345841 -

considerable positive impact on the results. One can also observe that for the majority of the data sets all algorithms were able to obtain high quality results which were relatively close to one another. A paired t-test with a 95% confidence interval was performed for each pair of algorithms on each data set to determine if there was any statistical difference between these results. Unsurprisingly the results of these tests were inconsistent and showed that some were statistically different and others are not. It is interesting to note that the data sets with a relatively small number of fragments (such as the x60189 and the f-series instances) the lone RRGA appears to have some

slight consistency with producing the best results. In addition one can also see trends with the RRGA+IM producing good results on the larger instances yet having consistently good average results on the smaller instances. Lastly, the transposition representation variations of the algorithms were able to achieve the best average results on almost all of the data sets with a relatively large number of fragments. Nevertheless, it is clear that the results are inconsistent. For the COL data set Lin-Kernighan was used for the preoptimization. This was due to the large size of the problem and it was assumed that 2-opt would not achieve a strong enough starting location (which was the case). Observe that there were no improvements made by the metaheuristic over the results of Lin-Kernighan. This further suggests that LinKernighan is able to find the optimal solution for this instance which was an observation made in [1]. As previously mentioned 2-opt was used as a simple post-optimization technique which statistically improved the obtained results by a large margin. In most cases, such as real world problems, it would be ideal to use another heuristic such as Lin-Kernighan, which has been shown to produce much better results than 2-opt. This option was not used because by doing so it would defeat the purpose of the post optimization on these instances as Lin-Kernighan is suspected to be able to achieve the optimal on these benchmarks. If one were using these algorithms on instances where this was not the case, such as much larger problem instances, it would be ideal to use higher quality heuristics. Although as the runtime required for heuristics on larger instances becomes much larger, if the metaheuristic were able to obtain a high quality solution the post-optimization would not take as long as the starting point for the heuristic would be good enough eliminate a lot of poor quality solution from the search space. V. C ONCLUSIONS AND F UTURE W ORK This work shows that the Recentering-Restarting Genetic Algorithm was able to produce high quality results for the DNA Fragment Assembly Problem. In addition to the RRGA, other variations including the RRGA+Island Model, and RRGA+Ring Species along with different representations were studied which were also able to produce high quality results. The results obtained in this work show that recentering and restarting GAs and GA variations produce significant results which substantially outperform many of the best metaheuristics for the DNA Fragment Assembly Problem on the studied benchmark problem instances. In more than one of these cases, not only did the variations outperform the best results, but they were also able to obtain the suspected optimal results. It was found that there was little consistency when it came to which of the system parameters produced the best results. This suggests that if one were to study these algorithms it would be beneficial to do a thorough parameter sweep to ensure that the algorithm is being fully explored. However, it was also observed that there was more consistency when it came to which of the algorithms performed better, which

suggests that spending too much time fine tuning the system parameters could be a large waste of time as simply exploring different algorithms might produce much better results. Although there were some small trends when it came to which algorithm produced the best results on the data sets it is clear that there is still a lot of inconsistency. This suggests that the only strong conclusion that can be made is that there is a requirement to use multiple strategies when studying a problem such as the DNA Fragment Assembly Problem. This corresponds to observations made in [34] which demonstrates how no one algorithm dominates on all data sets for a particular problem. This trend advocates for a strategy which studies multiple algorithms being applied to a problem, otherwise the study could lack in meaningful results with non-significant implications. Although the Lin-Kernighan algorithm is able to achieve the suspected optimal for the data sets used in this work it is highly likely that this would not be the case for all instances and for large real world problems. In addition to this, as the problem instances become larger the heuristics become infeasible due to the time required. These facts show the benefit of the metaheuristics as there is always a tradeoff when it comes to time and accuracy. It would be ideal to use a combination of these heuristics and metaheuristics (through pre-optimization and post-optimization) to further better the results. The results generated by Lin-Kernighan may be to rigid as it is designed to work only with the sum of the distances in the Hamiltonian circuit. An important advantage of metaheuristics is the easy addition of fitness functions such as those that minimize the number of contigs or maximize contig lengths. Future work that can be derived from this study would be to further explore the possibility of the transposition representation with the island model. The results observed in this work suggest that this combination would be highly beneficial as the transposition representation seemed to achieve the average best results and the island model was able to obtain the best results for many of the data sets. Furthermore, it would be ideal to explore other algorithm combinations as we saw substantial improvements when combining the IM and RS with the RRGA. It would be interesting to apply more cycles of heuristics and metaheuristics to determine if further improvements can be made. For example, 2-opt → RRGA → 2-opt → RRGA → 2-opt → etc. It would be useful to further study the system parameters which are particular to the IM and RS algorithms as in this work they were not fully explored and it is expected that this study could produce much better results. Additional large, real world, problem instances should also be studied with enough analysis to produce statistically significant results. This is a very time consuming process as these real world problems are significantly larger than the benchmarks: the largest of the instances made available at [31] has over 2, 000, 000 fragments.

VI. ACKNOWLEDGEMENTS This research was supported in part by the Natural Sciences and Engineering Research Council of Canada. R EFERENCES [1] Guillermo M Mall´en-Fullerton and Guillermo Fernandez-Anaya, “Dna fragment assembly using optimization”, in 2013 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2013, pp. 1570–1577. [2] Frederick Sanger, Steven Nicklen, and Alan R Coulson, “Dna sequencing with chain-terminating inhibitors”, Proceedings of the National Academy of Sciences, vol. 74, no. 12, pp. 5463–5467, 1977. [3] Mark Chaisson, Pavel Pevzner, and Haixu Tang, “Fragment assembly with short reads”, Bioinformatics, vol. 20, no. 13, pp. 2067–2074, 2004. [4] Neil Hall, “Advanced sequencing technologies and their wider impact in microbiology”, Journal of Experimental Biology, vol. 210, no. 9, pp. 1518–1525, 2007. [5] George M Church, “Genomes for all”, Scientific American, vol. 294, no. 1, pp. 46–54, 2006. [6] Mark D Adams, Susan E Celniker, Robert A Holt, Cheryl A Evans, Jeannine D Gocayne, Peter G Amanatides, Steven E Scherer, Peter W Li, Roger A Hoskins, Richard F Galle, et al., “The genome sequence of drosophila melanogaster”, Science, vol. 287, no. 5461, pp. 2185– 2195, 2000. [7] Asif T Chinwalla, Lisa L Cook, Kimberly D Delehaunty, Ginger A Fewell, Lucinda A Fulton, Robert S Fulton, Tina A Graves, LaDeana W Hillier, Elaine R Mardis, John D McPherson, et al., “Initial sequencing and comparative analysis of the mouse genome”, Nature, vol. 420, no. 6915, pp. 520–562, 2002. [8] J Craig Venter, Mark D Adams, Eugene W Myers, Peter W Li, Richard J Mural, Granger G Sutton, Hamilton O Smith, Mark Yandell, Cheryl A Evans, Robert A Holt, et al., “The sequence of the human genome”, science, vol. 291, no. 5507, pp. 1304–1351, 2001. [9] Jason R Miller, Sergey Koren, and Granger Sutton, “Assembly algorithms for next-generation sequencing data”, Genomics, vol. 95, no. 6, pp. 315–327, 2010. [10] Pavel Pevzner, Computational molecular biology: an algorithmic approach, vol. 1, MIT press Cambridge, 2000. [11] Dent Earl, Keith Bradnam, John St John, Aaron Darling, Dawei Lin, Joseph Fass, Hung On Ken Yu, Vince Buffalo, Daniel R Zerbino, Mark Diekhans, et al., “Assemblathon 1: A competitive assessment of de novo short read assembly methods”, Genome research, vol. 21, no. 12, pp. 2224–2241, 2011. [12] Rebecca Parsons and Mark E Johnson, “Dna sequence assembly and genetic algorithms-new results and puzzling insights.”, in ISMB, 1995, pp. 277–284. [13] Rebecca J Parsons, Stephanie Forrest, and Christian Burks, “Genetic algorithms, operators, and dna fragment assembly”, Machine Learning, vol. 21, no. 1-2, pp. 11–33, 1995. [14] Satoko Kikuchi and Goutam Chakraborty, “Heuristically tuned ga to solve genome fragment assembly problem”, in IEEE Congress on Evolutionary Comp., 2006. CEC 2006. IEEE, 2006, pp. 1491–1498. [15] Shu-Cherng Fang, Yong Wang, and Jie Zhong, “A genetic algorithm approach to solving dna fragment assembly problem”, Journal of Computational and Theoretical Nanoscience, vol. 2, no. 4, pp. 499– 505, 2005. [16] Prakit Meksangsouy and N Chaiyaratana, “Dna fragment assembly using an ant colony system algorithm”, in The 2003 Congress on Evolutionary Computation, 2003. CEC’03. IEEE, 2003, vol. 3, pp. 1756–1763. [17] Yidan Zhao, Ping Ma, Jie Lan, Chun Liang, and Guoli Ji, “An improved ant colony algorithm for dna sequence alignment”, in International Symposium on Information Science and Engineering, 2008. ISISE’08. IEEE, 2008, vol. 2, pp. 683–688. [18] Enrique Alba and Gabriel Luque, “A hybrid genetic algorithm for the dna fragment assembly problem”, in Recent Advances in Evolutionary Computation for Combinatorial Optimization, pp. 101–112. Springer, 2008. [19] Jesun Sahariar Firoz, M Sohel Rahman, and Tanay Kumar Saha, “Bee algorithms for solving dna fragment assembly problem with noisy and noiseless data”, in Proceedings of the fourteenth international conference on Genetic and evolutionary computation conference. ACM, 2012, pp. 201–208.

[20] LD Qin, QY Jiang, ZY Zou, and YJ Cao, “A queen-bee evolution based on genetic algorithm for economic power dispatch”, in Universities Power Engineering Conference, 2004. UPEC 2004. 39th International. IEEE, 2004, vol. 1, pp. 453–456. [21] Sung Hoon Jung, “Queen-bee evolution for genetic algorithms”, Electronics letters, vol. 39, no. 6, pp. 575–576, 2003. [22] Daniel R Zerbino and Ewan Birney, “Velvet: algorithms for de novo short read assembly using de bruijn graphs”, Genome research, vol. 18, no. 5, pp. 821–829, 2008. [23] David Hernandez, Patrice Franc¸ois, Laurent Farinelli, Magne Øster˚as, and Jacques Schrenzel, “De novo bacterial genome sequencing: millions of very short reads assembled on a desktop computer”, Genome research, vol. 18, no. 5, pp. 802–809, 2008. [24] Yingrui Li, Yujie Hu, Lars Bolund, Jun Wang, et al., “State of the art de novo assembly of human genomes from massively parallel sequencing data”, Hum Genomics, vol. 4, no. 4, pp. 271–277, 2010. [25] Gabriela Minetti and Enrique Alba, “Metaheuristic assemblers of dna strands: Noiseless and noisy cases”, in 2010 IEEE Congress on Evolutionary Computation (CEC). IEEE, 2010, pp. 1–8. [26] Gabriela Minetti, Guillermo Leguizam´on, and Enrique Alba, “Sax: a new and efficient assembler for solving dna fragment assembly problem”, 2012 Argentine Symposium on Artificial Intelligence, 2012. [27] Temple F Smith and Michael S Waterman, “Identification of common molecular subsequences”, Journal of molecular biology, vol. 147, no. 1, pp. 195–197, 1981. [28] Eugene W Myers, “Toward simplifying and accurately formulating fragment assembly”, Journal of Computational Biology, vol. 2, no. 2, pp. 275–290, 1995. [29] Shuba Gopal, Anne Haake, Rhys Price Jones, and Paul Tymann, Bioinformatics: a computing perspective, McGraw-Hill Science/Engineering/Math, 2008. [30] Guillermo M Mall´en-Fullerton, James Alexander Hughes, Sheridan Houghten, and Guillermo Fern´andez-Anaya, “Benchmark datasets for the dna fragment assembly problem”, International Journal of BioInspired Computation, vol. 5, no. 6, pp. 384–394, 2013. [31] “Dna assembly problem benchmark repository”, http://chac.sis.uia.mx/ fragbench/, 2013, [Online; accessed 10-Sept-2013]. [32] Dan Ashlock and Colin Lee, “Characterization of extremal epidemic networks with diffusion characters”, in Computational Intelligence in Bioinformatics and Computational Biology, 2008. CIBCB’08. IEEE Symposium on. IEEE, 2008, pp. 264–271. [33] James Hughes, Joseph Alexander Brown, Sheridan Houghten, and Daniel Ashlock, “Edit metric decoding: Representation strikes back”, in Evolutionary Computation (CEC), 2013 IEEE Congress on. IEEE, 2013, pp. 229–236. [34] James Hughes, Sheridan Houghten, and Daniel Ashlock, “Recentering, reanchoring & restarting an evolutionary algorithm”, in Nature and Biologically Inspired Computing (NaBIC), 2013 World Congress on. IEEE, 2013, pp. 76–83. [35] Darrell Whitley, “An executable model of a simple genetic algorithm.”, in FOGA. Citeseer, 1992, pp. 45–62. [36] Darrell Whitley, Soraya Rana, and Robert B Heckendorn, “The island model genetic algorithm: On separability, population size and convergence”, Journal of Computing and Information Technology, vol. 7, pp. 33–48, 1999. [37] D Ashlock and A McEachern, “Ring optimization of side effect machines”, Intelligent Engineering Systems Through Artificial Neural Networks, vol. 19, pp. 165–172, 2009. [38] GA Croes, “A method for solving traveling-salesman problems”, Operations Research, vol. 6, no. 6, pp. 791–812, 1958. [39] Shen Lin and Brian W Kernighan, “An effective heuristic algorithm for the traveling-salesman problem”, Operations research, vol. 21, no. 2, pp. 498–516, 1973. [40] Keld Helsgaun, “An effective implementation of the lin–kernighan traveling salesman heuristic”, European Journal of Operational Research, vol. 126, no. 1, pp. 106–130, 2000. [41] Jiˇr´ı Kubalik, Petr Buryan, and Libor Wagner, “Solving the dna fragment assembly problem efficiently using iterative optimization with evolved hypermutations”, in Proceedings of the 12th annual conference on Genetic and evolutionary computation. ACM, 2010, pp. 213–214.