TRAVELING SALESMAN PROBLEM BIOINFORMATIC ALGORITHMS

30 downloads 205 Views 4MB Size Report
of examining the effect of trying to solve bioinformatic problems using the reduction ... the TSP. The practical bioinformatic problem is the Shortest Superstring.
USING THE

TRAVELING SALESMAN PROBLEM IN

BIOINFORMATIC ALGORITHMS Master’s Thesis in Computer Science Finn Rosenbech Jensen 2006 0636

Supervisor: Christian Nørgaard Storm Pedersen

D EPARTMENT OF C OMPUTER S CIENCE A ARHUS U NIVERSITY D ECEMBER 2010

Layout and typography in this thesis are made using LATEX with the memoir documentclass. Any good idea or nice layout feature can be directly attributed to the book: Introduktion til Latex, by Lars Madsen, http://www.imf.au.dk/system/latex/bog/ The TSP tour images are made using the techniques described on http://www.cgl.uwaterloo.ca/~csk/projects/tsp/ an output modified version of the ccvt software http://code.google.com/p/ccvt/ and the TSP heuristic solver LKH http://www.akira.ruc.dk/~keld/research/LKH/ All self constructed figures are drawn using LaTeXDraw, http://latexdraw.sourceforge.net/ and all graphs and diagrams are constructed using matplotlib, http://matplotlib.sourceforge.net/

For brevity is very good, where we are, or are not understood Samuel Butler

Abstract Theoretical computer science has given birth to a very intriguing and fascinating class of problems known as Nondeterministic Polynomial-time (NP) complete problems. Among their captivating properties are their computational complexity despite often deceptively simple formulations. An example of this is the ubiquitous Traveling Salesman Problem (TSP). This problem has had an uncontested popularity in computational mathematics leading to a plethora of techniques and programs for solving it, both exactly and heuristically. Another remarkable property of NP complete problems is their frequent occurrence in real life problems, making them more than just theoretical constructions. A practical example of this is the field of bioinformatics where many of the challenges faced by its researchers have turned out to be NP complete problems. Yet another tantalising characteristic of the NP complete problems is their reduction property, making every problem equally difficult (or easy) to solve. In other words: to solve them all, we only have to solve one of them. This thesis aims at utilising the above properties with the purpose of examining the effect of trying to solve bioinformatic problems using the reduction property and a state of the art implementation for solving the TSP. The practical bioinformatic problem is the Shortest Superstring Problem (SSP). To asses the quality of the obtained solutions, they are compared to solutions from four approximation algorithms. To convey a full understanding of the algorithms and their approximation factors, the thesis additionally includes a self-contained survey of approximation algorithms for the SSP. The thesis further examines the bioinformatic problems concerning Multiple Sequence Alignment (MSA) and hereby presents the definition of a TSP based scoring function. A near-optimal MSA construction algorithm that uses this scoring and additionally a divide-and-conquer algorithm for refining MSAs are implemented and experimentally tested. Based on truely convincing results the main conclusion of the thesis is that it is definitely a promising idea to apply efficient TSP solver implementations to solve NP complete problems within bioinformatic applications. The results obtained for the implemented MSA algorithms are far more modest, although the MSA construction algorithm and the scoring i function should not be dismissed without further study.

Acknowledgements First I would like to thank my supervisor Christian Nørgaard Storm Pedersen for being utmost inspiring, positive and patient all along the long and winding road and for giving me the idea for this thesis in the first place. I can honestly recommend to chose a supervisor, who has personal experience with small children. Second I would like to thank Gaston Gonnet and especially Keld Helsgaun for their kind and helpful responses to my email queries. A warm thanks goes to the staff at the Department of Computer Science at Aarhus University, in particular to Torsten Nielsen, Michael Glad and Kai Birger Nielsen for being very prompt whenever I needed help and for letting me abuse quite some ’camels’ for my test runs. Furthermore I would like to thank all the people at the office-floor: Anders Hauge Nissen, Martin Højgaard Have, Sean Geggie, Mikkel Vester, Marie Mehlsen, Mette Helm, Lene Mejlby and Jesper Jakobsen — you all made a long indoor summer and fall quite enjoyable. Also many thanks to my office mates Morten Slot Kristensen and Anders Andersen for being very patient with a grumpy ol’ man and Anders for helping me out whenever my lack of technical talent took its toll. A heartily thanks to my “old” pair-programming partner Dennis Decker Jensen for his never failing willingness to help me out every time I have gotten myself into another fine (coding) mess. A great many thanks to all my thesis reviewers: Anders Andersen, Jesper Jakobsen, Rasmus Lauritsen and Sarah Zakarias who bravely accepted the harsh challenge of making my attempts in written English both readable as well as understandable. Last but certainly not least a deeply and wholehearted thanks to my wife Birgit and our three children for enduring me and my absentmindedness for so long.

Finn Rosenbech Jensen

iii

Contents

Abstract

i

Acknowledgements

iii

Contents

v

1 Introduction 1.1 Road Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Thesis Terminology and Notation . . . . . . . . . . . . . . . .

1 3 4

Part I The Road Goes Ever On and On

9

2 The Traveling Salesman Problem 11 2.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Seminal Results in TSP Computation . . . . . . . . . . . . . . 17 2.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3 Traveling Salesman Problem Solvers 21 3.1 Characterisation of TSP Solvers . . . . . . . . . . . . . . . . . 21 3.2 Selection of TSP Solver . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 v

C ONTENTS

Part II The Shortest Superstring Problem

27

4 The Shortest Superstring Problem 29 4.1 The Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.2 Applications of SSP . . . . . . . . . . . . . . . . . . . . . . . . 30 5 Preliminaries 5.1 Strings . . . . . . . . . . . 5.2 Graphs and Cycle Covers . 5.3 Approximation Strategies 5.4 Notes . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

6 Algorithms 6.1 The Cycle Algorithm . . . . . . . . . . . . . . . . . . 6.2 The RecurseCycle Algorithm . . . . . . . . . . . . . . 6.3 The Overlap Rotation Lemma . . . . . . . . . . . . . 6.4 The Greedy Algorithm . . . . . . . . . . . . . . . . . 6.5 Reductions of Hamiltonian Path Problem Instances 6.6 The Best Factor Algorithm . . . . . . . . . . . . . . . 6.7 The TSP Based Algorithm . . . . . . . . . . . . . . . 6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 6.9 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Experiments 7.1 Test Data . . . . . 7.2 SSP Solver Tests . 7.3 Timing Tests . . . 7.4 Sequencing Tests

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

. . . . . . . . .

. . . .

. . . .

33 33 36 39 41

. . . . . . . . .

43 43 53 53 57 61 64 73 73 74

. . . .

75 75 77 83 87

Part III The Circular Sum Score and Multiple Sequence Alignment 8 Sequence Alignments 8.1 Motivation . . . . . . . . . . . 8.2 Pairwise Sequence Alignment 8.3 Multiple Sequence Alignment 8.4 Notes . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

91 . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

93 93 94 96 99

9 Circular Sum Score 101 9.1 Sum-of-Pairs Score . . . . . . . . . . . . . . . . . . . . . . . . 101 9.2 MSAs and Evolutionary Trees . . . . . . . . . . . . . . . . . . 102 9.3 Circular Orders, Tours and CS Score . . . . . . . . . . . . . . 103 vi

Contents 9.4 Notes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 10 Algorithms 10.1 Path Sum Scores . . . . . . . . . . 10.2 The MSA Construction Algorithm 10.3 The MSA Enhance Algorithm . . . 10.4 Notes . . . . . . . . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

111 111 112 118 124

11 Experiments 11.1 Reference MSA Applications 11.2 Benchmark Database . . . . 11.3 Experiments Set-up . . . . . 11.4 MSA Construction Tests . . . 11.5 MSA Enhancing Tests . . . . 11.6 Notes . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

125 125 126 128 128 134 136

. . . . . .

. . . . . .

. . . . . .

Part IV The End of The Tour

139

12 Conclusion 141 12.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 List of Algorithms

144

List of Figures

146

List of Figures

146

List of Tables

148

List of Tables

148

Part V Appendices

149

A Acronyms

151

B LKH output

153

C GenBank Nucleotid Sequences

157

D SSP Test Data

161

Bibliography

167

vii

Nothing is more dangerous than an idea, if it’s the only one you have

C HAPTER

Pragmatic Programmers

1 Introduction

One of the most intriguing notions in theoretical computer science is the existence of the class of problems known as NP complete problems. The notion characterises a class of problems for which it is “very unlikely” that an “efficient” general solution exists. The argumentation for the “unlikeliness” is partly that forty years of intense study has not given birth to such a solution,1 partly that the implications, should such a solution exists, are very far reaching. As an example it would be possible to construct an efficient algorithm that could find the shortest mathematical proof of any theorem, thereby making the creative part of theorem proving superfluous. The notion of “efficient” general solution is defined in terms of polynomial2 running-time dependency in the size of the problem instance. Thus NP complete problems can be seen as problems for which we in most cases neither have the time nor the calculating power to find the optimal solution. This essentially means that searching for exact solutions is infeasible and instead approximation algorithms and heuristics must be used. From a mathematical point of view this might seem unsatisfying, given that the problems have trivial mathematical solutions. However from a practical point of view this makes the problems very challenging and in1

On August 7th. 2010, Vinay Deolalikar, a recognised researcher at Hewlett Packard, published a claimed proof that P 6= NP implying that such solutions do not exist. This was the 7th. out of 8 published proof attempt in 2010 http://www.win.tue. nl/~gwoegi/P-versus-NP.htm , and it spurred an intense effort to understand and pursue its new ideas; but in less than a week some deep and probably irremediable flaws were uncovered http://rjlipton.wordpress.com/2010/08/12/

fatal-flaws-in-deolalikars-proof . 2

as opposed to exponential or factorial

1

1. I NTRODUCTION teresting giving the opportunity and leeway to try out any conceivable idea for solving them. Another fascinating hallmark of NP complete problems is their “reduction” property. This means that each of the problems3 can be reformulated as an instance of every other NP complete problem. The effect of this is that if an efficient solution for one of these problems would be found, it could be used to find efficient solutions to all of the NP complete problems. One of the most famous examples of an NP complete problem is the TSP. It is at the same time one the most intensely investigated subjects in computational mathematics. The problem seems to cast a certain spell of fascination upon the people getting involved with it and it has inspired studies by mathematicians, chemists, biologists, physicists, neuroscientists and computer scientists. The popularity of the TSP has had the practical effect that a lot of applications for solving it, both exactly and heuristically, have been devised. A practical field where a lot of NP complete problems turns up is bioinformatics or computational biology. NP complete problems occur in connection with bioinformatic challenges like Deoxyribonucleic Acid (DNA) sequencing, sequence alignments, gene identification and evolutionary tree building. Since it is in most cases infeasible to search for optimal solutions for these problems, this makes the field of bioinformatics very vivid in terms of devising approximation algorithms and heuristics. This thesis can be seen in this light, as its main purpose is to examine the effect of trying to solve bioinformatics problems using the reduction property of NP hard problems and a state of the art implementation for solving the TSP. This idea does run the risk of being too simplistic and general, since in effect the idea is to solve different problems in the same way. This is opposed to constructing or using specialised methods and algorithms designed to solve each specific problem, thereby taking advantage of the individual character of the problem. On the other hand, the idea could prove applicable given that more than fifty years of vast exploration and work has been put into implementing efficient TSP solvers. The second thesis part is dedicated to the above purpose. It contains the results of heuristically solving the NP complete, bioinformatics related problem known as the SSP by reducing it to an instance of the TSP. To asses the quality of the solutions, they are compared to solutions from four approximation algorithms one of which achieves the best known approximation factor for the SSP. 3

2

actually every problem belonging to the larger class of problems known as NP

1.1. Road Map To convey a full understanding of the approximation algorithms and their approximation factors, this part also contains a chapter that can be characterised as a new-devised, self-contained survey of approximation algorithms for the SSP. As a further application of the idea, the third part of the thesis explores the use of the TSP in connection with bioinformatics problems concerning sequence alignment construction and evaluation. The part focuses on MSA and the definition of a scoring function which takes the evolutionary history of proteins into account. A near-optimal MSA construction algorithm based on this scoring function and additionally a divideand-conquer algorithm for refining MSAs are implemented and experimentally tested. These algorithms originate in work done at the Swiss Federal Institute of Technology around 2000 and the thesis part covers improvements on both algorithms. The goals and contributions of this thesis can thus be summarised as: 1) an examination of the feasibility of applying efficient TSP solver implementations to solve NP complete problems within bioinformatic applications, 2) a self-contained survey of approximation algorithms for the SSP including descriptions of implementations and testing of solution quality, and 3) a description and an experimental testing of the improvements on the TSP based algorithms for MSA construction and MSA refinement.

1.1

Road Map

The thesis is divided into four parts: Traveling Salesman Problem The first part is a short (minimal?) tour through some history, definitions and theoretical results for the TSP. It also characterises TSP solvers and argues for the choice of solver to be used in the algorithm implementations. This part should provide the reader with an understanding of the TSP and the current state-of-the-art solver implementations. 3

1. I NTRODUCTION Shortest Superstring Problem The second part has emphasis on the SSP and is by far the most theoretical and difficult part of the thesis. It starts off with a chapter introducing and defining the problem along with its applications. Next follows a chapter containing preliminaries for the theoretical treatment of the implemented algorithms. These algorithms are then the theme of the succeeding chapter, which is a full treatment of the five implemented algorithms, their approximation factors and implementations. This mix of theory and practical implementation description is intentional, in the hope that it will make the reading more varied. Concludingly a chapter follows, which describes the experiments conducted on the algorithms and the results thereof. The part should provide the reader with an understanding of both the theoretical foundation for SSP approximation algorithms as well as technical aspects of their implementation and the achieved practical results. Multiple Sequence Alignment The third part focuses on MSA and the evolutionary based scoring function. Both an MSA construction algorithm and an MSA improving algorithm, together with their implementation, are presented. The part finishes with a chapter describing the tests conducted on the algorithms and results thereof. This part of the thesis should provide the reader an understanding of sequence alignment and some of the aspects concerning constructing and estimating MSAs. Conclusion The fifth part contains the thesis conclusion and ideas for future work.

1.2

Thesis Terminology and Notation

This section contains a short description of the mathematical terminology and notation used in the thesis.

1.2.1

Graphs.

We denote directed graphs as G(N, A) and use the names “nodes” and “arcs” for their components. Undirected graphs will be denoted G(V, E) and their components described using the names “vertices” and “edges”. Arcs and edges are denoted ai and ei or by the nodes or vertices they have as endpoints (ni , nj ) and (vi , vj ). In an undirected graph the edges (vi , vj ) and (vj , vi ) are identical. 4

1.2. Thesis Terminology and Notation Remark 1 For simplicity the definitions in the rest of this section will all be phrased in terms of undirected graphs. Each definition containing the words ’edges’ and ’vertices’ can be substituted by an analogous definition for directed graphs phrased using ’nodes’ and ’arcs’. 2 A graph G0 = (V 0 , E 0 ) is a subgraph of G(V, E) iff V 0 ⊆ V and E 0 ⊆ E. If V 0 ⊆ V and E 0 = { e = (vi , vj ) | e ∈ E, vi , vj ∈ V 0 } then (V 0 , E 0 ) is called the subgraph induced by V 0 . By a walk in a graph G(V, E) we understand a sequence of vertices: v1 , v2 , . . . , vk where the walk start in v1 and continuously follow the edges (vi , vi+1 ) between the vertices vi , vi+1 in the sequence. A walk is called a path if each of the vertices in the sequence is distinct.4 A walk where v1 = vk is called a circuit and a path with v1 = vk is called a tour or a cycle.5 A path or a tour, visiting all vertices in the graph exactly once, is called Hamiltonian. For a tour visiting k vertices the short-cut notation: v1 , v2 , . . . , vk will also be used instead of the more correct v1 , v2 , . . . , vk , v1 . A tour T 0 is called a subtour of T if all vertices in T 0 are contained in T . For an undirected graph G(V, E) the “degree” of a vertex is the number of edges incident to the vertex. For a directed graph G(N, A) each node has an “in-degree” and an “out-degree” signifying the number of arcs going to and from the node respectively. The notation dnin and dnout denote the in-degree and out-degree of node n. Example 1 In Figure 1.1 (a) the sequence (n1 , n5 , n4 , n2 ) is a path and in (b) the sequence (n1 , n5 , n4 , n2 , n1 ) is a tour. In subfigure (c) the sequence (n1 , n5 , n4 , n2 , n3 ) is a Hamiltonian path and in subfigure (d) the sequence (n1 , n5 , n4 , n2 , n3 , n1 ) is a Hamiltonian tour. The degrees d1in , d2out , d3out and d4in equal two, all other degrees equal one. 2 A complete graph G(V, E) is a graph where E includes all possible edges in the graph. In other words E consists of all pairs of vertices (vi , vj ), i 6= j in V . Finally we define the cut-set of a set S ⊆ V as δ(S) = {e | e = (vi , vj ) ∈ E, vi ∈ S, vj ∈ / S} The cut-set of a vertex set consists of the edges having one endpoint in the set and the other endpoint outside the set. In other words, the edges that have to be “cut” in order to remove S from the graph. Example 2 Figure 1.2 (a) shows a complete (undirected) graph with five vertices. In subfigure (b) the cut set δ(v1 ) of the vertex v1 is illustrated. 2 4 5

we use path in the implicit meaning “simple” path again we use tour (cycle) in the implicit meaning simple tour (cycle)

5

1. I NTRODUCTION

1

1

5

5

2

3

4

2

3

4

(a)

(b)

1

1

5

5

2

3

4

2

3

4

(c)

(d)

Figure 1.1: paths and tours in a graph

1

1

5

2

4

3

(a) complete graph

5

2

4

3

(b) cut set of v1

Figure 1.2: complete graph and cut set

6

1.2. Thesis Terminology and Notation A bi-partite graph G(V, E) = G(V 0 , V 00 , E) is a graph, where the vertex set V can be divided into two disjoint vertex sets V 0 , V 00 with the property that E = {(v 0 , v 00 )| v 0 ∈ V 0 , v 00 ∈ V 00 }. In other words all edges are going from V 0 to V 00 . For simplicity the notation G(V, V 0 , E) is also used for bipartite graphs. A weighted graph G(V, E, W ) is a graph that associates a weight (or cost) with each of its edges. The weights are typically real numbers, that is W : E → R. We denote the weight of an edge w(e) and the weight of a P graph G by W (G) = e∈ E w(e). Finally a multigraph is a graph, which can have more than one copy of its edges.

1.2.2

Trees.

A connected graph without cycles, T (V, E) is called a “tree”. In a tree each vertex can have zero or more “child vertices” called its children and at most one “parent vertex” called its parent. A vertex having no children is called a “leaf ”. A tree can be “rooted” or “unrooted” implying whether or not there is a single vertex having no parent, denoted the “root” of the tree. Non-leaf vertices are also called “interior” vertices. A subtree, Tv (V 0 , E 0 ) of a tree, T (V, E) is the induced subgraph where V 0 = {v} ∪ {vi |vi is a descendant of v}. In other words a tree having v as root and including everything “below” v in T A special kind of tree is the “binary tree” where the number of children for each vertex is at most two. In a binary tree with n leafs there will be n − 1 interior vertices. Example 3 In Figure 1.3 (a) is shown an unrooted tree with two interior vertices and five leafs. Subfigure (b) shows a rooted tree consisting of a root, two interior vertices and five leafs and finally subfigure (c) shows a binary tree consisting of a root, three interior vertices and five leafs. 2

b

b b b

b

b b

b b

b

b b

b

(a)

b

b

b b b

b

b b

b

b

(b)

bb

(c)

Figure 1.3: Different tree types

7

1. I NTRODUCTION

1.2.3

Strings.

When naming strings and chars, letters from the “middle” of the alphabet (s, t, u, . . . ) will be used for strings, and letters from the beginning of the alphabet (a, b, . . . ) will be used for chars. The empty string will be denoted . For any string s let |s| denote the length of s. For a collection of strings P S = (s1 , s2 , . . . , sn ) we denote the accumulated length of S as n 0 0 ||S|| = i=1 |si |. String indexing [ ] will be one-based and string slicing 0 [ : ]0 will be start inclusive and end non-inclusive. The notation s[: i] is shorthand for s[1 : i] and s[i :] is shorthand for s[i : |s| + 1]. String concatenation will be implicit i.e. the concatenation of strings s, t will be denoted st. Example 4

Let s = abcde, t = cba then || = 0, |s| = 5, s[1] = a, s[2] = b, s[5] = e and s[2 : 5] = bcd, s[1 : 3] t[2 : 4] = s[: 3] t[2 :] = abba, t3 = ttt = cbacbacba, t∞ = ttt...

1.2.4

2

Arrays.

Since strings can be seen as char arrays we will treat array operations analogously to strings, i.e. array indexing 0 [ ]0 will be one-based and array slicing 0 [ : ]0 will be start inclusive and end non-inclusive. Furthermore, for an array Arr, an element elem, and an index idx we use the following array functions: • LIST() constructs an array, • SIZE(Arr) or |Arr| denotes the number of elements in Arr, • APPEND(Arr, elem) appends elem at the end of Arr, • REMOVE(Arr, idx) removes the element at index idx from Arr, • SUM(Arr) calculates the sum of the elements in Arr. (Requires numeral elements in Arr).

8

Part I

The road goes ever on and on

DNA

climb tour with 8190 cities

The Road goes ever on and on Down from the door where it began. Now far ahead the Road has gone, And I must follow, if I can, Pursuing it with eager feet, Until it joins some larger way Where many paths and errands meet. And whither then? I cannot say.

C HAPTER

Bilbo Baggins (J.R.R. Tolkien)

2 The Traveling Salesman Problem.

This chapter introduces the Traveling Salesman Problem and gives a short overview of its history along with some major computational results. It furthermore contains the Linear Program (LP) formulation of the TSP. As this chapter is meant as an introduction, treatment and definition of some technical terms will be postponed to later chapters.

2.1

History

The TSP is one of the most (if not the most) widely studied problems in computational mathematics. One of the reasons for this might very well be the ease of formulating and understanding the problem: Definition 1 (The Traveling Salesman Problem) Given a collection of cities along with the cost of travel between each pair of them, determine the cheapest route visiting each city exactly once, starting and ending in the same city. 2 Given this simple formulation it might be expected that the problem could have an equally simple solution. This turns out not to be the case. Despite being easily understood no general efficient solution to the TSP has been found. However the problem has inspired multiple studies by mathematicians, computer scientists, chemists, physicists, psychologists and a host of non professional researchers. An interesting historical reference is the German handbook from 1832 by B.F. Voigt with the title: “Der Handlungreisende - wie er sein soll und was er zu thun hat um Aufträge zu erhalten und eines glücklichen Erfolgs 11

2. T HE T RAVELING S ALESMAN P ROBLEM

Figure 2.1: The Commis-Voyageur tour for 47 German cities (from [Applegate et al., 2007]).

in seinen Geshäften gewiss zu sein - Von einem alten Commis-Voyageur”.1 The handbook contains routes through Germany and Switzerland and goes as far as to claim: We believe we can ensure as much that it will not be possible to plan the tours through Germany in consideration of the distances and the traveling back and fourth, which deserves the traveller’s special attention, with more economy. The main thing to remember is always to visit as many localities as possible without having to touch them twice.2 One of the tours shown in Figure 2.1 goes through 47 German cities and is actually of very good quality and might even be optimal given the travel conditions of that time. The first explicit mentioning of the TSP as a mathematical optimisation problem is probably due to the Austrian mathematician Karl Menger who in a mathematical colloquium in 1930 in Vienna used the formulation: Wir bezeichnen als Botenproblem (weil diese Frage in der Praxis von jedem Postboten, übrigens auch von vielen Reisenden 1

The traveling salesman - how he has to be and what he has to do to acquire orders and be assured of a fortunate success in his business - by an old Commis-Voyageur 2 translated from the German original by Linda Cook

12

2.1. History zu lösen ist) die Aufgabe, für endlich viele Punkte, deren paarweise Abstände bekannt sind, den kürzesten die Punkte verbindenden Weg zu finden.3 Shortly after this the TSP became popular among mathematicians at Princeton University. There does not exist any authoritative source for the origin of the problems name, but according to Merrill Flood and A. W. Tucker it became introduced by its present day name in 1934 as part of a seminar given by Hassler Whitney at Princeton University.4 An additional characteristic of the TSP is, that it is relatively easy to construct good solutions, but far from easy to find a provable optimal solution. This has made the problem a popular “guinea pig” for all kinds of optimisation techniques during the last half of the 20th. century (see Table 2.1 on the following page). An excellent example of this is the seminal paper [Dantzig et al., 1954] where the inventor of the Simplex Algorithm together with two colleagues introduce the integer LP formulation of the TSP (see Definition 2) as well as a cutting plane method5 in order to solve a 49 cities problem instance to provable optimality. Definition 2 (Integer LP-formulation for TSP) Let G = (V, E, W ) be a weighted graph with V = {v1 , v2 , . . . , vn }, then the TSP in G can be formulated as X Minimise w(e) · xe subject to e∈E

X

xe = 2,

∀ vi ∈ V

xe ≥ 2,

S ⊂ V, S 6= ∅

(degree constraints)

(2.1)

e ∈ δ({vi })

X

(subtour elimination constraints) (2.2)

e ∈ δ(S)

xe ∈ {0, 1},

∀e∈E

(integer constraints)

(2.3) 2

Remark 2 In Definition 2 the degree constraints (2.1) ensure that all vertices have degree exactly two i.e. a tour enters and leaves each city exactly once. The subtour elimination constraints (2.2) ensure that the solution will consist 3

We denote as messenger-problem (because this question has to be solved by any postman and by the way many salesmen) the challenge of finding the shortest route between a finite number of points of which the pairwise distance is known. 4 although Whitney queried some twenty years later did not recall the problem [Dantzig, Fulkerson, and Johnson, 1954] 5 a way of coming from the optimal fractional variable solution to the optimal integer variable solution

13

2. T HE T RAVELING S ALESMAN P ROBLEM

Table 2.1: Some TSP history milestones (mainly from [Okano, 2009]) Year

Milestone

Contributor(s)

1954

49 cities instance solved to optimality using a LP and manually added cutting planes

Dantzig, Fulkerson and Johnson

1962

33 cities TSP contest

Procter & Gamble

1970

Lagrangian relaxation with error about 1 % below optimal.

M. Held and R. M. Karp

1972

Proof of NP-completeness for TSP

R. M. Karp

1973

k-opt heuristic 1 to 2 % above optimal

Lin and Kernighan

1976

Factor 32 -approximation algorithm for MTSP

Nicos Christofides

1983

Simulated Annealing based heuristic

Kirkpatrick, Gelatt and Vecchi

1985

Recurrent neural network-based heuristic

Hopfield and Tank

1990

TSP

1991

TSPLIB

1992

3038 cities instance solved to optimality using cutting plane generation (Concorde)

David Applegate, Robert Bixby, Vašek Chvátal and William Cook

1996

PTAS for Euclidean TSP with run1 ning time: nO(  )

Sanjeev Arora

1998

Improved k-opt heuristic within 1 % above optimal.

Keld Helsgaun

2004

24 978 cities instance solved by LKH and proven optimal by Concorde

Applegate, Bixby, Chvátal, Cook and Helsgaun

2006

85 900 cities instance solved and proven optimal using LKH and Concorde

Applegate, Bixby, Chvátal, Cook, Espinoza, Goycoolea and Helsgaun

14

heuristics based on k-d trees published

Bentley Reineld

LKH,

2.1. History of one large cycle and not two or more sub-cycles. Finally the integer constraints (2.3) ensure that we use each edge in a “binary” fashion i.e. the edge is either part of the tour or it is not part of the tour. 2 In 1962, the TSP became publicly known to a great extent in the USA due to a contest by Procter & Gamble consisting of a problem instance of 33 cities. The $ 10 000 Price for the shortest solution was at that time enough to purchase a new house in many parts of the country.

Figure 2.2: The 33 city contest from 1962. In 1970, Held and Karp developed a one-tree6 relaxation which provides a lower bound within 1 % from the optimal. It achieves this by relaxing the degree constraints in Definition 2 using a Minimum Spanning Tree (MST) and Lagrangian multipliers. In 1972, Karp [1972] proved the NP-completeness of the Hamiltonian Cycle Problem (HCP) from which the NP-completeness of the TSP follows almost directly [Karp, 1972]. In 1973, Lin and Kernighan proposed a variable-depth edge exchanging heuristic for refining an initial tour. The method, now known as the “Lin-Kernighan” algorithm, performs variable k-opt moves that allow intermediate tours to be longer than the original tour. A k-opt move can be 6

A tree containing exactly one cycle

15

2. T HE T RAVELING S ALESMAN P ROBLEM seen as the removal of k edges from a TSP tour followed by the patching of the resulting paths into a tour using k other edges (see Figure 2.3). The algorithm produces tours about 1 to 2 % above the optimal length and has won the reputation of being difficult to implement correctly. In [Helsgaun, 1998, Section 2.3.2, page 7] a survey paper published in 1989 is mentioned, where the authors claim that no other implementation of the algorithm at that time had shown as good efficiency as was obtained by Lin and Kernighan in [Lin and Kernighan, 1973].

b

b

v1

b

b

b b

b b

v2

v4

v2

v4

v1

v3

v3

(a) 2-opt move v2 v6

v3

b b

b

b

b

v1

v5

v2 v6

v3

b b

b

b

b

b

v1

v4

v5

b

v4

(b) 3-opt move

Figure 2.3: Different k-opt moves

In 1976, Christofides published a tour construction method, that achieves a 32 -approximation7 by using a MST and “Perfect Matching”. Apart from the euclidean TSP this is still the tightest approximation ratio known [Christofides, 1976]. Other examples of using the TSP as a "guinea pig" are found in the article [S. Kirkpatrick and Vecchi, 1983] which introduces the random, local search technique known as “Simulated Annealing” and in the article [Hopfield and Tank, 1985], which is one of the first publications discussing “Neural Network” algorithms. Both articles use the TSP as a working example. In 1990, Bentley developed a new highly efficient variant of the k-d tree8 data structure, which is used for proximity checking, while he was working on heuristics for the TSP [Bentley, 1992]. 7 8

16

i.e. guaranteeing a solution no worse than 32 times the optimal solution a binary search tree structure extended in k dimensions

2.2. Seminal Results in TSP Computation In 1991, Reinelt composed and published TSPLIB9 ,a library containing many of the test problems studied over the last 50 years [Reinelt, 1991]. In 1992, David Applegate, Robert Bixby, Vašek Chvátal and William Cook solved a 3038 TSP city instance to optimality using the exact TSP solver program Concorde, on which they started the development in 1990. Concorde has ever since been involved in all proven optimality tour records. In 1996, the first Polynomial Time Approximation Scheme (PTAS) for the euclidean TSP was devised by Arora. The PTAS finds tours with length 1 (1 + ) times the optimal and has a running-time of nO(  ) . Since it had previously been proven that both the general as well as the Metric Traveling Salesman Problem (MTSP) do not have a PTAS this result was received with surprise. In 1998 Keld Helsgaun released a highly efficient and improved extension of the Lin-Kernighan heuristic algorithm, called Lin-KernighanHelsgaun (LKH). Among other characteristics it uses one-tree approximations for determining candidate edge-lists10 and 5-opt moves. LKH has later been extended and it has participated with Concorde in solving the largest instances of the TSP to this day. Furthermore LKH has been holding the record for the 1 904 711 city World TSP Tour11 since 2003. It has subsequently improved the tour three times (most recently in May 2010).

2.2

Seminal Results in TSP Computation

The ability to solve TSP instances to optimality has profited both from the discovery of new computational techniques as well as the advances in computer technology. Table 2.2 on page 19 shows some of the milestone results for solving TSP instances. As mentioned in Table 2.1 the TSP tour for 24 978 cities in Sweden depicted in Figure 2.4 on the following page, was found in a cooperation between LKH and Concorde. LKH found the optimal tour and a Concorde calculation using a cluster of 96 machines and a total running-time of 84.8 CPU years deemed it optimal. The initial cutting-plane procedure on the first LP began in March 2003 and the final branch-and-cut run finished in January 2004. A final check of the solution was completed in May 2004. The largest TSP instance that has been optimally solved to this date consists of a tour through 85 900 cities in a Very-large-scale Integration (VLSI) application. The problem arose in the Bell Laboratories in the late 1980s. This achievement also consisted in a cooperation between Concorde and LKH. Again LKH found a good tour, which was then used as an 9

http://www.iwr.uni-heidelberg.de/groups/comopt/software/ TSPLIB95/ 10

a list containing the preferred routes between two cities

11

http://www.tsp.gatech.edu/world/

17

2. T HE T RAVELING S ALESMAN P ROBLEM

Figure 2.4: The optimal TSP tour in Sweden. (from http://www.tsp.gatech.edu/sweden/ tours/swtour_small.htm )

18

2.3. Notes

Table 2.2: Milestones for solving TSP (from [Applegate et al., 2007]) Year

Contributor(s)

1954 1971 1971 1975 1975 1977 1980 1987 1987 1987 1987

G. Dantzig, R. Fulkerson, S. Johnson M. Held, R. M. Karp M. Held, R. M. Karp P.M. Camerini, L. Fratta, F. Maffioli P. Miliotis M. Grötschel H. Crowder, M. W. Padberg M. W. Padberg, G. Rinaldi M. Grötschel, O. Holland M. W. Padberg, G. Rinaldi M. W. Padberg, G. Rinaldi

1992 1993 1994 1998 2001 2004 2004 2006

Concorde Concorde Concorde Concorde Concorde Concorde Concorde with Domino-Parity Concorde with Domino-Parity

Size 49 cities 57 cities 64 cities 67 cities 80 cities 120 cities 318 cities 532 cities 666 cities 1002 cities 2392 cities 3038 cities 4461 cities 7397 cities 13509 cities 15112 cities 24978 cities 33810 cities 85900 cities

TSPLIB

name

dantzig42 random points random points random points gr120 lin318 att532 gr666 pr1002 pr2392 pcb3038 fnl4461 pla7397 usa13509 d15112 sw24978 pla33810 pla85900

upper bound input for Concorde. The determination of the optimal tour, which turned out to be one found by LKH, required a total running-time of approximately 136 CPU years. To certify the optimality a proof-checking procedure, which ran for over 500 hours, was devised – a project which lead to an article by itself [Applegate, Bixby, Chvátal, Cook, Espinoza, Goycoolea, and Helsgaun, 2009].

2.3

Notes

This chapter is based on [Applegate et al., 2007] and [Okano, 2009]. All other references are explicitly mentioned.

19

The Traveling Salesman always visits Knuth first

C HAPTER

Knuth homage web page

3

Traveling Salesman Problem Solvers

This chapter describes the different types of TSP solvers. It further introduces the TSP solver, which was used in the later algorithm implementations and the argumentation for choosing it. To avoid getting too involved in technical details all descriptions are kept short.

3.1 TSP

Characterisation of TSP Solvers solvers come in two main classes:

Exact Solvers There are two groups of exact solvers. One of these is solving relaxations of the TSP LP formulation (see Definition 2 on page 13) and uses methods like Cutting Plane, Interior Point, Branchand-Bound and Branch-and-Cut. Another smaller group is using Dynamic Programming. For both groups the main characteristic is a guarantee of finding optimal solutions at the expense of running time and space requirements. Non-exact Solvers These solvers offer potentially non-optimal but typically faster solutions. In a way the opposite trade-off of the exact solvers. Non-exact solvers can be subdivided into: Approximation Algorithms These algorithms come with a worst case approximation factor for the found solution. The two traditional methods for solving the TSP are a pure MST based algorithm, which achieves a factor 2 approximation and a combined MST and Minimum Matching Problem (MMP) based algorithm due to Christofides, which achieves a factor 32 approximation. Both methods are restricted to the MTSP as they depend on the 21

3. T RAVELING S ALESMAN P ROBLEM S OLVERS triangle inequality. The PTAS for Euclidean TSP is mainly a theoretical result due to its prohibitive running time. Heuristic Algorithms These algorithms only promise a feasible solution. They range from simple tour-construction methods like Nearest Neighbour, Clarke-Wright and Multiple Fragment1 to more complicated tour improving algorithms like Tabu Search and Lin-Kernighan. Finally there is a group of fascinating algorithms which unfortunately tend to combine approximate solutions and large running-times. Here we find methods like Simulated Annealing, Genetic Algorithms, Ant Colony Algorithms and machine learning algorithms like Neural Networks.

3.2

Selection of TSP Solver

The main problem in choosing a TSP solver for the later algorithm implementations was to decide whether an exact solver or a heuristic solver would be appropriate, since the “state of the art” programs for each class are easily found. These implementations are the Concorde exact solver and the LKH heuristic solver respectively. This evaluation is in accordance with the findings reported in [Laporte, 2010] Concorde This is an ANSI C implementation consisting of approximately 130 000 lines of code. Concorde is an LP based solver that uses several modern cutting plane techniques thus approaching the optimal value from below. As all exact solvers its running time and space constraints suffers from the “cost of being exact”. The source code is available for academical use.2 LKH

This is a considerably optimised modification of the Lin-Kernighan algorithm implemented using ANSI C. To quote [Applegate et al., 2007, Chapter 17.2] Since its introduction, the LKH variant of the local-search algorithm of Lin and Kernighan has set the standard for high-end TSP heuristics. starts by using a Held and Karp one-tree ascent method to construct a candidate edge list. A problem instance is then solved using a number of independent runs. Each run consist of a series of trials, where in each trial a tour is determined by the modified LinKernighan algorithm. The trials of a run are not independent, since edges belonging to the best tour of the current run are used to prune LKH

22

1

often called Greedy

2

http://www.tsp.gatech.edu/concorde.html

3.2. Selection of TSP Solver the search. In the standard setting, LKH refines using 5-opt moves in ten independent runs each containing a number of trials equalling the number of cities. A typical output from LKH can be seen in Appendix B. The code is distributed for research use.3

Figure 3.1: A LKH tour of 11 455 danish cities. In order to gain experience with the programs the following experiment was conducted: On a single dedicated machine4 both programs were challenged with the dk11455.tsp problem. As the name indicates the problem is a tour consisting of 11 455 cities in Denmark (see figure 3.1). The optimal solution is a tour of length almost 21 000 km. As most of the other large solved tours, it was found by LKH and proven optimal using Concorde. With both programs in standard configurations the results, shown in Table 3.1 on the following page and Figure 3.2 on page 25 were obtained. The experiment with Concorde was stopped after 46 days. As can be seen in Figure 3.2 it looks as if Concorde steadily will approach the opti3

http://www.akira.ruc.dk/~keld/research/LKH/

4

Pentium 4, 3066 MHz, 512 MB RAM

23

3. T RAVELING S ALESMAN P ROBLEM S OLVERS

Table 3.1: LKH and Concorde results for dk11455.tsp OPT:

20 996 131

Best:

20 996 166

Concorde: day 1 started with best LKH tour as upper bound: Initial lower bound:

20 996 166 20 991 500

TSP

instance dk11455.tsp

LKH:

day 1 day 6

day 8 day 15 day 22 day 29 day 36 day 43 day 46

started produced 10 tours

Active nodes 557 987 1183 1478 1803 2027 2192

Temporary files 1.2 GB 2.1 GB 2.5 GB 3.0 GB 3.7 GB 4.1 GB 4.4 GB

Lower bound 20 992 921 20 993 016 20 993 069 20 993 116 20 993 152 20 993 171 20 993 182

mal value. A crude extrapolation5 predicts that the optimal value would have been found in about 16 months. Although the TSP instance in this experiment was larger than the expected size of the instances in the thesis experiments the characteristics of the two programs were clearly illustrated. In the end LKH was chosen as the standard TSP solver for the experiments in this thesis on account of the following: • As mentioned in [Laporte, 2010], it consistently produces optimal solutions on small and medium size instances, in less than a second for n = 100 and for n = 1000 in less than a minute.6 This covers the expected size of instances in the thesis experiments very well. • The need for provable optimality is of minor importance, when the TSP solution is used to implement heuristic algorithms.

3.3

Notes

Concorde is described in details by its authors in [Applegate et al., 2007] and LKH is described in details in the technical reports [Helsgaun, 1998, 2006] and in the papers [Helsgaun, 2000, 2009]. 5 6

24

not scientifically sound but very tempting on a 300 MHz G3 Power Macintosh

3.3. Notes

8000 +2.099e7 7000

LKH and Concorde results on dk11455.tsp LKH tour values Optimal tour length Concorde lower bound

Tour length

6000 5000 4000 3000 2000 1000 00

10

20

Days

30

40

Figure 3.2: Results on dk11455.tsp. Note the addition of 20 990 000 on the y-axis.

25

Part II

The Shortest Superstring Problem

Person and bases tour with 16384 cities

The Shortest Common Superstring Problem, an elegant but flawed abstraction

C HAPTER

Richard M. Karp

4

The Shortest Superstring Problem

This chapter introduces and defines the Shortest Superstring Problem. It further contains the motivation for including the problem in this thesis.

4.1

The Problem

The SSP or Shortest Common Superstring Problem is defined as follows: Definition 3 P Given a collection of strings S = (s1 , s2 , . . . , sn ) and a finite alphabet , P∗ with sj ∈ , j ∈ 1 . . . n, determine the shortest string containing each string in S as a (consecutive) substring. In other words find the shortest superstring for the collection S. 2 Remark 3 Without loss of generality (WLOG) it can be assumed that S is substring free, i.e. no string in S is a substring of any other string in S. This assumption, which is made for the rest of this thesis part, does not influence the shortest superstring. 2 The SSP is, like the TSP, NP hard [Gusfield, 1997, chapter 16.17] and also MAX-SNP hard [Blum, Jiang, Li, Tromp, and Yannakakis, 1994, Section 7], which means that no PTAS exists unless P = NP. It is therefore highly improbable that a general, efficient algorithm for solving the SSP exists, thus the best we can hope for is approximation algorithms and heuristics. 29

4. T HE S HORTEST S UPERSTRING P ROBLEM

4.2

Applications of SSP

Articles concerning the SSP typically mention applications in bioinformatics and data compression [Blum et al., 1994; Breslauer, Jiang, and Jiang, 1996; Ilie and Popescu, 2006; Kaplan, Lewenstein, Shafrir, and Sviridenko, 2005; Kaplan and Shafrir, 2005; Romero, Brizuela, and Tchernykh, 2004; Tarhio and Ukkonen, 1988]. The first application refers to the DNA-sequencing method known as Shotgun Sequencing, the second refers to the idea of compressing a set of strings into their shortest superstring combined with an index pair for each string marking the beginning and ending index respectively. As we will argue, the citation of these applications seem more to be based upon “transitive” citing rather than upon real applications.

4.2.1

Shotgun Sequencing

When having to sequence a genome it is not technically possible to sequence strands of length larger than 100 to 1000 basepairs. For this reason longer strands must be divided into smaller pieces and subsequently re-assembled. In Shotgun Sequencing one first make many copies (clones) of the DNA of interest. These copies are then randomly cut into smaller pieces by physical, enzymatic or chemical means. The difficulty in assembling the original sequence stems from the fact, that though the smaller pieces constitute overlapping sequences of the original DNA, the succession of the overlapping sequences is lost in the cutting process. This is where the SSP come into play, as a theoretical model for reassembling the overlapping pieces into a superstring which is assumed to be a good approximation of the original DNA sequence. This neat simple idea has set off many studies of efficient approximation algorithms for SSP, but as a model for sequence-assembly it does have some major problems especially concerning read-errors in the fragments and identical sub-sequences (repeats) in the original DNA sequence. Example 5 At the top of Figure 4.1 a sequence with two repeats (marked R) is cloned four times and the clones cut into smaller pieces. After assembly (at the bottom of the figure) only one occurrence of the repeat region will be present together with a smaller and originally non-existent region (marked R’). This is due to all the “inner” fragments of the repeats being assembled as belonging to a single occurrence of the repeat. 2 A rather devastating critic of using SSP directly to solve sequencing problems is given by Richard M. Karp in [Karp, 2003]: 30

4.2. Applications of SSP

R

R

R

R’

Figure 4.1: sequence assembly problem with repeats

The shortest superstring problem is only superficially related to the sequencing assembly problem. Its difficulty stems from pathological examples that are unlikely to occur in practise. It does not take noisy reads into account, and admits solutions with an unreasonably large number of mutually overlapping read. Although Shotgun Sequencing was the most advanced technique for sequencing genomes from about 1995–20051 and was used by the privately founded group solving the mapping of the human genome [Venter, Adams, Myers, Li, Mural, Sutton, Smith, Yandell, Evans, Holt, et al., 2001], newer technologies known as next-generation sequencing have surfaced. These techniques are far superior to Shotgun Sequencing: both in the amount of data produced as well as with respect to the time needed for producing the data — they do tend to have lower accuracy though [Metzker, 2009].

4.2.2

Data compression

The idea of using SSP as a way of compressing a set of strings originates in the article [Gallant, Maier, and Storer, 1980] which itself refers back to two technical reports by J. Storer (one of the authors) from 1977. The practical application mentioned is the need for string compression by a compiler. It has not been possible to track down references describing the application of this technique in compilers. Furthermore string compression is hardly of any significant concern for modern day compilers as the time versus space trade-off clearly seems to favour speed. An even more serious objection, mentioned in [Gusfield, 1997, page 445, exercise 1

http://en.wikipedia.org/wiki/Shotgun_sequencing

31

4. T HE S HORTEST S UPERSTRING P ROBLEM 25, 26], is that it seems like a rather exotic idea trying to solve a compression problem by using an NP complete problem — especially as there are simpler, better and more efficient algorithms available e.g. based on Cycle Covers (CCs).

4.2.3

Why is SSP then interesting anyway ?

Having defied the traditional practical uses of SSP and thereby almost removed the foundation for working with the problem, we now face the challenge of arguing for its inclusion in this thesis. In prioritised order the reasons are, 1) the SSP is in itself an interesting problem, 2) historically the SSP has been one of the first sequencing models “bridging” between computer science and biology, 3) the SSP has been subject to a considerable amount of research leading to a number of approximation algorithms. To quote from [Gusfield, 1997, Chapter 16.1] “This pure, exact string problem is motivated by the practical problem of shotgun sequence assembly and deserves attention if only for the elegance of the results that have been obtained.” 4) the SSP is closely related to the TSP, so it would seem unnatural not to include it at all, and finally 5) the SSP as a crude model is usable in bioinformatics. Besides the Shotgun Sequencing method it is related to the class of DNA-sequencing methods known as “Sequencing by Hybridisation” and it has also been proposed as a model for genome compression of viruses [Ilie and Popescu, 2006]. with these reasons in mind, the rest of this thesis Part will hopefully seem founded to the reader.

32

C HAPTER

5 Preliminaries

This chapter will introduce notation, definitions, some basic facts about strings and approximation strategies all of which will be needed in the later treatment of approximation factors and algorithm descriptions. For further notation usage see Section 1.2 on page 4

5.1

Strings

Given two strings s, t where s = xy and t = yz with x, z 6= , if y is the longest such string it is called the (maximal) overlap of s and t and is denoted ov(s, t). The string x is called the (minimal) prefix of s and t and is denoted pr(s, t). Similarly z is sometimes called the suffix of s and t and is denoted suf (s, t). The length of the prefix, |pr(s, t)| is called the distance from s to t and is denoted dist(s, t). Example 6

Let s = half, t = alfalfa then ov(s, t) = alf, ov(t, s) = , ov(t, t) = alfa, |ov(s, t)| = 3, |ov(t, s)| = 0, |ov(t, t)| = 4 pr(s, t) = h, pr(t, s) = alfalfa, pr(t, t) = alf, dist(s, t) = 1, dist(t, s) = 7, dist(t, t) = 3 note that |s| = dist(s, t) + |ov(s, t)| 2

33

5. P RELIMINARIES Definition 4 (Merge of strings) Let S = (s1 , s2 , . . . , sn ) be an ordered collection of strings hs1 , s2 , . . . , sn−1 , sn i = pr(s1 , s2 ) pr(s2 , s3 ) · · · pr(sn−1 , sn ) sn is called the merge of the strings (s1 , s2 , . . . , sn ). If S is substring free the order of the string in the merge is well-defined, as no two strings will begin or end at the same index. 2 Example 7 For two strings, s, t we have hs, ti = pr(s, t) t = pr(s, t) ov(s, t) suf (s, t) = s suf (s, t), |hs, ti| = dist(s, t) + |t| = |s| − |ov(s, t)| + |t| = |s| + |t| − |ov(s, t)|

2

Lemma 1 For an ordered collection of strings S = (s1 , s2 , . . . , sn ), the merge of the strings is the shortest superstring for the strings in that order. 2 To get warmed up and for the sake of completeness we include the proof: P ROOF ([G USFIELD , 1997]) The fact that the merge is a superstring is easiest seen by induction on the string indices beginning from the end of the merge. Base case: sn is a substring of the merge as it is explicitly included. Induction case: Assume si+1 is a substring of the merge. Now si = pr(si , si+1 )ov(si , si+1 ). As pr(si , si+1 ) is explicitly included in the merge, it remains to show that ov(si , si+1 ) follows directly after. Since S is substring free we have ov(si , si+1 ) is a prefix of si+1 , but si+1 is per induction hypothesis a substring of the merge starting at pr(si , si+1 ). The minimality follows from the definition of pr(·, ·)



Definition 5 (Cyclic String) For any string s we denote by φ(s) the cyclic string resulting from “wrapping around” the string so that s |s| precedes s[1]. The length of φ(s) equals |s|. 2 Any string t which is a substring of s∞ is said to map to the cyclic string φ(s) and correspondingly φ(s) is said to be mapping all such strings. 34

5.1. Strings Example 8

Let s = abac then all the following strings map to φ(s) t = ba u = bacaba v=

acabacabaca 2

Definition 6 (Factor and Period) A string x is a factor of another string s if s = xi y with y a (possibly empty) prefix of x and i ∈ N. The length of the factor |x| is called a period of s. 2 The shortest factor of a string s and its period are denoted f actor(s) and period(s) respectively. In case s is semi-infinite,1 we call s periodic if s = xs for x 6=  The shortest such x is called the factor of s and its length the period of s. Two strings s, t (finite or not) are called equivalent if their factors are cyclic shifts of each other, i.e. ∃ x, y : f actor(s) = xy and f actor(t) = yx. The following is a classical string algorithm result Lemma 2 (The Periodicity Lemma) Let s be a string with periods p, q. If |s| ≥ p + q then the greatest common divisor of p, q = gcd(p, q) will also be a period for s. 2 P ROOF ([G USFIELD , 1997]) Assume WLOG that q < p. Consider the number d = p − q. We will show that s[i] = s[i + d] ∀ i ∈ 1 . . . |s| − d. Case q < i This means s[i − q] exists. We now have: s[i] = s[i − q] (as q is a period for s) = s[i − q + p] = s[i + d] (as p is a period for s) Case i ≤ q This means s[i + p] exists as |s| ≥ p + q. We now have: s[i] = s[i + p] (as p is a period for s) = s[i + p − q] = s[i + d] (as q is a period for s) As |s| ≥ (p − q) + q we have shown, that applying Euclids Algorithm on p, q we will at each recursive step have numbers that are periods for s. As Euclids Algorithm ends with gcd(p, q) the claim follows.  1

an infinite string with one fixed end: s = ababab . . .

35

5. P RELIMINARIES

5.2

Graphs and Cycle Covers

To model the overlaps of a collection of strings S = (s1 , s2 , . . . , sn ) the following two complete graphs are very useful: Definition 7 Prefix (Overlap) Graph Let G(N, A, W ) be a weighted, directed graph, with N = S, A = {aij } ∀ i, j in 1 . . . n ( dist(si , sj ) (prefix graph) W : A → N with w(aij ) = |ov(si , sj )| (overlap graph)

2

We associate each arc in the prefix (overlap) graph with the prefix (overlap) of the two strings corresponding to the start- and end-nodes for the arc. Example 9 The prefix (distance) and overlap graphs for the string collection: (ate, half, lethal, alpha, alfalfa) is illustrated in figure 5.1. Only arcs having values different from the length of the string (in the prefix graph) and zero (in the overlap graph) are shown. 2

4

1

alpha

alpha

4

2

4 3 4

3 lethal

6

1 6 4

alfalfa

3

ate

half

5

1 2

lethal

1

1 2

alfalfa 4

(b) overlap graph

Figure 5.1: Graphs for the strings: ’ate’, ’half ’, ’lethal’, ’alpha’ and ’alfalfa’ (from [Blum et al., 1994]) Note that a cycle in the prefix graph for S consisting of the nodes n1 , n2 , . . . , nk will be of the form C = φ(t) = φ(pr(s1 , s2 ) pr(s2 , s3 ) · · · pr(sk−1 , sk ) pr(sk , s1 )) The notation C = s1 , s2 , . . . , sk will be used to describe such a cycle in terms of the strings from S that C is mapping. 36

ate

3

1

3

(a) prefix graph

1

half

5.2. Graphs and Cycle Covers With tsi we will denote the string equivalent to t achieved by “breaking” the cycle at the point where string si starts. Defining sn+1 = s1 we then have: tsi = pr(si , si+1 ) pr(si+1 , si+2 ) · · · pr(si−2 , si−1 ) pr(si−1 , si ) The next two lemmas prove a rather intuitive property and a more interesting property concerning cycles in the prefix graph: Lemma 3 For any cycle C = φ(t) = s1 , s2 , . . . , sn in the prefix graph for S we have that every si , i ∈ 1 . . . n and hs1 , s2 , . . . , sn i maps to φ(t) 2 P ROOF As the string collection is substring free set, we have that si is a prefix of pr(si , si+1 )si+1 which again is a prefix of  pr(si , si+1 )pr(si+1 , si+2 )si+2 and so on (defining sn+1 = s1 ). By letting k = |si |/|t| we will get that si is a prefix of tksi which itself maps to φ(t)k+1 . The claim for hs1 , s2 , . . . , sn i can be shown by an analogous argument. In a way the “reverse” property also holds: Lemma 4 Let S be a collection of strings. Let S ⊆ S with the property that all strings in S are sub strings of t∞ , then there exists a cycle C = φ(t) in the prefix graph for S such that all strings in S map to C 2 P ROOF All strings in S will appear in t∞ with an interval of |t| characters. This induces a circular ordering of the strings in S. A cycle in the prefix graph for S defined by this ordering will fulfil the criteria.  The optimal SSP solution for S = (s1 , s2 , . . . , sn ) will be of the form hsπ1 , sπ2 , . . . , sπn i for some permutation Π : [1, n] → [1, n] This means: OP T (S) = hsπ1 , sπ2 , · · · , sπn i = pr(sπ1 , sπ2 ) pr(sπ2 , sπ3 ) · · · pr(sπn−1 , sπn ) sπn = pr(sπ1 , sπ2 ) pr(sπ2 , sπ3 ) · · · pr(sπn−1 , sπn ) pr(sπn , sπ1 ) ov(sπn , sπ1 ) Note that the above concatenation of prefixes is describing a cycle in the prefix graph for S. For the length of the optimal solution we then have: |OP T (S)| = |pr(sπ1 , sπ2 )| + |pr(sπ2 , sπ3 )| + · · · + |pr(sπn , sπ1 )| + |ov(sπn , sπ1 )| (5.1) Remark 4 Expression 5.1 conveys that the length of the optimal solution is the sum of a TSP tour in the prefix graph for S plus the overlap of the “end” and “start” strings of the tour. 2 37

5. P RELIMINARIES If we let |T SP ∗ (Gpr )| denote the length of a shortest TSP tour in the prefix graph for S we get the following lower bound for |OP T (S)|: |T SP ∗ (Gpr )| < |T SP ∗ (Gpr )| + |ov(sπn , sπ1 )| ≤ |OP T (S)|

(5.2)

If we let wpr (aij ), wov (aij ) denote the weights for the arc aij in the prefixand overlap graph respectively, we have: wpr (aij ) + wov (aij ) = |si | ⇔ wpr (aij ) = |si | − wov (aij ), ∀ i, j ∈ 1 . . . n (5.3) We now turn to the important concept of a Cycle Cover: Definition 8 (Cycle Cover of String Collection S) Let CC(S) = {C1 , C2 , . . . , Ci } = {φ(t1 ), φ(t2 ), . . . , φ(ti )} where ∀ sj ∈ S, ∃ φ(tk ) ∈ CC(S) : sj maps to φ(tk ) then CC(S) is called a Cycle Cover of S.

2

Pi

The size of a Cycle Cover CC(S) is denoted ||CC(S)|| and equals k=1 |φ(tk )|. The minimal length cycle cover of a collection of strings S is denoted CC ∗ (S). We get the following properties for a minimal CC. Lemma 5 For every si ∈ S = (s1 , s2 , . . . , sn ), si maps to a unique Ci ∈ CC ∗ (S). For every distinct pair of cycles Ci = φ(ti ), Cj = φ(tj ) ∈ CC ∗ (S) we have ti and tj are inequivalent. 2 P ROOF As the collection is substring free every pair of strings si , sj will have pref (si , sj ) 6= . If si would map onto more than one cycle in CC ∗ (S) it would contribute to the length of CC ∗ (S) with two non-empty prefixes, which contradicts the minimal length of CC ∗ (S). The second claim now follows immediately as the strings mapping to two equivalent cycles would be identical.  we can now show a useful property of the prefix and overlap graphs Theorem 1 (Prefix-Overlap Graph Duality) Let S = (s1 , s2 , . . . , sn ) be a string collection, then the following relations hold between TSP-tours and CCs in the prefix and overlap graphs for S: • A shortest TSP tour in the prefix graph will be a maximum TSP tour in the overlap graph (and vice versa). • A minimal CC in the prefix graph will be a maximum CC in the overlap graph (and vice versa). 2 38

5.3. Approximation Strategies P ROOF Let T SP ∗ (Gpr ) = nπ1 , nπ2 , . . . , nπn be a shortest TSP tour in the prefix graph for S. Using Expression 5.3 on the preceding page we then have:

|T SP ∗ (Gpr )| = |pr(sπ1 , sπ2 ) pr(sπ2 , sπ3 ) · · · pr(sπn−1 , sπn ) pr(sπn , sπ1 )| = |pr(sπ1 , sπ2 )| + |pr(sπ2 , sπ3 )| + · · · + |pr(sπn , sπ1 )| = sπ1 − |ov(sπ1 , sπ2 )| + sπ2 − |ov(sπ2 , sπ3 )| + · · · + sπn − |ov(sπn , sπ1 )| = ||S|| − (|ov(sπ1 , sπ2 )| + |ov(sπ2 , sπ3 )| + · · · + |ov(sπn , sπ1 )|) = ||S|| − |ov(sπ1 , sπ2 ) ov(sπ2 , sπ3 ) · · · ov(sπn−1 , sπn ) ov(sπn , sπ1 )| (5.4)

now as |T SP ∗ (Gpr )| is minimal and ||S|| is a constant, the string length on the right hand side in Expression 5.4 will have to be maximal. But this is exactly the length of a TSP tour in the overlap graph for S. The other relation can be proved analogously. 

We can now formulate the following lower bound relation:

||CC ∗ (S)|| ≤ |T SP ∗ (Gpr )| < |OP T (S)|

(5.5)

by noting that a TSP tour in the prefix graph for S is a special case of a CC for S. An equivalent consideration is to consider any CC of S as a solution to the LP-formulation of the TSP (see Definition 2 on page 13) with the “subtour elimination constraint” removed.

5.3

Approximation Strategies

From the results in the previous section it can be seen that a possible approximation strategy for SSP would be to approximate the length of a TSP tour in the prefix graph for the string collection. Unfortunately the best known approximation factors for the minimum Asymmetric Traveling Salesman Problem (ATSP) are of the order of log(n) [Kaplan et al., 2005], which makes this strategy less interesting. Another approach is to consider the length of the optimal solution ex39

5. P RELIMINARIES pressed in the overlap graph for the string collection S: |OP T (S)| = |hsπ1 , sπ2 , · · · , sπn i| = |pr(sπ1 , sπ2 )| + |pr(sπ2 , sπ3 )|+ · · · + |pr(sπn−1 , sπn )| + |sπn | = ( |sπ1 | − |ov(sπ1 , sπ2 )| ) + ( |sπ2 | − |ov(sπ2 , sπ3 )| )+ · · · + ( |sπn−1 | − |ov(sπn−1 , sπn )| ) + |sπn | =

n X

|sπi | −

i=1

= ||S|| −

n−1 X

|ov(sπi , sπi+1 )|

i=1 n−1 X

|ov(sπi , sπi+1 )|

(5.6)

i=1

The right side sum in the Expression 5.6 is called the total overlap or the compression for OP T (S) and will be denoted OV ∗ (S). Remark 5 Expression 5.6 conveys that the length of an optimal SSP solution equals the length of S minus the length of a maximum Hamiltonian Path in the overlap graph for S. 2 The maximum Hamiltonian Path Problem (HPP) can actually be approximated by a constant factor, but this does not directly lead to an usable approximation strategy for SSP. The reason being that the optimal solution may have a large total overlap (of size O(||S||)), which means that sacrificing even a small fraction of the total overlap can lead to bad solutions. Example 10 Let S = (abc, bcd, cde, def, efg, fgh, ghi, hij) then ||S|| = 8 · 3 = 24, OV ∗ (S) =

7 X

|ov(si , si+1 )| = 7 · 2 = 14

i=1

and |OP T (S)| = |abcdefghij| = 24 − 14 = 10 An approximation factor for the total overlap of even SSP solution that is 50 % larger than optimal.

2 3

could lead to an 2

The main problem with this approximation strategy is that we are approximating the total overlap and not the length of the SSP solution. For this strategy to be applicable it is therefore necessary to assure that the strings in the collection do not have large mutual overlaps. This leads to the following approximation algorithm template:

40

5.4. Notes Definition 9 (CC Approximation Algorithm Template) Given a collection of strings, S, transform S into another set of strings, R, with the ensuing properties: 1) Strings in R do not overlap too much. 2) All strings in S are sub-strings of a string in R, i.e. a superstring for R induces a superstring for S. 3) |OP T (R)| is not much larger than |OP T (S)|. Remark 6 The above template achieves the following: • from 1) it follows that the reduction to HPP is applicable, and • from 2) and 3) it follows that an SSP solution for R is a good SSP solution for S. These claims will be formalised in Section 6.1.1

2

The standard approach to fill in the template is: • Find a minimum Cycle Cover CC ∗ (S) = {C1 , C2 , . . . , Cj } in the prefix graph for S. • For all Ci = si1 , si2 , . . . , sik ∈ CC ∗ (S) choose a representative for the cycle, r(Ci ) containing all strings in S that maps to the cycle, as sub-strings. For example let r(Ci ) = hsi1 , si2 , . . . , sik i. • Let R = {r(Ci )}, i ∈ 1 . . . j. The rationale behind this approach is that strings that overlap a lot will end up in the same cycle in the CC, so substituting these strings by a single representative should not increase the length of the optimal solution too much, but it will reduce the total overlap for the instance. This approach is the basis for the implemented algorithms called Cycle (Section 6.1 on page 43) and RecurseCycle (Section 6.2 on page 53) respectively.

5.4

Notes

This chapter is based on the excellent but unfinished note by Marcin Mucha [Mucha, 2007]. All other references are explicitly mentioned.

41

Copy from one, it’s plagiarism; copy from two, it’s research.

C HAPTER

6

John Milton

Algorithms

In order to run practical experiments for solving the SSP five algorithms were implemented. The first two are examples of CC approximation algorithms, the third is a well known greedy algorithm , the fourth an approximation algorithm achieving the best known approximation factor of 2.5 and finally the last algorithm is TSP based. This chapter contains descriptions of these algorithms and their actual implementations. It is additionally comprised of theoretical sections containing proofs for the approximation factors. It is the intention that the chapter should be self-contained with respect to these proofs. The strategy has been to include proofs for lemmas and theorems originating in directly used references and omit proofs for results that are only referenced in these. The survey character of this chapter has had the unfortunate effect of leading to sections having an abundance of lemmas, theorems and proofs. In order to compensate for this and ease the reading, most of the proofs have been elaborated on in terms of smaller steps and/or supplementary lemmas, remarks and examples.

6.1

The Cycle Algorithm

This algorithm is a variant of the algorithm TGREEDY in [Blum et al., 1994] called 3All in [Romero et al., 2004]. It consists of the following six steps, where steps 2) to 3) correspond to the template from Definition 9 on page 41 Given a list of strings S 1) Construct prefix graph for S. 43

6. A LGORITHMS 2) Find a minimum Cycle Cover CC ∗ (S) = {C1 , C2 , . . . , Cj } in the prefix graph for S. 3) For all Ci = si1 , si2 , . . . , sik in the CC choose a representative r(Ci ) = hsi1 , si2 , . . . , sik i by breaking the cycle at an index so that |r(Ci )| is smallest possible. Define the set R = (r(Ci )), i ∈ 1 . . . j. 4) Repeat step 1) for R. 5) Repeat step 2) for R creating a new set of representatives T , with the restriction that the CC for R should be non-trivial, i.e. no cycle in the cover includes only a single string from R. 6) Return a random merge of T Though the algorithm is doing the same thing twice the restriction in step 5) assures that the sets R and T are different. Pseudo-code for the algorithm is shown in Algorithm II.1 on the facing page.

6.1.1

Approximation factor

In order to show the approximation factor for the Cycle algorithm, we will refrain from the direct proofs in [Blum et al., 1994; Gusfield, 1997] preferring the more general proof-method by Mucha. The reason for this choice is that it will lead to a refinable theorem and a “reusable” proof (Theorem 2 on page 47). The approximation factor for the Cycle algorithm follows from a formalisation of the claims in Remark 6 on page 41 concerning the constructed representatives. Beginning with the last claim and denoting the original collection of strings S and the constructed set of representatives R we have: Lemma 6 |OP T (R)| ≤ |OP T (S)| + |CC ∗ (S)| ≤ 2 · |OP T (S)|

2

P ROOF ([M UCHA , 2007]) We have R = (r(C1 ), r(C2 ), . . . , r(Cj )) where each r(Ci ) is of the form hsi1 , si2 , . . . , sik i. For each r(Ci ) let rˆ(Ci ) = hsi1 , si2 , . . . , sik , si1 i and let ˆ = (ˆ R r(C1 ), rˆ(C2 ), . . . , rˆ(Cj )). ˆ Since each r(Ci ) is a substring of rˆ(Ci ) we have |OP T (R)| ≤ |OP T (R)|. Each rˆ(Ci ) begins and ends with the same string. Let s(ˆ r(Ci )) denote this string and let SRˆ = (s(ˆ r(C1 )), s(ˆ r(C2 )), . . . , s(ˆ r(Cj ))). Since SRˆ ⊆ S we have |OP T (SRˆ )| ≤ |OP T (S)|. Noting that OP T (SRˆ ) can be 44

6.1. The Cycle Algorithm

Algorithm II.1 Cycle Require: Nonempty substring free list of strings, sl procedure C YCLE(sl) avoidSimpleCycles ← false 3: reps1 ← MAKE R EPRESENTATIVES(sl, avoidSimpleCycles) if size(reps1) = 1 then return first(reps1) 6: end if avoidSimpleCycles ← true reps2 ← MAKE R EPRESENTATIVES(reps1, avoidSimpleCycles) 9: return MERGE S TRINGS(reps2) end procedure procedure MAKE R EPRESENTATIVES(sl, avoidSimpleCycles) CC ← MAKE M INIMUM C YCLE C OVER(sl, avoidSimpleCycles) reps = empty list 15: for all C = sc0 , sc1 , . . . , sck ∈ CC do i ← BEST B REAK I NDEX(C) rep ← hs(i mod k) , s((i+1) mod k) , . . . , s((i−1) mod k) i 18: reps ← reps ∪ rep end for return reps 21: end procedure 12:

Termination: The algorithm only iterates over finite non-increasing data structures. Correctness: Assuming all the the called sub-procedures are correct (see Section 6.1.2 onwards), all strings in the parameter list sl will be sub-strings of a representative so the algorithm will be yielding a feasible solution of the SSP.

45

6. A LGORITHMS ˆ by replacing each s(ˆ transformed into a solution for R r(Ci )) by its corresponding rˆ(Ci ) and that this transformation will increase the length by |CC ∗ (S)| we get ˆ |OP T (R)| ≤ |OP T (R)| ≤ |OP T (SRˆ )| + |CC ∗ (S)| ≤ |OP T (S)| + |CC ∗ (S)|.



In order to show that the strings in R do not overlap too much, we will need following lemma: Lemma 7 Let C = φ(t) = s1 , s2 , . . . , sn ∈ |CC ∗ (S)| and let r(C) be the representative for C, then period(r(C)) = |t|.

2

P ROOF ([M UCHA , 2007]) As r(C) maps to C = φ(t), |t| will be a period for r(C). Assume there exists a factor f , for r(C) with |f | < |t|. Then we have that r(C) maps to φ(f ). Being sub-strings of r(C) so does all the strings s1 , s2 , . . . , sn . Lemma 4 on page 37 now gives the existence of a cycle shorter than C “including” all the strings that maps to C, which contradicts that C belongs to a minimum cycle cover for S.  We can now show Lemma 8 Let C1 = φ(t1 ), C2 = φ(t2 ) ∈ |CC ∗ (S)| and let r1 (C), r2 (C) be their representatives, we then have |ov(r(C1 ), r(C2 ))| < |t1 | + |t2 |.

2

P ROOF ([M UCHA , 2007]) We have C1 and C2 ∈ |CC ∗ (S)| are both minimal and from Lemma 7 it follows period(r(C1 )) = |t1 |, period(r(C2 )) = |t2 |. Let u = ov(r(C1 ), r(C2 )). Now if |u| ≥ |t1 | + |t2 | we would have that |t1 |, |t2 | both are periods for u. Then the Periodicity Lemma (Lemma 2 on page 35) in connection with Lemma 4 on page 37 would lead to a contradiction of the minimality for C1 and C2  Using expression 5.5 on page 39 and Lemma 8 we get: Corollary 1 Any SSP-solution for the set R will have OV (R) ≤ 2 · OP T (S) 46

2

6.1. The Cycle Algorithm We can now turn to the final theorem needed to prove the approximation factor for the cycle algorithm: Theorem 2 Given an α-approximation for the MAX-HPP, it is possible to get a (4 − 2α) approximation for the SSP. 2 P ROOF ([M UCHA , 2007]) First construct the set of representatives R and then use the HPP approximation to get an SSP solution R, for R achieving a total overlap of α · OV ∗ (R). We then have: |R| = ||R|| − OV (R) = ||R|| − α · OV ∗ (R) = ||R|| − OV ∗ (R) + (1 − α) OV ∗ (R) = OP T (R) + (1 − α) OV ∗ (R) ≤ 2 · OP T (S) + (1 − α) OV ∗ (R)

(from Lemma 6)

≤ 2 · OP T (S) + (1 − α) 2 · OP T (S)

(from Corollary 1)

= (4 − 2α) OP T (S).

Theorem 3 The Cycle algorithm is a factor 3 approximation algorithm.



2

P ROOF Inspecting the steps of the Cycle algorithm (Section 6.1 on page 43), the algorithm breaks the cycles optimally in Step 3), continues to find an minimal non-trivial Cycle Cover CC ∗ (R), for R in Step 4) and Step 5) and finally return a random merge of the set of representatives T , in step 6). According to Theorem 1 on page 38, CC ∗ (R) is also a maximal CC in the overlap graph for R. Translated to the overlap graph for R, what the algorithm does in step 4) to 6) is equivalent to constructing a solution to the HPP in the overlap graph for R in the following way: • Split each cycle in CC ∗ (R) by throwing away the arc with smallest weight (minimal overlap). • Arbitrarily mend the ensuing paths together using random arcs between the end nodes. This gives a 12 -approximation for the HPP, as the cycles each consists of at least two arcs and throwing away the ones with minimal weight will not lose more than half of |CC ∗ (R)|. By Theorem 2 this then gives a factor 3 approximation for the Cycle algorithm.  47

6. A LGORITHMS Remark 7 It also follows that if the Cycle algorithm in Step 3) would randomly break the cycles and return a random concatenation of this set of representatives, no guarantee would be possible for the approximation factor, which corresponds to α = 0 (a 0-approximation of the max HPP) giving a factor 4 approximation by Theorem 2. 2

6.1.2

Implementation

The implementation of the Cycle algorithm is basically a Python implementation of Algorithm II.1 on page 45. The Cycle algorithm is dependent on the implementation of two subroutines to calculate an all-pairs suffix-prefix matrix for the strings in S and for finding a minimal CC for S respectively. All pairs suffix-prefix calculation The first attempted implementation of this simple problem was a Python implementation of the suffix-tree based algorithm described in [Gusfield, 1997, Section 7.10]. Unfortunately this task proved too complicated and thereby more error-prone than expected. In order to sanity check the suffix-tree algorithm a simple naive double-loop implementation was made. The results were in a sense both disillusioning as well as enlightening: First the suffix-tree implementation kept growing more and more complex while still containing bugs. Second the naive test implementation had a running-time of order five times faster on the data where the algorithms produced identical results. As a second try a Knuth-Morris-Pratt (KMP)-algorithm based implementation described in [Tarhio and Ukkonen, 1988] was implemented. This algorithm consistently produced identical results to the naive test implementation, thus it was probably free from bugs, but it was a factor three times slower than the naive test algorithm. The moral of the story is that even though the theoretical runningtimes for the different implementations are as shown in Table 6.1 , the measured running-times clearly favoured using the naive implementation. Part of the explanation for this result could be that the Python method string.startswith(), used in the interior of a double for-loop, is probably implemented as a highly optimised C or assembler method. In order to confirm this hypothesis a number of tests were run with four different implementations: • a fully naive algorithm, which uses char-by-char comparison instead of string.startswith(), • the KMP-based algorithm, 48

6.1. The Cycle Algorithm

Table 6.1: All-pairs suffix-prefix running-times Algorithm

Running-Time

Suffix-tree based KMP based Naive

n O(n|t| · (1 + |t| )) O(n|t| · (1 + n)) O(n|t| · (n|t|))

= O(||S|| + n2 ) = O(||S|| + n||S||)) = O(||S||2 )

where n is the number of strings and |t| the (identical) length of each string, i.e. n|t| = ||S|| • the naive algorithm using string.startswith() implemented in Python, and finally • a Java implementation of the naive algorithm using the Java method string.startswith(). The results of these test runs are shown in Figure 6.1

Overlap timeit results Fully naive alg. KMP alg. Python naive alg. Java naive alg.

Running-time in seconds

6 5 4 3 2 1 1000

2000

3000 4000 5000 6000 Cumulated length of strings

7000

Figure 6.1: Running-time measures of all-pairs overlap calculations In the end a Python wrapper calling the Java implementation of the naive algorithm was used by all SSP-algorithms, as it turned out to be the 49

6. A LGORITHMS fastest implementation. Finding minimal CCs The “traditional” way of finding a minimal CC is by reducing the problem to an instance of the Assignment Problem (AP) or Perfect Matching Problem. The reduction consists of solving the AP in a complete bi-partite graph. The node-set is N = {S, S} and arc weights are equal to either the weights of the prefix or the overlap graph, depending on whether a minimal or maximal assignment is wanted. In the general case the AP can be solved by the Hungarian Method1 which has a running-time of O(n3 ), where n is the number of strings. It turns out that due to the weights being overlap values between the strings, a maximal assignment can be found by a greedy algorithm having a running-time of O(log(n) · n2 ). Fast maximal assignment algorithm The fact that a greedy approach works is dependent on the following property Definition 10 (Inverse Monge Condition) Let B = (N1 , N2 , A, W ) be a complete, weighted bipartite graph. Let u, u0 ∈ N1 and v, v 0 ∈ N2 . Assume WLOG w(auv ) ≥ max (w(auv0 ), w(au0 v ), w(au0 v0 )) If the inequality w(auv ) + w(au0 v0 ) ≥ w(auv0 ) + w(au0 v )

(6.1)

holds for any such pairs u, u0 and v, v 0 in B then B is said to satisfy the Inverse Monge Condition. 2 Given a collection of strings S = (s1 , s2 , . . . , sn ), let BS be the complete, weighted bipartite given by BS = (S, S, A, W ) and w(aij ) = ov(si , sj ), ∀ i, j ∈ 1 . . . n. We then have Lemma 9 BS satisfies the Inverse Monge Condition.

2

P ROOF ([B LUM ET AL ., 1994]) Let su , su0 , sv , sv0 ∈ S. Denote ov(si , sj ) and |ov(si , sj )| by ovij and wij respectively. Assume wuv ≥ max (wuv0 , wu0 v , wu0 v0 ). If wuv ≥ wuv0 + wu0 v , then Expression 6.1 is obviously fulfilled. Assume 1

50

http://en.wikipedia.org/wiki/Hungarian_algorithm

6.1. The Cycle Algorithm wuv < wuv0 + wu0 v i.e. wu0 v0 > 0, and let ovuv = a1 a2 . . . awuv . Then ovuv0 = a(wuv −wuv0 +1) . . . awuv and ovu0 v = a1 . . . awu0 v . However, ovu0 v0 = a(wuv −wuv0 +1) . . . awu0 v ⇓ wu0 v0 = wu0 v − (wuv − wuv0 + 1) + 1 = wu0 v + wuv0 − wuv .



For a graphical interpretation see Figure 6.2

w uv w uv’ s s

v’

v

su s u’ w u’v w

u’v’

Figure 6.2: Illustration of Lemma 9 : wuv + wu0 v0 ≥ wu0 v + wuv0

Now consider the greedy algorithm M AKE M AXIMAL A SSIGNMENT for finding a maximum weight assignment (Algorithm II.2 on the next page) Assuming the marking of rows and columns and the checking for availability can be done in constant time, the running-time for Algorithm II.2 will be O(n2 + log(n2 ) · n2 + n) = O(log(n) · n2 ). The returned maximum assignment can easily be translated to a maximum CC for S in the overlap graph by following arcs “from left to right” in the assignment. According to Theorem 1 on page 38 this is also a minimal CC in the prefix graph for S. The SSP-algorithms needing a minimal CC are all using a Python implementation of Algorithm II.2. The Cycle algorithm is dominated by this procedure and thus having a running-time of O(log(n) · n2 ). 51

6. A LGORITHMS

Algorithm II.2 MakeMaximalAssignment Require: A n × n overlap matrix M for string collection S Ensure: A maximum weight assignment for the bi-partite graph BS procedure M AKE M AXIMAL A SSIGNMENT(M) overlaps ← empty list of size n 3: for row = 1 to n do for col = 1 to n do overlaps ← overlaps ∪ (M[row][col], (row, col)) 6: end for end for SORT (overlaps) . descending by overlap value 9: assignment ← empty set of size n mark all rows and columns available for all (w,(r,c)) ∈ overlaps do . iterating from start to end 12: if r and c both available then assignment ← assignment ∪ (r,c) mark row r and column c unavailable 15: end if end for return assignment 18: end procedure Termination: The algorithm only iterates over finite non-increasing data structures. Correctness: ([Blum et al., 1994]) Let A be the assignment returned by the algorithm and let Amax be a maximum assignment for Bs with the property that it has the maximum number of arcs in common with A. Assume A 6= Amax and let aij be the arc of maximal weight in the symmetric difference of A and Amax (with ties “sorted” the same way as in the algorithm). There are now two possibilities: aij ∈ Amax \ A : This means that A must contain either an arc aij 0 or ai0 j of “larger” weight than aij as the algorithm did not choose aij . Neither of these arcs can be in Amax as it is an assignment leading to a contradiction of the choice of aij . aij ∈ A \ Amax : In this case there are two arcs aij 0 , ai0 j ∈ Amax of “smaller” weight than aij . But according to Lemma 9 we could now replace the arcs aij 0 and ai0 j in Amax by the arcs aij and ai0 j 0 getting an assignment A0max with no less accumulated weight but having more arcs in common with A than Amax . This is a contradiction of the choice of Amax .

52

6.2. The RecurseCycle Algorithm

6.2

The RecurseCycle Algorithm

This is a slight variation of the Cycle algorithm. The variation consists of repeating steps 4) and 5) of the Cycle algorithm until only one representative string remains: 1) Execute Cycle algorithm step 1) to 3) 2) Repeat Cycle algorithm step 4) and 5) until a representative set of size one is achieved. 3) Return the single element of the final representative set. The pseudo-code for the algorithm is shown in Algorithm II.3 on the next page.

6.2.1

Approximation factor

Theorem 4 The RecurseCycle algorithm is a factor 3 approximation algorithm.

2

P ROOF As the algorithm in effect improves on the Cycle algorithm by returning a “good” as opposed to a complete random merge of representatives, its approximation factor cannot be worse than the factor 3 for the Cycle algorithm. 

6.2.2

Implementation

The algorithm is basically a small extension of the Cycle algorithm and is likewise implemented using python. As a practical matter the recursive implementation had to be substituted by an iterative version due to the missing tail-recursion primitive in the Python interpreter. The original recursive implementation ran into problems with maximum stack height.2 The running-time for the RecurseCycle algorithm is dominated by the MAKE M AXIMAL A SSIGNMENT () procedure (see Algorithm II.2) and thus of size O(log(n) · n2 ).

6.3

The Overlap Rotation Lemma

We will now shortly divert from the algorithm descriptions in order to introduce a couple of very useful results proved by Breslauer et al.. These results are at the basis of the proofs of the best known approximation factors for both the Greedy algorithm (Section 6.4 on page 57) and the Best Factor algorithm (Section 6.6 on page 64). 2 This problem could alternatively have been solved by either increasing the platform dependent maximum stack height or by using stack-less python.

53

6. A LGORITHMS

Algorithm II.3 RecurseCycle Require: Nonempty sub-string free list of strings, sl procedure RECURSE C YCLE(sl, avoidSimpleCycles) if size(sl) = 1 then 3: return first(sl) end if reps ← MAKE R EPRESENTATIVES(sl, avoidSimpleCycles) 6: return RECURSE C YCLE(reps, true ) end procedure 9:

begin main return RECURSE C YCLE(sl, false ) end main

Termination: In the base case (line 3) the algorithm returns when the string list has size 1. In the inductive case (lines 5 to 6) the restriction to non-trivial cycle covers in MAKE R EPRESENTATIVES(sl, avoidSimpleCycles) (line 5) ensures that size(reps) ≤ size(sl) − 1. As the size of the original string list is final and nonzero, this guarantees termination. Correctness: For each invocation of RECURSE C YCLE(sl, avoidSimpleCycles) the following invariant holds: A (random) concatenation of the strings in the string list is a superstring for the strings in the list. Before the first invocation the invariant holds trivially. If RECURSE C YCLE(sl, avoidSimpleCycles) returns from the base case, the string list contains a single string, which is a superstring for itself. If RECURSE C YCLE(sl, avoidSimpleCycles) returns from the inductive case, at least two strings have been replaced by their merge. As a merge is a superstring for the strings involved the invariant will hold.

54

6.3. The Overlap Rotation Lemma Lemma 10 (Overlap Rotation Lemma) Let α be a periodic semi-infinite string. There exists an integer k, such that for any (finite) string s that is inequivalent to α, ov(s, α[k, ∞]) < period(s) + 12 period(α) Furthermore such a k can be found in time O(period(α))

2

For the somewhat “technical” proof see [Breslauer et al., 1996, Section 3]. Remark 8 The Overlap Rotation Lemma shows that given any cyclic string Φ(t) and any string s inequivalent to t, it is possible to restrict the overlap between s and t by rotating t sufficiently. The Overlap Rotation Lemma actually consists of two results, but we only need the first of these. As will be shown later (Lemma 12) the implemented algorithms will not actually have to determine the integer k. 2 The following is a rephrasing of another result in [Breslauer et al., 1996, Section 5, Lemma 5.1] due to Mucha. Lemma 11 Let S = (s1 , s2 , . . . , sn ) be a collection of strings with minimal Cycle Cover CC ∗ (S). It is possible to choose representatives R for cycles in CC ∗ (S) such that OP T (R) ≤ 2 · OP T (S) and for any two distinct cycles Ci , Cj ∈ CC ∗ (S) ov(r(Ci ), r(Cj )) < |Ci | + 21 |Cj |.

(6.2) 2

P ROOF ([M UCHA , 2007]) Let Ci ∈ CC ∗ (S) = si1 , si2 , . . . , sik and let α = pr(si1 , si2 ) pr(si2 , si3 ) · · · pr(sik−1 , sik ) pr(sik , si1 ). Consider the semi-infinite string α∞ and let k be the integer given by the Overlap Rotation Lemma 10. Assume WLOG that k is in the pr(si1 , si2 ) part of α. (If not we can renumber the indices and split α accordingly.) In the approach until now we could choose the representative r(Ci ) = hsi2 , si3 , . . . , sik , si1 i.

55

6. A LGORITHMS As shown in the proof of Lemma 6 on page 44 we could actually choose the longer representative rˆ(Ci ) = hsi1 , si2 , si3 , . . . , sik , si1 i, ˆ ≤ 2 · OP T (S). Now it follows that if we instead and still have OP T (R) choose the representative r(Ci ) = (pr(si1 , si2 )[k : ])hsi2 , si3 , . . . , sik , si1 i

(6.3)

we have that r(Ci ) is a suffix of r(Ci ) which itself is a suffix of rˆ(Ci ). This implies that OP T (R) ≤ 2 · OP T (S). Equation 6.2 on the previous page follows directly from Lemma 5 on page 38 and the Overlap Rotation Lemma on the previous page.  These results immediately leads to Corollary 2 and an improvement of Theorem 2 on page 47 Corollary 2 An optimal SSP-solution for the set R will have OV (R) ≤

3 2

· OP T (S)

2

Theorem 5 Given an α-approximation for the MAX-HPP problem, it is possible to get a (3.5 − 1.5α) approximation for the SSP. 2 P ROOF Analogous to the proof for Theorem 2 on page 47



Corollary 3 The Cycle algorithm and RecurseCycle algorithms are factor (3.5 − 1.5 · 0.5) = 2.75 approximation algorithms. 2 As noted in [Mucha, 2007] there seems to be something non-intuitive or even contradictory about the approach used above. A better approximation for the total overlap of representatives was the road to the improvement in Theorem 5 and therefore it seemed sensible to minimise the overlaps of representatives; but actually our overall objective is to find a shortest superstring and therefore it surely is preferable to make representatives overlap as much as possible. Fortunately Mucha provides the following lemma. Lemma 12 Using any set of representatives R with ||R|| ≤ ||R|| yields a (3.5 − 1.5α) approximation for the SSP, where α is the approximation factor for the directed, MAX-HPP algorithm used. 2 56

6.4. The Greedy Algorithm P ROOF ([M UCHA , 2007]) Let OV (R) and OV (R) be the total overlap for the optimal solutions OP T (R) and OP T (R) respectively and let |SSPR | be the length of the approximated solution for R. We then have |SSPR | ≤ ||R|| − αOV (R) = ||R|| − α(||R|| − OP T (R)) (Expression 5.6) = (1 − α) · ||R|| + αOP T (R) ≤ (1 − α) · ||R|| + 2αOP T (S)

(Lemma 11)

≤ (1 − α) · 3.5OP T (S) + 2αOP T (S)

(Corollary 2 and Lemma 11)

= (3.5 − 1.5α) · OP T (S)



Remark 9 Lemma 12 ensures that the original representatives R defined in step 3) on page 44 can be used to obtain the approximation factor in Theorem 5 if the cycles are broken at the string yielding the shortest |r(C)| (to ensure ||R|| ≤ ||R||). The rotations used by Breslauer et al. are necessary to bound the approximation factor but not in the algorithm itself. 2

6.4

The Greedy Algorithm

The greedy algorithm is probably the first idea that comes to mind for solving the SSP. If S = (s1 , s2 , . . . , sn ) is a collection of strings, keep on merging the two strings that have the largest overlap and put the resulting merge back into the collection, until only one string is left. The pseudo-code for this simple and in practise surprisingly good greedy algorithm is shown in Algorithm II.4 on the following page.

6.4.1

Approximation factor

In [Tarhio and Ukkonen, 1988] it is shown that this algorithm achieves at least a factor 21 of the maximal total overlap possible. Furthermore, it is conjectured that the algorithm is a 2-approximation algorithm. The following tight example from [Blum et al., 1994] shows that the approximation factor cannot be better than 2. Example 11

Let S = {c(ab)k , (ba)k , (ab)k c} then |Greedy(S)| = |c(ab)k c(ba)k | = 4k + 2 and |OP T (S)| = |c(ab)k+1 c| = 2k + 4

2

57

6. A LGORITHMS Algorithm II.4 Greedy Require: Nonempty substring free list of strings, sl procedure G REEDY(sl) if size(sl) = 1 then 3: return first(sl) end if Find si , sj ∈ sl : i 6= j and ov(si , sj ) is maximal. 6: sl ← sl \ {si , sj } sl ← sl ∪ hsi , sj i return G REEDY(sl) 9: end procedure Termination: In the base case (line 3) the algorithm returns when the string list has size 1. In the inductive case (lines 6 to 8) two strings are removed and one merged string is added to the string list, reducing its size by one. As the size of the original string list is final and nonzero, this guarantees termination. Correctness: For each invocation of G REEDY(sl) the following invariant holds: A (random) concatenation of the strings in the string list is a superstring for strings in the list. Before the first invocation the invariant holds trivially. If G REEDY(sl) returns from the base case, the string list contains a single string, which is a superstring for itself. If G REEDY(sl) returns from the inductive case, two strings have been replaced by their merge. As a merge is a superstring for the strings involved the invariant will hold.

Even though this conjecture has been subject to a lot of study, it has not (yet) been possible to prove a better factor than 3.5. This result due to Kaplan and Shafrir is in essence an application of Corollary 2 on page 56 instead of Corollary 1 on page 46 in the original factor 4 proof by Blum et al. Before we can show this, we will need quite some terminology and results from [Blum et al., 1994]. Consider an implementation of the Greedy algorithm that satisfies the conditions in line 5 of Algorithm II.4 by the following. It runs down a collection of the arcs from the overlap graph sorted by decreasing weight (analogous to Algorithm II.2) and for each arc aij it decides whether to merge its node strings si , sj or not. 58

6.4. The Greedy Algorithm Let the string returned by the Greedy algorithm be denoted tgreedy and assume a renumbering of the strings in S so that tgreedy = hs1 , s2 , . . . , sn i. Let e, f be arcs in the overlap graph for S. We say that e dominates f if it comes before f in the sorted collection of the arcs. This means that w(e) ≥ w(f ) and if e = aij then either f = aij 0 or f = ai0 j . Greedy will reject merging the end strings of an arc f if either 1) f is dominated by an already chosen arc e 2) f is not dominated but the merge of its end strings would form a cycle. If the merge is rejected because of 2) then f is called a bad back arc.3 Let f = aji be a bad back arc then f corresponds to the merge hsi , si+1 , . . . , sj i, which also corresponds to a path in the overlap graph. As this merge was already formed when Greedy considered f it follows that ov(sk , sk+1 ) = w(ak,k+1 ) ≥ w(f ) = w(aj,i ) = ov(sj , si ), ∀ k ∈ [i, j − 1]. Additionally when f is considered, the arcs ai−1,i and aj,j+1 cannot have been considered yet. The arc f is said to span the closed interval If = [i, j]. In [Blum et al., 1994, Section 5] the lemma below is proved Lemma 13 Let e and f be two bad back arcs. Then the closed intervals Ie , If spanned by the arcs are either disjoint or one contains the other. 2 Minimal (with respect to containment) intervals of bad back arcs are called culprits. It follows from the definition, that culprit intervals must be disjoint. Each culprit [i, j] like its bad back arc corresponds to the merge hsi , si+1 , . . . , sj i. Let Sm ⊆ S = {sk | k ∈ [i, j], where [i, j] = culprit} Each culprit [i, j] also defines a cycle Cij = si , si+1 , . . . , sj in the overlap graph for S. Let CCc (Sm ) be the CC over Sm defined by the culprits. Now consider the result of applying Algorithm II.2 on page 52 on the overlap graph induced by Sm . It turns out that this will exactly produce the Cycle Cover CCc (Sm ), which implies that CCc (Sm ) is a minimal CC and we get |CCc (Sm )| = |CC ∗ (Sm )| ≤ |OP T (Sm )| ≤ |OP T (S)|. 3

(6.4)

in [Blum et al., 1994] these are called bad back edges

59

6. A LGORITHMS Finally denote the sum of the weights of all culprit bad back arcs in the overlap graph X |ov(sj , si )|. Oc = [i,j] = culprit

Blum et al. utilises the culprits to partition the arcs of the path corresponding to tgreedy into four sets. Through clever analysis they end up showing the following result [Blum et al., 1994, Section 5]. Lemma 14 |tgreedy | ≤ 2 · OP T (S) + Oc − |CCc (Sm )|.

2

Having by now put quite some canons into position, we can start aiming at the final proof. The main hurdle is to derive a use-full approximation for Oc . Let A = {r(Cir )| [i, r] = culprit} be the representatives for Sm guaranteed to exist by Lemma 11 on page 55. We then have Lemma 15 For each culprit [i, r], |ov(sr , si )| ≤ |r(Cir )| − |Cir |.

2

P ROOF ([B LUM ET AL ., 1994]) Let j ∈ [i, r] be the index such that the merge hsj+1 , . . . , sr , si , . . . , sj i is a suffix of r(Cir ) (e.g. j = si2 in Equation 6.3 on page 56). We now get |r(Cir )| ≥ |hsj+1 , . . . , sr , si , . . . , sj i|

The merge is a suffix

= |pr(sj+1 , sj+2 ) · · · pr(sj−1 , sj )sj | = |pr(sj+1 , sj+2 ) · · · pr(sj−1 , sj )pr(sj , sj+1 )| + |ov(sj , sj+1 )| ≥ |Cir | + |ov(sr , si )|



Corollary 4 Oc ≤ ||A|| − |CCc (Sm )| P ROOF ([B LUM ET AL ., 1994]) X X Oc = |ov(sj , si )| ≤ [i,j] = culprit

2

|r(Cir )| − |Cir | ≤ ||A|| − |CCc (Sm )|.

[i,j] = culprit 

We are now able to approximate Oc : Lemma 16 Oc ≤ OP T (S) + 32 |CCc (Sm )|. 60

2

6.5. Reductions of Hamiltonian Path Problem Instances P ROOF ([K APLAN AND S HAFRIR , 2005]) ||A|| = OP T (A) + OV (A) ≤ OP T (A) + 23 |CCc (Sm )|

(Corollary 2)

≤ OP T (Sm ) + |CCc (Sm )| + 23 |CCc (Sm )| (Lemma 6) = OP T (Sm ) + 25 |CCc (Sm )|. Now using Corollary 4 on the preceding page we have Oc ≤ ||A|| − |CCc (Sm )| ≤ OP T (Sm ) + 32 |CCc (Sm )| ≤ OP T (S) + 32 |CCc (Sm )|.



So putting it all together we get: Theorem 6 Greedy is a factor 3.5 approximation algorithm.

2

P ROOF ([K APLAN AND S HAFRIR , 2005]) Using Lemma 16 on the facing page and Lemma 14 on the preceding page we get |tgreedy | ≤ 2 · OP T (S) + OP T (S) + 23 |CCc (Sm )| − |CCc (Sm )| = 3 · OP T (S) + 12 |CCc (Sm )| = 3.5 OP T (S)

6.4.2



Implementation

The actual implementation of the Greedy algorithm is a Python implementation of the pseudo-code in Algorithm II.4 on page 58 with the checks in line 5 implemented as a generator method. Since this method has to sort the n2 pairwise overlap values, the Greedy algorithm running-time is dominated by the running-time for this sorting which in this implementation4 is of size O(log(n2 ) · n2 ) = O(log(n) · n2 ).

6.5

Reductions of HPP Instances

Since the last two algorithms to be presented in this chapter, both are based on the overlap formulation (Expression 5.6 on page 40) for the optimal solution of the SSP, we will now treat the necessary reductions from the original HPP to equivalent instances of ATSP and TSP respectively. 4 a more efficient implementation could use Radix Sort, since the size of the overlap values are restricted

61

6. A LGORITHMS

6.5.1

Reduction from a HPP to an ATSP

Given a collection of strings S = (s1 , s2 , . . . , sn ). Let GS (N, A, W ) be the overlap graph for S. Let H(GS ) be the instance of the maximum HPP in 0 the overlap graph for S. Define an instance of the maximum ATSP, A(GS ), by 0

0

0

0

GS = G(N , A , W ) with 0

0

N = N ∪ n0 , A = A ∪ {ai0 , a0i } i in 1 . . . n ( w(aij ) for i, j 6= 0 0 0 0 W : A → N with w (aij ) = 0 otherwise Lemma 17 The reduction from HPP to ATSP is correct.

2

P ROOF This graph transformation can clearly be done in polynomial time. If POP T = (nP1 , nP2 , . . . , nPn ) is the hamiltonian path corresponding to the optimal solution for H(GS ), then the TSP-tour TP = (n0 , nP1 , nP2 , . . . , nPn ) 0 will be an optimal solution for A(GS ) with the same length as POP T . Conversely if TOP T = (n0 , nT1 , nT2 , . . . , nTn ) is the TSP-tour for the optimal so0 lution for A(GS ), then the path PT = (nT1 , nT2 , . . . , nTn ) will be an optimal solution for H(GS ) because removing the visit of node n0 does not reduce the length of the solution. Both cases can be proved using that existence of a better TSP tour (Hamiltonian path) leads to a contradiction of the assumed optimality of POP T and TOP T respectively.  Remark 10 The adding of the extra n0 node is necessary in order to make the reduction work. Intuitively it could seem as if an optimal solution to the HPP would be equivalent to determining an optimal solution for the TSP and then removing the heaviest edge. The graph depicted in Figure 6.3 on the facing page illustrates that this intuition is quite wrong. An optimal solution to the TSP (depicted with full drawn lines) has length 1002. Removing the heaviest edge from this tour will result in a solution of the HPP with length 502. However, the optimal HPP solution has length 3. 2

6.5.2

Reduction from ATSP to TSP

Let A(Gatsp ) be an instance of the maximum ATSP in a graph Gatsp = (Natsp , Aatsp , Watsp ). Transform Gatsp into an undirected graph Gtsp = (Vtsp , Etsp , Wtsp ) by the following steps: 62

6.5. Reductions of Hamiltonian Path Problem Instances 1000 b

500

1

b

500

b

1 b

1

Figure 6.3: Shortest Hamiltonian path is not necessarily included in the optimal TSP tour.

1) Split each node in n ∈ Natsp into a pair of nodes nin , nout , where all incoming arcs for n are incoming to nin and all outgoing arcs for n are outgoing from nout . Let Vtsp = Natspin ∪ Natspout . 2) In the graph resulting from 1). Add an opposite arc with same weight for all arcs, making the resulting graph symmetric. Convert all correlating arc pairs into an edge of the same weight, thereby converting the graph into an undirected graph. ( watsp (aij ) if aij ∈ Natsp Define wtsp (eij ) = watsp (aji ) if aji ∈ Natsp

3) Define a value K  max(Watsp ) where max(Watsp ) is the maximum weight in Gatsp . 4) For all pairs vi , vj ∈ Vtsp such that vi = nkin ∈ Natspin and vj = nkout ∈ Natspout add an edge eij , between the pair with wtsp (eij ) = K. 5) Make the graph complete by adding missing edges eij with wtsp (eij ) = −K. 6) Multiply all edge weights by -1. A graphical illustration of the transformation is shown in Figure 6.4 on the next page Steps 1) and 2) ensures that the original arcs with their original weights are present in the undirected graph. Steps 3) to 5) guarantees that any maximal TSP tour in the “node splitted” graph will always travel directly between any corresponding in and out vertex-pair. Finally step 6) ensures that a maximal TSP tour is turned into an equivalent minimal TSP tour. Lemma 18 The reduction from ATSP to TSP is correct.

2

63

6. A LGORITHMS

2

1

2

1

(a)

1

1

(b)

2

2

(c)

K K

1

2

1

-K

K

-1

2 -K

-K

K

-K

K (d)

-2 K

(e)

(f)

Figure 6.4: Transforming an ATSP instance to a TSP instance

P ROOF The whole graph transformation can clearly be done in polynomial time, i.e. a reduction of an instance of A(Gatsp ) in graph Gatsp can be turned into an instance T (Gtsp ) in graph Gtsp in polynomial time. If TAOP T = (n1 , n2 , . . . , nn ) is an optimal maximal ATSP tour in Gatsp with length Latsp , then the TSP-tour Ttsp = (vn1in , vn1out , vn2in , vn2out , . . . , vnnin , vnnout ) will be an optimal solution for the TSP in Gtsp with length − (Latsp + n · K). Conversely if TTOP T = (vn1in , vn1out , vn2in , vn2out , . . . , vnnin , vnnout ) is an optimal TSP tour in Gtsp with length Ltsp then Tatsp = (n1 , n2 , . . . , nn ) is an optimal maximal ATSP tour in Gatsp with length − (Ltsp + n · K). As for the first reduction, both cases can be proved using that existence of a better TSP tour (ATSP tour) leads to a contradiction of the assumed optimality of TAOP T and TTOP T respectively. 

6.6

The Best Factor Algorithm

The basis for this algorithm is the overlap formulation (Expression 5.6 on page 40) for the optimal solution of the SSP. After making the reductions 64

6.6. The Best Factor Algorithm described in Section 6.5.1 on page 62 the algorithm finds an approximation for the resulting ATSP instance. To this end the techniques described in [Kaplan et al., 2005] are used.

6.6.1

Overview

The algorithm starts off by finding a solution to the following relaxation of an LP-problem: Definition 11 Let S = (s1 , s2 , . . . , sn ) be a string collection and let G(N, A, W ) be its overlap graph. The following is a relaxation of the LP-formulation for Maximal CC with Two-Cycle Constraint X Maximise w(aij ) · xij subject to aij ∈A

X

xij = 1,

∀j

(in-degree constraints)

xij = 1,

∀i

(out-degree constraints)

i

X j

xij + xji ≤ 1,

∀ i 6= j

(2-cycle constraints)

xij ≥ 0,

∀ i 6= j

(non-negativity constraints)

2

The above LP-formulation can be seen as a variation of the LP-formulation of the TSP (see Definition 2 on page 13) with the subtour elimination constraints being modified to only consider possible subsets of size 2. Using the found solution to the LP formulation in Definition 11 the algorithm first “integrates” the possibly non-integral solution. This is done by scaling up (multiplying) with the least common denominator D of all variables. This integral solution defines a multigraph having a total weight of at least D times the value of the original  D  solution. Furthermore the multigraph does not contain more than 2 copies of any 2-cycle. By using a representation of this multigraph as a pair of CCs and by careful rounding and recursing Kaplan et al. end up with two CCs CC1 , CC2 with the property ||CC1 || + ||CC2 || ≥ 2 · OP Tatsp . (6.5) This pair of CCs is then further decomposed by a 3-coloring into three collections of paths. It now follows that the heaviest collection will have weight at least 32 · OP Tatsp and the returned approximation solution is achieved by randomly patching this path collection into a tour. Remark 11 As mentioned in [Mucha, 2007], the result achieved by Kaplan et al. is quite impressive. The theoretical bound for an integral solution to the LP 65

6. A LGORITHMS seems to be 23 (see Section 6.6.7 on page 72), and Kaplan et al. manage to achieve this bound using some nice techniques. For this reason the main part of the algorithm will be covered in details in the following. 2

6.6.2

Preliminaries and notation

The algorithm consists of the following steps 1) Transform the HPP into an instance of the ATSP and construct the overlap matrix for this instance. 2) Find a solution to the corresponding LP in Definition 11. 3) Scale up the possibly non-integral solution by multiplying with the least common denominator D of all variables. 4) Recursively decompose the resulting multigraph until a pair of CCs fulfilling Equation 6.5 is achieved. 5) Decompose this CC pair into three path collections. 6) Patch the heaviest path collection into a tour. Step 1) follows the description in Section 6.5 on page 61 and Step 2) is done by use of an LP solver program capable of solving mixed integer linear programs. The steps 3) and 4) will be the topic of the rest of this section. A detailed description of Step 5) is given in [Kaplan et al., 2005, Section 5]. Finally Step 6) is done by arbitrarily merging the disjoint paths together until a tour is constructed. Definitions and notation Let G(N, A, W ) be the overlap graph from the LP. Let {x∗ij } for i, j ∈ 1 . . . |N | and i 6= j be an optimal (fractional) solution from Step 2) with OP Tf the maximal value found for the objective function of the LP. Denote the optimal value for the corresponding ATSP OP Tatsp . Let D be the least common denominator from Step 3). Define k · (i, j) to be the multiset containing k copies of the arc (i, j) and define the weighted multigraph D · G = (N, A, W ) where A = {(Dx∗ij ) · (i, j) | (i, j) ∈ A} and all copies of the same arc have identical weight. Let Wmax denote the maximum weight for an arc in G. Denote the number of copies (the multiplicity) of an arc a by mG (a) and denote a 2-cycle between the nodes i, j by Cij . Remark 12 The multiplier D might be exponential in the size of N , as it can be the product of |A| = (n − 1)2 factors. However, the algorithm avoids using 66

6.6. The Best Factor Algorithm more than polynomial space since the number of bits necessary to represent D is polynomial. Furthermore, multigraphs are implicitly and never explicitly represented by maintaining the multiplicities for all arcs. 2 Definition 12 (Half-Bound d-regular Multigraph) A (directed) multigraph G(N, A) is said to be d-regular if the in and outdegrees for all nodes equal d.  d A d-regular multigraph is said to be halfbound if it contains at most 2 copies of any 2-cycle. 2

6.6.3

Finding the lowest common denominator

Determining the multiplier D in Step 3) is done using a Greatest Common Divisor (GCD) implementation as a subroutine. The pseudo-code for the algorithm is shown in Algorithm II.5. Algorithm II.5 leastCommonDenominator Require: Nonempty list of fractional numbers, p fl = [ pq11 , pq22 , . . . q|fl| ] with GCD(pi , qi ) = 1 and pi ≤ qi ∀ i ∈ 1 . . . |fl| |fl| Ensure: The smallest number, lcd ∈ N : lcd · pq ∈ N ∀ pq ∈ fl. procedure LEAST C OMMON D ENOMINATOR(fl) 3: lcd ← 1 for all pq in fl do q lcd ← lcd · GCD(lcd, q) 6: end for return lcd end procedure Termination: The algorithm only iterates over finite non-increasing data structures. Correctness: Using induction over the size of fl, we get Base Case: If |fl| = 1 we have lcd1 = lcd0 = 1 follows from line 3.

q1 GCD(lcd0 , q1 )

=

q1 1

= q1 where

Induction Case: Assume lcdi−1 is correct and |fl| = i. We have lcdi = lcdi−1 ·

qi GCD(lcdi−1 , qi )

No matter whether qi divides lcdi−1 or not, it follows from the qi definition of GCD that GCD(lcd will be the smallest possible i−1 , qi ) number to multiply with lcdi−1 such that the qi ’s divides the product.

67

6. A LGORITHMS The running-time for LEAST C OMMON D ENOMINATOR () is dependent on the running-time for the used sub-procedure GCD ().5 Using an implementation of Euclids Algorithm we have for the running-time of GCD () rtGCD = O(log(max(lcd, q))) (see [Cormen, Leiserson, Rivest, and Stein, 2001, chapter 31.2]) and for LEAST C OMMON D ENOMINATOR (): rtLCD = O(|fl| · rtGCD ).

6.6.4

Finding Cycle Covers when D is a power of two

To determine the CCs in Step 4) on page 66 we follow the presentation in [Kaplan et al., 2005, Section 3] and first show the procedure for the case where the multiplier D is a power of two. Let G0 = D · G where G = G(N, A, W ) is the graph from the LP in Step 2). G0 is then D-regular, half-bound and has W (G0 ) ≥ D · OP Tatsp . Define a D-regular bi-partite undirected multigraph B(V, V 0 , E) as follows: for each node ni ∈ N define two vertices vi ∈ V, vi0 ∈ V 0 and for each arc aij = (ni , nj ) ∈ A define the edge e = (vi , vj0 ) ∈ E. Next divide B into two D B1 , B2 as fol2 -regular bi-partite multigraphs mB (e)  lows: for each edge e ∈ B with mB (e) ≥ 2 add copies of e to both 2  mB (e)  B1 and B2 and remove 2 · copies of e from B. Distribute the “left2 over” edges among B1 and B2 as follows: for each connected component in the remaining subgraph of B determine an Eulerian Cycle,6 and alternately assign the edges in these cycles to B1 and B2 . Now, let G0 and G00 be the sub-graphs of G0 corresponding to B1 and B2 respectively. Note that both G0 and G00 are D 2 -regular and half-bound. The latter is given since G0 was half-bound and D-regular, meaning that for every 2-cycle Cij in G0 at least one of the arcs (ni , nj ) and (nj , ni ) has mul0 00 tiplicity ≤ D 2 . By letting G1 be the heaviest graph of G and G , it is D 0 00 achieved that W (G1 ) ≥ 2 · OP Tatsp because W (G ) + W (G ) = W (G0 ) ≥ D · OP Tatsp . Applying the above procedure log(D) − 1 times the obtained graph Glog(D)−1 will be a 2-regular half-bound multigraph having W (Glog(D)−1 ) ≥ 2 · OP Tatsp . From this graph it is fairly simple7 to obtain a pair of CCs satisfying Equation 6.5 on page 65. The running-time for the procedure will be O(log(D) · n2 ) following from log(D) iterations each costing O(n2 ). 5

actually it is also dependent on a procedure being able to extract the denominator of a fractional number from its decimal representation. However, the dominant procedure will be the GCD. 6 a cycle containing all edges in a graph 7 e.g. by using arc coloring in a bi-partite graph.

68

6.6. The Best Factor Algorithm

6.6.5

Finding Cycle Covers when D is not a power of two

When D is not a power of two the challenge is to use a rounding procedure on the solution of the LP in order to derive a 2y -regular multigraph. When this is achieved the procedure in Section 6.6.4 can be applied. For reasons that will become apparent later assume |N | ≥ 5. Determine y ∈ N such that 2y−1 < 12 |N |2 Wmax ≤ 2y . Define D = 2y − 2 |N |. For each value x∗ij let xij denote the largest integer multiple of 1/D less than x∗ij . Define the multigraph D · G = (N, A, W ) where A = {(D xij ) · (i, j) | (i, j) ∈ A}. Since D xij ∈ N this graph is well-defined. Lemma 19 For all nodes n in D · G the following inequalities hold D − (|N | − 1) ≤ dnin , dnout ≤ D.

2

P ROOF ([K APLAN ET AL ., 2005]) For node nj it follows from the rounding of the x∗ij values that xij ≥ x∗ij −1/D and P from the in-degree constraints in Definition 11 on page 65 we get that i x∗ij = 1. From the definition of D · G it then follows that X xij (6.6) dnin =D j i

≥D

X

 x∗ij − 1/D

i



|N | − 1 D = D − (|N | − 1)



≥D 1−

using xij ≤ x∗ij the upper bound for the in-degree follows from Equation (6.6). The bounds on the out-degree can be proved analogously.  ˆ has the sum-of-arcs In the following we will say that a multigraph G property if mGˆ (aij ) + mGˆ (aji ) ≤ D ∀ ni , nj , i 6= j, i, j ∈ 1 . . . |N |. From the 2-cycle constraints it follows that D · G has the sum-of-arcs   property, that is it has at most D 2 copies of any 2-cycle Cij . Next the multigraph D · G has to be completed into a 2y -regular graph by the addition of arcs. Thereby it has to be ensured that the resulting graph will be half-bound. The process of arc-adding is divided into an arcaddition stage and an arc-augmenting stage, and ensures the half-bound property by satisfying the condition, that mG (aij ) + mG (aji ) ≤ D for every pair of nodes ni , nj . 69

6. A LGORITHMS Arc-addition stage: If ∃ ni , nj , i 6= j : dnout < D and dnin < D and i j mG (aij ) + mG (aji ) < D add an arc aij to the graph D · G denoting the graph by the end of the arc-addition stage G0 the following lemma holds Lemma 20 With the exception of maximally two nodes ni , nj every node n in G0 has dnin = dnout = D. If such pair of nodes do exist, then mG (aij ) + mG (aji ) = D. 2 P ROOF ([K APLAN ET AL ., 2005]) Assume the existence of three distinct nodes ni , nj , nk such that forPeach of these P either the in-degree or outdegree is less than D. Since n dnin = n dnout there will be at least one node, say ni with dnout < D and at least one node, say nj with dnin < D. i j in out in Assume further dnk < D (the case dnk = D ⇒ dnk < D is symmetric). Since the arc-addition stage finished without adding further copies of aij and aik it follows that mG0 (aij ) + mG0 (aji ) = mG0 (aik ) + mG0 (aki ) = D. This again implies that dnin + dnout = 2D. However, since the arc-addition i i stage ensures that neither the in nor the out-degree of any node will be larger than D it follows that dnout = D, leading to a contradiction.  i By Lemma 20 at most two nodes ni , nj exist for which the in-degree and out-degree do not equal D. For these nodes it follows from Lemma 19 that D − (|N | − 1) ≤ dnin , dnout , dnin , dnout . The first phase of the arci i j j augmenting stage starts by choosing an arbitrary node nk 6= ni , nj and add copies of the arcs aki , aik , akj , ajk until dnin = dnout = dnin = dnout = i i j j in out D. At this point any node nm , m 6= k has dnm = dnm = D. Denote the corresponding graph G00 Since there has been added maximally 2(|N | − 1) incoming and maximally 2(|N | − 1) outgoing arcs to and from nk and because D = 2y − 2|N | it still holds that dnin , dkout < 2y (Lemma 19). Furthermore, after the arck addition stage we had mG0 (aki ), mG0 (aik ) ≤ D. The first phase of the arcaugmenting stage added maximally (|N | − 1) copies of both aki and aik so it now holds that mG00 (aki ), mG00 (aik ) ≤ 2y . Since a similar argument holds for akj and ajk it is still possible to augment G00 into a 2y -regular graph G0 having the sum-of-arcs property. Since nk is the only node in G00 whose in-degree or out-degree is larger than D it follows that dnin = dnout = L k k y and D ≤ L ≤ 2 . The second phase of the arc-augmenting stage adds L − D arbitrary cycles passing through all nodes except nk , ensuring that all nodes have dnin = dnout = L. Then 2y − L arbitrary Hamiltonian cycles are added with the restriction that none of these Hamiltonian cycles uses any of the arcs aki , aik , akj and ajk . Note that there has to be at least five vertices in order 70

6.6. The Best Factor Algorithm for these these Hamiltonian cycles to exist. Calling the resulting graph G0 it follows by construction that: Lemma 21 G0 is 2y -regular and half-bound.

2

Next the procedure of Section 6.6.4 on page 68 can be applied to G0 . After y − 1 iterations the procedure will end with a 2-regular half-bound graph Gy−1 . It remains to prove the following lemma Lemma 22 The graph Gy−1 will have W (Gy−1 ) ≥ 2 OP Tatsp

2

P ROOF ([K APLAN ET AL ., 2005]) Using the following facts: i) xij ≥ x∗ij − 1/D, ii) D = 2y − 2|N | iii)

P

aij ∈A 1

= (|N | − 1)2 ≤ |N |2 , and

iv) OP Tf ≤ |N | Wmax we have for the weight of G0 that W (G0 ) ≥

X

D · w(aij ) · xij

aij ∈A



 X w(aij )  ≥D· w(aij )x∗ij − D aij ∈A aij ∈A   X w(aij )  ≥ D · OP Tf − D aij ∈A   X Wmax  ≥ D · OP Tf − D a ∈A X

( from i) )

ij

y

≥ 2 OP Tf − 2|N | OP Tf − |N |2 Wmax ≥ 2y OP Tf − 3|N |2 Wmax

( from ii) and iii) )

( from iv) )

At each of the y − 1 iterations the heaviest graph is retained, thus using 2y ≥ 12 |N |2 Wmax ⇔ 2y−1 ≥ 6 |N |2 Wmax and OP Tf ≥ OP Tatsp . The 71

6. A LGORITHMS following holds for the weight of Gy−1 : W (G0 ) 2y−1 y 2 OP Tf − 3 |N |2 Wmax ≥ 2y−1 3 |N |2 Wmax ≥ 2 OP Tf − 6 |N |2 Wmax 1 ≥ 2 OP Tatsp − 2 ≥ 2 OP Tatsp since W (Gy−1 ) and OP Tatsp are integers.

W (Gy−1 ) ≥

6.6.6



Approximation factor

Theorem 7 The Best Factor algorithm is a 2.5 approximation algorithm.

2

P ROOF The algorithm achieves a factor 23 for the MAX-ATSP. Through the reductions in Section 6.5.1 on page 62 this gives a factor 32 for the MAX -HPP. From Theorem 5 on page 56 we now have that this gives a 3.5 − 1.5 · 32 = 2.5 approximation for the SSP. 

6.6.7

Implementation

The implementation consists of a Python implementation, that utilises two exterior LP solver programs 8 to find the solution in step 2) on page 66. A quite remarkable observation is that the solutions to the LP problems turned out to be integral in all but one case out of 355 280 . This simplified the implementation extraordinarily since the steps 3) to 5) on page 66 could be substituted with the steps: iii) No rounding is needed because of integral solution. iv) No pair of CCs is needed, as the returned CC is optimal. v) No decomposition is needed, just remove the lightest arc from each cycle in the cover to achieve a heaviest path collection. Removing the lightest arc in step v) will maximally reduce the accumulated weight with one third (as all cycle consist of minimally three arcs), thus the approximation factor will be ensured. 8 QSopt: http://www2.isye.gatech.edu/~wcook/qsopt/ http://www.gnu.org/software/glpk/

72

and glpsol:

6.7. The TSP Based Algorithm Remark 13 It has not been possible to explain this overwhelming majority of integral solutions to the LP. However, using constructed examples of distance matrices it has been proven, that the feature neither is dependent on the distance matrices being asymmetric nor on their satisfying of the InverseMonge-Condition. 2

6.7

The TSP Based Algorithm

Like the Best Factor algorithm 6.6 on page 64 the basis for this algorithm is the overlap formulation (Expression 5.6 on page 40). After making the reductions described in Section 6.5.1 and Section 6.5.2 on page 62 we can use LKH to find a solution and transforms this back into a solution for the original HPP.

6.7.1

Implementation

The implementation consists of a Python implementation that transforms an original asymmetric overlap matrix into a corresponding symmetric distance matrix and then calls the LKH program to find a solution for the transformed TSP instance. This solution is then converted into a solution for the original SSP using the methods in Section 6.5 and expression 5.6 on page 40. The running-time for the implementation is dominated by the calculation of the matrices and the running-time for LKH on the transformed instance. Letting S = (s1 , s2 , . . . , sn ) be the collection of strings, the runningtime for the calculation of the original ATSP overlap matrix can (theoretically) be reduced to O(||S|| + n2 ) [Gusfield, 1997]. Transforming this matrix into a symmetric matrix for the transformed instance will have a running-time of O((2n)2 ). The running-time for LKH is empirically determined to approximately O((n)2.2 ) [Helsgaun, 1998]. In practise the calculation of the all-pairs prefix-suffix overlap matrix was implemented as mentioned in Section 6.1.2 on page 48.

6.8

Summary

To recapitulate, Table 6.2 shows the characteristics of the implemented algorithms.

73

6. A LGORITHMS

Table 6.2: SSP algorithm characteristics Algorithm

Approximation factor

Running-time

Cycle

2.75

O(log(n) · n2 )

RecurseCycle

2.75

O(log(n) · n2 )

Greedy

3.5

O(log(n) · n2 )

Best Factor

2.5

O(log(D) · n2 )



O(n2.2 )

TSP

In the running-times D is the multiplier form Step 3) on page 66 and n is the number of strings in the stringcollection S = (s1 , s2 , . . . , sn ) The running-times for the Cycle algorithm and the RecurseCycle algorithm are dominated by Algorithm II.2 on page 52. The running-time for Greedy is dominated by the sorting of n2 pairwise overlaps. The runningtime for the Best Factor algorithm might be dominated by the LP solver program. Finally is the running-time for the TSP algorithm dominated by running LKH.

6.9

Notes

The approximation factor theory in Section 6.1 and Section 6.2 is based on [Blum et al., 1994; Mucha, 2007]. The theory in Section 6.3 is based on [Breslauer et al., 1996] and the theory in Section 6.4 is based on [Blum et al., 1994; Kaplan and Shafrir, 2005]. Finally Section 6.6 is based completely on [Kaplan et al., 2005]. All other references are explicitly mentioned.

74

C HAPTER

A little experience often upsets a lot of theory.

7 Experiments

This chapter contains descriptions of the experiments performed on the implemented SSP algorithms together with their results. The main and almost sole purpose of the SSP experiments was for each of the different algorithms to evaluate its properties concerning SSP solution quality. This especially meant that no particular emphasis concerning running-time efficiency was put into the implementations of the algorithms. In other words: A fast and simple implementation was preferred to a complex but more running-time efficient implementation. Since it turned out to be fairly easy to collect test results concerning properties like running-time and solution quality as sequencing method, tests for these were also conducted. It should be kept in mind though, that the latter tests were made more out of curiosity than of planned intent and as such care should be taken not to draw too definite conclusions upon them. An initial section describing the test data generation is followed by sections covering the tests conducted for SSP solution quality, runningtime and sequencing quality respectively. Each of these experiment sections finishes with the conclusions drawn from the test results.

7.1

Test Data

The test data was constructed according to one of the procedures described in [Romero et al., 2004]. The procedure resembles the Shotgun Sequencing method (see 4.2.1 on page 30) thus we will need the following definition:

75

7. E XPERIMENTS Definition 13 (Coverage) As a measure of relation between number of clones and number of overlapping fragments, the coverage is defined as cov =

n·` |G|

(7.1)

where n is the number of fragments, ` is the average length of the fragments and |G| is the length of the original DNA sequence (genome). 2 As mentioned in [Gusfield, 1997, chapter 16.14] a five- to tenfold coverage generates the best results for Shotgun Sequencing. The steps for producing test data are: 1) From each of a pool of 64 fasta-format files1 a nucleotide sequence of length N is obtained. 2) A collection of strings S of equal length ` is constructed by shifting character by character along the obtained nucleotide sequence. 3) The collection S is modified by randomly removing a number of strings r until a wanted coverage value is obtained. 4) Ensure that S is substring free (see Remark 3 on page 29). In the experiments coverage values of 6, 8, 10, 20, 30 and 40 were used. String length varied from minimally 6 to maximally 100, collection size from 100, 200, . . . , 1000 and a few of size 2000 and 3000. For exact details and calculation of the necessary nucleotide sequence length N see Appendix D on page 161. In [Romero et al., 2004] two further methods to generate test data were used. These methods consist of a random generation of the nucleotide sequence in step 1) and a random generation of all strings the collection S respectively. The choice of the above method was due to the results obtained by Romero et al.. These results showed that the largest deviations from a lower bound for the SSP solutions, was obtained using test data generation from real nucleotide sequences. This might not be all that surprising, as uniformly random generated nucleotide sequences tend to have small and sparse overlaps, thus are easy to compress optimally. The 64 nucleotide sequences were obtained from GenBank2 and comprise the sequences which were possible to contrive using the accession numbers given in [Romero et al., 2004].

76

1

see appendix C on page 157

2

http://www.ncbi.nlm.nih.gov/

7.2. SSP Solver Tests

Table 7.1: SSP experiments overview coverage

string length

string count

Figure 7.1

figures 7.2, 7.3

figures 7.4, 7.5

Running-time





figures 7.7 – 7.9

Sequencing



Figure 7.10a

Figure 7.10b

SSP

solution

Figures showing the quality dependency of the row values upon the the column variables

Table 7.1 contains an overview of the following figures and which relationship combination they are illustrating. All tests were run on two dedicated, identical machines.3 Remark 14 The following abbreviations are used in the figure legend for all figures: • ’RCycle’ for RecurseCycle algorithm, • ’Best’ for Best Factor algorithm, and • ’LB’ for lower bound. Please note the fine scaling on the y-axes in the figures 7.1–7.5.

7.2

2

SSP Solver Tests

Even though the SSP-solutions originating from the TSP algorithm were expected to be optimal at the very least for string count up to 1000, it was decided to calculate a lower bound for the optimal SSP solution for each test. The lower bound was calculated by Concorde as a guaranteed (with respect to numerical precision) feasible solution to the dual of a LP relaxation for the TSP as a lower bound. The results of for all SSP algorithms are thus depicted as their percentage deviation above the lower bound i.e. the lower bound value is used as a 100 % mark.

7.2.1

Variable coverage

From Equation 7.1 it follows that the coverage and the length of the original sequence are inversely proportional. For the test data this means, 3

Intel Xeon, Dual core, 3.0 ghz!, 1.0 GB RAM

77

7. E XPERIMENTS that keeping string (fragment) count and length constant, increasing coverage is equivalent to shorter original sequence length. In other words the problem of solving the SSP is expected to become easier for increasing coverage as the superstring to determine becomes smaller and the fragment count and overlaps are constant.

Percentage in terms of lower bound

0.045

SSP results for count 200, length 50

+9.999e1

Cycle RCycle Best Greedy Tsp LB

0.040 0.035 0.030 0.025 0.020 0.015 0.010

5

10

15

20

25 Coverage

30

35

40

(a) solution dependency on coverage

SSP results for count 200, length 100

Percentage in terms of lower bound

+9.999e1

Cycle RCycle Best Greedy Tsp LB

0.025

0.020

0.015

0.010

5

10

15

20

25 Coverage

30

35

40

(b) solution dependency on coverage

Figure 7.1: SSP solution dependency on variable coverage

78

7.2. SSP Solver Tests The tests were conducted with coverage values of 6, 8, 10, 20, 30 and 40. Only the results for string count 200 and string lengths 50 and 100 are shown, as they are representative for the results using other string counts and lengths. As can be seen in Figure 7.1 the expected increase in SSP-solutions quality with larger coverage values is not obvious. Actually all algorithm SSP-values (except for the TSP algorithm) seem to fluctuate somewhat. The fluctuations can partly be explained as a visual effect due to the very fine y-axis scale.

7.2.2

Variable length

From Equation 7.1 it can be seen that the string (fragment) length and the length of the original sequence are directly proportional. For the test data this means, that keeping string count and coverage constant, increasing length is equivalent to larger original sequence length. The problem of solving the SSP is expected to become easier though as the possible overlaps are increasing with the string lengths.

SSP results for coverage 8, count 100

+9.99e1

Cycle RCycle Best Greedy Tsp LB

Percentage in terms of lower bound

0.5

0.4

0.3

0.2

0.1

20

40

60 String length

80

100

Figure 7.2: SSP solution dependency on length

For these tests string lengths of 10, 20, 25, 40, 50 and 100 were used. In Figure 7.2 and Figure 7.3 are some of the results for a coverage of 8 depicted, the results for other coverage and string count values are similar. 79

7. E XPERIMENTS

SSP results for coverage 8, count 500

Percentage in terms of lower bound

+9.97e1

Cycle RCycle Best Greedy Tsp LB

1.1 0.9 0.7 0.5 0.3

20

40

60 String length

80

100

Figure 7.3: solution dependency on length, compare with Figure 7.10a

In contrast to the variable coverage results, the variable length results are very consistent. All algorithms are producing values of increasing quality for longer lengths. Only a string length of 10 results in values up to 1 % above the lower bound. As always this is not case for the TSP algorithm, which constantly lies more or less on the lower bound line.

7.2.3

Variable count

From Equation 7.1 it follows that the string count and the length of the original sequence are directly proportional. For the test data this means, that keeping string length and coverage constant, increasing count is equivalent to larger original sequence length. The problem of solving the SSP is expected to stay undisturbed as the possible overlaps are non-increasing and the extra amount of strings are compensated for by increasing superstring length. For these tests string counts of 100, 200, 300, 400, 500, 600 and 700 were used. In Figure 7.5 and Figure 7.5 are the results for a coverage of 8 depicted for string lengths of 25 and 40, the results for other coverage and string length values being similar. The results are showing a slight tendency of reduced solution quality 80

7.2. SSP Solver Tests

+9.99e1

Percentage in terms of lower bound

0.22 0.20 0.18

SSP results for coverage 8, length 25 Cycle RCycle Best Greedy Tsp LB

0.16 0.14 0.12 0.10 100

200

300

400 500 Number of strings

600

700

Figure 7.4: solution dependency on count, compare with Figure 7.10b

upon growing string count. The fluctuations in the figures are mainly due to the changes in y-value scale between Figure 7.4 and Figure 7.5, and probably also a too conservative lower bound value. If the results from Figure 7.5 are correlated using the TSP algorithm values as lower bound, the lines look more smooth (see Figure 7.6).

7.2.4 SSP solution conclusions Some recurrent patterns that are present in all of the SSP solution figures are: • the TSP algorithm values are almost constantly laying on the lower bound line, • the Greedy algorithm are producing the second best values, followed by the two Cycle algorithms and the Best Factor algorithm, which are producing solutions of similar quality, • the Cycle algorithm and the RecurseCycle algorithm are, for all practical manners, producing identical values, and finally 81

7. E XPERIMENTS

Percentage in terms of lower bound

0.07

SSP results for coverage 8, length 40 Cycle RCycle Best Greedy Tsp LB

+9.999e1

0.06 0.05 0.04 0.03 0.02 0.01

100

200

300

400 500 Number of strings

600

700

Figure 7.5: solution dependency on count

Table 7.2: minimum scores for SSP algorithms in 355 280 tests Algorithm

minimum scores

percentage

Cycle

301 637

84.90 %

RecurseCycle

301 664

84.91 %

Best Factor

301 251

84.79 %

Greedy

306 519

86.28 %

TSP

355 280

100.00 %

• all algorithms are producing results typically less than 1 % above the lower bound, but always less than 2 %. To further assess the results Table 7.2 shows a listing of the number of times an algorithm achieved the smallest (best) value in all test runs. The conclusions that can be drawn upon these observations are: 1) The TSP algorithm is far superior to the other algorithms in view of solution quality. 82

7.3. Timing Tests

SSP results for coverage 8, length 40 Cycle RCycle Best Greedy Tsp LB

Percentage in terms of lower bound

+9.999e1

0.05 0.04 0.03 0.02 0.01 100

200

300

400 500 Number of strings

600

700

Figure 7.6: Figure 7.5 correlated using TSP algorithm values as 100 %

2) The Greedy algorithm produces better results than CC based algorithms even though its approximation factor is higher. 3) The Best Factor algorithm, the Cycle algorithm and the RecurseCycle algorithm are producing alike quality solutions. 4) The improvement of the RecurseCycle algorithm in respect to the Cycle algorithm is negligible. 5) All approximation algorithms are producing far better solutions than their worst case approximation factor might indicate.

7.3

Timing Tests

For test data with coverage 6 the running-time for all algorithms plotted against string count is shown using normal scale in Figure 7.7 To examine the running-time complexities, the Best Factor algorithm and the TSP algorithm running-times were plotted using double-log scale (see Figure 7.8a). A linear regression was made for both algorithms4 and 4

using the stats package from scipy

83

7. E XPERIMENTS

SSP algorithms running-time in seconds for coverage 6 Cycle RCycle Best Greedy Tsp

120 100

Running-time

80 60 40 20 0 100

200

300

400

500 600 String count

700

800

900

1000

Figure 7.7: SSP algorithms running-time using normal scale

the result is plotted in Figure 7.8b. The values obtained by the linear regressions are shown in Table 7.3. The Greedy algorithm, the Cycle algorithm and the RecurseCycle algorithm running-times are shown in Figure 7.9a using double-log scale and running-time values divided by log(string count) values. A linear regression was made for all algorithms and the result is plotted in Figure 7.9b. The values for the linear regressions are shown in Table 7.3. The following observations can be made: • the TSP algorithm and the Best Factor algorithm are clearly slowest, • the Greedy algorithm, the Cycle algorithm and the RecurseCycle algorithm are all comparable with respect to running-time, and • the TSP algorithm and the Best Factor algorithm are having a polynomial running-time dependency, the Greedy algorithm, the Cycle algorithm and the RecurseCycle a log times polynomial runningtime dependency. Comparing these observations with Table 6.2 on page 74 it can be concluded that all algorithms, apart from the Best Factor algorithm, are exhibiting the expected running-times. The 84

7.3. Timing Tests

18

log2(Running-time)

16

SSP algorithms running-time in milliseconds for coverage 6 Best Tsp

14 12 10 8 66.5

7.0

7.5

8.0 8.5 log2(String count)

9.0

9.5

10.0

(a) algorithm running-times

SSP algorithms running-time in milliseconds for coverage 6 Best Best regression 16 Tsp Tsp regression log2(Running-time)

18

14 12 10 8 66.5

7.0

7.5

8.0 8.5 log2(String count)

9.0

9.5

10.0

(b) running-time linear regressions

Figure 7.8: Best Factor and TSP algorithm running-times

85

7. E XPERIMENTS

SSP algorithms running-time in milliseconds for coverage 6 Cycle RCycle 9 Greedy

log2(Running-time/log2(String count))

10

8 7 6 5 4 36.5

7.0

7.5

8.0 8.5 log2(String count)

9.0

9.5

10.0

(a) algorithm running-times

SSP algorithms running-time in milliseconds for coverage 6 Cycle Cycle regression 9 RCycle RCycle regression 8 Greedy Greedy regression

log2(Running-time/log2(String count))

10

7 6 5 4 36.5

7.0

7.5

8.0 8.5 log2(String count)

9.0

9.5

10.0

(b) running-time linear regressions

Figure 7.9: Cycle, RecurseCycle and Greedy algorithm running-times

86

7.4. Sequencing Tests

Table 7.3: Running time linear regression values Algorithm

slope

r-value

r-squared

p-value

stderr

Cycle

1.89

1.00

1.00

0.00

0.022

RecurseCycle

1.89

1.00

1.00

0.00

0.021

Best Factor

2.16

1.00

1.00

0.00

0.026

Greedy

1.67

1.00

1.00

0.00

0.027

TSP

2.87

0.99

0.98

0.00

0.060

The r-value is the correlation coefficient, the r-squared is the coefficient of determination and the two-sided p-value is for a hypothesis test whose null hypothesis is that the slope is zero i.e. no correlation at all.

running-time measured for the Best Factor algorithm is probably due to its external LP solver.

7.4

Sequencing Tests

For the test data with coverage 8 the similarity to the original string was tested using the needle application from the EMBOSS suite [Rice, Longden, Bleasby, et al., 2000].5 This is an implementation of the NeedlemanWunsch Optimal Pairwise Alignment (OPA) algorithm but in this case it was only used in order to get a percentage value for the similarity between the two strings. Remark 15 The alignment tests were conducted before the Best Factor algorithm was implemented, therefore no values are shown for this algorithm in the figures. 2 The results for variable string length and variable string count were collected. Results for string count 600 and for string length 25 are shown, as the other string counts/lengths results were similar. The results are depicted in Figure 7.10. Figure 7.10a should be compared to Figure 7.3 on page 80 and Figure 7.10b should be compared with Figure 7.4 on page 81. Another measure of the similarity between the calculated superstring and the original string is illustrated in Table 7.4 From the figures it can be seen that there is a correlation between the SSP solution quality and the string similarity indicating, that the SSP is a 5

http://emboss.sourceforge.net/

87

7. E XPERIMENTS

Percentage identity to original string

Align results for stringcount: 0500 95 90 85

RCycle Tsp Greedy Cycle

80 75

20

40

60 String length

80

100

(a) alignment similarity for variable string length, compare with Figure 7.3

Percentage identity to original string

Align results for stringlength: 025 RCycle Tsp Greedy Cycle

98.5 98.0 97.5 97.0 96.5 100

200

300

400 500 Strings count

600

700

(b) alignment similarity for variable string count, compare with Figure 7.4

Figure 7.10: SSP algorithm alignment similarity results

88

7.4. Sequencing Tests

Table 7.4: superstring lengths compared to original string length Algorithm

shorter

equal

longer

Cycle

78 %

11 %

11 %

RecurseCycle

78 %

11 %

11 %

Best Factor

79 %

11 %

10 %

Greedy

79 %

12 %

9%

TSP

86 %

14 %

0%

plausible model for the Shotgun Sequencing. From Table 7.4 it is evident, that all algorithms compress the fragments too much, though. Note that due to the way of constructing the test data (removal of random chosen fragments in Step 3) on page 76) it might not be possible to reconstruct a superstring with same length as the original sequence.

89

Part III

The Circular Sum Score and Multiple Sequence Alignment

DNA

strands tour with 16384 cities

One or two homologous sequences whisper . . . A full multiple alignment shouts out loud

C HAPTER

Arthur Lesk

8

Sequence Alignments

This chapter introduces and defines biological sequence alignments. It further gives a description of the problems concerned with especially MSAs along with a characterisation of the commonly used construction algorithms.

8.1

Motivation

Many algorithms in bioinformatics are concerned with comparing sequences. This is mainly due to the following: Definition 14 (The First Fact of Biological Sequence Analysis) In biomolecular sequences (DNA, RNA or amino acid sequences) high sequence similarity usually implies significant functional or structural similarity. 2 This fact can be popularly rephrased as: If it looks like a duck, it will probably walk and talk like a duck. Evolution reuses and modifies successful structures and “duplication with modification” is the central paradigm for protein (amino acid sequence) evolution. A large majority of the human genes are very similar if not directly identical to the genes of for example the mouse.1 What makes a mouse differ from a human may be determined more by protein regulation, than difference in the protein sequences. 1 This explains why some molecular biologists affectionately refer to mice as “those furry little people”

93

8. S EQUENCE A LIGNMENTS Though the first fact is extremely powerful, it does not signify a symmetric property – which is captured in: Definition 15 (The Second Fact of Biological Sequence Analysis) Evolutionary and functionally related molecular sequences can differ significantly throughout much of the sequence and yet preserve the same three-dimensional structure(s), or the same two-dimensional substructure(s), or the same active sites, or the same or related dispersed residues (DNA or amino acid). 2 The second fact could then be rephrased as Even if it walks and talks like a duck, it does not have to look like a duck.

8.2

Pairwise Sequence Alignment

The similarity of two sequences is often measured using an alignment: Definition 16 (Global pairwise alignment) A (global) alignment of two strings s1 and s2 is obtained by first inserting chosen spaces (or dashes) either into or at the ends of s1 and s2 , and then placing the resulting strings one above the other so that every character or space in either string is opposite a unique character or space in the other string. 2 The term “global” emphasises that the full length of both strings (as opposed to only a substring) participates in the alignment. A pairwise alignment (using dashes) of two strings s1 , s2 ∈ Σ∗ results in two strings a1 , a2 ∈ (Σ0 )∗ where Σ0 = Σ∪{0 −0 } and 0 −0 ∈ / Σ with the property that if we remove the inserted dashes we will obtain the original strings. A single occurrence or a contiguous sequence of the space character (0 −0 above) is denoted a “gap”. Biologically gaps are interpreted as either insertion or deletion events (indels) in the sequences. Remark 16 In the rest of this thesis 0 −0 will be used as the gap character and the notation A = [[a1 , a2 ]] will be used to denote an alignment. We will use the notation A[r][c] to describe the c’th. character of the r’th. row in the alignment. Furthermore, we denote the sequences a1 , a2 the induced sequences for s1 , s2 in the alignment A. 2 To find out whether two sequences share a common ancestor the sequences are usually aligned. The problem is to determine the alignment that maximises the probability that the sequences are related. The evaluation of an alignment is done using a scoring function. 94

8.2. Pairwise Sequence Alignment Definition 17 (Scoring Function for an Alignment) Let A be the set of possible alignments for two strings s1 , s2 ∈ Σ∗ over the finite alphabet Σ. A function f : A → R is called a scoring function for the alignment. 2 The value of the scoring function is typically defined as the sum of a column-wise scoring of the alignment characters using a scoring matrix M of size |Σ0 | × |Σ0 | Example 12 Given two DNA-strings AGCCT A and AT T C and an alignment of these: A=

AGCT CA A T −T C −

The score S, of the alignment using the score matrix A C G T A 5 1 3 1 C 1 5 1 3 G 3 1 5 1 T 1 3 1 5 − −1 −1 −1 −1

− −1 −1 −1 −1 0

is given by: A G C T A T − T S= 5 + 1 − 1 + 5

C A C − + 5 − 1 = 14

2

In Example 12 the scores for gaps are independent of the length of the current gap. An often used scoring strategy for gaps is the so-called “affine gap cost” which scores a gap g of length |g|, according to the formula: cost(g) = o + |g| · e, where o is the cost of starting a gap (opening cost) and e the cost of extending a gap.2 Further refinements can be setting the cost of terminal gaps to zero or letting gapcost be dependent on the distance to neighbouring gaps. Determining the OPA for two strings is done using dynamic programming3 and has running-time and space complexity of O(|s1 | · |s2 |). As mentioned in [Gusfield, 1997, chapter 12.1] the use of a recursive technique due to Hirshberg can reduce space complexity to O(min(|s1 |, |s2 |) while only increasing the running-time by a factor of two. 2

We deliberately refrain from participating in the controversy of whether the first gap should have the cost ’o’ or ’o + e’. 3 In bioinformatics the algorithms are named “Needleman-Wunsch” and “SmithWaterman” for global and local alignment respectively [Gusfield, 1997, chapter 11]

95

8. S EQUENCE A LIGNMENTS

8.3

Multiple Sequence Alignment

If comparison of two Ribonucleic Acid (RNA), DNA, or amino acid sequences (proteins) can lead to biological insights; then this is even more so in the case of comparing multiple sequences. Being able to determine “families” of proteins has very important applications in at least three ways: History of Life on Earth Sequence families are often used as the basis for building evolutionary or phylogenetic trees. The sequences are put as leaves in the tree (denoted the “taxa”) and vertices in the tree then reflects unknown common ancestors. Identifying conserved regions When comparing multiple viral or bacterial protein strains the most conserved regions are typically also the critical functional regions for the proteins. Since vira mutate at a very rapid rate, identifying these regions is of paramount importance when designing drugs. A drug attacking a conserved region will be far longer effective than a drug attacking a more unstable region. Structural and functional prediction A protein folds into so called “secondary structures”. These structures are divided into three categories: α-helices, β-sheets and random coils or loops (see Figure 8.1). Indel events seldom occur in the first two categories meaning that a possible way of identifying coils and loops are as regions with gaps in the alignment.

Figure 8.1: Secondary structure of Avian Sarcoma Virus. A retrovirus with many similarities to HIV (from http://mcl1.ncifcrf.gov/integrase/asv_ intro.html )

96

8.3. Multiple Sequence Alignment A straight-forward extension of the pairwise alignment definition leads to: Definition 18 (Global Multiple Alignment) A global multiple alignment of k > 2 strings, S = (s1 , s2 , . . . , sk ) is a natural generalisation of an alignment for two strings. Chosen spaces are inserted into (or at either end of) each of the k strings so that the resulting strings have the same length, defined to be `. Then the strings are arrayed in k rows of l columns each, so that each character and space of each string is in a unique column. 2 A more mathematical formulation of the above definition would be Definition 19 (Multiple Sequence Alignment) Let S = (s1 , s2 , . . . , sk ) be a collection of sequences with si ∈ Σ∗ ∀ i in 1 . . . k and Σ a finite alphabet. An MSA consists of a collection of sequences A = [[a1 , a2 , . . . , ak ]] with ai ∈ (Σ0 )` ∀ i in 1 . . . k where Σ0 = Σ ∪ {0 −0 } and 0 −0 ∈ / Σ. 2 Remark 17 We call k the size of the MSA and ` the length of the MSA. In connection with MSAs the literature often denotes the number of strings k and their (original) length n. We will use this notation, even though n in connection with the SSP was used to denote the number of strings. Hopefully this will not lead to confusion. 2 Constructing an MSA immediately leads to three distinct problems: 1) The sequences to be aligned have to be determined. 2) A scoring function for the MSA has to be defined. 3) The optimal MSA with respect to the scoring function has to be determined. Re 1) Algorithms for constructing an MSA usually assume that the sequences are “homologous”.4 This does not mean that the algorithms will refrain from constructing an MSA, should this not be the case, only that the resulting MSA will at best be useless and in the worst case directly misleading. Re 2) Defining a meaningful scoring function for an MSA is very difficult. This has to do with the notion of correctness of an MSA. What does it mean that an MSA is biologically correct and is it even possible to recognise a correct MSA if we see one? The challenge is to define a mathematical function capturing all biological knowledge about the structure, function and evolutionary history of the sequences. This challenge is still 4

The sequences share a common ancestor

97

8. S EQUENCE A LIGNMENTS to be met, as of today there is no consensus as to how to score an MSA [Elias, 2006; Gusfield, 1997; Notredame, 2002, 2007]. Re 3) As if the problem of defining a scoring function was not enough; apart from instances with very short or very few sequences, it is inconceivable to calculate an optimal MSA. The general MSA problem has been proven NP hard for the most used scoring functions [Elias, 2006; Wang and Jiang, 1994]. Algorithms for constructing MSAs can be divided into three groups: Exact Algorithms As a rule of thumb the optimal MSA cannot be practically calculated for more than three sequences of realistic lengths. There are algorithms that extend this limit by extending a dynamic programming approach into higher dimensions. They depend on identifying the portion of the hyperspace that will contribute to the solution, thereby minimising the space constraints. However even the best implementations seem to reach a limit around ten sequences [Notredame, 2002]. This group contains all implementations of the Needleman-Wunsch algorithm. One of the best known extensions to higher dimensions is the application msa [Lipman, Altschul, and Kececioglu, 1989].5 Progressive Alignment Algorithms These algorithms follow the simple idea of constructing the final MSA by adding one sequence (or a subalignment) at a time to a continuously growing alignment. They depend on a proximity measure in order to choose the next sequence (or subalignment) to add. Additionally, the order of succession has very high influence on the terminal result, in particular the insertion of gaps. This feature is captured in the common formulation: “once a gap, always a gap”. An illustration of this problem is seen in Example 13 on the facing page. The best known program using a progressive alignment algorithm is Clustalw [Thompson, Higgins, and Gibson, 1994]. It is also one of the first published MSA construction applications. Despite having been surpassed concerning both running-time and quality [Blackshields, Wallace, Larkin, and Higgins, 2006], Clustalw is still one of the most popular MSA construction applications. Iterative Alignment Algorithms The essence of these algorithms is the idea, that it is possible to arrive at an optimal MSA by refining a given sub-optimal MSA. The ways to refine the original MSA range from simply choosing random sequences (or subalignments) to realign to a variety of techniques including Hidden Markov Model 5 although contrary to popular belief msa does not guarantee the mathematical optimal solution [Notredame, 2002]

98

8.4. Notes (HMM), “Simulated Annealing, Genetic Algorithms, Divide-and-Conquer” and “Consistency Based” approaches. The latter algorithms tries to determine the MSA most consistent with libraries containing structural, functional, local and global alignment similarity information. Known applications of this type are Probcons[Do, Mahabhashyam, Brudno, and Batzoglou, 2005] and TCoffee[Notredame, Higgins, and Heringa, 2000] Many MSA application are a mix of the last two groups. Example 13 The four sequences: THE LAST FAT CAT, THE FAST CAT, THE VERY FAST CAT and THE FAT CAT, are progressively aligned in this order to construct an MSA: T HE LAST F AT CAT → T HE F AST CAT

T HE

LAS T

FAT

C AT

T HE F A S T

CA T

−−−

↓ T HE V ERY F AST CAT →

T HE

LAS T

FA − T

C AT

T HE

F AS T

CA − T

−−−

T HE V E R Y

FAST

C AT



T HE F AT CAT →

T HE

LAS T

FA − T

C AT

T HE

F AS T

CA − T

−−−

T HE

V ERY

FAST

C AT

T HE

−−−− F A−T

C AT

If the algorithm in the first alignment chooses to mismatch L, F and F, C instead of aligning CAT with CAT, the resulting end gap will stay in the final MSA. This can happen, if the cost of an end gap is below the cost of an interior gap, which is the case for several MSA applications. In the final MSA a far better score would be achieved by realigning the second sequence. 2

8.4

Notes

This chapter is mainly based on [Notredame, 2002] and [Gusfield, 1997], in particular most definitions are taken from the latter.

99

Oh he love, he love, he love He does love his numbers And they run, they run, they run him In a great big circle In a circle of infinity But he must, he must, he must Put a number to it

C HAPTER

Kate Bush

9

Circular Sum Score.

This chapter introduces the Circular Sum (CS) score along with the considerations leading to its definition and a description of its main characteristics. The CS score is at the basis of the MSA algorithms described in a later chapter.

9.1

Sum-of-Pairs Score

As mentioned in Chapter 8 on page 93 one of the problems when dealing with MSA is how to evaluate them. One of the first introduced scoring schemes [Bacon and Anderson, 1986; Murata, Richardson, and Sussman, 1985] can be seen as a natural extension of the scoring function for pairwise alignments. Definition 20 (Sum of Pairs Score) Let A = [[a1 , a2 , . . . , ak ]] be an MSA of the sequences s1 , s2 , . . . , sk . Let S(ai , aj ) be the score for the induced alignment of the the strings si , sj in A. We then define the Sum-of-Pairs (SOP) score of A as SOP (A) =

k X k X

S(ai , aj )

2

i=1 j>i

Example 14 Using the scoring scheme of +1 for char matches and 0 for everything else,1 the SOP score of the final MSA from Example 13 on page 99 is calcu1

This scoring scheme is also known as “unit metric” or “edit distance”

101

9. C IRCULAR S UM S CORE lated as: a1 : T HE

LAS T

FA − T

C AT

a2 : T HE

F AS T

CA − T

−−− ← −+

a3 : T HE

V ERY

FAST

C A T ←−−−− +

a4 : T HE

−−−− F A−T

a2 : T HE

F AS T

CA − T

−−−

a3 : T HE

V ERY

FAST

−+ C AT ←

a4 : T HE

−−−− F A−T

a3 : T HE

V ERY

a4 : T HE

−−−− F A−T

FAST

8

9

9

C A T ←−−−−−− + 5

5

C A T ←−−−− + C AT

9

−+ C AT ←

giving the result 26 + 10 + 9 = 45.

2

The SOP score is easy to calculate but as most scoring functions it lacks a clear biological reasoning. In [Notredame, 2002] it is even stated that, the SOP score is wrong from an evolutionary point of view. The following section will elaborate on this claim.

9.2

MSAs and Evolutionary Trees

Substitution matrices used in alignments are often constructed using a predefined evolutionary model in which the score of a substitution, a gap or a conservation is determined according to the biological likeliness of this event happening. This means, that unlikely mutations will have negative (or very small) scores while mutations occurring more often than would be expected by chance, will have positive scores.2 The most commonly used scoring matrices are the Point Accepted Mutation (PAM) and Block Substitution Matrix (BLOSUM) series.3 Typically the scoring of gaps is done using the “affine” gap cost (see Section 8.2 on page 94). We can connect an MSA to an evolutionary tree in the following way: Definition 21 (Evolutionary Tree) Let S = (s1 , s2 , . . . , sk ) be a collection of biomolecular, homologous sequences with k ≥ 2. An Evolutionary or Phylogenetic Tree for S is a rooted, binary tree, where the leafs are labelled with the k sequences 2

If a minimisation problem is preferred for the MSA the same values can be used after either multiplying each value by −1 or subtracting each value from the largest value. 3

102

http://en.wikipedia.org/wiki/Substitution_matrix

9.3. Circular Orders, Tours and CS Score (extant species) and each interior vertex represents a common (usually unknown) ancestor of its children (extinct species). 2 Consider an evolutionary tree of five sequences S = (s1 , s2 , s3 , s4 , s5 ) (see Figure 9.1). Assume this is the correct evolutionary tree for the sequences. If we look upon the construction of an MSA for the sequences as a minimisation problem, we can interpret the score of the pairwise alignment [[s1 , s2 ]] as a measure of the probability of evolutionary events (mutations and indels) on the edges (s1 , v1 ) and (v1 , s2 ). Likewise the score of the alignment [[s3 , s4 ]] evaluates the likelihood of evolutionary events on the edges (s3 , v2 ), (v2 , v3 ), (v3 , s4 ). In other words: the shorter the distance between two sequences in the tree is, the less likely it is that evolutionary events took place i.e. the more identical and closely related the sequences will be.

r v2 v1

s1

v3 s3

s4

s5

s2

Figure 9.1: Evolutionary tree with five sequences (species) With this interpretation in mind consider an MSA of the sequences in the same evolutionary tree using the SOP score. If there is added a “tick” to an edge, each time it is included in the calculation of a pairwise alignment, we get a result as shown in Figure 9.2. The effect is, that some edges are considered more often than others. However, there is no biological justification to suggest, that some evolutionary events are more important than others. Another way to formulate the evolutionary problem with the SOP score is that in a sense the SOP score considers each sequence to be an ancestor of every other sequence.

9.3

Circular Orders, Tours and CS Score

The considerations in the last section thus led to the idea of defining a scoring function that will consider each edge in a evolutionary tree equally. Or quoting from [Gonnet et al., 2000]: 103

9. C IRCULAR S UM S CORE

r 6

6

v2 v1 4 s1

4 4

4 s3

s4

6 v3

4 s5

s2

Figure 9.2: SOP score “ticks” added for each edge. (from [Gonnet et al., 2000])

How can a tree be traversed, going from leaf to leaf, such that all edges (each representing its own episode of evolutionary divergence) are counted the same number of times? Definition 22 (Circular Order and Circular Tour) Let T be an evolutionary tree for the sequences S = (s1 , s2 , . . . , sk ). A permutation of the sequences which, when interpreted as a traversal of T , leads to a tour visiting each leaf once and each edge exactly twice, is called a Circular Order (CO). A tour in a tree following a CO is denoted a Circular Tour (CT).4 2 Consider our example evolutionary tree for five sequences ( 9.3 on the next page). If the tree is traversed in the order s1 , s2 , s3 , s4 , s5 (see subfigure 9.3a) the result will be a CT. Another CO would be s1 , s2 , s5 , s4 , s3 leading to the CT depicted in subfigure 9.3b. The last tour can also be imagined as being “equal to” the first tour in an isomorphic tree resulting from rotating the subtree, Tv2 around v2 . If we consider cycle orders, which are cyclic shifts of each other as giving the same CT, then the number of distinct CTs is given as the number of rotations around all inner vertices apart from the root. For a tree with n leafs the number of distinct CTs is thus 2n−2 . As an evolutionary tree has at least two leafs, we now have: Lemma 23 The CT is the shortest possible tour through an evolutionary tree that visits each leaf once. 2 4 Strictly speaking this is only a tour as far as the distinct leaf nodes are concerned, as each inner node is visited twice.

104

9.3. Circular Orders, Tours and CS Score

r

r v2

v1

v2 v1

v3 s3

s1

s4

s5

v3 s3

s1

s2

s4

s5

s2

(a) circular order 1-2-3-4-5

(b) circular order 1-2-5-4-3

Figure 9.3: Circular tours

P ROOF A tour visiting all leafs and returning to the “start” leaf will have to traverse each edge at least twice, as every edge in both ends will be incident to subtrees containing at least one leaf. A CT per definition visits each edge exactly twice. Using induction over the number of leafs, we can prove the existence of a CT in any tree. Base Case: In a tree with two leafs the only tour is a CT (see Figure 9.4). Induction Case: Assume that every evolutionary tree with n − 1 leafs has a CT. Consider a tree, Tn with n leafs. If we remove a subtree of Tn consisting of a leaf, sn and its non-root parent, pn together with the edge between them (such a subtree will exist for n ≥ 3), the resulting tree Tn−1 will have n − 1 leafs. Per induction hypothesis there exists a CT, ctn−1 in Tn . Now consider a tour tn , in Tn which is identical to ctn−1 . The tour will be passing pn twice as pn “divides” an edge in Tn−1 . If we extend the tour with a visit of sn at one of these passings, the resulting tour is a CT for T . (see Figure 9.5 on the next page) 

r s1

s2

Figure 9.4: Base case We can now finally close in on the new scoring function for MSAs, by considering the following question: Given the homologous sequences S = 105

9. C IRCULAR S UM S CORE

Tn-1

Tn p n

p n

sn

sn (a)

(b)

Tn

Tn

p

p

n

n

sn

sn

(c)

(d)

Figure 9.5: Induction case

(s1 , s2 , . . . , sk ), how can we find a CT through an evolutionary tree for S without actually knowing or having to construct the tree? If we interpret the score of the OPA [[si , sj ]] as a distance measure for the strings (leafs) in the evolutionary tree, and use Lemma 23 , the problem of finding  a circular order and a CT reduces to solving a TSP instance n with the 2 pairwise alignment scores as distance matrix and the k sequences as cities. The determination of a CO and com CT can be done very efficiently n 5 pared to the calculation of the 2 pairwise alignment scores, that often is a prerequisite for construction of an MSA. Definition 23 (Circular Sum Score) Let A = [[a1 , a2 , . . . , ak ]] be an MSA of the sequences s1 , s2 , . . . , sk . Let S(ai , aj ) be the score for the induced alignment of the the strings si , sj in A. We then define the Circular Sum score of A as CS(A) =

1 2

k X

S(aci , aci+1 ) where

i=1

ck+1 = c1 and C = (c1 , c2 , . . . , ck ) is a CO. 5

106

By now we also know which tools to use.

2

9.3. Circular Orders, Tours and CS Score The CS score is similar to the SOP score, except that not all pairwise alignments are added, only the pairwise alignments in CO are added and divided by two. The CS score can be regarded as a SOP score in CO. Example 15 Using the edit distance scoring for the four sequences from Example 14 on page 101, we get the following OPA distance matrix.

s1 s2 s3 s4

: : : :

T HE T HE T HE T HE

LAST F AT CAT F AST CAT V ERY F AST CAT F AT CAT

s1 s2 s3 s4

s1 s2 s3 s4 13 9 9 9 9 10 10 9 9 10 14 9 9 9 9 9

A CO will be C = (s1 , s2 , s3 , s4 ). The CS score of the alignment from Example 14 will be: LAST

FA − T

CAT

a2 : T HE

F AST

CA − T

−+ −−− ←

a3 : T HE

V ERY

FAST

CAT

←−−−− +

a4 : T HE

−−−− F A−T

CAT

←−−−−−− +

giving the result

9.3.1

←−−−−− +

a1 : T HE

1 2

8

5

· (8 + 5 + 9 + 9) = 15.5.

9 9 2

Some CS score properties

The CS score of an alignment of a set of sequences has a clear upper bound: Definition 24 (Maximal CS Score) Let S = (s1 , s2 , . . . , sk ) be a collection of sequences. Let OP A(si , sj ) denote the score of the OPA of the sequences si , sj The upper bound for the CS score of S is defined as CS max (S) =

1 2

k X

OP A(aci , aci+1 )

where

i=1

ck+1 = c1 and C = (c1 , c2 , . . . , ck ) is a CO.

2

Example 16 Using the matrix and the CO from Example 15 we get for the sequences S CS max (S) = =

1 2 1 2

· (OP A(s1 , s2 ) + OP A(s2 , s3 ) + OP A(s3 , s4 ) + OP A(s4 , s1 )) · (9 + 10 + 9 + 9) = 18.5

2

107

9. C IRCULAR S UM S CORE Likewise the optimal score of an alignment of a set of sequences can be defined: Definition 25 (Optimal CS Score) Let A be the set of all possible alignments of the sequences S = (s1 , s2 , . . . , sk ). The optimal score for the CS score of S is defined as CS opt (S) = max CS(A).

2

A∈A

Example 17 An optimal alignment of the string collection, S from Example 15 and the resulting CS score is a1 : T HE

LAS T

F A−T

CAT

a2 : T HE

−−−−

F AS T

CAT ← −+

a3 : T HE

V ERY

FAST

CAT ←−−−− +

a4 : T HE

−−−− F A−T

giving the result CS opt (S) =

1 2

9

←−−−−− + 10 9

CAT ←−−−−−−− +

9

· (9 + 10 + 9 + 9) = 18.5.

2

The relation between the scores is given by the following lemma: Lemma 24 Let A be the set of all possible alignments of the sequences S = (s1 , s2 , . . . , sk ). We then have: CS max (S) ≥ CS opt (S)

2

P ROOF This follows immediately as each score in the sum for CS max (S) will be larger than or equal to the equivalent induced pairwise alignment score in CS opt (S).  Although the scores in the previous two examples are equal, it is not in general case that CS opt (S) equals CS max (S). If, on the other hand, we do have an alignment, A of the sequences S = (s1 , s2 , . . . , sk ) with CS(A) = CS max (S) then we know that the MSA is optimal.

9.4

Notes

This chapter is mainly based on [Gonnet et al., 2000] and [Roth-Korostensky, 2000]. The origin of the SOP score is somewhat unclear; Gusfield gives [Carrillo and Lipman, 1988] as the first introduction but [Altschul and Lipman, 1989] gives the two references mentioned. 108

9.4. Notes The definition and treatment of evolutionary trees is intentionally kept simple in order to avoid getting too involved in the plethora of related topics like bifurcation contra multifurcating, parsimony, molecular clock hypothesis, ultrametric distances and evolutionary tree construction.

109

C HAPTER

10 Algorithms

This chapter presents the two implemented MSA algorithms. The first is an MSA construction algorithm called tspMsa and the second an MSA improving algorithm called enhanceMsa. Both algorithms are based directly on the CS score and the actual implementations both consist of improvements of the original algorithms due to Roth-Korostensky.

10.1

Path Sum Scores

For notational convenience we will start this chapter with a couple of definitions Definition 26 (Path Sum Score) Let A = [[a1 , a2 , . . . , ak ]] be an MSA of the sequences s1 , s2 , . . . , sk . Let S(ai , aj ) be the score for the induced alignment of the the strings si , sj in A. We then define the Path Sum score of A as PS (A)

=

1 2

k−1 X

S(aci , aci+1 ) with (c1 , c2 , . . . , ck ) a CO.

2

i=1

Remark 18 The PS(A) is the CS(A) where the score value for the induced strings ak , a1 is left out of the summation. 2 Analogously we have

111

10. A LGORITHMS Definition 27 (Maximal Path Sum (MPS) Score) Let S = (s1 , s2 , . . . , sk ) be a collection of sequences. Let OP A(si , sj ) denote the score of the OPA of the sequences si , sj . The MPS score is defined as

M P S(S) =

1 2

k−1 X

OP A(aci , aci+1 ) with (c1 , c2 , . . . , ck ) a CO.

2

i=1

Remark 19 The M P S(S) is the CS max (S) where the OPA score value for the induced strings ak , a1 is left out of the summation. If an MSA, A is having the property that PS(A) = M P S(S) we say the alignment has MPS score 2

10.2

The MSA Construction Algorithm

The tspMsa algorithm is a progressive alignment algorithm that uses a CT to determine the alignment order. The pseudo-code for the algorithm is shown in III.1 on the next page. Given a collection of sequences S = (s1 , s2 , . . . , sk ) tspMsa() executes the following four steps.

1) For all possible sequence pairs construct an optimal pairwise alignment and use the score of these optimal alignments to construct a distance matrix for the TSP solver. 2) Use the distance matrix to determine a CO and a CT. 3) Find the sequence pair si , si+1 having the worst OPA score in the CT and rearrange the cyclic order using cyclic shifts until the CT starts with sequence si+1 . Rename this order to s1 , s2 , . . . , sk . 4) Construct an MSA A, progressively beginning with sequence s1 , s2 ensuring that A will have MPS score. This step is the algorithm called CIRCUS in [Roth-Korostensky, 2000, chapter 6]. It progressively constructs an MSA of a set of strings S = (s1 , s2 , . . . , sk ) using a CO. (see Algorithm III.2 on page 115 and Example 18 on page 114)

112

10.2. The MSA Construction Algorithm Algorithm III.1 tspMsa Require: List of sequences, sl containing S = (s1 , s2 , . . . , sk ) Ensure: An MSA A = [[a1 , a2 , . . . , ak ]]. In case of maximisation and nonnegative induced alignment scores CS(A) ≥ k−1 k · CS max (S) procedure TSP M SA(sl) distances ← matrix of size |sl| × |sl| . Step 1) 3: for i = 1 to |sl| − 1 do for j = i + 1 to |sl| do distances[i][j] ← OP A(si , sj ) 6: distances[j][i] ← OP A(si , sj ) end for end for 9: cycleOrder ← FIND C YCLE O RDER(distances) . Step 2) worstValueIndex ← 1 . Step 3) worstValue ← distances[cycleOrder[1]][cycleOrder[2]] 12: for i = 2 to |cycleOrder| − 1 do currentValue ← distances[cycleOrder[i]][cycleOrder[i + 1]] if currentValue worse than worstValue then 15: worstValueIndex ← i worstValue ← currentValue end if 18: end for for i = 1 to worstValueIndex do APPEND (cycleOrder, cycleOrder[1]) 21: REMOVE (cycleOrder, 1) end for msa ← LIST(sl[cycleOrder[1]]) . Step 4) 24: for i = 2 to |cycleOrder| do msa ← EXTEND A LIGNMENT(msa, sl[cycleOrder[i]]) end for 27: return msa end procedure

Termination: The algorithm only iterates over final non-increasing data structures.

Correctness: Assume maximisation and S(ai , aj ) ≥ 0 ∀ i, j ∈ 1 . . . k and that Procedure EXTEND A LIGNMENT () returns an MSA having MPS score. The rotation of the cycle order in Step 3) ensures that the score for OP A(sk , s1 ) is minimal for all OPA scores. This means that 113

10. A LGORITHMS the maximal value for OP A(sk , s1 ) is

1 k

· CS max (S) giving

CS(A) = M P S(S) = CS max (S) − OP A(sk , s1 ) 1 ≥ CS max (S) − · CS max (S) k k−1 ≥ · CS max (S) k

The running-time for the algorithm results from a sequential execution of the algorithm steps. For a collection of sequences S = (s1 , s2 , . . . , sk ) with |si | = n ∀ i ∈ 1 . . . k this gives a running-time complexity of O(k 2 n2 + k 2.2 + k + kn2 ) = O(k 2 n2 ). Here the complexity of Step 2) (k 2.2 ) assumes use of the LKH TSP solver and the complexity of Step 4) (kn2 ) that the OPAs from Step 1) have to be recalculated. Example 18 Let S = (s1 , s2 , . . . , sk ) be a collection of sequences ordered with respect to the CO in Step 3) on page 112. Let S(aj , aj+1 ) denote the score of the induced alignments of the sequences sj , sj+1 . Let OP A(sj , sj+1 ) be the score of the OPA of sj , sj+1 and let tj , tj+1 denote the strings in this OPA. Assume that the first i sequences have been aligned in an MSA A0 = [[a1 , a2 , . . . , ai ]] so that S(aj , aj+1 ) = OP A(sj , sj+1 ) ∀ j ∈ 1 . . . i − 1 (see Figure 10.1).

A’

a1

ai

+

s i+1

Figure 10.1: Before executing extendMsa(). (from [Roth-Korostensky, 2000]) Now execute the following steps. i) Construct the OPA for the sequences si , si+1 giving the sequences ti , ti+1 (see Figure 10.2 on page 116 (a)). 114

10.2. The MSA Construction Algorithm

Algorithm III.2 extendMsa Require: An MSA A = [[a1 , a2 , . . . , ai ]] having MPS score, a sequences si+1 to incorporate in the MSA Ensure: An MSA A = [[a1 , a2 , . . . , ai+1 ]] having MPS score. procedure EXTEND A LIGNMENT(msa, sequence) ti , ti+1 ← CALCULATE OPA(si , si+1 ) 3: for all gaps g in ai do if g not present in ti then INSERT G AP (ti , g) 6: INSERT G AP (ti+1 , g) end if end for 9: for all gaps g in ti do if g not present in ai then for all aj in msa do 12: INSERT G AP (aj , g) end for end if 15: end for APPEND (msa, ti+1 ) return msa 18: end procedure Termination: The algorithm only iterates over final non-increasing data structures. Correctness: That the resulting MSA has MPS score can be seen by inspection of the gap insertion steps in the algorithm. In line 3 through 8 only gaps (or parts of gaps) that are not already in ti are inserted into both ti and ti+1 . This means that the score of S(ti , ti+1 ) will not be affected, since opposing gaps do not contribute to the score. In other words: ti , ti+1 will still be an OPA of si , si+1 only with superfluous gaps (see Figure 10.2 (b)). Analogously in line 9 through 15 only gaps (or part of gaps) that are not already in ai are inserted into every aj . This does not affect the score of the MSA which will still have MPS score (see Figure 10.2 (c)). Since all gaps (or part of gaps) not present in ai and ti are respectively inserted into the other sequence, ai and ti are now identical. Extending the MSA with ti+1 at this point will thus result in an MSA with MPS score (see Figure 10.2 (d)).

115

10. A LGORITHMS ii) Insert all gaps which are present in ai and not already present in ti into both ti and ti+1 (see Figure 10.2 (b)). iii) Insert all gaps present in ti which are not present in ai into all sequences in the MSA (see Figure 10.2 (c)). iv) At this point sequences ai and ti will be identical. Now append ti+1 to the MSA as ai+1 creating the extended alignment A00 (see Figure 10.2 (d)). 2

A’

a1

ai ti t i+1

si s i+1 ti t i+1

ti t i+1

(a)

A’

(b)

a1

A’ a1

ai

ai ti t i+1

A’

a1

ti t i+1

A” a1

ai

ai a i+1

ti t i+1

(c)

(d)

Figure 10.2: Example 18 steps i) – iv). (from [Roth-Korostensky, 2000])

116

10.2. The MSA Construction Algorithm

10.2.1

Result improvements

In order to improve the MSA constructed by the tspMsa algorithm Korostensky and Gonnet mentions a couple of possible post-optimisations. These post-optimisations were not implemented for this thesis, but since they are based on some interesting considerations, we will give a short sketch of the ideas. The MSA returned by the tspMsa algorithm has the property, that the CS score is optimal for all sequence pairs except s1 , sk . The optimisations aim at improving the score for the induced non-optimal [[a1 , ak ]] alignment. Changing this alignment will, however, affect the induced alignments of other pairs. Since these alignments are all optimal the challenge is to improve the score of [[a1 , ak ]] more than the change will decrease the score of all other induced alignments. It can be argued that inserting a gap into either of the strings a1 , ak is very unlikely to improve the total CS score. The argument is, that the gap-opening cost is typically large compared to the values from the scoring matrix. Thus inserting a gap, g into e.g. a1 will decrease the total CS score with at least two times the gap-opening cost (see Figure 10.3). Furthermore, a gap g 0 with the same length will have to be inserted into all other induced strings ai , i 6= 1 (at another position) to ensure that all induced strings have equal length. Again this causes an additional score decrease of at least two times the gap-opening cost.

b b b

g’

a k-1 ak

g g’

b

a1 a2 a3

b b

Figure 10.3: inserting gaps g and g 0 affects the CS score at the four locations marked with dark clouds (from [Roth-Korostensky, 2000]) All together this makes it very improbable that the CS score for the alignment can be improved by insertion of gaps into a1 or ak . If the alignment score of [[a1 , ak ]] is less than four times the gap-opening cost from the OPA score, it will be impossible to improve the CS score this way. 117

10. A LGORITHMS Refraining from gap insertions, a more promising score improving strategy is to shift gap blocks around in a1 , ak . The possibilities are either to use a heuristic, or to use an exhaustive search algorithm to shift gap blocks around. In [Roth-Korostensky, 2000] a heuristic that aligns equal-sized opposing gaps in a1 , ak is implemented. For this thesis a direct optimisation of the tspMsa algorithm was implemented. The idea being, that the algorithm Step 3) on page 112 only makes sense in order to achieve the guarantee for the final CS score. It is not certain, that the index found is the optimal index to break the cycle. To determine this optimal index, say i, it is necessary to determine the sequences where the difference between OP A(si−1 , si ) and the alignment score of [[ai−1 , ai ]], when breaking the cycle at index i is minimal.

10.2.2

Implementation

The implementation of tspMsa was done using a Python implementation of the algorithm steps on page 112. To determine the pairwise OPAs it uses external calls to the stretcher application from the EMBOSS suite. This is a C++ implementation of the Hirschberg algorithm. The determination of cycle orders and tours was done using external calls to LKH The actual implementation replaces Step 3) of the original algorithm by trying out all the k index possibilities for breaking the cycle. This did increase the running-time, but the effect was partly compensated for by caching the calculated OPA values from Step 1) and reusing these when progressively building the MSA.

10.3

The MSA Enhance Algorithm

The enhanceMsa algorithm can be characterised as a kind of iterative alignment algorithm, since it tries to improve a given MSA. The algorithm, introduced in [Roth-Korostensky, 2000, chapter 8] is an application of the Divide-and-Conquer schema. The following definition will ease describing enhanceMsa: Definition 28 (MSA Block) Let A = [[a1 , a2 , . . . , ak ]] be an MSA with length `. A block of A is defined by two indices s, e in 1 . . . ` + 1 with s ≤ e and is denoted A[[s : e]]. It consists of the sequence slices ai [s : e] ∀ i ∈ 1 . . . k. The notation A[[s :]] is shorthand for A[[s : ` + 1]] and A[[: e]] is shorthand for A[[1 : e]]. 2 Given a collection of sequences enhanceMsa can be described as follows. 1) Calculate an MSA A, using a fast MSA construction algorithm. 118

10.3. The MSA Enhance Algorithm 2) If the length of A is less than a given minimum length return A. 3) Determine the block of A having the best score B and the (possible empty) blocks on either side L and R. 4) Realign the best scoring block to get B 0 and keep the better scoring of B and B 0 , denoting this block Bf inal 5) Return an MSA: Lf inal Bf inal Rf inal where Lf inal and Rf inal are the results from repeating from Step 1) on L and R respectively.

A L

B

R

B final

Figure 10.4: The MSA enhance algorithm. (from [Roth-Korostensky, 2000]) In the case of enhancing MSAs of proteins, the biological reasoning for the algorithm is as follows: the algorithm looks for the best place to divide an MSA into a good block and two further blocks by determining the block having the best CS score. This block will typically be a conserved block of the proteins. Proteins exhibit conserved as well as highly divergent regions. The conserved regions may be active sites of the protein,1 or stable secondary structures like α-helices and β-sheets (see Figure 8.1). 1

locations responsible for the chemical reactions of the proteins

119

10. A LGORITHMS Since conserved regions have fewer gaps than divergent regions, most MSA algorithms produce more reliable alignments within these. Algorithm III.3 enhanceMsa Require: Sequence (or alignment) list sl of size k, a minimum block length mbl Ensure: An MSA A = [[a1 , a2 , . . . , ak ]]. procedure ENHANCE M SA(sl, mbl) A ← CONSTRUCT M SA(sl) 3: s, e ← FIND B EST B LOCK(A) Alef t ← A[[1 : s]], Abest ← A[[s : e]], Aright ← A[[e :]] if CS(Abest ) < CS(CONSTRUCT M SA(Abest )) then 6: Abest ← CONSTRUCT M SA(Abest ) end if if LENGTH(Alef t ) > mbl then 9: if CS(Alef t ) < CS(CONSTRUCT M SA(Alef t )) then Alef t ← ENHANCE M SA(Alef t , mbl) end if 12: end if if LENGTH(Aright ) > mbl then if CS(Aright ) < CS(CONSTRUCT M SA(Aright )) then 15: Aright ← ENHANCE M SA(Aright , mbl) end if end if 18: return JOIN(Alef t , Abest , Aright ) end procedure Termination: Assume FIND B EST B LOCK () returns a best block of length `best ≥ 1 and that CONSTRUCT M SA () does not return alignments with columns consisting of gaps only. The length of the longest possible alignment returned from CONSTRUCT M SA () in line 2 will be ||S||. Since `best ≥ 1 the recursive calls in line 10 and line 15 will be on alignments of strictly smaller length. This ensures termination of the recursions as soon as the alignments have length mbl or below. Correctness: Assuming CONSTRUCT M SA () returns a proper alignment, the final MSA will be a proper MSA of S as it consists of a join of alignments of disjoint blocks of A. The algorithm is depicted in Figure 10.4 and pseudo-code for the algorithm is shown in Algorithm III.3. The running-time for enhanceMsa is determined by the number of recursive calls and the time for each invocation of ENHANCE M SA (). The first depends on the lengths of the MSAs produced by CONSTRUCT M SA () 120

10.3. The MSA Enhance Algorithm and the best blocks found by FIND B EST B LOCK (). If the lengths of the latter is constant, say `bestthe number of recursive calls will be in the worst case be of order O `||S|| . best The running-time for each invocation is determined by the runningtimes for CONSTRUCT M SA (), FIND B EST B LOCK () and the time for calculating the CS score. Denoting the first two rt(CONSTRUCT M SA) and rt(FIND B EST B LOCK) respectively, the running-time for enhanceMsa will have a worst case order of   ||S|| O · (rt(CONSTRUCT M SA) + rt(FIND B EST B LOCK) + kn) . `best

10.3.1

Finding the best block within an MSA

It remains to specify how to construct MSAs and how to find the best block in an MSA, that is, how to implement the CONSTRUCT M SA () and FIND B EST B LOCK () functions used in Algorithm III.3. The first will be postponed (see Section Implementation on page 124) and this section will elaborate on finding the best scoring block within an MSA. In [Roth-Korostensky, 2000] this is done by deciding upon a fixed MSA block length and then use a sliding-window method to determine the best scoring block, i.e. calculate the score of the fixed block size for all possible positions along the MSA. For this thesis it was decided to use an implementation of the algorithm described in [Bentley, 2000, chapter 8]. The algorithm, called MAX S UM (), finds the maximum positive sum in a contiguous sub-array of an array containing real numbers (see Example 19). The reasons for choosing this algorithm were, that i) it will find the optimal scoring block of an MSA with no constraints on the length of this block, and ii) it will do so with a running-time linear in the length of the MSA. Pseudo-code for FIND B EST B LOCK () and the sub-procedure MAX S UM () is shown in Algorithm III.4 on page 123. Example 19 Let Arr =

1 2 3 4 5 6 7 8 9 10 31 −41 59 26 −53 58 97 −93 −23 84

be an array of integers. The MAX S UM () algorithm will return the indices 3, 8 marking the slice Arr[3 : 8] with sum 187. 2 121

10. A LGORITHMS In order to use MAX S UM () for an MSA the CS scores for each column is calculated and inserted into an array Arr. The average value of Arr is then subtracted from each value in Arr. This ensures that there will be both positive and negative values in Arr so that the result of MAX S UM () will be different from Arr itself (see Example 20).

Example 20 Using the MSA and edit distance scoring scheme from Example 13 Algorithm III.4 will give the following result.

T T T A= T Arr = 4

H H H H 4

E E E E 4

average value =

L F V − 0

A A E − 2

S S R − 2

T T Y − 2

F C F F 3

A A A A 4

− − S − 0

T T T T 4

C − C C 3

A − A A 3

T − T T 3

12 + 6 + 11 + 9 38 = ≈ 2.7 ⇒ 14 14

Arr = [1.3, 1.3, 1.3, −2.7, −0.7, −0.7, −0.7, 0.3, 1.3, −2.7, 1.3, 0.3, 0.3, 0.3] FIND B EST B LOCK (A)

122

= 1, 4 ⇒ A[[1 : 4]] is the maximal block

2

10.3. The MSA Enhance Algorithm

Algorithm III.4 findBestBlock Require: List of sequences, sl containing non-empty alignment A = [[a1 , a2 , . . . , ak ]] Ensure: Indices s, e such that CS(A[[s : e]]) is maximal for all blocks of A. procedure F IND B EST B LOCK(sl) columnScores ← GET C OLUMN S CORES(sl) (columnScores) 3: avg ← SUM |columnScores| for i = 1 to |columnScores| do columnScores[i] ← columnScores[i] - avg 6: end for return MAX S UM(columnScores) end procedure 9:

12:

15:

18:

21:

24:

27:

procedure MAX S UM(numbers) best ← LIST(0, 0, 0) . sum, start index, end index current ← LIST(0, 0, 0) for i = 1 to |numbers| do if current[1] + numbers[i] < 0 then current ← LIST(0, 0, 0) . reset if sum negative else current[1] ← current[1] + numbers[i] if current was reset then . e.g. is current[3] = 0 ? current[2] ← i current[3] ← i end if current[3] ← current[3] +1 . increase end index end if if best[1] < current[1] then best ← current[:] . a copy not a reference end if end for return (best[2], best[3]) end procedure

123

10. A LGORITHMS Termination: The algorithm only iterates over finite non-increasing data structures. Correctness: The algorithm uses the observation that, summed from its start index, the best slice cannot have negative part-sums. If it had, it would contain a sub-slice having a better score. This observation is captured in the use of the current variable, since it contains and sums from the possible start indices of a best slice in numbers. The best variable keeps track of the best slice seen so far. For every round of the for-loop in lines 13 to 27 the following invariant holds: The values of best and current are correct. Before the loop the invariant obviously holds. Assume the invariant holds for round i − 1. For round i we have two possibilities for current: • its sum plus the value of numbers[i] is negative in which case current should be reset. This happens in line 15. • its sum plus the value of numbers[i] is positive or zero in which case the sum should be increased by the value of numbers[i] and the end index should be increased by one. This happens in line 17 and line 22 respectively. Should numbers[i] be the first positive value that current “meets” after being reset, the start index and end index have to be set. This special case is treated in line 19 and 20. For round i we get for best • if current should have obtained a larger sum, best should be updated. This happens in line 25. All the above gives that the invariant will hold after round i

10.3.2

Implementation

The enhanceMsa was implemented using Python and external calls to the MSA construction applications Muscle and Mafft in order to implement the CONSTRUCT M SA () procedure.

10.4

Notes

This chapter is mainly based on [Korostensky and Gonnet, 1999], [RothKorostensky, 2000, chapter 6, 8] and [Bentley, 2000, chapter 8].

124

There is something fascinating about science. One gets such wholesale returns of conjecture out of such a trifle investment of fact.

C HAPTER

11

Mark Twain

Experiments

This chapter contains descriptions of the experiments performed on the implemented MSA algorithms: tspMsa and enhanceMsa together with their results. In order to evaluate the alignments constructed by tspMsa the alignments were compared with MSAs constructed by two other applications. All alignments were scored using both the CS score and the traditional SOP score. Further assessment was done by using the supplied scoring methods from a protein reference database. In the experiments with enhanceMsa the scores of the MSA before and after refinement were compared. The alignments were scored using the CS score and the the supplied scoring methods from the protein reference database.

11.1

Reference MSA Applications

As mentioned in [Blackshields et al., 2006] the amount of published MSA construction methods and/or packages is quite overwhelming. According to Blackshields et al. more than 50 applications were described from 1996–2006 and alone in 2005 there were at least 20 new publications. The criteria for choosing reference MSA applications can be phrased as trying to select “the fastest of the best”. A small running-time would be important for the enhanceMsa algorithm and high quality alignment methods would be preferable as benchmark for the tspMsa. The applications chosen were Muscle and Mafft, as they both fulfil the above criteria [Blackshields et al., 2006; Notredame, 2007; Wallace, Blackshields, and Higgins, 2005]. 125

11. E XPERIMENTS Muscle is a progressive alignment program that construct a very crude guide tree used for constructing the first MSA. The guide tree is then refined upon and a second MSA is constructed using the improved guide tree. For increased speed only sequences affected by the subtrees improvements are realigned. Mafft is a both a progressive and iterative MSA program. It uses a fast Fourier transform1 to generate a guide tree for its progressive MSA. As with Muscle the guide tree is then refined upon both by optimising a weighted SOP score and by using different pairwise alignment similarity information.

11.2

Benchmark Database

In order to further estimate the biological quality of the alignments a protein benchmark database was used. As the first MSA construction programs were developed it was difficult to benchmark them. First lack of processing power made the performance of the programs on large sequences inferior. Second there was not much actual information concerning correct sequence families. The first generation of programs were thus mainly tested using a small number of “golden standards”. As these standards were both limited in number and public available it had the unfortunate effect, that new programs easily could be tuned against the standards, achieving artificially high scores. In order to rectify this effect a number of benchmark databases consisting of large and continually growing data-sets have been published. For this thesis BAliBASE (Benchmark Alignments dataBASE)2 has been chosen as reference database for the MSA experiments. BAliBASE , which was one of the first benchmark databases, consists of nine protein reference data-sets chosen in order to represent real MSA problems faced by biologists. The alignments are based upon protein 3D structure super-positioning and are “manually” refined in order to ensure higher alignment quality than a pure automated method. This property can be seen as one of the main quality characteristics of BAliBASE , but it can also be considered as a source of subjectivity due to the “expert” refinement [Blackshields et al., 2006]. If nothing else it does make BAliBASE slower to maintain and expand compared to more automated benchmark databases.3 In this thesis only the reference sets 1–5 and 9 were used (see Table 11.1), the mundane reason being, that the reference sets 6–8 were lacking 1 2

http://en.wikipedia.org/wiki/Fast_Fourier_transform http://www-bio3d-igbmc.u-strasbg.fr/balibase/ version 3.0

3

like PREFAB, HOMSTRAD, SABmark and OXBench

126

11.2. Benchmark Database

Table 11.1: The used BAliBASE reference sets BAliBASE Reference Categories RV1

Equidistant sequences, divided by length and variability RV11 – very divergent sequences (< 20 % residue identity) RV12 – medium to divergent sequences (20-40 % residue identity)

RV20

Families with one or more highly divergent sequences

RV30

Divergent subfamilies

RV40

Sequences with large N/C terminal extensions

RV50

Sequences with large internal insertions

RV9

Short/Linear Motifs, experimentally verified RV911 – very divergent sequences (< 20 % residue identity) RV912 – medium to divergent sequences (20-40 % residue identity) RV913 – similar to medium divergent sequences (40-80 % residue identity) RV921 – true positive motifs only RV922 – true positive motifs and errors RV931 – true positive motifs only RV932 – true positive motifs and false positive motifs RV941 – true positive motifs only RV942 – true positive motifs and true negative motifs

files containing the sequences in fasta format. The used reference sets contains 610 alignments with all together 17 427 sequences.

11.2.1

BAliBASE scoring

In order to assess the performance of MSA programs, BAliBASE includes scripts to calculate two scores. Let A = [[a1 , a2 , . . . , ak ]] with |ai | = ` ∀ i ∈ 1 . . . k be a test alignment and Ar = [[a1 , a2 , . . . , akr ]] with length `r be the corresponding reference alignment Sum-of-pairs This score, which is a variant of the SOP, increases with the number of sequences correctly aligned with respect to the reference alignment. It is supposed to indicate the extent to which some, if not all, sequences in an alignment are correctly aligned. The score Sc for the c’th. column is defined as Sc =

k k X X i = 1j = 1 i 6= j

pcij ,

pcij

( 1 = 0

if A[i][c] and A[j][c] are aligned in Ar else

127

11. E XPERIMENTS denoting the sum-of-pairs score for A by SP we have P`

SP = P`ir = 1

Si

j = 1 Srj

with Srj =

kr · (kr − 1) 2

Column score This is a column-wise binary score indicating whether all sequences are aligned correctly with respect to the reference alignment. Let T C denote the Column score for A we then have ( P` 1 if the i’th. column in A and Ar are identical C i T C = i = 1 , Ci = ` 0 else

11.3

Experiments Set-up

For both the MSA construction and MSA enhancing experiments the following set-up was used: • Muscle and Mafft running with default parameters, • the BLOSUM62 scoring matrix, • a gap-opening cost of −12 and a gap-extension cost of −2, and • all gaps scored equally (end/internal gaps)

11.4 MSA Construction Tests For each of the BAliBASE reference sets the tspMsa alignment was scored by each of the scoring schemes: CS Score, SOP Score, BAliBASE Sum-ofPairs Score and BAliBASE Column Score. These scores were then compared to the same scores for Muscle and Mafft alignments. In order to illustrate the results with block diagrams, the number of times an algorithm achieved the best score was counted for each alignment file in each BAliBASE reference set directory. If more than one algorithm achieved the best score, every one of them was counted as “being best”. The results are shown in the figures 11.1 to 11.8

11.4.1 CS score results The results when scoring the alignments by CS score (figures 11.1 and 11.2) does not surprisingly favour the tspMsa algorithm. It achieves the highest number of best scores in every reference set. The mutual directory score between Muscle and Mafft is 2 – 8 with 5 draws. 128

11.4. MSA Construction Tests

Alignments scored using Cycle Sum Score 70 69

68

60

60 Number of best scores

tspMsa Muscle Mafft

65

49

50 40

29

30 20

20

15

10 2

0

5

2

RV11

1

0 0 RV20 RV30 Balibase directory

RV12

0 0 RV40

0 2 RV50

Figure 11.1: MSA construction on RV11–RV50

30

28

25 Number of best scores

22

Alignments scored using Cycle Sum Score tspMsa 27 Muscle 24 Mafft 22

22

20 17 15

14 10

10 5

6

7

2 2 2 2 22 1 0 0 0 0 0 0 0 0 0 0 2 2 2 2 1 1 1 1 3 RV91 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94 Balibase directory Figure 11.2: MSA construction on RV911–RV942

129

11. E XPERIMENTS

Alignments scored using Sum-of-Pairs Score 80 tspMsa Muscle Mafft

80

Number of best scores

70

60

60 50

47 41

40 30

27

26

21

20 10 0

39

3

3

RV11

19 10

7 0

RV12

0

0

RV20 RV30 Balibase directory

0

RV40

4

RV50

Figure 11.3: MSA construction on RV11–RV50

25 22

Alignments scored using Sum-of-Pairs Score 26 tspMsa 25 Muscle Mafft 22

Number of best scores

20 17

16

15

16

12 10 7

7

5 2

3 1

8

7

7

6

4 2

2

0 00 0 0 0 00 2 2 2 2 1 1 1 1 3 RV91 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94 Balibase directory Figure 11.4: MSA construction on RV911–RV942

130

11.4. MSA Construction Tests

11.4.2 SOP score results The results when scoring the alignments by SOP score (figures 11.3 and 11.4) shows a clear superiority by Mafft achieving the highest number of best scores in every references set. The mutual directory score between tspMsa and Muscle is 1 – 13 with 1 draw. This last result is also less surprising as Muscle is optimising using the SOP score.

Alignments scored using BAliBASE Sum-of-Pairs Score 70 tspMsa Muscle 63 Mafft

70

Number of best scores

60

50

50 40 39

39 31

30

26

20 10 0

17 10

12

8 3

RV11

13

9

RV12

1

RV20 RV30 Balibase directory

3

3 3

RV40

RV50

Figure 11.5: MSA construction on RV11–RV50

11.4.3

BAliBASE SOP score results

Scoring the alignments by the BAliBASE SOP score (figures 11.5 and 11.6) shows some interesting results. First Mafft is clearly superior being the best algorithm in 14 out of 15 BAliBASE reference set directories. Second the mutual directory score between tspMsa and Muscle is 5 – 7 with 3 draws and third the tspMsa algorithm achieves the best result for the BAliBASE reference set RV11 which consists of very divergent sequences with less than 20 % residue identity.4 4

also known as the “twilight zone”

131

11. E XPERIMENTS

Alignments scored using BAliBASE Sum-of-Pairs Score 28 tspMsa Muscle 25 23Mafft23

Number of best scores

25

20

20

17 15

14

13

10

9 6

554 0

2

1

RV91

33

1

2

RV91

1

3 0

0

2

1

6 3

1

33

2 2 2 1 1 1 3 RV91 RV92 RV92 RV93 RV93 RV94 RV94 Balibase directory

Figure 11.6: MSA construction on RV911–RV942

Alignments scored using BAliBASE Column Score 73 tspMsa Muscle Mafft

70

Number of best scores

60

56

52

50

50

47

40 39 30

28

28

20

16

10

RV11

11 9 RV12

18

21

24

28 21 10 10

RV20 RV30 Balibase directory

RV40

RV50

Figure 11.7: MSA construction on RV11–RV50

132

11.4. MSA Construction Tests

Number of best scores

25

Alignments scored using BAliBASE Column Score 27 tspMsa 26 26 25 Muscle 23 Mafft 22 20

20

20 19 1616

15 15 13

13

10 5 2 0

RV91

1

33 1

2 2 0

33

3

4

55

2 2 2 2 1 1 1 3 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94 Balibase directory

Figure 11.8: MSA construction on RV911–RV942

11.4.4

BAliBASE column score results

Finally scoring by the BAliBASE column score (figures 11.7 and 11.8) again shows superiority by Mafft “winning” in every reference set directory. Here the mutual directory score between tspMsa and Muscle is 6 – 4 with 5 draws. Interesting enough using the BAliBASE column score Mafft scores better than tspMsa in reference set RV11.

11.4.5 MSA construction conclusions The following conclusions can be drawn from the above results: 1) The Mafft algorithm is clearly superior to both the tspMsa and Muscle algorithm. 2) The tspMsa and Muscle algorithms clearly scores better than the other, when scoring with the algorithm’s optimising score. 3) Using the BAliBASE SOP score Muscle achieves better results than tspMsa. A Wilcoxon rank-sum test5 gives a p-value equal to 0.085 for the null hypothesis that the results are being comparable, and a z-statistic value of -1.721 in favour of Muscle. 5

calculated using the stats package from scipy

133

11. E XPERIMENTS 4) Using the BAliBASE column score tspMsa and Muscle achieves comparable results with a slight advantage for Muscle. A Wilcoxon rank-sum test gives a p-value equal to 0.885 for the null hypothesis that the results are being comparable, and a z-statistic value of -0.145 in favour of Muscle. 5) Using the BAliBASE scores the tspMsa algorithm is scoring relatively well in reference sets from the twilight zone (reference set RV11 and RV911)

11.5 MSA Enhancing Tests

Percentage improvement in Cycle Sum Score

MSA enhance results using Muscle 50 40

best 8 14 20 26

30 20 10 0

2 2 2 2 1 1 1 1 3 RV91 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94 Balibase directory Figure 11.9: CS score improvement on RV911– RV942

The enhanceMsa algorithm was tested using both Muscle and Mafft as a fast external alignment construction application. To compare the effect of different block sizes the algorithm was run using the FIND B EST B LOCK () function (see Section 10.3.1 on page 121) and also with constant block sizes of 8, 14, 20 and 26. For each of the BAliBASE reference sets the enhanceMsa algorithm was scored by each of the scoring schemes: CS Score, BAliBASE Sum-ofPairs Score and BAliBASE Column Score. The original first alignment 134

11.5. MSA Enhancing Tests

Percentage improvement in BAliBASE Sum-of-Pairs Score

MSA enhance results using Muscle best 8 14 20 26

15

10

5

0

RV91

1

2 2 2 2 1 1 1 3 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94 Balibase directory

Figure 11.10: BAliBASE SOP score improvement on RV911–RV942

was scored at the start of algorithm and the final alignment was scored when the algorithm finished. For each file in a reference set directory the improvement in scores between the first and the final alignment was calculated as a percentage of the original score. These differences were then summed (by score type) and averaged by dividing with the number of files in the reference set directory. The result of the test runs, when using Muscle as external MSA construction program, for the reference sets RV911– RV942 and all three score types are shown in Figure 11.9 – Figure 11.11. The results, when using Mafft, and for the data sets RV11 – RV50 were similar in the sad sense of being just as unclear. Using a Wilcoxon rank-sum test comparing the variable block algorithm with the constant block algorithm for each block size gives the values in Table 11.2. Neither the figures nor the table conveys any clear message. 135

11. E XPERIMENTS

MSA enhance results using Muscle

Percentage improvement in BAliBASE Column Score

12

best 8 14 20 26

10 8 6 4 2 0 2

2 2 2 2 1 1 1 1 3 RV91 RV91 RV91 RV92 RV92 RV93 RV93 RV94 RV94 Balibase directory Figure 11.11: BAliBASE column score improvement on RV911–RV942

11.5.1 MSA enhancing conclusions All though the test results are rather unclear, a couple of conclusions can be drawn: • Comparing the variable block length results to the constant block length results, the variable block results are not better than any of the constant block results. On the positive side, they are not clearly worse either. • The enhanceMsa algorithm uses the CS score to determine whether it should keep on refining the MSA. Using this scoring does in most cases also result in improvements in the BAliBASE scores.

11.6

Notes

Section 11.2 is based on [Blackshields et al., 2006] and [Perrodou, Chica, Poch, Gibson, and Thompson, 2008; Thompson, Koehl, Ripp, and Poch, 2005; Thompson, Plewniak, and Poch, 1999a,b]. Muscle is described in details in [Edgar, 2004] and Mafft in [Katoh, Kuma, Toh, and Miyata, 2005; Katoh, Misawa, Kuma, and Miyata, 2002]. 136

11.6. Notes

Table 11.2: comparing variable block size to constant block sizes CS

score

SOP

score

column score

constant block size 8 RV11–50

-0.1601, 0.8728

-0.1601, 0.8728

-0.1601, 0.8728

RV911-942

-0.2208, 0.8253

0.6181, 0.5365

0.3091, 0.7573

constant block size 14 RV11–50

0.1601, 0.8728

-0.1601, 0.8728

-0.3203, 0.7488

RV911-942

-0.3974, 0.6911

-0.4857, 0.6272

-0.6623, 0.5078

constant block size 20 RV11–50

-0.1601, 0.8728

-0.4804, 0.6310

-0.6405, 0.5218

RV911-942

-0.3091, 0.7573

-0.0442, 0.9648

-0.2208, 0.8253

constant block size 26 RV11–50

-0.1601, 0.8728

0.0000, 1.0000

-0.3203, 0.7488

RV911-942

-0.1325, 0.8946

0.4415, 0.6588

0.2208, 0.8253

Each table entry consists of z-statistics and two-sided p-value for the null hypothesis that two sets of measurements are drawn from the same distribution. A negative z-statistic value indicate worse variable block values, a positive z-statistic indicate better variable block values.

137

Part IV

The end of the tour

DNA

tour with 16384 cities

The Road goes ever on and on Out from the door where it began. Now far ahead the Road has gone, Let others follow it who can! Let them a journey new begin, But I at last with weary feet Will turn towards the lighted inn, My evening-rest and sleep to meet.

C HAPTER

Bilbo Baggins (J.R.R. Tolkien)

12 Conclusion

To recapitulate, the goals set for this thesis were: 1) an examination of the feasibility of applying efficient TSP solver implementations to solve NP complete problems within bioinformatic applications, 2) a self-contained survey of approximation algorithms for the SSP including descriptions of implementations and testing of solution quality, and 3) a description and an experimental testing of the improvements on the TSP based algorithms for MSA construction and MSA refinement. Based on the work done and the results obtained in this thesis the following conclusions can be drawn: Re 1) It can definitely be concluded that it is a promising idea to apply efficient TSP solver implementations to solve NP complete problems within bioinformatic applications. The results obtained by the TSP based SSP algorithm were of a quality that could hardly be more convincing. Compared to all other algorithms it achieved the best solutions in every test (see Table 7.2 on page 82). This affirms the findings in [Laporte, 2010] and [Applegate et al., 2007, Chapter 15.5, 15.6] that LKH is capable of solving TSP instances up to at least 1000 cities to optimality in a very short time. In the conducted experiments the TSP solver based algorithm used less than two minutes for solving a 1000 string SSP problem (see Figure 7.7 on page 84). this was achieved with an implementation that in no way has 141

12. C ONCLUSION been optimised with concern to running-time. In other words: the time cost in order to get results of this quality seems very well spent. Re 2) This thesis has in Chapter 6 presented a survey of approximation algorithms for the SSP. Furthermore Chapter 7 has presented test results of their solution quality. One important insight resulting from the latter chapter is, that the Greedy algorithm is the best choice if optimal solution quality is of less importance. This very simple algorithm obtains the best solutions among the fast algorithms (see Table 7.2 and Figure 7.9 on page 86). Another relevant point is, that approximation factors only reflect the performance in the most pathological instances. In none of the 355 280 test cases used for this thesis did any of the algorithms exceed a solution that was more than 2 % above the lower bound. Re 3) The results of the implemented MSA construction algorithm could not compete with the results of Muscle and especially Mafft, when the BAliBASE reference set and scores were used. This might not come as a big surprise; these programs have been elaborated on for quite some time, and surely more than was possible to put into algorithm implementation in the scope of this thesis. Nevertheless, the algorithm did not produce evidently bad results and it seems to produce quite good results on sequences from the “twilight zone” (see Figure 11.5 – 11.8 on page 133). On the other hand, the results obtained by the implemented MSA refinement algorithm can hardly be regarded as positive. The test results were very unclear and the most interesting conclusion was, that an optimisation based on the CS score, also resulted in improvements in the BAliBASE scores. This does, however, give reason for optimism concerning the applicability of the CS score in bioinformatics.

12.1

Future Work

The results from this thesis point to the following interesting possibilities for future work: Attacking other NP hard problems: The very positive results obtained by using an efficient TSP solver to solve another NP hard problem opens up a large amount of future applications. Actually every problem involving any of the other known NP hard problems is a possible candidate. Examination of integral solution majority: It came as an unexpected result, that the solutions to the LP used by the Best Factor approximation algorithm almost solely consisted of integral solutions. 142

12.1. Future Work There might be a simple reason for this, but there could also be some interesting insights behind this observation. Further assessment of the CS score: Optimising the CS score lead to improvements of the BAliBASE scores. To evaluate its biological applicability, a study examining the effect of optimising the CS score using other MSA benchmark databases would be interesting. But first of all, it would be relevant to include evaluations from biologists concerning the MSAs constructed by optimising the CS score. A competitive version of the tspMsa algorithm: The tspMsa algorithm did not excel in comparison to Mafft and Muscle but its results are not unpromising either. For example, there is still room for postoptimisations using gap block shifts. Both the CS score and the tspMsa algorithm origins in work done at Swiss Federal Institute of Technology around 2000, and when queried upon reasons for the lack of publications related to these ideas in the past ten years, Professor Gaston Gonnet kindly commented:1 I think the circular tour idea will always have a fundamental role in evaluating MSAs. Regarding MSA algorithms, there are several new remarkable algorithms which are probably very difficult to beat. In particular MAFFT and Prank. This makes the task of beating them more difficult and more challenging. Several years ago we improved our MSA algorithm based on CT but we did not put enough effort to make it competitive with current algorithms. I personally think it is an excellent topic to pursue.

My pen is at the bottom of a page, Which, being finished, here the story ends; ’Tis to be wished it had been sooner done, But stories somehow lengthen when begun. Byron

1

personal communication

143

List of Algorithms

II.1 Cycle . . . . . . . . . . . . . II.2 MakeMaximalAssignment II.3 RecurseCycle . . . . . . . . . II.4 Greedy . . . . . . . . . . . . II.5 leastCommonDenominator III.1 tspMsa . . . . . . . . . . . . III.2 extendMsa . . . . . . . . . . III.3 enhanceMsa . . . . . . . . . III.4 findBestBlock . . . . . . .

. . . . . . . . .

145

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

45 52 54 58 67 113 115 120 123

List of Figures

1.1 paths and tours in a graph . . . . . . . . . . . . . . . . . . . . . . 1.2 complete graph and cut set . . . . . . . . . . . . . . . . . . . . . 1.3 Different tree types . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Commis-Voyageur tour for 47 German cities (from [Applegate et al., 2007]). . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The 33 city contest from 1962. . . . . . . . . . . . . . . . . . . . . 2.3 Different k-opt moves . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 The optimal TSP tour in Sweden. (from http://www.tsp.gatech. edu/sweden/tours/swtour_small.htm ) . . . . . . . . . . . . . .

6 6 7 12 15 16 18

3.1 A LKH tour of 11 455 danish cities. . . . . . . . . . . . . . . . . . 23 3.2 Results on dk11455.tsp. Note the addition of 20 990 000 on the y-axis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 sequence assembly problem with repeats . . . . . . . . . . . . . 31 5.1 Graphs for the strings: ’ate’, ’half ’, ’lethal’, ’alpha’ and ’alfalfa’ (from [Blum et al., 1994]) . . . . . . . . . . . . . . . . . . . . . . . 36 6.1 Running-time measures of all-pairs overlap calculations . 6.2 Illustration of Lemma 9 : wuv + wu0 v0 ≥ wu0 v + wuv0 . . . . . 6.3 Shortest Hamiltonian path is not necessarily included in optimal TSP tour. . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Transforming an ATSP instance to a TSP instance . . . . . .

. . . . . . the . . . . . .

49 51 63 64

7.1 SSP solution dependency on variable coverage . . . . . . . . . . . 78 7.2 SSP solution dependency on length . . . . . . . . . . . . . . . . . 79 146

List of Figures 7.3 7.4 7.5 7.6 7.7 7.8 7.9 7.10

solution dependency on length, compare with Figure 7.10a solution dependency on count, compare with Figure 7.10b solution dependency on count . . . . . . . . . . . . . . . . . Figure 7.5 correlated using TSP algorithm values as 100 % SSP algorithms running-time using normal scale . . . . . . Best Factor and TSP algorithm running-times . . . . . . . . Cycle, RecurseCycle and Greedy algorithm running-times SSP algorithm alignment similarity results . . . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

80 81 82 83 84 85 86 88

8.1 Secondary structure of Avian Sarcoma Virus. A retrovirus with many similarities to HIV (from http://mcl1.ncifcrf.gov/integrase/ asv_intro.html ) . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 9.1 9.2 9.3 9.4 9.5

Evolutionary tree with five sequences (species) . . . . . . . . . . 103 SOP score “ticks” added for each edge. (from [Gonnet et al., 2000])104 Circular tours . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Base case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Induction case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

10.1 Before executing extendMsa(). (from [Roth-Korostensky, 2000])114 10.2 Example 18 steps i) – iv). (from [Roth-Korostensky, 2000]) . . . 116 10.3 inserting gaps g and g 0 affects the CS score at the four locations marked with dark clouds (from [Roth-Korostensky, 2000]) . . . 117 10.4 The MSA enhance algorithm. (from [Roth-Korostensky, 2000]) . 119 11.1 MSA construction on RV11–RV50 . . . . . . . . . . . . . . 11.2 MSA construction on RV911–RV942 . . . . . . . . . . . . . 11.3 MSA construction on RV11–RV50 . . . . . . . . . . . . . . 11.4 MSA construction on RV911–RV942 . . . . . . . . . . . . . 11.5 MSA construction on RV11–RV50 . . . . . . . . . . . . . . 11.6 MSA construction on RV911–RV942 . . . . . . . . . . . . . 11.7 MSA construction on RV11–RV50 . . . . . . . . . . . . . . 11.8 MSA construction on RV911–RV942 . . . . . . . . . . . . . 11.9 CS score improvement on RV911–RV942 . . . . . . . . . . 11.10BAliBASE SOP score improvement on RV911–RV942 . . 11.11BAliBASE column score improvement on RV911–RV942

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

129 129 130 130 131 132 132 133 134 135 136

147

List of Tables

2.1 Some TSP history milestones (mainly from [Okano, 2009]) . . . 14 2.2 Milestones for solving TSP (from [Applegate et al., 2007]) . . . . 19 3.1 LKH and Concorde results for dk11455.tsp . . . . . . . . . . . . . 24 6.1 All-pairs suffix-prefix running-times . . . . . . . . . . . . . . . . 49 6.2 SSP algorithm characteristics . . . . . . . . . . . . . . . . . . . . 74 7.1 7.2 7.3 7.4

SSP experiments overview . . . . . . . . . . . . . . . . . minimum scores for SSP algorithms in 355 280 tests . . Running time linear regression values . . . . . . . . . . superstring lengths compared to original string length

. . . .

. . . .

. . . .

. . . .

. . . .

77 82 87 89

11.1 The used BAliBASE reference sets . . . . . . . . . . . . . . . . . 127 11.2 comparing variable block size to constant block sizes . . . . . . 137 C.1 SSP-experiments nucleotid sequences . . . . . . . . . . . . . . . . 157 C.1 SSP-experiments nucleotid sequences . . . . . . . . . . . . . . . . 158 C.1 SSP-experiments nucleotid sequences . . . . . . . . . . . . . . . . 159 D.1 D.2 D.2 D.3 D.3 D.4 D.5 D.6

coverage 6 test data . . . . . . . . coverage 8 test data . . . . . . . . coverage 8 testdata (continued) . coverage 10 test data . . . . . . . coverage 10 test data (continued) coverage 20 test data . . . . . . . coverage 30 test data . . . . . . . coverage 40 test data . . . . . . . 148

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

163 163 164 164 165 165 165 165

Part V

Appendices

Evolutionary tree tour with 8178 cities

A PPENDIX

A Acronyms

AP

Assignment Problem

ATSP

Asymmetric Traveling Salesman Problem

BLOSUM

Block Substitution Matrix

CPU

Central Processing Unit

CC

Cycle Cover

CO

Circular Order

CS

Circular Sum

CT

Circular Tour

DNA

Deoxyribonucleic Acid

GB

Giga Byte

GCD

Greatest Common Divisor

HCP

Hamiltonian Cycle Problem

HMM

Hidden Markov Model

HPP

Hamiltonian Path Problem

KMP

Knuth-Morris-Pratt

LKH

Lin-Kernighan-Helsgaun

LP

Linear Program 151

A. A CRONYMS MB

Mega Byte

MHz

Mega Hertz

MMP

Minimum Matching Problem

MPS

Maximal Path Sum

MSA

Multiple Sequence Alignment

MST

Minimum Spanning Tree

MTSP

Metric Traveling Salesman Problem

NP

Nondeterministic Polynomial-time

OPA

Optimal Pairwise Alignment

PAM

Point Accepted Mutation

PTAS

Polynomial Time Approximation Scheme

RAM

Random-access Memory

RNA

Ribonucleic Acid

SCSP

Shortest Common Superstring Problem

SOP

Sum-of-Pairs

SSP

Shortest Superstring Problem

TSP

Traveling Salesman Problem

VLSI

Very-large-scale Integration

WLOG

Without loss of generality

152

A PPENDIX

B LKH Output

The following is a sample output from a run of LKH on the TSPLIB instance pr2392 using the shell command: ingolf% ./LKH pr2392.par File pr2392.par: PROBLEM_FILE= pr2392 . tsp OPTIMUM 378032 MOVE_TYPE = 5 PATCHING_C = 3 PATCHING_A = 2 RUNS = 10

Output: PARAMETER_FILE = pr2392 . par Reading PROBLEM_FILE: " pr2392 . tsp " . . . done ASCENT_CANDIDATES = 50 BACKBONE_TRIALS = 0 BACKTRACKING = NO # CANDIDATE_FILE = CANDIDATE_SET_TYPE = ALPHA EXCESS = 0.00041806 EXTRA_CANDIDATES = 0 EXTRA_CANDIDATE_SET_TYPE = QUADRANT GAIN23 = YES GAIN_CRITERION = YES INITIAL_PERIOD = 1196 INITIAL_STEP_SIZE = 1 INITIAL_TOUR_ALGORITHM = WALK # INITIAL_TOUR_FILE = INITIAL_TOUR_FRACTION = 1.000 # INPUT_TOUR_FILE =

153

B. LKH OUTPUT KICK_TYPE = 0 KICKS = 1 # MAX_BREADTH = MAX_CANDIDATES = 5 MAX_SWAPS = 2392 MAX_TRIALS = 2392 # MERGE_TOUR_FILE = MOVE_TYPE = 5 NONSEQUENTIAL_MOVE_TYPE = 9 OPTIMUM = 378032 # OUTPUT_TOUR_FILE = PATCHING_A = 2 PATCHING_C = 3 # PI_FILE = # POPULATION_SIZE = 0 PRECISION = 100 PROBLEM_FILE = pr2392 . tsp RESTRICTED_SEARCH = YES RUNS = 10 SEED = 1 STOP_AT_OPTIMUM = YES SUBGRADIENT = YES # SUBPROBLEM_SIZE = # SUBPROBLEM_TOUR_FILE = SUBSEQUENT_MOVE_TYPE = 5 SUBSEQUENT_PATCHING = YES # TIME_LIMIT = # TOUR_FILE = TRACE_LEVEL = 1 Lower bound = 373488.5 , Gap = 1.20% , Ascent time = 30.5 sec . Cand . min = 2 , Cand . avg = 5 . 0 , Cand . max = 5 Preprocessing time = 33.0 sec . ∗ 1 : Cost = 378224 , Gap = 0.051% , Time = 0 . 8 sec . ∗ 2 : Cost = 378032 , Gap = 0.000% , Time = 1 . 0 sec . = Run 1 : Cost = 378032 , Gap = 0.000% , Time = 1 . 0 sec . = ∗ 1 : Cost = 378032 , Gap = 0.000% , Time = 0 . 7 sec . = Run 2 : Cost = 378032 , Gap = 0.000% , Time = 0 . 7 sec . = ∗ 1 : Cost = 378032 , Gap = 0.000% , Time = 0 . 7 sec . = Run 3 : Cost = 378032 , Gap = 0.000% , Time = 0 . 7 sec . = ∗ 1 : Cost = 378811 , Gap = 0.206% , Time = 1 . 1 sec . ∗ 3 : Cost = 378798 , Gap = 0.203% , Time = 1 . 5 sec . ∗ 6 : Cost = 378653 , Gap = 0.164% , Time = 1 . 9 sec . ∗ 1 0 : Cost = 378033 , Gap = 0.000% , Time = 2 . 4 sec . ∗ 1 1 : Cost = 378032 , Gap = 0.000% , Time = 2 . 4 sec . = Run 4 : Cost = 378032 , Gap = 0.000% , Time = 2 . 4 sec . =

154

∗ 1 : Cost = 378032 , Gap = 0.000% , Time = 1 . 0 sec . = Run 5 : Cost = 378032 , Gap = 0.000% , Time = 1 . 0 sec . = ∗ 1 : Cost = 378136 , Gap = 0.028% , Time = 0 . 7 sec . ∗ 2 : Cost = 378032 , Gap = 0.000% , Time = 0 . 9 sec . = Run 6 : Cost = 378032 , Gap = 0.000% , Time = 0 . 9 sec . = ∗ 1 : Cost = 378370 , Gap = 0.089% , Time = 0 . 7 sec . ∗ 4 : Cost = 378075 , Gap = 0.011% , Time = 2 . 0 sec . ∗ 6 : Cost = 378053 , Gap = 0.006% , Time = 2 . 3 sec . ∗ 2 9 : Cost = 378032 , Gap = 0.000% , Time = 3 . 6 sec . = Run 7 : Cost = 378032 , Gap = 0.000% , Time = 3 . 6 sec . = ∗ 1 : Cost = 378683 , Gap = 0.172% , Time = 1 . 2 sec . ∗ 2 : Cost = 378032 , Gap = 0.000% , Time = 1 . 7 sec . = Run 8 : Cost = 378032 , Gap = 0.000% , Time = 1 . 7 sec . = ∗ 1 : Cost = ∗ 3 : Cost = ∗ 4 : Cost = ∗ 8 : Cost = Run 9 : Cost

378787 , Gap = 378634 , Gap = 378615 , Gap = 378032 , Gap = = 378032 , Gap

0.200% , Time = 0.159% , Time = 0.154% , Time = 0.000% , Time = = 0.000% , Time

1 . 2 sec . 1 . 5 sec . 1 . 7 sec . 2 . 7 sec . = = 2 . 7 sec . =

∗ 1 : Cost = 378032 , Gap = 0.000% , Time = 0 . 8 sec . = Run 1 0 : Cost = 378032 , Gap = 0.000% , Time = 0 . 8 sec . = Successes / Runs = 10/10 Cost . min = 378032 , Cost . avg = 378032.00 , Cost . max = 378032 Gap . min = 0.000% , Gap . avg = 0.000% , Gap . max = 0.000% T r i a l s . min = 1 , T r i a l s . avg = 5 . 8 , T r i a l s . max = 29 Time . min = 0 . 7 sec . , Time . avg = 1 . 6 sec . , Time . max = 3 . 6 sec .

155

A PPENDIX

C GenBank Nucleotid Sequences

Depicted in Table C.1 are the accession numbers and fasta file descriptions for the DNA-sequences used as basis for the SSP-experiments. Table C.1: SSP-experiments nucleotid sequences Acc. Number

Description

D00723.1

HUMHYDCP Homo sapiens mRNA for hydrogen carrier protein, a component of an enzyme complex, glycine synthase (EC 2.1.2.10) HUMPMP22A Homo sapiens mRNA for PMP-22(PASII/SR13/Gas-3), complete cds

D11428.1 NT_004321.15 NT_004350.16 NT_004433.16 NT_004434.16 NT_004487.16 NT_004511.16 NT_004538.15 NT_004547.16 NT_004559.11 NT_004610.16 NT_004668.16 NT_004671.15 NT_004686.16 NT_004754.15

Hs1_4478 Homo sapiens chromosome 1 genomic contig Hs1_4507 Homo sapiens chromosome 1 genomic contig Hs1_4590 Homo sapiens chromosome 1 genomic contig Hs1_4591 Homo sapiens chromosome 1 genomic contig Hs1_4644 Homo sapiens chromosome 1 genomic contig Hs1_4668 Homo sapiens chromosome 1 genomic contig Hs1_4695 Homo sapiens chromosome 1 genomic contig Hs1_4704 Homo sapiens chromosome 1 genomic contig Hs1_4716 Homo sapiens chromosome 1 genomic contig Hs1_4767 Homo sapiens chromosome 1 genomic contig Hs1_4825 Homo sapiens chromosome 1 genomic contig Hs1_4828 Homo sapiens chromosome 1 genomic contig Hs1_4843 Homo sapiens chromosome 1 genomic contig Hs1_4911 Homo sapiens chromosome 1 genomic contig the table is continued 157

C. G EN B ANK N UCLEOTID S EQUENCES

Table C.1: SSP-experiments nucleotid sequences Acc. Number

Description

NT_004836.15 NT_004873.15 NT_019273.16 NT_021877.16 NT_021937.16 NT_021973.16 NT_022071.12 NT_026943.13 NT_028050.13 NT_029860.11 NT_030584.10 X00351.1 X01098.1 X02160.1 X02874.1 X02994.1

Hs1_4993 Homo sapiens chromosome 1 genomic contig Hs1_5030 Homo sapiens chromosome 1 genomic contig Hs1_19429 Homo sapiens chromosome 1 genomic contig Hs1_22033 Homo sapiens chromosome 1 genomic contig Hs1_22093 Homo sapiens chromosome 1 genomic contig Hs1_22129 Homo sapiens chromosome 1 genomic contig Hs1_22227 Homo sapiens chromosome 1 genomic contig Hs1_27103 Homo sapiens chromosome 1 genomic contig Hs1_28209 Homo sapiens chromosome 1 genomic contig Hs1_30115 Homo sapiens chromosome 1 genomic contig Hs1_30840 Homo sapiens chromosome 1 genomic contig Human mRNA for beta-actin Human mRNA for aldolase B (EC 4.1.2.13) Human mRNA for insulin receptor precursor Human mRNA for (2’-5’) oligo A synthetase E (1,6 kb RNA) Human mRNA for adenosine deaminase (adenosine aminohydrolase, EC 3.5.4.4) Human mRNA for alcohol dehydrogenase beta-1-subunit (ADH1-2 allele) Human mRNA for nuclear envelope protein lamin A precursor Human mRNA for nuclear envelope protein lamin C precursor Human mRNA for c-fms proto-oncogene Human mRNA for platelet derived growth factor A-chain (PDGF-A) Human mRNA for porphobilinogen deaminase (PBG-D, EC 4.3.1.8) Human mRNA for liver alchol dehydrogenase (EC 1.1.1.1) gamma 1 subunit (partial) from ADH3 locus Human mRNA for plasma gelsolin Human mRNA for lymphocyte IgE receptor (low affinity receptor Fc epsilon R) Human mRNA for non-erythropoietic porphobilinogen deaminase (hydroxymethylbilane synthase; EC4.3.1.8) Human mRNA for platelet-derived growth factor PDGF-A Human mRNA for heme oxygenase Human mRNA for second protein of inter-alpha-trypsin inhibitor complex the table is continued

X03350.1 X03444.1 X03445.1 X03663.1 X03795.1 X04217.1 X04350.1 X04412.1 X04772.1 X04808.1 X06374.1 X06985.1 X07173.1

158

Table C.1: SSP-experiments nucleotid sequences Acc. Number

Description

X07577.1 X07696.1 X07743.1 X13097.1 X13403.1 X13440.1

Human mRNA for C1 inhibitor Human mRNA for cytokeratin 15 Human mRNA for pleckstrin (P47) Human mRNA for tissue type plasminogen activator Human mRNA for octamer-binding protein Oct-1 Human mRNA for 17-beta-hydroxysteroid dehydrogenase (17-HSD) (EC 1.1.1.62) Human mRNA for phospholipase C Human mRNA for adenocarcinoma-associated antigen (KSA) Human mRNA for proliferating cell nucleolar antigen P40 C-terminal fragment Human mRNA for integrin beta(4)subunit Human mRNA for p68 protein Human mRNA for placental-like alkaline phosphatase (EC 3.1.3.1) Human mRNA for cholesterol 7-alpha-hydroxylase Human mRNA for pM5 protein Human mRNa for adipogenesis inhibitory factor HSP15095 H.sapiens mRNA for leukocyte adhesion glycoprotein p150,95 Human mRNA for amyloid A4 precursor of Alzheimer’s disease Human mRNA for keratin 19 Human mRNA for B-lymphocyte CR2-receptor (for complement factor C3d and Epstein-Barr virus)

X14034.1 X14758.1 X15610.1 X51841.1 X52104.1 X53279.1 X56088.1 X57398.1 X58377.1 Y00093.1 Y00264.1 Y00503.1 Y00649.1

159

A PPENDIX

D SSP

Test Data

To generate the test data according to the method description in Section 7.1 on page 76 the following calculations were made to determined the needed length of nucleotid sequence N . Let the wanted values for coverage, collection size and string length be denoted by cov, Ssize and l respectively, then N=

Ssize · l cov

(D.1)

The number of strings gen(N ) of length l generated by the method using a nucleotid sequence of length N can be calculated as gen(N ) = N − l + 1 and the number of strings to be randomly selected for removal r is given by

r = gen(N ) − Ssize giving a reduction ratio rratio in percentage of size

rratio =

r · 100 gen(N )

161

D. SSP T EST D ATA Example 21

Let cov = 8, Ssize = 600, l = 16 600 · 16 = 1200 then N = 8 gen(N ) = 1200 − 16 + 1 = 1185 r = 1185 − 600 = 585 585 · 100 rratio = ≈ 42.194% 1185 2

In praxis an integer value for the reduction ration was preferred leading to the following modifications of the Equation D.1 on the previous page   Ssize · l 0 N = −1 +l =N −l+1 cov Example 22

Let cov = 8, Ssize = 600, l = 16   600 · 16 then N 0 = − 1 + 16 = 1215 8 0 gen(N ) = 1215 − 16 + 1 = 1200 r = 1200 − 600 = 600 600 · 100 = 50% rratio = 1200 2

Note that effect of this was that the value for cov is then actually only approximated giving Ssize · l cov 0 = N0 in Example 22 we get cov 0 =

600 · 16 ≈ 7.901 1215

The values for the generated testdata is listed in following tables

162

Table D.1: coverage 6 test data Ssize

l

rratio

Ssize

l

rratio

Ssize

l

rratio

100 200 300 400 500 600 700 800 900 1000

10 10 10 10 10 10 10 10 10 10

40 40 40 40 40 40 40 40 40 40

100 200 300 400 500 600 700 800 900 1000

20 20 20 20 20 20 20 20 20 20

70 70 70 70 70 70 70 70 70 70

100 200 300 400 500 600 700 800 900 1000

25 25 25 25 25 25 25 25 25 25

76 76 76 76 76 76 76 76 76 76

100 200 300 400 500 600 700 800 900 1000

40 40 40 40 40 40 40 40 40 40

85 85 85 85 85 85 85 85 85 85

100 200 300 400 500 600 700 800 900 1000

50 50 50 50 50 50 50 50 50 50

88 88 88 88 88 88 88 88 88 88

100 200 300 400 500 600 700 800 900 1000

100 100 100 100 100 100 100 100 100 100

94 94 94 94 94 94 94 94 94 94

3000

8

0

l

rratio

Table D.2: coverage 8 test data Ssize

l

rratio

Ssize

l

100 200 300 400 500 600 700 800 900 1000 2000

8 8 8 8 8 8 8 8 8 8 8

00 00 00 00 00 00 00 00 00 00 00

100 200 300 400 500 600 700 800 900 1000 2000 3000

10 10 10 10 10 10 10 10 10 10 10 10

rratio

Ssize

20 100 16 50 20 200 16 50 20 300 16 50 20 400 16 50 20 500 16 50 20 600 16 50 20 700 16 50 20 800 16 50 20 900 16 50 20 1000 16 50 20 20 the table is continued 163

D. SSP T EST D ATA

Table D.2: coverage 8 testdata (continued) Ssize

l

rratio

Ssize

l

rratio

Ssize

l

rratio

100 200 300 400 500 600 700 800 900 1000

20 20 20 20 20 20 20 20 20 20

60 60 60 60 60 60 60 60 60 60

100 200 300 400 500 600 700 800 900 1000

25 25 25 25 25 25 25 25 25 25

68 68 68 68 68 68 68 68 68 68

100 200 300 400 500 600 700 800 900 1000

32 32 32 32 32 32 32 32 32 32

75 75 75 75 75 75 75 75 75 75

100 200 300 400 500 600 700 800 900 1000

40 40 40 40 40 40 40 40 40 40

80 80 80 80 80 80 80 80 80 80

100 200 300 400 500 600 700 800 900 1000

50 50 50 50 50 50 50 50 50 50

84 84 84 84 84 84 84 84 84 84

100 200 300 400 500 600 700 800 900 1000

100 100 100 100 100 100 100 100 100 100

92 92 92 92 92 92 92 92 92 92

Table D.3: coverage 10 test data

164

Ssize

l

rratio

Ssize

l

rratio

Ssize

l

rratio

100 200 300 400 500 600 700 800 900 1000

10 10 10 10 10 10 10 10 10 10

00 00 00 00 00 00 00 00 00 00

100 200 300 400 500 600 700 800 900 1000

20 20 20 20 20 20 20 20 20 20

50 50 50 50 50 50 50 50 50 50

100 200 300 400 500 600 700 800 900 1000

25 25 25 25 25 25 25 25 25 25

60 60 60 60 60 60 60 60 60 600

100 200 300

40 40 40

75 75 75

100 200 300

50 50 50

80 100 100 90 80 200 100 90 80 300 100 90 the table is continued

Table D.3: coverage 10 test data (continued) Ssize

l

rratio

Ssize

l

rratio

Ssize

l

rratio

400 500 600 700 800 900 1000

40 40 40 40 40 40 40

75 75 75 75 75 75 75

400 500 600 700 800 900 1000

50 50 50 50 50 50 50

80 80 80 80 80 80 80

400 500 600 900

100 100 100 100

90 90 90 90

Table D.4: coverage 20 test data Ssize

l

rratio

Ssize

l

rratio

100 200

50 50

60 60

100 200

100 100

80 80

Table D.5: coverage 30 test data Ssize

l

rratio

Ssize

l

rratio

100 200 300 400 500 600

50 50 50 50 50 50

40 40 40 40 40 40

100 200

100 100

70 70

Table D.6: coverage 40 test data Ssize

l

rratio

Ssize

l

rratio

100 200

50 50

20 20

100 200

100 100

60 60

165

Bibliography

Altschul, S. and D. Lipman (1989). Trees, stars, and multiple biological sequence alignment. SIAM Journal on Applied Mathematics 49(1), 197–209. Applegate, D., R. Bixby, V. Chvátal, W. Cook, D. Espinoza, M. Goycoolea, and K. Helsgaun (2009). Certification of an optimal tsp tour through 85,900 cities. Operations Research Letters 37(1), 11–15. Applegate, D. L., R. E. Bixby, V. Chvátal, and W. J. Cook (2007). The Traveling Salesman Problem: A Computational Study (Princeton Series in Applied Mathematics)., Chapter 1–5,12–17. Princeton, NJ, USA: Princeton University Press. Bacon, D. and W. Anderson (1986). Multiple sequence alignment* 1. Journal of Molecular Biology 191(2), 153–161. Bentley, J. L. (1992). Fast algorithms for geometric traveling salesman problems. ORSA Journal on Computing 4(4), 387–411. Bentley, J. L. (2000). Programming Pearls (second edition ed.)., Chapter 8. Addison-Wesley, Inc. Blackshields, G., I. Wallace, M. Larkin, and D. Higgins (2006). Analysis and comparison of benchmarks for multiple sequence alignment. In silico biology 6(4), 321–339. Blum, A., T. Jiang, M. Li, J. Tromp, and M. Yannakakis (1994, May). Linear approximation of shortest superstrings. In Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, pp. 328–336. 167

B IBLIOGRAPHY Breslauer, D., T. Jiang, and Z. Jiang (1996). Rotations of periodic strings and short superstrings. Journal of Algorithms 24, 340–353. Carrillo, H. and D. Lipman (1988). The multiple sequence alignment problem in biology. SIAM Journal on Applied Mathematics 48(5), 1073– 1082. Christofides, N. (1976). Worst-case analysis of a new heuristic for the travelling salesman problem. Technical Report Report 388, Graduate School of Industrial Administration, Carnegie-Mellon University, Pittsburg. Cormen, T., C. Leiserson, R. Rivest, and C. Stein (2001). Introduction to algorithms., Chapter 22,31.2. The MIT press. Dantzig, G., R. Fulkerson, and S. Johnson (1954). Solution of a large scale traveling salesman problem. Technical Report P-510, RAND Corporation, Santa Monica, California, USA. Do, C., M. Mahabhashyam, M. Brudno, and S. Batzoglou (2005). Probcons: probabilistic consistency-based multiple sequence alignment. Genome Research 15(2), 330. Edgar, R. (2004). Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 32(5), 1792. Elias, I. (2006). Settling the intractability of multiple alignment. Journal of Computational Biology 13(7), 1323–1339. Gallant, J., D. Maier, and J. A. Storer (1980). On finding minimal length superstrings. Journal of Computer and System Sciences 20(1), 50 – 58. Gonnet, G., C. Korostensky, and S. Benner (2000). Evaluation measures of multiple sequence alignments. Journal of Computational Biology 7(12), 261–276. Gusfield, D. (1997). Algorithms on Streams, Trees, and Sequences., Chapter 5–6, 7.10, 10–17. Cambridge University Press, New York, NY. Held, M. and R. Karp (1970). The traveling-salesman problem and minimum spanning trees. Operations Research 18(6), 1138–1162. Helsgaun, K. (1998). An effective implementation of the lin-kernighan traveling salesman heuristic. datalogiske skrifter (writings on computer science). Technical Report 81, Roskilde University. Helsgaun, K. (2000). An effective implementation of the Lin-Kernighan traveling salesman heuristic. European Journal of Operational Research 126(1), 106–130. 168

Bibliography Helsgaun, K. (2006). An effective implementation of k-opt moves for the lin-kernighan tsp heuristic. datalogiske skrifter (writings on computer science). Technical Report 109, Roskilde University. Helsgaun, K. (2009). General k-opt submoves for the Lin-Kernighan TSP heuristic. Mathematical Programming Computation 1(2), 119–163. Hopfield, J. J. and D. W. Tank (1985). “Neural” computation of decisions in optimization problems. Biological Cybernetics 52, 141–152. 10.1007/BF00339943. Ilie, L. and C. Popescu (2006). The shortest common superstring problem and viral genome compression. Fundamenta Informaticae 73(1,2), 153– 164. Kaplan, H., M. Lewenstein, N. Shafrir, and M. Sviridenko (2005). Approximation algorithms for asymmetric tsp by decomposing directed regular multigraphs. section 1–3, 5. J. ACM 52(4), 602–626. Kaplan, H. and N. Shafrir (2005). The greedy algorithm for shortest superstrings. Inf. Process. Lett. 93(1), 13–17. Karp, R. M. (1972). Reducibility among combinatorial problems. In R. Miller and J. Thatcher (Eds.), Complexity of Computer Computations, New York, USA., pp. 85–103. Plenum Press. Karp, R. M. (2003). The role of algorithmic research in computational genomics. Computational Systems Bioinformatics Conference, International IEEE Computer Society 0, 10. Katoh, K., K. Kuma, H. Toh, and T. Miyata (2005). MAFFT version 5: improvement in accuracy of multiple sequence alignment. Nucleic acids research 33(2), 511. Katoh, K., K. Misawa, K. Kuma, and T. Miyata (2002). Mafft: a novel method for rapid multiple sequence alignment based on fast fourier transform. Nucleic acids research 30(14), 3059. Korostensky, C. and G. Gonnet (1999). Near optimal multiple sequence alignments using a traveling salesman problem approach. In Proceedings of the String Processing and Information Retrieval Symposium & International Workshop on Groupware, pp. 105–114. IEEE Computer Society. Laporte, G. (2010). A concise guide to the traveling salesman problem. Journal of the Operational Research Society 61(1), 35–40. 169

B IBLIOGRAPHY Lin, S. and B. Kernighan (1973). An effective heuristic algorithm for the traveling-salesman problem. Operations research 21(2), 498–516. Lipman, D., S. Altschul, and J. Kececioglu (1989). A tool for multiple sequence alignment. Proceedings of the National Academy of Sciences of the United States of America 86(12), 4412. Metzker, M. (2009). Sequencing technologies—the next generation. Nature Reviews Genetics 11(1), 31–46. Mucha, M. (2007). A tutorial on shortest superstring approximation. http://www.mimuw.edu.pl/~mucha/teaching/aa2008/ss.pdf . Murata, M., J. Richardson, and J. Sussman (1985). Simultaneous comparison of three protein sequences. Proceedings of the National Academy of Sciences of the United States of America 82(10), 3073. Notredame, C. (2002). Recent progress in multiple sequence alignment: a survey. pgs 3(1), 131–144. Notredame, C. (2007). Recent evolutions of multiple sequence alignment algorithms. PLoS Comput. Biol 3(8), e123. Notredame, C., D. Higgins, and J. Heringa (2000). T-coffee: a novel method for fast and accurate multiple sequence alignment1. Journal of molecular biology 302(1), 205–217. Okano, H. (2009). Study of Practical Solutions for Combinatorial Optimization Problems. Ph. D. thesis, School of Information Sciences, Tohoku University. Perrodou, E., C. Chica, O. Poch, T. Gibson, and J. Thompson (2008). A new protein linear motif benchmark for multiple sequence alignment software. BMC bioinformatics 9(1), 213–228. Reinelt, G. (1991, Fall). Tsplib - a traveling salesman problem library. ORSA, Journal On Computing 3(4), 376–384. Rice, P., I. Longden, A. Bleasby, et al. (2000). Emboss: the european molecular biology open software suite. Trends in genetics 16(6), 276– 277. Romero, H. J., C. A. Brizuela, and A. Tchernykh (2004). An experimental comparison of approximation algorithms for the shortest common superstring problem. In Fifth Mexican International Conference on Computer Science, (ENC), pp. 27 – 34. 170

Bibliography Roth-Korostensky, C. (2000). Algorithms for Building Multiple Sequence Alignments and Evolutionary Trees. Doctor of technical sciences, Swiss Federal Institute of Technology, Zürich. S. Kirkpatrick, C. D. G. J. and M. P. Vecchi (1983, May). Optimization by simulated annealing. Science 220(4598), 671–680. Tarhio, J. and E. Ukkonen (1988). A greedy approximation algorithm for constructing shortest common superstrings. Theor. Comput. Sci. 57(1), 131–145. Thompson, J., D. Higgins, and T. Gibson (1994). Clustal w: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic acids research 22(22), 4673. Thompson, J., P. Koehl, R. Ripp, and O. Poch (2005). Balibase 3.0: latest developments of the multiple sequence alignment benchmark. Proteins: Structure, Function, and Bioinformatics 61(1), 127–136. Thompson, J. D., F. Plewniak, and O. Poch (1999a). Balibase: a benchmark alignment database for the evaluation of multiple alignment programs. Bioinformatics 15, 87–88. Thompson, J. D., F. Plewniak, and O. Poch (1999b). A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Research 27(13), 2682–2690. Venter, J., M. Adams, E. Myers, P. Li, R. Mural, G. Sutton, H. Smith, M. Yandell, C. Evans, R. Holt, et al. (2001). The sequence of the human genome. science 291(5507), 1304. Wallace, I., G. Blackshields, and D. Higgins (2005). Multiple sequence alignments. Current opinion in structural biology 15(3), 261–266. Wang, L. and T. Jiang (1994). On the complexity of multiple sequence alignment. Journal of computational biology 1(4), 337–348.

171