Pattern matching algorithms

42 downloads 0 Views 3MB Size Report
Needle in a haystack the string matching problem consists of finding a (usually short) string, the pattern , as a substring in a given. (usually very long) string, the ...
Pattern matching algorithms Presentation by : kamran Mahmoudi [[email protected]] Under supervision of dr. Mahdavi Imam Khomeini international university, April 2017

Pattern matching in Bioinformatics 

Certain known nucleotide and/or amino acid sequences have properties known to biologists. Ex. ATG is a string which must be present at the beginning of every protein (gene) a DNA sequence.



Finding if a DNA sequence contains a specific (candidate) primer is therefore paramount to the ability to run correct PCR.



A conserved DNA sequence is a sequence of nucleotides in DNA, which is found in the DNA of multiple species and/or multiple strains.



Some sequences are conserved precisely. However, a lot of sequences are conserved with some modifications. Finding such modified strings is an important process for mapping DNA of a new organism.

Intro.

Needle in a haystack

the string matching problem consists of finding a (usually short) string, the pattern , as a substring in a given (usually very long) string, the text .[1]

1/56

Formal Definition Let Σ be an arbitrary alphabet. The (exact) string matching problem is the following problem: Input: Two strings t=t1….tn and p= p1…pm over Σ.

Output: The set of all positions in the text t, where an occurrence of the pattern p as a substring starts [1].

2/56

Classification using preprocessing as main criteria

Classes of string searching algorithms [2] Text not preprocessed

Text preprocessed

Patterns not preprocessed

primitive algorithms

Index methods

Patterns preprocessed

Constructed search engines

Signature methods

3/56

Basic classification 

Single Pattern Algorithms ✓

Naïve String Search



Knuth-Morris-Pratt Algorithm



Boyer-Moore Algorithm



Rabin-Karp String Search Algorithm



Finite State Automaton Based Search

 Bitap

algorithm (shift-or, shift-and, Baeza–Yates–Gonnet)

 Two-way  BNDM  BOM

string-matching algorithm

(Backward Non-Deterministic Dawg Matching)

(Backward Oracle Matching)

4/56

Basic classification  Algorithms

using a finite set of patterns

 Aho–Corasick

string matching algorithm (extension of Knuth-

Morris-Pratt)  Commentz-Walter  Set-BOM

algorithm (extension of Boyer-Moore)

(extension of Backward Oracle Matching)

 Rabin–Karp

string search algorithm

5/56

Basic classification  Algorithms

using an infinite number of patterns

 Naturally,

the patterns can not be enumerated finitely in this case. They are represented usually by a regular grammar or regular expression.

6/56

Naïve string search Input: a pattern p= p1…pm and a text t=t1….tn I := φ For j:=0 to n-m do i:=1 while pi=tj+1 and i0 and P[i+1] != P[j] then i  T[i] if P[i+1] = P[j] then i  i+1 T[j]  i return T end 8/56

Knuth–Morris–Pratt algorithm KMP-Matcher(T,P) Begin n  |T| m  |P| Table KMP-Prefix(P) i  0 for j=0 upto n step 1 do while i>0 and P[i+1] != T[j] do i  Table[i] Wend if P[i+1] = T[j] then i  i+1 end if if i = m then output(j-m) iTable[i] end if end

9/56

The Boyer-Moore algorithm 

The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at different alignments.



Instead of a brute-force search of all alignments (of which there are m − n + 1), Boyer-Moore uses information gained by preprocessing P to skip as many alignments as possible. [3]

10/56

The Bad Character Rule 

The bad-character rule considers the character in T at which the comparison process failed.The next occurrence of that character to the left in P is found, and a shift which brings that occurrence in line with the mismatched occurrence in T is proposed.[3]

THE GOOD SUFFIX RULE • If we match some characters, use knowledge of the matched characters to skip alignments. [4]

11/56

Ex.1: the bad character rule

[4]

12/56

Preprocessing for the bad character rule Input: a pattern p= p1…pm over alphabet Σ For all a ∈ Σ do β(a):=0 For i:=1 to m do β(pi):=i

Output: the function β.

13/56

Good suffix rule Let t be the substring of T that matched a suffix of P. Skip alignments until (a) t matches opposite characters in P (b) a prefix of P matches a suffix of t (c) P moves past t whichever happens first.

14/56

Bad match rule & good suffix rule

( https://www.youtube.com/watch?v=4Xyhb72LCX4 )

15/56

Rabin-Karp – the idea 

Compare a string's hash values, rather than the strings themselves.



For efficiency, the hash value of the next position in the text is easily computed from the hash value of the current position. [5]

16/56

Example Pattern = AAT Text = TAACGGCATACAATCG

Calculate hash from oldHash code method :

Character values : A=1 T=2

1.

X=oldHash – val(old char)

C=3

2.

X=x/prime

G=4

3.

newHash=X+primem-1 * val(new char)

Prime number=7

17/56

Example, Rabin-Karp algorithm Pattern = AAT H(AAT)= 1 + 1*7 + 2*49 = 106 ▪

Text = TAGACAATCG

H(TAG)=2+1*7+4*49 = 205 !=106



Text = TAGACAATCG

H(AGT)=(205-2)/7+1*49 = 78 != 106



Text = TAGACAATCG

H(GAC)=(78-2)/7+3*49 = 157 != 106



Text = TAGACAATCG

H(ACA)=(157-2)/7+1*49 = 71 != 106



Text = TAGACAATCG

H(CAA)=(71-2)/7+1*49 = 58 != 106



Text = TAGACAATCG

H(AAT)=(58-2)/7+2*49 = 106 ==106 18/56

Finite state automaton we will show that, after a clever preprocessing of the pattern, one scan of the text from left to right will suffice to solve the string matching problem. Furthermore we will see that the preprocessing can also be realized efficiently; it is possible in time in O(|p|.|Σ|). [1]

19/56

Informal definition of automata 

Informally speaking, a finite automaton can be described as a machine that reads a given text once from left to right. At each step, the automaton is in one of finitely many internal states, and this internal state can change after reading every single symbol of the text, depending only on the current state and the last symbol read.

20/56

Formal definition 

A finite automaton is a quintuple M = (Q;Σ; q0; Ϭ; F), where



Q is a finite set of states,



Σ is an input alphabet,



q0 ∈ Q is the initial state,



F ⊆Q is a set of accepting states , and



Ϭ : Q x Σ  Q is a transition function describing the transitions of the



automaton from one state to another. 21/56

Why using finite state machine 

Complex pattern matching like non-finite regular expressions : Finite State Machine (FSM) aka DFA



Time Complexity : 

Preprocessing : O(m3 |Σ|)

 Matching: 𝜃 (n) 22/56

String matching with FSM

( https://www.youtube.com/watch?v=nNb9lu5Hvio )

23/56

FSM Matching algorithm

FINITE-AUTOMATON-MATCHER(T,d,m) 1. n  length[T] 2. q  0 3. for i  1 to n 4. do q  Ϭ(q, T[i]) 5. if q=m then 6. print `Pattern occurs with shift' i-m

24/56

Transition-function construction algorithm 1. 2. 3. 4. 5. 6. 7. 8.

m  length[P] for q  0 to m (for each state) do for each character a ∈ Σ (|Σ|) do k  min(m+1, q+2) repeat k  k-1 (1 ≤ k ≤ m+1) until Pk ⊐ Pqa (Σ k ) Ϭ(q,a)  k return Ϭ

25/56

Better solution: suffix trees 

Can solve problem in O(m) time



• Conceptually related to keyword trees [7]

26/56

[8]

27/56

28/56

29/56

30/56

31/56

32/56

33/56

34/56

35/56

36/56

37/56

38/56

39/56

40/56

41/56

42/56

43/56

44/56

45/56

46/56

47/56

48/56

49/56

50/56

51/56

Weiner’s Algorithm I 

Definitions



i: suffix tree for Si=S[i..n]$



WHead(i): longest prefix of Si that is also prefix of Sj j>i



Proceeding



Build n+1 = edge (root, n+1) labelled $



For i from n to 1 do



Find WHead(j) in Wj+1



w = node labelled WHead(j) (eventually new created)



Create new leaf j and edge (w,j) labelled



S[j..n]-WHead(j)

52/56

[7]

53/56

54/56 [9]

Ukkonen’s suffix tree

(https://www.youtube.com/watch?v=WbLKFzqvacg )

55/56

Suffix array 

n computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms and within the field of bioinformatics

P.S. 1

Suffix array, example

P.S. 2

Suffix array, example (continue)

P.S. 3

Suffix array – pattern matching def search(P): l = 0; r = n while l < r: mid = (l+r) / 2 if P > suffixAt(A[mid]): l = mid + 1 else: r = mid s = l; r = n while l < r: mid = (l+r) / 2 if P < suffixAt(A[mid]): r = mid else: l = mid + 1 return (s, r)

P.S. 4

References 

[1]: Hans-Joachim Bockenhauer, Dirk Bongartz, “Algorithmic Aspects of Bioinformatics ”, 2007 Natural computing series, Springer, ISSN 1619-7127



[2]: https://en.wikipedia.org/wiki/String_searching_algorithm



[3]: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm



[4]: http://www.cs.jhu.edu/~langmea/resources/lecture_notes/boyer_moore.pdf



[5]: http://u.cs.biu.ac.il/~rosenfa5/Alg2/fingerpainting.ppt



[6]: http://web.cs.mun.ca/~wang/courses/cs6783-13f/n2-string-1.pdf



[7]: http://www.zbh.uni-hamburg.de/pubs/pdf/GieKur1997.pdf



[8]: http://bix.ucsd.edu/bioalgorithms/presentations/Ch09_CombinatorialPatternMatching.pdf



[9]: http://wwwmayr.in.tum.de/konferenzen/Jass03/presentations/pentenrieder.pdf

56/56