Pattern matching algorithms Presentation by : kamran Mahmoudi [
[email protected]] Under supervision of dr. Mahdavi Imam Khomeini international university, April 2017
Pattern matching in Bioinformatics
Certain known nucleotide and/or amino acid sequences have properties known to biologists. Ex. ATG is a string which must be present at the beginning of every protein (gene) a DNA sequence.
Finding if a DNA sequence contains a specific (candidate) primer is therefore paramount to the ability to run correct PCR.
A conserved DNA sequence is a sequence of nucleotides in DNA, which is found in the DNA of multiple species and/or multiple strains.
Some sequences are conserved precisely. However, a lot of sequences are conserved with some modifications. Finding such modified strings is an important process for mapping DNA of a new organism.
Intro.
Needle in a haystack
the string matching problem consists of finding a (usually short) string, the pattern , as a substring in a given (usually very long) string, the text .[1]
1/56
Formal Definition Let Σ be an arbitrary alphabet. The (exact) string matching problem is the following problem: Input: Two strings t=t1….tn and p= p1…pm over Σ.
Output: The set of all positions in the text t, where an occurrence of the pattern p as a substring starts [1].
2/56
Classification using preprocessing as main criteria
Classes of string searching algorithms [2] Text not preprocessed
Text preprocessed
Patterns not preprocessed
primitive algorithms
Index methods
Patterns preprocessed
Constructed search engines
Signature methods
3/56
Basic classification
Single Pattern Algorithms ✓
Naïve String Search
✓
Knuth-Morris-Pratt Algorithm
✓
Boyer-Moore Algorithm
✓
Rabin-Karp String Search Algorithm
✓
Finite State Automaton Based Search
Bitap
algorithm (shift-or, shift-and, Baeza–Yates–Gonnet)
Two-way BNDM BOM
string-matching algorithm
(Backward Non-Deterministic Dawg Matching)
(Backward Oracle Matching)
4/56
Basic classification Algorithms
using a finite set of patterns
Aho–Corasick
string matching algorithm (extension of Knuth-
Morris-Pratt) Commentz-Walter Set-BOM
algorithm (extension of Boyer-Moore)
(extension of Backward Oracle Matching)
Rabin–Karp
string search algorithm
5/56
Basic classification Algorithms
using an infinite number of patterns
Naturally,
the patterns can not be enumerated finitely in this case. They are represented usually by a regular grammar or regular expression.
6/56
Naïve string search Input: a pattern p= p1…pm and a text t=t1….tn I := φ For j:=0 to n-m do i:=1 while pi=tj+1 and i0 and P[i+1] != P[j] then i T[i] if P[i+1] = P[j] then i i+1 T[j] i return T end 8/56
Knuth–Morris–Pratt algorithm KMP-Matcher(T,P) Begin n |T| m |P| Table KMP-Prefix(P) i 0 for j=0 upto n step 1 do while i>0 and P[i+1] != T[j] do i Table[i] Wend if P[i+1] = T[j] then i i+1 end if if i = m then output(j-m) iTable[i] end if end
9/56
The Boyer-Moore algorithm
The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at different alignments.
Instead of a brute-force search of all alignments (of which there are m − n + 1), Boyer-Moore uses information gained by preprocessing P to skip as many alignments as possible. [3]
10/56
The Bad Character Rule
The bad-character rule considers the character in T at which the comparison process failed.The next occurrence of that character to the left in P is found, and a shift which brings that occurrence in line with the mismatched occurrence in T is proposed.[3]
THE GOOD SUFFIX RULE • If we match some characters, use knowledge of the matched characters to skip alignments. [4]
11/56
Ex.1: the bad character rule
[4]
12/56
Preprocessing for the bad character rule Input: a pattern p= p1…pm over alphabet Σ For all a ∈ Σ do β(a):=0 For i:=1 to m do β(pi):=i
Output: the function β.
13/56
Good suffix rule Let t be the substring of T that matched a suffix of P. Skip alignments until (a) t matches opposite characters in P (b) a prefix of P matches a suffix of t (c) P moves past t whichever happens first.
14/56
Bad match rule & good suffix rule
( https://www.youtube.com/watch?v=4Xyhb72LCX4 )
15/56
Rabin-Karp – the idea
Compare a string's hash values, rather than the strings themselves.
For efficiency, the hash value of the next position in the text is easily computed from the hash value of the current position. [5]
16/56
Example Pattern = AAT Text = TAACGGCATACAATCG
Calculate hash from oldHash code method :
Character values : A=1 T=2
1.
X=oldHash – val(old char)
C=3
2.
X=x/prime
G=4
3.
newHash=X+primem-1 * val(new char)
Prime number=7
17/56
Example, Rabin-Karp algorithm Pattern = AAT H(AAT)= 1 + 1*7 + 2*49 = 106 ▪
Text = TAGACAATCG
H(TAG)=2+1*7+4*49 = 205 !=106
▪
Text = TAGACAATCG
H(AGT)=(205-2)/7+1*49 = 78 != 106
▪
Text = TAGACAATCG
H(GAC)=(78-2)/7+3*49 = 157 != 106
▪
Text = TAGACAATCG
H(ACA)=(157-2)/7+1*49 = 71 != 106
▪
Text = TAGACAATCG
H(CAA)=(71-2)/7+1*49 = 58 != 106
✓
Text = TAGACAATCG
H(AAT)=(58-2)/7+2*49 = 106 ==106 18/56
Finite state automaton we will show that, after a clever preprocessing of the pattern, one scan of the text from left to right will suffice to solve the string matching problem. Furthermore we will see that the preprocessing can also be realized efficiently; it is possible in time in O(|p|.|Σ|). [1]
19/56
Informal definition of automata
Informally speaking, a finite automaton can be described as a machine that reads a given text once from left to right. At each step, the automaton is in one of finitely many internal states, and this internal state can change after reading every single symbol of the text, depending only on the current state and the last symbol read.
20/56
Formal definition
A finite automaton is a quintuple M = (Q;Σ; q0; Ϭ; F), where
Q is a finite set of states,
Σ is an input alphabet,
q0 ∈ Q is the initial state,
F ⊆Q is a set of accepting states , and
Ϭ : Q x Σ Q is a transition function describing the transitions of the
automaton from one state to another. 21/56
Why using finite state machine
Complex pattern matching like non-finite regular expressions : Finite State Machine (FSM) aka DFA
Time Complexity :
Preprocessing : O(m3 |Σ|)
Matching: 𝜃 (n) 22/56
String matching with FSM
( https://www.youtube.com/watch?v=nNb9lu5Hvio )
23/56
FSM Matching algorithm
FINITE-AUTOMATON-MATCHER(T,d,m) 1. n length[T] 2. q 0 3. for i 1 to n 4. do q Ϭ(q, T[i]) 5. if q=m then 6. print `Pattern occurs with shift' i-m
24/56
Transition-function construction algorithm 1. 2. 3. 4. 5. 6. 7. 8.
m length[P] for q 0 to m (for each state) do for each character a ∈ Σ (|Σ|) do k min(m+1, q+2) repeat k k-1 (1 ≤ k ≤ m+1) until Pk ⊐ Pqa (Σ k ) Ϭ(q,a) k return Ϭ
25/56
Better solution: suffix trees
Can solve problem in O(m) time
• Conceptually related to keyword trees [7]
26/56
[8]
27/56
28/56
29/56
30/56
31/56
32/56
33/56
34/56
35/56
36/56
37/56
38/56
39/56
40/56
41/56
42/56
43/56
44/56
45/56
46/56
47/56
48/56
49/56
50/56
51/56
Weiner’s Algorithm I
Definitions
i: suffix tree for Si=S[i..n]$
WHead(i): longest prefix of Si that is also prefix of Sj j>i
Proceeding
Build n+1 = edge (root, n+1) labelled $
For i from n to 1 do
Find WHead(j) in Wj+1
w = node labelled WHead(j) (eventually new created)
Create new leaf j and edge (w,j) labelled
S[j..n]-WHead(j)
52/56
[7]
53/56
54/56 [9]
Ukkonen’s suffix tree
(https://www.youtube.com/watch?v=WbLKFzqvacg )
55/56
Suffix array
n computer science, a suffix array is a sorted array of all suffixes of a string. It is a data structure used, among others, in full text indices, data compression algorithms and within the field of bioinformatics
P.S. 1
Suffix array, example
P.S. 2
Suffix array, example (continue)
P.S. 3
Suffix array – pattern matching def search(P): l = 0; r = n while l < r: mid = (l+r) / 2 if P > suffixAt(A[mid]): l = mid + 1 else: r = mid s = l; r = n while l < r: mid = (l+r) / 2 if P < suffixAt(A[mid]): r = mid else: l = mid + 1 return (s, r)
P.S. 4
References
[1]: Hans-Joachim Bockenhauer, Dirk Bongartz, “Algorithmic Aspects of Bioinformatics ”, 2007 Natural computing series, Springer, ISSN 1619-7127
[2]: https://en.wikipedia.org/wiki/String_searching_algorithm
[3]: https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm
[4]: http://www.cs.jhu.edu/~langmea/resources/lecture_notes/boyer_moore.pdf
[5]: http://u.cs.biu.ac.il/~rosenfa5/Alg2/fingerpainting.ppt
[6]: http://web.cs.mun.ca/~wang/courses/cs6783-13f/n2-string-1.pdf
[7]: http://www.zbh.uni-hamburg.de/pubs/pdf/GieKur1997.pdf
[8]: http://bix.ucsd.edu/bioalgorithms/presentations/Ch09_CombinatorialPatternMatching.pdf
[9]: http://wwwmayr.in.tum.de/konferenzen/Jass03/presentations/pentenrieder.pdf
56/56