## Algorithms and Data Structures

Dec 27, 2008 - String data is frequently obtained from user-input to a program. As such, it is ... Terminated by a newline sequence, for example in Windows INI files. Non-text .... from the end of the needle, so it can usually jump ahead a whole.

Algorithms and Data Structures Part 5: String Matching (Wikipedia Book 2014)

By Wikipedians

Editors: Reiner Creutzburg, Jenny Knackmuß

PDF generated using the open source mwlib toolkit. See http://code.pediapress.com/ for more information. PDF generated at: Sun, 22 Dec 2013 17:47:20 UTC

Contents Articles String Matching

1

String (computer science)

1

String searching algorithm

8

Knuth–Morris–Pratt algorithm

11

Boyer–Moore string search algorithm

18

References Article Sources and Contributors

22

23

24

1

String Matching String (computer science) In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and/or the length changed, or it may be fixed (after creation). A string is generally understood as a data type and is often implemented as an array of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding. A string may also denote more general arrays or other sequence (or list) data types and structures. Depending on programming language and precise data type used, a variable declared to be a string may either cause storage in memory to be statically allocated for a predetermined maximum length or employ dynamic allocation to allow it to hold variable number of elements. When a string appears literally in source code, it is known as a string literal and has a representation that denotes it as such. In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set called an alphabet.

Formal theory Let Σ be a non-empty finite set of symbols (alternatively called characters), called the alphabet. No assumption is made about the nature of the symbols. A string (or word) over Σ is any finite sequence of symbols from Σ. For example, if Σ = {0, 1}, then 01011 is a string over Σ. The length of a string is the number of symbols in the string (the length of the sequence) and can be any non-negative integer. The empty string is the unique string over Σ of length 0, and is denoted ε or λ. The set of all strings over Σ of length n is denoted Σn. For example, if Σ = {0, 1}, then Σ2 = {00, 01, 10, 11}. Note that Σ0 = {ε} for any alphabet Σ. The set of all strings over Σ of any length is the Kleene closure of Σ and is denoted Σ*. In terms of Σn,

For example, if Σ = {0, 1}, Σ* = {ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, ...}. Although Σ* itself is countably infinite, all elements of Σ* have finite length. A set of strings over Σ (i.e. any subset of Σ*) is called a formal language over Σ. For example, if Σ = {0, 1}, the set of strings with an even number of zeros ({ε, 1, 00, 11, 001, 010, 100, 111, 0000, 0011, 0101, 0110, 1001, 1010, 1100, 1111, ...}) is a formal language over Σ.

String (computer science)

Concatenation and substrings Concatenation is an important binary operation on Σ*. For any two strings s and t in Σ*, their concatenation is defined as the sequence of symbols in s followed by the sequence of characters in t, and is denoted st. For example, if Σ = {a, b, ..., z}, s = bear, and t = hug, then st = bearhug and ts = hugbear. String concatenation is an associative, but non-commutative operation. The empty string serves as the identity element; for any string s, εs = sε = s. Therefore, the set Σ* and the concatenation operation form a monoid, the free monoid generated by Σ. In addition, the length function defines a monoid homomorphism from Σ* to the non-negative integers (that is, a function , such that ). A string s is said to be a substring or factor of t if there exist (possibly empty) strings u and v such that t = usv. The relation "is a substring of" defines a partial order on Σ*, the least element of which is the empty string.

Prefixes and suffixes A string s is said to be a prefix of t if there exists a string u such that t = su. If u is nonempty, s is said to be a proper prefix of t. Symmetrically, a string s is said to be a suffix of t if there exists a string u such that t = us. If u is nonempty, s is said to be a proper suffix of t. Suffixes and prefixes are substrings of t.

Rotations A string s = uv is said to be a rotation of t if t = vu. For example, if Σ = {0, 1} the string 0011001 is a rotation of 0100110, where u = 00110 and v = 01.

Reversal The reverse of a string is a string with the same symbols but in reverse order. For example, if s = abc (where a, b, and c are symbols of the alphabet), then the reverse of s is cba. A string that is the reverse of itself (e.g., s = madam) is called a palindrome, which also includes the empty string and all strings of length 1.

Lexicographical ordering It is often useful to define an ordering on a set of strings. If the alphabet Σ has a total order (cf. alphabetical order) one can define a total order on Σ* called lexicographical order. For example, if Σ = {0, 1} and 0 < 1, then the lexicographical order on Σ* includes the relationships ε < 0 < 00 < 000 < ... < 0001 < 001 < 01 < 010 < 011 < 0110 < 01111 < 1 < 10 < 100 < 101 < 111 < 1111 < 11111 ...

String operations A number of additional operations on strings commonly occur in the formal theory. These are given in the article on string operations.

Topology Strings admit the following interpretation as nodes on a graph: • Fixed-length strings can be viewed as nodes on a hypercube • Variable-length strings (of finite length) can be viewed as nodes on the k-ary tree, where k is the number of symbols in Σ • Infinite strings can be viewed as infinite paths on the k-ary tree. The natural topology on the set of fixed-length strings or variable length strings is the discrete topology, but the natural topology on the set of infinite strings is the limit topology, viewing the set of infinite strings as the inverse limit of the sets of finite strings. This is the construction used for the p-adic numbers and some constructions of the Cantor set, and yields the same topology.

2

String (computer science) Isomorphisms between string representations of topologies can be found by normalizing according to the lexicographically minimal string rotation.

String datatypes A string datatype is a datatype modeled on the idea of a formal string. Strings are such an important and useful datatype that they are implemented in nearly every programming language. In some languages they are available as primitive types and in others as composite types. The syntax of most high-level programming languages allows for a string, usually quoted in some way, to represent an instance of a string datatype; such a meta-string is called a literal or string literal.

String length Although formal strings can have an arbitrary (but finite) length, the length of strings in real languages is often constrained to an artificial maximum. In general, there are two types of string datatypes: fixed-length strings, which have a fixed maximum length and which use the same amount of memory whether this maximum is reached or not, and variable-length strings, whose length is not arbitrarily fixed and which use varying amounts of memory depending on their actual size. Most strings in modern programming languages are variable-length strings. Despite the name, even variable-length strings are limited in length, although, in general, the limit depends only on the amount of memory available. The string length can be stored as a separate integer (which puts a theoretical limit on the length) or implicitly through a termination character, usually a character value with all bits zero. See also "Null-terminated" below.

Character encoding String datatypes have historically allocated one byte per character, and, although the exact character set varied by region, character encodings were similar enough that programmers could often get away with ignoring this, since characters a program treated specially (such as period and space and comma) were in the same place in all the encodings a program would encounter. These character sets were typically based on ASCII or EBCDIC. Logographic languages such as Chinese, Japanese, and Korean (known collectively as CJK) need far more than 256 characters (the limit of a one 8-bit byte per-character encoding) for reasonable representation. The normal solutions involved keeping single-byte representations for ASCII and using two-byte representations for CJK ideographs. Use of these with existing code led to problems with matching and cutting of strings, the severity of which depended on how the character encoding was designed. Some encodings such as the EUC family guarantee that a byte value in the ASCII range will represent only that ASCII character, making the encoding safe for systems that use those characters as field separators. Other encodings such as ISO-2022 and Shift-JIS do not make such guarantees, making matching on byte codes unsafe. These encodings also were not "self-synchronizing", so that locating character boundaries required backing up to the start of a string, and pasting two strings together could result in corruption of the second string (these problems were much less with EUC as any ASCII character did synchronize the encoding). Unicode has simplified the picture somewhat. Most programming languages now have a datatype for Unicode strings. Unicode's preferred byte stream format UTF-8 is designed not to have the problems described above for older multibyte encodings. All UTF-8, UTF-16 and UTF-32 require the programmer to know that the fixed-size code units are different than the "characters", the main difficulty currently is incorrectly designed API's that attempt to hide this difference.

3

String (computer science)

4

Implementations Some languages like C++ implement strings as templates that can be used with any datatype, but this is the exception, not the rule. Some languages, such as C++ and Ruby, normally allow the contents of a string to be changed after it has been created; these are termed mutable strings. In other languages, such as Java and Python, the value is fixed and a new string must be created if any alteration is to be made; these are termed immutable strings. Strings are typically implemented as arrays of bytes, characters, or code units, in order to allow fast access to individual units or substrings—including characters when they have a fixed length. A few languages such as Haskell implement them as linked lists instead. Some languages, such as Prolog and Erlang, avoid implementing a dedicated string datatype at all, instead adopting the convention of representing strings as lists of character codes.

Representations Representations of strings depend heavily on the choice of character repertoire and the method of character encoding. Older string implementations were designed to work with repertoire and encoding defined by ASCII, or more recent extensions like the ISO 8859 series. Modern implementations often use the extensive repertoire defined by Unicode along with a variety of complex encodings such as UTF-8 and UTF-16. The term bytestring usually indicates a general-purpose string of bytes, rather than strings of only (readable) characters, strings of bits, or such. Byte strings often imply that bytes can take any value and any data can be stored as-is, meaning that there should be no value interpreted as a termination value. Most string implementations are very similar to variable-length arrays with the entries storing the character codes of corresponding characters. The principal difference is that, with certain encodings, a single logical character may take up more than one entry in the array. This happens for example with UTF-8, where single codes (UCS code points) can take anywhere from one to four bytes, and single characters can take an arbitrary number of codes. In these cases, the logical length of the string (number of characters) differs from the logical length of the array (number of bytes in use). UTF-32 is the only Unicode encoding that avoids this problem. Null-terminated The length of a string can be stored implicitly by using a special terminating character; often this is the null character (NUL), which has all bits zero, a convention used and perpetuated by the popular C programming language. Hence, this representation is commonly referred to as C string. In terminated strings, the terminating code is not an allowable character in any string. Strings with length field do not have this limitation and can also store arbitrary binary data. In C two things are needed to handle binary data, a character pointer and the length of the data. An example of a null-terminated string stored in a 10-byte buffer, along with its ASCII (or more modern UTF-8) representation as 8-bit hexadecimal numbers is: F

R

A

N

K

NUL k

4616 5216 4116 4E16 4B16 0016

e

f

w

6B16 6516 6616 7716

The length of the string in the above example, "FRANK", is 5 characters, but it occupies 6 bytes. Characters after the terminator do not form part of the representation; they may be either part of another string or just garbage. (Strings of this form are sometimes called ASCIZ strings, after the original assembly language directive used to declare them.)

String (computer science)

5

Length-prefixed The length of a string can also be stored explicitly, for example by prefixing the string with the length as a byte value (a convention used in many Pascal dialects): as a consequence, some people call it a P-string. Storing the string length as byte limits the maximum string length to 255. To avoid such limitations, improved implementations of P-strings use 16-, 32-, or 64-bit words to store the string length. When the length field covers the address space, strings are limited only by the available memory. Here is the equivalent Pascal string stored in a 10-byte buffer, along with its ASCII / UTF-8 representation: length F 516

R

A

N

K

k

e

f

w

4616 5216 4116 4E16 4B16 6B16 6516 6616 7716

Strings as records Many languages, including object-oriented ones, implement strings as records in a structure like: class string { int length; char *text; }; Although this implementation is hidden, and accessed through member functions. The "text" will be a dynamically allocated memory area, that might be expanded if needed. See also string (C++). Linked-list Both character termination and length codes limit strings: For example, C character arrays that contain null (NUL) characters cannot be handled directly by C string library functions: Strings using a length code are limited to the maximum value of the length code. Both of these limitations can be overcome by clever programming, of course, but such workarounds are by definition not standard. Rough equivalents of the C termination method have historically appeared in both hardware and software. For example, "data processing" machines like the IBM 1401 used a special word mark bit to delimit strings at the left, where the operation would start at the right. This meant that, while the IBM 1401 had a seven-bit word in "reality", almost no-one ever thought to use this as a feature, and override the assignment of the seventh bit to (for example) handle ASCII codes. It is possible to create data structures and functions that manipulate them that do not have the problems associated with character termination and can in principle overcome length code bounds. It is also possible to optimize the string represented using techniques from run length encoding (replacing repeated characters by the character value and a length) and Hamming encoding. While these representations are common, others are possible. Using ropes makes certain string operations, such as insertions, deletions, and concatenations more efficient.

Security concerns The differing memory layout and storage requirements of strings can affect the security of the program accessing the string data. String representations requiring a terminating character are commonly susceptible to buffer overflow problems if the terminating character is not present, caused by a coding error or an attacker deliberately altering the data. String representations adopting a separate length field are also susceptible if the length can be manipulated. In such cases, program code accessing the string data requires bounds checking to ensure that it does not inadvertently

String (computer science) access or change data outside of the string memory limits. String data is frequently obtained from user-input to a program. As such, it is the responsibility of the program to validate the string to ensure that it represents the expected format. Performing limited or no validation of user-input can cause a program to be vulnerable to code injection attacks.

Text file strings In computer readable text files, for example programming language source files or configuration files, strings can be represented. The NUL byte is normally not used as terminator since that does not correspond to the ASCII text standard, and the length is usually not stored, since the file should be human editable without bugs. Two common representations are: • Surrounded by quotation marks (ASCII 2216), used by most programming languages. To be able to include quotation marks, newline characters etc., escape sequences are often available, usually using the backslash character (ASCII 5C16). • Terminated by a newline sequence, for example in Windows INI files.

Non-text strings While character strings are very common uses of strings, a string in computer science may refer generically to any sequence of homogeneously typed data. A string of bits or bytes, for example, may be used to represent non-textual binary data retrieved from a communications medium. This data may or may not be represented by a string-specific datatype, depending on the needs of the application, the desire of the programmer, and the capabilities of the programming language being used. If the programming language's string implementation is not 8-bit clean, data corruption may ensue.

String processing algorithms There are many algorithms for processing strings, each with various trade-offs. Some categories of algorithms include: • • • • • •

String searching algorithms for finding a given substring or pattern String manipulation algorithms Sorting algorithms Regular expression algorithms Parsing a string Sequence mining

Advanced string algorithms often employ complex mechanisms and data structures, among them suffix trees and finite state machines.

6

String (computer science)

Character string-oriented languages and utilities Character strings are such a useful datatype that several languages have been designed in order to make string processing applications easy to write. Examples include the following languages: • • • • • • • • • •

awk Icon MUMPS Perl Rexx Ruby sed SNOBOL Tcl TTM

Many Unix utilities perform simple string manipulations and can be used to easily program some powerful string processing algorithms. Files and finite streams may be viewed as strings. Some APIs like Multimedia Control Interface, embedded SQL or printf use strings to hold commands that will be interpreted. Recent scripting programming languages, including Perl, Python, Ruby, and Tcl employ regular expressions to facilitate text operations. Some languages such as Perl and Ruby support string interpolation, which permits arbitrary expressions to be evaluated and included in string literals.

Character string functions String functions are used to manipulate a string or change or edit the contents of a string. They also are used to query information about a string. They are usually used within the context of a computer programming language. The most basic example of a string function is the length(string) function, which returns the length of a string (not counting any terminator characters or any of the string's internal structural information) and does not modify the string. For example, length("hello world") returns 11. There are many string functions that exist in other languages with similar or exactly the same syntax or parameters. For example, in many languages, the length function is usually represented as len(string). Even though string functions are very useful to a computer programmer, a computer programmer using these functions should be mindful that a string function in one language could in another language behave differently or have a similar or completely different function name, parameters, syntax, and results.

References

7

String searching algorithm

8

String searching algorithm In computer science, string searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. Let Σ be an alphabet (finite set). Formally, both the pattern and searched text are vectors of elements of Σ. The Σ may be a usual human alphabet (for example, the letters A through Z in the Latin alphabet). Other applications may use binary alphabet (Σ = {0,1}) or DNA alphabet (Σ = {A,C,G,T}) in bioinformatics. In practice, how the string is encoded can affect the feasible string search algorithms. In particular if a variable width encoding is in use then it is slow (time proportional to N) to find the Nth character. This will significantly slow down many of the more advanced search algorithms. A possible solution is to search for the sequence of code units instead, but doing so may produce false matches unless the encoding is specifically designed to avoid it.

Basic classification The various algorithms can be classified by the number of patterns each uses.

Single pattern algorithms Let m be the length of the pattern and let n be the length of the searchable text. Algorithm Naïve string search algorithm

Preprocessing time

Matching time1

0 (no preprocessing) Θ((n−m+1) m)

Rabin–Karp string search algorithm

Θ(m)

average Θ(n+m), worst Θ((n−m+1) m)

Finite-state automaton based search

Θ(m |Σ|)

Θ(n)

Θ(m)

Θ(n)

Θ(m + |Σ|)

Ω(n/m), O(nm)

Knuth–Morris–Pratt algorithm Boyer–Moore string search algorithm

Bitap algorithm (shift-or, shift-and, Baeza–Yates–Gonnet) Θ(m + |Σ|)

O(mn)

1

Asymptotic times are expressed using O, Ω, and Θ notation

The Boyer–Moore string search algorithm has been the standard benchmark for the practical string search literature.

Algorithms using a finite set of patterns • Aho–Corasick string matching algorithm • Commentz-Walter algorithm • Rabin–Karp string search algorithm

Algorithms using an infinite number of patterns Naturally, the patterns can not be enumerated in this case. They are represented usually by a regular grammar or regular expression.

String searching algorithm

9

Other classification Other classification approaches are possible. One of the most common uses preprocessing as main criteria.

Classes of string searching algorithms Text not preprocessed Patterns not preprocessed Elementary algorithms Patterns preprocessed

Text preprocessed Index methods

Constructed search engines Signature methods

Naïve string search The simplest and least efficient way to see where one string occurs inside another is to check each place it could be, one by one, to see if it's there. So first we see if there's a copy of the needle in the first character of the haystack; if not, we look to see if there's a copy of the needle starting at the second character of the haystack; if not, we look starting at the third character, and so forth. In the normal case, we only have to look at one or two characters for each wrong position to see that it is a wrong position, so in the average case, this takes O(n + m) steps, where n is the length of the haystack and m is the length of the needle; but in the worst case, searching for a string like "aaaab" in a string like "aaaaaaaaab", it takes O(nm)

Finite state automaton based search In this approach, we avoid backtracking by constructing a deterministic finite automaton (DFA) that recognizes stored search string. These are expensive to construct—they are usually created using the powerset construction—but are very quick to use. For example,

Stubs Knuth–Morris–Pratt computes a DFA that recognizes inputs with the string to search for as a suffix, Boyer–Moore starts searching from the end of the needle, so it can usually jump ahead a whole needle-length at each step. Baeza–Yates keeps track of whether the previous j characters were a prefix of the search string, and is therefore adaptable to fuzzy string searching. The bitap algorithm is an application of Baeza–Yates' approach.

Index methods Faster search algorithms are based on preprocessing of the text. After building a substring index, for example a suffix tree or suffix array, the occurrences of a pattern can be found quickly. As an example, a suffix tree can be built in all

occurrences of a pattern can be found in

time, and

time under the assumption that the alphabet has a constant

size and all inner nodes in the suffix tree knows what leafs are underneath them. The latter can be accomplished by running a DFS algorithm from the root of the suffix tree.

String searching algorithm

Other variants Some search methods, for instance trigram search, are intended to find a "closeness" score between the search string and the text rather than a "match/non-match". These are sometimes called "fuzzy" searches.

Academic conferences on text searching • Combinatorial pattern matching (CPM), a conference on combinatorial algorithms for strings, sequences, and trees. • String Processing and Information Retrieval (SPIRE), an annual symposium on string processing and information retrieval. • Prague Stringology Conference (PSC), an annual conference on algorithms on strings and sequences. • Competition on Applied Text Searching (CATS), an annual series of evaluations of text searching algorithms.

References  Melichar, Borivoj, Jan Holub, and J. Polcar. Text Searching Algorithms. Volume I: Forward String Matching. Vol. 1. 2 vols., 2005. http:/ / stringology. org/ athens/ TextSearchingAlgorithms/ .

• R. S. Boyer and J. S. Moore, A fast string searching algorithm (http://www.cs.utexas.edu/~moore/ publications/fstrpos.pdf), Carom. ACM 20, (10), 262–272(1977). • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 32: String Matching, pp.906–932.

External links • Huge (maintained) list of pattern matching links (http://www.cs.ucr.edu/~stelo/pattern.html) Last updated:12/27/2008 20:18:38 • StringSearch – high-performance pattern matching algorithms in Java (http://johannburkard.de/software/ stringsearch/) – Implementations of many String-Matching-Algorithms in Java (BNDM, Boyer-Moore-Horspool, Boyer-Moore-Horspool-Raita, Shift-Or) • Exact String Matching Algorithms (http://www-igm.univ-mlv.fr/~lecroq/string/index.html) — Animation in Java, Detailed description and C implementation of many algorithms. • Boyer-Moore-Raita-Thomas (http://www.concentric.net/~Ttwang/tech/stringscan.htm) • (PDF) Improved Single and Multiple Approximate String Matching (http://www.cs.ucr.edu/~stelo/cpm/ cpm04/35_Navarro.pdf) • Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2647288/)

10

KnuthMorrisPratt algorithm

Knuth–Morris–Pratt algorithm In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a "word" W within a main "text string" S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing re-examination of previously matched characters. The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977.

Background A string matching algorithm wants to find the starting index m in string S[] that matches the search word W[]. The most straightforward algorithm is to look for a character match at successive values of the index m, the position in the string being searched, i.e. S[m]. If the index m reaches the end of the string then there is no match, in which case the search is said to "fail". At each position m the algorithm first checks for equality of the first character in the searched for word, i.e. S[m] =? W. If a match is found, the algorithm tests the other characters in the searched for word by checking successive values of the word position index, i. The algorithm retrieves the character W[i] in the searched for word and checks for equality of the expression S[m+i] =? W[i]. If all successive characters match in W at position m then a match is found at that position in the search string. Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 26^2 (1 in 676). So if the characters are random, then the expected complexity of searching string S[] of length k is on the order of k comparisons or O(k). The expected performance is very good. If S[] is 1 billion characters and W[] is 1000 characters, then the string search should complete after about one billion character comparisons. That expected performance is not guaranteed. If the strings are not random, then checking a trial m may take many character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string S[] consists of 1 billion characters that are all A, and that the word W[] is 999 A characters terminating in a final B character. The simple string matching algorithm will now examine 1000 characters at each trial position before rejecting the match and advancing the trial position. The simple string search example would now take about 1000 character comparisons times 1 billion positions for 1 trillion character comparisons. If the length of W[] is n, then the worst case performance is O(k⋅n). The KMP algorithm does not have the horrendous worst case performance of the straightforward algorithm. KMP spends a little time precomputing a table (on the order of the size of W[], O(n)), and then it uses that table to do an efficient search of the string in O(k). The difference is that KMP makes use of previous match information that the straightforward algorithm does not. In the example above, when KMP sees a trial match fail on the 1000th character (i=999) because S[m+999]≠W, it will increment m by 1, but it will know that the first 998 characters at the new position already match. KMP matched 999 A characters before discovering a mismatch at the 1000th character (position 999). Advancing the trial match position m by one throws away the first A, so KMP knows there are 998 A characters that match W[] and does not retest them; that is, KMP sets i to 998. KMP maintains its knowledge in the precomputed table and two state variables. When KMP discovers a mismatch, the table determines how much KMP will increase (variable m) and where it will resume testing (variable i).

11

KnuthMorrisPratt algorithm

KMP algorithm Worked example of the search algorithm To illustrate the algorithm's details, we work through a (relatively artificial) run of the algorithm, where W = "ABCDABD" and S = "ABC ABCDAB ABCDABCDABDE". At any given time, the algorithm is in a state determined by two integers: • m which denotes the position within S which is the beginning of a prospective match for W • i the index in W denoting the character currently under consideration. In each step we compare S[m+i] with W[i] and advance if they are equal. This is depicted, at the start of the run, like

m: S: W: i:

1 2 01234567890123456789012 ABC ABCDAB ABCDABCDABDE ABCDABD 0123456

We proceed by comparing successive characters of W to "parallel" characters of S, moving from one to the next if they match. However, in the fourth step, we get S is a space and W = 'D', a mismatch. Rather than beginning to search again at S, we note that no 'A' occurs between positions 0 and 3 in S except at 0; hence, having checked all those characters previously, we know there is no chance of finding the beginning of a match if we check them again. Therefore we move on to the next character, setting m = 4 and i = 0.

m: S: W: i:

1 2 01234567890123456789012 ABC ABCDAB ABCDABCDABDE ABCDABD 0123456

We quickly obtain a nearly complete match "ABCDAB" when, at W (S), we again have a discrepancy. However, just prior to the end of the current partial match, we passed an "AB" which could be the beginning of a new match, so we must take this into consideration. As we already know that these characters match the two characters prior to the current position, we need not check them again; we simply reset m = 8, i = 2 and continue matching the current character. Thus, not only do we omit previously matched characters of S, but also previously matched characters of W.

m: S: W: i:

1 2 01234567890123456789012 ABC ABCDAB ABCDABCDABDE ABCDABD 0123456

This search fails immediately, however, as the pattern still does not contain a space, so as in the first trial, we return to the beginning of W and begin searching at the next character of S: m = 11, reset i = 0.

m: S: W: i:

1 2 01234567890123456789012 ABC ABCDAB ABCDABCDABDE ABCDABD 0123456

12

KnuthMorrisPratt algorithm Once again we immediately hit upon a match "ABCDAB" but the next character, 'C', does not match the final character 'D' of the word W. Reasoning as before, we set m = 15, to start at the two-character string "AB" leading up to the current position, set i = 2, and continue matching from the current position.

m: S: W: i:

1 2 01234567890123456789012 ABC ABCDAB ABCDABCDABDE ABCDABD 0123456

This time we are able to complete the match, whose first character is S.

Description of pseudocode for the search algorithm The above example contains all the elements of the algorithm. For the moment, we assume the existence of a "partial match" table T, described below, which indicates where we need to look for the start of a new match in the event that the current one ends in a mismatch. The entries of T are constructed so that if we have a match starting at S[m] that fails when comparing S[m + i] to W[i], then the next possible match will start at index m + i - T[i] in S (that is, T[i] is the amount of "backtracking" we need to do after a mismatch). This has two implications: first, T = -1, which indicates that if W is a mismatch, we cannot backtrack and must simply check the next character; and second, although the next possible match will begin at index m + i - T[i], as in the example above, we need not actually check any of the T[i] characters after that, so that we continue searching from W[T[i]]. The following is a sample pseudocode implementation of the KMP search algorithm. algorithm kmp_search: input: an array of characters, S (the text to be searched) an array of characters, W (the word sought) output: an integer (the zero-based position in S at which W is found) define an an an

variables: integer, m ← 0 (the beginning of the current match in S) integer, i ← 0 (the position of the current character in W) array of integers, T (the table, computed elsewhere)

while m + i < length(S) do if W[i] = S[m + i] then if i = length(W) - 1 then return m let i ← i + 1 else let m ← m + i - T[i] if T[i] > -1 then let i ← T[i] else let i ← 0 (if we reach here, we have searched all of S unsuccessfully) return the length of S

13

KnuthMorrisPratt algorithm

Efficiency of the search algorithm Assuming the prior existence of the table T, the search portion of the Knuth–Morris–Pratt algorithm has complexity O(n), where n is the length of S and the O is big-O notation. Except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the while loop. To bound the number of iterations of this loop; observe that T is constructed so that if a match which had begun at S[m] fails while comparing S[m + i] to W[i], then the next possible match must begin at S[m + (i - T[i])]. In particular the next possible match must occur at a higher index than m, so that T[i] < i. This fact implies that the loop can execute at most 2n times. For, in each iteration, it executes one of the two branches in the loop. The first branch invariably increases i and does not change m, so that the index m + i of the currently scrutinized character of S is increased. The second branch adds i - T[i] to m, and as we have seen, this is always a positive number. Thus the location m of the beginning of the current potential match is increased. Now, the loop ends if m + i = n; therefore each branch of the loop can be reached at most k times, since they respectively increase either m + i or m, and m ≤ m + i: if m = n, then certainly m + i ≥ n, so that since it increases by unit increments at most, we must have had m + i = n at some point in the past, and therefore either way we would be done. Thus the loop executes at most 2n times, showing that the time complexity of the search algorithm is O(n). Here is another way to think about the runtime: Let us say we begin to match W and S at position i and p, if W exists as a substring of S at p, then W[0 through m] == S[p through p+m]. Upon success, that is, the word and the text matched at the positions(W[i] == S[p+i]), we increase i by 1 (i++). Upon failure, that is, the word and the text does not match at the positions(W[i] != S[p+i]), the text pointer is kept still, while the word pointer roll-back a certain amount(i = T[i], where T is the jump table) And we attempt to match W[T[i]] with S[p+i]. The maximum number of roll-back of i is bounded by i, that is to say, for any failure, we can only roll-back as much as we have progressed up to the failure. Then it is clear the runtime is 2n.

"Partial match" table (also known as "failure function") The goal of the table is to allow the algorithm not to match any character of S more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an initial segment of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we "pre-search" the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so. We want to be able to look up, for each position in W, the length of the longest possible initial segment of W leading up to (but not including) that position, other than the full segment starting at W that just failed to match; this is how far we have to backtrack in finding the next match. Hence T[i] is exactly the length of the longest possible proper initial segment of W which is also a segment of the substring ending at W[i - 1]. We use the convention that the empty string has length 0. Since a mismatch at the very start of the pattern is a special case (there is no possibility of backtracking), we set T = -1, as discussed above.

14

KnuthMorrisPratt algorithm

15

Worked example of the table-building algorithm We consider the example of W = "ABCDABD" first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set T = -1. To find T, we must discover a proper suffix of "A" which is also a prefix of W. But there are no proper suffixes of "A", so we set T = 0. Likewise, T = 0. Continuing to T, we note that there is a shortcut to checking all suffixes: let us say that we discovered a proper suffix which is a proper prefix and ending at W with length 2 (the maximum possible); then its first character is a proper prefix of W, hence a proper prefix itself, and it ends at W, which we already determined cannot occur in case T. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1 only if a valid suffix of size m was found at the previous stage (e.g. T[x]=m). Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so T = 0. We pass to the subsequent W, 'A'. The same logic shows that the longest substring we need consider has length 1, and although in this case 'A' does work, recall that we are looking for segments ending before the current character; hence T = 0 as well. Considering now the next character, W, which is 'B', we exercise the following logic: if we were to find a subpattern beginning before the previous character W, yet continuing to the current one W, then in particular it would itself have a proper initial segment ending at W yet beginning before it, which contradicts the fact that we already found that 'A' itself is the earliest occurrence of a proper segment ending at W. Therefore we need not look before W to find a terminal string for W. Therefore T = 1. Finally, we see that the next character in the ongoing segment starting at W = 'A' would be 'B', and indeed this is also W. Furthermore, the same argument as above shows that we need not look before W to find a segment for W, so that this is it, and we take T = 2. Therefore we compile the following table: i

0 1 2 3 4 5 6

W[i] A B C D A B D T[i] -1 0 0 0 0 1 2

Other example more interesting and complex: i W[i]

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 P A R T I C I P A T

E

T[i] -1 0 0 0 0 0 0 0 1 2

0

0

I

N

0

0

0

P

A

R

A

C

H

U

T

E

0

1

2

3

0

0

0

0

0

Description of pseudocode for the table-building algorithm The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. The only minor complication is that the logic which is correct late in the string erroneously gives non-proper substrings at the beginning. This necessitates some initialization code. algorithm kmp_table: input: an array of characters, W (the word to be analyzed) an array of integers, T (the table to be filled) output:

KnuthMorrisPratt algorithm

16

nothing (but during operation, it populates the table) define variables: an integer, pos ← 2 (the current position we are computing in T) an integer, cnd ← 0 (the zero-based index in W of the next character of the current candidate substring) (the first few values are fixed but different from what the algorithm might suggest) let T ← -1, T ← 0 while pos < length(W) do (first case: the substring continues) if W[pos - 1] = W[cnd] then let cnd ← cnd + 1, T[pos] ← cnd, pos ← pos + 1 (second case: it doesn't, but we can fall back) else if cnd > 0 then let cnd ← T[cnd] (third case: we have run out of candidates. else let T[pos] ← 0, pos ← pos + 1

Note cnd = 0)

Efficiency of the table-building algorithm The complexity of the table algorithm is O(n), where n is the length of W. As except for some initialization all the work is done in the while loop, it is sufficient to show that this loop executes in O(n) time, which will be done by simultaneously examining the quantities pos and pos - cnd. In the first branch, pos - cnd is preserved, as both pos and cnd are incremented simultaneously, but naturally, pos is increased. In the second branch, cnd is replaced by T[cnd], which we saw above is always strictly less than cnd, thus increasing pos - cnd. In the third branch, pos is incremented and cnd is not, so both pos and pos - cnd increase. Since pos ≥ pos cnd, this means that at each stage either pos or a lower bound for pos increases; therefore since the algorithm terminates once pos = n, it must terminate after at most 2n iterations of the loop, since pos - cnd begins at 1. Therefore the complexity of the table algorithm is O(n).

Efficiency of the KMP algorithm Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the overall algorithm is O(n + k). These complexities are the same, no matter how many repetitive patterns are in W or S.

Variants A real-time version of KMP can be implemented using a separate failure function table for each character in the alphabet. If a mismatch occurs on character in the text, the failure function table for character is consulted for the index in the pattern at which the mismatch took place. This will return the length of the longest substring ending at

matching a prefix of the pattern, with the added condition that the character after the prefix is

this restriction, character

. With

in the text need not be checked again in the next phase, and so only a constant number of

KnuthMorrisPratt algorithm operations are executed between the processing of each index of the text. This satisfies the real-time computing restriction. The Booth algorithm uses a modified version of the KMP preprocessing function to find the lexicographically minimal string rotation. The failure function is progressively calculated as the string is rotated.

References • Knuth, Donald; Morris, James H., jr; Pratt, Vaughan (1977). "Fast pattern matching in strings" . SIAM Journal on Computing 6 (2): 323–350. doi:10.1137/0206024 . Zbl 0372.68005 . • Cormen, Thomas; Lesiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). "Section 32.4: The Knuth-Morris-Pratt algorithm". Introduction to Algorithms (Second ed.). MIT Press and McGraw-Hill. pp. 923–931. ISBN 0-262-03293-7. Zbl 1047.68161 . • Crochemore, Maxime; Rytter, Wojciech (2003). Jewels of stringology. Text algorithms. River Edge, NJ: World Scientific. pp. 20–25. ISBN 981-02-4897-0. Zbl 1078.68151 . • Szpankowski, Wojciech (2001). Average case analysis of algorithms on sequences. Wiley-Interscience Series in Discrete Mathematics and Optimization. With a foreword by Philippe Flajolet. Chichester: Wiley. pp. 15–17,136–141. ISBN 0-471-24063-X. Zbl 0968.68205 .

External links • • • • • •

String Searching Applet animation  An explanation of the algorithm  and sample C++ code  by David Eppstein Knuth-Morris-Pratt algorithm  description and C code by Christian Charras and Thierry Lecroq Explanation of the algorithm from scratch  by FH Flensburg. Breaking down steps of running KMP  by Chu-Cheng Hsieh.  NPTELHRD YouTube lecture video

References  http:/ / citeseer. ist. psu. edu/ context/ 23820/ 0  http:/ / dx. doi. org/ 10. 1137%2F0206024  http:/ / www. zentralblatt-math. org/ zmath/ en/ search/ ?format=complete& q=an:0372. 68005  http:/ / www. zentralblatt-math. org/ zmath/ en/ search/ ?format=complete& q=an:1047. 68161  http:/ / www. zentralblatt-math. org/ zmath/ en/ search/ ?format=complete& q=an:1078. 68151  http:/ / www. zentralblatt-math. org/ zmath/ en/ search/ ?format=complete& q=an:0968. 68205  http:/ / www. cs. pitt. edu/ ~kirk/ cs1501/ animations/ String. html  http:/ / www. ics. uci. edu/ ~eppstein/ 161/ 960227. html  http:/ / www. ics. uci. edu/ ~eppstein/ 161/ kmp/  http:/ / www-igm. univ-mlv. fr/ ~lecroq/ string/ node8. html  http:/ / www. inf. fh-flensburg. de/ lang/ algorithmen/ pattern/ kmpen. htm  http:/ / oak. cs. ucla. edu/ cs144/ examples/ KMPSearch. html  http:/ / www. youtube. com/ watch?v=Zj_er99KMb8

17

BoyerMoore string search algorithm

18

Boyer–Moore string search algorithm In computer science, the Boyer–Moore string search algorithm is an efficient string searching algorithm that is the standard benchmark for practical string search literature. It was developed by Robert S. Boyer and J Strother Moore in 1977. The algorithm preprocesses the string being searched for (the pattern), but not the string being searched in (the text). It is thus well-suited for applications in which the pattern is much shorter than the text or does persist across multiple searches. The Boyer-Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string algorithms. In general, the algorithm runs faster as the pattern length increases.

Definitions A N P A N M A N P A N -

-

-

-

-

-

-

P A N -

-

-

-

-

-

-

P A N -

-

-

-

-

-

-

P A N -

-

-

-

-

-

-

P A N -

-

-

-

-

-

-

P

A N -

Alignments of pattern PAN to text ANPANMAN, from k=3 to k=8. A match occurs at k=5. • • • • • • • • • •

S[i] refers to the character at index i of string S, counting from 1. S[i..j] refers to the substring of string S starting at index i and ending at j, inclusive. A prefix of S is a substring S[1..i] for some i in range [1, n], where n is the length of S. A suffix of S is a substring S[i..n] for some i in range [1, n], where n is the length of S. The string to be searched for is called the pattern and is referred to with symbol P. The string being searched in is called the text and is referred to with symbol T. The length of P is n. The length of T is m. An alignment of P to T is an index k in T such that the last character of P is aligned with index k of T. A match or occurrence of P occurs at an alignment if P is equivalent to T[(k-n+1)..k].

Description The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at different alignments. Instead of a brute-force search of all alignments (of which there are m - n + 1), Boyer-Moore uses information gained by preprocessing P to skip as many alignments as possible. The algorithm begins at alignment k = n, so the start of P is aligned with the start of T. Characters in P and T are then compared starting at index n in P and k in T, moving backward: the strings are matched from the end of P to the start of P. The comparisons continue until either the beginning of P is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted to the right according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end of T, which means no further matches will be found. The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P.

BoyerMoore string search algorithm

19

Shift Rules The Bad Character Rule Description -

-

-

-

X -

-

K -

-

-

A N P A N M A N A M -

N N A A M A N -

-

-

-

-

-

N N A A M A N -

Demonstration of bad character rule with pattern NNAAMAN. The bad-character rule considers the character in T at which the comparison process failed (assuming such a failure occurred). The next occurrence of that character to the left in P is found, and a shift which brings that occurrence in line with the mismatched occurrence in T is proposed. If the mismatched character does not occur to the left in P, a shift is proposed that moves the entirety of P past the point of mismatch. Preprocessing Methods vary on the exact form the table for the bad character rule should take, but a simple constant-time lookup solution is as follows: create a 2D table which is indexed first by the index of the character c in the alphabet and second by the index i in the pattern. This lookup will return the occurrence of c in P with the next-highest index j < i or -1 if there is no such occurrence. The proposed shift will then be i - j, with O(1) lookup time and O(kn) space, assuming a finite alphabet of length k.

The Good Suffix Rule Description -

-

-

-

M A N P

X -

-

K -

-

-

A N A M A N A P

-

A N A M P N A M -

-

-

-

-

-

-

-

-

-

A N A M P N A M -

Demonstration of good suffix rule with pattern ANAMPNAM. The good suffix rule is markedly more complex in both concept and implementation than the bad character rule. It is the reason comparisons begin at the end of the pattern rather than the start, and is formally stated thus: Suppose for a given alignment of P and T, a substring t of T matches a suffix of P, but a mismatch occurs at the next comparison to the left. Then find, if it exists, the right-most copy t' of t in P such that t' is not a suffix of 'P' and the character to the left of 't'' in 'P' differs from the character to the left of 't' in 'P'. Shift 'P' to the right so that substring 't'' in 'P' aligns with substring 't' in 'T'. If 't'' does not exist, then shift the left end of 'P' past the left end of 't' in 'T' by the least amount so that a prefix of the shifted pattern matches a suffix of 't' in 'T'. If no such shift is possible, then shift 'P' by 'n' places to the right. If an occurrence of 'P' is found, then shift 'P' by the least amount so that a proper prefix of the shifted 'P' matches a suffix of the occurrence of 'P' in 'T'. If no such shift is possible, then shift 'P' by 'n' places, that is, shift 'P' past 't'.

BoyerMoore string search algorithm Preprocessing The good suffix rule requires two tables: one for use in the general case, and another for use when either the general case returns no meaningful result or a match occurs. These tables will be designated L and H respectively. Their definitions are as follows: For each i, L[i] is the largest position less than n such that string P[i..n] matches a suffix of P[1..L[i]] and such that the character preceding that suffix is not equal to P[i-1]. L[i] is defined to be zero if there is no position satisfying the condition. Let H[i] denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. If none exists, let H[i] be zero. Both of these tables are constructible in O(n) time and use O(n) space. The alignment shift for index i in P is given by n - L[i] or n - H[i]. H should only be used if L[i] is zero or a match has been found.

The Galil Rule A simple but important optimization of Boyer-Moore was put forth by Galil in 1979. As opposed to shifting, the Galil rule deals with speeding up the actual comparisons done at each alignment by skipping sections that are known to match. Suppose that at an alignment k1, P is compared with T down to character c of T. Then if P is shifted to k2 such that its left end is between c and k1, in the next comparison phase a prefix of P must match the substring T[(k2 n)..k1]. Thus if the comparisons get down to position k1 of T, an occurrence of P can be recorded without explicitly comparing past k1. In addition to increasing the efficiency of Boyer-Moore, the Galil rule is required for proving linear-time execution in the worst case.

Performance The Boyer-Moore algorithm as presented in the original paper has worst-case running time of O(n+m) only if the pattern does not appear in the text. This was first proved by Knuth, Morris, and Pratt in 1977, followed by Guibas and Odlyzko in 1980 with an upper bound of 5m comparisons in the worst case. Richard Cole gave a proof with an upper bound of 3m comparisons in the worst case in 1991. When the pattern does occur in the text, running time of the original algorithm is O(nm) in the worst case. This is easy to see when both pattern and text consist solely of the same repeated character. However, inclusion of the Galil rule results in linear runtime across all cases.

Implementations Various implementations exist in different programming languages. In C++, Boost provides the generic Boyer–Moore search  implementation under the Algorithm library. Below are a few simple implementations.

Variants The Boyer-Moore-Horspool algorithm is a simplification of the Boyer-Moore algorithm using only the bad character rule. The Apostolico-Giancarlo algorithm speeds up the process of checking whether a match has occurred at the given alignment by skipping explicit character comparisons. This uses information gleaned during the pre-processing of the pattern in conjunction with suffix match lengths recorded at each match attempt. Storing suffix match lengths requires an additional table equal in size to the text being searched.

20

BoyerMoore string search algorithm

References  Hume and Sunday (1991) [Fast String Searching] SOFTWARE—PRACTICE AND EXPERIENCE, VOL. 21(11), 1221–1248 (NOVEMBER 1991)  http:/ / www. boost. org/ doc/ libs/ 1_53_0/ libs/ algorithm/ doc/ html/ algorithm/ Searching. html#the_boost_algorithm_library. Searching. BoyerMoore

External links • Original paper on the Boyer-Moore algorithm (http://www.cs.utexas.edu/~moore/publications/fstrpos.pdf) • An example of the Boyer-Moore algorithm (http://www.cs.utexas.edu/users/moore/best-ideas/ string-searching/fstrpos-example.html) from the homepage of J Strother Moore, co-inventor of the algorithm • Richard Cole's 1991 paper proving runtime linearity (http://www.cs.nyu.edu/cs/faculty/cole/papers/ CHPZ95.ps)

21

Article Sources and Contributors

22