Algorithms and Data Structures

23 downloads 439384 Views 250KB Size Report
Dec 27, 2008 - The length of a string s is the number of symbols in s (the length of the sequence) and can be any ...... Created from scratch in Adobe Illustrator.
Algorithms and Data Structures Part 5: String Matching (Wikipedia Book 2014)

By Wikipedians

Editors: Reiner Creutzburg, Jenny Knackmuß

Contents 1

String Matching

1

1.1

String (computer science) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.1.1

Formal theory

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.1.2

String datatypes

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.1.3

Text file strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.1.4

Non-text strings

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.1.5

String processing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

1.1.6

Character string-oriented languages and utilities . . . . . . . . . . . . . . . . . . . . . . .

7

1.1.7

Character string functions

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.1.8

String buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.1.9

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.1.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

String searching algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.2.1

Basic classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.2.2

Other classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.2.3

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.2.4

Academic conferences on text searching . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.2.5

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

1.2.6

External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

Knuth–Morris–Pratt algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.3.1

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

1.3.2

KMP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

1.3.3

“Partial match” table (also known as “failure function”) . . . . . . . . . . . . . . . . . . .

15

1.3.4

Efficiency of the KMP algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.3.5

Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.3.6

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.3.7

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.3.8

External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

17

Boyer–Moore string search algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

1.4.1

Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

1.4.2

Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

1.4.3

Shift Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

1.2

1.3

1.4

i

ii

2

CONTENTS 1.4.4

The Galil Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.4.5

Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.4.6

Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

1.4.7

Variants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

1.4.8

See also . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

1.4.9

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

1.4.10 External links . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23

Text and image sources, contributors, and licenses

24

2.1

Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.2

Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

2.3

Content license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

Chapter 1

String Matching 1.1 String (computer science) This article is about the data type. For other uses, see String (disambiguation). In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some

Strings are applied e.g. in Bioinformatics to describe DNA strands composed of nitrogenous bases.

kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation). A string is generally understood as a data type and is often implemented as an array of bytes (or words) that stores a sequence of elements, typically characters, using some character encoding. A string may also denote more general arrays or other sequence (or list) data types and structures. Depending on programming language and precise data type used, a variable declared to be a string may either cause 1

2

CHAPTER 1. STRING MATCHING

storage in memory to be statically allocated for a predetermined maximum length or employ dynamic allocation to allow it to hold variable number of elements. When a string appears literally in source code, it is known as a string literal or an anonymous string.[1] In formal languages, which are used in mathematical logic and theoretical computer science, a string is a finite sequence of symbols that are chosen from a set called an alphabet.

1.1.1

Formal theory

See also: Tuple Let Σ be a non-empty finite set of symbols (alternatively called characters), called the alphabet. No assumption is made about the nature of the symbols. A string (or word) over Σ is any finite sequence of symbols from Σ.[2] For example, if Σ = {0, 1}, then 01011 is a string over Σ. The length of a string s is the number of symbols in s (the length of the sequence) and can be any non-negative integer; it is often denoted as |s|. The empty string is the unique string over Σ of length 0, and is denoted ε or λ.[2][3] The set of all strings over Σ of length n is denoted Σn . For example, if Σ = {0, 1}, then Σ2 = {00, 01, 10, 11}. Note that Σ0 = {ε} for any alphabet Σ. The set of all strings over Σ of any length is the Kleene closure of Σ and is denoted Σ* . In terms of Σn ,

Σ∗ =



Σn

n∈N∪{0}

For example, if Σ = {0, 1}, then Σ* = {ε, 0, 1, 00, 01, 10, 11, 000, 001, 010, 011, ...}. Although the set Σ* itself is countably infinite, each element of Σ* is a string of finite length. A set of strings over Σ (i.e. any subset of Σ* ) is called a formal language over Σ. For example, if Σ = {0, 1}, the set of strings with an even number of zeros, {ε, 1, 00, 11, 001, 010, 100, 111, 0000, 0011, 0101, 0110, 1001, 1010, 1100, 1111, ...}, is a formal language over Σ. Concatenation and substrings Concatenation is an important binary operation on Σ* . For any two strings s and t in Σ* , their concatenation is defined as the sequence of symbols in s followed by the sequence of characters in t, and is denoted st. For example, if Σ = {a, b, ..., z}, s = bear, and t = hug, then st = bearhug and ts = hugbear. String concatenation is an associative, but non-commutative operation. The empty string ε serves as the identity element; for any string s, εs = sε = s. Therefore, the set Σ* and the concatenation operation form a monoid, the free monoid generated by Σ. In addition, the length function defines a monoid homomorphism from Σ* to the non-negative integers (that is, a function L : Σ∗ 7→ N ∪ {0} , such that L(st) = L(s) + L(t) ∀s, t ∈ Σ∗ ). A string s is said to be a substring or factor of t if there exist (possibly empty) strings u and v such that t = usv. The relation “is a substring of” defines a partial order on Σ* , the least element of which is the empty string. Prefixes and suffixes A string s is said to be a prefix of t if there exists a string u such that t = su. If u is nonempty, s is said to be a proper prefix of t. Symmetrically, a string s is said to be a suffix of t if there exists a string u such that t = us. If u is nonempty, s is said to be a proper suffix of t. Suffixes and prefixes are substrings of t. Both the relations “is a prefix of” and “is a suffix of” are prefix orders. Rotations A string s = uv is said to be a rotation of t if t = vu. For example, if Σ = {0, 1} the string 0011001 is a rotation of 0100110, where u = 00110 and v = 01.

1.1. STRING (COMPUTER SCIENCE)

3

Reversal The reverse of a string is a string with the same symbols but in reverse order. For example, if s = abc (where a, b, and c are symbols of the alphabet), then the reverse of s is cba. A string that is the reverse of itself (e.g., s = madam) is called a palindrome, which also includes the empty string and all strings of length 1. Lexicographical ordering It is often useful to define an ordering on a set of strings. If the alphabet Σ has a total order (cf. alphabetical order) one can define a total order on Σ* called lexicographical order. For example, if Σ = {0, 1} and 0 < 1, then the lexicographical order on Σ* includes the relationships ε < 0 < 00 < 000 < ... < 0001 < 001 < 01 < 010 < 011 < 0110 < 01111 < 1 < 10 < 100 < 101 < 111 < 1111 < 11111 ... The lexicographical order is total if the alphabetical order is, but isn't well-founded for any nontrivial alphabet, even if the alphabetical order is. See Shortlex for an alternative string ordering that preserves well-foundedness. String operations A number of additional operations on strings commonly occur in the formal theory. These are given in the article on string operations. Topology

110

111

010

011 101 100

000 (Hyper)cube of binary strings of length 3

Strings admit the following interpretation as nodes on a graph: • Fixed-length strings can be viewed as nodes on a hypercube

001

4

CHAPTER 1. STRING MATCHING • Variable-length strings (of finite length) can be viewed as nodes on the k-ary tree, where k is the number of symbols in Σ • Infinite strings (otherwise not considered here) can be viewed as infinite paths on the k-ary tree.

The natural topology on the set of fixed-length strings or variable length strings is the discrete topology, but the natural topology on the set of infinite strings is the limit topology, viewing the set of infinite strings as the inverse limit of the sets of finite strings. This is the construction used for the p-adic numbers and some constructions of the Cantor set, and yields the same topology. Isomorphisms between string representations of topologies can be found by normalizing according to the lexicographically minimal string rotation.

1.1.2

String datatypes

See also: Comparison of programming languages (string functions) A string datatype is a datatype modeled on the idea of a formal string. Strings are such an important and useful datatype that they are implemented in nearly every programming language. In some languages they are available as primitive types and in others as composite types. The syntax of most high-level programming languages allows for a string, usually quoted in some way, to represent an instance of a string datatype; such a meta-string is called a literal or string literal. String length Although formal strings can have an arbitrary (but finite) length, the length of strings in real languages is often constrained to an artificial maximum. In general, there are two types of string datatypes: fixed-length strings, which have a fixed maximum length to be determined at compile time and which use the same amount of memory whether this maximum is needed or not, and variable-length strings, whose length is not arbitrarily fixed and which can use varying amounts of memory depending on the actual requirements at run time. Most strings in modern programming languages are variable-length strings. Of course, even variable-length strings are limited in length – theoretically by the number of bits available to a pointer, practically by the current size of memory. The string length can be stored as a separate integer (which may put an artificial limit on the length) or implicitly through a termination character, usually a character value with all bits zero. See also “Null-terminated” below. Character encoding String datatypes have historically allocated one byte per character, and, although the exact character set varied by region, character encodings were similar enough that programmers could often get away with ignoring this, since characters a program treated specially (such as period and space and comma) were in the same place in all the encodings a program would encounter. These character sets were typically based on ASCII or EBCDIC. Logographic languages such as Chinese, Japanese, and Korean (known collectively as CJK) need far more than 256 characters (the limit of a one 8-bit byte per-character encoding) for reasonable representation. The normal solutions involved keeping single-byte representations for ASCII and using two-byte representations for CJK ideographs. Use of these with existing code led to problems with matching and cutting of strings, the severity of which depended on how the character encoding was designed. Some encodings such as the EUC family guarantee that a byte value in the ASCII range will represent only that ASCII character, making the encoding safe for systems that use those characters as field separators. Other encodings such as ISO-2022 and Shift-JIS do not make such guarantees, making matching on byte codes unsafe. These encodings also were not “self-synchronizing”, so that locating character boundaries required backing up to the start of a string, and pasting two strings together could result in corruption of the second string (these problems were much less with EUC as any ASCII character did synchronize the encoding). Unicode has simplified the picture somewhat. Most programming languages now have a datatype for Unicode strings. Unicode’s preferred byte stream format UTF-8 is designed not to have the problems described above for older multibyte encodings. UTF-8, UTF-16 and UTF-32 require the programmer to know that the fixed-size code units are different than the “characters”, the main difficulty currently is incorrectly designed API’s that attempt to hide this difference (UTF-32 does make code points fixed-sized, but these are not “characters” due to composing codes).

1.1. STRING (COMPUTER SCIENCE)

5

Implementations Some languages like C++ implement strings as templates that can be used with any datatype, but this is the exception, not the rule. Some languages, such as C++ and Ruby, normally allow the contents of a string to be changed after it has been created; these are termed mutable strings. In other languages, such as Java and Python, the value is fixed and a new string must be created if any alteration is to be made; these are termed immutable strings. Strings are typically implemented as arrays of bytes, characters, or code units, in order to allow fast access to individual units or substrings—including characters when they have a fixed length. A few languages such as Haskell implement them as linked lists instead. Some languages, such as Prolog and Erlang, avoid implementing a dedicated string datatype at all, instead adopting the convention of representing strings as lists of character codes. Representations Representations of strings depend heavily on the choice of character repertoire and the method of character encoding. Older string implementations were designed to work with repertoire and encoding defined by ASCII, or more recent extensions like the ISO 8859 series. Modern implementations often use the extensive repertoire defined by Unicode along with a variety of complex encodings such as UTF-8 and UTF-16. The term bytestring usually indicates a general-purpose string of bytes, rather than strings of only (readable) characters, strings of bits, or such. Byte strings often imply that bytes can take any value and any data can be stored as-is, meaning that there should be no value interpreted as a termination value. Most string implementations are very similar to variable-length arrays with the entries storing the character codes of corresponding characters. The principal difference is that, with certain encodings, a single logical character may take up more than one entry in the array. This happens for example with UTF-8, where single codes (UCS code points) can take anywhere from one to four bytes, and single characters can take an arbitrary number of codes. In these cases, the logical length of the string (number of characters) differs from the logical length of the array (number of bytes in use). UTF-32 avoids the first part of the problem. Null-terminated

Main article: Null-terminated string

The length of a string can be stored implicitly by using a special terminating character; often this is the null character (NUL), which has all bits zero, a convention used and perpetuated by the popular C programming language.[4] Hence, this representation is commonly referred to as a C string. This representation of an n-character string takes n + 1 space (1 for the terminator), and is thus an implicit data structure. In terminated strings, the terminating code is not an allowable character in any string. Strings with length field do not have this limitation and can also store arbitrary binary data. In C two things are needed to handle binary data, a character pointer and the length of the data. An example of a null-terminated string stored in a 10-byte buffer, along with its ASCII (or more modern UTF-8) representation as 8-bit hexadecimal numbers is: The length of the string in the above example, “FRANK”, is 5 characters, but it occupies 6 bytes. Characters after the terminator do not form part of the representation; they may be either part of another string or just garbage. (Strings of this form are sometimes called ASCIZ strings, after the original assembly language directive used to declare them.) Length-prefixed The length of a string can also be stored explicitly, for example by prefixing the string with the length as a byte value (a convention used in many Pascal dialects): as a consequence, some people call it a Pascal string or P-string. Storing the string length as byte limits the maximum string length to 255. To avoid such limitations, improved implementations of P-strings use 16-, 32-, or 64-bit words to store the string length. When the length field covers the address space, strings are limited only by the available memory. Encoding the length n takes log(n) space (see fixed-length code), so length-prefixed strings are a succinct data structure, encoding a string of length n in log(n) + n space. However, if the length is bounded, then the length can be encoded in constant space, typically a machine word, and thus is an implicit data structure, taking n + k space, where k is the number of characters in a word (8 for 8-bit ASCII on a 64-bit machine, 1 for 32-bit UTF-32/UCS-4 on a 32-bit machine, etc.).

6

CHAPTER 1. STRING MATCHING

Here is the equivalent Pascal string stored in a 10-byte buffer, along with its ASCII / UTF-8 representation: Strings as records like:

Many languages, including object-oriented ones, implement strings as records in a structure

class string { unsigned int length; char *text; }; Although this implementation is hidden, and accessed through member functions. The “text” will be a dynamically allocated memory area, that might be expanded if needed. See also string (C++). Linked-list Both character termination and length codes limit strings: For example, C character arrays that contain null (NUL) characters cannot be handled directly by C string library functions: Strings using a length code are limited to the maximum value of the length code. Both of these limitations can be overcome by clever programming, of course, but such workarounds are by definition not standard. Rough equivalents of the C termination method have historically appeared in both hardware and software. For example, “data processing” machines like the IBM 1401 used a special word mark bit to delimit strings at the left, where the operation would start at the right. This meant that, while the IBM 1401 had a seven-bit word in “reality”, almost no-one ever thought to use this as a feature, and override the assignment of the seventh bit to (for example) handle ASCII codes. It is possible to create data structures and functions that manipulate them that do not have the problems associated with character termination and can in principle overcome length code bounds. It is also possible to optimize the string represented using techniques from run length encoding (replacing repeated characters by the character value and a length) and Hamming encoding. While these representations are common, others are possible. Using ropes makes certain string operations, such as insertions, deletions, and concatenations more efficient. Security concerns The differing memory layout and storage requirements of strings can affect the security of the program accessing the string data. String representations requiring a terminating character are commonly susceptible to buffer overflow problems if the terminating character is not present, caused by a coding error or an attacker deliberately altering the data. String representations adopting a separate length field are also susceptible if the length can be manipulated. In such cases, program code accessing the string data requires bounds checking to ensure that it does not inadvertently access or change data outside of the string memory limits. String data is frequently obtained from user-input to a program. As such, it is the responsibility of the program to validate the string to ensure that it represents the expected format. Performing limited or no validation of user-input can cause a program to be vulnerable to code injection attacks.

1.1.3

Text file strings

In computer readable text files, for example programming language source files or configuration files, strings can be represented. The NUL byte is normally not used as terminator since that does not correspond to the ASCII text standard, and the length is usually not stored, since the file should be human editable without bugs. Two common representations are: • Surrounded by quotation marks (ASCII 2216 ), used by most programming languages. To be able to include quotation marks, newline characters etc., escape sequences are often available, usually using the backslash character (ASCII 5C16 ). • Terminated by a newline sequence, for example in Windows INI files. See also: String literal

1.1. STRING (COMPUTER SCIENCE)

1.1.4

7

Non-text strings

While character strings are very common uses of strings, a string in computer science may refer generically to any sequence of homogeneously typed data. A string of bits or bytes, for example, may be used to represent non-textual binary data retrieved from a communications medium. This data may or may not be represented by a string-specific datatype, depending on the needs of the application, the desire of the programmer, and the capabilities of the programming language being used. If the programming language’s string implementation is not 8-bit clean, data corruption may ensue.

1.1.5

String processing algorithms

There are many algorithms for processing strings, each with various trade-offs. Some categories of algorithms include: • String searching algorithms for finding a given substring or pattern • String manipulation algorithms • Sorting algorithms • Regular expression algorithms • Parsing a string • Sequence mining Advanced string algorithms often employ complex mechanisms and data structures, among them suffix trees and finite state machines. The name stringology was coined in 1984 by computer scientist Zvi Galil for the issue of algorithms and data structures used for string processing.[5]

1.1.6

Character string-oriented languages and utilities

Character strings are such a useful datatype that several languages have been designed in order to make string processing applications easy to write. Examples include the following languages: • awk • Icon • MUMPS • Perl • Rexx • Ruby • sed • SNOBOL • Tcl • TTM Many Unix utilities perform simple string manipulations and can be used to easily program some powerful string processing algorithms. Files and finite streams may be viewed as strings. Some APIs like Multimedia Control Interface, embedded SQL or printf use strings to hold commands that will be interpreted.

8

CHAPTER 1. STRING MATCHING

Recent scripting programming languages, including Perl, Python, Ruby, and Tcl employ regular expressions to facilitate text operations. Perl is particularly noted for its regular expression use,[6] and many other languages and applications implement Perl compatible regular expressions. Some languages such as Perl and Ruby support string interpolation, which permits arbitrary expressions to be evaluated and included in string literals.

1.1.7

Character string functions

See also: Comparison of programming languages (string functions) String functions are used to manipulate a string or change or edit the contents of a string. They also are used to query information about a string. They are usually used within the context of a computer programming language. The most basic example of a string function is the string length function -- the function that returns the length of a string (not counting any terminator characters or any of the string’s internal structural information) and does not modify the string. This function is often named length or len. For example, length(“hello world”) would return 11.

1.1.8

String buffers

In some programming languages, a string buffer is an alternative to a string. It has the ability to be altered through adding or appending, whereas a String is normally fixed or immutable. In Java Theory Java's standard way to handle text is to use its String class. Any given String in Java is an immutable object, which means its state cannot be changed. A String has an array of characters. Whenever a String must be manipulated, any changes require the creation of a new String (which, in turn, involves the creation of a new array of characters, and copying of the original array). This happens even if the original String’s value or intermediate Strings used for the manipulation are not kept. Java provides an alternate class for string manipulation, called a StringBuffer. A StringBuffer, like a String, has an array to hold characters. It, however, is mutable (its state can be altered). Its array of characters is not necessarily completely filled (as oppose to a String, whose array is always the exact required length for its contents). Thus, it has the capability to add, remove, or change its state without creating a new object (and without the creation of a new array, and array copying). The exception to this is when its array is no longer of suitable length to hold its content. In this case, it is required to create a new array, and copy contents. For these reasons, Java would handle an expression like String newString = aString + anInt + aChar + aDouble; like this: String newString = (new StringBuilder(aString)).append(anInt).append(aChar).append(aDouble).toString();

Implications Generally, a StringBuffer is more efficient than a String in string handling. However, this is not necessarily the case, since a StringBuffer will be required to recreate its character array when it runs out of space. Theoretically, this is possible to happen the same number of times as a new String would be required, although this is unlikely (and the programmer can provide length hints to prevent this). Either way, the effect is not noticeable in modern desktop computers. As well, the shortcomings of arrays are inherent in a StringBuffer. In order to insert or remove characters at arbitrary positions, whole sections of arrays must be moved. The method by which a StringBuffer is attractive in an environment with low processing power takes this ability by using too much memory, which is likely also at a premium in this environment. This point, however, is trivial, considering the space required for creating many instances of Strings in order to process them. As well, the StringBuffer

1.1. STRING (COMPUTER SCIENCE)

9

can be optimized to “waste” as little memory as possible. The StringBuilder class, introduced in J2SE 5.0, differs from StringBuffer in that it is unsynchronized. When only a single thread at a time will access the object, using a StringBuilder processes more efficiently than using a StringBuffer. StringBuffer and StringBuilder are included in the java.lang package. In .NET Microsoft’s .NET Framework has a StringBuilder class in its Base Class Library. In other languages • In C++ and Ruby, the standard string class is already mutable, with the ability to change the contents and append strings, etc., so a separate mutable string class is unnecessary. • In Objective-C (Cocoa/OpenStep frameworks), the NSMutableString class is the mutable version of the NSString class.

1.1.9

See also

• Formal language — a (possibly infinite) set of strings in theoretical computer science • Connection string — passed to a driver to initiate a connection e.g. to a database • Rope — a data structure for efficiently manipulating long strings • Bitstring — a string of binary digits • Binary-safe — a property of string manipulating functions treating their input as raw data stream • Improper input validation — a type of software security vulnerability particularly relevant for user-given strings • Incompressible string — a string that cannot be compressed by any algorithm • Empty string — its properties and representation in programming languages • String metric — notions of similarity between strings • string (C++) — overview of C++ string handling • string.h — overview of C string handling • Analysis of algorithms — determining time and storage needed by a particular (e.g. string manipulation) algorithm

1.1.10

References

[1] “Introduction To Java - MFC 158 G”. String literals (or constants) are called ‘anonymous strings’ [2] Barbara H. Partee; Alice ter Meulen; Robert E. Wall (1990). Mathematical Methods in Linguistics. Kluwer. [3] John E. Hopcroft, Jeffrey D. Ullman (1979). Introduction to Automata Theory, Languages, and Computation. AddisonWesley. ISBN 0-201-02988-X. Here: sect.1.1, p.1 [4] Bryant, Randal E.; David, O'Hallaron (2003), Computer Systems: A Programmer’s Perspective (2003 ed.), Upper Saddle River, NJ: Pearson Education, p. 40, ISBN 0-13-034074-X [5] http://www.stringology.org/ [6] “Essential Perl”. Perl’s most famous strength is in string manipulation with regular expressions.

10

CHAPTER 1. STRING MATCHING

1.2 String searching algorithm In computer science, string searching algorithms, sometimes called string matching algorithms, are an important class of string algorithms that try to find a place where one or several strings (also called patterns) are found within a larger string or text. Let Σ be an alphabet (finite set). Formally, both the pattern and searched text are vectors of elements of Σ. The Σ may be a usual human alphabet (for example, the letters A through Z in the Latin alphabet). Other applications may use binary alphabet (Σ = {0,1}) or DNA alphabet (Σ = {A,C,G,T}) in bioinformatics. In practice, how the string is encoded can affect the feasible string search algorithms. In particular if a variable width encoding is in use then it is slow (time proportional to N) to find the Nth character. This will significantly slow down many of the more advanced search algorithms. A possible solution is to search for the sequence of code units instead, but doing so may produce false matches unless the encoding is specifically designed to avoid it.

1.2.1

Basic classification

The various algorithms can be classified by the number of patterns each uses.

Single pattern algorithms Let m be the length of the pattern and let n be the length of the searchable text. 1

Asymptotic times are expressed using O, Ω, and Θ notation

The Boyer–Moore string search algorithm has been the standard benchmark for the practical string search literature.[1]

Algorithms using a finite set of patterns • Aho–Corasick string matching algorithm • Commentz-Walter algorithm • Rabin–Karp string search algorithm

Algorithms using an infinite number of patterns Naturally, the patterns can not be enumerated in this case. They are represented usually by a regular grammar or regular expression.

1.2.2

Other classification

Other classification approaches are possible. One of the most common uses preprocessing as main criteria.

Naïve string search A simple but inefficient way to see where one string occurs inside another is to check each place it could be, one by one, to see if it’s there. So first we see if there’s a copy of the needle in the first character of the haystack; if not, we look to see if there’s a copy of the needle starting at the second character of the haystack; if not, we look starting at the third character, and so forth. In the normal case, we only have to look at one or two characters for each wrong position to see that it is a wrong position, so in the average case, this takes O(n + m) steps, where n is the length of the haystack and m is the length of the needle; but in the worst case, searching for a string like “aaaab” in a string like “aaaaaaaaab”, it takes O(nm)

1.2. STRING SEARCHING ALGORITHM

11

M

|MOMMY

M

M

|MOMMY MOM|MY M|OMMY

M |MOMMY MOMM|Y M|OMMY

|MOMMY M|OMMY

O M O

|MOMMY MO|MMY

M

O Y

|MOMMY MOMMY|

Finite state automaton based search In this approach, we avoid backtracking by constructing a deterministic finite automaton (DFA) that recognizes stored search string. These are expensive to construct—they are usually created using the powerset construction—but are

12

CHAPTER 1. STRING MATCHING

very quick to use. For example, the DFA shown to the right recognizes the word “MOMMY”. This approach is frequently generalized in practice to search for arbitrary regular expressions. Stubs Knuth–Morris–Pratt computes a DFA that recognizes inputs with the string to search for as a suffix, Boyer–Moore starts searching from the end of the needle, so it can usually jump ahead a whole needle-length at each step. Baeza– Yates keeps track of whether the previous j characters were a prefix of the search string, and is therefore adaptable to fuzzy string searching. The bitap algorithm is an application of Baeza–Yates’ approach. Index methods Faster search algorithms are based on preprocessing of the text. After building a substring index, for example a suffix tree or suffix array, the occurrences of a pattern can be found quickly. As an example, a suffix tree can be built in Θ(n) time, and all z occurrences of a pattern can be found in O(m) time under the assumption that the alphabet has a constant size and all inner nodes in the suffix tree know what leaves are underneath them. The latter can be accomplished by running a DFS algorithm from the root of the suffix tree. Other variants Some search methods, for instance trigram search, are intended to find a “closeness” score between the search string and the text rather than a “match/non-match”. These are sometimes called “fuzzy” searches.

1.2.3

See also

• Sequence alignment • Pattern matching • Compressed pattern matching • Approximate string matching

1.2.4

Academic conferences on text searching

• Combinatorial pattern matching (CPM), a conference on combinatorial algorithms for strings, sequences, and trees. • String Processing and Information Retrieval (SPIRE), an annual symposium on string processing and information retrieval. • Prague Stringology Conference (PSC), an annual conference on algorithms on strings and sequences. • Competition on Applied Text Searching (CATS), an annual series of evaluations of text searching algorithms.

1.2.5

References

[1] Hume; Sunday (1991). “Fast String Searching”. Software: Practice and Experience 21 (11): 1221–1248. doi:10.1002/spe.4380211105. [2] Melichar, Borivoj, Jan Holub, and J. Polcar. Text Searching Algorithms. Volume I: Forward String Matching. Vol. 1. 2 vols., 2005. http://stringology.org/athens/TextSearchingAlgorithms/.

• R. S. Boyer and J. S. Moore, A fast string searching algorithm, Carom. ACM 20, (10), 262–272(1977). • Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms, Second Edition. MIT Press and McGraw-Hill, 2001. ISBN 0-262-03293-7. Chapter 32: String Matching, pp.906–932.

1.3. KNUTH–MORRIS–PRATT ALGORITHM

1.2.6

13

External links

• Huge (maintained) list of pattern matching links Last updated:12/27/2008 20:18:38 • StringSearch – high-performance pattern matching algorithms in Java – Implementations of many StringMatching-Algorithms in Java (BNDM, Boyer-Moore-Horspool, Boyer-Moore-Horspool-Raita, Shift-Or) • Exact String Matching Algorithms — Animation in Java, Detailed description and C implementation of many algorithms. • Boyer-Moore-Raita-Thomas • (PDF) Improved Single and Multiple Approximate String Matching • Kalign2: high-performance multiple alignment of protein and nucleotide sequences allowing external features • C implementation of Suffix Tree based Pattern Searching

1.3 Knuth–Morris–Pratt algorithm In computer science, the Knuth–Morris–Pratt string searching algorithm (or KMP algorithm) searches for occurrences of a “word” W within a main “text string” S by employing the observation that when a mismatch occurs, the word itself embodies sufficient information to determine where the next match could begin, thus bypassing reexamination of previously matched characters. The algorithm was conceived in 1974 by Donald Knuth and Vaughan Pratt, and independently by James H. Morris. The three published it jointly in 1977.

1.3.1

Background

A string matching algorithm wants to find the starting index m in string S[] that matches the search word W[]. The most straightforward algorithm is to look for a character match at successive values of the index m, the position in the string being searched, i.e. S[m]. If the index m reaches the end of the string then there is no match, in which case the search is said to “fail”. At each position m the algorithm first checks for equality of the first character in the searched for word, i.e. S[m] =? W[0]. If a match is found, the algorithm tests the other characters in the searched for word by checking successive values of the word position index, i. The algorithm retrieves the character W[i] in the searched for word and checks for equality of the expression S[m+i] =? W[i]. If all successive characters match in W at position m, then a match is found at that position in the search string. Usually, the trial check will quickly reject the trial match. If the strings are uniformly distributed random letters, then the chance that characters match is 1 in 26. In most cases, the trial check will reject the match at the initial letter. The chance that the first two letters will match is 1 in 262 (1 in 676). So if the characters are random, then the expected complexity of searching string S[] of length k is on the order of k comparisons or O(k). The expected performance is very good. If S[] is 1 billion characters and W[] is 1000 characters, then the string search should complete after about one billion character comparisons. That expected performance is not guaranteed. If the strings are not random, then checking a trial m may take many character comparisons. The worst case is if the two strings match in all but the last letter. Imagine that the string S[] consists of 1 billion characters that are all A, and that the word W[] is 999 A characters terminating in a final B character. The simple string matching algorithm will now examine 1000 characters at each trial position before rejecting the match and advancing the trial position. The simple string search example would now take about 1000 character comparisons times 1 billion positions for 1 trillion character comparisons. If the length of W[] is n, then the worst-case performance is O(k⋅n). The KMP algorithm does not have the horrendous worst-case performance of the straightforward algorithm. KMP spends a little time precomputing a table (on the order of the size of W[], O(n)), and then it uses that table to do an efficient search of the string in O(k). The difference is that KMP makes use of previous match information that the straightforward algorithm does not. In the example above, when KMP sees a trial match fail on the 1000th character (i = 999) because S[m+999] ≠ W[999], it will increment m by 1, but it will know that the first 998 characters at the new position already match.

14

CHAPTER 1. STRING MATCHING

KMP matched 999 A characters before discovering a mismatch at the 1000th character (position 999). Advancing the trial match position m by one throws away the first A, so KMP knows there are 998 A characters that match W[] and does not retest them; that is, KMP sets i to 998. KMP maintains its knowledge in the precomputed table and two state variables. When KMP discovers a mismatch, the table determines how much KMP will increase (variable m) and where it will resume testing (variable i).

1.3.2

KMP algorithm

Worked example of the search algorithm To illustrate the algorithm’s details, consider a (relatively artificial) run of the algorithm, where W = “ABCDABD” and S = “ABC ABCDAB ABCDABCDABDE”. At any given time, the algorithm is in a state determined by two integers: • m, denoting the position within S where the prospective match for W begins, • i, denoting the index of the currently considered character in W. In each step the algorithm compares S[m+i] with W[i] and advances i if they are equal. This is depicted, at the start of the run, like 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456 The algorithm compares successive characters of W to “parallel” characters of S, moving from one to the next by incrementing i if they match. However, in the fourth step S[3] = ' ' does not match W[3] = 'D'. Rather than beginning to search again at S[1], we note that no 'A' occurs between positions 1 and 2 in W; hence, having checked all those characters previously (and knowing they matched the corresponding characters in S), there is no chance of finding the beginning of a match. Therefore, the algorithm sets m = 3 and i = 0. 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456 This match fails at the initial character, so the algorithm sets m = 4 and i = 0 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456 Here i increments through a nearly complete match “ABCDAB” until i = 6 giving a mismatch at W[6] and S[10]. However, just prior to the end of the current partial match, there was that substring “AB” that could be the beginning of a new match, so the algorithm must take this into consideration. As these characters match the two characters prior to the current position, those characters need not be checked again; the algorithm sets m = 8 (the start of the initial prefix) and i = 2 (signaling the first two characters match) and continues matching. Thus the algorithm not only omits previously matched characters of S (the “BCD”), but also previously matched characters of W (the prefix “AB”). 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456 This search fails immediately, however, as W does not contain another “A”, so as in the first trial, the algorithm returns to the beginning of W and begins searching at the mismatched character position of S: m = 10, reset i = 0. 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456 The match at m=10 fails immediately, so the algorithm next tries m = 11 and i = 0. 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456 Once again, the algorithm matches “ABCDAB”, but the next character, 'C', does not match the final character 'D' of the word W. Reasoning as before, the algorithm sets m = 15, to start at the two-character string “AB” leading up to the current position, set i = 2, and continue matching from the current position. 1 2 m: 01234567890123456789012 S: ABC ABCDAB ABCDABCDABDE W: ABCDABD i: 0123456 This time the match is complete, and the first character of the match is S[15]. Description of pseudocode for the search algorithm The above example contains all the elements of the algorithm. For the moment, we assume the existence of a “partial match” table T, described below, which indicates where we need to look for the start of a new match in the event

1.3. KNUTH–MORRIS–PRATT ALGORITHM

15

that the current one ends in a mismatch. The entries of T are constructed so that if we have a match starting at S[m] that fails when comparing S[m + i] to W[i], then the next possible match will start at index m + i - T[i] in S (that is, T[i] is the amount of “backtracking” we need to do after a mismatch). This has two implications: first, T[0] = −1, which indicates that if W[0] is a mismatch, we cannot backtrack and must simply check the next character; and second, although the next possible match will begin at index m + i - T[i], as in the example above, we need not actually check any of the T[i] characters after that, so that we continue searching from W[T[i]]. The following is a sample pseudocode implementation of the KMP search algorithm. algorithm kmp_search: input: an array of characters, S (the text to be searched) an array of characters, W (the word sought) output: an integer (the zero-based position in S at which W is found) define variables: an integer, m ← 0 (the beginning of the current match in S) an integer, i ← 0 (the position of the current character in W) an array of integers, T (the table, computed elsewhere) while m + i < length(S) do if W[i] = S[m + i] then if i = length(W) - 1 then return m let i ← i + 1 else if T[i] > −1 then let m ← m + i - T[i], i ← T[i] else let i ← 0, m ← m + 1 (if we reach here, we have searched all of S unsuccessfully) return the length of S

Efficiency of the search algorithm Assuming the prior existence of the table T, the search portion of the Knuth–Morris–Pratt algorithm has complexity O(n), where n is the length of S and the O is big-O notation. Except for the fixed overhead incurred in entering and exiting the function, all the computations are performed in the while loop. To bound the number of iterations of this loop; observe that T is constructed so that if a match which had begun at S[m] fails while comparing S[m + i] to W[i], then the next possible match must begin at S[m + (i - T[i])]. In particular, the next possible match must occur at a higher index than m, so that T[i] < i. This fact implies that the loop can execute at most 2n times, since at each iteration it executes one of the two branches in the loop. The first branch invariably increases i and does not change m, so that the index m + i of the currently scrutinized character of S is increased. The second branch adds i - T[i] to m, and as we have seen, this is always a positive number. Thus the location m of the beginning of the current potential match is increased. At the same time, the second branch leaves m + i unchanged, for m gets i - T[i] added to it, and immediately after T[i] gets assigned as the new value of i, hence new_m + new_i = old_m + old_i - T[old_i] + T[old_i] = old_m + old_i. Now, the loop ends if m + i = n; therefore, each branch of the loop can be reached at most n times, since they respectively increase either m + i or m, and m ≤ m + i: if m = n, then certainly m + i ≥ n, so that since it increases by unit increments at most, we must have had m + i = n at some point in the past, and therefore either way we would be done. Thus the loop executes at most 2n times, showing that the time complexity of the search algorithm is O(n). Here is another way to think about the runtime: Let us say we begin to match W and S at position i and p. If W exists as a substring of S at p, then W[0..m] = S[p..p+m]. Upon success, that is, the word and the text matched at the positions (W[i] = S[p+i]), we increase i by 1. Upon failure, that is, the word and the text does not match at the positions (W[i] ≠ S[p+i]), the text pointer is kept still, while the word pointer is rolled back a certain amount (i = T[i], where T is the jump table), and we attempt to match W[T[i]] with S[p+i]. The maximum number of roll-back of i is bounded by i, that is to say, for any failure, we can only roll back as much as we have progressed up to the failure. Then it is clear the runtime is 2n.

1.3.3

“Partial match” table (also known as “failure function”)

The goal of the table is to allow the algorithm not to match any character of S more than once. The key observation about the nature of a linear search that allows this to happen is that in having checked some segment of the main string against an initial segment of the pattern, we know exactly at which places a new potential match which could continue to the current position could begin prior to the current position. In other words, we “pre-search” the pattern itself and compile a list of all possible fallback positions that bypass a maximum of hopeless characters while not sacrificing any potential matches in doing so. We want to be able to look up, for each position in W, the length of the longest possible initial segment of W leading up to (but not including) that position, other than the full segment starting at W[0] that just failed to match; this is how far we have to backtrack in finding the next match. Hence T[i] is exactly the length of the longest possible proper initial segment of W which is also a segment of the substring ending at W[i - 1]. We use the convention that the empty string has length 0. Since a mismatch at the very start of the pattern is a special case (there is no possibility of backtracking), we set T[0] = −1, as discussed below.

16

CHAPTER 1. STRING MATCHING

Worked example of the table-building algorithm We consider the example of W = “ABCDABD” first. We will see that it follows much the same pattern as the main search, and is efficient for similar reasons. We set T[0] = −1. To find T[1], we must discover a proper suffix of “A” which is also a prefix of W. But there are no proper suffixes of “A”, so we set T[1] = 0. Likewise, T[2] = 0. Continuing to T[3], we note that there is a shortcut to checking all suffixes: let us say that we discovered a proper suffix which is a proper prefix and ending at W[2] with length 2 (the maximum possible); then its first character is a proper prefix of W, hence a proper prefix itself, and it ends at W[1], which we already determined cannot occur in case T[2]. Hence at each stage, the shortcut rule is that one needs to consider checking suffixes of a given size m+1 only if a valid suffix of size m was found at the previous stage (e.g. T[x] = m). Therefore we need not even concern ourselves with substrings having length 2, and as in the previous case the sole one with length 1 fails, so T[3] = 0. We pass to the subsequent W[4], 'A'. The same logic shows that the longest substring we need consider has length 1, and although in this case 'A' does work, recall that we are looking for segments ending before the current character; hence T[4] = 0 as well. Considering now the next character, W[5], which is 'B', we exercise the following logic: if we were to find a subpattern beginning before the previous character W[4], yet continuing to the current one W[5], then in particular it would itself have a proper initial segment ending at W[4] yet beginning before it, which contradicts the fact that we already found that 'A' itself is the earliest occurrence of a proper segment ending at W[4]. Therefore we need not look before W[4] to find a terminal string for W[5]. Therefore T[5] = 1. Finally, we see that the next character in the ongoing segment starting at W[4] = 'A' would be 'B', and indeed this is also W[5]. Furthermore, the same argument as above shows that we need not look before W[4] to find a segment for W[6], so that this is it, and we take T[6] = 2. Therefore we compile the following table: Another example, more interesting and complex:

Description of pseudocode for the table-building algorithm The example above illustrates the general technique for assembling the table with a minimum of fuss. The principle is that of the overall search: most of the work was already done in getting to the current position, so very little needs to be done in leaving it. The only minor complication is that the logic which is correct late in the string erroneously gives non-proper substrings at the beginning. This necessitates some initialization code. algorithm kmp_table: input: an array of characters, W (the word to be analyzed) an array of integers, T (the table to be filled) output: nothing (but during operation, it populates the table) define variables: an integer, pos ← 2 (the current position we are computing in T) an integer, cnd ← 0 (the zero-based index in W of the next character of the current candidate substring) (the first few values are fixed but different from what the algorithm might suggest) let T[0] ← −1, T[1] ← 0 while pos < length(W) do (first case: the substring continues) if W[pos-1] = W[cnd] then let cnd ← cnd + 1, T[pos] ← cnd, pos ← pos + 1 (second case: it doesn't, but we can fall back) else if cnd > 0 then let cnd ← T[cnd] (third case: we have run out of candidates. Note cnd = 0) else let T[pos] ← 0, pos ← pos + 1

Efficiency of the table-building algorithm The complexity of the table algorithm is O(n), where n is the length of W. As except for some initialization all the work is done in the while loop, it is sufficient to show that this loop executes in O(n) time, which will be done by simultaneously examining the quantities pos and pos - cnd. In the first branch, pos - cnd is preserved, as both pos and cnd are incremented simultaneously, but naturally, pos is increased. In the second branch, cnd is replaced by T[cnd], which we saw above is always strictly less than cnd, thus increasing pos - cnd. In the third branch, pos is incremented and cnd is not, so both pos and pos - cnd increase. Since pos ≥ pos - cnd, this means that at each stage either pos or a lower bound for pos increases; therefore since the algorithm terminates once pos = n, it must terminate after at most 2n iterations of the loop, since pos - cnd begins at 1. Therefore the complexity of the table algorithm is O(n).

1.3. KNUTH–MORRIS–PRATT ALGORITHM

1.3.4

17

Efficiency of the KMP algorithm

Since the two portions of the algorithm have, respectively, complexities of O(k) and O(n), the complexity of the overall algorithm is O(n + k). These complexities are the same, no matter how many repetitive patterns are in W or S.

1.3.5

Variants

A real-time version of KMP can be implemented using a separate failure function table for each character in the alphabet. If a mismatch occurs on character x in the text, the failure function table for character x is consulted for the index i in the pattern at which the mismatch took place. This will return the length of the longest substring ending at i matching a prefix of the pattern, with the added condition that the character after the prefix is x . With this restriction, character x in the text need not be checked again in the next phase, and so only a constant number of operations are executed between the processing of each index of the text. This satisfies the real-time computing restriction. The Booth algorithm uses a modified version of the KMP preprocessing function to find the lexicographically minimal string rotation. The failure function is progressively calculated as the string is rotated.

1.3.6

See also

• Boyer–Moore string search algorithm • Rabin–Karp string search algorithm • Aho–Corasick string matching algorithm

1.3.7

References

• Knuth, Donald; Morris, James H., jr; Pratt, Vaughan (1977). “Fast pattern matching in strings”. SIAM Journal on Computing 6 (2): 323–350. doi:10.1137/0206024. Zbl 0372.68005. • Cormen, Thomas; Lesiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2001). “Section 32.4: The KnuthMorris-Pratt algorithm”. Introduction to Algorithms (Second ed.). MIT Press and McGraw-Hill. pp. 923–931. ISBN 0-262-03293-7. Zbl 1047.68161. • Crochemore, Maxime; Rytter, Wojciech (2003). Jewels of stringology. Text algorithms. River Edge, NJ: World Scientific. pp. 20–25. ISBN 981-02-4897-0. Zbl 1078.68151. • Szpankowski, Wojciech (2001). Average case analysis of algorithms on sequences. Wiley-Interscience Series in Discrete Mathematics and Optimization. With a foreword by Philippe Flajolet. Chichester: Wiley. pp. 15–17,136–141. ISBN 0-471-24063-X. Zbl 0968.68205.

1.3.8

External links

• String Searching Applet animation • An explanation of the algorithm and sample C++ code by David Eppstein • Knuth-Morris-Pratt algorithm description and C code by Christian Charras and Thierry Lecroq • Explanation of the algorithm from scratch by FH Flensburg. • Breaking down steps of running KMP by Chu-Cheng Hsieh. • NPTELHRD YouTube lecture video • Proof of correctness

18

CHAPTER 1. STRING MATCHING

1.4 Boyer–Moore string search algorithm For the Boyer-Moore theorem prover, see Nqthm. In computer science, the Boyer–Moore string search algorithm is an efficient string searching algorithm that is the standard benchmark for practical string search literature.[1] It was developed by Robert S. Boyer and J Strother Moore in 1977.[2] The algorithm preprocesses the string being searched for (the pattern), but not the string being searched in (the text). It is thus well-suited for applications in which the pattern is much shorter than the text or where it persists across multiple searches. The Boyer-Moore algorithm uses information gathered during the preprocess step to skip sections of the text, resulting in a lower constant factor than many other string algorithms. In general, the algorithm runs faster as the pattern length increases. The key features of the algorithm are to match on the tail of the pattern rather than the head, and to skip along the text in jumps of multiple characters rather than searching every single character in the text.

1.4.1

Definitions

Alignments of pattern PAN to text ANPANMAN, from k=3 to k=8. A match occurs at k=5.

• S[i] refers to the character at index i of string S, counting from 1. • S[i..j] refers to the substring of string S starting at index i and ending at j, inclusive. • A prefix of S is a substring S[1..i] for some i in range [1, n], where n is the length of S. • A suffix of S is a substring S[i..n] for some i in range [1, n], where n is the length of S. • The string to be searched for is called the pattern and is referred to with symbol P. • The string being searched in is called the text and is referred to with symbol T. • The length of P is n. • The length of T is m. • An alignment of P to T is an index k in T such that the last character of P is aligned with index k of T. • A match or occurrence of P occurs at an alignment if P is equivalent to T[(k-n+1)..k].

1.4.2

Description

The Boyer-Moore algorithm searches for occurrences of P in T by performing explicit character comparisons at different alignments. Instead of a brute-force search of all alignments (of which there are m - n + 1), Boyer-Moore uses information gained by preprocessing P to skip as many alignments as possible. Previous to the introduction of this algorithm, the usual way to search within text was to examine each character of the text for the first character of the pattern. Once that was found the subsequent characters of the text would be compared to the characters of the pattern. If no match occurred then the text would again be checked character by character in an effort to find a match. Thus almost every character in the text needs to be examined. The key insight in this algorithm is that if the end of the pattern is compared to the text then jumps along the text can be made rather than checking every character of the text. The reason that this works is that in lining up the pattern against the text, the last character of the pattern is compared to the character in the text. If the characters do not match there is no need to continue searching backwards along the pattern. If the character in the text does not match any of the characters in the pattern, then the next character to check in the text is located n characters farther along the text, where n is the length of the pattern. If the character is in the pattern then a partial shift of the pattern along the text is done to line up along the matching character and the process is repeated. The movement along the text in jumps to make comparisons rather than checking every character in the text decreases the number of comparisons that have to be made, which is the key to the increase of the efficiency of the algorithm.

1.4. BOYER–MOORE STRING SEARCH ALGORITHM

19

More formally, the algorithm begins at alignment k = n, so the start of P is aligned with the start of T. Characters in P and T are then compared starting at index n in P and k in T, moving backward: the strings are matched from the end of P to the start of P. The comparisons continue until either the beginning of P is reached (which means there is a match) or a mismatch occurs upon which the alignment is shifted to the right according to the maximum value permitted by a number of rules. The comparisons are performed again at the new alignment, and the process repeats until the alignment is shifted past the end of T, which means no further matches will be found. The shift rules are implemented as constant-time table lookups, using tables generated during the preprocessing of P.

1.4.3

Shift Rules

A shift is calculated by applying two rules: the bad character rule and the good suffix rule. The actual shifting offset is the maximum of the shifts calculated by these rules. The Bad Character Rule Description Demonstration of bad character rule with pattern NNAAMAN. The bad-character rule considers the character in T at which the comparison process failed (assuming such a failure occurred). The next occurrence of that character to the left in P is found, and a shift which brings that occurrence in line with the mismatched occurrence in T is proposed. If the mismatched character does not occur to the left in P, a shift is proposed that moves the entirety of P past the point of mismatch. Preprocessing Methods vary on the exact form the table for the bad character rule should take, but a simple constant-time lookup solution is as follows: create a 2D table which is indexed first by the index of the character c in the alphabet and second by the index i in the pattern. This lookup will return the occurrence of c in P with the next-highest index j < i or −1 if there is no such occurrence. The proposed shift will then be i - j, with O(1) lookup time and O(kn) space, assuming a finite alphabet of length k. The Good Suffix Rule Description Demonstration of good suffix rule with pattern ANAMPNAM. The good suffix rule is markedly more complex in both concept and implementation than the bad character rule. It is the reason comparisons begin at the end of the pattern rather than the start, and is formally stated thus:[3] Suppose for a given alignment of P and T, a substring t of T matches a suffix of P, but a mismatch occurs at the next comparison to the left. Then find, if it exists, the right-most copy t' of t in P such that t' is not a suffix of P and the character to the left of t' in P differs from the character to the left of t in P. Shift P to the right so that substring t' in P aligns with substring t in T. If t' does not exist, then shift the left end of P past the left end of t in T by the least amount so that a prefix of the shifted pattern matches a suffix of t in T. If no such shift is possible, then shift P by n places to the right. If an occurrence of P is found, then shift P by the least amount so that a proper prefix of the shifted P matches a suffix of the occurrence of P in T. If no such shift is possible, then shift P by n places, that is, shift P past t. Preprocessing The good suffix rule requires two tables: one for use in the general case, and another for use when either the general case returns no meaningful result or a match occurs. These tables will be designated L and H respectively. Their definitions are as follows:[3] For each i, L[i] is the largest position less than n such that string P[i..n] matches a suffix of P[1..L[i]] and such that the character preceding that suffix is not equal to P[i-1]. L[i] is defined to be zero if there is no position satisfying the condition. Let H[i] denote the length of the largest suffix of P[i..n] that is also a prefix of P, if one exists. If none exists, let H[i] be zero.

20

CHAPTER 1. STRING MATCHING

Both of these tables are constructible in O(n) time and use O(n) space. The alignment shift for index i in P is given by n - L[i] or n - H[i]. H should only be used if L[i] is zero or a match has been found.

1.4.4

The Galil Rule

A simple but important optimization of Boyer-Moore was put forth by Galil in 1979.[4] As opposed to shifting, the Galil rule deals with speeding up the actual comparisons done at each alignment by skipping sections that are known to match. Suppose that at an alignment k1 , P is compared with T down to character c of T. Then if P is shifted to k2 such that its left end is between c and k1 , in the next comparison phase a prefix of P must match the substring T[(k2 - n)..k1 ]. Thus if the comparisons get down to position k1 of T, an occurrence of P can be recorded without explicitly comparing past k1 . In addition to increasing the efficiency of Boyer-Moore, the Galil rule is required for proving linear-time execution in the worst case.

1.4.5

Performance

The Boyer-Moore algorithm as presented in the original paper has worst-case running time of O(n+m) only if the pattern does not appear in the text. This was first proved by Knuth, Morris, and Pratt in 1977,[5] followed by Guibas and Odlyzko in 1980[6] with an upper bound of 5m comparisons in the worst case. Richard Cole gave a proof with an upper bound of 3m comparisons in the worst case in 1991.[7] When the pattern does occur in the text, running time of the original algorithm is O(nm) in the worst case. This is easy to see when both pattern and text consist solely of the same repeated character. However, inclusion of the Galil rule results in linear runtime across all cases.[4][7]

1.4.6

Implementations

Various implementations exist in different programming languages. In C++, Boost provides the generic Boyer–Moore search implementation under the Algorithm library. Below are a few simple implementations. [Python implementation] def alphabet_index(c): """ Returns the index of the given character in the English alphabet, counting from 0. """ return ord(c.lower()) - 97 # 'a' is ASCII character 97 def match_length(S, idx1, idx2): """ Returns the length of the match of the substrings of S beginning at idx1 and idx2. """ if idx1 == idx2: return len(S) - idx1 match_count = 0 while idx1 < len(S) and idx2 < len(S) and S[idx1] == S[idx2]: match_count += 1 idx1 += 1 idx2 += 1 return match_count def fundamental_preprocess(S): """ Returns Z, the Fundamental Preprocessing of S. Z[i] is the length of the substring beginning at i which is also a prefix of S. This pre-processing is done in O(n) time, where n is the length of S. """ if len(S) == 0: # Handles case of empty string return [] if len(S) == 1: # Handles case of single-character string return [1] z = [0 for x in S] z[0] = len(S) z[1] = match_length(S, 0, 1) for i in range(2, 1+z[1]): # Optimization from exercise 1-5 z[i] = z[1]-i+1 # Defines lower and upper limits of z-box l = 0 r = 0 for i in range(2+z[1], len(S)): if i 0: l = i r = i+z[i]−1 return z def bad_character_table(S): """ Generates R for S, which is an array indexed by the position of some character c in the English alphabet. At that index in R is an array of length |S|+1, specifying for each index i in S (plus the index after S) the next location of character c encountered when traversing S from right to left starting at i. This is used for a constant-time lookup for the bad character rule in the Boyer-Moore string search algorithm, although it has a much larger size than non-constant-time solutions. """ if len(S) == 0: return [[] for a in range(26)] R = [[−1] for a in range(26)] alpha = [−1 for a in range(26)] for i, c in enumerate(S): alpha[alphabet_index(c)] = i for j, a in enumerate(alpha): R[j].append(a) return R def good_suffix_table(S): """ Generates L for S, an array used in the implementation of the strong good suffix rule. L[i] = k, the largest position in S such that S[i:] (the suffix of S starting at i) matches a suffix of S[:k] (a substring in S ending at k). Used in Boyer-Moore, L gives an amount to shift P relative to T such that no instances of P in T are skipped and a suffix of P[:L[i]] matches the substring of T matched by a suffix of P in the previous match attempt. Specifically, if the mismatch took place at position i-1 in P, the shift magnitude is given by the equation len(P) - L[i]. In the case that L[i] = −1, the full shift table is used. Since only proper suffixes matter, L[0] = −1. """ L = [−1 for c in S] N = fundamental_preprocess(S[::−1]) # S[::−1] reverses S N.reverse() for j in range(0, len(S)−1): i = len(S) - N[j] if i != len(S): L[i] = j return L def

1.4. BOYER–MOORE STRING SEARCH ALGORITHM

21

full_shift_table(S): """ Generates F for S, an array used in a special case of the good suffix rule in the Boyer-Moore string search algorithm. F[i] is the length of the longest suffix of S[i:] that is also a prefix of S. In the cases it is used, the shift magnitude of the pattern P relative to the text T is len(P) - F[i] for a mismatch occurring at i-1. """ F = [0 for c in S] Z = fundamental_preprocess(S) longest = 0 for i, zv in enumerate(reversed(Z)): longest = max(zv, longest) if zv == i+1 else longest F[-i-1] = longest return F def string_search(P, T): """ Implementation of the Boyer-Moore string search algorithm. This finds all occurrences of P in T, and incorporates numerous ways of pre-processing the pattern to determine the optimal amount to shift the string and skip comparisons. In practice it runs in O(m) (and even sublinear) time, where m is the length of T. This implementation performs a case-insensitive search on ASCII alphabetic characters, spaces not included. """ if len(P) == 0 or len(T) == 0 or len(T) < len(P): return [] matches = [] # Preprocessing R = bad_character_table(P) L = good_suffix_table(P) F = full_shift_table(P) k = len(P) - 1 # Represents alignment of end of P relative to T previous_k = −1 # Represents alignment in previous phase (Galil’s rule) while k < len(T): i = len(P) - 1 # Character to compare in P h = k # Character to compare in T while i >= 0 and h > previous_k and P[i] == T[h]: # Matches starting from end of P i -= 1 h -= 1 if i == −1 or h == previous_k: # Match has been found (Galil’s rule) matches.append(k - len(P) + 1) k += len(P)-F[1] if len(P) > 1 else 1 else: # No match, shift by max of bad character and good suffix rules char_shift = i - R[alphabet_index(T[h])][i] if i+1 == len(P): # Mismatch happened on first attempt suffix_shift = 1 elif L[i+1] == −1: # Matched suffix does not appear anywhere in P suffix_shift = len(P) - F[i+1] else: # Matched suffix appears in P suffix_shift = len(P) - L[i+1] shift = max(char_shift, suffix_shift) previous_k = k if shift >= i+1 else previous_k # Galil’s rule k += shift return matches [C implementation] #include #include #define ALPHABET_LEN 256 #define NOT_FOUND patlen #define max(a, b) ((a < b) ? b : a) // delta1 table: delta1[c] contains the distance between the last // character of pat and the rightmost occurrence of c in pat. // If c does not occur in pat, then delta1[c] = patlen. // If c is at string[i] and c != pat[patlen-1], we can // safely shift i over by delta1[c], which is the minimum distance // needed to shift pat forward to get string[i] lined up // with some character in pat. // this algorithm runs in alphabet_len+patlen time. void make_delta1(int *delta1, uint8_t *pat, int32_t patlen) { int i; for (i=0; i < ALPHABET_LEN; i++) { delta1[i] = NOT_FOUND; } for (i=0; i < patlen-1; i++) { delta1[pat[i]] = patlen-1 - i; } } // true if the suffix of word starting from word[pos] is a prefix // of word int is_prefix(uint8_t *word, int wordlen, int pos) { int i; int suffixlen = wordlen - pos; // could also use the strncmp() library function here for (i = 0; i < suffixlen; i++) { if (word[i] != word[pos+i]) { return 0; } } return 1; } // length of the longest suffix of word ending on word[pos]. // suffix_length(“dddbcabc”, 8, 4) = 2 int suffix_length(uint8_t *word, int wordlen, int pos) { int i; // increment suffix length i to the first mismatch or beginning // of the word for (i = 0; (word[pos-i] == word[wordlen-1-i]) && (i < pos); i++); return i; } // delta2 table: given a mismatch at pat[pos], we want to align // with the next possible full match could be based on what we // know about pat[pos+1] to pat[patlen-1]. // // In case 1: // pat[pos+1] to pat[patlen-1] does not occur elsewhere in pat, // the next plausible match starts at or after the mismatch. // If, within the substring pat[pos+1 .. patlen-1], lies a prefix // of pat, the next plausible match is here (if there are multiple // prefixes in the substring, pick the longest). Otherwise, the // next plausible match starts past the character aligned with // pat[patlen-1]. // // In case 2: // pat[pos+1] to pat[patlen-1] does occur elsewhere in pat. The // mismatch tells us that we are not looking at the end of a match. // We may, however, be looking at the middle of a match. // // The first loop, which takes care of case 1, is analogous to // the KMP table, adapted for a 'backwards’ scan order with the // additional restriction that the substrings it considers as // potential prefixes are all suffixes. In the worst case scenario // pat consists of the same letter repeated, so every suffix is // a prefix. This loop alone is not sufficient, however: // Suppose that pat is “ABYXCDBYX”, and text is ".....ABYXCDEYX”. // We will match X, Y, and find B != E. There is no prefix of pat // in the suffix “YX”, so the first loop tells us to skip forward // by 9 characters. // Although superficially similar to the KMP table, the KMP table // relies on information about the beginning of the partial match // that the BM algorithm does not have. // // The second loop addresses case 2. Since suffix_length may not be // unique, we want to take the minimum value, which will tell us // how far away the closest potential match is. void make_delta2(int *delta2, uint8_t *pat, int32_t patlen) { int p; int last_prefix_index = patlen-1; // first loop for (p=patlen-1; p>=0; p--) { if (is_prefix(pat, patlen, p+1)) { last_prefix_index = p+1; } delta2[p] = last_prefix_index + (patlen-1 - p); } // second loop for (p=0; p < patlen-1; p++) { int slen = suffix_length(pat, patlen, p); if (pat[p - slen] != pat[patlen-1 - slen]) { delta2[patlen-1 slen] = patlen-1 - p + slen; } } } uint8_t* boyer_moore (uint8_t *string, uint32_t stringlen, uint8_t *pat, uint32_t patlen) { int i; int delta1[ALPHABET_LEN]; int *delta2 = (int *)malloc(patlen * sizeof(int)); make_delta1(delta1, pat, patlen); make_delta2(delta2, pat, patlen); // The empty pattern must be considered specially if (patlen == 0) return string; i = patlen-1; while (i < stringlen) { int j = patlen-1; while (j >= 0 && (string[i] == pat[j])) { --i; --j; } if (j < 0) { free(delta2); return (string + i+1); } i += max(delta1[string[i]], delta2[j]); } free(delta2); return NULL; } [Java implementation] /** * Returns the index within this string of the first occurrence of the * specified substring. If it is not a substring, return −1. * * @param haystack The string to be scanned * @param needle The target string to search * @return The start index of the substring */ public static int indexOf(char[] haystack, char[] needle) { if (needle.length ==

22

CHAPTER 1. STRING MATCHING

0) { return 0; } int charTable[] = makeCharTable(needle); int offsetTable[] = makeOffsetTable(needle); for (int i = needle.length - 1, j; i < haystack.length;) { for (j = needle.length - 1; needle[j] == haystack[i]; --i, --j) { if (j == 0) { return i; } } // i += needle.length - j; // For naive method i += Math.max(offsetTable[needle.length - 1 - j], charTable[haystack[i]]); } return −1; } /** * Makes the jump table based on the mismatched character information. */ private static int[] makeCharTable(char[] needle) { final int ALPHABET_SIZE = 256; int[] table = new int[ALPHABET_SIZE]; for (int i = 0; i < table.length; ++i) { table[i] = needle.length; } for (int i = 0; i < needle.length - 1; ++i) { table[needle[i]] = needle.length - 1 - i; } return table; } /** * Makes the jump table based on the scan offset which mismatch occurs. */ private static int[] makeOffsetTable(char[] needle) { int[] table = new int[needle.length]; int lastPrefixPosition = needle.length; for (int i = needle.length - 1; i >= 0; --i) { if (isPrefix(needle, i + 1)) { lastPrefixPosition = i + 1; } table[needle.length - 1 - i] = lastPrefixPosition - i + needle.length - 1; } for (int i = 0; i < needle.length - 1; ++i) { int slen = suffixLength(needle, i); table[slen] = needle.length - 1 - i + slen; } return table; } /** * Is needle[p:end] a prefix of needle? */ private static boolean isPrefix(char[] needle, int p) { for (int i = p, j = 0; i < needle.length; ++i, ++j) { if (needle[i] != needle[j]) { return false; } } return true; } /** * Returns the maximum length of the substring ends at p and is a suffix. */ private static int suffixLength(char[] needle, int p) { int len = 0; for (int i = p, j = needle.length - 1; i >= 0 && needle[i] == needle[j]; --i, --j) { len += 1; } return len; }

1.4.7

Variants

The Boyer–Moore–Horspool algorithm is a simplification of the Boyer–Moore algorithm using only the bad character rule. The Apostolico–Giancarlo algorithm speeds up the process of checking whether a match has occurred at the given alignment by skipping explicit character comparisons. This uses information gleaned during the pre-processing of the pattern in conjunction with suffix match lengths recorded at each match attempt. Storing suffix match lengths requires an additional table equal in size to the text being searched.

1.4.8

See also

• Knuth–Morris–Pratt string search algorithm • Boyer–Moore–Horspool string search algorithm • Apostolico–Giancarlo string search algorithm • Aho–Corasick multi-pattern string search algorithm • Rabin–Karp multi-pattern string search algorithm • Suffix trees

1.4.9

References

[1] Hume; Sunday (November 1991). “Fast String Searching”. Software—Practice and Experience 21 (11): 1221–1248. [2] Boyer, Robert S.; Moore, J Strother (October 1977). “A Fast String Searching Algorithm.”. Comm. ACM (New York, NY, USA: Association for Computing Machinery) 20 (10): 762–772. doi:10.1145/359842.359859. ISSN 0001-0782. [3] Gusfield, Dan (1999) [1997], “Chapter 2 - Exact Matching: Classical Comparison-Based Methods”, Algorithms on Strings, Trees, and Sequences (1 ed.), Cambridge University Press, pp. 19–21, ISBN 0521585198 [4] Galil, Z. (September 1979). “On improving the worst case running time of the Boyer-Moore string matching algorithm”. Comm. ACM (New York, NY, USA: Association for Computing Machinery) 22 (9): 505–508. doi:10.1145/359146.359148. ISSN 0001-0782. [5] Knuth, Donald; Morris, James H.; Pratt, Vaughan (1977). “Fast pattern matching in strings”. SIAM Journal on Computing 6 (2): 323–350. doi:10.1137/0206024. [6] Guibas, Odlyzko; Odlyzko, Andrew (1977). “A new proof of the linearity of the Boyer-Moore string searching algorithm”. Proceedings of the 18th Annual Symposium on Foundations of Computer Science (Washington, DC, USA: IEEE Computer Society): 189–195. doi:10.1109/SFCS.1977.3.

1.4. BOYER–MOORE STRING SEARCH ALGORITHM

23

[7] Cole, Richard (September 1991). “Tight bounds on the complexity of the Boyer-Moore string matching algorithm”. Proceedings of the 2nd annual ACM-SIAM symposium on Discrete algorithms (Philadelphia, PA, USA: Society for Industrial and Applied Mathematics): 224–233. ISBN 0-89791-376-0.

1.4.10

External links

• Original paper on the Boyer-Moore algorithm • An example of the Boyer-Moore algorithm from the homepage of J Strother Moore, co-inventor of the algorithm • Richard Cole’s 1991 paper proving runtime linearity

Chapter 2

Text and image sources, contributors, and licenses 2.1 Text • String (computer science) Source: http://en.wikipedia.org/wiki/String%20(computer%20science)?oldid=660859856 Contributors: Damian Yerrick, AxelBoldt, Eloquence, Hornlo, Bryan Derksen, Zundark, The Anome, Stephen Gilbert, Koyaanis Qatsi, Drj, Boleslav Bobcik, Perique des Palottes, Mjb, B4hand, Patrick, RTC, Michael Hardy, Pnm, TakuyaMurata, Mkweise, Ahoerstemeier, Александър, Arthur Frayn, Error, Bogdangiusca, Andres, Ghewgill, Charles Matthews, Dcoetzee, Dysprosia, Furrykef, Bevo, Sewing, Robbot, Murray Langton, Scarfboy, Pengo, Tobias Bergemann, Giftlite, DavidCary, Castaa, Fropuff, Jorge Stolfi, Christopherlin, Vadmium, Fudo, Beland, Kusunose, Maximaximax, Sebbe, Pinguin.tk~enwiki, Andreas Kaufmann, Shahab, Slady, Murtasa, Plugwash, Spitzak, MisterSheik, CanisRufus, Anphanax, Cedders, Richard W.M. Jones, Spearhead, Sietse Snel, R. S. Shaw, Minghong, Obradovic Goran, Nevyn, Wayfarer, Hippophaë~enwiki, Ubermonkey, Seec77, Alai, Forderud, Oleg Alexandrov, Linas, Bkkbrad, MattGiuca, Ruud Koot, Urod, Anthony Borla, Jonnabuz, Gwil, Qwertyus, Kbdank71, TheIncredibleEdibleOompaLoompa, StuartBrady, FlaBot, Ian Pitchford, Stoph, Margosbot~enwiki, Gparker, Gurch, Pexatus, Chobot, YurikBot, Borgx, Hairy Dude, Fabartus, Howcheng, Mikeblas, Black Falcon, JLaTondre, SmackBot, BurntSky, AnOddName, Chris the speller, Sahirshah, Gaiacarra, Thumperward, Nbarth, Jeremysr, BIL, Cybercobra, Dreadstar, Tompsci, Drphilharmonic, Mlpkr, Lambiam, Doug Bell, Derek farn, Witharebelyell, Shirifan, Loadmaster, Dr.K., Rory O'Kane, Dreftymac, Pimlottc, Courcelles, Georg Peter, Neelix, Gregbard, Peterdjones, Gogo Dodo, Christian75, Mojo Hand, John254, Icep, AntiVandalBot, JonathanCross, JAnDbot, Dereckson, David Eppstein, Philg88, Tigrisek, Gwern, DorganBot, WinterSpw, Tortoise3, VolkovBot, AlnoktaBOT, Andy Dingley, C45207, S.Örvarr.S, Bentogoa, Taemyr, Doctorfluffy, OKBot, Anchor Link Bot, Treekids, Elassint, ClueBot, The Thing That Should Not Be, Garyzx, Alexbot, OpinionPerson, Mad Tinman, Marc van Leeuwen, Gumum, SilvonenBot, Addbot, Jncraton, IOLJeff, Numbo3-bot, Teles, Jarble, Legobot, Yobot, Gyro Copter, Denispir, AnomieBOT, Materialscientist, LilHelpa, Xqbot, 4twenty42o, Nasnema, GenQuest, SassoBot, Kyng, Charvest, GNRY09, Jordandanford, Jc3s5h, Ptarjan, FoxBot, TBloemink, Ripchip Bot, EmausBot, RogerofRomsey, GoingBatty, Wikipelli, Slawekb, Nomen4Omen, Dennis714, SporkBot, Underrated1 17, Jay-Sebastos, Uuf6429, Cgt, ClueBot NG, Jiri 1984, Fatboar, CanadianMaritimer, Doorknob747, Luke Igoe, Dainomite, Joydeep, Andrew Helwer, Local.empire, Alfabalon, Frosty, Jochen Burghardt, Pantser, A4b3c2d1e0f, Tentinator, Captain Conundrum, DavidLeighEllis, Komarov om, Mythas11, Sofia Koutsouveli, IAMBLAQTHOVEN, Stawny, PJ editing and Anonymous: 133 • String searching algorithm Source: http://en.wikipedia.org/wiki/String%20searching%20algorithm?oldid=653943206 Contributors: Taw, Boleslav Bobcik, B4hand, Nixdorf, Kku, Angela, Poor Yorick, Dcoetzee, Nyxos, Mordomo, Jaredwf, Fredrik, Macrakis, Alvestrand, Pne, Phe, Watcher, PFHLai, Sam Hocevar, Andreas Kaufmann, Squash, Tristan Schmelcher, Ascánder, Bender235, Plugwash, BrokenSegue, NJM, Ruud Koot, Mandarax, Shadowhillway, Quuxplusone, Borgx, Rsrikanth05, Neilbeach, Bisqwit, Nils.grimsmo, Nils Grimsmo, Mikeblas, Tony1, Ms2ger, Netrapt, Thosylve, SmackBot, Rdt~enwiki, TripleF, Mlpkr, MegaHasher, Jafet, CRGreathouse, Sniffnoy, Szabolcs Nagy, MaxEnt, Ltickett, Shehzad.kazmi, A3RO, Drake Wilson, PhilKnight, Squidonius, Trusilver, Catmoongirl, Ijustam, TXiKiBoT, Webmeischda~enwiki, IdreamofJeanie, OKBot, Kumioko (renamed), Hariva, ClueBot, SummerWithMorons, Excirial, Jwpat7, Algebran, Addbot, Jarble, Luckas-bot, THEN WHO WAS PHONE?, Ayonggu114ster, AnomieBOT, Sz-iwbot, Materialscientist, GrouchoBot, RedBot, Dcirovic, KerinthIT, ZéroBot, Jan.papousek, HerrMister, OldCodger2, Jodosma, Dummy6277, Shekharsuman93, Stefan.Bunk, Moylin4, Anurag.x.singh and Anonymous: 75 • Knuth–Morris–Pratt algorithm Source: http://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt%20algorithm?oldid= 659379745 Contributors: Bryan Derksen, Michael Hardy, Tregoweth, Timwi, Dcoetzee, Ww, Almi, Fibonacci, Jaredwf, Mark T, Elias, Wikibob, Madoka, Phe, PACO~enwiki, Chadernook, Diagonalfish, Rich Farmbrough, Crescent Moon, Antaeus Feldspar, Olau, Acntx, Blinken, Krischik, Swift, Bikri, Chester br, Erroneous01, LOL, Ruud Koot, GregorB, Byronknoll, Ryan Reich, VsevolodSipakov, Mandarax, Quuxplusone, Chobot, Bgwhite, YurikBot, Borgx, Tom Alsberg, KSmrq, Shell Kinney, Zhaladshar, Bruguiea, BOT-Superzerocool, Cedar101, SmackBot, Mhss, Jon Awbrey, A5b, Curly Turkey, NeilFraser, Ycl6, Pranith, Vanisaac, Amitchaudhary, RainCT, Şamil~enwiki, Billc.cn, Znora, .anacondabot, Magioladitis, Master.mind, David Eppstein, Raknarf44, Gwern, Glrx, STBotD, LokiClock, MadLex, Sikuyihsoy, Jeremiah Mountain, Ee19921, Hariva, ClueBot, Jagat sastry, Magicheader, Niceguyedc, Johnuniq, Arlolra, Chucheng, Little Mountain 5, Addbot, Javy413, Lightbot, Peni, Yobot, Ptbotgourou, Citation bot, Xqbot, J04n, Smallman12q, FrescoBot, Hobsonlane, Pratik.mallya, Mikespedia, Wahas1234, Dinamik-bot, Jocapc, Seninp, TjBot, Ripchip Bot, WikitanvirBot, Spencer4Hire, OnePlusTwelve, Haojin, Mikhail Ryazanov, Adityasinghhhh, Adityasinghhhhh, Winston Chuen-Shih Yang, Andrew Helwer, Xterminatrix, Tushicomeng, Deltahedron, Hddqsb, Axings, Dmshafi, Wisiti, Mafagafogigante, Ying.l.xiong, Angelababy00, Jason721z and Anonymous: 137 • Boyer–Moore string search algorithm Source: http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore%20string%20search%20algorithm?

24

2.2. IMAGES

25

oldid=660810615 Contributors: Damian Yerrick, Tim Starling, Booyabazooka, Kayvee, Dcoetzee, Ww, Greenrd, Maximus Rex, Furrykef, Murray Langton, Fredrik, Moink, Kd4ttc, Tobias Bergemann, Ancheta Wis, DocWatson42, Wikibob, Karnan, Mboverload, Pne, Beland, Phe, Watcher, Fib, Billlava, Rich Farmbrough, Mathiasl26, YUL89YYZ, Antaeus Feldspar, Jemfinch, Mr flea, RJFJR, Fbriere, Ruud Koot, Triddle, Ryan Reich, BD2412, Nneonneo, Quuxplusone, Kri, Borgx, Bisqwit, Dpakoha, Jashmenn, Mikeblas, Klutzy, Eyal0, Ott2, DaveWF, Blueyoshi321, Lt-wiki-bot, SmackBot, JoeMarfice, Kostmo, TripleF, Evgeny Lykhin, Frap, Radagast83, Xillion, Wthrower, Zearin, Freaky Dug, Szabolcs Nagy, Ahuds, Infinito, Tim.head, Thijs!bot, Billyoneal, Alphachimpbot, Martinkunev, PhilKnight, Abednigo, Gwern, Lisamh, Glrx, Nemo bis, Plindenbaum, STBotD, Icktoofay, Duplicity, Nickjhay, Barry Fruitman, Eeppeliteloop, Elassint, SummerWithMorons, Thegeneralguy, Rhododendrites, M.O.X, Dekart, Addbot, DOI bot, Alex.mccarthy, Adfellin, Mi1ror, Tide rolls, Cneubauer, Balabiot, Luckas-bot, Yobot, Ptbotgourou, Lauren Lester, AnomieBOT, TapatioGeek, J12f, Edsarian, Czlaner, Smallman12q, SeekerOfThePath, Lumpynifkin, Citation bot 1, Biker Biker, Art1x com, Cwalgampaya, JustAHappyCamper, John of Reading, Brunobowden, Donner60, ChuispastonBot, Neelpulse, Snowgene, ClueBot NG, Jinghaoxu, Kejia, Vacation9, PedR, Widr, Chokfung, Jy2wong, Aunndroid, Andrew Helwer, ChrisGualtieri, Kucyla, IgushevEdward, Deqing.huang, Jun Furuse, Patrickzzy and Anonymous: 141

2.2 Images • File:0321_DNA_Macrostructure.jpg Source: http://upload.wikimedia.org/wikipedia/commons/b/b4/0321_DNA_Macrostructure.jpg License: CC BY 3.0 Contributors: Anatomy & Physiology, Connexions Web site. http://cnx.org/content/col11496/1.6/, Jun 19, 2013. Original artist: OpenStax College • File:DFA_search_mommy.svg Source: http://upload.wikimedia.org/wikipedia/commons/d/d9/DFA_search_mommy.svg License: Public domain Contributors: ? Original artist: ? • File:Hamming_distance_3_bit_binary.svg Source: http://upload.wikimedia.org/wikipedia/commons/b/b4/Hamming_distance_3_bit_ binary.svg License: CC-BY-SA-3.0 Contributors: This vector image was created with Inkscape. Original artist: en:User:Cburnett • File:Question_book-new.svg Source: http://upload.wikimedia.org/wikipedia/en/9/99/Question_book-new.svg License: Cc-by-sa-3.0 Contributors: Created from scratch in Adobe Illustrator. Based on Image:Question book.png created by User:Equazcion Original artist: Tkgd2007 • File:Wikibooks-logo-en-noslogan.svg Source: http://upload.wikimedia.org/wikipedia/commons/d/df/Wikibooks-logo-en-noslogan.svg License: CC BY-SA 3.0 Contributors: Own work Original artist: User:Bastique, User:Ramac et al.

2.3 Content license • Creative Commons Attribution-Share Alike 3.0