An Approximate Gazetteer for GATE based on ...

3 downloads 0 Views 522KB Size Report
GATE is an architecture that allows the use and development of useful ... GATE (General Architecture for Text Engineering) allows the implementation of typical.
An Approximate Gazetteer for GATE based on Levenshtein Distance Bruno Woltzenlogel Paleo ([email protected]) Technological Institute of Aeronautics, Sao Jose dos Campos, Brazil Technische Universitaet Wien, Austria

Abstract GATE is an architecture that allows the use and development of useful plugins for typical Natural Language Processing tasks. However, there was so far no plugin capable of finding and annotating words that only approximately match words specified in a list of words to be searched, since GATE’s Default Gazetteer performs only exact matching. Here we describe the development of such a plugin, based on Levenshtein Edit Distance, and its integration to the GATE environment. This plugin enables GATE to be useful for tasks in which exact matching is not enough.

Introduction

GATE (General Architecture for Text Engineering) allows the implementation of typical Natural Language Processing tasks as pipelines, where different processing resources (Partof-Speech taggers, Sentence Splitters, Gazetteers, . . . ) are consecutively applied to a given language resource (a text document or a collection of text documents). The output is a language resource enriched with annotations. We developed for GATE a particular type of processing resource called Approximate Gazetteer, which finds and annotates occurrences of words, even if these occurrences contain, for example, typing mistakes, noise, variant spellings, or other variations. If the Levenshtein Edit Distance between a substring of the text and a given word is less than a user-specified threshold, then an approximate match is detected and the occurrence is annotated.

Levenshtein Distance

To avoid overlapping annotations, another modification of the algorithm is necessary. As shown below, we do not annotate all matches with distances below the threshold (in this case, 2), but only those corresponding to minimal values in the last row (in this case, the cells in red).

C A T T

0 1 2 3 4

T 0 1 2 2 3

C 0 0 1 2 3

A 0 1 0 1 2

T 0 1 1 0 1

G 0 1 2 1 1

G 0 1 2 2 2

The previous algorithms find only the end positions of matches in the text. In order to annotate the whole occurrences in the text, we also need to determine their initial positions. This is done by storing not only the distances between prefixes in the table, but also the number of deletions and insertions that resulted in those distances. Then the start position of a match can be computed by: startposition = endposition − (lengthOf String + #insertions − #deletions)

The Plugin in GATE

Figures 1 and 2 show a simple qualitative example of the improved recall of the Approximate Gazetteer with respect to GATE’s Default Gazetteer. Searched words were: “Kurt G¨odel Society”, “S˜ao Paulo”, “Brazil”, “Programme Alban”, “Woltzenlogel”, “GATE”, “ESSLLI”, “Levenshtein”. Typical typing mistakes and spelling variations are handled successfully.

The Levenshtein Distance between two strings is the minimum number of character insertions, deletions and substitutions necessary to transform one string into the other. This minimum distance can be computed recursively according to the equation below:

d(s1.c1, s2.c2) =

!

Figure 2: Annotations produced by the Approximate Gazetteer.

s2) , if c1 = c2 d(s1,"# $% min d(s1, s2.c2) + 1, d(s1.c1, s!2) + 1, d(s1, s2) + 1 , if c1 "= c2

Conclusions

The Approximate Gazetteer is a first step in making GATE more applicable to data that can contain different kinds of noise, as for example: genes and proteins that can contain mutations; texts obtained via imperfect OCR; texts containing words in slightly different languages; texts containing words mistyped on purpose (spam, internet slangs). A gazetteer is often used as a pre-processing task within more complex Natural Language Processing tasks. By using Approximate Gazetteer, these tasks could benefit from its improved recall. Since its major drawback, when compared to Default Gazetteer is its speed, future work could try to combine the dynamic programming approach for the computation of Levenshtein distance with ideas borrowed from GATE’s Default Gazetteer, in order to avoid repeated computations for different words when the list of words is big.

Computation by dynamic programming is also possible. The distances between increasingly long prefixes are stored in a table, starting from the top left corner, as exemplified below for the strings “CATO” and “GATE”:

G A T E

0 1 2 3 4

C 1 1 2 3 4

A 2 2 1 2 3

T 3 3 2 1 2

O 4 4 3 2 2

• Kurt G¨ odel Society • Alban Programme

The dynamic programming algorithm can be modified, so that, instead of only computing the distance between two strings, it finds all occurrences of one string within the other. This is done by initializing the first row with zeros, as shown below, where we are interested in finding all occurrences of “CA” in the text “CATACA”. C 0 0 C 1 0 A 2 1

A 0 1 0

T 0 1 2

A 0 1 1

Affiliation and Support

C 0 0 1

A 0 1 0

References • GATE’s Website: http://gate.ac.uk/

Figure 1: Annotations produced by GATE’s Default Gazetteer.

Approximate Gazetteer’s improved recall has its cost. While Default Gazetteer stores all words to be searched in a finite state machine and then searches the text at once, Approximate Gazetteer needs to search the text once for each word in the list. Approximate Gazetteer is thus slower, specially if the list contains a large number of words to be searched.

• Download of Approximate Gazetteer: http://gate.ac.uk/gate/doc/plugins.html#bwp or http://paginas.terra.com.br/arte/lua/?content=Software/BWPGazetteer.html or http://bruno-wp.blogspot.com/2007/03/software.html • Levenshtein Distance:

– Levenshtein, Binary codes capable of correcting deletions, insertions, and reversals, 1966. – Navarro, A guided tour to approximate string matching, 2001.