an adaptive and efficient algorithm for detecting

0 downloads 0 Views 230KB Size Report
Jun 9, 2000 - records that are approximate duplicates, but not exact duplicates. ... rithm used to determine if two records represent the same entity. .... The second assumption is needed because any edit-distance algorithm ..... When two documents share enough shingles to be deemed similar, they are put in the same ...
AN ADAPTIVE AND EFFICIENT ALGORITHM FOR DETECTING APPROXIMATELY DUPLICATE DATABASE RECORDS Alvaro E. Monge1 California State University, Long Beach, CECS Department, Long Beach, CA, 90840-8302 June 9, 2000

Abstract | The integration of information is an important area of research in databases. By com-

bining multiple information sources, a more complete and more accurate view of the world is attained, and additional knowledge gained. This is a non-trivial task however. Often there are many sources which contain information about a certain kind of entity, and some will contain records concerning the same real-world entity. Furthermore, one source may not have the exact information that another source contains. Some of the information may be di erent due to data entry errors for example or may be missing altogether. Thus, one problem in integrating information sources is to identify possibly di erent designators of the same entity. Data cleansing is the process of purging databases of inaccurate or inconsistent data. The data is typically manipulated into a form which is useful for other tasks, such as data mining. This paper addresses the data cleansing problem of detecting database records that are approximate duplicates, but not exact duplicates. An ecient algorithm is presented which combines three key ideas. First, the Smith-Waterman algorithm for computing the minimum edit-distance is used as a domain-independent method to recognize pairs of approximately duplicates. Second, the union- nd data structure is used to maintain the clusters of duplicate records incrementally, as pairwise duplicate relationships are discovered. Third, the algorithm uses a priority queue of cluster subsets to respond adaptively to the size and homogeneity of the clusters discovered as the database is scanned. This results in signi cant savings in the number of times that a pairwise record matching algorithm is applied, without impairing accuracy. Comprehensive experiments on synthetic databases and on a real-world database con rm the e ectiveness of all three ideas. Key words: merge/purge, data cleansing, approximate, duplicate, database, transitive closure, union nd, Smith-Waterman, edit distance

1. INTRODUCTION Research in the areas of knowledge discovery and data cleansing has seen recent growth. The growth is due to a number of reasons. The most obvious one is the exponential growth of information available online. In particular, the biggest impact comes from the popularity of the Internet and the World Wide Web (WWW or web). In addition to the web, there are many traditional sources of information, like relational databases, which have impacted the growth of the information available online. The availability of these sources increases not only the amount of data, but also the variety of and quality in which such data appears. These factors create a number of problems. The work in this paper concentrates on one such problem: the detection of multiple representations of a single entity. Data cleansing is the process of cleaning up databases containing inaccurate or inconsistent data. One inconsistency is the existence of di erent multiple representations of the same real-world entity. The task is to detect such duplicity and reconcile the di erences into a single representation. The di erences may be due to data entry errors such as typographical mistakes, or to unstandardized abbreviations, or to di erences in detailed schemas of records from multiple databases, among other reasons. As the information in multiple sources is integrated, the same real-world entity is duplicated. The detection of records that are approximate duplicates, but not exact duplicates, in databases is an important task. Without a solution to this problem many of the data mining algorithms would be rendered useless as they depend on the quality of data being mined. This paper presents solutions to this problem. Every duplicate detection method proposed to date, including ours, requires an algorithm for detecting \is a duplicate of" relationships between pairs of records. Section 2 summarizes an algorithm used to determine if two records represent the same entity. Such record matching algorithms 1

2

A. E. Monge

are used in database-level duplicate detection algorithms which are presented in Section 3. The section starts out by de ning the problem and identifying related work in this area. Typically the record matching algorithms are relatively expensive computationally, and the database-level duplicate detection algorithms will have grouping methods to reduce the number of times that it must be applied. This is the major contribution of the work presented in this article and is presented in Sections 3.5 and 3.6 Section 4 provides an empirical evaluation of the duplicate detection algorithms, including a comparison with previous work. The article concludes in Section 6 with nal remarks about this work. 2. ALGORITHMS TO MATCH RECORDS Many knowledge discovery and database mining applications need to combine information from heterogeneous sources. These information sources, such as relational databases or worldwide web pages, provide information about the same real-world entities, but describe these entities di erently. Resolving discrepancies in how entities are described is the problem addressed in this section. Speci cally, the record matching problem is to determine whether or not two syntactically di erent record values describe the same semantic entity, i.e. real-world object. Solving the record matching problem is vital in three major knowledge discovery tasks.  First, the ability to perform record matching allows one to identify corresponding information in di erent information sources. This allows one to navigate from one source to another, and to combine information from the sources. In relational databases, navigating from one relation to another is called a \join." Record matching allows one to do joins on information sources that are not relations in the strict sense. A worldwide web knowledge discovery application that uses record matching to join separate Internet information sources, called WebFind, is described in [31, 33, 34].  Second, the ability to do record matching allows one to detect duplicate records, whether in one database or in multiple related databases. Duplicate detection is the central issue in the so-called \Merge/Purge" task [18, 21, 35], which is to identify and combine multiple records, from one database or many, that concern the same entity but are distinct because of data entry errors. This task is also called \data scrubbing" or \data cleaning" or \data cleansing" [41]. The detection problem is the focus of this article and is studied in more detail in Section 3. This article does not propose solutions to the question of what is to be done once the duplicate records are detected.  Third, doing record matching is one way to solve the database schema matching problem [3, 25, 29, 44, 30]. This problem is to infer which attributes in two di erent databases (i.e. which columns of which relations for relational databases) denote the same real-world properties or objects. If several values of one attribute can be matched pairwise with values of another attribute, then one can infer inductively that the two attributes correspond. This technique is used to do schema matching for Internet information sources by the \information learning agent" (ILA) of [13], for example. The remaining of this section discusses the related work in the area of record matching by rst stating the record matching problem precisely. Finally, the section brie y summarizes the domain-independent record matching algorithms proposed in [32]. 2.1. De ning the problem

The record matching problem has been recognized as important for at least 50 years. Since the 1950s over 100 papers have studied matching for medical records under the name \record linkage." These papers are concerned with identifying medical records for the same individual in di erent databases, for the purpose of performing epidemiological studies [37]. Record matching has also been recognized as important in business for decades. For example tax agencies must do record matching to correlate di erent pieces of information about the same taxpayer when social security

Detecting approximately duplicate records

3

numbers are missing or incorrect. The earliest paper on duplicate detection in a business database is by [48]. The \record linkage" problem in business has been the focus of workshops sponsored by the US Census Bureau [24, 8, 11, 46, 47]. Record matching is also useful for detecting fraud and money laundering [40]. Almost all published previous work on record matching is for speci c application domains, and hence gives domain-speci c algorithms. For example, three papers discuss record matching for customer addresses [1], census records [42], or variant entries in a lexicon [22]. Other work on record matching is not domain-speci c, but assumes that domain-speci c knowledge will be supplied by a human for each application domain [45, 18]. One important area of research that is relevant to approximate record matching is approximate string matching. String matching has been one of the most studied problems in computer science [5, 26, 17, 15, 9, 12]. The main approach is based on edit distance [28]. Edit distance is the minimum number of operations on individual characters (e.g. substitutions, insertions, and deletions) needed to transform one string of symbols to another [39, 17, 27]. In the survey by [17], the authors consider two di erent problems, one under the de nition of equivalence and a second using similarity. Their de nition of equivalence allows only small di erences in the two strings. For examples, they allow alternate spellings of the same word, and ignore the case of letters. The similarity problem allows for more errors, such as those due to typing: transposed letters, missing letters, etc. The equivalence of strings is the same as the mathematical notion of equivalence, it always respects the re exivity, symmetry, and transitivity property. The similarity problem on the other hand, is the more dicult problem, where any typing and spelling errors are allowed. The similarity problem then is not necessarily transitive; while it still respects the re exivity and symmetry properties. 2.2. Proposed algorithm The word record is used to mean a syntactic designator of some real-world object, such as a

tuple in a relational database. The record matching problem arises whenever records that are not identical, in a bit-by-bit sense, may still refer to the same object. For example, one database may store the rst name and last name of a person (e.g. \Jane Doe"), while another database may store only the initials and the last name of the person (e.g. \J. B. Doe"). In this work, we say that two records are equivalent if they are equal semantically, that is if they both designate the same real-world entity. Semantically, this problem respects the re exivity, symmetry, and transitivity properties. The record matching algorithms which solve this problem depend on the syntax of the records. These syntactic calculations are approximations of what we really want, semantic equivalence. In such calculations, errors are bound to occur and thus the semantic equivalence will not be properly calculated. However, the claim is that there are few errors and that the approximation is good. The experiments from Section 4 will provide evidence for this claim. Equivalence may sometimes be a question of degree, so a function solving the record matching problem returns a value between 0.0 and 1.0, where 1.0 means certain equivalence and 0.0 means certain non-equivalence. This study assumes that these scores are ordinal, but not that they have any particular scalar meaning. Degree of match scores are not necessarily probabilities or fuzzy degrees of truth. An application will typically just compare scores to a threshold that depends on the domain and the particular record matching algorithm in use. Record matching algorithms vary by the amount of domain-speci c knowledge that they use. The pairwise record matching algorithms used in most previous work have been application-speci c. For example, in [18], the authors use production rules based on domain-speci c knowledge, which are rst written in OPS5 [7] { a programming language for rule-based production systems used primarily in arti cial intelligence { and then translated by hand into C. This section presents algorithms for pairwise record matching which are relatively domain independent. In particular, this work proposes to use a generalized edit-distance algorithm. This domain-independent algorithm is a variant of the well-known Smith-Waterman algorithm [43], which was originally developed for nding evolutionary relationships between biological protein or DNA sequences. A record matching algorithm is domain-independent if it can be used without any modi ca-

4

A. E. Monge

tions in a range of applications. By this de nition, the Smith-Waterman algorithm is domainindependent under the assumptions that records have similar schemas and that records are made up of alphanumeric characters. The rst assumption is needed because the Smith-Waterman algorithm does not address the problem of duplicate records containing elds which are transposed.y The second assumption is needed because any edit-distance algorithm assumes that records are strings over some xed alphabet of symbols. Naturally this assumption is true for a wide range of databases, including those with numerical elds such as social security numbers that are represented in decimal notation. 2.3. The Smith-Waterman algorithm

Given two strings of characters, the Smith-Waterman algorithm [43] uses dynamic programming to nd the lowest cost series of changes that converts one string into the other, i.e. the minimum \edit distance" weighted by cost between the strings. Costs for individual changes, which are mutations, insertion, or deletions, are parameters of the algorithm. Although edit-distance algorithms have been used for spelling correction and other text applications before, this work is the rst to show how to use an edit-distance method e ectively for general textual record matching. For matching textual records, we de ne the alphabet to be the lower case and upper case alphabetic characters, the ten digits, and three punctuation symbols space, comma, and period. All other characters are removed before applying the algorithm. This particular choice of alphabet is not critical. The Smith-Waterman algorithm has three parameters m, s, and c. Given the alphabet , m is a jj  jj matrix of match scores for each pair of symbols in the alphabet. The matrix m has entries for exact matches, for approximate matches, as well as for non-matches of two symbols in the alphabet. In the original Smith-Waterman algorithm, this matrix models the mutations that occur in nature. In this work, the matrix tries to account for typical phoneme and typing errors that occur when a record is entered into a database. Much of the power of the Smith-Waterman algorithm is due to its ability to introduce gaps in the records. A gap is a sequence of non-matching symbols; these are seen as dashes in the example alignments of Figure 1. The Smith-Waterman algorithm has two parameters which a ect the start and length of the gaps. The scalar s is the cost of starting a gap in an alignment, while c is the cost of continuing a gap. The ratios of these parameters strongly a ect the behavior of the algorithm. For example if the gap penalties are such that it is relatively inexpensive to continue a gap (c < s) then the Smith-Waterman algorithm prefers a single long gap over many short gaps. Intuitively, since the Smith-Waterman algorithm allows for gaps of unmatched characters, it should cope well with many abbreviations. It should also perform well when records have small pieces of missing information or minor syntactical di erences, including typographical mistakes. The Smith-Waterman algorithm works by computing a score matrix E. One of the strings is placed along the horizontal axis of the matrix, while the second string goes along the vertical axis. An entry E(i; j) in this matrix is the best possible matching score between the pre x 1 : : :i of one string and the pre x 1 : : :j of the second string. When the pre xes (or the entire strings) match exactly, then the optimal alignment can be found along the main diagonal. For approximate matches, the optimal alignment is within a small distance of the diagonal. Formally, the value of E(i; j) is 8 E(i ? 1; j ? 1)+ m(letter(i); letter(j)) >> < E(i ? 1; j) + c if align(i ? 1; j ? 1) ends in a gap E(i; j) = max E(i ? 1; j) + s if align(i ? 1; j ? 1) ends in a match >> E(i; j ? 1) + c if align(i ? 1; j ? 1) ends in a gap : E(i; j ? 1) + s if align(i ? 1; j ? 1) ends in a match All experiments reported in this paper use the same Smith-Waterman algorithm with the same gap penalties and match matrix. The parameter values were determined using a small set of

y Technically, a variant of the Needleman-Wunsch [36] algorithm is actually used, which calculates the minimum weighted edit-distance between two entire strings. Given two strings, the better-known Smith-Waterman algorithm nds a substring in each string such that the pair of substrings has minimum weighted edit-distance.

Detecting approximately duplicate records

5

department- of chemical engineering, stanford university, ca------lifornia Dep------t. of Chem---. Eng-------., Stanford Univ-----., CA, USA. psychology department, stanford univ-----------ersity, palo alto, calif Dept. of Psychol-------------., Stanford Univ., CA, USA.

Fig. 1: Optimal record alignments produced by the Smith-Waterman algorithm.

aliation records. The experiments showed that the values chosen were intuitively reasonable and provided good results. The match score matrix is symmetric with all entries ?3 except that an exact match scores 5 (regardless of case) and approximate matches score 3. An approximate match occurs between two characters if they are both in one of the sets fd tg fg jg fl rg fm ng fb p vg fa e i o ug f, .g. The penalties for starting and continuing a gap are 5 and 1 respectively. The informal experiments just mentioned show that the penalty to start a gap should be similar in absolute magnitude to the score of an exact match between two letters, while the penalty to continue a gap should be smaller than the score of an approximate match. If these conditions are met, the accuracy of the Smith-Waterman algorithm is nearly una ected by the precise values of the gap penalties. The experiments varied the penalty for starting gaps by considering values smaller and greater than an exact match. Similarly, the penalty to continue a gap was varied by considering values greater than 0.0. The nal score calculated by the algorithm is normalized to range between 0.0 and 1.0 by dividing by 5 times the length of the smaller of the two records being compared. Figure 1 shows two typical optimal alignments produced by the Smith-Waterman algorithm with the choice of parameter values described. The records shown are taken from datasets used in experiments for measuring the accuracy of the Smith-Waterman algorithm and other record matching algorithms [32]. These examples show that with the chosen values for the gap penalties, the algorithm detects abbreviations by introducing gaps where appropriate. The second pair of records also shows the inability of the Smith-Watermanalgorithm to match out-of-order subrecords. The Smith-Waterman algorithm uses dynamic programming and its running time is proportional to the product of the lengths of its input strings. This quadratic time complexity is similar to that of other more basic record matching algorithms [32]. The Smith-Waterman algorithm is symmetric: the score of matching record A to B is the same as the score of matching B to A. Symmetry may be a natural requirement for some applications of record matching but not for others. For example, the name \Alvaro E. Monge" matches \A. E. Monge" while the reverse is not necessarily true. 3. ALGORITHMS TO DETECT DUPLICATE DATABASE RECORDS This section considers the problem of detecting when records in a database are duplicates of each other, even if they are not textually identical. If these multiple duplicate records concern the same real-world entity, they must be detected in order to have a consistent database. Multiple records for a single entity may exist because of typographical data entry errors, because of unstandardized abbreviations, or because of di erences in detailed schemas of records from multiple databases, among other reasons. Thus, the problem is one of consolidating the records in these databases, so that an entity is represented by a single record. This is a necessary and crucial preprocessing step in data warehousing and data mining applications where data is collected from many di erent sources and inconsistencies can lead to erroneous results. Before performing data analysis operations, the data must be preprocessed and organized into a consistent form. 3.1. Related work

The duplicate detection problem is di erent from, but related to, the schema matching problem [3, 25, 29, 44]. That problem is to nd the correspondence between the structure of records in one database and the structure of records in a di erent database. The problem of actually detecting

6

A. E. Monge

matching records still exists even when the schema matching problem has been solved. For example, consider records from di erent databases that include personal names. The fact that there are personal name attributes in each record is detected by schema matching. However record-level approximate duplicate detection is still needed in order to combine di erent records concerning the same person. Record-level duplicate detection, or record matching, may be needed because of typographical errors, or varying abbreviations in related records. Record matching may also be used as a substitute for detailed schema matching, which may be impossible for semi-structured data. For example records often di er in the detailed format of personal names or addresses. Even if records follow a xed high-level schema, some of their elds may not follow a xed low-level schema, i.e. the division of elds into sub elds may not be standardized. In general, we are interested in situations where several records may refer to the same real-world entity, while not being syntactically equivalent. A set of records that refer to the same entity can be interpreted in two ways. One way is to view one of the records as correct and the other records as duplicates containing erroneous information. The task then is to cleanse the database of the duplicate records [41, 18]. Another interpretation is to consider each matching record as a partial source of information. The aim is then to merge the duplicate records, yielding one record with more complete information [21]. 3.2. The standard method and its improvements

The standard method of detecting exact duplicates in a table is to sort the table and then to check if neighboring tuples are identical. Exact duplicates are guaranteed to be next to each other in the sorted order regardless of which part of a record the sort is performed on. There are a number of optimizations of this approach an these are described in [4]. The approach can be extended to detect approximate duplicates. The idea is to do sorting to achieve preliminary clustering, and then to do pairwise comparisons of nearby records [38, 14, 16]. In this case, there are no guarantees as to where duplicates are located relative to each other in the sorted order. In a good scenario, the approximate duplicate records may not be found next to each other but will be found nearby. In the worse case, they will be found in opposite extremes of the sorted order. The result depends on the eld used to sort and on the probability of error in that eld. Thus, sorting is typically based on an application-speci c key chosen to make duplicate records likely to appear near each other. In [18], the authors compare nearby records by sliding a window of xed size over the sorted database. If the window has size W then record i is compared with records i ? W + 1 through i ? 1 if i  W and with records 1 through i ? 1 otherwise. The number of comparisons performed is O(TW) where T is the total number of records in the database. In order to improve accuracy, the results of several passes of duplicate detection can be combined [38, 24]. Typically, combining the results of several passes over the database with small window sizes yields better accuracy for the same cost than one pass over the database with a large window size. One way to combine the results of multiple passes is by explicitly computing the transitive closure of all discovered pairwise \is a duplicate of" relationships [18]. If record R1 is a duplicate of record R2, and record R2 is a duplicate of record R3 , then by transitivity R1 is a duplicate of record R3 . Transitivity is true by de nition if duplicate records concern the same real-world identity, but in practice there will always be errors in computing pairwise \is a duplicate of" relationships, and transitivity will propagate these errors. However, in typical databases, sets of duplicate records tend to be distributed sparsely over the space of possible records, and the propagation of errors is rare. The experimental results con rm this claim [18, 19] and Section 4 of this paper. Hylton uses a di erent, more expensive, method to do a preliminary grouping of records [21]. Each record is considered separately as a \source record" and used to query the remaining records in order to create a group of potentially matching records. Then each record in the group is compared with the source record using his pairwise matching procedure.

Detecting approximately duplicate records

7

Finally, similarity of entire documents is also related to this body of work. In [6], the authors provide a method for determining document similarity and use it to build a clustering of syntactically similar documents. It would be expensive to try to compare documents in their entirety. Thus, the authors calculate a sketch of each document, where the size of a sketch is in the order of 100's of bytes. The sketch is based on the unique contiguous subsequences of words contained in the document, called shingles by the authors. The authors show that document similarity is not compromised if the sketches of documents are compared instead of the entire document. To compute the clusters of similar documents, the authors must rst calculate the number of shingles shared between documents. The more shingles two documents have in common, the more similar they are. When two documents share enough shingles to be deemed similar, they are put in the same cluster. To maintain the clusters the auhors also use the union- nd data structure. However, before any cluster gets created, all the document comparisons have been performed. As we will see later, in the algorithm presented in this paper, the union- nd data structure is used more eciently allowing for many record comparisons to not be performed. In addition, the system also queries the union- nd data structure to improve accuracy by performing some additional record comparisons. 3.3. Transitivity and the duplicate detection problem

Under the assumption of transitivity, the problem of detecting duplicates in a database can be described in terms of keeping track of the connected components of an undirected graph. Let the vertices of a graph G represent the records in a database of size T. Initially, G will contain T unconnected vertices, one for each record in the database. There is an undirected edge between two vertices if and only if the records corresponding to the pair of vertices are found to match, according to the pairwise record matching algorithm. When considering whether to apply the expensive pairwise record matching algorithm to two records, we can query the graph G. If both records are in the same connected component, then it has been determined previously that they are approximate duplicates, and the comparison is not needed. If they belong to di erent components, then it is not known whether they match or not. If comparing the two records results in a match, their respective components should be combined to create a single new component. This is done by inserting an edge between the vertices that correspond to the records compared. At any time, the connected components of the graph G correspond to the transitive closure of the \is a duplicate of" relationships discovered so far. Consider three records Ru, Rv , and Rw and their corresponding nodes u, v, and w. When the fact that Ru is a duplicate of record Rv is detected, an edge is inserted between the nodes u and v, thus putting both nodes in the same connected component. Similarly, when the fact that Rv is a duplicate of Rw is detected, and edge is inserted between nodes v and w. Transitivity of the \is a duplicate of" relation is equivalent to reachability in the graph. Since w is reachable from u (and vice versa), the corresponding records Ru and Rw are duplicates. This \is a duplicate of" relationship is detected automatically by maintaining the graph G, without comparing Ru and Rw . 3.4. The Union-Find data structure

There is a well-known data structure that eciently solves the problem of incrementally maintaining the connected components of an undirected graph, called the union- nd data structure [20, 10]. This data structure keeps a collection of disjoint updatable sets, where each set is identi ed by a representative member of the set. Each set correponds to a connected component of the graph. The data structure has two operations: Union(x; y) combines the sets that contain node x and node y, say Sx and Sy , into a new set that is their union Sx [ Sy . A representative for the union is chosen, and the new set replaces Sx and Sy in the collection of disjoint sets. Find(x) returns the representative of the unique set containing x. If Find(x) is invoked twice without modifying the set between the requests, the answer is the same.

8

A. E. Monge

To nd the connected components of a graph G, we rst create jGj singleton sets, each containing a single node from G. For each edge (u; v) 2 E(G), if Find(u) 6= Find(v) then we perform Union(u; v). At any time, two nodes u and v are in the same connected component if and only if their sets have the same representative, that is if and only if Find(u) = Find(v). Note that the problem of incrementally computing the connected components of a graph is harder than just nding the connected components. There are linear time algorithms for nding the connected components of a graph. However, here we require the union- nd data structure because we need to nd the connected components incrementally as duplicate records are detected. 3.5. Further improvements on the standard algorithm

The previous section described a way in which to maintain the clusters of duplicate records and compute the transitive closure of \is a duplicate of" relationships incrementally. This section uses the union- nd data structure to improve the standard method for detecting approximate duplicate records. As done by other algorithms, the algorithm performs multiple passes of sorting and scanning. Whereas previous algorithms sort the records in each pass according to domain-speci c criteria, this work proposes to use domain-independent sorting criteria. Speci cally, the algorithm uses two passes. The rst pass treats each record as one long string and sorts these lexicographically, reading from left to right. The second pass does the same reading from right to left. After sorting, the algorithm scans the database with a xed size window. Initially, the union nd data structure (i.e. the collection of dynamic sets) contains one set per record in the database. The window slides through the records in the sorted database one record at a time (i.e. windows overlap). In the standard window method, the new record that enters the window is compared with all other records in the window. The same is done in this algorithm, with the exception that some of these comparisons are unnecessary. A comparison is not performed if the two records are already in the same cluster. This can be easily determined by querying the union- nd data structure. When considering the new record Rj in the window and some record Ri already in the window, rst the algorithm tests whether they are in the same cluster. This involves comparing their respective cluster representatives, that is, comparing the value of Find(Rj ) and Find(Ri). If both these values are the same, then no comparison is needed because both records belong to the same cluster or connected component. Otherwise the two records are compared. When the comparison is successful, a new \is a duplicate of" relationship is established. To re ect this in the union- nd data structure, the algorithm combines the clusters corresponding to Rj and Ri by making the function call: Union(Rj ; Ri). Section 4 has the results of experiments comparing this improved algorithm to the standard method. We expect that the improved algorithm will perform fewer comparisons. Fewer comparisons usually translates to decreased accuracy. However, similar accuracy is expected because the comparisons which are not performed correspond to records which are already members of a cluster, most likely due to the transitive closure of the \is a duplicate of' relationships. In fact, all experiments show that the improved algorithm is as accurate as the standard method while signi cantly performing many fewer record comparisons. 3.6. The overall priority queue algorithm

The algorithm described in the previous section has the weakness that the window used for scanning the database records is of xed size. If a cluster in the database has more duplicate records than the size of the window, then it is possible that some of these duplicates will not be detected because not enough comparisons are being made. Furthermore if a cluster has very few duplicates or none at all, then it is possible that comparisons are being done which may not be needed. An algorithm is needed which responds adaptively to the size and homogeneity of the clusters discovered as the database is scanned. This section describes such a strategy. This is the high-level strategy adopted in the duplicate detection algorithm proposed in this work. Before describing the algorithm, we need to analyze the xed size window method. The xed size window algorithm e ectively saves the last jW j ? 1 records for possible comparisons with the

Detecting approximately duplicate records

9

new record that enters the window as it slides by one record. The key observation to make is that in most cases, it is unnecessary to save all these records. The evidence of this is that sorting has already placed approximate duplicate records near each other. Thus, most of the jW j ? 1 records in the window already belong to the same cluster. The new record will either become a member of that cluster, if it is not already a member of it, or it will be a member of an entirely di erent cluster. In either case, exactly one comparison per cluster represented in the window is needed. Since in most cases, all the records in the window will belong to the same cluster, only one comparison will be needed. Thus, instead of saving individual records in a window, the algorithm saves clusters. This leads to the use of a priority queue, in place of a window, to save record clusters. The rest of this section describes this strategy as it is embedded in the duplicate detection system. First, like the algorithm described in the previous section, two passes of sorting and scanning are performed. The algorithm scans the sorted database with a priority queue of record subsets belonging to the last few clusters detected. The priority queue contains a xed number of sets of records. In all the experiments reported below this number is 4. Each set contains one or more records from a detected cluster. For eciency reasons, entire clusters should not always be saved since they may contain many records. On the other hand, a single record may be insucient to represent all the variability present in a cluster. Records of a cluster will be saved in the priority queue only if they add to the variability of the cluster being represented. The set representing the cluster with the most recently detected cluster member has highest priority in the queue, and so on. The algorithm scans through the sorted database sequentially. Suppose that record Rj is the record currently being considered. The algorithm rst tests whether Rj is already known to be a member of one of the clusters represented in the priority queue. This test is done by comparing the cluster representative of Rj to the representative of each cluster present in the priority queue. If one of these comparisons is successful, then Rj is already known to be a member of the cluster represented by the set in the priority queue. We move this set to the head of the priority queue and continue with the next record, Rj +1. Whatever their result, these comparisons are computationally inexpensive because they are done just with Find operations. In the rst pass, Find comparisons are guaranteed to fail since the algorithm scans the records in the sorted database sequentially and this is the rst time each record is encountered. Therefore these tests are avoided in the rst pass. Next, in the case where Rj is not a known member of an existing priority queue cluster, the algorithm uses the Smith-Waterman algorithm to compare Rj with records in the priority queue. The algorithm iterates through each set in the priority queue, starting with the highest priority set. For each set, the algorithm scans through the members Ri of the set. Rj is compared to Ri using the Smith-Waterman algorithm. If a match is found, then Rj 's cluster is combined with Ri 's cluster, using a Union(Ri; Rj ) operation. In addition, Rj may also be included in the priority queue set that represents Ri 's cluster { and now also represents the new combined cluster. Speci cally, Rj is included if its Smith-Waterman matching score is below a certain \strong match" threshold. This priority queue cluster inclusion threshold is higher than the threshold for declaring a match, but lower than 1:0. Intuitively, if Rj is very similar to Ri, it is not necessary to include it in the subset representing the cluster, but if Rj is only somewhat similar, i.e. its degree of match is below the inclusion threshold, then including Rj in the subset will help in detecting future members of the cluster. On the other hand, if the Smith-Waterman comparison between Ri and Rj yields a very low score, below a certain \bad miss" threshold, then the algorithm continues directly with the next set in the priority queue. The intuition here is that if Ri and Rj have no similarity at all, then comparisons of Rj with other members of the cluster containing Ri will likely also fail. If the comparison still fails but the score is close to the matching threshold, then it is worthwhile to compare Rj with the remaining members of the cluster. The \strong match" and \bad miss" thresholds are used to counter the errors which are propagated when computing pairwise \is a duplicate of" relationships. Finally, if Rj is compared to members of each set in the priority queue without detecting that it is a duplicate of any of these, then Rj must be a member of a cluster not currently represented

A. E. Monge

10 Equational Smith-Waterman Soc. Sec. theory score number True 0.6851 missing positive missing False 0.4189 152014425 negative 152014423 0.3619 274158217 False 267415817 positive 0.1620 760652621 765625631

Name

Address

City, State Zip code Colette Johnen 600 113th St. apt. 5a5 missing John Colette 600 113th St. ap. 585 missing Bahadir T Bihsya 220 Jubin 8s3 Toledo OH 43619 Bishya T ulik 318 Arpin St 1p2 Toledo OH 43619 Frankie Y Gittler PO Box 3628 Gresham, OR 97080 Erlan W Giudici PO Box 2664 Walton, OR 97490 Arseneau N Brought 949 Corson Ave 515 Blanco NM 87412 Bogner A Kuxhausen 212 Corson Road 0o3 Raton, NM 87740

Table 1: Example pairs of records and the status of the matching algorithms.

in the priority queue. In this case Rj is saved as a singleton set in the priority queue, with the highest priority. If this action causes the size of the priority queue to exceed its limit then the lowest priority set is removed from the priority queue. 4. EXPERIMENTAL RESULTS The rst experiments reported here use databases that are mailing lists generated randomly by software designed and implemented by [19]. Each record in a mailing list contains nine elds: social security number, rst name, middle initial, last name, address, apartment, city, state, and zip code. All eld values are chosen randomly and independently. Personal names are chosen from a list of 63000 real names. Address elds are chosen from lists of 50 state abbreviations, 18670 city names, and 42115 zip codes. Once the database generator creates a random record, it creates a random number of duplicate records according to a xed probability distribution. When it creates a duplicate record, the generator introduces errors (i.e. noise) into the record. Possible errors range from small typographical slips to complete name and address changes. The generator introduces typographical errors according to frequencies known from previous research on spelling correction algorithms [27]. Editdistance algorithms are designed to detect some of the errors introduced, however, our algorithm was developed without knowledge of the particular error probabilities used by the database generator. The pairwise record matching algorithm of [18] has special rules for transpositions of entire words, complete changes in names and zip codes, and social security number omissions, while our Smith-Waterman algorithm variant does not. Table 1 contains example pairs of records chosen as especially instructive by [19], with pairwise scores assigned by the Smith-Waterman algorithm. The rst pair is correctly detected to be duplicates by the rules of [18]. The Smith-Waterman algorithm classi es it as duplicate given any threshold below 0.68. The equational theory does not detect the second pair as duplicates. The Smith-Waterman algorithm performs correctly on this pair when the duplicate detection threshold is set at 0.41 or lower. Finally, the equational theory falsely nds the third and fourth pairs to be duplicates. The Smith-Waterman algorithm performs correctly on these pairs with a threshold of 0.37 or higher. These examples suggest that we should choose a threshold around 0.40. However this threshold is somewhat aggressive. Small further experiments show that a more conservative threshold of 0.50 detects most real duplications while keeping the number of false positives negligible. 4.1. Measuring accuracy

The measure of accuracy used in this paper is based on the number of clusters detected that are \pure". A cluster is pure if and only if it contains only records that belong to the same true cluster of duplicates. This accuracy measure considers entire clusters, not individual records. This is intuitive, since a cluster corresponds to a real world entity, while individual records do not. A cluster detected by a duplicate detection algorithm can be classi ed as follows: 1. the cluster is equal to a true cluster or,

Detecting approximately duplicate records

11

4

7

x 10

6

Number of clusters

5

4 true clusters 3

PQS w/SW (pure clusters) merge/purge (pure and impure clusters) PQS w/HS: (pure clusters)

2

impure clusters

1

0 5

5.5

6 6.5 7 7.5 Average number of duplicates per original record

8

8.5

Fig. 2: Accuracy results for varying the number of duplicates per original record using a Zipf distribution.

2. the cluster is a subset of a true cluster or, 3. the cluster contains some of two or more true clusters. By this de nition, a pure cluster falls in either of the rst two cases above. Clusters that fall in the last case are referred to as \impure" clusters. A good detection algorithm will have 100% of the detected clusters as pure and 0% as impure. 4.2. Algorithms tested

The sections that follow provide the results from experiments performed in this study. Several algorithms are compared, where each is made up of di erent features. The main features that make up an algorithm are the pairwise record matching algorithm used and the structure used for storing records for possible comparisons. The three main algorithms compared are the so-called \Merge/Purge" algorithm [18, 19] and two versions of the algorithm from Section 3.6. One version, \PQS w/SW", uses the Smith-Waterman algorithm to match records, while the second version, \PQS w/HS", uses the equational theory described in [18]. For short, priority queue strategy is abbreviated by PQS. Both PQS-based algorithms use the union- nd data structure described in Section 3.4. Both also use the priority queue of cluster subsets discussed in Section 3.6. The equational theory matcher returns only either 0 or 1, so unlike the Smith-Waterman algorithm, it does not estimate degrees of pairwise matching. Thus, we need to modify the strategy of keeping in the priority queue a set of representative records for a cluster. In the current implementation of the PQS w/HS algorithm, only the most recently detected member of each cluster is kept. For both PQS-based algorithms, the gures show the number of pure and impure clusters that were detected. The gures also show the number of true clusters in the database, and the number of clusters detected by merge-purge. Unfortunately the merge-purge software does not distinguish between pure and impure clusters. The accuracy results reported here include both pure and impure clusters thus slightly overstating its accuracy. 4.3. Varying number of duplicates per record

A good duplicate detection algorithm should be almost una ected by changes in the number of duplicates that each record has. To study the e ect of increasing this number, we varied the number of duplicates per record using a Zipf distribution. Zipf distributions give high probability to small numbers of duplicates, but still give non-trivial probability to large numbers of duplicates. A Zipf distribution has two parameters 0    1 and 1  D. For 1  i  D the probability of i duplicates is ci?1 where the normalization constant

A. E. Monge

12 6

10

5

10

4

Number of clusters

10

true clusters 3

10

U/F + |V|=10 w/SW (impure clusters) U/F + PQS w/SW (impure clusters) U/F + PQS w/HS (impure clusters)

2

10

1

10

0

10 4 10

5

10 Total number of records in the database

6

10

Fig. 3: Accuracy results for varying database sizes (log-log plot).

P

P

 ?1 c = 1= Di=1 i?1 . Having a maximum number of duplicates D is necessary because 1 i=1 i diverges if   0. Four databases were created, each with a di erent value of the parameter  from the set f0:1; 0:2; 0:4;0:8g. The maximum number of duplicates per original record was kept constant at 20. The noise level was also maintained constant. The sizes of the databases ranged from 301153 to 480239 total records. In all experiments, the Merge/Purge engine was run with xed window of size 10, as in most experiments performed by [18]. Our duplicate detection algorithm used a priority queue containing at most 4 sets of records. This number was chosen to make the accuracy of both algorithms approximately the same. Of course, it is easy to run our algorithm with a larger priority queue in order to obtain greater accuracy. Figure 2 shows that our algorithm performs slightly better than the Merge/Purge engine. The number of pure clusters detected by both algorithms increases slowly as the value of theta is increased. This increase constitutes a decrease in accuracy since we want to get as close to the number of true clusters as possible. As desired, the number of impure clusters remains very small throughout. The fact that nearly 100% of the detected clusters are pure suggests that we could relax various parameters of our algorithm in order to combine more clusters, without erroneously creating too many impure clusters.

4.4. Varying the size of the database

the accuracy of duplicate detection. We consider databases with respectively 10000, 20000, 40000, 80000 and 120000 original records. In each case, duplicates are generated using a Zipf distribution with a high noise level, Zipf parameter  = 0:40, and 20 maximum duplicates per original record. The largest database considered here contains over 900000 records in total. Figures 3 and 4 show the performance of the PQS-based algorithms and the merge-purge algorithm on these databases. Again, the same algorithm parameters as before are used. The gures clearly display the bene ts of the PQS-based algorithm. Figure 3 shows that the number of clusters detected by all strategies is similar, with our the PQS-based strategy having slightly better accuracy. While the algorithms detect nearly the same number of clusters, they do not achieve this accuracy with similar numbers of record comparisons. As shown in gure 4, the PQS-based algorithms are doing many fewer pairwise record comparisons. For the largest database tested, the PQS-based strategy performs about 3.0 million comparisons, while the merge-purge algorithm performs about 18.8 million comparisons. This is six times as many comparisons as the PQS-based algorithm that uses the same pairwise matching method and achieves essentially the same accuracy. Overall, these experiments show the signi cant improvement that the union- nd data structure and the priority queue of cluster subsets strategy have over the merge-purge algorithm. The best

Detecting approximately duplicate records

13

8

Number of record comparisons

10

7

10

merge/purge U/F + |W|=10 w/SW U/F + PQS w/SW U/F + PQS w/HS 6

10

5

10 4 10

5

10 Total number of records in the database

6

10

Fig. 4: Number of comparisons performed by the algorithms (log-log plot).

depiction of this is in comparing the merge-purge algorithm with the PQS w/HS algorithm. In both of these cases, the exact same record matching function is used. The di erence is in the number of times this function is applied. The merge-purge algorithm applies the record matching function on records that fall within a xed size window, thus making unnecessary record comparisons. The PQS w/HS and the PQS w/SW algorithms apply the record matching function more e ectively through the use of the union- nd data structure and priority queue of sets. This savings in number of comparisons performed is crucial when dealing with very large databases. Our algorithm responds adaptively to the size and the homogeneity of the clusters discovered as the database is scanned. These results do not depend on the record matching algorithm which is used. Instead, the savings is due to the maintenance of the clusters in the union- nd data structure and the use of the priority queue to determine which records to compare. In addition to these bene ts, the experiments also show that there is no loss in accuracy when using the Smith-Waterman algorithm over one which uses domain-speci c knowledge. 5. DETECTING APPROXIMATE DUPLICATE RECORDS IN A REAL BIBLIOGRAPHIC DATABASE This section looks at the e ectiveness of the algorithm on a real database of bibliographic records describing documents in various elds of computer science published in several sources. The database is a slightly larger version of one used by [21]. He presents an algorithm for detecting bibliographic records which refer to the same work. The task here is not to purge the database of duplicate records, but instead to create clusters that contain all records about the same entity. An entity in this context is one document, called a \work", that may exist in several versions. A work may be a technical report which later appears in the form of a conference paper, and still later as a journal paper. While the bibliographic records contain many elds, the algorithm of [21] considers only the author and the title of each document, as do the methods tested here. The bibliographic records were gathered from two major collections available over the Internet. The primary source is A Collection of Computer Science Bibliographies assembled by Alf-Christian [2]. Over 200000 records were taken from this collection, which currently contains over 600000 BibTeX records. The secondary source is a collection of computer science technical reports produced by ve major universities in the CS-TR project [49, 23]. This contains approximately 6000 records. In total, the database used contains 254618 records. Since the records come from a collection of bibliographies, the database contains multiple BibTeX records for the same document. In addition, due to the di erent sources, the records are subject to typographical errors, errors in the accuracy of the information they provide, variation in how they abbreviate author names, and more.

14

A. E. Monge

Cluster Number of Number of size clusters records 1 118149 118149 2 28323 56646 3 8403 25209 4 3936 15744 5 1875 9375 6 1033 6198 7 626 4382 8+ 1522 18915 total 163867 254618 [21] 162535 242705

% of all records 46.40% 22.25% 9.90% 6.18% 3.68% 2.43% 1.72% 7.40% 100.00% 100.00%

Table 2: Results of duplicate detection on a database of bibliographic records

To apply the duplicate detection algorithm to the database, rst we created simple representative records from the complete BibTeX records. Each derived record contains the author names and the document title from one BibTeX record. As in all other experiments, the PQS-based algorithm uses two passes over the database and uses the same domain-independent sorting criteria. Small experiments allowed us to determine that the best Smith-Waterman algorithm threshold for a match in this database was 0.65. The threshold is higher for this database because it has less noise than the synthetic databases used in other experiments. Results for the PQS-based algorithm using a priority queue size of 4 are presented in Table 2. The algorithm detected a total of 163867 clusters, with an average of 1.60 records per cluster. The true number of duplicate records in this database is not known. However, based on visual inspection, the great majority of detected clusters are pure. The number of clusters detected by the PQS-based algorithm is comparable to the results of [21] on almost the same database. However, Hylton reports making 7.5 million comparisons to determine the clusters whereas the PQS-based algorithm performs just over 1.6 million comparisons. This savings of over 75% is comparable to the savings observed on synthetic databases. 6. CONCLUSION The integration of information sources is an important area of research. There is much to be gained from integrating multiple information sources. However, there are many obstacles that must be overcome to obtain valuable results from this integration. This article has explored and provided solutions to some of the problems to be overcome in this area. In particular, to integrate data from multiple sources, one must rst identify the information which is common in these sources. Di erent record matching algorithms were presented that determine the equivalence of records from these sources. Section 3.5 presents the Smith-Waterman algorithm that should be useful for typical alphanumeric records that contain elds such as names, addresses, titles, dates, identi cation numbers, and so on. The Smith-Waterman algorithm was successfully applied to the problem of detecting duplicate records in databases of mailing addresses and of bibliographic records without any changes to the algorithm. Although the Smith-Waterman component does have several tunable parameters, in typical alphanumeric domains we are con dent that the numerical parameters suggested in Section 2.3 can be used without change. The one parameter that should be changed for di erent applications is the threshold for declaring a match. This threshold is easy to set by examining a small number of pairs of records whose true matching status is known. In Section 5, we used the Smith-Waterman algorithm in detecting duplicate bibliographic records and compared it to the algorithm developed by Hylton [21]. We cannot perform experiments that use the equational theory of [18] because that equational theory only applies to mailing list records. An entirely

Detecting approximately duplicate records

15

new equational theory for bibliographic records would have to be written in order to make these comparisons. Since the Smith-Waterman algorithm is domain independent, we successfully used the same algorithm as in the experiments of the previous section without any modi cations. Only the thresholds needed to be adjusted for this database. Future work should investigate automated methods for learning optimal values for the thresholds and the other Smith-Waterman algorithm parameters. The duplicate detection methods described in this work improve previous related work in three ways. The rst contribution is an approximate record matching algorithm that is relatively domainindependent. However, this algorithm, an adaptation of the Smith-Waterman algorithm, does have parameters that can in principle be optimized (perhaps automatically) to provide better accuracy in speci c applications. The second contribution is to show how to compute the transitive closure of \is a duplicate of" relationships incrementally, using the union- nd data structure. The third contribution is a heuristic method for minimizing the number of expensive pairwise record comparisons that must be performed while comparing individual records with potential duplicates. It is important to note that the second and third contributions can be combined with any pairwise record matching algorithm. In particular, we performed experiments on two algorithms that contained these contributions, but that used di erent record matching algorithms. The experiments resulted in high duplicate detection accuracy while signi cantly performing many fewer record comparisons than previous related work. REFERENCES [1] J. Ace, B. Marvel, and B. Richer. Matchmaker matchmaker nd me the address (exact address match processing). Telephone Engineer and Management, 96(8):50,52{53 (1992). [2] Alf-Christian Achilles. A collection of computer science bibliographies. URL, http://liinwww.ira.uka.de/bibliography/index.html (1996). [3] C. Batini, M. Lenzerini, and S. Navathe. A comparative analysis of methodologies for database schema integration. ACM Computing Surveys, 18(4):323{364 (1986). [4] D. Bitton and D. J. DeWitt. Duplicate record elimination in large data les. ACM Transactions on Database Systems, 8(2):255{65 (1983). [5] Robert S. Boyer and J. Strother Moore. A fast string-searching algorithm. Communications of the ACM, 20(10):762{772 (1977). [6] Andrei Broder, Steve Glassman, Mark Manasse, and Geo rey Zweig. Syntactic clustering of the web. In Proceedings of the Sixth International World Wide Web Conference, pp. 391{404, http://www.scope.gmd.de/info/www6/technical/paper205/paper205.html (1997). [7] L. Brownston, R. Farrell, and E. Kant. Programming Expert Systems in OPS5 An Introduction to Rule-Based Programming. Addison-Wesley Publishing Company (1985). [8] U.S. Census Bureau, editor. U.S. Census Bureau's 1997 Record Linkage Workshop, Arlington, Virginia. Statistical Research Division, U.S. Census Bureau (1997). [9] W. I. Chang and J. Lampe. Theoretical and empirical comparisons of approximate string matching algorithms. In CPM: 3rd Symposium on Combinatorial Pattern Matching, pp. 175{84 (1992). [10] Thomas H. Cormen, Charles E. Leiserson, and Roland L. Rivest. Introduction to Algorithms. MIT Press (1990). [11] Brenda G. Cox. Business survey methods. John Wiley & Sons, Inc., Wiley series in probability and mathematical statistics (1995). [12] M.-W. Du and S. C. Chang. Approach to designing very fast approximate string matching algorithms. IEEE Transactions on Knowledge and Data Engineering, 6(4):620{633 (1994). [13] Oren Etzioni and Mike Perkowitz. Category translation: learning to understand information on the Internet. In Proceedings of the International Joint Conference on AI, pp. 930{936 (1995). [14] I. P. Fellegi and A. B. Sunter. A theory for record linkage. Journal of the American Statistical Association, 64:1183{1210 (1969). [15] Z. Galil and R. Giancarlo. Data structures and algorithms for approximate string matching. Journal of Complexity, 4:33{72 (1988). [16] C. A. Giles, A. A. Brooks, T. Doszkocs, and D.J. Hummel. An experiment in computer-assisted duplicate checking. In Proceedings of the ASIS Annual Meeting, page 108 (1976). :::

:::

16

A. E. Monge

[17] Patrick A. V. Hall and Geo R. Dowling. Approximate string matching. ACM Computing Surveys, 12(4):381{ 402 (1980). [18] M. Hernandez and S. Stolfo. The merge/purge problem for large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pp. 127{138 (1995). [19] Mauricio Hernandez. A Generalization of Band Joins and the Merge/Purge Problem. Ph.D. thesis, Columbia University (1996). [20] J. E. Hopcroft and J. D. Ullman. Set merging algorithms. SIAM Journal on Computing, 2(4):294{303 (1973). [21] Jeremy A. Hylton. Identifying and merging related bibliographic records. M.S. thesis, MIT, Published as MIT Laboratory for Computer Science Technical Report 678 (1996). [22] C. Jacquemin and J. Royaute. Retrieving terms and their variants in a lexicalized uni cation-based framework. In Proceedings of the ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 132{141 (1994). [23] Robert E. Kahn. An introduction to the CS-TR project [WWW document]. URL, http://www.cnri.reston.va.us/home/cstr.html (1995). [24] Beth Kilss and Wendy Alvey, editors. Record linkage techniques, 1985: Proceedings of the Workshop on Exact Matching Methodologies, Arlington, Virginia. Internal Revenue Service, Statistics of Income Division, U.S. Internal Revenue Service, Publication 1299 (2-86) (1985). [25] W. Kim, I. Choi, S. Gala, and M. Scheevel. On resolving schematic heterogeneity in multidatabase systems. Distributed and Parallel Databases, 1(3):251{279 (1993). [26] Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. Fast pattern matching in strings. SIAM Journal on Computing, 6(2):323{350 (1977). [27] K. Kukich. Techniques for automatically correcting words in text. ACM Computing Surveys, 24(4):377{439 (1992). [28] V. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics { Doklady 10, 10:707{710 (1966). [29] M. Madhavaram,D. L. Ali, and Ming Zhou. Integratingheterogeneousdistributeddatabasesystems. Computers & Industrial Engineering, 31(1{2):315{318 (1996). [30] Tova Milo and Sagit Zohar. Using schema matching to simplify heterogeneous data translation. In Ashish Gupta, Oded Shmueli, and Jennifer Widom, editors, VLDB'98, Proceedings of 24rd International Conference on Very Large Data Bases, August 24-27, 1998, New York City, New York, USA, pp. 122{133. Morgan Kaufmann (1998). [31] Alvaro E. Monge and Charles P. Elkan. WebFind: Automatic retrieval of scienti c papers over the world wide web. In Working notes of the Fall Symposium on AI Applications in Knowledge Navigation and Retrieval, page 151. AAAI Press (1995). [32] Alvaro E. Monge and Charles P. Elkan. The eld matching problem: Algorithms and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 267{270. AAAI Press (1996). [33] Alvaro E. Monge and Charles P. Elkan. WebFind: Mining external sources to guide www discovery [demo]. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. AAAI Press (1996). [34] Alvaro E. Monge and Charles P. Elkan. The WebFind tool for nding scienti c papers over the worldwide web. In Proceedings of the 3rd International Congress on Computer Science Research, pp. 41{46, Tijuana, Baja California, Mexico (1996). [35] Alvaro E. Monge and Charles P. Elkan. An ecient domain-independent algorithm for detecting approximately duplicate database records. In Proceedings of SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery, Tucson, Arizona (1997). [36] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequences of two proteins. Journal of Molecular Biology, 48:443{453 (1970). [37] Howard B. Newcombe. Handbook of record linkage: methods for health and statistical studies, administration, and business. Oxford University Press (1988). [38] Howard B. Newcombe, J. M. Kennedy, S. J. Axford, and A. P. James. Automatic linkage of vital records. Science, 130:954{959, Reprinted in [24]. (1959). [39] J. Peterson. Computer programs for detecting and correcting spelling errors. Communications of the ACM, 23(12):676{687 (1980). [40] T. E. Senator, H. G. Goldberg, J. Wooton, and M. A. Cottini et al. The nancial crimes enforcement network AI system (FAIS): identifying potential money launderingfrom reports of large cash transactions. AI Magazine, 16(4):21{39 (1995).

Detecting approximately duplicate records

17

[41] A. Silberschatz, M. Stonebraker, and J. D. Ullman. Database research: achievements and opportunities into the 21st century. A report of an NSF workshop on the future of database research (1995). [42] B. E. Slaven. The set theory matching system: an application to ethnographic research. Social Science Computer Review, 10(2):215{229 (1992). [43] T. F. Smith and M. S. Waterman. Identi cation of common molecular subsequences. Journal of Molecular Biology, 147:195{197 (1981). [44] W. W. Song, P. Johannesson, and J. A. Bubenko Jr. Semantic similarity relations and computation in schema integration. Data & Knowledge Engineering, 19(1):65{97 (1996). [45] Y. R. Wang, S. E. Madnick, and D. C. Horton. Inter-database instance identi cation in composite information systems. In Proceedings of the Twenty-Second Annual Hawaii International Conference on System Sciences, pp. 677{84 (1989). [46] William E. Winkler. Advanced methods of record linkage. In American Statistical Association, Proceedings of the Section of Survey Research Methods, pp. 467{472 (1994). [47] William E. Winkler. Matching and Record Linkage, pp. 355{384. In Brenda G. Cox [11], Wiley series in probability and mathematical statistics (1995). [48] M. I. Yampolskii and A. E. Gorbonosov. Detection of duplicatesecondary documents. Nauchno-Tekhnicheskaya Informatsiya, 1(8):3{6 (1973). [49] T.W. Yan and H. Garcia-Molina. Information nding in a digital library: the Stanford perspective. SIGMOD Record, 24(3):62{70 (1995).