The good into the Pot, the bad into the Crop

12 downloads 0 Views 173KB Size Report
Mar 26, 2001 - the databases can be viewed in a universal relation (one big table), for each database, respectively. .... the size and the quality of the learning sample, and .... two records is approximately equal, denoted by tconst o . With this ...
The good into the Pot, the bad into the Crop. Preselection of Record Pairs for Database Fusion Mattis Neiling∗ and Roland M. M¨ uller∗ {mneiling|rmueller}@wiwiss.fu-berlin.de Free University of Berlin, Department of Economics, Institute for Information Systems 26th March 2001

Abstract Our goal is to fuse databases which store information belonging to the same individuals within a population such as people, cars, or companies. In order to fuse records, we have to recognize, which records belong to the same individual. In order to do this, we apply object identification and assume that the identifying information available from the databases can be viewed in a universal relation. With this assumption, object identification can be performed by comparing pairs of records and their classification as same or not same. Given large databases containing information about one population, the preselection becomes mandatory: The number of possible comparisons increases very fast, in fact it is the product of the number of records from both databases. As Grimm’s Cinderella job of picking lentils from the ashes, we have to select pairs of records for database fusion. This article briefly presents the C3 -Framework for object identification (Conversion-Comparison-Classification) and shows how a preselection can be performed to minimize computational time for the comparison. With the preselection approach suggested here, a database fusion designer is able to control the error bounds of misclassification on a labeled learning sample by refining the selection rules. ∗ Part of this work was supported by the Berlin-Brandenburg Graduate School in Distributed Information Systems (DFG grant no. GRK 316) 1 Contained in the Proceedings of the 1th Workshop on Database Fusion held in Magdeburg, Germany, May 3–4, 2001

You tame pigeons, you turtle-doves, and all you birds beneath the sky, come and help me to pick the good into the pot, the bad into the crop. Cinderella 2

1

Introduction

We propose to apply object identification and classification techniques to fuse large databases without common identifiers (c.f. [NL00c, NL00b]. Given two or more databases about one population, the problem becomes as follows: For every record we have to search for other records, which (may) correspond to the same real-world objects. One solution to this problem is to apply object identification: The databases are made up from tables that consist of attributes with certain domains, and contain information belonging to real-world objects. Each attribute reflects a property of the real-world objects the table refers to. We can derive more properties from the provided attributes and perform object identification by comparing attribute-values. Since pure comparison is unsatisfactory in general, a final classification step is performed, splitting the pairs into at least two classes same and not same. The object identification process consists of three steps in succession: A, B =⇒ Conversion =⇒ Comparison =⇒ Classification =⇒ A ./ B, where A ./ B denotes the FULL OUTER JOIN of A and B. We introduced this C3 -Framework (Conversion-Comparison-Classification) for object identification in our previous publications. The JOIN splits into three parts for a ∈ A, b ∈ B, the NATURAL JOIN, the LEFT and RIGHT OUTER JOIN. The NATURAL JOIN consists of record pairs a ./ b that refer to identical objects and contain data from both data sources. Each pair a ./ b within the NATURAL JOIN can be enriched by a score, measuring the degree (or probability) that the joined records a ∈ A, b ∈ B belong together. The LEFT and RIGHT OUTER JOIN consist of the records a ./ ∅ and ∅ ./ b that contain only data from one origin source A or B, which means they have no counterpart in the other source. The remainder of this paper is organized as follows. In section 2 we briefly describe the C3 -Framework as the methodological background. In section 2

From the fairy tale Cinderella, the Grimm Brothers Aschenputtel

3 we introduce our optimization approach for the preselection. Finally in section 4 we summarize our results.

The C3 -Framework in a Nutshell

2

In order to fuse records we have to find out, which records are related to the same individual. We assume that the identifying information available from the databases can be viewed in a universal relation (one big table), for each database, respectively. With this data we perform the object identification. The C3 -Framework for object identification consists of three succeeding steps: • Conversion of the data into identifying properties • Comparison of record pairs • Classification of the whole comparison space First we introduce the notation used within the C3 -Framework with three definitions. Then we discuss each step in more detail.

2.1

Notation

Let U be a universe of distinct real world objects (a population such as people, books or cars). There are data sources A and B over U with attribute sets EA and EB , respectively. A record from data source A or B is written a ∈ A, b ∈ B. The domain of possible values of a single attribute e is denoted by dom e; for an attribute set E = {e1 , . . . , en } the domain is defined by Q dom E := di=1 dom ei . Definition 1. An attribute set E is derivable from an attribute set EA , iff there exists a surjective conversion function hA : dom EA → dom E which maps the domain of EA into the domain of E. Let E be a set of d attributes with the domain dom E. Let R ⊂ IRn+ be arbitrary with n ≤ d; R is possibly finite and discrete. Definition 2. Then a function f : (dom E)2 → R is designated as a comparison function. R is called comparison space. Let V be any space or set — like R or A × B. Definition 3. A classification rule on V is given by a decision function δ : V → {1, 2, . . . , max}, which returns for v ∈ V the index j = δ(v) of a class cj .

These definitions result in the following: Proposition 1. Each classification on R is a classification of A × B, too.

2.2

Conversion

The records consist of attributes with certain values, where each attribute reflects a property of the real-world objects. We derive more properties from the provided attributes and perform the object identification of records3 based upon these attributes (or a suitable subset). Furthermore we can use additional information for the identification (e.g. frequencies of values, contextual information or metadata like the semantic of attributes). The conversion of given data is not as difficult as the comparison and classification. That means any new property that seems to be identifiable, should be added to the set of derived attributes.

2.3

Comparison

Given a set of derived properties (or attributes) of two databases we compare the attribute values of record pairs. A comparison function can be chosen as a distance measure, but in general we need nominal categories for comparison cases, since there are exceptions like missing values, constraints on the data etc. 2.3.1

Discrete Comparison Functions

As usual in Record Linkage applications, the comparison function may be chosen with a few discrete values. After defining a comparison function for each identifiable attribute, you can estimate the multinomial distribution of the comparison values through learning samples for both identical and not identical labeled record pairs. Discrete comparison values provide the advantage, that even special cases can be treated similar to standard cases, e.g. the frequency of names. However, as a disadvantage we can remark the difficulty to transform continuous comparison scales into discrete ones. 2.3.2

Distance Measure

[NL99] describes the comparison of pairs through a weighted distance measure, so that a classification can be performed without supervised learning. The weights can be viewed as a rescaling (i.e. as a normalization of the axes). 3

Identification of records means the identification of the real-world objects that the records belong to.

Furthermore, the weights can reflect and correct the interaction among the dimensions, if the attributes (and with them the related comparison values) are somehow correlated. Assume, that all comparison functions fi are built as one-dimensional distance measures over Ri ⊂ IR+ , respectively. then we can define a composite n-dimensional distance measure on R as follows: Given a positive definite matrix W of the dimension n and a comparison √ space R ⊂ IRn , a distance measure can be defined by Dist(a, b) := r0 Wr, where r ∈ R is shorthand of the comparison vector of a pair of records (a, b) ∈ A × B, formally r := f (hA (a), hB (b)). The matrix W is a quadratic form over IRn , so that the distance measure becomes Euclidic for the unitiymatrix, rescaled Euclidic for a diagonal-matrix and axes-transformed and rescaled for an arbitrary positive definite matrix (sometimes called elliptic). For instance, if we can make the assumption of an n-dimensional gaussian distribution on the comparison space and choose W as the matrix of covariances, Dist(·, ·) becomes the special case of the Mahanobis-Distance of two records. The usage of an n-dimensional distance measure seems to be the most intuitive, but it only works well under specific circumstances, like the distribution assumption described above, or at least the assumption of similar distributed dimensions (whatever ’similar’ means) — an assumption that often fails for real applications.

2.4

Classification

Different algorithms could be applied for the classification of pairs of records. We assume that in general supervised learning is necessary. The problem as follows: Induce classification rules on a comparison space R from a labeled learning sample from A × B. The criteria can be based on rules of decision trees, generalized distance measures or a statistical measure like the the likelihood ratio test in the record linkage approach. Or it can be based on other methods like clustering algorithms (single linkage, nearest neighbor, etc.), association rules or neural networks. Which of these methods can be applied is depends upon • the scales and the cardinality of the comparison space, • the size and the quality of the learning sample, and • the distribution assumption on the comparison space.

The lowest requirements have both, the decision tree generation and association rule mining. If the comparison function is based on a distance measure, methods like clustering, neural networks and other can be applied (possibly without any learning sample, respecting uncertainty about error rates). The record linkage approach is suitable in most situations, because it requires only a discrete and finite comparison space. The larger the learning sample, the better the result of the classification rule. The quality of a classification can be measured by the error rates of misclassification on a testing sample.

3

Preselection

Because very large data sets have to be compared, it is mandatory to improve the number of comparisons at A × B (the cross-product of the data sources). For instance, within the next German Administrative Record Census about 100 million registration records have to be joined with 60 million social insurance records, resulting a number of 60 million times 100 million potential comparisons! There are three different kinds of computations for object identification, but only one of them needs exhaustive time — the comparison step. Therefore we have to reduce the number of pairs to compare. The idea behind a preselection is based on rejection rules: Almost all record pairs can be classified as not same through simple computations. A preselection is the application of rejection rules: Choose adequate rejection rules to reduce the number of comparisons. Apply a comparison only to the remaining pairs of records. Evidently, this works only properly, if the rejection rules REi have misclassification rates εi , that are very small, say less than a given ε > 0. Due to the fact that the rejected pairs of records are definitely not same (that means with certainty 1 − εi ), one gets a preselection as a partition of the entire cross-product space A × B. This was first discussed in [NL00a]. Next we show that this preselection is better than pure sorting or grouping of the data — which is only one possibility of preselecting data. The power of a preselection is the reduction of the number of pairs to be compared. But with the preselection capability of the application of rejecP tion rules REi the error rate i εi increases — the quality of the preselection drills down. Thus the preselection becomes an optimization problem: We have to select rejection rules on condition that the overall error bound ε > 0.

3.1

Rejection Rules

We assume, that the object identification process is completely defined for one specific database fusion application, that means the functions for the conversion, comparison and classification are well defined and implemented. To detect rejection rules, we can take the learning sample used for the classification again. We are searching for regions (e.g. intervals) of the comparison space which fulfill the following requirements: • the region is connected and almost completely contained in the not samepartition • the region contains not same-labeled examples above average • the region contains only a small number of same-labeled examples Depending on the type of the comparison space (e.g. discrete categorical or continuous), association rule mining or clustering techniques can be adapted to compute regions (Ri ). Each region Ri ⊂ R determines a rejection rule e.g. (REi ) Reject pair (a,b) if its comparison value r ∈ Ri . Error rates for misclassification can be computed at the learning sample for each region. The error rate εi for a rejection rule (REi ) corresponding to the region is equal to this error rate. Since a rejection rule is applied to comparison values to reject a pair, a comparison seems necessary — consequently there will be never an optimization! In general this is absolutely right. However, using special (and typical) comparison functions, the comparison can be left out for a lot of pairs and be replaced by some pre-computations. For instance, if a comparison function on a attribute is defined by (

f1 (value1 , value2 ) :=

0 1

: value1 = value2 : otherwise

the rejection rule (RE1 ) Reject pair (a,b) if its comparison value for f1 is 1 can be performed by sorting and grouping the records by its values of the involved attribute. This is a quite good optimization. But if you lower the number of comparisons down to the size of records of the smaller database, you will get the best optimization achievable!

Remark 1. The grouping by special attribute values (e.g. zip codes, phonetic codes, or initials) is commonly used for record linkage applications, it is called Blocking. But for more sophisticated comparison functions as the ones mentioned above, the resulting a priori error rate for misclassification might increase pretty much. Especially for a comparison based on distancemeasuring, a more refined preselection is necessary. The optimization gainings become not as good as if we had a distance measure as a comparison function, e.g. for names the the Minimum-EditDistance; D Edit(name1 , name2 ) is defined as the minimal number of operations (insertions, deletions and substitutions of one letter) to transform name1 into name2 , its computational complexity is bounded by the product of the length of both strings. In this case, the rejection rule (RE2 ) Reject pair (a,b) if D Edit(a, b) ≥ 3 can be performed without a complete comparison of the cross-product space A × B, but only by special pre-computations. One possible optimization is, to compute all the strings s, contained in the sphere of D edit(name, s) < 3, for each name-string that occurs in A. The strings s result from a maximum of two changes of the name-string. The comparison can then be performed for a name from A by selecting from B only the names contained in the ball. This procedure can be optimized further more by clustering the names within a special sphere, obviously resulting in a not disjunct Sphere-covering of the name-space. To reduce the size of a sphere it is possible to include a priori only values, that really occur in the database B, but in this case the pre-computation is always to make, if the database changes. Note, that this pre-computation is capable for distance measures at all. For numerical attributes, the pre-computation becomes very fast, since it can be done by pure sorting. It is not mentioned in the examples above, but nevertheless it is possible to define rejection rules on arbitrary regions of R, covering more than one attribute value or involving two up to all attributes. It depends on the computational complexity of the pre-computations, which rules are ”the good ones”. Obviously, it can happen, that the pre-computation becomes more complex than the comparison of the whole cross product space A × B, but a database fusion designer should be able to control this trade-off by using our optimization approach.

3.2

The Optimization Problem

The preselection of A × B, that means the partitioning of unclassified pairs of records as definitely not same and possibly same (same-candidates), be-

comes an optimization problem: We have to select rejection rules on condition that the overall error bound ε > 0. Since every rejection rule requires more or less computational time, a cost function ti > 0 can be associated with each rejection rule REi . In general, the cost function ti depends upon the cardinality of A (and B, but not of A × B) and on the comparison function and the attributes. This cost function can become very complex. With every rejection rule REi we associate a reduction rate ρi ∈ [0, 1), which measures the gain of reduction of A × B. As mentioned before, the error rate εi reflects the expected rate of false classified pairs for REi , for instance estimated at the labeled learning sample. For simplicity reasons we assume, that both εi and ρi are fixed for a rule REi , they do not change whether REi is applied at first or later in a preselection.4 Additionally, we assume, that the time for loading and comparing two records is approximately equal, denoted by tconst . o With this preliminaries we can formulate the optimization problem: Given two record sets A, B with cardinality card A, card B to compare, an error bound ε > 0 and a set of m rejection rules {REi }i=1...m with error rate εi , reduction rate ρi , and time cost ti , respectively. A preselection SEL is a sequence of rejection rules, SEL = (REσ(i) ), where σ denotes the indices of the REi succeeding in SEL. We have to minimize the cost function for the preselection SEL at the product space A × B through a variation of the rejection rules REi applied for the preselection SEL by fulfilling the error bound condition. X

min tT otal = c + SEL

i:REi ∈SEL

|

{z

rejection

Y

ti +

(1 − ρi ) tconst (card A · card B) (1) o

i:REi ∈SEL

}

|

{z

comparison

}

s.t. X

εi < ε.

(2)

i:REi ∈SEL

If we apply the comparison at the entire space A × B, t.i. without using any rejection rule, the preselection SEL = ∅, such that the formula (??) becomes tM AX = c +

X i:REi ∈∅

ti +

Y

(1 − ρi ) tconst (card A · card B) o

i:REi ∈∅

= c1 + tconst (card A · card B), o 4 Otherwise the estimation of the values for each combination of rejection rules ought to be done, resulting in a nonlinear problem.

tM AX is the overall bound for the cost funtion: tT otal ≤ c + tconst (card A · card B). o

(3)

Hence, we have to solve the optimization problem given by (1) – (3) by variation of the rejection rules for preselection. Example 1. We show a fictional calculation for fusion of databases with cardinality card A = 1, 000, 000 and card B = 1, 000. We set ε = 0.06 and c = 100 and could apply three rejection rules with the values εi , ρi and ti displayed in the following table. Rejection Rule RE1 RE2 RE3

εi 0.01 0.02 0.03

ρi 0.654342 0.784562 0.900321

ti 1,000,000 10,000,000 100,000,000

The overall bound for tT otal becomes with (3) tM AX = 1, 000, 000, 100. and the performance gain is given by tM AX /tT otal , The optimization than becomes according to (1) preselection SEL RE1 RE2 RE3 RE1 and RE2 RE1 and RE3 RE2 and RE3 RE1 , RE2 and RE3

tT otal 346,658,100 225,438,100 199,679,100 85,467,968 45,454,944 131,474,744 118,422,983

gain 2.9 4.4 5.0 11.7 22.0 7.6 8.4

Hence, the preselection with minimal cost is the combination of the rejection rules RE1 and RE3 with a performance gain of factor 22.0 — and not the combination of all of the tree rejection rules.

4

Summary

In this article we have presented in the field of database fusion an optimization approach for object identification without common identifiers. It was described, how to minimize the computational time through preselection of pairs. With this approach the database fusion designer is able to speed up the identification process and can control the error for undiscovered identical records. As Grimm’s Cinderella job of picking lentils from the ashes, we pick a preselection of pairs of records from the very big crossproduct space of two data sources: The good into the Pot, the bad into the Crop.

References [AJ97]

Wendy Alvey and Bettye Jamerson, editors. Record Linkage Techniques — 1997. Proceedings of an International Workshop and Exposition. March 20-21, 1997 in Arlington, Virginia, Washington, DC, 1997. Federal Committee on Statistical Methodology, Office of Management and Budget; c. f. http://www.census.gov/srd/www/reclink/reclink.html.

[FS69]

Ivan P. Fellegi and Alan B. Sunter. A theory of record linkage. Journal of the American Statistical Association, 64:1183–1210, 1969. Reprinted in [KA85, 51–78].

[Hor00] Werner Horn, editor. 14th European Conference on Artificial Intelligence (ECAI2000), Berlin, August 2000. IOS Press, 2000. [KA85] Beth Kilss and Wendy Alvey, editors. Record Linkage Techniques — 1985. Proceedings of the Workshop on Exakt Matching Methodologies in Arlington, Virginia May 9–10, 1985, Internal Revenue Service Publication, Washington, DC, 1985. Department of the Treasury, Statistics of Income Division; download at http://www.bts.gov/fcsm/methodology (a 29 MByte pdf-file). [Nei99]

Mattis Neiling. Datenintegration durch Objekt-Identifikation. In RalfDetlef Kutsche, Ulf Leser, and Johann Christoph Freytag, editors, 4. Workshop F¨ oderierte Datenbanken Berlin, Germany, 25.-26. November 1999, pages 117–143, 1999.

[New88] Howard B. Newcombe. Handbook of Record Linkage. Oxford University Press, Oxford, 1988. [NL99]

Mattis Neiling and Hans-Joachim Lenz. The creation of register based census for germany in 2001. An application of data integration. discussion paper 1999/34, Fachbereich Wirtschaftswissenschaft der Freien Universit¨ at Berlin, 1999.

[NL00a] Mattis Neiling and Hans-Joachim Lenz. Data fusion and object identification. In Int. Conf. on Advances in Infrastructure for Electronic Business, Science, and Education on the Internet. l‘Aquila, Italy, July 31 – August 6, 2000 (SSGRR 2000), 2000. [NL00b] Mattis Neiling and Hans-Joachim Lenz. Data integration by means of object identification in information systems. In Hans Robert Hansen et al., editor, Proceedings of the 8th European Conference on Information Systems (ECIS 2000), Vienna, Austria, July 2000, 2000. [NL00c] Mattis Neiling and Hans-Joachim Lenz. Supplement of information: Data integration by classification of pairs of records. In 24th Annual Conference of the Gesellschaft f¨ ur Klassifikation, Passau, Germany, March 15–17, 2000, 2000. to appear. [Tor00] Vicen¸c Torra. Towards the re-identification of individuals in data files with non-common variables. In Horn [Hor00]. [Win95] William E. Winkler. Matching and record linkage. In B. G. Cox, editor, Business Survey Methods, pages 355–384. J. Wiley, New York, 1995. Reprinted in [AJ97, 374–403].