A framework and test-suite for assessing approaches to ... - CiteSeerX

0 downloads 0 Views 949KB Size Report
During the past three decades there has been a steady ... Using this framework a simple test-suite ... be handled by a simple (syntactic) attribute transforma-.
Information and Software Technology 42 (2000) 505–515 www.elsevier.nl/locate/infsof

A framework and test-suite for assessing approaches to resolving heterogeneity in distributed databases H.T. El-Khatib, M.H. Williams*, L.M. MacKinnon, D.H. Marwick Department of Computing and Electrical Engineering, Heriot–Watt University, Edinburgh EH14 4AS, UK Received 2 July 1999; received in revised form 21 December 1999; accepted 10 January 2000

Abstract The problem of connecting together a number of different databases to produce an integrated information system has attracted a considerable amount of attention over the years and various approaches have been developed to handle this. However, the general problem of gathering related information from a number of existing heterogeneous databases is complex because of the differences in representation and meaning of data in different data sets. Many different approaches have been described to resolve this problem, and some prototype systems built. However, it is difficult to compare the effectiveness of different approaches and prototypes. This paper is aimed at addressing the specific issue of assessing the generality of different approaches. To this end it presents a framework for classifying the differences between data in different databases and a test-suite which can be used to evaluate and compare the extent to which different approaches handle different aspects of this heterogeneity. q 2000 Elsevier Science B.V. All rights reserved. Keywords: Distributed databases; Linking databases; Heterogeneous databases

1. Introduction During the past three decades there has been a steady growth in the number of databases. This has led to the storage of related data in different formats across multiple databases. For example, in areas such as healthcare, the information on a single patient may be scattered over a number of different medical databases with no simple way of obtaining a complete record of the patient. One consequence of this has been the steadily growing interest in connecting together different databases to produce distributed database systems. Initially the focus was on combining similar databases so that the individual databases might be regarded as views of a single underlying common database. However, it was soon realised that this approach was not adequate in many cases involving legacy databases; the area of medical databases is an obvious example. This in turn led to work on heterogeneous distributed database systems, i.e. distributed databases that include heterogeneous components [1] involving different database models, query languages and schemas. Although much work has been done on this problem, the general issue of handling differences in representation and * Corresponding author. Tel.: 1 44-131-451-3430; fax: 1 44-131-4513327.

meaning of data in different data sets is large and complex and our understanding of it is still maturing. A growing number of papers have been published describing a variety of different approaches and a somewhat smaller number of actual prototypes implementing these (see Section 4). Examples of the latter include the MIPS system [2,3], which is used as the starting point for this paper. However, as the number of different approaches and prototypes increases, it is important to develop means by which to assess them and draw comparisons between them. This is the main aim of this paper. Here we propose a framework for classifying different aspects of heterogeneity in data sets, and relate the various aspects of heterogeneity discussed by different researchers to this framework. The idea behind such a framework is to identify a comprehensive (though not complete) range of different types of heterogeneity that can arise either alone or in combinations. Using this framework a simple test-suite has been devised which can be used to test and compare different approaches to databases interoperability. The suite comprises a small number of data sets and queries, which exercise almost all aspects of the framework. The focus of this framework is on the relational database model since the vast majority of databases currently in use are relational. Most of the heterogeneities identified are common across all database models including Object

0950-5849/00/$ - see front matter q 2000 Elsevier Science B.V. All rights reserved. PII: S0950-584 9(00)00094-X

506

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

Fig. 1. Structure of Database1. Note: Patient 129 does not have a home telephone number. Mark Richard’s telephone number is stored with the old area code, while Karen Taylor’s telephone number is stored with the new area code. The code ‘PNE’ represents ‘pneumonia’. The tax rate before 1991 was 15% and after this date it became 17.5%.

Oriented, but this framework has not been extended to consider these other models at this time. The next section describes the framework with examples drawn from a small set of example databases. Section 3 describes the test-suite derived from this. Section 4 provides an overview of work done by a number of different researchers on the problem of heterogeneous distributed database systems and shows how they relate to the framework. Section 5 presents summary and conclusion. 2. Framework for classifying heterogeneities This section presents a framework for classifying the different types of heterogeneity, which arise and need to be catered for. In so doing, the classifications are described in terms of the relational model as mentioned in the previous section. Different instances of heterogeneity can be classified into one or a combination of the following: 1. Naming heterogeneity. This occurs when the same values are stored in different databases but the names given to

2.

3.

4.

5.

6.

the attributes are different in different systems. These can be handled by a simple (syntactic) attribute transformation of the query. Relational structure heterogeneity. When the composition of elementary attributes into composite structures varies but once again values stored are identical. This can be handled by a (syntactic) relational transformation of the query. Value heterogeneity. In this case the way in which values are represented is different in different databases. This may involve type and value transformations. Semantic heterogeneity. This is the most difficult form to deal with as in this case the data stored in different databases embody different assumptions, e.g. in what they represent or in how they have been captured. Data model heterogeneity. Here the data model itself is the issue and transformations between data models and differences between them are relevant. Timing heterogeneity. This concerns the changes over time in the structure of a database, the representation of attributes and the values themselves. Basically, almost any difference from each of the preceding categories, which can occur between databases, may

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

507

Fig. 2. Structure of Database2. Note: Mark Richard’s telephone number was updated on April 24 1992. The code ‘PNE’ represents ‘pneumoconiosis’.

also arise within a single database if it changes with time. One area not covered in this categorisation is that of recording errors in the data. Although this is a factor that does create problems, the issues of noisy data are generally highly dependent on the application and impossible to cover in any generalised way. To illustrate the different aspects of heterogeneity which follow, four simple data sets are given in Figs. 1–4. Each contains a collection of patient record data for patients attending different clinics in different institutions. The aim is to link these together to provide a single integrated information system. The first data set, Database1, comprises three relations: PAT-REC which stores basic patient data, VISIT which records details of individual visits to the clinic and LABTEST which stores information on laboratory tests conducted. The second data set, Database2, is a minor variation on Database1 with essentially the same three relations. Database3 on the other hand consists of four relations. Database4 represents data from a paediatric clinic. It consists of three relations: PATIENT which stores patient data, VISIT which stores information about home visits, C-VISIT which records details of individual visits to the clinic.

The different types of heterogeneity as shown in Fig. 5 are: 2.1. Naming heterogeneity The simplest form of heterogeneity is associated with concept naming. This arises when the same concept is described by two or more names in different databases (synonyms), or when the same name is used for different concepts (homonyms). This form of heterogeneity is not concerned with the value which is stored but merely with the name by which it is accessed. 2.1.1. Naming synonyms These include the following: • Attribute synonyms The same attribute may be given different names in different databases. For example, the attribute NAME in relation PAT-REC in Database1 corresponds to the attribute PATIENT-NAME in relation PATIENT in Database2. • Relation (Table) synonyms The same relation may be represented by different names in different databases. For example, the relation

508

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

LAB-TEST in Database1 corresponds to the relation TEST in Database2. • 2.1.2. Naming homonyms These include the following: • Attribute homonyms Two attributes with the same name occurring in different databases represent different things. For example, the attribute DATE occurring in relation VISIT of Database2 is different from DATE in relation VISIT of Database4. Although they have the same name, they represent different concepts. • Relation (Table) homonyms Relations with the same name occurring in different databases contain different things. For example, the relation VISIT in Database1 records details of individual visits to the clinic, while the relation VISIT in Database4 stores information about visits to the patient’s home. • Attribute-Relation homonyms (or entity-class homonyms) An attribute in one database has the same name as a relation in another database. For example, PATIENTNAME is an attribute of relation PATIENT in Database2, but it is a relation in Database3. 2.2. Relational structure heterogeneity This form of heterogeneity arises when the way in which attributes are composed into relations in one database is different from that of another. Once again this form of heterogeneity is not concerned with the values of attributes, but merely how they are assembled into relations. 2.2.1. Relation size In this case relations with the same name have different numbers of attributes in different databases, and thus are not union-compatible. For example, relation PATIENT in Database2 has five attributes whereas relation PATIENT in Database3 has only four. 2.3. Value heterogeneity This form of heterogeneity is concerned with the way in which the values of a concept are represented. It is possible that different instances of the same concept occurring in different databases may be represented in different ways. 2.3.1. Numeric–numeric • Different units—fixed conversion This arises when different databases use different units for the same data element. For example, an attribute WEIGHT in the VISIT relation of Database1 is expressed in kilograms whereas in Database2 it is expressed in







pounds. This represents a straightforward conversion from one set of units to another. Different units—time varying conversion As an example, consider the MEDICATION-PRICE/ PRICE attributes in the VISIT relations which in Database1 contains values expressed in pounds sterling and in Database2 contains values expressed in US dollars. This is also a conversion but the conversion factor varies with time and a conversion factor must be chosen for an appropriate instant of time. Units—other conversions Apart from the standard conversions of the previous two subsections, a number of irregular conversions arise. For example, the telephone number value in the PHONE attribute in relation PAT-REC of Database1 is represented with area codes whereas in the PHONE attribute in relation PATIENT of Database2 it is represented without area codes. Granularity This form of heterogeneity arises when data elements representing a particular measurement differ in their level of granularity. For example, the WEIGHT attribute value in the VISIT relation of Database1 is stored to the nearest kilogram while in Database3 it is stored to the nearest tenth of a kilogram. Composition of values in a single-valued attribute Sometimes a value consists of two or more components which are directly related. A classic example is that of the price of an object or service, which may be given inclusive or exclusive of tax. Similarly, prices in a restaurant may be inclusive or exclusive of service charge. As an example of this form of heterogeneity consider the attribute MEDICATION-PRICE of relation VISIT in Database1 which describes the price of the medicine including tax, whereas the attribute PRICE in relation VISIT in Database2 describes the medication price without tax.

2.3.2. String–string • Value synonyms This occurs when the values of an attribute are represented as strings but a slightly different set of values is used in different databases. As an example, the value of the SEX attribute in the PAT-REC relation of Database1 is stored as Male or Female, while in attribute SEX in relation PATIENT of Database2 it is stored as M or F. • Value homonyms The value ‘PNE’ occurring in attribute TEST-CODE in relation LAB-TEST of Database1 represents ‘Pneumonia’ but ‘PNE’ represents ‘Pneumoconiosis’ in attribute CODE in relation TEST of Database2. • Different string formats These arise when different databases use different string formats for the same element. The most common

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

509

Fig. 3. Structure of Database3. Note: Mark Richard’s telephone number was updated on May 15 1989. Patient telephone number was not compulsorily captured until 01/01/1990. So, NULL prior to this date represents ‘unknown’ and after this date represents ‘no telephone number’. The code ‘PNE’ represents ‘pneumonia’.

occurrence of this is in date representation, for example the attribute DATE in relation VISIT of Database3 is represented as Day Month Year, whereas the attribute DATE in relation VISIT of Database2 is represented as Month Day Year. Other forms might include “MM– DD–YY”, “DD/MM/YY”, “DDMMYY”, “MMDDYY”, “YYYYMMDD” and so on.

2.3.3. Numeric string These arise when the same attribute is defined in terms of different data types in different databases. For example, the PHONE attribute in relation PAT-REC of Database1 is of type ‘string’, whereas in relation PATIENT of Database2 it is of type integer. The date problem described in the previous section also arises here, for example the VISIT-DATE attribute value in relation VISIT of Database1 is stored as a numeric value while the DATE attribute value in relation VISIT of Database2 is stored as string.

2.3.4. Structures These arise when different databases use different formats for the same element. For example, in Database2 the name of the patient is represented as a single attribute in relation PATIENT whereas in Database3, it is represented as a pair of separate attributes in the relation PATIENT-NAME. 2.3.5. Incomplete information The meaning of null varies amongst databases (unknown, not applicable, unavailable). For example, when the value of an attribute MAIDEN-NAME is NULL, this is interpreted as not applicable if attribute SEX is male. If the AGE attribute value is equal to NULL this is taken as unknown value. On the other hand, if the PHONE attribute is NULL as in Database1 and Database3, this may mean either not applicable or unknown. 2.4. Semantic heterogeneity This form of heterogeneity occurs when there are

510

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

differences in what the data actually represents or the context in which the data has been captured in different databases. We can classify the semantic heterogeneity as follows. 2.4.1. What the data represents As an example, the PHONE attribute in relation PATREC of Database1 is a home phone number; the PHONE attribute in relation PATIENT of Database4 is a contact phone number, which may be the home phone number but may not. They are related concepts but not necessarily identical. 2.4.2. Context in which data is captured As an example, consider blood pressure. If blood pressure is measured at home by a nurse the measurement may be significantly lower than that obtained in the clinic by a doctor (so-called ‘white coat’ syndrome). In the case of Database4 the blood pressure in relation VISIT is measured at home by a nurse, whereas in relation C-VISIT it is measured in the clinic. Equally one would like to know whether a measurement may be affected by other conditions (e.g. if a patient being examined for condition X is also suffering from condition Y at the same time). 2.4.3. Difference in abstraction level The requirements of different local DBMSs may cause objects to be modelled at different levels of abstraction. For example, the attribute RESULT in relation LAB-TEST of Database1 describes the result of a test on the scale 0 to 10 whereas attribute RESULT in relation TEST of Database2 describes the result in terms of values {Low, Normal, Above Normal, High}. 2.5. Data model heterogeneity 2.5.1. Paradigm heterogeneity Local database systems may employ different paradigms, such as relational, hierarchical, object-oriented, or deductive. The focus of this framework is on the relational database model and has not been extended to consider these other models at this time. 2.5.2. Behavioural differences These arise when different insertion/deletion policies are associated with the same class of objects in distinct schemas. A record type may have constraints on the total number of occurrences, or on the insertions and deletions of records. For example, the details of a patient’s visit to hospital must be kept for a minimum of 10 years before they can be deleted, but in another database details may be kept for 5 years before they can be deleted. 2.5.3. Dependency conflicts These arise when a group of concepts is related among

themselves with different dependencies in different schemas. For example, it is possible for a relationship between two concepts in one database to be 1:1 whereas in another, it could be 1:n.

2.5.4. Differences in constraints The data model may support different constraints. For example, in database Database4 the patients are all children and hence the attribute BIRTHDATE in relation PATIENT is constrained to dates consistent with this (e.g. less than 10 years of age). On the other hand, the corresponding attribute BIRTHDATE in relation PAT-REC in Database1 has no such constraint.

2.5.5. Default value This form of heterogeneity occurs when there are different definitions of the attribute domain. Two attributes might have different default values in different databases. For example, when inserting a new VISIT record the default value for VISIT-DATE in Database1 may be the current date whereas the default value for DATE in Database2 may be NULL.

2.5.6. Relation keys In this case, equivalent relations in different databases may have different attributes as keys that can affect updates to these relations. 2.6. Timing heterogeneity 2.6.1. Domain evolution This problem occurs when the semantics of values of a domain change over time. This includes many of the different kinds of heterogeneity already described. For example, the form used to represent a value may change over time. An example of this is the change in telephone code occurring in Database1 where the area code changed from ‘031’ to ‘0131’ at a particular point in time. Other forms of domain evolution include changes in composition of values (e.g. when the tax rules changed), changes in granularity, changes in string representations due to changes in coding systems, changes in cardinality, etc.

2.6.2. Inconsistencies due to asynchronous updates These happen when data items replicated in different databases, get updated at different points in time and become inconsistent. For example, the PHONE attribute in relation PATIENT of Database2 for Mark Richard has been updated without a corresponding update to attribute PHONE in relation PATIENT of Database3, and the two attribute values become inconsistent.

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

511

Fig. 4. Structure of Database4.

3. The test suite For the test suite, the following eight simple queries have been selected. 3.1. Find the test code and result for Karen Taylor This query tests for attribute synonyms (e.g. NAME/ PATIENT-NAME), attribute-relation homonyms (e.g. PATIENT-NAME), relation synonyms (e.g. LAB-TEST/ TEST), value homonyms (e.g. meaning of PNE), structures (e.g. PATIENT-NAME), relation size (PATIENT has 4 or 5 attributes), and difference in abstraction level (RESULT). 3.2. Find the telephone number for patients born before 01/01/1981 This query involves semantic heterogeneity—what data represents (home vs work number), different units—other conversion (with or without area code), incomplete

information (meaning of NULL), different string formats (for BIRTHDATE), relation synonyms (PAT-REC/ PATIENT), relation size (PATIENT), domain evolution (area code and changed meaning of NULL), inconsistencies due to asynchronous updates (The attribute PHONE in relation PATIENT of Database2 is updated (Mark Richard’s phone number) without updating the attribute PHONE in relation PATIENT of Database3).

3.3. Find weights of all male patients weighed within the last year The query covers fixed conversion between different units (WEIGHT—pounds vs kilograms), different granularity (kgs vs. tenths of kg), value synonyms (Male/Female vs. M/F), relation size, relation homonyms (VISIT), relation synonyms (PATIENT/PAT_REC) and numeric-string conversion (VISIT-DATE).

512

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

Table 1 Relationship between concepts used by other researchers and our classification Authors/Heterogeneity

Cardenas, A.F. [37] Sheth and Larson [32] Thomas, Thompson et al. [38] Ferrier and Stangret [39] Litwin and Abdellatif [40] Urban and Wu [29] Hurson and Bright [12] Larson, Navathe et al. [13] Chatterjee and Segev [21] Spaccap., Parent et al. [31] Navathe and Gadgil [19] Batini and Lenzerine [20] Bukhres, Elmag. et al. [14] Navathe and Savasere [27] Casanova and Vidal [4] Motro and Buneman [5] Al-fedaghi and Scheu. [6] Yao, Waddle, Housel [7] Yu, Jia, Sun, and Dao [8] Toerey and Fry [9] Kahn [17] Elmasri and Navathe [10] Navathe, Sashid. et al. [11] Mannino and Effelsbe. [23] Kual, Drosten, Neuhold [24] Elmasri, Larson, Nav. [22] Dayal and Hwang [16] Batini, Lenzerini et al. [15] Spaccapietra, Parent [41] Gangopadhya, Barsalou [44] Worboys and Deen [45] Salaco, Saltor et al. [33] Ventrone and Heiler [34] Kim and Seo [18] Reddy, Prasad, Gupta [28] Breitbart, Olson et al. [42] Fankhauser, Neuhold [35] Sheth and Kashyap [36] Jeffery, Hutchinson et al. [26] Deen, Amin. Taylor [25]

N.HG NS NH

P F F F P F F F P P P P P P P F F F P F F F F

R.S.H. RZ

P F F F P F F F P P P P P P P F F F P F F F F

P

V.H. NN SS

P F F P F F

Ns

II

F F F

F F

S

F

S.H. WDR

CC

P

P

P

P

F P

F

DAL

F

D.M. PH BD F F F F P F F P

DC

DIC

DV

RK

T.H. IDAU

D.E.

F

F

F F

F

F F

P

P F

P F P

F P

F

F P

P P P P P P F P

F

F F F F F

P F F

F F P

F P

F

F

F P

F F F

F P

F P P

F P P

F

F F F

F

F F F

F

F

F F P

P F F F

F

F

F

F F

F F

F

F

F

F P

F F P

F F P

3.4. Find the price of medication for patient 529 This query involves time varying conversion between different units (Pounds vs. Dollars), composition of values in a single-valued attribute (PRICE with or without TAX), domain evolution (TAX rate).

3.5. Find the blood-pressure for Alex Brown This query covers semantic heterogeneity—context in which data is captured (BLOOD PRESSURE), attribute homonyms (DATE). A single query which covers most of the range of heterogeneties in the test suite is as follows: Find the name, date of birth and telephone number of every male patient who has

F F

F P

F

F F

had a high result (.4.0) for test PNE and whose weight exceeds 180 pounds.

4. Overview of previous work and comparison against framework This section provides a brief overview of the aspects covered in a number of different papers on this subject. A summary of how these fit into the proposed framework is given in Table 1. Note that the terminology used differs amongst authors and their coverage of a concept varies; this is indicated by an ‘F’ (if the concept is fully covered) or a ‘P’ (if it is partially covered) in the table. The major problem is that of finding how a data item in one set can be mapped to an appropriate form to make it

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

513

Fig. 5. The classification diagram.

accessible in another—in other words, finding attribute equivalences. The simplest form of heterogeneity in this regard is that of naming conflicts and naming heterogeneity. In general, the categories of structural and naming heterogeneties are recognised by most authors, e.g. Casanova and Vidal [4], Motro and Buneman [5], Al-fedaghi and Scheuermann [6], Yao et al. [7], Yu et al. [8], Teorey and Fry [9], Elmasri and Navathe [10], Navathe et al. [11], Hurson and Bright [12], Larson et al. [13], and Bukhres et al. [14]. Batini et al. [15], Dayal and Hwang [16], Kahn [17], and Kim and Seo [18], recognised these two categories and defined naming conflicts as including homonyms and synonyms. Navathe and Gadgil [19], Batini and Lenzerini [20], and Chatterjee and Segev [21] defined naming conflicts as homonyms and synonyms. Elmasri et al. [22] used the same two categories but widened naming conflicts to include attribute equivalence and entity class equivalence. Structural conflicts may be viewed as differences in

abstraction level [22–24] as well as differences in roles, degree and cardinality constraints [11,22]. The second major form of heterogeneity is concerned with differences in the representations of values. Again a number of authors recognise differences in units and granularity as well as differences in data types and structure. These include Deen et al. [25], Bukhres et al. [14], Jeffery et al. [26], and Larson et al. [13] who also cover differences in level of abstraction and in object identifier, Chatterjee and Segev [21] who include codes, incomplete information and recording errors, and Navathe and Savasere [27] who include data type and scale. Another way of viewing this is by distinguishing between schema level conflicts and data level inconsistencies [16]. This notion is elaborated by Kim and Seo [18] who distinguish between data that has been incorrectly entered, obsolete data and different representations for the same data. Reddy et al. [28] refer to quantitative data incompatibilities

514

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

which they attribute to different levels of accuracy, asynchronous updates and lack of security. The most complex heterogeneity is semantic heterogeneity, which is addressed by Urban and Wu [29], Colomb and Orlowska [30], Spaccapietra et al. [31], Sheth and Larson [32], and Hurson and Bright [12]. The latter point out the problems in modelling objects of different data structures, different names, different sets of attributes, different behaviour functions, different meanings, etc. Solaco et al. [33] also base the classification of semantic heterogeneities on an object-oriented data model. In addition Ventrone and Heiler [34] describe problems of semantic heterogeneity due to domain evolution. Fankhauser and Neuhold [35] refer to the problem of ambiguity and distinguish model ambiguity (arising from primitives such as is-a, instance-of, part-of) and semantic ambiguity. Sheth and Kashyap [36] include conflicts such as default value conflicts, attribute integrity constraint conflicts and union compatibility conflicts. The data model heterogeneity is addressed by Cardenas [37], Thomas et al. [38], Ferrier and Stangret [39], Litwin and Abdellatif [40], Hurson and Bright [12], Spaccapietra and Parent [41], Breitbart, Olson et al. [42], Navathe and Savasere [27], while Bukhres et al. [14] also cover access heterogeneity. Apart from the heterogeneities covered in this paper, authors have also covered differences in the database management systems [32], differences in data models [43], differences in query languages and differences at the system level (e.g. concurrency control, commit protocols and recovery). Ferrier and Stangret [39] include the network and the operating system and Litwin and Abdellatif [40] physical aspects such as login procedures, concurrency control, etc.

5. Conclusion A lot of research has been done on the problem of accessing heterogeneous distributed database systems and a range of different aspects of heterogeneity have been identified by different authors. This paper presents a framework for classifying the different types of heterogeneity, which brings together the different aspects of heterogeneity addressed by these authors. A summary of this classification is given in Fig. 5. An overview of some of the work done by different researchers on the problem of heterogeneity is given in Section 4. A summary of how the different concepts covered by different authors fits into the proposed framework is given in Table 1. From this framework a test suite has been developed which can be used to evaluate and compare the extent to which different approaches handle different aspects of this heterogeneity. A major advantage of this test suite is that it consists of four small databases and a small set of queries, which are easy to implement. Using it, all aspects of

heterogeneity identified in the framework are covered, with the exception of data model heterogeneity. This classification is based on a relational model, although it could easily be adapted to other paradigms. Such a framework can provide an aid for database designers and for integrating heterogeneous database research. We are currently using this framework as the basis for exploring means of resolving heterogeneity in databases by using a set of software agents. In time, the framework will be expanded and clarified, not least in the area of data model heterogeneity. Acknowledgements The authors wish to acknowledge the support given to H. El-Khatib by the Arab Students Aid International towards his PhD studies. References [1] A.K. Elmagarmid, C. Pu, Introduction: special issue on heterogeneous databases (guest editors), ACM Computing Surveys 22 (3) (1990) 175–178. [2] L.M. Mackinnon, D.H. Marwick, M.H. Williams, A model for query decomposition and answer construction in heterogeneous distributed database systems, J. Intelligent Inf. Sys. 11 (1998) 69–87. [3] W.J. Austin, E.K. Hutchinson, J.R. Kalmus, L.M. Mackinnon, K.G. Jeffery, D.H. Marwick, M.H. Williams, M.D. Wilson, Processing travel queries in a multimedia information system, Proceedings of Information & Comms Technologies in Tourism, Springer, Berlin, 1994 p. 64–71. [4] M.A. Casanova, V.M.P. Vidal, Towards a sound view integration methodology, in: Second ACM SIGACT/SIGMOD conference on principles of database systems, Atlanta, GA, ACM, New York, March 21–23, 1983, pp. 36–47. [5] A. Motro, P. Buneman, Constructing superviews, in: International Conference on Management of Data, Ann Arbor, MI, ACM, New York, April 29–May 1, 1981, pp. 56–64. [6] S. Al-Fedaghi, P. Scheuermann, Mapping considerations in the design of schemas for the relational model, IEEE Trans. Software Engng SE7 (1) (1981) 99–111. [7] S.B. Yao, V.E. Waddle, B.C. Housel, View modelling and integration using the functional data model, IEEE Trans. Software Engng SE-8 (6) (1982) 544–553. [8] C. Yu, B. Jia, W. Sun, S. Dao, Determining relationships among names in heterogeneous databases, Sigmod Record 20 (4) (1991) 79–80. [9] T. Teorey, J. Fry, Design of database structures, Prentice-Hall, Englewood Cliffs, NJ, 1982. [10] R. Elmasri, S. Navathe, Object integration in logical database design, in: IEEE COMPDEC Conference, 1984, pp. 426–433. [11] S.B. Navathe, T. Sashidhar, R. Elmasri, Relationship merging in schema integration, in: Tenth International Conference on Very Large Data Bases, Singapore, 1984, pp. 78–90. [12] A.R. Hurson, M.W. Bright, Object-Oriented multidatabase systems, in: O.A. Bukhres, A.K. Elmagarmid (Eds.), Object-Oriented multidatabase systems: a solution for advanced applications, Prentice-Hall, Englewood Cliffs, NJ, 1996, pp. 1–33. [13] J.A. Larson, S.B. Navathe, R. Elmasri, A theory of attribute equivalence in databases with application to schema integration, IEEE Trans. Software Engng 15 (4) (1989) 449–463. [14] O.A. Bukhres, A.K. Elmagarmid, F.F. Gherfal, X. Liu, K. Barker, T.

H.T. El-Khatib et al. / Information and Software Technology 42 (2000) 505–515

[15]

[16]

[17]

[18] [19]

[20]

[21] [22]

[23]

[24]

[25] [26]

[27]

[28]

[29]

Schaller, The integration of database systems, in: O.A. Bukhres, A.K. Elmagarmid (Eds.), Object-Oriented multidatabase systems: a solution for advanced applications, Prentice-Hall, Englewood Cliffs, NJ, 1996 pp. 37–56. C. Batini, M. Lenzerini, S.B. Navathe, A comparative analysis of methodologies for database schema integration, ACM Computing Surveys 18 (4) (1986) 323–364. U. Dayal, H. Hwang, View definition and generalization for database integration in a multidatabase system, IEEE Trans. Software Engng SE-10 (6) (1984) 628–645. B. Kahn, A structured logical database design methodology, PhD dissertation, Department of Computer Science, University of Michigan, Ann Arbor, MI, 1979. W. Kim, J. Seo, Classifying schematic and data heterogeneity in multidatabase systems, Computer 24 (12) (1991) 12–18. S. Navathe, S. Gadgil, A methodology for view integration in logical database design, in: Eighth International Conference on Very Large Data Bases, Mexico City, VLDB Endowment, Saratoga, CA, 1982, pp. 142–164. C. Batini, M. Lenzerini, A methodology for data schema integration in the entity-relationship model, IEEE Trans. Software Engng SE-10 (6) (1984) 650–664. A. Chatterjee, A. Segev, Data manipulation in heterogeneous databases, Sigmod Record 20 (4) (1991) 64–68. R. Elmasri, J. Larson, S.B. Navathe, Integration algorithms for federated databases and logical database design, Tech. Rep. Honeywell Corporate Research Center, 1987. M.V. Mannino, W. Effelsberg, A methodology for global schema design, Tech. Rep. No. TR-84-1, Department of Computer and Information Science, University of Florida, 1984. M. Kaul, K. Drosten, E.J. Neuhold, ViewSystem: integrating heterogeneous information bases by object-oriented views, in: IEEE Sixth International Conference on Data Engineering, Los Angeles, 1990, pp. 2–10. S.M. Deen, R.R. Amin, M.C. Taylor, Data integration in distributed databases, IEEE Trans. Software Engng SE-13 (7) (1987) 860–864. K.G. Jeffery, L. Hutchinson, J. Kalmus, M. Wilson, W. Behrendt, C. Macnee, A model for heterogeneous distributed database systems, in: D.S. Bowers (Ed.), Directions in Databases; Proceedings BNCOD12, Guildford, UK, July 6–8, Lecture Notes in Computer Science, 826, Springer, Berlin, 1994. S. Navathe, A. Savasere, A schema integration facility using objectoriented data model, in: O.A. Bukhres, A.K. Elmagarmid (Eds.), Object-Oriented Multidatabase Systems: a Solution for Advanced Applications, Prentice-Hall, Englewood Cliffs, NJ, 1996, pp. 105– 127. M.P. Reddy, B.E. Prasad, P.G. Reddy, A. Gupta, A methodology for integration of heterogeneous databases, IEEE Trans. Knowledge and Data Engng 6 (6) (1994) 920–933. S.D. Urban, J. Wu, Resolving semantic heterogeneity through the

[30] [31]

[32]

[33]

[34] [35]

[36]

[37] [38]

[39]

[40] [41] [42]

[43]

[44]

[45]

515

explicit representation of data model semantics, Sigmod Record 20 (4) (1991) 55–58. R.M. Colomb, M.E. Orlowska, Interoperability in information systems, Information Systems Journal 5 (1994) 37–50. S. Spaccapietra, C. Parent, Y. Dupont, Model independent assertions for integration of heterogeneous schemas, VLDB Journal 1 (1) (1992) 81–126. A.P. Sheth, J.A. Larson, Federated database systems for managing distributed, heterogeneous, and autonomous databases, ACM Computing Surveys 22 (3) (1990) 183–232. M. Solaco, F. Saltor, M. Castellanos, Semantic heterogeneity in multidatabase systems, in: O.A. Bukhres, A.K. Elmagarmid (Eds.), ObjectOriented Multidatabase Systems: a Solution For Advanced Applications, Prentice-Hall, Englewood Cliffs, NJ, 1996, pp. 129–202. V. Ventrone, S. Heiler, Semantic heterogeneity as a result of domain evolution, Sigmod Record 20 (4) (1991) 16–20. P. Fankhauser, E.J. Neuhold, Knowledge based integration of heterogeneous databases, in: D.K. Hsiao, E.J. Neuhold, R. Sacks-Davis (Eds.), Interoperable Database Systems (DS-5) (A-25), Elsevier, Amsterdam, 1993, pp. 155–175. A. Sheth, V. Kashyap, So Far (Schematically) yet So Near (Semantically), in: D.K. Hsiao, E.J. Neuhold, R. Sacks-Davis (Eds.), Interoperable Database Systems (DS-5) (A-25), Elsevier, Amsterdam, 1993, pp. 283–311. A.F. Cardenas, Heterogeneous distributed database management: the HD-DBMS, Proceedings of the IEEE 75 (5) (1987) 588–600. G. Thomas, G.R. Thompson, C. Chung, E. Barkmeyer, F. Carter, M. Templeton, S. Fox, B. Hartman, Heterogeneous distributed database systems for production use, ACM Computing Surveys 22 (3) (1990) 237–266. A. Ferrier, C. Stangret, Heterogeneity in the distributed database management system Sirius-Delta, in: Eighth International Conference on Very Large Data Bases, Mexico City, 1982, pp. 45–53. W. Litwin, A. Abdellatif, Multidatabase interoperability, Computer 19 (12) (1986) 10–18. S. Spaccapietra, C. Parent, Conflicts and correspondence assertions in interoperable databases, Sigmod Record 20 (4) (1991) 49–54. Y. Breitbart, P.L. Olson, G.R. Thompson, Database integration in a distributed heterogeneous database system, Second IEEE Data Engineering International Conference, CS Press, Los Almitos, CA, 1986 pp. 301–310. D.K. Hsiao, M.N. Kamel, Heterogeneous databases: proliferations, issues, and solutions (invited paper), IEEE Trans. on Knowledge and Data Engng 1 (1) (1989) 45–62. D. Gangopadhyay, T. Barsalou, On the semantic equivalence of heterogeneous representations in multimodel multidatabase systems, Sigmod Record 20(4) (1991) 35–39. M.F. Worboys, S.M. Deen, Semantic heterogeneity in distributed geographic databases, Sigmod Record 20(4) (1991) 30–34.