An Efficient and Scalable Algorithm for Segmented ... - Semantic Scholar

6 downloads 2326 Views 308KB Size Report
Jul 1, 2014 - Email addresses: [email protected] (Md. Hanif Seddiqui), ...... iMap system [5] addresses block matching and matches between relational ...
An Efficient and Scalable Algorithm for Segmented Alignment of Ontologies of Arbitrary Size Md. Hanif Seddiqui a and Masaki Aono a,∗ a Toyohashi

University of Technology, 1-1 Hibarigaoka, Tempakucho, Toyohashi, Japan

Abstract It has been a formidable task to achieve efficiency and scalability for the alignment between two massive, conceptually similar ontologies. Here we assume, an ontology is typically given in RDF (Resource Description Framework) or OWL (Web Ontology Language) and can be represented by a directed graph. A straightforward approach to the alignment of two ontologies entails an O(N 2 ) computation by comparing every combination of pairs of nodes from given two ontologies, where N denotes the average number of nodes in each ontology. Our proposed algorithm called Anchor-Flood algorithm, boasting of O(N ) computation on the average, starts off with an anchor, a pair of “look-alike” concepts from each ontology, gradually exploring concepts by collecting neighboring concepts, thereby taking advantage of locality of reference in the graph data structure. It outputs a set of alignments between concepts and properties within semantically connected subsets of two entire graphs, which we call segments. When similarity comparison between a pair of nodes in the directed graph has to be made to determine whether two given ontologies are aligned or not, we repeat the similarity comparison between a pair of nodes, within the neighborhood pairs of two ontologies surrounding the anchor iteratively until the algorithm meets that “either all the collected concepts are explored, or no new aligned pair is found”. In this way, we can significantly reduce the computational time for the alignment. Moreover, since we only focus on segment-to-segment comparison, regardless of the entire size of ontologies, our algorithm not only achieves high performance, but also resolves the scalability problem in aligning ontologies. Our proposed algorithm reduces the number of seeminglyaligned but actually misaligned pairs. Through several examples with large ontologies, we will demonstrate the features of our Anchor-Food algorithm. Key words: Ontology alignment, locality of reference, segmented alignment

∗ Corresponding author Email addresses: [email protected] (Md. Hanif Seddiqui), [email protected] (Masaki Aono).

Preprint submitted to Elsevier

1 July 2014

1 Introduction

“An ontology is an explicit specification of a conceptualization” is a prominent defintion by T.R. Gruber in 1995 [13]. The definition was then extended by R. Studer et al., in 1998 as “an ontology is an explicit, formal specification of a shared conceptualization of a domain of interest” [39]. An ontology generally consists of entities such as concepts, properties and relations. It is the backbone to fulfill the semantic web vision [2,23] and is a knowledge base to enable machines to communicate each other effectively. The knowledge captured in ontologies can be used to annotate data, to distinguish between homonyms and polysemies, to drive intelligent user interfaces and even to retrieve new information. A number of ontologies are increasing day by day with new semantic web contents, because an ontology is being developed to formalize the conceptualization behind the idea of semantic web. Therefore, in order to achieve semantic interoperability and integrity, ontology alignment would be playing an important role [1]. Ontologies may be of different size, small to large, having hundreds to millions of RDF triples, where an RDF triple is an RDF statement between a subject, a predicate, and an object. A small-scale ontology is usually defined within a single segment, where the term “segment” is referred by J. Seidenberg et al. [34] as a fragment that stands alone as an ontology in its own right. A large-scale ontology is generally very complex in nature and may contain multiple segments [34]. Stuckenschimdt et al. [38] also reveals that a large-scale ontology contains a set of modules about a certain subtopic that can be used independently of the other modules. A large-scale ontology expressed in Web Ontology Language (OWL) [25] is appearing to capture distributed knowledge in a centralized knowledge base. For example, the biomedical domain has enormous large-scale ontologies, such as FMA [32] and OpenGALEN [31]. Although there are many diverse solutions to the ontology alignment problem [7], only a few are capable to handle large ontologies efficiently [17,19]. Many previously developed ontology alignment systems are also facing severe performance problem when they attempt to deal with large ontologies. Resolving scalability problem is also an important current issue in the ontology alignment research [35]. Moreover, users (humans or machines) may want to find a specific part of a large ontology which will fit only to their interest. The main contribution of our approach is of attaining performance enhancement by solving the scalability problem in aligning large ontologies. The key idea is to start from an anchor point of a taxonomy of an ontology and to proceed towards the neighboring nodes considering the locality of reference. Eventually our proposed algorithm aligns a part of gigantic ontologies and outputs a segmented alignment, (i.e. an alignment across two related segments or fragments across ontologies) which are relatively small, however sufficient to satisfy users interest in a 2

particular domain. Therefore, performance, scalability, and segmented alignment (i.e. an alignment across two related segments or fragments of ontologies) are the key motivations of our proposed algorithm in this paper. In our previous research work, we also focused on the ontology alignment [16]. Our system was capable of aligning small-scale ontologies effectively with the feature of eliminating misalignments. However, its major drawback was against the aligning of large-scale ontologies, as its complexity was O(N 2 ). To overcome this limitation of scalability problem, we developed a new algorithm, we called as “Anchor-Flood” algorithm. Our algorithm assumes a seed called an anchor, where the notion anchor is derived from the Anchor-PROMPT algorithm [30], although it is clearly different from the notion used in that algorithm. The Anchor-PROMPT algorithm augments existing methods by determining additional possible aligned pairs across ontologies, while our Anchor-Flood algorithm starts aligning from an anchor and produces segmented alignments. Our algorithm also shares the term “flood” with similarity-flooding algorithm [26] without major similarities in their action blocks. Although our work is inspired by the idea of similarity flooding algorithm in that the elements of two distinct hierarchy models are likely to be similar when their adjacent elements are similar. However, our Anchor-Flood algorithm does not propagate similarity value to the adjacent nodes iteratively unlike similarity-flooding algorithm. In the main process block of our algorithm, we collect aligned pairs from the neighboring concepts computing similarities among the collected concepts across ontologies starting from an anchor. It begins exploring to the neighboring concepts to collect more aligned pairs. As our algorithm starts off an anchor and explores to the neighboring concepts, it has a salient feature of scalability. Our algorithm achieves enhancement in terms of scalability and performance in aligning large ontologies. It can reduce the number of seemingly-aligned but actually misaligned pairs [16]. It also outputs “segmented alignment”, which is a unique characteristic in the field of ontology alignment research. The rest of the paper is organized as follows. Section 2 introduces general terminologies to be used in the later sections. Section 3 contains the description of our proposed “Anchor-Flood” algorithm. Section 4 includes experiments and evaluation, while Section 5 describes the complexity factor of the algorithm. Some related work and the differences with our algorithm are described in Section 6. Section 7 includes concluded remarks along with some future directions of our work. 3

2 General Terminologies

This section introduces some of the basic definitions in the field of ontology alignment research and familiarizes the reader with the notions and terminologies used throughout the paper. It includes the definitions of ontology, taxonomy, alignment across ontologies, the similarity measures and the idea of a segment in an ontology.

2.1 Ontology, concept, relation, and taxonomy

According to M. Ehrig [7], an ontology contains a core ontology, logical mappings, a knowledge base, and a lexicon. Furthermore, a core ontology is defined as a tuple of five sets: concepts, concept hierarchy or taxonomy, properties, property hierarchy, and concept to property function. In a taxonomy, concepts are organized in subsumption relations. For example, if ≤C represents a taxonomy and if c1 < c2 in a taxonomy for c1 , c2 ∈ C, then c1 is a subconcept of c2 , and c2 is a superconcept of c1 . A taxonomy or a concept hierarchy and concept relations are widely used throughout this paper and taxonomies are the primary element of our algorithm. We also use some of the components of the knowledge base of an ontology, such as instances.

2.2 Ontology alignment

Alignment A is defined as a set of correspondences with quadruples < e, f, r, l > where e and f are the two aligned entities across ontologies, r represents the relation holding between them, and l represents the level of confidence [0, 1] if there exists in the alignment statement. The notion r is a simple (one-to-one equivalent) relation or a complex (subsumption, or one-to-many) relation [7]. The correspondence between e and f is called aligned pair throughout the paper. Alignment is obtained by measuring similarity values between pairs of entities.

2.3 Similarity Measures

Similarity is usually measured by considering textual contents, structure, and semantics available in an ontology. Concepts, properties or instances often contain labels and comments as their textual contents. Sometimes URI itself is informative. The textual contents associated with an entity, such as a concept, a property or an instance, are referred to as a descrip4

tion. A description usually contains informative terms, therefore, description based similarity measure is often called as terminological similarity measure. The terminological similarity measures are widely used in ontology alignment methods. We use terminological similiarity measures using WordNet [9] and the Jaro-Winkler string distance [43] in our algorithm. We employ WordNet based equality referred to by SimW N . WordNet based equality states that two terms are equal if there is at least one sense in common, which is synonym of the second [11] and is defined as,    1.0, cond(t1 , t2 ) holds

SimW N (t1 , t2 ) = 

 0.0, otherwise

(1)

cond(t1 , t2 ) = ∃x{x ∈ senses(t1 ) ∧ x ∈ senses(t2 )} The function “senses(t1 )” returns the WordNet based synonym senses of a particular term, t1 . String metric based similarity is defined in [43] and used in [19] and [36]. Let di be the description of concept ci and dj be the description of concept cj . The string metric based similarity between ci and cj is then defined as follows: SimSM (di , dj ) = comm(di , dj ) − diff(di , dj ) + winkler(di , dj ),

(2)

where comm(di , dj ) stands for the commonality between di and dj , diff(di , dj ) for the difference and winkler(di , dj ) for the improvement of the result using the method introduced by Winkler [43]. On the other hand, structural similarity measures in our research rely on the intuition that the elements of two distinct models are similar when their adjacent elements are similar [26]. Structural similarity values are computed by the methods called structural internal matching and structural external matching [36]. In structural internal matching, the similarity value between two concepts is computed by the ratio of the number of terminologically similar properties to the number of total properties of the pair of concepts across ontologies, and is defined as follows:

Siminternal (ci , cj ) =

2 × |Aligned Properties| |Properties ∈ ci , cj |

(3)

In structural external matching, the similarity value is computed between two concepts by the ratio of the number of terminologically similar direct superconcepts, 5

siblings, and subconcepts to the number of total direct superconcepts, siblings and subconcepts .

Simexternal (ci , cj ) =

2 × |Aligned Pairs| |Concepts around ci , cj |

(4)

2.4 Segment

The notion “segment" is playing an important role in this paper. Generally, a segment within an ontology is a conceptually connected subset of the entire ontology graph that is not a mere fragment, but stands alone as an ontology in its own right [34,38]. A segment can be independent of the other subtopics or other segments of an ontology. However, a segmented alignment in this paper is defined as a pair of connected components across two ontologies as far as aligned pairs are found among them. In this paper, the size of a segment is defined dynamically at the time of aligning as our Anchor-Flood algorithm starts off an “anchor", and explores to the neighboring concepts until “either all the collected concepts are explored, or no new aligned pair is found” and finds aligned pairs among the neighboring concepts. An example of a pair of segments is illustrated by the enclosed polygons in Figs. 1 and 2. Hence, starting from a single anchor, our algorithm may find two aligned segments from two ontologies, and we propose the alignment found within the segments as segmented alignment. As long as the aligned pairs are connected within neighbors across ontologies, the connected components of each ontology form a segment.

3 Anchor-Flood Algorithm

Our proposed algorithm takes an anchor as one of the inputs along with two ontologies, O1 and O2 . It outputs a set of aligned pairs within two segments across ontologies. Before proceeding to the detail of our algorithm, we define four frequently used terms in this paper.

Definition 1 (Anchor) An anchor is defined as a pair of similar concepts across ontologies. If e ∈ O1 , f ∈ O2 and e ≡ f , meaning that e and f are aligned, then an anchor, X is defined as a pair (e(O1 ), f (O2 )), where the first element e(O1 ) comes from ontology O1 and the second element f (O2 ) comes from the ontology O2 . The notion anchor was first introduced in an ontology alignment technique by Noy et al [30]. 6

Article

Proceedings

anchor Conference

Inproceedings

Manual

Booklet

Book

Incollection owl:Thing

Entry

TechReport

Inbook

Mastersthesis

Phdthesis

Unpublished

Misc

Fig. 1. Bibliographic Ontology, O1 (OAEI-2006 benchmark dataset #301). The segment found in our algorithm is enclosed by polygon.

Definition 2 (Neighbors) As concepts are organized in a hierarchical structure called a taxonomy, we consider neighbors in this paper as a set of concepts, and defined as: 7

Publisher

Institution

School

MastersThesis

Academic

PhdThesis

foaf:Person

Collection

Conference

Proceedings Book

owl:Thing

foaf:Organization

Monograph

PageRange

Booklet

Date

Unpublished

Address

Informal

Journal

Manual

LectureNotes

Reference

Report

Deliverable

MotionPicture

TechReport

Chapter rdf:List

Part Article Misc InProceedings PersonList

anchor InBook

InCollection

Fig. 2. Bibliographic Ontology, O2 (OAEI-2006 benchmark dataset #101). The segment found in our algorithm is enclosed by polygon.

neighbors(c) = {children(c) ∪ grand_children(c) ∪ parents(c) ∪ siblings(c) ∪ nephews(c) ∪ grand_parents(c) ∪ uncles(c)}, 8

(5)

where children(c) and parents(c) represent the children and the parents of a particular concept c, respectively within a texonomy, whereas grand_children(c) is children(children(c)) , grand_parents(c) is parents(parents(c)), siblings(c) is children(parents(c)) − {c}, nephews(c) is children(siblings(c)) and the last of all, uncles(c) is children(grand_parents(c)) − parents(c). The term neighboring concept is used as an element of neighbors.

Definition 3 (Descendant set) Descendant set, D, is a set of one or more children of concepts explored in a taxonomy and is defined as follows: D = {d | d is a descendant of an explored concept} Definition 4 (Ancestor set) An ancestor set or parent set, P , is a set of one or more ancestor concepts of the set of concepts available in descendant set, D: P = {p | p is ancestors of concepts ∈ D} We assume that every concept within a taxonomy of an ontology has semantic relationship among the neighboring concepts. The concepts of two distinct hierarchies are similar if their neighbors are similar [26]. Conversely, if the concepts of two distinct hierarchies are similar, there is a possibility that their neighbors are similar. Assume that there are two concepts e and f such that e ∈ O1 and f ∈ O2 , Ne be the neighbors(e), Nf be the neighbors(f ) and e ≡ f . Then there is a possibility that some elements of Ne are similar to the elements of Nf . If there is no sufficient similarity measured between the elements of Ne and those of Nf , then it is considered that there is no relation holds between e and f . Hence, the alignment e ≡ f is a misalignment [16]. Before showing the details of our Anchor-Flood algorithm later in Fig. 4, we will illustrate the algorithm in a simplified form and with a comprehesive example. The simplified form of the algorithm is depicted in Fig. 3. On the other hand, the example consists of two simple ontologies that are to be aligned. The two ontologies O1 and O2 describing the domain of bibliography are given in Figs. 1 and 2, respectively. Ontologies of these figures are taken from the benchmark of Ontology Alignment Evaluation Initiative 1 . Fig. 1 is taken from the benchmark ontology number 301 while the benchmark ontology number 101 is shown in Fig. 2. Our 1

available at http://oaei.ontologymatching.org/2006/benchmarks/.

9

algorithm aligns the ontologies starting with an anchor. For the given example, the algorithm starts with an anchor as (“Inproceedings”(O1 ), “InProceedings”(O2 )). Algorithm SimplifiedFormOfAnchorFlood (O1 , O2 , X) /* Ontology 1:O1 , Ontology 2:O2 , Anchor:X */ /* X = (e(O1 ), f (O2 )) */ /* Aterm , and Astruct are two alignments retrieved by terminological and structural similarity measures respectively */ Step 1: Parse ontologies, O1 and O2 , to generate taxonomy and relations Step 2: Initialize descendant, ancestor and alignment Repeat the steps 3, 4 and 5 until “either all the collected concepts are explored, or no new aligned pair is found” Step 3: Select a pair of concepts to explore Step 4: Explore and collect neighbors from the selected concepts Step 5: Start aligning process to produce Aterm and Astruct Step 6: Aggregate alignment Aterm and Astruct Step 7: Output segmented alignment Fig. 3. Simplified macro steps of the Anchor-Flood algorithm to align within two segments across ontologies to produce segmented alignment.

3.1 Simplified Form of the Algorithm with an Example

Our Anchor-Flood algorithm preprocesses ontologies. The algorithm starts off an anchor, taken as input parameter. For each iteration, it selects a pair of concepts to explore, collects neighboring concepts from the selected concepts across ontologies, and then starts the aligning process to produce aligned pairs from the collected neighbors. The process repreats until “either all the collected concepts are explored, or no new aligned pair is found”. The simplified form of the algorithm is illustrated in Fig. 3. For example, if we start our Anchor-Flood algorithm off an anchor, (“Inproceedings”(O1 ), “InProceedings”(O2 )) across ontologies O1 as depicted in Fig. 1 and O2 as displayed in Fig. 2, it explores to the neighboring concepts and collects them. It collects “Entry” in O1 and “Part” in O2 , using anchor-concepts “Inproceedings” and “InProceedings” respectively. Alignment process starts aligning the collected concepts and their properties. Then, the algorithm explores from “Entry” and “Part” to further neighboring concepts and collects them. Eventuatlly, our algorithm collects all the neighboring concepts across ontologies from the anchor and retrieve aligned pairs by aligning process. The algorithm repeats the steps until“either all the collected concepts are explored, or no new aligned pair is found”. Hence it produces aligned pairs within two segments across ontologies. The segments are illustrated by enclosed polygons in Figs. 1 and 2. On the other hand, if the algorithm starts with an anchor, e.g. (“Conference”(O1 ), “Conference”(O2 )), it cannot find any other aligned pair even after exploring to 10

their neighboring concepts. Thus, the provided anchor is considered as misaligned, although they are terminologically similar. The next subsections elaborate the descriptions of the components of the algorithm in more detail.

3.2 Preprocessing of Anchor-Flood algorithm

As we describe at the lines 1 and 2 of the pseudo code in Fig 4, the input ontologies are preprocessed. The Anchor-Flood algorithm takes two ontology files (either in RDF or OWL) as input. Then the preprocessing module parses each ontology file using the ARP parser 2 which produces RDF triples available in the ontology file. This module produces a taxonomy of concepts, a list of properties, and their relations. It also produces comments, and labels of uri’s as their textual contents, or description of entities. Before aligning the ontologies, the preprocessing module normalizes the textual contents of entities by the process of tokenization, stop-word removing, word-stemming and text normalization such as person names, dates in a uniform format.

3.3 Initialization of Anchor-Flood algorithm

As we describe at the lines 3 through 6 of the pseudo code in Fig 4, the basic components of the algorithm are initialized. An anchor, which is one of the inputs to our algorithm, consists of a pair of two concepts e and f denoted by (e(O1 ), f (O2 )). This means that concept e is supposed to be similar to concept f . We will use five sets of entities to keep track of important pieces of information during the algorithm. They are two ancestor sets P1 and P2 , two descendant sets D1 and D2 and one alignment A. (See Figs. 4 and 5). If we consider an anchor as (“Inproceedings”(O1 ), “InProceedings”(O2 )), P1 = D1 ={“Inproceedings”}, P2 = D2 ={“InProceedings”}, and A=(“Inproceedings”, “InProceedings”, =, 1.0). In the following, we define terminologies used in our Anchor-Flood algorithm. The alignment A is a set of correspondences with quadruple defined in Section 2. Note that the algorithm starts with initializing P1 , D1 with e, P2 , D2 with f , and A with (e, f, ≡, 1.0). 2

Another RDF Parser, http://www.hpl.hp.com/personal/jjc/arp/

11

3.4 Exploring the Neighboring Concepts

Having initialized the parent sets (P1 , P2 ), descendant sets (D1 , D2 ), and alignment (A), the algorithm starts exploring towards the neighboring concepts of the anchor, as we describe at the lines 8 through 36 of the pseudo code in Fig 4. The main processing block of the algorithm contains three mutual exploring steps, i.e. exploring aligned concepts (EA), exploring unaligned concepts (EU), and exploring the parents (EP). The EA step has the highest priority. Concepts of aligned pair are explored until there is no more unexplored aligned pair in set A. The EU step has the next priority. If there is no more aligned pair to be explored at a particular iteration, the EU step is executed as per condition stated in the algorithm, otherwise the EP step is executed. When the algorithm terminates, it outputs a set of aligned pairs within two segments across ontologies. The three exploring steps, involved in the algorithm, are elaborated below.

3.4.1 Exploring aligned concepts (EA step) As we describe at the lines 8 through 12 of the pseudo code in Fig 4, the EA step explores neigboring concepts of each of the aligned pairs available in the alignment set, A. Since the given anchor is a pair of concepts as defined above, the first aligned pair naturally comes from an anchor. Both of the concepts of an anchor are explored to their neighboring concepts. Then the explored neighboring concepts are inserted into corresponding descendant set, either D1 or D2 as depicted in Fig. 5 with a directed line labeled by ea1 or ea2 . After every step, AlignDescendantSets module in Fig. 6 is started to find new aligned pairs by computing similarities between the concepts available in D1 and D2 . The concepts belong to every aligned pair of concepts are explored one after another (see Fig. 4). For example, if ai (∈ A)= in Figs. 1 and 2 then, after exploring, the descendant sets D1 and D2 will be increased by the new concepts {“Article”, “Proceedings”, “Conference”, “Manual”, “Booklet”, “Book”, “Incollection”, “Inproceedings”, “TechReport”, “Inbook”, “MasterThesis”, “PhdThesis”, “Unpublished”, “Misc”} and {“Academic”, “Book”, “Informal”, “Report”, “MotionPicture”, “Part”, “Misc”} repectively.

3.4.2 Exploring unaligned concepts (EU step) The EU step is described at the lines 13 through 24 of the pseudo code in Fig 4. It explores the unexplored concepts to their neighboring concepts possibly from the descendant sets, D1 or D2 or from both. Exploring the unaligned concepts only in either descendant set D1 or D2 , may cause unbalanced growth of any of the sets. We detect that this happens when the distance of the concepts in one ancestor set from leaves is not close to that of the concepts of other ancestor set. To overcome 12

the problem, the size of both D1 and D2 are observed in the algorithm. By taking the example of Figs. 1 and 2, if “Article”∈ D1 and is not yet explored, then D1 will be increased by ∅. If “Academic”∈ D2 and is not yet explored, then D2 will be increased by the concepts {“MasterThesis”, “PhdThesis”}.

3.4.3 Exploring the parents (EP step) We describe the process of the EP step at the lines 28 through 36 of the pseudo code in Fig. 4. Since we assume two ancestor sets P1 and P2 are available to hold ancestors of ontology O1 and O2 respectively. Which set will be explored to its own ancestor depends on the difference between the number of unaligned concepts of D1 and that of D2 . It is decided observing the size of both D1 and D2 . If the size of one set is more than the double of the other set, then the other parent set is explored to its ancestor and the ancestor is added to its corresponding descendant set. Otherwise, the concepts of the both ancestor sets will be explored. The explored concepts are inserted into the descendant sets D1 or D2 as depicted in Fig. 5 with a directed line labeled by ep1 or ep2 For example, if P1 ={“Inproceedings”} in Fig. 1, then the concepts of P1 will be replaced by the parents of it, i.e {“Conference”, “Entry”}. Similarly, if P2 ={“InProceedings”} in Fig. 2, then the concepts of P2 will be replaced by the parents of it, i.e {“Part”}.

3.5 Aligning Descendant Sets

The alignment process, AlignDescendantSets, is described at the line 38 of the pseudo code in Fig. 4. It processes alignment with the sets of concepts in D1 and D2 as shown in Fig. 5 and produces terminological and structural alignments, as the concepts and the properties of ontologies contain terminologies and are organized in well defined structures. The total process, described in Fig. 6, includes the terminological alignment, and the structural alignment. The terminological alignment and the structural alignment are described in the following sections briefly.

3.5.1 Terminological Alignment The process of producing terminological alignment is depicted in Fig. 6 at line 1 through 7. Our system computes WordNet and string metric based similarity values as in Eqs. 1 and 2 respectively. String metric based similarity value is obtained as it is stated in [19,36]. Suppose, concept ci ∈ D1 contains description di and concept cj ∈ D2 contains description dj . Moreover, di is a set of terms, Ti = {ti1 ...tin } and dj is a set of terms Tj = {tj1 ...tjm }, where i and j are index values, and n, m are 13

the number of terms available in di and dj respectively. Then, the similarity value is calculated as follows: ∑

Simt (ci , cj ) = min(

1≤p≤n



max (simf (tip , tjq )) ,

1≤q≤m

max (simf (tip , tjq )))

(6)

1≤p≤n 1≤q≤m

where max function returns the maximum value from a list of values, min function returns the minimum one bewteen two values, and-

simf (tip , tjq ) =

simL (tip , tjq ) =

   simL (tip , tjq ),   0,

if simL ≥ δ otherwise

   SimW N (tip , tjq ) if tip , tjq ∈ WN   Sim

SM (tip , tjq )

otherwise

where SimW N and SimSM are defined in Eqs. 1 and 2 respectively, and we choose the threshold δ as 0.65 in our implementation. The experimental results shown in [36] indicate that the value 0.65 is a good threshold. However, our methods of similarity measure split the descriptions into terms and then compute similarities among them. Therefore, the overall similarity values decreases. Our experiments show that the value 0.50 is a good threshold in terminological similarity measure. The terminological alignment process uses terminological similarity values and it is depicted in the pseudo code of Fig. 6.

3.5.2 Structural Alignment The process of producing structural alignment is depicted in Fig. 6 at line 8 through 14. It shows that an structural alignment is retrieved after collecting and computing the terminological alignment for all of its neighboring concepts across ontologies. In our research we compute structural similarity values within the direct neighbors from a reference node, e.g. children, parents, and siblings. Terminological alignment is considered as a prerequisite of computing structural similarity values. During the terminological alignment process, properties of the referenced concepts are 14

also aligned by the their similarity values. Then, the structural similarity value is computed by the Eqs. 3 and 4, and then it is averaged as defined in Eq. 7 below:

Simavg =

Simexternal + Siminternal 2

(7)

We also consider threshold for structural similarities as 0.50. The structural alignment process uses structural similarity values and it is illustrated in the pseudo code of Fig. 6.

3.6 Aggregating Alignments

An aggregation (see Fig. 5) process is described at the line 45 of the pseudo code in Fig. 4. The process is named as AggregateAlignedSets. We consider the structural alignment as the basic aligned pairs, then we test each of the terminological aligned pair whether there is any aligned neighboring concepts. If a terminological aligned pair is not aligned by structural methods, then the terminological aligned pair is considered as probable misalignment [16]. However, The AggregateAlignedSets further looks for aligned pairs among neighboring concepts. If any neighboring concept is found as aligned pair within neighboring concepts of other, the concept pair is considered as aligned, otherwise it is discarded as a misaligned pair. The process of the aggregation of the alignments is also illustrated in the pseudo code of Fig. 7.

4 Experiments and Evaluation

We applied our Anchor-Flood algorithm to a variety of datasets, including the benchmark dataset 3 , and larger ontologies, such as anatomical ontologies (FMA.owl, OpenGALEN.owl, human.owl, mouse.owl and so on), web directory ontologies in OWL format. In this section, we discuss the results obtained from the OAEI-2008 and acquired by our extensive experiments in terms of scalability, memory consumption and 3

http://oaei.ontologymatching.org/2008/benchmarks/

15

Table 1 contains the results obtained by the participants on the benchmark test cases during OAEI2008, where H-mean represents harmonic means [3]. Sys

refalign

edna

Aflood

AROMA

ASMOV

CIDER

DSSim

GeRoMe

Lily

MapPSO

RiM

test

Prec.

Rec.

Prec.

Rec.

Prec.

Rec.

Prec.

Rec.

Prec.

Rec.

Prec.

Rec.

Prec.

Rec.

Prec.

Rec.

Prec.

Rec.

Prec.

Rec.

Prec.

1xx

1.00

1.00

0.96

1.00

1.00

1.00

1.00

1.00

1.00

0.98

0.99

0.99

1.00

1.00

0.96

0.79

1.00

1.00

0.92

1.00

1.00

2xx

1.00

1.00

0.41

0.56

0.94

0.64

0.96

0.69

0.96

0.70

0.95

0.85

0.97

0.57

0.97

0.64

0.56

0.52

0.97

0.86

0.48

3xx

1.00

1.00

0.47

0.82

0.95

0.66

0.82

0.71

0.81

0.77

0.90

0.75

0.90

0.71

0.61

0.40

0.87

0.81

0.49

0.25

0.80

H-mean

1.00

1.00

0.43

0.59

0.97

0.71

0.95

0.70

0.95

0.86

0.97

0.62

0.97

0.67

0.60

0.58

0.97

0.88

0.51

0.54

0.96

segmented alignment. As we participated in two tracks of the OAEI-2008 campaign [33], we obtained the precision and recall for the benchmark dataset and that of the anatomy track as a feedback. We also obtained the performance comparisons among different systems at the campaign. Moreover, we experimented to observe the effects on memory consumption, to evaluate the scalability, and after all, to observe the nature of the segmented alignment of our algorithm. In the following subsections we describe each of the factors.

4.1 Precision and Recall in OAEI-2008

We participated with our Anchor-Flood algorithm based system in the OAEI-2008 campaign. The campaign restricts of feeding manual anchors to a system. However, our Anchor-Flood algorithm starts off anchors. Therefore, we adapted our system to generate anchors automatically and fed the anchors to retrieve alignments.

4.1.1 Adaptation As our Anchor-Flood algorithm requires an anchor to produce a segmented alignment, we have developed a program module to retrieve anchors automatically considering ontology entities such as concepts, properties, instances and restrictions. This fragment of code is simple and runs faster. In this module, we use exact string match of the description of concepts, properties and instances across ontologies. All exactly matched concepts are considered as anchors. Moreover, the domains and the ranges of matched properties are added to the list of anchors. We also consider the concept-type of a pair of matched instances as an anchor. Furthermore, structural similarities also provide us some additional anchors.

4.1.2 Precision and Recall Results The correct alignment discussed herein is provided along with the benchmark, which is a set of small-scale ontologies of OAEI website and the OM-2008 proceeding [3] contains the results from the participant systems of ontology alignment. 16

Table 2 contains alignment of anatomy ontologies in OAEI-2008 [3], where BK stands for background knowledge. System

Runtime

BK

Precision

Recall

F-Measure

SAMBO [21]

12h

yes

0.869

0.836

0.852

SAMBOdtf [21]

17h

yes

0.831

0.833

0.832

RiMOM [45]

24min

no

0.929

0.735

0.821

Aflood [33]

1min 5s

no

0.874

0.682

0.766

Label Eq.



no

0.981

0.613

0.755

Lily [41]

3h 20min

no

0.796

0.693

0.741

ASMOV [20]

3h 50min

yes

0.787

0.652

0.713

AROMA [4]

3min 50s

no

0.803

0.560

0.660

DSSim [28]

17min

no

0.616

0.624

0.620

TaxoMap [15]

25min

no

0.460

0.764

0.574

We are one of the participants, participated with our Anchor-Flood algorithm and the adaptation is described in our paper of OM-2008 [3] and in the Section 4.1.1 above. Table 1, Fig. 8 and Table 2 depict the results in terms of precision and recall among the Anchor-Flood algorithm based system and other systems participated in the OAEI-2008. The results show that we are among the best four in terms of precision and recall in both the benchmark datasets and the anatomy track. Our AnchorFlood algorithm achieved the best precision over the other systems in the test set 3xx, which were real ontologies, as Anchor-Flood algorithm successfully reduced seemingly-aligned but actually misaligned pairs. We obtained comparatively lower recall over 3xx of benchmark dataset due to the removal of subsumption module from the local aligning process [33] and other necessary changes to work with large ontologies. Moreover, our system did not pass the test 303. Since this test seems to be the most difficult one among the 3xx test, results were not significant. However, we achieved the best time efficiency in the anatomy track of the OAEI2008 campaign.

4.2 Time Efficiency

The anatomy track of OAEI-2008 campaign contained two moderately large ontologies on the anatomy of human and mouse. Our system, based on Anchor-Flood algorithm, achieved the best runtime among nine other systems in terms of runtime. It required 1 minute and 5 seconds to produce aligned results across human and mouse ontologies executed by the organizers. The nearest faster system, AROMA [4], took 3 minutes and 50 seconds on the same pair of ontologies and was also executed by the organizers. AROMA produced lower precision than that of our system. Moreover, our system was more than 15 times faster than DSSim [28] and more than 22 times faster than the RiMOM [45]. The other systems even took much more time than our system. Table 2 depicted the fact. We also performed other different experiments over large ontologies for testing per17

formance, which are figured in Tables 3 and 5. The average number of comparisons and elapsed time, displayed in these tables, depict the efficiency of our algorithm. However, aligning process for OpenGalen vs. mouse ontology and FMA vs. OpenGalen ontology took much time and more average number of comparisons. It is due to the large number of children available in several different nodes. We also conducted experiments of FMA.owl ontology against itself, which contains around 72 thousand concepts to observe the nature of the operation over highly similar ontologies. Table 3 summarizes the average number of alignments as well as the average number of comparisons, by taking 300 randomly chosen anchors from FMA dataset. The table also shows the standard deviation which clealry demonstrates the stability and scalability of the Anchor-Flood algorithm. The total number of correspondences with FMA dataset is 71, 978 and the total number of comparisons was 3, 748, 970. From these tables, it is obvious that the average number of comparisons is significantly smaller than the number of comparisons with brute-force algorithm, which amounts to O(N 2 ) (N 2 = 71, 9782 ≈ 5 billion) comparisons. The average number of comparisons per aligned pair of our algorithm is approximately 53. It means that each concept is not necessarily compared against every concept across ontologies. Therefore, our algorithm clearly showed the performance enhancement over the other systems. Table 3 contains results of FMA.owl against itself. Aligned

Total

Comparisons

Memory

Elapsed

pair

Comparisons

per Aligned-pair

Consumption

Time (min)

71,978

3,748,970

53

396 MB

7.58

4.3 Memory Efficiency

We experimented with five large sized quite popular ontologies of the domain of anatomy. Table 4 contains the name of ontologies, memory consumption to store their own persistent model in memory, the number of concepts, properties, triples and the average number of children in intermediate concepts of their taxonomies. We create our own persistent model of an ontology, which stores textual contents, a taxonomy of concepts, relations and restrictions of concepts, properties and instances. We experimented for retrieving alignments across several pairs of ontologies. Table 5 displays information of the experiments. The first two columns represent the name of the ontology pair. The third column contains the total memory required for processing including to store their persistent memory model. The total number of aligned pair is displayed in column 4, whereas the fifth column shows the average number 18

Table 4 contains various statistics of some ontologies we used. Ontology

Memory

|Concepts|

|Properties|

|Triple|

Avg.

Consumption

Number

Number

Number

Child

FMA.owl

115 MB

72,559

100

576,462

4

Full-Galen

50 MB

23,141

950

249,421

5

OpenGalen

40 MB

9,565

0.929

59,773

6

Human

20 MB

3,298

1

39,939

6

Mouse

19 MB

2,738

2

19,158

6

Table 5 displays memory consumption, number of aligned pairs, elapsed time and the average number of comparisons per aligned pair. Ontology

Ontology

Memory

|Aligned-Pair|

Time

Avg.

O1

O2

Consumption

Number

(sec)

Comp

FMA

Full-galen

283 MB

3056

206

1,853

FMA

OpenGalen

310 MB

4154

2588

14,851

FMA

human

251 MB

2163

513

2,908

FMA

mouse

207 MB

1242

90

1,006

Full-Galen

OpenGalen

229 MB

7948

284

363

Full-Galen

human

138 MB

899

40

1,734

Full-Galen

mouse

95 MB

695

24

423

OpenGalen

human

208 MB

1091

330

8,912

OpenGalen

mouse

195 MB

989

86

2,745

of comparisons per aligned pair. The last column is dedicated for displaying elapsed time. The internal structure of a taxonomy, where there are a large number of children in several intermediate concepts, has negative effect of performance as shown in FMA vs. OpenGalen and OpenGalen vs. mouse ontology alignment cases. Inspite of all the negative effects of the large sized ontologies, our algorithm is use almost stable memory consumption despite of the change of size of an ontology (See Tables 3 and 5.

4.4 Scalability

The data contained in Tables 3 and 5 shows the fact of the scalability. Our algorithm keeps the memory consumption low, keeps the number of average comparisons 19

stable, and keeps the elapsed time linear. Our algorithm, starting from an anchor, retrieves aligned pairs across related segments only. Moreover, our algorithm explores gradually to neighboring concepts and collects concepts based on locality of reference within the segment, and computes similarity values among the collected concepts only. Thus, the algorithm reduces an average number of comparisons as it restricts within segments (see Tables 3 and 5). Although the ontology alignment of FMA vs. OpenGalen, and OpenGalen vs. mouse ontologies made us take large elapsed time and large number of average comparisons due to the internal structure of the taxonomies of the ontologies. As they contained large number of children in several intermediate concepts, our system collected large number of neighbors from the taxonomy, producing large number of comparisons and, hence, resulted in large elapsed time. Out of these exceptions, our algorithm managed to keep elapsed time short, even when the ontologies are very large.

4.5 Segmented Alignment

To demonstrate segmented alignment, another feature of our Anchor-Flood algorithm, we carried out the experiment using pair of ontologies of web directory structures called source.owl and target.owl, which are also available in OAEI as another alignment task. Table 6 shows the results of segmented alignment between two ontologies of web directory structures. One anchor (pair of concepts) produced one segment pair across ontologies. The first column of the table represents the name of the ontologies while the second column shows the abbreviated URI of concepts in the ontology. Two consecutive concepts in the second column form an anchor. The number of concepts in segments is displayed in the third column. The last column of Table 6 shows the number of aligned pair of equivalent relation within segments. The anchor, (Health(Source), Health(T arget)), produced two segments with 274 and 313 concepts respectively and the process AlignDescendantSets, portrayed in Fig. 6, produced 90 aligned pairs of equivalent (≡) relation. Similarly, other anchors of Table 6 produced segments of different interests. For example, anchor (M usic(Source), M usic(T arget)) produced pair of segments which contains only the concepts about Music.

5 Complexity Analysis

As the Anchor-Flood algorithm starts off with aligning an anchor, and explores to its neighboring concepts, the performance is not heavily affected by the size of ontologies. However, the performance depends on the size of neighbors defined in Eq. 5, and on the size of segments in a segmented alignment, which is illustrated in 20

Table 6 contains different segmented alignments starting from anchors across directory web ontologies. Ontology

Anchors

|Concepts|

Name

(Concepts)

in segment

Source

Health

274

Target

Health

313

Source

Wrestling

428

Target

Wrestling

611

Source

Food Wine

765

Target

Food Wine

756

Source

Buddhism

186

Target

Buddhism

251

Source

Science

145

Target

Science

234

Source

Music

206

Target

Music

340

|Aligned-pairs|

Average comparisons

90

786

98

840

28

172

63

287

70

290

70

543

Section 2.4. Consider two ontologies O1 with N1 entities and O2 with N2 entities, and for simplicity, if we assume N1 = N2 ≈ N , and the average number of children of each node be k for both of them, then with a brute-force method, the number of comparisons will be O(N 2 ). That is, every entity included in one ontology must be compared against every entity included in the other ontology. Let us assume that the size of each of the segments in a segmented alignment is M in both ontologies where usually M ≤ N . In any case, our algorithm compares entities related to the neighboring concepts of neighbors defined in Eq. 5. Therefore, the number of concepts to be compared from each ontology on the average at a particular iteration, collected from children, grand-children, parents, siblings, nephews, grand-parents and uncles concepts from a reference concept pair. As k is the average number of children of a node, we can retrieve the following average value for Eq. 5. |children(c)| = k |grand_children(c)| = k 2 |parents(c)| = 1 |siblings(c)| = k − 1 21

|nephews(c)| = k 2 − k |grand_parents(c)| = 1 |uncles(c)| = k − 1 Therefore, the total number of concepts in one neighbors as Eq. 5 would be: O(k + k 2 + 1 + (k − 1) + (k 2 − k) + 1 + (k − 1)) =O(2(k + k 2 )) =O(k 2 ) which shows that the average number of neighboring concepts in the neighbors of a reference concept to be compared is O(k 2 ). Therefore, the average number of comparisons among neighboring concepts of two concepts will be O(k 4 ). There are M/k intermediate concepts within a segment which has k children on an average. Thus, the average number of comparisons for a segmented alignment will be:

O(

M 4 k ) = O(M k 3 ) k

Now, if we consider that all entities are associated with some of the segments which have similar segments across ontologies, then there are N/M segments. Therefore, the total number of comparisons will be,

O(

N M k 3 ) = O(N k 3 ) M

As k is usually small in number in real ontologies, our algorithm works faster. In the worst case, when the total depth of an ontology is less than or equal to 3, the complexity becomes N 2 . However, only a small increase in the depth-number leads exponential imrovement in the performance as the size of k is reduced drastically. If the total depth increases, the value of k decreases logarithmically. Hence, if the depth is larger compared to log2 N , a taxonomy becomes a binary tree. Then, in the best case, the computational complexity will becomeO(23 N ) ≈ O(8N ) ≈ O(N ) The best case of the computational complexity O(N ) comes from the advantage of the “locality of reference”, meaning that we can skip unrelated part of ontologies. A significant number of unnecessary comparisons are skipped automatically. Inspite of some additional operations, it needs to collect neighboring concepts in AnchorFlood algorithm, it works well. However, the taxonomy of an ontology is neither a 22

binary nor a balanced tree. Ideally, the average number of comparisons is measured as O(N k3 ), where k is usually significantly smaller, however, is not negligible. The number of children k of a node depends on the the depth d of a taxonomy and k is defined as

k≈

logN logd

Therefore, the average complexity of our Anchor-Flood algorithm is O(N log(N )). Moreover, experimental results over large ontologies also imply the average complexity as O(N log(N )).

6 Related Work

There are some related works which target to deal with ontology alignment such as COMA++ [6], NLM Anatomical Ontology Alignment [44], PBM based FalconAO [17,18], iMAP [5] and Chimaera [24]. Moreover, there are some other approaches [14,27,42], which address large-scale ontologies or large-scale class-hierarchies. These related work, which are relevant to our Anchor-Flood algorithm, are described in the subsequent sections.

6.1 COMA++

COMA++ [6] is a generic schema matching tool, providing a library of individual matchers and a flexible infrastucture to combine the matchers and refine their results. A scalable approach is introduced to identify context dependent correspondences between schemas with shared elements and a fragment based matching approach which decomposes a large match task into a smaller tasks. By fragment they denote a rooted sub-graph down to the leaf level in the schema graph. In general, fragments have little or no overlap to avoid repeated similarity computation and overlapping match results. There are some major differences between COMA++ and our Anchor-Flood algorithm. COMA++ identifies fragments before the matchers applied. Then the similar fragments are identified by comparing the contexts of their schema roots to the fragment roots. There is always a chance of spitting really related components across fragments in one schema, however, of being well formed in another schema. Matchers of COMA++ would face problem to align across several fragments then. 23

If the process of fragmentation does not consider enough semantics, the above problem would arise severely. Again, the fragmentation and identification of the similar fragments will cost additional clock cycles. On the other hand, our Anchor-Flood algorithm does not split the ontology schema before applying the aligning process. Rather it starts off an anchor, collects neighboring concepts, finds aligned pairs, and eventually produces related fragments or segments across ontologies.

6.2 NLM Anatomical Ontology Alignment System

NLM Anatomical Ontology Alignment System [44] shares the notion of “anchor” from Anchor-PROMPT and the use of shares paths between anchors across ontologies to validate the proximity among related terms. Therefore, Anchor-PROMPT is undoubtedly the system to which their approach is the most closely related. This system uses a simpler validation scheme based on paths restricted to combinations of taxonomic and partitive relations, suitable for the anatomical domain. Unlike Anchor-PROMPT, this approach does not rely on path length and is therefore less sensitive to differences in granularity between ontologies. NLM does not use anchors to exploit neighbors for alignment, whereas Anchor-Flood algorithm mainly propagates from anchor to get further alignments.

6.3 PBM based Falcon-AO

Analogy to COMA++, Falcon-AO [17] is extended with the integration of Partitionbased Block Matching, PBM [19,18] to their original ontology alignment system. PBM is introduced to cope large ontologies. In PBM, large-scale ontologies are hierarchically partitioned into blocks based on both the structural affinities and terminological similarities, and then blocks from different ontologies are matched via predefined anchors. In contrast, our algorithm does not partition the ontology into block first. Rather, our algorithm extracts pair of segments eventually.

6.4 iMap

iMap system [5] addresses block matching and matches between relational schemas, but the ideas can be generalized to other data representations. The iMAP architecture consists of three main modules: match generator, similarity estimator, and match selector. The match generator takes two schemas S and T as input. For each 24

attribute t of T , it generates a set of match candidates, which can include both one-to-one and complex matches. The similarity estimator then computes for each match candidate a score, which indicates the candidate’s similarity to attribute, t. Thus, the output of this module is a matrix that stores the similarity scores of pairs. Finally, the match selector examines the similarity matrix and output the best matches for the attributes of T . It is also based on block matching in contrast to our system.

6.5 Some Other Systems

There are some other systems of aligning large ontologies or partitioning ontologies into blocks. Some of them are introduced here briefly. The Chimaera ontologymerging environment [24], and an interactive merging tool based on the Ontolingua ontology editor [8] consider a limited ontology structure in suggesting merging steps. They consider an environment for large ontologies. However, the only relations that Chimaera considered are the subclass-superclass relation and slot attachment. The issue of partitioning large-scale ontologies (including large class hierarchies) is also addressed in [12,38,40]. In [12], an efficient solution for partitioning ontologies is provided by using -Connections. It guaranteed that all concepts, which have subsumption relations, can be partitioned into one block, which becomes a limitation for ontology matching. In [38], large class hierarchies are automatically partitioned into small blocks. The background techniques are dependency graph and “island” algorithm. Although the main contribution of [40] is for ontology visualization, it also presents a method for ontology partitioning by Force Directed Placement algorithm. Although these tools focus to solve the problem of large ontologies, they are quite different than our Anchor-Flood algorithm as they partition ontologies into blocks prior to other operations. Moreover, without major relationship in processing unit of alignment, there are some interesting approach to deal with multi-strategy ontology alignment techniques such as oMap [37], RiMOM [22,45] and so on. Readers can go through the references for further information.

7 Conclusions and Future Work

In this paper, we described the Anchor-Flood algorithm that can align ontologies of arbitrary size effectively, and that makes it possible to achieve high performance and scalability over previous alignment algorithms. To achieve these goals, the algorithm took advantage of the notion of segmentation and allowed segmented output of aligned ontologies. Specifically, owing to the segmentation, our algorithm concentrates on aligning only small sets of the entire ontology data iteratively, by 25

considering “locality of reference”. This brings us a by-product of collecting more alignments in general, since similar concepts are usually more densely populated in segments. Although we need some further refinement in segmentation, we have an advantage over traditional ontology alignment systems, in that the algorithm finds aligned pairs within the segments across ontologies and it has more usability in different discipline of specific modeling patterns. When the anchor represents correct aligned pair of concepts across ontologies, our Anchor-Flood algorithm finds segmented alignment within conceptually closely connected segments across ontologies efficiently. Even if the input anchor is not correctly defined, our algorithm is also capable of handling the situation of reporting misalignment error. The complexity analysis and a different set of experiments demonstrate that our proposed algorithm outperforms in some aspect to other alignment systems. The size of ontologies does not affect the efficiency of Anchor-Flood algorithm. The best running time computational complexity of our algorithm is O(N ), and the worst case is O(N 2 ), when the taxonomy is flat. The average number of children per intermediate nodes in FMA, Full-Galen, OpenGalen, Mouse and Human ontologies varies from 4 to 6, which is sufficiently small compared to their average number of concepts, N . Therefore, the average running time computational complexity of our algorithm is O(N log(N )). In OAEI-2008 campaign, our Anchor-Flood algorithm based system obtained the best runtime in Anatomy track. Our future target includes strengthening the local process of alignment by extending the alignment algorithm to handle one-to-many (1 : n) (complex) mapping, and improving the subsumption alignments. We have a plan to integrate WordNet sense filtering and to integrate a process of computing semantic similarity and discovering missing background knowledge.

Acknowledgments

This work was partially supported by the Global COE Program “Frontiers of Intelligent Sensing”, from the ministry of Education, Culture, Sports, Science and Technology.

Annex

Our Anchor-Flood algorithm can be downloaded from the URL http://www. kde.ics.tut.ac.jp/~hanif/res/anchor_flood.zip. 26

References

[1] V. Benjamins, J. Contreras, O. Corcho, A. Gómez-Pérez, Six Challenges for the Semantic Web, AIS SIGSEMIS Bulletin 1 (1) (2004) 24–25. [2] T. Berners-Lee, M. Fischetti, M. Dertouzos, Weaving the Web: The Original Design and Ultimate Destiny of the World Wide Web, Harper San Francisco, 1999. [3] C. Caracciolo, J. Euzenat, L. Hollink, R. Ichise, A. Isaac, V. Malaisé, C. Meilicke, J. Pane, P. Shvaiko, H. Stuckenschmidt, O. Šváb-Zamazal, V. Svátek, Results of the Ontology Alignment Evaluation Initiative 2008, Proceedings of Ontology Matching Workshop of the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 73–119. [4] J. David, AROMA results for OAEI 2008, Proceedings of Ontology Matching Workshop of the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 128–131. [5] R. Dhamankar, Y. Lee, A. Doan, A. Halevy, P. Domingos, iMAP: Discovering Complex Semantic Matches between Database Schemas, Proceedings of the 23rd ACM SIGMOD International Conference on Management of Data, Paris, France (2004) 383–394. [6] H. Do, E. Rahm, Matching Large Schemas: Approaches and Evaluation, Information Systems 32 (6) (2007) 857–885. [7] M. Ehrig, Ontology Alignment: Bridging the Semantic Gap, Springer, New York, 2007. [8] A. Farquhar, R. Fikes, J. Rice, Ontolingua Server: A Tool for Collaborative Ontology Construction, International Journal of Human-Computers Studies 46 (6) (1997) 707– 727. [9] C. Fellbaum, WordNet: An Electronic Lexical Database, The MIT Press, Cambridge, 1998. [10] T. Finin, R. Fritzson, D. McKay, R. McEntire, KQML as an Agent Communication Language, Proceedings of the 3rd International Conference on Information and Knowledge Management (CIKM’94), Gaithersburg, Maryland (1994) 456–463. [11] F. Giunchiglia, P. Shaiko, Semantic Matching, The Knowledge Engineering Review 18 (03) (2004) 265–280. [12] B. Grau, B. Parsia, E. Sirin, A. Kalyanpur, Automatic Partitioning of OWL Ontologies Using ε-Connections, Proceedings of the International Workshop on Description Logics (DL’05), Edinburgh, Scotland 2005. [13] T. Gruber, Toward Principles for the Design of Ontologies Used for Knowledge Sharing, International Journal of Human-Computer Studies 43 (5/6) (1995) 907–928. [14] S. Guha, R. Rastogi, K. Shim, Rock: A Robust Clustering Algorithm for Categorical Attributes, Journal of Information Systems 25 (5) (2000) 345–366.

27

[15] F. Hamdi, H. Zargayouna, B. Safar, C. Reynaud, TaxoMap in the OAEI 2008 alignment contest, Proceedings of Ontology Matching Workshop of the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 206–2013. [16] S. Hanif, Y. Seki, M. Aono, Automatic Alignment of Ontology Eliminating the Probable Misalignments, Proceedings of the 1st Asian Semantic Web Conference (ASWC2006), Beijing, China (2006) 212–218. [17] W. Hu, G. Cheng, D. Zheng, X. Zhong, Y. Qu, The Results of Falcon-AO in the OAEI 2006 Campaign, Proceedings of Ontology Matching (OM-2006), Athens, Georgia, USA (2006) 124–133. [18] W. Hu, Y. Qu, G. Cheng, Matching Large Ontologies: A Divide-and-Conquer Approach, Data and Knowledge Engineering (2008) 140–160. [19] W. Hu, Y. Zhao, Y. Qu, Partition-based Block Matching of Large Class Hierarchies, Proceedings of the 1st Asian Semantic Web Conference (ASWC2006), Beijing, China (2006) 72–83. [20] Y. Jean-Mary, M. Kabuka, ASMOV: results for OAEI 2008, Proceedings of Ontology Matching Workshop of the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 132–139. [21] P. Lambrix, H. Tan, Q. Liu, SAMBO and SAMBOdtf results for the ontology alignment evaluation initiative 2008, Proceedings of Ontology Matching Workshop of the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 190– 198. [22] J. Li, J. Tang, Y. Li, Q. Luo, RiMOM: A Dynamic Multi-Strategy Ontology Alignment Framework, IEEE Transactions on Knowledge and Data Engineering 18. [23] A. Maedche, S. Staab, Ontology Learning for Semantic Web, IEEE Intelligent Systems 16 (2) (2001) 72–79. [24] D. McGuinness, R. Fikes, J. Rice, S. Wilder, An Environment for Merging and Testing Large Ontologies, Proceedings of the 7th International Conference on Principles of Knowledge Representation and Reasoning (KR2000), Breckenridge, Colorado, USA (2000) 483–493. [25] D. McGuinness, F. van Harmelen, OWL Web Ontology Language Overview, W3C Recommendation, 2004. [26] S. Melnik, H. Garcia-Molina, E. Rahm, Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching, Proceedings of the 18th International Conference on Data Engineering (ICDE 2002), San Jose, CA, USA (2002) 117–128. [27] G. Murray, B. Dorr, J. Lin, J. Hajic, P. Pecina, Leveraging Recurrent Phrase Structure in Large-scale Ontology Translation, Proceedings of the 11th Annual Conference of the European Association for Machine Translation (EAMT2006), Oslo, Norway (2006) 141–150.

28

[28] M. Nagy, M. Vargas-Vera, P. Stolarski, E. Motta, DSSim results for OAEI 2008, Proceedings of Ontology Matching Workshop of the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 147–159. [29] D. Nardi, R. Brachman, An Introduction to Description Logics, The Description Logic Handbook: Theory, Implementation, and Applications (2003) 1–40. [30] N. Noy, M. Musen, Anchor-PROMPT: Using Non-Local Context for Semantic Matching, Proceedings of Workshop on Ontologies and Information Sharing at International Joint Conference on Artificial Intelligence (IJCAI-01), Seattle, Washington, USA (2001) 63–70. [31] J. Rogers, OpenGALEN: Making http://www.opengalen.org/, 2005.

the

Impossible

Very

Difficult,

[32] C. Rosse, J. Mejino, A Reference Ontology for Biomedical Informatics: the Foundation Model of Anatomy, Journal of Biomedical Informatics 36 (6) (2003) 478– 500. [33] M. Seddiqui, M. Aono, Alignment Results of Anchor-Flood Algorithm for OAEI2008, Proceedings of Ontology Matching Workshop of the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 120–127. [34] J. Seidenberg, A. Rector, Web Ontology Segmentation: Analysis, Classification and Use, Proceedings of the 15th International Conference on World Wide Web (WWW2006), Edinburgh, Scotland (2006) 13–22. [35] P. Shvaiko, J. Euzenat, Ten challenges for ontology matching, Proceedings of the 7th International Conference on Ontologies, DataBases, and Applications of Semantics (ODBASE), Monterrey, Mexico (2008) 1164–1182. [36] G. Stoilos, G. Stamou, S. Kollias, A String Metric for Ontology Alignment, Proceedings of the 4th International Semantic Web Conference (ISWC2005), Galway, Ireland (2005) 623–637. [37] U. Straccia, R. Troncy, Towards distributed information retrieval in the semantic web: Query reformulation using the oMAP framework, Proceedings of the 3rd European Semantic Web Conference (ESWC-06), Budva, Montenegro 4011 (2006) 378–392. [38] H. Stuckenschmidt, M. Klein, Structure-based Partitioning of Large Concept Hierarchies, Proceedings of the 3rd International Semantic Web Conference (ISWC2004), Hiroshima, Japan (2004) 289–303. [39] R. Studer, V. Benjamins, D. Fensel, Knowledge Engineering: Principles and Methods, Journal of Data & Knowledge Engineering 25 (1-2) (1998) 161–197. [40] K. Tu, M. Xiong, L. Zhang, H. Zhu, J. Zhang, Y. Yu, Towards Imaging LargeScale Ontologies for Quick Understanding and Analysis, Proceedings of the 4th International Semantic Web Conference (ISWC2005), Osaka, Japan (2005) 702–715. [41] P. Wang, B. Xu, Lily: ontology alignment results for OAEI 2008, Proceedings of Ontology Matching Workshop of the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 167–175.

29

[42] Z. Wang, Y. Wang, S. Zhang, G. Shen, D. Tao, Matching Large Scale Ontology Effectively, Proceedings of the 1st Asian Semantic Web Conference (ASWC2006), Beijing, China (2006) 99–105. [43] W. Winkler, The State of Record Linkage and Current Research Problems, Technical report, Statistical Research Division, U.S. Census Bureau, Washington, USA. [44] S. Zhang, O. Bodenreider, NLM Anatomical Ontology Alignment System Results of the 2006 Ontology Alignment Contest, Proceedings of Ontology Matching (OM2006), Georgia, USA (2006) 153–164. [45] X. Zhang, Q. Zhong, J. Li, J. Tang, RiMOM results for OAEI 2008, Proceedings of Ontology Matching Workshop of the 7th International Semantic Web Conference, Karlsruhe, Germany (2008) 182–189.

30

Algorithm Anchor-Flood (O1 , O2 , X) /* Ontology 1:O1 , Ontology 2:O2 , Anchor:X */ /* X = (e(O1 ), f (O2 )) */ /* Preprocessing to model ontologies*/ 1. M O1 = Preprocessing(O1 ); 2. M O2 = Preprocessing(O2 ); /* M O1 = {Taxonomy:T O1 (O1 ), NormalizedText(O1 ), Relations(O1 )}; M O2 = {Taxonomy:T O2 (O2 ), NormalizedText(O2 ), Relations(O2 )}*/ /* Initialization */ 3. P1 = D1 = {e}; 4. P2 = D2 = {f}; 5. A_index = 0; 6. A = makeAlign(X); /* A = {(e,f),r,n} */ /* Main block */ 7. while (¬ (“either all the collected concepts are explored, or no new aligned pair is found” ) 8. if (A [A_index] != ∅) D1 = D1 ∪ EA(T O1 , A[A_index].e); 9. 10. D2 = D2 ∪ EA(T O2 , A[A_index].f); 11. A_index = A_index + 1; 12. Flag e and f as explored; 13. else if(∃ (¬ explored concept ∈ D1 || ¬ explored concept ∈ D2 ) 14. if (diff > sizeOf(D2 )) 15. P2 = EP(T O2 , P2 ); 16. D2 = D2 ∪ P2 ; 17. else if (diff > sizeOf(D1 )) 18. P1 = EP(T O1 , P1 ); 19. D1 = D1 ∪ P1 ; 20. d1 = next(¬explored concept ∈ D1 ); 21. d2 = next(¬explored concept ∈ D2 ); 22. D1 = D1 ∪ EU(T O1 , d1 ); 23. D2 = D2 ∪ EU(T O2 , d2 ); 24. Flag d1 and d2 as explored; 25. else 26. if (diff > sizeOf(D2 )) 27. P2 = EP(T O2 , P2 ); 28. D2 = D2 ∪ P2 ; 29. else if (diff > sizeOf(D1 )) 30. P1 = EP(T O1 , P1 ); 31. D1 = D1 ∪ P1 ; 32. else 33. P1 = EP(T O1 , P1 ); 34. D1 = D1 ∪ P1 ; 35. P2 = EP(T O2 , P2 ); 36. D2 = D2 ∪ P2 ; /* The following "if" is almost always true. */ 37. if (sizeOf(D1 ) or sizeOf(D2 ) is changed) 38. {At ,As } = AlignDescendantSets(D1 , D2 , M O1 , M O2 ); 39. AD=At ∩ As ; A = ∪ AD; 40. Aterm = Aterm ∩ At ;Astruct = Astruct ∩ As 41. D1 = D1 ∩ AD; 42. D2 = D2 ∩ AD; 43. diff = absoluteValue(|¬ explored concept ∈ D1 | - |¬ explored concept ∈ D2 |) 44. endwhile 45. AggregateAlignedSets(Aterm , Astruct , M O1 , M O2 ); 46. return D1 , D2 , A; Fig. 4. Pseudo code of the Anchor-Flood algorithm. Each local declared sets (P1 , P2 , D1 , D2 and A) are indexed sets and associated with flags that contain index number to indicate up to which they are already explored. The next function always returns the next element of an indexed set and the operator ∪ is used31for a union operator, where new distinct data are always appended at the set.

Fig. 5. The figure shows the process of Anchor-Flood algorithm where anchor is taken as an input to produce a segmented alignment. The process “AlignDescendantSets” works locally to produce correspondences among the concepts of “Decsendant Sets” D1 , D2 .

32

Algorithm alignDescendantSets (Set D1 , Set D2 ) /* Descendant Sets :D1 , D2 */ /* Alignments: Aterm ,Astruct */ /* Probable Misalignment: Apm */ /*Terminological Alignment */ 1. For each concept, c1 ∈ D1 2. For each concept, c2 ∈ D2 3. sim(c1 , c2 )=Apply Eq. 6 4. if sim(c1 , c2 ) ≥ threshold 5. if c1 ∈ concept(Aterm ) k c2 ∈ concept(Aterm ) 6. Select aligned pair with max. similarity 7. Update set, Aterm /*Structural Alignment */ 8. For each concept, c1 ∈ D1 9. For each concept, c2 ∈ D2 10. sim(c1 , c2 )=Apply Eq. 7 11. if sim(c1 , c2 ) ≥ threshold 12. if c1 ∈ concept(Astruct ) k c2 ∈ concept(Astruct ) 13. Select aligned pair with max. similarity 14. Update set, Astruct /*Output */ 15. return Aterm , Astruct , Apm Fig. 6. Pseudo code of the AlignDescendantSets, where the process of terminological alignment, and structural alignment are displayed.

AggregateAlignedSets(Set Aterm , Set Astruct , Model M O1 , Model M O2 ) /* Ontology Model :M O1 , M O2 */ /* Alignments: Atot , Aterm ,Astruct , Af inal */ /* Probable Misalignment: Apm */ /*Aggregation of Alignments */ 1. Atot = Aterm ∪ Astruct 2. Apm = Aterm − Astruct 3. Af inal = Astruct 4. For each aligned pair, a(e, f, =, n) ∈ Apm 5. if (∃aN (eN , fN , =, nN ) | eN ∈ neighbors(e) ∧ fN ∈ neighbors(f )) 6. Af inal = Af inal ∪ a /*Output */ 7. return Af inal Fig. 7. Pseudo code of the AggregateAlignedSets, where the terminological alignment and the structural alignment are aggregated to produce final alignment.

33

Fig. 8. The average precision and recall graph for the systems participated in OAEI 2008 campaign, including our Anchor-Flood algorithm based system, aflood [3].

34