Finding Important Vocabulary Within Ontology

9 downloads 4780 Views 159KB Size Report
useful in finding important vocabularies within ontology. 1 Introduction .... RDF sentence supposing such domain vocabulary exists and is unique. .... Local Name.
Finding Important Vocabulary Within Ontology Xiang Zhang, Hongda Li, and Yuzhong Qu Department of Computer Science and Engineering, Southeast University, Nanjing 210096, P.R. China {xzhang, hdli, yzqu}@seu.edu.cn

Abstract. In current Semantic Web community, some researches have been done on ranking ontologies, while very little is paid to ranking vocabularies within ontology. However, finding important vocabularies within a given ontology will bring benefits to ontology indexing, ontology understanding and even ranking vocabularies from a global view. In this paper, Vocabulary Dependency Graph (VDG) is proposed to model the dependencies among vocabularies within an ontology, and Textual Score of Vocabulary (TSV) is established based on the idea of virtual documents. And then a Double Focused PageRank algorithm is applied on VDG and TSV to rank vocabulary within ontology. Primary experiments demonstrate that our approach turns out to be useful in finding important vocabularies within ontology.

1 Introduction An ontology presents a conceptual framework, which includes the description of concepts and relations between them. While an identical or similar set of concepts may be represented by different ontologies, it is not easy to investigate into all of the relevance ontologies totally by human effort when knowledge engineers are willing to utilize the vocabulary of others. Harith Alani terms this condition the knowledge reuse conundrum [1]. To solve the problem of reuse, ontology search engines and libraries emerge. Among them, Swoogle uses a hyperlink analysis approach for ontology ranking [2]. While researchers have depicted some major problem of ontology ranking and gave their solutions, another problem still remains unresolved. Given an ontology, some vocabularies play a crucial role when using them to define others. We refer them as “important” vocabularies in an ontology. While the “importance” of a concept has been mentioned in [3] when visualizing large-scale ontologies, the ranking scheme is still simple. We believe that ranking vocabularies within ontology is beneficial for ranking vocabularies across ontologies such as the research by Swoogle, since finding important vocabularies in a global view can be enhanced by the local ranking of vocabularies within ontology. Besides, it is also beneficial for applications on ontology visualization or summarization, in which application users are generally only concerned with most important vocabularies described in an ontology. It is also significant for ontology indexing considering that important vocabularies tend to be retrieved more frequently among others. R. Mizoguchi, Z. Shi, and F. Giunchiglia (Eds.): ASWC 2006, LNCS 4185, pp. 106 – 112, 2006. © Springer-Verlag Berlin Heidelberg 2006

Finding Important Vocabulary Within Ontology

107

2 Related Works In this section, we firstly give a picture to the related works on finding important vocabularies within ontology, and then introduce a Double Focused PageRank algorithm, which is the basic ranking algorithm applied in our approach. 2.1 Ontology Ranking Presented in [1], AKTiveRank is a ranking system closely related to our work. While the methodology of Swoogle is query-independent and the structure inside the ontology is less considered, AKTiveRank is based on the analysis of concept structure and gave a consideration on the measurement of relevence of result ontologies to multiple query keywords. The ranking of the ontology should reflect the importance of each relevant vocabulary in the ontology, which is quantified by the “Centricity” and “Density” of the vocabularies where “Centricity” represents the distance of a vocabulary to the centual level of its hierarchy, and “Density” of a vocabulary is a weighted summation of the number of its subclasses, siblings, relations and instances. KeWei Tu described the approach of finding and presenting important classes to users in [3]. Important classes are the classes in high level of the hierarchy or having many descendants. The “importance” of a class can be quantified using a parameterisable formula, which calculates a linear combination of total “importance” of its direct subclasses and a function of its depth in the class hierarchy. Comparing to the related ranking schemes mentioned above, our work contributes to a combined analysis of both structure and textual information when finding important vocabularies within ontology. 2.2 Double Focused PageRank Double Focused PageRank is a variation of PageRank and Michelangelo interpreted this ranking algorithm using a unified probabilistic framework in [4]. Imagining a scene of web surfing, Double Focused PageRank defines two atomic actions for a surfer to choose when he is staying at some page: following a hyperlink from current page or jumping to another page. In practice, the surfer will not jump or forward to other pages randomly. Considering the textual information, the surfer often prefers to jump or forward to pages with their topics relevant to the surfer’s interest. In the algorithm, a text classifier is used to attach a score to each page representing its relevance to a given topic of interest. The ranking of each page is finally obtained by iteratively computing the probability distribution of the surfer staying at certain page at certain step. The reason of choosing such algorithm for ranking vocabularies within ontology is that: Given a graph and a set of textual information attached to each node, Double Focus PageRank presents a clear and parameterisable ranking scheme considering both effect of the graph structure and textual information to the result of ranking. This feature is appropriate to determine the “importance” of vocabularies in our approach and make us different with other approaches purely using structure information.

108

X. Zhang, H. Li, and Y. Qu

3 Ranking Vocabularies Within Ontology Since the nature of RDF graph is different from the graph of web pages, it is unsuitable to rank vocabularies within an ontology using link analysis directly on RDF graph: RDF graphs can’t explicitly show all the dependencies between vocabularies. In an RDF graph, two vocabularies are adjacent only when they appear in a same triple. But commonly, two vocabularies may have relation via one or more medium blank nodes, which is implied by a path between the two vocabularies in RDF graph. Besides, users are mainly concerned with domain vocabularies, so blank nodes, literals and build-in vocabularies should be discarded in the ranking process. To achieve a more reasonable vocabulary-ranking scheme, a new graph model is need, which should explicitly exhibits the dependencies between vocabularies. And similar to the interpretation of page ranking, we believe that a vocabulary is intuitively “important” if it is an authoritative node in our graph model. 3.1 RDF Sentence Definition 1 (B-Triple). A triple in an RDF graph is called a B-Triple if it has at least a blank node as its subject or object. Definition 2 (B-Connectedness). Two B-Triples in an RDF graph, denoted by bs and bt, are said to be B-Connected if and only if one of the following conditions is satisfied: ⎯ ⎯

bs and bt have a blank node in common; There exists a sequence: b0 (=bs), b1, …, bn (=bt) with n>1 such that bi-1 and bi are B-Connected (for i =1, …n).

Definition 3 (RDF Sentence of an RDF Graph). Given an RDF graph, a triple in the RDF graph with no blank node as its subject or object is a simple RDF Sentence; a maximum subset of B-Connected B-Triples in the RDF graph is called a complex RDF sentence. Simple or complex RDF sentences are both RDF sentences (or sentence in short) of the RDF graph. And no more are called RDF sentences of the given RDF graph. The Size of a sentence is the number of triples in the sentence. Shown as Figure 1, two sentences can be parsed from a simple RDF graph. Definition 4 (Subject of an RDF Sentence). The subject of an RDF sentence is the domain vocabulary playing the role as the subject in a certain triple contained by the RDF sentence supposing such domain vocabulary exists and is unique. 3.2 Vocabulary Dependency Graph Definition 5 (Domain Vocabularies). Domain vocabularies of an RDF graph are the URI references (or URIrefs in short) that occur in the RDF graph and are not belonged to the built-ins provided by Ontology Language such as RDF or OWL. Definition 6 (Vocabulary Dependency Graph of an RDF Graph). Vocabulary Dependency Graph (VDG), written by , is an weighted directed graph such that the vertices in V are domain vocabularies in the RDF graph, and there exists an edge between two vertices if their associated domain vocabularies co-occur in at least one RDF sentence of the RDF graph. W: E→ R+ and w(i,j) denotes the strength of

Finding Important Vocabulary Within Ontology

109

dependency between domain vocabulary i and j. The strength is formulated as a function of the size of RDF sentence, in which two vocabularies co-occur. The size can be simply deemed as the distance between the vocabularies. Dependency is directed: a vocabulary being the subject of an RDF sentence is depended on other cooccurring vocabularies with the formulated strength, and meanwhile co-occurring vocabularies also have a reversed dependency on the subject with a weaker strength. The idea of VDG is somewhat similar to Dependency Graph proposed in [7]. However, the Dependency Graph in [7] only extracts dependencies corresponding to the subclass hierarchy and dependencies created by the domain and range restrictions in property definitions.

foo:Person

foo:Animal

_:genid

foo:hasFather

1

Fig. 1. A sample RDF graph and its RDF sentences

3.3 Textual Score of Vocabulary In our model, textual score of each vocabulary is established based on the idea of virtual documents approach as described in [5] to reflect the similarity of natural language between a vocabulary and the whole ontology. A virtual document (VDoc) is a collection of weighted words containing the local descriptions and neighboring linguistic information of an URIref to reflect the intended meaning of the URIref. In an ontology, local descriptions include local name, labels, comments and other annotations of declared URIrefs. The paper mentioned above also presented the construction of VDocs and measurement of similarity between VDocs. The original purpose to define virtual documents is for ontology mapping. In our model, each VDoc of a vocabulary is constructed taking no account of neighboring linguistic information for simplicity. A VDoc of the whole ontology is constructed by combining all the VDocs and the linguistic information of the ontology itself, including ontology comments, file name of ontology document and so on. The Textual Score of Vocabulary (TSV) is defined as the similarity between the VDoc of the vocabulary and the VDoc of the ontology as whole. Therefore, an n-dimensional vector of textual scores is then constructed, with each dimension’s value indicating the relevance of each domain vocabulary to the topic of the ontology. 3.4 Ranking Process As described in Section 2.2, Double Focused PageRank provides a ranking algorithm considering both link structure and textual content. Considering the ranking of vocabularies, a surfer might be interest with a vocabulary, and when he decides to

110

X. Zhang, H. Li, and Y. Qu

know more about the vocabulary he will forward one of the links in VDG and take a look at the adjacent vocabulary. Heavily weighted link shows that the adjacent vocabulary is close to the original one and will be accessed with more probability. The surfer might choose to jump if he loses interest to the current vocabulary. Although he may jump arbitrarily in VDG, it is believed that he prefers the vocabulary more relevant to the topic of the ontology. The ranking of vocabularies is finally determined by the probabilistic distribution.

4 Experiments We have performed experiments on a set of relatively small ontologies for the sake of human evaluation. In this section, we will present the experimental result of two sample ontologies: Animal ontology and Music ontology, and give our evaluation to the result. The Animal ontology1 is a very small ontology presenting a conceptual framework of Person (as a subclass of Animal) and relationships between persons, such as parent, spouse and friend. The RDF graph and class hierarchy of this ontology can be found at our website2,3. The top ten ranked vocabularies are shown in Table 1. While we treat classes and properties equally as vocabularies in the ranking process, we separate them in the final ranking list. Another experiment is on the “Music” ontology 4 . Its class hierarchy graph and property graph can also be found at our website5, 6. The top ten ranked classes and properties are shown separately in Table 2. Most top ranked vocabularies are intuitively important among other vocabularies within this ontology and they are all relevant to the topic of the ontology. Table 1. Top ten vocabularies within “Animal” ontology

No.1 No.2 No.3 No.4 No.5 No.6 No.7 No.8 No.9 No.10 1

Classes Ranking Local Name Score Animal 0.19360 Person 0.14556 Male 0.07522 Female 0.07216 Woman 0.03412 0.02989 Man HumanBeing 0.01241 TwoLeggedPerson 0.00676 TwoLeggedThing 0.00334

Properties Ranking Local Name Score hasAncestor 0.10708 hasParent 0.07476 hasFather 0.05199 hasMother 0.04950 hasSpouse 0.03365 hasFriend 0.02210 hasChild 0.01652 hasMaleParent 0.01606 hasFemaleParent 0.01455 biologicalMotherOf 0.01388

http://www.atl.lmco.com/projects/ontology/ontologies/animals/animalsA.owl http://xobjects.seu.edu.cn/project/falcon/questionnaire/graph/animalsA_graph.htm 3 http://xobjects.seu.edu.cn/project/falcon/questionnaire/class hierarchy/animalsA_graph.htm 4 http://webster.cs.uga.edu/~janik/2004-Fall/8350/ontology/Music.owl 5 http://xobjects.seu.edu.cn/project/falcon/questionnaire/class hierarchy/Music_graph.htm 6 http://xobjects.seu.edu.cn/project/falcon/questionnaire/property%20graph/Music_graph.htm 2

Finding Important Vocabulary Within Ontology

111

Table 2. Top ten vocabularies within “Music” ontology

No.1 No.2 No.3 No.4 No.5 No.6 No.7 No.8 No.9 No.10

Classes Ranking Local Name Score Musical_Instrument 0.0542 Musician 0.0512 Group 0.0472 Music_piece 0.0453 Movement 0.0353 0.0325 String_instruments Performer Piano 0.0236 Piano 0.0221 Trio 0.0191 Violin 0.0178

Properties Ranking Local Name has_tempo plays_instrument consist_of_movements belongs_to owns_instrument play_in_ensemble consist_of_members used_instruments composed_for plays_piano

Score 0.0363 0.0318 0.0315 0.0274 0.0272 0.0223 0.0221 0.0180 0.0169 0.0162

5 Conclusions and Future Work We present in this paper our novel approach to find important vocabularies within a given ontology. The idea of Double Focus PageRank is utilized for the ranking of vocabularies within ontology by considering both the structure and textual content. The structure of an ontology is characterized by a Vocabulary Dependency Graph, and Textual Score of Vocabulary is proposed to reflect the relevance of each vocabulary with the ontology. From the experimental result, our approach turns out to be useful in finding “important” vocabularies within a given ontology. According to [6], OWL ontology can also be mapped to RDF graph to define description formulations. Because our current work is based on the RDF graph of ontology, it can be extended to reflect OWL features in the result of ranking by considering the type of edges and specifying a corresponding weighing scheme in the RDF graph when building Vocabulary Dependency Graph. One of our future works is to consider the problem of multiple topics. Some ontologies describe conceptual frameworks on more than one topics. With current ranking scheme, important vocabularies in the light-weighted topics will be drowned by the ones in heavy-weighted topics. However, some users are willing to see the ontology separated into different partitions according to its topics and vocabularies are ranked in the range of partitions. It will be interesting to address this issue. Another future work will be ranking vocabularies across multiple ontologies by utilizing the ranking of vocabulary within each ontology.

Acknowledgements The work is supported partially by the 973 Program of China under Grant 2003CB317004, and in part by the JSNSF under Grant BK2003001. The third author of this paper is also supported by NCET (New Century Excellent Talents in University) program under Grant NCET-04-0472. We would like to thank our colleagues for their valuable suggestions.

112

X. Zhang, H. Li, and Y. Qu

References 1. Alani, H., Brewster, C.: Ontology ranking based on the analysis of concept structures. In Proceedings of Third International Conference on Knowledge Capture (K-Cap), pp. 51-58, Banff, Alberta, Canada. (2005) 51-58 2. Ding, L., Pan, R., Finin, T.W., Joshi, A., Peng, Y., Kolari, P.: Finding and Ranking Knowledge on the Semantic Web. International Semantic Web Conference 2005 (2005) 156-170 3. Tu, K., Xiong, M., Zhang, L., Zhu, H., Zhang, J., Yu, Y.: Towards Imaging Large-Scale Ontologies for Quick Understanding and Analysis. International Semantic Web Conference 2005 (2005) 702-715 4. Diligenti, M., Gori, M., Maggini, M.: A Unified Probabilistic Framework for Web Page Scoring Systems. IEEE Trans. Knowl. Data Eng. 16(1) (2004) 4-16 5. Qu, Y., Hu, W., Cheng, G.: Constructing Virtual Documents for Ontology Matching. Accepted by the Fifteenth International World--Wide Web Conference. (2006) 6. Patel-Schneider, P.F., Hayes, P., Horrocks, I. (ed.): OWL Web Ontology Language Semantics and Abstract Syntax. W3C Recommendation 10 February 2004. Latest version is available at http://www.w3.org/TR/owl-semantics/ 7. Stuckenschmidt, H., Klein, M.: Structure-Based Partitioning of Large Class Hierarchies. In Proceedings of the 3rd International Semantic Web Conference (2004) 289-303