Interoperability of Annotation Languages in

0 downloads 0 Views 2MB Size Report
Mar 11, 2006 - PDF and gzipped PostScript formats via anonymous FTP from the area .... I investigated the use of web ontologies as devices for the design of Semantic ...... special applications, and without any knowledge of HTML, CSS,.
Interoperability of Annotation Languages in Semantic Web Applications Design

Valentina Presutti

Technical Report UBLCS-2006-11 March 2006

Department of Computer Science University of Bologna Mura Anteo Zamboni 7 40127 Bologna (Italy)

The University of Bologna Department of Computer Science Research Technical Reports are available in PDF and gzipped PostScript formats via anonymous FTP from the area ftp.cs.unibo.it:/pub/TR/UBLCS or via WWW at URL http://www.cs.unibo.it/. Plain-text abstracts organized by year are available in the directory ABSTRACTS.

Recent Titles from the UBLCS Technical Report Series 2005-4 Experiences with Synthetic Network Emulation for Complex IP based Networks (Ph.D. Thesis), Cacciaguerra, S., March 2005. 2005-5 Interactivity Maintenance for Event Synchronization in Massive Multiplayer Online Games (Ph.D. Thesis), Ferretti, S., March 2005. 2005-6 Reasoning with preferences over temporal, uncertain, and conditional statements (Ph.D. Thesis), Venable, K. B., March 2005. 2005-7 Whole Platform (Ph.D. Thesis), Solmi, R., March 2005. 2005-8 Loss Functions and Structured Domains for Support Vector Machines (Ph.D. Thesis), Portera, F., March 2005. 2005-9 A Reasoning Infrastructure to Support Cooperation of Intelligent Agents on the Semantic Grid, Dragoni, N., Gaspari, M., Guidi, D., April 2005. 2005-10 Fault Tolerant Knowledge Level Communication in Open Asynchronous Multi-Agent Systems, Dragoni, N., Gaspari, M., April 2005. 2005-11 The AEDSS Application Ontology: Enhanced Automatic Assessment of EDSS in Multiple Sclerosis, Gaspari, M., Saletti, N., Scandellari, C., Stecchi, S., April 2005. 2005-12 How to cheat BitTorrent and why nobody does, Hales, D., Patarin, S., May 2005. 2005-13 Choose Your Tribe! - Evolution at the Next Level in a Peer-to-Peer network, Hales, D., May 2005. 2005-14 Knowledge-Based Jobs and the Boundaries of Firms: Agent-based simulation of Firms Learning and Workforce Skill Set Dynamics, Mollona, E., Hales, D., June 2005. 2005-15 Tag-Based Cooperation in Peer-to-Peer Networks with Newscast, Marcozzi, A., Hales, D., Jesi, G., Arteconi, S., Babaoglu, O., June 2005. 2005-16 Atomic Commit and Negotiation in Service Oriented Computing, Bocchi, L., Ciancarini, P., Lucchi, R., June 2005. 2005-17 Efficient and Robust Fully Distributed Power Method with an Application to Link Analysis, Canright, G., Engo-Monsen, K., Jelasity, M., September 2005. 2005-18 On Computing the Topological Entropy of One-sided Cellular Automata, Di Lena, P., September 2005. 2005-19 A model for imperfect XML data based on Dempster-Shafer’s theory of evidence, Magnani, M., Montesi, D., September 2005. 2005-20 riends for Free: Self-Organizing Artificial Social Networks for Trust and Cooperation, Hales, D., Arteconi, S., November 2005. 2005-21 Greedy Cheating Liars and the Fools Who Believe Them, Arteconi, S., Hales, D., December 2005. 2006-01 Lambda-Types on the Lambda-Calculus with Abbreviations: a Certified Specification, Guidi, F., January 2006. 2006-02 On the Quality-Based Evaluation and Selection of Grid Services (Ph.D. Thesis), Andreozzi, S., March 2006.

2006-03 Transactional Aspects in Coordination and Composition of Web Services (Ph.D. Thesis), Bocchi, L., March 2006. 2006-04 Semantic Frameworks for Implicit Computational Complexity (Ph.D. Thesis), Dal Lago, U., March 2006. 2006-05 Fault Tolerant Knowledge Level Inter-Agent Communication in Open Multi-Agent Systems (Ph.D. Thesis), Dragoni, N., March 2006. 2006-06 Middleware Services for Dynamic Clustering of Application Servers (Ph.D. Thesis), Lodi, G., March 2006. 2006-07 Meta Model Management for (Semi) Structured and Uncertain Models (Ph.D. Thesis), Magnani, M., March 2006. 2006-08 Towards Abstractions for Web Services Composition (Ph.D. Thesis), Mazzara, M., March 2006. 2006-09 Global Computing: an Analysis of Trust and Wireless Communications (Ph.D. Thesis), Mezzetti, N., March 2006. 2006-10 Fast and Fair Event Delivery in Large Scale Online Games over Heterogeneous Networks (Ph.D. Thesis), Palazzi, C. E., March 2006. 2006-11 Interoperability of Annotation Languages in Semantic Web Applications Design (Ph.D. Thesis), Presutti, V., March 2006. 2006-12 Advanced Machine Learning Techniques for Digital Mammography (Ph.D. Thesis), Roffilli, M., March 2006. 2006-13 Modular Algorithms for Component Replication (Ph.D. Thesis), Vuˇckovi´c, J., March 2006.

UBLCS-2006-11

1

Interoperability of Annotation Languages in Semantic Web Applications Design Valentina Presutti1

Technical Report UBLCS-2006-11 March 2006 Abstract This document reports the work that I developed during the three years of the doctorate school I attended. My research efforts have focused on different aspects apparently disconnected for each other. In practice, during these three years I investigated the Semantic Web research field and this investigation drove me to deal with different topics which are actually parts of the Semantic Web overall project. Hence, my intent has been that of giving a contribution to the overall vision. The main part of this thesis is about ontology interoperability. Furthermore, the use of ontologies for the development of domain-oriented Web sites is investigated and also presented is an ongoing project concerning the development of a semantic domain-oriented search engine. The Semantic Web is the new generation World Wide Web. It extends the Web by giving information a well defined meaning, allowing it to be processed by machines. This vision is going to become reality thanks to a set of technologies which have been specified and maintained by the World Wide Web Consortium (W3C), and more and more research efforts from the industry and the academia. The most important element composing the architecture of the Semantic Web is the ontology. An ontology is the formal definition of a set of semantic concepts, the relations between them, and semantic constraints useful for automatic inference, aimed to describe a specific domain of knowledge. The specification of a Web Ontology Language has been maintained by the W3C. Through the use of ontologies it is possible to deploy information on the Web giving them a machine-understandable form, as well as allowing the automation of tasks, and building domain-oriented Web sites. During the last years the number of organizations which have “faced” themselves on the World Wide Web with their Web sites has incredibly increased and this trend probably will continue in the future. Organizations more and more rely upon these Web portals for offering services to their members or to other people. In order to design, implement and deploy such Web sites fundamentally generic tools are used. I propose to use ontologies as a guide for such tools in order to make easier the task of realizing domain-aware Web sites. In fact, I claim that ontologies can be used for designing Web sites and that the result are Web sites, which have knowledge of the specific domains they support and follow their evolution. Furthermore, this approach allows to easily obtain automatic mechanisms for the semantic annotation of the Web pages composing the Web sites. I present the design of the architecture of the tool WikiFactory, which represents a concrete application of my proposed approach. However, in order to exploit the assets deriving from the use of Web ontologies, they must be specified with the aim of making them interoperable and sharable. In fact, without ontology interoperability it is particularly difficult to exploit the Semantic Web benefits. To this aim the W3C set of Semantic Web technologies provides best practices and mechanisms such as the: the ontology extendibility, ontology patterns, and so on. One issue concerning the ontology interoperability is part of this thesis, and it is also a topic of interest and discussion of the W3C Semantic Web Best Practices and Deployment working group. In particular, the problem that this thesis addresses is that there are two different formalisms that are used for 1.

2

Department of Computer Science, University of Bologna, Mura Anteo Zamboni 7, 40127 Bologna, Italy.

representing meta-information on the Web and both are standards: RDF and Topic Maps. RDF is a W3C standard, it was born with the Semantic Web in mind and represents its base model, while Topic Maps is an ISO standard and albeit it was born for different purposes it is suitable for representing meta-data on the Web. In fact, there is huge amount of interesting and important information represented in Topic Maps that needs to be shared on the Semantic Web as well as that in RDF. The two formalisms are different but present many similarities. This thesis reports the work that I have been doing on this topic. In particular, the study of other existing approaches to solve the problem of the RDF and Topic Maps interoperability and a new proposal concerning a translation model are presented. A tool, named Meta, was developed as well, which supports the defined translation model. Furthermore, much more work has been done. In fact, the tool and the translation model were based on specifications that were non-stable at that time. When new revised and stable specifications were released the translation model and the tool needed to align to them. A new approach for the translation model has been studied and the converter tool has been re-implemented basing on the new approach and specifications.

UBLCS-2006-11

3

Chapter 1

Introduction The World Wide Web has became the greatest repository of information that ever existed. It contains documents and multimedia resources that relate to every imaginable topic, and all of this data is simply available to anyone who can access the Internet. The web growth has been so fast mainly thanks to its decentralized design. Web pages are hosted by several computers, and each document can point to other documents, either on the same or different computers. The result is that anyone all over the world can provide content on the web. The web dimension has also become its main problem, it is comparable to a library where most of the books are on the floor so as to make it difficult to locate useful information. Search engines (such as Google and Alta Vista) can provide some assistance, but for many users to find the document they are looking for is still like trying to find a needle in a haystack. Furthermore, many users want to use the web also to perform some task. For example, a user might want to find the best price on a car, or plan and make reservations that fit with their personal agenda dates. Currently, to work out these tasks often involves visiting a series of pages, integrating their content and reasoning about them in some way. This is far beyond the capabilities of contemporary directories and search engines. The main obstacle is that web pages provide information, which tells a computer how to display a text or where to go when a link is clicked, but there is nothing telling machines about what the text means. In order to support such usage of the web the W3C has chartered the Semantic Web initiative. The general aim of Semantic Web is that of giving all information a well-defined meaning not only for humans but also for machines. In this work I present my research efforts, whose general goal is to give a contribution to the Semantic Web realization. The aim of my thesis is to point out that the use of ontologies provides a number of benefits for the creation, interoperability, and maintenance of web-based software systems. Web Ontology and its supporting methods and techniques might be the key technology making the web go to the Semantic Web that is, in my opinion, one of the most important next technological step in general. More specifically, I studied the development of a domainoriented search engine with the goal of identifying a way of providing users of a specific domain of knowledge with a service allowing them to find the “right” information they are looking for, by exploiting the limited scope (i.e., the specific domain) of coverage of the service. Furthermore, I investigated the use of web ontologies as devices for the design of Semantic Web applications, with particular focus on web sites supporting a given domain of knowledge. Finally, I studied a problem concerning the interoperability between Web ontologies, and I provide a solution for the problem of interoperability between two standard formalisms used for representing web ontologies: the RDF and the Topic Maps. In particular, section 2 shows a proposal for modeling the mapping rules using a formal approach, I use ontology for modeling RDF, Topic Maps, and the mapping model. In my opinion, formal approach is the best way to handle fundamental issues on the Web. The Web’ success is due to its “relaxed” nature, and this aspect should not be lost. However, I think that a reasonable compromise can be reached between a relaxed and a formal approach, where formal approach is applied to fundamental issues such as that of identity of resources. The intuition is to provide the web with a strong foundation model preserving its 4

1

Background

relaxed nature. In this chapter, I give a brief introduction about some of the technologies which characterize the Semantic Web environment as well as the notions and concepts which are useful for understanding the rest of the thesis. I also describe the structure of the document.

1

Background

The aim of this section is to give an overall description of the context in which the work described by this thesis is placed. Such context is the World Wide Web and more specifically the Semantic Web. The context I am going to depict is that of the Semantic Web. Nevertheless, I dedicate a couple of sections to describe the general structure of a search engine, and to a brief description of the Unified Modeling Language (UML) which is used in chapter 3 for designing Web sites with an ontology-driven approach. 1.1

The Semantic Web The Semantic Web is an extension of the current Web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.

The above definition is taken from [BHL01], a paper describing the Semantic Web vision that can be considered as a symbolic point in time from which many research efforts on Semantic Web began. The technology object which is fundamental for the realization of the Semantic Web is the ontology. Ontology is a term borrowed from philosophy that refers to the science of describing the kinds of entities in the world and how they are related [SWM04]. In Information Science, an ontology is the product of an attempt to formulate an exhaustive and rigorous conceptual schema about a domain. This domain does not have to be the complete knowledge of that topic, but purely a domain of interest decided upon by the creator of the ontology. An ontology is typically a hierarchical data structure containing all the relevant entities and their relationships and rules within that domain (e.g., a domain ontology). In order to realize the Semantic Web vision a set of standard technologies have been defined. The Figure 1 shows the Semantic Web layered architecture. The Web is an information space where everything is a resource, e.g., documents, images, downloadable files, services, electronic mailboxes. Each resource is identified by a URI. The Universal Resource Identifier (URI) [BFM98] is a mechanism of short strings that identify resources in the Web, making them available under a variety of naming schemes and access methods such as HTTP, FTP, and Internet mail addressable in the same simple way. The RDF [RDF] provides the general model, and OWL [OWL04] is the language for specifying Web ontologies. SPARQL [PS05] is the query language for RDF, it provides facilities to extract information in the form of URIs, blank nodes, plain and typed literals, extract RDF subgraphs, and construct new RDF graphs based on information in the queried graphs. There are also other technologies used in the Semantic Web, among them I cite the Topic Maps [MG05], which is considered an alternative to RDF by a wide community, which has used it since before the Semantic Web was starting its concrete realization. A brief introduction to Topic Maps, as well as to RDF is given in this thesis. One contribution of this work is to give a solution to the problem of their interoperability (see chapters 5, 6, and 7). 1.1.1 The Resource Description Framework in a nutshell The Resource Description Framework (RDF) [RDF] is a model that was born as the base model for the Semantic Web. It is used for representing information about resources in the World Wide Web. It is particularly intended for representing metadata about Web resources. Its aim is that of providing a tool to describe formally the information the web containes so as it can be machinereadable. UBLCS-2006-11

5

1

Background

Figure 1. The architecture of the Semantic Web taken from [T.05]

The key element of such a model is the resource. Together with the concept of resource two other simple elements contribute to the RDF model: properties and statements. A resource can be a Web page, an entire Web site, an element within a document, or an abstract concept. It is identified uniquely by a URIref, i.e. a URI [BFM98] plus an optional anchor. A property is associated to a resource and allows its description by means of a specific characteristic. A statement is a triple formed by a subject, a predicate and a object, where the subject is always a resource, the predicate is a property and the object is the value of the property. Furthermore RDF defines a mechanism allowing the reification of statements, that is a way to assert facts about a statement. To reify a statement means to make a statement being the subject and object of other statements. This simple model allows the declaration of facts about the world. The RDF is defined by a set of specifications including: an XML-based syntax [DB04], a semantics [Hay04], an abstract syntax [KC04], a vocabulary for defining schemas [BG04], a test cases document [JDB04], and an introducing document [FEB04]. RDF Schema [BG04] in particular, is the RDF’s vocabulary description language. It defines classes and properties that may be used to describe classes, properties and other resources. 1.1.2 The Web Ontology Language The Web Ontology Language [OWL04] (OWL) is a language for defining Web ontologies, which provides a vocabulary richer than RDF Schema and that is built following the RDF general approach. OWL extends RDF Schema in order to give much support for reasoning on documents content. An OWL ontology is composed of definitions of classes, properties, and instances. OWL UBLCS-2006-11

6

1

Background

has a well defined formal semantics that allows the derivation of consequences given a defined ontology. For example OWL provides the possibility to characterize properties saying that they are transitive, symmetric, functional, inverse. Furthermore, OWL allows us to restrict properties both locally and globally. For example, it can be said that a property must have values of a certain type if applied to instances of a certain class. It can also be expressed with OWL that a class is equivalent to another. OWL provides also the main basic operator of the set theory: intersection, union, complement, etc. OWL provides three related languages, each having a different level of expressiveness: • OWL Lite: supports classification hierarchies and simple constraint features; • OWL DL: provides the maximum expressiveness without losing computational completeness (all entailments are guaranteed to be computed) and decidability (all computations will finish in finite time) of reasoning systems; • OWL Full: maximum expressiveness and the syntactic freedom of RDF with no computational guarantees. Each sublanguage is an extension of its simpler predecessor, both in what can be legally expressed and in what can be validly concluded [SWM04]. OWL is defined by a family of documents [OWL04]. 1.1.3 Topic Maps in a nutshell The focus of Topic Maps is conceptual organisation of information, primarily with a view to findability, but the technology has much wider applicability. A major difference with RDF is the greater focus on human semantics in Topic Maps, as opposed to the “machine semantics” (or automated reasoning) of RDF. Topic Maps consist of topics, each of which represent some thing of interest, known as subjects (the definition of subject is deliberately all-inclusive). Associations repesent relationships between two or more topics, have a type (which is a topic) and consist of a set of (role type, role player) pairs (where both elements are topics). Occurrences can be either simple property-value assignments for topics, or references to resources which contain information pertinent to a topic. Occurrences also have a type (which is a topic). In addition, topics may have one or more names (which also may be typed). As in RDF topics can have attached identifying URIs. The difference is that each URI can be either a subject locator (which means that dereferencing it produces the information resource the topic represents) or a subject identifier (which when dereferenced produces an information resource describing the subject). Another difference is that a topic can have any number of identifying URIs of either type attached. There is a defined procedure for merging topic maps based on these identifiers. Further, any construct can be reified (associations, occurrences, roles, and names), and any construct (except roles) may have a scope, which is a set of topics representing the context in which the construct is valid. The technology stack consists of a data model [MG05], an XML interchange syntax [otTMAG01], a constraint language [MBN05] (comparable to RDFS/OWL, but focused on validation), and a query language [GBN05]. 1.2 UML and ontology modeling Software engineers use specific technologies in order to work out the design and development of applications. Among all available technologies, which are suitable for this aim, the Unified Modeling Language [grod] (UML) is the de facto standard for the specification of software models. Furthermore, UML is the modeling language used in many research works related to ontology definition and development. Examples are [Cra01b, Cra01a], where UML diagrams are translated to RDF schema documents. More specifically, in [Cra01b, Cra01a] is shown that UML class UBLCS-2006-11

7

1

Background

and object diagrams can have a direct correspondence to ontology concepts. In [Groa] is defined a method for ontology engineering using UML as modeling language, and [Cha98] shows that the RDF Schema is equivalent to a subset of the class model in UML. Furthermore, the Object Management Group, which is the standard organization maintaining the UML specification is promoting research efforts on the Semantic Web field, and has recently published a Request for Proposal for the specification of the ”Ontology Definition Metamodel” [Grob]. A revised submission document is already available at [ea05], where the metamodels for Topic Maps, OWL and RDFS (among others) are specified, as well as their UML profiles. 1.2.1 UML and RDF Schema Given an information domain, the corresponding ontology is a formal schema. In general, a schema is a collection of classes and properties, the relations between them, and the constraints on their interpretation. The RDF, that is the general model for the Semantic Web, provides constructs to describe these elements. In [Cha98] is shown that the RDF Schema is equivalent to a subset of the class model in UML. The equivalence can be stated through the comparison of the Directed Labeled Graph (DLG) representation of the RDF Schema and UML class schemas. It can be shown that the RDF Schema DLG is isomorphic to a subgraph of the UML class schema DLG. Intuitively: • UML and RDF classes map between each other; • RDF Schema properties and UML attributes map between each other; • The RDF Schema does not support UML operations; • UML associations are expressed as RDF Schema properties; • Reification in RDF Schema maps to UML association names and attributes. 1.3 General purpose and domain oriented search engines Chapter 2 describes the development of a search engine for the University of Bologna. In this section I introduce the concept of search engine and the elements that compose such a software system. Figure 2 depicts the general architecture of a search engine. Typically, a search engine is composed of: • a crawler: addresses the task to pick up the documents on the Web starting from a set of starting resources given as input; • a storage manager: is a central part of the search engine. It is accessed by the crawler in order to store the information picked up on the Web, by the indexer in order to construct the indexes, and by the query dispatcher when the significant documents related to a given query are identified; • a page analysis module: the pages collected by the crawler are periodically analyzed by the page analysis module in order to extract significant information. The indexing is done concurrently; • an indexer: performs the construction of the index. This is one of the most important components of the search engine; • a ranking module: here the documents are associated to a weight that depends on their content. There are several techniques of ranking of documents; • a query engine: receives and uses the lists created during the indexing in order to respond to the users queries.

UBLCS-2006-11

8

2

Motivations and objectives

client

WWW

crawler(s)

page repository

queries

query engine

results

ranking

indexer

crawl control

indexes usage feedback

Figure 2. The architecture of a search engine

There are several types of search engine. A main difference can be stated between general purpose search engine like Google [Goo], Altavista [Alt], and AllTheWeb [All], and domainspecific search engines. In the context of this work I approach the problem of studying a specific domain of knowledge and constructing a search engine supporting such a domain. The development of a domain-oriented search engine is characterized by a deep study of the domain of knowledge it has to support. Typically, a domain-oriented search engine runs on a limited set of Web domains and is provided as a service by a portal representing that domain of interest. Being so, this task is characterized by: • a limited network composed of servers populating the considered Web domains; • a specific ontology describing the domain of knowledge. Given this it is simple to understand that the more we know what the network is like and the more the resources are annotated with semantic information, the more we are able to build a search engine capable of giving really interesting and meaningful results against the users’ queries. This approach can not be followed for the development of general purpose search engines. Intuitively, a general purpose search engine could exploit domain-oriented search engines by relying upon an ontology defining cross relations among the domain ontologies of each domain of knowledge it considers. By the way, this discussion is out of the scope of this thesis.

2

Motivations and objectives

The general aim of my work is to contribute to the realization of the Semantic Web vision. The approach I have followed is twofold: • studying and investigating Semantic Web issues theoretically from a software engineering perspective; UBLCS-2006-11

9

3

Thesis outline

• experimenting with existing tools in order to build Semantic Web applications while testing the presumed theories. In the context of the above mentioned general goal I have identified a number of issues that this work addresses. Such issues can be described by the following research questions: • Searching the “right” information on the Web often is quite difficult. Most of the time the information a user is looking for is related to a specific domain of knowledge. General purpose search engines are useful services that give a reasonable help, but still users feel like they were searching a needle in a haystack. How can we obtain more accurate results? To build a search engine upon a specific domain could be a way to guarantee more precise results for users belonging to that domain of knowledge. Furthermore, Semantic Web technologies can help in defining methods and techniques for search engines providing them with semantics features. • How does Semantic Web affect the development of Web applications? In order to exploit Semantic Web benefits, Web-based applications should evolve so as to be able to managing and creating semantic information. Furthermore, with the availability of ontologies a new generation of Web applications can be realized, say domain-aware applications. That is, applications able to change their status depending on the evolution and changes of the ontology they rely on. • How do we guarantee ontology interoperability facing two alternative formalisms useful for representing Web ontologies? There are two suitable mechanisms for representing semantic information on the Web: RDF and Topic Maps. We need a set of mapping rules and tools implementing those rules in order to make the two formalisms interoperate easily. In order to address this questions, in this work I present: • the study of a specific domain, that of the University of Bologna (unibo). I studied the unibo domain in terms of the Web domains associated with it, the way they are physically distributed, the ontology representing the domain, and the typical users’ needs. I studied all these aspects using the unibo Web portal [Uni] which provides a search engine among its services. I collaborated on the development of such a search engine, named Sub, and I am still working on its improvement by the exploitation of Semantic Web technologies. In chapter 2 I describe the development of Sub and present some experimental results, which show the success of the search engine in terms of its usage and users’ satisfaction. • the study of the application of Web ontologies as a device for designing Web sites. In particular, in chapter 3 I identify a new generation of Web sites, which I call domainaware Web sites. A domain-aware Web site is able to capture the evolution of the domain of knowledge it represents and supports by means of the evolution of the domain ontology lifecycle. The consequence is that it changes its status in order to hold consistency against the ontology. In order to demonstrate the applicability of this approach in chapter 4 I present the design of an application completely based on Web ontologies which is able to deploy domain-aware Web sites based on the wiki [Web] platform. • a solution for the RDF and Topic Maps interoperability. In particular I define a set of guidelines for the RDF and Topic Maps mapping and a formal theory based on first order logic with identity for their mapping.

3

Thesis outline

This thesis is organized as follows: Chapter 2 present the development of a search engine for the University of Bologna (unibo) which is currently provided as a service of the unibo Web portal [Uni], furthermore a metric is defined for the validation of the result of the search engine based UBLCS-2006-11

10

3

Thesis outline

on the comparison of the set of results provided by the search engine against each query and the set of desired results for those queries, say the Query Test Suite. The chapter 3 shows a study of the application of Web ontologies as a device for designing Web sites supporting a specific domain of knowledge, while in chapter 4 a concrete application of this approach is shown. In chapter 5 I approach the problem of the interoperability between Web ontologies focusing on the case of RDF and Topic Maps. More specifically I present a survey of existing proposals including a first solution I proposed in the past, when neither RDF nor Topic Maps had a formal specification. In chapter 6 I propose a solution consisting of a set of guidelines for obtaining the interoperability between the two formalisms. The approach is based on the formal definition of theory for the mapping represented in first order logic with identity and on the definition of a vocabulary which can be used in order to guide the translation from documents written in one formalism to the other, and vice versa. In chapter 7 I describe the development of a tool which performs the conversion in both directions implementing the proposed approach. Finally, in chapter 8 I summarize the contributions of this work discussing possible future evolutions.

UBLCS-2006-11

11

Chapter 2

Domain-Oriented Search Engine During 2004 I started to collaborate with the Web portal of the University of Bologna (Uniboweb). The project I participate in is about the development of the new search engine that had to be made available to the users of the University of Bologna (unibo) domain through the unibo Web portal [Uni]. The first phase of such project was that of evaluating the shortcomings of the legacy search engine with the aim of designing and developing the new one. This phase led to the first deliverable: the release of the Sub (Searching the University of Bologna) search engine, which is now available on the above mentioned Web portal. Currently, I am collaborating on the second phase of the Sub project, which relates to the Sub maintenance and the development of a new prototype for the search engine which supports Semantic Web technologies. In particular, in order to develop the Semantic Web-based prototype of Sub, research efforts have been focused on the definition of: • an OWL-based ontology for the unibo domain; • a new crawler able to collect OWL documents as well as the already supported document formats; • a new indexing mechanism and ranking algorithm, which exploit semantic information. The current version of Sub was developed with the Semantic Web in mind. In fact, I defined a simple vocabulary for unibo exploiting the Dublin Core [Duba] features, which allow users to add semantic information to their documents in order to make Sub able to better index and rank such documents. In fact, Sub’s policies of indexing and ranking are more accurate when documents are enriched with such semantic metadata. As it is shown later in this chapter the approach taken led to a meaningful improvement of the searching service both in the usage of the search engine and the positioning of interesting resources among the results. My current research work has the aim of making the next Sub release strongly Semantic Web-based. In this chapter, I describe the analysis of the old search engine shortcomings in order to go through the description of the current version of Sub, which is one of the contributions of this thesis.

1

Sub: a Domain-Oriented Search Engine

The Uniboweb [Uni] is a portal, whose user community has grown very fast in the last three years. This has led to the need of improving the services it provides, and the most important and used is the search engine. The data on the usage of search engine of the Uniboweb were not “brilliant”. For this reason the office, which is responsible for the Uniboweb decided to study and implement a new search engine. At that time began my collaboration on the design of the new search engine that we called Sub. It has been deployed since November 2004, currently runs on the unibo.it domain, and is available through the unibo Web portal [Uni]. 12

2

The ‘old’ search engine

The domain unibo.it represents the University of Bologna (unibo) on the Web and serves its community of users: students, professors, staff, etc. This academic domain is composed of thousands of autonomous Web sites, physically located on hundreds of different servers, which are geographically distributed in different towns (the University of Bologna is a multicampus). It is probably the largest academic domain in Italy both in terms of servers and in terms of pages. The authors of the pages composing these Web sites have different levels of familiarity with Web technologies. The Uniboweb office manages the site www.unibo.it [Uni], which is the official Web portal of the university. The pages of the Uniboweb are created and organized with the support of a Content Management System (CMS). Let me describe the situation of the Uniboweb before the realization of SUB.

2

The ‘old’ search engine

The old search engine was actually composed of two different search engines. The portal had an academic license for the use of a proprietary search engine (i.e. Altavista [Alt]) that allowed the indexing of a limited amount of resources (i.e. up to 250.000). The second search engine was an ad-hoc made one. It ran only on the www.unibo.it site, that is the pages composing the university Web portal, created using the CMS. They chose this approach because the CMS manages some important metadata for the documents that with Altavista would be lost when transforming the documents from the CMS to the Web server. In fact, the old ad-hoc engine accessed directly the CMS to retrieve the documents, so that the metadata were maintained. The rest of the unibo domain was searched with the Altavista search engine. It was November 2004 and the unibo domain was composed of about 700.000 resources as we estimated using Google [Goo] (query site:unibo.it ). We can reasonable think that Google overestimates the dimension of the domain, but it is also reasonable to assume that the unibo.it Web domain was at least 500.000 resources at that time. Hence, the Altavista search engine was not sufficient because of the license limit above mentioned. However, there was also another major problem: the merging of the results. The two search engines used two different ranking algorithms, and the results of each were simply concatenated by giving priority to the results of the ad-hoc search engine. This was due to the fact that the ranking criterion of Altavista was hidden, hence neither accessible nor known. For the reasons above mentioned, the search engine resulting by composing the two was completely unsatisfactory. Unfortunately, there was nothing that could be reused in order to realize the new search engine.

3

Requirements and design of SUB

The analysis conducted about the old search engine pointed out the problems described above and represented a guidance for the definition of the requirements for the novel search engine. The search engine (Sub from now on) is a main project of the University, hence it is fundamental to have the ability of making it evolvable and maintainable without relying on external assistance. In fact, one of our main goals is that of creating an internal know-how about Sub. By the way, we faced a twofold situation: on one hand we needed a working search engine as soon as possible, the availability of such a service was at that moment the task with the highest priority; on the other hand, our aim was to build a well designed search engine which would become the subject of research efforts and experimentation of new technologies such as those of the Semantic Web. In order to meet these two complimentary aspects we chartered two working groups, named the research group and the operations group. Furthermore, the activities of the projects were divided into two main phases. In the first phase of the project these two groups worked as one. The aim was that of realizing Sub and making it available on the unibo Web portal. After the first phase they passed each to its own responsibility. The second phase is longer than the first one and iterative as well, and is characterized by the efforts for the maintenance of sub and for the development of prototypes aimed to extend Sub, mainly based on Semantic Web technologies. The responsibility of the operations group is that of maintaining Sub, validating and tuning it, UBLCS-2006-11

13

4

The architecture

and testing prototypes provided by the research group in order to select new features to put in production. The research group has the responsibility of experimenting with new and innovative technologies in order to realize prototypes and propose them to the operations group for putting them in production. A single search engine: the first requirement we defined is that of having a single search engine. Such a search engine must be able to access all the resources via the Web, comprising those belonging to the unibo Web portal. In particular all meta-information attached to the documents within the portal should be correctly managed by the crawler and the indexer, in order to be exploited during the searching of information. Open source software: in order to be fast in the realization of Sub we had two possible ways: either using a proprietary search engine with a license allowing the indexing of a wider number of resources, or working out our own search engine relying on suitable open source software, if any. The technical discomforts caused by the use of a proprietary search engine (i.e., Altavista [Alt]) related to the impossibility of modifying the code or simply being aware of the ranking criterion led us to decide on adopting open source-based software. Furthermore, the adoption of open source software allows us also to undertake research activities as well as to ”tune” Sub based on a process of validation of the results. Semantic approach: the search engine we were going to realize was conceived with the Semantic Web in mind. Nevertheless, the task of having a semantic search engine as a first deployment was too challenging due to the short time constraint. The decision was to define a simple vocabulary for the unibo domain in order to annotate the documents with a Dublin Core like mechanism. The vocabulary is a built-in feature of the first release of the search engine, which treats such annotations (if any) in a proper way. More complex semantic approaches are currently being investigating. A method for validation: the definition of a method for monitoring the results of Sub periodically in order to validate it is one of the aims of the project. Such a method would allow us to tune the search engine.

4

The architecture

In accordance with the above mentioned high level requirements a mechanism that allows us to transfer the CMS documents to the Web server preserving the metadata was implemented. This makes it possible to treat all the resources in the same way accessing all of them from the Web. Furthermore, the metadata attached to the documents are properly managed by the indexer in order to be exploited during the searching and ranking activities. The adopted open source software is the Apache Lucene search engine [Luc]. Lucene is a full-featured text search engine library written entirely in Java. In fact, Lucene resulted to be the most mature open source search engine project among the other. Even if it is based on Java it allows us to guarantee reasonable performance requirements. In this section I describe the functioning of Sub. Figure 1 depicts its architecture. The search engine consists of three main components: the crawler, the indexer and the searcher. The crawler, that was built from scratch, has the responsibility of traversing the unibo Web domain in order to collect all the resources populating it. A set of resources are specified in a configuration file which represents the starting points for the crawler. All the collected resources are then passed to the indexer, which is based on the Apache Lucene library [Luc]. In order to handle the possible metadata embedded in the HTML documents, the crawler parses the heading tags and passes all metadata it found together with the tokens, which derive from the text parsing. The indexer performs a reverse lookup (such mechanism is explained in section 4.2) and creates the local database of resources on which the searcher relies. The searcher is based on Lucene as well and has the responsibility to select all documents related to a given query. Such documents are then sorted by means of a ranking algorithm in order to be returned to the user. There is also an additional component which transforms the user queries in order to make them computable by the searcher. UBLCS-2006-11

14

4

The architecture

search

Lucene Indexer (open source)

SUB Crawler

Metadata

value













Inverse indexing

Lucene Searcher (open source)

Query transformation

SUB

Figure 1. The architecture of Sub

Figure 2 shows the class diagram of the Sub’s crawler. 4.1 The crawler BlackWidow is the class containing the main method, hence it creates and starts the crawler. In fact, it has the responsibility of initializing all the structures needed for the crawling activity (e.g., list of collected resources, list of links to be visited, etc.). This class defines also an inner class named Finalizer useful for the correct termination of the crawling. It kills all the open threads in case of manual termination of the crawler. The class LinksQueue represents the list of links (pointers to resources) that have to be visited (not visited yet). It is incremented during the crawling adding, for all parsed resources the pointers they have to other resources. The elements of the LinksQueue are represented by the class Link. The ResourceList is a list, the elements of which represent the already visited and processed resources. Such elements are modeled by the class Resource. The class Crawler has the responsibility of traversing both the lists of already visited resources and of links to be visited. It performs a set of checks using proper filters in order to decide whether a new resource has to be added to the list or not (e.g., checks if the resource’s url points to a resource already collected). Furthermore, it parses all resources and adds to the link list all the resources which are connected to (pointed by) the one currently parsed. For example, it parses the HTML documents and extracts the values of all , , and tags in order to add them to the LinksQueue. The class CrawlerStatistics stores statistics data e.g., the amount of documents accepted and rejected, their types, time of crawling, and so on. 4.2 The indexer The index of Sub is created by using the Lucene [Luc] library features aimed to this task. In particular, all resources collected by the crawler are passed to the indexer classes which perform a reverse lookup indexing. A reverse lookup index is one, which for every word, stores a list of documents in which the word appears (as opposed to a normal index which maps every document to all the words it contains). Hence, the index is implemented as a list of terms each of which is associated with a list of documents. A simplification of the result of such mechanism UBLCS-2006-11

15

4

The architecture

Figure 2. Sub’s crawler - UML class diagram

UBLCS-2006-11

16

4

The architecture

is depicted in Figure 3, where three documents a, b, and c are indexed using the reverse lookup on the base of terms they contain. In addition to the terms contained in the documents’ text, Sub

..student ......class .... ...topic... ......

a class.... ..professor ..office... ....

...topic... ....office... ...book... ......

b

student

a

class

a

b

topic

a

c

professor

b

office

b

book

c

c

c

unibo Web domain

reverse lookup index

Figure 3. Reverse lookup index

looks at the presence of a set of metadata information in order to create the index of the documents. Such metadata are defined in a specific unibo vocabulary, which is described in the next section. 4.3 The vocabulary and its usage As above mentioned I defined a small set of metadata fields that authors of html documents, which are published on the unibo Web domain, can use in order to help Sub in indexing the resource. Currently Sub creates one index for each of a set of eight metadata fields. In the definition of this small vocabulary I decided to use as a reference the Dublin Core Metadata Initiative (DCMI) [Duba]. The DCMI is an organization dedicated to promoting the widespread adoption of interoperable metadata standards and developing specialized metadata vocabularies for describing resources that enable more intelligent information discovery systems. In particular the [Dubb] defines all metadata terms maintained by the DCMI. The metadata fields for the unibo documents are the followings: title A name given to the resource. Typically, a title will be a name by which the resource is formally known. This property is equivalent to the http://purl.org/dc/elements/1.1/title element. description An account of the content of the resource. Description may include but is not limited to: an abstract, table of contents, reference to a graphical representation of content or a freetext account of the content. This property is equivalent to the http://purl.org/dc/elements/1.1/description element. language A language of the intellectual content of the resource. Recommended best practice is to use RFC 3066 [Lan], which, in conjunction with ISO 639 [Lib], defines two- and threeletter primary language tags with optional subtags. Examples “e” or “eng” for English, UBLCS-2006-11

17

4

The architecture

“akk” for Akkadian, and “en-GB” for English used in the United Kingdom. This property is equivalent to the http://purl.org/dc/terms/language date Date of creation of the resource. This property is equivalent to the http://purl.org/dc/terms/created element. target A class of entity for whom the resource is intended or useful. This property is equivalent to the http://purl.org/dc/terms/audience element. author An entity responsible for making contributions to the content of the resource. Examples include a person, an organisation, or a service. This property is equivalent to the http://purl.org/dc/elements/1.1/contributor element. areaContent The main subject (or subjects) which the content of the resource is related to. It is useful for the categorization of the document referred to its content. The recommended best practice is to choose the value(s) among four possible values: Teaching, Research, Administration, and Personal resourceType The category of the owner and publisher of the document. Three possible values are defined for it: Person, Institution, and Magazine. As well as the documents’ textual content, metadata are handled by the reverse lookup mechanism. In fact, Sub create a new index for all the above described metadata fields. Each index is a list of lists, and each list represents a value for the field and is linked to all documents having that value for the field. The main goal of this approach is that of relating documents with words and concepts that might not appear in the text. Furthermore, the areaContent, and resourceType are useful for refining the research by specifying contexts. The figure 4 shows an example of four indexes for the areaContent, resourceType, target, and author metadata fields.

areaContent

resourcetype

Research

a

Teaching

c

Administration

a

Personal

e

b

c

d

author

Person

a

Institution

c

Magazine

b

c

target

“Valentina Presutti”

a

student

a

“Fabio Vitalii”

c

developer

b

“Paolo Ciancarini”

b

researcher

c

c

Figure 4. Indexes of metadata values

UBLCS-2006-11

18

5

The validation of the results

4.4 The searcher The searcher function of Sub is implemented exploiting the Lucene library aimed to that task. First, Sub performs a transformation of the user query. In particular, it performs stemming transformations on each single word, the query is cleaned of all terms considered meaningless (i.e. stop words), and then it is represented as a boolean formula. The result of this transformation is passed to the searcher that performs the searching on all the indexes. The relevance (i.e., rank) of each retrieved document is calculated by means of the Lucene ranking formula, here reported as 1 and taken from [J.]. scored = coordqd

X

tfq

t

idft idft tfd boostt normq normdt

(1)

where: scored : score for document d. sumt : sum for all terms t. tfq : the square root of the frequency of t in the query. tfd : the square root of the frequency of t in d. numDocs idft : log( docFreq ) + 1.0. t +1

numDocs : number of documents in index. docFreqt : number of documents containing t. pP 2 normq : t (tfq idft ) . normdt : square root of number of tokens (i.e., terms) in d in the same field as t. boostt : the user-specified boost for term t. coordqd : number of terms in both query and document over the number of terms in query. The coordination factor gives an AND-like boost to documents that contain, e.g., all three terms in a three word query over those that contain just two of the words. All metadata fields are associated to a value for the boost factor. Such values are specified in a configuration file. In the case of matching between the query and the value of a metadata field of a document, the calculation of the rank for that document is affected through the boost factor.

5

The validation of the results

The Uniboweb office provided us a document containing the results they expected for a set of queries. We used this document in order to fix some bugs and make Sub stable enough for the deployment. The results now given by the engine encounter with sufficient precision the requirements, nevertheless we want to study how to enhance it exploting Semantic Web based technologies, as it is underlined in the chapter 8. Furthermore, we started an activity of “tuning” the search engine with the aim of improving the main functions of Sub (i.e., crawling, indexing, and searching) and concurrently defining a method for performing the validation of the results [PDV+ 05]. Section 5.1 illustrates which are the main principles the validation activity is basedon. Based on the definition of these principles I have been studyung and defining the method for the validation of the results that can be used for the maintenance of Sub.

UBLCS-2006-11

19

6

Precision and recall

5.1 Validation principles The validation activity in whatever domain, needs to be based on a number of well-defined principles that allow testers to state if the subject of the evaluation satisfies its requirements. This applies also to search engines validation. Such principles can be of two types: quantitative and qualitative. Quantitative principles allow us to estimate objective requirements we term indicators the values of which are enumerable for nature. An example of such principles is to valuate the distribution of the results for queries on a given argument of interest based on the dimension of the domains that a search engine indexes. Quantitative principles are not enough to state whether the results of a search engine satisfy its requirements. In fact, they do not give any support for judging non-functional aspects e.g., the relevance of a Web page with regard to a specific argument of interest. Qualitative principles relate to non-functional aspects. In general, in order to judge such qualitative principles the human cognition of knowledge is needed. What we are trying to do with our method is to combine the use of indicators and the user’s judgment with a mechanism based on metadata information for calculating the value of qualitative indicators i.e., a metric for non-functional aspects. The next section sketches out our method. 5.2 A method for validating a search engine’s results When a customer commissions a search engine, he/she has to provide a list of queries each associated with the expected results: we call this a Query Test Suite (QTS). The QTS represents the desired behavior of the search engine at a given time. Our aim is to define a number of properties that the QTS must satisfy. These properties would allow us to compare the QTS with the actual search engine results. Having the QTS during the analysis phase of the search engine, allows us to better understand the client needs and the domain structure. It is important to notice that the QTS changes in time because the domain contents are dynamic, but the properties must be satisfied anyway. The changes do not affect the development of the search engine but only the contents of the QTS. Our idea consists of defining a metric that can be used to estimate non- functional aspects of the algorithm used in ranking a search engines results. In fact, this metric is a guide for the search engine tuning activity after its deployment, and can be used to validate the search engine results by comparing them to the QTS. First of all we propose a formalization to define a model for the representation of the main concepts such as the ranking algorithm, the QTS, the search engine actual results, and the difference function between the last two (the metric). In order to model the concepts we need for defining the metric we use the mathematical theory of sets. The theory of sets plays an important role for mathematical foundations and currently it is placed in the context of the logic theory. With the concept of set we identify all groups, and collections of elements regardless of their nature. A set is usually identified with uppercase letter : A, B, C, Z, X, and so on, and it must be determined unequivocally.

6

Precision and recall

The basic measures used in evaluating search strategies are precision and recall. They are typical metrics in the Information Retrieval domain. Section 7 describes a metric I defined in the context of this work that can be used in conjunction with these two measures. In this section I described briefly precision and recall base concepts, which inspired the definition of the other two measures (i.e., ComparteTo and Distance). Recall is a measure of the completeness of the list of documents returned for a query, and Precision is a measure of the usefulness of a such list [K.]. Recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents in the index database. It is usually expressed as a percentage. It measures how well the engine performs in finding relevant documents. Precision is the ratio of the number of relevant documents retrieved to the total number of irrelevant and relevant documents retrieved. It is usually expressed as a percentage as well. It measures how well the engine performs in not returning nonrelevant documents. Recall and precision are measures for the entire index. They do not account for the quality of ranking the documents in the list. The requirement of users is to have the retrieved documents UBLCS-2006-11

20

7

A metric for the results validation

ranked according to their relevance to the query instead of just being returned as a set. The most relevant documents must be in the top few documents returned for a query. It is possible to use recall and precision metrics with some variants in order to evaluate the quality of the results of a search engine considering also the position of documents in the retrieved list. The two metrics presented in the following sections implement this approach.

7

A metric for the results validation

In this section I propose two methods for the validation of the results of a search engine. One is based on a function between two sets, named CompareTo, the other based on the consistency of two sets i.e., ordering and content. Consider the two sets Search (S)1 , and Search model (S∗ )2 of the search engine actual and desired results, respectively.

Figure 5. Comparison of functions S and S∗

We are going to compare the set Result and the set of Desired Results. Definition 2.1 (Set Result, R). Given U (document universe), a set of URL which are indexed by a search engine M, such as U = {u1 , u2 , ..., un }where n ∈ ℵ then, the set Result of all possible indexed URLs returned by M after a query is RM = P+ (U)3

Definition 2.2 (Set of desired results, A). Given R, the set Result of all possible indexed URL which are returned by a search engine after a query. The set of Desired Results A is defined as the set of relevant results (URLs) for that query. Furthermore, the following holds: A⊆R 1. 2. 3.

S:Q→R S∗ : Q → A set of parts+

UBLCS-2006-11

21

8

Method 1 - The function CompareTo

This inclusion is true in part. Actually it is more correct to say: A⊆U because the set of Desired Results is

A = A 0 + A 00

where • A 0 : is the set of URLs after the ranking and searching activities; • A 00 : is the set of desired URLs which are not collected by the crawler or which are wrongly excluded after the searching and ranking activities. If the Indexer and the Searcher were ideal, then the original assumption would holds.That is, A ⊆ R.

8

Method 1 - The function CompareTo

The metric I use for validating the results of a search engine is based on a relation of equivalence between two sets. Definition 2.3 (Weight function). The Weight function Wpos (x), is the function that relates the rank of a URL with the position assigned to that URL in a given ordered set. Wpos (x) =

rank(x) , ∀x ∈ A, ∀pos ≤ K pos

(2)

where: • x is an element of the set A; • rank(x) is the score of the document x with regard to given query; • pos is the position of x in the returned list of documents for the given query; • K ∈ N is the maximum position, which is considered meaningful for the results. Now, I define a relation between sets by exploiting the Weight function. Definition 2.4 (Relation CompareTo). It is a relation defined as follows: CT = {equal weight} relates the elements of A with those of R

CT : A → R

by means of the function Wpos (x), previously defined. That is: CT = {(x,y), x ∈ A, y ∈ R / Wi (x) = Wi (y), for i = 1..K}

(3)

8.1 Comments The above described method is conceived in order to evaluate quantitatively if the elements of a set are similar to those of another set. Such evaluation is performed element by element. If the two sets are represented graphically, it can be noticed for all points we can draw on the grid diagonal we have the desired results. Figure 6, highlights the elements which verify the function CT. It is intuitive that the weight of the element d must be such that it does not satisfy the similarity so it does not compare in the grid. UBLCS-2006-11

22

9

Method 2 - Distance between sets

Figure 6. An example of graphical representation

9

Method 2 - Distance between sets

A second approach is that of comparing two sets by mean of a concept of distance. Two sets have a distance equal to zero if they are equal and their elements have the same position in the set. We want to evaluate the distance between the sets A and R, above defined. 9.1 Distance The distance is a relation between the set of Desired results, and the set of Actual results related to a query. That is: d(A, R). How can we relate them using the concept of distance? It is important to remember that these sets are ordered. 9.2 Some useful definitions We need the concept of order and content in order to define the concept of distance. Definition 2.5 (Content, C). It is the number of elements that A and R have in common, that is the intersection between the two sets. C = ](A ∩ R). (4) We consider K ∈ N , which is the highest position we want to obtain for the desired URLs within the ranked results of a given query (i.e., the maximum number of query results that are “interesting”). Hence, the content of our sets is limited to the first K elements. C ≤ K. The concept of order can be defined exploiting the property of the sets to be ordered. We use the the length of a permutation in order to define the concept of degree of ordering. Definition 2.6 (Permutation). The permutation is the way of combining n elements swapping their positions. The permutation of a set X is defined as a bijective function: f:X→X UBLCS-2006-11

23

9

Method 2 - Distance between sets

Figure 7. Intersection

The number of permutations of n elements is the factorial of n: n! = n · (n − 1) · (n − 2) · ...1 Permutations are represented as follows, each element of the first row is related to the corresponding one on the second row:   1 2 3 4 A= 3 1 2 4 In this example, the letter 1 is associated to the letter 3, the letter 2, the letter 1 , and so on. The constraints we defined result in having the second row composed of elements which are different to each other. A permutation can also be thought as a "movement": it identifies how to move an ordered disposition. For example, consider the list: cabd we can apply the permutation showed in matrix A: The new list is now: abcd Hence, the permutation of matrix A is the one that has the effect of sorting the list. Definition 2.7 (Length of a permutation, `). We can represent a permutation using a cycle. A cycle is a list of identifiers where each identifier in the cycle is associated to the following one. The length of a permutation is the number of steps performed in the cycle. For example, the cycle: 1235 represents the permutation which moves 1 to 2, 2 to 3, and so on up to 5 to 1. Two cycles are disjoint if they do not have elements in common. For example: (1 2 3) and (4 5) are disjoint, but (1 2 3) and (1 2 4), are not. In order to represent the composition of permutations with cycles, we write the cycles in sequence. In order to calculate the permutation resulting by the composition of sequences of cycles, we “trace” the destiny of each element. For example:   1 2 3 4 5 (123)(135)(24) = 4 5 3 2 1 We start from the element 1: the first cycle moves 1 to 2, the second does not affect 2, the third move 2 to 4, hence the three cycles move 1 to 4. The first cycle moves 2 to 3, the second moves 3 UBLCS-2006-11

24

9

Method 2 - Distance between sets

to 5, the third does not affect 5, hence the cycles move 2 to 5. And so on. At the end a check of consistency is needed: the elements in the second row must be all different. A cycle is represented with the following notation: (i1 , ..., ik ) where i1 < ih , ∀ h = 2,...,k. In order to understand the way a cycle Y = (i1 , ..., ik ) plays, we can draw the k integers, i1 , ..., ik , on a circle, as it is shown by Figure 8. In this way we can notice that Y moves all integers along the circle to the next integer, following the direction of the arrow.

Figure 8. Cycle of permutation

Theorem 2.1 A permutation can be represented as a composition of disjoint cycles. In order to represent a permutation in terms of cycles we "follow" an element (chosen at random) until we find a cycle. For example: consider matrix A, we have that 1 moves to 3, 3 moves to 2 and 2 moves to 1; hence, the first cycle is (123). What about 4? It moves but at the end it come back to 4, so the cycle in this case has length equal to 1. Cycles with length equal to 1 does not have an explicit representation. Hence, the permutation of matrix A is represented by (123). 9.3 Definition of d(A, R) Now we have the notions for introducing the second parameter useful for the validation of results. Such parameter is the order of the elements in the set. I want to use such concepts in order define a concept of distance between two sets. We need a relation that assigns to the distance a small value when C is very big, that is the two sets are very similar considering also the ordering of the elements. We define the concept of distance as follows: d(A, R) = ` + (N − C)

(5)

with ` = length of the permutation, C = number of elements that A and R have in common, N = number of elements of A. In this way we have the distance between A and R as a relation between the ordering and the number of elements that the two sets have in common. This means that from the set of desired results and from that of the actual results, we consider only the intersection set and the ordering of its elements. It has to be notices that the more the cycles of permutation are small, the more the elements have similar order, hence the two sets are closed to each other, i.e., similar. Conceptually this notion of distance is simple, Figure 9.3 shows an example. The x-axis represents the elements of set A of the desired results, while the y-axis represents the elements of sets R of the actual results. The distance is a function we need to minimize: min(d(A, R))

UBLCS-2006-11

(6) 25

10

Experimental Results

Figure 9. Example of difference between two sets

In fact, the function d returns a set of the following type: D = {d1 , d2 , d3 , . . .} which contains all the distances for all the queries. The correspondent set of queries is the set Q which is composed as follows: Q = {q1 , q2 , q3 , . . .} and that refer to the queries performed by the user. The Figure 9.3 depicts the set D that is the set of distances between the points of the main diagonal. With d=0 we refer to the points of the diagonal which belong to both sets. Figure 9.3 shows that only three queries satisfy the user needs (set b) and that the query corresponding to the point "a" is better satisfied than that corresponding to the point "c".

10

Experimental Results

Since November 2004, Sub is running on the unibo Web portal [Uni]. During this period I monitored several statistics, useful for the activity of “tuning” the search engine, and in order to understand the usage trend. In this section a set of graphs show the trend of different parameters about the search engine’s usage and user satisfaction. In order to understand the dimension of the problem and to be able to evaluate the results obtained it is useful to report some quantitative information about Sub and the unibo Web domain. I queried three popular and general search engines i.e., Google [Goo], Altavista [Alt], and AllTheWeb [All], in order to estimate the dimension of the unibo Web domain. The first query I submitted was about the dimension of the domain. The results are compared in Figure 10. The reader should consider that such results reference the number of different URLs, and furthermore it is possible to find duplicates among them. This result deserves some comments. Unfortunately, it is well-known that it is impossible to know precisely the dimension of a domain in terms of the resources that populate it. Nevertheless, we can evaluate the accuracy of such estimation by comparing it with the results of the same query but applied on a different domain, the dimension of which we have under control (i.e., we know more precisely). In fact, we know (by CMS logs) that the www.unibo.it domain (i.e., the Web portal of unibo) is populated by about 20.000 documents. Such number refers to single individuals, that means that if more than one URLs (both if it is static and dynamic, i.e., query strings) point to the same document, that document is counted only once. We can consider with reasonable approximation that the average number of URLs referring to the same document is 3. The Figure 11 shows the results of the three above mentioned search engines to the query UBLCS-2006-11

26

10

Experimental Results

Dimension of the unibo Web domain

2.500.000 2.300.000

2.000.000

1.500.000

1.000.000

575.000

500.000

566.000

0 Google

S1 Altavista AllTheWeb

Figure 10. Estimation of the dimension of the unibo Web domain

about the dimension of the unibo web portal i.e., www.unibo.it compared with the information from the CMS logs. As it can be noticed it is reasonable to deduce that all the search engines I queried report a number of results strongly affected by resource duplicates. The dimension of the current Sub’s index is about 200.000 entries (notice that Sub indexes all the unibo domain not only the unibo web portal) . Such entries relate to resources which are different individuals, and do not comprise dynamic pages (except for those of the www.unibo.it Web domain), that means that when the crawler faces a URL which is a query string it drops it, so Sub does not index such resources yet. It is also true that Sub loses all resources belonging to the unibo Web domain which are referenced only by external Web domains, they are islands from Sub’s perspective. Google [Goo] and other search engines exploiting back links mechanisms are able to reach them. Given that, I think that the dimension of the index of Sub can be considered a good approximation of the dimension of the unibo Web domain in terms of static resources. the crawling and indexing of dynamic resources is currently subject of other work efforts, so it is left to future work for what concerns this thesis. The figure 12 shows the number of user accesses (i.e., distinct users access) that Sub has had since its deployment up to September. As it can be seen the search engine started with a number of accesses equal to 222.553 and we had in September 2005 a peak of 382.888 accesses. Without considering this peak, and also without considering the number of accesses in August, period during which the University is closed and students are in vacation, the search engine had an increase of the access of the 22% average. By the way, the peak represents an increase of the 72%. The number of queries had an increasing trend as well. The figure 13 shows the graph for this parameter. The number of different queries that Sub received the first month it was online is 150.214. Without considering the peak and the lowest value (i.e., that of August) we have had an increase in average of the 51,2%. And the peak value corresponds to an increase of the 123,7%. This can be interpreted as a very positive trend. In fact, this huge increasing of the number of UBLCS-2006-11

27

10

Experimental Results

Dimension of the www.unibo.it Web domain

180.000

175.000

160.000 140.000 120.000 100.000 80.000

87.800 87.200

60.000 40.000 20.000 20.000 0 CMS logs Google

S1 Altavista AllTheWeb

Figure 11. Estimation of the dimension of the unibo Web portal (www.unibo.it)

Accesses

400.000 350.000

382.888

300.000 250.000

222.553

200.000 150.000

Accesses

100.000 50.000 0 Nov-04 Dec-04

Jan-05 Feb-05 Mar-05

Apr-05 May-05

Jun-05

Jul-05

Accesses Aug-05

Sep-05

Figure 12. Monthly accesses to the search engine

UBLCS-2006-11

28

10

Experimental Results

Sub: Number of Queries

350.000 300.000

336.034

250.000 200.000 150.000 150.214 Query 100.000 50.000 0 Nov-04 Dec-04

Jan-05

Feb-05

Mar-05

Apr-05

May-05

Jun-05

Jul-05

Query Aug-05

Sep-05

Figure 13. Number of queries

Position of the visited links among the results 6,00 5,60 5,00

4,05

4,00

3,66

3,80 3,53

3,00

3,81 3,53

3,41

3,72

3,55

3,53

ù

2,00

1,00

0,00 Nov-04

Dec-04

Jan-05

Feb-05

Mar-05

Apr-05

May-05

Jun-05

Jul-05

Aug-05

Sep-05

Position of the result visited

Figure 14. Position of the visited links among the results

UBLCS-2006-11

29

11

Conclusion and Future works

different queries means that the search engine is becoming more and more a reference for the unibo users for different aspects and topics. Probably, before Sub was online most of the users accessed the “old” search engine trying to find the information they needed, and when they were faced with unsatisfactory results they just looked for an alternative service on the Web. When Sub went online it is reasonable to believe that the users were curious to try the “new” search engine. Although they do not have the possibility of choosing between the old and the new search engine, the increasing of accesses together with that of the queries seems to mean that they now found what they are looking for so as to ask Sub for even new information. Another parameter we monitor is the position of the link the user goes through among the results returned by Sub after each query. As it can be seen in Figure 14 we started with an average of 5,60 position for the link visited among the results of the queries, to become stable at about 3,50. This data reflects what happened in the first two months of Sub’s life. This means that users find what they are looking for in the first 3,5 URLs.

11

Conclusion and Future works

I have described the development of a search engine named Sub, which is the search engine running on the Web portal of the University of Bologna. Furthermore, we have described how this experience led us to study the definition of a method for the validation of the results of a search engine, and to implement a web-application for supporting it. Currently, we are working on the following aspects: • the application of Semantic Web technologies to the SUB approach to the crawling, indexing and searching operations. In particular, we are studying an algorithm for automating the process of semantic annotation of the resources and finding which is the best way to store such external semantic metadata in order to exploit them in the searching algorithm. Moreover, we are extending the ranking function of Lucene [Luc] with the use of semantic metadata. • the definition of statistical indicators (related to both quantitative and qualitative principles) based on semantic metadata, and the implementation of their computation in the WebMonitor application.

UBLCS-2006-11

30

Chapter 3

Designing Domain-Oriented Web Sites During the last years the number of organizations which have “faced” themselves on the World Wide Web with their Web sites has incredibly increased and this trend probably will continue in the future. Organizations more and more rely upon these portals for offering services to their members or to other people. The large amount of information and services that these portals make available is accessible through the specification of URI addresses, the use of search engines, or following links from related documents. In order to support this new usage of Web technologies, the concept of a Semantic Web has been introduced. The goal of the Semantic Web initiative is to give to all information available on the Web a machine “understandable” form [BHL01], so that it can be used for several purposes, including more relevant results when searching some specific information, better data integration from different sources, and the automation of organizational tasks across organizations. The World Wide Web represents a new space through which any kind of organization can offer services and data. The huge diffusion of this Internet service has led to the develop of new kinds of software systems, called Web applications. A Web application is an application delivered to users from a Web server over a network such as the World Wide Web or an intranet. The most popular Web application is the Web site, a collection of Web pages providing information and services (e.g. forms where input data can be provide in order to execute some logic on the server side), connected through hypertextual links, and supporting a given domain (i.e. Web portal). A Web site often reflects the structure of an organization, and provides a set of functionalities to its users who are members and have specific roles in the organization. With the new concept of the Semantic Web the development of Web sites is evolving including, in their implementation, the use of knowledge representation technologies in order to obtain all benefits offered by the semantic annotation of documents and data. In fact, a Semantic Web site is intended to be a Web site, where all pages are annotated with semantic information, useful for describing machine processable and “understandable” organizational information. That being so, we should ask ourselves: where are we going? And, what do we need to do in order to make the Semantic Web vision succeed? In order to realize the Semantic Web vision, Web sites, services, and all forms of Web applications need to be aware of semantic aspects. Their “awareness” is the key to interoperability, automation of tasks, and improvement of informationbased services on the Web. In order to make them aware of the semantics they have to be built upon the knowledge domains they support. Ontologies are the base technology for knowledge representation, and Web ontology is the key concept of the Semantic Web. The basic idea I depict in the rest of this chapter can be summarized as: we can use ontology modeling as a way to model Semantic Web sites. A Web site supporting a given domain is composed of objects, each of which is strictly related to concepts described in the ontology representing that domain. It is possible to combine concepts of the domain of knowledge with those that are useful for describing Web sites’ 31

1

Motivations

typical elements (e.g., page, form) in order to obtain an ontology which allows users to describe a domain-oriented Web site (i.e., a Web portal). Usually, Web engineers use specific tools in order to design, implement and deploy Web sites (e.g. content management systems (CMS), Web site building tools). My idea is that of using Web ontologies as the main mechanism upon which such tools’ logics should be based. As an example consider the development of a Web site using a CMS which is ”aware” of the characteristics of the domain that the resulting Web site is going to support. Its awareness covers the type of the content that the Web site will contain, the internal structure of the pages, the services that will be provided, the typical users, and so on. All these are known by means of an ontological description that evolves over time, and the evolution of which must be reflected on the Web site. In this chapter I investigate how ontologies can be used in a typical software engineering tools. A case study is presented in order to show that software design and ontology design can be combined for developing Semantic Web sites.

1

Motivations

The life cycle of a Web site is not intrinsically different from that of a conventional application. That is, it is characterized by several activities, which usually are repeated iteratively over the life cycle. Typically, such activities are: the definition of the requirements, the specification of the software model, and the implementation and maintenance of the software system. These activities are performed for each functionality and feature that the application has to provide. The Semantic Web initiative bases its foundation on the capability of describing formally the meaning of the information available on the Web. Such information now is represented by formatted documents which embed natural language. By the term “meaning” it is not mean the “sense”, as the meaning of a word or expression, the way in which a word or expression or situation can be interpreted from a cognitive perspective, but it is intended to use the logical implication that can be deduced from assertions, and the way it can be exploited for making different sources of information interoperate. In order to succeed, the Semantic Web needs a set of technologies and methods to be defined with that aim in mind. In fact, the W3C [Con] has defined a set of technologies for supporting the Semantic Web initiative. Among them, RDF [RDF] is the general model for the Semantic Web and OWL [OWL04] is the language for specifying Web ontologies. The general approach for bootstrapping the Semantic Web is that of adding the information available on the Web using semantic annotations. Such annotations can be extracted using natural language processing techniques, manually added by using tools which support their editing, or automatically generated together with new information when it is created. For information already existing on the Web, such techniques are all suitable, but the very important thing in my opinion is to study new techniques and methods for create new information-oriented applications so as they have the knowledge represented in a proper way as developers built-in functionality, which affect all their life cycle. Such an approach would allow developers to generate a class of “very” Semantic Web applications, which over time would replace the traditional ones. For this reason I decided to investigate how Web ontologies can be used in the design of a Web-based application and in particular I focused on Web sites. In order to study this aspect, I experimented with existing tools which have been proposed in the context of related works, and my approach has been that of presupposing the use of tools for the development of software systems and for the creation and engineering of ontologies in the context of a general framework and method for the development of Semantic Web applications, more specifically for Semantic Web sites. This means that the tools used in the following examples are just possible choices. The aim of the experimentation is to show a possible application of Web ontologies from a software engineering perspective.

UBLCS-2006-11

32

2

2

The UML and ontology modeling

The UML and ontology modeling

A first consideration concerns the technologies that are used to express a system specification design. For example, the possibility of using the same formalism to describe both ontology specification and software specification would represent a way to make it simpler to approach the Semantic Web sites’ specification for software engineers. The challenge is to exploit ontological specification of concepts in order to represent both the system design and the domain of knowledge. This approach gives the possibility to include the domain specific concepts in the design of applications supporting the corresponding domain of knowledge. Let me sketch the general idea. When drawing a UML class diagram for a Web application we are defining de facto the elements that characterize that domain ontology. For example, when we design the Web site of a University, we have to represent concepts like professor, student, faculty, etc. Hence, a first approach could be that of representing the concepts of a domain that match with elements of the application. It is possible then to obtain a first base ontology for the domain of knowledge supported by the such application. In [Cra01b, Cra01a] the XML Diagram Interchange (XMI, the standard DTD for UML) [Groc] is used in order to obtain a translation from a UML class diagram to a corresponding RDF schema. In particular, in [Cra01b] is presented a tool named ”UML Data Binding” (UDB) which implements the translation from UML class diagrams to RDF schemas and sets of Java classes exploiting the XMI encodings of the class diagrams. This XMLbased representation of the ontology model is the input to a set of stylesheets that realize the translation. The generated Java classes allow an application to represent knowledge about objects in the domain as in-memory data structures. The generated schema in RDF defines domainspecific concepts that an application can reference when serializing this knowledge using RDF (in its XML encoding). I used the UDB tool in the context of the case study described in 3. On the other hand, another possible approach is that of using an ontological description for specifying the elements which compose the model of the Web site being created (e.g., pages, forms, etc.). Such an ontology allows the description of the Web site model, and combined properly with the ontology that describes the domain of knowledge that the Web site will support, provides a useful device for modeling domain specific Web sites. A concrete application of this approach is described later in chapter 4.

3

Domain-oriented web sites based on ontologies

We term as domain-oriented, Web sites which support a specific domain of knowledge, providing services to users which play a role in such domain. With an ontology-driven approach for Web site creation, several benefits can be envisaged [CP02]. A domain ontology changes or evolves when new concepts are added, or new elements and constraints replace existing ones in that specific environment, this is captured during the ontology lifecycle. If there is a dependency between the structure and content of the Web site representing a specific domain and the ontology of that domain, then they must be both affected by these changes. Consider a domain of knowledge, which is represented by a specific ontology. All concepts, relations and constraints in the ontology (i.e. the ontology) can be extended in order to merge such specificity with concepts related to the element composing the Web site we want to realize. The specification of such concepts with possible instances (i.e. the static definition) can be transformed in a concrete deployment of a domain-oriented Web sites. Now imagine the life cycle of this domain-oriented web site. Two possible situations can be faced: • new Web pages are created by the users of the Web site; • the ontology evolves, and this can happen both at the schema and the instance level. These two situations represent the dynamic aspects of the domain-oriented Web site life cycle. My idea is that of identifying a new generation of Web sites (i.e. domain-aware Web sites), which are able to react to dynamic behavior such as that above described, in order to preserve UBLCS-2006-11

33

4

The development of Semantic Web applications

its consistency with respect to the domain of knowledge. To preserve consistency means that the Web site should be able to automatically generate meta-data information based on its own knowledge of the domain ontology, when new contents are manually created as well as reflecting changes in its structure if the ontology is revised and modified, and adding new “humanreadable” content if new meaningful instances of the ontology are created. In order to demonstrate the applicability of this approach, I describe a method, which exploit existing technologies, suitable for handling both ontologies and software modeling.

4

The development of Semantic Web applications

The proposed ontology-driven development method is based on: the use of UML as modeling language, and tools that are built upon Web standards. Actually, the method suggests a guideline about the type of technologies that a designer should use during the development of a Semantic Web application and it can be applied to any kind of process that is suitable for the development of Web-based applications. The process of building Semantic Web applications (or more general hypermedia) applications is not intrinsically different from the one used when building conventional applications. Compared with traditional systems, Web applications have introduced the concepts of navigation and Semantic Web applications have introduced that of domain ontology. Different approaches have been investigated for the purpose of developing Web applications. The most common techniques used are WebML [ASP00], a method based on the Entity-Relation (E-R) model and concentrating its features on web application appearance. It provides an Entity-Relation based notation for the definition of data structures and it can be exhaustive in the case of development of simple Web applications. In order to obtain a full specification of a complex Web application, this language should be supported by another one that provides enough expressive constructs to fill WebML shortcomings about computational behavior aspects. Two other common techniques are: • the Object-Oriented Hypermedia Design Method (OOHDM) [RSL99] that was the first to introduce the concept of navigational design, it is summarized in Figure 1 • the Conallen UML extension for web application [J.99a, J.99b] strongly based on business logic. This extension focuses, above all, on the implementation aspects of the development, in fact it contains concepts as: form, page, script, frame, client, server etc. I will try to illustrate how this approach can be applied to the design of a Semantic Web application. Using this approach we produce a model that is used to obtain the corresponding RDF specific schema expressed in XML syntax that in turn is used as a conceptual model for a Web site. Actually, we will take in account only a subset of the entire domain. In order to explain the method we have chosen as a case study a domain that is well known to all academics, namely the organizational structure of a university. A university is a large and complex educational and research organization. Terms such as student, professor, exam, research, library, etc. are typical ways to refer to objects in this domain. For instance, the University of Bologna (UniBO for short) is composed of schools, which coordinate teaching activities, and departments, which coordinate research activities. These two structures are autonomous but strictly related. Each school offers several courses and benefits from services from a certain number of departments. Professors are employed and work for one faculty. They are also affiliated with a department in which they act as researchers. Professors can teach more that one topic in the context of different courses, etc. Since UniBO is a large and complex organization the UML model which describes its organizational structure consists of a set of class diagrams each contained in a different package that underlies a specific aspect. At the top level there are two packages: the General and the UniBOntology packages. The former one contains the model of a General ontology. Here, concepts that can be used also for other domains, completely different from the university’s, are defined. UBLCS-2006-11

34

4

Activities

Products

Formalisms

The development of Semantic Web applications

Mechanism

Design Concerns

Conceptual Design

Classes, sub-systems, relationships, attribute perspectives

Object-Oriented Modeling constructs; Design Patterns

Classification, aggregation, generalization and specialization

Model the semantics of The application domain

Navigational Design

Nodes, links, access structures, navigational contexts, navigational transformations

OO Views and State Charts, Context classes Design Patterns User Centered Scenarios

Classification, aggregation, generalization and specialization

Takes into account user profile and task. Emphasis on cognitive aspects. Build The navigational structure of the application

Abstract Interface Design

Abstract interface objects, responses to external events interface transformations

Abstract data view Configuration diagrams ADV-Charts Design Patterns

Mapping between navigation and perceptible objects

Model perceptible object implementing chosen Metaphores. Describe interface for navigational objects. Define layout of interface objects

Implementation

Running application

Those supported by the target environment

Those provided by the target environment

Performance completeness

Figure 1. Summary of the OOHDM Method taken from [RSL99]

The General ontology (the concept of which can be inherited by other well-known ontologies) is extended by the UniBOntology through the use of UML specialization constructs that correspond to the rdfs:SubClassOf property provided by the RDF Schema. The two packages are shown in figure 2. In order to define an ontology for UniBO, we chose to describe the university structure, its hierarchy and organization, using a vocabulary that expresses its typical concepts and relationships. The elements for the UniBOntology are modelled in the UniBOntology package. The classes that it contains have been given a name using a taxonomy for the domain. The UniBOntology package contains a set of packages, which are depicted in figure 3. The University package is the abstract model of the UniBO organization, while the People package contains the classes describing the roles that people act within the UniBO. The Uni-Documents package describes the types of documents that can be created as reports of university people’s work, and the Teaching and Research packages describe all the typical concepts related to such activities. The Relations package contains other packages, each showing the typical relationships between classes of the various packages. These collections of diagrams are useful for the design of the navigational model, namely the descriptions of possible interactions between a specific role and the Web site. In fact, these diagrams provide a representation from a particular class view that corresponds to a class of users (i.e. user profile). At the end of the design process we have a set of class diagrams that describe the typical elements of our domain and how these elements relate to each other. Thus, we have a specific schema to express the entities composing the University of Bologna and make them publicly available on the World Wide Web. The result is the model for the Web site with all the pages composing it. Such a model is useful in order to deploy such pages with semantic annotations. In fact, this model is combined with the corresponding specification of a collection of pages and services. The Web site specification evolves with its navigational design, interface definition and implementation, and the ontology schema is used to relate the Web site’s components with their semantic annotations. This description seems to completely separate the two models, suggesting the idea that the ontology serves to integrate the portal with metadata, but it is separately defined. This situation UBLCS-2006-11

35

4

The development of Semantic Web applications

Figure 2. Top level packages

Figure 3. UniBO uml model: the UniBOntology packages

UBLCS-2006-11

36

4

Ontology Design

Conceptual Design

XMI Ontology

Navigational Design

The development of Semantic Web applications

Interface Definition

Implementation

Object Model

UDB-like tool RDF schema & Java Classes

Application

Information Metadata (RDF/XML)

Figure 4. Ontology-based approach to the OOHDM

is realistic if we have an existing web portal and we want to provide it with semantic annotations. On the other hand, if we are creating from scratch a novel Semantic Web application it is appealing to consider obtaining the desired ontology schema without an increase of the design activity efforts. We can have an ontology schema as a starting point and use it to define the Web site objects. This can be achieved considering that the ontology model is essentially the conceptual model. For instance, considering the Object-Oriented Hypermedia Design Method (OOHDM) [RSL99] it is possible to integrate it with the model of the ontology without any additional design efforts. The figure 4 shows a schematic description of the steps performed when combining an ontology-based approach to the OOHDM. The activities that this method includes are: Ontology Design, Conceptual Design, Navigational Design, Object Model, Interface Definition, and Implementation. For clarity I have separated the Ontology Design and the Conceptual Design as two different steps, but in practice they reduce to one single step that incorporates both the ontology and the conceptual design. The conceptual model reflects objects and behaviors in the application domain. It also reflects the fact that the application implementation is going to be deployed in a Web environment. The object model is a collection of UML object diagrams that describes an abstract representation of the knowledge information that will be contained in the specific Semantic Web application we are developing. It can be constructed on the base of the ontology and navigational models, that together give the structure of the application. This abstract representation can be used to automatically generate a serialization format to allow Web publication of the knowledge represented. The object model will be integrated with diagrams every time new information has to be added, so it can be seen as a component that evolves during the application lifecycle. In fact, the figure 4 shows that this is not really a process step but that it uses the results of the process as input elements for producing its results. UBLCS-2006-11

37

5

Two example scenarios

This ontology-driven process is based on UML as the modeling notation for the design activities. There are tools which allow the automatic translation from class diagrams and objects diagrams to a correspondent serialization format that allows Web publication. In the context of this work I have used for this purpose a technology called UML Data Binding (UDB) [Cra01a] together with a Rational UML supporting tool. Other tools are suitable for the definition of ontologies. For example, the tool Prot´eg´e [Pro] is widely used for this purpose. The difference here is that UML based tools with ontology serialization capabilities allow designers to combine the modeling of both the Web site and the ontology. As an example of the obtained code, Figure 5 shows: the definition of the class Professor, its property name, that has been modeled as an attribute and that inherits from the class Person of the general ontology, and its property tutorOf, that has been modeled as an association. In order to make the code more readable, the following abbreviations have been used: • UniBOntology, that stands for http://www.cs.unibo.it/schema/UniBOntology rdf-schema • General, that stands for http://www.cs.unibo.it/schema/General • rdf-schema, that stands for http://www.w3.org/2000/01/rdf-schema • rdf that stands for http://www.w3.org/1999/02/22-rdf-syntax-ns The UniBO Semantic Web site is then deployed together with the semantic annotation of its pages. Assuming that the ontology model is realized by following best practices of ontology engineering (e.g. sharing of existing and wide used ontologies) this approach allows the communication between different sources of related environments, and the automation of tasks based on data-sharing [ACP03]. In the next section I describe two possible scenarios that become possible with this approach.

5

Two example scenarios

A main goal of our Semantic Web site is that of enabling more exact results when a user performs a web search on its specific domain. An ontology, which defines and organizes knowledge about academic entities and relationships can have several benefits. For instance, suppose that an undergraduate student needs to collect documents about a topic of interest. He could use the UniBO Web site to query about the researchers working on a particular area and the articles that they wrote about that topic. Another possible situation is the following: he can realize a software agent to search for articles on a particular subject whose authors are member of a particular set of institutions. If the search is based on traditional text-based techniques (not based on semantic Web technologies) the answer would include a large number of uninteresting entries because “traditional” search engines rank each page based upon the number of query terms it contains. Considering that all the pages of the UniBO web site are annotated with metadata describing semantic information, the result of the query could be a graph of all possible pathways that match the query. Figure 6 shows an example of such a graph that could be the result of a search about all the articles whose subject is the “Ontology research area” and whose authors are members of departments belonging to UniBo. The nodes of the graph represent resources, and the connecting arcs represent properties. Each arc has a label indicating the property name it represents. All nodes are hypertextual links to the URL containing that resource information. For example, in the case of an article the URL could refer to a downloadable version (eg. a pdf file) or to a web page containing the article. Thus, the user can follow the links he finds more appropriate for his purposes. UBLCS-2006-11

38

5

Two example scenarios

I)J !"#$%&'()%%*"#$&)+,-./0123452.,(,6789",$:%%,"0; !"#$%&%-+'()%%5$*"#$&":%,-"; !"#$&9",?:".7*)+,-./0123452.,(,6789",$:%%,"@2)A:0; !"#$%&#,A)32*"#$&":%,-"; !"#$%&")26:*"#$&":%,-"; !>"#$&9",?:".7; !"#$&9",?:".7*)+,-./0123452.,(,6789",$:%%,"@.-.,"5$0; !"#$%&#,A)32*"#$&":%,-"; !"#$%&")26:*"#$&":%,-"