UPGRADE, Vol. VI, issue no. 6, December 2005 - Council of European ...

6 downloads 747 Views 3MB Size Report
equivalent of a personal notepad. .... ter's laptop computer to support him in his CRM, a few ...... gorithm on a notebook computer September 2002 and made.
Call for Take up Actions, Joined sub-Projects to AXMEDIS project of the European Commission http://www.axmedis.org

The Integrated Project AXMEDIS - Automating Production of Cross Media Content for Multi-channel Distribution requires the participation of new contractors to carry out take-up actions as sub-projects within AXMEDIS project to promote the validation and early application of AXMEDIS technologies via demonstration activity. AXMEDIS is providing a framework which includes technologies, methods and tools to speed up and optimise content production, protection and distribution, for leisure, entertainment and digital content valorisation and exploitation in general for multi-channel distribution, supporting interoperability of content and DRM. AXMEDIS aims to meet the challenges of digital-content market demand by: (i) reducing costs for content production and management by applying composition, parallel processing, optimisation techniques for content formatting and representation (format) and workflow control; (ii) reducing distribution and aggregation costs in order to increase accessibility with a Peer-to-Peer platform at Business-to-Business level, which can integrate content management systems and workflows; (iii) providing new methods and tools for innovative, flexible and interoperable Digital Rights Management (DRM), including the exploitation of MPEG-21 and overcoming its limitations, and supporting different business and transactions models. For the technical details regarding AXMEDIS framework specification please visit AXMEDIS web site on which Tutorials, Specification, Use Cases, Test Cases, and reports about the research activity performed and planned are available. See also next AXMEDIS conference http://www.axmedis.org/axmedis2006 The candidate topic areas of this call include the followings application and/or extension of the AXMEDIS framework and tools to support: one or more distribution channels in order to make evident interoperability of content and tools with other AXMEDIS distribution channels and tools (mobile devices, PC, STB, portable video player, portable music player, etc.); massive and/or coordinated production, aggregation, protection, of cross media content; collaboration among different actors of the production and distribution value chain; collaborations among cultural institutions, etc.; production and/or distribution authoring tools and/or players. Take up projects should aim at developing real solutions (adoption of the AXMEDIS framework and technology in reallife scenarios) by exploiting AXMEDIS technologies. They should start real sustainable activities by taking advantage of the AXMEDIS framework services and derived tools. Maximum funding is about 1-1.1 Meuro, for the whole 3-4 take up actions. All the necessary information for submitting your proposal is available at the call webpage of the AXMEDIS project http://www.axmedis.org/callfortakeup/call.html, AXMEDIS project co-ordinator contact details: Prof. Paolo Nesi, [email protected], http://www.dsi.unifi.it/~nesi/

Is a promoting partner of AXMEDIX 2006

UPGRADE is the European Journal for the Informatics Professional, published bimonthly at UPGRADE is the anchor point for UPENET (UPGRADE European NETwork), the network of CEPIS member societies’ publications, that currently includes the following ones: • Mondo Digitale, digital journal from the Italian CEPIS society AICA • Novática, journal from the Spanish CEPIS society ATI • OCG Journal, journal from the Austrian CEPIS society OCG • Pliroforiki, journal from the Cyprus CEPIS society CCS • Pro Dialog, journal from the Polish CEPIS society PTI-PIPS Publisher UPGRADE is published on behalf of CEPIS (Council of European Professional Informatics Societies, ) by Novática , journal of the Spanish CEPIS society ATI (Asociación de Técnicos de Informática, ) UPGRADE monographs are also published in Spanish (full version printed; summary, abstracts and some articles online) by Novática, and in Italian (summary, abstracts and some articles online) by the Italian CEPIS society ALSI (Associazione nazionale Laureati in Scienze dell’informazione e Informatica, ) and the Italian IT portal Tecnoteca UPGRADE was created in October 2000 by CEPIS and was first published by Novática and INFORMATIK/INFORMATIQUE, bimonthly journal of SVI/FSI (Swiss Federation of Professional Informatics Societies, ) Editorial Team Chief Editor: Rafael Fernández Calvo, Spain, Associate Editors: François Louis Nicolet, Switzerland, Roberto Carniel, Italy, Zakaria Maamar, Arab Emirates, Soraya Kouadri Mostéfaoui, Switzerland, Editorial Board Prof. Wolffried Stucky, Former President of CEPIS Prof. Nello Scarabottolo, CEPIS Vice President Fernando Piera Gómez and Rafael Fernández Calvo, ATI (Spain) François Louis Nicolet, SI (Switzerland) Roberto Carniel, ALSI – Tecnoteca (Italy) UPENET Advisory Board Franco Filippazzi (Mondo Digitale, Italy) Rafael Fernández Calvo (Novática, Spain) Veith Risak (OCG Journal, Austria) Panicos Masouras (Pliroforiki, Cyprus) Andrzej Marciniak (Pro Dialog, Poland) English Editors: Mike Andersson, Richard Butchart, David Cash, Arthur Cook, Tracey Darch, Laura Davies, Nick Dunn, Rodney Fennemore, Hilary Green, Roger Harris, Michael Hird, Jim Holder, Alasdair MacLeod, Pat Moody, Adam David Moss, Phil Parkin, Brian Robson

Vol. VI, issue No. 6, December 2005

Monograph: The Semantic Web (published jointly with Novática*) Guest Editors: Luis Sánchez-Fernández, Michael Sintek, and Stefan Decker 2

Fernández, Michael Sintek, and Stefan Decker 5

UPGRADE Newslist available at

19

The Quest for Information Retrieval on The Semantic Web – David Vallet-Weadon, Miriam Fernández-Sánchez, and Pablo CastellsAzpilicueta

24

Functional RuIeML: From Horn Logic with Equality to Lambda Calculus – Harold Boley

30 Towards Semantic Desktop Wikis – Malte Kiesel and Leo Sauermann 35

ISSN 1684-5285

Monograph of next issue (February 2006): Key Success Factors in Software Engineering (The full schedule of UPGRADE is available at our website)

Towards Semantically-Interlinked Online Communities – Uldis Bojars, John G. Breslin, Andreas Harth, and Stefan Decker

41 A Semantic Search Engine for the International Relation Sector – Luis Rodrigo-Aguado, V. Richard Benjamins, Jesús Contreras-Cino, Diego-Javier Patón-Villahermosa, David Navarro-Arnao, Robert Salla-Figuerol, Mercedes Blázquez-Cívico, Pilar Tena-García, and Isabel Martos-Laborde 42

Semantic Search in Digital Image Archives: A Case Study – Julio Villena-Román, José-Carlos González-Cristóbal, Cristina MorenoGarcía, and José- Luis Martínez-Fernández.

55

Configuring e-Government Services Using Ontologies – Dimitris Apostolou, Ljiljana Stojanovic, Tomás Pariente-Lobo, Joan BatlleMontserrat, and Andreas E. Papadakis

UPENET (UPGRADE European NETwork) 63

Copyright © Novática 2005 (for the monograph and the cover page) © CEPIS 2005 (for the sections MOSAIC and UPENET) All rights reserved. Abstracting is permitted with credit to the source. For copying, reprint, or republication permission, contact the Editorial Team The opinions expressed by the authors are their exclusive responsibility

The Semantic Web: Fundamentals and A Brief State-of-the-Art – Luis Sánchez-Fernández and Norberto Fernández-García

12 Leveraging Metadata Creation by Annotation for The Semantic Web – Siegfried Handschuh

Cover page designed by Antonio Crespo Foix, © ATI 2005 Layout Design: François Louis Nicolet Composition: Jorge Llácer-Gil de Ramales Editorial correspondence: Rafael Fernández Calvo Advertising correspondence:

Presentation. The Semantic Web or The Next Web Wave – Luis Sánchez-

From Novática (ATI, Spain) ICT for Education

An Initiative for Educational Modernization: The Ponte dos Brozos Project – Simón Neira-Dueñas and Felipe Gómez-Pallete Rivas 71

From Pro Dialog (PIPS, Poland) ICT for Education

On The Superiority of Internet-Based Mass Enrolment to High Schools over Traditional – Andrzej P. Urbanski * This monograph will be also published in Spanish (full version printed; summary, abstracts, and some articles online) by Novática, journal of the Spanish CEPIS society ATI (Asociación de Técnicos de Informática) at , and in Italian (online edition only, containing summary, abstracts, and some articles) by the Italian CEPIS society ALSI (Associazione nazionale Laureati in Scienze dell’informazione e Informatica) and the Italian IT portal Tecnoteca at .

The Semantic Web Presentation

The Semantic Web or The Next Web Wave Luis Sánchez-Fernández, Michael Sintek, and Stefan Decker

The Semantic Web vision – that of a Web in which software agents can access and process web page content and automatically perform tasks that today require tedious interaction – was proposed by Tim Berners-Lee, the inventor of the current Web, towards the end of the last century. Since that moment, there has been a flurry of research activity in this field, and applications based on Semantic Web technologies are already beginning to appear. Interested readers are referred to "Semantic Web Challenge", . This UPGRADE and Novática monograph devoted to the Semantic Web (also called the Next-Generation Web) is made up of articles intended to provide a broad overview of the different activities being carried out in this field. In addition to the regular article on the state-of-the-art ("The Semantic Web: Fundamentals and A Brief State-of-the-Art",

by Luis Sánchez-Fernández and Norberto FernándezGarcía), the monograph will cover the following key areas: „ Fundamental Semantic Web technologies: "Leveraging Metadata Creation by Annotation for The Semantic Web", by Siegfried Handschuh ; "The Quest for Information Retrieval on The Semantic Web", by David Vallet-Weadon, Miriam Fernández-Sánchez and Pablo Castells-Azpilicueta; and "Functional RuIeML: From Horn Logic with Equality to Lambda Calculus", by Harold Boley. „ Systems that in some way allow us to get more out of the Web: "Towards Semantic Desktop Wikis", de Malte Kiesel and Leo Sauermann; and "Towards SemanticallyInterlinked Online Communities", by Uldis Bojars, John G. Breslin, Andreas Harth and Stefan Decker. „ Specific applications based on Semantic Web tech-

The Guest Editors Luis Sánchez-Fernández graduated as a telecommunications engineer from the Universidad Politécnica de Madrid, Spain, in 1992 and received his doctorate in Telecommunications Engineering, from the same university in 1997. In October 1997 he joined the Universidad Carlos III de Madrid where he is currently a full professor in the Dept. of Telematic Engineering, holding the post of Assistant Director. He is Director at the Web Technologies Lab, , which forms part of the research group Grupo de Aplicaciones y Servicios Telemáticos (Telematic Applications and Services Group) of the Universidad Carlos III de Madrid. He has participated and/or led a number of national research projects and one European project related to web technologies, including Semantic Web technologies, and has authored more than 50 publications in national and international conferences and journals as well as a number of chapters in scientific books. His current research activities are focused on the Semantic Web (semantic annotation, ontologies, semantic Web services). He is also interested in other technologies related to Web applications, such as XML. He is a member of the Spanish CEPIS society ATI (Asociación de Técnicos de Informática) and a frequent contributor to its journal Novática. Michael Sintek studied Computer Science and Economics at the University of Kaiserslautern, Germany, and received the Diplom (Master’s degree) in 1996. Since then, he is working as a research scientist at the German Research Center for Artificial Intelligence (DFKI GmbH) Kaiserslautern. In the research department for Intelligent Engineering Systems he investigated in the VEGA project logic programming and machine learning approaches for the maintenance of knowledge-bases. In 2000 and 2001, he was project leader of the FRODO project (DFKI

2

UPGRADE Vol. VI, No. 6, December 2005

Knowledge Management Group) where we develop a framework for building distributed organizational memories. As a visiting researcher at the Stanford Medical Informatics department (August - October 1999 and November 2000 - February 2001) he developed various plugins for the frame-based knowledge acquisition tool Protégé-2000, including the OntoViz ontology visualization tab and the RDFS and OIL backends. In 2002, he was a visiting researcher at the Stanford Database Group and at ISI, working on the Edutella project and the Semantic Web rule language TRIPLE. Currently, he is cohead of the Competence Center Semantic Web (CCSW) at DFKI. Stefan Decker received his PhD at the University of Karlsruhe, Germany. He is working as a Senior Research Fellow and Adjunct Lecturer at the National University of Ireland, Galway, and is executive director of the Digital Enterprise Research Institute (DERI) and Cluster Leader of the Semantic Web Cluster within the institute. Previously he worked at ISI, University of Southern California (2 years, as Research Assistant Professor and Computer Scientist), Stanford University, Computer Science Department (Database Group) (3 Years, PostDoc and Research Fellow), and Institute AIFB, University of Karlsruhe (4 years, PhD Student and Junior Researcher). He has initiated or participated in several projects and activities regarding the Semantic Web, such as Ontobroker, Protégé, XML-based OIL, Edutella, and the Semantic Web Working Symposium at Stanford University (USA). His research interests include the Semantic Web and P2P technologies and his current and future objective is the creation and wide dissemination of the next generation collaboration and augmentation infrastructure - the Social Semantic Desktop.

© Novática

The Semantic Web nologies: "A Semantic Search Engine for The International Relation Sector", by Luis Rodrigo-Aguado, V. Richard Benjamins, Jesús Contreras-Cino, Diego-Javier PatónVillahermosa, David Navarro-Arnao, Robert SallaFiguerol, Mercedes Blázquez-Cívico, Pilar Tena-García and Isabel Martos-Laborde; "Semantic Search in Digital Image Archives: A Case Study", by Julio Villena-Román, José-Carlos González-Cristóbal, Cristina Moreno-García and José-Luis Martínez-Fernández; and "Configuring eGovernment Services Using Ontologies", by Dimitris Apostolou, Ljiljana Stojanovic, Tomás Pariente-Lobo, Joan Batlle-Montserrat, and Andreas E. Papadakis. From the point of view of their origin the articles can be broken down into those from industry sources, those produced by research institutes linked (to a greater or lesser extent) to universities or coming directly from the university world, plus one from a European research project, the

consortium of which includes both universities and companies. The presence of the university world is important, but there is also clear evidence of interest from industry. As is normal in monographs published by this journal, the reader can also find a number of useful references in this presentation, complemented on this occasion by a glossary of terms commonly used in this field. We would not like to end this presentation without thanking UPGRADE and Novática for their support during the editing process, and we trust that this edition will be of interest and use to the readers of both journals. Translation by Steve Turpin

Useful references on Semantic Web This section provides a lists of some of the most important references related to the Semantic Web, which are intended to complement those appearing in the articles making up this monograph.

„ IEEE/WIC/ACM International Conference on Web Intelligence, 2005: . „ Atlantic Web Intelligence Conference, 2005: .

Websites „ W3C (World Wide Web Consortium): . „ W3C Semantic Web: . „ Semantic Web ORG: . „ Semantic Web Science Association: . „ AIS SIGSEMIS (Semantic Web and Information Systems): . „ OMWG (Ontology Management Working Group): . „ SWSI (Semantic Web Services Initiative): .

Journals Journal of Web Semantics, Elsevier: . „ IEEE Intelligent Systems, IEEE: . „ Applied Ontology, IOS Press: . „ International Journal of Knowledge and Learning, Inderscience: . „

Books John Davies, Dieter Fensel, Frank van Harmelen. Towards the Semantic Web: Ontology-Driven Knowledge Management, John Wiley & Son, 2003. ISBN 0-470-84867-7. „ Dieter Fensel, Wolfgang Wahlster, Henry Lieberman, James Hendler. Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. The MIT Press, 2002. ISBN 0262-06232-1. „ Grigoris Antoniou, Frank van Harmelen. A Semantic Web Primer, The MIT Press, 2004. ISBN 0-262-01210-3. „ Siegfried Handschuh, Steffen Staab. Annotation for the Semantic Web. IOS Press, 2004. ISBN 158603345X. „ Asunción Gómez-Pérez, Mariano Fernández-López, Oscar Corcho. Ontological engineering: with examples from the areas of knowledge management, e-commerce and the semantic web, Springer Verlag, 2004. ISBN 1852335513. „

Conferences „ 1st Asian Semantic Web Conference, 2006: . „ European Semantic Web Conference, 2005: . „ International Semantic Web Conference, 2005: . „ International Conference on Formal Ontology in Information Systems, 2004: „ International World Wide Web Conference, 2005. . „ International Conference on Artificial Intelligence, ICAI, 2005: . © Novática

UPGRADE Vol. VI, No. 6, December 2005

3

The Semantic Web „ Steffen Staab, Rudi Studer. Handbook on Ontologies. Heidelberg: Springer Verlag, 2004. ISBN 3-540-40834-7. „ Franz Baader, Peter Patel-Schneider, Diego Calvanese, Deborah L. McGuinness, Daniele Nardi. The Description Logic Handbook: Theory, Implementation, and Applications, Cambridge University Press, 2003. ISBN 0521781760.

Research Projects, Excellence Networks „ KnowledgeWeb: . „ SEKT (Semantically Enabled Knowledge Technologies): . „ DIP (DIP-Data, Information, and Process Integration with Semantic Web Services): . „ SWAP (Semantic Web and Peer to peer): . „ AceMedia: . „ REWERSE (Reasoning on the Web with Rules and Semantics): . „ OntoWeb: . „ SEWASIE (SEmantic Webs and AgentS in Integrated Economies): . „ SWWS (Semantic Web Enabled Web Services): . „ WonderWeb: . „ NEWS: Ontologies „ SUMO (Suggested Upper Merged Ontology): . „ MILO (MId Level Ontology): . „ KIMO (Knowledge and Information Management Ontology): ) elements and distinguishes

24

UPGRADE Vol. VI, No. 6, December 2005

'uninterpreted' (constructor) vs. 'interpreted' (user-defined) functions just via an XML attribute; another attribute likewise distinguishes the (single vs. set-)valuedness of functions (Section 2). We then proceed to the nesting of all of these (Section 3). Next, for defining (interpreted) functions, unconditional (oriented) equations are introduced (Section 4). These are then extended to conditional equations, i.e., Horn logic implications with an equation as the head and possible equations in the body (Section 5). Higher-order functions are finally added, both named ones such as Compose and λ-defined ones (Section 6).

2 Interpretedness and Valuedness The different notions of 'function' in LP and FP have been a continuing design issue: LP: Uninterpreted functions denote unspecified values when applied to arguments, not using function definitions. FP: Interpreted functions compute specified returned values when applied to arguments, using function definitions. Uninterpreted function are also called ‘construCtors’ since the values denoted by their application to arguments will be regarded as the syntactic data structure of these applications themselves. Harold Boley is Adjunct Professor at the Faculty of Computer Science, University of New Brunswick, Canada, and leader of the Semantic Web Laboratory at NRC, IIT e-Business (National Research Council, Institute for Information Technology). His current focus is Semantic Web knowledge representation using POSL (POsitional-Slotted Language) and RuleML (Rule Markup Language). He received his PhD and Habilitation degrees in Computer Science from the Universities of Hamburg and Kaiserslautern, Germany, respectively. He developed the Relational-Functional Markup Language (RFML) before starting and co-leading the Rule Markup Initiative. As member of the Joint Committee he co-designed the Semantic Web Rule Language (SWRL), which combines the W3C-recommended Web Ontology Language (OWL) and RuleML, . He also led the design of a FirstOrder Logic (FOL) web language and helped in the design of a Semantic Web Services Language (SWSL), both extending RuleML.

© Novática

The Semantic Web For example, the function first-born: Man x Woman Æ Human can be uninterpreted, so that first-born(John, Mary) just denotes the first-born child; or, interpreted, e.g. using definition first-born(John, Mary) - Jory, so the application returns Jory. The distinction of uninterpreted vs. interpreted functions in RuleML 0.89 is marked up using different elements, vs. . Proceeding to the increased generality of logic with equality (cf. Section l), this should be changed to a single element name, , with different attribute values, vs. , respectively. The use of a Function’s interpreted attribute with values "no" vs. "yes" direct1y reflects uninterpreted vs. interpreted functions (those for which, in the rulebase, no definitions are expected vs. those for which they are). Functions’ respective RuleML 0.89, , applications with Cterm vs. Nano can then uniformly become Expressions for either interpretedness. The two versions of the example can thus be marked up as follows (where "u" stands for "no" or "yes"): first-born John Mary

In RuleML 0.89 as well as in RFML and its human-oriented Relfun syntax [4] this distinction is made on the level of expressions, the latter using square brackets vs. round parentheses for applications. Making the distinction through an attribute in the rather than element will permit higher-order functions (cf. Section 6) to return, and use as arguments, functions that include interpretedness markup. A third value, "semi", is proposed for the interpreted attribute: Semi- interpreted functions compute an application if a definition exists and denote unspecified values else (via the syntactic data structure of the application, which we now write with Relfun-like square brackets). For example, when "u" stands here for "semi", the above application returns J o r y if definition first-born(John, Mary) - Jory exists and denotes first-born[John, Mary] itself if no definition exists for it. Because of its neutrality, in="semi" is proposed as the default value. In both XXL and UML (Unified Modeling Language) processing, functions (like relations in LP) are often setvalued (non- deterministic). This is accommodated by introducing a valued attribute with values including "1" (deterministic: exact1y one) and "0.." (set-valued: zero or more). Our val specifications can be viewed as transferring to functions, and generalizing, the cardinality restrictions for (binary) properties (i.e., unary functions) in description logic and the determinism declarations for (moded) relations in Mercury [11].

© Novática

For example, the set-valued function children: Man x Woman Æ 2Human can be interpreted and set-valued, using definition children(John, Mary) = {Jory, Mahn}, so that the application children(John, Mary) returns {Jory, Mahm}. The example is then marked up thus (other legal val values here would be "0…3", "1…2", and "2"): children John Mary

Because of its highest generality, val="0,…" is proposed as the default. While uninterpreted functions usually correspond to , attribute combinations of in="no" with a val unequal to "1" will be useful when uninterpreted functions are later to be refined into interpreted set-valued functions (which along the way can lead to semi-interpreted ones). Interpretedness and valuedness constitute orthogonal dimensions in our design space, and are also orthogonal to the dimensions of the subsequent sections, although space limitations prevent the discussion of all of their combinations in this paper.

3 Nestings One of the advantages of interpreted functions as compared to relations is that the returned values of their applications permit nestings, avoiding flat relational conjunctions with shared logic variables. For example, the function age can be defined for Jory as age(Jory) = 12, so the nesting age(first-born (John, Mary)), using the first-born definition of Section 2, gives age (Jory), then returns 12. Alternatively, the function age can be defined for uninterpreted f i r s t - b o r n application as a g e (first-born[John, Mary]) = 12, so the nesting age(first-born[John, Mary]) immediately returns 12. Conversely, the function age can be left uninterpreted over the returned value of the first-born application, so the nesting age(first-born[John, Mary]) denotes age[Jory]. Finally, both the functions age and first-born can be left uninterpreted, so the nesting a g e [ f i r s t -born[John, Mary]] just denotes itse1f. The four versions of the example can now be marked up thus (where "u" and "v" can independently assume "no" o "yes"): age first-born

UPGRADE Vol. VI, No. 6, December 2005

25

The Semantic Web

John Mary

Nestings are permitted for set-valued functions, where an (interpreted or uninterpreted) outer function is automatically mapped over all elements of a set returned by an inner (interpreted) function. For example, the element-valued function age can be extended for Mahn with age(Mahn)=9, and nested, interpreted, over the set-valued interpreted function children of Section 2: age(children(John,Mary)) via age({Jory, Mahn}) returns {12,9}. Similarly, age can be nested uninterpreted over the interpreted children: age[children(John,Mary)] via age[{Jory, Mahn}] returns {age[Jory], age [Mahn]}. The examples can be marked up thus (only "u" is left open for "no" o "yes"): age children John Mary

4 Unconditional Equations In Sections 2 and 3 we have employed expression-defining equations without giving their actual markup. Let us consider these in more detail here, starting with unconditional equations. For this, we introduce a modified RuleML 0.89 element, permitting both symmetric (or undirected) and oriented (or directed) equations via an oriented attribute with respective "no" and "yes" values. Since it is more general, oriented="no" is proposed as the default. Because of the potential orientedness of equations, the RuleML 0.89 role tag within the type tag will be refined into and for an equation’s left-hand Side and right-hand side, respectively. For example, the Section 2 equation first-born(John, Mary)=Jory can now be marked up thus: first-born John Mary Jory

26

UPGRADE Vol. VI, No. 6, December 2005



While the explicit and role tags emphasize the orientation, and are used as RDF (Resource Description Framework) properties when mapping this markup to RDF graphs, they can be omitted via stripe skipping, : the and roles of ‘s respective first and second subelements can still be uniquely recognized. This, then, is the stripe-skipped example: first-born John Mary Jory

Equations can also have nested left-hand sides, where often the following restrictions apply: The directly in the left-hand side must use an interpreted function. Any nested into it must use an uninterpreted function to fulfil the so-called "constructor discipline" [9]; same for deeper nesting levels. lf we want to obey it, we use in="no" within these nestings. An equation’s right-hand side can use uninterpreted or interpreted functions on any level of nesting, anyway. For example, employing binary subtract and nullary this-year functions, the equation a g e ( f i r s t - b o r n [ J o h n , M a r y ] ) = substract (this-year(),1993)leads to this stripe-skipped 'disciplined' markup: age first-born John Mary subtract this-year 1993

5 Conditional Equations Let us now proceed to oriented conditional equations, which use a (defining, oriented) element as the conclusion of an element, whose condition may employ other (testing, symmetric) equations. An equational condition may also bind auxiliary variables.

© Novática

The Semantic Web While condition and conclusion can be marked up with explicit and roles, respectively, also allowing the conclusion as the first subelement, we will use a stripeskipped markup where the condition must be the first subelement. For example, using a unary birth-year function in the condition, and two ("?"-prefixed) variables, the conditional equation (written with a top-level "⇒") ?B= birth-year (?P) ⇒ age (¿P)=subtract(thisyear(),?B) employs an equational condition to test whether the birth-year of a person ?P is known, assigning it to ?B for use within the conclusion. This leads to the following stripe-skipped markup: B birth-year P age P subtract this-year B

Within conditional equations, relational conditions can be used besides equational ones. For example, using a binary lessThanOrEqualrelation in the condition, the conditional equation l e s s T h r a n O r E q u a l (age(?P),15) ⇒ discount(?P,?F) = 30 with a free variable ?F (flight) and a data constant 30 (percent), gives this markup: lessThanOrEqual age P 15 discount P F 30

© Novática



Notice the following interleaving of FP and LP (as characteristic for FLP): The function discount is defined using the relation lessThanOrEqual in the condition. The element for the lessThanOrEqual relation itself contains a nested element for the age function. For conditional equations of Horn logic with equality in general [10], the condition is a conjunction of and elements, as shown in Appendix A.

6 Higher-Order Functions Higher-order functions are characteristic for FP and thus should be supported by Functional RuleML. A higher-order function permits functions to be passed to it as (actual) parameters and to be returned from it as values. Perhaps the most well-known higher-order function is Compose, taking two functions as parameters and returning as its value a function performing their sequential composition. For example, the composition of the a g e and first-born functions of Section 2 is performed by Compose(age,first-born). Here is the markup for the interpreted and uninterpreted use of both of the parameter functions (where we use the default in="semi" for the higher-order function and let "u" and "v" independently assume "no" or "yes" for the first-order functions): Compose age first-born

The application of a parameterized Compose expression to arguments is equivalent to the nested application of its parameter functions. For example, when interpreted with the definitions of Section 2, Compose(age,first-born)(John, Mary) via age(first-born(John, Mary) returns 12. All four versions of this sample application can be marked up thus (with the usual "u" and "v"): Compose edad primogénito John Mary

Besides being applied in this way, a Compose expression can also be used as a parameter or returned value of another higher-order function. To allow the general construction of anonymous functions, Lambda formulas from λ-calculus [1] are introduced. A λ-formula quantifies variables that occur free in a functional expression much like a

UPGRADE Vol. VI, No. 6, December 2005

27

The Semantic Web ∀−formula does for a relational atom. So we can extend principles developed for explicit-quantifier markup in FOL RuleML, , where quantifiers are allowed on all levels of rulebase elements. For example, the function returned by Compose (age,first-born) can now be explicitly given as λ(?X, ?Y)age(first-born(?X, ?Y)). Here is the markup for ita interpreted and uninterpreted use (with the usual "u" and "v"): X Y age first-born X Y

This Lambda formula can be applied as the Compose expression was above. The advantage of Lambda formulas is that they allow the direct λ-abstraction of arbitrary expressions, not just for (sequential or parallel) composition etc. An example is λ(?X, ?Y) plex(age(?X), xy, age (?Y), fxy, age (first-born (?X, ?Y))), whose markup should be obvious if we note that plex is the interpreted analog to RuleML’s uninterpreted built-in function for n-ary complex-term (e.g., tuple) construction. By also abstracting the parameter functions, age and first-born, Compose can be defined generally via a Lambda formula as Compose(?F,?G)=l(?X,?Y) ?F(?G(?X,?Y)). Its markup can distinguish object (first-order) variables like ?X vs. function (higher-order) ones like ?F via attribute values ord="1" vs. ord="h".

7 Conclusions The design of Functional RuleMI, as presented in this paper also benefits other sublanguages of RuleML, e.g. because of the more 'logical' complex terms. Functional RuleML, as a development of FOL RuleML, could furthermore benefit all of SWRL FOL, . However, there are some open issues, two of which will be discussed below. Certain constraints on the values of our attributes cannot be enforced with DTDs (Document Type Definition; cf. Appendix A) and are hard to enforce with XSDs (XML Schema Definition), e.g. in="o" on functions in call patterns in case we wanted to always enforce the constructor discipline (cf. Section 4). However, a semantics-oriented validation tool will be required for future attributes anyway, e.g. for testing whether a rulebase is stratified. Thus we propose that such a static-analysis tool should be developed to make fine-grained distinctions for all 'semantic' attributes.

28

UPGRADE Vol. VI, No. 6, December 2005

The proposed defaults for some of our attributes may require further revisions. It might be argued that the default in="semi" for functions is a problem since equations could be invoked inadvertently for functions that are applied without an explicit in attribute. However, notice that the default oriented="no" for equations permits to 'revert' any function call, using the same equation in both directions. Together, those defaults thus constitute a kind of 'vanilla' logic with equality, which can (only) be changed via our explicit attribute values. While our logical design does not specify any evaluation strategy for nested expressions, we have preferred 'call-by-value' in implementations [5]. A reference interpreter for Functional RuleML, is planned as an extension of OO jDREW [2]; the first step has been taken by implementing oriented ground equality via an EqualTable data structure for equivalence classes, . References [1] Henk Barendregt, "The Impact of the Lambda Calculus". In Logia and Computer Science, 8. The Bulletin of Symbolic Logic 3(2) : 181–215, 1997. [2] Marcel Ball, Harold Boley, David Hirtle, Jing Mei, and Bruce Spencer. "The OO jDREW Reference Implementation of RuleML". In Proc. Rules and Rule Markup Languages for the Semantic Web (RuleML-2005). LNCS 3791, Springer-Verlag, November 2005. [3] Paul A. Bailes, Colin J. M. Kemp, Ian Peake and Sean Seefried. "Why Functional Programming Really Matters". In Applied Informatics, págs. 919–926, 2003. [4] Harold Boley. "Functional-Logic Integration via Minimal Reciprocal Extensions". In Theoretical Computer Science, 212: 77–99, 1999. [5] Harold Boley. "Markup Languages for FunctionalLogic Programming". In el IX Simposio Internacional de Programación Lógica y Funcional, Benicassim, España, pp. 391–403, Servicio de Publicaciones de la UPV, Valencia, septiembre 2000. [6] Harold Boley. "Object-Oriented RuleML: User-Level Roles, URIGrounded Clauses, and Order-Sorted Terms". In Proc. Rules and Rule Markup Languages for the Semantic Web (RuleML-2003). LNCS 2876, Springer-Verlag, October 2003. [7] Achille Fokoue, Kristopher Rose, Jerôme Siméon, and Lionel Villard. "Compiling XSLT 2.0 into XQuery 1.0". In Proceedings of the Fourteenth International World Wide Web Conference, págs. 682–691, Chiba (Japón), May 2005. ACM Press. [8] Dov Gabbay, Christopher Hogger and J. A. Robinson (eds.). Handbook of Logic in Artificial Intelligence and Logic Programming, vol. 5: Logic Programming, Oxford University Press, Oxford, 1998. [9] M. J. O’Donnell. Equational Logic as a Programming Language. MIT Press, Cambridge (Mass.), 1985. [10] P. Padawitz. "Computing in Horn Clause Theories", EATCS Monographs on Theoretical Computer Science,

© Novática

The Semantic Web vol. 16, Springer, 1988. [11] Z. Somogy, F. Henderson, and T. Conway. "The Execution Algorithm of Mercury. An Efficient Purely Declarative Logic Programming Language", en Journal of Logic Programming, 29 (1-3): 17-64, 1996.

Acknowlegedments Thanks to David Hirtle, Duong Da¡ Doan, and Thuy Thi Thu Le for helpful discussions and for improving the DTD. This research was partially supported by NSERC.

Appendix A: A DTD for Functional RuIeML A DTD for our stripe-skipped version of Functional RuleML is given below. It mainly consists of declarations specifying the Assertion of a rulebase with zero or more Implies/Atom/Equal clauses. We introduce here for Relations interpretedness distinctions analogous to those for Functions, where the novel accommodates embedded propositions of model logics. An Expression, say f[i], with an uninterpreted function, here f, can itself be used as the uninterpreted or interpreted function of another expression, e.g. f[i][a] or f[i](a); to specify this distinction, such a 'function-naming' Expression also needs an interpreted attribute. For DTD-technical reasons, only the two most important values are specified for the val attribute (similarly, only two ord values are given). The DTD also does not enforce context-dependent attribute values such as being normally used in conditions. Moreover, while the DTD does not prevent Lambda formulas to occur on the lhs of (both kinds of) equations, a static analyzer should confine them to the rhs of oriented equations. A more precise XSD is part of the emerging Functional RuleML, 0.9, .
% % % %

term ateq conclusion condition

"(Data | Ind | Var | Expr)" > "(Atom | Equal)" > "(%ateq;)" > "(And | %ateq;)" >


(Implies | %ateq;)* >


(%condition;, %conclusion;) >


(%ateq;)* >


(%term;, %term;) > ((Rel | Expr | Lambda), (%term; | Rel | Fun | Lambda)*) >


((Fun | Expr | Lambda), (%term; | Rel | Fun | Lambda)*) >


((%term;)+, %term;) >


Fun Rel Data Ind Var

(#PCDATA) (#PCDATA) (#PCDATA) (#PCDATA) (#PCDATA)


Equal oriented Expr in Rel in Fun in Fun val Var ord

(yes (yes (yes (yes (1 | (1 |

© Novática

> > > > >

| no) "no" > | no | semi) "semi" > | no | semi) "semi" > | no | semi) "semi" > 0..) "0.." > h) "h" >

UPGRADE Vol. VI, No. 6, December 2005

29

The Semantic Web

Towards Semantic Desktop Wikis Malte Kiesel and Leo Sauermann To manage information on a personal computer, tools are needed that allow easy entering of new knowledge and that can relate ideas and concepts to existing information. Wikis allow entering information in a quick and easy way. They can be employed for both collaborative and personal information management. Semantic Web standards such as RDF(S) (Resource Description Framework) and OWL (Web Ontology Language) provide means to represent formalized knowledge. Using these standards to represent relations between individual desktop data sources, an integrated view of the user’s information can be realized, known as the Semantic Desktop. In this paper, we propose combining information represented using Semantic Web standards with the simple information management known from wikis. The result is a Semantic Desktop Wiki, which can form a melting pot for ideas and personal information management.

Palabras clave: Personal Information Management, Semantic Desktop, Semantic Web, Semantic Wiki, Wiki.

1 Introduction The Semantic Desktop [10] is about creating a network of associations between your sources of information - for example, text documents, web bookmarks, and calendar entries. However, relating these resources semantically requires "semantic glue" - when connecting resources, you need a way to express why the resources are connected. Ideally, this information would be located within the resources. Unfortunately, often one cannot or does not want to tamper with the resources themselves. In real life, this problem is not new. If you want to organize many files in a filesystem, you often resort not to use descriptive file and directory names exclusively but to create new files - README files come to mind. In this example, file and directory names correspond to properties, files correspond to resources, and the READMEs correspond to resources that do not fit into the standard heavyweight resource schema but denote a lightweight information resource. So, is there a kind of widely accepted standard for quickly writing down information that fits in the modern web-based information world? We argue wikis fit this description. Nowadays, wikis are used for a wide range of applications, from the well-known Wikipedia [12] to corporate intranet applications, and personal wikis, that are the equivalent of a personal notepad. Therefore, it is natural to not only enable standard application data to be linked semantically, but to use wikis for supplying the semantic glue that is necessary for this network to function fully - providing powerful but at the same time simple ways of relating your data. We will take a look at existing wiki implementations 1

In software project management software, wikis are already a major feature (see Trac, , as an example).

30

UPGRADE Vol. VI, No. 6, December 2005

and the upcoming Semantic Wikis. Then we will show how data of the Semantic Desktop can be integrated to wiki pages, giving the opportunity to combine wiki text with structured information. Not only information from outside the wiki can be included - information authored inside the wiki can be used to augment information present in external resources. Finally, we will present our conclusions and further work.

2 Wikis and Their Problems Current applications of wikis range from open encyclopaedias and collaborative information spaces (most notably in open source software projects1 but one can find wikis in almost every project that needs documentation to be creMalte Kiesel studied Computer Science at the University of Kaiserslautern, Germany, actively working on several open source projects during his studies. His master thesis was about "Generating and Integrating Evidence for Ontology Mappings". Working at DFKI (Deutsche Forschungszentrum für Künstliche Intelligenz - German Research Center for Artificial Intelligence) Knowledge Management since 2004, he participates in the SmartWeb project which focuses on making web contents available on mobile devices using Semantic Web technologies. In parallel, he works on extending standard communication and knowledge management software with semantic features. He is an experienced programmer in Java. Leo Sauermann studied Information Science at the Vienna University of Technology, Austria. Under the project name "gnowsis" he merged Personal Information Management with Semantic Web technologies, resulting in a master thesis about "Using Semantic Web technologies to build a Semantic Desktop". Working as a researcher at the DFKI (Deutsche Forschungszentrum für Künstliche Intelligenz - German Research Center for Artificial Intelligence) since 2004, he has continued the work and now maintains the associated opensource project gnowsis. His research focus is on Semantic Web and its use in Knowledge Management and gives talks about the Semantic Desktop. From 1998 to 2002 he worked in several small software companies, including the position of lead architect at Impact Business Computing developing mobile CRM solutions. He is an experienced programmer in both Delphi and Java.

© Novática

The Semantic Web ated collaboratively) to personal notepads (WikidPad, ). For some applications even specialized wiki implementations exist. For example, MediaWiki, , is crafted for Wikipedia [12], focusing on the requirements of the encyclopaedia use case.

2.1 Simple to The Limits One problem with most existing wiki implementations is that they take the keep it simple approach too far, while it is a good idea to make editing information as simple as possible, relying totally on basic wiki functionality for simple state-of-the-art structuring techniques is of doubtful benefit. For example, it may be an interesting thing to base a category system on backlinks2 technically, but applying a generic technique (such as backlinks) to implement a specialized application (a category system) quickly becomes uncomfortable - nevertheless, this is a common practice in wiki implementations. We argue that, while giving the user freedom when entering his information and thus providing a low entrance barrier, a wiki should also provide more elaborate means to express information. This should be available as an optional feature in order to keep the entry barrier low. One of these "more elaborate means of expressing information" is to introduce semantics. 2.2 Readable for Whom? To provide a qualitative idea of the term of semantics in the current context, one can say that it is about the differences of what words and statements mean to one person, to another person, and to computer programs. Only humans are able to read and understand the texts contained in a wiki - for machines, without sophisticated processing the wiki is a large number of text pages which link to each other, or, more technically spoken, a set of strings (page content) indexed by strings (page names), interlinked with each other. This is a sad fact as, in this way, much information contained in wikis is either de facto irretrievable or needs vast efforts to get exploited. Take Wikipedia as an example: almost the complete common knowledge is in there, however, one cannot find the most important philosophers of a certain epoch, since the data about the individual philosopher’s lifetime is present, but only in a human, not machinereadable way. It has to be noted that there are several ways of solving this problem, most notable the standard "wiki way", which is to simply set up a page for every epoch, describing its most important people. However, it is clear that this does neither scale nor satisfy more complex queries. 2 Backlinks are the set of pages pointing to the currently viewed page, which is, in this case, a page representing a page category; for example, if we create a page named CategoryCompany and insert a link to it on every page that has a company as subject, we find all company pages by looking at the backlinks of the CategoryCompany page.

© Novática

2.3 Metadata for Wikis This problem can be solved by supporting adding metadata to wiki’s contents, leading to a Semantic Wiki. This can come in various ways: a simple kind of metadata describing the structure of a wiki page consists, for example, in using a concept hierarchy, and extending the wiki’s contents to use that hierarchy. In a more formal manner, this means linking each wiki page to a concept using an has-subject property. Obviously, categories are very coarse-grained metadata. However, one can think of metadata that are arbitrarily complex, ultimately leading to a formalized representation of the complete wiki’s contents. In practice, one will use something in the middle of the extremes. For example, the Wikipedia community tries to extend the wiki implementation used by Wikipedia with typed links [5], meaning that, in addition to every link, you write in the text of the wiki page you can specify the link’s type. E.g., when referencing Germany on the wiki page describing Berlin, the author is able to tag the page’s link to Germany with the fact that the link uses the type is-capital-of. Since the Semantic Desktop is also about managing your personal information sources such as files, a Semantic Desktop Wiki should provide means to incorporate references to these information sources into the personal knowledge network. 2.3 Linking to External Resources A problem when trying to integrate the wiki idea into the Semantic Desktop scenario is that standard, web-based wikis only poorly support linking to desktop resources. The URL/hyperlink idea the WWW (World Wide Web) is based on simply does not support linking to local resources. While it is possible to use workarounds (such as file:// URLs for linking to local files, or e-mail message ids for linking to emails), this is cumbersome and impractical on a larger scale (both workarounds have their problems: file:// URLs only work on the host they are intended for but do not bear a link to that host, and e-mail message ids identify a mail but do not supply a clue on how to retrieve it). The Semantic Desktop framework [10] provides means to identify and link resources by associating every resource present on the desktop with a URI, so a Semantic Wiki can make use of this and integrate with desktop resources.

3 Semantic Desktop Wikis Taking the capabilities of Semantic Wikis, we can create a more productive work environment based on the simplicity of wikis, the semantic power of RDF (Resource Description Framework) and the vast data sources available on the Semantic Desktop. In this section we will show how these three approaches can be merged. The Semantic Desktop is an approach to bring Semantic Web technologies to desktop computers. An overview is given in [9]. The social aspects and a possible roadmap for future developments can be found in Stefan Decker’s work [3].

UPGRADE Vol. VI, No. 6, December 2005

31

The Semantic Web First, wikis are suited for Personal Information Management (PIM). For example, in the scenario of customerrelationship-management, the salesperson Peter of company AcmeWear may use a personal wiki to write information about his customers. It would be possible to generate wiki pages for products and for customers and for business meetings, where Peter meets his customer, Freistein company. The product of choice would be Security Glove MKL, that the customer needs to handle chemical probes. Peter could now write a wiki note with the following text (bold words are links to other pages). Title: Customer Freistein. Text: Kent Brockman working at Freistein noted that he is interested in the SecurityGloveMKL for use in their chemical lab facility. Workers there have to handle glycol and AG342 and are unhappy with the existing gloves by our competitor WorseWear. I already had a good phone call and made an offer, see OfferGU1234. In such a note, Peter is able to catch the current status of his relation with the Freistein company and Kent Brockman, the purchasing agent. We are aware that existing customerrelationship-management (CRM) solutions like Siebel-Sales or Update.com or SAP can handle this scenario, but does not provide the freedom and flexibility of a wiki system. For example, the above text can be entered in a wiki as free text while in a relational system, each link target (Freistein) has to be searched first; the flexibility of the wiki also allows to note down context information about Freistein like "their managers play golf at KingsGolfClub", whereas a relational-database system would require to model the concept of golf-clubs beforehand. If AcmeWear considers to install a wiki system on Peter’s laptop computer to support him in his CRM, a few questions will appear: „ How to integrate existing CRM information (telephone numbers, etc) into the wiki? „ How to integrate the information into the companies product catalogue and enterprise resource planning (ERP) system to get offers, prices, etc? „ Can Peter have reports on all e-mails and phone calls he does with the customer, right from the wiki?

3.1 Integrating External Data Sources These are typical questions that AcmeWear’s IT department has to face in order to make the system more profitable for the company. Using conventional wikis, integration with outside data sources is nearly impossible. Usually Wikis only allow links inside the wiki to information that was already entered. In a company scenario, where not all information is kept in the wiki and instead is spread over the e-mail system, ERP and others, a wiki has to be integrated. Also, if the wiki is a personal information source and not a company-wide and shared one, Peter will have less motivation to enter everything starting with product codes and ending with customer’s telephone numbers. Semantic wikis, as shown in the previous section, offer a solution to 32

UPGRADE Vol. VI, No. 6, December 2005

the problem of data integration. The straightforward approach is to build adapters that convert the existing data sources into the RDF representation and integrate these into the wiki. But this leaves us with the problem of adapters and their integration. A Semantic Desktop [8] allows the integration of various data sources. Using this approach, all available data sources would be first integrated into a Semantic Desktop data integration framework and then the Semantic Wiki would use the Semantic Desktop to access the information. Introducing the layer in-between allows developers and ITdepartments to concentrate on providing adapters that bring company information sources into the Semantic Desktop. Data sources can be either treated as virtual RDF graphs or can be buffered completely in RDF databases, the first approach requiring a bigger effort. The work by Bizer and Seaborne about adapting SQL databases [2] provides light on how to integrate large SQL databases through virtual graphs and mapping definitions; for web data sources, the SECO paper [4] gives some hints. Our own approach for data integration using heterogeneous data sources is described and discussed in [11]. Ready built adapters can be downloaded from the sourceforge project aperture, , or from collections like simile’s RDF-izers, . So when these adapters and converters exist, they can be integrated into the Semantic Desktop framework. Serving as a data integration hub, it allows querying of all data sources using standardized protocols and query languages such as SPARQL (Simple Protocol and RDF Query Language) [7]. SPARQL is the equivalent of SQL applied to RDF, making it the tool of choice for data integration tasks. On this basis, data from different external data sources can be integrated into wiki pages without having to adapt to all the different systems. Common wiki engines provide APIs (Application Program Interface) and extension points that will be used by the integration of the Semantic Desktop features. A typical extension to wikis is a special tag to link to pages on other wikis or external websites. The Semantic Desktop can be seen like an external website, each resource (identified by a URI - Uniform Resource Identifier) represents an external page. Data about telephone numbers, invoices, products, etc. can be queried using the query language SPARQL. Query results are then rendered into customized wiki pages. Based on a Semantic Desktop framework, it is possible to answer the first three question of AcmeWear: „ Contact information (telephone numbers, etc) is integrated by converting the existing address book into RDF and providing it as a Semantic Desktop data source. „ AcmeWear’s product catalogue and ERP system are also adapted, using dynamic adapters, that can translate questions posed in RDF to retrieve offers, prices, products, invoices, stock levels, etc. The wiki contacts the ERP system via the Semantic Desktop. „ Peter can have reports on e-mails, product offers, invoices, stock, etc. via automatic queries into the ERP sys-

© Novática

The Semantic Web tem via the Semantic Desktop data integration hub. This is the first advantage of the Semantic Desktop: any application (in our case, the Semantic Wiki) can access information from other data sources. The next innovation of a Semantic Desktop Wiki is the way users can author the content. The creation of wiki pages requires that users know the titles of wiki pages (i.e., they usually have to know the exact spelling and structure of the wiki to create links). This manual authoring of wiki pages is a caveat in conventional wikis and remains a problem in Semantic Wikis. In the following section we will take a look at existing (semantic) wikis and at ways to improve them.

4 Introducing Semantic Wikis Several wiki implementations exist that implement the basic wiki features and also want to address the problems indicated above. In [6], an overview of Semantic Wikis and personal wikis is given, resulting in the description of SemperWiki, addressing the problems of Semantic Desktop Wikis. Lets take a look at the ways metadata are implemented in different wiki implementation in the following. In most traditional wikis, the idea of metadata typically only appears in a very technical way. For example, in JSPWiki, , metadata is added directly into the wiki text using special tags, and mostly serves the purpose of implementing access control. In SnipSnap, , metadata come by means of labels that can be attached to wiki pages which are a kind of categorization scheme. The Semantic Wiki Platypus, , adds RDF(S) and OWL (Web Ontology Language) metadata to wiki pages. Metadata have to be entered separately from wiki text and relate a wiki page to another resource; thus, metadata can be transformed into a list of related pages that can be shown along with the actual wiki page. The Semantic MediaWiki, , [5] is an extension of MediaWiki, , the software used by Wikipedia. Again, metadata associated to a wiki page may point to other resources, but here also data literals are allowed. Also, metadata are entered directly into the wiki text, and do not have to adhere to a schema. Rhizome, , builds on a framework that adapts techniques such as XSLT (Extensible Stylesheet Language Transformations) and XUpdate to RDF. In essence, RDF is used throughout the framework for almost everything, and RxSLT (an XSLT variant adapted for RDF) is used for transforming queries’ results to HTML (HyperText Markup Language) or other

3

Building the keyword list manually has its drawbacks. We intend to experiment with techniques known from natural language processing for automatic keyword extraction as well as incorporating linguistic ontology annotations [1] which also support multilinguality.

© Novática

output formats. Page metadata have to be entered separately from the page. While the approach is very interesting from a technical point of view, the current implementation requires a lot of experience with the underlying techniques. So, current Semantic Wikis lack of features concerning extraction and usage of metadata - users have to enter metadata manually, and the only means of querying the metadata is either very simple queries built with a user interface, or very complex queries entered manually as text in a query language. Let us take a look at how better metadata handling and exploitation could be achieved.

4.1 Coupling Semantics with The Wiki’s Contents In a standard wiki, certain words (written by the user in "CamelCase" or highlighted using special characters) indicate that these words get linked to the wiki page describing the corresponding topic. In our Semantic Wiki prototype Kaukolu, , we take the occurrence of keywords as evidence that semantic concepts and relations occur in the text. For example, let’s imagine that we are editing a page named MillersHomepage containing the text Mysoftware is written in Perl. The wiki links Mysoftware to some RDF classes’ instance, written in to some RDF property, and Perl again to an RDF instance, so the wiki concludes that these three RDF resources occur, and the user may build an RDF triple of the three resources recognized. This new triple is independent of the page and the user that created it. The list of keywords is generated manually - these ‘semantic’ keywords link to semantics similar to normal WikiWords/page titles linking to wiki pages3. Providing a formalization of a text in this way is quite an expensive process, as both vocabulary and resources must be created and looked up again when creating instance data (the formalized knowledge). However, partly this is almost the same problem that occurs when writing standard wiki pages: You have to either look up or remember the page’s name that describes the concept you are currently talking about. Typically, one has to stop writing numerous times and start searching for the proper page name then. This problem can be partly solved with several techniques, for example one can use features such as autocompletion (using for example ECMAScript) which should simplify the formalization process greatly from the user’s point of view. Creating RDF instances is only part of the problem. In order to build an ontology-enabled wiki, conformance of the RDF instances to an RDF Schema (RDFS) should be checked, possible properties should be proposed, and ultimately one should be able to create new ontologies. Currently, Kaukolu supports none of these features truly—building RDF Schemas is possible only because RDF Schemas are formulated as RDF (which can be created by Kaukolu). However, no special support for building RDF Schemas is available. In the future, we plan to support the user when

UPGRADE Vol. VI, No. 6, December 2005

33

The Semantic Web building RDF instances by listing properties defined in the RDFS class (this represents a kind of semantic TODO editing help). Checking instances against their schema and marking consistent versions of the wiki would be another step in the direction of an ontology-enabled wiki.

4.2 Building Metadata Qqueries So now that we have metadata, we need ways to exploit it. Most Semantic Wikis support very simple queries ("List all resources this resource is related to") and hand-crafted arbitrarily complex advanced queries. A simple way to formulate queries of medium complexity would be desirable. One solution would be to assist the user by keeping track of the link types the user traversed when using the wiki, offering these types again when entering the query. Also, query refinement by ways of taking user feedback concerning query results into account could be implemented.

5 Conclusion and Further Work Semantic Wikis will allow a combination of best breeds: the ease of authoring content known from wikis and the explicit semantic information of the Semantic Web. When Semantic Wikis are employed on the Semantic Desktop, they can be integrated into personal information management (PIM) scenarios. First, the integration of diverse external data sources like ERP and other PIM systems allows the user to reuse existing information from systems like MS-Outlook in his personal Semantic Wiki. Then, complex queries can be formulated and their results displayed inside wiki pages, allowing the user to get an integrated view of information in the wiki. In the end we showed approaches that improve the user interface of such applications. We plan to improve our Semantic Wiki prototypes and integrating them with our Semantic Desktop framework gnowsis. At the moment, our experiments have been conducted using three separate prototypes that emphasize different aspects. First, a wiki integrated with an early gnowsis version in 2003 contained an integrated web interface. Second, the current 2005 version of gnowsis shows a Java Swing GUI (Graphic User Interface) for the wiki that supports drag and drop, and semantic search capabilities. Finally, Kaukolu is our prototype for a completely RDFbased Semantic Wiki, but offers no semantic desktop integration currently. Integrating these three projects will be the next challenge. A downloadable example application will be part of this. Our aim is to create a personal Semantic Wiki that still provides all benefits known from wikis: ease of use, low entry barrier, free in form and semantics. Beyond the basic features, users can add explicit semantics to their wiki pages, annotating information inside the wiki as well as resources of their desktop data sources. These extended annotation possibilities and the extended querying and reporting functions create a new form of wiki: the personal Semantic Wiki.

34

UPGRADE Vol. VI, No. 6, December 2005

References [1] P. Buitelaar, M. Sintek, and M. Kiesel. Integrated Representation of Domain Knowledge and Multilingual, Multimedia Content Features for Cross-Lingual, Cross-Media Semantic Web Applications, in Proceedings of the ISWC 2005 Workshop on Knowledge Markup and Semantic Annotation, 2005. [2] A.S. C. Bizer. D2RQ - Treating Non-RDF Databases as Virtual RDF Graphs, in Proceedings of the 3rd International Semantic Web Conference (ISWC2004), 2004. [3] S. Decker and M. Fran. The Social Semantic Desktop. WWW2004 Workshop Application Design, Development and Implementation Issues in the Semantic Web, 2004. [4] A. Harth. SECO: mediation services for Semantic Web data. Intelligent Systems, IEEE Volume 19, Issue 3 (May-June 2004), 66 - 71. [5] M. Krötzsch, D. Vrandecic, and M. Völkel. Wikipedia and the Semantic Web - The Missing Links, in Proceedings of Wikimania 2005 (JUL 2005), Wikimedia Foundation, . [6] E. Oren. SemperWiki: A Semantic Personal Wiki, in Proceedings of the 1st Semantic Desktop Workshop at the ISWC2005, 2005. [7] E. Prud’hommeaux, and A. Seaborne. SPARQL Query Language for RDF. W3c working draft, W3C, 2005. . [8] L. Sauermann. The gnowsis - Using Semantic Web Technologies to build a Semantic Desktop. Diploma thesis, Technical University of Vienna, 2003. [9] L. Sauermann, A. Bernardi, and A. Dengel. Overview and Outlook on the Semantic Desktop, in Proceedings of the 1st Workshop on The Semantic Desktop. 4th International Semantic Web Conference (Galway, Ireland), 2005, S. Decker, J. Park, D. Quan, and L. Sauermann (Eds.). [10] L. Sauermann and S. Schwarz. Introducing the Gnowsis Semantic Desktop, in Proceedings of the International Semantic Web Conference 2004, 2004. [11] L. Sauermann and S. Schwarz. Gnowsis Adapter Framework: Treating Structured Data Sources as Virtual RDF Graphs, in Proceedings of the ISWC2005, 2005. [12] Wikipedia, the free encyclopaedia, .

© Novática

The Semantic Web

Towards Semantically-Interlinked Online Communities Uldis Bojars, John G. Breslin, Andreas Harth, and Stefan Decker Online community sites have replaced the traditional means of keeping a community informed via libraries and publishing. At present, online communities are islands that are not interlinked. Ontologies and Semantic Web technologies offer an upgrade path to providing more complex services. We present the SIOC (Semantically-Interlinked Online Communities) ontology which combines terms from vocabularies that already exist with new terms needed to describe the relationships between concepts in the realm of online community sites.

Keywords: Knowledge Management, RDF, Online communities, Ontologies, Semantic Web, Weblogs.

1 Introduction At the moment, most online communities are islands that are not linked. Sites are hosted on stand-alone systems that cannot be interconnected due to application and interface differences. Parallel discussions on interrelated topics may exist on a number of sites, but their users are unaware of that. There is a huge amount of related information that could be harnessed across online communities, from similar member profile details to common-topic discussion fora. The goal of SIOC (Semantically-Interlinked Online Communities, ) is to interconnect these online communities. Community sites can include many discussion primitives, such as bulletin boards, weblogs and mailing lists, which we have grouped under the concept of forum. SIOC will facilitate the location of related and relevant information; by searching on one forum, the ontology and interface will allow users to find information on fora from other sites that use a SIOC-based system architecture. Other uses include cross-site querying, topic-related searches, and the importing of SIOC data into other systems. Therefore, SIOC tries to overcome the serious limitations of current sites in making information accessible to their users in an efficient manner [6]. In a typical usage scenario, a user is searching for information on, for example, installing broadband on a Linuxbased PC in their house in Galway. There is a post A discussing local ISPs (Internet Service Provider) on site 1, a bulletin board dedicated to Galway, that references (on the HTML, HyperText Markup Language, level) both a Usenet post B comparing broadband modems and a mailing list post C detailing how to install broadband on Linux. Previously the user would have had to traverse three sites to find the relevant information. However, by making use of the SIOC ontology and remote RDF (Resource Description Framework) querying, a search for broadband on the Galway bulletin board will also yield the relevant text from the interlinked Usenet and mailing list posts B and C. There are some challenges for SIOC. The grand challenge is adoption by community sites, i.e. how can the us© Novática

ers be enticed to make use of the SIOC ontology. By using concepts that can be easily understood by site administrators, and by providing properties that are automatically created by an end-user, the SIOC ontology can be adopted in a useful way. A second challenge is how best to use SIOC with existing ontologies. This can be partially solved by mappings and interfaces to commonly-used ontologies. Another challenge is how SIOC will scale. We will keep the scaling challenge in mind when creating a future architecture for an interconnected system of community sites. The main contributions of this paper are the development of the SIOC ontology and mappings to other RDF vocabularies, and prototypes to produce SIOC metadata from community sites. The remainder of this paper is organised as follows. In Section 2, we describe the SIOC ontology and mappings to other existing vocabularies. In Section 3, we discuss the exchange of SIOC instances. Section 4 describes some usages of the created instances, and related work is discussed in Section 5. Section 6 concludes the paper.

Uldis Bojars is currently studying for his PhD at DERI (Digital Enterprise Research Institute), National University of Ireland, Galway (NUI Galway). His research interests include semantic matching of skills, social networks and online community discussions. John G. Breslin received his PhD at the National University of Ireland, Galway. He is a Postdoctoral Researcher at DERI (Digital Enterprise Research Institute), NUI Galway, Ireland. His research interests include social networks and online communities. Andreas Harth is currently studying for his PhD at DERI (Digital Enterprise Research Institute), NUI Galway, Ireland. His research interest is data interoperation on the Web. Stefan Decker received his PhD at the University of Karlsruhe, Germany. He is an Executive Director and Adjunct Lecturer at DERI (Digital Enterprise Research Institute), NUI Galway, Ireland. His research interests include the Semantic Web and P2P technologies.

UPGRADE Vol. VI, No. 6, December 2005

35

The Semantic Web 2 SIOC Ontology In this section we present the SIOC ontology. The ontology consists of two major parts: classes and properties that describe the information in online community sites, and mappings that relate SIOC to existing vocabularies. We have identified the main concepts in online communities. The ontology is available online at and its structure is shown in Figure 1.

by reply relationships. Posts have content and may also have attached files. Posts may have one or many topics. User is an online account of a member of an online community. They are connected to posts that they create or edit, to fora that they are subscribed to or moderate, to sites that they administer, and to other users that they know. Users can be grouped for purposes of allowing access to certain fora or enhanced community site features.

2.1 Main Classes We list the major classes that are used in the SIOC ontology, and describe their usage in more detail. Site is the location of an online community or set of communities, with users in groups creating posts on a set of fora. While an individual forum or group of fora are usually hosted on a centralised site, in the future the concept of a 'site' may be extended (for example, a topic thread could be formed by posts in a distributed forum on a peer-to-peer environment). Forum is a discussion area on which posts are made. A forum can be linked to the site that hosts it. Fora will usually discuss a certain topic or set of related topics. The hierarchy of fora can be defined in terms of parents and children, allowing the creation of structures conforming to topic categories. Examples of fora include mailing lists, online bulletin boards, Usenet newsgroups and weblogs. Post is an article or message posted by a user to a forum. A series of posts may be threaded and are connected

2.2 Important Properties In the next paragraphs, we describe some important properties of SIOC concepts. topic A topic definition applies to most of the concepts defined above, and topic metadata can be a useful way to match users and posts to each other. Users or group of users can define topics of interest when their profiles are created or modified. As regards posts, while it may be more difficult to require a user to assign a topic to a post at creation time, it is more likely that a forum will have an associated topic or set of topics that can be propagated to the posts it contains. Topics can also be assigned to posts via predefined category hierarchies and free-text keywords (using 'folksonomy' tagging). In order to enable the location of related information across sites, the SKOS framework [7] can be used to define the concepts represented by the topics or tags, and to link topics between community sites. has_creator The has_creator property links a post to the user profile of its author. Thus, we can follow the link

Figure 1: Overview of Classes and Properties Used in SIOC.

36

UPGRADE Vol. VI, No. 6, December 2005

© Novática

The Semantic Web

SIOC

FOAF

RSS 1.0

Atom

Site Forum

-

channel

Feed

Post

Document

item

Entry

User

Online Account

-

-

Table 1: Selected SIOC Mappings.

from the post to the creator and locate the other posts by the same person. The community can be seen as a network of posts with users linked to each post, and there is also a network of other posts created by a given user stemming from there.

2.3 Mappings One of the main functions of SIOC is to provide a means for exchanging community instance data. Since there are already a considerable number of classes and properties defined in RDF on the Web, we provide mappings in RDFs and OWL (Web Ontology Language), , to allow the import and export of SIOC instance data in different vocabularies, such as (Friend Of A Friend, ) and RSS 1.0 / Atom, . Therefore, we can leverage the instance data that is already available. In Table 1 we show how classes in FOAF, RSS 1.0 and Atom correspond to SIOC classes. Mappings of properties are described in a similar manner Since mappings in SIOC are not only restricted to ontologies, we need to provide a means to extract information from simple data structures. For example, we can map from XML (eXtensible Markup Language) documents such as RSS 0.9x and 2.0 into the SIOC ontology using XSL (eXtensible Stylesheet Language) stylesheets. In this method, titles, descriptions and hyperlinks are extracted from XML documents, somewhat similar to how GRDDL (Gleaning Resource Descriptions from Dialects of Languages, ) is used to extract information from XHTML (eXtensible HyperText Markup Language) documents.

3 Exchanging Instances The core use of SIOC will be in the exchange of instance data between sites. In the following, we elaborate on how the exchange, both importing and exporting data, can be carried out. We show how wrappers can help to achieve export functionality, either based on exporting documents containing the information or by rewriting queries. Another solution for incorporating the "document-based" wrapping to mirror exported RDF documents in an RDF store and thus allow for performing queries. We present a third solu© Novática

tion, possibly for newly-developed applications, which uses a native RDF repository to store and retrieve statements, making import and export straightforward.

3.1 Wrappers to Existing Tools Wrappers will allow us to export instances of community site concepts such as fora or posts in RDF. They can also allow us to import SIOC instances to other non-SIOC systems. Systems for which wrappers could be developed can be divided into two categories - legacy systems and web-based systems. Legacy Systems A large number of systems preceding the current Web are still deployed and widely used on the Internet. Email is used for exchanging messages and files in an asynchronous way, Usenet is still used to exchange messages, and IRC (Internet Relay Chat, ) is used for synchronous communication. Therefore, to really capture a large amount of data currently exchanged in online communities on the Internet, these legacy systems and protocols need to be considered for SIOC. In contrast to web-based systems, where we just need to translate the data, we need to employ protocol wrappers for legacy protocols to HTTP (HyperText Transmissión Protocol). For example, for email we need to translate the data representation format from RFC822, , to SIOC, and provide a wrapper to the access protocol for email stores (usually POP3 - Post Office Protocol version 3 , or IMAP4 - Internet Message Access Protocol, ). The email export wrapper accepts a conjunctive query over HTTP GET and returns the results in SIOC. In a next step, the query is parsed and translated into IMAP4 to send to the original data source. The original data source then returns the results in RFC822 format, which is then translated back into RDF and returned to the original caller via HTTP. We have implemented the wrapper and the mapping using Java. Web-Based Systems Providing mappings from webbased systems is easier than mapping from legacy systems since protocol translation is not needed here. We will discuss two kinds of community sites here -

UPGRADE Vol. VI, No. 6, December 2005

37

The Semantic Web bulletin boards and weblogs. All these systems are based on content management systems. Therefore exporting and importing information from and to such systems can be accomplished by adding wrapper interfaces to these systems. Some export functionality is already available for bulletin boards and CMSs (e.g. FOAF from vBulletin, ). Most of these systems use open source architecture and a wrapper for them will build on existing libraries such as Magpie RSS, etc. We have provided a module for Drupal CMS, , that exports SIOC information about Drupal 'nodes'. Weblogs usually are small scale systems consisting of one or more contributors and a community of readers. Most weblog engines already have RSS export functionality and there are experimental implementations to export metadata, such as the Wordpress FOAF plugin, . Since majority of these engines are open source software, it is straightforward to modify existing export functions to generate SIOC metadata.

The main challenge for using SIOC with web-based systems are not in the technical implementation of SIOC wrappers, but rather in the wide adoption of the SIOC ontology to gain incentives for people to provide data and tools for SIOC. By making SIOC data available through exports, we are encouraging the adoption of SIOC concepts. To this end, we have created a SIOC metadata export facility for the WordPress weblog engine, . This makes use of existing WordPress PHP (Hypertext Preprocessor) functions to access the information about posts, users and fora (weblog channels) from the underlying relational database. SIOC metadata in RDF is generated for each concept instance. The export process is illustrated by example in Figure 2. Other export facilities are being written for phpBB, vBulletin, and b2evolution, .

3.2 Mirror Data in RDF Store Most of the web-based wrappers just provide simple document-based export facilities. Since our goal is to make

Figure 2: SIOC Metadata Export from WordPress.

38

UPGRADE Vol. VI, No. 6, December 2005

© Novática

The Semantic Web SIOC data available for query and to entice people to use SIOC now, we need a method to allow querying of the information that sites publish in flat files. A solution to provide query facilities for sites that have only simple data export facilities is to replicate the information in a data store that can process queries. Queries are then answered from the replica. The replica is updated either by an RDF crawler that traverses rdfs:seeAlso links, or by the original site that pushes updates and changes automatically into the mirror store. Replicating the contents of the entire site from the relational database to an RDF store may work initially and create an easy upgrade path. However, in the longer term, storing and integrating data in a native RDF repository is the desirable solution.

3.3 Native RDF Store The previous two subsections discussed tasks that concerned querying existing sites and their content. We will now describe how newly architectured sites can make use of a native RDF repository to store their data. Exporting data is quite simple because RDF does not restrict you in the way data can be expressed. On the flip side, the flexibility of RDF creates a problem when importing data into systems with a fixed schema. Issues arise here, for example, when an application is importing data using a given schema, and certain mandatory data is missing. Since community sites provide access to complex structures of information with different types, it is natural to store that information in RDF directly. Repositories such as Jena2 [10], Sesame [3], Redland [1], or YARS [5] can be used to store and retrieve the data. With an RDF store as the data repository, importing and exporting information is straightforward, and also data integration tasks can be facilitated. An API (Application Program Interface) similar to the RDF NetAPI [9] can be used as well. The route we chose for SIOC is to use HTTP methods such as PUT and DELETE for adding and removing data.

4 Using SIOC Data Given the ontology, the mappings, and the wrappers, we are now able to pose queries and add data to individual SIOC sites.

4.1 Browsing Once we have made the data available using a common query infrastructure, we can use various user interfaces to navigate SIOC data. The simplest solution is to use a mapping from SIOC to a data format where client programs already exist. For example, SIOC data can be mapped to email and then read in any email program. Also, a mapping from SIOC to RSS allows us to navigate a subset of SIOC information inside a regular RSS news reader. Since SIOC has a richer data model than RSS, some information will be lost during the conversion. Another approach is to use existing RDF browsers such as BrownSauce, , or © Novática

Node browser, , to view arbitrary RDF data. Leveraging the full potential of SIOC requires the provision of custom programs and user interfaces specially tailored towards SIOC (e.g. for cross-site browsing).

4.2 Query Representing data in SIOC enables users to pose structural queries against the collected data rather than just having keyword search. An implication of structural queries is that you get precise answers as a result, and not just pieces of documents that match the keyword. One central problem in P2P networks is how to route queries [8]. We plan to exploit the link structure that connects fora or sites to route queries. The forum and site linkage inside SIOC makes it easier to do routing than in general-purpose peer-to-peer networks, since we have some (human-created) links that can be exploited. We expect a scale-free behaviour of these links once SIOC is widely used in practice. By building the infrastructure for distributing queries into the different site management software or wrappers, we can perform queries without any central components. As a result, querying inside an intranet will be simple and already integrated into the tools used to manage the different community sites inside an organisation, such as mailing lists or fora. 4.3 Locating Related Information Querying the community sites for information on demand is not the only model of end-user interaction. Another way to enhance the end-user experience is to prepare the data in advance, at creation time of a post. Once a new post is created in a community site and the SIOC information is available, this site then queries the network of community sites to find related posts. A query is performed based on the post metadata, such as other posts by this person or other posts in the set of the post’s topics. Results are then stored and can be reused to browse forum entries and navigate through the web of interlinked posts, independent of the underlying site structure that the fora and posts are hosted on. The results of this information retrieval model are the enhanced functionality added to community sites, and better scalability since the information is prepared in advance.

5 Related Work Harvest is an early system [2] that can be used to gather information from diverse repositories to build, search, and replicate indexes, and to cache objects as they are retrieved across the Internet. Harvest uses the Summary Object Interchange Format (SOIF) to exchange metadata about resources. In contrast, SIOC uses RDF as the exchange format and allows mappings between different vocabularies, which is not envisioned in SOIF. Issue based information systems (IBIS) model [11] uses argumented discussions in the process of solving design issues and provides a detailed model for links between con-

UPGRADE Vol. VI, No. 6, December 2005

39

The Semantic Web versations. SIOC uses metadata and reply links to connect conversations on online community sites and can be extended to describe argumented discussions. Various approaches for data integration on the Web, such as data representation languages, structural information retrieval, and query processing, are surveyed in [4]. However, advanced database techniques have failed so far to surface on the Web. SIOC is a first step in providing a common vocabulary for data representation across online communities. RDF Site Summary (RSS 1.0) is widely used in weblog systems and news sites. RSS 1.0 defines a lightweight vocabulary for syndicating news items, but is used for all sorts of data exchange. Although RSS works well in practice, there are several issues: firstly, only the last "n" news items are typically exported in RSS. Secondly, most of the systems use non-RDF versions of RSS, which limit its use with other vocabularies.

6 Conclusion We have presented the SIOC ontology and various mappings to and from other vocabularies that are already deployed on the Web. We have described how instance data in SIOC can be exchanged among online community sites. Our initial SIOC ontology can also be used to enable more complex use cases, for example cross-site structural queries, and integration based on the warehousing approach. To tackle the challenge of adoption, we have provided an upgrade path that allows a gradual migration from existing systems to semantically-enabled sites. For combination with other ontologies, we have presented mappings to and from SIOC that allow the export and import of SIOC data using existing systems and tools. We have developed prototype SIOC exporters for a weblog engine and a content management system, with several more in development. In the future, we intend to exploit the characteristics of intraand inter-site links to guide query routing in a P2P-like environment.

[4] D. Florescu, A. Y. Levy, and A. O. Mendelzon. Database Techniques for the World-Wide Web: A Survey. SIGMOD Record, 27(3):59–74, 1998. [5] A. Harth, S. Decker. Optimized Index Structures for Querying RDF from the Web. 3rd Latin American Web Congress, Buenos Aires, Argentina, October 31 to November 2, 2005, pp. 71-80. [6] R. Lara, S.-K. Han, H. Lausen, M. Stollberg, Y. Ding, and D. Fensel. An Evaluation of Semantic Web Portals. In IADIS Applied Computing International Conference 2004, Lisbon, Portugal, March 23-26, 2004. [7] A. J. Miles, N. Rogers, and D. Beckett. SKOS Core RDF Vocabulary. 2004. [8] W. Nejdl, B. Wolf, C. Qu, S. Decker, M. Sintek, A. Naeve, M. Nilsson, M. Palmer, and T. Risch. EDUTELLA: a P2P networking infrastructure based on RDF. In WWW , pages 604–615, 2002. [9] A. Seaborne. An RDF NetAPI. In 1st International Semantic Web Conference, pages 399–403, 2002. [10] K. Wilkinson, C. Sayers, H. A. Kuno, and D. Reynolds. Efficient RDF Storage and Retrieval in Jena2. In Proceedings of SWDB’03, In 1st International Workshop on Semantic Web and Databases, Co-located with VLDB 2003, pages 131–150, 2003. [11] H. Rittel, W. Kunz. Issues as elements of information systems. Working Paper 131, Berkeley Ca, University of California Center for Planning and Development Research, 1970.

Acknowledgements The authors would like to acknowledge the support of Science Foundation Ireland under Grant No. SFI/02/CE1/ I131. This paper was originally presented at the 2nd European Semantic Web Conference (ESWC 2005). References [1] D. Beckett. The Design and Implementation of the Redland RDF Application Framework. Computer Networks, 39(5):577–588, 2002. [2] C. M. Bowman, P. B. Danzig, D. R. Hardy, U. Manber, and M. F. Schwartz. The Harvest information discovery and access system. Computer Networks and ISDN Systems, 28(1–2):119–125, 1995. [3] J. Broekstra, A. Kampman, and F. van Harmelen. Sesame: A Generic Architecture for Storing and Querying RDF and RDF Schema. In 1st International Semantic Web Conference, pages 54–68, 2002.

40

UPGRADE Vol. VI, No. 6, December 2005

© Novática

The Semantic Web

A Semantic Search Engine for the International Relation Sector Luis Rodrigo-Aguado, V. Richard Benjamins, Jesús Contreras-Cino, Diego-Javier Patón-Villahermosa, David Navarro-Arnao, Robert Salla-Figuerol, Mercedes Blázquez-Cívico, Pilar Tena-García, and Isabel Martos-Laborde The Royal Institute Elcano (Real Instituto Elcano, RIE) is a prestigious independent political Spanish institute whose mission is to comment on the geo-political situation in the world focusing on its relation to Spain. As part of its dissemination strategy it operates a public website. In this paper we present and evaluate the application of a semantic search engine to improve access to the Institute’s informational content: instead of retrieving documents based on user queries of keywords, the system accepts queries in natural language and returns answers rather than links to documents. Topics that will be discussed include ontology construction, automatic ontology population, semantic access through and a natural language interface.

Keywords: Knowledge Acquisition, Ontology, Question Answering, Semantic Search.

1 Introduction Worldwide there are several prestigious institutes that comment on the geo-political situation in the world, such as the UK's Royal Institute for International Affairs, , or the Dutch Institute for International Relations, . In Spain, the Real Instituto Elcano (Royal Institute Elcano, RIE, ) fulfils this role. The institute provides several types of written reports which discuss the political situation in the world, with a focus on events relevant for Spain. The reports are organized into different categories, such as Economy, Defense, Society, Middle East, etc. In a special periodic report - the "Barometer of the Royal Institute Elcano" - the Institute comments on how the rest of the world views Spain in the political arena. Access to the content is provided by categorical navigation and a traditional full text search engine. While full text search engines are helpful instruments for information retrieval ( is the champion), in domains where relations are important, those techniques fall short. For instance, a keyword-based search engine will have a hard time finding the answer to a question such as: "Governments of which countries have a favorable attitude toward the US-led armed intervention in Iraq?" since the crux of answering this question resides in 'understanding' the relation "has-favourable-attitude-toward". In this paper we present a semantic search engine that accepts natural language questions to access content produced by the Institute. Semantic in this context means related to the domain of International Relations (politics).

2 An Ontology of International Affairs When searching for a particular datum, looking for a concrete answer to a precise question, a standard search engine that retrieves documents based on matching keywords falls short. First of all, it does not satisfy the primary need of the user, which is finding a well-defined data, and provides a collection of documents that the user must ex-

© Novática

Luis Rodrigo-Aguado graduated in Computer Science from the Universidad Politécnica de Madrid (UPM), Spain, and is currently studying towards his doctoral thesis on the subject of the Semantic Web and Natural Language. He divides his time between the Smart Systems Lab of UPM’s Dept. of Artificial Intelligence and the company Intelligent Software Components (iSOCO, S.A.), where he is currently working as a project manager, coordinating work related to Natural Language. He has authored a number of articles and presentations in national and international conferences. V. Richard Benjamins is Director of Research & Development and board member at Intelligent Software Components (iSOCO, S.A.), in Madrid, Spain. He co-founded iSOCO in June 1999, and contributed to its start-up (now 70 persons) and international positioning as a Semantic Web Solutions company. He is also part time professor at the Universidad Politécnica de Madrid. He has acquired and managed over 5 million euro in R&D projects in advanced Information Technologies related to the Internet. Before working at iSOCO, he had a permanent position at the University of Amsterdam, Nederland, in the area of Knowledge Systems Technology (1998-2000). Between 1993 and 1998, he worked at the University of Sao Paulo, Brazil, the University of Paris-South, France, and the Spanish Artificial Intelligence Research Institute in Barcelona, Spain. He has published over 80 scientific articles in books, journals and proceedings, in areas such as Knowledge Technologies, Artificial Intelligence, Knowledge Management, Semantic Web and Ontologies. He has been guest editor of several journal special- issues, serves on many international program committees, and has been cochair of numerous international workshops and conferences. He is member of the editorial board of IEEE Intelligent Systems and of Web Semantics (Elsevier). He received his Master’s Degree (1988) and Ph.D. (1993) in Cognitive Science from the University of Amsterdam. Jesus Contreras-Cino obtained a PhD in Artificial Intelligence (2004) at the Universidad Politécnica de Madrid, Spain. Since 1996 he was an Assistant Researcher in the Intelligence Systems Research Group, , where he participated in projects oriented towards the development of Knowledge Based Systems and Advanced Artificial Intelligence Applications. In 1998 he joined Software A.G.’s e-business competence center, , where he

UPGRADE Vol. VI, No. 6, December 2005

41

The Semantic Web amine, looking for the desired information. Also, not all of the retrieved documents might contain the appropriate answer, and some of the documents that do contain it may not be included in the collection. These drawbacks suggest the need for a change in the search paradigm, evolving from the extraction of whole documents to the information contained in those documents. This approach, however, is not feasible in all conditions. It is not cost justifiable to build such a search engine for general usage, but can be justified for limited, well-defined domains. Such is the semantic search engine developed for the Real Instituto Elcano, which focuses on the topics covered by the reports written by the institute analysts i.e. international politics. In order to be able to analyse the documents, and reach sufficient 'understanding' of them to be able to answer the users' questions, the system relies on a representation of the main concepts, their properties and the relations among them in the form of an ontology. This ontology provides the system with the necessary knowledge to understand the questions of the users, provide the answers, and associate with them a set of documents that mention the concept of the answer. Based on the ontology, each document gets its relevant concepts annotated and linked to the representing concept or instance in the ontology, allowing a user to browse from a document to the information of a concept he is interested in, and backwards, from the ontology to any of the reports that mention that concept.

2.1 Ontology Design An ontology is a shared and common understanding of some domain that can be communicated across people and computers [6][7][3][8]. Ontologies can therefore be shared and reused among different applications [5]. An ontology can be defined as a formal, explicit specification of a shared conceptualization [6][3]. 'Conceptualization' refers to an abstract model of some phenomenon in the world by having identified the relevant concepts of that phenomenon. 'Explicit' means that the type of concepts used, and the constraints on their use are explicitly defined. 'Formal' refers to the fact that the ontology should be machine-readable. 'Shared' reflects the notion that an ontology captures consensual knowledge, that is, it is not private to some individual, but accepted by a group. An ontology describes the subject matter using the notions of concepts, instances, relations, functions, and axioms. Concepts in the ontology are organized in taxonomies through which inheritance mechanisms can be applied. It is our experience that building a commonly agreed ontology is not easy, especially the social part [2]. Based on interviews with experts of the Elcano Institute, we used the CIA world factbook, , as the basis for the design of the ontology of International Affairs. The CIA fact book is a large online repository with actual information on most countries of the world, along with relevant information in the fields of geography, politics, society, economics, etc. We have used the competency questions approach [10] 42

UPGRADE Vol. VI, No. 6, December 2005

(cont. from previous page) was enrolled in various European projects as software engineer and main researcher. In November 2000, he joined the Innovation Dept. of Intelligent Software Components (iSOCO, S.A.). During his career he has published various articles about Natural Language Processing in Human-Computer Interaction. Diego-Javier Patón-Villahermosa graduated (HND) in Software Engineering. He is a Knowledge Engineer in Intelligent Software Components (iSOCO, S.A.), specialising in portals and social networks based on semantic search engines, . He has participated in Knowledge Parser projects: framework software that enables data to be extracted automatically from online sources and then locally stored in structured warehousing. David Navarro-Arnao is a developer and researcher in the Innovation Department working on projects such as the Buscador Semántico Residencia de Estudiantes, AMASS (Associative Memory Arrays for Semantic Search), Buscador Semántico Real Instituto Elcano, Esperonto Services (link between the current Web and the Semantic Web), Knowledge Parser (automatic data extraction), NETCASE (a smart system based on Semantic Web technologies for application in legal environments), and the open source library KPONTOLOGY for working with ontologies, used and maintained by a number of projects (HOPS, Semantic Search Engine, SEKT, Esperonto Services, Onto-H, Iuriservice). , Robert Salla-Figuerol Graduated as Technical Engineer (Computer Science), in the Universitat de Lleida (UDL), Spain, with the thesis “Selfsimilar processes applied to Internet traffic”. He has a Master of Computer Science from the Computer Science Faculty of the Universitat Politecnica de Catalunya (UPC), Barcelona, Spain. Ha has a long experience in information retrieval software development for Semantic Web purposes, and has authored several papers for international congresses and journals. Mercedes Blázquez-Cívico has been working at Intelligent Software Components (iSOCO, S.A.) as a researcher since September 2000. She graduated in Computer Science from the Universidad Politécnica de Madrid (UPM), Spain, in 1997 and studied for a masters degree in knowledge engineering and software engineering at the Computer Science faculty of the same university, finishing in Novemebr of 2000. She is currently studying towards her doctorate in Computer Science and Artificial Intelligence at the UPM, in knowledge management and its applications. Her research activities include the application of the Semantic Web in knowledge management, and she is currently participating as a work team leader at iSOCO in the under the SEKT 6th framework IP (IST-2003506826). She has also participated in various PROFIT R&D projects related to the application of Semantic Web technologies and ontologies. Pilar Tena-García graduated in Law and Information Science (branch of Journalism) in the Universidad Complutense de Madrid (UCM), Spain (1977). She is Deputy Director of Institutional Relations of the Real Instituto Elcano since 2002. Isabel Martos-Laborde graduated in Information Science in the Universidad Complutense de Madrid (UCM), Spain (1992). She is the webmaster of the Real Instituto Elcano site since its inception in 2002.

© Novática

The Semantic Web to determine the scope and granularity of the domain ontology. The ontology consists of several top level classes, some of which are: „ Place: concept representing geographical places such as countries, cities, buildings, etc. „ Agent: concept taken from WordNet [11] representing entities that can execute actions modifying the domain (e.g.: Persons, Organizations, etc.). „ Events: time expressions and events. „ Relations: common class for any kind of relations between concepts. Without instances information, the ontology contains about 85 concepts and 335 attributes (slots, properties). The ontology has been constructed using Protégé [9]. Figure 1 shows a fragment of the ontology in Protégé.

3 Automatic Annotation One of the challenges for the success of the Semantic Web is the availability of a critical mass of semantic content [17]. Semantic annotation tools play a crucial role in upgrading the actual web content into semantic content that can be exploited by semantic applications. In this context we developed the Knowledge Parser © (KP), a system able to extract data from online sources populating specific domain ontologies, adding new or modifying existing knowledge facts or instances. The Semantic Web community often calls this process semantic annotation (or just annotation). The Knowledge Parser offers a software platform that combines different technologies for information extraction, driven by extraction strategies that allow the optimal technology combination application to each source type based on the domain ontology definition.

Ontology population from unstructured sources can be considered as the problem of extracting information from the source, its assignation to the appropriate location in the ontology, and finally, its coherent insertion in the ontology. The first part deals with the information extraction and document interpretation issues. The second part deals with the information annotation, in the sense of adding semantics to the extracted information, according to domain information and pre-existing strategies. The last part is in charge of populating, i.e., inserting and consolidating the extracted knowledge into the domain ontology. The three phases can be seen in the architecture of the system, illustrated in Figure 2.

3.1 Information Extraction The KP system at present handles HTML (HyperText Markup Language) pages, and there are plans to extend it to handle also PDF (Portable Document Format), RTF (Rich Text Format), and some other popular formats. To be able to capture as much information as possible from the source document, KP analyzes it using four different processors, each one focusing on a different aspect: the plain text processor, the layout processor, the HTML source processor and the natural language processor. The plain text source interpretation supports the usage of regular expressions matching techniques. The usage of these kinds of expressions constitutes an easy way of retrieving data in the case of stable, well known pages. If the page suffers frequent changes the regular expression becomes useless. It is very common that even when documents of the same domain have very similar visual aspect they have a completely different internal code structure. Most of the online

Figure 1: Ontology for International Affairs.

© Novática

UPGRADE Vol. VI, No. 6, December 2005

43

The Semantic Web Sources

Source Preprocess

Information Identification

Layout

Ontology Population

Strategies

Data

Population

DOM Operators

Hypothesis

Evaluation

Symbols Description

Domain

NLP

Figure 2: Architecture of The System.

banks offer a position page where all the personal accounts and their balance are shown. These pages have very similar visual aspect, but their source code is completely different. The KP system includes layout interpretation of HTML sources, which allows to determine if certain pieces of information are visually ABOVE, UNDER, RIGHT, LEFT, IN_ROW, IN_COLUMN, IN_ROW RIGHT;… of another piece of information. In addition to HTML rendering of the source code in a visual model, the KP system needs to process the HTML elements in order to browse through the sources. The source description may include a statement that some information is a valid HTML link (e.g., a country name in a geopolitical portal), and when activated takes one to another document (a country description). Finally, the fourth model tries to retrieve information from the texts present in the HTML pages. To do that, the user describes the pieces he is interested in in terms of linguistic properties and the relations among them (verbal or nominal phrases, coordinations, conjunctions, appositions, etc.).

3.2 Information Annotation Once the document is parsed using different and complementary paradigms, the next challenge is to assign the extracted information piece to the correct place in the domain ontology. This task is called annotation, since it is equivalent to wrapping up the information piece with the corresponding tag from the ontology schema. In most cases the annotation of information is not direct. For instance, a numeric data extracted from the description of a country can be catalogued as the country population, its land area, or its number of unemployed. It is necessary to have some extra information that facilitates reducing this ambiguity. This information, formulated in another model, enlarges the domain ontology with background knowledge, the same way a human uses for its understanding. The extraction system needs to know, for example, that in online banking the account balance usually appears in the same visual row as the account number, or that it is usually preceded by a currency symbol. This kind of information describing the pieces of information expected in the source and the relations among them is formalized in a, so 44

UPGRADE Vol. VI, No. 6, December 2005

called, wrapping ontology. This ontology supports the annotation process holding information describing the following elements: document types, information pieces and relations among the pieces (any kind of relation detectable by the text, layout, HTML or NLP - Natural Language Processing - models). According to the domain ontology and the background information added, the system should construct possible assignments from the information extracted to the ontology schema. The result of this process is a set of hypotheses about data included in the source and their correspondence with the concepts, properties and relations in the domain ontology. During the construction process the system can evaluate how much the extracted information fits the information description. The different ways in which hypotheses can be generated and evaluated are called strategies. Strategies are pluggable modules that, according to the source description, invoke operators. In the current version of the system there are two possible strategies available. For system usages where the response time is critical we use the greedy strategy. This strategy produces only one hypothesis per processed document using heuristics to solve possible ambiguities in data identification. On the other hand, when quality of annotation is a priority and requirements on response time are less important, we use a backtracking strategy. This strategy produces a whole set of hypothesis to be evaluated and populated into the domain ontology.

3.3 International Affairs Ontology Population Using the Knowledge Parser system, we populated the ontology of international affairs, designed as described in Section 2.1. The domain experts selected four sources where they could find most of the information that they used on their daily basis. These four sources are: „ CIA World Factbook, . „ Nationmaster, . „ Cidob, . „ International Policy Institute for Counter-Terrorism, . The set of sources is, of course, not exhaustive, but it © Novática

The Semantic Web In te r n e t

N a v ig a t e R e tr ie v e co n t e n t

S u p e r vi si o n

A n n o t a tio n s

D o m a in O n to lo g y

Figure 3: Domain Ontology Population Process.

tries to follow the 80-20 rule, where a few sites cover most of the knowledge needed by the users of the system. For each of the sites, a wrapping ontology was developed, describing the data contained in it, the way to detect it and the relations among them. The development of these kinds of descriptive ontologies is at present done by experienced knowledge engineers, but future advances will be designed to develop some kind of tools that will allow the domain experts to describe a new source and populate the ontology with its contents themselves. As a result of this process, we evolved from an empty ontology to an ontology with more than 60,000 facts, occupying more than 20 MB of RDF files.

4 The International Relations Portal Modeling the domain in the form of an ontology is one of the most difficult and time consuming tasks in developing a semantic application, but an ontology itself is just a way of representing information, it provides no added value for the user. What becomes really interesting for the user is the kind of applications (or features inside an application) that an ontology allows. In the following, we will describe how we have exploited the semantic domain description, in the form of enhanced browsing of the already existing reports, and a semantic search engine integrated in the international relations portal, interconnected between them.

4.1 Establishing Links between Ontology Instances and Elcano Documents The portal holds two different representations of the same knowledge, the written reports from the institute analysts and the domain ontology, which are mutually independent. However, one representation can enrich the other, and vice versa. For example, an analyst looking for the Gross © Novática

Domestic Product (GDP) of a certain country may also be interested in reading some of the reports where this figure is mentioned, and, in the same way, someone who is reading an analysis about the situation in Latin America may want to find out the political parties present in the countries of the region. Trying to satisfy these interests, we inserted links between the instances in the ontology and the documents of the Institute. The links are established in both directions. Each concept in the ontology has links to the documents that mention it, and each document has links that connect the concepts mentioned in the article with the corresponding concepts in the ontology. This way, the user can make a question (for example, "Who is the USA president?") and gets the information of the instance in the ontology corresponding to George Bush. From this screen, he can follow the links to any of the instances appearing in the text, George Bush being one of them. This process can be seen in Figure 4, where the information about George Bush in the ontology contains a set of links, and the document seen can be reached following one of them. To generate these links a batch process is launched that generates, at the same time, both the links in the ontology and the links in the articles. At present, the process of adding links is a batch process that opens a document, and looks for appearances of the name of any of the instances of the ontology in that text. For any matching, it adds a link in the text to the instance in the ontology and link in the ontology with a pointer to the text. To evaluate the matching, not only the exact name of the instance is used, but also the possible synonyms, contained in an external thesaurus, which can be easily extended by any user, i.e., the domain expert. Future plans include the automation of this task, so that any new document in the system (the institute constantly produces new reports) is processed automatically by the link generator tool and the new links are transparently included in the system.

4.2 The Semantic Search Engine With the objective of making available the knowledge contained in the ontology in a comfortable, easy to use fashion, we also designed a semantic search engine. Using this engine, users can ask in natural language (Spanish, in this case) for a concrete data, and the system retrieves the data from the ontology and presents the results to the user.

5 Related Work Our Knowledge Parser is related to several other initiatives in the area of automatic annotation for the Semantic Web, including KIM [12], which is based on GATE [13], Annotea [14] of W3C., Amilcare [15] of the UK Open University (also based on GATE), and AeroDAML [16]. For an overview of those approaches and others, see [4]. All approaches use NLP as an important factor to extract semantic information. Our approach is innovative in the sense that it combines four different techniques for Information Ex-

UPGRADE Vol. VI, No. 6, December 2005

45

The Semantic Web

Figure 4: Links between The Instances and The Documents.

traction in a generic, scalable and open architecture. The state of the art of most of these approaches is still not mature enough (few commercial deployments) to provide concrete comparison in terms of performance and memory requirements.

(PROFIT 2003, TIC). The natural language software used in this application is licensed from Bitext, . For ontology management we use JENA libraries from HP Labs, .

References 6 Conclusions A semantic search engine for a closed domain was presented. The figures of the evaluation are promising, as more than 60% of the spontaneous questions are understood and correctly answered when they belong to the application domain. However, some things need to improve, such as the automatic link generation, a more flexible mechanism for building queries, an automated process to generate complete synonym files from linguistic resources, just to mention a few.. It would also be of a high interest to completely decouple the search engine from the domain information, which are currently lightly connected, in order to be able to apply the semantic search engine to a new domain just by replacing the domain ontology and the synonyms files. The semantic search engine is, at the same time, a proof of the utility and applicability of the Knowledge Parser © which will also be further developed in future projects.

[1] [2]

[3] [4] [5]

[6]

[7]

Acknowledgements Part of this work has been funded by the European Commission in the context of the project Esperonto Services IST-2001-34373, SWWS IST-2001-37134, SEKT IST2003-506826 and by the Spanish government in the scope of the project: Buscador Semántico, Real Instituto Elcano

46

UPGRADE Vol. VI, No. 6, December 2005

[8]

A. Gómez-Pérez et al. Ontological Engineering. Springer-Verlag. London, UK, 2003. V. R. Benjamins et al. (KA)2: Building ontologies for the internet: a mid term report. International Journal of Human-Computer Studies, 51(3):687–712, 1999. W. N. Borst. Construction of Engineering Ontologies. PhD thesis, University of Twente, 1997. Contreras et al. D31: Annotation Tools and Services, Esperonto Project, . A. Farquhar et al. The ontolingua server: a tool for collaborative ontology construction. International Journal of Human-Computer Studies, 46(6):707–728, June 1997. T. R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5:199– 220, 1993. N. Guarino. Formal ontology, conceptual analysis and knowledge representation. International Journal of Human-Computer Studies, 43(5/6):625–640, 1995. Special issue on The Role of Formal Ontology in the Information Technology. G. van Heijst et al. Using explicit ontologies in KBS development. International Journal of Human-Computer Studies, 46(2/3):183–292, 1997.

© Novática

The Semantic Web [9] Protege 2000 tool, . [10] M. Uschold and M. Gruninger. Ontologies: principles, methods, and applications. Knowledge Engineering Review, 11(2):93–155, 1996. [11] WordNet, . [12] Atanas Kiryakov et al. Semantic Annotation, Indexing, and Retrieval 2nd International Semantic Web Conference (ISWC2003), 20-23 October 2003, Florida, USA. LNAI Vol. 2870, pp. 484-499, SpringerVerlag Berlin Heidelberg 2003 [13] H. Cunningham et al. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002 [14] José Kahan et al. Annotea: An Open RDF Infrastructure for Shared Web Annotations, in Proc. of the WWW10 International Conference, Hong Kong, May 2001. [15] Fabio Ciravegna. "(LP)2, an Adaptive Algorithm for Information Extraction from Web-related Texts", in Proceedings of the IJCAI-2001 Workshop on Adaptive Text Extraction and Mining, held in conjunction with the 17th International Conference on Artificial Intelligence (IJCAI-01), Seattle, August, 2001 [16] P. Kogut and W. Holmes. "AeroDAML: Applying Information Extraction to Generate DAML Annotations from Web Pages", in Proceedings of the First International Conference on Knowledge Capture (K-CAP 2001). [17] V.R. Benjamins et al. Six Challenges for the Semantic Web. White Paper, April 2002.

© Novática

UPGRADE Vol. VI, No. 6, December 2005

47

The Semantic Web

Semantic Search in Digital Image Archives: A Case Study Julio Villena-Román, José-Carlos González-Cristóbal, Cristina Moreno-García, and José- Luis MartínezFernández. This paper describes a commercial project which applyiesng the concepts put forward by the Semantic Web in order to improve image search in a website for selling photographs through the Internet. The specific problem addressed here concerns techniques for the semiautomatic creation of thesauri and the normalization of image descriptors from a previous set of labels showing free keywords with partial morphological expansion. The ultimate goal of this project is to improve the customer accessibility to a collection of more than two million photographs. This project has been developed by the Spanish company DAEDALUS-Data, Decisions and Language, S.A. for the Internet website stockphotos.es, of the company Stock Photos S.L.

Keywords: Automatic Classification, Digital Image Library, Information Retrieval, Normalisation Process, PostTranslation, Pretranslation, Subject Hierarchy, Thesaurus.

1 Introduction The ultimate goal of the Semantic Web is to improve the access to any kind of information that is published in on the Internet. Nowadays, several languages and standards, which are mainly promoted by W3C (World Wide Web Consortium), allow a uniform representation of information, as well as the formalization of inference processes. Both aspects are essential to facilitate the localization of information stored in any digital repository. Today, these standards and supporting tools make it very easy to adopt an ontology for a particular domain, perhaps with some kind of adaptation, or even to build it one from scratch for that particular domain or a specific application. These are areas which need a limited effort and are, in general, achievable in commercial projects. However, content providers also want to reach the promised land of the Semantic Web from a huge volume of information which is few not highly structured or is totally unstructured. Moreover, as large amounts of money and effort have already been invested in the acquisition or taylorization of those resources, any action to make profit from them needs to balance the accessibility by customers and economical constraints. This is the main problem that arises in the case that we present in this article: how to improve the likelihood that a given customer finds the photograph which he/she needs to illustrate a publication or an advertising campaign, in the shortest time, in an archive with several million images. This objective necessarily demands that the images are tagged in the best way to match the user query. But tagging a photograph means specifying the objects that are shown, the environment in which they are located, relationships among them, actions or effects which could be happening at that moment, feelings that are evoked, light, colour range, photographic technique, etc. This work has already been carried out in the past, perhaps with criteria, depth, precision or quality which that are less than optimal. Logically, therefore, investing more money is not an option. The situation is a digital image archive, tagged with a short title and several keywords from a free (uncontrolled) vocabulary, without diacritics or typographical marks. A stratified sampling was performed at the beginning of the project. Finally, the selected set was formed by with 194,618

48

UPGRADE Vol. VI, No. 6, December 2005

Julio Villena-Román graduated as a Telecommunications Engineer from the Higher Technical School of Telecommunications Engineers (ETSIT) of the Universidad Politécnica de Madrid (UPM), Spain, in 1997. He was a founder member and is currently the technological director of the Spanish company DAEDALUS. He started his career as a researcher at UPM on an FPI (Research Personnel Training) grant, 1997. He has been lecturing in the Telematic Systems Engineering Department of the Universidad Carlos III de Madrid since 2002. He has led research projects in the field of intelligent systems and has authored many international publications. José-Carlos González-Cristóbal is a Doctor of Telecommunications from the Higher Technical School of Telecommunications Engineers (ETSIT) of the Universidad Politécnica de Madrid (UPM), Spain, in 1989. He was a founder member and has been the president of the Spanish company DAEDALUS since its inception in 1998. He has been lecturing in the Telematic Systems Engineering Department (ETSIT – UPM) desde 1985. He has led numerous research and development projects, as part of national and European programmes, or private initiative projects. He has authored a great many scientific and technical publications in various fields related to artificial intelligence, and he has taken part as an organizer and collaborator in national and international conferences. He has represented UPM as the Chairman of the Technical Committee of CITAM (Research Centre for Multimedia Technologies and Aplications, AIE), and is also the chairman of the Spanish chapter of the Computer Society (IEEE). Cristina Moreno-García graduated in Technical Computer Engineering from the UNED (Universidad Nacional de Educación a Distancia, Spanish National Open University). She has been working in the technical department at the Spanish company DAEDALUS since 2000 where she has carried out important work on web technology projects. José-Luis Martínez-Fernández graduated as a Telecommunications Engineer from the Higher Technical School of Telecommunications Engineering (ETSIT) of the Universidad Politécnica de Madrid (UPM), Spain, in 1998. He was a founder member of the Spanish company DAEDALUS and is currently its director of consulting. Between 2000 and 2001 he worked at SGI (Soluciones Globales de Internet), a business unit of the Spainsh group GMV Sistemas S.A.. He has been lecturing in the Computer Science Department of the Universidad Carlos III de Madrid since 2002. He has led research projects in the field of intelligent systems, a subject on which he has authored many international publications.

© Novática

The Semantic Web images, tagged with 1,008,593 terms in titles and 2,917,973 terms in keywords. As we will see, different inflectional lexical forms were frequently included in the same image to increase the possibility of it being found. Also a certain proportion of spelling mistakes was found. The objective of this article is to illustrate the process of term normalization and, at the same time, the creation of an ad-hoc thesaurus which allows to access to all the available contents in a structured and optimised way. This process is particularly demanding in most projects related to Semantic Web. Questions that arise under these circumstances are: 1. Is it possible, by semiautomatic means and with acceptable costs, to carry out the generation of an ontology suited to the contents of this collection, the normalization of the keys and the multiclassification of those contents according to the generated ontology? 2.To what extent is this project economically worthwhile? 3. What other tools or investments are necessary so that the customers can find the appropriate images within the new content structure? 4. What is the impact of changes on the costs of cataloguing new collections and what are the repercussions of adopting the new technology on the maintenance costs? 5. And, lastly, what is the return of on this investment? From here, Sections 3 to 6 of this article are focused on the first question, describing the methodology followed in the project. Section 2 is dedicated to putting this work in context,. It showings the relationship of this work to other works in the area of image annotation for the Semantic Web

and also with other R&D projects in compatible areas done by the Spanish company DAEDALUS. Section 7 is dedicated to presenting some conclusions.

2 Framework This project is connected to other R&D projects carried out by DAEDALUS in the Information Retrieval field: Omnipaper (Smart Access to European Newspapers, IST2001-32174) [1][2] and EDDENN (Extracción de Datos de Documentos con Estructura No Normalizada – Data Extraction from Documents with Non-Normalized Structure, FIT-350200-2004-33 y 350100-2005-308, in collaboration with IPSA). Moreover, this project benefits from the participation of DAEDALUS in the European Information Retrieval forum, specifically anyhing that is concerned with multilingual image retrieval [3][4]. The present work is related, although with an approach and goals that are very practical, to research activities in the linguistic annotation for the Semantic Web [5], particularly to the annotation of multimedia objects [6][7]. The application of thesauri and ontologies has been explored, for example, by projects to publish Finnish museum pictures on the Internet [8][9] and images in general [10]. On some occasions, pure textual techniques are combined with others with specific image content handling, as in [11] or in our own experiences in [4].

3 Description of The Image Archive The information source is the digital image archive of StockPhotos, mainly consisting of digital high-resolution colour photographs, in different formats and of heteroge-

ID Collection Title Keywords

JAP-000401-LAI ETNICAS-III JAPON ASIA ORIENTE ORIENTAL ASIATICO COLOR PAISAJE MUJERES MUJER JAPONESAS JAPONESA RAZAS RAZA ETNIAS ETNIA SOMBRILLAS SOMBRILLA PARAGUAS SENTADAS SENTADA SENTARSE SENTAR RELAJADAS RELAJADA RELAJARSE RELAJAR RELAZ ARENA MIRANDO MIRAR PAISAJE TABLAS TABLA SURF TRANSPORTES TRANSPORTE DEPORTES DEPORTE AGUA MAR OCEANOS OCEANO PLAYA ARQUITECTURA EDIFICIOS EDIFICIO RASCACIELOS RASCACIELO CIUDADES CIUDAD MODERNAS MODERNA ODAIBA JAPON TOKIO ASIA VIAJES VIAJE ATRACCION TURISTICA TURISMO TIEMPO LIBRE OCIO ENTRETENIMIENTO DIVERSION DIVERTIR RECREACION RECREAR

ID Collection Title Keywords

JAP-000401-LAI ETHNICS-III JAPAN ASIA ORIENT ORIENTAL ASIATIC ASIAN COLOUR LANDSCAPE WOMEN WOMAN JAPANESE RACES RACE ETHNIC GROUP PARASOL UMBRELLA SAT SIT RELAXED RELAX SAND LOOKING LOOK LANDSCAPE SURFBOARD SURF TRANSPORTATION SPORTS SPORT WATER SEA OCEANS OCEAN BEACH ARCHITECTURE BUILDINGS BUILDING SKYSCRAPER CITIES CITY MODERN ODAIBA JAPAN TOKYO ASIA TRAVEL ATTRACTION TOURISTIC TOURISM FREE TIME HOBBY ENTERTAINMENT AMUSEMENT LEISURE

Figure 1: Example of Information about An Image. (The English version of the examples is not exactly aligned with the original Spanish one as there are several more words, mainly plurals, in the latter.)

© Novática

UPGRADE Vol. VI, No. 6, December 2005

49

The Semantic Web neous subjects, including both copyright and royalty-free images. A stratified sampling based on subjects was performed at the beginning of the project, to finally build a set of more than 100,000 images. As long well as the photograph, each image in the archive had several structured information fields associated with it. Apart from the image identifier (unique id) and the collection in which the image is included, two other text fields were relevant to this project: the title (short description) of the image and keywords, i.e. additional terms which complement tohe description of the image. Actually both fields are concatenated and handled as one single field. As in the example in Figure 1, fields have are free text (without a controlled vocabulary), in Spanish, and describe the image from some points of view: format, technique, author or agency, etc., apart from the image content (objects in foreground or background, concepts and feelings, number, sex and age of people shown, geographical information, common use synonyms, etc.). This information is indexed in a database management system, which is exploited with internal tools and also from the company’s public website. Visitors may make queries which combine search terms and the usual logical operators (AND, OR and NOT), view the images and their attributes, and may make the purchase. Keywords are variable-length literals which are words separated by one or more spaces. To simplify the edition editing and search processes, all words are converted to uppercase and accents are eliminated. Among the keywords there are some ambiguous terms (for example ratón [mouse], a small animal or a computer peripheral) and also grammatical words (those which have no meaning on their own, such as prepositions or conjunctions, whose main function is to build the syntactic structure of the sentence). As there is no specific separation between indexing terms (different from theother than space), multiword terms are undifferentiated from the others (for example: primer plano [foreground], tiempo libre [free time] ). The text also includes incorrect words, due to spelling mistakes or typing errors, more frequently than desirable, as it is usual in documentaary databases. Furthermore, due to the highly inflectional nature of the Spanish morphology, additional indexing terms are generated by semiautomatic means to improve the system recall by increasing the number of results. These terms are the inflectional forms (nominal and/or verbal) corresponding to the original terms, and some of their synonyms. For example: „Nominal inflection (singular ↔ plural): mujeres→mujer [women→woman] „Verbal inflection: (participle↔infinitive): sentadas→ sentada/sentarse/sentar [sat (plural participle)→sat (singular participle)/sits (reflexive)/sit] „Synonyms: sombrilla→paraguas [parasol→umbrella] The objective is to be able to find a given image, independently of the specific word forms which the user includes in his/ her query, for example, edificio/edificios [building/buildings], relajadas/relajar [relaxed/relax] or sombrilla/paraguas [parasol/umbrella]. It is interesting to note that, in the case of nominal words, gender inflectional forms are not included (masculine→feminine) to avoid misleading the user when, for instance, he/she looks for niña bailando [girl dancing] and results include images with niños [boys]. Due to this automatic process, a well-known problem is that, in multiword terms, original and additional words are intercalated and mixed up (ciudades modernas→ciudades ciudad modernas moderna [modern cities→modern(plural)

50

UPGRADE Vol. VI, No. 6, December 2005

modern(singular) cities city]), as no distinction is possible between single and multiword terms. Another problem is that some terms may be duplicated, depending on the order in which the expansion has been done. Many descriptions have been machine-translated from other languages (English, French, German), with variable quality. Moreover, depending on their origin, some descriptions include Spanish idiomatic expressions or regional inexpressions from Spain or Latin America, which may make queries more difficult for different Spanish speakers. In short, due to the diverse origins of the resources and despite the exhaustive pre-processing tasks, descriptions are very heterogeneous and their quality changes varies among different collections or even among different images in the same collection.

4 Thesaurus Construction The objective of the first phase of the project was the construction of a conceptual classification thesaurus for the digital image archive, semantically representative of its contents (archive coverage). The final result of the process would be a catalogue with nodes distributed in a hierarchical tree structure, where each node represents a concept (categories), and may include one or more descriptive terms related to that concept and, in some cases, other nodes which would represent more specific concepts (subcategories). For example, when the concept is a "place", descriptive terms would be "geographic references" (city names, monuments, rivers or mountains, including synonyms) and more specific concepts would be "smaller territorial divisions" (such as countries or states). From this point of view, the thesaurus was a semantic net in which nodes (concepts and descriptors) are related among them with the relationships describes/is_described_by (descriptor↔concept) and more_general/more_specific (concept↔subconcept). After an initial research on existing classification hierarchies in the area, none of which was perfectly suited to our purposes, our methodology consisted of the combination of the analysis of the thematic contents of the archive, the know-how in the company and the behaviour of website visitors, in an iterative process with a spiral life cycle. The final classification hierarchy didn’t did not focus on a general purpose but was rather pragmatically designed. Our client was looking for a thesaurus which was closely adapted to their archive contents, and insisted on priority being given to more frequently used categories with less emphasis on (or directly omitting) those categories with few images. Therefore, the tree is not balanced. The maximum depth corresponds to categories alimentación [food] and naturaleza [nature] (5 levels), compared to category monumentos [monuments] (2 levels) in which one lower specification level was enough. In its final version, the thesaurus includes 34 root categories and 276 categories in all. XML (eXtensible Markup Language) was adopted for the thesaurus representation from the beginning, as it is considered to be the most suitable for describing treestructured data, as in our case. Moreover, XML is platform -independent, permits internationalization as it is fully Unicode compatible, its computational management is quite easy and, as it is a text-based format, it is possible to read and/or edit XML documents with standard well-known editionng tools, if necessary. The main drawback was the increase in the size of the thesaurus due to the tag and syntax overhead, but its impact was not considered to be relevant. Finally, the data structure of the thesaurus is defined with the DTD (Document Type Definition) shown in Figure 2.

© Novática

The Semantic Web

(name,

Figure 2: Thesaurus DTD. Regarding the descriptors, the design guidelines for the team of linguists were that descriptors ought to be, whenever possible, word lemmas. In high inflectional languages such as Spanish, lemmas are the paradigmatic forms that represent the whole set of inflectional forms that can be obtained with a morphological process, that is, the representatives of the different variants of the same word. Usually the lemma is the masculine-singular form for nominal words (nouns, adjectives, determinatives and pronouns) – when that form exists; if not, feminine, or plural –, the infinitive for verbal words (verbs) and the same word for the rest (prepositions, conjunctions, etc.). Lemmas allow performing performance of any linguistic operation independently of the specific variant of the word. The thesaurus includes both single and multiword terms (in this case, different words are concatenated with an underscore). To distinguish among ambiguous meanings of different terms, compound descriptors in the form term/ meaning are used, for instance: cabo/geografía [cape/geography] and cabo/militar [corporal/army]. The thesaurus includes over 7,000 descriptors in its final version.

5 Normalization Process The objective of the next stage of the project was to establish a matching between each keyword in the image

descriptions and the corresponding descriptor(s) in the thesaurus. This stage was named the normalization process, i.e., "categorize, adjust to a model, rule or norm" (according to the Real Academia Española, Royal Academy of the Spanish Language), referring to the process of transforming words in a free (uncontrolled vocabulary) text into terms of a restricted or controlled vocabulary. Obviously, the definition of the normalization process should take into account the specific features and problems in both the digital archive and the thesaurus. In addition, it was essential to build a robust system in order to anticipate images which will be incorporated into the archive in the future. Therefore, a functional requirement was that any linguistic resource included in the system had to be easily reconfigurable (modified, expanded) by non experts, to simplify system maintenance and evolution. Finally the normalization process was designed in two cycles (pre-cycle and post-cycle), each one again with two stages (translation and classification), applied in cascade. The final goal is was to classify the original keywords in the image description into two sets: normalized keywords and the remaining keywords. The whole normalization process is described in the next sections. 5.1 Stage 1a. Pretranslation The first stage of the normalization process, which is called "initial translation" or pretranslation, is used to make simple transformations (translations) to the image keywords. The goal is to prepare the description in the original text before an initial keyword search in the thesaurus (the second stage). The input for this stage is a text field with the image description, and the output is the transformed text (translated). The process is internally based on the pretranslation table, which contains a list of terms with their corresponding translation. Actually this table is a text file with two columns in the format: original_term \t translated_term The first column contains the original term (both single terms and multiword expressions) and the second column, separated by tab, indicates the word or expression in to which