Download as a PDF

12 downloads 400 Views 570KB Size Report
primary, secondary education and vocational training. Both ... students from a high-school class. .... Methodology as Applied to the 1985 Census of Tampa.
Relating User Tags to Ontological Information Kees van der Sluijs Technische Universiteit Eindhoven PO Box 513, 5600 MB, Eindhoven, The Netherlands [email protected]

Geert-Jan Houben Vrije Universiteit Brussel Pleinlaan 2, 1050 Brussels, Belgium [email protected]

ABSTRACT

why we chose a simple approach for tagging and which problems that choice introduces. In the section “Matching Component” we describe a component we created to relate user tags to ontological data utilizing Semantic Web techniques. This component uses syntactic, semantic, but also user feedback techniques to find these relationships. Then in section “Test applications” we describe two prototype applications that benefitted from our component by getting more metadata, both on the tagged objects and the users. There we also show how we can capitalize on the result of our techniques in different ways, for example for providing better personalization and faceted navigation for the user. We end the paper with conclusions and further work.

With the help of the simple and world-wide accepted technique of tagging users can help to collaboratively provide metadata over previously uncharted collections of multimedia documents. However, the semantics of tags are rather limited and not always as helpful in disclosing a dataset like a proper ontology can be. In this paper we look at how applying both syntactic and semantic techniques to connect tags to ontologies can help to quickly get more semantics about a tag. Then we describe how the user can help to improve the described techniques and we sketch how user collaboration helps the content provider in obtaining professional quality metadata, and enables personalization for the user. Finally, we also consider how these semantically richer tags can help us in navigating, searching and representation of the datasets.

TAGGING

New Web applications with tagging functionality appear everyday. Tagging is simply assigning a keyword or a short sentence fragment to a resource (e.g. a multimedia document) by a user. An inherent property of tagging is that it is schemaless. This means that the user does not need any prior knowledge of some domain for annotating resources. This is what makes tagging inherently simple.

Author Keywords

Tag, Ontology Alignment, Semantic Web, String Matching, Semantic Expansion, Faceted Browsing ACM Classification Keywords

H.5.1 Information Interfaces and Presentation: Multimedia Information Systems, H.3.3 Information Storage and Retrieval: Information Search and Retrieval

A very well-known tagging system for instance is Del.icio.us [1]: a collaborative tagging system for web bookmarks. In [1] it is observed that based on user behavior different kinds of tags can be distinguished. Tags that identify what (or who) the content is about are used the most by far. Also tags that identify the kind of content, the owner, qualities or characteristics, and refinements are used. Moreover, users sometimes use unique tags for bookmarking purposes.

INTRODUCTION

With the ongoing Webification most users expect to be able to get their entire information need from the Web and its Web applications. Therefore, most companies and institutes try to open up their information collections to their users. However, this is not as easy with any information source such as multimedia collections. This is especially true if these collections have no, or few, metadata descriptions.

Even though tagging is a very simple and accessible mechanism for the user it is not the best way to get wellstructured information. One problem is that it is not always clear what is exactly meant with a tag. This can have a number of causes, consider for instance spelling mistakes, disambiguation concerns (e.g. Pluto: god, planet or cartoon character), words that have more then one common spellings (e.g. “modeling” versus “modelling”) or morphology.

In this paper we look at using user collaboration for obtaining metadata. In the section “Tagging” we explain

Another problem is that tags are not very well structured. It is not clear which tags are related to which other tags, or what property of a resource is actually described. For example, the tags differ in how specific they are. The 1

picture in Figure 1 could for instance be tagged with “Building”, “Church” or “Catharinachurch”. However, if you would know that “Catharinachurch” is a type of “Church” and “Church” is a kind of “Building”, this information could be used during a search for buildings and then you could also find resources only labeled with “Catharinachurch”. As an example of property confusion consider the tag “old” as applied to Figure 1. Is it meant that the building on the picture or the photo itself is old, or maybe both?

is to be a global library useable by several specific applications. Therefore, the functionality should be rather generic, with configuration options to blend in a specific application. A simplified diagram of the main input output behavior of the matching component is depicted in Figure 2. The input is a set of tags (strings). Typically, this is just one tag (the one just typed in by the user). However, users can attach several tags to a resource. In this case we might use these other tags for disambiguation purposes (as will be detailed later). The set of tags is then compared with a set of ontologies or other data like previously accepted tags. Based on our analysis we then make a recommendation for alternative tags (again strings) to the user input. Every suggestion has a certainty attached to it, which represents our confidence that the suggestion matches the input. Furthermore, a suggestion can also have a set of concepts (i.e. URIs) associated with it, where the suggestion is a label associated to the concept. It is a set of concepts, because several concepts may share the same label.

Figure 1: Example Picture1

The Semantic Web initiative offers ways to more structurally describe metadata. Furthermore, it allows defining ontologies that can be used for (limited) reasoning capabilities. An ontology is considered here as a knowledge or data structure that in a carefully constructed hierarchical manner lists all the concepts and their relationships, thus representing the semantics of the domain of that knowledge or data. To obtain semantic annotated information the user has to provide more information. However, solutions like letting the user fill in forms or providing a graph notation might seem too complex or too time-consuming for the user. Therefore, we try to maintain the simplicity of the tagging while unlocking the potential of Semantic Web metadata by relating the user tags to ontological information. The main envisioned benefit of this approach is an improved searching and browsing experience for the user. For instance, by tying an input tag “diner” to the concept “restaurant” we can exploit the relationships with that concepts and concepts like “building” and “bar” so that people can for example easily navigate from pictures about restaurants to either pictures of bars and other buildings or pictures about some specific restaurant.

Figure 2: Matching Component

At the heart of the matching component is the actual analysis of what suggestions to return to the user given an input tag. The analysis is divided in four steps: string matching, semantic expansion, context disambiguation and user feedback. First, though, we have to look at what to compare to what. As a data model we focus on the use of RDF2 and later on we will use some typical RDF properties. For efficient access to our structured data we store it in a database. Because of our use of the RDF data model we chose Sesame [2] as our data store. As a query language we currently chose the Sesame SeRQL query language [3], because its current implementation is quite mature and has some handy query features. However, in the future we might as well switch to W3C recommendation SPARQL3 when a mature implementation becomes available (for Sesame). String Matching

The basis of our analysis is string matching: comparing strings to search for similarities.

MATCHING COMPONENT

To relate tags to ontological information we developed a matching component. The goal of the matching component 1

Taken from the RHCe dataset, see section “CHI”.

2

http://www.w3.org/RDF/

3

http://www.w3.org/TR/rdf-sparql-query/

calculated word distance. The string matching result is the input for the next steps of analysis.

The user input, the set of tags, consists basically of strings. As concepts in RDF are denoted by URIs, which syntax doesn’t need to mean anything, we have to find the labels associated to those concepts. A typical way to model a label is with the rdfs:label property, however this might differ per source (an alternative would be for instance using SKOS4, i.e. skos:altLabel and skos:prefLabel, for more fine-grained control). Therefore the matching component can be configured per source by defining a graph pattern to find the labels that should be compared with the input tags. For this pattern there has to be defined what variable represents the label and what variable represents the original concept, as we want to discern the two. An example configuration would thus be:

Semantic Expansion

With Semantic Expansion we expand the output of the previous steps with new suggestions based on the structure of the underlying ontology. The underlying assumption is that an ontology connects related concepts and that there are many cases in which such related concepts might be a good alternative to the original found concept. As an example consider exploiting the rdfs:subClassOf property. This property can be used to find more specific concepts, more generic concepts or sibling concepts. Consider for instance again the example were “church” is a subclass of “building”. By using the rdfs:subClassOf relationship we can suggest “building” as the more generic term for concept “church” and “church” as a more specific type of “building”. Furthermore, if we also know that “city hall” is a subclass of “building”, we can suggest “city hall” as a suggestion for “church” as its sibling concept. Making this suggestion might be convenient if the tagging user is not completely sure what kind of building he is tagging. However, even though this kind of suggestion might be useful in some cases, the new concept is in general less certain than the original found one. Therefore it can be specified in the configuration how to decrease the certainty of an expansion-concept given the certainty of the original concept.

Source.0.GraphPattern = {x}rdfs:label{y} Source.0.ConceptVar = x Source.0.LabelVar = y All labels ‘y’ will then be matched with the input tag. And every label ‘y’ will have concept ‘x’ associated with it We use several algorithms for matching. The simplest matching algorithm is exact matching. We have two versions of exact matching: strict and non-strict. In strict the whole tag has to match the whole label. In non-strict also substrings are considered. As an example consider input-tag “after”. In strict matching this will not match with “afternoon” while in non-strict matching it will.

Which properties to follow can also be specified in the configuration, since which property is the sensible one to follow differs between ontologies. We allow two ways to define which semantic expansion we want to perform. One way is to simply define the property, for example “rdfs:subClassOf” or “skos:related”. It is also possible to specify if the property should be used straightly, as an inverse or both (e.g. for either finding the more specific or the more general). Note that we also use knowledge of how to find literals for concepts here, so that we do not confuse concepts and their labels.

Other matching algorithms are fuzzier and allow spelling mistakes by calculating word-distances, i.e. the number of transformations (like deletions, insertions and substitutions). We implemented the use of three such metrics, namely Levenshtein distance [4], Jaro-Winkler distance [5] and Soundex [6]. Levenshtein is a very well-known metric for calculating word distances. It calculates the minimum number of edits to transform one word into another (the fewer edits the more alike the words are). Jaro-Winkler is an alternative in which matching characters in two words are compared based on the position of the character in the words. Words that share many similar characters on similar positions in the word are considered more similar than words that do not. The Soundex algorithm is based on the phonetics of words. Words are considered more similar to each other the more they sound alike.

Sometimes this simple schema is too simple however, i.e. if the path we want to follow is more complex than following a simple property. Therefore the second way to specify semantic expansion is by specifying an explicit query (using SeRQL). Consider for instance the examples for finding siblings of concepts, which requires two property steps instead of one. A bit more complex example can be found in the query of Figure 3, where we look for the synonyms of a word by using WordNet5. The %inputTerm% variable in the query will be substituted for every tag-suggestion we have thus far. The words that will result from this query are specified in the SELECT clause.

The matching component can be configured to use one of the matching algorithms. A tag and a label match if the word distance calculated by the configured algorithm is equal or greater than a threshold that also can be configured. The result of the string matching process is a set of suggestions, each with a certainty that is equal to the

4

5

http://www.w3.org/TR/swbp-skos-core-spec/

3

Refer to: http://www.w3.org/2006/03/wn/wn20/

SELECT DISTINCT wordForm FROM {Synset} ws:containsWordSense {aWordSense} ws:word {aWord} ws:lexicalForm {"%inputTerm%"@en-us}, {Synset} ws:containsWordSense {bWordSense} ws:word {bWord} ws:lexicalForm {wordForm} WHERE NOT wordForm = "%inputTerm%"@en-us LIMIT 5 WHERE NAMESPACE ws=http://www.w3.org/2006/03/wn/wn20/schema/

Figure 3: SeRQL query for finding synonyms in WordNet

Besides expansion of concepts, we also allow the replacement of (previous) suggestions. How this might be useful is illustrated in Figure 4. Suppose we have a concept with several labels where we can discern between labels, e.g. preferred labels and alternative labels. If the input of the users matches with an alternative label we might want to show the preferred label instead of the alternative one. For instance suppose we have a concept with a preferred label “Building” and an alternative label “Construction”. Now if a user tag matches with “Construction”, we might want to give “Building” as the alternative and not “Construction”. Defining that a query has to replace the previous suggestions instead of just expand them is done simply by defining setting the configuration option “suggestionReplace” to true.

Figure 5: Two matches in one region of an ontology

The certainty-increase for tag suggestions that are found in the same region is configurable. Also the number of properties that may lie between two tag suggestions to be still called neighborhood is configurable. User Feedback

The former processes for finding alternatives for tags were rather user-independent. Given the input tags, the ontology, and the configuration, the suggestions would always be the same. By utilizing user feedback we want to improve the suggestions. The matching component has a feedback channel. It has a simple and an extensive feedback input. The simple feedback input can be used by the applications to report what suggestion was actually chosen by a user (implicit feedback). The extensive feedback input can be used for concrete feedback of a user on a set of suggestions if he thinks which suggestions were good and which were bad (explicit feedback). The feedback is recorded in a separate RDF store. If supplied we record the involved user (for the moment simply identified by a URI), we store the original tag (or tags) and also store which suggestion the user chose. Similarly with concrete feedback, we store per input tag and suggestion the feedback of the specific user.

Figure 4: Example of Concept with multiple labels

Note that it is possible to define more than one semantic expansion query per source in case there is more than one way to expand the tag suggestions. It is of course also possible to skip this phase if it is not needed. The result of the semantic expansion phase is again a set of suggestions (each with certainty and a set of associated concepts) Context Disambiguation

With context disambiguation we exploit that the input for the matching component is a set of tags instead of a single tag. The idea is, as illustrated by Figure 5, that if two or more tags have a match in the same region of an ontology then the change is probably higher that the suggestions are better. For instance if the input contains the tags “bank” and “building” the matches in the building (part of an) ontology gets a higher certainty than if the tags would have been “bank” and “river”.

The information we thus store is in the first place used to change certainty. The first thing the matching component checks (before the start of string matching) is to see if the tag is already in the feedback store. If so, we are able to immediately output the result, but now including the feedback of the user. How the result is adapted by that feedback is configurable. Consider for instance the input tag “bank”. We might have a geographic-features ontology and a building ontology that both have matches for “bank” and both ontologies are by the system considered to be equally important. Even so, if most users choose an alternative for bank in the building ontology we consider this ontology more relevant for the tag “bank” and will start to increase the certainty of matches in that ontology. A side effect of storing user feedback data is that we can use this information to build a user profile. Even though we currently we don’t use this data yet, we intend to use it to improve personal tag suggestions. Given knowledge about previous tags and accepted tags we expect to able to derive knowledge about user involvement with certain (parts of)

Second the matching component can be used during search. The search terms are then matched to concepts in the GTAA ontology and extended in the semantic expansion phase. This should lead to a more extensive query which should lead to a higher recall, also because more items are expected to be tagged with GTAA terms because of the earlier suggestions. Coming user tests will test these and other hypotheses.

ontologies and therefore we could infer that suggestion from those (parts of) ontologies should be given a higher certainty. Furthermore we want to look at how we can reuse user profiles from other applications or sharing our user profile with others, which would be in line with previous work we have done like [7]. TEST APPLICATIONS

We applied the matching component within two concrete applications named ViTa and CHI. Both have very similar demands which can be fulfilled by the matching component. In this section we shortly describe the applications and we explain how our techniques can contribute to a richer user experience and how we thus can come to a better personalization, useful for both the user and the content provider

CHI

The main stakeholder for the CHI application is RHCe. RHCe (Regional Historic Center Eindhoven) is an institute that governs all historic information that is related to the cities in the region around Eindhoven in the Netherlands. The information is gathered from local government agencies or private persons and groups. This includes collections like birth, marriage and death certificates, but also posters, drawings, pictures, videos, city counsel minutes, etcetera. The amount of information they store is big: physical archives are quantified in kilometers.

ViTa

One of the two applications we discuss in this paper is called ViTa, which is constructed in a project with the same name. ViTa is a Video Tagging system for educational videos. Its goal is “to find more relevant information by better descriptions”.

One of the goals of CHI is to expose these collections to the general public. However, especially for the videos and pictures very little metadata is available which makes indexing this data for navigation or searching hard. Another goal of RHCe is to have high-quality metadata of all their collections for easy retrieval (both online and offline, and both for the general public and for the officials of the local government). RHCe employs a number of domain experts whose full-time job is to provide high quality metadata over multimedia documents. However, in spite of all their efforts by far most of their collections have no metadata at all.

The two main stakeholders in this project are SURFnet6 and Kennisnet7. SURFnet is a Dutch national non-profit organization that provides internet access and internetrelated services to most higher education institutes (like universities) and research centers. Kennisnet is a Dutch public organization dedicated to providing IT-services for primary, secondary education and vocational training. Both have a videoportal8 with many educational videos, but also with very little known metadata about most videos.

The goal of CHI is twofold. First, it has to disclose the RHCe-dataset to the users for searching and browsing. Second, it has to provide a way for users to suggest metadata, but in such a way that the domain experts get the most promising metadata first and can simply agree with a suggestion or not.

One of the goals within the ViTa application is that it might be better to let the users (e.g. students) tag the videos, instead of the experts (e.g. the teachers), because the users might use different terms then the experts and the idea is that users maybe better find the videos they are looking for if they can be found by using their own terms (assuming there is some form of homogeneity within the users). This idea will be tested on an actual group of users, in this case students from a high-school class.

In order to accomplish this, CHI also uses a tagging mechanism for the users to overcome the sparseness of the metadata, similar to ViTa, because of its simplicity and time effectiveness. However, within CHI we discern three types of tags based on the three dimensions that are applicable to practically all items in the heterogeneous data source: time, location and keywords. Users can make suggestions for all three dimensions. The matching component has an initial ontology for those dimensions at its disposal (crafted internally by RHCe). The matching component will then look if there are alternatives available in the initial ontologies related to the input tags of the user and will suggest those.

The matching component is used for two reasons in this application. The first reason is improving the user’s tagging quality by providing alternatives from a carefully constructed ontology. During tagging we use the GTAA ontology (an extensive Dutch ontology, e.g. described in [8]) for providing alternatives for input tags. The users can then choose one of these alternatives as ‘correction’ for their input tag.

6

http://www.surfnet.nl/

7

http://www.kennisnetictopschool.nl/

8

http://video.surfnet.nl/ and http://videoportal.kennisnet.nl/

In this way every multimedia document has a set of user tags and a set of RHCe approved tags. Users can vote for tag suggestions of others if they agree with a tag or not (see screenshot Figure 6). RHCe administrators have an overview of the most approved tags and they can simply

5

decide to officially accept a tag or to reject it. They can also see a list of the tags that are used the most, but are not in the initial ontologies. This information can be for instance be used to extend the initial ontologies.

ACKNOWLEDGMENTS

This work was carried out within the ViTa project and the CHI project. ViTa is part of the MultimediaN10 program, which is sponsored by the Dutch government. ViTa participants are the Telematica Instituut, SURFnet, Kennisnet, FabChannel, Roessingh Research and Development and Eindhoven University of Technology. The CHI project is a collaborate effort of RHCe (Regionaal Historisch Centrum Eindhoven) and Eindhoven University of Technology. REFERENCES

1. Scott Golder and Bernardo A. Huberman. Usage Patterns of Collaborative Tagging Systems. In Journal of Information Science, vol. 32, no. 2 (2006), 198-208. Figure 6: Current Location Tags for an Object (in dutch)

By using the three different dimensions over the objects we are able to provide a faceted search (for a similar approach e.g. refer to [9]). We are also able to group results based on one of the dimensions in a specialized view. In Figure 7 there is an example of a map-view were items with similar keywords are clustered on common locations visualized in Google Maps9. We have similar views for time (with a timeline) and for keywords (with a cloud representation).

2. Jeen Broekstra, Arjohn Kampman and Frank van Harmelen. Sesame: An Architecture for Storing and Querying RDF and RDF Schema. In Proc. First International Semantic Web Conference, SpringerVerlag Lecture Notes in Computer Science (LNCS) no. 2342 (2002), 54-68. See also http://www.openrdf.org/ 3. Jeen Broekstra. SeRQL: A Second-Generation RDF Query Language. Chapter 4 in Storage, Querying and Inferencing for Semantic Web Languages. PhD Thesis, Vrije Universiteit Amsterdam (2005). See also: http://www.openrdf.org/doc/sesame/users/ch06.html 4. Fred J. Damerau. A Technique for Computer Detection and Correction of Spelling Errors. In Communications of the ACM, ACM, vol. 7, no. 3 (1964), 171-176. 5. Matthew A. Jaro. Advances in Record Linking Methodology as Applied to the 1985 Census of Tampa Florida. In Journal of the American Statistical Society, vol. 84, no. 406 (1989), 414-420. 6. Donald E. Knuth. The Art of Computer Programming Volume 3: Sorting and Searching. Addison-Wesley (1973), 394-395.

Figure 7: Map view in CHI CONCLUSION

In this paper we concentrated on multimedia Web information systems that use collaborative tagging to acquire metadata for their data collection. We described a matching component that uses syntactic, semantic and user feedback techniques to relate user tags to ontological information. Then we described two prototype applications that use the matching component to achieve better metadata descriptions, but also improved searching facilities, personalization opportunities and faceted browsing.

7. Kees van der Sluijs and Geert-Jan Houben. A Generic Component for Exchanging User Models between Webbased Systems, International Journal of Continuing Engineering Education and Life-Long Learning (IJCEELL), Inderscience (2006), vol 16, No. 1/2, 64-76. 8. Hennie Brugman, Véronique Malaisé, Luit Gazendam. A Web Based General Thesaurus Browser to Support Indexing of Television and Radio Programs. In Proceedings of the 5th international conference on Language Resources and Evaluation (LREC 2006)/ 9. Eetu Mäkelä, Eero Hyvönen and Samppa Saarela. Ontogator - A Semantic View-Based Search Engine Service for Web Applications. In Proc. International Semantic Web Conference, Springer (2006), 847-860

9

Via de Google Maps API, refer for more information to: http://www.google.com/apis/maps/

10

http://www.multimedian.nl