Annotation-based Document Retrieval with Probabilistic Logics Ingo Frommholz University of Duisburg-Essen Duisburg, Germany [email protected]
Abstract. Annotations are an important part in today’s digital libraries and Web information systems as an instrument for interactive knowledge creation. Annotation-based document retrieval aims at exploiting annotations as a rich source of evidence for document search. The POLAR framework supports annotation-based document search by translating POLAR programs into four-valued probabilistic datalog and applying a retrieval strategy called knowledge augmentation, where the content of a document is augmented with the content of its attached annotations. In order to evaluate this approach and POLAR’s performance in document search, we set up a test collection based on a snapshot of ZDNet News, containing IT-related articles and attached discussion threads. Our evaluation shows that knowledge augmentation has the potential to increase retrieval effectiveness when applied in a moderate way.
In today’s digital libraries and Web information systems, annotations are an important instrument for interactive knowledge sharing. By annotating a document, users change their role from passive readers to active content providers. Annotations can be used to establish annotation-based collaborative discussion when they can be annotated again, or they can appear as private notes or remarks. Depending on their role as content-level or meta-level annotations , they are an extension to the document content or convey interesting information about documents. In any case, annotations are an important adjunct to the primary material a digital library deals with . Examples for annotationbased discussion can be found in newswire systems on the Web like ZDNet News (http://news.zdnet.com/), where users can write comments to the published articles, which in turn can be commented again. We also find annotation-based discussion in several digital libraries (see, e.g., ). Annotations are an important source for satisfying users’ information needs, which is the reason why we evaluated some annotation-based discussion search approaches recently . But annotations can also play an important part in document search as an additional source of evidence for the decision whether a document is relevant w.r.t. a query or not. It is thus a straightforward step to seek for effective methods for annotation-based document retrieval.
While classical retrieval tools enable us to search for documents as an atomic unit without any context, systems like POOL  are able to model and exploit the document structure and nested documents. But in order to consider the special nature of annotations for retrieval, we proposed POLAR (Probabilistic Object-oriented Logics for Annotation-based Retrieval) as a framework for annotation-based document retrieval and discussion search . POLAR cannot only cope with structured documents like POOL, but also with annotations to help satisfying various information needs in an annotation environment. Although some of the POLAR concepts like knowledge and relevance augmentation with so-called context and highlight quotations were already evaluated for discussion search , there has been no evaluation of annotation-based document search so far. In this paper, we are thus going to present the results of further experiments applying knowledge augmentation for document search with a prototype of the POLAR system. We start with a brief description of POLAR and its implementation before discussing our test collection and the evaluation.
POLAR is a framework targeted at annotation-based retrieval, i.e. document, annotation and discussion search . With POLAR, developers of digital libraries can integrate methods for document search (exploiting annotations), annotation search and discussion search (considering the structural context in threads) into their systems. It supports annotation types and is able to distinguish between annotations made on the content- or meta-level. In this paper, we assume that our collection consists of main documents and, attached to them, annotation threads establishing a discussion about their corresponding root document. This is the typical annotation scenario we find on the Web and in many digital libraries (e.g.,  and many others). In POLAR, documents, annotations and their content, categorisations, attributes and relationships are modeled as probabilistic propositions in a given context. A context, in our case, is a document or an annotation. Figure 1 shows an example knowledge base modeled in POLAR. Line 1 describes the document d as a context. d is indexed with the terms ‘information’ and ‘retrieval’; their term weights are the probabilities of the corresponding term propositions and can be derived, e.g., based on their term frequency within the given context. d is annotated by the content annotation a (also described as a context), which
1 2 3 4
d[ 0.5 information 0.6 retrieval 0.6 *a ] a[ 0.7 search 0.7 *b ] b[...] document(d). annotation(a). annotation(b) 0.5 ◦ search 0.4 ◦ information
Fig. 1. An example POLAR knowledge base
contains the term ‘search’ (l. 2). a in turn is annotated by b. Line 3 categorises d, a and b as documents and annotations, respectively. If a context c1 relates to another context c2 (e.g. if c2 annotates c1 ), we define the access probability that c2 is entered from c1 . As an example, b is accessed from a with probability 0.7. Access probabilities can be given directly, for example as a global value valid for all contexts accessed by other contexts, or individually for every entered context. They can also be derived via rules, e.g. to reflect users’ preferences for or against certain authors of annotations. In our experiments in Section 4.2, we assume global values for access probabilities and provide these values directly. Line 4 shows some global term probabilities for ’search’ and ’information’; these probabilities can be based on, e.g. the inverse document frequency. To allow for annotation-based retrieval, POLAR supports rules and queries like rel(D) :- D^q[search] rel(D) :- D^q[information] ?- rel(D) & document(D) which return all documents relevant to a query q = “information OR search” (capital letters denote variables). The core concept of annotation-based retrieval in POLAR is augmentation . Augmentation in our scenario means that we extend a context with its corresponding annotation (sub-)threads. In radius-1 augmentation, each context is augmented only with its direct annotations, whereas in full augmentation, it is augmented with the whole annotation (sub-)threads. Figure 2 shows the augmented context for document d in our example. The solid line shows the augmented context d(a) for radius-1 augmentation, whereas the dotted line shows the augmented context d(a(b)) of d when applying full augmentation. In the latter case, a and b are subcontexts of the supercontext d(a(b)), whereas in radius-1 augmentation, only a is a subcontext of d(a). It is clear that full augmentation, where we traverse whole annotation threads, is more resourceconsuming than radius-1 augmentation where we only consider direct comments. We recently proposed two basic augmentation strategies for annotation-based
Fig. 2. Augmented contexts for radius-1 and full knowledge augmentation.
retrieval: knowledge augmentation, where the knowledge contained in contexts is propagated to its corresponding supercontext, and relevance augmentation, where we propagate relevance probabilities (see  for a further discussion). We
focus on knowledge augmentation only. Without augmentation, the above POLAR query would calculate a retrieval status value (RSV) of 0.5 · 0.4 = 0.2 for d, based on the term ‘information’ alone (its weight within the context d and the global term probability), since d does not know about ‘search’. The POLAR program rel(D) :- //D^q[search] rel(D) :- //D^q[information] ?- rel(D) & document(D) applies knowledge augmentation (indicated by the ‘//’ prefix). The term ‘search’ is propagated from a to d according to the access probability of 0.6 and thus has the term weight 0.6 · 0.7 = 0.42 in the augmented context d(a). For d, the RSV is 0.368 now, based on both terms ’information’ and ’search’ in the augmented context d(a) (the calculation of this value is explained in the following section). Note that POLAR not only supports the augmentation of term propositions, but also, like POOL , of classifications and attributes. We focus our further considerations on term augmentation as this is applied in our experiments reported later.
POLAR Implementation and Translation into FVPD
POLAR is implemented on top of four-valued probabilistic Datalog (FVPD)  by translating POLAR programs into FVPD ones and executing these with an FVPD engine. FVPD allows for dealing with inconsistent and contradicting knowledge and the open world assumption, which are important features for annotation-based discussion or semantic annotation where annotators might have different opinions about things or how to tag a document. After being parsed and translated into FVPD, each POLAR program is executed by Hyspirit1 , a probabilistic Datalog implementation. Besides translation methods, the implemented POLAR prototype offers classes for creating an index for the required datastructures. Our prototype realises certain translation methods which we call trans here, to translate POLAR propositions, rules and queries into FVPD. As an example, trans("rel(D) :- D^q[search]") creates the FVPD rule instance_of(D,rel,this) :- term(search,D) & termspace(search) (this denotes the global database context) with, e.g., trans("a[0.7 search]") = "0.7 term(search,a)" trans("0.5 ◦ search") = "0.5 termspace(search)". as a translation of probabilistic POLAR propositions into FVPD ones. For knowledge augmentation, the POLAR rule rel(D) :- //D^q[search] is translated into 1
term_augm(T,D) :- term(T,D) !term_augm(T,D) :- !term(T,D) term_augm(T,D) :- acc_contentanno(D,A) & term_augm(T,A) !term_augm(T,D) :- acc_contentanno(D,A) & !term_augm(T,A) instance_of(D,rel,this) :- term_augm(search,D) & termspace(search) if full knowledge augmentation is applied; for radius-1 knowledge augmentation, the third and fourth term_augm rules are non-recursive and formulated as term_augm(T,D) :- acc_contentanno(D,A) & term(T,A). !term_augm(T,D) :- acc_contentanno(D,A) & !term(T,A). It is, e.g., trans("d[0.6 *a]") = "0.6 acc_contentanno(d,a)". To execute the translated knowledge augmentation rules in Section 2, the FVPD engine combines the probabilistic evidence using the inclusion-exclusion formula. If e1 , . . . , en are joint independent probabilistic events, the engine computes P (e1 ∧ . . . ∧ en ) = P (e1 ) · . . . · P (en ) n X X P (e1 ∨ . . . ∨ en ) = (−1)i−1 P (ej1 ∧ . . . ∧ eji ) i=1
1≤j1 < ...