The Xircus Search Engine - CiteSeerX

7 downloads 59097 Views 187KB Size Report
Feb 19, 2003 - We present an XML-sensitive search engine (Xircus) suited for processing ... engines should rely more on the XML structure and fa- cilitate path ... *Xircus is an acronym for XML-based Indexing, Ranking, and. Classification ...
The Xircus Search Engine∗ Holger Meyer

Ilvio Bruder Andreas Heuer

Gunnar Weber

University of Rostock Database Research Group 18051 Rostock, Germany {hme,ilr,weber,heuer}@informatik.uni-rostock.de

February 19, 2003 Abstract Nowadays, XML is the document model in favour for both document- and data-centric web applications. Its influence in other, more traditional projects and applications grows as the web and associated techniques become the de-facto standard in user interfaces in such systems. We present an XML-sensitive search engine (Xircus) suited for processing semi-structured queries over large collections of XML documents. Xircus is based on state of the art information retrieval techniques. It is a test bed for research in query processing for XML and semistructured data in general.

tion from XML document collections, path indexing and processing and semi-structured query processing in general, that combines information retrieval with structured, SQL-like queries.

The software architecture should allow for plug in different methods like language specific stemmers, domain specific stopword lists, ontologies and thesauri. The search engine builds upon several basic data structures. The meta database describes attributes common to XML document collections and properties of documents within these collections. Per collection, there might be different index structures for accelerating the access to documents and their fulltext, XML structure, and often queried fragments. 1 Introduction The Xircus search engine should be easily deployed in a distributed, heterogeneous environment and adopted Traditional search engines are built upon classical in- to different settings. formation retrieval methods. Even though they are enThe paper is organised as follows. Primarily, we give hanced by evaluating the hyper-link structure of web an overview of the system architecture. Then, a brief sites, there is little effort made in exploiting the docudiscussion of query language issues follows. The paper ment structure itself. Newly built XML-sensitive search closes with a look at the first prototype and its user inengines should rely more on the XML structure and faterface. Last but not least, some related work and future cilitate path expression and structured queries based on tasks are discussed. a type system. The application of such XML-sensitive search engines is manifold: digital libraries, (web) content man- 2 Architectural Overview agement, XML-enabled databases, and many other webbased software projects. Xircus has an component-based structure. Figure 1 deBeside that, there are two reasons why we started the picts the distributed architecture of the search engine. Xircus project. The main components are the Xircus Agent, the Xircus

• In the first place, we wanted an XML search en- Server and servlets in a web server. The Xircus Agent gathers information from disgine which implements state of the art techniques tributed XML-collections. It performs several document for fulltext search, XML indexing and query proanalysis and index preparation steps. Finally, it transcessing. mits the collected information to the server. The Xircus • Secondly, Xircus should offer a research frame- Server manages the basic data structures and performs work for experimenting with information extrac- the query evaluation. The User-interface is built by a set ∗ Xircus is an acronym for XML-based Indexing, Ranking, and of servlets. These servlets communicate with the server using JDBC, as the agents do. Classification Techniques for Customised Search Engines.

Web Browser

Xircus Agent

Web Browser

XML Parser Post Processing Query Engine

Servlet Container

Servlet

JDBC

Metadata Link Base Fulltext Index Structure Index Value Index

JDBC

Servlet

Xircus Agent

XML Parser Web Server

Xircus Server

JDBC Post Processing

Figure 1: Xircus Architecture Besides this component-based architecture, the process of indexing and querying can be illustrated by the processing steps necessary (Figure 2). After accessing the XML document collections the document analysis step starts in the Xircus agent. The extracted information is then handed over to the index preparation step that takes place in the Xircus server. Document analysis First, some metadata on the document collections are collected by agents. This includes data like timestamps, document type, document length, checksum and other. The documents them-self are further analysed in two steps. At first, a structure analysis takes place which includes the extraction of the document structure tree and its relations to the content. Secondly, a content analysis is performed. There are a couple of analysis tools for the textual content analysis, e.g., linguistic tools, thesauri or ontologies. There are some dependencies between analysis steps because some results of the one analysis is helpfully or even necessary for the other analysis. Term position must be determined before stopword elimination because some terms are not counted and some phrase search may fail. Stop-word elimination should be processed before stemming because stemming is expensive depending on the number of words. Generally, the metadata are collected first because some analysis are dependent on document or schema type.

• Structure- or path index for querying the document structure, evaluating path expressions • Value index, atomic element and attribute content, typed values (XPath 1.0 type system), for structured parts of a document, and SQL-like queries • Link-base, outgoing and incoming edges per document, to analyse the hyperlink structure.

The metadata encompasses information on documents, e.g., checksum, timestamps, document type, collection affiliation and term statistics, and information on collections like document schema (DTD, XML Schema), main language, and other document statistics. Stopword lists or stop context can be defined on a per collection basis. A stop context is a XML fragment to be excluded from processing. It can be described by a path expression. The data for the fulltext index consist of terms, their occurrences and the term position. The terms are processed from the document words by stopword elimination, stemming and possible usage of thesauri/ontology.

The term position are determined by sentence and word recognition. The structure index includes information on elements, element-subelement relationships, The storage and index structures Several data and attributes, and paths. XML-elements are annotated by a position number too. Values, like author names or publiindex structures are managed in the Xircus server: cation years, in the value index are extracted from XML• Metadata storage: collections, documents, statis- attributes or elements. They are associated with a data tics, schemes (DTD, XML Schema, index struc- type as defined in XML Schema. Two kinds of links, references or citations are distinguished, in-collection and tures per collection) external hyperlinks and references. The links must be • Fulltext index: words of the fulltext, sentences, defined by structured elements in a known way, i.e usphrases of a natural language, IR based querying ing ID/IDREF in a DTD or with XLink/XPointer.

Linguistic Techniques Document Analysis

Ontologies, Thesauri Document Structure Link Structure Terms, Descriptors XML Structure, Paths

Index Preparation

Table 1: Information retrieval like expressions Expression types term ’term1 term2 . . . ’ {term1 term2 . . . } (expr . . . ) expr1 op expr2 expr * factor

Hyperlinks

Meaning words phrases sentences grouping boolean operators op: and, or, not weighted expressions to influence the ranking

Search Pattern Value Index Boolean Queries Query Processing Retrieval

Path Expressions XQuery, FLWOR Ranking

Table 2: Query Expressions Involving Structure Expression types path(pexpr) path(pexpr) contains expr pexpr comp const return pexpr

User Interface

Surface Language

Result Presentation

Customizing

Meaning embedded path XPath 1.0 path restriction

expression,

value-based comparison, comp: =, = 1998 return /article/fm The back matter of an article is searched for the author Heikki Mannila. The search is restricted by an exact query term, which selects references from 1998 up to now. Since we are interested in authors who cited Heikki Mannila, we just want to return the front matter stuff (author, title) of the article. 3.2 Ranking

Figure 3: Xircus Search Interface A graded set is a set of pairs (i, g): where i is an item (document, fragment, object) and g is a real number in the interval [0, 1]. The following rules hold for rankQ (i), grades/ranks for an item i under the query Q: • conjuncts:

rankA∧B (i) = min{rankA (i), rankB (i)} The ranking mechanism of Xircus assigns relevance • disjuncts: measures to documents or fragments based on the statistics stored in the database. The ranking value is calcurankA∨B (i) = max{rankA (i), rankB (i)} lated from four measures for similarity between documents and queries. These similarities are based on (1) • negations: terms, (2) the XML structure, (3) element and attribute rank¬A = 1 − rankA (i) values, and (4) the linking structure. These four measures can be combined in a ranking function. The comBased on these rules the query evaluation will return bination is controlled by user ratings or by a user defined a combined ranking for queries on both the fulltext and ranking function. Computing these similarities involves the structure of an XML document. processing the related index structures: the fulltext/term index, the path index, the value index, and the hyperlink 4 The Xircus prototype and user interface base. The first Xircus prototype was implemented by students of the Complex Software Systems class at University of Rostock during the summer term 2002. The prototype • Similar ranking for embedded path expressions realizes all major concepts except the index structures. based on: ef · if f . Element frequency ef : element By now, the functionality is provided by an objectoccurrences divided by the number of elements in relational database system (IBM DB2) and its extenders. the XML fragment. Inverse fragment frequency Most index structures are implemented with user tables if f : logarithm of number of fragments divided by and indexes. number of fragments containing the element. Xircus is implemented in the Java language. It makes heavy use of free software, e.g. for the checksum tool, • Ranking of value-based comparison: is mapped to based on the MD5 hash value (RFC 1321), the stemthe boolean values {1, 0}. mer, based on the Porter stemming algorithm, and the synonym sets of Wordnet1 (a project at University of Since XircL combines IR-like queries, which result Princeton). in a ranking, with structured queries, the challenge is, The user interface (Figure 3 is realized as a set how to integrate the result (ranking) of the different subof servlets executed in the usual Apache/tomcat webquery types? We adopted a technique used in multiserver. The servlets issue search queries in the Xirmedia database systems [3]. Ranking values for difcus surface language and inter-operate with the Xircus ferent sub-queries are combined based on graded sets 1 http://www.cogsci.princeton.edu/~wn/ (Fuzzy sets). • Ranking for term-based queries based on: tf · idf (term frequency and inverse document frequency).

XYZFind provides a powerful query language called XYZQL. XYZQL supports path-level queries, Boolean queries, keyword search, and numeric range queries. An XYZQL query is a filter specification that constrains which XML documents are returned as well as which parts of documents are returned.

Figure 4: Xircus Result Presentation search server using a standard JDBC-interface. This gives much freedom in independently changing the design of both components. Figure 4 exemplifies the search form and the search result presentation.

Linguistic techniques. An overview on IR-related text analysis and processing gives [1]. It describes linguistic-related analysis with a focus on collecting statistic term information and term preparation for indexing. A robust and fast linguistic analysis tool is represented in [8] (SMES). SMES is a linguistic tool for the German language and consists of lexical, morphological and syntactical analysis. It can extract linguistic annotated word lists and also linguistic relations between words and word phrases.

Ranking. [6] gives an overview on ranking algorithms. It describes several ranking aspects in the IR research area including a guide to selecting ranking techniques. A survey on general combining ranking algorithms gives [2]. A ranking approach for structural data using the probabilistic model is XPRES [9] from the 5 Related Work University of Bonn. XPRES describes extensions to We will have a short look at some related products, the probabilistic ranking function using given structure projects and research issues that are related to the Xircus information from XML documents. Another approach project. [5, 4] consists of an inference machine for probabilistic document weights combined with structural data. It deXML search engines. GoXML [10] provides the stor- fines different contexts for term weightings in different age of XML Schema or DTD structure definitions. structural areas. Using fuzzy sets for integrating scoring values into a When XML content is inserted or updated in a database it is checked for compliance with a schema and the data structured query language like SQL was first introduced types defined within that schema. The Index System by Fagin [3]. creates and maintains indexes over attribute and element values. These are used by the XPath Query Engine, which supports XPath with proprietary extensions. 6 Future Tasks GoXML DB includes also support for “. . . a major subset of XQuery (FLWR, SORTBY, distinct) as specified Based on the first prototype, future investigations will go into several directions. We will improve the path inin the June 2001 public W3C drafts.” The TEXTML [7] Server processes any well-formed dex structures especially if an XML schema for a docuXML without being constrained by a particular schema ment collection is present. A redesign of the distributed or DTD. Indexes can be created to search for words software architecture is needed to support better index (fulltext), dates, strings (whole content of an XML doc- preparation and distributed query processing. We plan ument), numerical values, and date and time values. The for using the search engine in digital library projects in server offers fine granular indexes, which can account large scale, distributed environments where replication, for the position of every occurrence of a word within caching and distributed query processing is important. a document, therefore allowing advanced search capaA recently started second student project will rebilities like proximity search. The query language is implement the fulltext index using compression algoexpressed as an XML document and provides Boolean rithms and experiment with path index structures. The search and fulltext search over whole documents or in- user interface will be extended and performance evaluadividual elements. tion based on the INEX collection will take place. XYZFind [11] builds a search-able repository of all Detailed information on the ongoing Xircus project data from all XML documents, indexing values, num- can always be found on the project homepage2. bers, structural names, namespaces, and content. It 2 http://www.xircus.de/ accepts any number of well-formed XML documents.

7 Acknowledgements The authors would like to thank the students of Complex Software Systems class at University of Rostock who implemented the first Xircus prototype. These are Ramona Bunk, Sebastian Dolke, Thomas Lange, Lars Milewski, Manja Nelius, Mathias Reusch, Gunnar Söllig, Sven Schattat, Matthias Schulz, and Ines Weber.

References [1] R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. ACM Press & Addison Wesley, New York, USA, 1999. [2] W. B. Croft. Combining Approaches to Information Retrieval. In W. B. Croft, editor, Advances in Information Retrieval. Kluwer Academic Publishers, Boston, 2000. [3] R. Fagin. Combining Fuzzy Information from Multiple Systems. In Proceedings of the Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, June 3-5, 1996, Montreal, Canada, pages 216–226. ACM Press, 1996. [4] N. Fuhr and K. Großjohann. XIRQL: A Query Language for Information Retrieval in XML Documents. In Proceedings of the 24th Annual International Conference on Research and Development in Information Retrieval, pages 172–180. ACM, 2001. [5] N. Fuhr and G. Weikum. Classification and Intelligent Search on Information in XML. Bulletin of the IEEE Technical Committee on Data Engineering, 25(1), 2002. [6] D. Harman. Ranking Algorithms. In W. B. Frakes and R. Baeza-Yates, editors, Information Retrieval — Data Structures & Algorithmns, pages 363–392. Prentice Hall PTR, New Jersey, USA, 1992. [7] IXIASOFT. TEXTML-Server, Nov. 2001. [8] G. Neumann, R. Backofen, J. Baur, M. Becker, and C. Braun. An Information Extraction Core System for Real World German Text Processing. In Proc. of the 5th International Conference of Applied Natural Language, Washington, USA, 1997. [9] J. E. Wolff, H. Floerke, and A. B. Cremers. XPRES: a Ranking Approach to Retrieval on Structured Documents. Technical report, University of Bonn, July 1999. [10] XML Global Technologies, Inc. Choosing The Correkt Database For XML Content, 2002. [11] XYZFind Corporation. XYZFind Server User’s Guide, Version 1.01, Mar. 2001.