Optimizing information retrieval in question ... - Semantic Scholar

Optimizing information retrieval in question answering using syntactic annotation J¨ org Tiedemann Alfa Informatica, University of Groningen [email protected]

Abstract One of the bottle-necks in open-domain question answering (QA) systems is the performance of the information retrieval (IR) component. In QA, IR is used to reduce the search space for answer extraction modules and therefore its performance is crucial for the success of the overall system. However, natural language questions are different to sets of keywords used in traditional IR. In this study we explore the possibilities of integrating linguistic information taken from machine annotated Dutch newspaper text into information retrieval. Various types of morphological and syntactic features are stored in a multi-layer index to improve IR queries derived from natural language input. The paper describes a genetic algorithm for optimizing queries send to such an enriched IR index. The experiments are based on the CLEF test sets for Dutch QA from the last two years. We could show an absolute improvement of about 8% in mean reciprocal rank scores compared to the base line using traditional IR with plain text keywords.

1

Introduction

One of the strategies in question answering (QA) systems is to identify possible answers in large document collections. The task of the information retrieval (IR) component in such a system is to reduce the search space for information extraction modules that look for possible answers in relevant text passages. Obviously, the system fails if IR does not provide appropriate segments to the subsequent modules. Hence, the performance of IR is crucial for the entire system. The main problem for IR is to match a given query with relevant documents. This is usually done in a bag-of-words approach, i.e. sets of query keywords are matched with word type vectors describing documents in the collection. However, in QA we start up with a well-formed natural language question from which an appropriate query has to be formulated to send to the IR component. The base line approach is simply to use all content words in the question as keywords to run traditional IR. In many cases this is not satisfactory

especially where questions are short with only a few informative content words. In some cases we want to restrict the query to narrow down possible matches (to improve precision). In other cases, where keywords from the question are to restrictive, we want to widen the query to increase recall. Natural language questions are more than bags of words and contain additional information besides possible keywords. Syntactic constructions and dependencies between constituents in the question bear valuable information about the given request. The challenge for QA is to take advantage of any linguistic clue in the question that might be necessary to find an appropriate answer. Therefore, natural language processing (NLP) is used in many components of QA systems, for example, in question analysis, answer extraction and off-line information extraction (see e.g. (Moldovan et al. 02; Jijkoun et al. 04; Bouma et al. 05)). The use of NLP tools in information retrieval has been studied mainly to find better and/or additional index terms, e.g. complex noun phrases, named entities, disambiguated root forms (see e.g. (Zhai 97; Prager et al. 00; Neumann & Sacaleanu 04)). Several studies also investigate the use of other syntactically derived word pairs (Fagan 87; Strzalkowski et al. 96). (Katz & Lin 03) argue that syntactic relations can be very effective in information retrieval in question answering when selected carefully. Following up on these ideas, we would like to combine various features and relations that can be extracted from linguistically analyzed documents in our IR component to find better matches between natural language questions and relevant text passages. Our investigations are focused on open-domain question answering for Dutch using dependency relations. We use the wide-coverage dependency parser Alpino (Bouma et al. 01) to produce linguistic analyses of both questions as well as sentences in documents in which we expect to find

the answers. An example of a syntactic dependency tree produced by Alpino can be seen in figure 1.

Figure 1: A dependency tree produced by Alpino for a Dutch CLEF question (When did the German re-unification take place?). From the dependency trees produced by the parser we can extract features and relations that might be useful for IR, for example, part-ofspeech information, named-entity labels, and, of course, syntactic dependency relations. The idea is to add this information to the index in some way to make it searchable via the IR component. Questions are analyzed in the same way and similar features and relations can be extracted from the annotation. Hence, we can match them with the enriched IR index to find relevant text passages. For this we assume that questions do not only share lexical items with relevant text passages but also other linguistic features such as syntactic relations. For example, if the question is about “winning the world cup” we might want to look for documents that include sentences where “world cup” is the direct object of any inflectional form of “to win”. This would narrow down the query compared to a plain keyword search for “world”, “cup” and “winning”. The nice thing about relevance ranking in IR is that we can also combine traditional keyword queries with more restrictive queries using, e.g., dependency relations. Documents that contain both types will be ranked higher than the ones where only one type is matched. In this way we influence the ranking but we do not reduce the number of selected documents. Linguistic annotation can be used in many

other ways. For example, part-of-speech information can be useful for disambiguation and weighting of keywords. Certain keyword types (e.g. nouns and names) can be marked as “required” or as “more important” than others. Named entity labels can be used to search for text passages that contain certain name types (for example, to match the expected answer type provided by question analysis). Morphological analyses can be used to split compositional compounds. There is a large variety of possible features and feature combinations that can be included in a linguistically enriched IR index. There is also a wide range of possible queries to such an index using all the features extracted from analyzed questions. Finding appropriate features and query parameters is certainly not straightforward. In our experiments, we use data from the CLEF (Cross-Language Evaluation Forum) competition on Dutch QA to measure the success of linguistically extended queries. The following sections describe the IR component in our QA system and an iterative learning approach to feature selection and query formulation in the QA task.

2

The IR component

The IR component in our QA system (Joost) (Bouma et al. 05) is implemented as an interface to several off-the-shelf IR engines. The system may switch between seven engines that have been integrated in the system. One of the systems is based on the IR library Lucene from the Apache Jakarta project (Jakarta 04). Lucene is implemented in Java with a well-documented API. It implements a powerful query engine with relevance ranking and many additional interesting features. For example, Lucene indices may include several data fields connected to each document. This feature makes it very useful for our approach in which we want to store several layers of linguistic information for each document in the collection. Besides the data fields, Lucene also implements a powerful query language that makes it possible to adjust queries in various ways. For example, query terms can be weighted (using numeric “boost factors”), boolean operators are supported and proximity searches can be stated as well. It also allows for phrase searches and fuzzy matching. The support of data fields and the flexible query language are the main reasons for selecting Lucene as the IR engine in this study.

The IR interface can be used independently from Joost. In this way we can run batch calls on pre-defined queries without requiring other modules of the QA system. 2.1

Table 2: Type layers

The multi-layer index

The QA task in CLEF is corpus-based question answering. The corpus for the Dutch competition contains several years of newspaper texts, including about 190,000 documents with about 77 million words. Documents are marked with paragraph boundaries (which might be headers as well). We decided to use paragraphs for IR which gave the best balance between IR recall and precision. Paragraphs also seem to be a natural segmentation level for answer extraction even though the mark-up does not seem to be very homogeneous in the corpus. The entire corpus consists of about 1.1 million paragraphs that include altogether about 4 million sentences. The sentences have been parsed by Alpino and stored in XML tree structures.1 From the parse trees, we extracted various kinds of features and feature combinations to be stored in different data fields in the Lucene index. Henceforth, we will call these data fields index layers and, thus, the index will be called a multi-layer index. We distinguish between token layers, type layers and annotation layers. Token layers include one item per token in the corpus. Table 1 lists token layers defined in our index. Table 1: Token layers text root RootPos RootHead RootRel RootRelHead

stemmed plain text tokens root forms root form + POS tag root form + head word root form + relation name root form + relation + head

As shown in the table above, certain features may appear in several layers combined with others. Features are simply concatenated (using special delimiting symbols between the various parts) to create individual items within the layer. Tokens in the text layer and in the root layer have also been split at hyphens and underscores to split compositional compounds (Alpino adds underscores between the compositional parts of words that have been identified to be compounds). 1

Type layers include only specific types of tokens in the corpus, e.g. named entities or compounds (see table 2).

About 0.35% of the sentences could not be analyzed because of parsing timeouts.

compound ne neLOC nePER neORG

compounds (non-split root forms) named entities (non-split roots) location names person names organization names

Annotation layers include only the labels of (certain) token types. So far, we defined only one annotation layer for named entity labels. This layer may contain the items ’ORG’, ’PER’ or ’LOC’ if such a named entity occurs in the paragraph. 2.2

Multi-layer IR queries

Features are extracted from analyzed questions in the same way as it was done for the entire corpus when creating the IR index (see section 2.1). Now, complex queries can be sent to the multi-layer index described above. Each individual layer can be queried using keywords of the same type. Furthermore, we can restrict keywords to exclude or include certain types using the linguistic labels of the analyzed question. For example, we can restrict RootPos keywords to nouns only. We can also add another restriction about the relation of these nouns within the dependency of the tree. We can, for example, use only the nouns that are in a object relation to their head in the tree. Now, we can also change weights of certain types (using Lucene’s boost factors) and we can run proximity searches using pre-defined token window sizes. Here is a summary of query items that we may use in IR queries: basic: a keyword in one of the index layers restricted: token-layer keywords can be restricted to a certain word class (’noun’, ’name’, ’adj’, ’verb’) or/and a certain relation type (’obj1’ (direct object), ’mod’ (modifier), ’app’ (apposition), ’su’ (subject)) weighted: keywords can be weighted using a boost factor proximity: a window can be defined for each set of (restricted) token-layer keywords The restriction features (second keyword type) are limited to the ones listed above. We could

easily extend the list with additional POS labels or relation types. However we want to keep the number of possible keyword types at a reasonable level. Altogether there would be 304 different keyword types using all combinations of restrictions and basic index layers, although, some of them are pointless because they cannot be instantiated. For example, a verb is not to be found in an object relation to its head and therefore, such a combination of restriction is useless. For simplification, we consider only a small pre-defined set of combined POS/relation-type restrictions: noun-obj1, name-obj1 (nouns or names as object); noun-mod, name-mod (nouns or names as modifiers); nounapp, name-app (nouns or names as appositions); and noun-su, name-su (nouns or names as subjects). In this way we get a total set of 208 keyword types. Figure 2 shows a rather simple example query using different keyword types, weights and one proximity query. Wanneer vond de Duitse hereniging plaats ? (When did the German re-unification take place?) RootRelHead:(Duits/mod/hereniging hereniging/su/vind_plaats) root:((vind plaats)^0.2 Duits^0.2 hereniging^3) text:("vond Duitse hereniging"~50)

Figure 2: An example query using linguistic features derived from a dependency tree using rootrelation-head triples, roots with boost factor 0.2, noun roots with boost factor 3 and text tokens in a window of 50 words (stop words have been removed) Note that all parts in the query are composed in a disjunctive way (which is the default operator in Lucene). In this way, each “sub-query” may influence the relevance of matching documents but does not restrict the query to documents for which each sub-query can be matched. In other words, no sub-query is required but all of them may influence the ranking according to their weights. An extension would be to allow even conjunctive parts in the query for items that are required in relevant documents. However, this is not part of the present study.

3

The CLEF test set

We used the CLEF test sets from the Dutch QA tracks in the years 2003 and 2004 as training and evaluation data. Both collections contain Dutch

questions from the CLEF competitions that have been answered by the participating system. The test sets include the answer string(s) and document ID(s) of possible answers in the CLEF corpus. We excluded the questions for which no answer has been found. Most of the questions are factoid questions such as ’Hoeveel inwoners heeft Zweden?’ (How many inhabitants does Sweden have?). Altogether there are 570 questions with 821 answers.2 For evaluation we used the mean reciprocal rank (MRR) of relevant paragraphs retrieved by IR: M RR =

1X 1 x x rank(f irst answer)

We used the provided answer string rather than the document ID to judge if a retrieved paragraph was relevant or not. In this way, the IR engine may provide passages with correct answers from other documents than the ones marked in the test set. We do simple string matching between answer strings and words in the retrieved paragraphs. Obviously, this introduces errors where the matching string does not correspond to a valid answer in the context. However, we believe that this does not influence the global evaluation figure significantly and therefore we use this approach as a reasonable compromise when doing automatic evaluation.

4

Automatic query optimization

As described above, we have a large variety of possible keyword types that can be combined to query the multi-layer index. It would be possible to use intuition to set keyword restrictions, weights and window sizes. However, we like to carry out a more systematic search for optimizing queries using possible types and parameters. For this we use a simplified genetic algorithm in form of an iterative “trial-and-error beam search”. The optimization loop works as follows (using a subset of the CLEF questions): 1. Run initial queries (one keyword type per IR run) with default weights and default window settings. 2

Each question may have multiple possible answers. We also added some obvious answers which were not in the original test set when encountering them in the corpus. For example, names and numbers can be spelled differently (Kim Jong Il vs. Kim Jong-Il, Saoedi-Arabi¨e vs. SaudiArabi¨e, bijna vijftig jaar vs. bijna 50 jaar)

2. Combine the parameters of two of the N best IR runs (= crossover). For simplicity, we require each setting to be unique (i.e. we don’t have to run a setting twice; the good ones survive anyway). Apply mutation operations (see next step) if crossover does not produce a unique setting. Do crossovers until we have a maximum number of new settings. 3. Change some settings at random (mutation). 4. Run the queries using the new settings and evaluate (determine fitness). 5. Continue with 2 until “bored”. This setting is very simple and straightforward. However, some additional parameters of the algorithm have to be set initially. First of all, we have to decide how many IR runs (“individuals”) we like to keep in our “population”. We decided to keep only a very small set of 25 individuals. “Fitness” is measured using the MRR scores for finding the answer strings in retrieved documents. Selecting “parents” for the combination of settings is simply done by randomly selecting two of the 25 “living individuals”. We compute the arithmetic mean of weights (or window sizes) if we encounter identical keyword types in both parents. We also have to set the number of new settings (“children”) that should be produced at a time. We set this value to a maximum of 50. Selection according to the fitness scores is done immediately when a new IR run is finished. Finally, we have to define mutation operations and their probability. Settings may be mutated by adding a keyword type (with a probability of 0.2), removing a keyword type (with a probability of 0.1), or by increasing/decreasing weights or window sizes (with a probability of 0.2). Window sizes are changed by a random value between 1 and 10 (shrinking or expanding) and weights are changed by a random real value between 0 and 5 (decreasing or increasing). The initial weight is 1 (which is also the default for Lucene) and the initial window size is 20. The optimization parameters are chosen intuitively. Probabilities for mutations are set at rather high values to enforce quicker changes within the process. Natural selection is simplified to a top-N selection without giving individuals with lower fitness values a chance to survive. Experimentally, we found out that this improves

the convergence of the optimization process compared to a probabilistic selection method. Note that there is no obvious condition for termination. A simple approach would be to stop if we cannot improve the fitness scores anymore. However, this condition is too strict and would cause the process to stop too early. We simply stop it after a certain number of runs especially when we encounter that the optimization levels out.

5

Experiments

For our experiments, we put together the CLEF questions from the last two years of the competition. From this we randomly selected a training set of 420 questions (and their answers) and an evaluation set of 150 questions with answers (heldout data). The main reason for merging both sets and not using one year’s data for training and another year’s data for evaluation is simply to avoid unwanted training/evaluation-set mismatches. Each year, the CLEF tasks are slightly different from previous years to avoid over-training on certain question types. By merging both sets and selecting at random we hope to create a more general training set with similar properties as the evaluation set. For optimization, we used the algorithm as described in the previous section together with the multi-layer index and the full set of keyword types as listed earlier. IR was run in parallel (3-7 Linux workstations on a local network) and a top 15 list was printed after each 10 runs. For each setting we also compute the “fitness” of the test data to compare training scores to scores on heldout data. Table 3 summarizes the optimization process by means of MRR scores and compares it to the base line of using traditional IR with plain text keywords. The algorithm was stopped after 1000 different settings. Figure 3 plots the training curve for 1000 settings on a logarithmic scale (left plot). In addition, the curve of the corresponding evaluation scores is plotted in the right part of the figure. The thin lines in figure 3 refer to the scores of individual settings tested in the optimization process. The solid bold lines refer to the top scores using the optimized queries. Both plots illustrate the nature of the iterative optimization process. Random modifications cause the oscillation of the fitness scores (see the thin black lines). However, the algorithm picks up the advantageous features and promotes them in

baseline (46.27%) optimized (training)

baseline (46.71%) optimized (evaluation)

50

answer string MRR in %

answer string MRR in %

50

40

30

20

10

40

30

20

10

1

10

100

1000

nr of settings

1

10

100

1000

nr of settings

Figure 3: Parameter optimization. Training (left) and evaluation (right). nr of settings baseline 10 150 250 450 600 1000

training 46.27 42.36 49.69 51.51 52.32 52.79 53.16

evaluation 46.71 46.70 51.28 53.55 55.82 55.62 54.76

Table 3: Optimization of query parameters (MRR scores of answer strings). Baseline: IR with plain text tokens (+ stop word removal & Dutch stemming). All scores are in % the competitive selection process. The scores after the optimization are well above the base line of using plain text tokens only (about 8% measured in MRR). Most of the improvements can be observed in the beginning of the optimization process3 which is very common in machine learning approaches. The training curve levels out already after about 300 settings. The two plots also illustrate the relation between scores in training and evaluation. There seems to be a strong correlation between training and evaluation data. The general tendency of evaluation scores is similar to the training curve with step-wise improvements throughout the optimization process even though the development of the evaluation score is not monotonic. Besides the drops in evaluation scores we can also observe a slight tendency for values to decrease after about 500 settings which is probably due to over-fitting. Note also that the evaluation scores for the opti3

Note that the X scale in figure 3 is logarithmic for both, training and evaluation.

mized queries do not always reach the top scores among all tested individuals. However, the optimized queries are close to the best possible query according to the fitness scores measured on evaluation data. Now, we are interested in the features that have been selected in the optimization process. Table 4 shows the settings of the best query after trying 1000 different settings. It also lists the total numbers of keywords that have been produced for the questions in the training set for each keyword type using these settings. Eight token layers are used in the final query. It is somehow surprising that the named-entity layers are not used at all except the meta-layer that contains the named-entity labels (neTypes). However, features captured in these layers are also used in some of the query parts where the POS restriction is set to ’name’. This overlap probably makes it unnecessary to add named-entity keywords to the query. Most keyword restrictions are applied to the root layer. The largest weight, however, is set to the plain text token layer. This seems to be reasonable when looking at the performance of the single layers (the text layer performs best, followed by the root-layer). The most popular constraints are ’nouns’ and ’names’ (among POS labels) and subject (su), direct object (obj1) among relation types. This also seems to be natural as noun phrases usually include the most informative part of a sentence. Looking at the the proximity queries we can observe that the optimized query is quite strict with rather small windows. Many proximity queries use a window size below the initial setting of 20

Table 4: Optimized query parameters after 1000 settings including the number of keywords produced for the training set using these parameters. layer text text

restrictions POS relation

verb noun

mod

root root noun su adj name noun name name RootPos RootPos noun noun name RootRel

su obj1 su

obj1 obj1 obj1

RootHead obj1 RootRelHead noun name name compound neTypes

obj1 su obj1 obj1

nr of keywords 1478 1478 237 3 1570 1570 520 237 125 88 72 54 20 1209 1209 472 216 44 1209 413 413 1209 413 1209 413 60 23 23 208 306

weight/window weight = 13.57 window = 13 window = 9 weight = 2.75 weight = 1.66 window = 12 weight = 1 window = 20 window = 21 window = 26 weight = 1 window = 18 weight = 4.21 weight = 1 window = 29 weight = 3.65 window = 36 weight = 1 weight = 3.12 weight = 1.76 window = 18 weight = 1 weight = 1 weight = 1 weight = 2.55 weight = 5.19 weight = 1 window = 10 weight = 2.52 weight = 3.25

tokens. However, it is hard to judge the influence of the proximity queries and their parameters on the entire query where all parts are combined in a disjunctive way. Altogether, the system makes extensive use of enriched index layers and also gives them significant weights (see for example the RootPos layer for nouns and the RootRelHead layer for noun subjects). They seem to contribute to the IR performance in a positive way.

6

Conclusions and future work

In this paper we describe the information retrieval component of our open-domain question answering system. We integrated linguistic features produced by a wide-coverage parser for Dutch, Alpino, in the IR index to improve the retrieval of relevant paragraphs. These features are stored in several index layers that can be queried by the QA system in various ways. We also use word class labels and syntactic relations to restrict keywords in queries. Furthermore, we use keyword weights

and proximity queries in the retrieval system. In the paper, we demonstrate an iterative algorithm for optimizing query parameters to take advantage of the enriched IR index. Queries are optimized according to a training set of questions annotated with answers taken from the CLEF competitions on Dutch question answering. We could show that the performance of the IR component could be improved by about 8% on unseen evaluation data using the mean reciprocal rank of retrieved relevant paragraphs. We believe that this improvement also helps to boost the performance of the entire QA system. It will be part of future work to test the QA system with the adjusted IR component and the improved ranking of relevant passages. We also like to explore further techniques in integrating linguistic information in IR to optimize retrieval recall and precision even further.

References (Bouma et al. 01) Gosse Bouma, Gertjan van Noord, and Robert Malouf. Alpino: Wide coverage computational analysis of Dutch. In Computational Linguistics in the Netherlands CLIN, 2000. Rodopi, 2001. (Bouma et al. 05) Gosse Bouma, Jori Mur, and Gertjan van Noord. Reasoning over dependency relations for QA. In Knowledge and Reasoning for Answering Questions (KRAQ’05), IJCAI Workshop, Edinburgh, Scotland, 2005. (Fagan 87) Joel L. Fagan. Automatic phrase indexing for document retrieval. In SIGIR ’87: Proceedings of the 10th annual international ACM SIGIR conference on Research and development in information retrieval, pages 91–101, New York, NY, USA, 1987. ACM Press. (Jakarta 04) Apache Jakarta. Apache Lucene - a highperformance, full-featured text search engine library. http://lucene.apache.org/java/docs/index.html, 2004. (Jijkoun et al. 04) Valentin Jijkoun, Jori Mur, and Maarten de Rijke. Information extraction for question answering: Improving recall through syntactic patterns. In Proceedings of COLING-2004, 2004. (Katz & Lin 03) Boris Katz and Jimmy Lin. Selectively using relations to improve precision in question answering. In Proceedings of the EACL-2003 Workshop on Natural Language Processing for Question Answering, 2003. (Moldovan et al. 02) Dan Moldovan, Sanda Harabagiu, Roxana Girju, Paul Morarescu, Finley Lacatusu, Adrian Novischi, Adriana Badulescu, and Orest Bolohan. LCC tools for question answering. In Proceedings of TREC-11, 2002. (Neumann & Sacaleanu 04) G¨ unter Neumann and Bogdan Sacaleanu. Experiments on robust NL question interpretation and multi-layered document annotation for a cross-language question/answering system. In Proceedings of the CLEF 2004 working notes of the QA@CLEF, Bath, 2004. (Prager et al. 00) John Prager, Eric Brown, Anni Cohen, Dragomir Radev, and Valerie Samn. Question-answering by predictive annotation. In In Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Athens, Greece, July 2000. (Strzalkowski et al. 96) Tomek Strzalkowski, Louise Guthrie, Jussi Karlgren, Jim Leistensnider, Fang Lin, Jos Prez-Carballo, Troy Straszheim, Jin Wang, and Jon Wilding. Natural language information retrieval: TREC-5 report, 1996. (Zhai 97) Chengxiang Zhai. Fast statistical parsing of noun phrases for document indexing. In Proceedings of the fifth conference on Applied natural language processing, pages 312–319, San Francisco, CA, USA, 1997. Morgan Kaufmann Publishers Inc.