Mining Multilingual Association Dictionary from Wikipedia for Cross ...

1 downloads 47247 Views 1MB Size Report
ferent applications of the mined CLAD to the cross-language information retrieval ... with the dynamic development of the Internet, the manually maintained mul-.
Mining Multilingual Association Dictionary from Wikipedia for Cross-Language Information Retrieval Zheng Ye

1

School of Information Technology York University Toronto, Ontario, M3J 1P3, Canada

Jimmy Huang

2

School of Information Technology York University Toronto, Ontario, M3J 1P3, Canada

Ben He 3 School of Information Technology York University Toronto, Ontario, M3J 1P3, Canada

Hongfei Lin

4

Department of Computer Science and Engineering Dalian University of Technology Dalian, Liaoning, 116023, China

Abstract The Wikipedia is characterized by its dense link structure and a huge amount of articles in different languages, which make it a notable Web corpus for knowledge extraction and mining, in particular for mining the multilingual associations. In this paper, motivated by a psychological theory of word meaning, we propose a graphbased approach to constructing a cross-language association dictionary (CLAD) from Wikipedia, which can be used in a series of cross language accessing and processing applications. In order to evaluate the quality of the mined CLAD, and to demonstrate how the mined CLAD can be used in practice, we explore two different applications of the mined CLAD to the cross-language information retrieval (CLIR). First, we expand cross-language retrieval queries using the mined CLAD; and second, we use it to filter out translation candidates with low translation probabilities. Experimental results on a variety of standard CLIR test collections show that the CLIR retrieval performance can be substantially improved with the above two applications of CLAD. Keywords: Association Dictionary, Wikipedia, CLIR.

1

Introduction

Multilingual resources are essential to facilitate many cross language applications, such as machine translation and cross language information retrieval (CLIR) [3,41]. CLIR is a subfield of information retrieval (IR), which retrieves documents written in the so-called target language, for queries written in other languages (aka. source languages). Usually, multilingual resources, such as bilingual dictionaries, are manually built and maintained. There are two generally recognized limitations of these manually built dictionaries. First, it is very expensive and time consuming to manually build and maintain a bilingual dictionary. Second, the dictionaries are usually limited in size, short of new words, and not comprehensive enough for real applications. Consequently, with the rapid growth of the vocabulary of different languages, particularly with the dynamic development of the Internet, the manually maintained multilingual dictionaries are no longer capable of providing a satisfying service for cross language applications. 1 2 3 4

Email: Email: Email: Email:

[email protected] [email protected] [email protected] [email protected]

The recent development of the Internet has attracted a number of researchers’ interests in studying mining bilingual dictionaries from the Web, since more and more multilingual resources with high quality can be freely obtained (e.g. Wikipedia), and the new words in different languages usually emerge first on the Web. For example, Cao et al. [3] built a system to mine large-scale bilingual dictionaries from monolingual Web pages. In this paper, we aim to mine a high-quality cross-language association dictionary (CLAD) from Wikipedia, which enables the effective extraction of associated words or phrases across different languages for any given word. The CLAD dictionary is different from traditional bilingual dictionary in its ability in expanding the word associations from the semantic perspective. In the mined CLAD, the words associated to a given word are not limited to its direct translations in other languages, but also include the related words in other languages. For example, when we input “most-favored-nation” in the mined CLAD, we find its related terms (最惠国 hmost favored nationi, 待遇 htreatmenti, 人权 h human rightsi, 美国 hAmericai, 国民党 hthe Nationalist Partyi, 一个 hone i, 权利 hrightsi, 英国 hEnglishi, 卫生 hsanitationi, 价 值 hvaluei). This unique feature of CLAD enables the semantic reasoning of words in the context of languages other than the given word. According to the theory of Hyperspace Analogue to Language (HAL) [22,21], when a human encounters a new concept, he/she derives its meaning via other concepts occurred within the same contexts. Therefore, in the case of our study, one would expect to derive the meaning of a word in a foreign language by its related words in his native language. Our mined CLAD dictionary can be beneficial to many multilingual and cross language applications, such as multilingual text categorization, question answering and cross language information retrieval, etc., especially for languages with rare translation resources. We demonstrate the applications of the CLAD dictionary for CLIR in Section 4. In this paper, we choose to mine the CLAD from Wikipedia 5 for its following characteristics. First, it has a dense link structure that can be employed to mine word associations; and second, it has a huge amount of articles with translation in various languages, which enables mining of word associations across different languages. All the above features make Wikipedia a notable Web corpus for knowledge extraction and mining for multilingual applications [23]. However, most of the previous studies on automatic word association dictionary construction from the Web focus on building dictionaries in single languages , which can only be applied to monolingual applica5

http://www.wikipedia.org, June, 2008

Fig. 1. An Example Page for Concept “Wikipedia”

tions [7,16,32]. Indeed, there has been little research on how to properly utilize the advantage brought by the development of Internet to automatically mine a CLAD. In this paper, motivated by the psychological theory of word meaning [22,21], we propose a graph-based approach to automatically constructing a crosslanguage association dictionary from Wikipedia based on the analysis on Wikipedia’s link structure. In particular, we employ two kinds of links in Wikipedia articles as shown in Figure 1. The first one is the concept link, which typically indicates a topical association between articles, or rather between concepts being described by the corresponding articles. The second one is the multilingual link on the left side of an online Wikipedia page, which is used to link a concept to its variants in other languages. This makes the concepts language-independent since we can easily gather a concept’s names in difference languages with the help of the multilingual links. We also observe that relevant words in different languages always co-occur with the same concepts or highly related concepts within a certain distance [39], which confirms the HAL theory to some extent. This motivates us to explore ways to mine cross-language association dictionary from Wikipedia articles. The main contributions of this paper are as follows. First, we extend the

HAL theory to a cross language scenario. Second, we propose a novel graphbased approach to model the cross-language word associations in the context of Wikipedia, and apply graph-based techniques to estimate word associations. As a result, the mined cross language association dictionary from Wikipedia significantly strengthens the existing multilingual resources that can be applied to enhance many cross language applications. As a demonstration, we apply the mined CLAD to two CLIR applications, namely the cross language query expansion, and the translation candidate filtering. Related experimental results indicate marked improvement in the CLIR performance brought by our mined CLAD from Wikipedia. The rest of the paper is organized as follows. In section 2, we review the related work. Section 3 introduces the proposed approach in detail. Section 4 presents two applications of the mined multilingual dictionary. Section 5 and 6 present the experimental settings and results. Section 7 provides a brief conclusion and some future directions.

2

Related Work

According to Masahiro [16], thesauruses or dictionaries can be classified into two types. The first one is the “relation dictionary”, which defines explicit relationships such as “is-a”, or “part-of”. The second one is the “association dictionary” that enables user to extract words associated to a specific word [16]. The main difference between those two kinds of thesauruses is that the association thesauruses provide a certain kind of associative relation between words, which is not structured as in the relation dictionaries. An association dictionary is usually automatically built from a text corpus. Based on the type of corpora used, there are two kinds of methods proposed to build an association dictionary. The first kind of methods build association dictionaries from ordinary plain text corpus. A series of thesaurus construction methods, such as word co-occurrence based methods and word clustering based methods, have been proposed in [7,32]. The second kind of methods are Web-based, which build association dictionaries from Web corpus. One of the advantages of the Web-based association dictionaries is the existence of hyperlinks in Web corpus, which brings valuable information for multilingual analysis. Wikipedia is currently the world’s largest free online encyclopedia, which is written collaboratively by many of its readers. Recently, Wikipedia has attracted great attentions as an invaluable corpus for knowledge extraction

and mining [16,24,36,23,15]. There are more than 7.8 million articles 6 for the top 10 languages in Wikipedia and it covers concepts of various fields such as Sports and Science. Moreover, there is a dense link structure in Wikipedia and the hyperlinks between articles denote a semantic associative relations of the corresponding concepts. Based on the structure of Wikipedia, a large number of studies in the field of automatically thesaurus construction has been conducted on the corpus of Wikipedia. For example, David et al.. presented a case study on how to mine a domainspecific relation thesauri from Wikipedia [24]. The “redirects” information, categories structure and inner hyperlinks in Wikipedia are used to identify the equivalence relation, hierarchical and associative relation respectively. In addition, Gabrilovich et al. proposed a T F IDF 7 based method to compute the relatedness between concepts in Wikipedia [10]. A vector space model is used to represent the concept corresponding to a page, in which every dimension denotes an inner hyperlink(concept) of the page and traditional T F IDF weighting formula [30] is used to estimate its importance. Then the correlation metrics can be used to calculate the relatedness between two concepts. The main disadvantage of this method is that when the pages representing the concepts are poorly created or unreliable, the accuracy will be decreased seriously as only local information is used in this method. To solve this problem, Masahiro et al. proposed a thesaurus construction method based on global concepts co-occurrence analysis [16]. Local information, the importance of concepts based on the T F IDF weighting scheme, is also used to enhance their method through a linear combination and they report that combined method can produce better performance. However, most of the studies focused on building a monolingual dictionary in the context of Wikipedia, which can not be applied to multilingual applications directly. In this paper, we extend them into cross-language environment, and verify the quality of the mined CLAD dictionary via its applications in the CLIR evaluation. In fact, this is not the first study to exploit Wikipedia to enhance CLIR. For example, Juffinger et al. [17] proposed a query reconstruction approach via returned Wikipedia articles in target language 8 . The advantage of this approach is that no translation dictionary is need. Potthast 6

http://Wikipedia.org/, in Feb 10, 2009 TFIDF (term frequency–inverse document frequency) [30] is a frequently used weight function in information retrieval. It measures how important a word is to a document in a collection. 8 In this paper, we refer to the query language as source language, and the document language as target language in CLIR 7

et al. introduced CL-ESA, a multilingual retrieval model for the analysis of cross-language similarity via converting documents to a concept space, the advantage of which is that no direct translation effort is necessary [28]. However, it is very time consuming to build the concept vectors since the similarity of each pair of documents needs to be computed. Schonhofen et al. used bilingual Wikipedia articles for dictionary extension and exploited Wikipedia hyperlinkage for query term disambiguation [31]. Our approach is different from the studies in that: 1) we explicitly mine a CLAD dictionary, which is not limited to CLIR; 2) the mined CLAD dictionary can not only be used for translation disambiguation, but also for expanding related words.

3

A Graph-based Approach

An empirical observation one would make on the Wikipedia corpus is that the related words in different languages are likely to co-occur with the same concepts or highly related concepts. For example, if we choose two sentences from the Wikipedia articles for the concept “Wikipedia” in English and Chinese respectively, as shown in Figure 2, as we can see, there are several pairs of related words, or even translation of each other in these two sentences. The above observation motivates us to devise a graph-based approach to mining a high-quality cross language association dictionary by considering the co-occurrences of related words in the descriptions of the same concept in different languages. The Hyperspace Analogue to Language (HAL) [22,21] model is a computational instantiation of a psychological theory of word meaning. It hypothesize that when a human encounters a new concept, he/she derives its meaning via other concepts occurred within the context. In the case of our study, we restrict the context to the concepts in the Wikipedia corpus. As we discussed in Section 1, the concept names of a Wikipedia concept can be mapped to a single concept ID. Therefore, the meaning of a word in whatever language can be derived by the Wikipedia concepts that are language independent. With the help of the language-independent concepts, we can connect two words in different languages via these concept links. In this case, it is possible for us to use statistical methods to identify which of these pairs are highly related, since the Wikipedia articles have plenty of this kind of sentence pairs and rich links of concepts. In this paper, we represent the CLAD dictionary by a weighted-graph, where the Wikipedia concepts and normal words are the nodes, and the relatedness between these nodes are the edges. A simple example of this graph

English Sentence Wikipedia is a free, Web-based and collaborative multilingual encyclopedia, born in the project supported by the non-profit Wikipedia Foundation.

Chinese Sentence 维基百科是一个基于Wiki技术的多语言百科全书协作计划,也是一部用不同 语言写成的网路百科全书

Related Pairs (multilingual, 多语言) (encyclopedia, 百科全书) (collaborative, 协作计划) (collaborative multilingual encyclopedia, 多语言百科全书) Fig. 2. An Example for the Concept “Wikipedia”

is presented in Figure 3. In particular, there are two kinds of edges in this graph. The first one is the edge between two concepts, and the second one is the edge between a concept and a word. By doing so, the words in different languages can be associated via these edges, and their relatedness can then be estimated via a function of the weights of these edges. The main research challenges now we are facing are as follows. First, how to construct this graph? Second, how to estimate the weight of the edges? And third, how to compute the relatedness between words in different language from the graph? Detailed solutions for the above three issues are described in the following three subsections respectively. 3.1

Constructing the Association Graph

According to the previous statement, the CLAD dictionary can be represented as a weighted undirected graph. Figure 3 shows a simple example. This can be formalized as follows. We represent the CLAD dictionary as a weighted undirected graph G =, where V denotes set of nodes and E denotes set of edges. For the two different kinds of vertices and the corresponding edges, we use superscripts c and w to distinguish them. For example, v c represents a concept node and v w represents a normal word node. The algorithm for building the graph is presented in Figure 4. Firstly, we use the concept links to construct a weighted undirected graph, in which

Fig. 3. A Simple Example of the CLAD Graph

the vertices are concepts and there is an edge between a pair of concepts if they co-occur within a certain distance (see Section 3.2). We call this graph “center concept graph” (denoted as GCCG ). Secondly, we add all words in Wikipedia corpus as vertices into the center concept graph. Specifically, if a word co-occurs with a concept within a certain distance, we add an edge between the word node and the concept node into this graph and the weight is the relatedness between the corresponding word and concept. After these two steps, we get a new association graph that contains two kinds of vertices, word vertex and concept vertex, and we call this new graph “overall association graph” (denoted as GOAG ). With this graph, we link the words in different languages together and the relatedness among them can be estimated based on a function of their weighted path length. In the following, we describe how the weights of these two kinds of edges are estimated in detail. 3.2

Estimating the Weights

Technically, there are a lot of statistical methods that can be used to estimate the weights of edges in graph GOAG . However, this is not the main concern in this paper. So for both kinds of edges in GOAG , we simply use a co-occurrence based method to estimate their weights, which is similar to that used in Masahiro [16]. To count the number of co-occurrences, we divide an article into sentences, extract all concepts in the sentences, and map these concepts to their concept identifiers. By doing this, an article is seen as a series of slide windows consisting of every k consequential sentences. If a word and a concept or two concepts appeared in an article within a window, we count a co-occurrence frequency of them. k is empirically set to 3 in our experiments.

0. Input The original set of Wikipedia articles D; 1. Output “overall association graph” (OAG) – GOAG

2. Algorithm for each article d in collection D { extract all pairs of concepts or a word and a concept from d : (Vic , Vjc ), (Viw , Vjc ) for each pair of concepts (Vic , Vjc ) { add Vic and Vjc as vertices in GOAG ; if Vic and Vjc co-occur within a certain distance. { c for these two vertices in G add an edge Ei,j OAG ;

} for each pair of a word and a concept (Viw , Vjc ) { add Viw and Vjc as word vertex and concept vertex in GOAG respectively; if Viw and Vjc co-occur within a certain distance. { w for these two vertices in G add an edge Ei,j OAG ;

} } }

Fig. 4. Algorithm of Building the Graph

In our experiments, we use Dice Coefficient formula [18] to compute the co-occurrence between two terms, which is reported to be effective in the environment of Wikipedia articles [16]. The formula of dice coefficient can be represented as follows:

Dice(x, y) =

2fxy , (0 ≤ Dice(x, y) ≤ 1) fx + fy

(1)

where fx and fy are the number of windows that term x and y occurred respectively, fxy is the number of windows that term x and y co-occurred.

3.3

Computing the Relatedness between Two Words

Every path between two words, w1 and w2 , represents a certain kind of association between them, and weighted path length wpl(p) of the path can be used to estimate the corresponding relatedness. However, intuitively, a longer path indicates a weaker relatedness between w1 and w2 , since the words are not directly associated. So we divide the weighted path length by length of the path, and the association of one path could be estimated by the following formula:

AP (p, w1 , w2 ) =

wpl(p) len(p)

(2)

where p is a path between two words, wpl(p) is the weighted path length of p 9 , and len(p) is the number of edges along path p . Moreover, it would be unfeasible to take all paths into account due to the computational cost. So we take the maximum value of the weighted path length of two words as its relatedness and only paths with length equaled to 3 or 4 are taken into account in our experiments. In other words, we only consider association of two words linked through one or two concepts. The relatedness between two words, w1 and w2 , can be computed by the following formula: n

ASCT (w1 , w2 ) = max(AP (pi , w1 , w2 )) i=1

(3)

where n is the number of path between words w1 and w2 . In the above, we present how the CLAD dictionary is built. However, it’s hard to verify the quality of the CLAD dictionary directly since we can not find a multilingual word similarity test collection. So we validate the effectiveness of the proposed approach in a cross-language access application. In particular, we propose to apply the CLAD dictionary in two ways to enhance structure-based query translation approach [25] in CLIR, which will be detailed in Section 4.

9

(The estimation of the weight of each edge is described in Section 3.3)

4

Applications of Word Association Dictionary

4.1

Filtering Low Probability Translations

First, we propose to use the CLAD dictionary to filter out the low probability translations. Translation dictionaries can often provide multiple translations for the same source language word or phrase. In reality not all translations are equally likely and the CLIR performance can be improved by choosing the best possible translation among alternative choices [13]. As reported in [38], the low probability translations can hurt the retrieval performance of CLIR, in which it is suggested that removing low probability translations can significantly improve the retrieval performance. We apply a co-occurrence based method for filtering, which has been extensively studied in the community of CLIR [1,11,12,19]. Generally, the basic assumption of this method is that correct translations of the query words tend to co-occur in the target language. The probability of selecting a translation candidate word is usually measured by a function of co-occurrence statistics between this candidate word and the other selected translation words in the query, but this always becomes a “chicken-egg” problem since the selection of translations for one word is determined by the translation selected for other words [19]. However, we can estimate the translation probabilities directly by utilizing the multilingual association dictionary learnt from Wikipedia. We assume that the correct translations are likely to relate to the query words in source language, and we use the following formula to estimate the selecting probabilities. SP (cw) =

num X i=1

ASCT (cw, swi ) num

(4)

This formula measures the average relatedness between the translation candidate and the source query. where cw is a translation candidate, swi is the ith word in the source query and num is the number of words in the query. In our experiments, we remove all the candidates with translation probabilities lower than 0.2 empirically, but we keep at lease two. 4.2

Conducting Query Expansion

Second, we propose to use the CLAD dictionary to conduct pseudo relevance feedback via query expansion, which has proven to be an effective technique for information retrieval in a series of studies [6,14,29,37]. Generally, there are two categories of query expansion approaches for

CLIR, pre-translation expansion and post-translation expansion. The first one performs the expansion prior to query translation, in which an extra corpus in source language is needed; The second one performs the expansion after retrieval using the translated query, in which the expansion terms are selected from the top returned documents. In our experiments, we use the pre-translation alike expansion technique. The main difference is that we use the CLAD dictionary to achieve the same effect as extra corpus in traditional pre-translation expansion approaches. The concrete process flow is presented in Figure 5.

0. Input The original query Q = (q1 , q2 , . . . , qk ), the vocabulary Lo in original language and Lt in target language 1. Output The expanded query Q0

2. Conversion for the original query Q { convert it to a vector VQ = (p(q1 ), p(q2 ), . . . , p(qk )), where p(qi ) is the probability of query term qi in query language model. } for each word tw in Lt { construct a pseudo document represented by vector dtw = (w1 , . . . , wn ), in which every dimension represent a word in source language and the weight is the value of relatedness between the word tw and wi } 3. Get the expansion set QCLAD a). rank the pseudo documents using cosine similarity : cos(VQ , dtw ) b). extract the top K words corresponding to the top K pseudo documents. c). combine the original query and the expanded query term with equal weight using #combine operator in Indri Search Engine. Q0 = #combine( 1.0 Q 1.0 QCLAD )

Fig. 5. Query Expansion Process

5

Experimental Settings

In this paper, we evaluate our proposed approach on the English-Chinese datasets, although it can be applied to any language pairs. We refer to the language of queries as the source language (i.e., English in this paper), and the language of documents as the target language (i.e., Chinese in this paper). In the following of this section, we describe the bilingual resources, datasets used and evaluation measure. 5.1

Bilingual Resources

Firstly, the basic bilingual translation dictionary adopted in our experiments is the LDC EC2.0 dictionary from Linguistic Data Consortium (LDC) 10 , which contains 86,000 English entries, 137,000 Chinese entries and 240,000 translation pairs. This dictionary is widely used as a basic translation resource in the community of CLIR [38,42]. Secondly, Wikipedia articles are used as multilingual corpus for mining the CLAD dictionary. Several kinds of static dumps are available from its official website 11 and we used the English and Chinese static HTML dumps released in June, 2008 in our experiments. We discard the articles that are only written in one language. After the preprocessing, we obtain 104,539 concepts that have articles written in both English and Chinese. All of our CLIR experiments are conducted based on the LDC dictionary. The names of corresponding runs in table 2 and 3 are started with LDC. 5.2

Test Collections

We evaluate our approach on two standard English-Chinese CLIR datasets, the TREC5&6 and NTCIR5&6 datasets, from TREC 12 and NTCIR 13 . These two projects aim to support research within the information retrieval community by providing the infrastructure necessary for large-scale evaluation of text retrieval methodologies. It allows researchers to make cross-comparisons of different IR systems and techniques. The corresponding test collections are frequently used for CLIR evaluation [4,42]. The TREC5&6 and NTCIR5&6 datasets consist of Chinese documents with queries in both English and Chinese. Having two versions of the queries 10

http://www.ldc.upenn.edu/ http://download.Wikipedia.org/, June, 2008 12 http://trec.nist.gov/ 13 http://research.nii.ac.jp/ntcir/index-en.html 11

facilitates us to conduct both monolingual and cross-lingual experiments. For the query topics of each dataset, there are three fields (title, description and narrative) that can be used for retrieving. Figure 6 gives an example of TREC5 topics. All the three fields are used in query formulation, but redundant phrases (e.g. A relevant document should describe) in the narrative field are removed. Table 1 shows the statistics of the datasets. In total, there are 54 topics from TREC and 100 topics from NTCIR. For Chinese text segmentation, we use a simple dictionary-based algorithm similar as in [38]. The words in segmentation dictionary are obtained from the LDC bilingual dictionary as mentioned in 5.1. To segment Chinese text, we treat every substring of 2 or more characters as a word if it can be found in the segmentation dictionary. This strategy can be used to optimize cross language retrieval performance [38]. The Indri toolkit 14 is used to index and retrieve documents in all the experiments. Except for removing stop words, no further procedure for text analysis is used in our experiments.

Fig. 6. An Example Topic – TREC-5 CH1

5.3

Performance Measures There are three major evaluation measures used in this paper:

14

http://www.lemurproject.org/

Table 1 Statistical Information of the Datasets Dataset

TREC5&6

NTCIR5&6

topic

TREC-5

TREC-6

NTCIR-5

NTCIR-6

Query Language

English

English

English

English

Document Language

Chinese

Chinese

Chinese

Chinese

Query Count

28

26

50

50

Document Count

164,789

901,446



MAP (Mean Average Precision): the mean of the average precisions for all the test queries, which reflects the overall retrieval performance of an IR system. It is also the main official evaluation measure for the TREC5&6 and NTCIR5&6 CLIR datasets.



P@10: precision after 10 documents retrieved, which measures the percentage of the retrieved documents that are actually relevant. It is often used in the scenario that users pay more attention to the top 10 returned documents, such as in search engine.



R-Precision, Precision after R documents retrieved, where R is the number of relevant documents for a given topic.

More information of the measures can be found in [2]. For the evaluation of CLIR, we first translate the queries in source language using the proposed approach into the target language, and then retrieve the documents using the translated queries. The performance is always compared with that of monolingual retrieval, in which the queries are translated by a human expert. The performance of monolingual retrieval has always been considered as an “upper-bound” of CLIR since the process of automatic translation is inherently noisy [42]. In this paper, all the three evaluation measures are used for comparison. In addition, we apply the CLAD dictionary to enhance the CLIR, and it is also evaluated by the measures mentioned above. If the mined CLAD dictionary in our experiment is good enough, it would be beneficial to improve the performance of CLIR.

6

Experiments and Analyses

In the following section, we describe a series of cross-language retrieval experiments designed to evaluate the quality of the mined CLAD dictionary. A total of 3 categories of experiments have been conducted on both two datasets. The first one is monolingual and bilingual baseline experiment, which are

detailed in subsection 6.1. The second one is using the CLAD dictionary to do cross-language query expansion. The corresponding runs with names of F ILT ER is presented in Tables 2 and 3. The third one is using CLAD dictionary to filter low probability translation alternatives. The corresponding runs with names of F B is detailed in Tables 2 and 3. We summarize different methods used in our experiments as follows: •

Monolingual baseline: in this run, we use the manually translated version of queries to retrieve. That is, for the queries and the documents are in the same language. As we previously said, the corresponding performance has always been considered as an “upper-bound” of a CLIR system, since the automatic translation is always noisy.



LDC baseline: in this run, we use the structure-based query translation approach (see Section 6.1) to translate the queries using only the LDC bilingual dictionary.



LDC-FB: in this run, we first use the structure-based query translation approach to translate the queries using the LDC bilingual dictionary, and use the algorithm describes in Section 4.2 to expand the original query. The expanded queries are used to obtain the final retrieval results.



LDC-FILTER : in this run, we first use the structure-based query translation approach to translate the queries using the LDC bilingual dictionary, then filter translation alternatives with low probabilities using the algorithm described in Section 4.1.

6.1

Baseline Experiments

For the basic monolingual retrieval, we use the the query-likelihood language model (QL). In particular, we use a Dirichlet prior (with a hyperparameter of µ = 1000) for smoothing the document language model as shown in Equation 5, which can achieve good performance generally [40]. p(w|d) =

c(w; d) + µp(w|C) |d| + µ

(5)

where c(w; d) is the frequency of query term w in document d, p(w|C) is the probability of term w in collection language C and |d| is the length of document d. p(w|C) is estimated by Maximum Likelihood Estimation (MLE). We fix the document language model in all the experiments in order to make a reasonable comparison. For the cross-language retrieval, we employ the structure-based query

translation approach [25,26], which has been shown to be effective in many CLIR studies [25,26,34,27]. Query translation in CLIR is the most widely used matching strategy due to its effectiveness and simplification. One of the key issues is the selection of the translation alternatives. Structure-based query translation approach treats all the translation alternatives in a translation dictionary as synonyms. In contrast to other query translation models [8,9,20,35], it addresses the translation ambiguation problem indirectly. In addition, it can be easily integrated with the mined CLAD dictionary as shown in Section 4. Thus, we can focus our attention on mining the CLAD dictionary, but not the selection of translation alternatives. For the implementation of the structure-based query translation model, we use Indri’s structure-based query language. In particular, we use the following query operators in Indri: •

#syn: the #syn operator treats its operand keys as synonyms, and each search keyword in this operator is treated as a match. For the translation alternatives a1 , a2 , . . . , an of a source language key A, we issue a search statement #syn(a1 a2 . . . an ), the likelihood of which in a document model is corresponding to that of the source language key A. The formula for computing the probability of the #syn operator is as follows: Y P (#syn(a1 a2 . . . 1n )|d) = 1 − (1 − pi ) (6) i

where pi is the probability of ai in a document model. •

#odn: the #odn(a1 a2 . . . an ) operator matches if the ai ’s appear in any order with no more than N words between adjacent terms. In the structured query model, it is used in compound and phrase-based structuring. For example, #od1 (equivalent to #1) means the words in this operator should occur as a phrase in a document



#combine: the #combine operator allow you to combine other operators about terms, phrases, etc.

In our experiments, we first recognize all the phrases in the query using a dictionary-based matching method and translate them. Then, for the remaining words, we translate them word by word. If a word or phrase have more than one translation alternatives, we use the #syn operation in Lemur to combine them. Here the #syn operation plays a role of translation disambiguation implicitly. For example, the English topic “China’s Expectations about APEC” is translated as follows: #combine( #syn(中 中国 瓷器) #syn(可能性 #1(将来 的 指望)

指望 期待 期望 #1(继承 遗产 的 指望) #1(被 期待 物) 预期) apec ) where #1 (equivalent to #od1) means the words in this operator should occur as a phrase in a document. This operator is used here since the Chinese translation alternative may be a compound word and segmented into more than one word. In addition, the stop word “about” is removed and the outof-dictionary word “APEC” remains as what it is, since it is possible that . TREC5&6

MAP

P@10

R-Precision

Monolingual baseline

0.4488

0.6815

0.4606

LDC baseline

0.3016(67.20%)

0.5441(79.83%)

0.3377(73.31%)

LDC-FB

0.3263∗ (72.27%, 8.18%)

0.5578(81.84%, 2.51%)

0.3672∗ (79.72%, 8.73%)

LDC-FILTER

0.3227∗ (71.90%, 6.99%)

0.5763∗ (84.56%, 5.91%)

0.3516∗ (76.33%, 4.11%)

Table 2 MAP, P@10 and R-Precision for all the Experiments on TREC5&6 dataset. A star indicates a statistically significant improvement over the cross language retrieval baseline, according to the Wilcoxon matched-pairs signed-ranks test at the 0.05 level; the values in parentheses are respectively relative performance over that of monolingual retrieval and performance improvement over that of structured-based cross-language retrieval. The bold phase style means that it is the best result for each measure.

NTCIR5&6

MAP

P@10

R-Precision

Monolingual baseline

0.2813

0.4060

0.3089

LDC baseline

0.1826(64.49%)

0.2740(67.48%)

0.2163(70.02%)

LDC-FB

0.2107∗ (74.90%, 15.33%)

0.3060∗ (72.90%, 11.67%)

0.2341∗ (75.78%, 8.22%)

LDC-FILTER

0.2016∗ (71.66%, 10.40%)

0.3010∗ (71.67%, 9.85%)

0.2324∗ (75.23%, 7.44%)

Table 3 MAP, P@10 and R-Precision for all the Experiments on NTCIR5&6. A star indicates a statistically significant improvement over the cross language retrieval baseline, according to the Wilcoxon matched-pairs signed-ranks test at the 0.05 level; the values in parentheses are respectively relative performance over that of monolingual retrieval and performance improvement over that of structured-based cross-language retrieval. The bold phase style means that it is the best result for each measure.

Fig. 7. Standard pr-curve for All Experiments on TREC-5 and NTCIR-5 Dataset

6.2

Discussions

Tables 2 and 3 present the retrieval performance in terms of three traditional evaluation measurements: 1)MAP, 2) P@10, 3) R-Precision. Figure 7 shows the standard precision/recall curves (Interpolated Recall of Precision Averages at r recall) for all our results on TREC5&6 and CLIR5&6 datasets respectively. First, from Tables 2 and 3 we can see, in general, the cross-language retrieval performance can be significantly improved by applying the CLAD dictionary to two different applications, which evaluate the quality of the CLAD dictionary from two different aspects. In the first application described in Section 4.1, the improvements in the experiments at lease show that our proposed approach can estimate the relative relatedness of a word A to its synonyms in the context of a given query topic. Assuming that if the correct translations of key search terms are filtered out from the translation alternatives, it’s likely that the performance would be decreased remarkably. In the second application described in Section 4.2, the improvements show that good expansion terms can be obtained from the CLAD dictionary given a query topic. When one or more important keywords are out of the LDC translation dictionary, the improvements are much more impressive. We think evaluation from these two application is adequate to demonstrate that we can find real associated words from CLAD dictionary and rank them properly in general. In addition, from Figure 7 we can see that the performance of using CLAD dictionary to conduct expanding is always better than that of using CLAD dictionary to do filtering especially on the NTCIR5&6 dataset. There are

probably three reasons: 1) some correct translations or synonyms may be filtered, which will decrease retrieval performance especially when they are important keywords in the query topic; 2) although it is also possible to expand negative expansion terms into query, the impact could be decreased by the good expansion terms and even the translations of the query; 3) as mentioned earlier, the out-of-dictionary words, proper nouns and phrases, which are not included in translation dictionary or can not be properly translated word by word, are important for CLIR. It is possible that the translations of them can be included in the expansion terms, but filtering can do nothing about this. For example, the phrase “most-favored-nation” in topic CH1 is not included in LDC2.0 and can not be properly translated word by word. But its translation (最惠国)is in the expansion terms (最惠国, 待遇, 人权, 美国, 国民党, 一个, 权利, 英国, 卫生, 价值). Therefore, the MAP score for topic CH1 has been improved from 0.1019 to 0.1594. Finally, note that we only experiment with the structure-based query translation approach to evaluate the CLAD dictionary. Better performance could be expected when hybrid approaches with multiple resources are used. However, it is hard to make a solid comparison with the performance of other work. We believe the structure-based query translation approach is still a good baseline approach for evaluating the quality of the mined CLAD dictionary. It is also important to keep in mind that the main focus of this study is on mining a CLAD dictionary from Wikipedia articles. Therefore, we have not paid much attention on how to combine the CLAD dictionary with other CLIR models. In summary, the cross-language retrieval performance can be significantly enhanced by using the CLAD dictionary, which demonstrates we can find real associated words from the CLAD dictionary and rank them properly. This verifies that the CLAD dictionary mined from Wikipedia is of sound quality.

7

Conclusions and Future Work

In this study, we have proposed a graph-based method to mine a crosslanguage association dictionary from Wikipedia articles, in which the link structure in Wikipedia is utilized to build the graph. In addition, we have proposed two different applications in CLIR in order to evaluate the quality of the CLAD dictionary. Experimental results have shown that the mined CLAD dictionary can be used to enhance structure-based CLIR. We conclude that it is possible to mine a good CLAD dictionary from Wikipedia articles, which could also be used in other applications of multilingual information

access, especially for the pair of languages that has minor bilingual resources. As the growth of the Wikipedia domains, we believe that it’s likely to get an even better CLAD dictionary. In future work, it would be interesting to apply the mined dictionary to other applications, such as multilingual text categorization. Furthermore, translation disambiguation and pseudo relevance feedback have been intensively studied in the community of CLIR. We believe that applying the advance research results into our experiments can improve the retrieval performance since only basic techniques have been used to filter or to expand the translated queries in our experiments. Finally, we have employed a simple dictionary-based method to recognize phrases in our experiments and for outof-vocabulary phrases we employ a word-to-word method to translate them. However, there are still lots of phrases that can not be translated properly. It’s possible to consider using natural language processing (NLP) techniques [5,33] to mine cross-language phrase associations and we will further our study in this direction.

8

Acknowledgements

This research is jointly supported by NSERC of Canada, the Early Researcher/ Premier’s Research Excellence Award, Natural Science Foundation of China (No. 60373095 and 60673039) and the National High Tech Research and Development Plan of China (2006AA01Z151).

References [1] Ballesteros, L. and W. B. Croft, Resolving ambiguity for cross-language retrieval, in: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 1606–1611. [2] Buckley, C. and E. M. Voorhees, Evaluating evaluation measure stability, in: SIGIR ’00: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (2000), pp. 33–40. [3] Cao, G., J. Gao and J.-Y. Nie, A system to mine large-scale bilingual dictionaries from monolingual web pages, in: Proceedings of MT Summit XI, 2007. [4] Cao, G., J. Gao, J.-Y. Nie and J. Bai, Extending query translation to crosslanguage query expansion with markov chain models, in: CIKM ’07: Proceedings

of the sixteenth ACM conference on Conference on information and knowledge management (2007), pp. 351–360. [5] Carballo, J. P. and T. Strzalkowski, Natural language information retrieval: progress report, Information Processing and Management 36 (2000), pp. 155– 178. [6] Carpineto, C., G. Romano and V. Giannini, Improving retrieval feedback with mutltiple term-ranking function combination, ACM Transactions on Information Systems 20 (2002), pp. 259–290. [7] Crouch, C., A cluster-based approach to thesaurus construction, in: Proceedings of Intermational ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 309–320. [8] Federico, M. and N. Bertoldi, Statistical cross-language information retrieval using n-best query translations, in: Proceedings of the 25th ACM SIGIR Conference on Research and Development in Information Retrieval, 2002, pp. 167–174. [9] Franz, M., J. S. McCarley and S. Roukos, Ad hoc and multilingual information retrieval at ibm, in: Proceedings of Text REtrieval Conference, 1998, pp. 104– 115. [10] Gabrilovich, E. and S. Markovitch, Computing semantic relatedness using wikipedia-based explicit semantic analysis, in: Proceedings of International Joint Conference on Artificial Intelligence, 2007, pp. 1606–1611. [11] Gao, J. and J. Nie, A study of statistical models for query translation: Finding a good unit of translation, in: Proceedings of Intermational ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 194–201. [12] Gao, J., J. Nie, H. He, W. Chen and M. Zhou, Query translation ambiguity using a decaying co-occurrence model and syntactic dependence relations, in: Proceedings of Intermational ACM SIGIR Conference on Research and Development in Information Retrieval, 2002, pp. 183–190. [13] Gey, F. C., N. Kando and C. Peters, Cross-language information retrieval: the way ahead, Information Processing and Management 41 (2004), pp. 415–431. [14] He, D. and D. Wu, Translation enhancement: A new relevance feedback method for cross-language information retrieval, in: Proceedings of the 17th ACM Conference on Information and Knowledge Management, 2008, pp. 729–738.

[15] Hu, B., Wiki’mantics: interpreting ontologies with wikipedia, Knowledge and Information Systems (2009). URL http://dx.doi.org/10.1007/s10115-009-0247-6 [16] Ito, M., K. Nakayama, T. Hara and S. Nishio, Association thesaurus construction methods based on link co-occurrence analysis for wikipedia, in: Proceeding of the 17th ACM conference on Information and knowledge management, 2008, pp. 817–826. [17] Juffinger, A., R. Kern and M. Granitzer., Crosslanguage retrieval based on wikipedia statistics, in: Proceedings of CLEF 2008 Workshop, 2008. [18] Kondrak, G., D. Marcu and K. Knight, Cognates can improve statistical translation models, in: Proceedings of HLT-NAACL 2003: Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, 2003, p. 46. [19] Liu, Y., R. Jin, and J. Chai, A maximum coherence model for dictionary-based cross-language information retrieval, in: Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval, 2005, pp. 536–543. [20] Lu, C., Y. Xu and S. Geva, Translation disambiguation in web-based translation extraction for english-chinese clir, in: Proceedings of the 2007 ACM symposium on Applied computing, 2007, pp. 819–823. [21] Lund, K. and C. Burgess, Producing high-dimensional semantic spaces from lexical co-occurrence, Behavior Research Methods, Instrumentation, and Computers 28 (1996), pp. 203–208. [22] Lund, K., C. Burgess and R. A. Atchley, Semantic and associative priming in high-dimensional semantic space, in: Proceedings of the 17th Annual Conference of the Cognitive Science Society, 1995, pp. 660–665. [23] Medelyan, O., D. Milne, C. Legg and I. H. Witten, Mining meaning from wikipedia, International Journal of Human Computer Studies 67 (2009), pp. 716–754. [24] Milne, D., O. Medelyan and I. Witten, Mining domain-specific thesauri from wikipedia: A case study, in: Proceedings of ACM International Conference on Web Intelligence, 2006, pp. 442–448. [25] Pirkola, A., The effects of query structure and dictionary setups in dictionarybased cross-language information retrieval, in: Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998, pp. 55–63.

[26] Pirkola, A., T. Hedlund, H. Keskustalo and Kalervo, Dictionary-based cross-language information retrieval:problems, methods, and research findings, Information Retrieval 4 (2001), pp. 209–230. [27] Pirkola, A., D. Puolam¨ aki and K. J¨arvelin, Applying query structuring in crosslanguage retrieval, Information Processing and Management 39 (2003), pp. 391– 402. [28] Potthast, M., B. Stein and M. Anderka, A wikipedia-based multilingual retrieval model, in: Proceedings of ECIR 2008, 2008, pp. 522–530. [29] Salton, G. and C. Buckley, Improving retrieval performance by relevance feedback, Journal of the American Society for Information Science 41 (1990), pp. 288–297. [30] Salton, G. and M. McGill, “An Introduction to Modern Information Retrieval,” McGraw-Hill, Inc. New York, NY, USA, 1986. [31] Sch¨ onhofen, P., A. Bencz´ ur, I. B´ır´o and K. Csalog´any, Cross-language retrieval with wikipedia (2008), pp. 72–79. [32] Schutze, H. and J. Pedersen, A cooccurrence-based thesaurus and two applications to information retrieval, Information Processing and Management 33 (1997), pp. 307–318. [33] Smeaton, A. F., Using nlp or nlp resources for information retrieval tasks, in: Natural Language Information Retrieval (1997), pp. 99–111. [34] Sperer, R. and D. W. Oard, Structured translation for cross-language information retrieval, in: Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (2000), pp. 120–127. [35] Wang, J. and D. W. Oard, Combining bidirectional translation and synonymy for cross-language information retrieval, in: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 202–209. [36] Wang, P., J. Hu, H.-J. Zeng and Z. Chen, Using wikipedia knowledge to improve text classification, Knowledge and Information Systems 19 (2009), pp. 265–281. [37] Xu, J. and W. Croft, Improving the effectiveness of information retrieval with local context analysis, ACM Transactions on Information Systems 18 (2000), pp. 79–112.

[38] Xu, J., R. Weischedel and C. Nguyen, Evaluating a probabilistic model for cross-lingual information retrieval, in: Proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval, 2001, pp. 105–110. [39] Ye, Z., X. Huang and H. Lin, A graph-based approach to mining multilingual word associations from wikipedia, in: Proceedings of the 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2009, pp. 690–691. [40] Zhai, C. and J. Lafferty, A study of smoothing methods for language models applied to information retrieval, ACM Transanction of Information Systems 22 (2004), pp. 179–214. [41] Zhang, Y. and P. Vines, Using the web for automated translation extraction in cross-language information retrieval, in: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (2004), pp. 162–169. [42] Zhou, D., M. Truran, T. Brailsford and H. Ashman, A hybrid technique for english-chinese cross language information retrieval, ACM Transactions on Asian Language Information Processing (TALIP) 7 (2008), pp. 1–35.