A New Term-Term Similarity Measure for Selecting Expansion ...

4 downloads 19757 Views 183KB Size Report
A new Term-Term similarity measure for selecting expansion features in Big Data. Ilyes Khennak and Habiba Drias. Laboratory for Research in Artificial ...
2014 International Conference on Advanced Networking Distributed Systems and Applications

A new Term-Term similarity measure for selecting expansion features in Big Data Ilyes Khennak and Habiba Drias Laboratory for Research in Artificial Intelligence Computer Science Department, USTHB Algiers, Algeria

[7], Ranganathan estimated that the amount of online data indexed by Google had increased from 5 exabytes in 2002 to 280 exabytes in 2009. According to [8], this amount is expected to be double in size every 18 months. Ntoulas et al [9] read these statistics in terms of the number of new pages created and demonstrated that their number is increasing by 8% a week. The work of Bharat and Broder [10] went further and estimated that the World Wide Web pages are growing at the rate of 7.5 pages every second. This revolution, which the Web is witnessing, has led to the appearance of two points:

Abstract—The massive growth of information and the exponential increase in the number of documents published and uploaded online each day have led to led to the appearance of new words in the Internet. Due to the difficulty of reaching the meanings of these new terms, which play a central role in retrieving the desired information, it becomes necessary to give more importance to the sites and topics where these new words appear, or rather, to give value to the words that occur frequently with them. For this purpose, in this paper, we propose a new term-term similarity measure based on the co-occurrence and closeness of words. It relies on searching for each query feature the locations where it appears, then selecting from these locations the words which often neighbor and co-occur with the query features, and finally used the selected words in the retrieval process. Our experiments were performed using the OHSUMED test collection and show significant performance enhancement over the state-of-the-art.

- The first point is the entry of new words into the Web which is estimated, according to [11], at about one new word in every two hundred words. Studies by [11][12][13] have shown that this invasion is mainly due to: neologisms, first occurrences of rare personal names and place names, abbreviations, acronyms, emoticons, URLs and typographical errors. - The second point is that the users employ these new words during the search. Chen and Zhou [14] indicated in their study that more than 17% of query words are out of vocabulary (Non dictionary words), 45% of them are E-speak (lol), 18% are companies and products (Google), 16% are proper names, 15% are misspellings and foreign words (womens, lettre) [15][16].

Keywords-term co-occurrence, term proximity, query expansion, information retrieval

I. I NTRODUCTION Over the years, many different retrieval models including vector space models [1], language models [2], Okapi BM25 [3], divergence from randomness [4], Markov random field (MRF) model for Information Retrieval (IR) [5] have been proposed and studied in order to solve the problem of searching documents in a collection that satisfy the users’ information need [6]. Further, several text retrieval systems dealing with the proximity and the interdependence of words have been developed in this respect. Nevertheless, it has always been a significant challenge to develop text retrieval systems that are effective, robust, and efficient. From this perspective, the main goal of this work is to propose A new Term-Term similarity measure for generating features expansion based on the co-occurrence and proximity of words. This principle gives importance, during the search process, to words which often appear in the same context. For example, the word ’INDS’ often found in the same locations where the words ’Conference’, ’2014’, and ’Algeria’ appear. Relying on this principle was not a coincidence but rather came as a result of the studies carried out recently concerning the evolution and growth of the Web. All of these studies have shown an exponential growth of the Web and rapid increase in the number of new pages created. In his study 978-1-4799-5178-9/14 $31.00 © 2014 IEEE DOI 10.1109/INDS.2014.23

Out of these two points which the web is witnessing and due to the difficulty, or better, the impossibility to use the meanings of these words, we proposed a method based on finding the locations and topics where these words appear, and then trying to use the terms which neighbor and occur with the latter in the search process. We will use the bestknown instantiation of the Probabilistic Relevance Framework system, Okapi BM25, and the Blind Relevance Feedback as the baseline for comparison, and evaluate our approach using OHSUMED test collection. The main contributions of our work in this paper are the following: - The adoption of an external distance measure in order to evaluate the co-occurrence of words with respect to the query features. - The determination of an internal distance measure in order to assess the proximity and closeness of words relative to the features of the query. 87

specific values assigned to k1 and b: k1 = 2, b = 0.5 As part of the indexing process, an inverted file is created containing the weight wiBM 25 of each term ti in each document d. The similarity score between the document d and a query q is then computed as follows:

In the next section, we will discuss the BM25 model and the Blind Feedback approach. The proposed approach is presented in Section III and finally Section IV describes our experiments and results. II. P ROBABILISTIC R ELEVANCE F RAMEWORK : O KAPI BM25

ScoreBM 25 (d, q) =

The probabilistic Relevance framework is a formal framework for document retrieval which led to the development of one of the most successful text-retrieval algorithms, Okapi BM25. The classic version of Okapi BM25 term-weighting function, in which the weight wiBM 25 is attributed to a given term ti in a document d, is obtained using the following formula: wiBM 25 =

tf dl k1 ((1 − b) + b ) + tf avdl

wiRSJ

(5)

A. Blind Relevance Feedback for Query Expansion The Blind Feedback (or pseudo-relevance feedback, or retrieval feedback) is based on modifying the initial query by adding new terms. We start by performing an initial search on the original query using the BM25 term-weighting and the previous document-scoring function given by formula 5, suppose the best ranked documents to be relevant, and then extract all terms from the relevant documents and rank them in order according to formula 2. We expand the initial query by adding the top ranked terms, and finally repeat the BM25 interrogation process. In this paper, we will use the BM25 model and the Blind Feedback approach as the baseline to compare our proposed approach.

(1)

tf , is the frequency of the term ti in a document d; k1 , is a constant; b, is a constant; dl, is the document length; avdl, is the average of document length; wiRSJ , is the well-know Robertson/Sparck Jones weight [17]: (ri + 0.5)(N − R − ni + ri + 0.5) (ni − ri + 0.5)(R − ri + 0.5)

wiBM 25

During the interrogation process, the relevant documents are selected and ranked using this similarity score. The multiple occurrences of a term in the query are treated as different terms.

Where:

wiRSJ = log

 ti ∈q

(2)

III. T HE CLOSENESS AND CO - OCCURRENCE OF TERMS FOR EFFECTIVENESS IMPROVEMENT

Where:

The main goals of our proposed method is to return only the relevant documents and as quickly as possible. For the purpose of selecting the relevant ones, we have introduced the concept of co-occurrence and closeness, during the search process. This concept is based, at first, on finding for each query term the locations where it appears and then selecting, from these locations, the terms which frequently neighbor and co-occur with that query term. To put it simply, we recover for each query term the documents where it appears, and then assess the relevance of the terms contained in these documents to the query term on the basis of: 1. The co-occurrence, which gives value to words that appear in the largest possible number of those documents.

N , is the number of documents in the whole collection; ni , is the number of documents in the collection containing ti ; R, is the number of documents judged relevant; ri , is the number of judged relevant documents containing ti . The RSJ weight can be used with or without relevance information. In the absence of relevance information (the more usual scenario), the weight is reduced to a form of classical idf : wiIDF = log d

N − ni + 0.5 ni + 0.5

(3)

The final BM25 term-weighting function is therefore given by: wiBM 25 =

tf dl k1 ((1 − b) + b ) + tf avdl

log

N − ni + 0.5 ni + 0.5

2. The proximity and closeness, which gives value to words in which the distance separating them and the query term within a document, with respect to the number of words, is small. After that, we rank these words on the basis of their relevance to the query terms, then we add the top ranked words to the initial query, and finally repeat the search process. The adoption of the principle of selecting the words that frequently appear with and close to the query terms and employing them in the search step, leads us to consider the terms of the query as words that are themselves close to

(4)

Concerning the internal parameters, a considerable number of experiments have been done, and suggest that in general values such as 1.2 < k1 < 2 and 0.5 < b < 0.8 are reasonably good in many cases. Robertson and Zaragoza [18] have indicated that published versions of Okapi BM25 are based on

88

each other and the possibility of their presence in the same documents is higher. Accordingly, the more the documents contain the terms of the query, the more they are relevant; and further, the more a word appears in these documents, the higher is its importance, i.e. relevance, to the query terms. From this standpoint, we started our work by reducing the search space through giving importance to documents which contain at least two words of the initial query. This means that the terms, which will be added to the original query, will depend only on this set of documents. The following formula allows us to select the documents that contain at least two words of the query, i.e. to pick out any document d whose ScoreBigram to a query q is greater than zero: i=j 

ScoreBigram (q, d) =

(wiBM 25 + wjBM 25 )

Our dependence on this distance came as a result of the remarkable outcomes achieved in [19][20]. The distance was used during the indexing process to compute the external distance between each pair of terms of the dictionary which appear in at least one document. After that and during the search process, we expanded the original query by adding, for each term t of the initial query, the term whose external distance to t is the highest. In the second step, we will find the terms which are often neighbors to the query terms. Therefore, we attribute more importance to terms having a short distance with the query keywords. We interpret this importance via the measurement of the internal distance between each term ti of the Rc ’ vocabulary and each term tj(q) of the query q. This distance, which takes in consideration the content of documents, computes the distance between ti and tj(q) within a given document d in terms of the number of words separating them. In the case where ti appears next to tj(q) such as: ”Information Retrieval”, the value of the internal distance will be 1.0; and in the case where the number of words separating ti and tj(q) is more than five words, , the value of the internal distance will be 0.0. That means, the more ti is close to tj(q) , the greater is its internal distance. The internal distance between ti and tj(q) within a given document d is obtained by the following formula:

(6)

(ti ,tj )∈q

Where: wiBM 25 > 0, and wjBM 25 > 0. As we previously mentioned, we will find, in the first step, the terms which often appear together with the query terms. Finding these words is done by assigning more importance to words that occur in the largest number of documents where each term of the query appears. We interpret this importance via the measurement of the external distance of each term ti of the Rc ’ vocabulary to each term tj(q) of the query q (Rc , is the set of documents returned by using the formula (6) ). This distance, which does not take in consideration the content of documents, computes the rate of appearance of ti with tj(q) in the collection of documents Rc . In the case where ti appears in all the documents in which tj(q) occurs, the value of the external distance will be 1.0; and in the case where ti does not appear in any of the documents in which tj(q) occurs, the value of the external distance will be 0.0. Based on this interpretation, the external distance ExtDist of ti to tj(q) is calculated as follows:  ExtDist(ti , tj(q) ) =

IntDist(d) (ti , tj(q) ) =



dist(ti , tj(q) ), the distance expressed in number of words between ti and tj(q) in the document d. The terms ti and tj(q) may appear more than once in a document d. Therefore, the internal distance between the term pair (ti , tj(q) ) is estimated by calculating the average of all possible IntDist(d) between ti and tj(q) . Thus, the preceding formula becomes:

(7) x(j,k)



dk ∈Rc

Where:



x(i,k) =

IntDist(d) (ti , tj(q) ) =

1 if ti ∈ dk , 0 else.

tj

(q)

∈q

ExtDist(ti , tj(q) )

(q)

)

1 dist(ti , tj(q) )

tf (ti ) ∗ tf (tj(q) )

(10)

tf (ti ), is the frequency of the term ti in the document d; occ(ti , tj(q) ), is the number of appearance of the term pair (ti , tj(q) ) in the document d.

The total external distance between a given term ti and the query q is estimated as follows: 

occ(ti ,tj

Where:

dk , is a document that belongs to Rc .

ExtDist(ti , q) =

(9)

Where:

x(i,k) ∗ x(j,k)

dk ∈Rc

1 dist(ti , tj(q) )

(8)

The average internal distance between ti and tj(q) in the whole Rc is then determined as follows:

89

 IntDist(ti , tj(q) ) =

dk ∈Rc

TABLE I: Characteristics of the sub-collections used for evaluating the proposed approach.

IntDist(dk ) (ti , tj )



x(i,k) ∗ x(j,k)

(11)

dk ∈Rc

The following formula calculates the total internal distance between a given term ti and the query q : 

IntDist(ti , q) = tj

(q)

∈q

IntDist(ti , tj(q) )

(12)

Finally, we compute the overall distance between ti and q as follows:

Size of the (#docs) collection: (Mb) Number of terms in the dictionary

50000 3.34 35443

100000 6.66 50232

150000 10.1 62126

Size of the (#docs) collection: (Mb) Number of terms in the dictionary

200000 13.5 71699

250000 17 80717

300000 20.6 88764

IV. E XPERIMENTS

judgments chosen from the whole collection of documents. Partitioning the collection of documents into sub-collections leads inevitably to a decrease in the number of relevant documents for each query. In other words, if we have n documents relevant to a given query q with respect to the entire collection, then surely we will have m documents relevant to the same query with respect to one of the sub-collections, where the value of n is certainly greater or equal to the value of m and, the probability of non-existence of any relevant document for a given query could be possible. In this case, in which the value of m is equal to 0, we have removed, for each sub-collection c, every query does not include any relevant document in c. Table II shows the number of queries (Nb Queries) for each sub-collection, the average query size in terms of number of words (Avr Query Size), the average number of relevant documents (Avr Rel Doc).

In order to evaluate the effectiveness of the proposed approach, we carried out a set of experiments. First, we describe the dataset, the software, and the effectiveness measures used. Then, we present the experimental results.

TABLE II: Some statistics on the OHSUMED sub-collections queries.

Dist(ti , q) = ExtDist(ti , q) . IntDist(ti , q) 

Dist(ti , q) = tj

(q)

∈q



ExtDist(ti , tj(q) ) . tj

(q)

∈q

(13)

IntDist(ti , tj(q) ) (14)

Using formula (14), we evaluate the relevance of each term t of the Rc ’ vocabulary to the query q. Then, we rank the terms on the basis of their relevance, and after we add the top ranked ones to the original query. Finally, we use the BM25 similarity score, presented in section II, to retrieve the relevant documents.

A. Dataset

#docs

Extensive experiments were performed on OHSUMED test collection 1 . The collection consists of 348 566 references from MEDLINE, the on-line medical information database, consisting of titles and/or abstracts from 270 medical journals over a five-year period (1987-1991). The available fields are title, abstract, MeSH indexing terms, author, source, and publication type. In addition, the OHSUMED collection contains a set of queries, and relevance judgments (a list of which documents are relevant to each query). In order that the results be more accurate and credible, we divided the OHSUMED collection into 6 sub-collections. Each sub-collection has been defined by a set of documents (docs), queries, and a list of relevance documents. Table I summarizes the characteristics of each sub-collection in terms of the number of documents it contains, the size of the sub-collection, and the number of terms in the vocabulary (dictionary). Regarding the queries, the OHSUMED collection includes 106 queries. Each query is accompanied by a set of relevance

Nb Queries Avr Rel Doc Avr Query Size

50000

100000

150000

200000

250000

300000

82 4.23 6.79

91 7 6.12

95 10.94 5.68

97 13.78 5.74

99 15.5 5.62

101 19.24 5.51

There is no doubt that the title of a document is a set of keywords which summarize and highlight the document content, and also, there is no doubt that these keywords are often found together in the same sites and documents and are frequently adjacent to each other; and this is consistent with our proposed idea, which gives value to words that often appear in the same sites. As a result, and in order to rely on sources of information where the words are very adjacent and close to each other, we decided to use only the titles of documents in experiments. During the indexing phase, all non-informative words such as prepositions, conjunctions, pronouns and very common verbs were disregarded by using a stop-word list. Subsequently, the most common morphological and inflectional suffixes were removed by adopting a standard stemming algorithm. Finally, the weights of the words have been calculated for each document resorting to the well-known BM25 term weighting presented in section 2.

1 we used only the documents titles in experiments, instead of using all the documents’ content.

90

B. Software, effectiveness measures

+

0.26

The BM25 model, the Blind Feedback method and the proposed approach have been implemented in Python. All the experiments have been performed on a Sony-Vaio workstation having an Intel i3-2330M/2.20GHz processor, 4GB RAM and running Ubuntu GNU/Linux 12.04. The precision and the Mean Average Precision (MAP) have been used as measures to evaluate the effectiveness of the systems and to compare the different approaches.

Proposed System BM25 Blind Feedback

precision

0.23 0.20

+ +

+ +

+

+

0.17

+ +

+

+

0.14

+

+

+

250

300

+

+

0.11 0

50

100

150

Collection size

200

(thousand documents)

(a) Precision after retrieving 5 documents (P@5).

C. Results In the first stage of testing, we evaluated and compared the results of our suggested approach (EXTER/INTER), which use both the external and internal distances, with those of the BM25 and Blind Feedback where we computed the precision values after retrieving 5 (P @10), 10 (P @10), and 20 (P @10) documents. The number of words added to the original query, for our system and the Blind Feedback approach, is equal to the size of the initial query. Figure 1 shows the results of the precision values for the proposed approach, the BM25, and the Blind Feedback. From Figure 1a, we note an obvious superiority of the suggested approach EXTER/INTER compared with the BM25, and this superiority was more significant in comparison to the results of the Blind Feedback. Despite the superiority shown in Figure 1b and 1c, the result was not similar to that observed in 1a, however, the precision values of the proposed approach were the best in the majority of the sub-collections. In the next phase of experiments, we calculated the percentage change in the values of precision in comparison to the BM25 and the Blind Feedback (Table III).

+ +

0.19

+

+ +

0.17 precision

+ +

0.15 0.13

+ +

0.11

+

+

+ +

+

+

+

+

Proposed System BM25 Blind Feedback

0.09 0

50

100

150

Collection size

200

250

300

(thousand documents)

(b) Precision after retrieving 10 documents (P@10).

+ +

0.14

+

precision

0.12 0.10

+ +

0.08

TABLE III: Percentage improvement of EXTER/INTER approach over the BM25 and the Blind Feedback (BF) in terms of precision.

+ +

+

+ +

+

+

+

+

+

Proposed System BM25 Blind Feedback

+

0.06 0

50

100

150

Collection size

200

250

300

(thousand documents)

(c) Precision after retrieving 20 documents (P@20). (a) Precision rate improvement after retrieving 5 documents. #docs

50000

100000

150000

200000

250000

300000

BM25 BF

+1.46% +33.38%

+10.01% +57.23%

+1.52% +73.56%

+0.91% +30.23%

+0.82% +45.57%

+0.50% +61.38%

Fig. 1: Effectiveness comparison of the EXTER/INTER approach to the BM25 and the Blind Feedback in terms of precision.

(b) Precision rate improvement after retrieving 10 documents. #docs

50000

100000

150000

200000

250000

300000

BM25 BF

+7.78% +24.14%

+4.80% +24.80%

+3.08% +37.45%

+6.15% +32.15%

+5.11% +40.59%

+2.45% +48.68%

and 20 documents, in all the sub-collections compared with the BM25 and the Blind Feedback. This improvement reached to 15% (see Table IIIc) and 73% (see Table IIIa) with regard to the BM25 and the Blind Feedback, respectively. Through Figure 1 and Table III we can conclude that the proposed method EXTER/INTER, compared with the rest of the search techniques, succeeded to improve the ranking of the relevant documents and made them in the first place. The precision values of the suggested system, after retrieving 5 documents, show a clear and significant superiority in front of each of BM25 and Blind Feedback, and this confirms the effectiveness of the EXTER/INTER approach. In the final stage of testing, we computed the mean average

(c) Precision rate improvement after retrieving 20 documents. #docs

50000

100000

150000

200000

250000

300000

BM25 BF

+15.92% +31.82%

+7.68% +24.27%

0.00% +28.82%

+3.80% +32.80%

+4.46% +38.70%

+0.32% +35.21%

It is clearly seen from Table III that the proposed approach managed to improve the search results, after retrieving 5, 10,

91

R EFERENCES

precision (MAP) score to evaluate the retrieval performance of the proposed system, the BM25, and the Blind Feedback method.

[1] G. Salton, The SMART Retrieval System – Experiments in Automatic Document Processing. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1971. [2] F. Song and W. B. Croft, “A general language model for information retrieval,” in Proceedings of the 8th International Conference on Information and Knowledge Management, ser. CIKM’99. New York, NY, USA: ACM, 1999, pp. 316–321. [3] S. E. Robertson, S. Walker, M. Beaulieu, M. Gatford, and A. Payne, “Okapi at trec-4,” in Proceedings of the 4th Text REtrieval Conference, ser. TREC-4. NIST Special Publication, 1995, pp. 73–96. [4] G. Amati and C. J. Van Rijsbergen, “Probabilistic models of information retrieval based on measuring the divergence from randomness,” ACM Transactions on Information Systems (TOIS), vol. 20, no. 4, pp. 357– 389, 2002. [5] H. Fang and C. Zhai, “An exploration of axiomatic approaches to information retrieval,” in Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, ser. SIGIR’05. New York, NY, USA: ACM, 2005, pp. 480– 487. [6] C. J. Van Rijsbergen, Information Retrieval. London, UK: Butterworths, 1979. [7] P. Ranganathan, “From microprocessors to nanostores: Rethinking datacentric systems,” IEEE Computer, vol. 44, no. 1, pp. 39–48, 2011. [8] Y. Zhu, N. Zhong, and Y. Xiong, “Data explosion, data nature and dataology,” in Proceedings of the 2009 International Conference on Brain Informatics, ser. BI’09. Berlin, Heidelberg: Springer-Verlag, 2009, pp. 147–158. [9] A. Ntoulas, J. Cho, and C. Olston, “What’s new on the web?: The evolution of the web from a search engine perspective,” in Proceedings of the 13th International Conference on World Wide Web, ser. WWW’04. New York, NY, USA: ACM, 2004, pp. 1–12. [10] K. Bharat and A. Broder, “A technique for measuring the relative size and overlap of public web search engines,” Computer Networks and ISDN Systems, vol. 30, no. 1, pp. 379–388, 1998. [11] H. E. Williams and J. Zobel, “Searchable words on the web,” International Journal on Digital Libraries, vol. 5, no. 2, pp. 99–105, 2005. [12] J. Eisenstein, B. O’Connor, N. A. Smith, and E. P. Xing, “Mapping the geographical diffusion of new words,” in Workshop on Social Network and Social Media Analysis: Methods, Models and Applications, ser. NIPS’12, 2012. [13] H.-m. Sun, “A study of the features of internet english from the linguistic perspective,” Studies in Literature and Language, vol. 1, no. 7, pp. 98– 103, 2010. [14] Q. Chen, M. Li, and M. Zhou, “Improving query spelling correction using web search results,” in Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, ser. EMNLP-CoNLL’07. Stroudsburg, PA, USA: ACL, 2007, pp. 181–189. [15] L. V. Subramaniam, S. Roy, T. A. Faruquie, and S. Negi, “A survey of types of text noise and techniques to handle noisy text,” in Proceedings of The Third Workshop on Analytics for Noisy Unstructured Text Data, ser. AND’09. New York, NY, USA: ACM, 2009, pp. 115–122. [16] F. Ahmad and G. Kondrak, “Learning a spelling error model from search query logs,” in Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, ser. HLT’05. Stroudsburg, PA, USA: ACL, 2005, pp. 955–962. [17] S. E. Robertson and K. S. Jones, “Relevance weighting of search terms,” Journal of the American Society for Information science, vol. 27, no. 3, pp. 129–146, 1976. [18] S. Robertson and H. Zaragoza, “The probabilistic relevance framework: Bm25 and beyond,” Foundations and Trends in Information Retrieval, vol. 3, no. 4, pp. 333–389, 2009. [19] I. Khennak and H. Drias, “Term proximity and data mining techniques for information retrieval systems,” in Proceedings of the 2013 World Conference on Information Systems and Technologies, ser. WorldCIST’13. Berlin, Heidelberg: Springer-Verlag, 2013, pp. 477–486. [20] I. Khennak, “Classification non supervis´ee floue des termes bas´ee sur la proximit´e pour les syst`emes de recherche dinformation,” in Proceedings of the 10th French Information Retrieval Conference, ser. CORIA’13. Neuchˆatel, Switzerland: Unine, 2013, pp. 341–346.

TABLE IV: Mean Average Precision (MAP) results of the EXTER/INTER approach, the BM25 and the Blind Feedback. #docs

50000

100000

150000

200000

250000

300000

EXTER/INTER BM25 Blind Feedback

0.2216 0.2032 0.1580

0.1777 0.1768 0.1154

0.1835 0.1777 0.092

0.169 0.164 0.1067

0.1671 0.1627 0.103

0.1752 0.1706 0.1081

+ 0.21

Proposed System BM25 Blind Feedback

+

0.19

MAP

+ +

+

0.17

+ +

+ +

+

+

+

150

200

250

+

0.15

+ +

0.13

+

0.11

+

0.09 0

50

100

Collection size

300

(thousand documents)

Fig. 2: Mean Average Precision (MAP) results of the EXTER/INTER approach, the BM25 and the Blind Feedback. Table IV and Figure 2 show a clear advantage of the EXTER/INTER approach compared to the Blind Feedback in all the sub-collections. They also show a slight superiority over the BM25 results. V. C ONCLUSION In this paper, we proposed an External/Internal distance measure for improving retrieval effectiveness based on the cooccurrence and the closeness of words. We have introduced in this work, the concept of the External/Internal distance of terms. This concept is based on finding for each query term the locations where it appears and then selecting from these locations the words which often neighbor and co-occur with that query term. The top selected terms are then added to the initial query in order to repeat the search process. We thoroughly tested our approach using the OHSUMED test collection. The experimental results show that the proposed approach EXTER/INTER achieved a significant enhancement in effectiveness (up to 15% over the BM25 and 73% over the Blind Feedback). Even though our methods perform quite well, there are some remaining issues that need to be investigated further. One limitation of this work is the use of a single test collection. The other one is that the semantic aspect of terms was not exploited in order to improve the search effectiveness.

92