Optimizing Document Similarity Detection in ... - Semantic Scholar

2 downloads 131700 Views 805KB Size Report
Finding desired data on the Web in a timely and cost-effective way is a problem of wide interest. In the last several years, many search engines have been.
Journal of Convergence Information Technology Volume 5, Number 2, April 2010

Optimizing Document Similarity Detection in Persian Information Retrieval Omid Kashefi, Nina Mohseni, Behrouz Minaei Computer Engineering Department, Iran University of Science and Technology, Tehran, Iran kashefi@{ieee.org, iust.ac.ir}, [email protected], [email protected] doi: 10.4156/jcit.vol5.issue2.11 For thousands of years people have realized the importance of archiving and finding information. With the advent of computers, it became possible to store large amounts of information; and finding useful information from such collections became a necessity [3]. The field of IR was born in the 1950s out of this necessity [3]. Over the last forty years, the field has matured considerably [3]. Several IR systems are used on an everyday basis by a wide variety of users. Early IR systems were Boolean systems which allowed users to specify their information need using a complex combination of Boolean ANDs, ORs and NOTs [4]. Boolean systems have several shortcomings, such as there is no inherent notion of document ranking, and it is very hard for a user to form a good search query [4]. Relevance ranking is not critical in Boolean systems and documents are usually retrieved ordered by date or some other features but rather by relevance [4] IR systems rank documents by estimation of usefulness of them for a user’s query. Most of IR systems assign a numeric score to every document and rank documents by this score. Several models have been proposed for this process. The three most used models in IR research are the 1) vector space model, 2) the probabilistic models, and 3) the inference network model [5]. One of the common problems in IR in web environment is that most searches return many repeated documents [6]. All existing search engines are not always ideal and old techniques of IR such as Boolean retrieval method could not resolve the corresponding problems [7, 8]. Finding desired data on the Web in a timely and cost-effective way is a problem of wide interest [9]. Measuring the text similarity of documents is a research topic from 1970s [10]. Text similarity techniques target retrieving similar documents include retrieving terms that exactly match with search term. Due to increasing amount of information every day, identifying near-redundant information is a key challenging issue in IR systems. In addition, finding similar documents can employ to resolve various problems like to rank related documents, and to classify documents.

Abstract Most data on the Web is in the form of text or image. Finding desired data on the Web in a timely and cost-effective way is a problem of wide interest. In the last several years, many search engines have been created to help Web users find desired information. In this paper we present a new technique to eliminate the affixes and their effects on recognizing similar Persian documents. Reviewing affixes’ rules and exceptions in Persian language, we extracted about 300 common inflectional suffixes and their combinations. We evaluate the effectiveness of eliminating the affixes from Persian texts on document similarity using four major document similarity approaches: Latent Semantic Indexing, Shingling, Vector Space Model, and Co-occurrence. Evaluation results demonstrate improvement in retrieval and detection of similar documents after eliminating affixes.

Keywords Information Retrieval, Document Similarity, Stemming, Lemmatization, Affix, Suffix, NLP, Persian Language

1. Introduction One of the major issues in information retrieval (IR) systems from 1990s was the shear growth in volume of data, as seen by the growth of the web [1, 2]. This growth in data also raises challenges like duplicate data detection both from the users’ perspective and IR systems. As an example, a web search system for the user’s search query “apache”, a popular open source web server, may return hundreds of thousands of results containing the term “apache”, where thousands are the exact same documents or default pages [2]. This redundancy in results does not help users to find the desired information through tens of result pages looking for new/different results. Additionally, these redundant pages slow down indexing time, web crawlers and increase searching time [2].

101

Optimizing Document Similarity Detection in Persian Information Retrieval Omid Kashefi, Nina Mohseni, Behrouz Minaei

One of the main challenges in IR systems is variations in word forms. In most cases, morphological variants of words have similar semantic interpretations and can be considered as equivalent for the purpose of IR applications. Referring this challenge, stemming process is used to reduce inflected (or sometimes derived) words to their stem, base or root form. Thus, the key terms of a query or document are represented by stems rather than by the original words. One kind of stemming that concerned about removing affixes from word in order to generate its root word is called affix-stripping [11]. In this paper, we focused on optimize detecting similar Persian documents by stripping affixes from words and so return more relevant results. Due to this purpose, we study Persian affix combination rules and design a Persian affix-stripper and then we investigate the results empirically. Evaluation results indicate improvement in recall of detecting similar documents after eliminating affixes from Persian documents compared to pure and unprocessed documents. The rest of this paper is organized as follows: Section 2 describes some related works. Affixes review and are related challenges are presented in section 3. Section 4 presents our approach. Finally conclusion and some future works are described in section 5.

measure such as cosine similarity [4, 13] between their corresponding vectors (since cosine has the nice property that it is 1.0 for identical vectors and 0.0 for orthogonal vectors). The major disadvantage of this method is the large space vector and the lack of considering relationship and order between terms. This method fails to deal with synonymy and polysemy leading to inaccurate results. It is computationally inefficient due to the sparse sentence vector. The other drawback is that texts with similar meaning do not necessarily share many words [5, 12]. Another important word-based method is Latent Semantic Indexing (LSI) [14] that focuses on the cooccurrence of the term, but it is inefficient both in time and accuracy, because of the lack of considering other relations between terms [12,14]. This results in creating relations between terms which are not real. It also does not take into account the order of the words [12, 14]. W-shingling [15, 16] is another method of document similarity. A w-shingling is a set of unique “shingles” (contiguous subsequences of tokens in a document) that can be used to gauge the similarity of two documents; the w denotes the number of tokens in each shingle in the set. For a given shingle size, the degree to which two documents A and B resemble each other can be expressed as the ratio of the magnitudes of their shingling intersection and union as shown in Equation 1 [15].

2. Related Work Traditional IR systems that extensively used in digital library systems have some shortcomings. Reasons of these shortcomings are due to two basic problems. First, IR systems cannot understand concepts of user query terms and second, the system cannot understand the content of document. In fact each keyword in a document and in a user query may have several meanings. Therefore, just measuring the surface similarity of a document and cannot produce the best result with user’s query. For identifying similar documents many techniques was developed. Some of these extensions are based on using semantic properties of terms like relationships between them. A very common approach to find related terms is using thesauri. A thesaurus represents the knowledge of a domain with a collection of terms and a limited set of relations between them. Other methods are word-based methods that focus on the co-occurrence of the terms in the texts. Most well-known of them is vector space model (VSM). In this case, the document is represented as a vector of words and the frequency of each word in each document is calculated. If words are chosen as terms, then every word in the vocabulary becomes an independent dimension in a very high dimensional vector space [4, 12, 13]. The similarity between two documents is determined by a similarity

r ( A, B ) 

S( A )  S( B ) S( A )  S( B )

Equation 1

Due to complexity of Persian language compared to other languages like English, implementation of a Persian IR system requires particular attentions. Unfortunately, little efforts have been focused on this problem [17]. One of the major document similarity approaches, which is focused on Persian, is Zamanifar et al. [12] method (hereinafter we refer to it as “co-occurrence”). This method has three main steps: 1) topic identification, 2) topic interpretation, and 3) similarity measurement. Topic identification includes text segmentation, removing trivial, redundant information and stemming. The main part of the method is topic interpretation. Topic interpretation consists of finding and ranking important words. At the final stage, the degree of similarity between sentences is calculated based on similarity measure. This method is built on lexical chain of important words and based on term co-occurrence property of the text. It prevents the irrelevant documents to be identified similar due to polysemy property of the words. It also considers the

102

Journal of Convergence Information Technology Volume 5, Number 2, April 2010

order of the words in identifying the similar documents [12]. F. Oroumchian et al. [18] studied the existing IR techniques on Persian texts. They concluded that vector space model with Lnu.ltu weighting using nonstemmed single words results best for Persian IR. Nayyeri et al. [19] designed a fuzzy retrieval system for Persian (FuFaIR). Their experiment results show better performance comparing with vector space model. Taghva et al. [20] designed a Persian stemmer. Also they had proposed an stemmer based on the morphology of Persian language formerly on 2003. Afterwards, Hessami Fard et al. [21] modified the Krovetz algorithm for Persian stemming that used POS tagging to reduce errors and to increase performance. Mokhtaripour et al. [22] presented an statistical stemmer for Persian text and Usefan et al. [23] studied on the Persian verbs stemming challenges and present an algorithm for Persian verbs.

- Complex morphophonology of affixes. In some cases affixes and especially suffixes may change regarding base word. This morphophonetic rules also affect the affixes inta-combination. These rules need phonetic attribute of words and so are hard to computationally implement. - Multiple correct spelling. In Persian some words have different correct spelling. As an example, words »‫ «اتاق‬and »‫ «اطاق‬are spelled correct, pronounce the same, and equally mean “room”. - Multiple same characters. In Unicode character encoding, some letters like »‫« ک‬, »‫«ی‬, and pseudospace denoted by more than one characters so users with different keyboard layout may use different characters for one letter which leads to inconsistency in IR. In other words, a user might use a particular keyword in searching on the web, but the computationally particular keyword has not used in the document but according to the mentioned cases, document may have that keyword in different forms. Affixes are letters that usually not used independently but sometimes can be used in combination with other words, while making new words, new meanings to join them. This affixes are divided into three categories: 1) prefix, which are added to the beginning of the words; 2) infix, which are located between two words, and 3) suffix or rhyme, which are added to the end of words. Persian language is subset of Indo-European languages in which affixes importance and effectiveness on words and meanings are high. Persian language is rich of affixes. In Persian, total number of words can be decreased into a few words that considered as stem, and the rest of the words are derivation this stems. In other words, each word contains a stem that encompasses the idea and original meaning of the word. Deriving a word of the principal stem causes that the main concept of that word takes a better shape, or expresses a syntax role in a sentence. In Persian language derivation and manufacturing words operations will be done with the combination of stem words along with prefixes and suffixes. The aim of stripping is removing incorporation and finds the main essence of words. Nevertheless, in reality, sometimes affixes may change meanings so eliminating them cause the loss of original meaning. So removing the affixes faces the confusions [24]. Persian affixes can drop into 1) inflectional (declensional), and 2) derivational affixes. Persian declensional affixes inflect nouns, adjectives, adverbs, determiners, and pronouns but derivational affixes makes new words usually with different meaning from

3. Affixes Role in the Persian Language Persian script provides unique hardships in the electronic domain that has deleterious effects on the quality of content search [24]. Thus, new arrangements must be envisioned to transform the Persian language, from a language of poetry and mysticism to a one more suitable to the electronic domain of scientific exchange [24]. In recent decades, most variation about Persian spelling method was over separate or continuous writing of compound words [24]. Persian language has a complicated and uncertain morphology in compound word construction and verb stemming [25]. In Persian, in addition to white space between words, there is another intra-word distance between letters of word that known as pseudo-space. For example »‫«آب سرد کن‬ is a complete, correct and meaningful sentence means “Cool the water” but »‫ «آبسردکن‬is a word and it means “water cooler”. Wrong used of space and pseudo-space instead of each other is widely common even from educated Persians, which may change the meaning of word or sentence. Some other computationally challenges of Persian language that may cause inconsistent IR results are as follow [24, 25, 26, 27]: - Combination of affixes with word. Close fitting, non close fitting or fitting with pseudo-space of affixes with words is a challenging issue in Persian. - Intra-affix fitting challenges. In Persian affixes can combine to each other too. For instance, there are more than 2600 valid combinations of just inflectional suffixes.

103

Optimizing Document Similarity Detection in Persian Information Retrieval Omid Kashefi, Nina Mohseni, Behrouz Minaei

base word. So eliminating derivational affixes may cause to change the meaning of word. Persian declensional affixes also have many combinations (more than 2600) with many morphological exceptions and detecting these different forms is one of the most challenging parts of Persian affix-stripping. Web searches, parameters that make a search result favorable for a user, are as follows [24]:

evaluate our proposed affix-stripper, 10 pair similar documents are evaluated by an expert user who has manually eliminated the affixes from the texts, and the results are compared by affix stripping method. Numbers of eliminated affixes from the texts by expert and affix stripping method are shown in Table 1. Table 1. Our Proposed Affix-striper Evaluation

- Precision. This performance measure is the percent of retrieved documents that are relevant to the search. - Recall. The purpose of recall results is showing that all found web pages are based on desired users’ keywords. In other words no desirable pages should be eliminated. But recall and precision of results may vary, based on the language used in writing the content of pages. Especially, in languages such as Persian, which has many handwriting challenges [25], inappropriate results are more probable.

4. Our Proposed Method

Number of Affix By Our Affix Striper

Doc1

25

28

Doc2 Doc3 Doc4 Doc5 Doc6 Doc7

27

31

16

15

24

24

29

30

26

27

31

31

Doc8 Doc9 Doc10

24

24

19

20

25

23

As there is not a notable declensional affix-stripper for Persian, we cannot compare our results but following shows amount of the precision and recall score of our approach.

In this paper, we emphasis on affix removal trying to improve the detection of similar documents and so return more relevant user desired results. We compare similarity measure of four major document similarity methods before and after eliminating affixes of words: 1) LSI, 2) VSM, 3) Shingling and 4) Co-occurrence. As mentioned before, affixes may cause the ambiguity in words meaning and recognizing the correct stem. Through an exhaustive survey on Persian affixes and their different compositions, we have extracted about 300 common inflectional affixes. After finding affixes with our affix-stripping algorithms, which shown in Figure 1, all existing declensional affixes are removed from texts and again similarity measure calculated using mentioned methods. To

1: 2: 3: 4: 5: 6: 7: 8: 9:

Number of Affix By User

ID

Recall 

All Affixes  RetrievedAffixes 243   0.98 246 All Affixes

Precision 

All Affixes  RetrievedAffixes 243   0.96 253 RetrievedAffixes

Persian Affix-stripping Algorithm Procedure AffixStripping(w:Word, Wlist :WordSet) out:Word if w is not in Wlist PersianPatternInfo[ ] patternlist = PersianAffixRecognizer.MatchForAffix(w) foreach PersianPatternInfo pi in patternlist do if pi.stem is in Wlist return pi.stem to Text endfor endif return w Figure 1. Partial Pseudo Code our Affix-Stripper

104

Journal of Convergence Information Technology Volume 5, Number 2, April 2010 After eliminating affixes from the texts, we evaluate our optimization claim by recalculating similarity amount of corresponding texts by LSI, VSM, Shingling and Co-occurrence similarity measures. Our experiments show that similarity results after eliminating affixes from the texts are improved. We use over 50 pair of human revised similar documents in sport, politics, and scientific news. Table 2 compares the average results of four methods before and after deletion of affixes on 50 pair similar document, note that results of Table 2 are normalized similarity score, while 0 means not similar and 1 means exactly the same.

We also evaluate precision and recall of detecting similar documents in a set of similar and non-similar documents together. We use a set of 25 pair of similar document, where only one pair of them are similar and another 600 (24 × 25) possible pairs are (may) not be similar. Table 4, shows the recall of detecting similar document and effects of our approach. Table 4. Comparison of Effects of our Approach on Recall Score of Similarity Detection

Table 2. Comparison of Similarity Score of Similar Documents before and After Removing Affixes Similarity Methods

Before Removing Affixes

After Removing Affix

LSI

0.700

0.803

VSM

0.728

0.810

Shingling

0.268

0.312

Co-occurrence

0.781

0.871

After Removing Affix

LSI

0.302

0.329

VSM

0.275

0.301

Shingling

0.00446

0.00448

Co-occurrence

0.247

0.285

After Affix Elimination

LSI

0.88

0.92

VSM

0.88

0.96

Shingling

0.32

0.32

Co-occurrence

0.92

0.96

Table 5. Comparison of Effects of our Approach on Precision Score of Similarity Detection

Table 3. Comparison of Similarity Score of Nonsimilar Documents before and After Removing Affixes Before Removing Affixes

Before Removing Affixes

Table 5, reflects the effects of our approach on precision of detecting similar document.

Due to investigating accuracy of proposed methods we have chosen 25 pair of non-similar documents and compute amount of similarity before and after affixes elimination. Table 3, shows the average of similarity rate of non-similar documents.

Similarity Methods

Similarity Methods

Similarity Methods

Before Removing Affixes

After Affix Elimination

LSI

0.48

0.44

VSM

0.53

0.51

Shingling

0.88

0.88

Co-occurrence

0.59

0.57

As it is shown in Table 4, the recall in all document similarity methods after use of our proposed method is considerably improved. This also means that more similar documents are considered and retrieved.

4. Conclusion Remarks In this paper we presented an approach for removing affixes and so their effects on detecting similar documents for Persian. We considered four major document similarity approaches (VSM, Shingling, LSI, and Co-occurrence) as our evaluation metrics. Our experiment results shown that eliminating affixes from texts improves detecting and retrieving similar documents. We demonstrated that the recall of detecting similar documents is considerably improved in all of four similarity detection methods. The accuracy of proposed affix-stripper was validated by the precision and the recall score. It was shown that the recall of detecting affixes was very close to human’s result and the error rate of our method was very low.

In shingling method, since test documents are similar only in terms of concepts not in terms of writing and spelling, the similarity coefficient is very low. So, the effect of affixes elimination on this method shows less improvement. The similarity percent in non-similar documents increased because of word-based similarity detection method challenges. After elimination affixes from word more percents of word would become similar. For example, two texts contain a lot of non-similar words but they may include many »‫«آن‬, a determiner which means “that” and »‫«آنها‬, where »‫ «ها‬is a plural affix, and means “those”. So after affixes elimination these two words became the same and the similarity percent increased.

105

Optimizing Document Similarity Detection in Persian Information Retrieval Omid Kashefi, Nina Mohseni, Behrouz Minaei

5. Reference [14]

[1] W. S. Al halabi, M. Kubat, M. Tapia, "A Tool to Personalize the Ranking of the Documents Returned by an Internet Search Engine", Journal of Convergence Information Technology, Vol.2 No.3, pp. 6-10, 2007. [2] S. Antonia, and H. Leong, “Duplicate data detection”, in Proceedings of the 6th International Web Conference, 1997. [3] G. A. Singhal, “Modern Information Retrieval: A brief overview”, Journal of the American Society for Information Science, vol. 24, pp. 115–139, 2001. [4] J. Becker, and D. Kuropka, "Topic-based Vector Space Model", In Proceedings of the 6th International Conference on Business Information Systems, 2003. [5] E. M. Voorhees, “Natural Language Processing and Information Retrieval”, Lecture Notes in Computer Science, Springer Berlin/Heidelberg, pp. 32-48, 1999. [6] J. W.Cooper, A. R.Coden, and E. W.Bro, “Detecting similar documents using salient terms”, in Proceedings of the 11th international conference on Information and knowledge management, Virginia, USA, 2002. [7] M. Azadnia, “Presenting an expert system for automatic correcting Persian texts”, International Journal of Computer Science and Network Security, Vol. 8, No. 3, pp. 27-31, 2008. [8] H. Al-Mubaid, and P. Chen, “Context-based similar words detection and its application in specialized search engines”, in Proceedings of the 10th international conference on Intelligent user interfaces, San Diego, California, USA, 2005. [9] W. Meng, C. YU and K. Lup Liu, “Building efficient and effective meta search engines”, ACM Computing Surveys, Volume 34, pp. 48- 89, 2002. [10] S. Helmer, “Measuring the structural similarity of semistructured documents using entropy”, in Proceedings of the 33rd international conference on Very large data bases, 2007. [11] M. Sankupellay, “Malay-Language Stemmer,” Sunway Academic Journal, vol. 3, pp. 147–153, 2006. [12] A. Zamanifar, B. Minaei-Bidgoli, and O. Kashefi, “A New Technique for Detecting Similar Documents based on Term Co-occurrence and Conceptual Property of the Text”, in Proceedings of the 3th International Conference on Digital Information Management, London, England, 2008. [13] G. Salton, A. Wong, and C. S. Yang “A Vector Space Model for Automatic Indexing” In Proceedings of the

[15] [16]

[17]

[18]

[19]

[20]

[21]

[22] [23]

[24]

[25]

[26]

[27]

106

14th ACM international conference on Computational Linguistics, 1975. S. Deerwester, S. T. Dumais, G. W. Furnas, T.K. Landauer, and R. Harshman, "Indexing by Latent Semantic Analysis", Journal of the American Society for Information Science, Vol. 41, pp. 391-407, 1990. Broder, Glassman, Manasse, and Zweig, Syntactic Clustering of the Web, SRC Technical Note, 1997. A. Broder, "Identifying and filtering near-duplicate documents" in Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching , Springer, 2000. A. Aleahmad, P. Hakimian, F. Mahdikhani, and F. Oroumchian, “N-Gram and Local Context Analysis For Persian Text Retrieval”, in Proceedings of the Third Text Retrieval Conference (TREC-3), 2007. F. Oroumchian, and F. Mazhar Garamaleki, “An Evaluation of Retrieval performance Using Farsi Text”, in Proceedings First Eurasia Conference on Advances in Information and Communication Technology, Tehran, Iran, 2002. A. Nayyeri, and F. Oroumchian, “FuFaIR: a Fuzzy Farsi Information Retrieval System”, in Proceedings of IEEE International Conference on Computer Systems and Applications, 2006. K. Taghva, R. Beckley, and M. Sadeh. “A Stemming Algorithm for the Farsi Language”. in Proceedings International Conference on Information Technology: Coding and Computing, 2005. R. Hessami Fard, and G. Ghasem Sani, “Stemmer Algorithm Design for Persian Language”, in Proceedings 11th International CSI Computer Conference, Tehran, Iran, 2006. A. Mokhtaripour, S. Jahanpour, “Introduction to a New Farsi Stemmer”, in CIKM, Virginia, USA, 2006. A. Usefan, S. Salehi, and B. Minaei-Bidgoli, “Stemming Chalenges and Stemming Algorithm for Farsi Verbs”, In Proceedings of the First Workshop on Persian Language and Computers, Tehran University, Iran, 1993. M. Sadiqi, K. Zamanifar, “A method for overcoming Persian web content mining”, Iran Documents, Vol. 2, Iran, 2005. M. Nasri, and O. Kashefi, Defects of Persian Official Writing Rules from Computational View, Technical Report PMF-1-I, IUST, Tehran, 2008. O. Kashefi, B. Minaei-Bidgoli, M. Sharifi, “A novel string distance metric for ranking Persian spelling error corrections”, Language Resource and Evaluation, 2010. H. Anvari, H.A. Givi, Persian Language, 3rd ed., Tehran: Fatemi, 2006.