Rich Document Representation for Document ... - Semantic Scholar

4 downloads 0 Views 80KB Size Report
This is the same expression as for the regular Pearson correlation coefficient, ... The Spearman rank correlation is an example of a non-parametric similarity ...
Rich Document Representation for Document Clustering Azam Jalali Department of Computer and Electrical Engineering, Faculty of Engineering, University of Tehran, [email protected]

Farhad Oroumchian University of Wollongong in Dubai POBox 20183, Dubai, UAE [email protected]

Abstract In traditional document clustering models, a document is considered as a bag of words. In this paper we present a new method for generating feature vectors, using the sentence fragments that are called logical terms and statements, in PLIR system. PLIR is a Knowledge-Based Information system based on the theory of the Plausible Reasoning. We have conducted a number of experiments using OHSUMED document collection and the clustering methods K-Means with seven different similarity measures between documents. The Experiments seem to indicate that the use of richer features such as logical terms or statements for clustering tends to perform better than the simp le bag of words approaches within our domain of experiments that is second phase of a twostage retrieval system.

1 Introduction Document Clustering, which is the process of finding natural groupings in documents, is an important task in information retrieval. The cluster hypothesis states that the relevant documents tend to be more similar to each other than to non-relevant documents, and therefore tend to appear in the same clusters. There has been general research on how to utilize clustering to improve retrieval results. In most of the pervious attempts the strategy was to build a static clustering of the entire collection and then match the query to the cluster centroids. Many researchers have examined the effectiveness of hierarchic clustering methods and have compared it to conventional Inverted File Search (Croft, 1980). Recent application of clustering has been investigated as a methodology for improving retrieved document search and browse (Tombros 2002; Cutting et all 1992). In cluster-based search, a single cluster is retrieved in response to a query. The documents within the retrieved cluster are not ranked in relation to the query but rather the whole cluster is retrieved as an entity. Cluster representation refers to the formation of cluster representatives, or centroids that attempt to summarize the contents of a cluster for the purpose of retrieving the cluster. Incoming queries are matched against representatives, and the cluster whose representatives are most similar to the query, is retrieved (Tombros 2002). Three different types of cluster-based searches have been studied in IR: Top-down search, Bottom-up search and optimal cluster search Cluster-based browsing paradigm clusters documents into topically-coherent groups, and presents descriptive textual summaries to the user (Cutting et all 1992). Informed by the summaries the user may select clusters, forming a sub collection for interactive examination. The clustering and reclustering is done on the fly. Here, cluster representation refers to textual or graphical representations of the cluster contents in a manner such that they will support judgments by the user on the utility of the cluster contents There are many algorithms for automatic clustering such as partitioning algorithm and hierarchical clustering that can be applied to a set of vectors to form the clusters. Traditionally the documents are represented as bag of weighted words, the weights could be calculated based on the frequency of the

words appearing in the documents. Normally, the words and documents are assumed to be independent of each other and the relationships or associations among the words are not exploited in those approaches. In this paper we describe utilization of a document representation method called Rich Document Representation for clustering. This method utilizes single words, phrases, logical terms and statements in the way they are captured by the PLIR system from text. According to our experiments we believe this method provides a better representation, which results in better performance. The rest of the paper is structured as follows: Section 2 will provide details of document vectors construction using Plausible Reasoning, Section 3 details the system, environment and procedures under which the experiments were conducted. Section 4 discusses the experimental results. At the end, section 5 outlines the conclusions and future direction of the work.

2 Document Vector Construction using Rich Document Representation The theory of plausible reasoning is developed by Collins and Michalski (Collins & Michalski 1989) (Collins & Burstejn 1988) for Question Answering situations where information is incomplete or uncertain or dynamically changing. They have collected and organized a wide variety of human plausible inferences made when incomplete and inconsistent information where present. These observations led to the development of a descriptive theory of human plausible inferences that categorizes plausible inferences in terms of a set of frequently recurring inference patterns and a set of transformations on those patterns. In (Oroumchian & Oddy, 1996), authors have described an information retrieval system based on the theory of plausible reasoning called PLIR. That system represented documents as single words, phrases and logical terms, which were extracted from the text. In (Ashori, Oroumchian & Arabi, 2003) authors have described an information filtering system based on PLIR system. We were interested to know if the same document representation could be used for clustering and if the representation without the reasoning still were useful. In (Oroumchian 1995) author describes the methods for analyzing text and extracting shallow relationships and provides a one to one algorithm for converting the relationships to logical forms. The author distinguishes seven different relationships between words and phrases namely: ISA, BroaderNarrower, REF (reference), X, Y, AUTH (author) and CITE (citation) and implements algorithms to automatically extract them from documents. X and Y are relationships that we have enough evidence for their existence but we are not able or willing to completely identify their semantic meaning. PLIR uses a rule -based or clue-based approach in order to locate those relationships in the text. For example the sentence fragment “unstructured elicitation techniques such as protocol analysis and interviews” signifies an ISA relationship between “protocol analysis” and “unstructured elicitation techniques” based on the clue word “such as”. Table 1 provides examples of sentence fragments, clues and the relationships they imply. There is a confidence value associated with the effectiveness of each clue. The system also removes all the rare relationships on the assumption that a real and useful relationship is far more frequent and dominant than noise in the text. System also is able to cope with 5-10% completely wrong relationships in its KB. All of the remaining relationships extracted from text are used to build the KB but only REF, AUTH and Y rela tionships are utilized in representing the documents. PLIR has methods to convert relationships to logical terms and logical statements of the theory of Plausible Reasoning. PLIR assumes that a document or query is only partially represented by its single words, phrases and logical terms (Y relations) or logical statements. Therefore it utilizes inferences of the theory of plausible reasoning to transform the query representation to other possible representations that are close enough to a document representation. In this work, we only experimented with PLIR’s document representation (called Rich Document Representation) without it’s reasoning for clustering documents. Also, we have not used the weight called Dominance reported in (Oroumchian & Oddy, 1996; Ashori, Oroumchian & Arabi, 2003). We have only borrowed the extraction and conversion methods in order

to produce a representation consisting of stemmed single words, phrases and sentence fragments (logical terms).

No.

Sentence fragment

Clue

Relationship

1

Automatic Training Systems (ATS) operating in a man-machine dialog mode

Abbreviation

ATS ISA Automatic_ Training_Systems

2

the development of knowledgebased systems in organizations.

proposition

3

Query language

4

Query language

Phrase structure Phrase structure

Development Y-rel knowledge-based systems Query X-rel Query_language Auery_language BN language

Table 1: Examples of sentence fragments, clues and the relationships.

3 Experimental Details The context of the experiments is the second stage of a two-stage search and browsing engine. The goal is to provide a browsing tool that helps users to find the documents they are interested in faster. The first stage of the system is a general retrieval system such as SMART system. The second stage utilizes clustering to cluster the top n documents of the first stage and then uses summarization to create meaningful representation for the clusters. The process of retrieval is one of search and browse. We hope this will help users of commercial or web search engines to find their documents faster. Here, The objective was to investigate whether “Rich Document Representation” can be separated from it’s reasoning and used independently and whether it is useful for representing documents in the clustering stage of this system. This objective was achieved by representing documents by single words and Rich Document Representation and then comparing both the optimal cluster effectiveness and clustering quality by clustering documents with 7 different measures of clustering. In these experiments, the OHSUMED medical abstracts collection was selected as test collection. It is a clinically oriented MEDLINE subset, consisting of 348,566 references covering all references from 270 medical journals over a five-year period (1987-1991). In this study we used a subset of the OHSUMED collection (Hersh et all, 1994) We selected all documents in 1987, which includes 54,710 documents and 63 queries. But we have excluded all the queries that had fewer than 6 relevant documents in their top 100 ranked documents. 28 queries were left after the above process. Figure 1 depicts the two phases of the cluster based search system. The SMART version 11 document retrieval system (Salton, 1971) was used in order to perform the initia l retrieval and clustering with atc.atc weighting scheme. The atc measure normalizes the weights for document length, giving all documents an equal chance for retrieval. After the initial retrieval, the top n ranked documents were used for clustering. Six different values of n were tested: n=100, n=200, n=300, n=400, n=500, n=1000. Formula 1 shows atc weighting scheme of SMART retrieval system. To assign an indexing weight wij that reflects the importance of each single -term Tj in a document Di , different factors should be considered, as follows: • within-document term frequency tfij , which represents the first letter of the SMART label. • collection-wide term frequency dfj , which represents the second letter of the SMART label.

Figure 1: cluster based search system In Formula (1), idf stands for inverse document frequency and is computed as follows: idfj = log

N ; Fj

Where, N represents the number of documents and Fj represents the document frequency of term Tj . • normalization scheme, which represents the third letter of the SMART label.

idf j × (0.5 + wij =

tf ij 2 × max_ tf i

)

tf ik )] 2 ∑ nk =1[ idf k × (0.5 + 2 × max_ tf i

(1)

The default SMART stoplist and stemming were also used in indexing all the collections and queries. 3.1 The Clustering Algorithm Partitioning methods of clustering have been extensively used and examined in the context of IR (Hearst & Pederson 1996). We employed k-means clustering algorithm. In order to cluster gene expression data into groups with similar genes or micro arrays. We used C clustering library reported in (Michiel de Hoon, Seiya Imoto & Satoru Miyano, 2003) and applied seven distance functions available to measure similarity, or conversely, distance. These measures are defined as follows: 1. ‘c’ Pearson correlation The Pearson correlation coefficient is defined as



 x i − x  yi − y      σ  x  σ y  i =1 in which x and y are the sample mean of x and y respectively, and σ x , σ y are the sample standard 1 r= n

n

deviation of x and y. The Pearson distance is then defined as dP = 1 – r. 2. ‘a’ Absolute value of Pearson correlation The distance is defined as usual as da = 1 -| r|. Where r is the Pearson correlation coefficient. 3. ‘u’ Uncentered Pearson correlation (equivalent to the cosine of the angle between two data vectors.) The uncentered correlation is defined as

1 ru = n

∑  σ n

i =1

 x i  yi  ( 0) ( 0 )  x  σ y

   

where

σ

σ

( 0) x

( 0) y

=

=

1 n

∑x

1 n

∑y

n

2 i

i =1 n

2 i

i =1

This is the same expression as for the regular Pearson correlation coefficient, except that the samples means x and y are set equal to zero. The distance corresponding to the uncentered correlation coefficient is defined as du = 1 – ru. Where ru is the uncentered correlation. 4. ‘x’ Absolute uncentered Pearson correlation (equivalent to the cosine of the smallest angle between two data vectors) The distance measure is defined using the absolute value of the uncentered correlation, du =1- | ru |; where r u is the uncentered correlation coefficient. 5. ‘s’ Spearman’s rank correlation The Spearman rank correlation is an example of a non-parametric similarity measure. To calculate the Spearman rank correlation, each data value is replaced by their rank if the data in each vector is ordered by their value. Then the Pearson correlation between the two rank vectors instead of the data vectors is calculated. Spearman rank distance measure corresponding to the Spearman rank correlation as ds = 1 -| rs |; where r s is the Spearman rank correlation. 6. ‘e’ Euclidean distance The Euclidean distance is the only true metric among the distance functions that are available in the C clustering library, being the only distance function satisfying the triangle inequality. The Euclidean distance is defined as. n

d = ∑ (x i − y i )

2

i =1

7. ‘h’ Harmonically summed Euclidian distance The harmonically summed Euclidean distance is a variation of the Euclidean distance, where the terms for the different dimensions are summed inversely (similar to the harmonic mean):

1 d =  n

∑ n

i =1

2  1       x i − y i  

−1

The characters in front of the distance measures are used in figures. 3.2 Optimal Cluster Evaluation In these experiments, the E effectiveness function that was proposed by Jardine and Van Rijsbergen (1971) is used as optimally criterion. The formula for the measure is given by: 1 − {( β 2 + 1) PR ( β 2 P + R )} where P and R correspond to the standard definitions for precision and recall (over the set of documents of a specific cluster), and β is a parameter that reflects the relative importance attached to precision and recall. Three values of this parameter are usually used: 1, 0.5 and 2. The first value attributes equal importance to precision and recall, the second deems precision twice as important as recall, and the third treats recall twice as important as precision. The E effectiveness measure and these three values of the parameter β are used in the experiments reported in this paper. The optimal cluster for any given query is the cluster that yields the least E value for that query. Jardine and Van Rijsbergen named this measure MK1. It is used to measure optimal cluster effectiveness in our experiments.

The main advantage of optimal measures eliminates any bias that may be introduced from external sources. External sources include the choice of a particular cluster-based search strategy that matches queries to clusters, and the ability of a user during a browsing session to choose the cluster that is most relevant to his/her information need(Tombros, 2002).

4 Optimal Cluster Evaluation Results Beside the main objective of comparing the effectiveness of the bag of words approach and RDR after the first initial run, we wanted to examine the effectiveness of different similarity measures in clustering. Table 2 depicts the naming convention used in this project. So, a run name such as “PCL.S.RD” means K-means Clustering is used with Spearman’s rank correlation as similarity measure and RDR as document representation. Method Rich Document Representation Single Words Representation K-means (k=6)Clustering Pearson correlation

Symbol RD SW PCL C

Use Document Representation Document Representation Partitional Clustering Similarity Measure

Absolute value of Pearson correlation Uncentered Pearson correlation Absolute uncentered Pearson correlation Spearman’s rank correlation Euclidean distance Harmonically summed Euclidian distance

A U X S E H

Similarity Measure Similarity Measure Similarity Measure Similarity Measure Similarity Measure Similarity Measure

Table 2: The Naming Convention in the Experiments Optimal cluster evaluation results on the MK1 measure are reported in the figures 2 through 4. The figures present a comparison of MK1 values for seven clustering methods (K-means with different similarity measures) based on stemmed single words and RDR representation for 6 different number of documents (from 100 to 1000) and three different β values (0.5, 1, 2). Since many of the runs had very similar performances, only four of them have been depicted in these figures. In general, PCL.E.RD, PCL.A.RD, PCL.X.RD, PCL.U.RD, PCL.C.RD runs produce very similar results therefore only the results of PCL.E.RD run is depicted in the figure as their representative. The result of PCL.E.SW reported here represents those of PCL.A.SW, PCL.X.SW, PCL.U.SW and PCL.C.SW runs too. The PCL.S.RD run represents PCL.H.RD run also and the PCL.H.SW run performs very close to PCL.S.SW run in below figures. The figures 5 and 6 compare average cluster size and average number of relevance for β = 2 . As depicted in these figures each instance of Rich Document Representation consistently produces better cluster effectiveness than it’s single term counter part. With the exception of Spearman rank Correlation (s) and Harmonically summed Euclidian distance (h), in general all the clustering measures for Rich Representation perform better than all the instances of single term experiments. Although PCL.H.RD and PCL.S.RD runs are better than their single word runs (PCL.H.SW and PCL.S.SW) but they are worse than PCL.E.SW, PCL.A.SW, PCL.X.SW, PCL.U.SW and PCL.C.SW single word runs. As depicted in the figures, the MK1 results for n=1000 are worst than others, mostly because the number of the relevant retrieved document does not increase as much as the number of the retrieved documents. It seems the first few hundred documents produce better clustering effectiveness (represented by low MK1 values) than last several hundred documents. This could suggest that by considering the cost of online clustering, smaller amount of n (between 400 and 500) could be sufficient for clustering purposes without significant degradation of the results. This could have been

caused also by the fact that we only changed the number of documents but kept the number of the clusters fixed at six. So as the number of documents grows each cluster contains more and more documents and the clustering effectiveness degrades. It also seems that fewer documents (less then 200) favors more precision oriented evaluation ( β = 0.5 ) than recall oriented evaluation ( β = 2 ) and vice verse. Figure 5 demonstrates that the average number of relevant documents for optimal clusters is higher for all the Rich Representation instances than sin gle term instances (with the exception of s and h measures). Average size of optimal clusters for k-means method using different distance and different weighting is equal. Average number of relevant documents using h and s distance measures is worst than others. As seen in figure 5 the average numbers of relevant documents for optimal clusters does not always increase in proportion to the number of top-ranked documents clustered. Table 3 reports the mean number of relevant documents per each query. Van Rijsbergen(1979), and also Sneath and Sokat (1973) emphasized that the various association and distance measures are monotone with respect to each other. Consequently, a clustering method that depends only on the rank ordering of the resemblance values would give similar results for all such measures. Based on the results of our experiments as depicted in figures 2 through 6, the above observation does not hold in our situation where documents and clusters are limited (at most 1000 documents and 6 clusters). We perceived that the results of k-means document clustering using seven distance functions were different. Spearman rank Correlation (s) and harmonically summed Euclidian distance (h) produced results that are worst than the other distance measures for both document representations.

Mean Relevant documents per query 13. 58 16.35 16.5 16.64 16.64 16.71

100 200 300 400 500 1000

1 0.9 0.8

PCL.s.SW

0.6 0.5

PCL.e.RD

0.4

PCL.s.RD

0.3 0.2 0.1 0 0

100 200 300 400

500 600 700 800 900 1000

Top-n

Table 3: Mean Relevant documents per query for different best-ranked documents

Figure 2: Mk1 with β = 0.5

1

1

0.9

0.9

0.8

0.8

0.7

0.7

PCL.e.SW PCL.s.SW

0.5

PCL.e.RD 0.4

PCL.s.RD

PCL.s.SW

0.5

PCL.e.RD 0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

PCL.e.SW

0.6

MK1

0.6

MK1

PCL.e.SW

0.7

MK1

Top -n

PCL.s.RD

0

0

100

200 300 400

500 600

700

800 900 1000

Top-n

Figure 3: Mk1 with β = 1

0

100 200

300 400

500 600

700 800

900 1000

Top-n

Figure 4: Mk1 with β = 2

180

Average size

140 120 PCL.e.SW 100

PCL.s.SW

80

PCL.e.RD PCL.s.RD

60 40 20

Average number of Relevant documents

18

160

16 14 12

PCL.e.SW

10

PCL.s.SW

8

PCL.e.RD

6

PCL.s.RD

4 2 0

0 0

100

200

300

400

500

600

700

800

900 1000

Top-n

Figure 5: Average size for optimal clusters with

β =2

0

100 200 300 400 500 600 700 800 900 1000

Top-n

Figure 6 : Average number of relevant documents for optimal clusters with β = 2

Table 4 summarizes the difference mean of MK1 values between single term representation and Rich Document Representation. This table shows that Rich Document Representation consistently produces better cluster effectiveness than it’s single term counter parts for all the seven distance measures.

distance

Mean difference of Mean difference of Mean difference of MK1 with β = 0.5 MK1 with β = 1 MK1 with β = 2 a 0.185 0.12 0.109 e 0.16 0.099 0.110 c 0.12 0.114 0.111 x 0.151 0.109 0.109 u 0.125 0.14 0.109 s 0.12 0.102 0.112 h 0.101 0.108 0.112 Table 4: Difference mean of MK1 values between single term representation and Rich Document Representation for all the seven distance measures for different values of β .

5 Future Work and Conclusions We have experimented with the method called Rich Document Representation for constructing document vectors using single words, phrases and logical terms as defined in the theory of plausible reasoning and PLIR system. The experiments demonstrate that the quality and effectiveness of clustering using this method is better than the usual bag of words representation (single words only). Currently we are conducting more experiments using this method with hierarchical clustering and also with PLIR certainty weight as clustering similarity measure. In future, more experiments will be conducted with other collections and different methods of clustering. Another set of experiments will be performed with best clustering and document representation methods and different summarization techniques to find the best way of combining these two techniques. In such a scenario, after clustering, the cluster summaries could be presented to the user enabling him/her to choose closest cluster to his/her need. After selecting a cluster, the documents in that cluster will be clustered again and their cluster summaries will be presented again. The process will go on till user chooses to view the actual documents of a cluster.

References A. Collins, and R. Michalski (1989). The logic of Plausible Reasoning A core theory, Cognitive Science, vol. 13, 1--49.

A. Collins, M. H. Burstejn (1988). Modeling A Theory Of Human Plausible Reasoning, Artificial Intelligence III. Anastasios Tombros (2002). The effectiveness of query-based Hierarchic clustering of documents for information retrieval, Ph.D. Thesis, Department of Computing Science Faculty of Computing Science, Mathematics and Statistics, University of Glasgov. Anastasios Tombros, Robert Villa, C.J. Van Rijsbergen (2002).The effectiveness of query-specific hierarchic clustering in information retrieval, Information Processing and Management 38, 559-582. Buckley, C., Mitra, M., Walz, J., & Cardie, C.(2000). Using clustering and super-concepts within SMART: TREC 6, Information Processing & Management, 36(1), 109--131. Croft, W. B.”A model of cluster searching based on classification”, Information Systems, 5, 189-195. Cutting, D. R., Karger, D. R., Pedersen(1992). J. O., & Tukey, J. W. Scatter/Gather: A cluster-based approach to browsing large document collections, In Proceedings of the 15th annual ACM SIGIR conference, Copenhagen, Denmark pp. 126–135. Ellis, D., Furner-Hines, J., & Willett, P.(1993). Measuring the degree of similarity between objects in text retrieval systems, Perspectives in Information Management, 3(2), 128–149. El-Hamdouchi, A., & Willett, P.(1987). Techniques for the measurement of clustering tendency in document retrieval systems, Journal of Information Science, 13, 361–365. Farhad Oroumchian, Babak Arabi, Elham Ashori(2002). Using Plausible Inferences and Dempster shafer Theory Of Evidence For Adaptive Information Filtering, 4 th International Conference on Recent Advances in Soft Computing (RASC2002), Nottingham, United Kingdom, Dec 12-13. Farhad Oroumchian, Babak Arabi, Elham Ashori(2002). An application of Plausible Reasoning and Dempster-shafer Theory in Information Retrieval, International Conference on Fuzzy Systems and Knowledge Discovery FSKD 2002, (IEEE Neural Network Society) Singapore, Nov 18-22. Farhad Oroumchian, R. N. Oddy(1996). An Application Of Plausible Reasoning To Information Retrieval, In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, August 18-22, 1996. Farhad Oroumchian, An Application Of Plausible Reasoning To Information Retrieval, Ph.D. Thesis, Syracuse University. Elham Ashori, Farhad Oroumchian, Babak Arabi(1987). Improving the Ranking of the PLIR System by Local and Global Appraoches, WSEAS Transactions on Systems, Issue 3, Volume 2, July 2003. Gordon, A.D. “A review of hierarchical classification”, Journal of the Royal Statistical Society, Series A, 150(2):119-137. Hearst, M. A., & Pedersen, J. O.(1996). Re-examining the Cluster Hypothesis: Scatter/Gather on retrieval results, In Proceedings of the 19th Annual ACM SIGIR conference, Zurich, Switzerland (pp. 76 -- 84). Hersh, W. R., Buckley, C., Leone, T. J. and Hickam, D. H.(1994). OHSUMED: An interactive retrieval evaluation and new large test collection for research, In Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’94), 192--201. Jardine, N., & Van Rijsbergen, C. J. (1971). The use of hierarchical clustering in information retrieval, Information Storage and Retrieval, 7, 217--240. K. Dontas, M. Zemankova(1998). APPLAUSE: An Implementation of the Collins-Michalski Theory Of Plausible Reasoning, In Proceedin gs of the 3rd International Symposium on Methodologies for Intelligent Systems, Torino, Italy. Michiel de Hoon, Seiya Imoto, Satoru Miyano(2003). The C Clustering Library, The university of Tokyo, Institute of Medical Science, Human Genome Center. Oren Eli Zamir(1999). Clustering Web Documents: A Phrase-Based Method for Grouping Search Engine Results, Ph.D. Thesis, University of Washington. Salton, G.(1971). The SMART retrieval system – experiments in automatic document retrieval, Englewood Cliffs, NJ: Prentice-Hall. Sneath, P.H.A. and Sokal, R.R.(1973). Numerical taxonomy: the principles and practice of numerical classification. San Francisco: W.H. Freeman. Stefan Ruger, Susan Gauch(2000). Feature Reduction for Document Clustering and Classification, Technical Report DTR 2000/8; Department of Computing, Imperial College; London, England. Van Rijsbergen, C. J.(1974). Further experiments with hierarchic clustering in document retrieval, Information Storage and Retrieval, 10, 1--14. Van Rijsbergen, C. J., & Croft, W. B. (1975). Document clustering: An evaluation of some experiments with the Cranfield 1400 Collection, Information Processing & Management, 11, 171-182. Van Rijsbergen, C.J.(1979). Information Retrieval. London: Butterworths, 2nd Edition.