Automatic Thesaurus Generation using Co

0 downloads 0 Views 146KB Size Report
Abstract. This paper proposes a characterization of useful thesaurus terms by the informativity of co- occurence with that term. Given a corpus of documents, ...
Automatic Thesaurus Generation using Co-occurrence Rogier Brussee a a

Christian Wartena a

Telematica Instituut, P.O. Box 589, 7500 AN Enschede, The Netherlands Abstract

This paper proposes a characterization of useful thesaurus terms by the informativity of cooccurence with that term. Given a corpus of documents, informativity is formalized as the information gain of the weighted average term distribution of all documents containing that term. While the resulting algorithm for thesaurus generation is unsupervised, we find that high informativity terms correspond to large and coherent subsets of documents. We evaluate our method on a set of Dutch Wikipedia articles by comparing high informativity terms with keywords for the Wikipedia category of the articles.

1

Introduction

We consider the problem of generating a thesaurus for a given collection using statistical methods. This problem is related to, but different from, assigning keywords to a text from a list of keywords, and that of finding the most characteristic terms for a given subset of the corpus. Our approach is to produce a list of terms which are the most informative for understanding the collection as a whole. Part of the attraction of the current approach is that it proposes a statistical model to formalize the notion of a (good) keyword. If we believe in the assumptions leading to this model, the high level algorithms to generate a thesaurus are almost forced upon us. Our main assumption is that co-occurrence of terms is a proxy for their meaning [11, 14, 9]. To use this information, we compute for each term the distribution of all co-occurring terms. We can then use this co-occuring term distribution as a proxy for the meaning of the term in the context of the collection and compare it with the term distribution of a single document. We assume that a document is semantically related to a term if the term distribution of the document is similar to its distribution of co-occurring terms. Fortunately, there is a natural similarity measure for probability distributions, the relative entropy or Kullback-Leibler divergence. If we follow this formalization through, there is an obvious strategy for generating a thesaurus as the set of terms which give the greatest overal information gain, defined as a difference of Kullback-Leibler divergences. In practice this model is a slight oversimplification, e.g. because the same subject can be characterized by different terms. We will discuss this in section 5. The organization of this paper is as follows, in section 2 we discuss some related work. In section 3 we introduce the different probability distributions and the information theoretic notion of Kullback-Leibler divergence that will be the basis for the rest of the paper. We use these in section 4 to give various definitions of information gain that can be used to rank keywords. In section 5 we evaluate our notions on a corpus of the Dutch Wikipedia.

2

Related work

The problem of finding the set of thesaurus terms to describe the documents of a collection is closely related to the problem of determining term weights for indexing documents. In particular it is related to approaches in which the weight of a term for a document is computed from its frequency in the document and its importance in the collection as a whole. One of the most elegant approaches is the term discrimination model of Salton (see [15, 18], and references cited there). Given a vector space model in which each dimension represents one term, each document can be represented as a vector. The discrimination value of a term is the change of the average distance between all documents if the corresponding dimension is deleted from the model. Of course, various distance measures can be used to compute the distance between documents in this model. Conceptually, this approach is somewhat similar to ours: deleting a term with a high discrimination value causes a higher density, or lower entropy. However, we do not look at the effect of deleting a term, but on the compression that can be obtained by knowing that a term is contained in a document. Thus co-occurrence of terms plays a crucial role in our approach, while this is not taken into account in the computation of the discrimination value. Another approach to the problem of finding a set of thesaurus terms is pursued by [3]. They start clustering documents (based on distances in a term vector space model) and then try to find discriminative terms for the clusters. The methods used in this paper are somewhat similar to latent semantic analysis [5, 9] since both have co-occurrence matrices of terms and documents as their starting point. While the current approach also naturally leads to combinations of keywords, it has the major advantage of leading only to a probability distribution over keywords (i.e. weighted sums of keywords with total weight 1) rather than positive and negative combinations. It is also based on conceptually similar, but technically different notion of proximity between keywords that has a more direct information theoretic interpretation. Unfortunately, like the singular value decomposition needed for latent semantic analysis, it is computationally heavy. Our method is even closer to, and influenced by the probabilistic latent semantic analysis (PLSA) of [6, 7]. Like PLSA our method is based on maximizing Kullback-Leibler divergence and maximal likelihood. However unlike Hofmann’s work we try to find probability distributions over terms rather than some assumed underlying abstract aspects of the text that “explain” the observed distibution of terms in documents. A lot of work has also been done on the assignment of keywords to individual documents or to subsets of a corpus of documents. As examples for this field of research we refer to [1, 8] and to [17] who focus on finding keywords in different sections of biomedical articles. There is a vast literature on the information theoretic background of this paper. We used [2] as our main information theoretic reference, which we found a quite readable introduction.

3

Term and document distributions

We simplify a document to a bag of words or terms1 . Once we accept this model, the number of occurrences of all the different terms in all the different documents is the only information we have left. Thus, consider a set of n term occurrences W each being an instance of a term t in T = {t1 , . . . tm }, and each occurring in a source document d in a P collection C = {d1 , . . . dM }. Let n(d, t) be the number of occurrences of term t in d, n(t) = d n(d, t) be the number of P occurrences of term t, and N (d) = t n(d, t) the number of term occurrences in d. 1 In the experiments discussed in section 5 we will do some preprocessing, like stemming and multiword detection. Thus we use terms rather than words

3.1

Basic Distributions

We define probability distributions Q on C × T , a distribution Q on C and q on T that measure the probability to randomly select a term occurrence, and the corresponding term, or source document. on C × T

Q(d, t) = n(d, t)/n Q(d) = N (d)/n

on C

q(t) = n(t)/n

on T

These distributions are the baseline probability distributions for everything that we will do in the remainder and we will use them in favor of the simple counting measure when determining the “size” of subsets of C and W. In addition we have two important conditional probability distributions Q(d|t) = Qt (d) = n(d, t)/n(t)

on C

q(t|d) = qd (t) = n(d, t)/N (d)

on T

We use the notation Q(d|t) for the source distribution of t, as it is the probability that a random occurrence of term t has source d. Similarly, q(t|d), the term distribution of d is the probability that a random term occurrence in a document d is an instance of term t. Other probability distributions on C × T , C and T will be denoted by P, P , p with various sub and superscripts .

3.2

Distribution of Co-occurring Terms

The setup in the previous section allows us to set up a Markov chain on the set of documents and terms which will allow us to propagate probability distributions from terms to document and vice versa. Consider a Markov chain on T ∪ C having transitions C → T with transition probabilities Q(d|t) and transitions T → C with transition probabilities q(t|d) only. Given a term distribution p(t) we compute the one step Markov chain evolution. This gives us a document distribution Pp (d), the probability to find a term occurrence in a particular document given that the term distribution of the occurrences is p X Pp (d) = Q(d|t)p(t). t

Likewise given a document distribution P (d), the one step Markov chain evolution is the term P distribution pP (t) = d q(t|d)P (d). Since P (d) gives the probability to find a term occurrence in document d, pP is the P-weighted average of the term distributions in the documents. Combining these, i.e. running the Markov chain twice, every term distribution gives rise to a new distribution X X p¯(t) = pPp (t) = q(t|d)Pp (d) = q(t|d)Q(d|t0 )p(t0 ) d

t0 ,d

In particular starting from the degenerate “known to be z” term distribution pz (t) = p(t|z) = δtz (1 if t = z and 0 otherwise), we get the distribution of co-occurring terms p¯z X X p¯z (t) = q(t|d)Q(d|t0 )pz (t0 ) = q(t|d)Q(d|z). d,t0

d

This distribution is the weighted average of the term distributions of documents containing z where the weight is the probability Q(d|z) that a term occurrence in d is an occurrence of z. Note that the probability measure p¯z is very similar to the setup in [10, section 3]. However, we keep track of the density of a keyword in a document rather than just the mere occurrence or non occurrence of a keyword in a document. This difference is particularly relevant for a long document in which a term occurs with low density, because it has a relatively high contribution to the mean word distribution. Unfortunately p¯z is expensive to compute.

4

Informativeness of Keywords

Intuitively, keywords make it easier to remember the content of a document given some knowledge of similar documents. Formalizing this intuition, we base a criterion for the informativity of keywords on the Kullback-Leibler divergence (KL-divergence) between the mean distribution of terms coocurring with a given keyword with the term distribution of the whole collection. For the convenience of the reader, we recall the definition, basic properties and interpretation of the KL-divergence [2, sec. 2.3]. Given a finite set of symbols X = {x1 , x2 , . . . , xn } with two probability distributions p(xi ) = pi and q(xi ) = qi the KL-divergence is defined as   n X pi D(p||q) = pi log qi i=1

It is easy to show that D(p||q) ≥ 0 with equality iff pi = qi for all i. The KL-divergence has the following standard interpretation. In an optimal compression scheme, a stream of symbols (xi1 , xi2 , . . .) (with 1 ≤ ik ≤ n) over the alphabet X distributed −1 according to a probability distribution p uses i ) bits for the symbol xi . Thus with Pn at least log(p −1 optimal compression we need on average i=1 pi log(pi ) bits per symbol. If the P stream is compressed using a scheme adapted to a distribution q instead, we use on average ni=1 pi log(qi−1 ) bits per symbol. Therefore, the KL-divergence is the average number of bits saved per symbol by using the actual probability distribution p of the stream rather than some a priori distribution q.

4.1

Information Gain for a document

Given a collection of documents and a term, there is a subcollection for which the term is relevant. To determine this subcollection we will define an information theoretic measure defined as average number of bits saved by using a specialized compression scheme for the subcollection compared to using a single scheme for the whole collection. The net information gain of using a term distribution p for a document d is defined as IG(d, p) = D(qd ||q) − D(qd ||p)   X p(t) = qd (t) log q(t) t,qd (t)6=0

It measures the gain or loss of compression obtained from using p compared to the corpus term distribution q. We will call this the specific net information gain of using p to emphasize that this is a gain per word. Clearly IG(d, p) ≤ D(qd ||q), and the unique maximum is attained for p = qd . It is possible that IG(d, p) = −∞, but IG(d, p) is finite if and only if p(t) = 0 implies qd (t) = 0, since q(t) = 0 certainly implies qd (t) = 0.

4.2

Information Gain for a Subcollection

For a subcollection of documents D ⊂ C, we define the fractional net specific information gain by weighing the specific information gain with the size of the document relative to the collection X X IG(D, p) = Q(d) IG(d, p) = Q(d)qd (t) log (p(t)/q(t)) d∈D

d∈D,t∈T

P Defining the fraction of term occurrences in D, βD = d∈D Q(d) and the average term distribution P −1 pD = βD d∈D Q(d)qd , of documents in the subcollection D, we rewrite IG(D, p) to X IG(D, p) = βD pD (t) log (p(t)/q(t)) = βD (D(pD ||q) − D(pD ||p)) t∈T

Typically, we let the term distribution p depend on the chosen subcollection D. The most obvious choice for p is p = pD . Clearly p = pD maximizes IG(D, p) and for the fractional specific gain of the subcollection we have IG(D) = IG(D, pD ) = βD D(pD ||q) ≥ 0.

4.3

(1)

Information Gain for a Keyword

Consider the subset Cz of documents benefitting from compression against the average distribution p¯z of terms co-occurring with z i.e. Cz = {d ∈ C| IG(d|¯ pz ) > 0}. We then have a natural measure + for the informativeness of a key term. We define IG z , the fractional net specific information gain of a key term z as the non negative number X + IG z = IG(Cz |¯ pz ) = Q(d) IG(d|¯ pz ). d∈Cz

Note that a document d might be in Cz even if there is no occurrence of z in d. It is also + possible that IG(Cz ) > IG z , because it it need not be the case that p¯z = pCz . In fact Q(d|z) takes into account the fraction of the occurrences of z that appear in a document rather than the mere occurrence or non occurrence of z, and thus can be seen as a “weighted” version of the collection of documents containing z.

5

Evaluation

We have implemented a number of techniques to extract a thesaurus from a collection of texts. All tested algorithms give a ranking of all detected terms. To test and compare the different strategies we compiled a small corpus of Dutch Wikipedia articles consisting of 758 documents. In the analysis phase, 118099 term occurrences, and 26373 unique terms were found. The articles were taken from 8 Wikipedia categories: spaceflight, painting, architecture, trees, monocots, aviation, pop music, charadriiformes. Categories were selected for subjective similarity, like spaceflight and aviation, and subjective dissimilarity like pop music and monocots. Articles are equally distributed over the categories, but articles in some categories are significantly longer than in others. Moreover, homogeneity and specifity of articles differs significantly between categories. This is clearly reflected in the information gain that is obtained by splitting the collection in the articles from one category and from the remaining categories and using formula (1) in section 4.2. The results are given in Table 1. One reason for the high score of the category pop music is the fact that the Dutch articles on this subject contain many lists of English song titles, whereas in the other categories English words are rather rare. spaceflight architecture

0.31 0.24

monocots pop music

0.19 0.47

painting trees

0.40 0.31

aviation charadriiformes

0.38 0.20

Table 1: Information gain for all document categories

5.1

Preprocessing

Before extracting thesaurus terms we do some preprocessing using the GATE–framework [4]. The main analysis steps are: lemmatization, multiword lookup and named entity recognition. Lemmatization and tagging is done using the Treetagger [16]. Tagging allows us to distinguish content words from function words. We only use content words. After lemmatization all inflected forms of verbs, nouns and adjectives are mapped onto their lexical form, substantially reducing the variation

in the input data. Since adjectives are seldom used as a keyword we have filtered out the adjectives from the result lists in all experiments, even though adjectives are bearers of relevant semantic information, e.g. in a phrase like chemical process. For multiword lookup we used article titles from the Dutch Wikipedia, since it is reasonable to assume that each title represents a single concept. Finally, some simple rules are applied to identify proper names. While some of the components are language dependent, all of the components are available for a number of languages within the GATE–framework. As an additional filter we require that a potential thesaurus term z has to be specific enough. + It turned out that there are some highly frequent words that have a high IG z but that cannot be correlated to any subject. We therefore require that p¯z be different enough from the background distribution q, as measured by the Kullback-Leibler divergence. We used a cutoff D(¯ pt ||q) > 1 bit, that turned out to give decent results.

5.2

Thesaurus construction +

The resulting list of terms is now ranked based on the value of IG . In fact we got slightly better + results by slightly modifying IG and using a naively smoothed version of the co-occurrence (sm) distribution p¯t = 0.95¯ pt + .05q. Smoothing also simplified software development, as the KLdivergence of a distribution with respect smoothed distribution is automatically finite. Since we were not aware of other research on thesaurus extraction without prior knowledge of + categories, we compared the IG ranking with the ranking given by the discrimination values for term weighting from [15] and some more simple and naive rankings. For the computation of discrimination values (DV) we have implemented the algorithm from [18] using cosine dissimilarity. For our corpus we found that this method strongly favors terms specific for one or a few documents, especially large ones. This method might be suited to find discriminating terms for a specific document, but seems to be less suited for identifying useful global keywords. The first “naive” method we evaluated is to take the most frequent words. This approach gives reasonable results probably because we only considder nouns, verbs (without auxiliaries) and proper names. The last method is to compute the average tf.idf value over all documents for each term. We use the following formula to compute tf.idf for a corpus C, a term t and a document d ∈ C: tf .idf (d, t) = (1 + log n(d, t)) · M/df (t) where M is the number of documents in C and df (t) = Σd m(t, d) where m(t, d) = 1 if n(t, d) > 0 and m(t, d) = 0 otherwise (see e.g. [13]).

5.3

Results

To evaluate the resulting generated thesaurus we determine (a) the number of the thesaurus terms that are really informative and (b) the number of topics in the data set that are covered. These two evaluation aspects resemble the usual precision and recall measures. To evaluate the precision aspect (a) we have compared the generated thesaurus terms with a thesaurus of keywords used by the Dutch Institute of Sound and Vision, the Gemeenschappelijke Thesaurus voor Audiovisuele Archieven (GTAA, Common Thesaurus for Audiovisual Archives), containing about 9000 subject terms and extensive lists of person names, company names and geographical names [12]. The fraction of the keywords found in this thesaurus was used as an indication of the quality of the set of automatically generated thesaurus terms. The recall aspect (b) presupposes a knowledge of the topics in the corpus. Of course the full list of “topics” is somewhat open to debate, but since we have selected articles from eight Wikipedia categories it seems reasonable to count each Wikipedia category as a separate topic. In order to get an impression of the coverage of these topics we have extracted the five best keywords for each category. Thereby we obtain a set of 30 terms. A thesaurus generation algorithm is considered

category spaceflight architecture monocots pop music

keywords aarde, lancering, lanceren, raket, satelliet gebouw, huis, station, bouwen, kerk familie, plant, soort, geslacht, eenzaadlobbig album, band, the, single, nummer

category painting trees aviation charadriiformes

keywords werk, werken, Van Gogh, schilderen, schilderij boom, blad, hout, vrucht, gebruiken vliegtuig, toestel, vliegen, motor, Schiphol vogel, soort, geslacht, zwart, kust

Table 2: Keywords for the document categories. better if more of these terms are found without training on these categories. In order to find the best keywords for each category we have used a slight variations of the method proposed by [1]: we compare the term distribution of the whole corpus with the term distribution of the subset and select those terms that have the most deviating distributions. The resulting sets of keywords are given in Table 2. The results for the coverage of the best n terms by the GTAA is shown in Table 3 (left). We have determined the coverage for n is 40, 80 and 160. The low percentages are partially due to the n 40 80 160

DV 0,18 0,16 0,18

freq 0,33 0,39 0,43

tf.idf 0,38 0,35 0,34

+

IG 0,48 0,40 0,38

n 40 80 160

DV 1 (1) 1 (1) 1 (1)

freq 16 (8) 28 (8) 33 (8)

tf.idf 11 (4) 13 (5) 20 (8)

+

IG 12 (7) 15 (8) 20 (8)

Table 3: Left table: Coverage by GTAA. Right table: recall of keywords (categories) fact that many of the generated thesaurus terms would not be considered good keywords (neither intuitively nor in the GTAA) although they do clearly relate to a specific domain e.g. ’guitar + player’ or ’French’. For the IG -method we see a slight tendency to give higher ranks to the more meaningful terms. The results of our method are clearly better than the average tf.idf and at the low levels slightly better than the term frequency method. To evaluate the recall of the eight subjects from which the texts were taken we can either count the number of keywords from Table 2 or the number of categories from which a keyword (according to the same table) is present in the suggested list. The results for the sets of the 40, 80 and 160 best terms according to each method are given in Table 3 (right). Surprisingly, the simple term frequency gives the best results for this evaluation measure. At the low levels the method presented in this paper is clearly better than the average tf.idf. On closer inspection of the generated terms one reason for the somewhat disappointing result becomes immediately clear: Our method tends to select words that have many co-occurring words with a similar distribution. This implies that all these cooccurring words are found as well. Thus, most of the best ranked terms are near synonyms, or at least terms on the same subject, suppressing the less prominent subjects. Clustering of terms with similar distributions (e.g. using a symmetrised KL-divergence as a similarity measure) would be a solution to this problem. However, results depend strongly on the clustering method and are difficult to compare to the other methods.

6

Conclusion +

We developed an information theoretic measure IG to rank terms for their usefulness as keyword, based on the co-occurence of a term with other terms. The measure gives a natural balance between specifity and generality of a term, and detects large and coherent subsets of a document collection that can be characterized by a term. To generate a thesaurus for a given corpus we used the highest

ranked terms after some filtering. To evaluate the measure we have constructed a corpus with articles from the Dutch Wikipedia, equally taken from 8 categories. As a measure for precision we have determined the fraction of the n highest ranked terms that is contained in a carefully constructed general thesaurus of keywords. As an approximation for the recall we have compared the n highest ranked terms with the set of keywords that were extracted with an independent method for each of the 8 original categories.

Acknowledgements We thank Wolf Huijsen for extracting the data from Wikipedia. The research for this paper was part of a MultimediaN project sponsored by the Dutch government under contract BSIK 03031.

References [1] M. A. Andrade and A. Valencia. Automatic extraction of keywords from scientific text: application to the knowledge domain of protein families. Bioinformatics, 14(7):600–607, 1998. [2] T. Cover and J. Thomas. Elements of information theory. John Wiley and Sons, Inc., 1991. [3] C. J. Crouch and B. Yang. Experiments in automatic statistical thesaurus construction. In SIGIR ’92: Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 77–88, New York, NY, USA, 1992. ACM. [4] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A framework and graphical development environment for robust NLP tools and applications. In Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics, 2002. [5] S. T. Dumais, G. W. Furnas, T. K. Landauer, S. Deerwester, and R. Harshman. Using latent semantic analysis to improve access to textual information. In CHI ’88: Proceedings of the SIGCHI conference on Human factors in computing systems, pages 281–285, New York, NY, USA, 1988. ACM. [6] T. Hofmann. Probabilistic latent semantic analysis. In UAI99: Uncertainty in artificial intelligence, 1999. [7] T. Hofmann. Unsupervised learning by probabilistic latent semantic analysis. Machine Learning, 42(1-2):177–196, January 2001. [8] A. Hulth and B. Megyesi. A study on automatically extracted keywords in text categorization. In ACL. The Association for Computer Linguistics, 2006. [9] T. Landauer, P. Foltz, and D. Laham. Introduction to latent semantic analysis. Discourse Processes, 25:259–284, 1998. [10] H. Li and K. Yamanishi. Topic analysis using a finite mixture model. Inf. Process. Manage., 39(4):521– 541, 2003. [11] K. Lund and C. Burgess. Producing high-dimensional semantic spaces from lexical co-occurrence. Behaviour Research Methods, Instruments, & Computers, 28(2):203–208, 1996. [12] V. Malais´e, L. Gazendam, and H. Brugman. Disambiguating automatic semantic annotation based on a thesaurus structure. In Actes de la 14e conf´erence sur le Traitement Automatique des Langues Naturelles, pages 197–206, 2007. [13] C. D. Manning and H. Sch¨utze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, Massachusetts, 1999. [14] Y. Niwa and Y. Nitta. Co-occurrence vectors from corpora vs. distance vectors from dictionaries. In Proceedings of the 15th conference on Computational linguistics, pages 304–309, Morristown, NJ, USA, 1994. Association for Computational Linguistics. [15] G. Salton, C. Yang, and C. Yu. A theory of term importannce in automatic text analysis. Technical Report TR 74-208, Department of Computer Science, Cornell University, July 1974. [16] H. Schmid. Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK, 1994. unknown. [17] P. K. Shah, C. Perez-Iratxeta, P. Bork, and M. A. Andrade. Information extraction from full text scientific articles: Where are the keywords? BMC Bioinformatics, 4(20):1–9, 2003. [18] P. Willett. An algorithm for the calculation of exact term discrimination values. Inf. Process. Manage., 21(3):225–232, 1985.