A Probabilistic Hierarchical Clustering Method for ... - CiteSeerX

6 downloads 0 Views 66KB Size Report
vino-ci0, giro-ci0 @wpmail.paisley.ac.uk. Abstract. In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale ...
A Probabilistic Hierarchical Clustering Method for Organising Collections of Text Documents Alexei Vinokourov and Mark Girolami Computational Intelligence Research Unit Department of Computing and Information Systems University of Paisley Paisley, Scotland fvino-ci0, [email protected] Abstract In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically both asymmetric and symmetric models based on the multinomial and binomial distributions are most appropriate. An Expectation Maximisation parameter estimation method is provided for all of these models. An experimental comparison of the models is obtained for two extensive online document collections.

1. Introduction Along with the increasing number of documents available through the Internet the attendant problem of intelligent automatic document classification has arisen. A variety of recent publications has demonstrated the utility of statistical approaches in learning to automatically classify text documents. In this paper we will concentrate on hierarchical probabilistic unsupervised clustering as this has a number of significant potential advantages for text documents and Information Retrieval (IR). If we look, for example, at an IR system such as YAHOO!, it has a large text database manually organized in a hierarchic manner. This database requires to be maintained and updated regularly to keep pace with the number of new pages appearing in the web. An automatic hierarchical document clustering system would significantly reduce the amount of manual labour and consequently the cost of maintenance of hierarchical IR systems. Formally, text collections are commonly represented by

a bag-of-words model where a word w 2 W , jW j = M occurs in a part of a sample S representing a document d 2 D, jDj = N and ndw indicates the number of times the word occurs in the document. The sample S relates to the observed part of the data. It is shown that one of the effective techniques to either reduce the dimensionality of input data or discover it’s internal structure is to introduce unobserved variables and apply the Expectation Maximization (EM) algorithm [3][10] in the parameter estimation. In this case the hidden variables represent classes of documents or document / word pairs.

2. The family of hierarchical probabilistic mixture models 2.1. Symmetric and Asymmetric models In the sequel we will consider directly hierarchical mixture models. Plain or flat models are special cases of hierarchic models once we reduce the hierarchy to the simplest form, that is with one layer and the root node having nodes from that layer as it’s children. The idea of hierarchical mixtures has been exploited in a number of recent works [6],[1],[12]. For convenience we let the mixture components p( m j m 1 )  0 if cluster m 1 is not a child of m in some chosen hierarchic topology T where m signifies the mth layer of T . We also denote the marginal probability of a cluster (node) l = ( 1 ; 2 ; : : : ; l ) from layer l whose antecedents are defined recursively as l 1 , l 2 and so on such that

p( l ) 

Yl p

m=1

j

( m m 1 );

p( 1 j 0 )  p( 1 )

(1)

and we denote the posterior probabilities of the cluster for a document and document-word pair as

p( l jd)



p( l jd; w)

=

Yl p

j

(2)

j

(3)

( m m 1 ; d)

m=1 l

Yp

( m m 1 ; d; w )

m=1

where we denote p( 1 j 0 ; d)  p( 1 jd), p( 1 j 0 ; d; w) p( 1 jd; w) for convenience accordingly.



There are two basic models of a collection of documents: the sample is either a number of documents each represented by a term-vector or a number of document-term pairs. Thus we obtain a sub-family of asymmetric models where document and word variables d and w are not treated equally and are defined by the following equations for the probability of a document p(d)

p(d) =

Xp

j l )

( l )p(d

l

Xp

( l )p(d; w

2.3. Binomial Asymmetric Hierarchical Analysis (BASHA)

Y p wj

The binomial conditional distribution is given by

p(dj l ) =

(

w

p(wj l ))1 ndw

n l ) dw (1

(11)

Substituting the equation above to the general equation (7) and applying the standard technique we derive the following E-step p( l j l 1 ; d) / p( l j l 1 )p(dj l ) (12) and M-step

p(wj l ) p( l j l

(4)

1)

1+

=

Pd ndw p ljd Pd p ljd

X p j 2+

/

d

(

)

(

)

( l l 1 ; d)

(13) (14)

in the EM-algorithm accordingly.

For the symmetric sub-family we have

p(d; w) =

In equation (9) for updating class means the Laplacean smoothing [8] is used.

j l )

(5)

2.4. Hierarchical Symmetric Multinomial Semantic Analysis (HPLSA)

2.2. Multinomial Asymmetric Hierarchical Analysis (MASHA)

Like virtually most of the latent models HPLSA explicitly introduces a conditional independence assumption, namely that d and w are independent conditioned on the state of the associated latent variable. Hence we decompose the joint probability of a document-term pair as the following p(d; wj l ) = p(dj l )p(wj l ) (15)

l

Consider now asymmetric models in detail. Two probabilistic discrete distributions have been commonly used for text classification and clustering tasks - multinomial and binomial distributions. Instantiating the generic asymmetric model by the multinomial distribution gives

Y p wj n p dj l / (6) l w P with the following constraint w p wj l . The expectation of complete log-likelihood at level l is X X p jd fp p dj g C ae (7) Lae (

)

) dw

(

(

l =

( l

) log

)=1

( l) (

l) +

Standard calculations lead to the following EM-algorithm with the E-step

and M-step

p(wj l ) p( l j l

1)

=

/

d

(8)

P Pd Pd nwdwndwp plj dljd

X p j 2+

j

1 )p(d l )

1+

( l l 1 ; d)

(

)

(

LSAe LHP l

=

)

(9) (10)

X X n X p jd; w f d

dw

w

log p(d; w

l

d l ae where Cl is a constant independent of the parameters.

p( l j l 1 ; d) / p( l j l

P

P

with the following constraints d p(dj l ) = 1 and w p(wj l ) = 1 This gives the following equation for the likelihood ) log p( l ) +

( l

l

j l )g

(16)

which lead to the following E-step

p( l j l 1 ; d; w) / p( l j l and M-step

p(dj l )

/

p(wj l )

/

p( l j l

1)

/

1 )p(d; w

X n p jd; w dw l w X n p jd; w dw l d X X n p j d

w

j l )

(17)

(

)

(18)

(

)

(19)

dw ( l l 1 ; d; w)

(20)

of the EM respectively. We term the method Hierarchical PLSA to stress the connection with Probabilistic Latent Semantic Analysis (PLSA) [4]. It follows the discussion in the Introduction that plain PLSA is a particular case of the more general HPLSA where the hierarchy has a single layer.

2.5. Hierarchical Binomial Symmetric Semantic Analysis (HBSSA) Having assumed the conditional distribution of document-word pairs to be binomial then

p(d; wj l ) = p(dj l )p(wj l )ndw (1

p(wj l ))1 ndw (21)

P

where the document conditional has to be normalized d p(dj l ) = 1 we obtain the following likelihood equation

LHBSSAe l

=

X X X p jd; w f d

w

l

log p(d; w

) log p( l ) +

( l

j l )g

(22)

where p( L ; k ) is the joint probability of cluster L at the last layer of hierarchy L and labeled class . The terms p( L ) and p(k ) are the marginal probabilities, hp(k )i = N N ; N = jk j. As an estimate of the joint distribution we used an empirical average of the posterior. For asymmetric models it is given by

hpasymm ( L ; k )i = N1

X X p jd w d2k

( L

)

(28)

where hi denotes the estimate of the parameter. Analogous expression is used for HBSSA

hpHBSSA ( L ; k )i = jS1 j

X X p jd; w ( L

)

(29)

dw p( L

jd; w)

(30)

w d2k

For HPLSA the expression is of the form

hpHP LSA ( L ; k )i = jS1 j

XXn w d2k

of EM algorithm.

The summary of experimental results are given in Table 1. Table 1 shows the performance of the hierarchical models along with plain ones in terms of MImeasure for a number of configurations. All experiments were carried out using the Bow library [9]. For all corpora we selected the 100 most frequent words, removed html-tags and for 20 Newsgroups [5] and Reuters-21578 [7] we removed stop-words. Due to the large memory demands of the symmetric models, proportional to N  M  K where K is the number of clusters, subsets of corpora have been selected - 4 comp.* newsgroups: all except comp.graphics (N=4000) and only documents from Reuters from the following 6 topics: grain,wheat,corn,ship,trade,crude (N=1977). The WebKB [2] (N=8277) was taken entirely although, as yet, we have not carried out experiments with the symmetric algorithms for all configurations.

3. Experimental results

4 Discussion

For an assessment of results we need an evaluation measure of the achieved clustering. In all data sets that we experimented with documents were classified manually i.e. they had existing class labels. Therefore two distributions of class variable exist, one given manually and another obtained by automatic clustering. Information theory has a popular criterion for assessment of the diversity of two distributions known as mutual information. This criterion has been suggested in [11] for unsupervised clustering and we adopt this measure. The mutual information between cluster and class label variables is given by

The results clearly show 1) the superiority of the asymmetric methods over symmetric ones; 2) superiority of the multinomial methods over binomial ones; 3) the hierarchical methods achieve better results than the plain ones. One can see from Table 1 that the choice of an appropriate configuration can considerably improve the performance of a model. The authors have derived the model selection criterion [13] that allows one to choose between different configurations. This has proved to be rather reliable in use, although detailed discussion is outwith the scope of this paper. Following the work of McCallum and Nigam [8] it is interesting to investigate the behavior of the multinomial and binomial models over different number of terms. This informs our future work.

and consequently the following E-step

p( l j l 1 ; d; w) / p( l j l and M- step

p(dj l )

/

p(wj l )

=

p( l j l

MI ( ; k ) =

1)

/



j l )

X n p jd; w dw l w Pd ndwp ljd; w Pd p ljd; w X X p j ; d; w (

)

(

1+

(

2+

d

w

XXp ;k

L

1 )p(d; w

( L

( l l 1

) log

(23)

(24)

)

)

)

p( L ; k ) p( L )p(k )

(25) (26)

(27)

5 Acknowledgements Table 1. The MI-comparison of the clustering methods for different model configurations and different text corpora. Configurations are given in the format - (# of clusters in plain model). For the plain methods (P) the number of clusters is taken the same as the number of nodes at the last layer of hierarchy in the hierarchical (H) methods. The best result over 30 iterations for each method is taken. Conf 4-2(8) 4-3(12) 4-4(16) 5-2(10) 5-3(15) 5-4(20) 6-2(12) 6-3(18) 6-4(24) 7-2(14) 7-3(21) 7-4(28) 4-2(8) 4-3(12) 4-4(16) 5-2(10) 5-3(15) 5-4(20) 6-2(12) 6-3(18) 6-4(24) 7-2(14) 7-3(21) 7-4(28) 4-2(8) 4-3(12) 4-4(16) 5-2(10) 5-3(15) 5-4(20) 6-2(12) 6-3(18) 6-4(24) 7-2(14) 7-3(21) 7-4(28)

MASHA BASHA HPLSA H P H P H P 4 Newsgroups (N=4000) 0.67 0.63 0.37 0.59 0.29 0.49 0.82 0.66 0.40 0.61 0.66 0.45 0.90 0.68 0.60 0.60 0.59 0.50 0.78 0.66 0.88 0.60 0.58 0.49 0.83 0.67 0.61 0.60 0.42 0.45 0.86 0.68 0.90 0.61 0.58 0.50 0.29 0.66 1.08 0.61 0.32 0.45 0.83 0.66 0.31 0.61 0.38 0.44 0.90 0.58 0.51 0.53 1.14 0.49 1.61 0.65 0.10 0.57 0.83 0.51 0.81 0.57 0.58 0.51 0.67 0.48 0.82 0.60 0.67 0.53 0.59 0.53 Reuters (6 topics, N=1977) 0.47 0.37 0.46 0.34 0.21 0.24 0.88 0.49 0.87 0.42 0.25 0.27 0.80 0.57 0.53 0.48 0.32 0.32 0.52 0.44 0.49 0.39 0.17 0.26 0.58 0.55 0.58 0.47 0.31 0.29 1.22 0.64 0.69 0.50 0.48 0.33 0.67 0.49 0.70 0.42 0.31 0.27 0.85 0.60 0.55 0.50 0.24 0.33 1.03 0.44 0.72 0.54 0.32 0.36 0.92 0.53 0.45 0.45 0.24 0.30 0.75 0.66 0.58 0.48 0.28 0.34 0.81 0.49 0.74 0.47 0.31 0.37 WebKB (N=8277) 0.41 0.60 0.78 0.54 0.11 0.25 0.48 0.71 0.34 0.64 0.35 0.27 0.89 0.76 0.69 0.72 0.33 0.30 0.58 0.62 0.34 0.64 0.44 0.26 0.76 0.74 0.49 0.69 0.34 0.30 0.88 0.79 0.80 0.76 0.29 0.32 0.50 0.71 0.83 0.64 0.33 0.27 0.84 0.78 0.58 0.77 0.16 0.28 0.60 0.81 0.78 0.76 0.26 0.29 0.99 0.69 0.79 0.68 0.34 0.29 0.54 0.82 0.84 0.76 0.29 0.30 0.89 0.88 0.76 0.80 0.29 0.34

HBSSA H P 0.04 0.05 0.09 0.05 0.08 0.04 0.08 0.10 0.17 0.11 0.09 0.07

0.08 0.06 0.07 0.07 0.08 0.08 0.06 0.05 0.09 0.05 0.06 0.09

0.06 0.04 0.08 0.05 0.05 0.07 0.05 0.08 0.09 0.07 0.08 0.07

0.07 0.08 0.09 0.07 0.08 0.09 0.08 0.09 0.10 0.08 0.10 0.12

0.10 0.06 0.14 0.10 0.12 0.15 0.11 0.12 0.11 0.08 0.22 0.11

0.12 0.15 0.12 0.15 0.15 0.18 0.15 0.17 0.18 0.14 0.18 0.22

This work is funded by the British Library, The Council for Museums, Archives and Libraries, Grant Number LIC/GC/991.

References [1] C. M. Bishop and M. E. Tipping. A hierarchical latent variable model for data visualization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):281–293, 1998. [2] M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to extract symbolic knowledge from world wide web. In Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98). [3] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society, 39:1–38, 1977. [4] T. Hofmann. Probabilistic latent semantic analysis. In Proceedings of the Fifteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI–99), pages 289–296, San Francisco, CA, 1999. Morgan Kaufmann Publishers. [5] T. Joachims. A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. In Proceedings of the 1997 International Conference on Machine Learning (ICML ’97), 1997. [6] M. I. Jordan. Hierarchical mixture of experts and the em algorithm. Neural Computation, (6):181–214, 1994. [7] D. D. Lewis. An evaluation of phrasal and clustered representations on a text categorization task. In Fifteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 37–50, 1992. [8] A. McCallum and K. Nigam. A comparison of event models for naive bayes text classification. In AAAI-98 Workshop on Learning for Text Categorization, 1998. [9] A. K. McCallum. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/˜mccallum/bow, 1996. [10] G. McLachlan and T. Krishnan. The EM Algorithm and Extensions. New York; John and Wiley and Sons, Inc., 1997. [11] S. Vaithyanatham and B. Dom. Generalized model selection for unsupervised learning in high dimensions. In Proceedings of Advances in Neural Information Processing Systems (NIPS’99), volume 12. MIT Press, 2000. [12] N. Vasconcelos and A. Lippman. Learning mixture hierarchies. In Proceedings of Advances in Neural Information Processing Systems (NIPS’98), volume 11. MIT Press, 1999. [13] A. Vinokourov and M. Girolami. A probabilistic hierarchical clustering method for organising collections of text documents. Technical Report 5, University of Paisley, Department of Computing and Information Systems, Paisley, PA1 2BE, UK, March 2000. ISSN 1461-6122.