Persian Text Classification based on Topic Models

0 downloads 0 Views 130KB Size Report
Keywords—Persian text, document classification, topic models, bag of words ... appropriate classifier function to obtain good generalization and avoid over-fitting ...
Persian Text Classification based on Topic Models Parvin Ahmadi, Iman Gholampour, Mahmoud Tabandeh Department of Electrical Engineering Sharif University of Technology Tehran, Iran Abstract—With the increase of information, document classification as one of the methods of text mining, plays vital role in management and organizing information. Most text categorization algorithms represent a document collection as a Bag of Words (BOW) and use the words in a text document as the features of the document. But in this way, the number of features is large; therefore performing computations such as classification faces serious problems. Moreover, the BOW representation is unable to recognize semantic relationships between terms. Recently, topic-model approaches have been successfully applied for text classification to overcome the problems of BOW. Our main goal in this paper is to investigate the possibility of applying the topic models for Persian text classification and compare between the feature processing techniques of BOW and the topic model based approaches. The experimental results show that the topic-model approach for representing the Persian documents yields at least 9% accuracy improvement compared to the BOW based algorithm. Keywords—Persian text, document classification, topic models, bag of words

I. INTRODUCTION The text mining studies are gaining more importance recently because of the availability of the increasing number of the electronic documents from a variety of sources [1]. The main goal of text mining is to enable users to extract information from textual resources and deals with the operations like retrieval, classification, clustering and summarization [1]. Document classification or text classification (categorization) is a supervised learning task of assigning natural language text documents to one or more predefined categories or classes according to their contents. The workflow in most text classification systems is to train the classification system using a training dataset, including many text documents whose categories are known (training phase) then, assigning a category to a new document by this learned system (test phase). However text classification consists of several challenges, like proper annotation to the documents, appropriate document representation, dimensionality reduction to handle algorithmic issues and an appropriate classifier function to obtain good generalization and avoid over-fitting [1]. Text classification process usually adopts the supervised machine learning algorithms for learning the classification model [2]. To prepare the term feature set, the bag of words (BOW) is usually applied to represent the feature space. Under the BOW model, each document is represented by a vector of weight values calculated from, for example, the term frequency-inverse document frequency (TF-IDF) [3], of a

term occurring in the document. The BOW is very simple to create; however, this approach discards the semantic information of the terms (i.e., synonym). Therefore, different terms whose meanings are similar or the same would be represented as different features. As a result, the performance of a classification model learned by using the BOW model could become deteriorated [2]. Recently, some researchers have applied the topic model approach to cluster the words (or terms) into a set of topics. The topic model improves the performance of a classification model by (1) reducing the number of features or dimensions and (2) mapping the semantically related terms into the same feature dimension [2]. In this paper, Farsi, also known as Persian, as a live language in Middle East and Caucasus, has been taken into account with regards to the fact that the amount of electronic Farsi texts are growing rapidly. Because of the complex nature of Persian language, such as words with separate parts and combined verbs, the most of text classification systems are not applicable to Farsi texts [4]. So previous works deal with automated Farsi text classification are so limited. Our main goal in this paper is to investigate the possibility of applying the topic models for Persian text classification and compare between the feature processing techniques of BOW and the topic model based approaches. The rest of this paper is organized as follows. Section II describes related works in Persian document classification. A briefly review on some topic models is represented in section III. In section IV, our proposed approach for Persian text classification is presented. Implementation and results is demonstrated in section V. Finally, the paper is concluded in section VI. II. RELATED WORKS Arabsorkhi and Feili [5] developed a Farsi text classifier using of Bayesian model. Basiri et al. [6] presented a comparison between K-Nearest Neighbor (KNN) and fuzzy KNN approaches for Farsi text classification based on information gain and document frequency feature selection. Bina et al. [7] developed a Farsi text classifier using n-grams and KNN. Pilevar et al. [8] provided a Farsi text classification system using the Learning Vector Quantization network. In this method, each class is presented by an essence vector called the codebook. These vectors are placed in the feature space in a manner that decision boundaries are approximated by the KNN rule. Maghsoodi and Homayounpour [9] have used Support Vector Machine (SVM) classifier based on extending the feature vector applying words extracted from a thesaurus. This method has improved classifier performance

when training dataset is unbalanced and not comprehensive for some classes. Elahimanesh et al. [4] improved the KNN text classifier by inserting a factor to the KNN formula for considering the effects of unbalanced training datasets and used N-grams with lengths more than 3 characters in text preprocessing. Their approach improves the KNN algorithm especially when 8-grams indexing method and removing stop words are applied. Parchami et al. [10] proposed a method that uses WordNet to increase similarity of documents under the same category. Documents are represented by single words and their frequencies, by using WordNet, frequency of related words is changed to acquire higher accuracy. III. BACKGROUND THEORY Probabilistic Topic models (PTMs) such as probabilistic Latent Semantic Analysis (pLSA) [11], Latent Dirichlet Allocation (LDA) [12] and Hierarchical Dirichlet Process (HDP) [13] were developed to capture latent topics in a large collection of textual documents, to process the text effectively and accurately. In 2011, Zhu and Xing [14] presented a nonprobabilistic topic model (NPM) called Sparse Topical Coding (STC) which assigns a sparse set of topics to each document. A briefly review on some topic models is represented in the following. A. pLSA pLSA is a statistical model which originates from a statistical view of LSA. Topics used to build a joint probability model over documents and words, defined as the mixture:

P (w , d ) = P (d ) P (w | d ) = P(d ) ∑ P ( z |d ) P (w |z )

(1)

z

pLSA introduces a conditional independence assumption that the occurrence of a word w is independent of the document d it belongs to, given a topic z.

B. LDA The most commonly used topic model is LDA. LDA has become a popular model because it enforces a Dirichlet prior over the topic distributions and word distributions, which is shown to improve the performance compared to pLSA. In a corpus of D documents, each document is modeled as a mixture of K topics. Each topic k is modeled as a multinomial distribution over a word vocabulary given by β={βk}. α is a Dirichlet prior on the documents. For each document d, a parameter θd of the multinomial distribution is drawn from Dirichlet distribution Dir(θd,α). For each word wdn in document d, a topic zdn is drawn with probability θdk, and word wdn is drawn from a multinomial distribution given by β(zdn). α and β are the hyperparameters to be optimized. Suppose that the parameters α and β are given, then the joint distribution of a topic mixture θ, a set of N topics z, and a set of N words w is given by:

P (θd , z d , w d | α , β ) = (4)

N

P (θd | α ) ∏ P ( z dn , θd ) P (w dn | z dn , β ) n =1

The marginal likelihood P(wn|α, β) and hence the posterior distribution P(θd,zd|α,β) are intractable for expressing exact inference. Thus, an inference method, such as variational Bayesian methods must be utilized to approximate P(θd,zd|α,β). A supervised version of LDA is proposed in [15] for the classification problem, called MedLDA. MedLDA is defined as integrating latent topic discovery and multi-class classification classifier. C. STC

In the training phase, given a set of training documents D, the parameters of the model are estimated using Maximum Likelihood. The log-likelihood of the model with parameter θ can be expressed as:

STC is a non-probabilistic topic model which assigns a sparse number of topics to each document. θd∊RK is the code of document d and sdn∊RK is the code of word n. A joint probability distribution can be defined as follows:

L (θ | D ) = ∑ ∑ n (d ,w ) log( P (w | d ))

P (θ , s ,w , β ) = P ( β ) P (θ ) ∏ P (s n | θ ) P (w n | s n , β )

(2)

d ∈D w

In the test phase, to estimate P ( z | d ) of the topics for a document d, pLSA keeps the learned model P ( w | z ) fixed and maximizes the log-likelihood of the words in the document:

(

Ld ( P ( z | d )) = ∑ n (d ,w ) log ∑ P ( z | d ) P (w | z ) w

(5)

n =1

Thus, the topic distribution P ( w | z ) is learnt.

u

N

z

)

(3)

For discrete word counts, STC uses a Poisson distribution with snTβ.n as the mean parameter to generate the observations, i.e.: T

P (w n | s n , β ) =

wn

(s n β . n )

{

T

exp −s n β.n wn!

}

(6)

In order to achieve sparse representations for θ and s, STC chooses the Laplace prior and the supergaussian, respectively: p (θ ) ∝ exp( − λ θ 1 ) p (s n | θ ) ∝ exp( −γ s n − θ

(7) 2 2

− ρ sn

1

(8)

)

Then STC minimize the following objective function: min ∑ log p (w n | s dn , β ) + ∑ (γ s dn − θd

θ ,s , β d , n

λ ∑ θd d

d ,n

1

2 2

+ ρ s dn 1 ) +

s .t . : θd ≥ 0, ∀d ; s dn ≥ 0, ∀d , n ∈ I d ; β k ∈ P , ∀k (9)

where Id is the set of word indices of document d and λ, γ, and ρ are non-negative hyper-parameters, set by the users [14]. A supervised version of STC is also proposed in [14] for the classification problem, called MedSTC. MedSTC learns predictive representations and a supervised dictionary by exploring the available side-information of categorical labels. As the non-probabilistic STC can be naturally integrated with any convex loss function, the large margin principle is adopted to define a classifier, which has been successfully explored in MedLDA [15] too. Specifically, document code θ is used as input feature for a multi-class SVM classifier, and the max-margin supervised STC (MedSTC) is defined as jointly learning a large-margin classifier, learning a dictionary β, and discovering latent representations s and θ.

Fig. 1. Block diagrams of text classification based on BOW model, or supervised topic models

IV. PROPOSED APPROACH In text classification domain, dimension of features is essential and effective problem. Even for limited documents collections, the dimension of the unique words can be exceedingly large. But the all terms belonging to this bag of words are not necessary and discriminant features. For this reason, non-useful words must be removed, in order to extract a subset more suited for the categorization task. Stop words removal is performed using TF-IDF method [3] in this article, providing a more efficient description of the Persian documents. After stop words removal, the histogram of remained words is calculated for each document. For classification based on BOW model, these document-word representations are classified directly using SVM to learn the classification model. But to overcome the problems of BOW, we propose to use unsupervised topic models of pLSA, LDA, STC; and supervised models of MedSTC and MedLDA for Persian document classification. For unsupervised models, we use all the data to learn their parameters, and then use the training documents with their topical representations as features to build multi-class SVM classifiers. Fig. 1 shows the block diagram of text classification based on BOW or supervised topic models. Fig. 2 shows the block diagram of text classification based on unsupervised topic models.

Fig. 2. Block diagrams of text classification based on unsupervised topic models

V. EXPERIMENTAL RESULTS Classification approaches are not for a specific language and can be used for every language. Because of very rare works on Persian language, in this paper we tested the classification methods on a Persian text dataset. Persian language is the formal language of some countries like Iran, Tajikistan and Afghanistan. Persian is the second language of some countries in Middle East too. Persian language has its own structure and complexity. We evaluate the performance of proposed approach for Persian text classification on a subset of Bijankhan corpus1 including 3947 documents. 3347 documents are used for training and 574 documents for testing. The documents are from 9 different subject categories including literary, art, archeology, economic, social, sporty, political, religious and medical. Table I shows the description of this subset of Bijankhan dataset. After stop word removal, 3500 words are remained. Therefore we have a 3347×3500 word-document matrix for training and a 574×3500 matrix for testing. Table II reports the classification accuracy values of topic model based methods, for different number of topics (K). In these methods, the number of topics defines the dimension of input feature vectors of the classifiers and has an important impact on accuracy. Generally, accuracy increases with the number of topics in a certain range and then begins to decrease. The phenomenon of accuracy decline is recognized as over-fitting and is a direct result of the curse of dimensionality. The larger value at which the over-fitting starts, the model can handle the larger number of latent topic features. On one hand, a large number of topics increases the possibility of over-fitting; on the other hand, it provides more latent features for building the classifier. The classification accuracy of BOW method and the best accuracy value of topic model based approaches are reported in Table III. As the results show, all the topic model based approaches achieves at least 9% improvement over the BOW model. The best result for Persian document classification is obtained by means of MedSTC model which has an accuracy improvement of 29% in comparison with BOW model. This is because the “bag of words” model offers a rather impoverished representation of the data duo to ignoring any relationships between the terms. Topic Model supposes documents and corpus composed of mixture topics and then documents can be thought of “bag of topics”. Topics can be viewed as a probability distribution which implies semantic coherence about words. Thus, these models can handle the problem effectively about terms dependency. VI. CONCLUSION In this paper, the problem of text classification was investigated for Persian language. To improve the performance of text categorization based on the bag of words 1

http://ece.ut.ac.ir/dbrg/bijankhan/

feature representation, the topic model based algorithms were used to cluster the term features into a set of latent topics. Topic models transfer the data to a new low-dimensional semantic topic space, in which less computation for classification can be done with more accuracy. We compared between the feature processing techniques of BOW model and the topic model for Persian text classification. From the experimental results, the approach of feature representation with the topic models yielded an accuracy improvement of at least 9% over the BOW model. TABLE I Description of the Subset of Bijankhan Corpus

Class (Category)

Number of Training Documents

Number of Test Documents

Literary

190

24

Art

422

37

Archeology

191

60

Economic

641

107

Social

615

112

Sporty

370

44

Political

167

38

Religious

140

50

Medical

367

102

TABLE II Classification Accuracy of Topic Model based Methods for Different Number of Topics

Method

K=9

K=18

K=27

K=36

K=45

pLSA

0.57

0.67

0.66

0.63

0.61

LDA

0.58

0.67

0.68

0.68

0.62

STC

0.62

0.70

0.75

0.77

0.76

MedLDA

0.75

0.83

0.82

0.81

0.80

MedSTC

0.81

0.85

0.87

0.87

0.86

TABLE III Classification Accuracy of BOW and Different Topic Model based Methods

Method

Classification Accuracy

BOW

0.58

pLSA

0.67

LDA

0.68

STC

0.77

MedLDA

0.83

MedSTC

0.87

ACKNOWLEDGEMENT

[7]

We would like to thank Dr. Bahram Vazirnezhad and Basira Khaki Ardekani from Sharif Languages & Linguistics Center to provide us with the Persian dataset.

B. Bina, M. Ahmadi, M. Rahgozar, “Farsi text classification using ngrams and KNN algorithm: A comparative study”, In Proceedings of the 4th International Conference on Data Mining, pp. 385-390, 2008.

[8]

M. T. Pilevar, H. Feili, and M. Soltani, “Classification of Persian textual documents using learning vector quantization”, In Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering, 2009, pp. 1-6.

[9]

REFERENCES [1]

B. Baharudin, L.H. Lee, and K. Khan. “A review of machine learning algorithms for text-documents classification”, Journal of advances in information technology 1, no. 1: 4-20, 2010.

N. Maghsoodi and M. Homayounpoor, “Using Thesaurus to Improve Multiclass Text Classification”, In Computational Linguistics and Intelligent Text Processing, Springer Berlin Heidelberg, 2011, pp. 244253.

[2]

W. Sriurai, P. Meesad, and C. Haruechaiyasak, “Improving Web Page Classification by Integrating Neighboring Pages via a Topic Mode”, In IICS, pp. 238-246, 2010.

[10] M. Parchami, B. Akhtar, and M. Dezfoulian, “Persian text classification based on K-NN using wordnet”, In Advanced Research in Applied Artificial Intelligence, Springer Berlin Heidelberg, 2012, pp. 283-291.

[3]

G. Salton, A. Wong, and C. S. Yang. A vector space model for automatic indexing. Communications of the ACM, 18(11): pages 613– 620, 1975.

[11] T. Hofmann, “Probabilistic latent semantic analysis”, UAI, pp. 289-296, 1999.

[4]

M. H. Elahimanesh, B. Minaei, and H. Malekinezhad, “Improving KNearest Neighbor Efficacy for Farsi Text Classification”, The International Conference on Language Resources and Evaluation (LREC), pp. 1618-1621, 2012.

[5]

[6]

M. Arabsorkhi, H. Feili, “Using Bayesian model to Persian text classifcation”, In Proceedings of the Second Workshop on Persian Language and Computer, pp. 245-249, 2006 [in Persian]. M. E. Basiri, S. Nemati, and N. Aqaee, “Comparing KNN and FKNN algorithms in Farsi text classification based on information gain and document frequency feature selection”, In Proceedings of the 13th International Computer Conference of Computer Society of Iran, 2008, pp. 383-406.

[12] D.M. Blei, A.Y. Ng, M.I. Jordan, J. Lafferty, “Latent Dirichlet Allocation”, Journal of Machine Learning Research (3), pp. 993-1022, 2003. [13] Y.W. Teh, M.I. Jordan, M.J. Beal, D.M. Blei, “Hierarchical Dirichlet processes”, Journal of the American Statistical Association, 101 (476), pp. 1566–1581, 2006. [14] J. Zhu and E. Xing, “Sparse topical coding”, Proceedings of the Twenty-Seventh Conference Annual Conference on Uncertainty in Artificial Intelligence (UAI), pp. 831–838, 2011. [15] J. Zhu, A. Ahmed, and E.P. Xing. “MedLDA: maximum margin supervised topic models for regression and classification”, In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1257-1264. ACM, 2009.