Extreme Learning Machine for Multi-Class Sentiment

0 downloads 0 Views 97KB Size Report
explores the multi-class sentiment classification using machine learning .... from Deep Learning which requires intensive tuning in hidden layers and hidden.
Extreme Learning Machine for Multi-Class Sentiment Classification of Tweets Dr. Zhaoxia Wang1 and Yogesh Parth2 1

Social and Cognitive Computing (SCC) Department, Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (ASTAR),Singapore 138632 2 Indian Institute of Space Science and Technology (IIST), Department of Space, India, 695547 {[email protected],[email protected]}

Abstract. The increasing popularity of social media in recent years has created new opportunities to study and evaluate public opinions and sentiments for use in marketing and social behavioural studies. However, binary classification into positive and negative sentiments may not reveal too much information about a product or service. This research paper explores the multi-class sentiment classification using machine learning methods. Three machine learning methods are investigated in this paper to examine their respective performance in multi-class sentiment classification of tweets. Experimental results show that Extreme Learning Machine (ELM) achieves better performance than other machine learning methods. Key words: Extreme Learning Machine, Machine Learning, Multi-class classification, Sentiment Analysis, Social media, Tweets.

1

Introduction

The booming of “Tweet” on Twitter, over a wide variety of topics (e.g., products, organizations, people, etc.) is becoming a rapid and effective way of gauging public opinions for business marketing or social studies that can be beneficial to private companies and public organizations. Sentiment analysis, or opinion mining, can be defined as a computational study of consumer opinions, sentiments and emotions, particularly towards specific products or services [1]. Sentiment classification can be thought of as a pattern-recognition and classification task analysing unstructured data for purposes towards improving product or service quality [2]. Sentiment classification can be broadly categorized into two main groups: machine learning-based and non-machine learning-based methods [3, 4]. In general, machine learning-based methods can achieve better classification results compared to non-learning based method (such as simple lexicon-based method), and are widely used [4]. Extreme Learning Machine (ELM) is one of the more recent and popular machine learning-based methods [5]. It is a kind of feedforward networks, which considers multi-hidden-layer of networks as a white box

2

Dr. Zhaoxia Wang and Yogesh Parth

and trained layer-by-layer [6]. In general, ELM tends to perform better compared to other gradient-based learning algorithms [5]. It has been successfully implemented in many real-world applications [7], [8], [9], [10], [11]. Other machine learning methods like Support Vector Machine (SVM), Na¨ıve Bayes (NB) and Maximum Entropy have also been used in classification applications [12–16]. Most of these machine learning methods are applied to binaryclassification problems and their performance in handling multi-class sentiment classification has not been well researched. In this paper, we investigate the performances of different machine learning methods such as ELM, Multi-Class SVM, and Multinomial Na¨ıve Bayes in multi-class sentiment classification. The rest of the paper is organized as follows. Section II reviews the relevant work of the machine learning methods in classification problems. This is followed by the implementation of these methods in sentiment analysis of tweets in Section III. Performance of different machine learning-based methods is evaluated with case studies in Section IV. Section V concludes this paper with recommendations for future studies.

2

Relevant Work of Machine Learning Methods

The machine-learning methods reviewed herein include Multinomial Na¨ıve Bayes, Multi-Class support vector machine and ELM. Na¨ıve Bayes classifier is a probabilistic classifier based on the Bayes theorem [17, 18]. Relaxing the conditional independence assumption for each of the features in binary classification, the Multinomial Na¨ıve Bayes classifier can be used to deal with multi classifications [19]. Given a set of objects, each of which belonging to a known class and having a known vector of variables, the algorithm attempts to construct a rule which will assign future objects to a class, while being given only the vectors of variables describing the future objects. Let xi be∑the feature vector in multinomial model for the ith document Di . Let ni = t xit be the total number of words in Di , where xit is the t th element of xi , and let P (wt |C) be the probability of words wt occurring in class C. Then, by Na¨ıve Bayes assumption of independence, the document likelihood P (Di |C) can be written as: |v| ni ! ∏ P (Di |C) = P (wt |C)xit |v| ∏ xit ! t=1

(1)

t=1

The probability, P (wt |C), of each word in a given the document class, C, can be written as, N ∑ xit zik (2) Pˆ (wt |C = k) = i=1 |v| ∑ N ∑ xis zik s=1 i=1

ELM for Multi-Class Sentiment Classification of Tweets

3

where N is the total number of documents with the condition, zik equals to 1 when Di contain class C = k, otherwise equals 0. Each P (wi |c) term in Multinomial Na¨ıve Bayes is assumed to a multinomial distribution. Multinomial distribution works well for data which can easily be turned into counts, and in this case, word counts in the text. SVM is a non-probabilistic classifier that constructs a hyperplane in a highdimensional space through the classification training process [4,20]. The decision function can be defined as follows: ∑ f (x) = sign( αi K(xvi , x) + b)

(3)

i

where αi is Lagrange multiplier determined during SVM training. The parameter b representing the shift of the hyperplane is determined during SVM training with the K(xvi , x) as the kernel function [21]. ELM which was initially proposed by Huang [22] is different from BP and SVM which consider multi-layer of networks as a black box. It is also different from Deep Learning which requires intensive tuning in hidden layers and hidden neurons, ELM theories show that hidden neurons do not need tuning because its hidden nodes parameters (ci , ai ) are randomly assigned [23]. For N arbitrary ˜ hidden distinct samples (xk , tk ) ∈ Rn × Rm , the single ELM classifier with N nodes becomes a linear system as, ˜ N ∑

βi G(xk ; ci ; ai ) = tk , k = 1, ..., N.

(4)

i=1

where ci ∈ Rn and ai ∈ Rare the learning parameters of hidden nodes and randomly assigned weight βi connecting the ith hidden node to the output node, xk are the training examples, tk is the target output for k = 1, ..., N , and G(xk ; ci ; ai ) is the output of the ith hidden node with respect to the input xk . The output weights can be described in matrix form as, T T β = ⌊β1T ...βN ˜ ⌋mxN ˜

(5)

Equation (4) can be rewritten as: Hβ = T

(6)

where H(c1 , ..., cN˜ , a1 , ..., aN˜ , x1 , ..., xN ) = 

 G(x1 ; c1 , a1 ) · · · G(x1 ; cN˜ , aN˜ )   .. .. ..   . . . G(xN ; c1 , a1 ) · · · G(xN ; cN˜ , aN˜ ) N xN˜

(7)

T = ⌊tT1 ...tTN ⌋TmxN

(8)

4

Dr. Zhaoxia Wang and Yogesh Parth

The output weights β can be determined by finding the least-square solution as,

βˆ = H † T

(9)

where H † is the Moore-Penrose generalized inverse [24] of the hidden layer output matrix H. An ELM classifier (single) implements multi-class classification problem using a network architecture of multi-output nodes equal to the number of pattern classes n. The network output can be written as y = (y1 , y2 , ..., yn )T . For each training example say x, the target output t is coded into n bits: (t1 , ...., tn )T . For a pattern of class i, only the target output ti is “1” and the rest is “-1”.

3

Proposed Implementation of ELM and other Machine Learning Methods for Sentiment Classifications

The task of applying machine learning methods for classification requires several steps. Pre-processing of unstructured tweet data is the first step in implementing machine learning methods for sentiment classification. The collected data have to undergo cleaning, tokenization [25], and stemming [26] to convert them into structured text data. Cleaning involves the removal of url links, usernames (“@username”), punctuations, whitespaces, hashtags etc. The structured texts are then tokenized with labels to create a word features list. Using chi-square (χ2N ) distribution, the word features are assigned an intermediate score. The scores are subsequently updated to find collocation and bigrams. After preprocessing, the top n features in the list with the best bigrams and collocations are extracted to train the machine learning classifier. The latest advanced enhancement methods [4] have been used to obtain the best possible results for SVM and Na¨ıve Bayes classifier in this paper. In the case of ELM, the optimized parameters have been chosen to minimize the root mean square error (RMSE). Tables 1 and 2 show the list of parameter sets for a particular type of data which are optimized using the ridge regression and linear regression respectively. The selected parameter sets which produce the least RMSE (shown in bold) are used to train the classifier. The number of iterations for training and testing the classifier is limited to a maximum of 20. Algorithm 1 shows the pseudo code for minimizing the RMSE, which is coded in hyper-parameter optimization library in python “Hyperopt” [27]. The optimization is performed over the given parameters or search space such as hidden node, alpha, ridge alpha, and radial basis function (RBF) width. The Treestructured Parzen Estimator (TPE) [28], is used for optimization over the given conditions for training an ELM classifier.

ELM for Multi-Class Sentiment Classification of Tweets

Algorithm 1 Minimize Root Mean Square Error 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24:

procedure REQUIRE(numpy, hyperopt, sklearn, elm) input ← training file iteration max ← 20 test run: test(params, ridge) test(params, ridge): hidden n, alpha, rbf width, activation func = params layer ← RandomLayer(params) ridge ← Ridge(ridge alpha) elm = pipeline([layer, ridge]) elm.train p ← elm.predict rmse = sqrt(mse(test, p)) return rmse parameters: hidden n ← ploguniform(a,b) alpha ← uniform(a’,b’) rbf width ← loguniform(c,d) ridge alpha ← uniform(c’,d’) activation func ← choice(tanh,sine, ...,gaussian) main(): best ← min(test run, parameters, algo=tpe, iteration max). print [rmse],[params],[ridge alpha] print best

5

RMSE Hidden node 5.53667074826311 96 0.117694994097736 23 0.119176722319975 14 0.121964679113459 360 0.128636656277498 27 0.116351185300275 728 0.129012115416406 147 0.117714432502741 734 0.119782506893025 570 0.119752243475673 10 0.123482000618614 17 0.156298356419104 405 0.193590715031852 192 0.117524241972222 850 0.120402689130944 13 0.130039185254618 17 0.146980644034541 30 0.126685956956185 23 0.151775202209191 81

alpha 0.524301588135944 0.362227734140632 0.0862999041065237 0.267976516333136 0.81584265132577 0.82093232140078 0.399149654069297 0.301426069647834 0.430990031149163 0.869747469515247 0.999118291073565 0.854263388991001 0.319027484499291 0.373112162422129 0.622356117194812 0.605762140378522 0.19477643934742 0.813360130784076 0.478881694502156

RBF width Activation Function 0.00192745905713552 inv tribas 7.05900482368913 tanh 0.0000154232670105994 multiquadric 0.0000335082533900404 softlim 0.0533106256560502 sigmoid 0.0153925477267926 hardlim 22.5662635770545 inv multiquadric 3.6344151437243 hardlim 0.000191208358888382 tanh 0.0174366766856871 softlim 1.94065082807598 hardlim 0.0034193793063242 gaussian 0.0644530059543435 inv tribas 1.15291865399892 hardlim 0.00298340650893961 multiquadric 0.0165939924916861 inv tribas 0.0000137035341999231 tribas 0.923405718456997 sigmoid 27.3035063972805 inv multiquadric

Table 1. ELM Optimized parameter set for tweet data without any regressor

6 Dr. Zhaoxia Wang and Yogesh Parth

RMSE Hidden Node alpha RBF width Activation Function 0.117219772709406 12 0.0973901333207798 1.48500198321801 gaussian 0.171115181719981 87 0.894978744640209 48.345992769188 multiquadric 0.112284877403487 965 0.1340791976002 0.0984951895364843 multiquadric 0.125655943802341 284 0.29186818541362 0.0985569804485006 softlim 0.122583509721702 196 0.018282092561289 0.927267663139151 sine 0.112803169495448 675 0.0638767992050295 0.0778255414360264 multiquadric 0.1179177952119 77 0.516497025526854 0.190687099943598 inv multiquadric 0.122992691773984 18 0.651179646424338 0.0125897731660419 sigmoid 0.131352330647929 17 0.596875638531792 2.90337806004152 gaussian 0.15669449022685 60 0.459878608804425 0.000359990252415064 sigmoid 0.117270497665036 26 0.182715737779089 6.22597637491161 tanh 0.117270497665036 109 0.381036038270451 7.75350974738416 hardlim 0.1131814807259 652 0.0769244046713881 0.0890436054223181 multiquadric 0.135652404731523 46 0.351763852755516 1.04207335382151 inv multiquadric

Table 2. ELM Optimized parameter set for tweet data with ridge as regressor Ridge alpha 0.000240288138976629 0.00518583033923811 0.298806226947076 0.541801573766452 0.000561830406504939 0.719153883848897 31.118255688534 0.0000180218698166426 0.00000638733172727182 0.00000856192646740847 0.00000288300935812876 0.00000104973023516875 0.841414102931452 0.0195730715339201

ELM for Multi-Class Sentiment Classification of Tweets 7

8

4 4.1

Dr. Zhaoxia Wang and Yogesh Parth

Performance Evaluation with Discussion Data Collection and Preparation

The datasets used in this study are tweets data obtained from two different sources. In case 1, the data are downloaded from the “twitter-sentiment-analyzer” (https://github.com/ravikiranj/twitter-sentiment-analyzer/tree/master/data), which contained 1.6 million pre-classified tweets reported previously [4]. We downloaded ds 5k, ds 10k, ds 20k, ds 40k, which consisted of 5k, 10k, 20k, and 40k of pre-classified tweets respectively. However, this set of data only contains binary sentiment tweets i.e positive and negative tweets, so we extracted neutral tweets data from our tweet data collections which were collected by using twitter wrapper API (application program interface), and appended to the downloaded tweet datasets. In case 2, the data were collected through twitter API by using the keyword “MRT” (Mass Rapid Transit) over the region of Singapore. Locationconstraining geo codes were used to ensure that the tweets were collected within the region in and around Singapore. We performed this data collection with the aim to investigate the public attitudes towards Singapore public transportation services. The collected tweets were annotated manually with the help of field experts in order to obtain ground-truth data to be used in machine-learning-based methods. In both cases 1 and 2, the training dataset contains 75% of the data, while the test set consists of the remaining 25%.

4.2

Performance Evaluation

We trained each classifier, namely Multinomial Na¨ıve Bayes, Multi-class SVM, and ELM using the training set and tested its accuracy using the test sets. The performance metric is measured by the accuracy of classification. This defines how close the performance is to the idle or benchmark value. For binary classification problems, this is calculated using the formula: Accuracy = (Tp + Tn )/(Tp + Tn + Fp + Fn )

(10)

where Tp is the number of correctly identified positives, Fp is the number of incorrectly identified positives while, Tn is the number of correctly identified negatives and Fn the number of incorrectly identified negatives. For multi-classification problems, accuracy is calculated as: Accuracy = Nc /Nt

(11)

where Nc is the number of correctly identified samples, Nt is the number of total samples.

ELM for Multi-Class Sentiment Classification of Tweets

4.3

9

Case Studies

Case Study 1 In this case study, we used tweets data downloaded from the web and extracted from our tweet collections. Table 3 shows the comparison of the classification accuracy among the different machine learning methods. The accuracy in ELM ranges from 68% to 99% but is higher the that of Multi-class SVM and Multinomial Na¨ıve Bayes for all the datasets. ELM is significantly better than Multinomial Na¨ıve Bayes and marginally better than Multi-class SVM. In general, ELM outperforms the others in larger datasets. This indicates the efficiency of the ELM for multi-class classification.

Table 3. Comparison of Machine-Learning Algorithms for classifying the downloaded tweets data-sets Datasets Number Number of class of feature ds 5k ds 10k ds 20k ds 40k tweets*

3 3 3 3 3

1000 4000 2000 8000 2000

ELM

Multi-class Multinomial SVM NB

83.50% 68.53% 91.12% 78.73% 99.27%

83.07% 68.18% 84.73% 77.49% 93.71%

71.07% 43.44% 78.92% 74.67% 92.98%

Case Study 2 Tweets collected and extracted using the twitter API, having search queries related to MRT services over the region of Singapore. The accuracy of the different classifiers is compared and shown in Table 4. ELM achieves an accuracy of nearly 90%, outperforming SVM (84%) and Multinomial Na¨ıve Bayes (76%).

Table 4. Comparison of Machine-Learning Algorithms for classifying the MRT tweets data-sets Datasets

MRT DATA

Number Number of class of feature 3

700

ELM

89.09%

Multi-class Multinomial SVM NB 83.64%

76.36%

10

Dr. Zhaoxia Wang and Yogesh Parth

5

Conclusions and Future Work

In this paper, we have investigated the performance of different machine learning classifiers, including ELM, Multi-Class SVM, and Multinomial Na¨ıve Bayes for multi-class sentiment analysis. The experimental results show that ELM achieves better performance than other machine learning methods for multi-class sentiment classification of tweet data. As the performance of machine learning methods is dependent on how the features are selected, the machine-learning based multi-class sentiment classifiers may be improved if enhanced feature selection can be incorporated. Further studies on ELM with sophisticated feature selection techniques are currently being explored. Acknowledgments. This work is supported by the A*STAR Joint Council Office Development Programme ”Social Technologies+ Programme”.

References 1. B. Pang and L. Lee, “Opinion mining and sentiment analysis,” Foundations and trends in information retrieval, vol. 2, no. 1-2, pp. 1–135, 2008. 2. B. Liu and L. Zhang, “A survey of opinion mining and sentiment analysis,” in Mining Text Data. Springer, 2012, pp. 415–463. 3. Z. Wang, V. J. C. Tong, and D. Chan, “Issues of social data analytics with a new method for sentiment analysis of social media data,” in 2014 IEEE 6th International Conference on Cloud Computing Technology and Science (CloudCom 2014). IEEE, 2014, pp. 899–904. 4. Z. Wang, V. J. C. Tong, and H. C. Chin, “Enhancing machine-learning methods for sentiment classification of web data,” in Information Retrieval Technology. Springer, 2014, pp. 394–405. 5. G.-B. Huang, “Extreme learning machine: theory and applications,” Neurocomputing, vol. 70, no. 1, pp. 489–501, 2006. 6. G.-B. Huang, Z. Bai, L. L. C. Kasun, and C. M. Vong, “Local receptive fields based extreme learning machine,” Computational Intelligence Magazine, IEEE, vol. 10, no. 2, pp. 18–29, 2015. 7. N.-Y. Liang, P. Saratchandran, G.-B. Huang, and N. Sundararajan, “Classification of mental tasks from eeg signals using extreme learning machine,” International Journal of Neural Systems, vol. 16, no. 01, pp. 29–38, 2006. 8. S. D. Handoko, K. C. Keong, O. Y. Soon, G. L. Zhang, and V. Brusic, “Extreme learning machine for predicting hla-peptide binding,” in Advances in Neural Networks-ISNN 2006. Springer, 2006, pp. 716–721. 9. C.-W. Yeu, M.-H. Lim, G.-B. Huang, A. Agarwal, and Y.-S. Ong, “A new machine learning paradigm for terrain reconstruction,” Geoscience and Remote Sensing Letters, IEEE, vol. 3, no. 3, pp. 382–386, 2006. 10. J. Kim, H. Shin, Y. Lee, and M. Lee, “Algorithm for classifying arrhythmia using extreme learning machine and principal component analysis,” in Engineering in Medicine and Biology Society, 2007. EMBS 2007. 29th Annual International Conference of the IEEE. IEEE, 2007, pp. 3257–3260.

ELM for Multi-Class Sentiment Classification of Tweets

11

11. G. Wang, Y. Zhao, and D. Wang, “A protein secondary structure prediction framework based on the extreme learning machine,” Neurocomputing, vol. 72, no. 1, pp. 262–268, 2008. 12. P. Chaovalit and L. Zhou, “Movie review mining: A comparison between supervised and unsupervised classification approaches,” in System Sciences, 2005. HICSS’05. Proceedings of the 38th Annual Hawaii International Conference on. IEEE, 2005, pp. 112c–112c. 13. B. Galitsky and E. W. McKenna, “Sentiment extraction from consumer reviews for providing product recommendations,” May 12 2008, uS Patent App. 12/119,465. 14. M. Hu and B. Liu, “Mining and summarizing customer reviews,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2004, pp. 168–177. 15. J. Si, A. Mukherjee, B. Liu, Q. Li, H. Li, and X. Deng, “Exploiting topic based twitter sentiment for stock prediction.” in ACL (2), 2013, pp. 24–29. 16. L. Chi, X. Zhuang, and D. Song, “Investor sentiment in the chinese stock market: an empirical analysis,” Applied Economics Letters, vol. 19, no. 4, pp. 345–348, 2012. 17. L. A. Dalton and E. R. Dougherty, “Optimal classifiers with minimum expected error within a bayesian frameworkpart ii: Properties and performance analysis,” Pattern Recognition, vol. 46, no. 5, pp. 1288–1300, 2013. 18. V. Muralidharan and V. Sugumaran, “A comparative study of na¨ıve bayes classifier and bayes net classifier for fault diagnosis of monoblock centrifugal pump using wavelet analysis,” Applied Soft Computing, vol. 12, no. 8, pp. 2023–2029, 2012. 19. E. Pappas and S. Kotsiantis, “Integrating global and local application of discriminative multinomial bayesian classifier for text classification,” in Intelligent informatics. Springer, 2013, pp. 49–55. 20. C.-C. Chang and C.-J. Lin, “Libsvm: a library for support vector machines,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 2, no. 3, p. 27, 2011. 21. E. Byvatov, U. Fechner, J. Sadowski, and G. Schneider, “Comparison of support vector machine and artificial neural network systems for drug/nondrug classification,” Journal of Chemical Information and Computer Sciences, vol. 43, no. 6, pp. 1882–1889, 2003. 22. G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine: a new learning scheme of feedforward neural networks,” in Neural Networks, 2004. Proceedings. 2004 IEEE International Joint Conference on, vol. 2. IEEE, 2004, pp. 985–990. 23. G.-B. Huang, “An insight into extreme learning machines: random neurons, random features and kernels,” Cognitive Computation, vol. 6, no. 3, pp. 376–390, 2014. 24. C. R. Rao and S. K. Mitra, Generalized inverse of matrices and its applications. Wiley New York, 1971, vol. 7. 25. J. Marˇs´ık and O. Bojar, “Trtok: A fast and trainable tokenizer for natural languages,” The Prague Bulletin of Mathematical Linguistics, vol. 98, pp. 75–85, 2012. 26. P. Willett, “The porter stemming algorithm: then and now,” Program, vol. 40, no. 3, pp. 219–223, 2006. 27. J. Bergstra, D. Yamins, and D. D. Cox, “Hyperopt: A python library for optimizing the hyperparameters of machine learning algorithms,” 2013. 28. J. S. Bergstra, R. Bardenet, Y. Bengio, and B. K´egl, “Algorithms for hyperparameter optimization,” in Advances in Neural Information Processing Systems, 2011, pp. 2546–2554.