Enhancing Machine-Learning Methods for Sentiment Classification of ...

3 downloads 146788 Views 250KB Size Report
With advances in Web technologies, more and more people are turning to popular social media platforms such as Twitter to express their feelings and opinions ...
Enhancing Machine-Learning Methods for Sentiment Classification of Web Data Zhaoxia Wang1, Victor Joo Chuan Tong1, and Hoong Chor Chin2 1

Social and Cognitive Computing Department Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore, 138632 {wangz,tongjc}@ihpc.a-star.edu.sg 2 Department of Civil & Environmental Engineering, National University of Singapore, Singapore, 117578 [email protected]

Abstract. With advances in Web technologies, more and more people are turning to popular social media platforms such as Twitter to express their feelings and opinions on a variety of topics and current issues online. Sentiment analysis of Web data is becoming a fast and effective way of evaluating public opinion and sentiment for use in marketing and social behavioral studies. This research investigates the enhancement techniques in machine-learning methods for sentiment classification of Web data. Feature selection, negation dealing, and emoticon handling are studied in this paper for their ability to improve the performance of machine-learning methods. The range of enhancement techniques is tested using different text data sets, such as tweets and movie reviews. The results show that different enhancement methods can improve classification efficacy and accuracy differently. Keywords: Emoticon handling, negation dealing, feature selection, hybrid method, machine learning, sentiment classification, Twitter, Web data.

1

Introduction

With the advent of social media, the ways in which people communicate their comments, feedback and critiques have changed dramatically. They can now post reviews and opinions on discussion topics, products, services, policies and other issues through blogs, social networks and social media such as Twitter. Twitter is a microblogging service that allows its users to publish status known as “tweets” which are limited to 140 characters in length. The service boasts over 140 million active users and over 340 million tweets per day [1]. “Tweets” as one of popular Web data reflects users’ emotions and attitudes on almost every topic for which they can find readers and listeners [2]. As a result, sentiment analysis has emerged as a powerful tool for using tweet data to extract useful information for public organizations and governments, as well as private organizations and citizen’s groups [1]. A. Jaafar et al. (Eds.): AIRS 2014, LNCS 8870, pp. 394–405, 2014. © Springer International Publishing Switzerland 2014

Enhancing Machine-Learning Methods for Sentiment Classification of Web Data

395

Sentiment analysis can be used to characterize a user’s attitude towards a topic or issue of interest based on the identification of patterns of reactions that can be discovered within text-based data that is posted and made available for collection [3,4,5]. Sentiment patterns hidden within comments, feedback and critiques often provide useful information that can be leveraged for different purposes [1]. One example is to obtain the data shared by Internet users reflecting their sentiments toward services or products to help improve product or service quality. Sentiment patterns can be categorized into various types, for example, positive, negative, neutral and ambivalent (mixed) or into more detailed categories, such as very good, good, satisfactory, bad, and very bad [6]. Sentiment pattern analysis can be thought of as a pattern-detection and classification task in which each category takes on the form of a sentiment pattern [3]. Sentiment-classification methods can be broadly categorized into two main classes: lexicon-based and machine-learning-based methods. The lexicon-based method derives the dominant polarity of a text (i.e., positive or negative) by searching for opinion and emotion indicators based on lexicons [7,8]. It derives the sentiment of the entire text based on handling the use of these words or phrases that appear in the lexicons. However, the accuracy of the existing lexicon-based approach is limited by semantic ambiguity. The other class is a learning-based approach that derives the relationship between features of the text segment. Models based upon such machine-learning-based methods typically require a large training database. This approach can achieve better classification results compared to simple lexicon-based approaches and is widely used [9,10]. However, it is difficult to improve the performance of such machine-learning methods even there are enough training datasets [11]. Therefore, how to increase the classification accuracy of machine-learning-based methods through improved knowledge and design has been a key concern for many researchers [11] . Several enhancements, i.e., feature selection [12,13,14], negation dealing [10] ,[15] and emoticon handling [16,17] are used to handle the sentiment classification problem. While these enhancements have been employed previously, how they perform alongside the different machine-learning methods has not been well researched. Therefore, from this paper, we seek to demonstrate how these enhancement techniques can improve the efficacy and accuracy in sentiment classification of Web data for the machine-learning methods studied. Among the many existing machinelearning-based methods, the methods of naïve Bayes (NB) [18,19] Maximum Entropy (MaxEnt) [20] and support vector machine (SVM) [21,22] are chosen for investigation in this paper, because these have been commonly applied in text data analysis.

2

Web Data Collections and Preparations

The Internet has long been recognized as an accessible, inexpensive and effective channel within which to collect all kinds of cross-sectional data. Web data can be downloaded directly from the web or collected by using various application programming interfaces (APIs) and web crawlers that are provided by third parties.

396

Z. Wang, V.J. Chuan Tong, and H.C. Chin

We made use of two types of dataset: (1) data downloaded directly from third parties and (2) data collected using an API. For the first type of dataset, we downloaded the data from a “twitter-sentimentanalyzer” website which contained 1.6 million pre-classified tweets prepared as part of a research effort [23]. We downloaded ds_10k, ds_20k, ds_40k, ds_200k, ds_400k and ds_1400k which consisted of 10k, 20k, 40k, 200k, 400k and 1.4m pre-classified tweets respectively. We also downloaded movie-review data [24] for use in sentiment-analysis experiments. This is a collection of 1000 positive and 1000 negative processed moviereview documents labeled with respect to their overall sentiment polarity. Twitter provides an API that allows easy access to tweets. Using the GET search/tweets resource, we could search for tweets on a specific topic or keyword and limited to a specific geographical region and also to a specific language. Sentiment patterns derived from public domain social media data may be indicators of changes that can have negative and potentially serious consequences, as is the case with worsening social sentiment related to public transportation, the degradation of air and water quality, and other issues that affect land and livability issues [25]. Therefore, the second type of data was collected through Twitter API by using the keyword “MRT” (for “Mass Rapid Transit”) in Singapore. We used locationconstraining geo codes to ensure that the tweets originated from Singapore. The collection and analysis of such data can help government agencies and other organizations to understand public attitudes toward urban transportation services through sentiment analysis. Since a training data set or a set of pre-classified data was necessary for machinelearning-based methods, the collected data needed to be pre-classified to obtain training data if the data had not been annotated or labeled. To analyze the tweets on Singapore’s MRT service from citizens and residents, two social scientists with domain expertise in MRT performance were tasked to extract relevant tweets. Eight assistants with different backgrounds (e.g. students and researchers from the National University of Singapore, students from high schools and scientists from Social and Cognitive Computing group) worked as annotators to classify the tweets manually. They performed the classification tasks independently. We compared and analyzed the classifications of the eight annotators and found that the coincident percentages among them were between 80.1% and 86.8%. Different people have different understandings for the same tweet. We selected the tweets that the annotators gave the same classifications and excluded the ones that different annotators give different classifications to form a human annotated dataset. The obtained human annotated datasets (HA data) together with the downloaded preclassified data were both used to test the enhancement of the classifiers. We separated each type of dataset obtained into a training set and a testing set, containing three-quarters and one-quarter of the entities respectively.

Enhancing Machine-Learning Methods for Sentiment Classification of Web Data

3

397

Methods and Enhancements

Sentiment classification can be described as a process in which a classification algorithm determines the target cluster to which some data belongs. There is a wide variety of machine-learning-based methods, such as naïve Bayes (NB) classifier, Maximum Entropy (MaxEnt) classifier and support vector machine (SVM), etc. Naïve Bayes classifier is a probabilistic classifier that assumes the statistical independence of each feature (or word) and is a conditional model based on Bayes’ formula [18,19]: P(ci|d)=P(ci). P(d | ci) /P(d)

(1)

where P(ci|d) is the posterior probability of instance d being in class ci , P(ci) is prior probability of occurrence of class ci. It can be calculated by P(ci) =Ni/N, where Ni is the number of textual data assigned to class P(ci) , and N is the number of total textual data. P(d | ci) is the probability of generating instance d given class ci, and P(d) is the prior probability of instance d occurring. For a textual data d = (d1,d2,…,dn), with the independence assumption, a Bayes classifier, is a function defined as follows [18,19]: Classifying (d,ci) = argmax P(ci) . P(d1|ci) . P(d2|ci) … P(dn|ci)

(2)

MaxEnt is another probabilistic classifier and uses a multinomial logistic regression model [20]. It is closely related to Naive Bayes classifier, but the model uses search-based optimization to find weights for the features that maximize the likelihood of the training data. For each feature di and class ci, a joint feature g(di,cj)=m is defined, where m is the number of times that di occurs in a document in class ci. Via iterative optimization, a weight γ is assigned to each of the joint feature g(di,cj), so as to maximize the log-likelihood of the training data. The probability of class ci given a document d and weights γ : P

| ,

∑ ∑

,



,

(3)

The difference between these two probabilistic classifiers, Naïve Bayes and MaxEnt classifier, can be seen from Eqn. 1 and 3. SVM is a non-probabilistic classifier that works by constructing a decision surface on a high-dimensional space [21,22]. The principle of the SVM algorithm is to find a decision surface, named hyperplane that optimally splits the training set. The training data are mapped to a very high-dimensional space. Then, the algorithm finds the hyperplane in this space with the largest margin, separating the data to different groups [21] . The decision function is defined as follows: ∑

,

(4)

398

Z. Wang, V.J. Chuan Tong, and H.C. Chin

where αi is Lagrange multipliers determined during SVM training. Parameter b determines the shift of the hyperplane, and it is determined during SVM training. K(xiv,x) is kernel function [26]. Parameter selection is a pivotal step to decide the performance of SVM [21], [26]. We have tested different parameters and found that for this problem, the best parameter selection is when svm_type is “C-SVC” (multi-class classification) and kernel_type is “LINEAR”. In this paper we explore these three machine-learning methods, and integrate feature selection, negation dealing, and emoticon-handling techniques to test and compare the conformance of the machine-learning methods. We started with a basic implementation of each of these machine-learning classifiers. After analyzing their performance, we improved them by integrating different enhancement techniques, including feature selection [12,13,14] negation dealing [10] and emoticon handling [16,17]. 3.1

Feature Selection

The main difficulties of the implementation of the machine-learning classifiers are the learning speed and effectiveness [12]. Without super computers, it is difficult to perform the task using machine-learning methods when dealing with high dimension data, especially with huge training datasets. In this paper, Chi Square feature selection is leveraged to reduce the number of data dimensions to improve the performance classifiers. Chi Square feature selection not only reduces the number of data dimensions, it also removes irrelevant, redundant, and noisy data [13]. Chi Square feature selection can be described as following: κ2 (c, f) = N (AI-BH) / [(A+H) (B+H) (A+B) (H+I)]

(5)

where N is the total number of the training datasets. A is the number of data that contain the feature, f, and also belong category, c. B is the number of data that contain the feature, f, but do not belong to category, c. H is the number of data that do not contain the feature, f, but belong to category, c. I is the number of data that do not contain the feature, f, and do not belong to category, c. The priority of each feature is calculated by Chi Square feature selection equation (Eqn. 1) and only top n features were selected. 3.2

Negation Dealing

During human comprehension of a sentence, negative words/phrases such as ‘do not’, ‘never’, ‘seldom’ can be one of the indicators for judging the orientation of the sentence [10]. For example, ‘love’ is a positive word, but it does not make the sentence positive in “I do not love this”. Two ways of dealing with negation words are considered in this paper. (1) Appending a NEGATE to the word directly after the negation word (NEGdword) and (2) appending a NEGATE to all words after the negation word until reaching a punctuation mark (NEGall).

Enhancing Machine-Learning Methods for Sentiment Classification of Web Data

3.3

399

Emoticon Handling

Analyzing the collected tweets, it was observed that emoticons were very often used by the Twitter user and presented the orientation of the sentiment of the tweets [17]. So by making use of emoticons, we considered the variants of positive and negative emoticons as shown in Table 1. We considered only these emoticons because they are widely used and have no ambiguous meaning. Analyzing the data collected, we found that there were sometimes negative emoticons in positive comments. Typing errors (e.g. typing a frowning face instead of a smiling face) could account for these. We assumed that the chance of mistyping an emoticon was low. Table 1. Emoticon list Positive Emoticons

:P ;P :)) :) :p ;p ^_^

4

(: (; =) (= :-) :3 ^-^

;-) (-: (-; :~) ;~) ;3