Paper Title (use style: paper title)

5 downloads 119 Views 1MB Size Report
Apr 15, 2013 - Sentiment Analysis and Visualization of Social Media. Data ... (SA) on social media text data (e.g. product reviews, tweets) using machine ..... Bubble Chart after the hashtag consolidation using logarithmic scale for radial size.
Sentiment Analysis and Visualization of Social Media Data The #BostonMarathon #Bombings test case

Amir Salarpour

Mohammad Hossein Bamneshin

Dimitris Proios

Department of Computer Engineering Bu-Ali Sina University Hamedan, Iran [email protected]

Department of Computer Engineering Bu-Ali Sina University Hamedan, Iran [email protected]

Department of Information and Telematics Harakopio University of Athens Athens, Greece [email protected] Fig. 2. Sentiment analysis flow

Abstract— This work aims a) to perform sentiment analysis on social media data using Machine Learning methods and b) to propose a user-friendly visualization of these data. Keywords— sentiment analysis, data visualization, machine learning

I. INTRODUCTION (Heading 1) The target of this project is a) to perform sentiment analysis (SA) on social media text data (e.g. product reviews, tweets) using machine learning algorithms and b) to create a visualization summary of these data taking also into account the SA output. Fig. 1. Overall system’s workflow

II.

DATASET AND ANNOTATION

A. Dataset Description The following two datasets were used for the experiments: 

To perform sentiment analysis we pre-processed the text data using GATE (General Architecture for Text Engineering). The output was used to construct feature vectors (feature extraction) which in turn were used to train several machine learning models. Finally, the learnt models were evaluated in terms of accuracy, which is the proportion of test instances (reviews) that were classified in the correct category. For the visualization task we used D3 (Data-Driven Documents) a JavaScript library that provides functionality to display data in graphical charts.

1

Product Reviews Corpus (PRC). The PRC is a part of Wishful Expressions Corpora [1]. It contains 1235 sentences with customer product reviews from Amazon.com and cnet.com, collected by Bing Liu1 and his colleagues, and used in several publications. Two examples of such reviews are the following:

“i will never buy their product again at this rate and neither should you” “the product has worked perfectly for me on my xp”

This corpus was annotated by ILSP 2 regarding to specific sentiment categories (see below section 2.2) 

Boston Corpus (BC) has been collected and annotated by ILSP3. It contains 5000 tweets related to the Marathon event and the bombings that took place in Boston on 15/4/2013. The Marathon started at 09:00 and the explosions occurred at 14:59. The tweets were collected in the timeframe between 14/4/2013 at 21:00

The original data and publications can be found at http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html Institute for Language and Speech Processing, Athena R.C. 3 This corpus cannot be distributed without the permission of ILSP. Any questions on this data should be directed to Haris Papageorgiou ([email protected]) 2

am and 15/04/2013 at 19:46 am. Some examples of these tweets are the following: Excited for the #bostonmarathon tomorrow #makeitcount #findgreatness Best of luck to all those running #bostonmarathon today! Have FUN and enjoy!! #bostonmarathon explosions! Horrifying site :( let's pray for the may affected #PrayersforBoston our prayers go out to Boston this afternoon.

We randomly selected and annotated a 1027 tweets subset. This sub-corpus was initially used to test the best classifier that was trained on the product reviews corpus. In a second phase, it was combined with the product reviews corpus in order to train the final sentiment model. B. Annotation and Inter-Annotator Agreement (IAA) Product Reviews Corpus: Each sentence of the PRC was judged 4 by two annotators regarding to the following categories: Subjective: any text about private states (sentiments, opinions, emotions, feelings, thoughts, beliefs etc.) expressed by an author. Objective: text that contains only factual information. Positive: any text containing positive opinions, emotions, feelings etc. or facts/events that may trigger positive sentiment. Negative: any text containing negative opinions, emotions, feelings etc. or facts/events that may trigger negative sentiment. Praise: positive evaluations and opinions about specific entities or topics and their aspects explicitly or implicitly expressed by an author. Criticism: negative evaluations and opinions about specific entities or topics and their aspects explicitly or implicitly expressed by an author. To assess the agreement of the two annotators we used Cohen's kappa coefficient and PSA (Proportion of Specific Agreement). The results are shown in Table 1. TABLE I. Method Kappa PSA for 0-label PSA for 1-label

INTER-ANNOTATOR AGREEMENT

Sub 0.0124

Obj 0.0123

Pos 0.6966

Neg 0.6254

Prais 0,6785

Crit 0,5690

0.0171

0.8372

0.8517

0.8778

0,8749

0,9071

0.8377

0.0171

0.8436

0.7450

0,8029

0,6616

When the number of each class instances are different, kappa coefficient will be unreal. So we used PSA coefficient that shows the agreement separately. For the subjective category we notice that the agreement for 1-label class is high but not for the 0-label one and vice versa for the objective 4

category. In contrast, the results show high agreement in the positive and negative categories. The IAA is also high for the praise category. In the criticism category the agreement for the 0-label class is high (0.9071) and for the 1-label (0.6616). However, this asymmetry between these two criticism classes results in a low Kappa. We decided to focus on the positive and negative categories, where the IAA was substantial. In total, the two annotators disagreed in 300 sentences. These conflicts were resolved by an expert annotator in order to create a reliable corpus. Boston Corpus: The randomly selected 1027 5 tweets subset was annotated similarly to the product reviews corpus by a domain expert from ILSP6. Again, we focus on the positive and negative categories. III.

FEATURES AND DATA PRE-PROCESSING

A. Features ILSP provided two types of features for the sentiment experiments: • Lexical features assigned to each token of a text after being pre-processed by applying a custom Natural Language Processing pipeline in GATE (see below 3.3). Examples of such features are the part-of-speech tags, punctuation, orthography etc. • Sentiment Lexicon-based features resulting from the following lexica: Opinion Lexicon [2, 3]: It contains 4783 negative and 2006 positive opinion words, namely words used from writers and speakers in order to express their opinions toward some target. ANEW (Affective Norms of English Words) [4]: contains 1034 words with valence, arousal and dominance scores. The specific words had been previously identified as bearing meaningful emotional content. Attitude Lexicon [5]: It contains 4075 “attitude” words classified in particular categories using specific syntactic and semantic criteria. Attitude is examined inside the scope of Appraisal Theory [6] and the specific words are considered as a linguistic device to express evaluations (criticism or praise) toward some target. Intention Lexicon7: It contains 355 words and expressions used to express future intentions such as commitments, promises, desires etc. Each entry is classified in particular categories using specific syntactic and semantic criteria. B. Data Pre-processing The dataset was pre-processed using GATE (General Architecture for Text Engineering) by applying the following pipeline:

The annotated corpus cannot be distributed without the permission of ILSP. After removing the duplicates (re-tweets) we ended up to a corpus of 776 distinct tweets. 6 The annotated corpus cannot be distributed without the permission of ILSP. 7 This lexicon is being developed by ILSP (Pontiki Maria, Thanasis Kalogeropoulos and Haris Papageorgiou) and is not published yet. 5

TABLE II.

Fig. 3. NLP pipeline

The final output of the GATE pre-processing for each text is stored in an XML file. IV.

SENTIMENT ANALYSIS USING MACHINE LEARNING

A. Experiments on product reviews dataset We split the 1235 reviews of the PRC to two parts; the 70% was used for training and the 30% for testing. 1) Feature extraction We parsed each GATE xml file (using MATLAB) and we calculated the following 32 features for each review: 1234567891011121314151617181920212223242526272829303132-

Number of different category there were in each sentence. Number of NN tags. Number of JJ tags. Number of VB tags. Number of RB tags. Number of W tags. Number of negation words. Number of words detected by Opinion Lexicon. Number of negative words detected by Opinion Lexicon. Number of positive words detected by Opinion Lexicon. Number of words detected by Attitude Lexicon. Number of negative words detected by Attitude Lexicon. Number of positive words detected by Attitude Lexicon. Number of both words detected by Attitude Lexicon. Number of JJ words detected by Attitude Lexicon. Number of NN words detected by Attitude Lexicon. Number of RB words detected by Attitude Lexicon. Number of negative and JJ words detected by Attitude Lexicon. Number of negative and NN words detected by Attitude Lexicon. Number of negative and RB words detected by Attitude Lexicon. Number of positive and JJ words detected by Attitude Lexicon. Number of positive and NN words detected by Attitude Lexicon. Number of positive and RB words detected by Attitude Lexicon. Number of both and JJ words detected by Attitude Lexicon. Number of both and NN words detected by Attitude Lexicon. Number of both and RB words detected by Attitude Lexicon. Average of Valence Mean for words covered by ANEW Lexicon. Average of Dominance Mean for words covered by ANEW Lexicon. Average of Arousal Mean for words covered by ANEW Lexicon. Number of discovered Desire word using Intention Lexicon. Number of discovered Commitment word using Intention Lexicon. Number of discovered Purpose word using Intention Lexicon.

In Table 2 we present a statistical analysis for the extracted features:

FEATURES PROPERTIES

Min

Max

Mean

Mod

Variance

Range

Number of Category

2

83

19.5789

13

11.6469

81

Number of NN

0

28

5.43157

3

3.51322

28

Number of JJ

0

9

1.40323

1

1.47017

9

Number of VB

0

18

3.51336

2

2.33056

18

Number of RB

0

10

1.35708

1

1.35794

10

Number of W

0

0

0

0

0

0

Number of Negations

0

3

0.20242

0

0.46375

3

Number of Opinion

0

7

1.26315

1

1.21025

7

Number of Positive Opinion

0

6

0.88178

0

0.98314

6

Number of Negative Opinion

0

4

0.38137

0

0.67926

4

Number of Attitude

0

10

0.98542

0

1.34301

10

Number of Negative Attitude

0

4

0.13117

0

0.42676

4

Number of Positive Attitude

0

9

0.72874

0

1.09593

9

Number of Both Attitude

0

6

0.12550

0

0.51204

6

Number of JJ Attitude

0

7

0.62429

0

0.88002

7

Number of NN Attitude

0

2

0.05506

0

0.23860

2

Number of RB Attitude

0

5

0.30607

0

0.60273

5

Number of JJ Negative Attitude

0

7

0.64939

0

0.91236

7

Number of NN Negative Attitude

0

6

0.41943

0

0.73835

6

Number of RB Negative Attitude

0

4

0.17894

0

0.48096

4

Number of JJ Positive Attitude

0

9

0.89959

0

1.21829

9

Number of NN Positive Attitude

0

9

0.80728

0

1.16684

9

Number of RB Positive Attitude

0

9

0.73603

0

1.09512

9

Number of JJ Both Attitude

0

9

0.68502

0

1.00423

9

Number of NN Both Attitude

0

6

0.37085

0

0.76074

6

Number of RB Both Attitude

0

6

0.18056

0

0.56558

6

Valence Average

0

8.72

3.85172

0

3.35980

8.72

Dominance Average

0

7.39

3.33664

0

2.84397

7.39

Arousal Average

0

8.02

3.05978

0

2.62077

8.02

Number of Desire Number of Commitment

0

3

0.22186

0

0.47906

3

0

5

0.24372

0

0.52308

5

Number of Purpose

0

3

0.19190

0

0.46386

3

2) Experiments For the sentiment analysis task several experiments were conducted using different machine learning (ML) algorithms. To evaluate the learnt models we used accuracy which is defined as the number of correctly classified instances divided with the total number of instances. In our models we kept all the features listed in previous section since the feature selection experiments we run using various methods (Information Gain, Mutual Information and Ranking Algorithm) didn’t show any improvements. As a baseline system we used a majority classifier which always chooses as the correct category the one that is more frequent in the training data. The accuracy of this method also indicates the level of difficulty of the task. Below we present the experiments we have done on product reviews dataset using various ML methods: K-Nearest Neighbors (k-NN): We tried k-NN a simple ML algorithm that classifies a test instance to the majority category of its k nearest training examples. We tried a wide range of k values (k =1,…,100) and we found using cross validation on the training set that the optimal one is 21. As a distance measure between feature vectors we used Euclidean distance. We also used Principal Component Analysis (PCA) to remove correlations between features. PCA improves accuracy for both positive and negative categories (see Table 4 and 5). Naïve Bayes: Another well-known ML algorithm is Naïve Bayes (NB). NB assumes that feature variables (x1, … , xn) are independent given the class c (category). The distributions of these variables P(xi | c ) are estimated from the training data. When a test instance defined by its feature vector x is given to NB, it classifies it to the class c that has the highest P(c | x). The estimation of the latter probability is estimated using Bayes Theorem and learnt P(xi | c ) probabilities. SVM with MLP or RBF kernel: We also tried Support Vector Machines (SVM) which attempt to learn a separating hyperplane for the given classes (categories) from the training data. We experimented with Multilayer Perceptron kernel (MLP) and Radial Basis Function (RBF) kernel. The best accuracy was obtained using RBF kernel. We tuned model parameters on training set using Genetic Algorithms (GA) which significantly improved accuracy for both kernels (see Table 4 and 5). As previously we used PCA to remove correlation between features. The model with the best results is SVM with RBF which we use in our experiments with twitter data. TABLE III.

ACCURACY OF DIFFERENT METHODS FOR NEGATIVE CATEGORY

SVM MLP + PCA

0.5606

0.5686

0.4641

SVM MLP + PCA + tuning

0.6280

0.7035

0.67130

SVM RBF + PCA

0.6226

0.6605

0.9769

SVM RBF + PCA + tuning

0.7062

0.7279

0.7326

Naïve Bayes

0.5391

0.4849

0.5486

Naïve Bayes + PCA

0.6442

0.6640

0.7523

Naïve Bayes + tuning Naïve Bayes + tuning + PCA

0.5391

0.4849

0.5486

0.6631

0.6744

0.7280

TABLE IV.

ACCURACY OF DIFFERENT METHODS FOR POSITIVE CATEGORY Train

Method

Test

Baseline

0.5256

Cross-validation in training set -

k-NN (k=21)

0.6011

0.6395

0.6991

K-NN + PCA (k=21)

0.6927

0.6779

0.7326

SVM MLP + PCA

0.6199

0.5988

0.5336

SVM MLP + PCA + tuning

0.7008

0.7442

0.7338

Test

SVM RBF + PCA

0.6523

0.6407

0.9664

0.7278

0.7384

0.7581

Naïve Bayes

0.5633

0.6140

0.6134

Naïve Bayes + PCA

0.5984

0.6465

0.6516

Naïve Bayes + tuning Naïve Bayes + tuning + PCA

0.5660

0.5221

0.6354

0.6388

0.5198

0.7338

We also wanted to assess how better our models predict the target (category) as we increase the number of training instances. So, we build models using 10%, 20%,…, 100% of the training set and we evaluated them on the test set. We show the results we obtained in Figures 3, 4, 5, and 6 for negative category and in Figures 7, 8, 9 and 10 for the positive category. K-NN with PCA, SVM with RBF or MLP kernel using parameter tuning and PCA, and Naïve Bayes using parameter tuning and PCA improve their accuracy as we add more training data.

Fig. 4. KNN Learning curve comparison for using PCA and not using it

KNN Learning Curve on Negative Class 0.60 1

Baseline

0.6280

Cross-validation in training set -

k-NN (k=21)

0.6523

0.6837

0.7014

K-NN + PCA (k=21)

0.6631

0.7081

0.7199

Full training set 0.6713

0.5061

SVM RBF + PCA + tuning

2

3

4

5

6

7

8

Train Method

Full training set

non pca

using pca

9 10

Fig. 5. Naive Bayes Learning Curves - comparison for using and not using PCA and Parameter tuning

Fig. 9. Naive Bayes Learning Curves - comparison for using and not using PCA and Parameter tuning

Naive Bayes Learning Curve on Negative Class

Naive Bayes Learning Curve on Positive Class

0.7 0.6 0.5

0.5 1

2

3

4

5

6

7

8

9

1

10

2

3

4

5

6

using pca

non pca

using pca

optimized and using pca

optimized and non pca

non pca

8

9

10

optimized and using pca

Fig. 6. SVM-RBF learning Curve using PCA compared with same method with tuned parameters

SVM-RBF Learning Curve on Negative Class

7

optimized and non pca

Fig. 10. SVM-RBF learning Curve using PCA compared with same method with tuned parameters

0.8

Learning Curve SVM-RBF for Positive class

0.7 0.6 0.8

0.5 1

2

3

4

5

using pca

6

7

8

9

10

optimized and using pca

0.6

0.4 0.2

Fig. 7. SVM-MLP Learning Curve using PCA and compared with same method with tuned parameters

0

SVM-MLP Learning Curve on Negative Class

Using PCA

Optimised using PCA

0.7 Fig. 11. SVM-MLP Learning Curve using PCA and compared with same method with tuned parameters

0.5 1

2

3

4

using pca

5

6

7

8

9

10

SVM-MLP Learning Curve on Positive Class

optimized and using pca

Fig. 8. KNN Learning curve comparison for using PCA and not using it

0.75 0.7

KNN Learning Curve on Positive Class

0.65 0.6

0.9

0.55

0.7

0.5 1

0.5

1

2

3

4

non pca

5

6

7

8

using pca

9

10

2

3

using pca

4

5

6

7

8

9

10

optimized and using pca

B. Experiments on BC We used our best models (SVM RBF + PCA + tuning) trained on the 70% of the product reviews and evaluate them on the 776 tweets. The models achieve 64.1% and 77.7% for the positive and negative class, respectively. Both outperform the corresponding majority baselines. TABLE V.

A. Hashtag graph This one presents the most important (frequent) discussed topics of the 3963 tweets using a D3 bubble chart. The biggest bubbles correspond to more frequently discussed topics. Fig. 13. General Hashtags Bubble Chart

SVM RBF + PCA + TUNING

method Majority baseline SVM RBF + PCA + tuning

positive class 0.5555 0.6410

negative class 0.7094 0.7777

We also created a training set using the 70% of product reviews and the 70% of the labeled twitter data. Similarly, we created a test set by combining the remaining 30% of the two aforementioned datasets. We then trained models by progressively adding more training data as in the previous section. As it is shown in Figure 11 our models achieve better accuracy as more training instances are added. Fig. 12. Learning Curve of SVM-RBF using GA on combined dataset

SVM-RBF kernel using GA Learning Curve on Combined CORPUS

Fig. 14. General Hashtags Bubble Chart using logarithmic scale for radial size

0.77 0.72 0.67 0.62

Positive Label

Negative Label

In the following Table we show the accuracy of our classifier trained using the 100% of training set. This model was used to classify the remaining 3963 tweets from the BC and the output was fed to the data visualization algorithm. TABLE VI. Method Majority SVM RBF + PCA + tuning

V.

SVM RBF + PCA + TUNING Positive Class 0.505785124 0.707438017

Negative Class 0.659504132 0.73553719

DATA VISUALIZATION

Two types of data graph visualizations are presented:

As seen in the above figures, each topic corresponds to a set of twitter hashtags whose names arise one from another using minor lexical or stylistic transformations (e.g. “BostonMarathon”, “bostonmarathon”). These hashtags are detected using simple heuristics and/or Levenshtein (edit) distance. A graph with the consolidated hashtags is shown below:

Fig. 15. Bubble Chart after the hashtag consolidation

Fig. 17. The distribution of positive and negative tweets on a four hour time frame

Fig. 18. The distribution of positive and negative tweets per hour

Fig. 16. Bubble Chart after the hashtag consolidation using logarithmic scale for radial size

VI.

B. Sentiment Graph It presents the frequency of the positive and negative tweets over time. As shown below (Figure 14 and 15) the number of tweets for the Boston Marathon was relatively small in the beginning, however, after bomb explosion it was rapidly increased. As it also shown the negative tweets dominated over the positive ones as time was passing since more people expressed its sadness or anger about the event. The fact that many positive tweets are detected (as shown in the graph) is mainly due to many people express hopes and wishes (e.g. “Best wishes to those at #BostonMarathon”, “I hope everyone is ok”).

CONCLUSIONS AND FUTURE WORK

We have experimented with a variety of well-known machine learning algorithms that were used to predict the expression of positive or negative sentiments on social media data. We have shown that a Support Vector Machine with RBF kernel has obtained the best results for both categories on a dataset of product reviews. We have also shown that same classifier has competitive results on a different domain (twitter dataset). We also created using D3 JavaScript library a concise visualization summary of the data. This visualization presents in a user friendly way a) the most important topics discussed and b) the dominant sentiment expressed in the data over time. In future work we plan to assess the effectiveness of each lexicon and to test different feature sets and machine learning algorithms (e.g. Logistic Regression). In addition, we would to perform an error analysis to detect the cases that our classifier fails to predict correct sentiment. Furthermore, a more sophisticated visualization is planned in which we will present the dominant topics per time unit separately for each sentiment category.

International World Wide Web conference (WWW-2005), May 10-14, 2005, Chiba, Japan.

REFERENCES [1]

[2]

[3]

Andrew B. Goldberg, Nathanael Fillmore, David Andrzejewski, Zhiting Xu, Bryan Gibson and Xiaojin Zhu. May All Your Wishes Come True: A Study of Wishes and How to Recognize Them. Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT 2009). Minqing Hu and Bing Liu. "Mining and Summarizing Customer Reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA, Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion Observer: Analyzing and Comparing Opinions on the Web." Proceedings of the 14th

[4]

Bradley, M., & Lang, P. (1999). Affective norms for english words (anew): Stimuli, instruction manual and affective ratings. Technical report c-1, Gainesville, FL: University of Florida

[5]

Pontiki Maria, Aggelou Zoe, Maltezou Sofia & Papageorgiou Haris (2013). Sentiment Analysis: Building Bilingual Lexical Resources. To be published in the Proceedings of the 13th International conference on Greek Linguistics, September 26-29, 2013

[6]

Martin, J.R. and White, P.R.R. (2005). The Language of Evaluation, Appraisal in English, Palgrave Macmillan, London & New York.