Opinion mining in Social Big Data

6 downloads 78978 Views 284KB Size Report
Facebook or Twitter has made it easier than ever for users to share their views and make ..... nies are denoted using a $ sign, for example “$APPL” for Apple Inc.
Opinion mining in Social Big Data Corresponding author: Peter Wlodarczak Affiliation: University of Southern Queensland Corresponding address: [email protected]

Dr. Mustafa Ally Affiliation: University of Southern Queensland, Toowoomba, QLD 4350, AUSTRALIA Prof. Dr. Jeffrey Soar Affiliation: University of Southern Queensland, Toowoomba, QLD 4350, AUSTRALIA Abstract. Opinion mining has rapidly gained importance due to the unprecedented amount of opinionated data on the Internet. People share their opinions on products, services, they rate movies, restaurants or vacation destinations. Social Media such as Facebook or Twitter has made it easier than ever for users to share their views and make it accessible for anybody on the Web. The economic potential has been recognized by companies who want to improve their products and services, detect new trends and business opportunities or find out how effective their online marketing efforts are. However, opinion mining using social media faces many challenges due to the amount and the heterogeneity of the available data. Also, spam or fake opinions have become a serious issue. There are also language related challenges like the usage of slang and jargon on social media or special characters like smileys that are widely adopted on social media sites. These challenges create many interesting research problems such as determining the influence of social media on people’s actions, understanding opinion dissemination or determining the online reputation of a company. Not surprisingly opinion mining using social media has become a very active area of research, and a lot of progress has been made over the last years. This article describes the current state of research and the technologies that have been used in recent studies. Keywords: Big Data; Social Media; opinion mining; sentiment analysis

Opinion mining in Social Big Data, p. 1, 2014.

Electronic copy available at: http://ssrn.com/abstract=2565426

1

Introduction Opinion mining, also called sentiment analysis, has become a very ac-

tive area of research. It analyses people’s opinions, appraisals, attitudes, and emotions toward entities, individuals, issues, events, topics, and their attributes [1]. Opinion mining using Social Media (SM) is still in its infancy, but there is a growing interest for several reasons. Opinions are important because they are key influencers of our behaviour [1]. Also, SM such as Facebook or Google+ has made it very easy for users to share opinions, views, interests and ideas on the internet and make them visible worldwide. This yielded an unprecedented amount of user opinions on products, services, or political events. In the past an organisation had to conduct polls or surveys to get user or voter opinions. Social Media gives access to large amounts of user opinions. For the first time in human history, we now have a huge volume of opinionated data in the social media on the Web [2]. Also, opinion mining using SM data poses many challenging problems which makes it a very interesting area of research. Since opinions are subjective, one opinion is usually not enough for an application and a collection of opinions needs to be analysed. SM provides a large collection of opinions that makes it an invaluable source for opinion mining applications. Using Social Media Mining (SMM) all the Opinion mining in Big Social Data, p. 2, 2014.

Electronic copy available at: http://ssrn.com/abstract=2565426

data can be analysed, and there is no need to select a sample. SM data are typically in the form of textual content (e.g. in blogs, reviews and status updates), rating scores in Likert scales or stars (e.g. review ratings), like or dislike indications (e.g. reviews helpful votes and Facebook’s like or Google’s “+1” buttons), web search queries (e.g. Google trends), tags and profile information (e.g. social network graphs) [14]. But the amount of available data is a challenge in itself and we need to create some sort of summary. The summary can be in the form of binary sentiment classifications, positive or negative opinions, or in the form of multiclass sentiments, for instance in the form of a Likert scale: “very bad”, “bad”, “neutral”, “good”, “very good”, or rating scores like 1 – 5 stars. Opinion mining using SM data has been used in many domains. It has been used to find out about a company’s online reputation, to detect new trends, to analyse user intents and to gain knowledge, and there are many commercial applications. Several research attempts at sentiment analysis have been made in the past years, among other to analyse political opinions [3][4], to find influential participants and groups [5], to analyse why users move from one service to another [6], to detect mood polarity [7], and in predictive analytics [8,9,10,14,22]. Opinion mining in Big Social Data, p. 3, 2014.

Sentiment analysis is a Natural Language Processing problem. It is a cross-disciplinary research field with theoretical underpinnings including computer science, linguistics and psychology. Natural Language Processing (NLP) is an area of research and application that explores how computers can be used to understand and manipulate natural language text or speech to do useful things [13]. Sentiment analysis touches every aspect of NLP like word sense disambiguation, coreference resolution or negation handling. However, linguistic issues will not be covered in this article. This article describes the state-of-the-art techniques adopted for sentiment analysis used in current research.

2

Methods

2.1

Types of sentiment expressions

Sentiment analysis, also called opinion mining, is the field of study that analyses people’s opinions, sentiments, evaluations, appraisals, attitudes, and emotions towards entities such as products, services, organizations, individuals, issues, events, topics, and their attributes [2]. Sentiments can be expressed at the entity level or at the aspect level. Entity level senti-

Opinion mining in Big Social Data, p. 4, 2014.

ment analysis looks directly at the entire target of an opinion. For example, “The new iPhone is excellent” expresses an opinion on the product as a whole. Aspect level sentiment analysis looks at features of a product. For instance, “This car is very quiet but it uses a lot of petrol”. In this example features are the opinion target, the noise produced by a car and its fuel consumption. At the entity level, an opinion is a quadruple, (e, s, h, t), where e is the entity, the opinion target, s is the sentiment, h is the sentiment holder and t is the time when the sentiment was expressed. At the aspect level, a sentiment is a quintuple, (e, a, s, h, t), where the additional element a is the aspect. Here e and a are the opinion targets. Regular opinions express sentiments on an entity or a feature directly, whereas comparative opinions compare multiple entities based on some of their aspects. For example, “BMW makes more ecologic cars than Mercedes”. Here the cars are compared based on their consumption of resources. Sentiment words such as “excellent”, “good”, “poor” are the most important indicators of sentiments. However, there are other ways of ex-

Opinion mining in Big Social Data, p. 5, 2014.

pressing sentiments. Phrases or idioms without sentiment words can express opinions. For instance, “This car cost me an arm and a leg” or, “This mattress had a valley after one month”. Opinions can also be expressed using verbs. For example, “This TV sucks” or, “This washer uses a lot of water”. However, the sentence “This hoover really sucks!” expresses a positive opinion. Detecting the real meaning, whether it is meant positively or negatively, is one of the challenges of opinion mining. Others are spam detection. Spammers spread unsolicited or malicious messages. Social spammers have become rampant and the volume of spam has increased dramatically [12]. SM can be accessed from anywhere in the world and users can express their opinions without disclosing their identity and without the fear of consequences. While this is a highly desirable feature in some cases, it also allows people with malicious intent or hidden agendas to post fake opinions to promote or discredit products, companies or individuals. Such individuals are called opinion spammers and their activities are called opinion spamming [2]. Advances have been made in spam detection, but these techniques are beyond the scope of this article.

Opinion mining in Big Social Data, p. 6, 2014.

Detecting sarcasm automatically is very difficult. Considering the sentence, “The government wants to legalize marijuana, oh great!” it is impossible to say whether the person is in favor of legalizing it or is being sarcastic about it. Several studies tried to detect sarcasm. The study that could detect sarcasm with most accuracy was only able to detect it in 57% of the cases [2]. These are only some of the challenges faced when doing sentiment analysis, but they exemplify that opinion mining is highly domain specific and as with most data mining tasks, opinion mining usually starts with understanding the domain. The examples given here were all using the English language. Opinions are also expressed in SM using other languages that pose different challenges.

2.2

Preconditioning

Text categorisation is usually done using clustering or classification techniques. Text documents are represented in the vector space based on Vector Space Modelling (VSM). However most SM posts are not in a form that is usable for text mining techniques. Social media data are

Opinion mining in Big Social Data, p. 7, 2014.

multi-model in nature, including content such as images, audio, and videos, concept such as discussion topic, tag, and annotation, and context such as links, profile, timestamp, and click-through [11]. The data has to be purified from irrelevant data first. The cleaning process or data preconditioning can involve stop-words removal. Stop-words are words like “the”, and “is” in sentences such as “The new Panasonic GM1 camera is excellent!”. They only have grammatical significance and are thus eliminated. Also usually special characters such as “!”, brackets or smileys are removed. In addition, words are converted into their canonical form using stemming algorithms. Stemming algorithms such as unsupervised morpheme segmentation find the word stem from inflected or derived words. For instance “sucks” becomes “to suck”. Lemmatization might be performed for further analysis. It is the process of grouping together the different inflected forms of a word, so they can be analyzed as a single item [15]. An often used pre-conditioning task is creating a bag-of-words. Bagof-words based approaches model news articles by vector space model which translates each news piece into a vector of word statistical measurements, such as the number of occurrences, etc. [16]. A bag-of-words

Opinion mining in Big Social Data, p. 8, 2014.

is a list of all the words in a text disregarding grammar or word order. Bag-of-words are suitable inputs for machine learning methods.

2.3

Sentiment lexicons

As shown in the previous chapter, words that express positive or negative sentiments are essential for opinion mining. Also, the examples showed that these words are highly domain specific. Words can bear different meanings whether the opinion is expressed about a car, a mobile phone or a mattress. Not only sentiment words are important. The sentiment strength can be altered using sentiment modifiers such as “very”. For instance “The new iPhone is very good”. Here “good” is the sentiment word and “very” is the modifier. Sentiment words can be base types such as “good” or “bad” or comparative types such as “better” or “worse”. Negators such as “not” are also important since they can change the sentiment to the opposite. For instance “The new Chevrolet is not great”. They are called sentiment polarity shifters. They can also change the opinion in a positive way. For instance “The new Mercedes doesn’t suck”.

Opinion mining in Big Social Data, p. 9, 2014.

A common way of performing opinion mining is creating a list of sentiment words, a sentiment lexicon, and using it to analyse the opinion texts. Compiling sentiment lexicons can be done manually. However, this is labour intensive and usually an automatic approach is preferred. Dictionaries such as WordNet (http://wordnet.princeton.edu/) or Dictionary.com (http://dictionary.reference.com/) list synonyms and antonyms of words. They can be used to automatically generate sentiment lexicons. This approach works as follows. A small set of seed sentiment words is compiled manually. From the seed words, an algorithm searches the online dictionary for synonyms and antonyms. They are added to the word list. The search is repeated iteratively until no more sentiment words can be found. Some sentiment lexicons also weight sentiment words. For instance “excellent” is stronger than “good”. They are useful when posts are not just analysed for positive or negative opinions but divided into multiclass sentiment categories such as “good” reviews or “very good” reviews.

Opinion mining in Big Social Data, p. 10, 2014.

2.4

Supervised and unsupervised machine learning methods

The idea of text categorization is to assign semantic similar documents into the same group and the created groups should be as dissimilar as possible to each other [17]. In opinion mining on SM, documents are classified into posts with positive or negative sentiments. Text categorization can be divided into classification and clustering problems. Texts are represented in the vector space and every word is given a specific weight. Term Frequency and Inverse Document Frequency (TF-IDF) is one of the best known term weighting methods [17]. It is defined as: wt ,d  tf t ,d  log(

N ) df t

(1)

where tft,d is the number of occurrences of term t in the document d, N is the number of document in the collection and dft, is the number of documents, in which term t appears [17]. Then the similarity of the documents is computed using a distance measure such as the Euclidian distance, Manhattan distance or Chebyshev distance. Sentiment classification is usually formulated as binary sentiment classification problem, positive or negative. Supervised techniques are used when the class label is known, unsupervised techniques when it is

Opinion mining in Big Social Data, p. 11, 2014.

unknown. Here the class label is “positive” or “negative” reviews. Supervised machine learning techniques are a common way of text classification. A set of data, SM posts, is divided into a training and a testing set. The model is trained using the training data set. The test set is used to determine how well model performs and calculate the error, the classification accuracy. This process is repeated until the result is acceptable. The trained model can then be applied for future, unseen SM posts. There are many supervised machine learning algorithms. Popular algorithms are the Naïve Bayes classifier, Support Vector Machines (SVM) and kNearest Neighbor (k-NN). They take a feature vector as input. A feature vector can be unigrams, a bag-of-word, containing the sentiment words identified in sentiment lexicon generation, terms and their frequencies, part of speech (POS) or sentiment shifters. Since sentiment words are often the dominant factor for sentiment classification, it is not hard to imagine that sentiment words and phrases may be used for sentiment classification in an unsupervised manner [2]. In unsupervised machine learning, documents are clustered into similarity groups. SM posts are grouped together using a similarity, or distance function. The most commonly used distance functions for numeric attributes are the Euclidean distance and Manhattan (city block) distance Opinion mining in Big Social Data, p. 12, 2014.

[1]. However, others such as Chebyshev and Minkowski distance functions are also used. K-means Clustering is probably the most popular clustering algorithm.

2.5

Latent Dirichlet Allocation

In recent studies Latent Dirichlet Allocation (LDA) has been used for sentiment analysis using SM [17,18,19,20,23]. LDA is based on Latent Semantic Indexing and represents a probabilistic model that finds the cooccurrence patterns of terms that corresponds to semantic topics and has been

used

in

probabilistic

document

model

that

classification. is

based

It on

is

a

generative

multinomial

and

Dirichlet distribution [17]. The Dirichlet distribution is defined as: n

p ( |  ) 

 (  i ) i 1

1 ... k , 11

n

  ( )

k 1

(2)

i

i 1

Where every document d is characterized by Dirichlet distribution ϴd with parameter α, n is the number of predefined topics and α > 1, and n



i

 1 . Γ(x) is a Gamma function. The parameters can be calculated

i 1

by several variation methods such as Gibbs sampling. As with machine learning, LDA is first trained with a set of SM posts and creates a latent Opinion mining in Big Social Data, p. 13, 2014.

description of the posts. It thus creates a profile of, for instance, positive or negative Tweets. It can then filter out the relevant Tweets from a corpus of Tweets. LDA has also been used for relevance filtering [19], and extensions have been proposed that consider the underlying sequential structure of the document [20] or filter out background topics [23]. LDA has proven to be useful in exploratory as well as predictive text analytics.

3

Discussion Sentiment analysis remains a challenging area of research. Whereas

the classification algorithms, machine learning, LDA or others like statistical, are important, data pre-processing remains an equally important task. SM data is typically noisy, and there is a lot of irrelevant data. The classification algorithms won’t perform well if the data is not properly preconditioned and the accuracy of the results will suffer. Preconditioning encompasses relevance filtering, noise removal and feature vector preparation. The feature vector can contain word frequencies of sentiment words, POS, but attributes might also be weighted. For instance, not all sentiment words might have the same importance, and the feature vector might also contain weighting. Feature vector creation is at least as Opinion mining in Big Social Data, p. 14, 2014.

important as selecting the appropriate classification algorithm, nevertheless there seem to be much less research in this area than in the area of classification. Other challenges originate from the complexities of natural language with linguistic constructs such as humour, sarcasm or innuendos which are very difficult to detect by computers. Microblogging SM sites such as Twitter and Sina Weibo usually have character limits and posts are usually clear statements. Forums and blogs often cover several topics and contain opinions on different subjects, which makes them more difficult to mine. More complex sentences can have sentiments on different targets. For instance “Microsoft is doing well in this bad market”. That’s why some studies only considered explicit statements [21]. Probably the most difficult posts to analyse are political opinions since they are full of irony and sarcasm. The author uses Tweets for his research. Tweets are limited to 140 characters, so they are typically straight to the point and make them suitable targets for getting opinions. However due to the shortness of Tweets they are usually full of slang or emoticons, which poses a challenge. For instance Tweets have no subject line so subject words are highlighted

Opinion mining in Big Social Data, p. 15, 2014.

using a hash tag, for example “#IBM #share is plummeting” or companies are denoted using a $ sign, for example “$APPL” for Apple Inc. Slang such as “ATTA car”, that’s a car, abbreviations or acronyms typically found on SM such as IMHO (In My Humble Opinion) or LOL (Lough Out Loud) or texts such as “gooooood car” pose additional challenges for sentiment analysis.

4

Conclusions A fully automated and accurate solution for opinion mining using SM

is nowhere in sight. The main issue is that opinion mining is a natural language processing problem, and there are many ambiguities, fuzziness and irregularities in natural languages. Other reasons are the limitations of the algorithms. Many opinion mining algorithms give satisfactory results, but there are still manual steps necessary. Also the algorithms usually don’t produce human readable results, that’s why very often it is difficult to understand the whole process. Future areas of research are should focus among other on feature vector creation as a crucial step in opinion mining. Automatic opinion spam detection is an area where a lot of progress has been made, but spam is usually artfully created and as spam filters detect new forms of spam, Opinion mining in Big Social Data, p. 16, 2014.

opinion spammers find more sophisticated ways too. Not domain specific word lexicon are still nowhere in sight, but automatic sentiment lexicon creation for a specific domain would improve the whole opinion mining process. Many studies have analysed the effectiveness of online marketing campaigns, how influential online opinions are is still not very well understood. More research in the area of online influence of SM users on user behaviour would be an interesting and desirable area of research.

5

Appendix

5.1

references

[1] B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2 ed., Heidelberg: Springer, 2011. [2] B. Liu, Sentiment Analysis and Opinion Mining: Morgan & Claypool, 2012. [3] M. Kaschesky, P. Sobkowicz, and G. Bouchard, “Opinion mining in social media: modeling, simulating, and visualizing political opinion formation in the web,” in Proceedings of the 12th Annual International Digital Government Research Conference: Digital Government Innovation in Challenging Times, College Park, Maryland, 2011, pp. 317-326. [4] S. Stieglitz, and L. Dang-Xuan, “Social media and political communication: a social media analytics framework,” Social Network Analysis and Mining, vol. 3, no. 4, pp. 1277-1291, 2013/12/01, 2013. [5] D. King, "Introduction to Mining and Analyzing Social Media Minitrack." pp. 3108-3108. Opinion mining in Big Social Data, p. 17, 2014.

[6] Jones, and L. Huan, "Mining Social Media: Challenges and Opportunities." pp. 90-99. [7] V. Hangya, and R. Farkas, "Target-oriented opinion mining from tweets." pp. 251-254. [8] S. Asur, and B. A. Huberman, "Predicting the Future with Social Media." pp. 492-499. [9] J. Bollen, H. Mao, and X.-J. Zeng, “Twitter mood predicts the stock market,” Journal of Computational Science, vol. 2, pp. 8, 2010. [10] Siganos, E. Vagenas-Nanos, and P. Verwijmeren, “Facebook's daily sentiment and international stock markets,” Journal of Economic Behavior & Organization, no. 0, 2014. [11] H. Shen, X.-S. Hua, J. Luo, and V. Oria, “Guest editorial: content, concept and context mining in social media,” World Wide Web, vol. 15, no. 2, pp. 115-116, 2012/03/01, 2012. [12] J. Tang, Y. Chang, and H. Liu, “Mining social media with social theories: a survey,” SIGKDD Explor. Newsl., vol. 15, no. 2, pp. 20-29, 2014. [13] Preeti, and BrahmaleenKaurSidhu, “Natural Language Processing,” International Journal of Computer Technology and Applications, vol. 4, pp. 751-758, 09/01, 2013. [14] E. Kalampokis, E. Tambouris, and K. Tarabanis, “Understanding the predictive power of social media,” Internet Research, vol. 23, no. 5, pp. 544-559, 2013. [15] S. Stieglitz, and L. Dang-Xuan, “Social media and political communication: a social media analytics framework,” Social Network Analysis and Mining, vol. 3, no. 4, pp. 1277-1291, 2013/12/01, 2013. [16] X. Li, H. Xie, L. Chen, J. Wang, and X. Deng, “News impact on stock price return via sentiment analysis,” Knowledge-Based Systems, no. 0, 2014. [17] Z. Daniel, Z. Daniel, S. Ján, J. Jozef, and C. Anton, “Text Categorization with Latent Dirichlet Allocation,” Journal of electrical and electronics engineering, vol. 7, pp. 161-164, 05/01, 2014.

Opinion mining in Big Social Data, p. 18, 2014.

[18] T. Shulong, L. Yang, S. Huan, G. Ziyu, Y. Xifeng, B. Jiajun, C. Chun, and H. Xiaofei, “Interpreting the Public Sentiment Variations on Twitter,” Knowledge and Data Engineering, IEEE Transactions on, vol. 26, no. 5, pp. 1158-1170, 2014. [19] M. Arias, A. Arratia, and R. Xuriguera, “Forecasting with twitter data,” ACM Trans. Intell. Syst. Technol., vol. 5, no. 1, pp. 1-24, 2014. [20] L. Du, W. Buntine, H. Jin, and C. Chen, “Sequential latent Dirichlet allocation,” Knowledge and Information Systems, vol. 31, no. 3, pp. 475-503, 2012/06/01, 2012. [21] J. Bollen, H. Mao, and X.-J. Zeng, “Twitter mood predicts the stock market,” Journal of Computational Science, vol. 2, pp. 8, 2010. [22] A. Porshnev, I. Redkin, and A. Shevchenko, "Machine Learning in Prediction of Stock Market Indicators Based on Historical Data and Data from Twitter Sentiment Analysis." pp. 440-444. [23] T. Shulong, L. Yang, S. Huan, G. Ziyu, Y. Xifeng, B. Jiajun, C. Chun, and H. Xiaofei, “Interpreting the Public Sentiment Variations on Twitter,” Knowledge and Data Engineering, IEEE Transactions on, vol. 26, no. 5, pp. 1158-1170, 2014.

Opinion mining in Big Social Data, p. 19, 2014.