Paper Title (use style: paper title)

2 downloads 0 Views 825KB Size Report
Tweets containing geographical information regarding “Bill Gates” as a search term. Fig. 6 shows the HeatMap associated with the collected data for the “Bill ...
A Comprehensive Study of Twitter Social Networks Silvia Ciotec, Mihai Dascalu, Stefan Trausan-Matu Computer Science Department University Politehnica of Bucharest Bucharest, Romania [email protected], [email protected], [email protected]

Abstract—As most approaches perform social network analysis from a static point of view, our paper is centered on the analysis of the Twitter network, emphasizing its dynamic aspects by using an analytics and visualization-centered application. Our aim is to model the activity and importance of individual users over time, as well as the connection between a recent activity of the entire network and on a given subject (for example, trending topics like an event or a celebrity). A user’s influence is measured based on his/hers followers and retweets, enabling the possibility to classify members of a certain community. Therefore, we shift the perspective towards analyzing the Twitter network as a newsspreading platform by studying the behavior of users, the underlying timelines and relationships. Keywords—Twitter analysis; Social Network retweets; user behavior; interactive visualizations.

Analysis;

I. INTRODUCTION Twitter has become a popular social networking and microblogging service in which users can post and receive short messages of up to 140 characters (named tweets). In other words, basic interactions enable a user to communicate and to be updated via the tweets he/she is following. The service’s popularity is reflected in the number of users (500 million registered users in 2012), as well as the number of tweets per day (200 million as of 1st Aug 2011) [1]. One important aspect of this social network is its ability to spread the news and to connect people from all around the world. Our focus was to automatically analyze Twitter’s potential to speak about a topic of interest, the behavior and relationships among its users. Thus, the paper presents statistics about tweets from the geographical point of view correlated with their distribution among places on the world map. Our application can also extract information about individual users, for example tweeting peak hours, statistics about their followers or finding communities among user’s followers. The rest of the document is structured as follows: the second section presents related work, the third and fourth sections describe the tools that were integrated in order to conduct our experiments, the fifth section highlights the experimental data and interpretation of results, while the last section presents conclusions and future work. II. STATE OF THE ART Several recent efforts have been made in order to analyze social networks, especially Twitter. Firstly, Lim and Datta [1]

talk about detecting communities with common concerns on Twitter. Their initial focus was to identify celebrities that are representative for an interest category, before detecting communities based on links among the followers of these celebrities. Their study also comprises the attributes of these communities and the effects of deepening or of the specialization of interests. Secondly, Cha, et al. [2] show an in-depth comparison of three measures of influence: in-degree, retweets and mentions, applied on a big volume of data collected from Twitter, that consists of 2 billion follow links among 54 million users who produced a total of 1.7 billion tweets. After investigating topics and the behavior of users in time, their study highlights that the most influential users can impact distinct topics and that this cumulative effect is not gained by chance, but through concentrated effort of limiting tweets to a single topic. An interesting approach regarding the Twitter social network consists of a sentiment driven perspective. As a practical example, Agarwal, et al. [3] built models for classifying tweets based on positive, negative and neutral sentiments. They experimented with a unigram model, a feature based model and a tree kernel based model. Their experiments demonstrated that characteristics that have to do with Twitter-specific features (for example, emoticons or hashtags) add only marginal importance to the classifier, while features that combine prior word polarities with their corresponding parts-of-speech tags are more relevant for the classification task. The authors also created two publicly available resources: a hand annotated dictionary for emoticons that maps emoticons to their polarity and an acronym dictionary collected from the web with English translations of over 5.000 frequently used acronyms. In terms of visualization, Ediger, et al. [4] analyzed Twitter’s public data stream as a graph by using GraphCT (http://trac.research.cc.gatech.edu/graphs/wiki/GraphCT), the Graph Characterization Toolkit for massive graphs representing social network data. Their analysis on graphs of over 60 million vertices and approximately 1.5 billion edges demonstrated that the packaged metrics reveal characteristics of Twitter users’ interactions. Also, actors were ranked within conversations that might help analysts focus attention on smaller, representative data subsets. From a different perspective, Crymble [5] presents an analysis of how the archival community is using social networking services such as Twitter and Facebook as outreach

tools. The study shows that archival organizations overwhelmingly use the services to promote content they have created themselves, whereas archivists promote information they find useful. In all cases, more frequent posting did not correlate to a larger audience [6]. III. TWITTER SOCIAL NETWORK AND INTEGRATED TOOLS A. Twitter Overview Twitter is a social network in which registered users can share thoughts and ideas by using small text messages. Twitter permits its users to post and receive messages of up to 140 characters in length. These messages are named “tweets” and form the basis of social interactions. Also, users can be notified when favorite users have posted messages. In the end, this “mesh” of users following other users generates the social structure of the Twitter network. Users can post their own tweets or re-post tweets from other members in a process called retweeting. As an extra functionality, users can reference each other in their tweets (through the use of the key word @) or they can mention keywords or key subjects for an easier search by using the “#” (hashtag), for example #. In addition, the availability of celebrities and the multitude of young people, as well as Twitter’s nature to make public both the local gossip and hottest news, creates an ideal environment for searching and spreading information about celebrities and details of their lives. As proof, the first 10 most followed users aren’t corporations or mass media organizations, but individuals, most of them famous (e.g., Justin Bieber dethroned Lady Gaga as the most-followed Twitter user in January 2013, but ceded the top spot to Perry in November 2013, who still is on the first place in top twitter users, at the present moment – July 2014). Furthermore, these celebrities communicate with millions of other users that follow them by using tweets, often published by themselves or by publicists, thus avoiding the traditional interactions from a mass-media point of view, between themselves and their fans. Together with the conventional celebrities, another class of Twitter users composed of bloggers, writers and journalists have started to occupy a small, but important share of tweets and followers. Thus, it can be said that Twitter has a whole spectrum of communications that can span from a personal and private level, to the traditional mass-media messages. Henceforth, Twitter can offer an interesting overall context, especially because Twitter, as opposed to TV, radio, printed media and mass-media, permits easy observation of the information flow between its users [7]. B. Twitter API Twitter offers an API (https://dev.twitter.com/docs/api) that enables data collection functionalities by providing developers with the capability to search for and store user profile information, user connections, tweets and retweets, as well as geographical information for the tweets, if available. Nevertheless, this ensures extensibility as online social network

analyses can be performed from the individual level to the online community level. From a technical perspective, the API is based on the REST architecture (Representational State Transfer), a collection from the network design principles that defines resources and methods of accessing and using the underlying data. In terms of authentication, OAuth [8] is an open standard that provides 'secure delegated access' to server resources and data on behalf of a resource owner. Nevertheless, the API encompasses multiple limitations out of which the most problematic consist of: 1) Per user and per application limits The rate of usage for the API version 1.1 is taken into account firstly from a per user point of view, or to be more specific, based on an application token (access token). In case a method allows 15 requests per time window, then this method will be granted 15 requests for one token. When application level authentication is used, the imposed limits are determined at a global level, creating an analogous limitation in terms of the requests per window in the name of the user. 2) The 15 minutes time window limit The time windows of the Twitter API version 1.1 are made based on 15 minutes intervals with mandatory authentication. There are two initial rates available for performing GET requests: the first one is a group of 15 requests once every 15 minutes and the second one is a group of 180 requests for the same timeframe. More details regarding operations can be found at https://dev.twitter.com/docs/rate-limiting/1.1/limits. 3) Search timeframe limits The search is limited to 180 requests for a 15 minutes window. Twitter does not permit an exhaustive search through all possible tweets in its API, but only a search through its most recent posts. With the current version, the results are returned from the latest 6 to 9 days. In order to take full advantage of the API, Twitter4J (http://twitter4j.org) has been selected as the most mature Java interface that makes use of the latest version of the Twitter API. C. IMDB The Internet Movie Data Base or better known as IMDB (http://www.imdb.com/) is a commercial website that hosts an on-line database referring to movies, TV shows, actors, production casts, video games and fictive characters present in mass media specialized in the visual entertainment [9]. IMDB does not offer an API for interrogation, but even without these automation capabilities, most data can be downloaded in JSON format (Java Script Object Notation). In current experiments, we opted to extract only the exact and popular alternative/stage names as these turned out to be the most relevant fields that could be retrieved from IMDB. D. Gephi Gephi (https://gephi.org/) [10] is an opensource software for network analysis and graphics whose flexible and multitasking architecture brings new possibilities of working with complex data sets, while generating representative visual graphs. In a nutshell, Gephi offers a broad and easy access to network data, as well as facilities that enable data filtering, navigation, manipulation and clustering.

E. Google Fusion Tables Fusion Tables (http://tables.googlelabs.com) are on-line data management applications specifically conceived for collaboration, visualization and data publication. In contrast to traditional databases focusing on SQL queries and transaction processing, these toolsets are built for data management and collaboration: combining multiple data sources, data analysis, interrogation, visualization and web publishing. After loading datasets in table forms, the results can be filtered, aggregated and visualized using Google Maps or other visualizations APIs provided by Google. Also, data from multiple sources can be combined when the corresponding data sets are all connected to the same entities. IV. IMPLEMENTATION Our developed application provides a web interface through which a user can search for an actor from IMDB correlated with celebrities’ profiles on Twitter. The results are sorted according to relevance, including fan-made profiles, parody profiles, informative type profiles or other related content. The relevance of Twitter results is given by a metric computed as the product of the number of followers and the number of re-tweets. Multiple formulas and approaches have been used, but the presented heuristic generated the most relevant results. The advantage of using IMDB is that results are returned even if typos are made in the actor’s name. Thus, if the exact spelling of the actor’s name is unknown to the user, the most viable alternative is automatically used. Overall, the application is capable of collecting recent data stored locally as imposed by the limit size of the Twitter API that allows only up to 180 requests every 15 minutes. Visualization is achieved by integrating the previously described technologies – Gephi and Google Fusion Tables tools set. Thus, the experimental results contain information about the geographical area in which the tweets were posted. Because not all Twitter users have geo-location activated, not all tweets have geographical information, reducing the data to a limited data set that sometimes can lead to precision loss. The majority of the data sets contain approximately 1.000 posts. For a better visualization of the used data, a heat map of the data sets has been generated (see Fig. 3). Afterwards, the most active hours in the last week and the number of tweets posted can be further analyzed. Additionally, users were also analyzed from the followers’ point of view. Statistics about the followers of the user’s followers emphasize user’s importance through followers. The application generates a graph in Graphml format (http://graphml.graphdrawing.org/) that can be imported in Gephi. Some measurements on the directed user graph containing the user’s followers and the connection between them can be performed with Gephi. Firstly, betweenness centrality (i.e., the number of shortest paths from all vertices to all others that pass through the current node) [11, 12] highlights the most important and central nodes within the community. Afterwards, modularity (i.e., the strength of the

division of a network into corresponding modules) is used to detect the community structure of our follower graph. V. EXPERIMENTAL RESULTS AND DISCUSSIONS A. Tweets Assigned to a Topic This section presents a dataset that is comprised of data about Hollywood or European actors, musicians, public persons and events. 1) Actors Fig. 1 depicts a geographical distribution of tweets that talk about the actress Jennifer Lawrence, a trending movie actress in top 10 ranking actors according to the STARmeter from IMBD in May 2014. The used subset contains tweets from the week of 24-31 May 2014, out of which only 790 were geolocalized.

Fig. 1. Tweets containing geographical information regarding “Jennifer Lawrence” as a search term

For a better visualization of the areas where users are more active with regards to a given topic, a heat-map of the data set has been generated (See Fig. 2). A heat map is a graphical representation of data where the individual values are cumulated and represented as colors. Thus, the color red expresses the highest intensity of tweets in the rendered region, while light green color expresses a lower number of tweets.

Fig. 2. Tweet intensity over the course of one week regarding "Jennifer Lawrence"

The next analyzed subset is about Jim Carrey, another actor form Hollywood, who was rated in Top 20 actors of the last 20 years on IMDB (the top was last updated on 13th of June 2013). Fig. 3 shows a visualization of approximately 200 tweets about him between 24th May and 1st June, 2014. The intensity map presented in Fig. 4 is quite similar to the heat map for Jennifer

Lawrence, most active regions being the United States, United Kingdom and Indonesia. This emphasizes which people are committed to the Hollywood community, but also the places where Twitter is popular.

Brazil, Europe (mostly the United Kingdom), Turkey, Indonesia and Japan.

Fig. 6. Tweet intensity over the course of one week regarding "Bill Gates" Fig. 3. Tweets containing geographical information regarding “Jim Carrey” as a search term

The next dataset is dedicated to Stephen Fry, who was born in UK and lives in London. He also wields a considerable amount of influence through his use of Twitter [13]. The number of gathered tweets is about 60, from 24th of May to 1st of June. Fig. 8 represents the intensity of these tweets, underlining the fact that he is mostly acclaimed in his country.

Fig. 4. Tweet intensity over the course of one week regarding "Jim Carrey"

2) Other Celebrities A geographical visualization of tweets about Bill Gates, one of the most renowned and emblematic figures of the IT&C community, from 24th of May to 1st of June, can be found in Fig. 5. The data is composed of about 500 tweets.

Fig. 5. Tweets containing geographical information regarding “Bill Gates” as a search term

Fig. 6 shows the HeatMap associated with the collected data for the “Bill Gates” search. It can be easily noticed that a wide range of people are talking about him, from all corners of the world, as the IT community is vast and more spread. By using this popularity feature around the world, the regions with the most active twitter users can be pointed: southern USA,

Fig. 7. Tweets containing geographical information regarding “Stephen Fry” as a search term

Fig. 8. Tweet intensity over the course of one week regarding "Stephen Fry"

3) Events The analyzed event is a Japanese nuclear disaster, Fukushima, which had a gravity level of 5 on a 7 scale. Data was gathered from 24th of May to 1st of June 2014. The data consists of about 600 tweets. The intensity of these tweets is represented in Fig. 10. As the event happened in Japan, it’s obvious it was tweeted mostly in Japan.

distribution, the total number of tweets is of only 183. The graphic highlights a few zones that are the most important: zones between 15 – 19 (eastern US) with around 20 tweet per zone and zones between 31-32 (mostly UK) with 20 and 13 tweets respectively. Again, the highest density of tweets corresponds to the eastern part of the US and UK, therefore highlighting areas in which Twitter is highly adopted.

Fig. 9. Tweets containing geographical information regarding “Fukushima” as a search term Fig. 12. Jim Carrey tweet distribution by UTM zones

The number of tweets that talk about Bill Gates, the founder of Microsoft, consists of 488 tweets in our dataset. Fig. 13 shows that the highest density of tweets is in zones 18 and 19, with about 60 tweets per zone, and a big density in zones 31, 32 with around 30 tweets per zone. Following the same pattern, the zones correspond to eastern US and United Kingdom.

Fig. 10. Tweet intensity over the course of one week regarding "Fukushima”

B. Tweets Statistics Taking into account the analyzed topics and their distributions, the corresponding statistics can reveal meaningful interpretations. The statistics built from the data set are made by means of UTM (Universal Transverse Mercator coordinate system) [14]. The UTM conformal projection uses a 2-dimensional Cartesian coordinate system to pinpoint locations all around the globe, independently of their vertical position. Therefore, for each UTM zone, the number of tweets is pointed out. Jennifer Lawrence, the first search term, has a wide distribution of tweets among the world map zones, as it can be seen in Fig. 11. The graphic indicates that the zones with the highest density are zone 31 (mostly to the United Kingdom) with 106 tweets and zone 18 (New Jersey, Virginia, Pennsylvania, New York, Delaware and Maryland from the eastern zone of the United States) with 100 tweets.

Fig. 13. Bill Gates tweet distribution by UTM zones

Stephen Fry, a pure British celebrity, reveals a density of tweets talking about him mostly in the United Kingdom. As the graph presents, the highest density of tweets can be found in zone 31, corresponding to UK, with a number of 39 tweets.

Fig. 14. Stephen Fry tweet distribution by UTM zones

A completely different perspective is revealed through the analysis of tweets about the Fukushima disaster. From a set of data containing about 600 tweets, most of them are located in zone 55, corresponding to Japan.

Fig. 11. Jennifer Lawrence tweet distribution by UTM zones

The data sub-set for the search term “Jim Carrey” is smaller than the previous one. Although Fig. 12 presents a wide

E. User Graph and communities

As presented in the forth section, Gephi was used to build an oriented graph of followers for each user. Fig. 18 depicts an example for the user @JoanaJord14, her followers and the connections between her followers. The users are represented as nodes, while friend/follower relationships are represented as edges.

Fig. 15. Fukushima tweet distribution by UTMzones

As the graphics show, the most popular zone is zone 31, corresponding to UK, with a standard deviation of 38.65 and an average of 40.6 tweets, followed by the zones from eastern US, especially zone 19, with a standard deviation of 34.50 and an average of 32 tweets. C. Tweets analyzed from the users point of view Our application is able to get the most active hours for a user, according to its tweets from the past week. Thus, for the user @stephenfry, the hours when he posted the most tweets during the week of 25th May – 1st of June can be visualized in the graphic from Fig. 16.

Fig. 18. User Graphs for @JoanaJord14 and @_RuxandraD

The communities are differentiated by colors and the importance of users by the size of the node. The analyzed user is in the center of the graph. The graph is also automatically filtered by keeping solely the most important nodes in terms of the betweenness centrality. In this manner, a user can easily identify the communities among his/her followers. VI. CONCLUSIONS AND FUTURE WORK

Fig. 16. Peak hours for @stephenfry user D. User analyzed from the follower’s point of view

Additionally, the application can classify the number of followers for each of the user’s followers. Therefore, for the user @JoanaJord14, we classified the followers’ count in 8 slices. The figure shows that most followers (~30%) of @JoanaJord14 have between 5.000 and 50.000 followers.

Fig. 17. Followers count for @JoanaJord14’s followers

As our application allows the accumulation of the most recent data relative to the present moment, the paper presents statistics about different search subjects (celebrities, trending topics or events) according to the geographical localization of the tweets. Thus, areas in which tweeting is frequent can be inspected, regardless of the subject, suggesting that the Twitter network is most frequently used or preferred in certain areas. Our study confirms that even though Twitter users can be found anywhere on the globe, the active half of the users – the ones that post a tweet at least once a month – are especially localized in five countries: USA, Japan, Indonesia, Great Britain and Brazil (source: http://mashable.com). In addition, areas in which users make use of other alphabetical characters (e.g., Russia or China) seldom appear in the collected dataset. Furthermore, our application is also capable of capturing peak hours in which Twitter users posted during the last week. Staring from all the previously performed analytics, our study represents a confirmation of common user behaviors on social networks and the generated statistics provide valuable insights in terms of usage, distribution and interests. As future developments, the Twitter profile search will be improved regarding the relevance and diversity of data, the user links and their role within specific user groups. The user search can be extended according to the targeted social network, for academic subjects, or friends of users with specific interests. The social network analysis can also be improved by adding new methods of research, like presenting

the users and their friends/followers in a clustered form or the emphasis of the analysis of a user’s friends. From a different perspective, current experiments envision an integration with sentiment analysis tools [15] as in the experiments performed by Martinez--Cámara, et al. [16], as well as creating an interface with our discourse analysis platform – ReaderBench [17] – for a deeper representation of cohesive links among tweets. ACKNOWLEDGMENT This research has been partially supported by the Sectoral Operational Programme Human Resources Development 20072013 of the Ministry of European Funds through the Financial Agreement POSDRU/159/1.5/S/134398. REFERENCES [1] K. H. Lim and A. Datta, "Finding twitter communities with common interests using following links of celebrities," in 3rd Int. workshop on Modeling social media, Milwaukee, Wisconsin, USA, 2012, pp. 25–32. [2] M. Cha, H. Haddadi, F. Benevenuto, and K. P. Gummadi, "Measuring User Influence in Twitter: The Million Follower Fallacy," in 4th Int. AAAI Conf. on Weblogs and Social Media (ICWSM), Washington, DC, USA, 2010, pp. 10–17. [3] A. Agarwal, B. Xie, I. Vovsha, O. Rambow, and R. Passonneau, "Sentiment analysis of Twitter data," in Workshop on Languages in Social Media, Portland, Oregon, 2011, pp. 30–38. [4] D. Ediger, K. Jiang, J. Riedy, D. A. Bader, and C. Corley, "Massive Social Network Analysis: Mining Twitter for social good," in 39th Int. Conf. on Parallel Processing, San Diego, CA, 2010, pp. 583–593. [5] A. Crymble, "An Analysis of Twitter and Facebook Use by the Archival Community," Archivaria, vol. 70, pp. 125–151, 2010. [6] A. Spink, B. J. Jansen, and J. Pedersen, "Searching for people on Web search engines," Journal of Documentation, vol. 60, pp. 266–278, 2004.

[7] S. Wu, J. M. Hofman, W. A. Mason, and D. J. Watts, "Who says what to whom on twitter," in 20th Int. Conf. on World Wide Web, Hyderabad, India, 2011, pp. 705–714. [8] Internet Engineering Task Force (IETF), "The OAuth 2.0 Authorization Framework (RFC-6749)," 2012. [9] F. Gao, "Modeling and Interference of the Internet Movie Database," Master thesis, Department of Mathematics and Computer Science, Technische Universiteit Eindhoven, 2011. [10] M. Bastian, S. Heymann, and M. Jacomy, "Gephi: An open source software for exploring and manipulating networks," in International AAAI Conference on Weblogs and Social Media, San Jose, CA, 2009, pp. 361–362. [11] U. Brandes, "A faster algorithm for betweenness centrality," Journal of Mathematical Sociology, vol. 25, pp. 163–177, 2001. [12] L. Freeman, "A set of measures of centrality based on betweenness," Sociometry, vol. 40, pp. 35–41, 1977. [13] BBC News. (2009). A portrait of the decade [Online]. Available:

http://news.bbc.co.uk/2/hi/in_depth/8409040.stm [14] U.S. Geological Survey, "The Universal Transverse Mercator (UTM) Grid," U.S. Geological Survey, Reston, VA 077-01, 2001. [15] D. Lupan, M. Dascalu, S. Trausan-Matu, and P. Dessus, "Analyzing emotional states induced by news articles with Latent Semantic Analysis," in 15th Int. Conf. on Artificial Intelligence: Methodology, Systems, Applications (AIMSA 2012), Varna, Bulgaria, 2012, pp. 59–68. [16] E. Martinez--Cámara, M. T. Martín--Valdivia, L. A. Ureña-López, and A. R. Montejo--Ráez, "Sentiment analysis in Twitter," Natural Language Engineering, pp. 1–28, 2013. [17] M. Dascalu, Analyzing discourse and text complexity for learning and collaborating, Studies in Computational Intelligence vol. 534. Switzerland: Springer, 2014.