Master's Thesis

0 downloads 0 Views 44MB Size Report
in Chapter 7. Chapter 8, tries to find clues for answering the four scientific ...... One key property of scale-free networks is their widely varying degree distribution.
Master’s Thesis Validating the Merit of Properties that Predict the Influence of a Twitter User Author:

Stefan Ra¨biger Magdeburg, March 24th 2014

Reviewer: Prof. Myra Spiliopoulou Supervisor: Prof. Myra Spiliopoulou

Abstract What characterizes an influential user? While there is much research on finding the concrete influential members of a social network, there are less findings about the properties distinguishing influential from non-influential users. A major challenge is the absence of a ground truth, on which supervised learning can be performed. In this thesis, such a ground truth for users and tweets is established for the social network Twitter. Annotators assigned users and tweets a label through an interactive annotation tool. Moreover, on the basis of this dataset, a relation and interaction graph are constructed, from which attributes are extracted that allow the application of supervised learning. The experiments show that there are predictive properties associated with the activity level of users and their involvement in communities, but also that writing influential tweets is not a prerequisite for being an influential user.

Key Words Identification of Influential Users, Properties of Influential Users, Classification of Influential Users, Community Mining, Influential Users in Twitter, Influential Users, Influential Tweets, Annotation of Tweets, Twitter

Acknowldegements First and foremost, I would like to thank Prof. Spiliopoulou as my supervisor for being approachable literally at all times and for having many insightful discussions that improved the quality of this thesis significantly1 . Then, of course, I would like to express my deepest gratitude to all annotators labeling altruistically the myriad of tweets and users in their precious spare time. Without them, my thesis would not have been possible at all. Moreover, special thanks go to Prof. Evans for his advice on the LineGraphCreator as well as to Christian Beyer for the countless discussions and for proofreading my thesis. Last but not least, I am also grateful for the great support by my parents, brother and grandmother.

1

No statistical test performed here

I

Contents List of Figures

IV

List of Tables

IV

1. Introduction 1.1. Motivation . . . . . 1.2. Scientific Questions 1.3. Main Outcomes . . 1.4. Methodology . . . 1.5. Text Organization .

. . . . .

1 1 2 3 4 5

. . . . . . . .

6 6 6 7 8 9 10 10 11

3. Overview of Components 3.1. SNAnnotator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. InfluenceLearner . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3. InfluenceSimulator . . . . . . . . . . . . . . . . . . . . . . . . . .

13 13 13 13

4. Workflow of InfluenceLearner 4.1. Turning Dataset into Graphs . . . . . . . . . . . . . . . 4.2. Detecting Overlapping Communities . . . . . . . . . . 4.3. Considered Attributes to be Extracted . . . . . . . . . 4.4. Deriving Labels from the Dataset with Ground Truth . 4.5. Learning MaxClassifier & DirectClassifier . . . . . . . . 4.6. Identification of Most Predictive Attributes of Influence

15 15 16 17 20 21 22

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

2. Background 2.1. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1. What is Influence? . . . . . . . . . . . . . . . . . . 2.1.2. Methods for Detecting Influentials . . . . . . . . . . 2.1.3. Role of Community Overlaps . . . . . . . . . . . . 2.1.4. Methods for Detecting Overlapping Communities . 2.1.5. Using Communities for the Detection of Influentials 2.1.6. Why Use Twitter? . . . . . . . . . . . . . . . . . . 2.2. Basic Concepts of Twitter . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . .

. . . . . . . .

. . . . . .

. . . . .

. . . . . . . .

. . . . . .

. . . . .

. . . . . . . .

. . . . . .

. . . . . .

II

Contents 5. Tasks of InfluenceLearner 5.1. Analysis of the Graphs’ Properties Regarding Social Networks . . 5.1.1. Reciprocity . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2. Properties of Scale-Free Networks . . . . . . . . . . . . . . 5.2. Community Detection Tasks . . . . . . . . . . . . . . . . . . . . . 5.2.1. Transformation of the Original Graph to a Line Graph . . 5.2.2. Detection of Non-Overlapping Communities on the Line Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3. Detecting Overlapping Communities . . . . . . . . . . . . 5.3. Details on Attribute Extraction . . . . . . . . . . . . . . . . . . . 5.4. Implementation Details . . . . . . . . . . . . . . . . . . . . . . . .

23 23 23 24 30 30

6. SNAnnotator 6.1. Establishing the Ground Truth Dataset . . . 6.1.1. Dataset Crawl . . . . . . . . . . . . . 6.1.2. Annotation Process . . . . . . . . . . 6.1.3. Merging and Cleaning the Dataset . 6.2. Analysis of the Ground Truth Dataset . . . 6.2.1. Analysis of the Annotator Agreement 6.2.2. Analysis of the Label Distribution . .

. . . . . . .

39 39 40 40 43 43 44 45

. . . .

47 47 48 48 49

. . . . . . .

51 51 53 54 56 58 61 63

7. InfluenceSimulator 7.1. Linear Threshold Model . . . . . . . . . 7.2. Weighted Cascade Model . . . . . . . . . 7.3. Deriving Influence Probabilities . . . . . 7.4. General Procedure of InfluenceSimulator

. . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

8. Evaluation 8.1. Distinguishing between Influential Users and Influential 8.2. Deriving the Baselines . . . . . . . . . . . . . . . . . . 8.3. Quality of Learned Classifiers . . . . . . . . . . . . . . 8.4. Simulation of Influence Propagation . . . . . . . . . . . 8.4.1. Linear Threshold Model . . . . . . . . . . . . . 8.4.2. Weighted Cascade Model . . . . . . . . . . . . . 8.4.3. Interpretation of Results . . . . . . . . . . . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

. . . .

Tweets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31 33 33 37

III 8.5. Most Predictive Attributes of Influence . . . . . . . . . . . . . . .

64

9. Conclusions 9.1. Contributions & Findings . . . . . . . . . . . . . . . . . . . . . . 9.2. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3. Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69 69 70 71

A. Annotation Protocol

74

B. Annotation Tutorial

80

Glossary

86

References

88

Statement of Authenticity

95

IV

List of Figures 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.

Degree Distribution IG . . . . . . . . . . . . . . . . . . . . . . . . Degree Distribution RG . . . . . . . . . . . . . . . . . . . . . . . Annotation Tool Screenshot . . . . . . . . . . . . . . . . . . . . . Stored Metadata During Annotation . . . . . . . . . . . . . . . . Comparison of ROC Curves (MaxClassifier) . . . . . . . . . . . . Comparison of ROC Curves (DirectClassifier) . . . . . . . . . . . Results of Simulation According to LT with MaxLabel . . . . . . Results of Simulation According to LT with DirectLabel . . . . . Results of Simulation According to WC with MaxLabel . . . . . . Results of Simulation According to WC with DirectLabel . . . . . Comparison of ROC Curves on Reduced Attribute Set with MaxLabel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of ROC Curves on Reduced Attribute Set with DirectLabel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of ROC Curves on Reduced Attribute Set Without Influence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26 27 41 42 55 56 58 59 61 62 66 67 68

List of Tables 1. 2. 3. 4. 5. 6. 7. 8.

Glossary of Twitter Terms . . . . . . . . . . . . . . . . . . Extracted Attributes per User . . . . . . . . . . . . . . . . Properties of RG and IG to be Considered Scale-Free . . . Test Datasets for LineGraphCreator . . . . . . . . . . . . . Labels in RG and IG . . . . . . . . . . . . . . . . . . . . . Correlation Between Influentials and Tweets . . . . . . . . Most Predictive Attributes of Influence . . . . . . . . . . . Most Predictive Attributes of Influence Without Influence

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

12 18 29 38 46 53 64 67

1

1. Introduction This chapter motivates at first the problem of identifying influential users. Afterwards the questions to be answered in this thesis are stated. The next section summarizes the main findings which is followed by the methods being applied in this thesis to unveil them. This chapter is concluded by an outline of the following chapters.

1.1. Motivation What do propaganda, recommendations, spreading of diseases and viral marketing have in common? They all can be modeled as networks, and therefore be analyzed in a similar manner. Now a company might ask: we want to start a viral marketing campaign for our brand new product - whom do we have to contact to reach our target audience? Similarly, a person being inexperienced in quantum mechanics wants to learn everything about this topic and asks for some guidance in a forum - who is most likely able to help? Likewise, doctors having discovered a new disease which might turn into an epidemic - how can this be prevented efficiently? Correspondingly, someone is interested in shaping the public’s opinion in a certain way - who should be contacted? Such questions can be answered by identifying influential entities in the corresponding context, shortly referred to as influentials, in the respective network. In the context of diseases, a network consisting of genes could be modeled by inserting edges between two entities in case they are known for interacting with each other. Now influentials could be certain genes which are mainly responsible for triggering a specific disease. By identifying them, biologists might be given useful insights where to start searching for an antidote in order to prevent the disease from turning into an epidemic. In the case of expertise like quantum mechanics, a network could be modeled with forum users and an edge would be inserted if a user replied to a post by a different user. Influentials would be those users in the network who have the best overview of the topic. Detecting such users allows to refer beginners to them. This, in turn, enables beginners to learn efficiently as these experienced users could give them pointers in terms of where to start the research or which literature is recommended. Usually, they tend to listen to

1.2

Scientific Questions

2

experts’ recommendations [SCHT07]. When one wants to manipulate the public opinion in a preferred way, she could hire influential individuals in the field of interest, who are able to propagate the counterfactual opinion in order to reach as many people as possible. In terms of viral marketing the company seeks to promote a product with little effort and costs by giving their product to selected individuals for free or on discount. Such carefully chosen people spread positive information about the received product motivating others to also buy it, which makes viral marketing appealing for companies. These examples illustrate that many practical applications exist for the identification of influentials and that detecting such individuals is generalizable to a large extent, so that, in principle, methods used in one field can be applied to other areas correspondingly. Nonetheless, solving this problem does not only help answer all aforementioned questions, but can easily be misused as well, for instance in terms of propaganda. Therefore, it is important to deal critically with the subject and consider potential consequences of introducing new techniques to this field in advance.

1.2. Scientific Questions There exist a wide range of applications for the problem of identifying influential users in networks as outlined above. But this thesis focuses on identifying influentials in the context of viral marketing on Twitter, although the basic ideas might be transferable to other fields too. The problem of finding influentials is expressed as a supervised classification task and aims at identifying characteristics of influentials rather than detecting influential users themselves, although this is accomplished as a byproduct. Besides using attributes related to certain assumptions that are known to correlate with influence, attributes associated with community structure are utilized as well, which has been mainly neglected in studies so far, although there exist some findings that suggest a correlation with influence. One part of analyzing the characteristic properties of influentials deals with the question whether influentials write influential tweets. Thus, not only influentials, but also users writing influential tweets are considered throughout this thesis. Last but not least, there exists extensive research with respect to influence propagation models for social networks. Nonetheless, it is still an open question

3

1.3

Main Outcomes

[Bon11] how well they capture influence propagation given a ground truth, for they are normally used in the absence of a ground truth. Thus, this thesis aims to answer the following four questions: 1. What are the characteristic attributes of influential users and users posting influential tweets? 2. Does community structure help identify influential users and users writing influential tweets? 3. Do influential tweets characterize influential users? 4. Are influence spread models an appropriate means for evaluating algorithms in the absence of ground truth data?

1.3. Main Outcomes This thesis makes three main contributions and has five main findings. This thesis contributes: 1. A workflow, that is applicable to any Twitter topic covering all aspects from collecting and establishing a ground truth dataset as well as applying supervised learning methods to identify influentials and their characteristic properties; 2. An analyzed Twitter dataset in English regarding the topic ”Amazon Kindle” containing user and tweet labels with respect to influence; 3. An interactive annotation tool for assigning labels with respect to influence to tweets and users. The main findings of this thesis are: 1. Attributes related to community structure contribute to detecting influential users and users writing influential tweets; 2. Five to six attributes covering tweet content and graph topology suffice to build a classifier whose quality can compete with that of the classifiers utilizing all attributes;

1.4

Methodology

4

3. There is no statistical evidence that influential users post many influential tweets based on the collected dataset. This result is in part intuitive, because there are individuals who are influential by external factors, e.g. celebrities; 4. The underlying assumptions of the Weighted Cascade Model and the Linear Threshold Model are unrealistic and not suited as means of evaluation in the absence of a ground truth; 5. Analyzing the interaction and relation graph suggests that the collected dataset does not exhibit common properties of social networks (degree distribution obeys a power law, assortative mixing, small average shortest path length, high clustering coefficient).

1.4. Methodology This section sheds light on the reasoning being applied in this thesis in order to answer the questions stated above. Since the problem of identifying influentials is tackled in this thesis with the help of a supervised learning algorithm, a ground truth must be devised. For this purpose, a Twitter dataset is collected and tweets and users are then assigned a label in terms of influence by annotators. This is necessary for distinguishing between influentials and users who write influential tweets. More precisely, influential users are determined on user labels, whereas users composing influential tweets are identified with the help of tweet labels. The final dataset is then transformed into two graphs - interaction and relation graph - and from them a set of attributes for the users in the network is extracted. These attributes belong to one of four assumptions characterizing influentials: activity, community structure, quality and centrality. Based on these attributes the classifiers are built for influentials and users posting influential tweets. In order to answer the questions from this thesis, four different experiments are conducted: (1) The relationship between influentials and users posting influential tweets is analyzed using Pearson’s r and Spearman’s coefficient in order to find out whether influential tweets characterize influentials. (2) The quality of the learned classifiers compared to baseline methods is examined by means of ROC curves. (3) Two popular and basic influence propagation models, Weighted Cascade Model

5

1.5

Text Organization

and Linear Threshold Model, are applied to the built graphs in order to find out whether the results obtained from (2) can be confirmed by means of simulating the spread of influence. (4) The most predictive attributes of influence are determined by computing the subset of most predictive attributes that show a low intercorrelation among each other.

1.5. Text Organization Chapter 2 describes the basics of Twitter, important Twitter terms and related work with respect to social networks and the identification of influentials. This is followed by a short overview of the developed components in Chapter 3. Chapter 4 introduces the general workflow of the first component describing all steps from the construction of the graphs to the extraction of attributes to the identification of characteristic properties. The following chapter discusses the details of the proposed workflow, while Chapter 6 presents the second component which is responsible for collecting the Twitter dataset and establishing a ground truth. The last component allowing to simulate influence spread is described thereafter in Chapter 7. Chapter 8, tries to find clues for answering the four scientific questions of this thesis by conducting four different experiments. A summary of the content and a discussion of potential improvements conclude this thesis.

6

2. Background At first, the most relevant work to this thesis is discussed including a motivation for choosing Twitter to study the problem of identifying influentials in social networks. The chapter is concluded by introducing the basic concepts of Twitter which are referred to throughout this thesis.

2.1. Related Work This sections reviews existing work that is relevant to this thesis. It is divided into six parts, each one dealing with a different aspect. At first, the question of what constitutes influence is discussed, which is followed by distinguishing between different types of methods for detecting influentials. Since community structure is a vital aspect for identifying influentials in this work, the following two paragraphs shed light on how to detect communities in networks. The most similar studies with regard to this thesis are introduced thereafter in which the authors utilize community structure for the identification of influentials. This section is concluded by motivating why Twitter is suitable for analyzing influence in social networks. 2.1.1. What is Influence? It is, in general, difficult to measure influence because it is a highly subjective term, so that different people associate different meanings with influence. Cha et al. [CHBG10] even note “[i]t has been unclear what influence means.”. Nonetheless, various definitions of the term exist, e.g. [GBL10, WD07, KB03, BCM+ 12, BHMW11]. For instance, Keller et al. [KB03] found influentials can be characterized as well-connected in the network, active meaning they engage their audience to some extent, and being considered knowledgeable by peers which is why they are often consulted. This definition comes from a marketing background. Bakshy et al. [BHMW11] consider an individual influential if she disproportionately impacts the spread of information throughout the network. More specifically, they count for Twitter users how many URLs of them were reposted by others

7

2.1

Related Work

and interpret this number as influence. Thus, influence is defined in a pragmatic way and not necessarily aiming to cover all aspects of the term. Likewise, other definitions exist depending on the task at hand, for example Bigonha et al. [BCM+ 12] assume influentials to exhibit the following characteristic properties: activity, well-connectedness, high tweet quality and expressing their opinions about a product thereby engaging others. Some of those assumptions regarding the question how influentials are characterized are reused in this thesis when it comes to the selection of adequate attributes for identifying influential users. 2.1.2. Methods for Detecting Influentials Ever since the problem of identifying influentials was studied by Domingos et al. in [DR01], the field has attracted a lot of attention as the prospect of viral marketing is intriguing. There exist different approaches for detecting influentials. The first type of approach involves the maximization of the spread of influence in the network. This problem was introduced by Kempe et al. in [KKT03]. According to this approach, given the probabilities for influencing other nodes in the network, a subset of k nodes maximizing the spread of influence must be found. This problem is known to be NP-hard, but different methods exist to approximate the most influential users. Various models exist to model this spread of influence. Amongst them are the basic, yet popular Independent Cascade Model (IC) and Linear Threshold Model (LT). Both are basic in the sense that they are restrictive: in IC each node has only one chance to activate its neighbors, whereas in LT each nodes picks at random a threshold which must be exceeded by incoming edges in order to activate this node. Additionally, discrete time steps are used and both models are static meaning the probabilities of influencing neighbors do not change over time. Moreover, every node is in each time step either active or inactive, but it is impossible to turn an active node into an inactive one again. Last but not least, IC and LT assume that influence probabilities of the entire network are known, that is for every edge in the network exists a probability with which the target node gets activated. More realistic models incorporate a time delay before an activation of neighbors takes place [SKOM10], utilize continuous time [LPLS13] or even overcome the problem of estimating influence probabilities by learning them

2.1

Related Work

8

from a collected action log [GBL10] instead of approximating them by heuristics. The action log contains information regarding at what time a user performed an action. This allows to consider initiators of actions more influential. More simulation models are described in [Bon11]. LT and a variant of IC are used in this thesis for evaluating the quality of the learned classifiers. The second type of approach for detecting influentials focuses on utilizing heuristics and attributes directly obtained from the network. Only methods related to Twitter are considered here. Cha et al. evaluate three simple measures, which can be obtained from the network immediately [CHBG10]. They test the number of a user’s followers, the number of retweets and how often a user name was mentioned in tweets as indicators for influence. They conclude only the two latter measures correlate with influence. Bigonha et al. [BCM+ 12] rank users according to a formula utilizing topological (well-connectedness) and topical attributes (tweet sentiment and quality). Kong et al. [KF11] identify influentials based on the tweet quality. Besides using well-connectedness, different attributes in terms of tweet quality are reapplied in this thesis. 2.1.3. Role of Community Overlaps Overlaps mean that nodes belong to more than one community at a time. For example, in social networks often times individuals belong to multiple communities at the same time, for example to the communities ”sport” and ”video gaming”. Furthermore, it was observed by Barbieri et al. [BBM13] that overlaps play a crucial role in spreading information (= cascades) further in a network: In particular, individuals tend to adopt the behavior of their social peers, so that cascades happen first locally, within close-knit communities, and become global “viral” phenomena only when they are able cross the boundaries of these densely connected clusters of people. Therefore, the study of social contagion is intrinsically connected to the problem of understanding the modular structure of networks (known as community detection), and together form the central core of network science.

9

2.1

Related Work

A recent study by Yang et al. [YL12] suggests that overlaps are more densely connected than it has been expected so far. Hence, it is likely that nodes in the overlap are connected to many communities. In conjunction with the finding of Barbieri et al. this observation motivates why overlaps must be taken into account when identifying communities in the network in order to lose no information. 2.1.4. Methods for Detecting Overlapping Communities In general, two different approaches for detecting communities in graphs exist, namely node-based and edge-based clustering. The former exhibit a drawback, though, because it is difficult to incorporate overlaps with such algorithms. In terms of node-based community detection algorithms several approaches exist to account for overlaps, e.g. [Gre09, GKMI+ 10, PDFV05]. In [Gre09] Gregory transforms the underlying graph structure into a graph without any overlapping communities. This allows to apply fast community detection algorithms, such as [WT07, CNM04, BGLL08] which are otherwise unable to consider overlaps. Specifically, the Louvain method must be mentioned as it detects communities of high quality, yet it is also faster than [WT07, CNM04]. Hence, the algorithm is applicable to large networks containing more than 100 million nodes. Two algorithms proposed by Ahn et al. [ABL10] and Evans et al. [EL09] overcome the problem of incorporating overlaps by clustering the network based on edges instead of nodes by assigning each node to all communities its incident edges belong to. By virtue of this change of perspective, overlaps are incorporated naturally. Ahn et al. cluster similar edges together and their similarity is calculated with the help of the Jaccard index. Evans et al. [EL09] developed independently from Ahn et al. a method that is also based on the idea of clustering edges. They transform the original graph into a line graph, which generally maps nodes from the original graph to edges in the line graph and edges from the graph to nodes in the line graph. Afterwards any community detection algorithm not incorporating overlaps can be applied in principle, for instance Louvain method. The algorithms proposed by Evans et al. together with the Louvain method and Ahn et al. are applied in this thesis for detecting overlapping communities.

2.1

Related Work

10

2.1.5. Using Communities for the Detection of Influentials The joint correlation of social contagion and community detection has not been studied frequently. Instead, both subjects are mainly investigated separately. One of the few exceptions is the work of Barbieri et al. [BBM13], who constructed a generative model incorporating the structure of the social graph (= communities) and information cascades. It resembled the properties of real-world networks remarkably well. The key idea of Wang et al. [WCSX10] is closely related to this master’s thesis. More precisely, Wang et al. also examine the idea of finding influential users based on detected communities. For this purpose, they consider social contagion for detecting appropriate communities, which do not account for overlaps. In contrast to this thesis, the authors pursue no supervised learning approach. Moreover, the focus in this thesis is put on identifying characteristics of influential users rather than detecting the influentials themselves. 2.1.6. Why Use Twitter? Among social networks Twitter is particularly suited for investigating influence. First of all, Cha et al. [CHBG10] and Kwak et al. [KLPM10] both argue from examining almost the complete Twitter graph that following other users indicates that they are more influential, because the authors found a low link reciprocity, meaning only a few users follow their followers back. Their observation holds also for actions from users, like retweeting, because if user A retweets user B, A does not have to retweet B. Due to the low reciprocity Bakshy et al. [BHMW11] call Twitter a ”who listens to whom” network as users are free to subscribe to receiving updates from any person they find interesting. However, it has to be noted that Weng et al. [WLJH10] report a high link reciprocity in contrast to the previous two studies by Cha and Kwak. Hence, Weng et al. conclude that following other Twitter users indicates homophily, which means that individuals tend to befriend or interact with people similar to them. But the authors analyze only a small fraction of the Twitter network, which suggests that their sample of the entire Twitter graph is biased. Nonetheless, it has to be kept in mind that the Twitter network is mainly used for satisfying information needs according to Kwak et al. [KLPM10] which implies that it also exhibits different properties

11

2.2

Basic Concepts of Twitter

than other social networks like Facebook2 , although Twitter is considered to be a social network throughout this thesis. The properties of the graphs constructed in this thesis must therefore be analyzed in detail. But since Twitter users are dedicated to spreading information in the network, this microblogging service is particularly suitable for observing such information cascades. Thus, Twitter is also used in this thesis as basis for analyzing influential users. The basics of Twitter are explained in the following section.

2.2. Basic Concepts of Twitter This section describes briefly the most important aspects of Twitter that are utilized throughout this thesis. Twitter is a microblogging service allowing users to post text-based messages of 140 characters at maximum, which are called tweets. They are publicly available in real-time, so that additional individuals can receive or find the news, and are thereby able to interact with each other like in a social network. Twitter has currently over 200 million users3 who post about 400 million tweets per day on average. A main characteristic of Twitter relationships is that they do not have to be mutual. Hence, it is common that a user subscribes (follows) to receive updates from a fellow, but the latter individual does not follow her back. There exist different possibilities for users to interact with their audience. Firstly, they can credit the original authors by mentioning their names anywhere in the tweet. Depending on the position of the mention in the tweet, one can distinguish between attribution and mention. In the former the user name, which is preceded by ”via”, is cited at the end of the tweet, whereas in the latter interaction the user name can occur anywhere in the tweet. In the course of this thesis attribution and mention are not distinguished. Hence, whenever ”mention” is used, attributions and mentions are meant. A second way of an interaction is established by reposting (retweet) a message. Last but not least, a user can comment (reply) on a tweet. In the remainder of this thesis only mentions, replies and retweets are regarded as interactions. In Table 1 an overview of the above introduced Twitter terms is given and extended with more definitions of Twitter 2 3

www.facebook.com Retrieved: 03-23-2014 https://blog.twitter.com/2013/celebrating-twitter7, Retrieved: 03-23-2014

2.2

Basic Concepts of Twitter

Term Tweet To Follow Follower Followee Retweet Protected Account Timeline

Direct sage

Mes-

Reply

Mention

Attribution Hashtag (#)

12

Explanation It is Twitter’s name for a message written by a user. It describes the activity of subscribing to another Twitter account. This allows receiving the latest tweets by that account. All the users following the same account are its followers. These are all the Twitter accounts one user follows. A user reposts another tweet. This is indicated by ”RT @username ” in a tweet. Tweets from such an account are only displayed to approved followers, and cannot be found in the search results. It is a list of tweets ordered by date that can be viewed publicly. Additionally, tweets from timelines can be searched via Twitter unless the user account is protected. The owner of the timeline can also add further information like URLs to a personal blog. A direct message can only be sent to a user’s followers. Specifically, it is broadcasted only to the addressed users. It is not published on the timeline. A user replies to a sender’s post. It is posted to the recipient’s timeline, and if both follow each other, then also to the sender’s one. Otherwise the sender receives a notification in a different way. A user is mentioned at some position in the tweet indicated by ”@username”. The tweet is posted to the sender’s timeline, and if the recipient follows the sender, it is posted to hers as well. Otherwise the recipient is notified differently. See mention, but indicated by ” via @username”. A word preceded by # indicates a topic or keyword. Table 1: Glossary of Twitter Terms

concepts and commonly used expressions to facilitate a better understanding of the thesis. In the next chapter an overview of the components to be implemented for identifying influentials and users writing influential tweets with their respective characteristics is given.

13

3. Overview of Components This chapter presents an overview of the three components that are developed in order to answer the questions from Section 1.2. They are discussed in detail in the following chapters.

3.1. SNAnnotator SNAnnotator describes how the ground truth dataset is obtained. This involves the collection and annotation of data from Twitter as well as the derivation of the respective ground truth. After collecting tweets written in English regarding a specific topic, their respective authors are retrieved as well. Independent annotators then assign tweets and users appropriate labels with respect to influence by means of a developed annotation tool. The final dataset to be used as input for InfluenceLearner is derived from these suggested labels in order to improve their reliability as influence remains highly subjective.

3.2. InfluenceLearner InfluenceLearner constructs two different graphs from the dataset which was created using SNAnnotator, identifies overlapping communities and afterwards extracts 16 attributes per graph. Two different sets of labels are derived from the tweet and user labels of the final dataset, both serving as a ground truth. This allows to learn four different classifiers based on the extracted attributes. One classifier is learned per model per graph yielding four different combinations in total. Finally, for each of them the most predictive attributes are determined.

3.3. InfluenceSimulator InfluenceSimulator allows to compare the performance of classifiers according to the influence spread models Weighted Cascade Model and Linear Threshold Model. This means the subset of the most influential users according to the respective classifiers is used as a basis for starting the simulation of influence

3.3

InfluenceSimulator

14

spread. The message diffuses in the network to neighbors of the initial subset of users according to the used influence spread model. After a certain time span the process is stopped and the performances of the classifiers are compared. The better a classifier is, the more users should have received the message to be propagated in the end. The introduction of the three developed components over the following chapters differs slightly from this chapter, in that InfluenceLearner is described before SNAnnotator because the requirements for the dataset to be collected are then clear from the previous chapters. Hence, the next chapter introduces InfluenceLearner.

15

4. Workflow of InfluenceLearner This chapter explains in general the capabilities of InfluenceLearner, that is how from a collected dataset classifiers are learned such that characteristic properties of influentials and users writing influential tweets may be identified. At first, the dataset must be converted into a graph. On such a graph overlapping communities are detected then, allowing to extract all necessary attributes as some of them are based on community structure. Labels serving as a ground truth are derived from the dataset thereafter, permitting supervised learning. This eventually renders it possible to identify the most predictive properties of influentials.

4.1. Turning Dataset into Graphs This section describes how InfluenceLearner constructs two different graphs - relation and interaction graph - from the ground truth dataset which was created with SNAnnotator. Often times, for instance in [BCM+ 12], two graphs are constructed from a dataset when the goal is the detection of influential users, namely the interaction graph IG and the relation graph RG which are defined hereafter. The reasoning for this approach is motivated by [HRW08], in which the authors note that relationships between followers and followees lead to a dense graph. But in reality individuals interact only with a few carefully selected users. Thus, a graph based on interactions is sparse. Definition 1. A graph G = (V, E) describes the set of pairwise relationships E between a set of elements V , where the elements are called nodes and the relationships are called edges. Definition 2. Let G(U, E) be the original social network, where U is the set of users and E is the set of edges linking them, where a directed edge e = (u1 , u2 ) denotes any kind of connection from u1 to u2 . The ”relation graph” RG(U, R) is a directed graph, where R ⊂ E is the set of follower links, i.e. for an edge (u1 , u2 ) ∈ E it holds that (u1 , u2 ) ∈ R if and only if u1 is a follower of u2 .

4.2

Detecting Overlapping Communities

16

Definition 3. Let G(U, E) be the original social network as above. The ”interaction graph” IG(U, I) is a directed graph, where I ⊂ E is the set of interaction links, encompassing direct replies, mentions and retweets. This means that for an edge (u1 , u2 ) ∈ E it holds that (u1 , u2 ) ∈ I if and only if u1 has replied to, retweeted a tweet of or otherwise mentioned u2 . Notice that in the remainder of this thesis only the largest weakly connected component is considered when referring to either IG or RG, because certain attributes can only be computed for connected graphs, particularly attributes related centrality and community, which are discussed in Section 4.3. A directed graph is weakly connected if all nodes are connected given that all directed edges are replaced by undirected ones [WCC09]. Thus, all users having no connection with this weakly connected component of either RG or IG are discarded, although the fraction of such users is negligible with 62 out of 4500 users. Therefore, the final graphs comprise less users (4438) than actually exist in the dataset. Constructing both graphs successfully already permits to extract a few attributes for the classifiers. However, a crucial aspect, community structure, has not been addressed so far, which is required for obtaining the remaining attributes. Thus, it is dealt with in the next section.

4.2. Detecting Overlapping Communities This subsection outlines how InfluenceLearner detects overlapping communities. Therefore, the method proposed by Evans et al. [EL09] is applied. To this purpose, a directed, unweighted graph DG (e.g. IG or RG) is transformed into a weighted, directed line graph L(DG) which, in general, maps nodes from DG to edges in L(DG) and edges from DG to nodes in L(DG). This permits to apply any community detection algorithm on L(DG) that does not take overlaps into account. In this work the Louvain method [BGLL08] is selected for this task because it is particularly fast which is crucial as L(DG) is dense, especially in case of RG, for it contains more than 550000 nodes and 280 million edges. The resulting communities in L(DG) are then transferred to DG, so that users in DG belong to all the communities their incident edges belong to. Now that also the overlapping communities in the graphs are known, all attributes that are necessary for building the classifiers may be extracted.

17

4.3

Considered Attributes to be Extracted

4.3. Considered Attributes to be Extracted This section motivates the underlying assumptions in this thesis as to what constitutes influential users and introduces all associated attributes that InfluenceLearner extracts in order to distinguish influentials from non-influentials. The attributes to be extracted are based on four assumptions: 1. Role of community structure: Attributes related to community structure are extracted because Barbieri et al. state in [BBM13] that communities connected more frequently to the outside world foster the spreading of messages in the network, and therefore enable word-of-mouth propagation. 2. Role of activity: In [CHBG10] the authors report that influentials gain their importance over time through constant involvement of their followers. Thus, it is assumed influentials are active and engage their followers into interactions, which motivates the extraction of related attributes. 3. Role of quality: In [KF11] the contributors explain that tweets with high quality characterize influentials. Hence, appropriate attributes are incorporated in this thesis as well. 4. Role of centrality: In [BCM+ 12] the authors state that well-connectedness in the network is typical for influentials. Therefore, influentials are to some extend ”central” and possess more power than non-influentials, which is reflected by the choice of attributes to be extracted in terms of centrality. An overview of all considered attributes is presented in Table 2. They are extracted once from RG and once from IG. In the following it is described what information each attribute captures. Role of Community This assumption encompasses the following attributes which are extracted from DG: 1. Inter-community edges measures the overall strength of a user’s connections to adjacent communities. For this attribute multiple edges to the same community contribute to the result. In contrast to Overlaps, this attribute

4.3

18

Considered Attributes to be Extracted No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Attribute Name Inter-community edges # Connected communities # Edges from influentials # Overlaps Potential audience Influence # Interactions # Tweets TFF [BCM+ 12] User readability [BCM+ 12] User quality [KF11] In-degree [NMB05] Closeness [NMB05] Vertex betweenness [NMB05] Eigenvector centrality [NMB05] Edge betweenness [GN02]

Assumption Community Community Community Community Community Activity Activity Activity Quality Quality Quality Centrality Centrality Centrality Centrality Centrality

Table 2: Attributes to be extracted per user.

takes the edge weight connecting two communities into account. Hence, the results can differ. 2. # Connected communities counts in DG to how many different communities a user belongs to. 3. # Edges from influentials counts how often influentials have outgoing edges to a specific user as this direction is more meaningful. For instance, in RG it is easy for a user to follow an influential one, whereas the opposite case indicates that a non-influential user is also ”interesting” to a some extent. 4. # Overlaps counts how often a user is part of more than one community. Multiple overlaps with the same community contribute to the final result like in Inter-community edges. 5. Potential audience approximates how many individuals a single user can reach.

19

4.3

Considered Attributes to be Extracted

Role of Activity This assumption encompasses the following attributes which are extracted from the collected dataset: 1. Influence measures how much influence users exert based on their collected tweets. 2. # Interactions regarding the topic describes the number of interactions (replies, mentions, retweets) with users from the dataset. 3. # Tweets regarding the topic is the number of collected tweets by this user. Role of Quality These assumption encompass the following attributes which are extracted from the collected dataset: 1. TFF (Twitter Followee Follower ratio) puts the number of followers and followees of a user into a relationship. If users have more followers than they are following themselves, it suggests that they are to some extent ”interesting” as well. Thus, a larger TFF value indicates more relevant users. 2. User readability measures the readability of a user’s tweets. The lower this value is, the more readable are the tweets written by this user. 3. User quality considers the quality of all tweets composed by a user. Tweet quality is based on the assumption that tweets of higher quality are retweeted more often. A tweet is assumed to be of even higher quality if users add further content to it before reposting. Role of Centrality These assumption encompass the following attributes based on DG: 1. In-degree counts how many incoming edges users have and therefore indicates in RG how many others follow them, whereas in IG it signifies that

4.4

Deriving Labels from the Dataset with Ground Truth

20

other users interacted with them. Normally, node degree would have been chosen as attribute, but since DG is directed, it seems more natural that the in-degree is used instead because it increases the likelihood of a user to be influential in both graphs. 2. Closeness The more central users are in the network, the lower is their total distance to all other users, and therefore they can easier propagate information. Closeness tends to give higher values to users near the center of communities. 3. Vertex betweenness is the number of shortest paths between any two individuals running through a specific user. Hence, users exhibiting higher Vertex betweenness values act as bridges by connecting otherwise disconnected components of the network. 4. Eigenvector centrality describes to which extent a user is connected to the best connected part of the network. Thus, users exhibiting high Eigenvector centrality values are connected with users characterized by similarly high values. 5. Edge betweenness is strongly related to Vertex betweenness, but considers edges instead of users. Before classifiers can be built based on these extracted attributes, the next section must describe which ground truth is chosen for users.

4.4. Deriving Labels from the Dataset with Ground Truth This section describes which ground truth is utilized for building the classifiers. For the analysis of the question whether influentials can be characterized by posting influential tweets or not, in this thesis two sets of labels are derived from the collected dataset which are then assigned to the users as their respective ground truth. One of which is based on tweet labels, the other one on user labels from the collected ground truth dataset, whose creation is described in Chapter 6. More precisely, DirectLabel is derived directly from the user labels in that it utilizes the user labels of the ground truth dataset. In contrast, MaxLabel is derived indirectly from the tweet labels of a user; it is computed as the highest

21

4.5

Learning MaxClassifier & DirectClassifier

label in the ground truth dataset that annotators assigned to any of her tweets as influential tweets are rare. Since all prerequisites are fulfilled to apply supervised learning, InfluenceLearner is able to build classifiers which is described in the following section.

4.5. Learning MaxClassifier & DirectClassifier This section specifies the different classifiers and their inputs, in particular which ground truth each one utilizes and to which graph it is applied. InfluenceLearner builds different classifiers based on both sets of labels, DirectLabel and MaxLabel. Classifiers being built on the basis of DirectLabel are referred to as DirectClassifier, whereas those trained on MaxLabel are referred to as MaxClassifier. Since MaxClassifier and DirectClassifier are built using attributes obtained from RG and IG, respectively, four different classifiers must be learned in total. Specifically, MaxClassifierRG and DirectClassifierRG are trained on RG and MaxClassifierIG and DirectClassifierIG are trained on IG. But apart from using different ground truth for building the respective classifiers, the inputs for MaxClassifierRG and DirectClassifierRG (MaxClassifierIG and DirectClassifierIG) remain identical; so the same attributes extracted from RG (IG) are given to the appropriate classifiers. Since influentials naturally occur more rarely than non-influentials regardless of which ground truth is used, InfluenceLearner takes the highly skewed distribution of the collected dataset into account by using a combination of SMOTE (Synthetic minority over-sampling technique) [CBHK02] and a cost-sensitive matrix. The former technique adds influentials with artificially generated attributes derived from real influentials to the dataset. Specifically, InfluenceLearner doubles the number of positive instances by applying SMOTE. InfluenceLearner balances the resulting skewed distribution by setting up an appropriate cost matrix, such that the costs of misclassifying the minority class results in a proportionally large penalty. The last task of InfluenceLearner involves the identification of characteristic attributes of influentials, that is discussed hereafter concluding this chapter.

4.6

Identification of Most Predictive Attributes of Influence

22

4.6. Identification of Most Predictive Attributes of Influence This section describes how InfluenceLearner determines the most predictive attributes of influentials. The prerequisites for performing this task are an existing ground truth and the respective graph with the extracted attributes per user. Most importantly, the data are prepared like in Section 4.5, meaning the skewed distribution is handled exactly as in the previous section. Only then the actual task is carried out. At first, InfluenceLearner selects a subset of attributes by applying Correlation-based Feature Extraction (CFS) [Hal99] on the dataset that is characterized by the respective classifier. For instance, in terms of DirectClassifierRG CFS is employed on the extracted attributes from RG with DirectLabel, whereas CFS is employed on the attributes obtained from IG with DirectLabel in the case of DirectClassifierIG. CFS prefers subsets of attributes that correlate with the ground truth when such attributes exhibit a low inter-correlation within the subset at the same time. The ranks of the resulting subset of attributes are then determined according to their score calculated by the Chi-Squared test (χ2 ). χ2 assesses attributes separately by measuring their χ2 statistic with respect to the ground truth. The score is larger for an attribute if it is independent from the ground truth. The larger this value turns out to be, the more relevant the attribute is. CFS and χ2 were applied in [LLW02] separately, but successfully which is why they are selected in this thesis. This section concludes the description of InfluenceLearner’s capabilities. But many details have not been mentioned yet, which is therefore done in the following chapter.

23

5. Tasks of InfluenceLearner Although the analysis the properties of IG and RG is not part of InfluenceLearner, it is essential for getting an idea of the underlying dynamics in both graphs. Hence, this topic is addressed in the beginning of the chapter, in which the properties of both graphs are compared against other social networks. The remainder of this chapter focuses on addressing all those details which are important for implementing InfluenceLearner appropriately, that is how overlapping communities and all attributes are extracted. A brief discussion of implementation details concludes the chapter which utilizes the findings from Section 5.1.

5.1. Analysis of the Graphs’ Properties Regarding Social Networks This section analyzes the main properties of RG and IG, which are both constructed according to Section 4.1, whereby the collected dataset encompasses 4438 users and 35742 tweets. Their properties are compared to other social networks, which often turn out to be ”scale-free” because nodes with highly different degrees coexist. The motivation for this investigation stems from Kwak et al. [KLPM10], who note that Twitter exhibits different properties than social networks in general. Thus, the analysis of such properties helps to decide whether the findings from this thesis are generalizable in the sense that they can be directly applied to other social networks correspondingly. This paragraph examines at first the reciprocity of both graphs, that is the fraction of node pairs having reciprocal relationships. Afterwards different properties are discussed; in case IG or RG exhibits them, that would document that the respective graph is scale-free. 5.1.1. Reciprocity In [CHBG10, KLPM10] the authors had access to nearly the entire Twitter graph and reported low reciprocity between 10-22%, and therefore interpreted reciprocity as an indicator of influence. In contrast, Weng et al. [WLJH10] noted a much higher reciprocity of 72% and concluded that reciprocity suggests

5.1

Analysis of the Graphs’ Properties Regarding Social Networks

24

homophily, the tendency of individuals to befriend similar users, instead of influence. But the authors analyzed only a very small fraction of the Twitter graph which suggests their sample might have been biased. Interestingly, it turns out that in RG the reciprocity is 99.7%, which would permit to interpret the graph as undirected almost without any loss of information. Hence, it seems more likely that the reciprocity in this graph indicates homophily, since many book authors exist in the dataset. In contrast, the reciprocity in IG is only 43.4%, meaning only this fraction of user pairs returned the favor of interacting with the initiator. Therefore, interpreting this graph as undirected would lead to a loss of information. These results in terms of reciprocity suggest that the collected dataset is biased, and therefore it is unclear to which extent the findings from this thesis can be transferred directly to other Twitter datasets. 5.1.2. Properties of Scale-Free Networks Many social networks have been analyzed in terms of their characteristics and it turned out that many of them are ”scale-free”. The term ”scale-free” stems from the observation that in random networks the node degrees, the number of connections per node, are comparable, whereas they largely differ in scale-free networks. Hence, the average node degree cannot be used as an internal scale in scale-free networks to predict what degree a randomly drawn node has opposed to a random network. Scale-free networks share four typical properties, which are also analyzed for IG and RG in the following: (1) the degree distribution obeys a power law, (2) the average distances to neighbor nodes are small, (3) they are assortative and (4) they exhibit a large clustering coefficient compared to a random network. One key property of scale-free networks is their widely varying degree distribution of the nodes. The reason for such a degree distribution is the existence of a few nodes called ”hubs” that have exceedingly many edges compared to other nodes in the network. In other words, the degree distribution of scale-free networks follows a power law which is of the form y = cx−γ , where c is a normalization constant and γ the exponent. For social networks γ is of interest and should typically lie between two and three. The degree distribution for the collected dataset is plotted for RG and IG using logarithmic binning as described in [Mil10] on a

25

5.1

Analysis of the Graphs’ Properties Regarding Social Networks

log-log scale. Logarithmic binning improves the estimation of γ as less weight is assigned to the typical heavy tail of the power law distribution. In fact, each bin contributes equally, so that this bias is removed. The exponents are estimated with the least-squares method, assuming the degree distribution is generated with a pure power law. Furthermore, the plots in Figure 1 and 2 are depicted with a decade of 0.1, whereby a decade denotes an interval in which the value of the x-axis changes by a factor of 10. Hence, 10 bins exist per decade. Last but not least, logarithmic binning is only applied to degrees larger than or equal to five in order to avoid oversmoothing of the data. Thus, for plotting smaller degrees, linear binning is applied. All graphs in Figure 1 and 2 resemble lines, so the degree distributions of IG and RG seem to obey power laws. According to Barab´asi [Bar12](chapter four, p. 27) IG and RG are not scale-free since the computed γ values are smaller than two. However, there exist many reasons why both graphs can still be scale-free. For instance, it is possible that the degree distribution rather follows a combination of power law and exponential function than a pure power law, so γ would have to be re-estimated accordingly. But it is beyond the scope of this thesis to analyze all possible combinations due to the limited time frame. Social networks exhibit commonly a small average shortest path length a, which allows individuals to reach all other network members within a few hops. The reason for this characteristic is the existence of hubs. The average shortest path length is computed according to [BLM+ 06]: a=

X s,t∈V

d(s, t) , n(n − 1)

where V is the set of nodes in the graph, d(s, t) is the shortest path from user s to t and n is the total number of nodes in the graph. The average shortest path length in IG is 1.44, whereas it is 2.21 in RG. Since the shortest average path lengths are small, it is clear that hubs exist in both graphs. Assortativity r describes the preference of nodes to connect with ”similar” nodes, whereby this relationship can be defined in many ways [New03]. Here, as in most other works, a node’s degree is chosen as a means of similarity. Hence, if the network exhibits a positive mixing (r > 0), hubs are more likely to be connected with other hubs. Conversely, if disassortative mixing (r < 0) is observed, the

5.1

Analysis of the Graphs’ Properties Regarding Social Networks

26

(a)

(b)

(c) Figure 1: Log-log plot of degree distributions of IG using logarithmic binning. The corresponding γ is also visualized as dashed line. (a): In-degree. (b): Out-degree. (c): Total degree.

27

5.1

Analysis of the Graphs’ Properties Regarding Social Networks

(a)

(b)

(c) Figure 2: Log-log plot of degree distributions of RG using logarithmic binning. The corresponding γ is also visualized as dashed line. (a): In-degree. (b): Out-degree. (c): Total degree.

5.1

Analysis of the Graphs’ Properties Regarding Social Networks

28

opposite behavior is more likely. Interestingly, almost without any exceptions positive mixings are reported for social networks. It is argued that this is a unique property of social networks because of social interactions [NP03]. Assortativity is computed as the standard Pearson correlation coefficient r [New03]: P

r=

xy

xy(exy − ax by ) σa σb

,

where ax is the fraction of edges that start at nodes with value x, by is the fraction of edges that end at nodes with value y, exy is fraction of all edges that join together nodes with values x and y, σa , σb are the standard deviations of the distributions ax and by . Assortativity can have values in the range of −1 ≤ r ≤ 1, where r = 1 indicates perfect assortativity, while r = −1 denotes perfect disassortativity. Since IG and RG are directed, it has to be stated for which direction r is computed for edges of the form e(s, t), where s, t are nodes of the respective graph. Valid combinations are (in-degree(s), in-degree(t)), (out-degree(s), indegree(t)), (in-degree(s), out-degree(t)) or (out-degree(s), out-degree(t)). Here, as for s the out-degree is considered and in terms of t the in-degree. However, both constructed graphs exhibit a slight disassortative mixing according. In IG it is less evident with an assortativity coefficient of -0.07, whereas it is -0.24 in RG. The latter value can be explained by the fact that users following only a few others tend to follow those who already have many followers, where the large number of followers indicates that these users post ”interesting” content or are ”interesting” themselves. In terms of IG the result suggests that users interacting infrequently with others try to get in touch with those who are eager to interact with others. The last property to be analyzed is the clustering coefficient cd , which characterizes to which extent the nodes in the network tend to cluster together, or in other words, form communities. It has to be large compared to a random network in which the degree distribution of the original network is preserved (configuration model). The clustering coefficient is computed according to the following formula [OSKK05] between triangles of users s, t, u: cd =

X 2 (w˜st w˜su w˜tu )1/3 , deg(s)(deg(t) − 1) st

29

5.1

Analysis of the Graphs’ Properties Regarding Social Networks

Property Avg. shortest path Clustering coefficient γin−degree γout−degree γtotal−degree Assortativity

RG 2.21 0.394 (0.030) -0.88 -0.88 -0.97 -0.24

IG 1.44 0.196 (0.009) -1.84 -1.45 -1.78 -0.07

Table 3: Overview of properties that are important for IG and RG to be considered scale-free. Values in brackets denote results obtained from a random network preserving the degree distribution of the original network.

where deg(s) denotes the degree of user s and all edge weights wst , wsu , wtu are normalized by the respective maximum value in the network w˜st = wst /max(wst ). The clustering coefficient yields 0.394 for RG, whereas it is only 0.03 in the corresponding configuration model. As for IG, cd is lower with cd = 0.196, but still much larger than the clustering coefficient cd = 0.009 in the respective configuration model. The clustering is therefore much more evident in RG and IG than in their random equivalents. This means in both graphs exist tightly knit areas (communities) that cannot be explained by chance. A summary of all computed properties of social networks for both graphs is illustrated in Table 3 in order to facilitate readability4 . The characteristics of RG and IG conform to social networks and scale-free networks with respect to clustering coefficient and average shortest path, but they deviate from them by means of the computed exponent for the degree distribution and the disassortative mixing. Thus, IG and RG do not satisfy the requirements to be considered scale-free and one has to be cautious when deriving general statements based on both graphs. The findings from this analysis are important for the implementation of detecting overlapping communities in RG and IG. But beforehand, the steps it takes to detect overlapping communities need to be addressed at first, which is described in the next sections.

4

For computing average shortest path length, clustering coefficient and assortativity, NetworkX 1.8.1 is used [HSS08].

5.2

Community Detection Tasks

30

5.2. Community Detection Tasks This section describes the different steps InfluenceLearner considers until overlapping from RG or IG, respectively, are eventually extracted. At first, the graphs must be transformed into corresponding line graphs according to Evan’s method. In the next step, after describing adaptions made to the applied community detection algorithm - communities are identified on the line graphs using the extended Louvain method. This allows now to determine overlapping communities on RG and IG, respectively. 5.2.1. Transformation of the Original Graph to a Line Graph In order to extract communities from DG, Evan’s method [EL09], which was introduced in Section 2.1, needs to be adapted because the original implementation is memory consuming for large graphs. Since the resulting line graph of RG, L(RG), is dense, the reduction of memory consumption is important and accomplished by a proposed modification. Its pseudocode is presented in Algorithm 1. This modification takes as input the set of users U , incoming and outgoing edges of DG and transforms DG into its corresponding line graph L(DG) by exploiting the following observation: an edge pair can only be ”connected” in L(DG) if both edges share a node sh in DG. ”Connected” in this sense means that one edge is pointing toward sh, whereas the other one is going outward from sh. More formally speaking, an edge pair (i, j), (k, l) is ”connected” by a user sh if and only if j = sh and k = sh holds, where {i, j, k, l, sh} ∈ U . If no self-links are allowed, additionally i 6= sh and l 6= sh must hold. The modification iterates over all users in DG (line 4) and assumes that the current user ”connects” an edge pair. This assumption does not hold if that user has not at least one incoming and outgoing edge, and therefore she does not exist in L(DG). Otherwise she exists in L(DG) (line 6). Now all possible combinations of incoming and outgoing edges need to be created (line 8-9). An edge in L(DG) is directed from sh’s incoming edge to its outgoing edge. Depending on whether self-links are allowed or not (line 10), the weight w of the resulting edge in L(DG) is computed according to w = 1/out-degree(sh) (line 11) and stored in memory

31

5.2

Community Detection Tasks

until either the memory limit is exceeded (line 15-16) or the transformation of DG into L(DG) has finished (line 21-22) successfully. In the former case, all edges of L(DG) are stored in a file, so that more memory is available again. At the end, all edges stored in memory are written to this file as well (line 21-22).

Algorithm 1 LineGraphCreator. 1: Input: Users, outgoing edges OE, incoming edges IE of unweighted, directed graph DG and maximal RAM consumption. 2: Output: Line graph L(DG). 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:

for each user in DG do sh ← user if user has outgoing and incoming edges then // Edge pair: (in, sh), (sh, out) for each in ∈ IE of user do for each out ∈ OE of user do if (in, out) is allowed then weight w ← 1/out-degree(sh) edge in L(DG) ← (in, out, w) end if end for if allocated RAM is consumed then store computed edges in file end if end for end if end for if edges in L(DG) exist then store computed edges in file end if

5.2.2. Detection of Non-Overlapping Communities on the Line Graph For detecting non-overlapping communities on the line graph, InfluenceLearner utilizes the Louvain method [BGLL08] which is based on the maximization of modularity Q. Modularity measures the strength of the partitioning of the graph into communities. It is defined as (fraction of edges within community - expected fraction of edges within community), and therefore ”[l]arge positive values of

5.2

Community Detection Tasks

32

modularity indicate a statistically surprising fraction of edges that fall within the chosen communities” [LN08]. Modularity is computed according to: Q=

1 Xh ki kj i δc ,c Aij − 2m ij 2m i j

(1)

In Equation 1 m denotes the number of edges in the network, δci ,cj is the Kronecker delta, which yields one if i == j and otherwise zero, ci is the label to which community node i is assigned, Aij is an entry of the adjacency matrix that is one if node i and j are adjacent and otherwise zero. The term (ki kj )/(2m) describes with which probability an edge exists between i and j by chance, where i has a degree of ki and j one of kj . At the beginning, each node corresponds to a single community according to the Louvain method. In the first step it iterates over all nodes in the network and checks for each of them whether Q would increase if the node was removed from its original community and added to one of its neighboring communities instead. The node is assigned to the community yielding the highest Q. If no node is assigned to a new community, the second step of the Louvain method aggregates all nodes per community to a single node updating edges and corresponding weights; then the algorithm is executed again on this aggregated network. This process is repeated until no further improvement in terms of Q is monitored or until step one and two of the Louvain method were repeated a fixed number of times. The Louvain method is extended in this thesis in a straightforward manner for directed graphs by replacing Q with directed modularity Qd [LN08]: Qd =

k out k in i 1 Xh Aij − i j δci ,cj m ij m

(2)

The meaning of the variables in Equation 2 does not change compared to Equation 1. The main difference is that the total number of edges in the graph has doubled, which is why the factor two is discarded in Qd . In addition, the probability of an edge (i, j) existing between node i and j depends now on the out-degree of node i and the in-degree of j. Besides using Qd , the extension of the Louvain method takes also directed edges into account when edges and weights of the network are updated during the second step of the algorithm. Both adaptions are incorporated into an exist-

33

5.3

Details on Attribute Extraction

ing Python implementation5 , which is then applied to L(DG) to discover nonoverlapping communities. 5.2.3. Detecting Overlapping Communities In order to determine for users in DG their (multiple) community memberships, a mapping MAP is utilized, which was created while transforming DG into L(DG). In MAP, each edge pair ep from DG, which also exists in L(DG), is assigned to its corresponding node n in L(DG). Hence, each entry in MAP is a pair < n, ep >. Thus, it is straightforward to iterate once over MAP and append to all four involved nodes of ep the community of n. Knowing these communities allows to extract all attributes related to community structure. In fact, all attributes may now be extracted. The next section therefore describes how InfluenceLearner extracts them.

5.3. Details on Attribute Extraction This section presents details how InfluenceLearner extracts each attribute. All attributes that are derived from the graphs, are extracted from the original graphs RG and IG, respectively. Attributes regarding the role of centrality are extracted using the Python library NetworkX 1.8.1 6 [HSS08]. Role of Community This assumption encompasses the following attributes: 1. Inter-community edges computes the total sum of weights of inter-community edges going outward from user u1 . If u1 has an outgoing edge to a user from a different community, the weight w of the edge connecting both users in L(DG) has to be added. In practice, w is computed as w = 1/out-degree(u1 ) in DG like during the creation of the line graph because if u1 exists in MAP, it is clear she is present in L(DG) as well. 5 6

https://bitbucket.org/taynaud/python-louvain Retrieved: 03-23-2014 http://networkx.github.io/ Retrieved: 03-23-2014.

5.3

Details on Attribute Extraction

34

2. # Connected communities counts in DG to how many different communities a user belongs to based on MAP. 3. # Edges from influentials counts for each user the in-degree in DG, but only edges from influential users are taken into account. 4. # Overlaps sums up, similarly to Inter-community edges, to how many different communities each user is connected to in DG via her outgoing edges. But this time each connection increments the count by one. 5. Potential audience is approximated by summing up the sizes of communities which the user is a member of. This sum is normalized by the total number of users in DG. Role of Activity This assumption encompasses the following attributes: 1. Influence is measured by choosing the median tweet label of a user’s collected tweets, because user’s rarely publish influential tweets and therefore the median reflects this observation better than the average or third quartile respectively. The latter was tested, but showed no improvement over the median. 2. # Interactions regarding the topic counts the number of interactions (retweets, mentions, replies) per user with other collected users based on the collected metadata. 3. # Tweets regarding the topic is the number of collected tweets by this user because only tweets with respect to a specific topic were collected using proper hashtags. Role of Quality This assumption encompasses the following attributes: 1. TFF is a user’s ratio of followers over followees. The follower and followee counts from a user’s Twitter profile are utilized instead of selecting the

35

5.3

Details on Attribute Extraction

counts obtained from DG. Those counts were retrieved during the collection of the dataset. 2. User readability employs the average Flesh-Kincaid Grade Level (FKGL) score over all tweets composed by the respective user [K+ 75]. FKGL describes how many years of education are required for an individual to understand the given text. For this purpose, FKGL takes a tweet’s words, syllables and sentences into account. Before calculating the respective FKGL value, each tweet is preprocessed by replacing hashtags, URLs, popular emoticons, ”RT” (indicating retweets) and user names by appropriate placeholder texts. Otherwise, the FKGL scores could be biased. For instance, assuming the same tweet exists twice and the only difference between both is the mentioned user within the tweet, FKGL would depend on the length of the mentioned user name. But if every user name is replaced with the same placeholder text, FKGL yields identical results for both tweets, which is more intuitive. FKGL is computed per tweet t according to: F KGL(t) = 0.39(total words/total sentences)+11.8(total syllables/total words)−15.59. 3. User quality is determined by averaging the tweet quality values over all collected tweets by a user [KF11]. If a tweet is reposted by a different user, a score of 0.5 is assigned to this tweet. If the retweet includes comments, another 0.5 is added to the score. Thus, the minimum score is zero, but there is no upper limit. Role of Centrality This assumption encompasses the following attributes: 1. In-degree Cin for a user u is computed as: Cin (u) =(number of incoming edges)/(nodes in graph -1), where the denominator represents the maximal number of nodes u could be connected with. 2. Closeness Cc is computed per user u as [New05]: Cc (u) =1/(average distance to all other nodes).

5.3

36

Details on Attribute Extraction

3. Vertex betweenness Cv is defined for a user u as [Bra08]: Cv (u) =

X σst (u) 1 , (n − 1)(n − 2) s6=u6=t σst

where σst is the number of all shortest paths running from user s to t. Accordingly, σst (u) denotes all shortest paths from s to t passing through u. The factor 1/((n − 1)(n − 2)) normalizes Cv (u) by the number of possible user pairs not including u, where n represents the number of users in the network. 4. Eigenvector centrality Cei for a user u is defined as [New08]: Cei (u) =

n 1X au,v Cei (u), λ v=1

where λ is a constant, au,v is an entry in the adjacency matrix of RG (IG), which is one if u has an outgoing edge toward v, otherwise it is zero; n denotes the total number of users in the network. 5. Edge betweenness Ce for a user u is similar to Vertex betweenness, but considers edges instead of users U . Hence, it is defined for an edge e in the following way [Bra08]: Ce (e) =

X σst (e) 1 , n(n − 1) s,t∈U σst

where σst is the number of all shortest paths running from user s to t. Accordingly, σst (e) denotes all shortest paths from user s to t passing through e. The factor 1/(n(n − 1)) normalizes Cv (e) by the number of possible user pairs, where n denotes the number of users in the graph. In this thesis to obtain a value for u, the Edge betweenness scores of all edges including u either as source or target node are summed up. The next steps are to implement the extraction of the aforementioned attributes and learn classifiers. While this is straightforward for the former, only a few selected aspects are discussed in the following section, for instance which algorithm is chosen for building the classifiers.

37

5.4

Implementation Details

5.4. Implementation Details Although all of the tasks described in the previous sections are implemented, only additional remarks regarding four aspects are included here, because the implementation is straightforward in the remaining cases. At first, the used hardware and programming environments are discussed besides the algorithm used for building the four classifiers. The following two remarks address aspects of community detection. Specifically, the steps for verifying the correctness of the algorithm transforming the graphs into line graphs from Section 5.2.1 are explained as well as why communities on L(RG) are detected differently than described in Section 5.2.2. The hardware environment for implementing all algorithms and tasks encompasses a dual core CPU (2.8GHz) and 16 GB RAM. All tasks dealing with supervised learning, which are described in Section 4.5 and 4.6, are provided by means of Weka [HFH+ 09]. After evaluating different algorithms with respect to their ROC curves on IG and RG, a Bayesian network was selected from Weka for building the four classifiers. For implementing the remaining tasks, Python 2.7 is used. To test the consistency of the modified LineGraphCreator from Section 5.2.1, which transforms a directed, unweighted graph into its corresponding line graph, it is tested on various small, synthetic networks, and four real-world networks against the original implementation of Evans7 . In all cases, both versions of the algorithm yield the same results. Two of those real-world networks are taken from [RFI02]8 , where the authors analyze the Gnutella network. Furthermore, the dataset from Adamic et al. [AG05]9 and IG are selected. All datasets are chosen since they are directed and publicly available, except for IG. The respective network sizes are documented in Table 4. From the political blogs dataset only the largest weakly connected component is considered and self-links are removed as it is assumed in the modified algorithm that none exist. For the detection of overlapping communities the Louvain method is applied to the extracted line graph. Since RG is dense and L(RG) is even denser (280 million edges and 550000 nodes), the latter is interpreted as undirected. Interpreting 7

http://theory.ic.ac.uk/~time/networks/LineGraphCreator.html Retrieved: 03-23-2014 http://snap.stanford.edu/data/index.html Retrieved: 03-23-2014 9 http://www-personal.umich.edu/~mejn/netdata/ Retrieved: 03-23-2014 8

5.4

38

Implementation Details Dataset name p2p-Gnutella08 p2p-Gnutella25 Political blogs Gi

Nodes 6307 22687 1222 1746

Edges 20777 54705 19021 5631

Table 4: Real-world datasets used for testing LineGraphCreator.

L(RG) as undirected is justified as RG’s reciprocity is 99.7% as explained in Section 5.1 and due to the transformation into L(RG) reciprocal edges are retained, so almost no information is lost. This allows to apply the original implementation10 of the Louvain method to L(RG) instead of the extended version from Section 5.2.2, which is significantly slower than the original version, since the extended version required several weeks to finish, while the original implementation needed only three hours. But for community detection on IG the extended Louvain method described in Section 5.2.2 is employed. The main reason for preferring Evans’ approach for detecting overlapping communities over Ahn’s [ABL10] stems from preliminary experiments where the method introduced by Ahn et al., which was extended to take directed graphs into account, yielded unintuitive results on small synthetic datasets. More precisely, mainly trivial communities including only two users were detected. Since now all the details are known about InfluenceLearner, the focus is to be put on how the ground truth in the collected Twitter dataset was established. This is further explained in the next chapter.

10

https://sites.google.com/site/findcommunities/ Retrieved: 03-23-2014

39

6. SNAnnotator The objective of this chapter is to describe SNAnnotator, which collects, labels and creates the final Twitter dataset, so that the respective classifiers can be built. In addition, the obtained labels are examined in terms of their reliability and how they are distributed. The reason for collecting a new Twitter dataset stems from the fact that only a few, at least partially, manually annotated datasets for influentials exist, for instance [BCM+ 12, Tiw11]. Unfortunately, neither of them can be used in this thesis as the authors could not grant access to the tweet content due to Twitter’s terms of agreement. But since attributes related to quality and activity require access to the tweet content, a new dataset had to be collected.

6.1. Establishing the Ground Truth Dataset The task of collecting and labeling a dataset was conducted by loosely following the idea of crowdsourcing [DRH11, NR10], that is annotators worked separately on the task at their homes in front of their computers. The typical problem in crowdsourcing of annotators being adversarial and maximizing their benefits by labeling as many data as possible can be neglected, as every individual was informed about the absence of monetary incentives in advance, but they still chose to participate. One key challenge during the annotation process remained the fact that ”influence” is highly subjective and thus implies ambiguity. Therefore, several countermeasures were taken that are mentioned in the respective paragraphs. This section focuses on describing the process of creating the final dataset. Therefore, the collection of the dataset is addressed at first, which is followed by an explanation of how the annotators assigned labels to tweets and users. This section is then concluded by describing how the acquired labels are incorporated into the ground truth dataset.

6.1

Establishing the Ground Truth Dataset

40

6.1.1. Dataset Crawl Given the requirements in terms of the attributes to be extracted from InfluenceLearner, tweets regarding a single topic, ”Amazon Kindle”, and their corresponding authors were collected. For this reason, hashtags were used to identify suitable tweets which were then collected in real-time using Twitter Streaming API11 . Specifically, the hashtags #Kindle, #KINDLE and #kindle were used in this thesis. Then the metadata of the respective users were obtained through Twitter Rest API12 and furthermore only tweets written in English were retrieved to simplify the annotation process for the annotators. For this purpose, SNAnnotator collected only tweets from users who had chosen the respective language in their Twitter front end, which was detected based on the metadata. This preprocessing step filtered out most of the unrelated users and tweets, but not all of them. The dataset was crawled within one week (April 5th 2013 - April 12th 2013). 6.1.2. Annotation Process The goal of the annotation process was to assign each user and tweet a label with respect to influence, so that it is possible to distinguish between influentials and users posting influential tweets as described in Section 4.6. Otherwise obtaining user labels would have been sufficient. For this purpose, an annotation tool was developed allowing independent annotators to assign each tweet and user a label regarding influence. A screenshot of the tool is shown in Figure 3. This tool also takes the independence of labels into account by selecting the next user or tweet to be presented to annotators in a random order. In case of tweets, the annotators were displayed the tweet content and if they required more information, they could visit the tweet online by clicking on a link. If users had to be labeled, the annotators were provided with a link to the users’ Twitter timeline, so that they could access their latest tweets, personal information and external links. Although the latest tweets on a user’s timeline might be unrelated to Kindle, it is still possible to assign appropriate user labels, because Cha et al. [CHBG10] found that influentials hold influence over multiple topics. 11 12

https://dev.twitter.com/docs/streaming-apis Retrieved: 03-23-2014 https://dev.twitter.com/docs/api/1.1 Retrieved: 03-23-2014

41

6.1

Establishing the Ground Truth Dataset

Figure 3: Annotating a tweet with the annotation tool

The decision whether a user (or tweet) is influential was not binary: rather five labels for users and tweets were defined, namely ”Influential”, ”Probably Influential”, ”Probably Non-influential”, ”Non-influential” and ”Strange”. The reason for the first four labels was that some users or tweets seem promising but are not influential yet. The last label ”Strange” was to be used for tweets if they were written in a language other than English, were incomprehensible or not available online anymore. In terms of users ”Strange” was permissible only if their account got suspended/deleted, the majority of their tweets were incomprehensible or not written in English. In the end, nine annotators could be recruited for the annotation process, all of which annotated in their spare time without receiving any monetary incentive. All of them were no experts in marketing and had different backgrounds (economics, informatics, social science). The annotators were not provided with a definition of influence as to when to assign a tweet/user which of the labels. Instead, they were given examples from different topics to get a better idea of influence. Each annotator was provided with two documents: (1) a protocol summing up the goals of the annotation process and individual responsibilities and

6.1

Establishing the Ground Truth Dataset

42

Figure 4: Additional metadata being stored during the annotation of users and tweets.

(2) a tutorial for the annotation tool. The complete protocol can be found in Appendix A and the tutorial for the annotation tool in Appendix B. In addition to the written form of the tutorial, a publicly available video13 was prepared as well. This allowed to reduce misunderstandings by annotators as some individuals tend to understand better through a more interactive/visual approach [Mou07]. All these steps aimed to minimize noise across the labels. But in the end, annotators defined influence for themselves, which introduced some noise. This problem was tackled by requiring each user and tweet of the dataset to be annotated three times, but every annotator labeled each of them at most once. The five labels are represented as ordinal values internally. Hence, it is straightforward to determine the final tweet and user labels with the help of the median which is recommended for ordinal data [J+ 04]. While annotating, additional metadata were recorded which is depicted in Figure 4. The total time ttotal starts when the annotator loads a tweet or a user’s link to the timeline into the tool; ttotal stops as soon as the annotator assigns a label. Furthermore, it is stored how much time each annotator spends browsing a user’s timeline (tprof ile/tweet ) and for how long a linked URL is visited (tU RL ). These data are valuable for analyzing the labeling behavior of annotators over time in drift mining as it is likely that annotators adapted their criteria for assigning labels. But due to the restricted time frame this analysis is not conducted in this thesis. However, some of these metadata are only available if the links were visited directly through the annotation tool. After all, it took in total 17 weeks to complete the annotation process (July 1st 2013 - October 30th 2013).

13

http://www.youtube.com/watch?v=5h2Zfm9Nza8 Retrieved: 03-23-2014

43

6.2

Analysis of the Ground Truth Dataset

6.1.3. Merging and Cleaning the Dataset SNAnnotator determines the ground truth for user and tweet labels by selecting the median of the assigned labels by annotators. Users or tweets that were marked to be removed, are excluded from the dataset, since they could not be labeled by all responsible annotators a sufficient number of times. All tweets which were written by a deleted user are removed as well. Likewise, users without any more valid tweets are discarded, too. Following these steps results in the removal of 310 users and 1497 tweets in total. This resulting dataset encompassing 4438 users and 35742 tweets serves as input for InfluenceLearner. Note, however, that in the remainder of this thesis only two labels are considered, namely ”Influential” and ”Non-influential”. Unless otherwise stated, when referring to tweet and user labels, binary labels are assumed with respect to influence. Thus, ”Influential” and ”Probably Influential” are merged into ”Influential” and ”Probably Non-influential” is merged with ”Non-influential” into ”Non-influential”. This decision is motivated by the fact that it was subjective for annotators to decide whether a user/tweet is rather ”Influential” or ”Probably Influential”. Moreover, the annotators turned out to be too reluctant to assign the labels ”Influential” and ”Non-influential”. A study of why this is so, is beyond the scope of this thesis. But merging the classes introduces noise, because it is more difficult to distinguish between ”Probably Non-influential” and ”Probably Influential” than between ”Influential” and ”Probably Non-influential” or between ”Non-influential” and ”Probably Influential”. This makes InfluenceLearner’s learning task more challenging. Now it is interesting to analyze to what extent the annotators agreed when assigning labels to the same users and tweets which is discussed in the next section.

6.2. Analysis of the Ground Truth Dataset This section investigates two different aspects of the ground truth dataset. At first, it is analyzed how reliable the assigned user and tweet labels of the dataset are. Secondly, the label distribution is examined for tweets and users depending on RG and IG.

6.2

44

Analysis of the Ground Truth Dataset

6.2.1. Analysis of the Annotator Agreement In order to get a rough idea about the agreement among annotators when assigning labels, Fleiss’ Kappa κ [Fle71] is employed: κ=

P − Pe 1 − Pe

(3)

In Equation 3 the nominator represents the agreement that is actually achieved above chance, whereas the denominator describes the degree of agreement that is attainable above chance. In Equations 4-7 is described how the respective terms P and P e are calculated.

P =

N 1 X Pi N i=1

Pe =

k X

p2j

(4)

(5)

j=1

Pi =

k X 1 [( n2ij ) − n] n(n − 1) j=1 N 1 X pj = nij N n i=1

(6)

(7)

N denotes always the number of tweets (users), n the number of ratings per tweet (user), and k the number of categories into which assignments are made. The tweets (users) are indexed as i = 1, . . . N and the categories as j = 1, . . . k. So, nij represents the number of annotators who assigned category j to tweet (user) i. This way, pj is the fraction of all assignments to the j-th category (Equation 7) and Pi is the level of agreement for the i-th tweet (user), that is, how many annotator-annotator pairs are in agreement, relative to the number of all possible annotator-annotator pairs (Equation 6). When the annotators are in perfect agreement, κ = 1. If there is no agreement among the annotators κ ≤ 0. In this thesis the parameters for the computation of κ are set in the following way: k = 2 because only the merged labels ”Influential” and ”Non-influential” are considered, n = 3 as three annotators label the same tweet (user), in terms

45

6.2

Analysis of the Ground Truth Dataset

of tweets N = 3 ∗ 35742 = 107226 and in case of users N = 3 ∗ 4438 = 13314. For user labels, it turns out that κ = 0.084 whereas κ = 0 for tweet labels. Not surprisingly, the agreement is higher for user labels because annotators are given more data on the user’s timeline on which they can base their decisions as opposed to single tweets. Another explanation for the lower κ in terms of tweet labels was found by discussing with some annotators and manual inspection of the tweets thereafter. It turns out that most of the tweets promote books and if annotators are not interested in the specific genres, they tend to automatically assign lower ratings. Hence, the available genres affect the assigned tweet labels to some extent. This problem is less likely to occur when assigning labels to users, because their Twitter timelines include tweets regarding different topics. Both computed agreements are relatively low, but considering that influence remains subjective, the annotators had a diverse background and could decide for themselves how to define influence, the results are not surprising. Therefore, it is essential to use the median labels for users and tweets. 6.2.2. Analysis of the Label Distribution Looking at the nodes (=users) and edges in Table 5, it becomes obvious that IG is sparser than RG. The number of influential users depending on which label is used, is also depicted: if they are selected based on DirectLabel there exist 250 influentials. In contrast, there are only 161 influentials in IG, but relative to the number of users existing in this graph, the fraction of influentials has nearly doubled. The same observation holds for influentials if they are determined based on MaxLabel. Most notably, only 15 out of 92 influentials based on MaxLabel are also considered influential when assigning them the label according to DirectLabel on RG. Hence, only very few influentials have posted influential tweets in the dataset. Furthermore, building IG seems to exclude proportionally many non-influentials. On the other hand, in IG not all identified influentials are contained in contrast to RG. In RG there are 35741 tweets, but only 131 of them are influential. In IG only 25800 tweets exist, but the fraction of influential tweets is identical with RG. Surprisingly, less influential tweets than users could be identified by annotators, although eight times more messages exist in the dataset.

6.2

46

Analysis of the Ground Truth Dataset

Users Edges Influentials (DirectLabel) Influentials (MaxLabel) Influential tweets

RG 4437 550931 250 (5.6%) 92 (2.0%) 131 (0.4%)

IG 1685 5471 161 (9.6%) 74 (4.4%) 113 (0.4%)

Table 5: Final statistics for RG and IG. In IG exist 25800 tweets, whereas RG encompasses 35741 tweets. Percentages in brackets denote the fraction of influential users (tweets) with respect to users (tweets).

This analysis concludes the chapter and the only component that has not been introduced yet, InfluenceSimulator, is thus described in the following chapter.

47

7. InfluenceSimulator This chapter introduces InfluenceSimulator which simulates the propagation of influence in networks according to the influence spread models Weighted Cascade Model (WC), a variant of Independent Cascade Model, and Linear Threshold Model (LT). It simulates how far information can diffuse in the network when a subset of k influential users is given. Influence spread models assume that a subset of more influential users is able to propagate a message further in the network than a subset of less influential users, which is quantified by the number of users that know about this message after a certain time span has passed. Hence, it is expected that the subsets selected by qualitatively better classifiers are able to diffuse a message further in the network than qualitatively worse ones. InfluenceSimulator aims to shed light on the open question as to how realistic the assumptions of WC and LT are [Bon11]. Two popular influence spread models, LT and WC, are investigated in this thesis because it is unclear whose underlying assumptions for influence spread are more realistic. This chapter introduces at first both models separately and then focuses on deriving influence probabilities for the entire network, which are assumed to exist according to Section 2.1.2, meaning that it is known for every user with which probability she is able to propagate a message further to her neighbors. These influence probabilities are classifier-dependent in this thesis and not set to fixed values. This chapter concludes with a description of InfluenceSimulator that incorporates both models and derives the respective influence probabilities.

7.1. Linear Threshold Model This subsection explains how LT spreads a message in DG. At first, it is required that the sum of all weights from incoming edges of a user s equals one. This is necessary because in LT each user is assigned uniformly a threshold between zero and one in the beginning, which must be exceeded by the sum of the weights of incoming edges in order to activate s. Users can either be active or inactive, where the former indicates that a user received the message to be spread and continues to propagate it further in the network. A user is only permitted to turn from inactive to active. It is then checked in each discrete

7.2

Weighted Cascade Model

48

time step whether new users get activated. If new users turn active, they do so in the next time step allowing them to try to activate their neighbors. Thus, a user turns more likely active, the more neighbors are activated. The simulation is completed, if no additional users can be activated at a certain time step anymore or if the maximal number of time steps to be simulated is reached. The second influence spread model, Weighted Cascade Model, is explained in the following section.

7.2. Weighted Cascade Model This subsection explains the process how WC spreads a message in DG. WC is a variant of Independent Cascade Model in that the probability of a user s influencing user t is estimated by p(s, t) = 1/in-degree(t). Users are at all times in one of two states as in LT: active and inactive. Once they turn active, they remain in this state. Similarly to LT, in each time step it is checked whether new users become active. These newly activated users can activate their neighbors in the next time step. But each user has only one attempt to activate an inactivated one with the respective influence probability. Otherwise this user can never activate this inactive user. The simulation finishes if no new users turn active or the maximum number of time steps is reached like in LT. Before the InfluenceSimulator can be described, it is necessary to derive influence probabilities of the network that are classifier-dependent, which is done in the next section.

7.3. Deriving Influence Probabilities This section describes how the probability p(s, t) of user s influencing user t is estimated depending on a classifier according to the following reasoning. It is based on the assumption that, provided a ranking among users could be established, users being ranked higher exert more influence on their neighbors. Each classifier determines this ranking individually. For this purpose, the probabilities with which the respective classifier assigns a user the label ”Influential” is interpreted as confidence. Since a Bayesian network is chosen for building the classifiers, labels are assigned with fuzzy probabilities in the range from zero to

49

7.4

General Procedure of InfluenceSimulator

one. Ordering this list of confidences descendingly yields the classifier-dependent ranking. In case of ties, all users with the same probabilities are assigned the same average rank. For instance, provided there are three users a, b and c, and a has the highest probability p(a), whereas p(b) = p(c) and p(a) > p(b), then a’s rank would be 1, whereas b and c would both have the rank 2.5. Based on this ranking, each user s is able to activate user t with a probability between zero and one depending on the rank of s, where the latter is in the range 1 . . . n with n denoting the number of users in the network: α=

n − (rank(s) − 1) , n

(8)

where rank indicates the rank that was assigned to s by the respective classifier. Thus, the higher s is ranked, the larger α becomes. In addition, α is normalized by the in-degree of t, 1/in-degree(t), so that the sum of all incoming edges of t always equals one, which finally yields for p(s, t): p(s, t) =

n − (rank(s) − 1) n in-degree(t)

(9)

Equation 9 guarantees that p(s, t) lies in the range from zero to one, which is essential for using this formula for LT and WC. This formula is incorporated into InfluenceSimulator, which is described next.

7.4. General Procedure of InfluenceSimulator This section outlines how InfluenceSimulator combines WC and LT with the derived influence probabilities in order to simulate the influence spread appropriately. In Algorithm 2 this procedure is shown. InfluenceSimulator takes as input the number of time steps that should be simulated at maximum per k top-influential users, a list of the k top-influential users according to the classifier which are active in the beginning of the simulation, the directed graph DG and the number r indicating how often the active users should be computed for each subset of k. The algorithm outputs all users who are activated in at least 50% of the repetitions. Then the influence probabilities for all edges in DG are computed ac-

7.4

General Procedure of InfluenceSimulator

50

cording to Equation 9 (line 4). Afterwards for each k (line 5) the top-influentials are selected according to the confidence of the respective classifier (line 6) and either WC or LT is run (line 8). This is repeated r times (line 6). Now only those users, who were activated in at least 50% of the repetitions, are stored (line 10). Since all three components have now been introduced, it is possible to focus on conducting experiments with them in order to answer the raised questions from Section 1.2 in the next chapter. Algorithm 2 Simulation. 1: Input: Time steps t, List of how many influentials are initially active K, directed graph DG, number of repetitions r, results from classifier c. 2: Output: Active users. 3: 4: 5: 6: 7: 8: 9: 10: 11:

Compute influence probabilities according to Equation 9 with respect to c for each k in K do for each r do Select top k-influential users according to a classifier’s confidence Run simulation with k-influential users, DG, t end for Retain only users that were activated in at least 50% of the runs r end for

51

8. Evaluation This chapter describes the four experiments to be conducted that aim to answer the scientific questions raised in the beginning of this thesis. The first experiment examines whether influentials and users posting influential tweets are correlated. In other words, it is investigated if influential tweets characterize influential users. After explaining the various baselines utilized for comparison, the next two experiments analyze the quality of the built classifiers against these baselines from two different perspectives. While the first one examines the quality of the classifiers in terms of ROC curves, the second one simulates the influence spread with the help of InfluenceSimulator in order to confirm the results obtained from the previous experiment. The last experiment tries to identify characteristic attributes of influential users and users writing influential tweets. Note that for all conducted experiments the baselines are derived from the proposed classifiers, which is why they are also built using Bayesian networks. All baselines and classifiers are learned with the help of Weka using 10-fold cross-validation.

8.1. Distinguishing between Influential Users and Influential Tweets The first experiment analyzes the correlation between influentials and influential tweets and aims to answer the question whether influentials may be identified with the help of influential tweets or whether both entities exist independently from one another. For this purpose, the classifiers derived from DirectLabel are compared with those which use MaxLabel. Since two different graphs exist, MaxClassifierIG is compared with DirectClassifierIG and MaxClassifierRG with DirectClassifierRG, which allows to examine the correlations separately on RG and IG. In this experiment, the correlations are analyzed with the help of Spearman’s coefficient ρ and Pearson’s r. The former is able to detect non-linear relationships,

8.1

Distinguishing between Influential Users and Influential Tweets

52

whereas the latter can discover linear relationships only. Pearson’s r is computed for a sample size of n according to: n P

¯ i − Y¯ ) (Xi − X)(Y

i

r=q

¯ 2 (Yi − Y¯ )2 (Xi − X)

,

where the confidence Xi denotes the confidence of MaxClassifierIG (MaxClassifierRG) for assigning user i the label ”Influential”, whereas Yi corresponds to the confidence of DirectClassifierIG (DirectClassifierRG) for assigning user i the ¯ (Y¯ ) represents the mean of X (Y ), and n is the number of label ”Influential”, X users in IG (RG). For ρ it is necessary to establish a ranking among the users. Similarly to InfluenceSimulator (see Section 7.3), this ranking is established by sorting the confidences with descending order with which the respective classifier assigns users the label ”Influential”. This ranking also introduces ties, but ρ is able to take them into account by assigning each of those instances a rank corresponding to their average rank. Spearman’s coefficient is calculated as Pearson’s r, but this time instead of comparing the confidences Xi and Yi directly with each other, the ranks of the confidences are compared. Thus, it is computed as follows: n P

(xi − x¯)(yi − y¯) , ρ = qi (xi − x¯)2 (yi − y¯)2 where the confidences Xi and Yi are transformed to ranks xi and yi , and x¯ (¯ y) corresponds to the mean of x (y). In case of a perfect correlation, both coefficients yield 1 (positive correlation) or -1 (negative correlation). Values equal to zero indicate no correlation. In Table 6, both coefficients are reported together with the p-values in parentheses (significance level is 0.05); in the left column MaxClassifierRG is compared with DirectClassifierRG, and in the right one MaxClassifierIG is compared with DirectClassifierIG. All coefficients turn out to be close to zero, which confirms the preliminary observation that only a few influentials, namely 15, also write influential tweets in RG (see Section 6.2.2). Hence, the null hypothesis, ”influential tweets and users of influential tweets do not correlate with one another”, cannot

53

8.2

ρ r

RG -0.014 (0.342) -0.014 (0.347)

Deriving the Baselines

IG -0.015 (0.550) 0.009 (0.714)

Table 6: Results for Pearson’s r (r) and Spearman’s coefficient (ρ) when comparing MaxClassifierRG with DirectClassifierRG (left column) and MaxClassifierIG with DirectClassifierIG (right column). The corresponding p-values are reported in brackets.

be negated. This means it is necessary to distinguish between influentials and users writing influential tweets in general, for influential tweets do not characterize influential users. There are following possible explanations for that. First, the annotators decide subjectively whether a user is influential and whether a tweet is influential, so the null hypothesis may be due to the annotators’ behavior. Moreover, the annotators have little available information on tweets (just the tweet content) in contrast to the information on users (all tweets of each user, personal information and external URLs), so the labels of the tweets may have been assigned in a less well-informed way. Finally, there exist users, e.g. celebrities, for which presumably have many followers and many discussions on their tweets, because of inherent properties of the users rather than because of the tweets’ contents. To investigate this further, there is need for data where celebrities can be identified and treated separately. The following experiments require baselines in order to investigate the performance of the classifiers. Thus, the next section addresses how these baselines are derived.

8.2. Deriving the Baselines This section describes which baselines are to be used for MaxClassifierRG, MaxClassifierIG, DirectClassifierRG and DirectClassifierIG in the following experiments. Most importantly, these baselines are always derived from the respective classifier. 15 different baselines are derived per classifier. At first, four baselines are built only incorporating attributes related to one out of the four assumptions community, activity, quality, centrality from Section 4.3. Likewise, six baselines are derived by utilizing attributes related to two assumptions. Four more baselines are added which consider attributes related to three assumptions. Finally,

8.3

Quality of Learned Classifiers

54

a random baseline is built predicting the label of a user depending on the distribution of the user labels in the respective graph. To facilitate this point, only the best baselines regarding one, two and three assumptions are depicted in the respective plots of the following experiments to illustrate differences.

8.3. Quality of Learned Classifiers The second experiment compares the quality of the four classifiers with respect to their baselines in terms of their ROC curves and thus aims to answer the question whether the chosen set of attributes from Section 4.3 yields an improvement over smaller subsets of the same attributes in terms of detecting influentials. If this is the case, the classifiers should outperform their baselines. ROC curves are used for evaluating the quality of the classifiers, as they are independent of the underlying skewed distribution [Faw04] as opposed to precision and recall. In Figure 5 and 6 the ROC curve of each final classifier is plotted against its respective baselines. The performance of all four classifiers MaxClassifierIG, MaxClassifierRG, DirectClassifierIG, DirectClassifierRG shows an improvement over the random baseline method. As it can be seen from both figures, the more assumptions are added, the better is the performance of the classifier, except for MaxClassifierIG, where the baseline utilizing no attributes related to quality outperforms the final model slightly. This observation suggests that less extracted attributes might suffice to detect influentials almost without worsening the classifiers’ performance. This assumption is further supported by the fact that in MaxClassifierRG and DirectClassifierRG the baselines utilizing only attributes related to three instead of four assumptions are also able to compete with the four proposed classifiers. Moreover, identifying influentials seems easier on RG (Figure 5(b) and 6(b)) than on IG (Figure 5(a) and 6(a)), at least in the collected dataset. This observation disagrees with the findings of [BCM+ 12, HRW08], where the authors found that IG is a better representation of influence than RG. It is suspected in this thesis that the inferior performance on IG is due to the construction process of IG: if no interactions of a user are part of the crawled data, then this user is removed from IG but is still present in RG. Hence, RG contains information on influential users that do not

55

8.3

Quality of Learned Classifiers

(a)

(b) Figure 5: ROC curves of final classifiers of MaxClassifier compared with a random baseline and the best baselines capitalizing attributes related to one, two and three out of four assumptions. Performance on IG (a) and on RG (b). F allout = 1−Specif icity.

appear in IG. This construction caveat can only be amended (in an unbiased way) if the whole Twitter network is copied. However, this is not feasible, because new tweets are added continuously to Twitter, so sampling is inevitable. In summary, the results obtained from this experiment suggest that the four proposed classifiers are indeed able to identify influentials with a higher reliability than the baselines except for MaxClassifierIG, where one baseline is slightly better. In other words, the attributes selected in this thesis capture the notions of influentials and users writing influential tweets well.

8.4

Simulation of Influence Propagation

56

(a)

(b) Figure 6: ROC curves of final classifiers of DirectClassifier compared with a random baseline and the best baselines capitalizing attributes related to one, two and three out of four assumptions. Performance on IG (a) and on RG (b). F allout = 1−Specif icity.

The next experiment is conducted in order to confirm these results. Hence, it is expected that the four proposed classifiers outperform their respective baselines.

8.4. Simulation of Influence Propagation The third experiment re-examines the quality of the proposed classifiers from a different perspective by applying SNAnnotator. It therefore aims to answer the question whether the underlying assumptions of WC and LT for propagating

57

8.4

Simulation of Influence Propagation

influence are realistic. If so, then it is expected that the results of the simulations reflect the quality of the classifiers and baselines from Section 8.3. This essentially means that the classifiers perform better than their respective baselines, implying they are able to activate more users than their baselines. For the experiment, the following parameter settings are chosen for SNAnnotator: the number k of initially activated users is varied from 1-30, the number of discrete time steps t is ten, and the number of repetitions r is three, so that the simulation per k is repeated three times in order to obtain more stable results, meaning only if they were activated in at least 50% of the repetitions, they are considered to be activated. The outcomes of the simulations using LT and WC are both analyzed individually at first, and the implications are discussed thereafter.

8.4

Simulation of Influence Propagation

58

8.4.1. Linear Threshold Model

(a)

(b) Figure 7: Activated users in at least 50% of the three repetitions, given a subset of k initially activated influential users according to (a) MaxClassifierIG and (b) MaxClassifierRG with their respective best baselines.

59

8.4

Simulation of Influence Propagation

(a)

(b) Figure 8: Activated users in at least 50% of the three repetitions, given a subset of k initially activated influential users according to (a) DirectClassifierIG and (b) DirectClassifierRG with their respective best baselines.

The performance of the classifiers is plotted in Figure 7 (using MaxLabel) and 8 (using DirectLabel) together with their respective baselines, where the x-axis denotes the subset of initially activated users k and on the y-axis the number of activated user is shown. When only considering the seperate graphs RG and

8.4

Simulation of Influence Propagation

60

IG, the plots of Figure 7(a) and 8(a) look similar for IG; likewise Figure 7(b) and 8(b) look alike for RG. On IG, the random baseline clearly outperforms the classifier and the remaining baselines as it activates the highest number of users for each subset of initially activated users. The classifier and the remaining baselines converge to a similar level in terms of activated users when using higher values of k, although all baselines incorporating only attributes related to a single assumption require the largest k in order to spread the message in the network as far as the remaining baselines and classifiers. On RG, the same observation holds in terms of the baseline associated with attributes of a single assumption. But here the random baseline is only able to activate more users for k ≤ 12, for larger k all methods perform equally good.

61

8.4

Simulation of Influence Propagation

8.4.2. Weighted Cascade Model

(a)

(b) Figure 9: Activated users in at least 50% of the three repetitions using WC with MaxLabel, given a subset of k initially activated influential users according to (a) MaxClassifierIG and (b) MaxClassifierRG with their respective best baselines.

8.4

Simulation of Influence Propagation

62

(a)

(b) Figure 10: Activated users in at least 50% of the three repetitions using WC with DirectLabel, given a subset of k initially activated influential users according to (a) DirectClassifierIG and (b) DirectClassifierRG with their respective best baselines.

The performance of the classifiers is plotted in Figure 9 (using MaxLabel) and 10 (using DirectLabel) together with their respective baselines. For WC the same observations from LT hold: the plots look similar when only considering the graphs separately and on RG all classifiers and their respective baselines

63

8.4

Simulation of Influence Propagation

activate about the same number of users. On IG, the random baseline manages to outperform the classifiers and the remaining baselines again. However, a small difference is that in Figure 9(a) it does only manage to spread the message further in the network for k ≤ 6. Thus, the difference between this baseline and the other methods is less obvious than using LT on the same graph (see Figure 7(a)). 8.4.3. Interpretation of Results It is clear that the results from simulating the influence spread according to either WC or LT do not reflect the classifiers’ qualities from Section 8.3, as the random baseline is in all cases able to propagate the message in the network as far as any other baseline or classifier. On IG, it even manages to outperform them evidently. In terms of ROC curves the random baseline was always dominated by any baseline and classifier. This result suggests that the underlying assumptions in both influence spread models are unrealistic as they cannot simulate the spread of influence appropriately according to the ground truth. There are several possible explanations why the results obtained from this experiment do not agree with those from Section 8.3. First of all, the estimation of influence probabilities (see Equation 9) might be inappropriate. For instance, the influence probabilities are small due to the normalization by the in-degree. This is particularly noticeable with regards to the dense RG, in which the simulations yield almost identical results for all classifiers and their baselines. Since IG is less dense, small differences in the performances of the classifiers and their baselines are observable. Related to this issue is also the ranking of the users according to the classifiers’ performance: perhaps the top-ranked influential users according to a classifier are in fact non-influential, so that the usage of a different ranking scheme would yield results in accordance with the classifiers’ quality in terms of ROC curves. Another potential reason for the deviating results compared to the classifiers’ ROC curves could be the noisy labels obtained from SNAnnotator. One last problem could also involve LT and WC themselves, as they might not be able to capture the process of influence spread appropriately. But due to the several possible explanations for the deviating results of the classifiers during the simulation compared with the classifiers’ ROC curves and the limited time frame, it is beyond the scope of this thesis to analyze all potential sources of error sys-

8.5

Most Predictive Attributes of Influence Rank 1 2 3 4 5 6

RG Edge betweenness Influence # Tweets Eigenvector centrality # Interactions TFF

Rank 1 2 3 4 5 6

RG Influence Edge betweenness # Tweets Closeness # Interactions TFF

64

IG Influence In-degree # Edges from influentials # Interactions TFF

IG In-degree Influence Edges from influentials # Interactions TFF

Table 7: Subsets of most discriminatory attributes for DirectClassifier (top) and MaxClassifier (bottom) ordered with respect to their χ2 score.

tematically. Hence, it is still unclear whether influence spread models are capable of evaluating algorithms for the detection of influentials in the absence of ground truth data. Previous experiments have not addressed the identification of characteristic properties of influentials and users writing influential tweets. Hence, this is dealt with in the following experiment.

8.5. Most Predictive Attributes of Influence The fourth and last experiment aims to identify most predictive attributes of influential users and users writing influential tweets. Table 7 lists the best attribute subsets for each graph according to CFS and χ2 . Figure 11 (with MaxLabel) and 12 (with DirectLabel) show the quality of the classifiers that are trained on the reduced subsets from Table 7 compared with the respective classifiers trained on the full attribute set. The plots indicate the meaningfulness of the discovered subsets, although it seems easier to select a subset of attributes on RG (Figure 11(b) and 12(b)) than on IG (Figure 11(a) and 12(a)). Again, as already explained in Section 8.3, in this thesis the absence

65

8.5

Most Predictive Attributes of Influence

of some influentials in IG as opposed to RG is assumed to be the reason for this observation. Yet, the quality of the resulting classifiers built on the reduced attribute sets indicates their usefulness. This means by incorporating only five attributes on IG or six attributes on RG, respectively, it is possible to identify almost the same influentials compared to the four proposed classifiers. When looking at the most predictive attributes in Table 7, it turns out that Influence, # Interactions and TFF are utilized across all four classifiers. Furthermore, when comparing DirectClassifierRG with MaxClassifierRG and DirectClassifierIG with MaxClassifierIG, the selected attributes are identical and only their rankings differ slightly. Most notably, Influence ranks highest across all four learners on average, which suggests that averaged tweet labels do correlate with influential users to some extent. But this relationship is weak and not statistically significant as explained in Section 8.1. Motivated by the observation that a smaller subset of attributes suffices to identify most influentials, one could ask: what happens if no tweet labels (and therefore Influence) existed? Since it is time-consuming to obtain a second set of labels from annotators, it is interesting to investigate this question as well. For MaxClassifierRG and MaxClassifierIG cannot be built without the existence of tweet labels, both classifiers are neglected regarding this question. Therefore, DirectClassifierRG and DirectClassifierIG are built by using InfluenceLearner, but now Influence is excluded from the set of attributes in the first place. The most predictive attributes are then determined as above and are shown in Table 8. In IG # Tweets replaces Influence and the remaining attributes correspond to those from Table 7. Likewise, Influence is replaced by In-degree in RG. When looking at the ROC curves of these classifiers in Figure 13, it turns out that their quality is comparable with the respective classifiers that are built with the help of the reduced set of attributes. This observation also supports the assumption that the correlation between tweet and user labels is weak. In fact, if any single attribute of Table 7 is removed from the classifiers, their quality does not alter. Both observations suggest that all attributes in Table 2 exhibit a low correlation with user and tweet labels. However, as soon as a subset of those attributes is removed, the ROC curves deteriorate. This claim indicates that the differences between influentials and non-influentials are very subtle, and therefore the attributes for identifying influentials should be mani-

8.5

Most Predictive Attributes of Influence

66

(a)

(b) Figure 11: ROC curves of final classifiers trained on the reduced attribute set (”Reduced”) compared with those trained on the full attribute set (”Full) and the random baseline using the full attribute set. All classifiers use MaxLabel. F allout = 1 − Specif icity.

fold. For example, in the collected dataset the top-ranked attributes are related to at least three different assumptions. Hence, it seems promising to add further assumptions about influential users instead of looking for more attributes related to the proposed assumptions. For instance, influential users express their positive/negative opinions about a subject in tweets, so sentiment would be added as fifth assumption together with appropriate attributes, especially since its correlation with influentials is already known [BCM+ 12]. This experiment concludes this chapter and the thesis is summarized in the following one.

67

8.5

Most Predictive Attributes of Influence

(a)

(b) Figure 12: ROC curves of final classifiers trained on the reduced attribute set (”Reduced”) compared with those trained on the full attribute set (”Full) and the random baseline using the full attribute set. All classifiers use DirectLabel. F allout = 1 − Specif icity.

Rank 1 2 3 4 5 6

RG Edge betweenness In-degree # Tweets Eigenvector centrality # Interactions TFF

IG In-degree Edges from influentials # Interactions # Tweets TFF

Table 8: Subsets of most discriminatory attributes when Influence is removed from DirectClassifier ordered with respect to their χ2 score.

8.5

Most Predictive Attributes of Influence

68

(a)

(b) Figure 13: ROC curves of final classifiers trained on the reduced attribute set discarding Influence (”Reduced + No Influence”) compared with those trained on the full attribute set (”Full”) and the random baseline using the full attribute set compared with classifiers trained on the reduced attribute set (”Reduced”). F allout = 1−Specif icity.

69

9. Conclusions In the beginning, the main ideas of this thesis including important findings and contributions are summed up. The next section deals critically with those findings. This chapter concludes with a discussion of improvements to the proposed method for identifying influential users.

9.1. Contributions & Findings This section summarizes the study and briefly presents the main findings and contributions thereafter. This thesis aimed to identify the characteristic attributes of influential users and influential users themselves as a byproduct by means of a supervised learning approach. In particular, this study utilized attributes related to community structure, which had been neglected in most studies, besides well-known attributes with respect to activity, centrality and quality to detect such users. To this purpose, a Twitter dataset containing English tweets and the respective authors was collected with respect to the topic ”Amazon Kindle”. All tweets and users were assigned by annotators labels in terms of influence with the help of a developed annotation tool. Based on this dataset two different graphs were constructed, from which 16 attributes were extracted. Due to the user and tweet labels in the dataset, two types of users could be distinguished: influential users and users writing influential tweets which served as a ground truth for learning the classifiers. This thesis has the following findings. The built classifiers identifying both types of users outperform the respective baselines in terms of ROC curves. When identifying the characteristic properties of influential users and users writing influential tweets, five to six attributes suffice to learn classifiers that achieve about the same quality as those built using all attributes when comparing them in terms of ROC curves. That means five to six attributes characterize influential users and users writing influential tweets appropriately. The distinction between influential users and users posting influential tweets is necessary, for influential users usually do not post influential tweets and no statistically significant correlation could be measured between influential users and influential tweets. One potential explanation

9.2

Conclusion

70

is that they turned influential due to external factors, e.g. celebrities. Another important result is that attributes related to community structure improve the classifiers’ overall quality. The last finding is that Weighted Cascade Model and Linear Threshold Model make unrealistic assumptions about the influence spread in the network, which is why they are unable to reflect the classifiers’ quality in terms of ROC curves. But this result needs to be investigated further as there are many potential explanations why both models failed to simulate influence spread consistently with the ground truth. In total, this thesis makes three contributions. First of all, the proposed workflow from above for identifying influential users and their characteristic properties is independent of the topic. This allows to apply the workflow to arbitrary topics. This, in turn, permits to collect data with respect to different topics and compare them with the results from this thesis. For assigning labels to tweets and users, the developed annotation tool may be reused, which is the second contribution. Last but not least, the established ground truth dataset regarding ”Amazon Kindle” serves as benchmark for future work. Nonetheless, the outcomes from this thesis need to be verified on different Twitter datasets. By then, they must be treated with caution. This problem is discussed more detailed in the next section.

9.2. Conclusion This section describes the limitations and the relevancy of the results obtained from this study. This work proposed a workflow for identifying the characteristic properties of influential users and users posting influential tweets which is independent of the Twitter dataset. The results suggest that a few attributes related to different assumptions suffice for this purpose. But one important question, that naturally arises from the fact that the problem of identifying influential entities is important in many fields, is: can the results from this thesis be directly transferred to other Twitter datasets, social networks or even other fields like biology for preventing diseases to spread? It is impossible to answer this question without further research. For instance, at the moment it is even unclear whether the identified characteristic properties

71

9.3

Future Work

would hold for different Twitter datasets as interaction and relation graph exhibit an exceedingly high reciprocity compared to studies that had access to the near complete Twitter graph and it is unknown to which extent this observation biased the results of this thesis. For different social networks certain steps of the workflow might have to be adjusted, in particular the extraction of attributes would have to be tailored to the specific network because not all attributes can be extracted in the same manner across different networks. Moreover, relation and interaction graph are both not scale-free, which might affect the results when applying the workflow to other social networks which are usually scale-free. But in the long run, it might turn out that the results from thesis might be even transferable to other fields, at least to some extent. The Twitter dataset that was collected in this thesis allows an analysis of open questions in the realm of detecting influential users which require a ground truth, e.g. the examination of how realistic the underlying assumptions of influence spread models are [Bon11]. Although a first step toward such an analysis was made in this thesis, more systematic experiments are required to draw conclusions. The additionally stored metadata during the annotation process are helpful for this purpose. More concrete topics regarding future research are proposed in the following section.

9.3. Future Work This section suggests potential improvements to the proposed method of detecting influential users and users writing influential tweets and their respective characteristic properties. Community structure, among others, is assumed to help detect influential users. Thus, all extracted attributes associated with community structure depend on the chosen community detection algorithm. In this thesis, the Louvain method was applied to the graph which is based on the optimization of directed modularity. In [KSJ09] the authors reported that directed modularity as defined in [LN08] is unable to distinguish the direction of edges in certain cases. Therefore, it is worth analyzing whether the results from this thesis would change by applying a different community detection algorithm, which does not suffer from this

9.3

Future Work

72

shortcoming. Potential candidates are [RB11] and [RAK07]. The former uses a similar scheme like the Louvain method, but optimizes a different measure than modularity, while the latter utilizes label propagation, meaning a node is assigned to the community most of its neighbors belong to. Since it was assumed in the thesis that the collected dataset is biased due to the high reciprocity of the graphs, the proposed workflow would have to be applied to more datasets. Specifically, it would have to be employed to a different Twitter dataset in the first step. If it turns out that the results agree with this thesis, the workflow could be transferred to different scale-free social networks in the second step, in order to investigate whether these results deviate from this thesis since neither interaction nor relation graph are scale-free. Only then a reliable statement can be made about the generalizability of the proposed workflow. Although the classifiers are able to identify most influential users and users writing influential tweets, their quality can be improved upon, particularly by introducing new assumptions with corresponding attributes as opposed to adding new attributes related to the established four assumptions centrality, community, quality and activity. Hence, attributes associated with sentiment [BCM+ 12], linguistic attributes [QECC11] and tweet/user credibility would be added. If individuals express strong opinions regarding a product, this triggers emotions in readers, and therefore attracts more interest. Influential users also tend to structure their messages in a characteristic way and trustworthiness is essential for reading information posted by an individual. Another major challenge is the reduction of noise across the user and tweet labels, which is essential for building reliable classifiers. Thus, it is worth testing active learning methods to obtain a ground truth dataset with high reliability. For example, only user and tweet labels are used if all annotators agree on the label, otherwise the tweet (user) is assigned the label of the most similar tweet (user). A more sophisticated method, tailored to a scenario where multiple experts with unknown background assign labels, is described by Donmez et al. in [DCS09]. This is combined with the acquired knowledge from analyzing the rating behavior of the annotators over time due to the additionally collected metadata from the annotation tool. In this thesis it turned out that there are many possible reasons why the Weighted Cascade Model and Linear Threshold Model yield different results for assessing

73

9.3

Future Work

the quality of the classifiers and their baselines compared with their ROC curves. Hence, it is planned to analyze those potential explanations such that the cause for the difference can be identified. This involves testing more sophisticated models [SKOM10, GBL10, LPLS13] in the first place. A second aspect to be investigated for influence spread models is the derivation of classifier-dependent influence probabilities with which users influence their neighbors. More precisely, the learning method proposed in [GBL10] would be applied to determine the probabilities based on the metadata of the collected dataset.

74

A. Annotation Protocol

Tutorial for Annotation Tool 1 Description of the Dataset The dataset you are going to annotate contains data from Twitter, which is a social network where people can post short messages called “tweets” consisting of up to 140 characters. My dataset encompasses about 37k tweets from approximately 4.7k users. All tweets were composed regarding the topic “Amazon Kindle“. Kindle is a series of e-book readers. These devices enable users to download, browse, and read e-books, newspapers, and other digital media via wireless networking. A screenshot of a Kindle is depicted below in Fig. 1.

2 Big Picture The goal of my master’s thesis is to predict influential users in my dataset. Therefore, I need to know which users are truly influential. And this has to be determined by humans as the notion of ”influence“ is highly subjective and you have to define this term for yourselves - I am not going to restrict you. Then I am able to compare the ratings

1

of my algorithm with yours, so that I can conclude whether my algorithm is useful or not. Since each user and tweet will be rated three times by different annotators, for the final rating the median (order ratings ascendingly/descendingly and choose the rating which is in the middle) of the three values will be chosen. For instance, if two annotators labeled as “influential” and one as “noninfluential”, the final label of would be “influential” as ordering the ratings would yield “noninfluential”, “influential”, “influential”.

3 Task Your task is now to rate the influence of each single tweet and user. This will be realized with the help of my annotation tool which displays all tweets and users sequentially (which you have to rate to proceed to the next tweet). More details regarding the annotation tool can be found in the document “Annotation Tool Tutorial” I sent you as a separate file.

4 What does “influential” mean? As stated above this is highly subjective, so you have to rely on your intuition. I will not restrict your personal views by any means. Moreover, you need not share your definition with me. I would only like you to think about potential indicators of influence before you start annotating as a tweet is very short. If you are confused and have no idea how you could notice influential tweets or users, then contact me and I would provide you potential examples. But I am only going to give them on demand as I do not desire to bias you in any way. Check out question Examples of Tweets for inspiration.

5 How do I use the Annotation Tool? Check out the tutorial video (for a quick start) and the separate document “Annotation Tool Tutorial”.

6 How many users and tweets do I have to label and how do I receive the dataset? As mentioned before, the dataset has to be labeled thrice. In total, a dataset containing roughly 126k items (users and tweets) has to be rated. Since the number of annotators will change over time, I cannot give you a fixed number of items you have to label. It will be about 10-12k. Hence, annotating will take quite some time. I am planning with roughly 10 weeks for the process to be completed. At first, I would send you a chunk of the dataset that you should label within one week. Its size was computed, so that the annotation process would be finished in exactly 10 weeks. Thus, it can

2

happen that the size changes from chunk to chunk depending on how many annotators are currently available, how many tweets and users are already annotated, and so on. That is why I would like you to try finish annotating a chunk on time. However, you are free to finish your chunk earlier, of course :) After sending me your results back, I would provide you with a new chunk. The reason for not sending you a larger part of the dataset is mainly because I need to keep an overview of the annotation progress. Additionally, I do not want to bother you all the times with asking for updates on your progress. So, in short, the process goes like this: 1. I send you a chunk of the dataset. Ideally, you would label it within 7 days. But it can take more or (even better) less time, of course. 2. Once you finished, send me the annotated data back. (Please, check out ”Annotation Tool Tutorial“ for details). Now you can delete the entire dataset from your PC if you wish. 3. Then I send you a new chunk to be annotated. And you annotate it, send it back to me, and so on.

7 Description of the Annotation Procedure You can annotate the data at any time you want to at your preferred place with the annotation tool. You can also exit the program whenever you want to stop. The results will be saved automatically and next time you start the annotation tool, you continue from where you stopped last time. While annotating it is crucial that you stick to the instructions described thereafter because otherwise I cannot obtain meaningful results. 1. Start the annotation tool. 2. Rate a tweet’s/user’s influence by choosing the appropriate value from the scale (“Influential”, “Probably Influential”, “Strange”, “Probably Noninfluential”, “Noninfluential”). a) When annotating a tweet, feel free to check the tweet online by clicking on the corresponding link. This is particularly recommended if you feel like you need the context (e.g. when tweet is taken from a conversation). Please, if you want to check a linked website from the tweet, click the corresponding link within my annotation tool and NOT on Twitter. This is important for me because I save these times in my tool as well, so that this dataset can be used for evaluating different open research questions in the future. b) Before rating a user’s influence, go to the user’s profile (simply click on the user’s name in the annotation tool) and briefly check the information you consider important for making your decision. c) Use “Strange” only in the following cases:

3

• Other language than English was used for composing a tweet.

• Tweet incomprehensible due to severe grammatical errors/missing words, e.g. “what #kindle hamburger no” or “bla” and you also checked for more context on Twitter by clicking on the respective tweet link. • The user profile does not exist when you click on the link in my annotation tool or all her other tweets are written in a language other than English. Otherwise assign a different label. 3. Notice that I would also save how much time you require for rating a user/tweet. You can take as much time as you want - but I want to utilize this information in my evaluation. 4. Send me the annotated dataset.

8 Hmm, I have assigned the same label to many tweets and users - what am I doing wrong? In short: nothing. In fact, it is very likely that the dataset contains substantially more tweets and users that are not influential. However, this would be my expectation and if you rated more tweets and users to some degree as influential this can also be fine. Simply trust your intuition and your own considerations on what characterizes influential tweets and users. Never rate an item with a certain value just because you have not used it for a while.

9 Examples of Tweets Here I would like to give you a first idea of how an “Influential”, “Probably Influential”, “Probably Noninfluential”, and “Noninfluential” tweet MIGHT look like. Notice that this does not have to coincide with your perception. It is my personal view :) • Influential Message: A master class in customer service from Lego. Boy writes to Lego after losing a mini-figure. Here’s their reply... pic.twitter.com/ldoMH29j • Probably Influential Message: Febreeze just makes my bathroom smell like I took a shit in Hawaii. • Probably Noninfluential Message: @MikeMcFarlandVA IT WILL BE AFTER SOME COLA! :D #COLA • Noninfluential Message: Officially getting a #MAC, pumped!

4

I would leave the question of examples for influential users open as I’m confident you will find an intuitive way for yourselves to decide on rating users.

10 Ok, in short, what do I have to do again? 1. To evaluate whether my algorithm is capable of detecting influentials in my collected Twitter dataset, I need to know in the first place who is actually influential. This task has to be solved by the annotators. 2. Answer for yourself the following two questions prior to annotation: what are indicators for influential tweets? what are signs for influential users? 3. Start the annotation tool and rate tweets and users. Only use the label “Strange” in case you do not understand the language, the tweet is incomprehensible, the user profile did not exist or all remaining tweets on that page were written in a different language. Before rating a user, visit her Twitter profile. Check it and take as much time as you need for rating her properly. In terms of tweets you can have a look at the tweet online in case you need more context for rating it. But click on links in my annotation tool, please. Results will be saved automatically. 4. Once you finished the annotation process, send me your results. Now you can delete the dataset on your PC (if you want to). Then I will send you a new chunk of the dataset.

Thank you very much for your help - I really appreciate it:)

5

80

B. Annotation Tutorial

81

How do I annotate the given dataset? 1 Introduction Dear annotator, welcome to the tutorial of my annotation tool! And thank you for taking your job seriously and taking to time to read this tutorial carefully. I hope I cover all important questions with my document. If not, I would refer you to question 10

2 How to load the dataset? By now you should have received a file containing the dataset in a compressed version. Uncompress this file on your computer at any place you want to. When starting the annotation tool for the first time, please choose the root folder of the uncompressed dataset. If you double-click on the file, you should see two different subfolders amongst others, namely raw and annotated. For instance, you have saved the uncompressed version of your dataset (called data) under C:/users/my-name/Downloads/data/, then you would have to choose the directory C:/users/my-name/Downloads/data/ when the annotation tool asks you for the path to the dataset. From now on the dataset will be automatically loaded from the specified path each time you start the annotation tool. Hence it is important not to move the dataset to a different place on your system. If you can’t avoid this, you have to delete the file “AnnotationTool.config“ which is situated in the same folder as AnnotationTool.exe and then move the dataset to your new destination. Restarting the tool now should prompt you for the path to the dataset again. If you are unsure what to do, don’t hesitate to contact me.

3 How can I save my results? You don’t have to worry about this issue as everything will be saved ”automagically” under /annotated/. That’s why you can exit the application at any time you like, and next time you start the application you can continue from where you stopped in your last session.

1

82

4 How do I send you my results? All your ratings will be saved under /annotated/. Therefore, it suffices to send me only the annotated directory. So you can create a compressed version of this subdirectory and send it to me via email, Skype, whatsoever. For instance, if your uncompressed dataset is stored under C:/users/myname/Downloads/data/, then you would only have to send me C:/users/my-name/Downloads/data/annotated/, preferably as a compressed version. If you have installed the programm “WinRar”, then simply click on the annotated subfolder and choose “Add to rar/zip” and confirm your choice. Now you’re able to send me the compressed folder.

5 How do I label a tweet? Important: you have to be connected to the internet for rating tweets properly. Otherwise you’ll be unable to examine the tweet online more closely. 1. Read the tweet carefully. In case you need more context (e.g. the tweet is part of a conversation), then click on “Tweet Details” for visiting the tweet online and examine the conversation more closely. It may happen that when clicking “Tweet Details” the page says something like “Oops, something went wrong/The page doesn’t exist”. In this case the author of the tweet deleted her message and you can only base your rating on the available information in the annotation tool. 2. If you want to visit an URL from the tweet, please click on it in the annotation tool and not online. Because I need to know which links you clicked for how long. This is only possible, if you click on the links in my tool and not in the browser. These data are important for future research questions. And no, I won’t share these data with the NSA ;) 3. Choose on the right side the label that describes the influence best in your opinion and click it. 4. Now press “Annotate” at the bottom which saves your rating. This procedure is illustrated in Figure 1.

6 How do I label a user? Important: you have to be connected to the internet for rating users. Otherwise it is impossible as you have to browse a user’s tweets and check out her profile. 1. Click on the username to visit her Twitter profile. Browse her tweets, visit links on the profile page you regard interesting (this includes personal web blogs as well) until you have come to a conclusion regarding the influence of this user.

2

83

Figure 1: Key Functionality for Annotating Tweets 2. Choose on the right side the label that describes the user’s influence best in your opinion and click it. 3. Now press “Annotate” at the bottom which saves your label. This procedure is illustrated in Figure 2.

7 How do I know whether the current item I have to rate is either a user or tweet? As you noticed labeling a user and tweet are very similar. You can easily see that you’re now annotating a user when either checking the question at the very top or you can see that a message in red asking you kindly to click on the user profile.

8 I need to change the label of a previously labeled tweet/user. How can I revise my choice? Notice that you can only change ratings while you haven’t closed the program. Otherwise it’s impossible to revisit the respective user.

3

84

Figure 2: Key Functionality for Annotating Tweets ˆ You can return to the previously labeled item (= user or tweet) by clicking “Previous item”. A warning will be displayed that this item is already labeled. ˆ Furthermore you can see which label you assigned earlier to this item, namely the label which is currently selected, and by choosing a different label and clicking “Annotate” you change this item’s rating. If you don’t press “Annotate” your changes won’t be saved. ˆ Now the item that was labeled previously to this (previously labeled) item will be automatically loaded and displayed. ˆ Once you have changed the desired label, you can jump to the next item to be annotated by clicking on “Current item”.

Notice a warning being displayed above the tweet content whether the item was labeled previously or not. You can jump to the next item which isn’t annotated yet by pressing “Current item”. Then the warning will be gone and “The current item isn’t labeled yet” will be displayed instead indicating you can continue as usual from then on.

4

85

Figure 3: Annotation Process Completed

9 When do I know I’ve annotated all files are annotated? I don’t want to spoiler, but when you completed annotating all tweets and users, the annotation tool will be displaying you instructions on how you can receive your well-earned 1,000,000 ¿ reward :) Unfortunately I don’t have so much money, so the annotation tool would display you a “Thank you” message and all buttons are disabled, so that you can’t press any. This is depicted below.

10 I have another question... Ok, then you should definitely contact me. Perhaps I forgot to deal with an important aspect and then I would also update the other annotators about the solution to your question. My email address is [email protected]. You can also contact me directly on Skype (stefan-raebiger).

5

Glossary

86

Glossary DG Represents a directed, unweighted graph like RG or IG. DirectLabel This label is used as ground truth for each user of the collected dataset. It is defined as the final user label of the dataset. This final user label is derived as median of the three user labels that the annotators assigned per user. IG Interaction graph, which is directed and unweighted. An edge from user A to user B exists only, if A interacted with B. An interaction is defined as replying to or retweeting a tweet of B or mentioning B. MaxLabel Similarly, this label is used as ground truth for each user of the collected dataset as well. It is defined as the maximum final tweet label of the collected dataset. The final tweet label is also derived as median of the three tweet labels that the annotators assigned to this tweet for the specific user. RG Relation graph, which is directed and unweighted. An edge from user A to user B exists only, if A follows B. SNAnnotator Is responsible for collecting, labeling and deriving a ground truth dataset. InfluenceLearner Is responsible for transforming the ground truth dataset into graphs, extracting from them attributes and learning appropriate classifiers. DirectClassifier All classifiers using MaxLabel as ground truth.. DirectClassifierIG Refers to a classifier that is trained using DirectLabel on IG. DirectClassifierRG Refers to a classifier that is trained using DirectLabel on RG. MaxClassifier Refers to all classifiers using MaxLabel as ground truth. MaxClassifierIG Refers to a classifier that is trained using MaxLabel on IG. MaxClassifierRG Refers to a classifier that is trained using MaxLabel on RG.

87

Glossary

InfluenceSimulator Evaluates the performance of the classifiers with respect to the underlying influence spread models, which can be either Weighted Cascade or Linear Threshold Model. Influentials The term refers to any influential entity in the given context. Since this thesis focuses on viral marketing, the term refers to influential users. LT Linear Threshold Model for simulating the influence spread in a network. WC Weighted Cascade Model for simulating the influence spread in a network.

88

References [ABL10]

Ahn, Yong-Yeol ; Bagrow, James P. ; Lehmann, Sune: Link communities reveal multiscale complexity in networks. In: Nature 466 (2010), Nr. 7307, S. 761–764 9, 38

[AG05]

Adamic, Lada A. ; Glance, Natalie: The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd international workshop on Link discovery ACM, 2005, S. 36–43 37

[Bar12]

Barabasi, Albert-Laszlo: Network Science. Free online book. http://barabasilab.neu.edu/networksciencebook/. Version: 2012 25

[BBM13]

Barbieri, Nicola ; Bonchi, Francesco ; Manco, Giuseppe: Cascade-based community detection. In: Proceedings of the sixth ACM international conference on Web search and data mining ACM, 2013, S. 33–42 8, 10, 17

[BCM+ 12] Bigonha, Carolina ; Cardoso, Thiago N. ; Moro, Mirella M. ; Gon¸ calves, Marcos A. ; Almeida, Virg´ılio AF: Sentiment-based influence detection on Twitter. In: Journal of the Brazilian Computer Society (2012), S. 1–15 6, 7, 8, 15, 17, 18, 39, 54, 66, 72 [BGLL08]

Blondel, V.D. ; Guillaume, J.L. ; Lambiotte, R. ; Lefebvre, E.: Fast unfolding of communities in large networks. In: Journal of Statistical Mechanics: Theory and Experiment 2008 (2008), Nr. 10, S. P10008 9, 16, 31

[BHMW11] Bakshy, Eytan ; Hofman, Jake M. ; Mason, Winter A. ; Watts, Duncan J.: Everyone’s an influencer: quantifying influence on twitter. In: Proceedings of the fourth ACM international conference on Web search and data mining ACM, 2011, S. 65–74 6, 10 [BLM+ 06]

Boccaletti, Stefano ; Latora, Vito ; Moreno, Yamir ; Chavez, Martin ; Hwang, D-U: Complex networks: Structure and dynamics. In: Physics reports 424 (2006), Nr. 4, S. 175–308 25

89

References

[Bon11]

Bonchi, Francesco: Influence Propagation in Social Networks: A Data Mining Perspective. In: IEEE Intelligent Informatics Bulletin 12 (2011), Nr. 1, S. 8–16 3, 8, 47, 71

[Bra08]

Brandes, Ulrik: On variants of shortest-path betweenness centrality and their generic computation. In: Social Networks 30 (2008), Nr. 2, S. 136–145 36

[CBHK02] Chawla, Nitesh V. ; Bowyer, Kevin W. ; Hall, Lawrence O. ; Kegelmeyer, W P.: SMOTE: Synthetic Minority Over-sampling Technique. In: Journal of Artificial Intelligence Research 16 (2002), S. 321–357 21 [CHBG10] Cha, Meeyoung ; Haddadi, Hamed ; Benevenuto, Fabricio ; Gummadi, P K.: Measuring User Influence in Twitter: The Million Follower Fallacy. In: ICWSM 10 (2010), S. 10–17 6, 8, 10, 17, 23, 40 [CNM04]

Clauset, A. ; Newman, M.E.J. ; Moore, C.: Finding community structure in very large networks. In: Physical review E 70 (2004), Nr. 6, S. 066111 9

[DCS09]

Donmez, Pinar ; Carbonell, Jaime G. ; Schneider, Jeff: Efficiently learning the accuracy of labeling sources for selective sampling. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining ACM, 2009, S. 259–268 72

[DR01]

Domingos, Pedro ; Richardson, Matt: Mining the network value of customers. In: KDD ’01: Proceedings of the 7th ACM SIGKDD international conference on Knowledge discovery and Data Mining. New York, NY, USA : ACM, 2001. – ISBN 1–58113–391–X, S. 57–66 7

[DRH11]

Doan, Anhai ; Ramakrishnan, Raghu ; Halevy, Alon Y.: Crowdsourcing systems on the world-wide web. In: Communications of the ACM 54 (2011), Nr. 4, S. 86–96 39

References

90

[EL09]

Evans, TS ; Lambiotte, R: Line graphs, link partitions, and overlapping communities. In: Physical Review E 80 (2009), Nr. 1, S. 016105 9, 16, 30

[Faw04]

Fawcett, Tom: ROC graphs: Notes and practical considerations for researchers. In: Machine learning 31 (2004), S. 1–38 54

[Fle71]

Fleiss, Joseph L.: Measuring nominal scale agreement among many raters. In: Psychological bulletin 76 (1971), Nr. 5, S. 378 44

[GBL10]

Goyal, Amit ; Bonchi, Francesco ; Lakshmanan, Laks V.: Learning influence probabilities in social networks. In: Proceedings of the third ACM international conference on Web search and data mining ACM, 2010, S. 241–250 6, 8, 73

[GKMI+ 10] Goldberg, M. ; Kelley, S. ; Magdon-Ismail, M. ; Mertsalov, K. ; Wallace, A.: Finding overlapping communities in social networks. In: Social Computing (SocialCom), 2010 IEEE Second International Conference on IEEE, 2010, S. 104–113 9 [GN02]

Girvan, Michelle ; Newman, Mark E.: Community structure in social and biological networks. In: Proceedings of the National Academy of Sciences 99 (2002), Nr. 12, S. 7821–7826 18

[Gre09]

Gregory, Steve: Finding overlapping communities using disjoint community detection algorithms. In: Complex Networks (2009), S. 47–61 9

[Hal99]

Hall, Mark A.: Correlation-based feature selection for machine learning, The University of Waikato, Diss., 1999 22

[HFH+ 09]

Hall, Mark ; Frank, Eibe ; Holmes, Geoffrey ; Pfahringer, Bernhard ; Reutemann, Peter ; Witten, Ian H.: The WEKA data mining software: an update. In: ACM SIGKDD Explorations Newsletter 11 (2009), Nr. 1, S. 10–18 37

[HRW08]

Huberman, Bernardo ; Romero, Daniel M. ; Wu, Fang: Social networks that matter: Twitter under the microscope. In: First Monday 14 (2008), Nr. 1 15, 54

91

References

[HSS08]

Hagberg, Aric A. ; Schult, Daniel A. ; Swart, Pieter J.: Exploring network structure, dynamics, and function using NetworkX. In: Proceedings of the 7th Python in Science Conference (SciPy2008). Pasadena, CA USA, August 2008, S. 11–15 29, 33

[J+ 04]

Jamieson, Susan u. a.: Likert scales: how to (ab) use them. In: Medical education 38 (2004), Nr. 12, S. 1217–1218 42

[K+ 75]

Kincaid, J P. u. a.: Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel. (1975) 35

[KB03]

Keller, Ed ; Berry, Jon: The influentials: one American in ten tells the other nine how to vote, where to eat, and what to buy. In: NY: Simon and Schuster (2003) 6

[KF11]

Kong, Shoubin ; Feng, Ling: A tweet-centric approach for topicspecific author ranking in micro-blog. In: Advanced Data Mining and Applications. Springer, 2011, S. 138–151 8, 17, 18, 35

[KKT03]

´ Kempe, David ; Kleinberg, Jon ; Tardos, Eva: Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining ACM, 2003, S. 137–146 7

[KLPM10] Kwak, Haewoon ; Lee, Changhyun ; Park, Hosung ; Moon, Sue: What is Twitter, a social network or a news media? In: Proceedings of the 19th international conference on World wide web ACM, 2010, S. 591–600 10, 23 [KSJ09]

Kim, Youngdo ; Son, Seung-Woo ; Jeong, Hawoong: LinkRank: Finding communities in directed networks. In: arXiv preprint arXiv:0902.3728 (2009) 71

[LLW02]

Liu, Huiqing ; Li, Jinyan ; Wong, Limsoon: A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. In: Genome Informatics Series (2002), S. 51–60 22

References

92

[LN08]

Leicht, Elizabeth A. ; Newman, Mark E.: Community structure in directed networks. In: Physical review letters 100 (2008), Nr. 11, S. 118703 32, 71

[LPLS13]

Li, Jingxuan ; Peng, Wei ; Li, Tao ; Sun, Tong: Social Network User Influence Dynamics Prediction. In: Web Technologies and Applications. Springer, 2013, S. 310–322 7, 73

[Mil10]

´, Staˇsa: Power law distributions in information science: Milojevic Making the case for logarithmic binning. In: Journal of the American Society for Information Science and Technology 61 (2010), Nr. 12, S. 2417–2425 24

[Mou07]

Moura, Heloisa: Learning Styles: From History to Future Reasearch Implications for Distance Learning. In: Revista Brasileira de Aprendizagem Aberta ea Distˆancia 1 (2007) 42

[New03]

Newman, Mark E.: Mixing patterns in networks. In: Physical Review E 67 (2003), Nr. 2, S. 026126 25, 28

[New05]

Newman, Mark E.: A measure of betweenness centrality based on random walks. In: Social networks 27 (2005), Nr. 1, S. 39–54 35

[New08]

Newman, MEJ: The mathematics of networks. In: The new palgrave encyclopedia of economics 2 (2008) 36

[NMB05]

Nooy, Wouter de ; Mrvar, Andrej ; Batagelj, Vladimir: Exploratory social network analysis with Pajek. Bd. 27. Cambridge University Press, 2005 18

[NP03]

Newman, Mark E. ; Park, Juyong: Why social networks are different from other types of networks. In: Physical Review E 68 (2003), Nr. 3, S. 036122 28

[NR10]

Nowak, Stefanie ; R¨ uger, Stefan: How reliable are annotations via crowdsourcing: a study about inter-annotator agreement for multilabel image annotation. In: Proceedings of the international conference on Multimedia information retrieval ACM, 2010, S. 557–566 39

93

References

[OSKK05]

´sz, J´anos ; Kaski, Onnela, Jukka-Pekka ; Saram¨ aki, Jari ; Kerte Kimmo: Intensity and coherence of motifs in weighted complex networks. In: Physical Review E 71 (2005), Nr. 6, S. 065103 28

[PDFV05]

´nyi, I. ; Farkas, I. ; Vicsek, T.: Uncovering Palla, G. ; Dere the overlapping community structure of complex networks in nature and society. In: Nature 435 (2005), Nr. 7043, S. 814–818 9

[QECC11]

Quercia, Daniele ; Ellis, Jonathan ; Capra, Licia ; Crowcroft, Jon: In the mood for being influential on twitter. In: Privacy, security, risk and trust (passat), 2011 ieee third international conference on and 2011 ieee third international conference on social computing (socialcom) IEEE, 2011, S. 307–314 72

[RAK07]

Raghavan, Usha N. ; Albert, R´eka ; Kumara, Soundar: Near linear time algorithm to detect community structures in large-scale networks. In: Physical Review E 76 (2007), Nr. 3, S. 036106 72

[RB11]

Rosvall, Martin ; Bergstrom, Carl T.: Multilevel compression of random walks on networks reveals hierarchical organization in large integrated systems. In: PloS one 6 (2011), Nr. 4, S. e18209 72

[RFI02]

Ripeanu, Matei ; Foster, Ian ; Iamnitchi, Adriana: Mapping the gnutella network: Properties of large-scale peer-to-peer systems and implications for system design. In: arXiv preprint cs/0209028 (2002) 37

[SCHT07]

Song, Xiaodan ; Chi, Yun ; Hino, Koji ; Tseng, Belle L.: Information flow modeling based on diffusion rate for prediction and ranking. In: Proceedings of the 16th international conference on World Wide Web ACM, 2007, S. 191–200 2

[SKOM10] Saito, Kazumi ; Kimura, Masahiro ; Ohara, Kouzou ; Motoda, Hiroshi: Generative Models of Information Diffusion with Asynchronous Timedelay. In: Journal of Machine Learning ResearchProceedings Track 13 (2010), S. 193–208 7, 73 [Tiw11]

Tiwari, Garima: Crowdsourced Evaluation for Reranked Twitter Search, University of Washington, Diss., 2011 39

References [WCC09]

94 Wang, Laung-Terng ; Chang, Yao-Wen ; Cheng, Kwang-Ting T.: Electronic design automation: synthesis, verification, and test. Morgan Kaufmann, 2009 16

[WCSX10] Wang, Yu ; Cong, Gao ; Song, Guojie ; Xie, Kunqing: Community-based greedy algorithm for mining top-k influential nodes in mobile social networks. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining ACM, 2010, S. 1039–1048 10 [WD07]

Watts, Duncan J. ; Dodds, Peter S.: Influentials, networks, and public opinion formation. In: Journal of consumer research 34 (2007), Nr. 4, S. 441–458 6

[WLJH10]

Weng, Jianshu ; Lim, Ee-Peng ; Jiang, Jing ; He, Qi: Twitterrank: finding topic-sensitive influential twitterers. In: Proceedings of the third ACM international conference on Web search and data mining ACM, 2010, S. 261–270 10, 23

[WT07]

Wakita, Ken ; Tsurumi, Toshiyuki: Finding community structure in mega-scale social networks:[extended abstract]. In: Proceedings of the 16th international conference on World Wide Web ACM, 2007, S. 1275–1276 9

[YL12]

Yang, Jaewon ; Leskovec, Jure: Structure and overlaps of communities in networks. In: arXiv preprint arXiv:1205.6228 (2012) 9

95

Statement of Authenticity

I hereby certify that this thesis represents my own work, that no one has written it for me, that I have not copied the work of another person, and that all sources that I have used have been properly and clearly documented.

Otto-von-Guericke University Magdeburg, 24th March 2014

Stefan R¨abiger