Tag Clouds - Springer Link

7 downloads 15011 Views 6MB Size Report
Nov 28, 2008 - Tag Explorer as an example of an online system utilizing so-called faceted tag clouds for search result ..... Facebook graph that highlights both who is friend ...... (Twitter API provides up to 3,200 public tweets per user).
T

Tag Clouds Christoph Trattner1 , Denis Helic2 , and Markus Strohmaier3; 4 1 Knowledge Management Institute and Institute for Information Systems and Computer Media, Graz University of Technology, Graz, Austria 2 Knowledge Management Institute, Graz University of Technology, Graz, Austria 3 Computational Social Science Group, GESIS – Leibniz Institute for the Social Sciences, Cologne, Germany 4 Department of Computer Science, University of Koblenz-Landau, Koblenz, Germany

Synonyms Keyword clouds; Tagclouds; Term clouds

Glossary Resource Any kind of web content, e.g., documents, hyperlinks, images, or videos, that is uniquely addressable Tag A short string, term, or word that describes an online resource and that is applied by a person Word Cloud A visualization method that shows the top N most frequent words of a text document

Definition A tag cloud is a visualization method that summarizes a set of tags related to a certain resource or a set of resources in a visually appealing manner. Contrary to a word cloud, the tags in the tag cloud are generated by people and refer to resources through links. Usually, a tag cloud shows the topmost N tags of one particular online resource, a set of resources, or the resources of the whole system. A very basic and at the same time very popular approach for tag clouds calculation is an algorithm that sorts the tags by alphabet and indicates the importance of each tag by font size (see Fig. 1). However, today a large variety of tag cloud calculation algorithms exist. Some of them display tags in different colors, some of them cluster tags into categories (see for example Fig. 3) or according to their semantic meaning, while others manipulate the font and the intensity of the tags or simply display the tags as a list (Bateman et al. 2008; Kaser and Lemire 2007).

Historical Background A first well-known use of tag clouds in an online information system was in the photo-sharing system Flickr which integrated tag clouds as method to visualize the tags of images on a large scale in 2004 (Tagclouds 2012). Tag clouds were also popularized around the same time by the online bookmarking site Delicious and Technorati

R. Alhajj, J. Rokne (eds.), Encyclopedia of Social Network Analysis and Mining, DOI 10.1007/978-1-4614-6170-8, © Springer ScienceCBusiness Media New York 2014

T

2104

Tag Clouds

Tag Clouds, Fig. 1 An example of a tag cloud – in this case Amazon’s global tag cloud – showing the most popular tags of the system sorted by alphabet and boosted in font size according to their importance

Tag Clouds, Fig. 2 An example of how tag clouds are used in Flickr as a navigational tool to browse from one resource to another

(Tagclouds 2012). Today, many popular online platforms utilize tag clouds on a global or local basis to visualize the tags either of the whole system or of a particular resource and to support the users in their information-seeking process.

Key Applications In today’s online information systems, a key application for tag clouds is content summarization of one particular or multiple resources to serve as

a visually appealing tool to support the user in her information-seeking process. If used as a tool for navigation, a tag in the tag cloud refers to a list of resources that is usually sorted by date, alphabet, or similarity (Helic et al. 2011). Prominent examples of online systems utilizing tag clouds as a navigational tool include for instance LastFM, Delicious or Flickr (see Fig. 2). An application utilizing tag clouds for search result summarization is the Yahoo! (TagExplorer 2012), providing the user with the possibility to search Flickr photos with a so-called faceted tag cloud (see Fig. 3).

Tag Clouds

2105

T

Tag Clouds, Fig. 3 Yahoo! Tag Explorer as an example of an online system utilizing so-called faceted tag clouds for search result summarization

Usefulness of Tags Clouds Due to their visually appealing appearance, tag clouds gained tremendously in popularity over the past few year, mostly serving as a tool for better information access in information systems. Interestingly, while most of the research on tag clouds was devoted to the development of better visualization algorithms, the usefulness of tag clouds for “efficient” information access remained unexplored for a long time (Trattner et al. 2012; Venetis et al. 2011). One of the early research papers evaluating the usefulness of tag clouds for information access was a study by Halvey and Keane in 2007. In their work (Halvey and Keane 2007), they performed a user study with 62 users to investigate 6

different and popular tag cloud calculation algorithms compared to an alphabetically sorted list. For evaluation, they used a selection task where users had to find a randomly chosen item. They found that tag clouds providing both – alphabetization and font size – aid users to select items more easily and quickly than other approaches. Another important work investigating the usefulness of tag clouds for search result summarization was a study by Kuo et al. In their work (Kuo et al. 2007), they analyzed the utility of tag clouds for the summarization of search results from queries over a biomedical literature database. A user study showed that “the tag cloud interface is advantageous in presenting descriptive information and in reducing user frustration” compared to a standard layout.

T

T

2106

However, Kuo et al. also observed that “it is less effective at the task of enabling users to discover relations between concepts.” Another work in this context and the first study investigating tag clouds for the task of navigation is a study by Helic et al. In their work (Helic et al. 2011) the authors modeled tag clouds as a directed bipartite network and showed on a network-theoretic level that tag clouds spawn networks which are in general efficiently navigable. However, taking user interface decisions such as “pagination” combined with reversechronological listing of resources into account, the authors demonstrated that tag clouds are significantly impaired in their potential as a useful tool for navigation.

Future Directions Although tag clouds are widely used today and research has shown their usefulness for instance for search result summarizations or selection tasks, research in this area is still inconclusive. Especially studies on cognitive or navigational aspects of tag clouds are in early stages. While recent work (Trattner 2012) shows that it is possible to produce efficiently navigable tag clouds from a network-theoretic perspective, it still remains unclear to what extent tag clouds aid users in cognitive processing and/or navigation of information.

Acknowledgments This work is in parts supported by funding from the BMVIT – the Federal Ministry for Transport, Innovation and Technology (grant no. 829590); the FWF Austrian Science Fund Grant I677; and the Know-Center, Graz.

Cross-References  Analysis and Mining of Tags, (Micro)Blogs, and Virtual Communities  Folksonomies

Tag Clouds

References Bateman S, Gutwin C, Nacenta M (2008) Seeing things in the clouds: the effect of visual features on tag cloud selections. In: Proceedings of the 19th ACM conference on hypertext and hypermedia, HT’08. ACM, New York, pp 193–202. http://dx.doi.org/10. 1145/1379092.1379130 Halvey MJ, Keane MT (2007) An assessment of tag presentation techniques. In: Proceedings of the 16th international conference on world wide web, WWW’07. ACM, New York, pp 1313–1314. http://dx.doi.org/10. 1145/1242572.1242826 Helic D, Trattner C, Strohmaier M, Andrews K (2011) Are tag clouds useful for navigation? A network-theoretic analysis. Int J Soc Comput Cyber-Phys Syst 1(1): 33–55. http://dx.doi.org/10.1504/IJSCCPS.2011.043 601 Kaser O, Lemire D (2007) Tag-Cloud drawing: algorithms for cloud visualization. In: Proceedings of tagging and metadata for social information organization (WWW 2007). http://arxiv.org/abs/cs/0703109 Kuo BYL, Hentrich T, Good BM, Wilkinson MD (2007) Tag clouds for summarizing web search results. In: Proceedings of the 16th international conference on world wide web, WWW’07. ACM, New York, pp 1203–1204. http://dx.doi.org/10.1145/ 1242572.1242766 Tagclouds (2012) Tagclouds.com. http://www.tagclouds. com. Accessed 08 Aug 2012 TagExplorer (2012) Sandbox from Yahoo! research. http:// tagexplorer.sandbox.yahoo.com. Accessed 08 Aug 2012 Trattner C (2012) On the navigability of social tagging systems. Dissertation, Graz University of Technology, Graz, pp 1–240. https://online.tugraz.at/ tug online/wbAbs.showThesis?pThesisNr=46709 Trattner C, Lin Yl, Parra D, Yue Z, Real W, Brusilovsky P (2012) Evaluating tag-based information access in image collections. In: Proceedings of the 23rd ACM conference on hypertext and social media, HT’12. ACM, New York, pp 113–122. http://dx.doi.org/10. 1145/2309996.2310016 Venetis P, Koutrika G, Garcia-Molina H (2011) On the selection of tags for tag clouds. In: Proceedings of the 4th ACM international conference on web search and data mining, WSDM’11. ACM, New York, pp 835–844. http://dx.doi.org/10.1145/1935826.1935855

Recommended Reading Hearst MA, Rosner D (2008) Tag clouds: data analysis tool or social signaler? In: Proceedings of the 41st annual Hawaii international conference on system sciences, HICSS’08. IEEE Computer Society, Washington, DC, p 160. http://dx.doi.org/10.1109/HICSS. 2008.422 Helic D, K¨orner C, Granitzer M, Strohmaier M, Trattner C (2012) Navigational efficiency of broad vs.

2107

T

for

Team

Analysis

for

Team

Analysis

for

Team

Telecommunications Fraud Detection, Using Social Networks for narrow folksonomies. In: Proceedings of the 23rd ACM conference on hypertext and social media, HT ’12. ACM, New York, pp 63–72. http://doi.acm.org/ 10.1145/2309996.2310008 Knautz K, Soubusta S, Stock WG (2010) Tag clusters as information retrieval interfaces. In: Proceedings of the 2010 43rd Hawaii international conference on system sciences, HICSS’10. IEEE Computer Society, Washington, DC, pp 1–10. http://dx.doi.org/10.1109/ HICSS.2010.360 Koutrika G, Zadeh ZM, Garcia-Molina H (2009) Data clouds: summarizing keyword search results over structured data. In: Proceedings of the 12th international conference on extending database technology: advances in database technology, EDBT ’09. ACM, New York, pp 391–402. http://doi.acm.org/10.1145/ 1516360.1516406 Rivadeneira AW, Gruen DM, Muller MJ, Millen DR (2007) Getting our head in the clouds: toward evaluation studies of tagclouds. In: Proceedings of the SIGCHI conference on human factors in computing systems, CHI ’07. ACM, New York, pp 995–998. http://dx.doi.org/10.1145/1240624.1240775 Seifert C, Kump B, Kienreich W, Granitzer G, Granitzer M (2008) On the beauty and usability of tag clouds. In: 12th international conference on information visualisation, IV ’08, pp 17–25. http://dx.doi.org/10.1109/ IV.2008.89 Sinclair J, Cardew-Hall M (2008) The folksonomy tag cloud: when is it useful? J Inf Sci 34(1):15–29. http:// dx.doi.org/10.1177/0165551506078083

Task Assignment  Social Interaction Collaboration

Analysis

Taxonomies  Semantic Social Networks

Team Collaboration  Social Interaction Collaboration

Team Formation  Social Interaction Collaboration

Telco Operators  Social Networking in the Telecom Industry

Tag-Based Recommendation  Recommender Systems, Semantic-Based

Tagclouds  Tag Clouds

Tagging System

Telecommunications Fraud  Telecommunications Fraud Detection, Using Social Networks for

Telecommunications Fraud Detection, Using Social Networks for

 Folksonomies

Chris Volinsky Statistics Research Department, AT&T Labs-Research, Florham Park, NJ, USA

Targeted Advertising

Synonyms

 Online Privacy Paradox and Social Networks

Signatures; Social networks; Telecommunications fraud

T

T

2108

Telecommunications Fraud Detection, Using Social Networks for

Glossary Telecommunications Fraud occurs when someone uses or sells telecommunications services with no intention to pay for that service Social Networks are the networks or graphs created by the communication patterns, friendships, or trust relationships between individuals Signatures are statistical profiles of entities based on transactional data. These signatures, like our handwritten ones, evolve through time

Definition Telecommunications fraud can be challenging to detect and eradicate for service providers. This entry describes one effective way of finding new cases of fraud through mining the network of relationships which emerges from the transactional data collected in the network. For each entity in the network, the social network – or community of interest (COI) – is observed and analyzed to look for patterns which might indicate fraud. The COI are updated regularly via an exponentially weighted moving average, which is able to adapt the signature using the most recent data while incorporating historical information.

Introduction Historical Background Telecommunications has long been a breeding ground for fraudulent activity. Due in part to the sheer volume of transactions and the need for automated systems to connect those transactions, many bright computer scientists have been attracted to the problem of hacking into these production systems and stealing phone service. Famously, Steve Jobs was captivated by the early phone fraudsters (known as phreakers), and breaking into phone systems using the socalled blue boxes was an early and important

exposure for him to the computer hacking world (Isaacson 2011). For over 50 years, phone companies and fraudsters have been locked in an arms race of technology, where hackers try and subvert telecommunications systems while the phone companies attempt to keep up with preventing revenue loss. In the US, AT&T has been a leader in developing technology to identify and fight fraudulent behavior in the telecommunications network (Becker et al. 2010). Much of our success came through our ability to work with the massive scale of the data through statistical profiles, or signatures of each phone number, which allowed us to look for anomalies in behavior without resorting to a time consuming and costly dip into a data warehouse (Cortes and Pregibon 2001). A parameter in signature X can be updated on a call-by-call basis via an exponentially weighted moving average (EWMA): Xn D   Dc C .1  /  Xp

(1)

where Xp and Xn are the previous and new values of the parameter and Dc is the information that comes from the current call. The parameter  controls the decay function of the moving average. EWMAs update parameters in a smoother fashion than simple moving averages and allow for streaming updating of parameters without the need to access data that has already been seen. Social Networks and Fraud Fraud can come in many shapes and sizes and can be based on invasive technology, social engineering, identity theft, or exploiting loopholes in regulation (Becker et al. 2010; Phua et al. 2005). Each type of fraud requires unique data collection and statistical models to combat it. However, common themes did emerge. We discovered that fraudsters often operate in a community – either they communicated directly with each other or through intermediaries. Even if they did not communicate in social circles with other fraudsters, they often tended to have the same innocent targets – people who could get duped repeatedly

Telecommunications Fraud Detection, Using Social Networks for

into calling fraudulent numbers. Using this information, we recognized that fraudulent numbers tended to be close to each other in the social network created by our telecommunication transaction records (otherwise known as the call graph), and this information could be used to catch new fraudsters. Here is an example: Fraudster Frank has an adult chat line and would like to attract customers (callers). He sets up a phony toll-free account perhaps through stolen credit cards or identity theft and advertises the fraudulent numbers in certain magazines or online bulletin boards. Once we detect the fraud, we shut down the number, but only after days or weeks of calls were made. Frank then attempts to set up a second number in the same way. Since we never catch the person behind the fraud, this cycle can frustratingly repeat. However, we can catch the new numbers easily through the call graph. Given that the fraudulent numbers are advertised in the same magazines, we observe that the new fraudulent line is connected to the old fraudulent line in the call graph through the innocent people who are calling both numbers. It turns out to be fairly easy to quickly identify new fraudulent numbers that cater to the same community of people. Once we observed this fact, we realized that many different types of fraud could be identified through the call graph, and there was a need for an efficient social network based signature that could be used for analysis.

Communities of Interest: Key Techniques The telecommunications call graph consists of nodes (phone numbers) and edges (calls between those phone numbers). The US-based call graph contains hundreds of millions of unique nodes and billions of edges between those nodes. Our task was to create a signature for each individual node which represented the local social network of that node. This network would give us the most information about phone number A by including nodes that are informative about A and removing

2109

T

noise from the network. The easiest way to define this signature is by the ego-net, the set of other numbers that communicate with A. However, there may be numbers that are two or more hops away from A that are relevant to A (especially if they have many friends in common). But including all numbers two or three hops away could grow it to hundreds or thousands of numbers, and intuition tells us that the relevant set of numbers for a given node cannot be that large. Conversely, there may be numbers that communicate directly with A that are not very relevant (such as a wrong number or local business). There needs some mechanism to grow and prune the network to create a succinct signature. Our solution to this was to create a graphbased signature for each phone number that allowed for growing and pruning via a top-k mechanism and allowed edges to decay off of the graph if there was no communication between A and B for a given time. We call this profile the community of interest signature, or COI signature. The COI signature is defined as follows. Let GO t 1 denote the top-k approximation to Gt 1 at time t  1 and let gt denote the graph derived from the new transactions at time step t. The approximation to Gt is formed from GOt 1 and gt , node by node, using a top-k approximation to the EWMA from Eq. 1: GOt D top-kfgt ˚ .1  /GOt 1 g;

(2)

where the ˚ operator is a graph sum operation which takes the union of the nodes and edges in the two graphs for the aggregate graph. The “top-k” part of the updating is a pruning function which only includes the neighbor nodes with the highest weight in the COI signature. Everything that is not included in the top-k edges gets aggregated into an overflow bin called other. Figure 1 shows an example of this updating scheme. Here, k D 9 and the COI signature contains the top 9 other numbers called. The middle panel shows the calls that were made today; these calls include one number that does not currently exist in the signature. The final panel shows the resulting blend of old information and today’s data.

T

T

2110

Telecommunications Fraud Detection, Using Social Networks for

Telecommunications Fraud Detection, Using Social Networks for, Fig. 1 Computing a new top-k edge set from the old top-k edge set and today’s edges. Note how

a new edge enters the top-k edge set, forcing an old edge to be added to other

The new number has knocked a low-weight edge into the other bin, and the other weights have either been updated or decayed based on whether a call was seen to that number today. Truncation of the signature in this fashion ensures that only the most relevant nodes will make it into the signature. In general, calling behavior is heavily skewed such that the vast majority of calling minutes are concentrated on one’s top few friends. We set  and k such that typically 95 % of all of the communication behavior is accounted for in the top-k links (Hill et al. 2006). Once we have the COI signature for each phone number, we can use it to build a community of interest for each phone number by recursion – find the COI signature of A, and then recurse to find the COI signature of the friends of A. A final pruning step of low-weight edges typically results in a COI of a few dozen nodes – the most relevant nodes to use for analysis of A. Each phone number’s signature could be updated once a day in batch or on a call-by-call basis, depending on the application, allowing for social networks to be investigated at scale and in real time.

numbers without having to resort to the painful process of collecting all of the raw transaction data from a data warehouse. In the Fraudster Frank example presented earlier, Frank’s second attempt at setting up a fraudulent number could be easily identified by building the COI of the new phone number. This number’s COI would connect it to an known fraudulent number through many different paths – each path corresponding to an innocent user that called both phone numbers. This pattern would be easily identifiable to a company fraud analyst. Alternatively, systems can automatically detect these patterns and shut obvious cases of fraud down automatically. Along these lines it became clear that any relatively new phone number that had several known fraud numbers in their COI could immediately be tagged for further investigation by the fraud team. Through post hoc analysis we discovered that when new numbers had more than three fraudsters in their COI, their probability of some kind of fraudulent behavior was 90 % or greater. This guilt-by-association fraud model was a very successful component of our production fraud system. COI were also helpful for cases of identity theft or identity obfuscation. Phantom Churn is a type of fraud that occurs when a customer pretends to be somebody else in order to get a special deal. One example of this occurs for mobile phones, where free phones or other bargains are offered to new customers and not to existing

Key Applications The COI signature can be thought of as “sufficient statistics” for social network analysis of the call graph. Once we had built the infrastructure, it allowed for analysis of large sets of phone

Temporal Analysis on Static and Dynamic Social Networks Topologies

customers. By using COI, we can identify that a new account has the same social network as a recently closed account, and calculate with high probability that the users are the same. This allows us to estimate the impact of phantom churn and take measures to curtail it.

Conclusions Using social network signatures through our COI infrastructure has allowed us to identify fraud in situations that the standard profiling methods would not have allowed. In cases where a fraudster is pretending to be someone else through identity theft or use of a stolen credit card, the fraudster will typically not change the people they will want to communicate with. Because of this, social network analysis through COI signatures has been extremely effective – in fact they have been some of the most effective modules in our internal fraud detection system. Nonetheless, as the fraudsters figure out our methods, they can change their behaviors to try and thwart us – for instance, by injecting random calls to dilute the social network signal. Our methods are also continually evolving to attempt to keep up with these fraudsters as we detect new schemes. The arms race continues!

2111

T

Isaacson W (2011) Steve Jobs: the exclusive biography. Little Brown Book, London Phua C, Lee V, Smith K, Gayler R (2005) A comprehensive survey of data mining-based fraud detection research. Artif Intell Rev

Temporal Analysis  Temporal Analysis on Static and Dynamic Social Networks Topologies

Temporal Analysis on Static and Dynamic Social Networks Topologies Idrissa Sarr1 and Rokia Missaoui2 1 Department of Computer Science and Mathematics, Universit´e Cheikh Anta Diop, Dakar, S´en´egal 2 Department of Computer Science and Engineering, Universit´e du Qu´ebec en Outaouais (UQO), Gatineau, QC, Canada

Synonyms Community evolution; Dynamic networks; Social data analysis; Temporal analysis

Acknowledgments Glossary I thank Deepak Agarwal, Rick Becker, Robert Bell, Corinna Cortes, Shawndra Hill, Daryl Pregibon, and Allan Wilks, who were all key contributors to the methodologies and applications described here.

References Becker RA, Volinsky C, Wilks AR (2010) Fraud detection at AT&T: a historical perspective. Technometrics 52(1):20–33 Cortes C, Pregibon D (2001) Signature-based methods for data streams. Data Min Knowl Discov 5(3):167–182 Hill S, Agarwal D, Bell R, Volinsky C (2006) Building an effective representation for dynamic networks. J Comput Graph Stat 15(3):584–608

Dynamic Networks Networks that change over time Temporal Analysis on Social Networks Exploring the evolution of social networks over time Transient Community A community formed by a set of actors and temporal ties they share during a time window Viral Marketing Techniques that use preexisting social networks and other technologies to produce increases in brand awareness or achieve other marketing objectives HIV Human Immunodeficiency Virus AIDS Acquired Immunodeficiency Syndrome

T

T

2112

Temporal Analysis on Static and Dynamic Social Networks Topologies

Definition

change significantly over time (and sometimes during a given time period such as summer) since new partnerships are formed and other ones break down. Therefore, it is important to consider windows of time in which some contacts exist and analyze a sequence of windows to both assess how the disease spreads over time and identify the main diffusion factors. Overall, temporal analysis is important to understand the creation and evolution of a network (mainly dynamic ones). It raises up challenging issues such as how to track changes over time and what are the network properties one may consider to this end.

Online social media have a very large spread in the Internet and Web era due to their great impact on many societies and organizations. In fact, using social media may ease communication, marketing, customer services, and even back-end business processes. At the heart of this spectacular growth is the concept of social network that stems from the collection of actors that use the online media and have social ties. Analyzing such social networks is a process of exploring nodes and ties between actors in order for instance to identify influential customers and best ranked products, estimate disease propagation, and facilitate coordination and cooperation (Scott 2012). Due to important advances in information technology and hence the ease with which many interactions and transactions can be conducted, social networks have grown drastically in complexity, size, and variety. For instance, the networks of Facebook, LinkedIn, and YouTube deal with hundreds of millions of users with various contents and different goals. The creation or the birth of these networks is a dynamic process in which the network state evolves over time. This dynamic process unveils the intrinsic temporal aspects of the network. To analyze the network properties, one may consider the network as a static view in which all the links in the final network are present throughout the study. This is a useful and simplifying assumption for a network which is built instantly and does not evolve frequently over time. A typical example is a network of disease spread, where diseases are relatively contagious and spread faster than the creation or dissolution of a new contact or interaction. However, if the network evolves over longer time scales, it is worthwhile to take into account the fact that ties may be transient and the network structure can change at many points of time and have an impact on the final network status. One example is the case of HIV/AIDS disease which is propagated within the population over a relatively longer period of time. At a single time t, an individual has zero or very few contacts, but his contacts can

Introduction Analyzing a social network is a process of studying nodes and ties or interactions between them. Such interactions or relationships can be used to group actors into communities that represent people with similar patterns or interests. For example, a community can represent a set of individuals who share the same pathology of a disease or a set of Twitter users who tweet or re-tweet the same information through the microblogging platform. There have been studies in the literature that tackle the problem of identifying communities within a network (Newman 2004b; Fortunato 2010). Early studies in this context rely entirely on the static properties of the network and neglect the fact that most existing real-life networks evolve over time. In fact, interactions between actors change constantly with the insertion and deletion of links, and this process may have an impact on the final community structure. To figure out such changes and report them appropriately, we need to analyze permanently temporal data to discover network structure and their evolution. With this insight, studies were conducted recently to analyze dynamic networks (Goldberg et al. 2012; Leskovec et al. 2010) and mainly community evolution. Most of these studies use topological properties to identify the portions of

Temporal Analysis on Static and Dynamic Social Networks Topologies

the network that are changing and characterize the type of changes such as network shrinking, growing, splitting, and merging (Br´odka et al. 2013). In this paper, we do not focus directly on detecting the community evolution as done in most of the literature, but we aim to track temporal communities, which are built based on transient ties created between a set of actors during a time slot. Basically, we assume that actors may have temporal (or transient) links (e.g., during a set of events) that disappear afterwards. Such links are mined in order to extract dominant features of the network like temporal communities that we call transient communities. Moreover, we use temporal links to either identify active or passive actors.

Key Points From a viral marketing or an information flow perspective, it is important to assess how likely an information will spread within a network or a community. To this end, it is crucial to track the behavior of network actors with respect to the goals of that structure. The objective of this study is to track activities or events that occur within a network and observe how actors react to them. We mean by activity or event any kind of social or professional activity like meeting, conference, post, or tweet in which users are invited to participate. For instance, an event in Twitter is a tweet message, and all users who forward the tweet to their followers are considered as reacting or participating to the event. By gathering data related to events that occur within time windows, i.e., snapshots of network data at different times, we aim to figure out (1) whether an actor A of a community C is still active or not (churn) and (2) what are the temporal communities formed by active nodes and their transient interactions. Knowing active nodes or influencers may help identify key players in a disease or rumor dissemination within a network, while tracking virtual communities can be used to portray the evolution details of the infected or concerned groups.

2113

T

Historical Background In a simplest way, a social network is represented by a graph with a set of nodes (vertices), joined in pairs by edges or ties. Many networks exhibit a collection of distinct groups or communities such that there are many edges between vertices inside a community while there are very few edges between communities. There are many studies about community detection (Fortunato 2010; Newman 2004a). A traditional approach used by computer scientists is to decompose a network into a predefined number n of homogeneous groups (Fiedler 1973; Kernighan and Lin 1970). Groups have almost the same size, and the number of edges between two groups is minimized. These approaches are based on an iterative bisection. A best division of the initial network into two groups is first found. Then, further subdivisions are conducted on the generated groups until the required number of groups is reached. Another mechanism used by sociologists is the agglomerative algorithm known as hierarchical clustering (Scott 2012). Using this approach, a measure of similarity xij between pairs .i; j / of vertices is used based on the given network structure. Several similarity measures can be used and the reader may refer to Newman (2004a) for further details. Starting with the pair having the highest similarity score, new edges are added to decrease similarity. Moreover, the number of groups is set by the analyst. A well-known approach for community detection is described in Girvan and Newman (2002) and is based on the intuition that groups within a network may be detected through “natural” divisions among the vertices without requiring to set the number of groups or put restrictions on their size. As opposed to the hierarchical approach, the proposed algorithm is a divisive method in which edges are progressively removed from a network. Links that lie between communities are removed since they are interpreted as bottlenecks. Many other approaches have been developed for tracking the evolution of social communities over time (Backstrom et al. 2006; Palla et al. 2007; Leskovec et al. 2007; Toivonen et al. 2009).

T

T

2114

Temporal Analysis on Static and Dynamic Social Networks Topologies

To that end, they use several static views of the network at different times. For each view, one may use an existing community detection algorithm (Fortunato 2010) to depict the community topology. Therefore, between two time points, changes may occur such as a network growth or partition. Most of the new community detection approaches are devised on an underlying event framework that defines a specific behavior of a community like birth, growth, and merging in network evolution (Backstrom et al. 2006). More recent approaches to detect or predict the evolution of a group or a network are considering heterogeneous information networks (Sun and Han 2012), or studying the behavior of an actor over time with respect to the network content or actions of other actors (Kashoob and Caverlee 2012). In Sun and Han (2012), the authors report different studies on mining and analyzing a more general kind of networks, namely, heterogenous information networks, and tackle many challenging issues such as relationship prediction, node ranking combined with clustering (or classification), similarity search (e.g., look for a researcher who has a similar profile as a given one), and so on. Such networks contain more than one type of links or nodes. Each type of link indicates a specific relationship among actors. A simple example could be a network that describes two node types such as researcher and publication together with two kinds of links like collaboration between researchers and authorship between researchers and publications. Another example of a heterogeneous information network may be a Facebook graph that highlights both who is friend with whom, who is relative with whom, and who is colleague with whom. Heterogeneity brings useful semantic information that better helps identify influential actors and groups within a network as well as predict new relationships. For instance, using the intensity of the links can be a factor to consider an actor in a group rather than in another cluster. That is, if the intensity of a collaboration link between two individuals is far higher than the one of friendship, it is then more interesting to put such actors in the collaboration

community than in the friendship one. In Kashoob and Caverlee (2012), authors rely on social bookmarking to analyze communities over time. The approach assumes that aggregating the non-coordinated tagging actions of a large and nonhomogeneous group of actors can be exploited for enhanced knowledge discovery and sharing. Therefore, based on the tags and those actors who made them, they provide a framework for community-based organization of web resources. To that end, they devise a model that builds an underlying temporal community structure in which users belong to implicit groups of interest. Afterwards, authors show how the approach captures evolution, dynamics, and relationships among the discovered temporal communities and their important implications for designing future bookmarking systems, anticipating user’s future requirements, and so on. To summarize, studying the evolution of community has the advantage to foresee the overall trend of a group and to anticipate some positive or negative effects they lead to. For example, detecting the growth of a botnet at its early stage may help foresee criminal or suspicious attacks. The approach proposed in the present work is well related to the recent approaches that oversee evolving networks over time since it relies entirely on the actor behavior with respect to events that occur in the network.

Methodology for Tracking Network Evolution A common way to track temporal network behavior is to split time into a set of windows and study the structural characteristics of the static networks within each window. In other words, time is divided into a set of windows where a given window wj represents an interval Œtj ; tj C . For each window, all events that happen within the time interval are collected. Moreover, for each event in wj , the actors that are involved in wj as well as their transient ties are reported. Hence, we may compute the most active nodes in the considered time interval and the possible transient communities.

Temporal Analysis on Static and Dynamic Social Networks Topologies

Active Actors of a Social Network Active actors (nodes) in a given time slot wj are those who contribute to the social network activities by participating to most of the events that happen in this time window. Such nodes play a key role in the network topology as well as in the information dissemination. To check if a node is an active actor, let us consider E.wj / as a set of events happening within the wj interval, i.e., E.wj / D fej1 ; ej2 ; : : : ; ejn g where eji denotes the i th event that occurs in wj . Then, we define Nwkj as the total number of events to which node k participates in wj , while NwEj is the number of events that happen within the same window. Therefore, a node k is active if the following inequality holds: NwEj  Nwkj NwEj

 accepted laziness

(1)

The parameter accepted laziness is the laziness ratio, i.e., the percentage of events to which a node inside a community may not react to. The value of accepted laziness is set by the analyst. When it is set to D 0, i.e., NwEj D Nwkj , the active actor k is then an ubiquitous one since he is assumed to attend all events. The ubiquitous actors may be interpreted as the leaders that keep and/or oversee the community activity during a while (e.g., leading researchers in a given community). Moreover, tracking the active actors of a community over a set of intervals has the advantage to forecast the group evolution and rank actors. It can also be exploited to destabilize a group (e.g., a criminal group) by taking appropriate actions on such actors to weaken their membership group. Finally, a permanent decrease in the number of active actors could be a precursor of a group dissolution. Transient Communities Definition and Goals A transient community is a community formed by a set of actors and transient ties they share during one or many events. This situation is well

2115

T

illustrated by a research community. Consider two researchers (say Peter and Paul) who are not tied originally by the network coauthorship that describes “who cowrites with whom” a publication. Assume that Peter and Paul attend together a couple of conferences in which they organize or cochair the same sessions or workshops. Thus, Peter and Paul are tied during conferences, and such ties are called transient ones. The purpose of this work is to use transient ties that connect actors during events and figure out transient communities. We define a transient community based on the pattern “who participates with whom,” and we built it through two steps: identification of active nodes and link creation. Finding Transient Communities Most of the online systems created in recent years like Facebook, MySpace, Twitter, and so on offer both a rich set of events/activities and the opportunity for extensive interactions (Crandall et al. 2008). These online systems record both activities and interactions, thereby enabling a social graph visualization after any event. Nevertheless, we aim to find a transient community after a set of events rather than each event. The reason is that two nodes may interact during only one event and not at all for the rest of events. Hence, using data from a unique event is not enough to conclude that two nodes are likely to be tied. With this insight, we consider a time window wj and all events that occur within this time frame. Furthermore, we consider all active actors as members of the transient community for that window. Using only active actors as members of a transient group is motivated by the fact that a transient community is formed with transient ties that are mostly established by active actors. Afterwards, we set links between two actors based on their co-occurrences in several events. Precisely, a new link is added between two actors if the number of their co-occurrences in several events is greater than a given threshold c . To measure the co-occurrence of two nodes, several metrics can be used as described in Manning and Sch¨utze (1999) and Matsuo et al. (2006). For instance, one can use the matching coefficient

T

T

2116

Temporal Analysis on Static and Dynamic Social Networks Topologies

  j , and jX \ Y j, the Jaccard coefficient jX\Y jX[Y j  jX\Y j , where jX j the overlap coefficient min.jXj;jY j/ and jY j denote the occurrence number of actors X and Y, respectively, while jX \ Y j and jX [ Y j represent the hit counts of “X AND Y” and “X OR Y,” respectively. We use the overlap coefficient because it is shown in Matsuo et al. (2006) that it is better suited to social network analysis than the matching and Jaccard coefficients. To add links between two nodes k and l following their interactions during wj in which the set of events E.wj / occur, we define Pk;l .wj / as the probability that a link between k and l exists by defining the following formula: Pk;l .wj / D

j Nwkj \ Nwl j j min.Nwkj ; Nwl j /

Temporal Analysis on Static and Dynamic Social Networks Topologies, Fig. 1 Initial community

(2)

As previously stated, Nwi j gives the number of events that node i attends with wj , while the numerator indicates the number of events k and l attend together. Basically, the procedure to build a transient community of a time window wj works as follows. For any couple of active nodes k and l, Pk;l .wj / is computed and a link is created between them whenever Pk;l .wj /  c . The probability Pk;l .wj / captures the intensity of relations between actors. The higher the value of Pk;l .wj /, the more likely the transient links tend to become permanent and have an impact on network evolution. In other words, by considering transient communities, one may draw the evolution of a real community by including transient ties with higher intensity. With a social network like Facebook or LinkedIn, such changes may occur by suggesting to two members to update their ties since they already share a couple of events. Illustrative Example To illustrate our approach, we consider a collaboration network of researchers. Basically, the network is drawn based on coauthorship patterns, and we track the co-participation of actors to events related to the main topics of the considered

research community. To this end, we assume a community of researchers depicted in Fig. 1. The network is set by using the pattern “who cowrites with whom.” Let us consider afterwards a time window in which we report ten events (e.g., meetings or conferences). We gather in Table 1 the events and their attendees during a time window. For instance, the set of events to which Actor 1 takes part are e2 ; e3 ; e5 ; e8 ; and e10 . When we set the accepted laziness rate to 40 %, the active researchers given by Formula 1 are Actors 7 and 9. In other words, only researchers who have participated to more than 60 % of the events are considered active. However, if this rate is set to 50 %, all researchers are active except Actor 4 who takes part to less than 50 % of events. Once active researchers are identified, Formula 2 is used to create new ties between them in order to know whether they belong to a more cohesive group or not. To this end, we set two distinct values of the threshold c : 40 and 60 %, and we draw the resulting networks. Figures 2 and 3 depict the transient networks when c D 40 % and c D 60 %, respectively. With c (D 40 %), the transient community is dense since new links are added even when two actors share a low number of events. That is the reason we have more links in Fig. 2 than in Fig. 1, which represents the initial network. Furthermore, with a low value of c , it is unlikely to estimate

Temporal Analysis on Static and Dynamic Social Networks Topologies

2117

T

Temporal Analysis on Static and Dynamic Social Networks Topologies, Table 1 Event participation Actors

e1

e2

e3

e4

e5

e6

e7

e8

e9

e10

1 2 3 4 5 6 7 8 9

0 0 0 1 1 1 1 1 1

1 1 1 0 1 0 1 0 1

1 0 1 0 0 1 1 0 1

0 1 0 1 1 1 0 1 0

1 1 1 0 0 0 1 0 1

0 1 1 0 0 0 1 0 1

0 0 0 1 1 1 0 1 0

1 1 1 0 0 0 1 1 1

0 0 0 1 1 1 0 1 0

1 1 1 0 0 0 1 0 1

how relatively close are two actors with a good precision. However, when c D 60 %, links are added only between actors who participate to more than 60 % of events, which has the effect of clustering actors with a similar pattern into a cohesive group. Figure 3 highlights two distinct groups formed based on the intensity of temporal links established by actors. The colors are only used to differentiate the groups, and members of a group are those who take part to the same events or are interested by the same research topics. Moreover, one may observe in Table 1 that nodes in the group with blue links have a participation rate smaller than the one for nodes in the other group. If ever such a behavior is observed (or reinforced) over subsequent time windows (or over a long period of time), an attrition of the corresponding group may be expected. We recall that transient communities depict only temporary interactions (e.g., who co-participates with whom) and are different from more stable communities in the initial network (e.g., coauthorship network). However, when mapped over the initial network, transient communities give some additional information to predict new cohesive groups that arise from event occurrences.

Key Applications The method described in this paper can be applied in several situations. It could be one of the following cases:

Temporal Analysis on Static and Dynamic Social Networks Topologies, Fig. 2 Transient community with c D 40 %

Temporal Analysis on Static and Dynamic Social Networks Topologies, Fig. 3 Transient communities with c D 60 %

1. Churn Detection. After a subsequent set of time windows, our method identifies inactive nodes that may be considered as churners (Karnstedt et al. 2010). Churn detection is fruitful for most service-based companies like telecommunication, banking, and social network services that may see their profitability decrease with the loss of customers or members. This is also useful to predict employee attrition based on the decrease of employee’s participation to social or professional events within an organization. Therefore, predicting or detecting customer or employee attrition in its early stage gives more flexibility to companies to apply appropriate incentives to keep customers or employees in their business. 2. Ranking and Clustering a Set of Actors. Detecting active actors can be applied to identify and/or rank actors who may react positively to a call for papers or an invitation. Ranking actors based on their research productivity helps answer to the following

T

T

2118

Temporal Analysis on Static and Dynamic Social Networks Topologies

question: Who are the leading researchers in social network analysis? However, transient ties built by those active actors help get an answer to the following questions: Who are the peer researchers of Bob? Who may become a coauthor of John? Furthermore, with a research project perspective, it is desirable to know which individuals are always interested or still work on a topic in order to send them a call for conference papers or ask for their participation. Our approach is suited to fill such needs since individuals who keep an interest on some topics are depicted as active nodes due to their participation in some similar past events. 3. Viral Marketing/Information Flow Diffusion. Transient communities that are formed during activities reveal ties that are not necessarily depicted by the global network topology. For instance, researchers who attend more than once the same conference may establish some interactions or ties that are not reflected by the network of co-authorship collaboration (e.g., DBLP network). However, such transient ties could be precursors of a future collaboration.

Future Directions Our ongoing work and future directions cover two issues: • Influence and Relation Strength Learning. We plan to extend our work to measure influence propagation over time. The influence concept denotes the ability, power, or capacity of a network actor to have a direct or indirect effect on the remaining actors. In our context, it may be the effects of an actor (or the intensity of his links) on other actors with respect to their common participation to different events. For instance, it may be interesting to know who are the influencers for a conference attendance or which types of relationships are most influential for an author to take a given decision regarding his research topics. To reach our goal, we rely on the intensity of (direct or indirect) links between two nodes k and l, and we set a probability that states how likely a

node l can get an information from k during events held in a time window wj . The higher the value of the probability, the likely the node k may impact the decision or behavior of l. • Tuning the Intensity Metrics of Transient links. We use the intensity of interactions between two active actors to decide if a new link is to be added between them. The metric used assumes that when two actors participate to several events, they can be linked, ignoring the fact that there is a probability, even low, that these actors will never work together because of their lack of affinity. To handle such a situation, we plan to use the possibility theory (Masson and Denoeux 2006) to predict whether two actors will be linked by including the probability that they do not share any affinity.

Cross-References  Dynamic Community Detection  Temporal Networks

References Backstrom L, Huttenlocher D, Kleinberg J, Lan X (2006) Group formation in large social networks: membership, growth, and evolution. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06, Philadelphia, pp 44–54 Br´odka P, Saganowski S, Kazienko P (2013) Ged: the method for group evolution discovery in social networks. Soc Netw Anal Min 3(1):1–14 Crandall D, Cosley D, Huttenlocher D, Kleinberg J, Suri S (2008) Feedback effects between similarity and social influence in online communities. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’08, Las Vegas, pp 160–168. ACM Fiedler M (1973) Algebraic connectivity of graphs. Czech Math J 23:298–305 Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174 Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci USA 99(12):7821–7826

Temporal Networks Goldberg MK, Magdon-Ismail M, Thompson J (2012) Identifying long lived social communities using structural properties. In: ASONAM, Istanbul, pp 647–653 Karnstedt M, Hennessy T, Chan J, Hayes C (2010) Churn in social networks: a discussion boards case study. In: Proceedings of the 2010 IEEE second international conference on social computing, SOCIALCOM ’10, Minneapolis. IEEE Computer Society, pp 233–240 Kashoob S, Caverlee J (2012) Temporal dynamics of communities in social bookmarking systems. Soc Netw Anal Min 2(4):387–404 Kernighan BW, Lin S (1970) An efficient heuristic procedure for partitioning graphs. Bell Syst Tech J 49(1):291–307 Leskovec J, Kleinberg J, Faloutsos C (2007) Graph evolution: densification and shrinking diameters. ACM Trans Knowl Discov Data 1(1):1–41 Leskovec J, Huttenlocher D, Kleinberg J (2010) Predicting positive and negative links in online social networks. In: Proceedings of the 19th international conference on world wide web, Raleigh, pp 641–650 Manning CD, Sch¨utze H (1999) Foundations of statistical natural language processing. MIT, Cambridge Masson MH, Denoeux T (2006) Inferring a possibility distribution from empirical data. Fuzzy Sets Syst 157(3):319–340 Matsuo Y, Mori J, Hamasaki M, Ishida K, Nishimura T, Takeda H, Hasida K, Ishizuka M (2006) Polyphonet: an advanced social network extraction system from the web. In: Proceedings of the 15th international conference on world wide web, Edinburgh. ACM, pp 397–406 Newman MEJ (2004a) Detecting community structure in networks. Eur Phys J B Condens Matter Complex Syst 38(2):321–330 Newman MEJ (2004b) Fast algorithm for detecting community structure in networks. Phys Rev E 69(6):066,133 Palla G, Barabasi AL, Vicsek T (2007) Quantifying social group evolution. Nature 446:664–667 Scott JP (2012) Social network analysis. SAGE, London Sun Y, Han J (2012) Mining heterogeneous information networks: principles and methodologies. Synthesis lectures on data mining and knowledge discovery. Morgan & Claypool Publishers, San Rafael Toivonen R, Kovanen L, Kivel¨a M, Onnela JP, Saram¨aki J, Kaski K (2009) A comparative study of social network models: network evolution models and nodal attribute models. Soc Netw 31(4):240–254

Temporal Analytics  Stream Querying and Reasoning on Social

Data

2119

T

Temporal Communities  Community Evolution

Temporal Graphs  Temporal Networks

Temporal Metrics  Stability and Evolution of Scientific Networks

Temporal Networks Petter Holme Department of Energy Science, Sungkyunkwan University, Suwon, Korea IceLab, Department of Physics, Ume˚a University, Ume˚a, Sweden

Synonyms Dynamic graphs; Dynamic networks; Dynamical graphs; Evolving graphs; Temporal graphs; Temporal networks; Time-aggregated graphs; Time-stamped graphs; Time-varying graphs

Glossary Temporal Network A system that could be modeled as a graph with additional information about when contacts happen, or the representation itself Node, Vertex One unit that interacts with others to form a temporal network Contact One interaction event, limited in time, between a pair of vertices Edge, Link A pair of vertices that at some point are in contact

T

T

2120

Temporal Networks

Definition

seems like, for a large number of systems, the opposite is true – interaction sequences are highly heterogeneous in time, displaying burstiness (close to power-law inter-event time distribution) and correlations (see Fig. 1 and Barab´asi (2005); Eckmann et al. (2004)). These heterogeneities have, like the static network structure, been shown to influence dynamic processes over the contacts (Karsai et al. 2011; Rocha et al. 2011). Such effects are not yet fully categorized. This is especially true when both the network structure and the temporal structure together affect spreading processes. Some of the effects follow from that all dynamics events have to obey the time ordering of paths as illustrated in Fig. 2 (Holme 2005; Kempe et al. 2002). One effect of the time ordering of temporal-network paths is that paths become (as opposed to static networks) intransitive. (i.e., there can be a path from A via B to C at the same time as there is no path from C via B to A.) There is no static network representation that can capture intransitivity while having nodes representing the same objects as the nodes of the contact sequences. Other, even more obscure effects come from the interplay between burstiness and network topology. We can, once we have the proper tools and methods, expect a wealth of discoveries and useful applications from this more complete representation (than static networks) of many large-scale systems. To reach this goal we need to integrate tools and theories from statistical physics, information theory, data science, statistics, and computer science.

Temporal network is a subfield of network theory, or complex-network analysis, where one treats the timing of when two vertices are in contact explicitly. A temporal network is any system that can be modeled, mathematically and computationally, as a graph of vertices with explicit timing of the contacts along edges.

Introduction To understand how large-scale complex systems function, one needs to zoom out and look at the system from a distance, i.e., disregard unimportant details. For many areas – from the wireless networks to protein interactions and from friendship networks to chains of historical events – a systematic approach to zoom out is to represent the system as a network (Newman 2010). Parts of this development has been inspired by statistical physics – the branch of physics zooming out from microscopic (molecular) to macroscopic (thermodynamic) properties of systems. In networks, nodes represent the interacting units (which would be atoms and molecules in traditional physics) and links represent interactions between nodes. Although this approach of simplification is more approximative outside of physics, it has led to several discoveries. One such finding is that many properties universal in the sense that they are shared between networks represent a variety of systems. Perhaps the most celebrated such property is the common feature that real-world networks have a probability distribution of degree (the number of neighbors) following a power law. Another class of discoveries concerns how such network structures affect dynamic systems (like disease spreading or Internet traffic) on the networks. Until recently, in most data-driven network studies, the time dimension has been projected out by aggregating the contacts between vertices to links. Of course, information will inevitably get lost in such an aggregation. If the contact sequences were homogeneous in time, this approximation would not be so bad. However, it

Key Points To understand the large-scale features systems in society, biology, and technology, one can model them as temporal networks. These are mathematical objects that record the time of the beginning and end of a contact between two units and the identity of the two vertices in contact. Temporal networks can be used to predict the behavior of dynamic systems that spread via the contacts and as a tool to explore the formative forces of the system.

Temporal Networks

2121

T

Temporal Networks, Fig. 1 Burstiness in human behavior. Each line represents the time of an email sent by the author. There are periods of quiescence and periods of intense activity, the hallmarks of burstiness. Most models of humans assume a more uniform distribution of contact times (like Poissonian)

research (Holme and Saram¨aki 2012). Temporal network has also some origins in times-series analysis – the data-driven discipline of measuring and characterizing regularities in temporally resolved series of events. Temporal Networks, Fig. 2 Illustration of the nontransitive nature of temporal networks. Information (or something else carried by the dynamic system) can spread from A to B, from B to C, from C to B, and from B to A; it can also spread from A to C via B. However, it cannot spread from C to A (since by the time the information has reached B, all contacts between A and B have already happened). Since networks are transitive, even directed networks, one cannot simply reduce a temporal network to a static one without changing the meaning of the concept of vertices

Historical Background Temporal networks share much of its legacy with the complex-network, or network theory, field. In a way, it is one of the early interdisciplinary data-driven sciences. With the explosion of available data that happened in the late last century, many scientific fields faced the same questions – how can we characterize a dataset’s large-scale structure? For many types of data, the answer is to represent it as a network and study the structure of it, i.e., how it differs from a random network. Although this approach has a history dating back to the first half of last century (maybe longer), it really exploded as a scientific discipline around the year 2000 with contributions from the physics, computer science, mathematics, statistics, and many other communities. Temporal networks have been a minor subfield since the early complex-network studies (Kempe et al. 2002). Only the last few years has it begun to assemble a unified body of

Temporal Networks as a Modeling Framework Overview of Applications The study of temporal networks is an emerging field of complex systems, occupying the same interdisciplinary realm as the wider field of complex-network research. The field of temporal networks draws on methods from physics, statistics, applied mathematics, computer science, sociology, and quantitative biology to study systems from biology, technology, and social systems. Networks of human interactions – face-toface-contacts, communication through electronic channels, etc. – are prime examples for a temporal-network approach for many reasons. First, the contact patterns of humans are heterogeneous and bursty with power-law-like inter-contact time distributions and often also correlated (see Fig. 1 for an example). Understanding the consequences and driving forces behind these patterns is crucial for understanding the spreading of biological and electronic viruses and information. Here, the results we have so far indicate that the effects can be large (Holme and Saram¨aki 2012). Second, modern technologies allow for detailed recording of massive amounts of such contacts via RFID sensors (Cattuto et al. 2010; Isella et al. 2011; Santoro et al. 2011; Stehl´e et al. 2011),

T

T

2122

cell-phone call records (Karsai et al. 2011; Kovanen et al. 2012), and tracking apps on mobile devices (Eagle and Pentland 2006). Third, in order to understand fundamental patterns of human behavior at large scales, one needs to go beyond the static approaches typically used by network scientists and sociologists alike. Fourth, the number of potential technological and medical applications is large. However, there are plenty of other systems that also lend themselves to a temporal-network framework. Ecological networks (of trophic or mutualistic) interactions between species are another example, where the links are on and off depending on environmental changes. Gene networks could also benefit from a temporal network approach to capture the influence of outer stimuli on the transcription. Networks of economic systems, including the trade between companies or countries, are yet another example. The list can be much longer. In this entry, we present temporal networks from a general perspective. Measuring Temporal Network Structure In this section, we will review some of measures that attempt to capture structure of both the temporal and topological dimension. For the rest of the chapter, we will mostly consider systems represented as lists of contacts – triplets of pairs of vertices and the time of the contact. A similar system we will not mention much is that where the contacts are extended in time (so that they have the start- and end-time of the contact). We call the first type of temporal network contact sequence, the other interval graph. We note that temporal networks are notoriously difficult to visualize in such a way that they both show all information and highlight the important structures (in a way like springembedding algorithms can successfully do for static networks). Two representations, labeled graphs and timeline plots, are illustrated in Fig. 3. Of these, the timeline plots put an emphasis on the temporal structure (bursts, daily patterns, etc.), while the labeled graphs highlight the network topology – but neither of them can be scaled up to more than a dozen or so vertices. There are other attempts of showing the time

Temporal Networks

Temporal Networks, Fig. 3 Visualization of temporal networks. (a) shows a labeled network of aggregate contacts. Panel (b) shows a timeline plot

evolution of a simplified topology – for example, the alluvial diagrams of Rosvall and Bergstrom (2010). However, these would typically not be able to visualize the temporal structure like the non-transitivity, burstiness, or some other aspect of temporal networks. Reachability and Latency One of the most conspicuous differences between static and temporal networks is that the latter type is not transitive. In general, even if something A is related to B and B is related to C, it might be the case that A is not related to C (see Fig. 2). The relation in this case is the possibility of something spreading from one vertex to another through a series of contacts where the times of the contacts are increasing (anything else would not be feasible in reality). For this reason, statistics of such paths over which something can spread should be informative. Some authors have, for example, investigated the average times of timerespecting paths (Holme 2005; Kossinets et al. 2008; Tang et al. 2010). One of the key quantities is latency. Given a pair of vertices (i ,j ) and

Temporal Networks

an instant t of time, then latency is the shortest duration of any time-respecting path from i to j at time t. There is more to the statistics of timerespecting paths than latency. Just like regular graphs, temporal networks can be disconnected. This is, for empirical data sets we are aware of, more common in temporal networks than static ones. A practical measure capturing the tendency of what would be paths in static networks to be disconnected in temporal networks is the expectation value of the vertex pairs that are connected in an aggregated network having infinite latency values (Holme 2005). One can expand latency-like measures in many ways to capture different aspects of the contact sequences. For example, in case the dynamics does not allow spreading to occur immediately over a vertex, i.e., if an incoming signal cannot spread to another the same timestep, then that would call for other structural measures (Pan and Saram¨aki 2011). It could furthermore be interesting to monitor the number of time-respecting paths between the two vertices or resolve the average latency in time – it might be that the average latency follows, e.g., a daily pattern where time-respecting paths are faster when the contacts are more frequent. Local Structures Correlations In static networks, the local structure – focusing on average properties of the immediate surrounding of vertices – is a powerful predictor of the behavior of dynamic systems on the network. For example, in a disassortative network (where highdegree vertices tend to be attached to low-degree vertices), disease outbreaks happen less easily than in a neutral network, but if they occur, they tend to be more severe (Estrada 2011; Newman 2010). Adding a time dimension to a measure like assortativity (measuring the tendency of a network to be assortative or disassortative) is not straightforward. There are many options how to integrate the time dimension – by integrating the interactions by binning them or applying a sliding window, or by focusing on a temporalpath-based measure that captures non-transitive effects. This could be the explanation why there

2123

T

have been rather few attempts in that direction. Nevertheless, we anticipate future efforts in this direction. Persistent Patterns As the contacts of a temporal network (especially those systems that are representable as interval graphs) evolve, some parts of it will be more active. Such patterns of persistent contacts define subnetworks that are candidates to be some sort of functional subunits. Looking at the network of such contacts could be an alternative to aggregating all contacts when one wants to reduce the network to a static network. An approach to investigating the persistency is to let a time window slide through an interval graph and calculate the following function: i .t/ D P qP

j 2.i;t/ a .i; j; t / a .i; j; t

j 2.i;t/ a .i; j; t/

qP

C 1/

j 2.i;t/ a .i; j; t

C 1/ (1)

where t is the start of the time window and .i ,t/ are the nonzero indices of the (time-dependent) adjacency matrix. This function, originally proposed in Clauset and Eagle (2007), is called adjacency correlation function or vertex persistency. Motifs Network motifs were first proposed for directed static networks (Shen-Orr et al. 2002). They are subgraphs that are overrepresented with respect to a null model of randomized subgraphs. Motifs are often interpreted as candidates of functional subunits. In static directed networks, motifs in biological networks can be mapped to electronic components (Shen-Orr et al. 2002), but in temporal networks, this is harder. Rather, motifs in temporal networks correspond to a typical sequence of events. There are many ways of defining such motifs. To take one example, Kovanen et al. (2012) look at the sequence of contacts between a few vertices that are maximally separated by a time t (they are t adjacent). More precisely, two contacts ei and ej are adjacent if

T

T

2124

and only if they share a vertex and they are t adjacent. Kovanen et al. (2012) finds an overrepresentation of temporal network motifs that are acyclic. This, they argue, is natural since causal networks are by nature acyclic (and that a subnetwork is acyclic could mean it constitutes a causal structure). Mesoscopic Structures In static networks, there is a large number of methods for discovering mesoscopic structures (a.k.a. clusters, communities, or modules (Fortunato 2010)). These are loosely defined as groups of vertices more densely connected within than between each other. Much of the literature focuses on deriving a method for decomposing a static network based on some kind of conceptually simple principle. Very few studies seek to identify structures known beforehand to exist. The papers incorporating a time dimension into the community detection typically operate on aggregated time slices of the contact sequence (Rosvall and Bergstrom 2010). One can imagine clustering algorithms that operate on more informative representations of temporal networks, like contact sequences or interval graphs (an exception is Lin et al. (2008)). As mentioned elsewhere, visualizing temporal networks (into a printed diagram containing all information in the data) is difficult and this is a major obstacle to intuitive reasoning about temporal-topological structure. Reducing the network to a network of clusters that split and merge with time is perhaps the most promising path in this direction. However, such a reduction would also destroy, for example, any nontransitive feature of the original structure and all temporal structures (bursts and the like) of timescales shorter than the time windows.

Models of Temporal Networks As in other areas of theoretical science, our understanding of temporal networks needs mathematical and computational models. Such models come in different flavors for different purposes. The simplest, already mentioned above, are the null or reference models one needs, and addition to the network quantities,

Temporal Networks

to infer structure. Generative models can also work as null models and, in addition, serve as the underlying structure to explore the behavior of dynamic systems. A third class is the mechanistic models for exploring the emergence of the network structures one measures; and finally we have the predictive models that are tailored to forecast future aspects of a temporal network. There have been surprisingly few models proposed that control both temporal and topological structure. This is in contrast to the early years of rapid development of static network theory, where there was a collective emphasis on model development – in particular, mechanistic models of temporal networks (Newman 2010). Randomization as Null Models As mentioned above, to interpret a quantity to measure temporal-network structure, one needs to compare it to something – a null model. The procedure is to measure the quantity on the empirical data and also over an ensemble of model networks that is intended to represent neutrality (except some selected fundamental constraints from the environment of the system). Then by comparing the results for the empirical network to that of the null model summarizes the structure of the real-world data. In practice, “comparing” can refer to subtracting (or dividing) the empirical and null-model values; or more elaborately calculating the Z-score of the empirical and nullmodel values. For static networks, the most used null model is to rewire the edges (we refer to it as RE below) randomly while keeping the degree and number of vertices constant (Maslov and Sneppen 2002). This model is closely related to a more mathematically formulated model – the configuration model – which is defined by a set of degrees that are connected as randomly as possible (Newman 2010). The only difference between edge randomization and the configuration model is that the former is more computationally oriented and takes a graph, rather than a degree distribution as its input. For temporal networks, randomization schemes serve somewhat different roles than

Temporal Networks

2125

T

in static networks. On one hand, there is no well-studied theoretical counterpart like the configuration model. On the other hand, there are many more possibilities of randomizing a temporal, than a static, network. This means one can get a systematic characterization of a temporal network by a sequence of randomization models where one preserves less and less structure of the original model. We will now turn to listing a few such randomization procedures of temporal networks. An illustration can be found in Fig. 4. Randomly Permuted Times (RP) As a timedimension counterpart to the above-mentioned randomization of edges, one can permute the times of contacts while keeping the network’s static topology and the numbers of contacts between all pairs of vertices fixed. This randomization scheme thus retains the topology of the aggregated network of contacts and the number of contacts for each edge. One can use this randomization to study the effects of the order of contacts. This includes burstiness, inter-contact time distributions, or correlations between contacts on adjacent edges. The model also preserves the overall rate of events in the network over time, such as daily or weekly patterns. Randomized Edges with Randomly Permuted Times (RE C RP) The RE and RP procedures are independent and can thus be combined. First, one randomizes the network structure that is randomized using the RE procedure. Second, the time stamps of all contacts are reshuffled using the RP scheme. The output is a temporal graph, where all topological and temporal correlations (with the exception of the overall rate of contacts, such as circadian rhythms) have been destroyed. Random Times (RT) The set of time stamps is conserved by the RP procedure. Hence, the overall contact rates follow the same patterns as the original data – if there are daily or weekly patterns (which is the case in many biological and social systems), they will be unaltered in the randomized networks. If one wants to see the influence of these patterns, one can draw

Temporal Networks, Fig. 4 Illustration of some randomization methods. Panel (a) illustrates the randomly permuted time (PT) scheme that removes structures in the order of events. Panel (b) shows the random time scheme (RT) and panel (c) depicts a static network rewiring as it appears in a contact sequence. The boxes illustrate changes (that can occur at other contacts than the ones swapped)

new times of the interactions from a random distribution, perhaps obeying some constraint set by the environment, and compare it to results from the RP ensemble. Randomized Contacts (RC) This is another translation of the RE scheme to the time dimension (other than RP and RT), where one keeps the original set of times (like RP) but does not conserve the number of contacts per edge.

T

T

2126

To be more precise, one keeps the graph topology fixed but randomly redistributes the contacts among the edges. After this randomization, the number of contacts per edge follows a binomial distribution rather than (typically in empirical networks) some broad, right-skewed distribution. This randomization scheme is suitable for testing effects of the distribution of the number of contacts per edge in combination with the order of contacts. Equal-Weight Edge Randomization (EWER) Karsai et al. (2011) used another randomization scheme to remove the correlations between the contacts of adjacent edges but keeping other statistics of the times of contacts of individual edges. In this approach, the entire time series of contacts associated with edges were randomly redistributed between edges of the same total number of contacts. In this representation, singleedge patterns such as burstiness, are retained, together with all temporal-statistical properties conserved by the RP model. This null model requires a large enough system so that there are enough edges with the same number of events to remove the correlations between different instances of the system. Alternatively, one can make an approximate EWER randomization by relaxing the strict constrain on the number of contacts to be the same, and rather randomize within edges with a number of contacts within some interval. Edge Randomization (ER) This null model is similar to the EWER model with the difference that sequences can be exchanged between edges regardless of their numbers of contacts. This is similar to randomly swapping the edge weights (measured as aggregate numbers of contacts) in the aggregated network. Thus, this scheme removes the weight-topology correlations that are destroyed in the null model. Time Reversal (TR) The last null model we describe (certainly not the last conceivable one) is designed for counting potential sequences of causal events (Bajardi et al. 2011). The randomization (or better, alternation) simply consists of running the original contact sequence backwards

Temporal Networks

in time. If sequences of consecutive contacts would be a consequence of temporal correlations alone, then one can assume the numbers of such sequences observed when time runs forwards and backwards to be similar. A lack of such chains in the time-reversed null model compared to the original sequence could then perhaps be attributed to chains of causality. Generative, Mechanistic and Predictive Models Many temporal-network studies have been performed by analyzing empirical example networks, often complimented by a randomization null-model approach (as described above). This approach has the advantage that we can understand the consequences of the real-world temporal-network structures even though we do not have a canonical set of structural measures like we do for static networks. The disadvantage is that this understanding is not as systematic understanding as it would be otherwise. To achieve such a systematic understanding, we would need generative models that can output temporal networks with tunable temporalnetwork structure. In case the topological and temporal dimensions are independent, creating a generative model is rather straightforward. One would first generate the network topology according to some static network model and then generate time series of contacts over the edges. We do not know of any approaches yet to generate temporal network with a correlation between the topology and the temporal structure. Some studies based on generative models to study the effects of temporal-network structure include Perra et al. (2012) and Rocha et al. (2013). Another class of models that is common in the static network literature but not yet for temporal networks is mechanistic models. These are designed to explain particular large-scale structures observed in real networks. This is in contrast to the static network field where the development of mechanistic models was a major driving force. The third type of models is predictive models. Such are solely targeted at forecasting the future development of the contact structure.

Temporal Networks

Drawing more from machine learning and statistical theory, they would not explain why a temporal network looks like it does. Rather, given a contact sequence or interval graph, it could predict the continuation in the near future. Such models are related to vaccination strategies (to forecast the most important spreaders in the near future) (Lee et al. 2012; Prakash et al. 2010). Processes on Temporal Networks Networks are never just a collection of vertices and edges or contacts (in the case of temporal networks). They are the underlying structure of a dynamic system – be it contagion of infections or Internet traffic – that is more closely related to the operation of the system than the network is. As mentioned, the time ordering of contacts can affect any type of dynamics that depends on the shortest paths between vertices. Moreover, temporal effects influence diffusion-type spreading events. This has been investigated by comparing quantities describing spreading phenomena – often simulated models of infectious disease spreading – on empirical contact sequences. As opposed to static complex networks, we do not have a comprehensive theory of how temporal-network structure affects disease spreading. For some temporal networks, the structure in combination with the disease dynamics speeds up the epidemics (Karsai et al. 2011); in other systems, structure seems to decelerate the spreading (Rocha et al. 2011). The temporal structure in focus in these studies is burstiness – the property that (usually human) activity happens in short intense periods followed by periods of relative quietude. To study the effect of burstiness, one compares simulated compartmental models on real-world networks compared to randomized networks. Another category of models of social diffusion – in particular, social influence or opinion spreading – is threshold models. In such models, an agent, or vertex, changes state whenever the social influence from the network neighborhood exceeds a threshold. It is maybe too early to generalize but it seems like threshold-model spreading speeds up by bursty contact patterns. Some examples of threshold models on temporal networks include Karimi and Holme (2013) that

2127

T

studied a modification of Watts’s (2002) cascade models and Takaguchi et al. (2012) that studied a threshold model of exponentially decaying influence. Both these studies were performed on empirical networks and randomized null models thereof.

Key Applications Temporal networks is an appropriate modeling framework when analyzing a large-scale system of agents interacting pairwise where the timescale of the contacts between agents is about the same as for dynamic systems on the network. If this is the case, then temporal-networks can be used to predict the performance of the dynamic system and suggest changes to the network to improve performance. Temporal networks is also a family of tools appropriate for characterizing the organization and functionality of such systems. Systems appropriate for temporal-network modeling are abundant in social media, human communication, cell biology, ecology and a range of other areas.

Future Directions Temporal networks are still not a mature field and there are many open directions. Several concepts, models, and methods in the complex-network literature still lack a temporal-network counterpart. Perhaps not all static structural measures are suitable for such generalizations, but quantities such as clustering coefficient, modularity, and assortativity do not have any temporal-network counterparts. The most relevant measures and methods for temporal networks could well be something rather different from static network measures. With every new, relevant measure comes questions about how this structure affects dynamic systems on the network and why real-world networks have the structure they have. There are also more specific questions with no counterpart in static networks, like how to visualize temporal networks, both as an image and a movie, and how to predict future contacts.

T

T

2128

Acknowledgments The author thanks Jari Saram¨aki for comments and acknowledges financial support from the Swedish Research Council and the WCU program through NRF Korea funded by MEST R31–2008–10029.

Cross-References  Dynamic Community Detection  Spatial Networks

References Bajardi P, Barrat A, Natale F, Savini L, Colizza V (2011) Dynamical patterns of cattle trade movements. PLoS ONE 6, Art no e19869 Barab´asi A-L (2005) The origin of bursts and heavy tails in humans dynamics. Nature 435:207–212 Blonder B, Wey TW, Dornhaus A, James R, Sih A (2012) Temporal dynamics and network analysis. Methods Ecol Evol 3:958–972 Cattuto C, van den Broeck W, Barrat A, Colizza V, Pinton J-F, Vespignani A (2010) Dynamics of personto-person interactions from distributed RFID sensor networks. PLoS ONE 5, Art no e11596 Clauset A, Eagle N (2007) Persistence and periodicity in a dynamic proximity network. In: DIMACS workshop on computational methods for dynamic interaction networks, DIMACS, Piscataway Eagle N, Pentland A (2006) Reality mining: sensing complex social systems. Pers Ubiquitous Comput 10: 255–268 Eckmann J-P, Moses E, Sergi D (2004) Entropy of dialogues creates coherent structures in e-mail traffic. Proc Natl Acad Sci USA 101:14333–14337 Estrada E (2011) The structure of complex networks: theory and applications. Oxford University Press, Oxford Fortunato S (2010) Community detection in graphs. Phys Rep 486:75–174 Holme P (2005) Network reachability of real-world contact sequences. Phys Rev E 71, Art no 046119 Holme P, Saram¨aki J (2012) Temporal networks. Phys Rep 519:97–125 Isella L, Romano M, Barrat A, Cattuto C, Colizza V, van den Broeck W, Gesualdo F, Pandolfi E, Rav L, Rizzo C, Tozzi AE (2011) Close encounters in a pediatric ward: measuring face-to-face proximity and mixing patterns with wearable sensors. PLoS ONE 6, Art no e17144 Karimi F, Holme P (2013) Threshold model of cascades in empirical temporal networks. Physica A 392: 3476–3483

Temporal Networks Karsai M, Kivel¨a M, Pan RK, Kaski K, Kert´esz J, Barab´asi A-L, Saram¨aki J (2011) Small but slow world: how network topology and burstiness slow down spreading. Phys Rev E 83, Art no 025102 Kempe D, Kleinberg J, Kumar A (2002) Connectivity and inference problems for temporal networks. J Comput Syst Sci 64:820–842 Kossinets G, Kleinberg J, Watts DJ (2008) The structure of information pathways in a social communication network. In: Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, pp 435–443 Kovanen L, Karsai M, Kaski K, Kert´esz J, Saram¨aki J (2012) Temporal motifs in time-dependent networks. J Stat Mech, Art no P11005 Kuhn F, Oshman R (2011) Dynamic networks: models and algorithms. ACM SIGACT News 42:82–96 Lee S, Rocha LEC, Liljeros F, Holme P (2012) Exploiting temporal network structures of human interaction to effectively immunize populations. PLoS ONE 7:e36439 Lin Y-R, Chi Y, Zhu S, Sundaram H, Tseng BL (2008) Facetnet: a framework for analyzing communities and their evolutions in dynamic networks. In: Proceedings of the 17th international conference on world wide web, Beijing, pp 685–694 Maslov S, Sneppen K (2002) Specificity and stability in topology of protein networks. Science 296:910–913 Newman MEJ (2010) Networks: an introduction. Oxford University Press, Oxford Pan RK, Saram¨aki J (2011) Path lengths, correlations, and centrality in temporal networks. Phys Rev E 84, Art no 016105 Perra N, Gonc¸alves B, Pastor-Satorras R, Vespignani A (2012) Activity driven modeling of time varying networks. Sci Rep 2, Art no 469 Prakash BA, Tong H, Valler N, Faloutsos M, Faloutsos C (2010) Virus propagation on time-varying networks: theory and immunization algorithms. Lect Notes Comput Sci 6323:99–114 Rocha LEC, Decuyper A, Blondel VD (2013) Epidemics on a stochastic model of temporal network. In: Mukherjee A et al (eds) Dynamics on and of complex networks, vol 2. Springer, Berlin, pp 301–314 Rocha LEC, Liljeros F, Holme P (2011) Simulated epidemics in an empirical spatiotemporal network of 50,185 sexual contacts. PLoS Comput Biol 7:e1001109 Rosvall M, Bergstrom CT (2010) Mapping change in large networks. PLoS ONE 5, Art no e8694 Santoro N, Quattrociocchi W, Flocchini P, Casteigts A, Amblard F (2011) Time-varying graphs and social network analysis: temporal indicators and metric. In: Proceedings of the 3rd AISB social networks and multiagent systems symposium (SNAMAS), York, pp 32–38 Shen-Orr S, Milo R, Mangan S, Alon U (2002) Network motifs in the transcriptional regulation network of Escherichia coli. Nat Genet 31:64–68 Stehl´e J, Voirin N, Barrat A, Cattuto C, Isella L, Pinton J-F, Quaggiotto M, van den Broeck W, R´egis C, Lina

Theory of Probability, Basics and Fundamentals B, Vanhems P (2011) High-resolution measurements of face-to-face contact patterns in a primary school. PLoS ONE 6:e23176 Takaguchi T, Masuda N, Holme P (2012) Bursty communication patterns facilitate spreading in a threshold-based epidemic dynamics. E-print arxiv:1206.2097 Tang J, Musolesi M, Mascolo C, Latora V (2010) Characterising temporal distance and reachability in mobile and online social networks. Comput Commun Rev 40:118–124 Watts DJ (2002) A simple model of global cascades on random networks. Proc Natl Acad Sci USA 99: 5766–5771

2129

T

Temporary Organizations  Inter-organizational Networks

Term Clouds  Tag Clouds

Test Instances  Benchmarking for Graph Clustering and Partitioning

Recommended Reading There are four review papers related to temporal networks to our knowledge. These are all recommended for a further reading. Holme and Saram¨aki (2012) write a broad introduction to temporal networks, perhaps with an emphasis on the physics literature. Blonder et al. (2012) focus on temporal networks in ecology. Santoro et al. (2011) target a computer-science audience, while an overview of contributions from the network engineering community can be found in Kuhn and Oshman (2011).

Text Mining  Multi-classifier System for Sentiment Analysis and Opinion Mining  Semantic Social Networks Analysis  User Sentiment and Opinion Analysis

Text Networks  Semantic Social Networks

The Prisoner’s Dilemma

Temporal Networks or Graphs  Analysis

and

Visualization

of

Dynamic

 Incentives in Collaborative Applications

Networks

Theory of Probability, Basics and Fundamentals Temporal Social Network Analysis  Models for Community Dynamics

Muhammad El-Taha Department of Mathematics and Statistics, University of Southern Maine, Portland, ME, USA

Temporal-Textual Web Search

Synonyms

 Spatiotemporal Information for the Web

First moment, mean, expected value; Population, sample space; Sample point, simple event, elementary event

T

T

2130

Glossary Population The collection of all possible observations Simple Event One possible outcome of an experiment Sample Space The collection of all simple events Event One or more possible outcomes of an experiment Relative Frequency The proportion of observations with a certain characteristic Random Sample A sample obtained in such a way that every element in the population has an equal chance of being selected Discrete Distribution A probability distribution where the possible outcomes are discrete or countable Continuous Distribution A probability distribution where the possible outcomes cover the entire continuum of possible values

Introduction In this entry we introduce the basic elements of probability theory and discuss the most basic probability distributions. We assume no previous knowledge of the subject, but reader maturity and familiarity with basic elements of calculus is assumed. Probability theory is the formal study of the notion of uncertainty that people faced from the beginning of time and still face in everyday life. This notion has not been formalized until the seventeenth century when the great mathematician Jacob Bernoulli (1654–1705) introduced the notion of probability using a real theoretical basis. His “Ars Conjectandi” was the first real and substantial treatment on probability. It contained the general theory of permutation and combination. Bernoulli also discovered the well-known law of large numbers. At that time probabilities were defined as relative frequencies and that constituted a big step forward. However, the relative frequency definition was not completely satisfactory. A more formal framework to study probability came with the Kolomogorov (1956)

Theory of Probability, Basics and Fundamentals

monograph where his measure theoretic formulation of the notion of probability became the standard that we continue to use today; see Shafer and Vovk (2006). Kolmogorov’s framework made it possible to incorporate the mathematical knowledge known at the time into the study of probability. That was a huge leap forward. In the time since 1933, probability theory has advanced to a point it is difficult to imagine a field of study where it does not play a significant role. There is a vast literature that covers the topic of probability from the most basic level to the most advanced. An undergraduate-level good readable book with intuitive explanations is Ross (2009); see also Ross (2007). A nonmathematical elementary book is Weiss (2012). Feller (1968), considered a classic, covers probability at a level suitable for advanced undergraduate and firstyear graduate level. One may also consult Hogg and Craig (1995) and Mood et al. (1974). An advanced measure theoretic book is Billingsley (1985); see also Lo´eve (1977) and Gnedenko (1978). Application of probability and statistical techniques in social networks can be found in Bandyopadhyay et al. (2011) and Thai and Pardalos (2012) among others. We start, section “Combinatorial Analysis,” by introducing basic elements of combinatorial analysis at an elementary level. In section “Basics of Probability Theory” we discuss the basic elements of probability theory. In section “Random Variables and Their Moments” we introduce the concept of random variables and discuss their properties including their moments and probability distributions. Section “Discrete Distributions” covers discrete distributions, and section “Continuous Distributions” covers continuous distributions. Finally, in section “Concluding Remarks” we end with concluding remarks on using probability distributions to model sources of randomness.

Combinatorial Analysis In probability theory one frequently encounters situations where there is a need to count the number of different ways a certain event can occur.

Theory of Probability, Basics and Fundamentals

It would be useful to have effective methods to count the number of ways different outcomes of random experiments can occur. These counting methods are known as combinatorial analysis. In this section we describe the basic counting principles. Basic Principle of Counting: mn Rule Suppose that two experiments are to be performed. Then if experiment 1 can result in any one of m possible outcomes and if, for each outcome of experiment, there are n possible outcomes of experiment 2, then together there are mn possible outcomes of the two experiments. This rule simply says that if an experiment can result in m possible outcomes and another experiment results in n possible outcomes, then there are mn possible outcomes when the two experiments are considered together. Example 1 In a toss of two coins the number of possible outcomes is given by mn D 2  2 D 4. In a throw of two dice, the number of possible outcomes is given by mn D 6  6 D 36. Now consider a more interesting example. A small community consists of eight women, each of whom has four daughters. If one woman and one of her daughters are to be chosen as mother and daughter of the year, how many different choices are possible? Let the choice of the woman as the outcome of the first experiment and the subsequent choice of one of her daughters as the outcome of the second experiment; the total number of possible choices is mn D 8  4 D 32. The above principle generalizes to any number of experiments. Generalized Basic Principle of Counting If r experiments that are to be performed are such that the first one may result in any of n1 possible outcomes, and if for each of these n1 possible outcomes there are n2 possible outcomes of the second experiment, and if for each of the possible outcomes of the first two experiments there are n3 possible outcomes of the third experiment, and if, : : :, then there are a total of n1  n2    nr possible outcomes of the r experiments.

2131

T

Example 2 (i) A college planning committee consists of three freshmen, four sophomores, five juniors, and two seniors. A subcommittee of 4, consisting of 1 individual from each class, is to be chosen. How many different subcommittees are possible? It follows from the generalized principle of counting that there are 3  4  5  2 D 120 possible subcommittees. (ii) How many different 7-place license plates are possible if the first 3 places are to be occupied by letters and the final 4 by numbers? It follows from the generalized principle of counting that there are 26  26  26  10  10  10  10 D 175;760;000 possible license plates. (iii) In (ii), how many license plates would be possible if repetition among letters or numbers were not allowed? In this case there would be 26  25  24  10  9  8  7 D 78;624;000 possible license plates. Permutations The number of ways of ordering n distinct objects taken r at a time (order is important) is given by Prn D

nŠ D n.n  1/.n  2/    .n  r C 1/: .n  r/Š

Permutations are also called ordered arrangements. Example 3 A box contains ten balls. Balls are selected without replacement one at a time. In how many different ways can you select three balls? Note that n D 10; r D 3. Number of different ways is 10  9  8 D 10Š D 720 (which is equal to 7Š nŠ ). .nr/Š Combinations Combinations are similar to permutations except that here order is not important. For r  n, we define n r

D

nŠ .n  r/ŠrŠ

T

T

2132

Theory of Probability, Basics and Fundamentals

  and say that nr represents the number of possible combinations of n objects taken r at a time (with no regard to order). We note that sometimes combinations are denoted as Crn . Example 4 (i) A committee of 3 is to be formed from a group of 20 people. How many different committees are possible?   20Š Solution 1 There are 20 D 20:19:18 D D 3Š17Š 3:2:1 3 1;140 possible committees. (ii) From a group of five men and seven women, how many different committees consisting of two men and three women can be formed?    7 D 350 possible Solution 2 There are 52 3 committees. Note that we have used combinations and the mn rule.   Binomial Theorem The values nr are often referred to as the binomial coefficients. They appear in the following well-known binomial theorem: .x C y/n D

n   X n kD0

k

x k y nk :

Basics of Probability Theory In this section we address the notion of sample space and events, probability of an event, laws of probability, equally likely outcomes, and random sampling. Laws of probability covered shall include complementation law, additive law, definition of conditional probability and independence, multiplicative law, law of total probability, and Bayes’ law. The notion of probability can be defined accurately using the basic notions of measure theory (Billingsley 1985), but that is beyond the level of this presentation. Ours will mimic Kolmogorov’s definition, but without invoking any measure theoretic concepts. We will follow that with an intuitive explanation of the notion of probability including the relative frequency interpretation. We start by giving a few definitions.

A random experiment involves obtaining observations of some kind. Examples include tossing a coin, throwing a die, polling, inspecting an assembly line, counting arrivals at an emergency room, and counting number of friends for a Facebook subscriber. A population is the set of all possible observations. Conceptually, a population could be generated by repeating an experiment indefinitely. The outcome of an experiment is what we observe when a random experiment is observed. An elementary event (simple event or a sample point) is one possible outcome of an experiment, while an event (compound event) is one or more possible outcomes of a random experiment. For example, in a die throw experiment, an interest in observing a 6 describes an elementary event, while an interest in observing an odd number describes an event. Finally, a sample space is the set of all sample points (simple events) for an experiment; equivalently it is the set of all possible outcomes of an experiment. We need the following set-theoretic notation to describe events and sample spaces. Let the sample space be denoted by and a sample point by !. We will use uppercase letters (e.g., A; B; C; D; E) to describe events. Note that an event is nothing but a subset, and a sample space represents a universal set. A sample space and events can sometimes be represented by a Venn diagram. Example 5 Let D fwi ; i D 1;    6g, where wi D i . In other words D f1; 2; 3; 4; 5; 6g is composed of six elementary events. We may think of as a representation of possible outcomes of a throw of a die. But it also represents any random experiment with six possible outcomes. Example 6 Now consider the random experiment that consists of tossing two coins. Then D f.H; H /; .H; T /; .T; H /; .T; T /g, where .H; T / represents heads on the first coin and tails on the second coin. Since events are subsets of , then the notion of union, intersection, and complementation are

Theory of Probability, Basics and Fundamentals

used to describe relations between events. Here we give brief definition of these concepts. Definition 1 Given A and B two events in a sample space . (i) The union of A and B, A [ B, is the event containing all sample points in either A or B or both. (ii) The intersection of A and B, A \ B, is the event containing all sample points that are both in A and B. Sometimes we use AB for intersection. (iii) The complement of A, Ac , is the event containing all sample points that are not in A. (iv) Two events are said to be mutually exclusive (or disjoint) if their intersection is empty (i.e., A \ B D ). Two events are said to be collectively exhaustive if their union is the entire sample space (i.e., A [ B D ). Example 7 Suppose D fE1 ; E2 ; : : : ; E6 g. Let A D fE1 ; E3 ; E5 g; B D fE1 ; E2 ; E3 g. Then A [ B D fE1 ; E2 ; E3 ; E5 g; AB D fE1 ; E3 g; Ac D fE2 ; E4 ; E6 g; B c D fE4 ; E5 ; E6 g; and A and B are not mutually exclusive. Proposition 1 The operations of [ and \ satisfy the following relations:

A [ B D B [ AI A \ B D B \ A (symmetry or commutative law) A [ .B [ C / D .A [ B/ [ C D A [ B [ C I

2133

Probability of an Event A simple definition of probability is simply to calculate the relative frequency of an event and assign it as the event probability. However, this is not completely satisfactory. In this subsection we will give a formal definition of probability and discuss its connections to relative frequencies and other ideas. We start with the following settheoretic result. Definition of Probability Consider a random experiment whose sample space is . For each event E of the sample space , define a number P .E/ that satisfies the following three axioms (conditions): (i) 0  P .E/  1 (ii) P . / D 1 (iii) (Additive property) For any sequence of mutually exclusive events E1 ; E2 ; : : : (i.e., for which Ei \ Ej D  whenever i 6D j ), P .[1 i D1 Ei / D

A \ .B [ C / D .A \ B/ [ .A \ C / (distributive law) .[niD1 Ei /c D \niD1 Eic .\niD1 Ei /c D [niD1 Eic

The last two relations are called DeMorgan’s laws. It would be helpful for the reader to state DeMorgan’s laws for two events A and B.

1 X

P .Ei /:

i D1

We refer to P .E/ as the probability of the event E. Immediate Consequences (i) As a consequence of conditions (i)–(iii), we conclude that P ./ D 0, where  represents the empty set. (ii) If Ei is null for all i > n, then

A \ .B \ C / D .A \ B/ \ C D A \ B \ C (associative law)

T

P .[niD1 Ei /

D

n X

P .Ei /;

i D1

which gives the finite additive property. Example 8 Let D fE1 ; : : : ; E10 g, P .Ei / D 1=20; i D 1; : : : ; 6, P .Ei / D 1=5; i D 7; 8; 9, and P .E10 / D 1=10. Calculate P .A/ where A D fEi ; i  6g. First, we need to establish that the assigned probabilities satisfy the axioms. It is easy to see that they do, so we proceed to calculate P .A/. Now, P .A/ D P .E6 / C P .E7 / C P .E8 / C P .E9 / C P .E10 / D 1=20 C 1=5 C 1=5 C 1=5 C 2=20 D 0:75

T

T

2134

Theory of Probability, Basics and Fundamentals

Remark 1 (i) A probability function (set function) is a real-valued function P .E/ defined on W! Œ0; 1. This means that P .:/ maps events (subsets) into the interval Œ0; 1. (ii) The assumption of the existence of a set function P , defined on the events of the sample space and satisfying Axioms (i)–(iii), constitute the modern mathematical approach to probability theory. (iii) The axioms (conditions) are natural and in accordance with our intuitive concept of probability as related to the phenomenon of chance and randomness. (iv) Using these axioms one can prove that if an experiment is repeated over and over again then, with probability 1, the proportion of time during which any specific event E occurs will equal P .E/. This result is known as the strong law of large numbers (SLLN ); see Feller (1968). (v) We have supposed that P .E/ is defined for all the events E of the sample space. Actually, when the state space is uncountably infinite set, P .E/ is only defined for the so-called measurable events. However, this restriction need not concern us as all events of any practical interest are measurable. Interpretations of Probability (i) Relative frequency interpretation: If a random experiment is repeated a large number, n, of times and the event E is observed nE times, the probability of E is nE : P .E/ ' n Note that nnE is the relative frequency of event E, and by the SLLN , P .E/ D limn!1 nnE . This is, sometimes, given as the relative frequency definition of probability. (ii) In real-world applications, one observes (measures) relative frequencies, while one cannot measure probabilities. However, one can estimate probabilities. At the conceptual

level we assign probabilities to events. The assignment, however, should make sense (e.g., P(H) = 0.5, P(T) = 0.5 in a toss of a fair coin). (iii) In some cases probabilities can be a measure of belief (subjective probability). For example, one may state that there is 60 % chance that Shakespeare did not write Hamlet. This measure of belief should however satisfy the axioms. (iv) Typically, we would like to assign probabilities to simple events directly and then use the laws of probability to calculate the probabilities of compound events. It should be pointed out that, generally speaking, there are two schools of thought regarding probability interpretations. The first school is the “frequentists” who interpret probability as a relative frequency, as in (i) above, and the second school is the “Bayesians” who interpret probability to represent its subjective plausibility, as in (iii) above. In other words, probability is a measure of how strongly one believes something tend to occur; see DeFinetti (1974). Laws of Probability In this subsection we present some basic laws of probability. Note that E and E c are always mutually exclusive (i.e., E \E c D ) and collectively exhaustive (i.e., E [ E c D ). Moreover, 1 D P . / D P .E [ E c / D P .E/ C P .E c /. Thus we have the following fact. Proposition 2 (Complementation law) P .E c / D 1  P .E/: In other words the complementation law says the probability that an event does occur is 1 minus the probability that it does not occur. Proposition 3 If E  F , then P .E/  P .F /. Proposition 4 (Additive law) P .E [ F / D P .E/ C P .F /  P .E \ F / P .E [ F /  P .E/ C P .F /:

Theory of Probability, Basics and Fundamentals

If E and F are mutually exclusive,

2135

T

Example 10 Repeat Example 9 (ii) when ten fair coins are tossed.

P .E [ F / D P .E/ C P .F /: Proposition 5 (Additive law for three events) P .E1 [ E2 [ E3 / D P .E1 / CP .E2 / C P .E3 /  P .E1 \ E2 /  P .E1 \ E3 /  P .E2 \ E3 / C P .E1 \ E2 \ E3 /:

Solution 3 Let A be the event of observing exactly two heads. Using the combinatorial . 10 / counting rules, we obtain P .A/ D 2210 D 45=1;024. Example 11 If two fair dice are rolled, what is the probability that the sum of upturned faces will equal 7?

The above Proposition can be generalized to any number of events (called the inclusionexclusion formula). For a reference, see Chap. 1 of Ross (2007).

Solution 4 Assume that all 36 possible outcomes are equally likely. Let A be the event that the sum of upturned faces will equal 7, and then A D f.1; 6/; .2; 5/; .3; 4/; .4; 3/; .5; 2/; .6; 1/g that results in the sample points in A being 6, so P .A/ D 6=36 D 1=6.

Equally Likely Outcomes Here, we consider the special case when all elementary events are equally likely. Consider an experiment whose sample space is a finite set, say D fw1 ; : : : ; wN g.

Example 12 If two balls are randomly selected from a bowl containing six white and five black balls, what is the probability that one of the drawn balls is white and the other is black?

Definition 2 The equally likely probability function P defined on a finite sample space D fw1 ; : : : ; wN g assigns the same probability P .wi / D 1=N .

Solution 5 Let A be the event of interest. Then using the combinatorial counting rules, we obtain    P .A/ D

In this case, for any (compound) event A P .A/ D

NA Sample points in A #.A/ D D N Sample points in #. /

where N is the number of the sample points in and NA is the number of the sample points in A.

6 1

5 1

  11 2

D 6=11:

Example 13 If n people are present in a room, what is the probability that no two of them celebrate their birthday on the same day of the year? Solution 6 Let A be the event of interest. Then P .A/ D

.365/.364/.363/   .365  n C 1/ : 365n

Example 9 Toss a fair coin three times. (i) List all the sample points in the sample space. The answer is D fHHH; HH T; H TH; H T T; THH; TH T; T TH; T T T g. (ii) Find the probability of observing exactly two heads.

Example 14 If n people are present in a room, what is the probability that at least two of them celebrate their birthday on the same day of the year?

Let A be the event of observing exactly two heads. Then A D fHH T; H TH; THH g. Since all elementary events are equally likely and A has three points, and has eight, we have P .A/ D 3=8.

Solution 7 Let B be the event that at least two of them celebrate their birthday on the same day of the year. Note that the event here is the complement of the one in the previous example. So P .B/ D 1  P .A/.

T

T

2136

Theory of Probability, Basics and Fundamentals

Conditional Probability Sometimes we wish to revise an estimate of the probability of an event in light of partial information that becomes available. The new probability estimate is what is called conditional probability. The conditional probability of the event A given that event B has occurred is denoted by P .AjB/. Definition 3 If P .B/ > 0, then P .AjB/ D

P .A \ B/ : P .B/

It can be shown that the function P .AjB/, for fixed B, P .B/ > 0, satisfies the axioms of probability. It follows immediately from the definition that P .A \ B/ D P .AjB/P .B/:

Theorem 1 (Law of total probability) Let the events A1 ; A2 ; : : : ; An be a partition of the sample space and let B denote an arbitrary event. Then P .B/ D

n X

P .BjAi /P .Ai /:

i D1

Using the multiplicative law and the law of total probability, we have for all k, P .Ak jB/ D

P .Ak \ B/ P .BjAk /P .Ak / D Pn ; P .B/ iD1 P .BjAi /P .Ai /

which gives Bayes’ law.

Theorem 2 (Bayes’ Theorem) Let the events A1 ; A2 ; : : : ; An be a partition of the sample space , and let B denote an arbitrary event, Similarly, P .A3 A2 A1 / D P .A3 jA2 A1 / P .A2 A1 / P .B/ > 0. Then D P .A3 jA2 A1 /P .A2 jA1 /P .A1 /. Proceeding inductively in this manner, we have P .BjAk /P .Ak / P .Ak jB/ D Pn : Proposition 6 (Multiplicative law (Product i D1 P .BjAi /P .Ai / Rule)) Special Case Let the events A; Ac be a partition of the sample space , and let B denote an P .An    A1 / D P .An jAn1    A1 / arbitrary event, P .B/ > 0. Then    P .A3 jA2 A1 /P .A2 jA1 /P .A1 /: P .BjA/P .A/ : P .AjB/ D P .BjA/P .A/ C P .BjAc /P .Ac / This general result is known as the Product Rule for conditional probabilities. Remark 2 Definition 4 Any collection of events that is mu(i) The events of interest here are Ak , P .Ak / tually exclusive and collectively exhaustive is are called prior probabilities, and P .Ak jB/ said to be a partition of the sample space . are called posterior probabilities. Let A1 ; A2 ; : : : ; An be a partition of the sam- (ii) Bayes’ Theorem is important in several fields of applications. ple space , and B any event, then (iii) Bayes’ Theorem is also useful when one has to reassess one’s personal probabilities in B D .B \ A1 / [ .B \ A2 / [ : : : .B \ An /: light of additional information. For instance, consider the following example. Therefore, Example 15 At a certain stage of a criminal investigation, the inspector in charge is 60 % conn n X X P .B/ D P .B \ Ai / D P .BjAi /P .Ai /; vinced of the guilt of a certain suspect. Suppose now that a new piece of evidence that shows i D1 i D1 that the criminal has a certain characteristic (such which lead to the following important probability as left-handedness, baldness, or brown hair) in law. uncovered. If 20 % of the population possesses

Theory of Probability, Basics and Fundamentals

this characteristic, (i) what is the probability that the suspect has this characteristic, and (ii) how certain of the guilt of the suspect should the inspector now be if it turns out that the suspect has this characteristic? Solution 8 Let C be the event that the suspect possesses the characteristic of the criminal and G be the event that the suspect is guilty, and then by the law of total probability, P .C / D P .C jG/P .G/ C P .C jG c /P .G c / D .1/.0:6/ C .0:2/.0:4/ D 0:68 : Now using Bayes’ law, we have P .GjC / D

P .G \ C / P .C /

D

P .C jG/P .G/ P .C jG/P .G/ C P .C jG c /P .G c /

D

.1/.0:6/ .1/.0:6/ C .0:2/.0:4/

D 0:882 : Independence Informally, two events are called independent if the probability of occurrence of one is not affected by the probability of the occurrence of the other. We now make this notion formal. Definition 5 (i) Two events A and B are said to be independent if P .A \ B/ D P .A/P .B/: (ii) Two events A and B that are not independent are said to be dependent. Remark 3 (i) If A and B are independent, then P .AjB/ D P .A/ and P .BjA/ D P .B/. (ii) If A is independent of B, then B is independent of A. Definition 6 Three events A, B, and C are said to be independent if

2137

T

P .A \ B \ C / D P .A/P .B/P .C / P .A \ B/ D P .A/P .B/ P .A \ C / D P .A/P .C / P .B \ C / D P .B/P .C / : The notion of independence can be generalized to any number of events. Random Sampling Sampling from a population in a way that insures the sample accurately represents the population plays an important role in statistics. Definition 7 A sample of size n is said to be a random sample if the n elements are selected in such a way that every possible combination of n elements has an equal probability of being selected. In this case the sampling process is called simple random sampling. Remark 4 (i) If n is large, we say the random sample provides an honest representation of the population. (ii) For finite populations the of possi  number N ble samples of size n is n . For instance, the number of possible   samples when N D D 20;475. 28 and n D 4 is 28 4 (iii) Tables of random numbers may be used to guide the selection of random samples. Computer-generated random numbers play an important role in statistical analysis. For example, they can be used to insure that a sample is random when using a statistical sampling technique. We point out that computer-generated random numbers are not random in the sense of being unpredictable, so if we know the seed(s), then the whole sequence is determined; that is why they are called pseudorandom numbers. The most popular methods of generating pseudorandom numbers are called linear-congruential-generators. The idea is that these methods generate numbers in such a manner that they behave as

T

T

2138

if they were random in the sense that they pass a battery of statistical tests of randomness, independence, uniformity, etc. The advantage of these methods is their ease of implementation, relatively long periods for most application, and their good statistical properties when parameters are carefully selected. Being able to reproduce a random sequence has its advantages; for example, one can study the effect of a change in an input parameter in exactly the same random environment. See, for example, the book by Law (2007) where random number generation and stochastic modeling applications are discussed. Generalization to other more sophisticated methods, e.g., multiple recursive generators, has been studied by L’Ecuyer (L’Ecuyer 2012) and references therein. The advantage of these newer methods is the extremely large periods and the outstanding statistical properties. Modeling Uncertainty In this subsection we reflect on the role probability theory plays in modeling randomness or uncertainty. The purpose of modeling uncertainty (randomness) is to discover the laws of change. We note that even though probability (chance) involves the notion of change, the laws governing the change may themselves remain fixed as time passes. For example, consider the simple chance experiment of tossing a fair coin. A probabilistic law would state: In a fair coin tossing experiment the percentage of (H)eads is very close to 0.5. In the abstract probabilistic model, the law would state P .H / D 0:5 exactly. This brings up the question: Why probabilistic reasoning? We illustrate by an example. Example 16 Toss five fair coins repeatedly and write down the number of heads observed in each trial. Now, what percentage of trials produce two Heads? Answer We could toss five coins repeatedly and calculate the relative frequency of observing two heads. Instead use the binomial law to show that

Theory of Probability, Basics and Fundamentals

  5 P .2 heads/ D .0:5/2 .1  0:5/3 2 D

5Š .0:5/2 .0:5/3 D 0:3125: 2Š3Š

Therefore, there is no need to carry out this experiment to answer the question. Thus saving time and effort. There is an interplay between theory and applications, in this case probability and statistics or data analysis. Theory is an exact discipline developed from logically defined axioms (conditions). Also, theory is related to physical phenomena only in inexact terms (i.e., approximately). A theory is useful because when it is applied to real problems, it works (i.e., it makes sense). Example 17 A fair die is tossed for a very large number of times. It was observed that face 6 appeared 1;500. Estimate how many times the die is tossed. Answer Number of tosses is approximately 9;000 times.

Random Variables and Their Moments In this section we introduce and discuss random variables, probability mass function (pmf ), probability density function (pdf ), cumulative distribution function (cdf ), expected value, moments, and variance. We motivate the discussion on random variables by introducing the following examples. Toss a coin three times, and then D fHHH; HH T; H TH; H T T; THH; TH T; T TH; T T T g: Let the variable of interest X be the number of heads observed, and then relevant events would be fX D 0g D fT T T g, fX D 1g D fH T T; TH T; T TH g, fX D 2g D fHH T; H TH; THH g, and fX D 3g D fHHH g. In this example, the domain is , and the range is f0; 1; 2; 3g. The relevant question here is to find the probability of each these events. Note that X takes integer values even though the sample space consists of Hs and Ts.

Theory of Probability, Basics and Fundamentals

The variable X transforms the problem of calculating probabilities from that of set theory to calculus. In contrast to discrete random variables, a continuous random variable assumes a continuum of values. For example, observe the lifetime of a light bulb, and then D fx; 0x 0 if f .x/ D e  x =xŠ; D0;

x D 0; 1; : : : ;

elsewhere:

The mean, variance, and standard deviationpare given by EŒX  D , V .X / D , and X D , respectively.

2141

T

Example 18 Suppose the number of typographical errors on a single page of this entry has a Poisson distribution with parameter D 1=2. Calculate the probability that there is at least one error on this page. Solution 9 Letting X denote the number of errors on a single page, we have P .X  1/ D 1P .X D 0/ D 1e 0:5 ' 0:395:

Rule of Thumb The Poisson pmf provides good approximations to binomial probabilities when n is large and D np is small, preferably with np  7. Example 19 Suppose that the probability that an item produced by a certain machine will be defective is 0:1. Find the probability that a sample of 10 items will contain at most 1 defective item. Solution 10 Using the binomial distribution, the desired probability is 

 10 P .X  1/ D .0:1/0 .0:9/10 0   10 C .0:1/1 .0:9/9 D 0:7361: 1 Using Poisson approximation, we have D np D 1, and P .X  1/ D P .X D 0/ C P .X D 1/ D e 1 C e 1 ' 0:7358

T which is close to the exact answer. Geometric The geometric distribution arises in situations where one has to wait until the first success. For example, in a sequence of coin tosses (with p D P(success)), the number of trials, X , until the first head is thrown is a geometric random variable. A random variable X is said to have a geometric pmf with parameter p if f .x/ D p.1  p/x1 .x D 1; 2; : : :/ :

T

2142

Theory of Probability, Basics and Fundamentals

The mean and variance are given by EŒX  D and V .X / D p12 , respectively. The complement of the cdf is given by 1 p

P .X  k/ D .1  p/k1 : Negative Binomial The negative binomial distribution arises in situations where one has to wait until the rth success. For example, in a sequence of coin tosses (with p D P (success)), the number of trials, X , until the rth head is thrown is a negative binomial random variable. A random variable X is said to have a negative binomial pmf with parameter p if  f .x/ D

 x1 p r1 r 1

.1  p/xr .x D 1; 2; : : :/ : The mean and variance are given by EŒX  D and V .X / D pr2 , respectively.

When a D 0 and b D 1, we say X is U.0; 1/. The U.0; 1/ forms the basis for methods used in simulating data from other probability distributions. Normal The normal distribution, also known as the bell curve, is considered the most popular in statistics. It occurs naturally in many applications and is mathematically tractable. Its importance also stems from the fact that if a large number of independent identically distributed random variables are averaged, then in the limit the average converges to a normal random variable. This fact is known as the central limit theorem. A random variable X is said to have a normal pdf with parameters and  if 1 2 2 p e .x/ =2 I 1 < x < 1;  2

f .x/ D r p

where 1 < < 1I 0 <  < 1 :

Continuous Distributions The continuous distributions covered include the uniform, normal, exponential, and gamma. Uniform The uniform distribution arises in situations where intervals of equal length are equally probable. If X is a random variable with a uniform pdf , then

The mean and variance are given by EŒX  D

and V .X / D  2 , respectively. Note that Z

C1 1

1 2 2 p e .x/ =2 D 1:  2

Definition 16 A random variable X with cdf Z

x

F .x/ D 1

1 a 6/ D P .jZj > 2/ D P .Z > 2/ CP .Z < 2/ D 0:0456:

elsewhere :

T

or D 1=100:

The value 0:0456 is obtained from any Z table. Exponential The exponential pdf often arises, in practice, as being the distribution of the amount of time until some specific event occurs. Examples include time until a new car breaks down, time until an arrival at emergency room, etc. The exponential distribution arises frequently in the modeling of stochastic systems such as queueing and inventory systems.

Therefore, the probability that a computer will function between 50 and 150 h before breaking down is given by Z P .50  X  150/ D

150 50

1 x=100 e dx 100

D e x=100 j150 50 D e 1=2  e 3=2 ' 0:38 :

T

2144

Theory of Probability, Basics and Fundamentals

The exponential distribution is the only continuous random variable with the memoryless property which we define here. Definition 18 A nonnegative random variable is said to be memoryless if P .X > h C tjX > t/ D P .X > h/ for all h; t  0:

Proposition 10 The exponential random variable has the memoryless property. Suppose that an item has an exponential lifetime. The memoryless property says that given an item has survived the first t time units, the distribution function of its remaining life is the same as the original exponential distribution function, i.e., it is unaffected by t or, in other words, it does not remember how long it has survived. The exponential distribution is the only continuous distribution with the memoryless property. Gamma A random variable X is said to have a gamma pdf with parameters ; ˛ if

f .x/ D

e x . x/˛1 ; x0;

.˛/

D0; x 0, then P .AjB/ D

P .A \ B/ P .B/

is called the conditional probability of A given B. If P .AjB/ D P .A/, we say that the events A and B are independent since P .A \ B/ D P .A/P .B/:

Introduction to Probability A probabilistic experiment is a random process whose outcome is an uncertain event belonging to the set of all the possible outcomes ˝. ˝ is the sample space of a given random experiment. A number of useful operations can be defined given the events (or the sets of events) A and B defined on ˝. The union of A and B is denoted by A [ B and consists of all the outcomes that belong to at least one of A and B. The intersection of A and B is denoted by A \ B and consists of all the outcomes that belong to both A and B. The difference of A from B is denoted by A  B and consists of all the outcomes that belong to A and do not belong to B. The complement of A is denoted by A{ and consists of all the outcomes in ˝ that do not belong to A. Given a probabilistic experiment with sample space ˝, we define the probability function P on the subsets of ˝ that assigns a real number to each possible event; this number represents the probability that a given event occurs. Following the approach of Kolmogorov (1933), the probability function P must satisfy a number of axiomatic definitions (see Billingsley (1995) for more details): (i) 0  P .A/  1 for any event A. (ii) P .˝/ D P .A/ C P .A{ / D 1 for any event A and its complementary event A{ . (iii) P .A [ B/ D P .A/ C P .B/ for any disjoint events A and B (A \ B D ;). From (ii) we can derive that the probability of the complementary of an event is equal to 1 minus the probability of the event: P .A{ / D 1  P .A/. It is possible to define the probability of an event given the presence of another event.

(1)

(2)

The conditional probabilities allow us to write the Bayes’ theorem (Bayes 1763): P .AjB/ D

P .BjA/P .A/ : P .B/

(3)

In many problems, we are given P .AjB1 /; : : : ; P .AjBk / where B1 ; : : : ; Bk are disjoint events with nonzero probability whose union is the sample space. In this case we can generalize the results above and find the following law of total probability: P .A/ D

k X

P .Bi /P .AjBi /;

(4)

i D1

and a more general form of the Bayes’ theorem: P .AjBj /P .Bj / P .Bj / D Pk : i D1 P .Bi /P .AjBi /

(5)

Frequentist and Bayesian Statistics The frequentist definition sees probability as the long-run expected frequency of occurrence of an event. So, for example, the probability of occurrence of an event A is defined as P .A/ D n=N , where n is the number of times event A occurs in N opportunities. The Bayesian view of probability is related to the degree of belief, and it is a measure of the plausibility of an event given incomplete knowledge. To evaluate the probability of a hypothesis from a Bayesian viewpoint, it is necessary to specify some prior probability, which is then updated in the light of the observed data through the Bayes’ theorem (Bernardo and Smith 1994).

Theory of Statistics, Basics, and Fundamentals

2147

Random Variables A random variable X is a function that maps the sample space to the real line, formally let X be defined on ˝, for each ! 2 ˝, X.!/ is a real number. A random variable X is discrete if its range is a finite or countably infinite set of discrete values. The probability mass function of a discrete random variable is completely determined by specifying P .X D x/ for all x: the relative frequency function of a discrete random variable X is defined by f .x/ D P .X D x/. A random variable X is continuous if it has a continuous sample space. So a continuous random variable can take on uncountably infinite number of outcomes, and this gives that P .X D x/ D 0 for any x. The probability distribution of a continuous random variable can be found introducing the probability density function f .x/ that has the properties of being  0 C1 R f .x/dx D 1. if for 1 < a < b < C1 and 1

So that we can define Zb P .a  X  b/ D

f .x/dx:

(6)

T

The cumulative distribution function has the following properties: • F is non-decreasing: if x  y then F .x/  F .y/. • F is right-continuous function. • limx!1 F .x/ D 0. • limx!C1 F .x/ D 1. Suppose that X is a random variable with cumulative distribution function F . Then we have: • P .a < X  b/ D F .b/  F .a/. • P .X  a/ D F .a/, • P .X > a/ D 1  F .a/. • If F is continuous, P .X D a/ D 0. Expected Value Sometimes it is useful to summarize the distribution of X by certain characteristics of the distribution; one of these is the expected value E.X /. The expected value E.X / of a random variable X is also called the mean or the expectation of X . If X is a discrete random variable with frequency function f .x/, then the expected value of X is defined as E.X / D

X

xf .x/;

(10)

x

a

the expected value of a function h of X is The cumulative distribution function (CDF) F of X is defined by

E.h.X // D

X

h.x/f .x/:

(11)

x

F .X / D P .X  x/ D P .! 2 ˝ W X.!/  x/: (7) If X is a discrete random variable, it can be defined as F .X / D

X

P .X D xi /:

(8)

xi x

E.X / D

xf .x/ dx;

(12)

the expected value of a function h of X : Z1

Zx f .t/dt: 1

Z1 1

If X is a continuous random variable, it can be defined as

F .X / D

If X is a continuous random variable with probability density function f .x/, then the expected value of X is defined as

(9)

E.h.X // D

h.x/f .x/ dx: 1

(13)

T

T

2148

Theory of Statistics, Basics, and Fundamentals

Variance and Standard Deviation Let X be a random variable with D E.X /, then the variance of X , denoted by Var.X /, is defined to be Var.X / D EŒ.X  / : 2

There are a number of important properties of the variance and standard deviation of a random variable: • Var.X /  0. • Var.X / D 0 if, and only if, P .X D / D 1 where D E.X /. • Var.X / D E.X 2 /  2 . • Var.aX C b/ D a2 Var.X / for any constants a and b, and so it follows that SD.aX C b/ D jajSD.X /. Moment-Generating Function A useful tool for computing means and variances is the moment-generating function, which, when it exists, uniquely characterizes a probability distribution. Let X be a random variable and the moment-generating function of X : t 2R

D m.2/ .0/  .m.1/ .0//2 :

(17)

F .x1 ; : : : ; xk / D P .X1  x1 ; : : : ; Xk  xk / (20) where the event .X1  x1 ; : : : ; Xk  xk / is the intersection of the events .X1  x1 /; : : : ; .Xk  xk /. Given the joint distribution function of random vector X, we can determine P .X 2 A/ for any set A  Rk . Suppose that X1 ; : : : ; Xk are discrete random variables defined on the same sample space. Then the joint probability function of X D .X1 ; : : : ; Xk / is defined to be f .x1 ; : : : ; xk / D P .X1 D x1 ; : : : ; Xk D xk / (21) and has to follow the condition X

(18)

f .x1 ; : : : ; xk / D 1:

(22)

x1 ;:::;xk

If X1 ; : : : ; Xk are discrete, then the joint frequency function must exist. Suppose that X1 ; : : : ; Xk are continuous random variables defined on the same sample space and that P .X1  x1 ; : : : ; Xk  xk / Z x1 Z xk ::: f .t1 ; : : : ; tk / dt1 : : : dtk (23) D 1

the expected value of X can be easily found by calculating the first derivative: E.X / D m.1/ .0/

(19)

Multidimensional Random Variables and Multivariate Distributions Suppose we have random variables X1 ; : : : ; Xk defined on some sample space. The joint distribution function of a random vector X D .X1 ; : : : ; Xk / is defined as

(16)

can be defined if there exists a b > 0 such that m.t/ < 1 for jtj < b. m.0/ always exists and is equal to 1. If m.t/ is differentiable at zero, then it is possible to find the i -th moment, the expected value, and the variance of X . The i -th raw moment can be found by calculating the i -th derivative of m.t/ around 0: E.X i / D m.i / .0/I

Var.X / D E.X 2 /  E.X /2

(14)

If E.X 2 / < 1, then Var.X / < 1; if E.X 2 / D 1, then we will define Var.X / D 1. We can define the standard deviation of X to be p SD.X / D Var.X /: (15)

m.t/ D E.exp.tX //I

and the variance of X :

1

for all x1 ; : : : ; xk . f .x1 ; : : : ; xk / is the joint density function of .X1 ; : : : ; Xk / if f .x1 ; : : : ; xk /  0, and

Theory of Statistics, Basics, and Fundamentals

Z

Z

1

1

::: 1

1

f .x1 ; : : : ; xk / dx1 : : : dxk D 1: (24)

Let X1 ; : : : ; Xk be random variables defined on the same sample space. X1 ; : : : ; Xk are said to be independent if the events .a1 < X1  b1 /; : : : ; .ak < Xk  bk / are independent for all ai < bi , i D 1; : : : ; k. In general, an infinite collection of random variables are independent if every finite collection of random variables is independent. If X1 ; : : : ; Xk are independent and have joint density (or frequency) function f .x1 ; : : : ; xk /, then k Y f .x1 ; : : : ; xk / D fi .xi /; (25) i D1

where fi .xi / is the marginal density (frequency) function of Xi . If X1 ; : : : ; Xk are independent random variables with the same marginal distribution, we say that X1 ; : : : ; Xk are independent, identically distributed (i.i.d.) random variables. Expected Value Suppose that X D .X1 ; : : : ; Xk / is a vector of random variables defined on some sample space and let Y D h.X/ for some real-valued function h. If X has a joint density or frequency function, we can define the expected value of Y as

E.Y / D E.h.X// D

X

h.x/f .x/

(28)

(b) If X1 ; : : : ; Xk are independent random variables, then k Y

E

! Xi

i D1

D

k Y

E.Xi /:

(29)

i D1

Covariance and Correlations To evaluate the linear relationship between two random variables, it is possible to use the covariance. Suppose X and Y are random variables with E.X 2 / and E.Y 2 / both finite, and let X D E.X / and Y D E.Y /. The covariance between X and Y is

Cov.X; Y / D EŒ.X  X /.Y  Y / D E.X Y /  X Y

(30)

if two variables are independent Cov.X; Y / D 0, but Cov.X; Y / D 0 does not necessarily imply that X and Y are independent, but we say that they are simply uncorrelated. If Cov.X; Y / > 0, X and Y are positively correlated. If Cov.X; Y / < 0, X and Y are negatively correlated. Using properties of expected values, it is quite easy to derive the following properties. For any constants a; b; c, and d ,

Suppose that X1 ; : : : ; Xk are random variables with E.Xi2 / < 1 for all i . Then 0

1 1

E.Xi /:

i D1

Cov.aX C b; cY C d / D a c Cov.X; Y /: (31)

if X has joint frequency function f .x/ and

1

k X

T

(26)

x

E.Y / D E.h.X// Z 1 Z D :::

E.X1 C : : : C Xk / D

2149

h.x/f .x/dx1 : : : dxk

Var @

k X

i D1

1 ai Xi A D

k X i D1

(27) C2

if X has joint density function f .x/. Suppose that X1 ; : : : ; Xk are random variables with finite expected values. (a) If X1 ; : : : ; Xk are defined on the same sample space, then

ai2 Var.Xi / 1 k jX X

ai aj Cov.Xi ; Xj /:

j D2 i D1

(32)

The covariance is a measure of the linear association between two random variables, but

T

T

2150

Theory of Statistics, Basics, and Fundamentals

its value is dependent on the scale of the two random variables. The correlation is a measure of linear association invariant to linear transformations that measures the degree to which we may approximate one random variable by a linear function of another. Then the correlation between X and Y is Corr.X; Y / D

Cov.X; Y / 1

ŒVar.X /Var.Y / 2

:

(33)

The correlation can take only values between 1 and 1. In particular Corr.X; Y / D 1 if, and only if, Y D aX C b for some a > 0; and Corr.X; Y / D 1 if, and only if, Y D aX C b for some a < 0. If X and Y are independent random variables (with E.X 2 / and E.Y 2 / finite), then Corr.X; Y / D 0. However, as with the covariance, a correlation of 0 does not imply independence. The invariance to linear transformations gives that if U D aX C b and V D cY C d , then

Corr.U; V / D Corr.X; Y /

(34)

if a and c have the same sign; if a and c have different signs, then Corr.U; V / D Corr.X; Y /. Conditional Distributions We are often interested in the probability distribution of random variables given knowledge of some event, or set of events, A. If P .A/ > 0, we can define conditional distributions, conditional density functions (marginal and joint), and conditional frequency functions using the definition of conditional probability,

P .X1  x1 ; : : : ; Xk  xk jA/ D

P .X1  x1 ; : : : ; Xk  xk ; A/ : P .A/

(35)

If we have discrete random variables, the conditional frequency function of X1 ; : : : ; Xj given the set of events Xj C1 D xj C1 ; : : : ; Xk D xk can be defined as

f .x1 ; : : : ; xj jxj C1 ; : : : ; xk / D P .X1 D x1 ; : : : ; Xj D xj jXj C1 D xj C1 ; : : : ; Xk D xk / D

P .X1 D x1 ; : : : ; Xj D xj ; Xj C1 D xj C1 ; : : : ; Xk D xk / (36) P .Xj C1 D xj C1 ; : : : ; Xk D xk /

which is simply the joint frequency function of X1 ; : : : ; Xk divided by the joint frequency function of Xj C1 ; : : : ; Xk .

Common Distributions Binomial and Bernoulli Distributions A discrete random variable X is said to have a binomial distribution with parameters n and  (X Bin.n; /) if its frequency function of X is given by ! n x f .x/ D  .1/nx x

for x D 0; 1; : : : ; n: (37)

When n D 1, X has a Bernoulli distribution with parameter  (X Bern./). The Bernoulli distribution describes all situations where a random process is made, resulting in either “success” or “failure”. As consequence, the binomial distribution is based on the repetition of a Bernoulli experiment and counts the number X of successes obtained. Normal Distribution A continuous random variable X is said to have a normal distribution with parameters and  2 (X N. ;  2 /) if its density is

f .x/ D

  1 .x  /2 p exp  2 2  2

(38)

Theory of Statistics, Basics, and Fundamentals

2151

When D 0 and  2 D 1, X is said to have a standard normal distribution; we will denote the distribution function of X N.0; 1/ by 1 ˚.x/ D p 2

Z

 2 t dt: exp  2 1

(39)

If X N. ;  2 /, then its distribution function is F .x/ D ˚

x   

:

Uniform Distribution Suppose that X is a continuous random variable with density function 

.b  a/1 0

axb otherwise

f .x/ D

(41)

X is said to have a uniform distribution on the interval Œa; b and is usually denoted as X Unif.a; b/. The distribution function is given by F .x/ D .x  a/=.b  a/ for a  x  b, (F .x/ D 0 for x < a, and F .x/ D 1 for x > b). The uniform distribution defines equal probability over a given range or set of possible events. One of the most important applications of the uniform distribution is in the generation of random numbers.

exp. x/ 0

x0 : x t/ D P .X > x/ 8x; t  0: (44) This means that the conditional probability that we need to wait more than another x units of time before the first occurrence, given that the first occurrence has not yet happened after t units of time, is equal to the probability that we need to wait more than x units of time for the first occurrence. Multivariate Normal Distribution Suppose that X1 ; : : : ; Xp are i.i.d. normal random variables with mean 0 and variance. 1. Then the joint density function of X D .X1 ; : : : ; Xp /T is f .x/ D

  1 T x x exp  p 2 .2 / 2 1

(45)

The random vector X has a standard multivariate (or p-variate) normal distribution. Let A be a p  p matrix and  D . 1 ; : : : ; p /T a vector of length p. Given a

T

T

2152

Theory of Statistics, Basics, and Fundamentals

standard multivariate normal random vector X, define Y D  C AX:

(46)

We say that Y has a multivariate normal distribution with mean vector  and variance-covariance matrix C D AAT . (Y Np .; C /). Note that C D AAT is a symmetric, nonnegative definite matrix (i.e., C T D C and vT C v  0 for all vectors v). If A is an invertible matrix, then the joint density of Y exists and is given by f .y/ D

1 p 2

1

.2 / j det.C /j 2   1  exp  .y  /T C 1 .y  / 2 (47)

Note that C 1 exists since A1 exists. Chi-Square Distribution Let X Np .0; I/ and define V D XT X D jjXjj2 . The random variable V is said to have a 2 (chi-square) distribution with p degrees of freedom and is usually denoted as V 2 .p/. Distribution Let Z N.0; 1/ and V 2 .n/ be independent random variables. Define T D Z=V =n; the random variable T is said to have Student’s t distribution with n degrees of freedom and is usually denoted as T T .n/.

depending only on the sample X1 ; : : : ; Xn . An estimator is a random variable and varies depending on the sample (Lehmann and Casella 1998). N   is called the bias of The difference EŒ the estimator N of the parameter . The estimator N is said to be unbiased if its expectation is equal to the parameter to be estimated, i.e., if N D ; otherwise, the estimator N is said to be EŒ biased. If an estimator is not unbiased, then it either overestimates or underestimates . In both cases, this results in systematic errors of the same sign in the estimate of the parameter . EŒNn  !  as n ! 1, then the estimator N is said to be asymptotically unbiased (Lehmann 1951). Sample Mean The sample mean of a random sample X1 ; : : : ; Xn is defined as 1X XN D N D Xi : n n

(48)

i D1

The sample mean N is an unbiased consistent estimator of the population expectation EŒX  D . If the population variance  2 exists, then the sample mean N is asymptotically   2 . normally distributed with parameters ; n The sample mean for the function Y D f .X / of a random variable X is 1X YN D f .Xi /: n n

F Distribution Let V 2 .n/ and W 2 .m/ be independent ; the random random variables. Define F D WV =n =m variable F is said to have an F distribution with n and m degrees of freedom and is usually denoted as F F .n; m/.

(49)

i D1

Sample Variances The quantity 1X .Xi  / N 2 n n

N 2 D

Estimators A statistical estimator N (or simply an estimator) of an unknown parameter  for a sample X1 ; : : : ; Xn is a function N D N .X1 ; : : : ; Xn /

(50)

i D1

is called the sample variance of the sample X1 ; : : : ; Xn and has the properties of asymptotic n1 2 unbiasedness EŒN 2  D  and consistency. n

Time-Aggregated Graphs

2153

The quantity 1 X .Xi  / N 2 n1 n

sN2 D

(51)

T

of their xi ’s. From the sampling distribution, it is possible to construct confidence intervals and O hypothesis test for the estimate ˇ.

i D1

is called the adjusted sample variance, and sN is called the sample mean square deviation of the sample X1 ; : : : ; Xn . The sN 2 is an unbiased estimator of the variance  2 : EŒNs 2  D  2 (Casella and Berger 2002).

Cross-References  Probabilistic Analysis  Probabilistic Logic and Relational Models  Regression Analysis  Spatial Statistics  Theory of Probability, Basics and Fundamen-

tals

Linear Models The general form of the linear model can be written as follows: Yi D ˇ0 C ˇ1 xi 1 C : : : C ˇp xip C i .i D 1; : : : ; n/ D xTi ˇ C i

(52)

where xi D .1; xi 1 ; : : : ; xip /T is a vector of known constants called covariates or predictors, ˇ D .ˇ0 ; ˇ1 ; : : : ; ˇp /T is a vector of unknown parameters, and 1 ; : : : ; n are i.i.d. normal random variables with mean 0 and unknown variance  2 . Alternatively, we can define Y1 ; : : : ; Yn as independent normal random variables with E.Yi / D xTi ˇ and Var.Yi / D  2 . To find an estimates ˇO of the vector of parameters ˇ we can consider the residual sum of squares: n X .yi  xTi ˇ/2 : S D 2

References Bayes T (1763) An essay towards solving a problem in the doctrine of chance. Philos Trans R Soc 53:370–418 Bernardo JM, Smith AFM (1994) Bayesian theory. Wiley, New York Billingsley P (1995) Probability and measure, 3rd edn. Wiley, New York Casella G, Berger R (2002) Statistical inference. Duxbury/Pacific Grove, Thomson Learning Kolmogorov AN (1933) Grundbegriffe der Wahrscheinlichkeitsrechnung. Springer, Berlin Lehmann EL (1951) A general concept of unbiasedness. Ann Math Stat 22(4):587–592 Lehmann EL, Casella G (1998) Theory of point estimation (Springer texts in statistics), 2nd edn. Springer, New York

Ties  Mapping Online Social Media Networks

(53)

T

i D1

The corresponding value ˇO that minimizes S 2 is called least squares estimate of ˇ and yields an estimator having the smallest variance among all the unbiased estimates of ˇ. For large values of n, the sampling distribution of ˇO can be approximated by a normal distribution with mean ˇ and covariance matrix  2 .X t X /1 where X is the n  p matrix containing the n vectors xi . This approximation is exact when the error terms i have a normal distribution conditional

Time  Spatiotemporal Footprints in Social Networks

Time-Aggregated Graphs  Temporal Networks

T

2154

Time- and Event-Driven Modeling of Blogger Influence Nitin Agarwal1, Debanjan Mahata1 , and Huan Liu2 1 Department of Information Science, University of Arkansas at Little Rock, Little Rock, AR, USA 2 Data Mining and Machine Learning Lab, School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, Tempe, AZ, USA

Synonyms Authority; Centrality; Impact; Prestige; Status

Glossary Blog or a Blog Site A blog can be defined as a website that displays, in a reverse chronological order, the entries by one or more individuals and usually has links to comments on specific postings. Blogs often provide opinions, commentaries, or news on a particular subject, such as food, politics, or local news; some function more like personal online diaries. Blogs often archive the old entries and keep them accessible. RSS or XML feeds of blogs are made available by the blogging platforms for convenient syndication Blog Post An entry in a blog is called blog post. A typical blog post can combine text, images, and links to other blogs, web pages, and other media related to its topic Blogroll Some blogs provide a list of links to similar or related blogs. Such a list of links is called a blogroll Blogger An individual who authors a blog post is referred as a blogger Blogging The act of updating a blog (adding an entry) to the blog website is known as “blogging” (Gill 2004)

Time- and Event-Driven Modeling of Blogger Influence

Blogosphere The universe of all the blogs is often referred as the blogosphere Influence and influential The capacity to have an effect on the character, development, or behavior of someone or something, or the effect itself, is known as influence, and someone or something possessing the capability to have the effect is known as an influential Collective Action Collective action is defined as an action aiming to improve the groups’ conditions (such as status or power), which is enacted by a representative of the group. The action benefits multiple members of the group but has an associated cost which is impossible for a single member to undertake. Hence, the action is undertaken collectively to share the cost Information Cascade An information (or informational) cascade occurs when there is a flow of information from one individual to another such that the receiver is compelled by the influence of the transmitter of the information to accept and adopt the information. The receivers and transmitters could be an individual, a group of individuals, or sources like blogs and web pages Power Law A power law is a mathematical relationship between two quantities. If the frequency (with which an event occurs) varies as a power of some attribute of that event (e.g., its size), the frequency is said to follow a power law Social Network A social network is a social structure made up of a set of actors (such as individuals or organizations) and ties between these actors. Social Media Media designed to – Support democratization of the content, transforming people from content consumers to content producers – Leverage Internet and web-based technologies to transform monologues into dialogues – Disseminate information through interactions – Create highly accessible, scalable, and dynamic publishing techniques Splogs A spam blog

Time- and Event-Driven Modeling of Blogger Influence

Introduction The widespread use of social media has turned the former mass information consumers to the present information producers. Social media websites include blogs, wikis collaborative tagging, media sharing, and other such services. Most of these websites facilitate interconnections between the users and encourage sharing of information. This makes it easier for the social media users to discover new information and get influenced by the other users (peers) with whom they are connected. This article aims to discuss about the phenomenon of influence in blogs. In the physical world, 83 % of the people prefer consulting family, friends, or an expert over traditional advertising before trying a new restaurant; 71 % of the people do the same before buying a prescription drug or visiting a place; and 61 % of the people talk to family, friends, or an expert before watching a movie (Keller and Berry 2003). In short, before people make decisions, they talk and they listen to others’ experiences, opinions, and suggestions. The virtual world of social media shows similar characteristics with the growing trend of trusting online friends and experts. A recent report (Bazaarvoice 2012) suggests that over half of Millennials (consumers aged 18–34) trust the opinions of strangers in online forums over those of friends and family. The individuals whose experiences, opinions, and suggestions are sought after are aptly termed as the influentials (Keller and Berry 2003). Before we go further, it is essential to introduce the concept of a blog and its related terminologies. A “blog” is a website where the entries by individuals are displayed in a reverse chronological order. A typical blog can combine text, images, videos, and links to other blogs and to web pages. These entries can be blog posts or comments to the follow-up posts linked to some specific posts. The blogosphere is the virtual universe that contains all blogs. Blogging has become a popular choice for many to serve their interpersonal communication needs. Blogs act as a platform for masses to share their likes and dislikes, voice their opinions, provide

2155

T

suggestions, and report news. The blog writers, also known as bloggers, loosely form their special interest communities where they debate and discuss issues, spread awareness, gather support, and organize and mobilize campaigns – utilizing and in many ways demonstrating the democratic nature of the Internet. These bloggers could vary from being just a hobbyist to full-time professionals (Technorati 2011). The highly interactive and informal nature of the blogosphere helps in creating new relationships and also to enhance the existing ones (Stefanone and Jang 2008), thus making the blogosphere as one of the most lucrative platforms for the online users to seek others’ opinion and the place where the influentials can thrive. The identification of influential bloggers could be beneficial for developing innovative business opportunities (Onishi and Manchanda 2012), forging political agendas (Davis 2009), discussing societal issues (Kumar et al. 2009), and leading to many interesting applications. For example, the influentials are often market movers and can influence buying decisions of the fellow bloggers. Identifying the influential bloggers can help companies to better understand the key concerns and new trends about products interesting to them. With additional information and consultation, the influential bloggers could act as unofficial spokesmen for their brands. It has also become a commonplace to find fashion bloggers in the front row of runway shows with the rise of a new breed of bloggers known as “beauty bloggers” (http://www.nytimes.com/2013/02/07/ fashion/at-fashion-shows-more-beauty-bloggersskin-deep.html). According to a report from Technorati (2011), 59 % of professional parttime bloggers and 66 % of professional fulltime bloggers have been approached to write about or review products. It is also reported that while making brand decisions, Millennials are 247 % more likely to be influenced by blogs or social networking sites (Symphoni IRI Group 2012). As representatives of communities, the influential bloggers could sway opinions in political campaigns and affect reactions to government policies (Davis 2009). Tapping the

T

T

2156

influentials can further help understand the changing interests, foresee potential pitfalls and likely gains, and adapt plans timely and proactively (not just reactively). A blogger may not be always equally influential. Further, the blogosphere is growing at a rate of three million blogs per month (Technorati 2011). As new bloggers join the blogosphere and existing ones drop out over time, the likelihood that a new set of influential bloggers exists increases. Thus, the extremely dynamic landscape of the blogosphere necessitates the consideration of the temporal aspect while studying the influential bloggers (Akritidis et al. 2011). This waxing and waning of influence over time adds a huge complexity to an already challenging problem of identifying the influentials and also adds a new dimension to the study. An equally interesting problem is to identify the influentials in the networks of diffusion. Tracking information flow between the bloggers can help in identifying creators and curators of information. This may further result in identifying who influences whom in the blogosphere. Such aspects are explored later in this article. The blogosphere has also become a popular platform for disseminating information, discussing, organizing, and coordinating different types of real-life events. These events might be launching of new products (Onishi and Manchanda 2012), political campaigns (Adamic and Glance 2005), sociopolitical movements (Agarwal et al. 2012), natural disasters (http:// www.australianscience.com.au/news/sandysaftermath), and personal and social events (Gordon and Swanson 2009). The bloggers provide live commentaries on real-world events promoting citizen journalism all over the world. Often, mainstream media relies on blogs for reporting first-hand accounts of an event (Ekdale et al. 2007). Identifying the influential bloggers blogging about these events could help in gleaning insights into the event. The views and opinions expressed by the influential bloggers can also motivate online and offline reactions of the people towards the events. Tapping such influentials could help in identifying the opinion leaders, studying the opinion diffusion among the

Time- and Event-Driven Modeling of Blogger Influence

bloggers, and understanding the manifestation of collective actions in real-world events (Agarwal et al. 2012). Such aspects are explored later in this article. In the following sections, we investigate into the historical background of the influential bloggers. We discuss the challenges pertaining to its study and explain the scientific fundamentals for studying time- and event-driven influence in the blogosphere. We then glean over the related key research findings from the scientific literature and explore the key real-world applications using the concept of influence in the blogosphere. References to different datasets and tools are provided for interested readers for studying the significance and challenges of identifying influentials in the blogosphere.

Historical Background The history of influence goes back to the study of prominence, role, position, and prestige of social actors in a social network. A social network is a network between “actors” (denoted by a node in a graph) with links between them known as “ties” (denoted by the edges between the nodes in a graph). Measuring influence has been a topic of keen interest for social scientists for a long time and has its roots in the concepts of “centrality” (Bavelas 1948; Freeman 1979; Borgatti and Everett 2006) and “prestige” or “status” (Moreno 1934; Zeleny 1940; Proctor and Loomis 1951) in the social network literature. Building upon the concepts from social networks, similar measures and methodologies were developed for ranking web pages (Kleinberg 1999; Page et al. 1999). The problem of measuring the influence of blogs became eminent recently with the increasing popularity of blogs (Gill 2004). The blogosphere can also be considered as a network of blogs, and the influential blogs could be considered as prominent nodes in the network who are authoritative and possess the capability to affect peoples’ opinions, choices, and attitudes (Agarwal et al. 2008). Several studies have been conducted for measuring bloggers’ and blog

Time- and Event-Driven Modeling of Blogger Influence

posts’ influence over each other and the spread of influence across the network of blogs (Java et al. 2006; Gruhl et al. 2004; Richardson and Domingos 2002). However, Agarwal et al. (2008) were among the pioneering works to develop a model to quantify or measure the influence of a blogger. There are several studies that have further explored specific aspects of influence in blogs including model robustness, effects of time and events on influence, topical significance on influence, and influence and social contagion among others.

Challenges The study of influential bloggers is subjected to various challenges. The additional dimensions of time and event make the problem even more complex. The major challenges underlying such studies as noted by researchers and practitioners are summarized below: • Quantifying influence: One of the biggest challenges is to define the influence of a blogger as the concept of influence is highly subjective, depending on the need for identifying them. The existence of the influentials in the blogosphere as in the real-world and their distinguishing characteristics is very difficult to comprehend and quantify into a measure. • Differentiating influential from active bloggers: There is often a dilemma in distinguishing influential bloggers form the voluble or active bloggers. Studies have shown that the active bloggers are not always the influentials (Agarwal et al. 2008). On the other hand, there are studies that take into account the productivity of the bloggers (assessed by their blogging activity) in order to measure influence (Akritidis et al. 2011). • Absence of ground truth: As there are no benchmark datasets studying blogger influence, the problem cannot be posed as a supervised machine learning task. One has to solely depend on the statistics collected for an individual or a group of bloggers in order to create a model for identifying the influentials. Such a methodology raises questions on the

2157

T

validation of the models due to the absence of ground truth. Such methodologies are also dependent upon the platform from which the statistics are collected and hence less adaptive. Given such a model, it could be very difficult to tune it to identify the influential bloggers for different purposes. • Dynamic nature of the blogosphere: Due to the highly dynamic nature of the blogosphere and the limitations imposed by blogging platforms for data collection, it is extremely challenging to capture and track the temporal aspects of the bloggers’ influence. Furthermore, the aperture of the time window is a critical factor in the estimation of a bloggers’ influence and its longitudinal analysis. • Missing links: Due to the casual environment of the blogosphere, not many bloggers cite the original sources they refer to. In the absence of explicit links, the underlying network of the bloggers could be unknown. It is essential to have complete information about who influences whom, as a single missing link could be misleading. • Evolving influentials: The evolving interests of the bloggers and often-inconsistent opinions require to constantly look for the influentials. A blogger once influential for an event may not be influential any longer, or he could be influential for another event. Further, events have different characteristics, which needs determining the ideal set of parameters for different events. This poses a great challenge for the model stability for a real-life event. • Sparse link structure and colloquial text: The informal nature of the blogosphere, sparse link structure, presence of splogs, colloquial usage of language, different languages used for blogging, and differences in various blogging platform structure present additional difficulties in studying influence of the bloggers. The different studies conducted in identifying the influential bloggers have taken into consideration one or more of these challenges. We highlight endeavors of various researchers in overcoming these challenges as we discuss their major research findings in the Key Findings

T

T

2158

section of this article. Next, we present the underlying scientific fundamentals for studying timeand event-driven influence in the blogosphere.

Scientific Fundamentals Time- and event-driven influence in the blogosphere has been studied leveraging the following fundamental principles: • Behavioral aspects of bloggers • Models for diffusion of information We elaborate on these fundamentals next. Behavioral Aspects of Bloggers Bloggers interact with each other by posting, adding comments, and linking to other posts forming virtual relationships between them. These blogging behaviors induce a network among the bloggers that has been studied in order to understand influence in the blogosphere (Agarwal et al. 2008; Akritidis et al. 2011; Li et al. 2011; Lim et al. 2011; Kwon et al. 2009) and are explained below. • Inlinks and Outlinks: The inlinks and outlinks to and from a blog post along with their time information have been taken as an indicator of influence of that blog post. The number of inlinks has been taken as an indication of recognition of a blog post, and the number of outlinks has been taken as an indication of novelty. • Comments: The number of comments and their age are good indicators of the activity that a blog post generates. Hence, a large number of comments indicate that the post affects many others who care to write a comment about it and can be a good measure of the influence the post has on others. • Trackbacks/Scrap: The trackbacks allow the bloggers to know when another blogger refers to their blogs. Sometimes, a blogger may fully or partly copy another bloggers’ post. Such an action is known as scraping. All these activities lead to dissemination of posts, views, and opinions representing explicit and implicit influence of bloggers on one another.

Time- and Event-Driven Modeling of Blogger Influence

• Bookmarks or Followers: The bookmarks are also indicators of influence as they explicitly show the interest of a blogger towards another blog. The bookmarks also allow bloggers to follow others, notifying them whenever there is a new post. • Time of Posting: The time of posting a blog post is very important in order to study the temporal aspect of influence. The difference between the time in which the influence is to be measured and the time of post plays a vital role in determining the present influence of the post. • Blog Text: Researchers have performed linguistic analysis on blog text to assess influence. Eloquence of an author assessed from the blog post text has been demonstrated as an indicator of influence. The textual content has been analyzed to identify topics, keywords, and named entities to study real-life events. The above set of properties may not be an exhaustive set for modeling influential bloggers. Other properties like the number of times a blog post is read or the number of times a blog post is shared in other social networking websites can also be considered for possibly measuring influence in the blogosphere. Next, we will discuss influence assessment models based on the diffusion of information. Diffusion of Information Writing about new information, linking to them, and commenting on them are very important for the bloggers. Such activities portray these bloggers as authoritative and help them to influence other bloggers. The other bloggers may come to know about the blogs posted by these authoritative bloggers at a later time, which may influence them to write posts on similar topics. The bloggers may or may not provide explicit links to the actual blog posts from which they are influenced resulting in the concepts of explicit and implicit links, respectively. However, all these activities in the blogosphere result in the diffusion of information between bloggers over time. The creators of new information are often the influentials or also known as the opinion leaders. These influentials play a key role in

Time- and Event-Driven Modeling of Blogger Influence

the spread of information and opinions in the blog networks. A vast majority of the work in measuring time- and event-driven influence of the bloggers come from the study of such diffusion of information across blog networks and identifying the creators of new information. The basic models of diffusion are the (a) linear threshold model (Granovetter 1978), (b) independent cascade model (Goldenberg et al. 2001), and (c) generalized cascade model (Kempe et al. 2003). According to Lim et al. (2011), due to the independent relationships between the bloggers, the independent cascade model is the most suitable one for modeling information diffusion in the blogosphere. However, applying the above models to the blogosphere context, an initial set of active nodes may be taken as the initial set of influential blogs or bloggers or the opinion leaders in the blogosphere influencing the other neighboring bloggers. As time passes, more bloggers get influenced by each other. For example, the above mentioned models could be used to study the spread of information about a product in the blogosphere. An influential set of bloggers who might have bought a product or interested in it may write about the product and act as an initial set of active nodes capable of spreading the information about the product. Neighboring bloggers may get influenced by their writings according to the models and result in spread of product information in the blogosphere. Next, we present the key research findings in the area of studying time- and event-driven influence in the blogosphere based on the scientific fundamentals explained in this section.

Key Research Findings In this section, we discuss about the key findings by the researchers and practitioners working on the problem of modeling bloggers’ influence. We present the problem from the perspectives of time and real-life events. As the occurrence and span of real-life events depend on temporal factors, the studies based on the two perspectives often overlap. In most cases the underlying techniques and concepts used have relied on the key scientific

2159

T

fundamentals discussed in the previous section. However, for better presentation, we segregate the findings into separate sections of (a) time driven and (b) event driven. The studies taking time as an integral factor are classified as “time driven,” and the studies that have explicitly mentioned about events have been classified as “event based” (see Table 1). Time Driven Majority of the works emphasizing on timedriven influence in the blogosphere are inspired by the studies based on diffusion of information. Explicit links between the blog posts, inference of implicit links, and their times of formation act as the fundamental properties that help in such studies. Scholars have modeled the flow of influence in the blog networks and have proposed metrics for quantifying influence of bloggers and blog posts over time by studying such properties. The flow of information between blog posts and bloggers results in the formation of information cascades. These information cascades are studied and characterized for building different models of information flow (Adar et al. 2004; Leskovec et al. 2007b; Gomez-Rodriguez et al. 2012). Information routes are constructed by inferring links between blog posts and influence of one post over another (Adar et al. 2004). The information cascades are characterized and most of their properties are found to follow the power law distribution (Leskovec et al. 2007b). Confirming such observations through experiments, a flu-like epidemiological model is proposed for generating cascades (Leskovec et al. 2007b). A near-optimal network that best describes the times of influence is constructed using the knowledge of the times when nodes in a network get influenced (Gomez-Rodriguez et al. 2012). The prediction of the affinity of a blog post towards an already existing cascade has been studied in Li et al. (2009). Content-oblivious features are identified for such a prediction task that helped in overcoming the challenges of mining colloquial text from the blog posts. Maximizing the flow of influence in a blog network is another approach that is studied by researchers in order to identify the most effective

T

T

2160

Time- and Event-Driven Modeling of Blogger Influence

Time- and Event-Driven Modeling of Blogger Influence, Table 1 Highlights of the key findings related to timeand event-driven influence in the blogosphere Diffusion based

Time driven Adar et al. (2004) – Inferencing implicit links, iRank algorithm, identification of influential blogs responsible for the spread of information Java et al. (2006) – Prediction of a set of influential blogs that can help in effective spreading of ideas, PageRank-based heuristics for influence models in the blogosphere and effect of splogs Leskovec et al. (2007b) – Temporal patterns in posting of blog posts and popularity of posts, shapes and sizes of cascades, and flu-like epidemiological model for information diffusion Leskovec et al. (2007a) – Objective functions for detecting outbreaks are submodular, efficient approximation algorithm CELF for detecting blogs responsible for outbreaks of information in blog networks Kwon et al. (2009) – Super node, broadcast edge, register edge, influence of high quality blogs that are displayed on the main page of the blog service providers

Blogger behavior based

Li et al. (2009) – Affinity of a blog towards a cascade, content-oblivious features for predicting it Gomez-Rodriguez et al. (2012) – NETINF algorithm for tracing paths of influence and diffusion Agarwal et al. (2008) – Identified collectible statistics from the blogosphere for measuring influence, iFinder algorithm, BlogTrackers tool Akritidis et al. (2011) – Change of influence and productivity of bloggers over time in a community

blogs in a network for spreading information (Java et al. 2006; Leskovec et al. 2007a). Java et al. (2006) pointed out drawbacks of such a model proposed by Kempe et al. (2003), when applied to large blog networks, especially in the presence of splogs. They further contributed (Page et al. 1999) - based heuristics to estimate influence in the blogosphere. The study also

Event driven Gruhl et al. (2004) – Model for information diffusion, spread of real-life topics from blog to blog, identification of effective individuals contributing to the spread of the topics Song et al. (2007) – Novel algorithm called InfluenceRank to identify opinion leaders in the blogosphere

Stewart et al. (2007) – Defined the problem of discovering information diffusion paths from a blogspace as a frequent pattern mining problem, IDPMiner algorithm for mining information diffusion paths Kumar et al. (2010) – Methodology for detecting hot topics in the blogosphere by studying the influential bloggers in a community

Agarwal et al. (2011) – Identified influential bloggers in order to discover and understand the opinion leaders and dissemination of opinion among bloggers, for the purpose of studying online collective action during real-life events

Karpf (2007) – Blogosphere authority index (BAI) for tracking online influence of blog sites Nallapati and Cohen (2008) – Identification of influential blogs related to a specific topic, Links-PLSA-LDA model Li et al. (2011) – Marketing influential value model

points out that the existing heuristics are biased towards already popular blogs. They suggested the need to develop influence models that would consider the dynamicity in the change of popularity of blogs and identify the upcoming and rising influential blogs. Leskovec et al. (2007b) proposed an approximation algorithm named cost-effective lazy forward selection

Time- and Event-Driven Modeling of Blogger Influence

(CELF) for identifying the most effective blogs looking at it from a different perspective of outbreak of new information. Such blogs can also be considered as the most influential set of blogs in the network for viral marketing. It was also concluded that the objective functions for detecting outbreaks in the real-world are submodular, that is, they exhibit the property of diminishing returns. A different set of approach at modeling influence is used by the next set of works. A rigorous study was conducted for identifying an intuitive set of properties for measuring influence of a blog (Agarwal et al. 2008). The number of inlinks to a post was used as a measure of its recognition. The number of outlinks represented the novelty. Comments to a blog post represented its potential for activity generation in the blogosphere, and the length of a post was taken to be the measure of eloquence. These properties are used for developing the iFinder algorithm which was able to identify a ranked set of influential bloggers in a community. The iFinder algorithm can further classify influential bloggers in one of the four categories, i.e., long term, short term, intermittent, and burgeoning based on their influence sustenance over time. The algorithm and the findings of the work was further used for developing a tool named BlogTrackers, which is explained in the Key Applications section of this article. Timeaware metrics of bloggers’ productivity index and bloggers’ influence index are considered in Akritidis et al. (2011) as an extension to Agarwal et al. (2008). Event Driven The discussion on the blogosphere is often inspired by the real-life events. As bloggers could discuss multiple topics in their blogs, topic segregation is considered a useful technique in studying influence vis-`a-vis events (Gruhl et al. 2004; Nallapati and Cohen 2008; Stewart et al. 2007; Kumar et al. 2010). The dynamics and patterns of the topics discussed in the blogosphere are characterized into short-termed highly intense spikes induced by external events and ongoing chatter determined by the bloggers (Gruhl et al. 2004). Such findings are further used for studying

2161

T

the paths of diffusion of various topics among the bloggers, and a formal model is developed drawing on the theory of infectious diseases. The flow of a specific topic in the blogs is tracked for developing a frequent pattern mining algorithm information diffusion path-miner (Stewart et al. 2007). The algorithm is used for mining paths of information diffusion from blogging communities. The topical relationship between the linking and the linked blogs is also used for identifying the most influential blogs in a topic in terms of hyperlinks as well as content (Nallapati and Cohen 2008). A methodology was devised for detecting hot topics in the blogosphere by analyzing the activities of the influential bloggers in terms of the topics they discuss (Kumar et al. 2009). The authors detected the discussion about the Indonesian presidential elections among the influential bloggers in a case study conducted on Indonesian blogs. Such a finding also confirmed the correlation between the topics discussed among the bloggers and the ongoing real-life events relevant to the selected sample of blogs. The influence and significance of opinion leaders in the blogosphere is studied by Song et al. (2007) and Agarwal et al. (2011). Song et al. (2007) propose an algorithm, InfluenceRank, for identifying opinion leaders who contribute novel information in blogger networks. They introduce the concept of hidden nodes and make use of it for devising the algorithm. The authors state that whenever a new blog entry is generated, the original content of the blog post might be influenced by an external influencer like mainstream media, another blog post, or a novel idea. They call these sources as the hidden nodes that allow the blogs to generate extra information otherwise not possible to generate on the basis of the blogs’ connection to other blogs. They concluded that the opinion leaders are also the most informative and influential nodes in a network and captures the most representative opinions of a network. Influential bloggers from communities are identified in the context of studying collective action in real-life events in Agarwal et al. (2011). Identification of the opinion leaders

T

T

2162

helps to understand how issues and concerns travel across blogger networks from leaders to followers. Metrics have been developed for identifying the influential bloggers in the political arena (Karpf 2007) and for marketing campaigns (Li et al. 2011). The blogosphere authority index (BAI) is proposed by considering networking and commenting aspects of political bloggers. The work is suitable for identifying influential political bloggers that can be further used during political campaigns and tracking reactions of the influentials regarding different political issues. Extending the work done by Agarwal et al. (2008), authors in Proctor and Loomis (1951) proposed network-based, content-based, and activeness-based factors for measuring marketing influence strength and identifying potential and influential authors in the blogosphere. The artificial neural networks-based approach was used along with the factors for constructing a model named marketing influential value (MIV). The observations presented above emphasize on the challenges of defining influence and the related subjectivity. A summary of the key findings is presented in Table 1. Time-driven and event-driven approaches along with diffusion and blogger behavior-based studies can be distinctly identified from the matrix. Although these studies address several key challenges, many remain unaddressed providing opportunities for innovation. Handling informal textual content of the blogs and inferring the sources of influence in the absence of explicit links still pose tremendous challenges.

Key Applications In earlier sections, we have discussed the significance of finding influential bloggers. Calculating influence scores for bloggers is of utmost importance to the companies that intend to establish expertise in social media analytics. A plethora of such companies exist in the market, e.g., Collective Intellect, Lithium, Attensity360, Radian6, Sysomos, Spinn3r, and Technorati.

Time- and Event-Driven Modeling of Blogger Influence

However, considering the scope of this article, we would like to discuss about only those companies that give priority in calculating influence in the blogosphere. In this section, we list some of the key real-world applications that help in identifying the influential blogs or bloggers. • Klout: Klout (http://klout.com) is a service that measures a persons’ influence based on his/her ability to drive actions on social networks including major blogging platforms like Blogger and Wordpress. The Klout score is further used by various businesses for reaching the influentials in order to market their products. • BlogTrackers: BlogTrackers (http://blogtrack ers.fulton.asu.edu) is a tool capable of tracking and analyzing blogs. In addition to identifying influential bloggers, the tool provides additional features of detecting topics of discussion in the blogosphere and tracking them over time. It is actively used by researchers and is freely available. • Sysomos: Sysomos (http://www.sysomos. com/) is an online proprietary product for analyzing and monitoring social media. It came out of an academic project BlogScope (Bansal and Koudas 2007) which monitored only blogs and had measures for identifying the authoritative blogs from the blogosphere. Sysomos employs sophisticated metrics for identifying and tracking influentials from the entire spectrum of social media sites including major blogging platforms. • Technorati: Technorati (http://technorati. com) is a blog indexing and search engine. It assigns authority scores to blogs which helps in identifying top blogs and bloggers in terms of their popularity and influence. The company also has services providing social media metrics for the marketeers and industry. • Spinn3r: Spinn3r (http://spinn3r.com/) is a web service providing access to high volumes of blogs, news, and raw data from various social media platforms. It ranks the sources indexed by it including blogs using their proprietary ranking metrics. Spinn3r service is actively used by the industry and academic research.

Time- and Event-Driven Modeling of Blogger Influence Time- and Event-Driven Modeling of Blogger Influence, Table 2 Few publicly available datasets for studying influence in blogs Dataset

URL

ICWSM datasets TREC Blog datasets

http://icwsm.org/data/index.php http://ir.dcs.gla.ac.uk/wiki/TRE C-BLOG http://snap.stanford.edu/data/me metracker9.html http://socialcomputing.asu.edu/ datasets/TUAW

MemeTracker dataset TUAW dataset

• MemeTracker: MemeTracker (http://www. memetracker.org/) is a service that builds maps of the daily news cycle by analyzing around 900,000 news stories and blog posts per day from one million online sources, ranging from mass media to personal blogs. It also tracks most frequent quotes and phrases over time for analyzing how different stories compete for daily news and blog coverage. The tool helps in understanding the external influencers of the blogosphere over time. Apart from the tools and services mentioned above, there are various search engines like Google, Bing, and IceRocket that uses various techniques in order to identify authoritative, popular, and influential blogs from the blogosphere.

2163

T

• TREC Blog Datasets: TREC (http://trec.nist. gov/) provides several datasets related to blogs and behavioral aspects of bloggers suitable for studying influential bloggers. The datasets are not completely free and can be obtained from http://ir.dcs.gla.ac.uk/wiki/TREC-BLOG. • MemeTracker Dataset: The dataset consists of 210,999,824 memes from the MemeTracker (mentioned in Key Applications), 96,608,034 number of blogs and news articles, and 418,237,269 number of links between them. The huge number of blogs, news articles, and links between them makes the dataset quiet suitable for studying the problems of time-driven and event-driven influential bloggers. The dataset can be freely obtained from http://snap.stanford.edu/data/ memetracker9.html. • TUAW Dataset: The dataset consists of blog posts crawled from The Unofficial Apple Weblog (TUAW). The blog site consists of a closed community of bloggers, where other users are allowed to comment on the blog posts. The dataset consists of blog posts from the period January 2004 to February 2007, in addition to metadata like the number of inlinks. The dataset can be freely obtained from http://socialcomputing.asu.edu/datasets/ TUAW.

Datasets Future Directions Researchers and practitioners have used different datasets for studying the phenomenon of influence in blogs from various perspectives. We list some of the publicly available datasets that can help in studying and modeling the problem of influential bloggers. The datasets are summarized in Table 2. • ICWSM Dataset: ICWSM gives access to several datasets provided by Spinn3r company and suitable for studying the problem of influential bloggers. The datasets consist of blogs and social media content related to various events occurring over different time periods. After signing an agreement, the datasets can be freely obtained from http://www.icwsm. org/data/.

In this article, we have discussed various challenges and fundamental concepts used for studying influence in the blogosphere. We focused on the key contributions in this domain from the perspective of time- and event-driven analysis of influence. Major contributions could be grouped by the nature of the studies, primarily under blogging behavioral characteristics and diffusionbased models. Along the way, we have identified several noteworthy applications and datasets that afford opportunities for researchers and practitioners to delve more deeply into this subject. Although there exists a huge body of work on studying influence, we envision several avenues for further advancements as our reliance on

T

T

2164

social media and its efficacy increases. With the increasing role of social media in real-life events and the consequent rise of citizen journalism, further research is needed to study event-specific influence models. This would entail not only exploring influence of blogs on other social media channels but also its effects on real-life events outside the virtual world of the web. A rigorous investigation is needed to study the different genres of events, such as sociopolitical, economic, and crisis-related events, among others. Further, as the blogging platforms move towards standardization and structuring of available information using semantic web technologies, newer methodologies could be envisioned to study contextual influence. Majority of the literature, as also described in this article, leverage connectivity in blog networks as a key fundamental to measure influence. Blog content has not been used to its fullest potential due to the various challenges pertaining to natural language and multimedia processing. A large amount of text is available in the blogs both in the form of the posts and comments that could be a rich information source for studying contextual influence. Various characteristics of the blog text could to be used to analyze high-quality blog posts, sentiments, opinion diffusion, and influence in the blogosphere. Moreover, concepts and ideas such as homophily (Aral et al. 2009) and social capital (http://en. wikipedia.org/wiki/Social capital), inspired from psychology, sociology, economics, political science, etc., could be explored for studying influence. The strategic placement of the blogosphere in the social media ecosystem presents unique challenges and significant opportunities for innovation for researchers and practitioners for years to come.

Cross-References  Actionable Information in Social Networks, Diffusion of  Centrality Measures  Human Behavior and Social Networks  Inferring Social Ties

Time- and Event-Driven Modeling of Blogger Influence  Linked Open Data  Link Prediction: A Primer  Mining Trends in the Blogosphere  Modeling and Analysis of Spatiotemporal

Social Networks  Scale-Free Nature of Social Networks  Social Influence Analysis  User Behavior in Online Social Networks, Influencing Factors

References Adamic L, Glance N (2005) The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd international workshop on link discovery, Chicago. ACM, pp 36–43 Adar E, Zhang L, Adamic L, Lukose R (2004) Implicit structure and the dynamics of blogspace. In: Workshop on the weblogging ecosystem, vol 13. New York, NY, USA Agarwal N, Liu H, Tang L, Yu P (2008) Identifying the influential bloggers in a community. In: Proceedings of the international conference on web search and web data mining, Palo Alto. ACM, pp 207–218 Agarwal N, Lim M, Wigand R (2011) Collective action theory meets the blogosphere: a new methodology. In: Networked digital technologies, Macau, pp 224–239 Agarwal N, Lim M, Wigand R (2012) Online collective action and the role of social media in mobilizing opinions: a case study on women’s right-todrive campaigns in Saudi Arabia. In: Reddick CG, Aikins SK (eds) Web 20 technologies and democratic governance. Springer, New York, pp 99–123 Akritidis L, Katsaros D, Bozanis P (2011) Identifying the productive and influential bloggers in a community. IEEE Trans Syst Man Cybern Part C: Appl Rev 41(5):759–764 Aral S, Muchnik L, Sundararajan A (2009) Distinguishing influence-based contagion from homophily-driven diffusion in dynamic networks. Proc Natl Acad Sci 106(51):21544–21549 Bansal N, Koudas N (2007) Blogscope: a system for online analysis of high volume text streams. In: Proceedings of the 33rd international conference on very large data bases, Vienna. VLDB Endowment, pp 1410–1413 Bavelas A (1948) A mathematical model for group structures. Hum Organ 7(3):16–30 Bazaarvoice (2012) Social trends report 2012. Technical report, Bazaarvoice, Social Summit 2012 Borgatti S, Everett M (2006) A graph-theoretic perspective on centrality. Soc Netw 28(4):466–484 Davis R (2009) Typing politics: the role of blogs in American politics. Oxford University Press, New York Ekdale B, Namkoong K, Fung T, Hussain M, Arora M, Perlmutter D (2007) From expression to influence:

Time- and Event-Driven Modeling of Blogger Influence understanding the change in blogger motivations over the blogspan. AEJMC, Washington, DC Freeman L (1979) Centrality in social networks conceptual clarification. Soc Netw 1(3):215–239 Gill K (2004) How can we measure the influence of the blogosphere. In: WWW 2004 workshop on the weblogging ecosystem: aggregation, analysis and dynamics, New York Goldenberg J, Libai B, Muller E (2001) Using complex systems analysis to advance marketing theory development: modeling heterogeneity effects on new product growth through stochastic cellular automata. Acad Mark Sci Rev 9(3):1–18 Gomez-Rodriguez M, Leskovec J, Krause A (2012) Inferring networks of diffusion and influence. ACM Trans Knowl Discov Data (TKDD) 5(4):21 Gordon A, Swanson R (2009) Identifying personal stories in millions of weblog entries. In: Third international conference on weblogs and social media, data challenge workshop, San Jose Granovetter M (1978) Threshold models of collective behavior. Am J Sociol 83:1420–1443 Gruhl D, Guha R, Liben-Nowell D, Tomkins A (2004) Information diffusion through blogspace. In: Proceedings of the 13th international conference on world wide web, New York. ACM, pp 491–501 Java A, Kolari P, Finin T, Oates T (2006) Modeling the spread of influence on the blogosphere. In: Proceedings of the 15th international world wide web conference, Edinburgh, 22–26 May 2006 Karpf D (2007) Measuring influence in the political blogosphere: who’s winning and how can we tell? Politics and technology review, George Washington University: Institute for Politics, Democracy and the Internet, pp 33–41 Keller E, Berry J (2003) The influentials: one American in ten tells the other nine how to vote, where to eat, and what to buy. Free Press, New York Kempe D, Kleinberg J, Tardos E´ (2003) Maximizing the spread of influence through a social network. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC. ACM, pp 137–146 Kleinberg J (1999) Authoritative sources in a hyperlinked environment. J ACM (JACM) 46(5):604–632 Kumar S, Agarwal N, Lim M, Liu H (2009) Mapping socio-cultural dynamics in Indonesian blogosphere. In: 3rd international conference on computational cultural dynamics, Washington, DC, pp 37–44 Kumar S, Zafarani R, Abbasi M, Barbier G, Liu H (2010) Convergence of influential bloggers for topic discovery in the blogosphere. In: Chai S-K, Salerno JJ, Mabry PL (eds) Advances in social computing. Springer, Berlin/New York, pp 406–412 Kwon Y, Kim S, Park S, Lim S, Lee J (2009) The information diffusion model in the blog world. In: Proceedings of the 3rd workshop on social network mining and analysis, Paris. ACM, p 4 Leskovec J, Krause A, Guestrin C, Faloutsos C, VanBriesen J, Glance N (2007a) Cost-effective outbreak

2165

T

detection in networks. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose. ACM, pp 420–429 Leskovec J, McGlohon M, Faloutsos C, Glance N, Hurst M (2007b) Cascading behavior in large blog graphs. arXiv preprint arXiv:07042803 Li H, Bhowmick S, Sun A (2009) Blog cascade affinity: analysis and prediction. In: Proceedings of the 18th ACM conference on information and knowledge management, Hong Kong. ACM, pp 1117–1126 Li Y, Lai C, Chen C (2011) Discovering influencers for marketing in the blogosphere. Inf Sci 181(23): 5143–5157 Lim S, Kim S, Kim S, Park S (2011) Construction of a blog network based on information diffusion. In: Proceedings of the 2011 ACM symposium on applied computing, TaiChung. ACM, pp 937–941 Moreno J (1934) Who shall survive? vol 58. Nervous and Mental Disease Publishing Company, Washington, DC Nallapati R, Cohen W (2008) Link-PLSA-LDA: A new unsupervised model for topics and influence of blogs. In: International conference for weblogs and social media, Seattle, 2008. AAAI Onishi H, Manchanda P (2012) Marketing activity, blogging and sales. Int J Res Mark 29:221–234 Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical report 1999-66, Stanford InfoLab, http://ilpubs.stanford.edu:8090/422/, previous number = SIDL-WP-1999-0120 Proctor C, Loomis C (1951) Analysis of sociometric data. Res Methods Soc Relat 2:561–85 Richardson M, Domingos P (2002) Mining knowledgesharing sites for viral marketing. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton. ACM, pp 61–70 Song X, Chi Y, Hino K, Tseng B (2007) Identifying opinion leaders in the blogosphere. In: Proceedings of the sixteenth ACM conference on information and knowledge management, Lisboa. Citeseer, pp 971–974 Stefanone M, Jang C (2008) Social exchange online: public conversations in the blogosphere. In: Proceedings of the 41st annual Hawaii international conference on system sciences, Waikoloa. IEEE, pp 148–148 Stewart A, Chen L, Paiu R, Nejdl W (2007) Discovering information diffusion paths from blogosphere for online advertising. In: Proceedings of the 1st international workshop on data mining and audience intelligence for advertising, San Jose. ACM, pp 46–54 Symphoni IRI Group (2012) Millennial shoppers: tapping into the next growth segment. Technical report, Symphoni IRI Group Technorati (2011) State of the blogosphere 2011. http:// technorati.com/social-media/article/state-of-the-blogo sphere-2011-introduction/ Zeleny L (1940) Measurement of social status. Am J Sociol 45:576–582

T

T

2166

Time-Sensitive Recommendation

Time-Sensitive Recommendation  Spatiotemporal Personalized Recommendation

of Social Media Content

Time-Stamped Graphs  Analysis and Visualization Networks  Temporal Networks

of

Dynamic

Time-Varying Graphs  Analysis and Visualization of Dynamic Networks  Stability and Evolution of Scientific Networks  Temporal Networks

Tools  NodeXL: Simple Network Analysis for Social

Media

Glossary Big Data Computer science and engineering term for large and complex datasets for which special data management (collection, storage, processing, visualization), usually distributed, has to be used Cloud Computing Computer engineering paradigm where hardware, network, and software resources are provided as services over the network INSNA International Network for Social Network Analysis Internet of Things (IoT) An experimental concept of integration of real-world devices (sensors, actuators, machines) and objects (buildings, cars, trains, etc.) into Internet over standardized interfaces using unique identifiers NOSQL Database A database implemented on some data storage model and query concept alternative to relational databases (document databases, key-value stores, graph databases, etc.). Relational databases use Standard Query Language (SQL) for queries and database operations, while NOSQL databases use other approaches SNA Social network analysis

Tools for Networks

Introduction

Alen Orbani´c Faculty of Mathematics and Physics, University of Ljubljana, Ljubljana, Slovenia Physics and Mechanics, Institute of Mathematics, Ljubljana, Slovenia Andrej Maru˘si˘c Institute, University of Primorska, Koper, Slovenia Abelium d.o.o., Research and Development, Ljubljana, Slovenia

Software packages for social network analysis (SNA software) are essential tools for network analysis and mining. Due to a relatively large number of SNA software packages (see Wikipedia; Huisman and van Duijn 2011), a general overview is useful when choosing appropriate tools to address specific analytic and/or visualization challenges. In addition to understanding capabilities of software packages, it is also important to have a general understanding of the trends of the development of SNA packages. Network analysis or network mining starts with data collection and then continues with iterative analytic and exploratory process of data transformation, visualization, and interpretation.

Synonyms SNA software; Social network analysis packages

Tools for Networks

A network analysis tool can address the whole analytic process or just a specific part of it. For example, visualization tools provide means for graphical visualization of networks (drawings) or related data (charts, dendrograms, blockmodels, etc.); data collection tools help us at extraction of network data from data sources into one of the established formats. Analytic algorithms are used to transform data and extract useful information. To review SNA software packages, some general review methodology has to be used. Comprehensive SNA software reviews like the ones by Huisman and van Duijn (2011, 2005) established good directions for future reviews. SNA software packages will be classified by categories describing main purposes of packages, analytic and visualization capabilities, availability (commercial, free), platforms supported (Windows, Mac OS, Linux), etc. General trends regarding SNA software will be reviewed. We will focus on modern and accessible packages. Older, obsolete, or closed commercial packages will be just mentioned but will not be evaluated.

Historical Background In the past, several SNA software reviews have been carried out (see Wasserman and Faust 1994; Freeman 1988; Degenne and Forse 1999; Scott 2002; Huisman and van Duijn 2005, 2011). The latest and the most comprehensive was by Huisman and van Duijn in 2011 (published in 2012), where the history of SNA software development and the history of SNA software reviews are summarized. The beginnings of SNA packages start in the 1980s (UCINET, NEGOPY/FATCAT/Multinet, GRADAP, STRUCTURE), and development continues into the 1990s (STRAN, Model, Pajek, StOCNET) and further into 2000s (ORA, statnet/sna, R-packages, NetMiner) until nowadays. In the review Huisman and van Duijn (2011), 56 software packages or toolkits were considered. Among them the following were identified as the most popular at the time:

2167

T

MultiNet, NetMiner, ORA, Pajek, statnet/sna, and UCINET+NetDraw. One of the criteria for popularity was the established tradition of organizing workshops at INSNA Sunbelt conferences. As general starting points for such reviews, lists on Wikipedia and the list of SNA software maintained by INSNA can be used.

Tools for Networks General Overview and Trends Motivation for development of SNA software packages often comes from academic environments and is usually based on research projects of particular groups. Certain SNA packages are further developed and made commercial through private companies like university spin-offs or through knowledge transfers (ORA, UCINET, Meerkat, GAUSS/SNAP, VennMaker). There are also specialized companies with commercial solutions based on their own research and development (NetMiner, SocioMetrica Suite, Commetrix), sometimes offering software and consulting packages targeting cases related to company organization (Inflow, FirmNet Online, MetaSight), survey taking (Network Genie, ONA Surveys), fraud detection and financial flow analyses (Financial Network Analyzer), customer base analysis (IDIRO, iPoint, KXEN Social Network), data visualization (Future Points System) and offering support to government services in the fields of intelligence, economy, security, and defense (Centrifuge Visual Network Analyzer, Detica NetReveal, iDetect, Sentinel Visualizer). On the other hand, there is also a strong and growing open-source community providing development of high-quality alternatives in all the aspects of SNA software. Open source is usually the fastest way for providing new algorithms and analytic approaches motivated by current research (R packages, various toolkits and libraries in Java, Python, R, Perl, C/C++, etc.). There are also many free (or free for noncommercial or academic use) but not opensource packages (Agna, GUESS, MultiNet, Pajek, StOCNET, visone, etc.). While most of SNA software packages include some form of

T

T

2168

visualization and visual graph data management, specialized graph and data visualization tools represent important building blocks of social network analysis (Gephi, Tulip, Mage, Sonia, Graphviz, etc.). Network data collection tools enable us in obtaining network data through surveys (collection from people), web crawling, and text mining or by connecting to databases (Network Genie, ONA Surveys, UrlNet, Deep Email Miner, Discourse Network Analyzer, etc.). With the recent trends related to “big data” and “NOSQL databases” (also “cloud computing”), graph-oriented databases (e.g., DEX, AllegroGraph, Neo4j) and distributed graph-processing platforms (e.g., Google Pregel, Hadoop-based Apache Giraph) are becoming more and more popular in SNA. Big network data from data sources like Facebook, Twitter, and LinkedIn provides motivation for such approaches. In business applications popularity of “Business Intelligence” (BI) tools is increasing as well. Business Intelligence addresses data analytics and reporting on business data for daily operations. General approach involves data collection from various data sources (connectors), data transformations (ETL – extract-transform-load), and storage into data warehouses from where the data is presented (reporting). We believe that one of the next steps of SNA development (mostly commercial) will be inclusion or integration into general BI packages. ETL pipelines are based on the data-flow programming model, which is often supported by graphical programming tools. Pipelines can represent the steps of analytic procedures and can be stored like programs. This approach is well established in the engineering community (e.g., Labview) for advanced visualizations (e.g., VTK – Visual Toolkit, ParaView) and is becoming increasingly popular in computer-aided design (CAD) community (Rhino/Grasshopper). Judging by available documentation, among SNA packages, Blue Spider is heading into this direction. Equivalent of pipelines are macros in Pajek. Packages supporting scripting (e.g., all R or Python packages) use alternative approaches by storing steps of analyses into program scripts.

Tools for Networks

With well-defined methodologies, applications of SNA tools are extended to other fields (biology, bioinformatics, communications, computer science, etc.), and SNA tools are slowly but surely loosing the “social” prefix and becoming more generally applicable “network analysis” tools. The Most Popular Packages In 2011 Huisman and van Duijn identified as the most popular SNA packages MultiNet, NetMiner, ORA, Pajek, statnet/sna, and UCINET+NetDraw. They already noticed the drop in popularity of MultiNet. In the last few years, statistical programming language R with specialized packages is gaining more and more popularity in various scientific disciplines as well as in business applications. It appears that currently the most popular packages are: • NetMiner 4 (commercial general package with the professional support) • ORA (analysis of dynamic networks) • Pajek/PajekXXL (large networks) • R with packages statnet, sna, tnet, RSiena, latentnet, and igraph (modeling and statistical approaches) • UCINET + NetDraw (traditional, wellestablished package for general network analysis of small- and medium-size networks) One way of measuring popularity of SNA software packages in scientific community is the presence of packages at INSNA Sunbelt conferences in the last few years (see Table 1). Evaluation and Comparison of Software Tools In order to evaluate the software packages, we will use similar metrics like in Huisman and van Duijn (2011). But we shall not take historical perspective and drop the evaluation of older or obsolete packages (GRADAP, STRUCTURE, PGRAPH, Referral Web, Snowball), packages no longer maintained and/or not accessible (DyNET SE/LE, C-IKNOW, Blanche, Permnet, UNISoN), and commercial packages providing no trial version (Blue Spider, MetaSight). The results of the evaluation in Huisman and van Duijn (2011)

Tools for Networks

T

2169

Tools for Networks, Table 1 Workshops at INSNA Sunbelt conferences (from 2009 to 2013) dedicated to SNA software packages

Visone

Vennmaker

UCINET

Tulip

R - Pnet

R - tnet

R - statnet/sna

R - Rsiena (SIENA)

R - multiplex

R - general SNA

Pajek

ORA

NetworkX

NetworkGenie

Net-map

KliqueFinder

Inflow

Citespace

C-IKNOW

R language

2009 2010 2011 2012 2013

will be used and updated. We will also consider some new packages (among them Gephi). Aspects of Evaluation Software packages are evaluated from the following aspects. Each SNA package belongs to one of the categories describing the purpose of the software. A short description of a general goal of the package is provided. It is identified on which platforms (operation systems) the package can be used. Supported internal data representation is considered as well as visualization capabilities. The most important aspect is analytic methods provided in the package. Additionally we consider the form of the software licensing, documentation, and support.

types of analyses (either specific type of real-life networks or specific analytic methods). Toolkits and libraries provide functionalities through scripting or inclusion into other packages. Data collection packages focus mainly on obtaining network data (through surveys, web crawling, text mining, connecting to databases, etc.). Visualization packages focus strictly on methods for network data visualization. Graph databases are NOSQL databases based on the graph data structure which usually have much wider range of applications then just social network analysis. In each category there are software packages that are considered old or obsolete (no longer maintained) but they were important at some point of SNA software development (see Huisman and van Duijn 2011).

Category Each SNA software package belongs to one of the following categories: general, special, toolkit/library, data collection tool, data visualization tool, or graph database. General packages provide reading and storing network data capabilities, general network analytic functionalities, and basic (or advanced) visualization. Special packages focus on specific

Platforms SNA software packages can be provided for a specific platform (operation systems: Windows, Linux, Mac OS) or several platforms. Package can also be implemented in multiplatform programming language (Java, Python, R, Matlab). Some software packages are available as web applications (“cloud” platform). One of the important aspects, especially for large networks,

T

T

2170

is the transition from 32- to 64-bit operation systems which enables use of much larger memory spaces. The transition for some packages is often difficult usually due to use of legacy software libraries. Supported Internal Network Representation A network on the node set V with links E is usually represented either in incidence matrix representation requiring O(jV j2 / memory space or in nodelink list requiring O(jEj/ space. The former representation is suitable for smaller networks and allows usage of algorithms with complexity larger than linear in the number of links. The latter representation is practically the only choice suitable for larger networks, which tend to be sparse in real-life networks. On larger networks algorithms with higher than subquadratic time complexity (in jV j/ are not practical to use. For example, Pajek (and PajekXXL) was designed with special care for efficient handling of large networks. On the other hand, some other programs are more focused on smaller networks in the incidence matrix representation (like UCINET and some R packages). Most of other general packages focus on general use for working with up to reasonably large networks. Visualization Capabilities Most of general SNA software packages contain some kind of visualization and visual editing tools. There are also packages, which are focused on efficient network visualizations like Tulip and Gephi. Besides static visualization a package can also support visualization of network evolution, which is necessary for visualizing dynamic networks over periods of time. In addition, SNA software packages often offer visualizations of network properties or network-related characteristics in the form of charts, dendrograms, or blockmodels. Some packages provide 2D visualization only while others support 3D visualizations as well. Analyses Following the classification of Huisman and Duijn (2011) which is based on classification

Tools for Networks

of analytic methods by the chapters of the book by Wasserman and Faust (1994), we shall use division of analytic methods into the following groups: • Descriptive measures (for nodes and links) • Structure and location (cohesion approaches – cliques, k-cores, slices, structural balance, affiliations; brokerage approaches – center and periphery, brokers, bridges, diffusion) • Roles and positions (prestige, ranking, genealogies, citations, blockmodels) • Dyadic and triadic methods • Statistical probability models (QAP, ERGM) • Network dynamics (models for network evolution and dynamics) Licensing Commercial SNA software products have commercial license and some of them offer a trial version, usually for limited time. Some commercial providers as well as academic groups provide packages under a free license for noncommercial or academic purposes, while others offer software package completely free for commercial or noncommercial use. Open-source versions are free versions where code is available for further development under various open-source licenses (GPL, BSD, MIT, Apache, etc.). Documentation A SNA software package can offer built-in help, manual, reference, or user guide, provide online or video tutorials, or even offer user group support on forums. One important aspect of SNA packages is user-friendliness. Similarly as in other fields, major commercial software packages tend to have better user interface and are more user-friendly. Example of such a package is NetMiner. Software packages built by academics tend to have less user-friendly graphical user interface.

General Packages Evaluation and comparison of general SNA packages is shown in Table 2. Development of the main packages is still active with fresh updates. For some packages (Agna, CosbiLab, GUESS, Multinet, NetVis, Network Workbench),

Tools for Networks

T

2171

Tools for Networks, Table 2 Evaluation of general SNA software packages

Visual exploration Dynamic visual network analysis Contextual and network analysis Visual exploration and analysis

NetVis

Dynamic visualization

Network Workbench ORA (*ORA Netscenes)

Analysis, modeling, visualization Dynamic network analysis Network analysis and visualization Link analysis and visualization

Pajek/Pajek XXL Sentinel Visualizer Sociometrica Suite (Linkalyzer, Visualyzer, EgoNet) SocNetV UCINET 6 (+NetDraw) visone

Analysis and visualization Analysis and visualization Network analysis and visualization Analysis and visualization











































































Manual





Free trial





Documentation

Built-in help





Commercial



Free non-commercial/academic use



Open Source





Free





Statistical models

2009 2013



Dynamics

5.24 4.1.0



Diadic/triadic methods





Roles/positions

2011



Network evolution

1.002





Visualization



Matrix



2007



Link-node list

2013

1.0.3

Linux

0.8.2

Web/cloud

2009

Mac OS

1.0

Windows



License

Tutorial/demo

MultiNet NetMiner

2003

Analysis

User group/mailing list

Meerkat

Applied graph/network analysis Graph definition, construction and basic analysis Visual exploratory analysis

Vis.

Structure/location

Gephi GUESS

2.1.1

Description

Repr.

Descriptives

CoSBiLab Graph

Last change

Agna

Multiplatform (Java, Python, R)

SNA package name

Version

Platform







♦ ♦

♦ ♦







♦ ♦





































































2013





























6.0.1

2013















2.15, 2.1, 2.0

2013













0.81

2010













6.463

2013

















2.6.5

2012















2.0

N/A

1.0.0

2009



3.0.8.5

2013

3.11











it seems that the development has stopped before 2010. We did not evaluate some older packages (DyNet SE/LS, GRADAP, STRUCTURE) since they are obsolete or no longer supported. Also we did not evaluate some commercial packages providing no trial version (Blue Spider, InFlow). Evaluation of the above-mentioned packages is available in Huisman and van Duijn (2011). Most of packages operate with both matrix and nodelink list representations. All of them provide some means of static network visualization, while some also provide visualization of network evolution. All packages provide basic analytic capabilities. But only the most popular ones offer advanced analytic capabilities. The package



♦ ♦ ♦



































♦ ♦

♦ ♦





visone is a package with gaining popularity. Regarding the licensing, the most popular packages tend to be commercial or free for noncommercial (academic) use. It seems that open-source development is not very popular or stalling for general packages, with exception of Gephi (note that we consider R packages as libraries and toolkits). All packages are well documented. Specialized Packages Evaluation and comparison of special SNA packages is shown in Table 3. For evaluation of some older packages and packages offering no support anymore (C-IKNOW, Referral

T

T

2172

Tools for Networks

Tools for Networks, Table 3 Specialized SNA packages

2009

Citation networks

3.5

2013

Commetrix

Dynamic network visualization

2.3

2012

Ego Net

Egocentric networks

E-Net Financial Network Analyzer KeyPlayer

Egocentric network analysis Statistical analysis of financial data Identifying nodes for interventions

KliqFinder

Cohesive subgroups

PNet

Exponential Random Graph Models

Puck

Analysis of kinship data

Radatools SONIVIS StOCNET (+SIENA) VennMaker

Community detection tools Virtual information spaces Advanced statistical analysis Actor-centered network analysis

2013-04-21

2013

0.41

2012

N/A

2013





♦ ♦ ♦

♦ ♦



















♦ ♦













2002









0.15

2013









1.0

2005



2013



2011



2010



1.8

2007



1.3.2

2013













web, UNISoN, PGRAPH, Blanche, Permnet, Snowball), see Huisman and van Duijn (2011). Several commercial packages offering no trial version fall into this group: Centrifuge Visual Network Analytics, Detica NetReveal, iDetect, Idiro SNA Plus, iPoint, KXEN Social Network, MetaSight, and Xanalys Link Explorer. Table 3 shows that most of special SNA packages are updated regularly. It seems that StOCNET + SIENA package is no longer updated (SIENA package became RSiena package in R). PNET package has also not been updated since 2005. Special packages tend to run on Windows or Java (multiplatform). By the definition they address specific types of analyses. Due to exclusion of many commercial packages (no trial version), Table 3 contains mostly freely

♦ ♦ ♦







♦ ♦











Manual

Free trial





Built-in help

Commercial

Open Source

Free non-commercial/academic use

Free

Statistical models





♦ ♦











♦ ♦



♦ ♦ ♦





♦ ♦



♦ ♦ ♦

















♦ ♦

3.2









0.8.1

Dynamics

Roles/positions

Diadic/triadic methods ♦







Documentation





1.45

2.0.25

License





♦ ♦

Descriptives



Analysis

Structure/location

Visualization



Network evolution

Link-node list



Matrix



Web/cloud

Linux



Vis.

User group/mailing list

1.0

CiteSpace

Mac OS

2011

Windows

2.0.5

Repr.

Tutorial/demo

CID-ABM

Dense groups and visualization Propagation of information

Last change

CFinder

Description

Version

SNA package name

Multiplatform (Java, Python, R)

Platform







♦ ♦









♦ ♦

♦ ♦























available packages. Presence of open-source packages is not very strong. General level of documentation available is plausible.

Libraries and Toolkits Evaluation and comparison of SNA libraries and toolkits is shown in Table 4. Most of them are multiplatform (Java, R, or Python) free or open-source and are regularly updated. A suite of R packages based on packages statnet and sna represents one of the most popular tools for network analysis. Besides R the other programming language of choice is Python. Some of the libraries support visualization. Visualization of dynamic networks is supported by JUNG 2 package. Matrix-oriented

Tools for Networks

T

2173

Tools for Networks, Table 4 Evaluation of SNA toolkits and libraries

igraph

Creating and manipulating graphs

0.6.5

R/Python/C

2013







JUNG 2

Modeling, analysis, visualization

2.0

Java

2010









latentnet

Latent position and cluster models

2.4.4

R

2013













LibSNA

Social network analysis

0.32

Python

2008





NetworkX

Complex networks

1.7

Python

2012







NodeXL

Analysis and visualization

1.0.1

Excel

2013





RSiena

Evolution of network and behavior

1.1-227

R

2013







sna

Social network analysis

2.2

R

2010







SocProg

Programs for analyzing social structure

2.4

Matlab

2009





statnet

Statistical modeling of networks

3.0

R

2012





tnet

Weighted, two mode and longitudinal data

3.0.11

R

2012





yFiles

Network visualization

2.10/4.2

Java/.NET

2013







libraries/toolkits like SocProg tend to cover a wider range of analytic methods, but of course only for smaller networks. For evaluation of the commercial package GAUSS/SNAP, see Huisman and van Duijn (2011) (not evaluated since no trial version). In the newest versions of Mathematica 9 (by Wolfram Research), social network analysis tools are included as well (visualization, descriptive measures, structure/location, roles/position).

Data Collection Packages Data collection packages are used to obtain data from various sources. When data is available in some digital data source (Internet, databases, files/documents, emails), general programming languages and libraries/toolkits can be used. Survey tools are used to extract data from groups

















































♦ ♦

♦ ♦























♦ ♦







Built-in help



Free trial



Commercial



Open Source



Free



Dynamics



Statistical models



Roles/positions

Tutorial/demo



User group/mailing list



Manual



Free non-commercial/academic use

2013

Diadic/triadic methods

Python

Documentation

Structure/location

2.2.23

License

Descriptives

Graph-tool

Network evolution



Visualization



Link-node list

2012

Matrix

Java

Web/cloud

1.1.2

Manipulation and statistical analysis of graphs

Linux

Library for dynamic graphs

Description

Mac OS

GraphStream

SNA package name

Windows

Multiplatform (Java, Python, R)

Analysis

Last change

Vis.

Environment

Repr.

Version

Platform































































♦ ♦

























of people with applications especially in the domain of social research, human resources, relations, and company organization. Another type of tools are text mining tools which provide transformation of texts into network data (semantic networks). The main representatives for these packages are: • Survey data collection: Network Genie, ONA Surveys (both online applications) • Text mining and semantic network extraction: Automap, Text2Pajek • Emails: Deep Email Miner • Other: Discourse Network Analyzer, WoS2Pajek, Leydesdorff software, etc.

Visualization Packages Visualization is an important way for carrying out exploratory network analysis where the human mind can identify and interpret certain patterns

T

T

2174

carrying new or additional interesting information. Most of the general SNA packages contain powerful visualization tools. However, packages specialized for (mostly) visualization exist as well. Tulip and Cytoscape represent advanced visualization tools. SoNIA and Nevada packages provide visualization of dynamic networks. There are also some smaller packages providing various types of graph drawing tools like aiSee, Apache Agora, and CuttleFish. Often general scientific visualization tools like Mage are used for visualizations of graphs. A graph drawing library OGDN (Open graph drawing network) contains a wide range of algorithms for graph drawing and it is implemented in C++. Some algorithms used to support graph drawing can be used for analysis as well (structure/location). Similar library implemented in C++ is Graph Drawing Toolkit (GDToolkit). It is available as a commercial library but it is free for academic use. The commercial library and software development kit Tom Sawyer can be used in advanced professional applications. Graph Databases Graph databases are NOSQL databases which are alternative database models to well-established relational databases. Graph databases use data structure of a graph for storing data which is a much more convenient approach for network and relation data. The most popular graph database is the open-source database Neo4j. There are also two commercial alternatives, Allegrograph and DEX, but several other graph databases exist as well. Graph databases provide storage capabilities of large network data, which can be accessed through queries in languages like SPARQL and Cypher or by function calls through interfaces (like function calls of a library). Graph databases are a rapidly developing field (Wikipedia, Graph Database).

Future Directions Massive amounts of network data are produced on a daily basis by popular social networking

Tools for Networks

applications (Facebook, Twitter, LinkedIn, etc.), Internet, and mobile, and wireless communication networks. Future integration of information systems supporting government functions (defense, security, energy, healthcare, transportation) together with emerging “smart” concepts (Smart city, Smart grid, Smart transportation), concepts like machine-tomachine (M2M) communication based on wireless and sensor networks and the concept of Internet of things, implies increases of generated network data in the near future for orders of magnitude. Requirements for data scientists or analysts are constantly increasing. New start-ups with businesses requiring some kind of network management and analysis tools and skills are constantly appearing in the cloud computing economy. With all these commercial opportunities, expectations, and source of motivation, we can expect immense pressure for fast development of new tools for network analysis. As we can currently observe through current cloud computing development, open-source approaches offer the highest level of flexibility and many marketleading companies join in supporting opensource packages.

Acknowledgments This work has been partially financed by Slovenian Research Agency (ARRS) within the EUROCORES program EUROGIGA (project GReGAS) of the European Science Foundation (ARRS grant number N1-0011). The author would like to thank Vladimir Batagelj for valuable suggestions.

Cross-References  Centrality Measures  Cloud Computing  Data Mining  Exponential Random

Models

Graph

Top Management Team Networks  Gephi  GUESS  Netminer  NodeXL: Simple Network Analysis for Social

Media  ORA  Pajek  Siena: Statistical Modeling of Longitudinal Network Data  SPARQL  Tulip III  UCINET  Visualization of Large Networks

References Degenne A, Forse M (1999) Introducing social networks. Sage, London Freeman LC (1988) Computer programs and social network analysis. Connections 11(2):26–31 Huisman M, van Duijn MAJ (2005) Software for social network analysis. In: Carrington PJ, Scott J, Wasserman S (eds) Models and methods in social network analysis. Cambridge University Press, Cambridge, pp 270–316 Huisman M, van Duijn MAJ (2011) A reader’s guide to SNA software. In: Scott J, Carrington PJ (eds) The SAGE handbook of social network analysis. Sage, London, pp 578–600. see also: http://www. gmw.rug.nl/huisman/sna/software.html. Accessed Apr 2013 Scott J (2002) Social network analysis: a handbook, 2nd edn. Sage, London Wasserman S, Faust K (1994) Social network analysis: methods and applications. Cambridge/New York, Cambridge University Press

Web References INSNA, list of SNA software. http://www.insna.org/ software/. Apr 2013 Software for social network analysis. http://www.gmw. rug.nl/huisman/sna/software.html. Apr 2013 STRAN, Model packages. http://vlado.fmf.uni-lj.si/pub/ networks/stran/. Apr 2013 Wikipedia, Graph Database. http://en.wikipedia.org/wiki/ Graph database. Apr 2013 Wikipedia, Social network analysis software. http://en. wikipedia.org/wiki/Social network analysis software. Apr 2013

2175

T

Top Management Team Networks Kevin D. Clark and Patrick G. Maggitti Villanova School of BusinessVillanova University, Villanova, PA, USA

Synonyms Executive teams; Innovation; Leadership team; Relational view; Social networks; Strategic decision-making; Top management teams

Glossary CEO Chief executive officer of the firm Philos A type of affinity based on trust and/or friendship Structural Hole A gap that exists between two sets of actors who could benefit from being connected TMT Top management team

Definition Top management teams (TMTs) consist of the most senior members of the organization, those who are responsible for formulating strategy. Typically, this is a very small group of senior executives, numbering from two to about a dozen. Because of its importance to organizational performance and positioning as a key boundary spanner at the periphery of the organization, the TMT has been subject of great deal of research particularly into composition and group process. More recently, interest has shifted to the social networks of TMTs and how these networks can be designed to improve decision-making effectiveness, particularly in novel circumstances such as innovation. This essay provides basic familiarity with TMTs, and their networks and provides a model of TMT effectiveness.

T

T

2176

Introduction Since Chester Barnard’s (1938) classic book The Functions of the Executive, scholars have attempted to explain how top management affects organizational outcomes. Most research on top management teams (TMT) has focused on demographic composition (who is on the TMT) and group process (how team members interact). More recently, interest has shifted based on the observation that executives use their networks of relations to gather information and resources. Whereas the study of composition may reflect the stocks of experiences and values held by the team and process measures how the team interacts, the study of TMT social networks exposes the types and amount of potential additional information that is available to team members through their connections to others. The TMT is a small number of senior managers who are charged with the strategic decisionmaking task within the firm. TMTs differ from other types of teams in several ways important to the design and utilization of their networks. First, TMTs are comprised of highly successful and experienced executives whose tenure in the current and (usually) in former organizations provides an enhanced ability to have developed large and influential networks. Moreover, there is an apparent power structure within TMTs, whereby the CEO has hierarchical power and may have selected other members of the team who may owe their position to him or her. TMTs also have positional authority over the organization and are situated at the boundary of the organization and thus have an enhanced ability to develop both internal and external ties. Finally, the decision-making context within which TMTs operate is distinct from most other types of teams. Strategic decisions are high stakes involving a significant commitment of resources, so the TMT may experience high levels of stress and take more time to ensure a comprehensive process; however, hypercompetitive conditions pressure TMTs to make speedy decisions even when important information is lacking. Finally, strategic decisions are complex and often involve both internal and external actors. In sum, because the strategic decision-making

Top Management Team Networks

environment is high stakes and fraught with uncertainty, TMTs need to gather and process information effectively in order to make timely high-quality decisions. Indeed, researchers argue that the management of uncertainty is the principal task of top-level decision-makers as they buffer and protect the organizational core (Thompson 1967). As key boundary spanners, important factors in the TMT’s information processing capacity are the stocks of knowledge and skills team members hold, how the team processes knowledge and information, and the relational contacts residing in the TMT social network that contain information team members can draw upon. In this article, we introduce some basic characteristics of TMT networks and introduce a model of an effective network for TMT decision-making around novel circumstances, particularly innovation.

The TMT Network The TMT network consists of everyone the team is in contact with for the purpose of achieving organizational goals. Social networks are complex phenomena, and, accordingly, research has examined many characteristics of TMT networks. Two main perspectives emerge: (1) a structural view that focuses on the dimensions of the network (i.e., size and reach) and one’s position within it (i.e., central versus peripheral) and (2) a relational view that focuses on the types of ties present in the network (i.e., strong versus weak ties). Given limited resources with which to build social networks, an implicit tension exists between the desire to develop a larger network of associations and a need to develop stronger, more expensive linkages with specific others. Moreover, because TMTs are a key boundaryspanning mechanism for the organization, it is important to differentiate ties to internal versus external actors. In sum, the trade-offs between a structural and relational approach must be made in conjunction with the functionality required of a given portion of the network. The ability of an organization to innovate is dependent on the availability of new knowledge

Top Management Team Networks

and information. The popular adage “it’s not what you know, but who you know” is a cynical commentary on the nature of organizational life. Instead, this research posits that who you know (and how you know them) impacts what you know. Top managers play a crucial boundaryspanning role serving as a conduit for information flows between internal and external stakeholders. According to the weak ties structural argument (Granovetter 1973), the TMT should maximize the size and range of its contacts by using less costly to develop weak ties. A boundary spanner with a large and broad range of contacts, for example, with banks, investment houses, universities, technology centers, and business associations, each of which potentially contains novel information, will be able to obtain information faster, access a richer set of data, and draw from a broader set of referrals. Conversely, boundary spanners with a narrow range of contacts will be limited in the types of information and knowledge they can search for and discover. Positioning the TMT centrally, or as a hub, in the network increases the information flow to and through the team and ensures quicker access to information relative to those in more peripheral locales. In contrast, the relational view suggests that strong ties (e.g., long duration, exercised frequently, and with an emotional closeness or friendship component) will be critical when the information is uncertain and ambiguous. Exchange of trustworthy information is more likely when ties are strong, and this is especially true in high-velocity environments where information is more complex, tacit, or sensitive. Thus, TMTs with strong ties are able to garner richer information and to be more confident in that information than will TMTs with weak ties. One way a TMT can develop stronger ties is when two or more members of the team have linkages to an actor. Overlap or redundancy in the individual networks of executives, however, constrains the possible size and range of the TMT network because highly overlapping networks would access the same information. In summary, trade-offs exist and the TMT needs to understand when and where strong ties provide enough value to offset the cost to network structure.

2177

T

Internal and External Network Researchers make an important distinction between the ties TMTs have to others in the organizational core (the internal network) and boundary-spanning links to external actors. This is because the information processing value provided by each type of network differs. Three distinct activities must take place for successful information processing to occur: information gathering, processing of the information, and distribution to appropriate organizational locales so that action can take place (Galbraith 1973). Two of the three components of the model take place inside of the organization, while information gathering could arguably take place both internally and externally. As detailed above, there is some debate as to the value of strong versus weak ties given the expense of developing and maintaining close ties. In general, the strength of ties argument applies most directly to external contacts. External networks are potentially limitless and are comprised of relationships with people who may not have an intrinsic interest in sharing information. Thus, the focused development of selected links into strong relationships, even at the expense of network size, may be the most effective way to garner information from external sources. Internal contacts, on the other hand, may feel more obligated to share information based on a common organizational identity and increased goal alignment. Moreover, top managers hold positional power over those in the core. Finally, much of the boundary-spanning role vis-`a-vis internal contacts involves transfer from the TMT to the core, rather than extraction. Information exchange with external actors is more complex, and researchers agree that strong ties are necessary for the exchange of complex information, especially when the extraction of information requires motivational influence (Hansen 1999). Strong ties also provide benefits not available in weaker links. Chief among these was a type of trust or affiliation termed philos that enabled the parties to exchange sensitive or timely information (Krackhardt 1992). In the context of rapidly changing and uncertain environments, the trust and familiarity

T

T

2178

present in strong tie relationships may be a necessary lubricant to the exchange of timely and sensitive information, especially across organizational boundaries. Since the motivational hurdles are much lower, large internal networks composed of weaker ties are beneficial for innovation in that they allow for efficient dissemination of information inside the organization. Indeed, TMTs with larger internal networks are more connected to the organization and would therefore be able to know who inside the firm has important information and then to gather and disseminate the information in an effective and efficient manner. Greater connectedness also enhances the TMT’s understanding of the information needs of those in the organizational core, and armed with this understanding, TMTs can better focus their external information search to attend to knowledge critical to ongoing innovation efforts and potentially fruitful future initiatives alike.

Top Management Teams Krackhardt D (1992) The strength of strong ties: the importance of philos in organizations. In: Nohria N, Eccles RG (eds) Networks and organizations: structure, form, and action. Harvard University Press, Cambridge, pp 216–239 Thompson JD (1967) Organizations in action. McGrawHill, New York

Top Management Teams  Top Management Team Networks

Topic Identification  Scholarly Networks Analysis

Topic Information Diffusion Cross-References  Actionable Information in Social networks, Diffusion of  Creating a Space for Collective Problem-Solving  Entrepreneurial Networks  Innovator Networks  Inter-organizational Networks  Intra-organizational Networks  Managerial Networks  R&D Networks  Social Networking for Open Innovation

References Barnard C (1938) The functions of the executive. Harvard University Press, Cambridge Galbraith J (1973) Designing complex organizations. Addison-Wesley, Reading Granovetter M (1973) The strength of weak ties. Am J Sociol 78:1360–1380 Hansen M (1999) The search-transfer problem: the role of weak ties in sharing knowledge across organizational subunits. Adm Sci Q 44:82–111

 Opinion Diffusion and Analysis on Social Networks

Topic Model  Topic Modeling in Online Social Media, User Features, and Social Networks for

Topic Modeling in Online Social Media, User Features, and Social Networks for Bo Hu, Zhao Song, and Martin Ester School of Computing Science, Simon Fraser University, Burnaby, BC, Canada

Synonyms Document; LDA; Topic model; Tweet; Update

Topic Modeling in Online Social Media, User Features, and Social Networks for

Glossary LDA Latent Dirichlet Allocation

Introduction Online social media websites, such as Epinions, Twitter, and Google+, have recently attracted millions of users and have become ubiquitous in our everyday lives. Thanks to the widespread adoption of various applications on mobile devices, people can easily post their routine status or reviews anytime anywhere. Consequently, we can see the unprecedented access to the news, events, and activities, with tons of user-generated data in highly dynamics. User-generated data, such as a review or a microblog, is usually textual and short, which differs from a traditional, long document, such as a scientific paper or news article. Beyond the large collection of usergenerated text, online social media websites also have a social network, allowing users to exchange and share information. The rich textual and relational environment is increasingly becoming a crucial facet among the general public; it is also a popular topic of study among researchers in the social sciences as well as in computer science. A particular area of interest is the study of topic modeling in online social media websites. In order to understand text documents, topic models, such as latent Dirichlet allocation (LDA) (Blei et al. 2003), have been proposed and shown as successful methods. LDA assumes that there is a set of latent topics in all documents, and each document contains a mixture of topics, which are associated with different word distributions. The majority of existing topic models (Blei et al. 2003; Hofmann 1999), including LDA, focus on traditional document collections, such as research papers, consisting of a relatively small number of long and highquality documents. However, user-generated documents tend to be shorter and noisier than traditional documents. Authors are not professional writers and use very diverse vocabulary, and there are many

2179

T

abbreviations and typos; for many users we do not have enough information to confidently learn their topic distribution. Moreover, the online social media websites have a social network full of context information, such as user features and user-generated labels, which have been normally ignored by the existing topic models. Not all users have the same interests, not all friends are equally important, and not all posts are interesting. Therefore, we argue that the social network and user features, such as age, gender, and location, are critically important when one wants to model and extract the latent topics.

Key Points In this paper, we address the problem of modeling and extracting user interests and latent topics in large-scale user-generated documents from online social media websites. We propose a feature-based topic model and a social-based topic model, taking into account the user features and social network, respectively. The models are based on the assumptions that (1) friends tend to have similar topic distributions and (2) users with similar features also tend to have similar topic distributions. The assumptions are motivated by the theory of “homophily” (McPherson et al. 2001) which states that users connect to similar users and by the theory of “social influence” (Friedkin 1998) which claims that connected users become more similar to each other. These phenomenons of homophily and social influence interact with each other, and their collective effect is referred to as “social correlation.” We do not consider the combination of features and social network in a unified model, since due to social correlation, a user’s features and links are somewhat redundant, i.e., one can be used to infer the other. In addition, our models address the noisiness challenge by regularizing the user topic distributions, i.e., getting additional input for a user’s topic distribution from his features or the distributions of his friends.

T

T

2180

Topic Modeling in Online Social Media, User Features, and Social Networks for

The major contributions of this paper are as follows: • We propose novel topic models for online social media, which consider the user features and social networks. • We present an extensive experimental evaluation on three real-life data sets from Epinions (http://www.Epinions.com), Twitter (http://www.Twitter.com) and Google+ (https://plus.google.com/), demonstrating the superiority of our proposed models over a baseline LDA model and the importance of the user features and social networks. • To the best of our knowledge, this paper is the first to analyze a Google+ data set and to discuss how to model a social network from user interactions available in Twitter and Google+.

Historical Background In this section, we review related work in the areas of analysis of online social media websites, topic modeling, and topic modeling in online social media websites. Analysis of Online Social Media Websites Kwak et al. (2010) work is the first to analyze Twitter demographics and topological structure. They also download the public tweets on trending topics by keyword search and calculate the active period of each trending topic. The conclusion is that Twitter is a tool for sharing information as well as social chatting. Finding interesting patterns from the content of microblogs has become a popular tendency. “Who says what to whom” is analyzed in Wu et al. (2011). Based on the content of tweets, Cheng et al. (2010) develop models to predict user’s location. The same authors show the existence of correlation between locations and social relationships in Cheng et al. (2011). Pennacchiotti and Popescu (2011) present a method to predict the Twitter users’ political orientation from their tweets. In Castillo et al. (2011) and Wang (2010), the authors distinguish the credible tweets from rumors.

This line of work differs from the task in our paper, which is to extract user interests and hidden topics from user-generated documents. Topic Models Topic modeling is a classic problem and has a long history in text mining. The most representative models are PLSA (Hofmann 1999) and LDA (Blei et al. 2003), which model and extract topics purely from documents. Author-LDA (Steyvers et al. 2004) also models topic distributions from documents; however, it assumes that topics of documents are uniformly drawn from multiple authors’ topic distribution. A few works (Mei et al. 2008; Sun et al. 2009) have been done incorporating social network regularization. Specifically, Mei et al. (2008) use a harmonic function to enforce the constraint that friends in the social network tend to have similar topic distributions. In a generative way, Sun et al. (2009) model both the influence between documents and content of documents by defining a Markov random field on the document graph. Kim and Shim (2011) propose a probabilistic model to profile Twitter users and assume that when users post a microblog, they borrow the microblog topic from one of their followees. The work mentioned above does not exploit user features. Furthermore, the social network in this paper has a different meaning from the one in their papers. Topic Modeling in Online Social Media Websites A few works (Hong and Davison 2010; Zhao et al. 2011; Ramage et al. 2010; Kim and Shim 2011; Chen and Fong 2011) have been proposed to extract hidden topics from microblogs. In Hong and Davison (2010), the authors combine all the tweets from each user as one document and apply LDA to extract the document topic mixture, which represents the user interest. Zhao et al. (2011) propose the Twitter-LDA model. It assumes that a single tweet contains only one topic, which differs from the standard LDA model, and is also found in Rosa and Ellen (2009). It uses the following process to generate tweets: (1) choose a topic from users’ topic

Topic Modeling in Online Social Media, User Features, and Social Networks for

distribution and (2) generate each word from that specific topic. The tweets usually contain user-provided tags (also known as hashtags). For the tagged tweets, Ramage et al. (2010) propose a Labeled-LDA model to include the supervision of labels. Unlike LDA, the Labeled-LDA model assumes a set of labels in advance. Although the above models consider several characteristics of user-generated documents, they do not consider user features and social networks.

Problem Definition In this section, we introduce some related concepts and formulate our problem definition. Social media users generate content, which is known as “reviews,” “tweets,” or “posts” in Epinions, Twitter, and Google+, respectively. Tweets are up to 140 characters, while posts are limited to 10,000 characters. Based on our observations, posts in Google+ are usually as short as tweets. Reviews are much longer than tweets. These documents normally consist of personal information, news, or links to content, such as images, videos, and articles. In Twitter and Google+, the documents posted by the users are automatically updated on their profile pages and presented to their followers or friends simultaneously. We associate each user with a set of documents, and each document is represented by a set of words. For convenience, in this paper, we consider “tweet,” “post,” “review,” and “document” as synonyms. All notations used in the following definitions and our models are listed in Table 1. We define a document collection formally as follows: Definition 1 (Document Collection) A document collection W is defined as a set of documents from all users: W D fwu g1uU where U is the number of users, and the set of documents wu authored by user u is defined as wu D fwu;s g1sNu

2181

T

where Nu is the number of documents from user u, and each document wu;s is defined as a bag of words: wu;s D fwu;s;n g1nNu;s where Nu;s is the number of words from sth document by user u. Moreover, all the words wu;s;n in the document collection are from a fixed vocabulary V, where W is the number of unique words in V. In addition to a document collection, we have a social network. For instance, in Epinions and Twitter, the social networks are directed, also known as a “truster-trustee” and a “followerfollowee” network. In Google+, there is a “follower-followee” relationship similar to Twitter. However, this information is not publicly available; therefore, we need to infer implicit social networks from explicit user actions and user interactions. User action is defined as posting a document. There are three types of user interactions: • “Retweet” or “Reshare.” A “retweet” with head “RT” which is originally tweeted by other users will be posted on the user’s profile page to be shared to his/her followers. It is a popular way to propagate interesting tweets and posts to one’s followers. • “Mention” or “Reply.” If you start a tweet with a user’s name, like @username, the message is considered a “mention” message to that user and shows up in the user’s mention stream. It is a good way to call someone’s attention since it is narrower than just broadcasting. • “Plus one.” Google+ offers the “plus one” service, i.e., users can press the “plus one” button to indicate their preference for a post. Note that we use the terms “retweet” or “reshare” and “mention” or “reply” interchangeably. Both “retweet” and “mention” are shared by Twitter and Google+. Consequently, we have four different types of social networks: follow, retweet, reply, and plus one in Twitter and Google+ and a trust social network in Epinions, and each one of them is formally defined as: Definition 2 (Social Network) A social network is defined as a directed graph G D .U; E/,

T

T

2182

Topic Modeling in Online Social Media, User Features, and Social Networks for

Topic Modeling in Online Social Media, User Features, and Social Networks for, Table 1 Notations in the FT and ST models Number of users Number of words in the document collection Number of documents posted by user u Number of words in the sth document posted by user u Number of vocabulary Number of topics Number of friends by user u Words in the document collection nth word in the sth document posted by user u User features User u’s lth feature value Topic assignments Topic assignment in the nth word of the sth document by user u Topic-feature weights Weight of topic t on feature l Dirichlet prior Probabilities of words given topics Probabilities of words given topic t Probabilities of topics given users Probabilities of topics given user u

where U is the set of users and E is the set of links between users. A directed edge .u; v/ 2 E; u 2 U; and v 2 U from user u to user v indicates that user u “follow,” “trust,” “retweet,” “reply,” or “plus one” user v or his/her posts. We use Mu to represent the number of friends by user u. Besides, we have node information in the social network. Users often have features associated with them, such as age, gender, location, and relationship status. Furthermore, in Twitter, “hashtags” (“#” followed by a textual tag name) help to designate topics that people might search for, especially when they want to distinguish the word from a common phrase. Hashtags are popular in Twitter because writing space is limited and people can associate their tweets with an event (or product) without explaining the full context. In other words, hashtags can be considered as user-specified topics. For every user, we obtain a set of hashtags which have been used by the user. Frequently used hashtags are chosen as user features, and a binary variable is used to represent each feature value, i.e., one means that the user

U N Nu Nu;s W T Mu w wu;s;n x xu;l z ´u;s;n a at;l ˇ ˚ t  u

Scalar Scalar Scalar Scalar Scalar Scalar Scalar N -dimensional vector Scalar U  F -dimensional vector Scalar N -dimensional vector Scalar T  F -dimensional vector Scalar Scalar W  T matrix W -dimensional vector T  U matrix T -dimensional vector

has mentioned the hashtag, and zero means the user has not mentioned it. We formally define user features as follows: Definition 3 (User Feature) User features X are defined as vectors from a feature space X D fxu;l g1uU;1lF where F is the number of features per user, and xu;l 2 f0; 1g represents user u’s lth feature value. We assume that there is a set of topics in the document collection W, and we define a topic as: Definition 4 (Topic) A semantically coherent topic  in the document collection W is defined as a multinomial distribution of all words in the vocabulary P V, i.e., fp.wj/gw2V , with the constraint w2V p.wj/ D 1. The set of latent topics in W is represented as ˚ D ft g1t T , where T is the number of topics. We assume that each user shows different distributions over all topics. We define a user interest over topics as follows:

Topic Modeling in Online Social Media, User Features, and Social Networks for

Definition 5 (User Interest) A user interest  is defined as a multinomial distribution of the set of topics ˚ in the document collection W, P i.e., fp.j/g2˚ with the constraint 2˚ p.j/ D 1. Based on the definitions of these concepts, we formalize our research problem as follows: Problem 1 (Topic and User Interest Modeling) Given a social network G with all user features on the feature set X , and a document collection W posted by the users, the task is to model and extract a set of topics f1 ; 2 ; : : : ; T g and a set of user interests f1 ; 2 ; : : : ; U g, where i is a topic in W and j is the user interest for user uj 2 U.

Feature- and Social-Based Topic Models In this section, we first explain some intuitions behind our models, and then we introduce our probabilistic feature topic (FT) and social topic (ST) models. The Model Figures 1 and 2 show the graphical models of the FT and ST models. All the notations used in our FT and ST models are listed in Table 1. We assume a set of observed random variables w D fwu;s;n g, where wu;s;n represents an nth word in sth document posted by user u; a set of observed user features x D fxu;l g, where xu;l represents user u’s lth feature; three sets of latent random variables z D f´u;s;ng, ˚ D ft g, and  D fu g, where ´u;s;n represents a topic assignment on the word wu;s;n, t represents a word probability distribution of topic t, and u represents the user u’s topic probability distribution; and a set of latent parameters a D fat;l g, where at;l indicates the weight of the topic t on the lth feature. In addition, there is a given Dirichlet hyperparameter fˇg, and vi (i D 1 : : : Mu ) are the set of friends of user u. The generative process of the FT and ST models is divided into two steps: users interest

2183

T

generation and words generation. FT and ST have different processes of generating users interest, while they share the words generation process. The detailed two-step generation process is as follows: 1. (Users interest generative process) For each user u = 1,. . . ,U, P a x (FT) (a) ˛u D F lD1 P ut;l u;l or ˛u D M i D1 vi =Mu (ST) (b) Draw a user u’s interest u Dirichlet.˛u / 2. (Words generative process) For each topic t D 1; : : : ; T, draw a topicword distribution t Dirichlet.ˇ/. For each user u D 1; : : : ; U, and for each document s D 1; : : : ; Nu , and for each word n D 1; : : : ; Nu;s . (a) Draw a topic ´u;s;n Multinomial.u / (b) Draw a word wu;s;n Multinomial .´u;s;n /. Parameter Learning Our goal is to learn parameters that maximize the marginal log likelihood of the observed random variables w. The marginalization is performed with respect to the latent random variables fz; ˚; g, and it is hard to be maximized directly. Therefore, we maximize the complete data likelihood of fw; z; ˚; g. In this paper, we adopt the Monte Carlo expectation maximization (MCEM) algorithm. To derive the MCEM equations, we need the joint distribution of all random variables, including observed and latent variables, and parameters. Specifically, the joint probability of all random variables, i.e., the set of words in the document collection w, the set of topic assignments z, the set of a coin assignments y, the set of user interests , and the set of topic mixtures ˚, given all other parameters, can be simplified in (1) and (2) for the FT and ST models, respectively. W T Y Y

p.w; z; ; ˚jˇ; x; a/ /

! C T W Cˇ 1 t;wt;w

t D1 wD1



T U Y Y uD1 t D1

CUT C u;tu;t

PF

lD1

at;l xu;l 1

! (1)

T

T

2184

Topic Modeling in Online Social Media, User Features, and Social Networks for

Xu

The complete data log-likelihood of FT is denoted as follows:

φt T

L D ln p.w; z; ; ˚jˇ; x; a/ D ln p.wjz; ˚/ C ln p.˚jˇ/ C ln p.zj/ C ln p.jx; a/ (3)

a

Wu,s,n

Zu,s,n

Θu

Nu,s U

Nu

Topic Modeling in Online Social Media, User Features, and Social Networks for, Fig. 1 The graphical model of the feature topic model

φt

Θv

Let fak g denote the current estimate at the kth iteration. The MCEM algorithm iterates through the following two steps until convergence. • E-step: Given kth iteration parameters fak g, we compute Ez;;˚ ŒL, where z; ; ˚ are sampled from the posterior distribution of p.z; ; ˚jw; ˇ; x; ak /. • M-step: Find the .k C 1/th iteration parameters fakC1 g that maximizes the expectation Ez;;˚ ŒL computed in the last E-step.

T

Mu

Monte Carlo E-Step Since the expectation Ez;;˚ ŒL is not available in closed form, we compute it based on samples generated by Gibbs sampler. We compute the expectation for the FT model as follows:

Θu

Zu,s,n

Wu,s,n

Ez;;˚ ŒL D Ez;;˚ Œln p.wjz; ˚/

Nu,s U

C Ez;;˚ Œln p.˚jˇ/

Nu

C Ez;;˚ Œln p.zj/ h i C Ez;;˚ ln p.jx; ak / (4)

Topic Modeling in Online Social Media, User Features, and Social Networks for, Fig. 2 The graphical model of the social topic model

p.w; z; ; ˚jˇ; M / /

W T Y Y

C T W Cˇ 1 t;wt;w



t D1 wD1

 

T U Y Y

C U T C.

u;tu;t

PMu

vD1 v;t =Mu /1



First, given the assignment of all other hidden variables, to sample a value for ´u;s;n , we obtain the following Gibbs sampling equation (5) using the identity  .k C 1/ D k .k/:

(2)

uD1 t D1

where M represents a set of user interests from TW users’ friends, Ct;w is the number of times the word w in the vocabulary is assigned to topic t, UT is the number of documents assigned to and Cu;t topic t for user u. Since FT and ST have similar MCEM formulas, we only present the FT-related formulas to save space.

p.´u;s;n D tjzfu;s;ng; w; ˇ; x; ak / T W;fu;s;ng Ct;w Cˇ / PW TW w 0 D1 Ct;w 0 C Wˇ



! P U T;fu;s;ng k C F Cu;t lD1 at;l xu;l  PT  U T;fu;s;ng PF C lD1 atk0 ;l xu;l t 0 D1 Cu;t 0 (5)

Topic Modeling in Online Social Media, User Features, and Social Networks for

where fu;s;ng represents the set of parameters in front of which lacks one parameter with the index of fu; s; ng. Given all other parameters, computing posterior distributions on  and ˚ is straightforward: UT TW and Ct;w , we evaluate the given sampled Cu;t posterior mean of u;t and t;w in the following formula: u;t

P m UT Cu;t C F lD1 at;l xu;l D PT   P F UT m t 0 D1 Cu;t 0 C lD1 at 0 ;l xu;l

t;w D PW

TW Ct;w Cˇ

w 0 D1

TW Ct;w 0 C Wˇ

(6)

(7)

Monte Carlo M-Step In the M-step, we want to find the parameter setting fakC1 g that maximizes the expected complete data log-likelihood in the E-step, which is equivalent to minimize the negative log-likelihood of joint probability of complete data in (1).

L D  ln p.w; z; ; ˚jˇ; x; a/ D  ln p.wjz; ˚/  ln p.˚jˇ/  ln p.zj/  ln p.jx; a/

2185

T

models by the likelihood of the test data set and the accuracy of document recommendation. Description of Data Sets We download the social networks among users and all available documents for each user from Epinions, Twitter, and Google+. Epinions Epinions was crawled by the authors of Moghaddam and Ester (2011). In Epinions, users can read reviews about a variety of items to help them decide on a purchase. Moreover, users can write reviews and establish trust relationships with other users. Twitter We crawled the Twitter data set using Twitter REST API. More specifically, we start by selecting a famous Twitter user, i.e., @jack, who is the cofounder of Twitter, one of the earliest Twitter user, and has millions of followers. Then we randomly sample 1,000 followers, and iteratively for each follower we sample 1,000 random followers until we get 14,905 users. Meanwhile, we download all available tweets in Oct. 2011 (Twitter API provides up to 3,200 public tweets per user).

(8)

Obviously, only second term in the log-likelihood is dependent of the parameter a. We solve it using an iterative gradient method.

Key Applications In this section, we report our experimental results on three online social media data sets from Epinions, Twitter, and Google+. We compare our FT and ST models against a baseline method which we call UserLDA (ULDA), a model modified from Author-LDA (Steyvers et al. 2004). The modification is that ULDA assumes that one document contains one topic, while Author-LDA assumes a mixture of topics in one document. ULDA is equivalent to FT without features and to ST without social network. We evaluate all

Google+ Currently the Google+ API provides read-only access to public data with some restrictions, e.g., “circles” information is not available. Since the most popular Google+ user Mark Zuckerberg, the cofounder of Facebook, is not publicly available, we start from the second most popular Google+ user Larry Page, who is the cofounder of Google, to obtain a well-connected social network. We first download all available posts from Larry Page. Based on these posts, we extract millions of users who have replied, reshared, or plus oned at least one of Larry Page’s posts. Then we randomly pick 4,000 users and collect all available posts as well as available user features, such as gender and relationship status. Again we extract all users who have replied, reshared, or plus oned at these 4,000 users. We continued to collect users and posts in Oct. 2011 until we got 10,555 users.

T

T

2186

Topic Modeling in Online Social Media, User Features, and Social Networks for

Topic Modeling in Online Social Media, User Features, and Social Networks for, Table 2 User features and social networks in Epinions, Twitter, and Google+ Data sets Epinions Twitter Google+

User features Product categories in user profiles Top 100 frequently used hashtags Gender and relationship status

Table 2 shows the user features and social networks we created from our data sets. In Epinions, we have category interests, such as car, wine, and electronic device, provided by users, and we use the user-provided trust network. In the Twitter data set (since around 70 % of user profile information, such as location, education, and interests, are missing, incomplete, or even fake), top 100 frequently used hashtags are chosen as user features. There are follow, retweet, and reply social networks in Twitter. Based on our experimental analysis, we find out that a follow relationship does not imply that the two users have similar interests because users regard Twitter not only as a social interaction service as but also as an information source (Kwak et al. 2010). Furthermore, Twitter users do not have serious commitments to followers or followees, so they switch topic interests frequently by linking new followees or unlinking the current ones. This is different from collaboration social networks in academic areas, where users usually establish solid relationships coauthoring a paper. The retweet and reply networks are much more solid than the follow network; however, they are much sparser. Therefore, we construct the Twitter social network by creating a link between two users if there is an edge in either retweet or reply social networks. In Google+, relatively complete and valid user features are gender and relationship status, and we establish the Google+ social network similar to Twitter.

Experimental Setup In our experiments, we split the whole data set into a training data set and a test data set. We withhold the 30 % most recent user documents as the test data set and build all models using the

Social networks Trust links Retweet, reply Reshare, reply, plus one

older 70 % user documents correspondingly and the social networks with user features. We compare the following models: • ULDA. A baseline LDA model, where a document’s topic assignment is drawn from user’s interest distribution. • F T . This is our feature topic model. • S T . This is our social topic model. Note that Twitter-LDA (Zhao et al. 2011) is also a competitive method, which assumes one document contains one topic. Since reviews are much longer than tweets and usually contain multiple topics, the experimental results show that Twitter-LDA is not as good as ULDA in Epinions. Therefore, we do not consider TwitterLDA as our comparison partner. All the above models were implemented in C and C++, but for ULDA, we use a toolkit GibbsLDA++ (http://gibbslda.sourceforge.net/). All the topic models used in the experiments have symmetric Dirichlet priors. More specifically, we set ˇ D 0:1 as recommended by Griffiths and Steyvers (2004). Moreover, all experiments were performed on a server running Windows 7 with an Intel Xeon E5630 2.53 GHz CPU and 8 GB RAM. Perplexity We estimate the likelihood of the test data set given the trained models. Perplexity is a standard of measuring how well a model fits the data (Blei et al. 2003), and is monotonically decreasing in the likelihood of the test data set, so that a lower perplexity indicates better performance of the model. We compute the perplexity as follows:  P P logp.wu;s / perplexity.test/ D exp  u s Ntest (9)

Topic Modeling in Online Social Media, User Features, and Social Networks for

a 7300

ULDA FT ST

7200

7100

7000

b 6100

ULDA FT ST

6000

2187

c 4700 4600

5900

4500

5800

4400

5700

4300

5600

4200

5500

4100

T

ULDA FT ST

Topic Modeling in Online Social Media, User Features, and Social Networks for, Fig. 3 Test set perplexity. (a) Epinions data set. (b) Twitter data set. (c) Google+ data set

where Ntest is the number of documents in the test data set. We compute the estimated likelihood of a test document wu;s by (10). p.wu;s / D

Y

T X

w 0 2wu;s

t D1

! u;t  t;w 0

(10)

We empirically choose the number of topics as 10, 25, and 10 for Epinions, Twitter, and Google+, respectively, because they achieve best performance in terms of perplexity. We train all compared models in the training data set and compute the perplexity of the test data set. Figure 3 shows the perplexity of the compared models in Epinions, Twitter, and Google+. We observe that the performance of the ST and FT models are always better than ULDA in all three data sets. The major reason for performance improvement of FT and ST compared to ULDA is that they capture the feature and social network association among users. In the Epinions and Google+ data sets, FT outperforms ST, but in the Twitter data set, ST is better than FT. The better performance of FT in Google+ is somewhat surprising, since the user features of Google+ are relatively very weak. Document Recommendation The document recommendation task can be considered as a ranking problem. Given a particular user, we compute a score of each test document using (10) and recommend the top-k test documents with top-k highest scores. We evaluate the accuracy of top-k document recommendation

using precision, i.e., the number of hits over k and recall, i.e., the number of hits over the number of documents posted by the user. We define a recommended document a “hit” as (1) “my hit,” if it was actually posted by the user or as (2) “all hit,” if it was posted by the user or the user’s friends. The task is extremely hard because many short user-generated documents are similar to each other. In Figs. 4–6, we show the precision and recall of the compared models when used for document recommendation. Epinions We observe that FT is the winner for my hits, and ST is the winner for all hits. The reason is that user features are effective for recommending to the user himself, but not for his friends, while the social network is effective for recommending to friends. According to Fig. 4b, FT outperforms ULDA, demonstrating that the features from the category of products are effective, while ST cannot improve the recall of ULDA. Figure 4d shows a different picture when measuring all hits: FT is worse than ULDA, but ST is clearly better than ULDA, which is not unexpected given that all hits favor a social network-based method. The network in the Epinions data set is a trust network, not a “social network,” and it shows that it cannot improve the precision. Twitter Figure 5b, d show that FT and ST have very similar recall (indicating strong social correlation) and both outperform ULDA up to k D 30. Note that in real-life applications the number of

T

T

2188

Topic Modeling in Online Social Media, User Features, and Social Networks for

a 0.0025

b

ULDA

ST

ST

0.002

FT

FT

0.015

0.0015

Recall

Precision

0.02

ULDA

0.01

0.001 0.005 0.0005

0

0

10

20

30 Top-K

40

50

0

60

c 0.02

d

0

40

50

60

20

30 Top-K

40

50

60

ST 0.003

FT

Recall

Precision

30 Top-K

ULDA

ST

0.01

0.005

0

20

0.004

ULDA 0.015

10

FT

0.002

0.001

0

10

20

30 Top-K

40

50

60

0

0

10

Topic Modeling in Online Social Media, User Features, and Social Networks for, Fig. 4 Document recommendation in Epinions. (a) Precision on my hits. (b) Recall on my hits. (c) Precision on all hits. (d) Recall on all hits

recommendations presented to a user will typically be fairly small. Figure 5a, c show that FT and ST consistently outperform ULDA in terms of precision, both for my hits and all hits, with the relative gain being much larger for my hits. Google+ According to Fig. 6b, d, as in Epinions, FT achieves better recall for my hits and ST better recall for all hits. In this data set, FT and ST can outperform ULDA in terms of recall only for my hits up to k D 30. In terms of precision, ST consistently outperforms ULDA which does better than FT. This is not surprising

since the user features available in Google+ are relatively weak. We also note that in a few cases (Figs. 4a, c and 6d), ULDA outperforms both ST and FT. In conclusion, ST can boost the performance of ULDA if a suitable social network is available (e.g., Google+ and Twitter, but not Epinions). Trust or follow social networks do not support the assumption that friends have similar interests. FT can improve the performance if suitable features are available (e.g., Twitter and Epinions, but not Google+).

Topic Modeling in Online Social Media, User Features, and Social Networks for

a

T

2189

b

0.001

0.0014

ULDA ST

0.0008

ULDA ST

0.0012

FT

FT

Recall

Precision

0.001 0.0006

0.0008 0.0006

0.0004

0.0004 0.0002 0.0002 0

c

0

10

20

30 Top-K

40

50

0

60

d

0.01

0

10

20

30 Top-K

40

50

60

50

60

0.0003 0.00025

0.008

Recall

Precision

0.0002 0.006

0.00015

0.004 0.0001

ULDA ST

0.002

0

10

20

30

40

ST

5e-05

FT 0

ULDA

50

60

Top-K

0

FT 0

10

20

30

40

Top-K

Topic Modeling in Online Social Media, User Features, and Social Networks for, Fig. 5 Document recommendation in Twitter. (a) Precision on my hits. (b) Recall on my hits. (c) Precision on all hits. (d) Recall on all hits

Future Directions In this paper, we propose comprehensive feature-based and social-based topic models to learn the user interests and latent topics in a document collection. The models are based on the assumption that (1) friends tend to have similar topic distributions and (2) users with similar features also tend to have similar topic distributions. We perform experiments on three real-life data sets from Epinions, Twitter, and Google+. We evaluate all compared models by the perplexity of the test data set and precision and recall on document recommendation

and demonstrate that our proposed models in many cases outperforms a baseline LDA model. This paper suggests several interesting directions for future research. First, there are diverse social networks in online social media, while the combination of these social networks proposed in this paper seems to be working, more principled approaches for constructing the social network should be investigated. Second, since Twitter and Google+ have millions of documents posted every day, how to build an efficient and incremental model becomes important and challenging.

T

T a

2190

Topic Modeling in Online Social Media, User Features, and Social Networks for

b

0.0003 ULDA ST FT

0.00025

0.001 ULDA ST FT

0.0008

0.0006

Recall

Precision

0.0002 0.00015

0.0004 0.0001 0.0002

5e-05 0

c

0

10

20

30 Top-K

40

50

0

60

0.0004

d ULDA ST FT

0

10

20

30 Top-K

40

50

60

20

30 Top-K

40

50

60

0.0006 ULDA ST FT

0.0005

0.0003

Recall

Precision

0.0004 0.0002

0.0003 0.0002

0.0001 0.0001 0

0

10

20

30 Top-K

40

50

60

0

0

10

Topic Modeling in Online Social Media, User Features, and Social Networks for, Fig. 6 Document recommendation in Google+. (a) Precision on my hits. (b) Recall on my hits. (c) Precision on all hits. (d) Recall on all hits

Cross-References  Flickr

and

Twitter

Data

Analysis  Gibbs Sampling  Probabilistic Graphical Models  Social Influence Analysis  User Behavior in Online Social Networks, Influencing Factors

References Blei DM, Ng A, Jordan M (2003) Latent dirichlet allocation. JMLR 3:993–1022

Castillo C, Mendoza M, Poblete B (2011) Information credibility on twitter. In: Proceedings of the 20th international conference on world wide web, Hyderabad, pp 675–684 Chen W, Fong S (2011) Social network collaborative filtering framework and online trust factors: a case study on facebook. IJWA 3:17–28 Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a content-based approach to geo-locating twitter users. In: Proceedings of ACM international conference on information and knowledge management, Toronto, pp 759–768 Cheng Z, Caverlee J, Lee K, Sui D (2011) Exploring millions of footprints in location sharing services. In: International AAAI conference on weblogs and social media, Barcelona Friedkin NE (1998) A structural theory of social influence. Cambridge University Press, Cambridge/New York

Topology of Online Social Networks Griffiths TL, Steyvers M (2004) Finding scientific topics. PNAS 101:5228–5235 Hofmann T (1999) Probilistic latent semantic analysis. In: UAI, Stockholm Hong L, Davison BD (2010) Empirical study of topic modeling in Twitter. In: Proceedings of the first workshop on social media analytics, Washington, DC, pp 80–88 Kim Y, Shim K (2011) Twitobi: a recommendation system for Twitter using probabilistic modeling. In: IEEE international conference on data mining, Vancouver, pp 340–349 Kwak H, Lee C, Park H, Moon S (2010) What is Twitter, a social network or a news media? In: WWW ’10: proceedings of the 19th international conference on world wide web, Raleigh, pp 591–600 McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Annu Rev Soc 27:415–444 Mei Q, Cai D, Zhang D, Zhai C (2008) Topic modeling with network regularization. In: WWW, Beijing Moghaddam S, Ester M (2011) ILDA: interdependent LDA model for learning latent aspects and their ratings from online product reviews. In: SIGIR, Beijing, pp 665–674 Pennacchiotti M, Popescu AM (2011) A machine learning approach to Twitter user classification, ICWSM, The AAAI Press, pp 281–288 Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM, Washington, DC Rosa KD, Ellen J (2009) Text classification methodologies applied to micro-text in military chat. In: Proceedings of the 2009 international conference on machine learning and applications, Miami Beach, pp 710–714 Steyvers M, Smyth P, Rosen-Zvi M, Griffiths T (2004) Probabilistic author-topic models for information discovery. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining, Seattle, pp 306–315 Sun Y, Han J, Gao J, Yu Y (2009) Itopicmodel: information network-integrated topic modeling. In: ICDM 2009, the ninth IEEE international conference on data mining, Miami, 6–9 Dec 2009, pp 493–502 Wang AH (2010) Don’t follow me – spam detection in Twitter. In: SECRYPT, Athens, pp 142–151 Wu S, Hofman JM, Mason WA, Watts DJ (2011) Who says what to whom on Twitter. In: Proceedings of the 20th international conference on world wide web, Hyderabad, pp 705–714 Zhao WX, Jiang J, Weng J, He J, Lim EP (2011) Comparing Twitter and traditional media using topic models, pp 338–349

Topic Networks  Semantic Social Networks

2191

T

Topology of Online Social Networks Kon Shing Kenneth Chung1 , Mahendra Piraveenan2, and Liaquat Hossain3; 4 1 Complex Systems Research Group, Project Management Program, Faculty of Engineering & Information Technologies, School of Civil Engineering, J05, The University of Sydney, Sydney, NSW, Australia 2 Complex Systems Research Group, Project Management Program, Faculty of Engineering & Information Technologies, The University of Sydney, Sydney, NSW, Australia 3 Professor, Information Management, Division of Information and Technology Studies, The University of Hong Kong, Hong Kong 4 Honorary Professor, Complex Systems Research Group, Project Management Program, Faculty of Engineering & Information Technologies, The University of Sydney, Sydney, NSW, Australia

Synonyms Online forums; Online social networks; Social media; Social network analysis; Web 2.0

Glossary ICT Information and communication technology Multiplex Ties Ties consisting of more than one relation Web 2.0 Websites, generally social media, that feature dynamic content

Introduction The World of Complex Networks The study of complex networks is a dominant trend in recent research that transcends domain boundaries. The pervasiveness of networked systems in biology, technology, and society has led to a recent surge of interest in uncovering the organizing principles that govern the topology

T

T

2192

and the dynamics of various complex networks (Park and Barabasi 2007). Indeed, complex networks research can be conceptualized as lying at the intersection between graph theory and statistical mechanics, which endows it with a truly multidisciplinary nature (Costa et al. 2007). In particular, many social systems can be modelled as complex networks. These include online social media, collaborations of scientists, interconnected groups of corporations and banks with shareholding links between them, ecological systems of species, communities of people who are subject to spread of infection, and operational hierarchies in defense organizations, among others. It has been shown that many of these networks from various domains can exhibit surprisingly similar underlying structures. For example, most social networks are shown to have the “scale-free” structure (Dorogovtsev and Mendes 2003; Piraveenan et al. 2012b, 2007, 2008, 2010), where the topological structure is largely independent of scale, and many also display the “small-world” property, where the average diameter of the network remains largely independent of the size of the network (Albert and Barabasi 2002; Milgram 1967). A number of structural properties of these networks such as modularity, topological robustness, mixing patterns, network diameter, and clustering have been analyzed in detail. Furthermore, it has been explained that many of these structural features are closely related to the functions the networks, or subnetworks and motifs contained therein, are intended to perform (Dorogovtsev and Mendes 2003). Therefore, many topological and behavioral patterns of social networks can be studied generically across domains. Overview of Online Social Networks Loosely defined, a social network is comprised of a set of actors who are engaged in one or more social ties. Actors are generally individuals but can be aggregate units such as departments, organizations, or families. Ties range from encapsulating information exchange such as communication or advice to resource exchange such as goods and services and social or financial

Topology of Online Social Networks

support. The strength of ties may range from weak to strong, depending on the number and types of resources they exchange, the frequency of exchanges, and the intimacy of the exchanges (Marsden and Campbell 1984). Furthermore, social ties consist of multiple relations (e.g., in the case of doctors who have a doctorpatient relationship as well as a friendship relationship) and therefore are called “multiplex ties” (Haythornthwaite 2002). With the advent and rapid adoption and use of the Internet, Hinds and Kiesler (1995) noted that the connectivity offered by computers through computer networks (and the Internet) allowed communication that allowed traversal of spatial, organizational, structural, and temporal barriers. They termed this capability as second-order effects of information and communication technologies (ICTs); the first-order effects are being related to task efficiency and job productivity. According to Wellman and colleagues (Wellman 1997; Wellman et al. 1996), an electronic group is virtually a social network, where social ties in offline and online environments influence each other. Thus, with the advent of Web 3.0 (Hendler 2009) and current mobile and Web 2.0 technologies such as Facebook, Flickr, YouTube, and Twitter, it is a widely accepted norm that personal relations these days are no longer conducted face-to-face only. The revolution of technology and the Internet means that the entire communication environment has taken on a virtual dimension. With online social networks, there exists a multiplex character of personal networks, which tend precisely to intersect several social relations (Licoppe and Smoreda 2005). ICTs now serve as supplements and even substitutes to traditional resources (e.g., town-hall style community meetings, massive open online courses) for developing an actor’s social network (Nardi et al. 2000). Thus, social networks not only shape the use of ICT for communication, but ICT means are also shaping personal networks and redrawing social boundaries at both the personal and societal level. For instance, network scholars and sociologists are convinced that the definition of community is

Topology of Online Social Networks

not only confined to physical and geographical boundaries but is recreated by virtual boundaries resulting from interactions over online social network sites such as discussion forums (Chung 2011; Chung and Chatfield 2011) and the Twittersphere (Gruzd et al. 2011). In this essay, we focus on the topology of key social networking sites such as Facebook, Google+, Flickr, YouTube, and Twitter because we consider these key SNS responsible for generating much of the online network activities. A brief history of these SNS is provided below.

Historical Background From 2004 to 2011, the world has witnessed the growth of social networking sites (SNS), some of which have transformed into not only the world’s top technology companies today but also the mainstream form of communication across the globe. Facebook: In 2004, Facebook, the social network site, was created by Mark Zuckerberg, although the notion was originally conceived in 2003 while Zuckerberg was a student at Harvard University. In its very first form, Facebook (called Facemash then) attracted 450 visitors and 22,000 photoviews in its first 4 h online (Locke 2007). Initially only limited to students at Harvard, it was soon made available to the Ivy League universities, then to high schools and companies including Apple and Microsoft. By 2007, Facebook had over 100,000 business pages and hits to the site increased steadily from 2009 onwards, surpassing Google by March 2010. As of June 2011, Facebook had reached one trillion page views and the company was made public in 2012. Google+: In an effort to make entry into the online social networking space, Google launched a number of SNS such as Google Buzz, Google Friend Connect, and Orkut. While the former two have ceased operations, Orkut has over 33 million active users worldwide, predominantly in India and Brazil

2193

T

(Google 2013). The fourth service that was launched is Google+, initially as an invitationonly service in June 2011, but opened to the public in September 2011. It was estimated that there were over 500 million members by the end of 2012. YouTube: In 2005, three former employees of PayPal, Steve Chen, Chad Hurley, and Jawed Karim, frustrated by the lack of efficient Internet technologies to share video files online, developed YouTube, a video-sharing website that allows users to upload, share, and view videos. By July 2006, more than 65,000 videos were being uploaded everyday while there were 100 million views per day (USA Today 2006). As of early 2013, as YouTube reaches one billion monthly active users, Google, which bought over the company in 2006, states, “If YouTube were a country, we’d be the third largest in the world after China and India” (Mukherjee and Ghosh 2013). Flickr: Flickr, a multimedia hosting company, generally associated with photo sharing, although the sharing of videos is also possible, was launched in 2004 by a company called Ludicorp and acquired by Yahoo! in 2005. The popularity of Flickr is largely attributed to its use by bloggers worldwide and their need to embed multimedia into their blogs. The use of “tags” was also popularized by Flickr in such a way that it allowed users to “tag” their photographs with information such as names of people, location of photograph, and other details. As of June 2011, Flickr had a total of 51 million registered users with over 80 million unique visitors (Yahoo! 2011) and is growing fast. Twitter: While the contemporary world during the late 1990s witnessed the emergence of short text messages (SMS) that allowed individuals to send them from one mobile phone to another, Twitter can be described as the “SMS of the Internet.” Simply, Twitter is a social networking service that allows users to send and receive messages of up to 140 characters, also known as “tweets.” This service is particularly useful for users to microblog or to make short and quick updates regarding their

T

T

2194

Topology of Online Social Networks

status. Created in 2006 by Jack Dorsey, there are currently over 500 million registered users as of 2012, with over 340 million tweets a day (Lunden 2012). Other online social networking sites that are famous include Hi5, Tagged, LinkedIn, and many others which are famed in certain countries.

Measuring the Topology of Networks A complete description of the way the components of a network are connected to each other is called the network’s topology. Understanding the topology of a network is vital for understanding its function, since the topology evolves (or is designed) to better undertake the function, and the efficiency of network function is influenced by its topology. For this reason, topological analysis of complex networks has been an intensely researched area in the last decade. An ensemble of standard measures can be utilized to quantify the topology of any network, including online social networks. In this section, we briefly review some of these measures. Centrality Measures Degree Centrality: The degree of a node is simply the number of links it has to other nodes in the network. Double links and selflinks are sometimes counted and this is a matter of convention. Betweenness Centrality: Betweenness centrality measures the fraction of shortest paths that pass through a given node (Dorogovtsev and Mendes 2003), averaged over all pairs of node in a network. It is formally defined, for a directed graph, as BC.v/ D

X s;t .v/ 1 .N  1/.N  2/ s;t s¤v¤t

where s;t is the number of shortest paths between source node s and target node t; while s;t (v) is the number of shortest paths between source node s and target node t that pass through node v.

Closeness Centrality: 1 d i ¤v g .v; i /

C C.v/ D P

where dg .v; i / is the shortest path (geodesic) distance between nodes v and i . Note that the average is “inverted” so that the node which is “closest” to all other nodes will have the highest measure of closeness centrality. Eigenvector Centrality: The eigenvector centrality measure is based on the assumption that a node’s centrality is influenced by the centrality scores of its neighbors – that the centrality score of a node is proportional to the sum of the centrality scores of the neighbors. As such, it is defined iteratively. If the centrality scores of nodes are given by the matrix X and the adjacency matrix of the network is A, then we can define x iteratively as x / Ax i.e., x D Ax The centrality scores are obtained by solving this matrix equation. It can be shown that, while there can be many values for , only the largest value will result in positive scores for all nodes. Community Structure Measures Assortativity: Assortativity is the tendency observed in networks where nodes mostly connect with similar nodes. Typically, this similarity is interpreted in terms of degrees of nodes. Assortativity has been formally defined as a correlation function of the excess degree distribution and the link distribution of a network (Newman 2003). The concepts of degree distribution p.k/ and excess degree distribution q.k/ for undirected networks are well known (Sole and Valverde 2004). Given q.k/; one can introduce the quantity ej;k as the joint probability distribution of the remaining degrees of the two nodes at either end of a randomly chosen link. Given these

Topology of Online Social Networks

2195

distributions, the assortativity of an undirected network is defined as 2 3 1 4X pD 2 j k.ej;k  qj qk /5 q jk

where q is the standard deviation of q.k/. Assortativity distributions can be constructed by considering the local assortativity of all nodes in a network (Piraveenan et al. 2008, 2010). Modularity: Network modularity is the extent to which a network can be separated into independent subnetworks. Formally, modularity quantifies the fraction of links that are within the respective modules compared to all links in a network (Alon 2007). Alon (2007) introduces an algorithm which can partition a network into k modules and measure the partitions modularity Q. The measure uses the concept that a good partition of a network should have a lot of within-module links and a very small number of between-module links. The modularity can be written as QD

Xk sD1

"

ls  L



ds 2L

2 # ;

where k is the number of modules, L is the number of links in the network, ls is the number of links between nodes in module s, and ds is the sum of degrees of nodes in module s. To avoid getting a single module in all cases, this measure imposes Q D 0 if all nodes are in the same module or nodes are placed randomly into modules. Clustering Coefficient: The clustering coefficient of a node characterizes the density of links in the environment closest to a node. Formally, the clustering coefficient C of a node is the ratio between the total number y of links connecting its neighbors and the total number of all possible links between all these z nearest neighbors (Dorogovtsev and Mendes 2003): C D

2y ´.´  1/

T

The clustering coefficient C for a network is the average C over all nodes. Average Path Length: The average path length l of a network is defined as the average length of shortest paths between all pairs of nodes in that network. For many real-world networks, this average path length is much smaller than the size of the network, that is, l

N. Such networks are said to be showing the small-world property (Newman 2000; Watts and Strogatz 1998). Reciprocity: Reciprocity is used to evaluate discrepancies between in- and out-degrees for a given node (Marsden and Campbell 1984). The reciprocity of a given node v is simply the ratio between the number of nodes which have both incoming and outgoing links from/to v, to the number of nodes which have incoming links from v. Therefore, reciprocity is always a fraction and between zero and unity. The average reciprocity of the network could be calculated by averaging values of node reciprocity of all nodes. Other Measures: A number of other topological measures can be used to quantify the structure of online social networks, including entropy and information measures (Piraveenan et al. 2012a; Sole and Valverde 2004), rich-club measures (Colizza et al. 2006), and local assortativity measures (Piraveenan et al. 2012b, 2008).

Topological Features Observed in Online Social Networks Facebook Facebook online social network had over a billion active nodes (users) as per October 2012 (Wikipedia 2013). Facebook is an undirected network, since it requires friend requests to be “accepted” before two nodes (people) can be connected. The Facebook network appears to be nearly fully connected, with one study finding 99.91 % of the nodes belong to the largest component (Ugander et al. 2011). Even though most online social networks are claimed to be

T

T

2196

Topology of Online Social Networks

Topology of Online Social Networks, Fig. 1 Degree distribution of the Facebook network as per May 2011, consisting of 721 million nodes. Both the global

distribution and the distribution for the USA alone are shown (Adopted from Ugander et al. (2011))

scale-free (Mislove et al. 2007), Facebook seems to depart from this trend, and no studies found that its degree distribution could be fitted into a power-law. However, the degree distribution seems to be monotonically decreasing with degree, as shown in Fig. 1. There is an anomaly at 20 nodes, because Facebook encourages the friends of individuals who have less than 20 links to find links for them, and this encouragement stops when an individual gets 20 friends. The degree distributions show a cutoff at 5,000, the maximum number of friends permitted by Facebook (Ugander et al. 2011). Most people in Facebook have a moderate number of friends, while only a minority has thousands of friends. One study as per May 2011, considering 721 million nodes, found the median number of friends in Facebook to be 99, while the average degree was 190.2 (Ugander et al. 2011). Facebook is essentially a small-world network (Milgram 1967), where the average path lengths are much smaller than the network diameter. It was found that in May 2003, the average path length of the global network was 4.7, while the average distance between users within the USA at the same time was 4.3. Compared to a network diameter of 41, this is quite small (Magno et al. 2012). It was found that, in May 2011, 92 % of all pairs of users had less than

five degrees of separation, while 99.6 % had less than six degrees of separation. Considering the USA alone, the small-world effect was even more pronounced, with 96 % of user pairs within five degrees of separation (Ugander et al. 2011). The same Facebook network has assortativity coefficient r D 0:226, being moderately assortative similar to other social networks (Newman 2002; Ugander et al. 2011). In terms of clustering, in Facebook, the local clustering coefficient seems to be fairly high, similar to Google+ (Magno et al. 2012). For example, for users with 100 friends, the average local clustering coefficient is 0.14, indicating that for a median user, 14 % of all their friend pairs are themselves friends (Ugander et al. 2011). Compared to a study of MSN messenger correspondences undertaken in 2008, this is five times as high (Leskovec and Horvitz 2008). Clustering coefficient decreases monotonically with degree and drops drastically for users closer to having 5,000 links. This perhaps may indicate that the users with a high number of friends are befriending people less coherently and more indiscriminately. Regarding mixing patterns in terms of node attributes, it was found that there is a strong preference for friends of around the same age and from the same country. Community structure

Topology of Online Social Networks

2197

T

Topology of Online Social Networks, Table 1 Table comparing topological features of some prominent OSNs (Adopted from Magno et al. (2012)) Network Google+ Facebook Twitter Orkut

Nodes 35M 721M 41.7M 3M

Edges 575M 62G 106M 223M

Crawled (%) 56 100 100 11

Path length 5.9 4.7 4.1 4.3

was shown to be clearly evident in the global Facebook network, at the scale of friendships between and within countries. Countries in turn were seen to themselves exhibit a modular organization, largely dictated by geographical distance (Ugander et al. 2011). Google+ Google+ network had 500 million nodes (users) as per September 2012 according to Wikipedia. Unlike Facebook, Google+ is a directed network, since it is possible to add someone to one’s friend circles without the favor being returned. Also unlike Facebook network, it has been claimed that Google+ network is a scale-free network (Magno et al. 2012). A study that was undertaken in December 2011 by crawling the publicly available profiles in Google+ (numbering some 35 million) has found that the in-degree distribution has a scale-free exponent  D 1:3 and the out-degree distribution has a scale-free exponent  D 1:2. The slight disparity of the exponents is explained by the fact that while there is a threshold for in-degree (any number of people may add a person to their circles), there is a threshold for out-degree in many cases. That is, Google+ allows only a limited number of privileged users to add more than 5,000 friends to their circles. It is not yet established whether this conclusion could be extended to the entire Google+ network and not just the public network. A list of nodes (people) from the public network which have the highest in-degree has been listed by study (Magno et al. 2012). The list is headed by Larry Page, coauthor of the paper that introduced Google search engine and the PageRank algorithm that is used in Google (Page et al. 1999), followed by Mark Zuckerberg,

Reciprocity (%) 32 100 22 100

Diameter 19 41 18 9

In-degree 16.4 190.2 28.19 –

Out-degree 16.4 190.2 29.34 –

founder of Facebook, and Britney Spears, musician. It is noted by the study that IT professionals, academics, and entrepreneurs tend to dominate the highest-ranked nodes list in terms of indegrees, compared to other social networks. The public network has an average degree of 32.8 (ignoring the directionality), which is much lesser than that of Facebook, suggesting that the number of inactive users may be high and even active users may be using it less regularly compared to Facebook. The average path length of the public network is 5.9, according to Magno et al. (2012). Compared to the network diameter of 19, this hints at small-world nature. It is postulated that the small-world structure will be even more prominent if all nodes and not just publicly available ones are considered. The average path length could be slightly higher than other networks, also may be because Google+ is a new system where relationships are still rapidly growing. The network diameter of 19 is comparable to Twitter, which is another directed network and has a diameter of 18. The public network consists of a giant component which connects 70 % of the users. Since Google+ is a directed network, relational reciprocity is an important metric in understanding its structure. More than 60 % of the users seem to have relational reciprocity higher than 0.6. The percentage of global reciprocal relations is calculated to be 32 % in Google+, comparatively higher than 22.1 % reported for Twitter, another directed network (Magno et al. 2012). Table 1, adapted from Magno et al. (2012), summarizes and compares some of the topological aspects of Google+ with other prominent OSNs.

T

T

2198

Topology of Online Social Networks 3

2.5

2

1.5

1

0.5

0 0

10

20

30

40

50

60

70

80

90

100

Topology of Online Social Networks, Fig. 2 Density of Flickr network (Adopted from Kumar et al. (2006))

YouTube and Flickr The topological analysis for Flickr and YouTube is grouped together, as both SNS are being primarily used for photo-sharing and video-sharing services, respectively. In a study by Kumar et al. (2006), the evolution of structure within Flickr was considered, during 2006, where there were approximately one million active nodes and eight million directed edges. The authors note that the density of the entire network was marked with growth, decline, and slow but steady increase as evidenced in the graph in Fig. 2. The authors also note that they could classify members of the social network into three groups: 1. Singletons: the degree-zero nodes who joined the service but have never connected with another user in the social network. 2. Giant components: representing the larger groups of people who are connected to one another directly or indirectly within the network. These form the most active and gregarious users in the network. 3. Middle region: comprising of isolated groups or communities who interact with each other but not with the larger part of the network. This group makes up a significant amount of the network.

In similar study by Mislove et al. in 2007, where YouTube was included, Flickr had 1.8 million users, with YouTube having 1.15 million. The number of edges for both SNS respectively was 22.6 and 4.9 million, and the average number of friends per user was 12.24 for Flickr and 4.29 for YouTube. In terms of components within the network, described by user groups, there were 103,648 for Flickr and 30,087 for YouTube. It is also interesting to note that the fraction of links symmetric is 62 % for Flickr and 79.1 % for YouTube. Thus, for both SNS, there is a high level of link symmetry, which is consistent with that of offline networks. Both type of networks also exhibited nodedegree conformance with the power-law observed in social networks, where the majority of the nodes have small degree and a few nodes have significantly higher degree. This test was conducted using the Kolmogorov-Smirnov goodness-of-fit metrics. Another interesting statistic is the average path lengths and diameters for both networks, which are quite short, in absolute terms. For instance, the average path length was 5.67 for Flickr and 5.10 for YouTube, and the diameter was 27 for Flickr and 21 for YouTube. These statistics lend support to the fact that these networks in both SNS be regarded as small-world networks.

Topology of Online Social Networks

In terms of assortative mixing, the assortativity coefficient r was used to assess node-level tendency to connect to other similar nodes based on degree distribution. For Flickr, the assortativity coefficient was found to be 0.202, while for YouTube, the coefficient was 0.033. Smaller Social Media A number of studies have quantified the topological patterns on smaller social media. For example, Milgram (1967) compares the smaller networks of Orkut and LiveJournal with Flickr and YouTube. They found that all the networks considered are scale-free and display relatively small average shortest path lengths, though in the case of Orkut, the average shortest path length (4.25) is comparable to network diameter (6.0). They found that all networks except YouTube are assortative, the node clustering coefficient is inversely proportional to node degree, and the networks each contain a large, densely connected core (Milgram 1967). Online Forums Online forums are another form of OSNs, though they are relatively transient compared to the social media networks mentioned above. Scientists studying online social networks have noted several trends in network relations and structure. In online discussion forums particularly, member of a community may post several messages or request for information in the hope that the message would have wide reception and generate useful response. While offline interactions are temporal, an interaction recorded by online forums remain static for as long as the forum is available. This presents immense benefits that does not only include knowledge-based building but the possibility of reaching and developing latency in relationships. At the tie level, Haythornthwaite (2002) claims that in addition to strong and weak ties, the potential for ICT when used to initiate a new contact suggests another type of tie called “latent tie,” which is a tie for which a connection is available technically but that has not yet been activated by social interaction.

2199

T

Such ties come into existence through the structures established by formal means (e.g., management of an organization). For example, the NSW Community Capacity Building Forum in Australia encourages community members to use its online discussion forum to post messages and seek advice from other community members on diverse subjects, as a result of which latent ties may be formed, developing to weak (when members acquaint with one another) and eventually to strong ties (when they become close friends over time). This is especially fruitful for bridging social capital. Latent ties can be created, which can then develop further and are particularly useful for isolates in the network structure or those spanning geographical distances. Apart from the strength and multiplex nature of ties, online discussion forums on a broader level are no different from interpersonal media and print and electronic mass media. However, it is crucially different from all the other channels in that its fundamental network form differs from one-to-one communication among pairs of people communicating face-to-face, by letter, or by telephone as well as from the one-to-many communication between a massmedium broadcaster and its audiences. Rather, it supports any network form involving one-few-many to onefew-many. Therefore, not only do online forums provide different forms and scale of communication but also different kinds of reach and benefits than do traditional communication channels. Katz and Rice (2002) endorse the idea of such forums as having great potential for connectivity without much social cost. This change in the way people maintain interpersonal ties has significant implications. For example, the networking and associative attributes of the online forums allow people to reestablish broken social ties. Information sharing, use, and communicating with others can be performed at little financial or social costs compared to the traditional brokering needed in unmediated personal relationships. Online community identities can be used to quickly introduce, assist, and socialize new participants, as well as sustain participation over time (Fig. 3).

T

T

2200

Topology of Online Social Networks

Topology of Online Social Networks, Fig. 3 Sociogram showing communication network from 2001 to 2010 for an online e-learning forum (Alon 2007; Chung 2011; Chung and Chatfield 2011)

In order to study how networks within online forums sustain and evolve over time, Chung (2011), Chung and Chatfield (2011), and Chung et al. (2013) assessed community capacity building over 10 years via an online e-government sponsored forum in Australia. Using social network analytics, they postulated that as communities evolved over time, the pattern of communication becomes denser and less centralized. They also postulated that there are clear patterns of assortativity where similar actors engage in communication with each other over time (Table 2). It was observed that there are mixed results in terms of growth in density and centralization values. Using sliding windows analysis, a clear pattern of networks losing their disassortative character in the early years followed by disassortative networks in the later years was observed. These network-level results added further insights to government-level metrics of community-building success such as unique visitors per month or webpage hits. The research thus provided a network analytics perspective as an empirical avenue for achieving richer understanding of social processes involved in the nature of community building.

Topology of Online Social Networks, Table 2 Network measures from 2001 to 2010 for an online e-learning forum (Alon 2007; Chung 2011; Chung and Chatfield 2011) Period

Network density

Network centralization

Network assortativity

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 Mean

0.0095 0.004 0.0037 0.0029 0.0033 0.0036 0.0044 0.0023 0.0028 0.0053 0.0042

0.1738 0.1095 0.0801 0.0692 0.0691 0.1257 0.0504 0.0494 0.0683 0.0802 0.0876

0.653 0.234 0.173 0.07 0.317 0.359 0.058 0.245 0.223 0.229 0.256

Summary We may observe a number of common features in most of the online social networks that have been analyzed by researchers. All major social media networks, except Facebook, display strong scale-free nature. However, Facebook is the largest of these networks, so it could be

Topology of Online Social Networks

postulated that as the links begin to saturate in the global community of users, the scale-free nature is beginning to be lost. Indeed, when all users have the maximum number of friends allowed, there can be no scale-free structure. On the other hand, it could be argued that rapidly evolving social networks display strong scalefree features. Almost all the social networks studied display the small-world feature, and this will only increase as the links saturate within a network. Facebook has a much higher average degree compared to the rest of the OSNs, hinting that it is at a more matured phase of growth. Most OSNs are assortative, as predicted by Nardi et al. (2000), and all of them display strong clustering organization, though the clustering is limited by geographical constraints. Therefore, it could be argued that OSNs display strong “local” clustering. There are large giant components in all OSNs, with a majority of nodes already belonging to a single giant component. Online forums display fundamentally different network topologies to online social media networks, because in the case of online forums, one-to-many and many-to-one links are possible, and the notion of “friendship” has a vague meaning. Online forums tend to be disassortative especially during the early stages of their life cycle.

Cross-References  Analysis and Mining of Tags, (Micro)Blogs, and Virtual Communities  Centrality Measures  Community Evolution  E-Government  Mapping Online Social Media Networks  Misinformation in Social Networks, Analyzing Twitter During Crisis Events  Online Communities

2201

T

References Albert R, Barabasi A-L (2002) Statistical mechnics of complex networks. Rev Mod Phys 74:47–97 Alon U (2007) Introduction to systems biology: design principles of biological circuits. Chapman and Hall, London Chung KSK (2011) Community building through social networks: evolution and engagement. In: Australasian conference on information systems, University of Sydney, Sydney Chung KSK, Chatfield AT (2011) An empirical analysis of online social network structure to understand citizen engagement in public policy and community building. Int J Electron Gov 4(1/2):85–103 Chung KSK, Piraveenan M, Levula AV, Uddin S (2013) Building assortativity, density and centralization in social networks. In: Hawaii international conference on system sciences, Muai Colizza V, Flammini A, Serrano MA, Vespignani A (2006) Detecting rich-club ordering in complex networks. Nat Phys 2:110–115 Costa LdF, Rodrigues FA, Travieso G, Boas PRV (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56(1):167–242 Dorogovtsev SN, Mendes JFF (2003) Evolution of networks: from biological nets to the internet and WWW. Oxford University Press, Oxford Google (2013) Orkut profle – Google Ad planner Gruzd A, Wellman B, Takhteyev Y (2011) Imagining Twitter as an imagined community. Am Behav Sci 55(10):1294–1318 Haythornthwaite C (2002) Strong, weak, and latent ties and the impact of new media. Inf Soc 18(5): 385–401 Hendler JA (2009) Web 3.0 emerging. Computer 42(1):111–113 Hinds P, Kiesler S (1995) Communication across boundaries: work, structure, and use of communication technologies in a large organization. Organ Sci 6(4): 373–393 Kumar R, Novak J, Tomkins A (2006) Structure and evolution of online social networks. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia. ACM, pp 611–617 Leskovec J, Horvitz E (2008) Planetary-scale views on a large instant-messaging network. In: Proceedings of the 17th international conference on world wide web, Beijing. ACM, pp 915–924 Licoppe C, Smoreda S (2005) Are social networks technologically embedded? How networks are changing today with changes in communication technology. Soc Netw 27(4):317–335 Locke L (2007) The future of Facebook. Time Lunden I (2012) Twitter passed 500M users in June 2012, 140M of them in US; Jakarta ‘Biggest Tweeting’ City Magno G, Comarela G, Saez-Trumper D, Cha M, Almeida V (2012) New kid on the block: exploring

T

T

2202

the Google+ social graph. In: Proceedings of the 2012 ACM conference on internet measurement conference, Boston. ACM, pp 159–170 Marsden P, Campbell KE (1984) Measuring tie strength. Soc Forces 63(2):482–501 Milgram S (1967) The small-world problem. Psychol Today 1:62–67 Mislove A, Marcon M, Gummadi KP, Druschel P, Bhattacharjee B (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on internet measurement, San Diego. ACM, pp 29–42 Mukherjee S, Ghosh S (2013) YouTube reaches 1 billion monthly active users. Sydney Morning Herald Nardi BA, Whittaker S, Schwarz H (2000) It’s not what you know: work in the information age. http:// firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/ar ticle/view/741/650. Accessed 12 Jan 2013 Newman MEJ (2000) Models of the small world. J Stat Phys 101(3):819–841 Newman MEJ (2002) Assortative mixing in networks. Phys Rev Lett 89(20):1–14 Newman MEJ (2003) Mixing patterns in networks. Phys Rev E 67(2):026126 Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to the web. Stanford InfoLab Park J, Barabasi A-L (2007) Distribution of node characteristics in complex networks. Proc Natl Acad Sci 104(46):17916–17920 Piraveenan M, Prokopenko M, Zomaya AY (2007) Information-cloning of scale-free networks. In: Costa FAe, Rocha LM, Costa E, Countinho A, Harvey I (eds) Advances in artificial life: 9th European conference on artificial life (ECAL-2007), Lisbon. Springer, pp 925–935 Piraveenan M, Prokopenko M, Zomaya AY (2008) Local assortativeness in scale-free networks. Europhys Lett 84(2):28002 Piraveenan M, Prokopenko M, Zomaya AY (2010) Local assortativeness in scale-free networks – Addendum. Europhys Lett 89(4):49901 Piraveenan M, Prokopenko M, Zomaya A (2012a) On congruity of nodes and assortative information content in complex networks. Netw Heterog Media 7(3): 441–461. doi:10.3934/nhm.2012.7.441 Piraveenan M, Prokopenko M, Zomaya AY (2012b) Assortative mixing in directed biological networks. IEEE/ACM Trans Comput Biol Bioinform 9(1):66–78 Sole RV, Valverde S (2004) Information theory of complex networks: on evolution and architectural constraints. In: Ben-Naim E, Frauenfelder H, Toroczkai Z (eds) Complex networks. Springer, Berlin Ugander B, Karrer L, Backstrom C (2011) The anatomy of the Facebook social graph. http://arxiv.org/abs/1111. 4503. Accessed 22 Mar 2013 USA Today (2006) YouTube serves up 100 million videos a day online. USA Today Watts DJ, Strogatz SH (1998) Collective dynamics of small-world networks. Nature 393(6684):440–442

Trade-Offs Wellman B (1997) An electronic group is virtually a social network. In: Kiesler S (ed) Culture of the Internet. Lawrence Erlbaum, Hillsdale, pp 179–205 Wellman B, Salaff J, Dimitrova D, Garton L, Gulia M, Haythornthwaite C (1996) Computer networks as social networks: collaborative work, telework, and virtual community. Annu Rev Sociol 22: 213–238 Wikipedia (2013) Facebook. http://en.wikipedia.org/wiki/ Facebook. Accessed 21 Mar 2013 Yahoo! (2011) Flickr – advertising solution, Yahoo!. http://advertising/yahoo.com/article/flickr. html. Accessed 20 Mar 2013

Trade-Offs  Online Privacy Paradox and Social Networks

Training  Social Media Policy in the Workplace: User Awareness

Trajectory  Spatiotemporal Footprints in Social Networks

Transforming and Integrating Social Networks and Social Media Data Mauro San Mart´ın1 and Claudio Gutierrez2 1 Department of Mathematics, Universidad de La Serena, La Serena, Chile 2 Department of Computer Science, Universidad de Chile, Santiago, Chile

Synonyms Data integration; Data transformation; Querying and sampling social networks; Restructuring social networks

Transforming and Integrating Social Networks and Social Media Data

Glossary Data Model An abstraction that defines a data structure (data elements and their relations), a set of operations (a language), and a set of integrity constraints Database Collection of data organized following a data model and collected and maintained with a purpose Data Base Management System (DBMS) A software system that implements a data model and provide data management services to other applications Schema Metadata that describes and defines the constraints (e.g., types, domains, and dependencies) over the data in a database Transformation Language A set of operations to map a database from one schema to another Social Networks (SN) A set of actors and the relations among them; both actors and relations may be described by attributes Social Media Social networks that include diverse types of media (e.g., text, photographies, and videos) SN&M Social networks and social media Pattern A graph pattern used to define a network structure to be found or to be constructed as part of a transformation

Definition The availability, volume, and diversity of formats of social networks and social media data sets call for well-supported transformation and integration operations. Data transformation is the process in which the social units and relations present in a source of social network and social media data are used to produce new social units and relations. For instance, a social network comprised by the employees of a company connected by the emails they interchange may be transformed into a network of communication among departments where all the employees working for the same department are coalesced into the same actor, representing the department, and relations are grouped and summarized accordingly.

2203

T

Note that transformation is not only a change in the representation format but in the structure of the network itself. Data integration is the process where two or more networks are joined to form a new network including all their actors and relations. Usually all the involved data sets describe the same or related networks but may have been collected independently, from different sources, with different tools, and/or at different times. Consequently, the integration process may require to transform all networks to the same schema of social units and relations before the join may proceed.

Introduction Data management in social networks and social media (SN&M) usually requires data to undergo successive changes. This workflow starts with data collection and/or reuse of existing data sets, and it may include a variable number of intermediate data manipulation stages. Most of these intermediate steps require to transform the network data or to integrate data from several sources as shown in Fig. 1. Transformation and integration of SN&M data takes place in the space of producers and consumers of data. Producers collect data sets by diverse means; for instance, direct observation, surveys, and/or automatically capturing the activities of users. Consumers, on the other hand, require access to the captured data to transform and integrate the subsets that are relevant for their applications. For instance, social network analysis (SNA) tools can be viewed as a consumer of social networks to be analyzed and a producer of structural measures; annotators and users are producing and updating social networks; developers of applications are consuming and producing networks, and so on. In this context, each individual workflow and the interaction of producers and consumers require a data management infrastructure to provide access to the data. This is ideally done through a high-level language to express the operations the user will need. The implementation of such infrastructure should provide the same

T

T

2204

Transforming and Integrating Social Networks and Social Media Data

Collect New Data

Prepare

Reuse Data Sets

Prepare

diverse data formats

Integrate

Transform

common data model

Analysis and Mining

diverse data formats

Transforming and Integrating Social Networks and Social Media Data, Fig. 1 Data workflow in SN&M study. Once the data sets to be studied are collected or identified among existing data sets, each data set is usually prepared and integrated in one consolidated data set. Then this consolidated data set may be transformed several times until it is finally ready for analysis and mining. The

appropriate foundation for SN&M data transformation and integration is a common standard data model. First, all networks to be integrated are translated to the common data model. Then the transformation language of the model is used to map the networks to a common schema and join them. Finally, the resulting network can be further transformed and, if needed, translated to any data format

benefits to the social networks practitioners as the relational database management systems (RDBMS) provide to the business domain, namely, separation of concerns between data and applications, avoiding data redundancy and inconsistency, user-friendliness and flexibility, and reduction of development costs. The appropriate foundation for this infrastructure is a common data model capable to represent the diversity of SN&M types of data. Once data is represented under this common model, it is possible to define a high-level language to integrate and transform the data (see Fig. 1). In the following sections, we present the historical antecedents from the SNA and Web perspectives. Then we discuss the main topics regarding the data representation requirements for transformations and integration, and a standard abstract language to support transformation. Finally, we briefly present the relevant future directions.

tasks: it is not possible to manage by hand data sets of this size and complexity, and the in-house development of custom made data transformation programs is prone to errors and may hinder the repeatability of the experiments. • The appropriate framework for a solution is a common standard data model including an adequate data structure and a transformation language covering the needs of SN&M studies.

Key Points • The study of SN&M requires to repeatedly transform and integrate data from different sources, collected at different times and/or with different methods or tools. • The volume and diversity of SN&M data available require a standard and computer supported infrastructure for these recurrent

Historical Background There are two main sources of antecedents in the direction of transforming and integrating SN&M data. On the one hand, the SNA community has a well-developed methodology to study structural data, albeit focused at small size data sets by the current standards. On the other hand, the database and Web communities developed transformation languages to process and restructure Web pages. Due to the explosive creation of Web-based SN&M and their size that demanded semiautomatic processing of its data, since the year 2000, both streams of developments began to overlap. The SNA Tradition Freeman (2004) states that the foundational aspects of SNA were already present long before Moreno started his works in the early 1930s: a

Transforming and Integrating Social Networks and Social Media Data

structural intuition, a sustained effort towards systematic data collection, and the development of visualization devices and mathematical and computational models (see also Wasserman and Faust 1994; Scott 2000). These aspects shaped an SNA practice with a well-developed methodology but focused mainly in small data sets (dozens to hundreds of elements). Freeman also identifies a tipping point for the field in the late 1990s: when the increasingly available structural data in seemingly unrelated fields like physics and biology and the newly available online social networks changed the SNA community forever (Freeman 2011). The volume and complexity of the newly available structural data required new tools and methods. Figure 2 shows the evolution of standard parameters of classic social networks (Southern Women, International Trades, etc.) to modern huge SN&M like Facebook or Twitter. The historical evolution of classical SNA data managing practice shows a clear trend towards automatization of procedures and use of computational tools to share and reuse data. Thus, the Web became a natural platform to support collection and manipulation of social network data. Data Management and the Web The Web has evolved into the global information space to exchange information; in particular, since year 2000, it became the host of a variety of SN&M. Due to the volume of information, the automatization of its data management became mandatory early in its history. Thus, several models and languages for managing Web data emerged. Florescu et al. (1998) present a survey of them classified in two generations. The first generation comprises languages which essentially extract (query) information from the Web, while the second one incorporated restructuring functionalities. In all these languages, it is possible to distinguish a data extraction stage and a data production (restructuring) stage. A good example is WebLog (Lakshmanan et al. 2001), a declarative language that defines an extraction pattern as a set of rules over the content values and structure

2205

T

of each page and the structure of links among them. The construction phase uses rules whose heads define the pages to be constructed from the collected and computed information. Although these languages serve as reference, they do not cover the actual needs of SN&M studies. More recently, in the context of the semantic Web, several languages have been proposed that query and transform XML (e.g., XSLT and XQuery) and RDF (SPARQL) data. Furche et al. (2004) review these languages.

The Need for a Common Data Model As we discussed above, most SN&M data analysis and mining applications use proprietary data formats, complicating integration and transformation. A common/standard data representation encompassing the representation capabilities of the different formats is needed. Freeman’s maximal structure experiment (Freeman et al. 1992) is a reference base for the requirements for this common representation. Freeman starts typifying a social network from the simplest case: a single relation recorded at a single time over an undifferentiated and unchanging population. He defines an experiment which uses two types of information: a set of social units (which at the lowest possible level refers to individuals, e.g., persons) and a set of pairs of social units that exhibit some social relation of interest between them. The natural choice for representing these simple networks is to use standard graphs where actors are nodes and relations are edges. However, doing so limits the representation power to that of binary relations and forbids attributes as part of the graph itself (see Fig. 3). From this basic setting, Freeman progressively builds up the maximal social structure experiment adding the following elements: (1) several kinds of relations, (2) two or more types or levels (groups) of social units, (3) structures that change through time, (4) sets of social units that grow or shrink, (5) attributes of social units, and (6) attributes that change. To cope with the requirements developed in recent years, two further

T

2206

Transforming and Integrating Social Networks and Social Media Data

Name

S. Women

Intl. Trade

EIES

DBLP

Twitter

Facebook

Modes

2-mode

1-mode

1-mode

multi-mode

multi-mode

multi-mode

Size (n, m)

((18,14),89)

(24,310)

(32,650)

(~1M,~Ms)

(~Ms,~Ms)

(~Ms,~Ms)

Types of Rel.

one

one (5 sets)

one (4 sets)

many

many

many

Attr. on Actors Demographic Demographic

Dis. & Cit.

No

No

Scalar

Yes

Yes

Yes

Prim. Purpose

SNA

SNA ...

SNA ...

Bibl. Database

Information Spreading

Social Interaction

Dynamic

No

No

Long. (3 sets)

Yes

Yes

Yes

Comments

Classical (Observed data)

Imported Data

Manual & Automatic Collection

Periodic updates

Online

Online

1

a2

a3

2 1

2

1

a4

1 2

a5

Purpose dep.

Attr. on Rels.

Transforming and Integrating Social Networks and Social Media Data, Fig. 2 SN&M data evolution: SN&M data size and complexity has evolved from manually collected and curated small SNA data set (e.g.,

1

Purpose dep. Purpose dep.

3

a1

Complex and Massive

Small and Simple (Classical)

T

Southern Women, International Trade, EIES) to current huge online SN&M (e.g., Twitter and Facebook) that are continuously and automatically collected and require automatic data management

N = {a1, a2, a3, a4, a5}

Attributes

M

a1

a2

a3

a4

a5

Actor

Name

Age

a1

-

-

1

2

3

a1

Amy

20

a2

-

-

-

1

-

a2

Ann

25

a3

1

-

-

2

-

a3

Joe

24

a4

-

2

1

-

-

a4

John

27

a5

-

1

-

-

-

a5

Mary

26

Transforming and Integrating Social Networks and Social Media Data, Fig. 3 SNA classical data model. In SNA the classic approach is to represent social networks as directed graphs and to store the attributes in a separate

data structure. This strategy limits the variety of networks that can be faithfully represented and the expressiveness of the associated transformation languages

elements should be added: attributes of relations and a variable number of participants in relations, i.e., relations linking a number of actors which is not fixed at modeling time. The requirements discussed above demand representations more elaborate than the simple graphs (or matrices). This challenge is not exclusive of the SN&M and it has been explored extensively in other contexts (Guting 1994; Blau et al. 2002; Angles and Gutierrez 2008). From this background and trends in information exchange indicating that all

the information should be in the same data structure and that it should support the addition of arbitrary metadata (e.g., provenance), it follows that the data structure that represents SN&M should have at least the following characteristics: 1. Actors have a unique identifier and a set of attributes and can participate in any number of relations. 2. Relations have a unique identifier, a set of attributes, and a number of participant actors. The number of participants can be one o more,

Transforming and Integrating Social Networks and Social Media Data

Thames River

publisher name London Eye

place

landmark

appears-in

photo

place a7 product

2012-05-15 date

group photo

appears-in

Muriel name

person Raquel a3 name person

url

picture

location

name person

a2 r1

a5

a1

appears-in

pub-date

landmark

name

T

Katy

2012-05-16

a6

2207

r2

shot direction coordinates

http://photosite.com/photos/photo1 size 1.9

author

Mauro a4 name person

W 51.30N0.07W

Transforming and Integrating Social Networks and Social Media Data, Fig. 4 Social media example. The figure shows a portion of the network of a fictitious photo-sharing site. Round nodes represent actors (e.g., a1, a5, and a7), rectangles represent relations (e.g., r1 and r2), and gray dots represent attribute values. Arc

labels represent participation roles of actors in relations (e.g., “publisher” and “appears-in”) and the meaning of attributes (e.g., “pub-date”). Finally, gray labels represent the families of actors and relations (e.g., “person” and “group photo”)

and it may change without affecting the other properties of the relation. 3. Attributes have an associated meaning and a literal value. One attribute is identified by the identifier of the object to which it is attached (actor or relation), by its meaning, and by its literal value. The class of an object – either an actor or a relation – is a special kind of attribute: its family. 4. Actors, relations, attributes, and their connections form a social network. Sharing and reuse requires metadata at the network level to record, for instance, provenance of the data sets. The example in Fig. 4 shows a network of a fictitious social media site that stores persons and photographies depicting them. Every actor (circle) and relation (rectangle) belongs to a family: person, place, shot, etc. We can modularly add actors and relations and update family membership and attributes. Nodes may participate in several relations. For example, node a5 participates in

two (r1 and r2) and could participate in new relations without altering its identity. Note that in this model the same holds for relations. Relations in classical graphs are binary; this model allows variable arity, for example, relation r1 has six participants, and it is possible to change its arity without modifying its identity. A data structure fulfilling these requirements will be capable of representing any social network as described by Freeman, including the modern requirements imposed by social media, like a diverse variety of actors and relations. It is possible to formally define this data structure as follows: Definition 1 (SN&M data structure) An SN&M is a tripartite-directed labeled graph defined as follows: 1. A schema representing the types of the elements involved: – A collection of families of actors A – A collection of families of relations T

T

T

2208

Transforming and Integrating Social Networks and Social Media Data Actors (A) a1 publisher appears-in

Attributes (C) name

Katy

person pers a2

name

Muriel

person pers

Relations (T)

Attributes (C)

a3 appears-in

pub-date

2012-05-16

appears-in

r1 group photo

landmark

a6

landmark photo

name

Raquel

person pers

a5

name place plac name

Thames River

London Eye

place plac

2012-05-15 location

date W

direction coordinates s

r2 shot

product author

51.30N0.07W

a7

url size

http://photosite.com/ photos/photo1 1.9

picture pictu a4

name

Mauro

person pers

Transforming and Integrating Social Networks and Social Media Data, Fig. 5 Data structure elements in Fig. 4. Photography-sharing site network in Fig. 4 with the elements of the model highlighted: actor set in green, relation set in blue, and attributes values in orange. Note

that the network can be alternatively represented as a set of triples representing each arc in the graph, for instance, the participation as “publisher” of a1 in r1 can be represented by the triple (a1, “publisher”, r1)

2. A set of nodes N D A [ T [ C , a disjoint union of: – The actor set A where each a 2 A belongs to at least one family in A – The relation set T (ties) where each t 2 T belongs to at least one family in T – The attribute set C that describe actors and relations 3. The set of arcs are ordered pairs belonging to: – A  T : set of participation roles – .A [ T /  C : set of meaning of the values of the attributes Notes 1. The definition includes a set of labels and identifiers to properly describe each of the elements above whose details we avoid here. For a complete formal description, see San Mart´ın (2012). 2. Observe that from point 3 and the note above, it follows that an SN&M can be represented as a set of triples (e.g., an element of A  T is the triple .a; l; t/ where a 2 A, t 2 T , and l

is the label of the participation role). See the details in San Mart´ın (2012). Figure 5 shows the same network depicted in Fig. 4 according to the formal definition above.

A Language for Transforming and Integrating SN&M Data As we saw, SN&M integration reduces to convert SN&M to a common data (and schema) model. From Definition 1, this amounts to define a common schema for the different SN&M involved in the integration, that is, defining a common family of actors, relations, and attributes (and of course the corresponding low-level mapping from the original schema to the new one). In this section, we will assume this process has been done, and we will concentrate in what is from a theoretical and practical point of view the main challenge: transformation of SN&M.

Transforming and Integrating Social Networks and Social Media Data Transforming and Integrating Social Networks and Social Media Data, Table 1 SNA use cases. Use cases from SNA practice and the corresponding transformation language features needed (Cases are taken from Exploratory Social Network Analysis with Pajek (de Nooy et al. 2005), which addresses characteristic SNA practice. See further arguments in San Mart´ın 2012) Use case Selecting groups and patterns Promoting attributes to actors Identifying brokers Counting social ties Ego-network selection Reacheable neighborhood

Required transformation language features Pattern matching, filtering by attribute value Pattern matching, to compute new ids and values, pattern production Pattern matching, negation, pattern production Pattern matching, aggregation, pattern production Pattern matching, induced subgraphs Pattern matching, transitive closure, induced subgraphs

Transformation Requirements A universal (“Turing complete”) language, although fully expressive, is of little help for common users. A dedicated language for transforming SN&M should have a compromise between expressiveness and complexity, to allow users to express their transformation needs and perform efficiently the most common tasks. The theoretical framework behind SN&M is that of SNA. Table 1 systematizes use cases from the SNA practice complemented with the features that a transformation language must provide to implement each use case. Grouping the basic types of operations involved, we obtain the following generic requirements: 1. Filtering by object identifiers and attributes. To be able to extract a subnetwork based on the identity of objects and the values of their attributes. 2. Pattern matching and negation. A common task is to search for the occurrences of a given fixed structure. Pattern matching can be seen as a structural filter. In some cases negation is required, for example, a successful match may require that a part of the pattern is not present in the network.

2209

T

3. Pattern production and creation of new objects and values. Some use cases require to compute new values or object identifiers using the data collected by matching a pattern in the source network. 4. Induced subgraphs. Some use cases require that once a subnetwork is identified, some additional related elements needs to be included in the result. For instance, once all neighbors of a given actor are identified, all relations among them can be added. 5. Transitive closure. To capture certain groups, a fixed pattern is not enough, as in all the reachable actors from a given one. A more powerful operation is the transitive closure, which allows to build a match by repeatedly applying a pattern in a sequence. (Regarding complexity and the actual performance of the implemented systems, a key design decision is the nature of the allowed output: only start and end points, e.g., start and end nodes in a path; all the matched subgraphs in the sequence, e.g., the path itself; and/or aggregated values from the matches, e.g., the length of the path.) 6. Set theoretical operations. Given two social networks with compatible schemas, at least two set theoretical operations are useful: union and difference. Set theoretical union is used to merge networks; and the difference, to compare them. Transformation Language What is a “good” language for these requirements? From a database perspective, nobody has yet proposed a set of primitives, flexible and expressive enough, to represent the full diversity of queries implied by network databases (Angles and Gutierrez 2008). The good news is that for SN&M data transformation, the required set of functionalities is explicit and most of them (not all) are computable by efficient graph algorithms (paths, connectedness, etc.). From a social networks practice point of view, there are two types of operations involved: data management (e.g., transformations and integration) that return social networks and structural measures (i.e., analysis and mining) that return values or sets of values for structural properties,

T

T

2210

Transforming and Integrating Social Networks and Social Media Data Data Collection/Extraction (Pattern Matching) 1 Match actors relations attributes

Result Construction (Pattern Production)

2

3

4

Filter and Grouping

Produce Data and computed values

Construct Result Network

FROM - WHERE

CONSTRUCT

Transforming and Integrating Social Networks and Social Media Data, Fig. 6 Transforming SN&M data as a two-step process. A transformation proceeds in two steps. First, information is extracted from the source data set by pattern matching (1); grouping and additional

filtering may be applied at this step also (2). Second, from the data collected in the first step, new values and ids are computed; all this data is used to populate a pattern given as a construction template (3). All individual patterns thus produced are joined to form the result (4)

such as centrality. In this chapter, we deal with data management operations that produce networks from networks. For structural measures, the reader can consult SNA and mining topics in this Encyclopedia. Abstractly (i.e., independent of its implementation architecture) a transformation language consists of two main modules or can be viewed as a two-step process as shown in Fig. 6. These two modules are: 1. Data collection (input processing). The relevant data is identified in the source instance and is filtered according to the user requirements. 2. Construction (output processing). The resulting network is built using the intermediate data processed in step 1. Currently most proposals use pattern matching facilities to implement both modules. We will illustrate in more detail these features with the case of the SNQL language (whose detailed syntax and semantics can be consulted in San Mart´ın et al. 2011). An SNQL transformation follows the standard

The is a list of SN&M data sources. Recall from Definition 1, note 2, that a social network is a collection of triples composed from object identifiers and constant literals. The simplest form of a is a set of triple patterns. But requirements need more expressiveness here; thus, it can be a Boolean combination of simple patterns and incorporate functionalities such as transitive closure and aggregation (carefully modularized to avoid complexity explosion). Theoretically, this is a Datalog program with certain restrictions: Datalog/GraphLog (Abiteboul et al. 1995; Consens and Mendelzon 1990). Each element in the list of is a collection of triple patterns (a set of triples including variables) possibly subject to certain constraints. Theoretically, this is based in a data exchange formalism called tuple-generating dependencies (Fagin et al. 2005). The transformation is evaluated as indicated in Fig. 7. First, based on the , we get a table of values (a set of bindings for each variable, pretty much like an SQL table). Then, using the patterns in < list-of- construction - patterns >, each such pattern is instantiated with the value of a tuple in the table produced before. The union of all these results gives the final network. We will illustrate this procedure by means of an example (see Fig. 8): a friendship network

CONSTRUCT WHERE FROM

structure of languages like SQL and SPARQL. It receives as input network (in the FROM clause), extracts information using patterns (in the WHERE clause), and outputs a new network, possibly with new computed values, using patterns as construction templates in the CONSTRUCT clause.

Transforming and Integrating Social Networks and Social Media Data

T

2211

TRANSFORMATION Extraction Patterns

D: Source Social Network

Evaluation

Pattern Matching

Construction Patterns

X1 X2 ... Xn

Value extraction

2 Production and gathering of partial results (subnets)

One table for each extraction pattern

Transforming and Integrating Social Networks and Social Media Data, Fig. 7 Transformation evaluation. In the general case, the evaluation proceeds in two steps. First, data is collected by pattern matching: each variable in an extraction pattern is bound to a value each time a match occurs. All these values are gathered in an

intermediate table that have a column for each variable in the extraction pattern and a tuple for each match. In the second step, each row of the intermediate table is used to populate a construction pattern. All these partial results are joined to produce the network that is the result of the transformation

Friendship Network among Persons Central City

city name

Mary Central City

city name

Ann

b

Pattern Production

Variable Bindings

1

a

D': Result Social Network

a1

Capital City

friend

person

city r1 friend friendship

introducer

a2 person

name John

a3 person

Friendship Network among Persons (Cities Promoted from Attributes to Actors) Central City

r3 place

name

inhabitant a1 lives-in person

name

name friend

r1 place

introducer r4

inhabitant

lives-in

a3

Capital City

John

friend

a5 city

Mary

name

person

friendship

a2

inhabitant

person

name r2

place lives-in

a4 city

Ann

Transforming and Integrating Social Networks and Social Media Data, Fig. 8 Friendship network and a simple query result. (a) A social network representing the friendship relation (square node) between Mary and John, who were introduced by Ann (actors as round nodes and

attribute values as gray dots). Note that the flexibility of the data structure allows that “introducers” were present only when correspond. (b) The same social network after promoting city attributes to actors

along with a transformation and the resulting network. In Fig. 8a, the fact that two persons live in the same city is not represented as part of the structure. It is possible to define a transformation from cities as attributes to cities as actors and

thus produces a network where all persons that live in the same city are connected to the same actor (which represents that city), see Fig. 8b. This transformation is called a promotion of an attribute to actor. This query creates a new kind of

T

T

2212

Transforming and Integrating Social Networks and Social Media Data

Extraction Pattern: EP

Construction Pattern: CP

L2

L2 L1 city

name P1 A1 person

L1 R1

name

friendship

Transforming and Integrating Social Networks and Social Media Data, Fig. 9 Graphical representation of a simple SNQL query. This is the graphical representation of the pattern used by the transformation query that promotes city attributes to actors producing the network depicted in

R2 A2 place city lives-in

inhabitant

name A1 P1 person

R1 friendship

Fig. 8b from the network depicted in Fig. 8a. Uppercase labels indicate variables. A2 and R2 represent new object identifiers functionally created: A2 D g.L1/ and R2 D f .A1; A2/

Transforming and Integrating Social Networks and Social Media Data, Table 2 Variables involved in the matching process in the transformation from Fig. 8a, b. Input variables (left) match three instances. Note that the computed variables A2 and R2 take functionally created values: A2 D g.L1/ and R2 D f .A1; A2/ Extraction variables A1 R1 a1 r1 a2 r1 a3 r1

L1 Central City Capital City Central City

L2 Mary John Ann

P1 Friend Friend Introducer

actor (city) and a new kind of relation (lives in) to associate people with cities. The following is the skeleton of the transformation described above, FriendshipNetwork is the social network shown in Fig. 8a, and CP and EP are the patterns shown in Fig. 9. CONSTRUCT CP WHERE EP FROM FriendshipNetwork As defined, the evaluation procedure includes two steps. First, it produces the intermediate table defined by the bindings of the variables in EP each time it matches the friendship network (see Table 2), that is, when it finds a “person” (A1) with attributes “city” (L1) and “name” (L2) that participates in a “friendship” relation (R1), with any role (P 1). Then, for each row in the intermediate table, an instance of the construction pattern is produced with the corresponding values in the row. In this case, for each row to additional values are computed, corresponding to the identifiers of each new city (A2), and the relation between the city and the person that lives in it (R2). Let us remark that further structural analysis options arise. For instance, another type of transformation would be to group people by city

Computed construction variables A2 a5 a4 a5

R2 r3 r2 r4

of residence, thus defining a network of cities, where relations summarize friendships among residents of cities. Additionally, one might like to describe in the network the population (person count) of each city and label the relations between them with the number of friendship relations between people. The language SNQL allows all of them (San Mart´ın 2012).

Key Applications The key applications of the data structure and transformation language presented are twofold: • At the abstract level, they allow the standard description of data sets and their transformations with a clear syntax and precise formal semantics, independently of specific tools or programs. • At the practical level, they allow the automatization of transformation and integration operations, complete workflows, and exploratory analysis processes. They also support the developing of local and remote SN&M data repositories.

Transforming and Integrating Social Networks and Social Media Data

Future Directions Current trends in the development of transformation languages for SN&M follow similar lines. Currently there are several proposals oriented to transform SN&M data with a related aims and scope: BiQL (Dries et al. 2009), SocialScope (Amer-Yahia et al. 2009), and SoQL (Ronen and Shmueli 2009) and the one discussed in this article SNQL (San Mart´ın 2012). These developments, together with SNQL sketched in this article, mark the lines along which researchers are developing transformation languages for SN&M. The automation of SN&M data management and the development of simple tools for end users based on general principles are becoming a reality. The type and size of SN&M are indicating that a new era of SN&M data management is ad portas. It will include the development of standards for data and for transformation languages which incorporate the experiences of the SNA as well as Web databases communities that historically have been working in this direction. This will give SN&M users, practitioners and developers tools to handle and operate a wide range of SN&M of different types and sizes. The SN&M community should incorporate in their agendas the development of standards for data model and transformation languages whose theoretical basis were presented in this article.

Acknowledgments Authors thank the support of Conicyt (Chile) through the project Fondecyt 1110287.

Cross-References  Collection and Analysis of Relational Data from Digital Archives

2213

T

 Collection and Analysis of Relational Data in Organizational and Market Settings  Linked Open Data  Process of Social Network Analysis  Query Answering in the Semantic Social Web: An Argumentation-based Approach  RDF  Sampling Effects in Social Network Analysis  Social Media  Social Network Datasets  Social Networking Sites  Sources of Network Data  SPARQL  Xpath/XQuery  XSLT

References Abiteboul S, Hull R, Vianu V (1995) Foundations of databases, 1st edn. Addison-Wesley, Boston Amer-Yahia S, Lakshmanan LVS, Yu C (2009) SocialScope: enabling information discovery on social content sites. In: CIDR, Asilomar. www.crdrdb. org Angles R, Gutierrez C (2008) Survey of graph database models. ACM Comput Surv (CSUR) 40(1):1–39 Blau H, Immerman N, Jensen D (2002) A visual language for querying and updating graphs. Computer Science Technical Report 2002-037, University of Massachusetts, Amherst Consens MP, Mendelzon AO (1990) GraphLog: a visual formalism for real life recursion. In: PODS, Nashville. ACM, New York, pp 404–416 de Nooy W, Mrvar A, Batagelj V (2005) Exploratory social network analysis with Pajek. Cambridge University Press, Cambridge Dries A, Nijssen S, Raedt LD (2009) A query language for analyzing networks. In: Cheung DWL, Song IY, Chu WW, Hu X, Lin JJ (eds) CIKM, Hong Kong. ACM, New York, pp 485–494 Fagin R, Kolaitis PG, Popa L, Tan WC (2005) Composing schema mappings: second-order dependencies to the rescue. ACM Trans Database Syst 30(4): 994–1055 Florescu D, Levy A, Mendelzon A (1998) Database techniques for the World-Wide Web: a survey. SIGMOD Rec 27(3):59–74 Freeman LC (2004) The development of social network analysis. Empirical, Vancouver

T

T

2214

Freeman LC (2011) The development of social network analysis – with an emphasis on recent events. In: Scott J, Carrington PJ (eds) The SAGE handbook of social network analysis. Sage, London, pp 26–39 Freeman LC, Romney AK, Douglas R White (1992) Research methods in social network analysis. Transaction, New Brunswick Furche T, Bry F, Schaffert S, Orsini R, Horrocks I, Krauss M, Bolzer O (2004) Survey over existing query and transformation languages. http:// rewerse.net/deliverables/m24/i4-d9a.pdf. Accesed 23 Jan 2013 Guting R (1994) GraphDB: modeling and querying graphs in databases. In: 20th VLDB conference, Santiago de Chile, pp 297–308 Lakshmanan LVS, Sadri F, Subramanian SN (2001) SchemaSQL: an extension to SQL for multidatabase interoperability. ACM Trans Database Syst 26(4): 476–519 Ronen R, Shmueli O (2009) SoQL: a language for querying and creating data in social networks. In: ICDE, Shanghai. IEEE, Piscataway, pp 1595–1602 San Mart´ın M (2012) A model for social networks data management. PhD thesis, Universidad de Chile. www. tesis.uchile.cl/handle/2250/111467 San Mart´ın M, Gutierrez C, Wood PT (2011) SNQL: a social networks query and transformation language. In: Barcel´o P, Tannen V (eds) AMW, CEUR workshop proceedings, Santiago, vol 749. CEUR-WS.org Scott J (2000) Social network analysis, 2nd edn. Sage, London Wasserman S, Faust K (1994) Social network analysis: methods and applications, 1st edn. Structural analysis in the social sciences. Cambridge University Press, Cambridge

Transportable Networks

Transportation Systems  Spatial Networks

Transpose of a Matrix  Matrix Algebra, Basics of

Traveling Networks  Mobile Communication Networks

Trend Detection  Mining Trends in the Blogosphere

Trial Heat Polls  Election Forecasting, Scientific Approaches

Trust Recommended Reading • For the base requirements for the common standard data model for SN&M data transformation and integration, see Freeman et al. (1992, Chap. 1). • For the complete discussion and formal definition of the common standard data model, see San Mart´ın (2012). • For a discussion on transformation languages on the Web, see Florescu et al. (1998).

 Social Capital

Trust and Reputation  Friends Recommendations in Dynamic Social

Networks

Trust Evaluation Transportable Networks  Mobile Communication Networks

 Subgraph Extraction for Trust Inference in Social Networks

Trust in Social Networks

Trust in Social Networks Zhen Wen IBM T. J. Watson Research Center, IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

Synonyms Belief; Reliance

2215

T

be trusted at the moment for this current task. For example, in military scenarios, a soldier who went over the hill with his/her communication interrupted may be a very trustworthy person from his/her past behavior in social and information networks. However, such long-term trustworthiness gives little cues on whether she/he or his/her communication is compromised by enemies and becomes untrustworthy. Similarly, a soldier who was highly trusted in previous tasks may be overwhelmed by adversarial environments at the moment. Thus, he/she cannot be trusted for completing his/her current task.

Glossary Trust One party is willing to rely on the actions of another party

Definition Trust is an important phenomenon in social networks. People trust one another for different reasons in various social contexts. Person A may trust person B because of person B’s characteristics, e.g., B’s expertise in the area of A’s current task. In addition, person A may trust person B because of social effects, such as homophily or because other trusted people in A’s network may already trust B. In social networks, a person’s trustworthiness can be categorized into long-term trustworthiness and real-time trustworthiness. Long-Term Trustworthiness A person’s long-term trustworthiness measures how much she/he can be trusted based over a long period of time. This is an aggregated measure that can be derived from the person’s past behavior in his/her social networks, such as whether he/she always convey truthful information or whether his/her opinions on a certain topic are valued by other people. Real-Time Trustworthiness In contrast to long-term trustworthiness, real-time trustworthiness measures how much a person can

Trust Prediction To predict a person’s long-term trustworthiness, machine learning techniques can be utilized based on features extracted from the person’s behavioral patterns and social network structures. Tang et al. studied such long-term trustworthiness in the context of product reviews (Tang et al. 2012). In particular, researchers have started to evaluate reliable features of social network structures for predicting the trust between two people. Initial studies have been conducted on several different large-scale social network datasets, ranging from massive multi-play online games to workplace e-mail and instant messaging communications in large enterprises. These studies have identified a set of reliable network features that are effective, such as shortest distances, node degrees, and resource allocation index (Borbora et al. 2011). For real-time trustworthiness, real-time signals on people’s different cognitive and mental states need to be acquired, by leveraging recent developments in neural sensing including dry electrodes and wireless data acquisition (Yasui 2009). Traditional EEG signal detecting devices usually require use of some type of messy conductive medium (gel, paste, or saline) and tether the individual to an amplifier with wires, allowing only a limited range of movement. These constraints make them unsuitable for field use. Fortunately, the relatively recent innovation of “dry” (no gel) wireless sensors and progress in

T

T

2216

digital signal processing and chip design provides some potential solutions for using EEG sensors in practical daily life environments (Yasui 2009). Similar progress was recently demonstrated in using dry sensors for detecting steady-state visual evoked potential (SSVEP) for brain-computer interface (BCI) system (Luo and Sullivan 2010). After people’s cognitive signals are collected, cognitive experts shall need to annotate the people’s mental states under such signals. Using these manually annotated data, predictive models such as Support Vector Machines (SVM) can then be learned to estimate human mental states from the signals. The estimated state information can be utilized to judge whether people are in a mental state where they can be trusted to contribute to distributed decision making.

Trust Network

Trust Prediction  Subgraph Extraction for Trust Inference in Social Networks

Tulip III David Auber, Daniel Archambault, Romain Bourqui, Maylis Delest, Jonathan Dubois, Bruno Pinaud, Antoine Lambert, Patrick Mary, Morgan Mathiaut, and Guy Melanc¸on CNRS UMR 5800 LaBRI, INRIA Bordeaux – Sud-Ouest, Talence, France

Synonyms Cross-References

Data analysis; Graph visualization; Visualization framework

 Computational Trust Models  Subgraph Extraction for Trust Inference in So-

cial Networks

References Borbora Z, Ahmad M, Haigh K, Srivastava J, Wen Z (2011) Exploration of robust features of trust across multiple social networks. In: 2nd workshop on trustworthy self-organizing systems, in conjunction with 2011 IEEE conference on self-adaptive and self-organizing systems, Ann Arbor, pp 27–32, 3 Oct 2011 Luo A, Sullivan T (2010) A user-friendly SSVEP-based brain-computer interface using a time-domain classifier. J Neural Eng 7(2) Tang J, Liu H, Gao H, Sarmas A (2012) eTrust: understanding trust evolution in an online world. In: KDD 2012, Beijing, pp 253–261 Yasui Y (2009) A brainwave signal measurement and data processing technique for daily life applications. J Physiol Anthropol 28(3):145–150

Trust Network  Subgraph Extraction for Trust Inference in

Social Networks

Tool’s ID Card • Tool name: Tulip • Creation year: 1999 • Authors: David Auber (original author), Daniel Archambault, Romain Bourqui, Maylis Delest, Jonathan Dubois, Bruno Pinaud, Antoine Lambert, Patrick Mary, Morgan Mathiaut, and Guy Melanc¸on • Scope: general • Copyright: LGPL • Type: program/library • Size limits: 500 K nodes 500 K edges • Programming language: C++ Python • Orientation social, bio

Introduction Although this article presents a system and discusses its design, its content goes much further. In a sense, this paper is a position paper following 10 years of lessons learned working in graph visualization, developing new visualization techniques, and building systems for users.

Tulip III

The strategy we have adopted is to develop, maintain, and improve the Tulip framework, aiming for an architecture with optimal data structure management from which target applications can be easily derived. The benefits of our strategy have paid off on several fronts. We have used the framework to demonstrate the Reproducibility of work published by others, allowing us to experiment with and validate our work. The architecture has promoted Extensibility and Reusability of our results and those of other researchers as discussed in detail in forthcoming sections. Tulip has facilitated scientific collaboration and technology adoption. The framework serves as a tool to demonstrate our expertise and know-how when interacting with scientific collaborators or end users. As we shall argue, the evolution path of our framework brings it into full coherence with Munzner’s nested model (Munzner 2009) and serves all facets of InfoVis guiding the creation and analysis of visualization systems. Tulip is one of the very few systems that offer the possibility to efficiently define and navigate graph hierarchies or cluster trees (nested subgraphs). This technique has been a central visual paradigm in our group, as it often provides answers to data analysts. The reason is quite simple: large graphs must be clustered to reduce visual complexity, turning the data exploration process into one involving a hierarchy built by a clustering algorithm. Hence, Tulip’s low-level data structure was designed since its birth to support the creation of nested and/or overlapping subgraphs, integrating at the heart of the system a property heritage mechanism that provides both coherence and optimal space usage. Tulip started after David Auber decided to enter the huge graph visualization arena (Auber 2001, 2002a, 2003). The library was designed to deal with graphs (relational data), focusing on graph topology as the main ingredient for visual encodings and mainly exploiting nodelink diagrams (points and straight lines) as a central visual metaphor. The framework was primarily designed to challenge scalability; its core architecture and low-level data structures were optimized to reach ambitious goals in terms of graph size (nodes and edges) that could be

2217

T

handled and visualized. After these initial efforts, Tulip found a place within our research group and soon became an everyday experimental tool. Because data analysis and combinatorial mathematics are companion fields to graph visualization, Tulip included a rather exhaustive list of node and edge metrics that could then be mapped to color or size. Obviously, Tulip initially served as an experimental framework from which the design of drawing algorithms and visualization techniques were developed, tested, and validated. From this point of view, Tulip can certainly claim to be part of the champion’s club of state-of-the-art graph visualization libraries and software. The growth of our community helped us gain visibility, and we were soon asked to cooperate with the end users to build visualization applications: navigating protein interaction networks (Iragne et al. 2005), producing automated drawing for secondary RNA structures (Auber et al. 2006), visualizing software reverse engineering graphs (Chiricota et al. 2003), social networks (Auber et al. 2003a), or air passenger traffic (Rozenblat et al. 2006). The graph hierarchy paradigm residing deep within Tulip was later fully exploited by the work of Archambault et al. (2006, 2007a, b, 2008, 2009). We also became aware of the use of Tulip by others (see Perego 2005; Boulet et al. 2008, for instance). Graph visualization is often a possible avenue for data analysis as seen from these numerous collaborations. However, once the graph has been established, the visualization process often needs to be supported by techniques in graphical statistics or visual data mining. Tulip has been extended to offer visual encodings for relational as well as non-relational data. The libraries have matured from their algorithmic-centered viewpoint towards a data analysis dashboard combining different visualization techniques and support for visual analytics. The Tulip architecture has been designed to promote extensibility and reusability of results. As such, from a software engineering perspective, it heavily relies on object composition rather than inheritance. Even if object composition is often more complex for

T

T

2218

the programmer, it considerably reduces code duplication and dependencies between modules. We are constantly improving and refactoring our library to minimize the code duplication and reimplementation, to ease the addition of future research results, and to preserve architecture scalability. Tulip offers a software library is in coherence with Munzner’s nested model and has software support for validation at any level of this model. Our paper is thus structured to illustrate this property. Section “Historical Background” describes previous and related software systems that inspired the design of many parts of Tulip. Section “Tulip Main Features” describes the architecture of the Tulip libraries and software. In this section, we describe elements that support each level of validation in Munzner’s nested model: algorithm plug-ins (section “Algorithms”) provide support for validating algorithm design, views and interactors (sections “Views” and “Interactors”) provide support for validating encoding/interaction technique design, and perspectives (section “Perspectives”) provide support for validating data/operation abstraction design and domain problem characterization. Section “Key Applications” presents applications, two to support information visualization group needs and one to support a domain-specific application, where Tulip was found to be helpful. Finally, section “Future Directions” presents some conclusions and future work.

Historical Background Developing a framework over an extended period often means being compared to or challenged by competitor systems and libraries. This section presents a representative subset of the libraries that are closest in spirit to our work. We briefly discuss the philosophy or underlying principles of each, contrasting them to Tulip. Many of these competitors have been benchmarked against Tulip in terms of scalability, one of Tulip’s strong points.

Tulip III

Libraries LEDA/AGD/OGDF (Mehlhorn and N¨aher 1995; Mutzel et al. 1998; Chimani et al. 2007) The LEDA/AGD/OGDF series of graph drawing libraries were built to provide a collection of efficient graph drawing algorithms. These libraries include some of the most powerful, sophisticated, and complex algorithms to produce graph drawings. However, the aim of these libraries is to draw graphs – that is, to decide the positions of nodes in the plane. This library tends not to focus on a fully integrated information visualization library. Furthermore, these libraries tend to focus on graph connectivity. Extra information linked to the nodes and edges of the graph is difficult to integrate into the visualization process. That said, LEDA/AGD have inspired our work (see section “Algorithms”). GraphViz (Ellson et al. 2002) This library is similar to OGDF and has support extrinsic data in its graph drawing algorithms (for instance, labels, size, and orientation of graph elements are all supported). GraphViz has been successful from both an end user and InfoVis community member perspective. It offers one of the best solutions for drawing hierarchical (directed acyclic) graphs which is state-of-the-art in hierarchical graph drawing. However, the library does not focus on fully integrating its algorithms into a fully functional information visualization system. VTK/Titan (Schroeder et al. 2006; Wylie and Baumes 2009) VTK is the standard library for producing applications supporting scientific visualization techniques. Recent developments of this library extend its scope to information visualization. With the integration of VTK and Boost, the latest versions support many information visualization techniques, even though VTK was not originally designed to support the visualization of abstract (nongeometric) data. The original strength of the library was its efficient rendering of meshes in three dimensions, and optimizations can be made under the assumption that most

Tulip III

2219

T

T

Tulip III, Fig. 1 (Continued)

T

2220

Tulip III

Tulip III, Fig. 1 Tulip III is a framework that enables visualization researchers and application designers to operate on an algorithm, technique/interaction, and visual encoding level. (a) Results of a number of graph drawing

algorithms and metrics. (b) Several views of the same data set with custom interactions. (c) Systrip perspective that implements a visualization pipeline supporting exploratory analysis of Trypanosome metabolism

information visualization techniques are focused on rendering information in two dimensions. However, information visualization often focuses on user interaction and visual data manipulation requiring efficient methods for tracking changes to the data needs to be supported, and this library does not appear to directly support this functionality. We compare the performance of the library to the Tulip one in section “Tulip Graphics.”

interactive visualization techniques, rather than offering sophisticated support for graph drawing algorithms.

Toolkits Toolkits offer users an environment for the development of InfoVis applications. They offer an off-the-shelf data import/storage solution and often include a variety of widely used graph layouts and node/edge metrics. The two toolkits we comment on here primarily support the design, development, and validation of new

Prefuse (Heer et al. 2005) This framework provides a comprehensive set of interactive information visualization techniques. Its clever design and management of interaction make this toolkit one of the most widely used for information visualization applications. On the other hand, the toolkit supports only a few graph drawing algorithms and node/edge metrics. The latest “pure” Prefuse release goes back to 2007, but recently Prefuse/Flare targeted the toolkit towards web-based InfoVis. In terms of scalability, efforts have been made by the authors to provide an efficient JAVA-based implementation. However, Tulip can handle larger data sets. For instance, a graph of 300,000

Tulip III

nodes and 600,000 edges takes 1.2 GB in Prefuse when it takes only 170 MB in Tulip. Furthermore, interaction with such a graph is almost impossible in Prefuse where it is still reasonable with Tulip. InfoVis Toolkit (Fekete 2004) The InfoVis Toolkit shares similarities with Prefuse and offers a comprehensive set of information visualization techniques, for instance, node-link diagrams, tree maps, or matrix views. As such, it has many of the advantages and disadvantages of Prefuse. The toolkit supports few but relevant graph drawing algorithms and metrics. The last release of this toolkit was in 2006. The concept of multi-views implemented in this framework has inspired a similar design in Tulip.

2221

T

providing several powerful tools such as k-core computation, Eccentricity, and others. In earlier versions, Tulip shared many similar ideas with this software. However, few visualization techniques outside graph are supported. Also, the software is not open source, making it difficult to use for information visualization research. Cytoscape (Shannon et al. 2003) Cytoscape is dedicated software for visualization of networks in biology. In many ways, it shares many ideas with the Tulip framework. However, it is primarily focused on biological networks and can have scalability problems. For instance, loading and displaying a grid graph having 10,000 nodes and 20,000 edges requires 1.5 GB in Cytoscape where it only requires 98 MB with Tulip.

Software ASK-GraphView/CGV (Abello et al. 2006; Tominskia et al. 2009) This software system shares an important feature with Tulip as it relies on the computation of subgraph hierarchies and implements multi-scale graph drawing techniques to explore large data sets that do not necessarily fit into the main memory. ASKGraph view is part of the few scalable graph visualization frameworks. However, it essentially offers a single visualization technique relying on multi-scale graph drawing as a central visual paradigm. GUESS (Adar 2006) GUESS uses a scripting language to perform basic tasks (searching and filtering, etc.). This scripting language is very useful and powerful users with programming experience in Python. However, direct manipulation of the data through interactive techniques may be preferable for some users, which is the focus of Tulip. Through the plug-in architecture of Tulip, it would be possible to implement a scripting language such as this one, but as of yet, we have not implemented such a feature. Also of concern is the scalability of an interpreted scripting language on very large data set sizes. Pajek (Batagelj and Mrvar 2003) The Pajek software focuses on the analysis of large graphs,

Key Points The Tulip framework consists of four packages. The first package, the core of the Tulip library, provides an efficient data structure designed for abstract data visualization. The second package is a complete OpenGL rendering engine tailored for information visualization techniques. The third package is a library of GUI components created using the Qt-Nokia library. Finally, Tulip software is an application where one can embed their algorithm, visualization technique, or complete information visualization system. Figure 2 summarized the connections between these different libraries. In the following, we detail the first three packages of our architecture.

T Tulip Main Features Tulip Core The Tulip core library was created for the purpose of visualizing data sets consisting of entities and the relationships between them. It enables to store into memory in an efficient way these entities/relations as well as attributes attached to them. Furthermore, it provides the necessary functions to access to these data and standard useful algorithms. For instance, it includes

T

2222

Tulip III

Tulip Core

Tulip Graphics

Tulip GUI

Tulip Software Application

plugin plugin

Modify menus and workpace organisation

use

Algorithm

Controller

modify and use

manage views

W o r k s p a c e

manage

use and modify

plugin modify and use

Interactor

add some visual informations

use for rendering

Graph

manage

modify view and graph

modify in some case

call draw

Display

plugin

nodes and edges

View

GIScene contains and use

mangae

GIEntity

mangae

mangae

tulip visual entity

inherit

Properties data on nodes and edges

use

GIComposite graph visual entity

GILayer entities manager

mangae use

Display views

display graph

OpenGI manager

V i e w s

plugin

GIyph nodes visual entity

Tulip III, Fig. 2 Tulip architecture overview. The Tulip framework consists of four packages. Tulip core provides an efficient data structures for relational data. Tulip graphics is a complete OpenGL rendering engine. Tulip GUI is a collection of widgets built on top of Qt-Nokia library for the purpose of information visualization. Finally,

Tulip software is an application for embedding algorithms, visualization techniques/interaction, and complete information visualization systems. All these packages can be dynamically extended through the plug-in architecture of Tulip

function to test whether or not a graph is planar or to compute a uniform quantification of a set of values. The Tulip core library also integrates a generic plug-in mechanism (Auber and Mary 2006). It is used many times in our library to enable easy extensions of our framework. The principle of that plug-in mechanism is to enable each plug-in to specify their input/output requirements as well as their dependency with other plug-ins. Similar to what is done with JavaBeans, we are able to call these plug-ins directly in a program or to use them directly through an automatically generated user interface. Furthermore, since plug-ins are dynamically loaded, dependency mechanism enables us to check the coherence of a set of plug-in. In the following, we describe the Tulip meta-model that is, from our point of view, the part that differentiates the most Tulip from all other information visualization system or libraries. For more details on the basic data

structure or functions (matrix, convex hulls etc.) provided by Tulip, the reader could have a look to the developer manual.

Meta-Model III Based on the previous Tulip version (Auber 2003), the Tulip meta-model III focuses on minimizing the amount of memory used while providing efficient operations on the data set. The original idea behind the data structure was to manage high-level operations used in the visual analysis process in the data structure. Integration of all these operations provides a global optimization during the interactive exploration of abstract data. As shown in Fig. 3, the Tulip meta-model user only has access to the class called Graph. In terms of design pattern terminology, the class is a facade. This facade provides simplified and centralized access to a set of complexly interacting classes. The programmer does not need

Tulip III

2223

T

Tulip III, Fig. 3 Overview of the meta-model class diagram. Instead of providing a complex set of classes to programmers to use, the Tulip philosophy is to provide centralized access to the data structure through the Graph interface. This approaches the implementation of an optimized and extensible data structure

to understand the behavior of the objects they manipulate through the facade. Furthermore, it eases the implementation of data storage optimizations to the library as external modules are not directly accessed. One should note that this facade can be used even when working on nonrelational data. This property is due to the fact that a graph data structure with an unbounded set of attributes is extremely versatile and allows to store a wide variety of data (relational, multidimensional, geospatial, etc.). In the following section, we present some of the operations provided by this facade. Subgraph hierarchy One of the first requirements was to provide efficient managements of subgraphs. As a subgraph generalizes the notion of a subset to relational data, it is often used in graph visualization systems that follow the “overview first, zoom and filter, detail on demand” Shneiderman mantra (Shneiderman 1996). In Fig. 3, we see that the facade currently uses two classes. GraphImpl is responsible of storing the entities and relations, while GraphView is responsible of storing subgraphs by using a filtering mechanism on a Graph. This approach is efficient in terms of memory, because, in most cases, storage needed for entities and relations in a filter can be done in a single bit (worst cases appear when fragmentation of these indexes are maximal). Furthermore, when a subgraph structure is implemented with filtering, entities and relations used are exactly the same. Thus, no overhead is required for correspondence between entities and relations

and their subgraphs. To guarantee the coherence in the subgraph hierarchy, all modification operations on a subgraph apply recursively to sub-subgraphs or its supergraphs when necessary. Using this implementation allows the Tulip framework to a large number of subgraphs. Using the current implementation, a graph having 1,000,000 nodes and 5,000,000 edges with 200,000 subgraphs requires 825 MB on a 64-bit architecture. If one is only interested in graph partitions, where elements must be strictly contained in a subgraph and all its ancestors to the root, this data structure can be optimized. Tulip does not support this optimization as it would limit visualization techniques for overlapping subgraphs and clusters. HGV (Raitner 2002) does implement this efficient data structure, and an interested reader could get more details there. Property sharing Our second requirement was to support storing an unbounded number of properties, or attributes, on graph elements. In the case of properties, the philosophy of Tulip is to not store them inside the entities and relations, but to have a single object for each property. Even if this data structure is slightly less intuitive for a programmer, this choice is necessary to enable global optimization and increase cache hits during iteration of entities (especially during rendering). This idea is also used in the IVTK (Fekete 2004) framework. In order to enable sharing of properties between subgraphs, we provide an inheritance mechanism for properties. As shown in Fig. 4, each subgraph inherits its supergraph properties and can also redefine or create its own

T

T

2224

Tulip III Graphlmpl::rootGraph

Shape::shape1

Color::color1

Layout::layout1 filter

filter Size::size1

Layout::layout2

GraphView::subgraph1

filter

GraphView::subgraph2

filter Color::color2

Shape::shape2 GraphView::subgraph1.1

GraphView::subgraph1.2

Tulip III, Fig. 4 Graph Hierarchy: Tulip provides management of a hierarchy of subgraphs through an efficient filtering mechanism of graphs. For example, a graph with 1,000,000 nodes and 5,000,000 edges and 200,000 subgraphs requires 825 MB on a 64-bit architecture. Furthermore, through an inheritance mechanism of properties of graph through that hierarchy, it maximizes the number

of properties shared between subgraphs. For instance, the subgraph1.2 inherits the layout of the root graph. The inheritance mechanism is also able to redefine properties in subgraphs like one would do in an object-oriented programming language. The subgraph2, for example, has redefined its layout but inherits the colors/sizes and shapes of its parent

properties, similar to the inheritance mechanism in object-oriented languages. Finally, the model integrates a widget similar to the virtual tables function to optimize access to properties when dealing with a deep hierarchy of subgraphs. In all the visualization technique and system we have developed, this property sharing mechanism has been key in providing overview+detail implementations and for synchronization.

2009; Bourqui and Auber 2009), we have integrated into the facade accessors to metainformation graph elements that are stored in the memory (GraphImpl). This solution memory overhead when compared to Auber and Jourdan (2005) but enables independence from the metagraph construction order and helps support de-aggregation operations. We also introduce aggregation functions in order to be able to modify the way aggregated values are computed.

Aggregation The third key feature is to enable hierarchical aggregation (Elmqvist and Fekete 2010) of entities/relations, and the Tulip meta-model III has been extended and optimized for this purpose. As presented in Auber and Jourdan (2005), the subgraph hierarchy presented above can support the efficient aggregation of subgraphs. However, after applying this technique to several multiscale problems (Archambault et al. 2008,

Observable data structure Interactive visualization often requires the modification of graph topology (graph structure), decomposition (subgraph or aggregation), and attributes (properties). To prevent static links between the Tulip data structure and the external algorithm or system, we provide an observer mechanism that listens for all modifications and applies them to the data structure.

Tulip III

State management The most substantial improvement in the new meta-model is to add to the facade the ability to save the current state of the data structure. Like the OpenGL matrix stack, we provide two functions: push and pop. These two functions can save or restore the current state of the data structure through a stack. A naive implementation of this feature would be suboptimal when dealing with a large number of graph elements and their properties. In Tulip, this mechanism has been designed with the proxy design pattern. This pattern allows objects to behave like other objects, hiding direct manipulation of the data structure from the user and allowing data sharing to be globally optimized. Using that stack of state, we were able to implement efficient implementation of the command design pattern, and thus, we provide efficient undo/redo operations on large data sets. For instance, a graph with 40,000 nodes and 80,000 edges under the following modifications, “change all the size,” “change the layout,” “change all the colors,” requires less than 115 MB (including Tulip GUI, 3D rendering engine, and plug-in memory usage), enabling immediate undo/redo on a 64-bit Intel Q9300 processor. Algorithms Several kinds of algorithms are used in information visualization systems but can be clearly separated from the technique. In Tulip, based on our plug-in mechanism, we provide a way to add such new feature. To be independent from visualization techniques, these plug-ins are only authorized to modify the meta-model described above. Furthermore, we do provide a call-back mechanism inside our algorithm, allowing for interactive use in visualization techniques or information visualization systems. For all the algorithms, we do not limit the input parameters, and thus, by using our dynamic parameters declaration mechanism, a programmer can write a large variety of algorithms. However, in order to categorize major classes of algorithms and ease automatic connection with the user interface, we provide interfaces for algorithms that modify a single

2225

T

Tulip property. For instance, standard graph drawing algorithms only need to modify the positions of nodes and the positions and number of bends in an edge which can be stored in a Layout Property. Based on this idea, we provide plug-ins for hierarchical graph drawing, radial trees, force-directed approaches, spectral methods, planar graph drawing, space-filling curves, edge bundling, and bin packing. The measure algorithm is based on this same idea and produces real values on entities/relations. It provides algorithms, such as the computation of k-cores, eccentricity, betweenness centrality, page rank, (bi/tri)connected component, strength metric, or Strahler number. Furthermore, we also provide a general algorithm type that can modify any element of the data structure if necessary. We use it for clustering algorithms, and it enables us to provide implementations for many approaches including agglomerative clustering methods, divisive clustering methods, and metric-based approaches. We also provide an adapter (i.e., wrapper) to directly use the algorithms provided in the open graph drawing framework (OGDF) library.

Data Import and Export The efficient import and export of a variety of data formats are key for building generic information visualization libraries. However, supporting these formats in a generalizable way is not obvious. A basic version of Tulip is able to import CSV (comma separated value) files, GML, and dot formats for graphs and their attributes. We also invented our own format (tlp) that allows meta-information to be saved to disk and for custom configuration of graph appearance. Import algorithms are also available for randomly generating graphs, importing web graphs or importing a file system. An important feature of the import/export architecture present in Tulip is that it also forms part of the plug-in architecture. Therefore, programmers can extend the import and export capabilities of Tulip by designing their own plug-ins for custom file formats.

T

T

2226

Tulip Graphics Efficient rendering of large amounts of geometric information is a bottleneck in most information visualization systems. In the Tulip Graphics library, we provide an OpenGL-based, multilayer rendering engine that includes the necessary functions for implementing information visualization techniques. In our multilayer rendering engine, threedimensional information can be displayed on different layers. For instance, using layers and transparency enables the graphics library to render textured quads behind the scene, transparent convex hulls on top of graph elements, or displaying legends (2D rendering on top of the scene) for visualizations. Through the OpenGL stencil buffer, we are able to force the visibility of elements on layers. This functionality implements guaranteed visibility (Munzner et al. 2003) for rendered elements. For example, in our visualization techniques, we use this capability to guarantee that selected elements are always visible. To ease the implementations of new techniques, we provide functions to manipulate the camera, select elements, render aggregated elements, render basic geometric entities, and facilitate the use of vertex/pixel/geometric shaders. Special attention has been paid to render these operations usable on huge data sets. For instance, computing and rendering curves, such as B´ezier, splines, and B-splines, are done on the GPU, allowing Tulip to render more than 10,000 with more than 100 bends in real time without storing any precomputed geometry. In this example, we save the storage and transfer of 2,000,000 triangles required to render this set of curves. However, to be able to extend existing visual metaphors without modification, we provide a plug-in mechanism to add new visual objects. These geometric plug-ins can be used to create Glyphs. For example, a programmer can create new plug-ins for rendering pie charts according to specific attribute values. After installing the plugin, all views (node-link diagram, scatterplots, etc.) can render graph elements using the new representation.

Tulip III

Using an external rendering engine could have been possible. Two main reasons required that we design our own rendering engine. First, external rendering engines can generate memory overhead unable to handle graph of over 500,000 elements in less than 256 MB of memory. Secondly, when the Tulip project began in 2000, 3D rendering engines were not readily available. However, designing an OpenGL citation rendering engine for the purpose of abstract data visualization allows us to optimize and tune the rendering engine according to the visualization techniques we have implemented. As an example, in earlier versions of Tulip, the skeletons of graphs were computed using the Strahler numbers to incrementally render graph nodes and edges (Auber 2002b). In software engineering terminology, a composite design pattern is used to model the hierarchy of visual objects to be rendered. A naive implementation of this pattern requires the instantiation of a large number of objects and therefore does not scale to large data sets because of memory constraints. To solve this problem, the Tulip Graphics library accesses this composite using a visitor pattern. First, the visitor pattern adds new functionality to the composite without any modification to its data. For instance, the visitor can compute bounding boxes needed for level of detail used during rendering. Secondly, the visitor pattern can simulate a hierarchy of objects without building it. For example, when using a GraphComposite, the visitor traverses a dynamically created hierarchy of objects instead of creating this hierarchy beforehand. Objects are generated and reused on the fly in a way that is similar to the flyweight design pattern during rendering. This pattern avoids data duplication in the data model and graphics library, allowing the system to scale to larger data sets and synchronize rendering with the model. The philosophy behind the Tulip Graphics library is the efficient, direct rendering of data stored in the Tulip data structure without duplication. However, as the amount of available memory has increased significantly, we have integrated into the last version of Tulip optimizations that are more memory intensive. For example, we use octrees to optimize selecting elements or

Tulip III

computing level of detail, and we use texturebased rendering to accelerate the rendering of aggregated elements during zoom and pan. Comparing the performance of Tulip to that of VTK/Titan in terms of speed and memory efficiency, we found that loading and rendering a grid of 1,000,000 nodes and 2,000,000 edges from scratch takes 20 s and 320 MB in Tulip and 50 s and 1.3 GB with VTK/Titan. After this initial rendering, VTK/Titan is five times faster than Tulip for subsequent renderings under simple zoom and pan navigation, without modification of graph structure or selection of elements. However, if the selection is modified, selection of elements on this grid is immediate with Tulip, and it takes more than 60 s with VTK/Titan. These results illustrate the trade-offs Tulip has made between rendering performance and memory usage for the implementation of information visualization techniques. Tulip GUI According to Munzner (2009), deciding on the proper visual encoding to use should be determined after problems from real-world users have been characterized. Now, it’s not that each problem each time calls for a unique and completely new visualization techniques. The problem often turns into selecting the proper techniques to assemble and implement together with the proper operations. Some techniques now have been used and studied long enough so that their usability perimeter has more or less been established. Because Tulip aims at to be used for implementing end-user visualization system, it has to implement a wide palette of existing techniques. Thus, a choice has been made to implement pairs of visual encoding and operations based on their usefulness and scope as assessed by the InfoVis community. Tulip progressively started adding new features that allowed users to go back and forth between a node-link diagram where metrics were mapped as color or size and histograms that helped understand how a metric was able to capture a key property in the data. These data analysis features have grown and now include a set of well-established data visualization

2227

T

techniques (see section “Views”). Tulip has evolved from essentially offering a unique visual encoding (node-link diagrams) to a variety of data analysis techniques that can moreover be astutely combined and synchronized. All these new features were carefully and coherently integrated into the framework using agile development methodology (see section “Generic Tulip Perspective”). We obtained an architecture based on the model-view-controller (MVC) architectural pattern. The model-view-controller approach is a well- known approach for designing interactive systems. The pattern splits the software architecture in three independent components. The model component has the responsibility to store the information, the view component gives a representation for the information, and the controller manages communication between one or more views and the model. This architecture disassociates the data structure (Model) from the representation (View) and the system behavior (Controller). In the following, we describe three main components of the Tulip GUI library. Views Views can be defined as visual representations of data. Node link-diagrams, parallel coordinates, and scatterplots are just a few examples of views that can be used to gain insight into a data set. Tulip uses the above-described meta-model to create multiple views of the same data set. The idea is to use the same data independent of the current view. For example, nodes in the nodelink view of a graph may have several attributes, and these attributes could be placed in a 2D scatterplot. Having all views share the same data model helps maintain system coherency and enables working with several views simultaneously. Structuring data manipulation in this way allows the information in one view to be easily analyzed in all other views, hopefully providing more insight. Figure 5 shows three different views, and in each view, one can see that shapes, colors, and relative size are preserved. This makes a fundamental, although simple, user interaction quite powerful. As an example, when selecting

T

Tulip III, Fig. 5 A centralized meta-model maintains coherence between views. (Left) Histogram view. (Middle) node-link diagram. (Right) scatterplot views. All three views share the same visual attributes enabling the user to switch between views easily

T 2228 Tulip III

Tulip III

2229

Call-F$

Call-PF$

Call-R$

T

Call-T$

Call-F$

Call-PF$

Call-R$

Call-T$

Tulip III, Fig. 6 (Left) The node-link diagram view renders glyphs for nodes and curves for edges. The view provides navigation such as zoom and pan, bring and go (Moscovich et al. 2009), fish-eye views, and a magnifying glass. Direct editing of the graph elements and data, such as adding or removing nodes and edges or translating rotating or scaling elements, is also supported. Other operations on this view include graph splatting,

metanode/graph hierarchy exploration, and texture-based animation. (Right) The scatterplot 2D view renders attribute values to depict possible correlations between properties, and the matrix allows efficient navigation between dimensions. The view provides similar interaction to the node-link view and implements an interactor to search for correlation in an interactively defined subsets of elements. Splatting is also available in this view

nodes in a histogram, to focus on high-value nodes, for instance, the user instantly sees where these nodes spread in the node-link view. For optimization purposes or in order to implement specific types of views, the programmer occasionally needs a custom data structure. For these cases, views can observe any change to the meta-model (see section “Meta-Model III” for details), synchronizing all views to it. As an example, consider the scatterplot matrix view (see Fig. 6) implemented in Tulip. This view generates a buffer of textures for efficient navigation through the matrix. The data model, in this case, is used to generate the scatterplot representation for each pair of dimensions, and the view stores these results as images. During interactive navigation, the rendering engine displays only the textured quads. If data set is modified by other views or interactors, the set of textures needs to be rendered again. The observer mechanism of Tulip notifies the appropriate views and modifies the data only when necessary. Views are implemented as Tulip plug-ins. Currently, all views are implemented using the Tulip rendering engine, but programmers are not

limited to this engine. Integrating rendering engines such as VTK, other engines, or even multiple engines simultaneously inside a single view can be supported. However, the programmer would need to synchronize all views manually. An example of a foreign rendering engine used in conjunction with the Tulip rendering engine inside a single view is the Google Map mashup, where Google Map API renders a map in one layer while the Tulip rendering engine renders the remaining layers on top of this map. Figure 6–9 present an overview of the major views implemented in the current Tulip release.

Interactors Interaction is essential for most information visualization techniques. However, generalizing interaction in an extendable way raises a significant challenge as a wide range of methods require support. Some selections require transparent rectangles to be drawn on top of selected elements. Opening a metanode requires a single click, a small amount of zooming and panning, and modifying graph structure locally at the metanode.

T

T

2230

Tulip III, Fig. 7 (Top) The parallel coordinates view depicts multivariate data, using the traditional parallel coordinates representation as well as a circular representation. In both views, lines can be rendered with smooth B´ezier curves. Interaction with the view is supported through zoom and pan, axis edition/permutation/ shifting, and multi-criteria/ statistical selection. (Bottom) The histogram view provides a view of element frequency. A matrix of histograms allows for the visual comparison of several statistical properties of a set of dimensions. This view has a standard set of navigation and statistical interactors. Additionally, an interactor enables the user to build nonlinear mapping functions to any of the graph attributes such as size, colors, glyphs, etc.

Tulip III

Tulip III

2231

Top Pair-F$

Top Pair-T$

Top Pair-R$

BLUFF-F

PUREBLUFF-F

VPIP-F::3xBB$

VPIP-PF::3xBB

VPIP-PF::3xBB$

VPIP-R::0xBB

VPIP-S::0xBB

VPIP-T::0xBB

VPIP-T::5xBB$

Tulip III, Fig. 8 (Top) The Google Map view implements a mash-up of the Google Map API. With this API, geospatial positions for the layout of graph elements can be specified. When working with data in geography, graphs can be displayed on top of the map. This view supports standard zoom and pan as well as the selection of elements. (Bottom) The pixel-oriented view uses space-filling

T

curves to display large number of entities and relations on a screen. This view supports Hilbert curves, Z-order curves, and spiral curves. The pixel-oriented view is based on our previous data cube (Auber et al. 2007) visualization and supports zoom and pan/selection interaction as well as focus+context techniques

T

T

2232

Tulip III

Barrel-R$

Barrel-F$

Barrel-T$

Barrel-F$

−31.8145

40.55

Call-F$

−18.8689

92.3436

Call-PF$

56.8485

Check Call-R$

−63.3572

−68.5138

37.2552

−6.32952

10.7337

Check Call-T$

−54.8641

22.4511

−67.6196

88.4078

Check Call-F$

−63.8161

80.55

VPIP-F::0xBB$

−3.13785

6.63886

Tulip III, Fig. 9 (Left) The self-organizing view implements Kohonen self-organizing maps (Kohonen 1982). Several kinds topology/connectivity for the generated maps are supported as well as navigation and selection

interactors. (Right) The matrix view implements a matrix view of the graph. This view has been built to support graphs with a large number of nodes and edges. Zooming and selection interactors are available for this view

The bring-and-go technique (Moscovich et al. 2009) changes the layout of the graph and requires both zoom and pan of the camera along a well-defined trajectory. Furthermore, programmers should be able to combine all these interactive techniques in the final visualization. As an example configuration, the mouse wheel could handle both zoom and pan, a left click could modify element selection, and a right click could display a context menu. To support a range of interaction methods, we implemented the chain of responsibility design pattern. This pattern models the transmission of a message through a chain of linked objects. During the transmission, the message can stop or continue along its path according to the object it passes through. In Tulip, we call an Interactor an entire chain and an InteractorComponent an object in the chain. An InteractorComponent implements an interaction method and can handle all GUI events on a view, modify the Tulip data structure, modify the view, and render objects on top of the view. In the model-view- controller paradigm, this component can be seen as a microcontroller. To encourage reuse, InteractorComponent is

programmed to be as small as possible. For instance, the zoom and pan, fish-eye lens, magnifying glass, zoom box, and box selection interactors are often reused and implemented in five individual interactors. An Interactor is an ordered set of InteractorComponent. The interactor receives all events from the view and implements the chain of responsibility which asks each InteractorComponent whether or not it can handle an event. The Qt-Nokia library is used as much as possible to achieve these operations. The interactor is also responsible for providing configuration widgets, documentation, and an icon for display in toolbars. Furthermore, interactors report the views with which they are compatible. In order to reuse the interactor without modification of source code, the set of views that an interactor supports can be dynamically extended. Interactors also implement the plug-in interface. Thus, programmers can create their own interactors by combining interactor components or developing new ones. As a result, interactors can be reused across views, and the programmer can extend the different types of interactions

Tulip III

supported by Tulip. For example, GPU-based graph splatting can be implemented as an interactor. Perspectives As each application requires considerable programming effort which we hope to reuse, Tulip recently added domain-specific or user-centered perspectives. Following Munzner (2009), realworld problems should be first characterized and abstracted into good operations and data types. There are good reasons to believe that Tulip contains several of the basic ingredients needed to properly combine and/or develop these operations and data types using Tulip’s plug-in-based architecture. After applying the Tulip framework in a variety of domains, including biology, social network analysis, and geography, we realized that many aspects of a visualization system cannot be generalized and must be left to the developer to specify. However, in order to reduce re-implementation, we tried to contain all domain-specific elements inside perspective plug-ins, allowing general system components and interaction to be reused across applications. A Tulip perspective specifies the visualization techniques (algorithms, views, and interactors) to assemble and how to load them. These plugins can use domain-specific widgets, menus, and libraries. Perspectives are very different from the generic perspective that comes with the open source release. They are designed through user interviews and problem characterizations and are customized using Tulip libraries and plug-ins. As our meta-model is generic, we hypothesized that one could keep the same data representation and switch between user interfaces depending on task. The development of the Tulip perspective was inspired by this requirement. In the MVC model, controllers are responsible for managing connections between models and views. Thus, by changing the controller, also known as a mediator pattern in the design pattern terminology, one can change the system behavior. We have had some experience using Tulip in such a context. In order to properly assess the effectiveness of Tulip in visual analytics

2233

T

solutions, more work is needed. However, what is clear is the gain we experience, as visualization designers and experts. Tulip is a toolbox we use when demonstrating the potential use of visual encodings to define paths to follow with end users. Tulip Run-Time Environment As described above, the philosophy of the Tulip framework is to facilitate the reuse of plug-ins over many contexts. The advantage of this approach is that it allows easier framework extension. However, a disadvantage of this approach is programming an application that exploits a collection of plug-ins that is more difficult to implement. This added complexity is, more generally, a disadvantage of plug-in-based systems. Tulip software aims at providing this needed organization to these plug-ins so that they can be more easily used. Section “Perspectives” shows that it is not the Tulip software that creates a visualization system, but a perspective plug-in launched by the Tulip software. The design of Tulip software was inspired by all the stand-alone applications that we have implemented with the Tulip libraries (Auber et al. 2003a, b, 2006; Iragne et al. 2005; Bourqui et al. 2007). Using agile method, refactorization aims to place all duplicated code inside this software. The primary difficulty of designing Tulip software is to determine the maximum set common functions between perspectives. In our experience, we found these functions either necessary or general enough to be used by all designed systems: Model management: All the perspectives store data inside the Tulip data model. Thus, Tulip software supports this model. The software provides import, export, open, and close and checks the data structure for modifications. As the model can be analyzed with different perspectives, Tulip software is also responsible for changing/choosing the perspective used and managing the multi-document interface with tab widgets. Plug-in management: Since perspectives are plug-ins, they cannot be used until they

T

T

2234

are loaded. Thus, the software initializes all the libraries and plug-ins. The software automatically checks for plug-in dependencies and can update or download plug-ins using the Tulip plug-in web service. When creating a desktop application, as opposed to a web application, this functionality is necessary to involve the end user in the development. Frequent installation of new releases is one of the most important problems for end users. Cross-platform support: Supporting multiple platforms is very time consuming when designing new applications. In Tulip, we aim to provide a platform-independent execution environment. The programmer can focus on the implementation of their visualisation workflow. Tulip is available for Linux, Windows, and Mac OS. Through the plugin web service, access to plug-ins compiled for all three platforms.

Key Applications The Tulip framework consists of a set of libraries and an application for managing plug-ins using these libraries. In a way, without plug-ins Tulip is not able to visualize data. However, it provides the necessary functions and data structures to build a system tailored to the task of the user. In this section, we describe some visualization systems we have built using Tulip. Generic Tulip Perspective The Tool Box system (known as the Tulip Graph Visualization Software) provides a generic software interface for the purpose of information visualization research. We identified the following tasks as the most important for our research. Reproducibility: The most important task is the reproducibility of results published in our community. Such a system should be able to integrate many different types of techniques and algorithms. In early versions of Tulip, the focus was on graph visualization, and

Tulip III

therefore, our requirements consisted of graph metrics, graph drawing algorithms, and graph clustering algorithms. In later version, we furthered this idea to include visualization techniques and user interaction approaches. Rapid prototyping: We would like to quickly prototype new algorithms or visualization techniques and analyze them in a general visualization context. For example, we could see how a new clustering algorithm or graph drawing algorithm eases understanding of a data set. Pipeline exploration: We would like to interactively combine existing algorithms, techniques, and interaction methods easily to construct domain-specific visualizations. This feature is helpful for interviews with end users as a combination of existing features can often be used as a starting point for user feedback, delaying implementation of custom visualization methods to later stages in the project. For instance, when working with biologist, we prototyped the analysis pipeline using the generic perspective (see section “Systrip Perspective”) before further implementation. We have implemented this generic Tulip perspective that supports editing graph element properties, exploration of the subgraph hierarchy, and access to built-in functions of the Tulip core libraries. Sample operations that are available include undo/redo, aggregation, subgraph creation, planarity testing, and cut/paste. Moreover, this system automatically constructs menu items and tool bars, allowing access to all installed algorithms, view, and interactors. The connection statistics for our plug-in web service, a service that checks for updates to perspectives when they are launched, indicates that the perspective is frequently used for direct data analysis. Every day more than 100 people use this perspective, and the number of hits to its web site is about 8,000 in March of 2010 (Fig. 10). Systrip Perspective The Systrip perspective was constructed in order to help biologists understand the metabolism of

Tulip III, Fig. 10 The Tulip generic perspective provides an automatically generated user interface depending on available plug-ins. It also provides tools for manual configuration of both views and interactors

Tulip III 2235

T

T

T

2236

Tulip III

Input

Drawing algorithms

Graph metrics

SBML

FM3 Layout

Betweeness

tlp

GEM 2D

Degree

gml

GEM 3D

Eccentricity

csv

Hierarchical

Strahler

Meta Viz

Strength

Bioinformatics algorithms Scope Choke points Pathway selection

Systrip

SBML

SBML

FM3 Layout

GEM 3D

Pathway sel

Scope

csv

Scope

FM3 Layout

FM3 Layout

csv

Strahler

Meta Viz

Eccentricity

Meta Viz

Tulip III, Fig. 11 The Systrip pipeline was created to help biologists understand the metabolism of the tsetse fly parasite that causes sleeping sickness. First, through the generic perspective of Tulip and then through a custom

Systrip perspective, we began to understand the task requirements, providing a visualization pipeline customized to the task of the biologist

the tsetse fly parasite that causes sleeping sickness. During initial user interviews, we found they seem to follow an analytic process which involves getting an overview of the data first and then focusing on a few relevant sub-networks. Using the generic perspective, we tested various interaction methods via manual selection of elements and the subgraph hierarchy. In a sense, this stage experimented with many different visualization pipelines for exploratory analysis with little implementation. After this initial stage, we implemented biology-specific algorithms to extract these subgraphs using a Tulip clustering plug-in. A custom import plug-in allowed Tulip to directly load their dedicated data format. By using this generic perspective, we were able to run a second round of interviews to determine if we were on the right track. Figure 11 shows two pipelines identified to be useful for their tasks. After the preliminary prototypes, we implemented a custom perspective (see Fig. 1) that

integrates these two pipelines. This perspective limits access to only the Tulip functionality that is relevant for their task and uses domain-specific terminology in the user interface. As an example, graph terminology is ineffective with this audience and the terms network and sub-network need to be used. With this prototype, the user community experimented with the perspective without our assistance, allowing them to suggest improvements. For instance, data is generated over time during experiments and the user uses required animation capabilities showing the changes induced by biological events. We were able to manually simulate this behavior using the generic perspective for feedback. Subsequently, the functionality was implemented as an interactor for the node-link diagram view. The final perspective integrates other domainspecific capabilities such as connections to databases and the three-dimensional rendering of molecules.

Tulip III

2237

T

Tulip III, Fig. 12 Images produced by GrouseFlocks and TugGraph. (a) Two perspectives of a movie graph, nodes are movies and edges link movies that share an actor, indicating genre lock. In both the action and documentary genres, we get a large metanode of non-genre (yellow) movies and a large metanode of in genre movies (pink).

(b) TugGraph explores the structure of the Internet around UBC. In this case, ubci9 is tugged on the left image revealing its direction connections in saturated blue on the right. In many cases, these direct adjacencies fragment the graph into multiple connected components shown in light blue

Grouse, GrouseFlocks, and TugGraph Perspective Sometimes, when the number of nodes and edges in a graph becomes large, rendering all of them directly can be an obstacle to graph readability. Also, computing a full drawing of the graph can be expensive in terms of running time. As a new approach for dealing with these

problems, members of our team applied Tulip to research new techniques for graph visualization. The perspective for this system was originally an application that used the Tulip and Qt libraries. Subsequently, the application was converted into Tulip perspective when support became available in version 3.

T

T

2238

In this approach, the contents of metanodes, derived from either topological structures or attribute information, were constructed and/or drawn on demand as the user explores the data. Grouse (Archambault et al. 2007a) took a large graph and hierarchy as input and was able to draw parts of it on demand as users opened metanodes. Appropriate graph drawing algorithms were used to draw the subgraphs based on their topological structure. For example, if the node contains a tree, a tree drawing algorithm will be used. GrouseFlocks (Archambault et al. 2008) was created to construct graph hierarchies based on attribute data and progressively draw them. For instance, subgraphs induced by a set of attributes values were placed inside connected metanodes. These metanodes could be drawn on demand with Grouse. However, often parts of a graph are near certain nodes, and metanodes are of interest and certain metanodes can be too large to draw on demand. TugGraph (Archambault et al. 2009) was created for these situations when topology near a node or metanode is interesting. Also, it can summarize specific sets of paths in the graph. Efficient implementation of these three software techniques benefited greatly from Tulip’s metanode/metagraph structure, its ability to handle the large numbers of subgraphs generated by the systems, and its animation functions to animate graph elements on the screen. One of the biggest advantages of using Tulip for developing this software was the number of graph drawing algorithms it supports. Using the Tulip plugins mechanism, all three systems were made to be configurable so that new graph drawing algorithms can be added dynamically. Images of TugGraph and GrouseFlocks are shown in Fig. 12.

Future Directions We have presented the Tulip 3 framework which is based on 10 years of our research. We have explained the architecture choices we have made to create a stable and maintainable platform for information visualization research.

Tulip III

The framework allows us to test all levels of the Munzner nested model: from the algorithm, to technique/interaction, to encoding, and finally to validating a complete system taking end users into account. Through technical details and a few experiments, we have demonstrated that our framework can scale to large data sets. Furthermore, we provide this framework to the information visualization community for reproducibility of our research under the LGPL license. Tulip is available under Windows, Linux, and Mac OS. A future challenge for Tulip will include integrating our initial experiences working with dynamic graphs (Archambault 2009) into this model and optimizing data storage for dynamic data. Furthermore, integrating this concept directly into our facade will provide a unified set of visualization techniques using relational data as a basis.

Cross-References  Clustering Algorithms  Gephi  GUESS  Pajek  Visualization of Large Networks

References Abello J, van Ham F, Krishnan N (2006) ASKgraphview: a large scale graph visualization system. IEEE Trans Vis Comput Graph 12(5): 669–676 Adar E (2006) GUESS: a language and interface for graph exploration. In: CHI ’06: proceedings of the SIGCHI conference on human factors in computing systems, Montr´eal, pp 791–800. http://graphexploration.cond. org/ (visited 18 Mar 2010) Archambault D (2009) Structural differences between two graphs through hierarchies. In: Proceedings of the graphics interface, Kelowna, pp 87–94 Archambault D, Munzner T, Auber D (2006) Smashing peacocks further: drawing quasi-trees from biconnected components. IEEE Trans Vis Comput Graph (Proc Vis/InfoVis 2006) 12(5):813–820

Tulip III Archambault D, Munzner T, Auber D (2007a) Grouse: feature-based, steerable graph hierarchy exploration. In: Proceedings of the Eurographics/IEEE VGTC symposium on visualization (EuroVis ’07), Norrk¨oping, pp 67–74 Archambault D, Munzner T, Auber D (2007b) TopoLayout: multilevel graph layout by topological features. IEEE Trans Vis Comput Graph 13(2):305–317 Archambault D, Munzner T, Auber D (2008) GrouseFlocks: steerable exploration of graph hierarchy space. IEEE Trans Vis Comput Graph 14(4):900–913 Archambault D, Munzner T, Auber D (2009) TugGraph: path-preserving hierarchies for browsing proximity and paths in graphs. In: Proceedings of the 2nd IEEE pacific visualization symposium, Beijing, pp 113–121 Auber D (2001) Tulip. In: Mutzel P, Jnger M, Leipert S (eds) 9th international symposium on graph drawing, GD 2001, Vienna. Lecture notes in computer science, vol 2265. Springer, pp 335–337 Auber D (2002a) Outils de visualisation de larges structures de donnes. Phd, University Bordeaux I Auber D (2002b) Using Strahler numbers for real time visual exploration of huge graphs. In: International conference on computer vision and graphics, Poland, vol 1–3, pp 56–69 Auber D (2003) Tulip: a huge graph visualization framework. In: Mutzel P, J¨unger M (eds) Graph drawing software. Mathematics and visualization. Springer, New York, pp 105–126 Auber D, Jourdan F (2005) Interactive refinement of multi-scale network clusterings. In: IV ’05: proceedings of the ninth international conference on information visualisation, London. IEEE Computer Society, Washington, DC, pp 703–709. doi:http://dx.doi.org/10.1109/IV.2005.65 Auber D, Mary P (2006) Mise en place dun m´ecanisme de plugins en c++. Programmation sous Linux 1(5):74– 79. http://www.labri.fr/publications/mabiovis/2006/ AM06 (visited 18 Mar 2010) Auber D, Chiricota Y, Jourdan F, Melanon G (2003a) Multiscale navigation of small world networks. In: IEEE symposium on information visualisation, Seattle. IEEE Computer Science, pp 75–81 Auber D, Delest M, Domenger JP, Ferraro P, Strandh R (2003b) EVAT: environment for visualization and analysis of trees. In: Proceedings of the IEEE symposium on information visualization, Seattle, pp 124–125 Auber D, Delest M, Domenger JP, Dulucq S (2006) Efficient drawing of RNA secondary structure. J Graph Algorithms Appl 10(2):329–351 Auber D, Novelli N, Melanc¸on G (2007) Visually mining the datacube using a pixel-oriented technique. In: IV, Zurich, pp 3–10 Batagelj V, Mrvar A (2003) Pajek – analysis and visualization of large networks. In: Mutzel P, J¨unger M (eds) Graph drawing software, vol 2265. Springer, New York, pp 77–103

2239

T

Boulet R, Jouve B, Rossi F, Villa N (2008) Batch kernel SOM and related Laplacian methods for social network analysis. NeuroComputing (Special Issue on Progress in Modeling, Theory, and Application of Computational Intelligence – 15th European Symposium on Artificial Neural Networks 2007) 71(7– 9):1257–1273. See also http://www.nature.com/news/ 2008/080519/full/news.2008.839.html Bourqui R, Auber D (2009) Large quasi-tree drawing: a neighborhood based approach. In: IV ’09: proceedings of the 13 international conference on information visualisation (IV’09), Barcelona. IEEE Computer Society, Washington, DC, pp 653–660 Bourqui R, Lacroix V, Cottret L, Auber D, Mary P, Sagot MF, Jourdan F (2007) Metabolic network visualization eliminating node redundance and preserving metabolic pathways. BMC Syst Biol 1:29 Chimani M, Gutwenger C, J¨unger M, Klein K, Mutzel P, Schulz M (2007) The open graph drawing framework. In: Posters of the 15th international symposium on graph drawing (GD’07), Sydney. http://www.ogdf.net/ ogdf.php/ogdf:publications (visited 18 Mar 2010) Chiricota Y, Jourdan F, Melanon G (2003) Software components capture using graph clustering. In: 11th IEEE international workshop on program comprehension, Portland. IEEE/ACM, pp 217–226 Ellson J, Gansner ER, Koutsofios E, North S, Woodhull G (2002) Graphviz – open source graph drawing tools. In: The 9th international symposium on graph drawing (GD’01), Vienna. Lecture notes in computer science, vol 2265, pp 483–484 Elmqvist N, Fekete JD (2010) Hierarchical aggregation for information visualization: overview, techniques, and design guidelines. IEEE Trans Vis Comput Graph 16:439–454. doi:http://doi. ieeecomputersociety.org/10.1109/TVCG.2009.84 Fekete JD (2004) The InfoVis toolkit. In: The 10th IEEE symposium on information visualization (InfoVis ’04), Austin, pp 167–174. http://ivtk.sourceforge.net/ (visited 18 Mar 2010) Heer J, Card SK, Landay JA (2005) Prefuse: a toolkit for interactive information visualization. In: CHI ’05: proceedings of the SIGCHI conference on human factors in computing systems, Portland, pp 421–430. http://prefuse.org/ (visited 18 Mar 2010) Iragne F, Nikolski M, Mathieu B, Auber D, Sherman DJ (2005) ProViz: protein interaction visualization and exploration. Bioinformatics 21(2):272–274 Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern 43:59–69 Mehlhorn K, N¨aher S (1995) LEDA: a platform for combinatorial and geometric computing. Commun ACM 38(1):96–102 Moscovich T, Chevalier F, Henry N, Pietriga E, Fekete JD (2009) Topology-aware navigation in large networks. In: SIGCHI conference on human factors in computing systems 2009, Boston, pp 2319–2328

T

T

2240

Munzner T (2009) A nested process model for visualization design and validation. IEEE Trans Vis Comput Graph 15(6):921–928 Munzner T, Guimbreti´ere F, Tasiran S, Zhang L, Zhou Y (2003) TreeJuxtaposer: scalable tree comparison using focus+context with guaranteed visibility. Proc SIGGRAPH 2003, ACM Trans Graph 22(3): 453–462 Mutzel P, Gutwenger C, Brockenauer R, Fialko S, Klau G, Kr¨uger M, Ziegler T, N¨aher S, Alberts D, Ambras D, Koch G, J¨unger M, Buchheim C, Leipert S (1998) A library of algorithms for graph drawing. In: The 6th international symposium on graph drawing (GD’98), Montr´eal. Lecture notes in computer science, vol 1547, pp 456–457 Perego UA (2005) The power of DNA: discovering lost and hidden relationships. How DNA analysis techniques are assisting in the great search for our ancestors. In: World library and information congress: 71st IFLA general conference and council, Oslo, pp 1–19 Raitner M (2002) HGV: a library for hierarchies, graphs, and views. In: 10th international symposium on graph drawing, GD 2002, Irvine, pp 236–243 Rozenblat C, Melanon G, Amiel M, Auber D, Discazeaux C, LHostis A, Langlois P, Larribe S (2006) Worldwide multi-level networks of cities emerging from air traffic (2000). In: International geographical union IGU 2006 cities of tomorrow, Santiago de Compostela Schroeder W, Martin K, Lorensen B (2006) The visualization toolkit an object-oriented approach to 3D graphics, 4th edn. Kitware, Inc., Clifton Park Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504 Shneiderman B (1996) The eyes have it: a task by data type taxonomy for information visualizations. In: VL’96: proceedings of the 1996 IEEE symposium on visual languages, Boulder, pp 336–344 Tominskia C, Abello J, Schumann H (2009) CGV – an interactive graph visualization system. Comput Graph 33(6):660–678 Wylie B, Baumes J (2009) A unified toolkit for information and scientific visualization. Vis Data Anal 7243(1):72430H. http://titan.sandia.gov/ (visited 18 Mar 2010)

Tweet  Microtext Processing  Topic Modeling in Online Social Media, User

Features, and Social Networks for

Tweet

Twitris: A System for Collective Social Intelligence Amit Sheth, Ashutosh Jadhav, Pavan Kapanipathi, Chen Lu, Hemant Purohit, Gary Alan Smith, and Wenbo Wang Ohio Center of Excellence in Knowledge-enabled Computing (Kno.e.sis), Wright State University, Dayton, OH, USA

Synonyms Citizen sensing; Community evolution; Event analysis on social media; Interaction network; People-content-network analysis; Real-time social media analysis; Semantic perception; Semantic Social Web; Sentiment-emotion-intent analysis; Social media analysis; Spatio-temporalthematic analysis; Web 3.0

Glossary Citizen Sensing Humans as citizens on the ubiquitous Web, acting as sensors and sharing their observations and views using mobile devices and Web 2.0 Services Citizen-Sensor Network An interconnected network of people who actively observe, report, collect, analyze, and disseminate information via text, links to other resources, and various media including audio, images, and videos Social Media Analytics Social media analytics is the practice of gathering data from social media websites and analyzing that data to gain new insights from social media and to make informed decisions Semantic Web Semantic Web is a group of methods and technologies to help machines and humans understand the meaning – or “semantics” – of data on the World Wide Web Spatio-Temporal-Thematic (STT) Analysis Social media analytics taking into account what is being said about an event (theme) and where (spatial) and when (temporal) it is being said

Twitris: A System for Collective Social Intelligence

2241

T

People-Content-Network Analysis (PCNA) Social media analytics taking into account social media user (People), data shared on social media websites (Content), and network of social media users (Network) Sentiment-Emotion-Intent (SEI) Extraction Analyzing social media content to extract insights about social media users’ sentiment (positive, negative, and neutral), emotion (happy, angry, upset, etc.), and the user’s intention (seeking information, sharing information, etc.)

Introduction

Twitris: A System for Collective Social Intelligence, Fig. 1 Twitris – three primary dimensions of analysis

Well over a billion people have become “citizens” of an Internet- or Web-enabled social community. Web 2.0 fostered the open environment and applications for tagging, blogging, wikis, and social networking sites that have made information consumption, production, and sharing so incredibly easy. With over five billion mobile connections, over a billion with data connections (smartphones) and with many more having the ability to communicate using SMS, digital media can be shared with the rest of the humanity instantly. As a result, humanity is interconnected as never before. This interconnected network of people who actively observe, report, collect, analyze, and disseminate information via text, audio, or video messages, increasingly through pervasively connected mobile devices, has led to what we term citizen sensing (Sheth 2009a, b). This phenomenon is different from the traditional centralized information dissemination and consumption environments where citizens primarily act as consumers of reported information from several authoritative sources. This citizen sensing is complemented by the growing ability to access, integrate, dissect, and analyze individual and collective thinking of humanity, giving us a capability that is recognized as collective intelligence. Citizen sensing involves humans in the loop and with it all the complexities associated with and intelligence captured in human communication. As citizen sensing has gained momentum, it’s

generating millions of observations and creating significant information overload. In many cases it becomes nearly impossible to make sense of the information around a topic of interest. Given this data deluge, analyzing the numerous social signals can be extremely challenging. In response to this growing citizen sensing data deluge, Twitris has been developed with the vision of performing semantics-empowered analysis of a broad variety of social media exchanges. Twitris, named by combining Twitter with Tetris, a tile-matching puzzle game, has incorporated increasingly sophisticated analysis of social data and associated metadata, combining it with background knowledge, and more recently (albeit not discussed here) machine sensor or data captured from sensors and devices that make up the Internet of Things (IoT). Twitris’ evolution can be characterized in three phases (and corresponding versions of the system). Figure 1 outlines the corresponding dimensions Twitris considers. Twitris is a comprehensive platform for analyzing social content along multiple dimensions leading to in-depth insights into various aspects of an event or a situation. The central thesis behind this work is that citizen sensor observations are inherently multidimensional in nature, and taking these dimensions into account while processing, aggregating, connecting, and visualizing data will provide useful organization and consumption principles. Twitris evolved in

T

T

2242

three phases, characterized by the versions of the systems: • Twitris v1: Spatio-temporal-thematic (STT) processing of Twitter and associated news, multimedia, and Wikipedia content (Sheth 2009b; Nagarajan et al. 2009a; Jadhav et al. 2010) • Twitris v2: People-content-network analysis (PCNA) (Purohit et al. 2011) with use of background knowledge and semantic metadata extraction and querying/exploration • Twitris v3: Sentiment-emotion-intent (SEI) extraction (Chen et al. 2012; Wang et al. 2012; Nagarajan et al. 2009b) along with personalization (Kapanipathi et al. 2011a) and emerging continuous semantics (Sheth et al. 2010) capability involving semantic streaming social stream (i.e., real-time) processing using dynamically generated and updated domain models for semantics and context The above versions, or phases, of Twitris’ development are not as granular as painted above, that is, the issues identified above are not explicitly segregated by the version of the Twitris which has been in continuous development with senior students graduating and new students picking up the work. Four talks including a tutorial cover many of the issues covered by Twitris (Sheth 2009a; Nagarajan 2010; Nagarajan et al. 2011; Sheth 2011).

Key Points Throughout this article we investigate the role and benefits of using semantic approach, especially by metadata extraction and enrichments and contextually applying relevant background knowledge, along with demonstrating examples on real-world data using system (Twitris) developed at Kno.e.sis. The article focuses on the following key points: 1. Event-specific analysis of citizen sensing and discussion on opportunities and challenges in understanding temporal, spatial, and thematic cues. 2. Facets of people-content-network analysis with focus on user-community engagement analysis.

Twitris: A System for Collective Social Intelligence

3. Real-time social media data analysis and the concept of continuous semantics supported by dynamic model creation. 4. Sentiment and emotion identification from citizen sensing data. 5. Recent advances in developing semantic abstracts or semantic perception to convert massive amounts of raw observational data into nuggets of information and insights that can aid in human decision making.

Historical Background The idea of research leading to Twitris occurred on November 26, 2008. Terrorists struck Mumbai, India, and over the next 3 days, they proceeded to make mayhem in nine locations. Each of the nine sub-events of this overall event separated by time and location (space) had distinct thematic elements or topical content. The importance of Twitter, especially in terms of citizen sensing – the ability of a regular person to use his or her mobile device to share his or her personal observation, thoughts, and belief, well before a traditional news media has a chance to do reporting and to shape opinions – was extensively discussed in the immediate aftermath of this momentous event. This event also gave us a clear case for the needs and benefits of analyzing social media content such as tweets and Flickr posts and related news stories along the three dimensions of spatial (location of observation), where; temporal (time of observation), when; and thematic (the event in question), what (Battle 2009; Impact Lab 2008; Keralaravind 2008).

Twitris Platform and Three Stages of Its Evolution Twitris v1: Spatio-Temporal-Thematic (STT) Processing of Twitter and Associated News, Multimedia, and Wikipedia Content Twitris v1 (Jadhav et al. 2010) was designed with the following three major steps: 1. Data collection: collect user posted tweets pertaining to an event from Twitter,

Twitris: A System for Collective Social Intelligence

2243

T

Twitris: A System for Collective Social Intelligence, Fig. 2 A snapshot of spatio-temporal-thematic slice of citizen sensing showing content related to Mumbai ter-

rorism (thematic) related to Taj hotel (spatial, thematic), during a period of interest (temporal)

associated news, multimedia, and Wikipedia content (Fig. 2). 2. Data analysis: (a) process obtained tweets to extract strong event descriptors considering spatial, temporal, and thematic event attributes and (b) process event related news, multimedia, and Wikipedia content to get event context and gain a better understanding. 3. Visualization: present extracted summaries on Twitris v1 user interface. Twitris v1 performs a two-step processing to extract strong event descriptors from tweets. First, it creates the spatio-temporal clusters of the tweet corpus surrounding an event, since every event is different and we want to preserve the social perceptions that generated this data. TFIDF computation is performed to fetch the n-grams from this set. The second step involves the association of spatial, temporal, and thematic bias to these n-grams by means of enhancing the weights while preserving the contextual relevance of these event descriptors to the event. Further details of the text-processing algorithm are available in (Nagarajan et al. 2009a). Twitris v1 user interface (Fig. 3) facilitates effective browsing of when (temporal/time), where (space/location),

and what (thematic/context) slices of social perceptions behind an event. The objective of the Twitris v1 user interface is to integrate the results of the data analysis (extracted descriptors and surrounding discussions) with emerging visualization paradigms to facilitate sensemaking. To start browsing, users are required to select an event. Once the user chooses a theme, the date is set to the earliest date of recorded observations for an event and the map is overlaid with markers indicating the spatial locations from which observations were made on that date. We call this the spatio-temporal slice (Figs. 4 and 5). Users can further explore activity in a particular space by clicking on the overlay marker. The event descriptors extracted from observations in this spatio-temporal setting are displayed as an event descriptor cloud. The spatio-temporal-thematic (STT) scores determine the size of the descriptor in the tag cloud. In order to get event context and better understanding of the event, we enhanced Twitris, by integrating event related news, multimedia (images and videos), and Wikipedia articles. We leveraged explicit semantic information from DBpedia to identify relevant news and Wikipedia articles.

T

T

2244

mumbai photographers capture images images of mumbai foreign relations perspective attacks in mumbai photographers capture capture images india prime minister country of india

Twitris: A System for Collective Social Intelligence

1.4553 1.3998 1.2792 1.2165 1.1261 1.0986 1.0986 1.0839 1.0280

pakistan pres promised mumbai attacks foreign relations rejected evidence evidence provided uk indicating mumbai attacks in rejected evidence provided

1.0065 0.9594 0.9490 0.8741 0.8741 0.8741 0.7927 0.7916

foreign relations perspective india prime minister country of india pakistan pres promised foreign relations rejected evidence evidence provided uk indicating attacks in mumbai

1.7185 1.5853 1.5295 1.5080 1.4510 1.3758 1.3758 1.3758 1.3293

photographers capture images rejected evidence provided mumbai attacks images of mumbai mumbai mumbai attacks in photographers capture capture images

1.3028 1.2933 1.2048 1.1822 1.1083 1.0797 1.0017 1.0017

Event descriptors sorted by their TFIDF scores

Event descriptors sorted by their enhanced spatio-temporal-thematic scores

Twitris: A System for Collective Social Intelligence, Fig. 3 STT biased scoring mechanism of Twitris v1 for relevance and ranking of keyphrases compared to tradi-

tional TFIDF-based ranking: “mumbai” ranked highest based on TFIDF is far less informative compared to “foreign relations perspectives”

Twitris: A System for Collective Social Intelligence, Fig. 4 Early version of Twitris v1 user interfaces for displaying thematic component (using STT biasing) on right (b) based on spatial and temporal selection on left (a)

Twitris: A System for Collective Social Intelligence, Fig. 5 Twitris v1 user interface with spatio-temporal slice and multimedia widgets

Twitris: A System for Collective Social Intelligence

2245

T

Twitris: A System for Collective Social Intelligence, Fig. 6 Twitris v1 user interface with event descriptor cloud, related tweets, news, and Wikipedia articles for event “Austin plane attack.” Joe Stack, the man responsible for the Austin suicide plane attack on the IRS office,

put up his suicide note online about the attack. He was a former bass player for the Billy Eli band. Here Twitris captures STT event descriptors summarizing the important facets

When a user clicks on a particular descriptor, we display tweets containing the event descriptors and the top current news items as well as related Wikipedia articles (Fig. 6).

(e.g., the Jasmine Revolution in Tunisia), brand management and marketing, and perhaps most visibly, crisis and disaster management (e.g., Haitian and Japanese earthquakes). The Twitris team started to look at the issues such as the role of content nature for high vs. low attributed information diffusion (a phenomenon of propagating messages via friendship/follower connections among users of social network) (Nagarajan et al. 2010b) and user engagement (given a discussion topic on social media, what motivates a user to engage in the discussion for his/her first interaction) (Purohit et al. 2011; Ruan et al. 2012). Consequently, Twitris v2 embarked on a more comprehensive analysis along the three pillars of what makes anything social: who is engaging in the social activity, what is being communicated, and how does this communication flow between those engaged in the social activity. The idea is to gain insights into how permanent and transient networks arise

Twitris v2: People-Content-Network Analysis (PCNA) with Use of Background Knowledge and Semantic Metadata Extraction and Querying/Exploration The Mumbai terrorism event of 2008 gave the impetus to study the event from STT dimensions and focus on connecting with relevant news content. Social media continues to grow and revolutionize the way users interact with each other and information. Social network users are not only creators and recipients of the information but also critical relays to propagate information. This powerful ability of sharing has played an important role in events with varied social significance, audience, and duration, such as political movements

T

T

2246

Twitris: A System for Collective Social Intelligence

Twitris: A System for Collective Social Intelligence, Fig. 7 Contrast in the community structure of influencers in user interaction networks, centered on two popular events #OccupyChicago and #OccupyLA

Pos Neg Objective

Twitris: A System for Collective Social Intelligence, Fig. 8 Sentiment of the influencers for the target candidate in the interaction network centered on that target: Romney (1st cluster) vs. Ron Paul (2nd cluster)

and what and why information flows across such networks. Twitris v2 developed the significant capability to extract more types of metadata, and the infrastructure became more semantic with the use of Semantic Web standard RDF as well as relevant background knowledge. The latter enabled Twitris v2 to support the deep exploration capability with use of DBpedia and SPARQL over metadata extracted from the tweets. Twitris v2 research that focus on coordination during disasters also led to integrate Twitris with Ushahidi’s SwiftRiver open source platform and support ingestion of SMS which were used for events such as Pakistan Floods in 2010. Let’s look at some examples of Twitris v2 capabilities: • Evolving ad hoc nature of social media communities: Event-centric communities with varied nature (Purohit et al. 2011) often bring together users

from different parts of the social network, especially in Twitter where we keep switching discussions of our interests, and we may not already be connected to other participants of those communities. Therefore, in such ad hoc communities, it is difficult to depend on just follower graphs for understanding the dynamics. Twitris v2 introduced analysis of user interaction networks so that human dynamics in the evolving communities can be understood at granular levels – influencer analysis, contextually important people with roles to engage with, community evolution, etc. Twitris v2 built this feature by extending our research in the user interaction network analysis on brand-page communities (Purohit et al. 2012a). • Contrast in the structure of interaction networks: Figure 7 shows the networks of influencers in two topical communities during the Occupy Wall Street (OWS) movement, “OccupyChicago” on the left and “OccupyLA” on the right. Such an analysis provides insights to understand not only the real dynamics of the actors (e.g., what organizations supporters belong to and to whom are they strongly connected) but also the potential of the influencers to drive actions in the communities (tightly connected influencers are likely to drive effective “call for action” propagation in the communities). In this figure, the influencer network of OccupyLA is highly connected and self-organized as compared to sparsely connected

Twitris: A System for Collective Social Intelligence

2247

T

Twitris: A System for Collective Social Intelligence, Fig. 9 Interaction network evolution for topical community surrounding Mitt Romney, US Presidential Election 2012

one for OccupyChicago and, therefore, likely to reach masses effectively for any call for action. Even the Facebook page for OccupyLA reflected such activism. • Slicing and dicing the networks by user features: To glean insights about actionable information in the ad hoc communities, we need to understand the participants better. Therefore, Twitris v2 introduced slicing and dicing analysis of the interaction networks by providing user/node centric features. For example, professional or organizational affiliation of users provides clues to understand the cause for dynamics – e.g., who are the people behind the organized network of OccupyLA? Are such users from the same type of organizations lead to coordinated actions? Similarly, Twitris v2 introduced the contentcentric analysis, thus realizing the full potential of PCNA. Users are clustered by grouping them into sentiment segments of the target topic, thus answering questions like which candidate is going stronger in the influencer network from a sentiment perspective (Fig. 9) between Mitt Romney and Ron Paul and for what issues? • Understanding group dynamics by community evolution: Twitris v2 focused on the larger goal of predictive ability for group dynamics, and the peoplecontent-network analysis (PCNA) framework was the key to the untapped potential of group dynamics. Therefore, Twitris v2 created clusters in the ad hoc communities based on the sentiment of the users for a targeted

topic over time and associated events on the timeline for causal analytics. Figure 9 shows an example of community evolution centered around Republican presidential nominee Mitt Romney during March 1–31, 2012. It shows three snapshots taken over a 10-day period, and we observed an extremely modularized community in the end of the analysis, which was not really the case for the closest competition, Rick Santorum. And as we know, Santorum exited the race on April 9. Thus, the analysis of community evolution made Twitris v2 capable of understanding group dynamics of ad hoc communities by not limiting the output to just understand users but also the group behavior. Twitris v2 leverages Semantic Web technologies by the use of background knowledge such as DBpedia to provide deeper insights about the event. Background knowledge changes the way you can look at the information, as it puts the information in context. This is especially important for tweets because they are short and therefore individually lack the volume of information that provides an informative context. For example, in the above Fig. 10, questions such as “Who are the dead people that are mentioned in the context of OWS movement” can be answered using the background knowledge, whereas simple keyword search cannot put the information of tweets in context. Further, to answer the questions in the figure and generate answers such as Rosa Parks, the system has to have the background knowledge about this named entity as a Person and also that she is dead. Going deeper into the

T

T

2248

Twitris: A System for Collective Social Intelligence

Twitris: A System for Collective Social Intelligence, Fig. 10 Leveraging Semantic Web technologies to provide insights of events

background knowledge provides information that Rosa Parks was famous for the Montgomery Bus Boycott during the US civil rights movement in 1955–1956. Twitris v3: Emotion-Sentiment-Intent, Real-Time View and Other Advancements Behind every (well, most of the important) tweet, there is a human. And a human is complex. Through a tweet, a person expresses emotion, sentiment, and intent. Understanding this dimension is a key to unlock the true potential of social media. This is especially true for monetization of social media. Understanding an underlying intent can tell us if a user is expressing a transactional (potentially for buying a product) intent, seeking information, or just sharing information (Nagarajan et al. 2009b). Sentiment is perhaps the most sought after type of analysis of social data. Currently, it is the primary basis of social media

analysis to predict whether a product or a movie will succeed, who is more likely to win an election, or to attempt to identify consumer interest and hence use it for targeting the advertisement. Analysis of or identification of emotion is likely the dark horse of the three – while techniques for its analysis are not yet as mature as sentiment analysis, it is likely to be combined with the other two to give far more signal than without it. A key innovation in sentiment analysis, employed in Twitris v3, is topic- specific sentiment analysis – to associate sentiment with an entity (Chen et al. 2012). This enables us to identify two different sentiments associated with different entities in a single tweet. For example, in tweet “The King’s Speech was bloody brilliant. Colin Firth and Geoffrey Rush were fantastic!” we can identify both the sentiment (i.e., bloody brilliant) associated with the movie “The King’s Speech” and the sentiment (fantastic) associated

Twitris: A System for Collective Social Intelligence

2249

T

Twitris: A System for Collective Social Intelligence, Fig. 11 Twitter users show the opposite sentiments towards two candidates on the same topic “final debate” in 2012 presidential election

with the actors Colin Firth and Geoffrey Rush. More recently, we are associating sentiments with events – when there is a significant change in sentiment, we attempt to associate that with real-world events. For example, by tracking both the event- and entity-specific sentiments, Twitris v3 is able to capture a substantial increase of positive sentiment towards President Obama on the immigration issue on June 15, 2012 (the day on which President Obama outlined a new immigration policy), and associate it with the event descriptors such as “dream act,” “obama ’s immigration move,” and “new

immigration policy.” Figure 10 shows that Twitter users have the opposite sentiments towards two candidates: Obama (green/positive) and Romney (red/negative) on the same topic “final debate.” The reason is that Obama received more positive feedback from Twitter users than Romney did, which is in line with the impression from news media. This example demonstrates Twitris’s power in identifying topic-specific sentiments. Compared with sentiment, emotion is more implicit. For example, “I will have a calculus test in two hours, but I’m not prepared at all.” We

T

T

2250

Twitris: A System for Collective Social Intelligence

Twitris: A System for Collective Social Intelligence, Fig. 12 Peek patterns of the emotion joy due to excitement of Twitter users caused by three debates and one TV program (the Daily Show) in the 2012 presidential election

can infer that the person is nervous about the test, though there are no explicit emotion words, such as “nervous” or “panic.” It is very difficult and time consuming to label sentences with emotions, considering the implicitness of emotion. In Twitris v3, we are able to automatically create a large emotion-labeled dataset (of about 2.5 million tweets) covering: joy, sadness, anger, love, fear, thankfulness, and surprise, by harnessing emotion-related hashtags available in the tweets (Wang et al. 2012). Machine learning classifiers are trained on the large dataset to learn how to identify people’s emotions behind their tweets. And, as another key innovation, Twitris v3 can analyze people’s emotional responses in different events. For example, Fig. 11 shows the volume of joyful tweets, reaching peaks on Oct. 3, 2012 (first debate), Oct. 16, 2012 (second debate), Oct. 22, 2012 (third debated), and Oct. 19, 2012 (Obama went to the Daily Show). The reason is that Twitter users are very enthusiastic about all three presidential debates and Obama’s presence in the Daily Show TV program. Other than analyzing emotions out of tweets, Twitris v3 is also able to identify emotions from blogs, news headlines, etc. The reason is that we adapt the classifiers trained on Twitter data to other domains with a relatively small amount of labeled emotion data in other domains.

Besides detecting users’ emotional states, we also explore how to automatically identify users’ intents from posts so that monetization can be more targeted on users’ needs (Nagarajan et al. 2009b). The highlight of our study is that we discover and differentiate three types of posts: (a) transactional posts, e.g., “I am looking for a 32 GB iTouch”; (b) information sharing posts, e.g., “I like my new 32 GB iTouch”; and (c) information seeking posts, e.g., “what do you think about 32 GB iTouch?” For monetization purposes, transactional posts and information seeking posts are more valuable than information sharing posts because users are looking for information that advertisers can exploit. By extracting intent/keywords/cues from transactional and information seeking posts, our system achieved an accuracy of 52 % on ad impressions using MySpace and Facebook data, while the baseline, without using our system, only achieved an accuracy of 30 %. All the above-mentioned precious assets (sentiment, emotion, and intent) of content exist due to an actionable purpose of humans. When such individual level purposes start to bring higher engagement in the groups, they become source of group-level actions, apparently, leading to the evolution of human dynamics in the social network. Therefore, we are exploring the integral

Twitris: A System for Collective Social Intelligence

Twitris: A System for Collective Social Intelligence, Fig. 13 Some of the capabilities of Twitris v3: (1) show popular Topics, also called social signals (weighted n-grams) related to the chosen event for today and any day of the past since the event began to be tracked; (2) search from among the event related tweets with autocomplete, popular event hashtags, and active users and explore content for deep analysis (e.g., who are the dead people mentioned most often in the “occupy wall street event”) using background knowledge (default source is Wikipedia/DBpedia) and Semantic Web technologies (RDF/SPARQL); (3) show key topics of discussions by locations/regions, states, and country (e.g., see the differences in social signals from Mississippi (a “red state”) vs. Massachusetts (a “blue state”) related to President Obama’s Nobel Prize); (4) see event relevant

2251

T

tweets in real time on a world map or any region; (5) analyze topic-/people-/region-specific sentiment (e.g., for the US election, sentiment on candidates by states, and by topics identified by election specific topics); (6) see the networks with insights from static (e.g., followers) and dynamic features (e.g., retweet) and people/demographics (e.g., with knowledge of profession of each person); (7) display tweets, recent news, and Wikipedia pages related to selected events and social signals; (8) show eventspecific multimedia (images and video); (9) see tweet traffic; (10) change date of video/analysis; (11) select location of interest – each pin shows a collection of social signals emanating from a location; and (13) select an event of interest (e.g., US Election, Occupy Wall Street, Japanese Tsunami)

T role of intent with sentiment and emotions for purposeful actions in the groups. Specifically, we are focusing on intent and sentiments behind group coordination because coordinated activity has the potential to make or break the system (Fig. 13). Advanced Research Detailed research on social data analysis encompasses social intelligence in real time (Gruhl et al. 2010) which involved a Kno.e.sis-IBM

collaboration leading to the operationally deployed BBC Sound Index system, prediction of topic volume on Twitter (Ruan et al. 2012), emotion identification using Twitter “big data” (Wang et al. 2012), brand tracking (Purohit et al. 2012a), psycholinguistic analysis during emerging coordination (Purohit et al. 2012b), privacy-aware content dissemination (Kapanipathi et al. 2011b), user-community engagement (Purohit et al. 2011), information diffusion (Nagarajan et al. 2010b), trust in social

T

2252

media (Thirunarayan and Anantharam 2011), and monetization of social activities (Nagarajan et al. 2009b) reported in over 30 publications and summarized in a comprehensive tutorial (Nagarajan et al. 2011).

Twitris: A System for Collective Social Intelligence  Futures of Social Networks: Where Are Trends Heading?  Semantic Social Networks Analysis  Spatiotemporal Footprints in Social Networks  User Sentiment and Opinion Analysis

Key Applications Twitris has been used in a research context for studying and analyzing social sensing and perception of a broad variety of events: politics and elections, social movements and uprisings, crisis and disasters, entertainment, environment, etc. We are now investigating more commercial applications including brand tracking and advertisement campaign effectiveness, and empowering professional users.

Future Directions The next evolution of Twitris will be in incorporating social media content along with data from sensors and Web of Things, as well as in advanced applications for crisis and disaster support.

Acknowledgments Alumni, Karthik Gomadam, Meena Nagarajan, and Ajith Ranabahu, have made substantial contributions to Twitris. Twitris continues to benefit from current collaborators, including Prof. Krishnaprasad Thirunarayan and Prof. Valerie Shalin, and team members Pramod Anantharam and Shreyansh Bhatt. This work is partially supported by NSF funded grants “SoCS: Social Media Enhanced Organizational Sensemaking in Emergency Response” (IIS1111182) and “EAGER: Expressive Scalable Queries over Linked Open Data” (IIS-1143717).

Cross-References  Collective Intelligence, Overview  Dynamic Community Detection

References Battle C (2009) New media’s moment in Mumbai. Foreign Policy J, 15 Jan 2009. http://www.foreign policyjournal.com/2009/01/15/new-media. Accessed 31 Mar 2013 Chen L, Wang W, Nagarajan M, Wang S, Sheth A (2012) Extracting diverse sentiment expressions with targetdependent polarity from Twitter. In: Proceedings of the 6th international AAAI conference on weblogs and social media (ICWSM), Dublin, 5–7 June 2012 Gruhl D, Nagarajan M, Pieper J, Robson C, Sheth A (2010) Multimodal social intelligence in a real-time dashboard system. VLDB J (Data Manage Mining Soc Netw Soc Media) (Special issue, to appear) 19(6): 825848 Impact Lab (2008) Twitter provided a vital link in Mumbai terrorist attacks, November 28, 2008. http://www.imp actlab.net/2008/11/28/twitter-provided-a-vital-link-inmumbai-terrorist-attacks/. Accessed 31 Mar 2013 Jadhav A, Purohit H, Kapanipathi P, Ananthram P, Ranabahu A, Nguyen V, Mendes P, Smith AG, Cooney M, Sheth A (2010) Twitris 2.0: semantically empowered system for understanding perceptions from social data. In: Semantic web application challenge at ISWC, Shanghai, 7–11 Nov 2010 Kapanipathi P, Orlandi F, Sheth A, Passant A (2011a) Personalized filtering of the Twitter stream. In: 2nd workshop on semantic personalized information management at ISWC 2011, Koblenz, 23–27 Oct 2011 Kapanipathi P, Anaya J, Sheth A, Slatkin B, Passant A (2011b) Privacy-aware and scalable content dissemination in distributed social networks. In: International semantic web conference (ISWC), Koblenz, 23–27 Oct 2011 Keralaravind (2008) ‘Hash Mumbai’, time line of citizen journalism & social media during Mumbai terrorist attacks, Youtube.com, uploaded on 28 Nov 2008. http:// www.youtube.com/watch?v=copw-W-IfvY. Accessed 31 Mar 2013 Nagarajan M (2010) Understanding user-generated content on social media. Ph.D. dissertation, Wright State University Nagarajan M, Gomadam K, Sheth A, Ranabahu A, Mutharaju R, Jadhav A (2009a) Spatio-temporalthematic analysis of citizen-sensor data – challenges and experiences. In: Tenth international conference on web information systems engineering, Poznan, 5–7 Oct 2009

Twitter Microblog Sentiment Analysis Nagarajan M, Baid K, Sheth A, Wang S (2009b) Monetizing user activity on social networks – challenges and experiences. In: IEEE/WIC/ACM international conference on web intelligence, Milan, 15–18 Sept 2009 Nagarajan M, Purohit H, Sheth A (2010) A qualitative examination of topical tweet and retweet practices. In: 4th international AAAI conference on weblogs and social media (ICWSM), Washington, DC, 23–26 May 2010, pp 295–298 Nagarajan M, Sheth A, Velmurugan S (2011) Citizen sensor data mining, social media analytics and development centric web applications. In: Proceedings of the WWW 2011, Hyderabad, 28 Mar–1 Apr 2011 Purohit H, Ruan Y, Joshi A, Parthasarathy S, Sheth A (2011) Understanding user-community engagement by multi-faceted features: a case study on Twitter. SoME 2011 (workshop on social media engagement, in conjunction with WWW 2011), Hyderabad, 28 Mar–1 Apr 2011 Purohit H, Ajmera J, Joshi S, Verma A, Sheth A (2012a) Finding influential authors in brand-page communities. In: 6th internationall AAAI conference on weblogs and social media (ICWSM), Dublin, 5–7 June 2012 Purohit H, Hampton A, Shalin V, Sheth A, Flach J (2012b) What kind of #communication is Twitter? A psycholinguistic perspective on communication in Twitter for the purpose of emergency coordination. In: NSF SoCS Symposium, Evanston, IL, USA 2012 Ruan Y, Purohit H, Fuhry D, Parthasarthy S, Sheth A (2012) Prediction of topic volume on Twitter. In: 4th international ACM conference of web science (WebSci), Evanston, 22–24 June 2012 Sheth A (2009a) Semantic integration of citizen sensor data and multilevel sensing: a comprehensive path towards event monitoring and situational awareness. In: From E-Gov to connected governance: the role of cloud computing, Web 2.0 and Web 3.0 semantic technologies, Fall Church, 17 Feb 2009 Sheth A (2009b) Citizen Sensing, Social Signals, and Enriching Human Experience. IEEE Internet Computing, July/August 2009, pp. 80-85 Sheth A (2011) Citizen sensing-opportunities and challenges in mining social signals and perceptions. In: Invited talk at microsoft research faculty summit 2011, Redmond, 19 July 2011 Sheth A, Thomas C, Mehra P (2010) Continuous semantics to analyze real-time data. IEEE Internet Comput 14(6):84–89 Thirunarayan K, Anantharam P (2011) Trust networks: interpersonal, sensor, and social. In: Proceedings of 2011 international conference on collaborative technologies and systems (CTS 2011), Philadelphia, 23–27 May 2011 Wang W, Chen L, Thirunarayan K, Sheth A (2012) Harnessing Twitter ‘Big Data’ for automatic emotion identification. In: Proceedings of international conference on social computing (SocialCom), 2012, Amsterdam, 3–5 Sept 2012

2253

T

Recommended Reading Sheth A, Thirunarayan K (2012) Semantics empowered Web 3.0: managing enterprise, social, sensor, and cloud-based data and services for advanced applications. Morgan & Claypool. ISBN:1608457168

Twitter  Social Networking in Political Campaigns

Twitter Microblog Sentiment Analysis Guangxia Li, Kuiyu Chang, and Steven C. H. Hoi School of Computer Engineering, Nanyang Technological University, Singapore, Singapore

Synonyms Microblog sentiment analysis; Twitter opinion mining

Glossary Sentiment Analysis The automatic analysis of opinions, sentiments, and subjectivity in text. It aims to determine the sentiment associated with a topic or context Online Learning Online learning algorithms update the learning model incrementally whenever they receive new data. They are usually highly efficient and scalable Multitask Learning The problem of jointly solving several related machine learning tasks by leveraging the commonality among tasks

Definition Twitter microblog sentiment analysis aims to identify and detect the sentiments or emotions

T

T

2254

present in a microblog post. The techniques developed for microblog sentiment analysis can also be applied to classify social media data in a real-time manner.

Introduction Microblogs, such as Twitter (http://www.twitter. com) and Facebook status updates (http://www. facebook.com), allow users to publish short snippets of text online. Compared to blogs, microblogs are typically shorter in length, but updated much more frequently. Twitter, one of the most popular microblogging platforms, allows users to publish short messages known as tweets (with a limit of 140 characters). The published messages are immediately announced to the user’s subscribers, who are known as followers. According to Java et al. (2007), the vast majority of posts on Twitter fall in one or more of the following types: chatter, conversations, information sharing, and news reporting. Due to its instantaneous nature, Twitter and other microblogging platforms have been widely employed for political and marketing campaigns, tracking of emergencies, opinion surveys, live news reporting, etc. A typical microblogger has tens if not hundreds of friends in his/her network, which translates to hundreds of microblog updates every day. As more people join in, this number will only grow. Therefore, one critical problem faced by microbloggers is the overwhelming amount of updates received. It is therefore imperative to pose the following question: can the microblogs be filtered according to each user’s information need? In this chapter, we study the microblog sentiment detection problem, whose goal is to identify whether a user’s microblog post contains emotions or sentiments. If emotional and nonemotional microblog posts can be automatically and accurately identified, it can lead to significant real-life implications. For example, a user may opt to filter emotional posts from acquaintances and instead be alerted to every emotional posts from his/her loved ones or close friends. Organizations could also monitor

Twitter Microblog Sentiment Analysis

microblogs of employees and customers to gather feedback about their products and services. Microblog sentiment detection is however quite challenging on several fronts. One difficulty is that microblog posts are often very short, thus making them hard to categorize (even by human beings). Another concern is that microblogging services emphasize timely updates, which poses a strict requirement for a highly efficient and scalable solution. Finally, microblogging styles among different users can be very diverse, which increases the difficulty of the classification task, since traditional classification approaches based on a single global classifier trained from a pool of user data may fail to capture the peculiarities of individual users, which lead to poor accuracies. To overcome these challenges, a collaborative online learning framework can be used. The basic idea is to first build a generic global classification model from large amount of user data and then leverage this global model to refine the personalized models of individual users.

Background Sentiment analysis or opinion mining analyzes a given piece of text to determine if it contains subjective information or not. In the presence of subjective content, it further determines the sentiment polarity (positive or negative opinion) of opinionated text with respect to a topic or context. Given the enormous amount of data published in online blogs and social networks, sentiment analysis has recently become a field of interest for many areas of research ranging from computer science to marketing. A very broad overview of existing techniques and approaches for sentiment analysis can be found in Pang and Lee (2007). Current research in sentiment analysis typically treats the task as a two-stage process. The first stage distinguishes subjective from objective instances. The second stage determines the sentiment polarity of opinionated texts that were recognized in the first stage. Identifying subjective instances has often proved to be more difficult than the subsequent task polarity classification (Pang and Lee 2007). An alternative way

Twitter Microblog Sentiment Analysis

is to perform subjectivity detection and sentiment polarity identification at the same time, yielding a result that classifies the given piece of text as objective (lack of opinion), positive (expressing a positive opinion), or negative (expressing a negative opinion). Whatever the approach, sentiment analysis can be postulated as a classification problem where the task is to assign text instances to one of several predefined categories. Text classification using machine learning techniques is a well-studied field and has been widely used to solve sentiment analysis problems. It has been shown that machine learning techniques achieve better results on classifying movie review sentiments compared to methods based on handpicked rules (Pang et al. 2002). To solve classification problems with machine learning techniques, a training set consisting of samples whose sentiment class labels are known must be provided. Each sample in the training set can be a microblog post, whose class label can be positive, negative, or neutral. The training set comprising samples of attributes and associated class labels is used to build a classification model, which is subsequently applied to records with unknown class labels. A learning algorithm is employed to build a classification model that estimates the relationship between the attributes and the class label of the training data. The model generated by a learning algorithm should both fit the input data well and correctly predict the class labels of records it has never seen before. Examples of popular learning algorithms include decision trees, neural networks, na¨ıve Bayes classifiers, and support vector machines (SVM). A human labeled set of training data is typically necessary for training a classifier. Generally, the accuracy of the classifier increases with increasing amount of labeled data. With the Twitter API (https://dev.twitter.com), it is easy to collect millions of tweets. However, manually labeling tweets as positive or negative is a laborious and expensive endeavor. Researchers have investigated various ways to automatically generate labeled training data for Twitter microblog sentiment analysis. For example, several researchers rely on emoticons to automatically

2255

T

label training data (Bifet and Frank 2010; Pak and Paroubek 2010). An emoticon is a pictorial representation of a facial expression using the characters available on the standard keyboard, usually written to express a person’s feeling or mood. Emoticons in tweet are good indicators of a writer’s emotional state and thus can be used to annotate training data accurately. Pak and Paroubek (2010) queried Twitter for two types of emoticons: happy emoticons and sad emoticons. The collected tweets were used to train a classifier to recognize positive and negative sentiments. In order to collect a corpus of objective posts, the author retrieved tweets from Twitter accounts of popular newspapers and magazines. Twitter uses hashtags – words or phrases prefixed with a “#” sign to denote subjects or categories. Some of these tags are sentiment tags, which assign one or more sentiment values to a tweet. Davidov et al. (2010b) made use of the #sarcasm tag to automatically collect sarcastic tweets (Davidov et al. 2010a; Kouloumpis et al. 2011) manually selected from a set of frequent hashtags that are indicative of positive, negative, and neutral tweets. These hashtags were used to create the required training data. Apart from training set size, the accuracy of the learned classifier also depends strongly on how the input sample is represented. Typically, the input sample is transformed into a feature vector, which contains a number of features that are descriptive of the sample. In classical data mining, this is known as the feature selection problem. A variety of features have been studied for Twitter sentiment classification, including unigrams, bigrams, part-of-speech (POS) tags, and punctuation-based features. A natural choice would be to use the word-based n-gram features. Each word appearing in a sentence or each consecutive word sequence containing n words serves as a binary feature. Weighting techniques have been applied to assign higher weights to rare words (Davidov et al. 2010a). Barbosa and Feng (2010) mapped each word in a tweet to its POS using a POS dictionary. For each tweet, (Kouloumpis et al. 2011) treated the number of verbs, adverbs, adjectives, nouns, etc. as POS features. The intuition is that certain POS tags

T

T

2256

are good indicators of sentiment. Generic features such as sentence length, number of special punctuations, and presence of emoticons and abbreviations have also been used (Davidov et al. 2010a; Kouloumpis et al. 2011). Lexicon features have also been investigated. They map a word to its prior subjectivity (weak or strong subjectivity) and polarity (positive, negative, or neutral) according to a lexicon (Barbosa and Feng 2010; Kouloumpis et al. 2011).

Twitter Microblog Sentiment Analysis

Global Model Users

. . .

Group tweets Tweet

Collaborative Model

Tweet

Collaborative Model . . .

. . . Tweet

Collaborative Model

Twitter Microblog Sentiment Detection by Collaborative Online Learning

Twitter Microblog Sentiment Analysis, Fig. 1 Collaborative online learning

Twitter messages have many unique attributes that are unfriendly to automatic sentiment analysis. One difficulty is that microblog posts are often very short. The maximum length of a tweet is 140 characters. Misspellings, slangs, and colloquialisms are also prevalent in tweets. These make Twitter messages hard to categorize using natural language-based tools. Another concern is that microblogging services emphasize timely update, which poses a strict requirement for a highly efficient and scalable classifier. Finally, microblogging styles among different users can be very diverse, which further exacerbates the difficulty of the task. As such, classical classification approaches based on a single global classifier trained from a collection of user data may fail to capture the peculiarity of different users and thus work poorly. To overcome these challenges, a collaborative online learning framework (Li et al. 2010, 2011) can be employed. Online learning algorithms work on a sequence of data by processing them one by one. On each round the learner receives an input, makes a prediction using an internal hypothesis, which is kept in memory, and then sees the true label. It uses the newly exposed example to modify its hypothesis according to some predefined rules. The goal is to minimize the overall number of rounds of incorrect predictions. Online learning algorithms are known for their high efficiency and scalability, and are especially suitable for problems whose data are

generated in a sequential manner, and for processing continuous and real-time data that arrive in huge amounts. Figure 1 illustrates the basic idea of collaborative online learning. Specifically, the collaborative online learning algorithm operates in a sequential manner. At each learning round, it collects the current global set of data, one from each of the engaged microblog users, which are employed to update the global classification model. At the same time, a collaborative model is maintained for each user. The individual collaborative model is subsequently updated using the latest individual microblog post and global model parameters. Therefore, the solution can make use of global common knowledge and adapt to individual nuances as well. The goal of microblog sentiment detection is to automatically classify each user’s post into two categories: emotional or nonemotional. For simplicity, it is assumed that the training data from all users can be represented in a global feature space and the sequences of training data among all users are ordered according to time step t. Let .xt ; yt / denote a training instance at round t, where xt 2 Rd is a d -dimensional vector representing the microblog post, and yt 2 f1; 1g indicates the presence/absence of emotions: emotional (“1”) or nonemotional (“1”). Further, we denote by Di D f.xt ; yt / W t D 1; : : : ; Ti g a collection of training data for the i -th user and D D f.xt ; yt / W t D 1; : : : ; T g the entire collection of training

Twitter Microblog Sentiment Analysis

2257

P data from a group of K users, where T D i Ti denotes the total number of posts across all users. The goal of the algorithm is to learn a set of K classification models, i.e., f .i / ./ W Rd 7! f1; 1g; i D 1; : : : ; K. Learning the Global Model The first step is to build a global classification model, i.e., f ./ W Rd 7! f1; 1g. The online passive-aggressive learning (PA) framework (Crammer et al. 2006) is used to build a linear global classification model using tweets published by all users at round t, i.e., ft .x/ D ut  x

(1)

where ut 2 Rd is the weight vector of the global model learned at round t. Specifically, at learning round t, the algorithm uses the latest training instance .xt ; yt / to update the classification model as follows: ut C1 D

s.t.

argminu2Rd

1 ku  ut k2 C C  2 (2a)

`.uI .xt ; yt //  

(2b)

 0

(2c)

where C is a positive parameter controlling the influence of the slack term  on the objective function, and ` is the hinge loss function defined as `.uI .xt ; yt // D max.0; 1  yt u  xt /

(3)

The closed-form solution to the optimization problem (Eq. 2) can be derived as ut C1 D ut C t yt xt

(4)

o n where t is given by t D min C; `t 2 (Cramkxt k mer et al. 2006). Learning the Collaborative Models The key idea of collaborative online learning is to leverage knowledge present in the global model f .x/ D u  x to influence the individual

T

collaborative models f .i / ; i D 1; : : : ; K. Using the same PA (Crammer et al. 2006) formulation, the goal is to learn a linear classification model for the i -th user as follows: / ft.i / .x/ D w.i t x

(5)

/ where w.i 2 Rd is the weight vector of the t collaborative model learned at round t for the i -th user. For simplicity, wt will be used to denote / w.i t henceforth. The collaborative model is formulated as a convex optimization problem that minimizes the deviation of the new weight from both the prior collaborative weight and the global weight, as follows:

 wt C1 D argminw2Rd kw  wt k2 2 1 C kw  ut k2 C C  2 s.t.

(6a)

`.wI .xt ; yt //  

(6b)

 0

(6c)

where 0    1 is a parameter that adjusts the relative influence of the global model ut and the collaborative model wt , and parameter C  0 determines the amount of slack variable  tolerated. The above formulation addresses the concern about training classifiers for a group of users. Each user’s personality is maintained while incorporating the global model of all members. Clearly, if  D 1, the approach reduces to learning an individual model without engaging the global one; if  D 0, it reduces to the global model. By setting 0 <  < 1, the contribution of each model can even be fine-tuned on an individual basis. The optimal update wt C1 to the above optimization problem (Eq. 6) is wt C1 D wt C .1  /ut C yt xt where  is given by

(7)

T

T

2258

Twitter Microblog Sentiment Analysis

 n 1  y .w C .1  /u /  x o t t t t  D max 0; min C; kxt k2 (8)

It is not appropriate to use a constant  during the entire online learning process since the optimal trade-off between two models is unknown without an oracle. We therefore dynamically adjust the value of  during the online learning process based on the evaluation of both models’ cumulative errors – total number of mistakes a model has made during previous learning rounds. The value of  is decreased if the global model gives fewer cumulative errors, and increased otherwise, as shown for the i -th user below .i /

t C1

8 0 0, u0 0; Initialize wk 0 for t 1 to T do // Update the collaborative models for k 1 to K do adjust parameter ; receive weight vector of the global model ut ;   k receive training instance xk t ; yt ; set loss k k k k `.wk I .xk t ; yt // D max.0; 1  yt w  xt /; k k k if `.w I .xt ; yt // D 0 then D0 else k 1ytk .wk t C.1/ut /xt

D minfC;

2

kxkt k

g

end k k k update wk tC1 D wt C .1  /ut C yt xt ;

/

otherwise

end // Update the global model for k 1 to K do   k receive training instance xk t ; yt ; set loss k k k `.ut I .xk t ; yt // D max.0; 1  yt ut  xt /; k k if `.ut I .xt ; yt // > 0 then

(9)

where e.ft / and e.ft.i / / refer to the cumulative errors of the global and collaborative models, respectively;  is close to one. In practice,  can be set to 0:95, min to 0:01, and max to 1. Algorithm 1 summarizes the collaborative online learning algorithm.

 D minfC; update utC1

1ytk ut xk t

g; 2 kxkt k D ut C  ytk xk t ;

end

Time and Space Complexity Analysis The worst case time complexity of the proposed collaborative online learning algorithm is O.T  d /. Clearly, the algorithm has linear time complexity with respect to the number of instances and dimensions, which is not different from other online learning algorithms. In terms of space complexity, the algorithm needs to maintain a single global model u 2 Rd and a set of K collaborative models w.k/ 2 Rd in memory. Hence, the storage requirement is O.d /. In practice, the memory cost is smaller than O.d / due to the sparsity of weight vectors. Therefore, the proposed algorithm is both efficient and scalable for large-scale online sentiment detection task. Empirical Evaluation Empirical evaluation on several Twitter data set showed that the collaborative online learning algorithm outperforms several PA

end end

benchmarks (Crammer et al. 2006) that solve Twitter sentiment classification tasks individually. More details can be found in Li et al. (2010).

Future Directions The collaborative online learning method focuses on the qualitative aspect of sentiment analysis: the presence or absence of sentiments. To distinguish the subjective tweets as positive or negative, one can refer to Li et al. (2011). For future work, one may take into consideration the time element, which for now has been left out as we simply consider sequential presentation of each microblog regardless of the publication time

Two-Mode Graph

stamp. Intuitively, microblogs that are published in quick succession (minutes, hours) should be more relevant than those that are published days, weeks, or months apart.

Cross-References  Data Mining  Multi-classifier System for Sentiment Analysis

and Opinion Mining  Sentiment Analysis in Social Media  User Sentiment and Opinion Analysis

References Barbosapp L, Feng J (2010) Robust sentiment detection on Twitter from biased and noisy data. In: COLING (posters), Beijing, pp 36–44 Bifet A, Frank E (2010) Sentiment knowledge discovery in Twitter streaming data. In: Pfahringer B, Holmes G, Hoiffmann A (eds) Discovery science. Springer, Berlin, pp 1–15 Crammer K, Dekel O, Keshet J, Shalev-Shwartz S, Singer Y (2006) Online passive-aggressive algorithms. J Mach Learn Res 7:551–585 Davidov D, Tsur O, Rappoport A (2010a) Enhanced sentiment learning using Twitter hashtags and smileys. In: COLING (posters), Beijing, pp 241–249 Davidov D, Tsur O, Rappoport A (2010b) Semisupervised recognition of sarcastic sentences in Twitter and amazon. In: Proceeding of the 23rd

2259

T

international conference on computational linguistics (COLING), Beijing Java A, Song X, Finin T, Tseng BL (2007) Why we Twitter: an analysis of a microblogging community. In: WebKDD/SNA-KDD, San Jose, pp 118–138 Kouloumpis E, Wilson T, Moore J (2011) Twitter sentiment analysis: the good the bad and the omg! In: ICWSM, Barcelona Li G, Hoi SCH, Chang K, Jain R (2010) Micro-blogging sentiment detection by collaborative online learning. In: ICDM, Sydney, pp 893–898 Li G, Chang K, Hoi SCH, Liu W, Jain R (2011) Collaborative online learning of user generated content. In: CIKM, Glasgow, pp 285–290 Pak A, Paroubek P (2010) Twitter as a corpus for sentiment analysis and opinion mining. In: LREC, Valletta Pang B, Lee L (2007) Opinion mining and sentiment analysis. Found Trends Inf Retr 2(1–2):1–135 Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In Proceedings of EMNLP, Stroudsburg, pp 79–86

Twitter Opinion Mining  Twitter Microblog Sentiment Analysis

Two-Mode Graph  New Intermediaries of Personal Information: The FB Ecosystem

T