The Importance of Outlying Relationships in Mobile Call Graphs

5 downloads 10160 Views 391KB Size Report
to the social connection they share, social network analysis and machine learning techniques are applied to mobile call graphs for a number of applications, ...
The Importance of Outlying Relationships in Mobile Call Graphs Derek Doran Dept. of Computer Science & Engineering University of Connecticut Storrs, CT, 06269 [email protected]

Veena Mendiratta, Chitra Phadke, and Huseyin Uzunalioglu Bell Laboratories Alcatel-Lucent Murray Hill, NJ, 07974 {veena.mendiratta, chitra.phadke, huseyin}@alcatel-lucent.com

Abstract—Mobile phones have become one of the primary tools for individuals to communicate, to access data networks, and to share information. Service providers collect data about the calls placed on their network, which exhibits a large degree of variability. Providers also model the structure of the relationships between network subscribers as a mobile call graph. In this paper, we apply a new measure to quantify by how much a relationship between users deviates from the average to explore the connection between calling behaviors and the complex structure mobile call graphs take. By studying a large mobile call graph from a major service provider, we learn that distant, outlier relationships play the strongest role in maintaining connectivity between cellular users, that calling features of users more strongly influence tie variation compared to social features, and how as highly varying ties in the graph is removed, its massively connected component rapidly decays.

I. I NTRODUCTION The cell phone has become a ubiquitous communication technology used by people to stay connected with each other. Service providers are able to collect a wealth of information about the calls placed among its subscribers, such as the time a call was placed, it’s duration, the cost a subscriber is charged for the call, and the geographic location the call was placed, for business analytics and predictive analysis. The information collected about the calls between two users1 helps establish a sense of the relationship that they share. A mobile call graph is a structure used to represent the network of relationships among subscribers of a cellular service. In such a graph, vertices represent users and a directed edge connects two users if a phone call was placed between them. The features about the aggregation of all calls placed between two users (e.g., call time, dates, and call location) are represented as the set of attributes over each edge. Recognizing that these attributes are also linked to the social connection they share, social network analysis and machine learning techniques are applied to mobile call graphs for a number of applications, including network performance improvement [1], human movement pattern analysis [2], and customer churn prediction [3], [4]. 1 the

paper.

terms subscriber and user are used interchangeably throughout the

Despite evidence that there is a very limited correlation between the friendship of cell phone users and individual calling attributes [5], modern studies over call graphs link their analysis to the way in which a single attribute varies, such as the number of times that two people talk to each other [6] or the total duration that two people communicate [7]. It is intuitive to think that examining variation across a single attribute will carry less meaning compared to multiple attributes. For example, consider the frequency that calls are placed across a connection. It may be possible that a user calls a bank to check balances and pay bills more times than they call a friend, even though the connection between friends is more significant. To improve our understanding of the variability and complexity in mobile calling data for future modeling applications, in this paper we examine the connection between user relationships and the structure of a mobile call graph. We define a metric that quantifies the extent to which a relationship varies from an average one, and use this metric to consider how the graph structure changes as outlier and ordinary relationships are removed. Our findings suggest that outlier relationships play a role similar to the one that weak ties do in social networks [7], [8] – that they are critical to maintain the structure of the network. Machine learning algorithms and social network analyses assuming reliable communication across the structure of a call graph must be keenly aware that the ties most critical to the spread information or influence exhibit outlying and unpredictable behavior. The layout of this paper is as follows. Section II introduces our algorithm to quantify variation across relationships. Section III gives a preliminary structural analysis of the call graph analyzed. Section IV applies the tie variation algorithm and then analyzes the structure of decomposed mobile call graphs. Section V discusses related work. Section VI concludes the paper and offers future research directions. II. T IE VARIATION A mobile call graph is formally defined as a simple directed graph G = (V, E) where the set of vertices V represent mobile phone users, and an edge e = (a, b) ∈ E iff

a, b ∈ V and a placed a call to b. In this way, G represents the |E| = m unique relationships that are made between the |V | = n users (note that m is different from the total number of calls recorded in the dataset). The values of the k calling attributes for edge i is represented in the row vector x(i) . These m row vectors are used to compose the m × k call attribute matrix X, so that the value of attribute j on edge i is given by [X]ij . In this section, we define tie variation with a function f f : E → S that maps the set of edges to the m-dimensional vector S, where Si is the tie variation for edge i. The tie variation of an edge in the call graph represents by how much the calling attributes of an edge vary from the values in an average, or expected relationship. We say that relationships having large tie variation as being “outliers”, and those whose tie variation is small as “ordinary”. Simply summing the differences of each calling attribute on an edge from its expected value is not an appropriate way to define tie variation. This is because different calling attributes may be measured on different scales, and some attributes may be more indicative of by how much a relationship deviates from the norm. Thus, we define f as a projection of the calling attribute matrix X onto a subspace that better represents the variation within its columns (calling attributes). The projection is performed over an orthogonal basis set of vectors that point in the directions where the variation in the data is the largest. These directions can be found by performing principle component analysis (PCA) [9] over X. PCA is used to identify directions (defined as the unit vector u) in the row vectors of X such that the projection of the data along that direction maximizes the sum of the squared distance from every vector to the origin. Identifying this direction is defined as the operation: n X max (x(i) u)2 u|uT u=1

i=1

It can be shown that the solution to this optimization problem must satisfy the equality Σu = λu where λ is a constant and Σ is the covariance matrix of X. This means that the directions of largest variance among the row vectors of X are given as the n right eigenvectors of its covariance matrix, and the directions corresponding to larger eigenvalues are directions that explain a greater amount of variation within the columns of X. Σ is positive symmetric by definition, which means that all of the eigenvectors exist, are real, and are orthogonal to each other. Thus, the set of eigenvectors can be used to form a basis set onto which we project X. Through such a projection, each x(i) is transformed in a way that preserves the total variation within the data, scales the components of x(i) , and mitigates the correlation across components. The sum of the eigenvalues of Σ is equal to the total variance within the data, which is the same as the dimensionality of the data if it has zero mean and unit variance. This means

that these eigenvalues relate the amount of variation that is explained by each dimension of the projected data with the variation along the dimensions of the original data [9]. To appropriately weight each direction of maximum variance in the tie variation metric, we take each component of the projected data, multiply it by the corresponding eigenvalue, and sum these weighted components. Before applying PCA we must first normalize the mean and variance of the data. The entire algorithm is summarized as follows: Pm 1 1) Set [X]ij = [X]ij − m i=1 [X]ij for all j. 2) Set [X]ij = [X]ij /σj , where σj2 is the variance of attribute j across all m edges 3) Find the covariance matrix Σ of X. 4) Find Λ, a k × 1 column vector where Λl is the lth largest eigenvalue of Σ. 5) Find U, a k × k matrix whose lth column is the right eigenvector corresponding to Λl . 6) The tie variation of edge ei is given by the ith component of the vector S = XUΛ. III. P RELIMINARY S TRUCTURAL A NALYSIS In this section, we give a preliminary analysis highlighting the structural properties of a call graph constructed with data provided by a major mobile service provider, and examine how the structure of this call graph changes as relationships with the highest (most different) and lowest (most average) are removed. The data set captures the calls placed by subscribers across an entire country over a period of 3 contiguous weeks in 2011. The constructed call graph contains over 3 million vertices, 4.5 million edges, and represents approximately 10 million calls. Mobile call graphs, similar to large-scale social networks, are scale-free and can be modeled by a preferential attachment process [10]. In such a process, edges are randomly attached to vertices such that the probability an edge is attached to a given vertex is proportional to the number of edges already attached to it. Through this, vertices that are rich with edges are likely to get even richer, while the remaining majority of vertices become starved for edges. The in- and out-degree distributions of these networks follow a power-law, defined as a distribution whose reliability function that takes the form of R(x) = cx−α where c is a constant and α is a scaling parameter defining how “heavy” the tail is. Figures 1 and 2 plot the out-degree and in-degree distributions of the call graph we analyzed on log-log scale. We fitted a straight line to the data to estimate α, and found αin = 3.41 and αout = 2.63. These values are similar to those reported for other mobile call graphs in the literature [11], [12], [7], [3]. Another critical property the mobile call graph is the existence of a single, massive sized connected component. To

Figure 1: Out-degree distribution of call graph nodes Figure 5: Decomposition of massive connected component

IV. T IE VARIATION AND GRAPH STRUCTURE

Figure 2: In-degree distribution of call graph nodes

see this, we plotted the distribution of connected component sizes in Figure 3. This massive component accounts for approximately 30% of all nodes in the graph (80% if the edges are collapsed to be undirected).

In this section, we study the relationship between the tie variation of a relationship and the structure of the mobile call graph. We chose to use three calling attributes to derive tie variation, namely, the total length of calls placed, the total number of calls placed, and the neighborhood overlap of the users incident on an edge. Neighborhood overlap is a metric measuring the proportion of neighbors that two users have in common with each other. For two users a and b, it is defined as N (a, b) = |N (a) ∩ N (b)|/|N (a) ∪ N (b)| where N (a), N (b) is the set of all neighbors of a and b. Figure 4 shows the relationship between the tie variation metric and the attributes used. Figure 4a and 4b demonstrate that tie variation tends to increase with the number of calls placed and the total amount of time users spend talking to each other. In other words, the average relationship does not place many calls or talk for a long period of time across the three week period. We also see that the neighborhood overlap only varies across relationships with little tie variation (ordinary, average relationships). This represents how the algorithm considers neighborhood overlap only as a way to differentiate relationships only among those that would otherwise appear to have a similar amount of tie variation. A. Structural analysis

Figure 3: Strongly-connected component size distribution

We next analyze the relationship between tie variation and call graph structure by considering different decompositions of the call graph. A decomposition is defined as a subgraph G0 (V (g, S), E(c, S)) where E(c, S) ⊂ E is the set of edges whose tie variation value S satisfies the criteria c, and V (c, S) ⊂ V is the set of vertices that are incident to the edges in E(c, S). 1) Degree correlation: In Figure 6, we compare the mean out-degree for vertices with a given in-degree and the mean in-degree for vertices with a given out-degree. Each plot in the figure considers decompositions of the call graph

(a) Call length

(b) Number of calls

(c) Neighborhood overlap

Figure 4: Correlation between social tie variation and selected attributes

Figure 6: In- and out-degree distributions under different decompositions

where ordinary relationships are gradually removed. For example, the upper-left most plot considers a decomposition where edges in the top 95th percentile of edges with highest tie variation are retained, while the bottom-right plot is a decomposition where edges that are only in the top 5th percentile are retained. The plots show that vertices maintain a positive in-to-out degree correlation as more and more average (ordinary) relationships are eliminated. In other words, users that receive calls from many people also tend to make calls to many people, no matter how dissimilar (outlying) their relationship is. In examining the out-to-in degree correlation, we find that when ordinary relationships

are included there is no correspondence between the number of calls placed and the number of calls received. As we decompose the graph and eliminate ordinary relationships, however, a positive correlation between the two becomes apparent. This tells us that calls placed along outlying ties are more likely to reciprocate a call back, while calls over ordinary ties do not encourage call backs. 2) Connected components: Next, we examine the relationship between tie variation and the structure of the massive connected component in the call graph. In Figure 5, we observe that the number of nodes in the massive connected component decreases faster when the edges are removed in

Figure 8: Connected component “trunk” structure

Figure 7: Fragmentation of massive connected component

the order of outlying to ordinary behavior. Figure 7 examines the number of connected components in call graphs that have been decomposed to different levels. The trend in the figure labeled “weak to strong” (strong to weak) removes the most weakly varying ties first (last). We find that the total number of connected components in the entire network rises quickly when outlying ties are removed first. The rapid decrease in the size of the massive connected component, along with the rise in number of connected components, suggest that the massive connected component of the graph breaks apart and fragments as highly varying ties in the graph are removed. When ordinary relationships are removed first, the massive connected component does not fragment as quickly. In fact, Figure 7 tells us that the number of connected components actually remain constant up until the graph only contains of edges whose tie variation is in the top 45th percentile. The knee in the trend at 25% likely represents the point where the structure of the massive connected component finally begins to shatter into multiple, smaller components. The decrease in the number of connected components starting at 45% thus represents a critical percentile of outlier relationships where the other, non-massive components begin to break down. These observations together tell us that the mobile call graph is not structurally robust against the loss of outlier relationships. B. Graph formation and social ties The findings of how node degree and connected components change as the graph decomposes paint an interesting picture about the make-up of a mobile call graph. Based on our analysis, we hypothesize that the connected components of a call graph grow starting from a “trunk” structure. This “trunk” is composed of pairs of users exhibiting outlying calling behavior. A connected component grows as users that are part of a trunk begin to communicate with other people in a more ordinary fashion. In this way, branches off

of the trunk begin to form. Figure 8 visualizes an example of a trunk structure by drawing a connected component from a decomposed call graph where only the top percentile of outlying relationships remain. Our hypothesis about graph formation matches well with observations made in our study. For example, a rise in the total number of connected components as highly varying ties are removed can be explained by branches that become a new connected component once the strongly linked “trunk” of a connected component become disconnected. V. R ELATED R ESEARCH Many studies of mobile call graphs have also focused on understanding itss structure. Nanavati et al. analyzed the structure of a mobile call graph and studied how this structure evolved over time [11]. Hidalgo studied the dynamics of a mobile call graph to understand node persistence [14]. Du et al. found patterns in mobile call graphs, leading to a model that can generate weighted time-evolving networks [15]. Onnela et al. examined the structure of a mobile call graph based on the amount of time users spend talking to each other [7]. More recent work considers the application of social network analysis to mobile call graphs. Zang et al. improves the performance of paging schemes to determine the location of a mobile user by leveraging information in a mobile call graph [1]. Dasgupta et al. propose a way to predict if customers in a mobile graph will churn based on the diffusion of information within the call graph [3]. Richter et al. proposed a churn prediction algorithm that identifies groups of users using second-order social metrics [4] Belike et al. examine the spread of infectious diseases through the human mobility available in call graphs [2]. Belo et al. apply a randomization technique that combines a number of calling attributes in a linear model [16]. Pan et al. synthesizes calling, location, spatial, and social network data from cell phones to gauge the influence users have over others [17]. In contrast to the previous work, this study examined the structure of the mobile call graph with respect relationships

with ordinary and outlying behavior. We explored a new way to understand the arrangement of the graph by comparing different decompositions of the call graph, and identified roles that different types of user relationships play in the structure of the graph. VI. C ONCLUSIONS AND F UTURE W ORK In this paper, we presented a new algorithm to quantify by how much a relationship between two cell phone users varies from ordinary behavior that incorporates any number of calling attributes. We then observed how the structure of the call graph decomposes as outlying and ordinary ties were removed. Our observations can be summarized as follows: • No matter how outlying a users relationships are, those that receive calls from many people also tend to make calls to many people. • Calls placed along outlying ties are more likely to reciprocate a call back, while calls over ordinary ties do not encourage call backs. • Mobile call graphs are not structurally robust against the loss of outlier relationships: – The number of connected components rapidly rise as outlying connections are removed (i.e. the massive component quickly fragments). – There exists a critical percentile of outlier relationships where all connected components begin to fragment. Future work seeks to examine additional relationships between structure and tie variation, including shortest paths, clustering coefficients, and neighborhood distributions. We will further explore our hypothesis on the role of outlier ties to the development of connected components in the call graph. Finally, we seek to better understand how ties that vary by exhibiting more or less behavior than expected influence the structure of the call graph. R EFERENCES [1] H. Zang and J. Bolot, “Mining Call and Mobility Data to Improve Paging Efficiency in Cellular Networks,” in Proc. of 13th ACM International Conference on Mobile Computing and Networking, 2007. [2] V. Belik, T. Geisel, and D. Brockmann, “Human movements and the spread of infectious diseases,” in Proc. of 1st Workshop on the Analysis of Mobile Phone Datasets and Networks, 2010, pp. 44–46. [3] K. Dasgupta, R. Singh, B. Viswanathan, D. Chakraborty, S. Mukherjea, and A. Nanavati, “Social Ties and their Relevance to Churn in Mobile Telecom Networks,” in Proc. of 11th ACM Intl. Conference on Extending Database Technology, 2008. [4] Y. Richter, E. Yom-Tov, and N. Slonim, “Predicting customer churn in mobile networks through analysis of social groups,” in Proc. of SIAM Intl. Conference on Data Mining, 2010.

[5] N. Eagle and A. (Sandy) Pentland, “Reality mining: sensing complex social systems,” Personal Ubiquitous Comput., vol. 10, pp. 255–268, March 2006. [Online]. Available: http://dx.doi.org/10.1007/s00779-005-0046-3 [6] H. Zhang and R. Dantu, “Social Relationship and Behavior Analaysis in Mobile Social Networks,” in Proc. of 1st Workshop on the Analysis of Mobile Phone Datasets and Networks, 2010, pp. 19–24. [7] J.-P. Onnela, J. Saramaki, J. Hyvonen, G. Szabo, D. Lazer, K. Kaski, J. Kertesz, and A.-L. Barabasi, “Structure and tie strengths in mobile communication networks,” Proceedings of the National Academy of Sciences of the United States, vol. 104, pp. 7332–7336, 2007. [8] M. Granovetter, “The Strength of Weak Ties,” American Journal of Sociology, vol. 78, pp. 1360–1380, 1973. [9] J. E. Jackson, A User’s Guide to Principal Components. John Wiley & Sons, 2004. [10] A.-L. Barabasi and R. Albert, “Emergence of Scaling in Random Networks,” Science, vol. 286, pp. 509–512, 1999. [11] A. Nanavati, R. Singh, D. Chakraborty, K. Dasgupta, S. Mukherjea, G. Das, S. Gurumurthy, and A. Joshi, “Analyzing the Structure and Evolution of Massive Telecom Graphs,” IEEE Transactions on Knowledge and Data Engineering, vol. 20, no. 5, 2008. [12] V. Tomar, H. Asnani, A. Karandikar, V. Chander, S. Agrawal, and P. Kapadia, “Social Network Analaysis of the Short Message Service,” in Proc. of 2010 National Conference on Communications, 2010, pp. 1–5. [13] H. Kwak, C. Lee, H. Park, and S. Moon, “What is Twitter, a Social Network or a News Media?” in Proc. of 19th Intl. World Wide Web Conference, 2010, pp. 591–600. [14] C. Hidalgo and C. Rodriguez-Sickert, “The dynamics of a mobile phone network,” Physica A: Statistical and Theoretical Physics, no. 387, pp. 3017–3024, 2008. [15] N. Du, C. Faloutsos, B. Wang, and L. Akoglu, “Large Human Communication Networks: Patterns and a Utility Driven Generator,” in Proc of 15th ACM SIGKDD International conference on Knowledge Discover and Data Mining, 2009, pp. 269–277. [16] R. Belo and P. Ferreira, “Using randomization methods to identify social influence in mobile networks,” in Book of Abstracts on the 2nd Conference on the Analysis of Mobile PHone Datasets and Networks, 2011, pp. 79–82. [17] W. Pan, N. Aharony, and A. Pentland, “Composite social network for predicting mobile apps installation,” in Proc. AAAI Conference on Artificial Intelligence, 2011, pp. 821– 827. [18] Y. Altshuler, N. Aharony, M. Fire, and A. P. Yuval Elovici, “Incremental learning with accuracy prediction of social and individual properties from mobile-phone data,” CoRR abs/1111.4645, 2011.