The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013)
Anomaly Detection in Online Social Networks Using Structure-Based Technique Abdolazim Rezaei
Zarinah Mohd Kasirun
Faculty of Computer Science and Information Technology University of Malaya Kuala Lumpur, Malaysia
[email protected]
Faculty of Computer Science and Information Technology University of Malaya Kuala Lumpur, Malaysia
[email protected]
Vala Ali Rohani
Touraj Khodadadi
Faculty of Computer Science and Information Technology University of Malaya Kuala Lumpur, Malaysia
[email protected]
Malaysia-Japan International Institute of Technology Universiti Teknologi Malaysia Kuala Lumpur, Malaysia
[email protected]
Abstract-Online
Social Networks as new phenomenon have
affected our life in many positive ways;
however it can be
considered as way of malicious activities. Identifying anomalous users has become a challenge and many researches are conducted but they are not enough and in this paper we propose a methodology based on graph metrics of online social networks. The experimental results illustrate that majority of friends in
disciplinary authors in author paper graph [5] , credit card fraud, calling card fraud, campaign donation irregularities network intrusion detection [7] , electronic auction fraud [2] ,
email spam and phishing
[9] .
The tremendous
increase in the popularity of the SNS allows them to collect a great amount of personal information about the
online social networks have common friends with their friends
users, their friends, and their habits. Unfortunately, this
while anomalous users may not follow this fact.
wealth of information, as well as the ease with which one
Keywords-component;
online
social
networks;
anomaly
detection; graph mining
I.
encompasses
INTRODUCTION
Online Social Networks has thrived to become a provide an environment which makes users capable of doing online social networking activities. of
social
network
sites
Since the
(SNS),
it
has
increasingly become a way of communication for people
both
have connected millions of people over the world. Social networks have been used in different domains such as in education, communication, business and many others. The rapid growth of online social network sites has raised many security issues that need to be dealt with. With the increasing popularity of online social networks,
the
misuse of these social networks and services has also increased, as has been reported in various sources
[3].
We define anomaly in OSN as unexpected behavior of those users whose behavior do not conform with normal behavior in the network and anomaly detection refers to these
behavior
could
users
or
signify
such
behavior.
irregularities,
978-1-908320-20/9/$25.00©2013 IEEE
Anomalous like
cross-
visual
[10] [6] ,
and
mathematical
analyses of relationships. Considering an online social network as a graph, the anomalous parts of the graph can be detected using an anomaly detection technique. We apply a structure-based anomaly detection technique in order to detect anomalous users and measure the degree of anomalies on an online social network.
from various locations. Facebook, Twitter, MySpace, LinkedIn which are known as the most popular OSNs
detecting
also attracts the interest of
online social networks. The Social Network Analysis
phenomenon within past decades. Social Network Sites
emergence
can reach many users,
malicious parties. Many researches have been done on
II.
RELATED WORK
Researchers study anomaly detection in two mam techniques: Behavior-Based Techniques and Structure Based Techniques (or Graph-based Techniques). The first technique deals with behavioral properties and characteristics
of
users
in
order
to
analyze
users'
behavior. For example, the number of messages, content of messages, hours spent on an event, number of shares or likes, the duration of an activity, or shared item's details etc. The latter technique concentrates on graph based properties of OSN and users. In this method, researchers try to analyze OSNs modeled in graphs, and use their graph-based properties such as node, edge, number
of
edges,
number
of
nodes,
betweenness
619
The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013)
centrality, degree centrality, in-degree, and out-degree.
number of friends as fl{}On flt[, and number of friends-of
Detecting worm publishers is one type of anomaly
friends
detection that researchers have been concentrating on. Xu
following definitions considering node i :
et al. [11] explain an behavior-based approach which is
as S1.lpflTfl{}On fl t i .
Therefore,
we
define
the
Ni: Number of friends of egoj
simple in design in order to detect worms on a social network. Detecting worm publishers is one type of anomaly
detection
that
researchers
have
been
concentrating on. Silva et al. [8] discuss the anomalous
Ei: Number of friends of egoneti
meetings on social networks by monitoring behavioral related interactions recorded from different members of a network. They aim to understand the probability of a meeting whether is anomalous and the degree of being anomalous. Ways to detect these spams or spammers are on the increase, in line with the number of the spams
B.
Compute Graph Metrics
Applying algorithms would assist toward computing number of friends (:N[ ) for fl[j 0i. Using the same algorithm for friends of fl{}Ottflt.i in order to calculate Ei.
which is steadily growing. These malicious users may appear in different states with different methods. Chu et al. [3] discuss social spamming on Twitter and introduce a machine learning classification based on shared URL's properties
while
they
consider
the
efficiency
and
robustness of the method. Yeung et al. [12] introduce a graph-based approach of detecting anomaly in spam detection, based on a learning approach which extracts some social network features such as in-count and out count emails, in-degree and out-degree of emails. Eberle and Holder [4] introduce another approach, in which they detect anomalous substructures which are known to be part of a non-anomalous sub-structure. Akoglu et al. [1] introduce
a
new
graph-based
method
of
C.
Compute Fitting Curve
Users on online social networks tend to keep their relationships with other friends. One of the ways of analyzing online social networks is to look at users' ties and relationships. Therefore, modeling relationship between users can help knowing them better. Utilizing local metrics like number of friends and using distribution models such as power law can help us perform a better analysis on social networks. In this modeling R 2: is considered as coefficient of determination for each generated model from empirical data. The amount of R 2: is computed as follows:
anomaly
detection which is appropriate for large, weighted graphs. The aim of their study is to spot anomalous subgraphs
R 2:
whose users' behavior deviates from other users. They
=
11.
_
consider number of edges for each node and their weight. Behavior-based
techniques
cover
a
wide
range
of
properties and features of OSNs related users that may lead to confusion while structure-based technique deal with a particular and defined range of metrics and features. Moreover, behavior-based techniques are very technology dependent while a large number of features can be included.
III.
PREPARE YOUR PAPER BEFORE STYLING
The proposed solution is based on the structure-based techniques and it considers an online social network as a graph with nodes and edges.
Where
Anomaly Detection Metrics There are many graph metrics that can be used, but in
this study, we will use two of the commonly used metrics of graphs. Considering user i or node i as fl{}O'( , its friend is 'fl{}On flt.i ,
and
the
friend
of
friend
would
be
SSrflsi d u al as
=
represents the number of its friends, neighbors or edges.
978-1-908320-20/9/$25.00©2013 IEEE
L �=l ( Yi -
predicted
L �=l (yi -
yi)2:
in which
yt
is
value
E (YiJ ) 2:
Akoglu et a1. (20 I 0) have shown N vs. E as Power Law where: Fitting line Ei. a Ni�' while 11. ::::; a ::::; 2 . Ei is the number of edges, Nirepresents the number of nodes, and law exponent for users i's egonet.
a
is the power
Compute Anomaly Score
Using graph metrics of social networks in the anomaly measurement formula is a way of computing anomaly scores. The distance from the fitting line can tell us about a node which may be anomalous. The formula presented in the OddBall method by Akoglu et a1. [1] is used in this work:
5UpflTfl{}On flti . Similarly, the degree of a node also In other words, a user i is known as fl{}Oi and his/her
=
( 1)
and and Yi. by which E(.) is the SSto ro.l expected value. We look into relationship between pairs of metrics and analyze them according to the regression equations that might be linear or power law. known
D. A.
SSrf1siooa.! SStota!
aSeore
=
max(Yi' min(Yi'
exE:'l ) E:'l ex )
log(ly -
ex
E:'l
1+11.)
620
( 2)
The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013)
Where J'i is the value for Y axis and x i is the value for X axis taken by node i , and f) is a power law exponent in the equation El J' = cx . It actually gives the amount that shows the deviation from the fitted line. This deviation is considered as the vertical difference between the same amounts of X for Y2 on the fitted curve equation and its original Y1. Measuring this anomaly score, it shows us how that particular node is deviated from the fitting line. The more deviated it is, the more penalty, according to the number of times it is deviated from the line. E.
This model shows the relationship between the number of friends for a particular node and its friends-of-friends. There should be a relation between the distributions of values on two dimensional graphs. Importing data into Matlab and calculating the fitting curve produces the regression as shown in Fig.l:
Labeling Based on Possible Pattern
Users follow patterns of friendship and they generally follow the tenet that "friends of friends are often friends" and in rare cases they may follow either the "cliques or near cliques" pattern or "stars or near-star". In other words, anomalous users follow aforementioned patterns. In order to fmd the threshold in the next step, 100 nodes are chosen for labeling. Also, these nodes as "cliques or near-cliques", "stars or near-star" are labeled, based on applying codes of programming. Anomalous nodes tend to follow particular patterns. Akoglu et al. [1] discuss some of these patterns that anomalous users tend to follow. They explain Star or Near-Star and Cliques or Near-Cliques in which those users whose neighbors are well connected to each other (near-cliques) or are not connected (stars) tum out to behave strangely. F.
Calculating Threshold
we compute to determine for each metric a threshold value on the outlier score aScDre that minimizes the F-Score, which is the number of false positives and false negatives in the labeled dataset.
Figure
1. N vs. E Distribution Model
Every dot in the plot, which shows the concentration at that point, indicates the relation between lll i and Ei for every single user. The value for R1. or coefficient of determination which shows the goodness of fit is R1. = 0. 3401 . Also, based on this regression the related fitted curve is drawn in Fig.2
1000
FSco rS'
=
2 X Precision
X
Recall
.
. Precl.Swn
.
Recall
(3)
1--- fitted curve 1
900
I
800 700
its highest value (1) indicates best classification of labeled data, whereas its lowest value (0) indicates a false classification of data which are labeled.
600 ill 500
//
/'
400 3m
IV.
EXPERIMENTAL RESULTS
This study conducted all steps included in the research methodology section over the Twitter dataset which is officially released by Stanford Network Analysis Project (SNAP). The distribution model includes the distribution of Ni vs. Ei or number of friends for a node and its number of friends of friends.
978-1-908320-20/9/$25.00©2013 IEEE
/
20J 100 0
5
15 Figure
Ni
20
30
25
2. N vs E Fitted Curve
Given the fitted curve we can also have the following values in order to reach the fitted line equation:
621
The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013)
C2(x)
=
a(x
-
br�
(4)
Where, the coefficients a; and b are 0 .29'15 (1 , 1) and 1 . 4786' - 7, respectively. It is important to note that the coefficients are calculated with 9 1 % confidence bounds. As the parameter n in the problem equals to 2, Hence:
Ii
=
fitted curve. Evaluating labeled nodes with high anomaly score helped to fmd the threshold in order to fmd anomalous nodes which cross the threshold. [I]
Akoglu, Leman, McGlohon, Mary, & Faloutsos, Christos. (2010). OddBall: spotting anomalies in weighted graphs. Paper presented at the Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining Volume Part II,Hyderabad,India.
[2]
Chau, Duen Horng, Pandit, Shashank, & Faloutsos, Christos. (2006). Detecting fraudulent personalities in networks of online auctioneers. Paper presented at the Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases,Berlin,Germany.
[3]
Chu, Zi, Widjaja, Indra, & Wang, I-Iaining. (2012). Detecting Social Spam Campaigns on Twitter. In F. Bao, P. Samarati & J. Zhou (Eds.), Applied Cryptography and Network Security (Vol. 7341,pp. 455-472): Springer Berlin Heidelberg.
[4]
Eberle, William,
0 . 2 9 15 N2
Some codes of programming are applied on 100 nodes with high degree and observed that 74 nodes have followed the pattern according to [1]. Therefore, the percentage of 74% proves this fact that 74% of the users with high degree (high numbers of friends) have followed the pattern. These 74% follower nodes must also be assessed with their anomaly score whether they have a high score. This comparison reveals F _ScDn as the final score. The 74 % is considered as the precision and 61 % is the recall value which shows those who have high anomaly score as well as conforming to one of anomalous patterns. Thus, as in (3) the threshold would be as follows:
�c ()rf!
=
2 (74)(61) (74 + 61)
=
66.87%
& Holder, Lawrence. (2007). Discovering Structural Anomalies in Graph-Based Data. Paper presented at
the Proceedings of the Seventh IEEE International Conference on Data Mining Workshops.
[5]
Jimeng, Sun, Huiming, Qu, Chakrabarti, D., & Faloutsos, C. (2005,27-30 Nov. 2005). Neighborhood formation and anomaly detection in bipartite graphs. Paper presented at the Data Mining,Fifth IEEE International Conference on.
[6]
Payne, J., Solomon, J., Sankar, R., & McGrew, B. (2008, 19-24 Oct. 2008). Grand challenge award: Interactive visual analytics palantir: The future of analysis. Paper presented at the Visual Analytics Science and Technology, 2008. VAST '08. IEEE Symposium on.
[7]
Sequeira, Karlton, & Zaki, Mohammed. (2002). ADMIT: anomaly-based data mining for intrusions. Paper presented at
the Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining,Edmonton, Alberta,Canada.
[8]
Silva, J., & Willett, R. (2008, 19-21 March 2008). Detection of anomalous meetings in a social network. Paper presented at the
Information Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on.
The final ps.co-re show the percentage of those people
[9]
Stringhini, Gianluca, Kruegel, Christopher, & Vigna, Giovanni. (2010). Detecting spammers on social networks. Paper presented at the Proceedings of the 26th Annual Computer Security Applications Conference,Austin,Texas.
[10]
Swing, E. (2008, 19-24 Oct. 2008). Award: Ef f icient toolkit integration solving the cell phone calls challenge with the Prajna Project. Paper presented at the Visual Analytics Science
[II]
Xu, Wei, Zhang, Fangfang,
who have high anomaly score and their behavior is known as real anomalous. In other words, 66.87% of users with high anomaly score contributed to malicious activities that have been detected in this research.
V.
and Technology,2008. VAST '08. IEEE Symposium on.
CONCLUSION
While online social networks have become the motivation for millions of people throughout the world to come together and receive many benefits, but it may also provide conditions for malicious activities. Many researches are conducted and in this study the results of related researches on anomalous activity detection in online social networks are discussed. Users still encounter such malicious activities on OSNs. Considering an OSN as a graph, the network is analyzed based on defmed graph metrics. We applied the proposed methodology and labeled 100 nodes with high probability at which they may follow an anomalous pattern. Moreover, the distribution model of calculated metrics led to finding anomaly score for each user based on their deviation from
978-1-908320-20/9/$25.00©2013 IEEE
& Zhu, Sencun. (2010). Toward worm detection in online social networks. Paper presented at the
Proceedings of the 26th Annual Computer Security Applications Conference,Austin,Texas.
[12]
Yeung, Ho-Yu Lam; Dit-Yan. (2007). A Learning Approach to Spam Detection Based on Social Networks. Fourth Conference
on Email and Anti-Spam.
622