Anomaly Detection in Online Social Networks Using ... - IEEE Xplore

5 downloads 3907 Views 652KB Size Report
methodology based on graph metrics of online social networks. The experimental results illustrate that majority of friends in online social networks have common ...
The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013)

Anomaly Detection in Online Social Networks Using Structure-Based Technique Abdolazim Rezaei

Zarinah Mohd Kasirun

Faculty of Computer Science and Information Technology University of Malaya Kuala Lumpur, Malaysia [email protected]

Faculty of Computer Science and Information Technology University of Malaya Kuala Lumpur, Malaysia [email protected]

Vala Ali Rohani

Touraj Khodadadi

Faculty of Computer Science and Information Technology University of Malaya Kuala Lumpur, Malaysia [email protected]

Malaysia-Japan International Institute of Technology Universiti Teknologi Malaysia Kuala Lumpur, Malaysia [email protected]

Abstract-Online

Social Networks as new phenomenon have

affected our life in many positive ways;

however it can be

considered as way of malicious activities. Identifying anomalous users has become a challenge and many researches are conducted but they are not enough and in this paper we propose a methodology based on graph metrics of online social networks. The experimental results illustrate that majority of friends in

disciplinary authors in author paper graph [5] , credit card fraud, calling card fraud, campaign donation irregularities network intrusion detection [7] , electronic auction fraud [2] ,

email spam and phishing

[9] .

The tremendous

increase in the popularity of the SNS allows them to collect a great amount of personal information about the

online social networks have common friends with their friends

users, their friends, and their habits. Unfortunately, this

while anomalous users may not follow this fact.

wealth of information, as well as the ease with which one

Keywords-component;

online

social

networks;

anomaly

detection; graph mining

I.

encompasses

INTRODUCTION

Online Social Networks has thrived to become a provide an environment which makes users capable of doing online social networking activities. of

social

network

sites

Since the

(SNS),

it

has

increasingly become a way of communication for people

both

have connected millions of people over the world. Social networks have been used in different domains such as in education, communication, business and many others. The rapid growth of online social network sites has raised many security issues that need to be dealt with. With the increasing popularity of online social networks,

the

misuse of these social networks and services has also increased, as has been reported in various sources

[3].

We define anomaly in OSN as unexpected behavior of those users whose behavior do not conform with normal behavior in the network and anomaly detection refers to these

behavior

could

users

or

signify

such

behavior.

irregularities,

978-1-908320-20/9/$25.00©2013 IEEE

Anomalous like

cross-

visual

[10] [6] ,

and

mathematical

analyses of relationships. Considering an online social network as a graph, the anomalous parts of the graph can be detected using an anomaly detection technique. We apply a structure-based anomaly detection technique in order to detect anomalous users and measure the degree of anomalies on an online social network.

from various locations. Facebook, Twitter, MySpace, LinkedIn which are known as the most popular OSNs

detecting

also attracts the interest of

online social networks. The Social Network Analysis

phenomenon within past decades. Social Network Sites

emergence

can reach many users,

malicious parties. Many researches have been done on

II.

RELATED WORK

Researchers study anomaly detection in two mam techniques: Behavior-Based Techniques and Structure­ Based Techniques (or Graph-based Techniques). The first technique deals with behavioral properties and characteristics

of

users

in

order

to

analyze

users'

behavior. For example, the number of messages, content of messages, hours spent on an event, number of shares or likes, the duration of an activity, or shared item's details etc. The latter technique concentrates on graph­ based properties of OSN and users. In this method, researchers try to analyze OSNs modeled in graphs, and use their graph-based properties such as node, edge, number

of

edges,

number

of

nodes,

betweenness

619

The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013)

centrality, degree centrality, in-degree, and out-degree.

number of friends as fl{}On flt[, and number of friends-of­

Detecting worm publishers is one type of anomaly

friends

detection that researchers have been concentrating on. Xu

following definitions considering node i :

et al. [11] explain an behavior-based approach which is

as S1.lpflTfl{}On fl t i .

Therefore,

we

define

the

Ni: Number of friends of egoj

simple in design in order to detect worms on a social network. Detecting worm publishers is one type of anomaly

detection

that

researchers

have

been

concentrating on. Silva et al. [8] discuss the anomalous

Ei: Number of friends of egoneti

meetings on social networks by monitoring behavioral related interactions recorded from different members of a network. They aim to understand the probability of a meeting whether is anomalous and the degree of being anomalous. Ways to detect these spams or spammers are on the increase, in line with the number of the spams

B.

Compute Graph Metrics

Applying algorithms would assist toward computing number of friends (:N[ ) for fl[j 0i. Using the same algorithm for friends of fl{}Ottflt.i in order to calculate Ei.

which is steadily growing. These malicious users may appear in different states with different methods. Chu et al. [3] discuss social spamming on Twitter and introduce a machine learning classification based on shared URL's properties

while

they

consider

the

efficiency

and

robustness of the method. Yeung et al. [12] introduce a graph-based approach of detecting anomaly in spam detection, based on a learning approach which extracts some social network features such as in-count and out­ count emails, in-degree and out-degree of emails. Eberle and Holder [4] introduce another approach, in which they detect anomalous substructures which are known to be part of a non-anomalous sub-structure. Akoglu et al. [1] introduce

a

new

graph-based

method

of

C.

Compute Fitting Curve

Users on online social networks tend to keep their relationships with other friends. One of the ways of analyzing online social networks is to look at users' ties and relationships. Therefore, modeling relationship between users can help knowing them better. Utilizing local metrics like number of friends and using distribution models such as power law can help us perform a better analysis on social networks. In this modeling R 2: is considered as coefficient of determination for each generated model from empirical data. The amount of R 2: is computed as follows:

anomaly

detection which is appropriate for large, weighted graphs. The aim of their study is to spot anomalous subgraphs

R 2:

whose users' behavior deviates from other users. They

=

11.

_

consider number of edges for each node and their weight. Behavior-based

techniques

cover

a

wide

range

of

properties and features of OSNs related users that may lead to confusion while structure-based technique deal with a particular and defined range of metrics and features. Moreover, behavior-based techniques are very technology dependent while a large number of features can be included.

III.

PREPARE YOUR PAPER BEFORE STYLING

The proposed solution is based on the structure-based techniques and it considers an online social network as a graph with nodes and edges.

Where

Anomaly Detection Metrics There are many graph metrics that can be used, but in

this study, we will use two of the commonly used metrics of graphs. Considering user i or node i as fl{}O'( , its friend is 'fl{}On flt.i ,

and

the

friend

of

friend

would

be

SSrflsi d u al as

=

represents the number of its friends, neighbors or edges.

978-1-908320-20/9/$25.00©2013 IEEE

L �=l ( Yi -

predicted

L �=l (yi -

yi)2:

in which

yt

is

value

E (YiJ ) 2:

Akoglu et a1. (20 I 0) have shown N vs. E as Power Law where: Fitting line Ei. a Ni�' while 11. ::::; a ::::; 2 . Ei is the number of edges, Nirepresents the number of nodes, and law exponent for users i's egonet.

a

is the power

Compute Anomaly Score

Using graph metrics of social networks in the anomaly measurement formula is a way of computing anomaly scores. The distance from the fitting line can tell us about a node which may be anomalous. The formula presented in the OddBall method by Akoglu et a1. [1] is used in this work:

5UpflTfl{}On flti . Similarly, the degree of a node also In other words, a user i is known as fl{}Oi and his/her

=

( 1)

and and Yi. by which E(.) is the SSto ro.l expected value. We look into relationship between pairs of metrics and analyze them according to the regression equations that might be linear or power law. known

D. A.

SSrf1siooa.! SStota!

aSeore

=

max(Yi' min(Yi'

exE:'l ) E:'l ex )

log(ly -

ex

E:'l

1+11.)

620

( 2)

The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013)

Where J'i is the value for Y axis and x i is the value for X axis taken by node i , and f) is a power law exponent in the equation El J' = cx . It actually gives the amount that shows the deviation from the fitted line. This deviation is considered as the vertical difference between the same amounts of X for Y2 on the fitted curve equation and its original Y1. Measuring this anomaly score, it shows us how that particular node is deviated from the fitting line. The more deviated it is, the more penalty, according to the number of times it is deviated from the line. E.

This model shows the relationship between the number of friends for a particular node and its friends-of-friends. There should be a relation between the distributions of values on two dimensional graphs. Importing data into Matlab and calculating the fitting curve produces the regression as shown in Fig.l:

Labeling Based on Possible Pattern

Users follow patterns of friendship and they generally follow the tenet that "friends of friends are often friends" and in rare cases they may follow either the "cliques or near­ cliques" pattern or "stars or near-star". In other words, anomalous users follow aforementioned patterns. In order to fmd the threshold in the next step, 100 nodes are chosen for labeling. Also, these nodes as "cliques or near-cliques", "stars or near-star" are labeled, based on applying codes of programming. Anomalous nodes tend to follow particular patterns. Akoglu et al. [1] discuss some of these patterns that anomalous users tend to follow. They explain Star or Near-Star and Cliques or Near-Cliques in which those users whose neighbors are well­ connected to each other (near-cliques) or are not connected (stars) tum out to behave strangely. F.

Calculating Threshold

we compute to determine for each metric a threshold value on the outlier score aScDre that minimizes the F-Score, which is the number of false positives and false negatives in the labeled dataset.

Figure

1. N vs. E Distribution Model

Every dot in the plot, which shows the concentration at that point, indicates the relation between lll i and Ei for every single user. The value for R1. or coefficient of determination which shows the goodness of fit is R1. = 0. 3401 . Also, based on this regression the related fitted curve is drawn in Fig.2

1000

FSco rS'

=

2 X Precision

X

Recall

.

. Precl.Swn

.

Recall

(3)

1--- fitted curve 1

900

I

800 700

its highest value (1) indicates best classification of labeled data, whereas its lowest value (0) indicates a false classification of data which are labeled.

600 ill 500

//

/'

400 3m

IV.

EXPERIMENTAL RESULTS

This study conducted all steps included in the research methodology section over the Twitter dataset which is officially released by Stanford Network Analysis Project (SNAP). The distribution model includes the distribution of Ni vs. Ei or number of friends for a node and its number of friends of friends.

978-1-908320-20/9/$25.00©2013 IEEE

/

20J 100 0

5

15 Figure

Ni

20

30

25

2. N vs E Fitted Curve

Given the fitted curve we can also have the following values in order to reach the fitted line equation:

621

The 8th International Conference for Internet Technology and Secured Transactions (ICITST-2013)

C2(x)

=

a(x

-

br�

(4)

Where, the coefficients a; and b are 0 .29'15 (1 , 1) and 1 . 4786' - 7, respectively. It is important to note that the coefficients are calculated with 9 1 % confidence bounds. As the parameter n in the problem equals to 2, Hence:

Ii

=

fitted curve. Evaluating labeled nodes with high anomaly score helped to fmd the threshold in order to fmd anomalous nodes which cross the threshold. [I]

Akoglu, Leman, McGlohon, Mary, & Faloutsos, Christos. (2010). OddBall: spotting anomalies in weighted graphs. Paper presented at the Proceedings of the 14th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining Volume Part II,Hyderabad,India.

[2]

Chau, Duen Horng, Pandit, Shashank, & Faloutsos, Christos. (2006). Detecting fraudulent personalities in networks of online auctioneers. Paper presented at the Proceedings of the 10th European conference on Principle and Practice of Knowledge Discovery in Databases,Berlin,Germany.

[3]

Chu, Zi, Widjaja, Indra, & Wang, I-Iaining. (2012). Detecting Social Spam Campaigns on Twitter. In F. Bao, P. Samarati & J. Zhou (Eds.), Applied Cryptography and Network Security (Vol. 7341,pp. 455-472): Springer Berlin Heidelberg.

[4]

Eberle, William,

0 . 2 9 15 N2

Some codes of programming are applied on 100 nodes with high degree and observed that 74 nodes have followed the pattern according to [1]. Therefore, the percentage of 74% proves this fact that 74% of the users with high degree (high numbers of friends) have followed the pattern. These 74% follower nodes must also be assessed with their anomaly score whether they have a high score. This comparison reveals F _ScDn as the final score. The 74 % is considered as the precision and 61 % is the recall value which shows those who have high anomaly score as well as conforming to one of anomalous patterns. Thus, as in (3) the threshold would be as follows:

�c ()rf!

=

2 (74)(61) (74 + 61)

=

66.87%

& Holder, Lawrence. (2007). Discovering Structural Anomalies in Graph-Based Data. Paper presented at

the Proceedings of the Seventh IEEE International Conference on Data Mining Workshops.

[5]

Jimeng, Sun, Huiming, Qu, Chakrabarti, D., & Faloutsos, C. (2005,27-30 Nov. 2005). Neighborhood formation and anomaly detection in bipartite graphs. Paper presented at the Data Mining,Fifth IEEE International Conference on.

[6]

Payne, J., Solomon, J., Sankar, R., & McGrew, B. (2008, 19-24 Oct. 2008). Grand challenge award: Interactive visual analytics palantir: The future of analysis. Paper presented at the Visual Analytics Science and Technology, 2008. VAST '08. IEEE Symposium on.

[7]

Sequeira, Karlton, & Zaki, Mohammed. (2002). ADMIT: anomaly-based data mining for intrusions. Paper presented at

the Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining,Edmonton, Alberta,Canada.

[8]

Silva, J., & Willett, R. (2008, 19-21 March 2008). Detection of anomalous meetings in a social network. Paper presented at the

Information Sciences and Systems, 2008. CISS 2008. 42nd Annual Conference on.

The final ps.co-re show the percentage of those people

[9]

Stringhini, Gianluca, Kruegel, Christopher, & Vigna, Giovanni. (2010). Detecting spammers on social networks. Paper presented at the Proceedings of the 26th Annual Computer Security Applications Conference,Austin,Texas.

[10]

Swing, E. (2008, 19-24 Oct. 2008). Award: Ef f icient toolkit integration solving the cell phone calls challenge with the Prajna Project. Paper presented at the Visual Analytics Science

[II]

Xu, Wei, Zhang, Fangfang,

who have high anomaly score and their behavior is known as real anomalous. In other words, 66.87% of users with high anomaly score contributed to malicious activities that have been detected in this research.

V.

and Technology,2008. VAST '08. IEEE Symposium on.

CONCLUSION

While online social networks have become the motivation for millions of people throughout the world to come together and receive many benefits, but it may also provide conditions for malicious activities. Many researches are conducted and in this study the results of related researches on anomalous activity detection in online social networks are discussed. Users still encounter such malicious activities on OSNs. Considering an OSN as a graph, the network is analyzed based on defmed graph metrics. We applied the proposed methodology and labeled 100 nodes with high probability at which they may follow an anomalous pattern. Moreover, the distribution model of calculated metrics led to finding anomaly score for each user based on their deviation from

978-1-908320-20/9/$25.00©2013 IEEE

& Zhu, Sencun. (2010). Toward worm detection in online social networks. Paper presented at the

Proceedings of the 26th Annual Computer Security Applications Conference,Austin,Texas.

[12]

Yeung, Ho-Yu Lam; Dit-Yan. (2007). A Learning Approach to Spam Detection Based on Social Networks. Fourth Conference

on Email and Anti-Spam.

622