Mining WiFi Data for Business Intelligence - IEEE Xplore

4 downloads 46101 Views 453KB Size Report
Mining WiFi Data for Business Intelligence. Deepali Arora, Stephen W. Neville and Kin Fun Li. Department of Electrical and Computer Engineering. University of ...
2013 Eighth International Conference on P2P, Parallel, Grid, Cloud and Internet Computing

Mining WiFi Data for Business Intelligence Deepali Arora, Stephen W. Neville and Kin Fun Li Department of Electrical and Computer Engineering University of Victoria P.O. Box 3055 STN CSC, Victoria, B.C., CANADA, V8W 3P6 {darora,sneville,kinli}@ece.uvic.ca

Abstract—The WiFi networks provide an ease of accessing email, web, and other Internet applications while on the move. However, deploying additional WiFi hotspots that can provide both increased coverage and enhance user quality of service largely depends upon the number of access points already existing and user densities. Extracting usage patterns and information from the available data has the potential to answer several business-focussed questions. In this paper, we show that by plotting WiFi locations in a two-dimensional space of incoming (downloading) and outgoing (uploading) data amount, in conjunction with the simple k-means clustering, it is possible to gain insight into the basic data usage patterns. When combined with information about geographic location of the WiFi hotspots such analysis can answer questions related to spatial patterns of data usage and make informed business decisions including charging customers at selected locations for WiFi service.

of the WiFi hotspots has the potential to aid business decisions related to addition (deletion) of new (existing) APs based on assessing user density, behaviour and patterns and their trends over time. Data mining has been used in telecommunication industry for a number of applications including marketing, security and network reliability. A number of approaches have been adopted which attempt to find solutions to a range of questions associated with the telecommunication industry. These approaches include classification, clustering, visualization and traffic monitoring [4],[5],[6]. For example, classification techniques have been used to determine churn and non-churn customers [7]; graph theory approaches have been used to obtain information about customers social network structure [8]; clustering approaches have been used for intrusion or fraud detection [9],[10],[11]; and both clustering and statistical data analysis approaches have been used for fault detection [12] and network reliability [13],[14]. Some of the studies have also used these approaches to determine user density or locations using the data from WiFi networks [15],[16],[17],[18]. Here, we explore the use of a clustering-based approach with WiFi access data about incoming (downloading) and outgoing (uploading) traffic rates in relation to more than 200 hotspots across an urban area, for mining information about user density and behaviour.

I. I NTRODUCTION WiFi, which stands for wireless fidelity, refers to wireless local area networks, based on the IEEE 802.11b, 802.11g etc., standard and has become one of the most popular choice of communications [1]. Annual industry revenue for WiFi communications is expected to exceed more than 4 billion [2] associated with an increase in mobile data traffic which is expected to reach around 17 million terabytes by 2014. The number of WiFi hotspots continues to grow and expected to reach globally to nearly 2.7 million by 2014 [3]. WiFi hotspots are considered an important part of the wireless infrastructure and they seek to enhance user experience; offload busy mobile broadband networks; and provide a platform for value-added services [3]. WiFi hospots are no longer limited to the traditional prime locations such as airports or hotels and are moving fast to neighbourhood retail outlets, parks, and shopping complexes. Deployment of WiFi hotspots, however, faces a number of challenges including providing coverage, capacity, user density and security in a manner that enhances a user’s overall quality of service (QoS) while keeping the network secure. Telecommunication operators have direct control over the first two challenges i.e., the capacity and coverage of WiFi hotspots which are influenced by the availability of WiFi access points (APs) i.e. the direct means of communication between the hotspot LAN and the user’s device. Telecommunication companies have access to the information about number of WiFi hotspots installed at any location. However, extracting additional information about user density, behaviour and patterns at WiFi hostspots is a challenging task and requires knowledge of various data mining techniques. Such information when combined with geographical location 978-0-7695-5094-7/13 $31.00 © 2013 IEEE DOI 10.1109/3PGCIC.2013.67

Such information when related to the geographical location of the hotspots is able to provide information about incoming (downloading) and outgoing (uploading) data rates in relation to user density at various hotspots and their usage patterns. Usage in this context refers to number of users trying to access the system and their incoming (downloading)/outgoing (uploading) data rates. Few access points at a location with high user density affect both coverage and QoS. Since access to WiFi networks is a service provided to customers to acess internet on the move, data mining techniques can potentially help to make decisions about assignment of future resources for infrastructure development such that both QoS and coverage can be improved. This paper is organized as follows. Related work is discussed in Section II and the objective of selecting actual data used in this paper is explained in Section III. A brief introduction to clustering is given in Section IV and results are presented in Section V. Finally, conclusions are given in Section VI. 403 394

II. R ELATED W ORK

wireless router, the gateway opened a session on the authentication server which assigned a session number (connection id) and gave it to the gateway. The gateway then calculated the total incoming (downloading) and outgoing (uploading) traffic in bytes transferred during that session, and sent it periodically to the authentication server for each connected user. The logged data was stored in a relational database in normalized form, without any pre-processing and retrieved as needed. Although the data trace contained 587,782 user sessions for 69,689 (distinct) users, which were collected from 206 hotspots, here we use a subset of the available data for our analysis to illustrate the proof of concept. For our analysis we followed clustering approach as explained below.

A number of studies have assessed user density, length of stay or number of access points at WiFi hotspots using different data mining techniques. For example, [15] developed a framework known as ToGo attempted to predict the length of time a customer would stay at a WiFi hostpot. Using the proposed ToGo framework, a mobile device reports its sensor readings to the access points (AP) which runs a machine learning algorithm to interpret user behavior. It then combines this feature with other features including received signal strength (RSSI) and periodically predicts users’ dwell time, that gives an estimate of the time user was present at that location. ToGo relies on the hypothesis that users’ dwell times are directly correlated to their activities at any WiFi hotspots. [16] proposed a density based clustering approach that attempted to learn geographical locations directly from a set of raw WiFi measurements. Nurmi and Bhattacharya [19] also proposed a non-parametric density based clustering approach to extract place information from discontinuous global positioning system (GPS) measurements and compared their results with those obtained using k-means clustering algorithm. [17] studied the usage patterns in urban WiFi networks and found that the modem users place the highest demand on the network while WiFi hotspot users had moderate network demands and smart phone users placed the lowest demands. [18] proposed an approach called Serendipity focussed on locating WiFi access points in an unsupervised manner using radio scans collected by ordinary smart phone users. They extracted dissimilarities between all pairs of WiFi APs from these radio scans and estimated relative positions of APs by analyzing the dissimilarities based on a multidimensional scaling technique. While both clustering and signal strength approaches have been used to estimate APs location or usage patterns, here we use the relationship between incoming (downloading) and outgoing (uploading) data rates in context of individual users and APs, to identify high traffic locations. III. W I F I

IV. C LUSTERING W I F I

HOTSPOTS FOR DETECTING USAGE PATTERNS

Clustering is a unsupervised machine learning problem and is defined as the technique of organizing objects into groups known as clusters, whose members are similar in some way. It is a common technique for statistical data analysis used in many fields, including pattern recognition, image analysis, information retrieval, and bioinformatics. A clustering algorithm should be • scalable • able to deal with different types of attributes • capable to discover clusters with arbitrary shape, • able to deal with noise and outliers, and • insensitive to order of input records. A number of clustering algorithms are available including hierarchical clustering that uses distance connectivity in building clusters, k-means clustering in which each cluster is represented by a single mean vector, expectation-maximization clustering algorithm that uses statistical distributions to form clusters, and density-based clusters formed by connecting dense regions in the data space [21],[22]. For our analysis we have used the k-means clustering algorithm because of its simplicity which is discussed below. A. K-means clustering

HOTSPOT DATA USED FOR DATA MINING

The k-means clustering algorithm is one of the simplest unsupervised machine learning algorithms whose objective is to group n objects into K number of clusters [21],[22]. The K centroids, one for each cluster, are placed as far as possible far away from each other. Each data point is then associated with its nearest centroid. The algorithm aims at minimizing an objective function, in this case a squared error function.

The accuracy of any data-mining technique is dependent upon the underlying data used. For the analysis conducted here, we have used data collected by the non-profit organization Ile Sans Fil (French for ”Island Without Wires”, also known as ISF), available at Crawdad website [20]. The data set contained user session traces collected from a large number of free Wi-Fi hotspots in Montreal, Quebec, Canada over three years. The objective of collecting these data is to provide solutions for managing hotspots, detecting and controlling bandwidth hogging, and assessing how WiFi networks are being used. The data collected by the ISF team included user identifications, medium access control (MAC) address, login and logout time, hotspot identifications, and amount of incoming (downloading) and outgoing (uploading) data transferred in bytes. The data collection methodology used by ISF was as follows. When a user connected to a gateway installed on a

J=

n k   j=1 i=1

(j)

(j)

||xi − cj ||2

(1)

where ||xi − cj ||2 is a measure of distance between a data (j) point xi and its cluster centre cj . The k-means clustering algorithm selects an initial set of K clusters with a priori estimate of their centroids and data points are assigned to these clusters. The value of the cost function J is then calculated. Then the positions of the K

395 404

user MAC address) it is possible to identify locations where users place stringent demands on networks and accordingly fair usage policies can be designed to limit network access per user. Additionally, telecommunication companies can also design pricing policies limiting the free access and specify maximum incoming (downloading)/outgoing (uploading) data amounts at the access points with high user density. Designing fair usage policies at locations with high user density and traffic would be beneficial to all users as a whole and network providers. Finally, Figure 2 (panel b,c) shows the effect of increasing the pre-determined number of groups into which APs are clustered (i.e., increasing K) plotted against the mean silhouette values. The mean silhouette value is a measure of how close each point in one cluster is to points in the neighboring clusters. This measure ranges from -1 to +1. A value of +1 indicates points that are very distant from neighboring clusters; 0 indicates points that are not distinctly in one cluster or another; and -1 indicates points that are probably assigned to the wrong cluster. Mean silhouette values close to +1 thus indicate the best clustering. For both APs and individual MAC users value of K = 3, found experimentally, yields the highest mean silhouette value indicating that APs and MAC users are best clustered when divided into three groups.

centroids are recalculated and data points are reassigned with an objective to reduce the value of the cost function J. The process is repeated until the centroids no longer move and the cost function J cannot be reduced any further. The result is a separation of the data into clusters as a function of number of clusters desired. Since the number of clusters cannot be determined a priori the k-means algorithm requires that the above process be repeated for different number of clusters K. The minimum cost function J for each K is then used to assess the optimum number of clusters into which the data can be clustered. Although increasing K may result in smaller cost function values, it may also lead to overfitting, and some subjective judgement is required. V. R ESULTS In order to extract information from the data collected by the ISF team, we first plot APs (or node locations) in the twodimensional space of incoming (downloading) and outgoing (uploading) data amount. When organized in the manner, as shown in Figure 1a, we first observe a broadly linear relationship between incoming (downloading) and outgoing (uploading) traffic amount when viewed across more than 200 APs. Note the logarithmic scale on both axes given the multiple order of magnitude difference between data amounts across the APs. The outgoing (uploading) data amount is typically an order of magnitude less than the incoming (downloading) data. This is an expected result since a normal user typically downloads more data than he/she uploads. In Figure 1a the APs are clustered into three groups, as an example. Figure 1b plots individual users, as indentified by their unique MAC address, on the same two-dimensional space of incoming (downloading) and outgoing (uploading) data amount. Similar to APs, we observe a linear relationship between incoming (downloading) and outgoing (uploading) data amount and an order of magnitude lower incoming (downloading) data amount, compared to the outgoing (uploading) data amount, for each user. Again, as an example, individual users are clustered into three groups. Figure 1, panels c and d show the same result as in panels a and b, respectively, but the APs and individual users are clustered into four groups. Figure 2, panel a, shows the distribution of total number of unique users at each AP, sorted in a descending order. Figure 2a shows that the number of users at more than 200 APs can vary from up to 75 users (at the most busy AP) to just a few users (at the least busy AP). The results shown in Figure 1 when combined with information about the geographic location of the APs have the potential to provide answers to a number of businessfocussed questions. For example, it is possible to determine if the access points located at certain locations are enough to serve the number of users (based on user densities) in an area and if adding more access points would enhance users QoS without increasing interference to neighbouring users. Additionally, based on total incoming (downloading) and outgoing (uploading) data amount at any access point (Figure 1a,c) and users (Figure 1b,d) active at that location (based on

VI. D ISCUSSION

AND

C ONCLUSIONS

The results shown in Figure 1 when combined with information about the geographic location of the APs have the potential to provide answers to a number of businessfocussed questions. Unfortunately, we did not have access to information about geographical location of the APs. The results in Figure 1a, for example, can be used to ask where in the two-dimensional space of the incoming (downloading) and outgoing (uploading) data amount a given AP lies and which users were active at that location as seen on their twodimensional space of incoming (downloading) and outgoing (uploading) data amounts (as in Figure 1b). This information can potentially guide policies around usage or pricing. It can also be determined if more access points are required to be installed at certain locations to enhance QoS for the users. The clustering algorithm we have used clusters APs into predetermined number of groups, to illustrate a proof of concept, but how meaningful is such clustering can only be determined when information about geographical location of APs and the kind of businesses where APs are located is additionally analyzed. When geographic location of APs is available, information shown in Figure 1 can also be potentially combined with a spatial representation of the users at AP locations, for example, using Google maps, to give operators a visual representation of the business intelligence. Finally, if data are being collected on a continuous basis then it is possible to obtain trends of activity and usage patterns over time to see if a given AP moves up or down the two-dimensional space of the incoming (downloading) and outgoing (uploading) data amount showing an increase or decrease in activity, respectively. This information can help the operator in capacity and usage planning.

396 405

Downloading vs. uploading data amount for various access points 10

Downloading vs. uploading data amounts for various MAC addresses 10 10

10

9

10 9

Uploading data amount

Uploading data amount

10

8

10

7

10

6

10

8

10

7

10

6

10

5

10

4

10

5

10

3

10

2

4

10

4

5

10

10

6

10

7

8

10

10

9

10

10

10

10

11

2

4

10

10

6

10

8

10

10

10

12

10

10

Downloading data amount

Downloading data amount

(a)

(b)

Downloading vs. uploading data amounts for various access points 10 10

Downloading vs. uploading data amounts for various MAC addresses 10 10

9

9

10

Uploading data amount

Uploading data amount

10

8

10

7

10

6

10

8

10

7

10

6

10

5

10

4

10

5

10

3

10 4

10

2

4

10

5

10

6

10

7

8

10

10

9

10

10

10

10

11

10

3

10

Downloading data amount

4

10

5

10

6

10

7

10

8

10

9

10

10

10

11

10

Downloading data amount

(c)

(d)

Fig. 1: Total incoming (downloading) versus outgoing (uploading) traffic amount (bytes) at each access point (node ID) (panels a,c) and for each unique user (identified by their MAC address) (panels b, d) when divided into 3 (panels a, b) and 4 clusters (panels c, d).

The analysis conducted here shows that using machine learning techniques, such as clustering, it is possible to extract information from telecommunication data which can potentially answer several business-focussed questions.

[6]

R EFERENCES [1] P. Henry and H. Luo, “Wifi: what’s next?” Communications Magazine, IEEE, vol. 40, no. 12, pp. 66–72, 2002. [2] G. William, M. Richard, T. Stuart, and T. Andrew, “Profiting from the rise of wi-fi new, innovative business models for service providers,” Cisco Technical Report, 2012. [3] Global developments in public wi-fi. [Online]. Available: http://www.wballiance.com/resource-center/wbaindustry-report/ [4] F. Malabocchia, L. Buriano, M. Mollo, M. Richeldi, and M. Rossotto, “Mining telecommunications data bases: an approach to support the business management,” in Network Operations and Management Symposium, 1998. NOMS 98., IEEE, vol. 1, 1998, pp. 196–204 vol.1. [5] L. Carbonara, H. Roberts, and B. Egan, “Data mining in the telecommunications industry,” in Principles of Data Mining and Knowledge Discovery, ser. Lecture Notes in

[7]

[8]

[9]

[10]

397 406

Computer Science, J. Komorowski and J. Zytkow, Eds., vol. 1263. Springer Berlin Heidelberg, 1997, pp. 396– 396. G. M. Weiss, “Data mining in the telecommunications industry,” in Encyclopedia of Data Warehousing and Mining, Second Edition, Wang J.(ed.), 2009, pp. 486– 491. T. Rashid, “Classification of churn and non-churn customers fortelecommunication companies,” in CSC Journals. [Online]. Available: www.cscjournals.org/csc/manuscript/Journals/IJBB32.pdf L. Cutillo, R. Molva, and M. Onen, “Analysis of privacy in online social networks from the graph theory perspective,” in Global Telecommunications Conference (GLOBECOM 2011), 2011 IEEE, 2011, pp. 1–5. R. Nussbaum, A. H. Esfahanian, and P.-N. Tan, “Clustering social networks using distance-preserving subgraphs,” in Advances in Social Networks Analysis and Mining (ASONAM), 2010 International Conference on, 2010, pp. 380–385. W. Zhuang, Y. Ye, Y. Chen, and T. Li, “Ensemble clustering for internet security applications,” Systems,

mean silhouette value for access points

Unique MAC addresses at each node sorted in descending order Number of unique MAC addresses

80 70 60 50 40 30 20 10 0 0

50

100

150

200

250

Node number

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

2

3

(a) mean silhouette value for number of user (identified by mac addresses)

4

5

6

k

(b) 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

2

3

4

5

6

k

(c)

Fig. 2: Total number of users (based on Mac ID) at different access points (panel a) and number of clusters K plotting against mean silhouette value for k-means clustering for APs (panel b) and individual MAC users (panel c). Mean silhouette values close to +1 indicate best clustering as explained in the text.

[11]

[12]

[13]

[14]

[15]

[16]

Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42, no. 6, pp. 1784–1796, 2012. A. Mi and L. Hai, “A clustering-based classifier selection method for network intrusion detection,” in Computer Science and Education (ICCSE), 2010 5th International Conference on, 2010, pp. 1001–1004. M. Manar and S. Foda, “Fault location in distribution networks using clustering techniques,” in Microelectronics, 2001. ICM 2001 Proceedings. The 13th International Conference on, 2001, pp. 197–202bis. N. Banerjee and P. Khilar, “Distributed intermittent fault diagnosis in wireless sensor networks using clustering,” in Integrated Intelligent Computing (ICIIC), 2010 First International Conference on, 2010, pp. 264–269. K. Suganthi, B. Sundaram, K. Kumar, J. Ashim, and S. Kumar, “Improving energy efficiency and reliability using multiple mobile sinks and hierarchical clustering in wireless sensor networks,” in Recent Trends in Information Technology (ICRTIT), 2011 International Conference on, 2011, pp. 257–262. J. Manweiler, N. Santhapuri, R. Roy Choudhury, and S. Nelakuditi, “Predicting length of stay at wifi hotspots,” in IEEE INFOCOM, Apr. 2013. O. Dousse, J. Eberle, and M. Mertens, “Place learning

[17]

[18]

[19]

[20]

[21] [22]

398 407

via direct wifi fingerprint clustering,” in Proceedings of the 2012 IEEE 13th International Conference on Mobile Data Management (mdm 2012), ser. MDM ’12. Washington, DC, USA: IEEE Computer Society, 2012, pp. 282–287. M. Afanasyev, T. Chen, G. Voelker, and A. Snoeren, “Usage patterns in an urban wifi network,” Networking, IEEE/ACM Transactions on, vol. 18, no. 5, pp. 1359– 1372, 2010. J. Koo and H. Cha, “Unsupervised locating of wifi access points using smartphones,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, vol. 42, no. 6, pp. 1341–1353, 2012. P. Nurmi and S. Bhattacharya, “Identifying meaningful places: The non-parametric way,” in Pervasive Computing, ser. Lecture Notes in Computer Science, J. Indulska, D. Patterson, T. Rodden, and M. Ott, Eds. Springer Berlin Heidelberg, 2008, vol. 5013, pp. 111–127. C. A. C. R. for Archiving Wireless Data At Dartmouth. Global developments in public wi-fi. [Online]. Available: http://crawdad.cs.dartmouth.edu/wifidog K. K. Sergios Theodoridis, Introduction to Pattern Recognition. Academic Press, 2003. k-means clustering. [Online]. Available: http://en.wikipedia.org/wiki/K-means clustering