Traffic Flow Classification and Visualization for Network ... - IEEE Xplore

11 downloads 0 Views 739KB Size Report
illustrate network communication for forensic analysis. In primarily analysis process, the timeline of events is reconstructed from traffic logs. An analyst can track ...
2015 IEEE 29th International Conference on Advanced Information Networking and Applications

Traffic Flow Classification and Visualization for Network Forensic Analysis Nuttachot Promrit Faculty of Information Technology King Mongkut’s University of Technology North Bangkok Bangkok, Thailand nuttachot|@|hotmail|.|com

Anirach Mingkhwan Faculty of Industrial and Technology Management King Mongkut’s University of Technology North Bangkok Prachinburi, Thailand anirach|@|ieee|.|org behaviour of hosts or users and identify the disorder that may cause the problem.

Abstract-This paper presents an iterative visualization technique including the timeline and parallel coordinates to illustrate network communication for forensic analysis. In primarily analysis process, the timeline of events is reconstructed from traffic logs. An analyst can track the related anomaly event on-demand. In addition the details of abnormal and normal activities are shown in multiple dimensions of parallel coordinates.

In the network security, traffic classification can use automatic method or visualization method. Although we can use intrusion detection system to detect anomalies on the network automatically without a need to see anomalies image. There are at least two reasons why we introduced information visualization to present the activity on the network 1) anomaly traffic and application traffic visualization allow us to see the relationship of variables such as source IP address, source port number, time, destination port number, destination IP address and packet size of traffic flow with object characteristics such as proximity, colour, size, orientation, direction shape, etc. It enables us to discover communication patterns on the network quickly from human intuition instead of showing the letter and statistical data which requires more thinking ability 2) the visualization allows the analyst discovers new anomaly and application traffic patterns easily or even see the different types of tools, algorithm or operating systems, etc. of each anomaly traffic.

The novelty of this research is not a presentation of the timeline and parallel coordinates technique, but iterative visualization framework to illustrate both anomaly traffic and application traffic pattern. We applied frequent item-set mining to search dominant traffic flow and classify them by traffic flow shape and entropy. Although some studies have been applied frequent itemset mining with traffic dataset, but as we have known, this is the first research to 1) take advantages of the frequent item-set mining and parallel coordinates, which allow us to find both the anomaly traffic and application traffic and it can easily understand the patterns of traffic flow with the multi-dimensional visualization, and 2) classify the application traffic from the entropy values of traffic flow discovered by frequent item-set mining. This method is able to classify the encrypted traffic data and it does not violate a user privacy.

The remaining of the paper is as follows. Section 2 provides a discussion on related researches on network anomaly detection, application traffic classification and information visualization. Section 3 depicts the approach behind the research put forward in this paper. Section 4 describes the to show anomaly and multi-dimension visualization application traffic and Section 5 provides an account of the classification. Section 6 shows analysis with the timeline of events. Section 7 depicts system design. Section 8 describes on the obtained results. The paper is concluded in Section 9.

The results of this research and development of a visual network communication tool can: 1) show abnormalities and normal communication activities, 2) have application traffic classification 92% accurate, 3) be a visual network communication prototype which helps an analyst to find the cause of the network malfunction. Index Terms—Traffic Flow Extraction, Entropy, Traffic Flow Classification, Behavior Analysis

1. INTRODUCTION In the age where we found the Internet everywhere, many people can use many devices such as laptop computers, tablet s or smartphones to access the network. However, when there is network malfunction, it is not always easy to analyze and find the cause of the problem.

2. RELATED WORK We found that in the previous research, such as a paper by Choi et al [1] presented a technique for anomaly detection from the 6 unique traffic flow shapes with 4 features include: 1) source IP address 2) destination IP address 3) destination port and 4) packet size and displayed each anomaly traffic with parallel coordinates.

When anomaly or damage occurs on the network, the investigation can be processed by piecing the event sequence that includes a lot of both normal and abnormal activity. The network forensic process allows an analyst to understand the 1550-445X/15 $31.00 © 2015 IEEE DOI 10.1109/AINA.2015.207

358

Parallel coordinates visualization was proposed by Inselberg [2] that it is a techniques to display multi-dimensional data. Theoretically, parallel coordinates, can illustrate the unlimited number of dimensions. While another technique such as a line graph or a bar graph can shows the maximum number of dimensions which human is able to understand as much as possible, 3 dimensions. So it is a proper technique to show the network communication patterns that consists of multidimensional data.

purpose of this research is showing dominant patterns of communication network, but from empirically evidence, the relationship of items are representative network traffic flow as well. Therefore in our research, we applied frequent item-set mining to extract the traffic flow instead of applying hash function, as in the research of Choi et al [1]. It can find patterns of attacks including application traffic without a need to know the characteristic of traffic flows in advance. After traffic flow was found, we classified it by comparing with well-known attack graphical signatures to classify the network attack and comparing with training datasets to classify the application traffic as well.

(a)

(b)

!

Another research, I. Paredes-Oliva et al [5], applied frequent item-set mining to detect network attacks. They used machine learning techniques such as C4.5 to create a decision tree and used frequent item-set mining to find traffic flow.

!

U.K. Chaudhary et al [6] applied association rule mining for application traffic classification. Traffic flow is created by grouping simple flows (a group of packets has a same source IP address, source port number, destination IP address, destination port number, and protocol) with model-based clustering algorithm. For the next step, traffic features of traffic flow such as IP address and port number were compared with association rules. Each rule which related to IP address and port number of a real network application was created by the association rule algorithm.

(c)

Figure 1. Host Scan, Internet Worm Attack

Unfortunately, the two traffic flow classification techniques [5], [6] still rely on port number and IP address, while our research classify traffic flow by considering the shape and entropy of traffic flow instead of using IP address and port number directly.

For example, the host scan attack attempts to send probe packets to several destination IP address in the range of subnet to discover hosts on the network that are opening service and Internet worm which is a program that attempt to exploit the vulnerability of the target host to copy itself is shown in Fig. 1 (a).

Research features which are the basis and inspiration for the new concept in this research are summarized in Table 1.

However, the image of host scan and Internet worm on parallel coordinates with only 4 features are not quite different from predefine graphical attack signatures in Fig. 1 (b). So, it is difficult to see the attack that have variety patterns by attack tools, algorithms or operating system, etc. In the left of Fig. 1 (a), we can only see lines distributed on the destination IP address axis. Therefore, we modified the parallel coordinates by adding time (packet arrival time) axis that it can show more traffic flow patterns [3] as Fig. 1 (c).

3. BACKGROUND Frequency Item-set Mining: frequency item-set mining is a process of extracting patterns or item-sets from transactions. Generally item-sets are used as input for creating association rules. Frequency item-set is defined as follows. Let I = {i1, ..., in} is a set of n items and D = {t1, ..., tm} is a set of m transactions. Each transaction is a subset of items in set I, while the frequency item-set is a subset of the items in I that is also a subset of at least s transaction in D [4].[7] In our research context, transaction is a simple flow or network packet, while items are traffic features of simple flow or packet header itself such as source IP address, source port number and simple flow size, etc. So item-sets are traffic features in a dataset that are frequently found, such as {SIP = 10.182.221.186, DIP = 10.182.208.1, DPort = 53} or {Simple Flow Size = 1}. These item-sets are created when the support value (the ratio of item-set frequency per number of transaction) is greater than or equal to a threshold. Entropy-based Traffic Analysis: entropy-based is a simple method which can be used to represent the distribution of the data in each dimension. Instead of using a histogram, we use

However, sometimes network traffic visualization with the parallel coordinates is more clusters. A research by Choi et al. [1] presented a flow extraction method with hash function and displayed each network attack with parallel coordinates. This technique can process in real-time, but it needs to know the shape of the attack before. Also, it only shows the attack flow. It is not enough for forensic analysis to understand all activities including application traffic activities. Instead of displaying the communication network image with parallel coordinates, E. Glatz et al [4] applied the frequent item-set mining technique to discover patterns that hidden within the dataset and showed the relationship of items (the data in different dimensions) with hyper graphs. Although the

359

Table 1. Summarize of related research features -

No use well-known port number to classify ! !

Simple Flow Simple Flow

Visualization

Flow Extraction

Anomaly detection

Application Traffic Classification

Parallel coordinates -

Hash Function FIM

Graphical Signatures Histogram-based

4. 5.

Fast Detection, 2009 [1] Anomaly Extraction, 2012 [7] Visualizing Big Network Traffic, 2014 [4] Practical Anomaly, 2012 [5] Flow Classification, 2010 [6]

6.

Our Research

1. 2. 3.

Datasets

Hyper graphs

FIM

-

-

N/A

Simple Flow

-

FIM Clustering

C4.5-Decision Tree -

Association Rules

-

Parallel coordinates

FIM

Graphical Signatures

Flow Entropy

!

Simple Flow Simple Flow Simple Flow + Network Traffic

the entropy-based method to reduce the multi-value of the histogram to only one value. Our research uses entropy-based method called sample entropy for classify application traffic. Sample entropy is a technique that Anukool Lakhina et al [8] used to represent the distribution of traffic features to detect anomalies on the network-wide. As a part of the information theory, sample entropy is a method for information average calculation, as in equation (1). !&$'("#$ % ) *+, - "#$ %

(1)

(a)

Let pi is a probability of the event and b is the base. In other words entropy in (1) is an expectation equation which is multiplied by -1 to change a negative value from log function to a positive value.

Figure 2. Two difference DNS traffic patterns on!parallel coordinates quickly see packet arrival, time distribution and the behavior of process on the host at each communication side. Such behavior of the server process is often used a! low port number. In addition we can see the server process that typically send a large size packet from a packet size axis.

Entropy is able to show the distribution of the traffic feature in equation (2). Let X is a traffic feature like source IP address, port number and packet size etc.

H(X) = . !1$'("&0/% ) *+, 2 "&0/%

(b)

(2)

The parallel coordinates in Fig. 2 (a) and (b) shows the DNS traffic that replied by the server process to the client process. We can clearly see the different behavior of two clients. The client (Mac OSX client) in Fig. 2 (a) communicate with DNS Server, with a large number of random port numbers while the client in Fig. 2 (b) communicates with DNS Server, with lower port number. However, the main problem of original parallel coordinates is it cannot show the distinction between the line graph of one record and multiple record that have the same value. That line is drawn on those axes at the same position. Therefore, we improve the parallel coordinates by adding grey points with difference contrast level on each axis to show the frequency of data in that area (in the same way of the histogram, the height of the bin represents the frequency in the class interval). The more overlap area (more frequency), the darker area.

When ni is a number of frequencies of each event, i {i = 1, ..., N} and S is the total number of frequencies. Entropy values are between 0 - log2(N), which the high entropy represents high data distribution. In contrast, the low entropy represents low data distribution. 4. ANOMALY AND APPLICATION TRAFFIC VISUALIZATION To be able to see the! network attack and application traffic pattern that may come from variety of tools, algorithm or operating system, etc., we designed parallel coordinates by adding traffic features such as the source port number, time (packet arrival time). We plot a line from each packet header rather than simple flow, so the graph that lay on each axis of parallel coordinates shows the distribution of the traffic feature. 6 features or 6 dimensions of parallel coordinates are source IP address (SIP), source port number (SPort), time, destination port number (DPort), destination IP address (DIP) and packet size (size), as shown in Fig. 2. Fig. 2 shows DNS traffic with 6-axis parallel coordinates. We designed to lay time axis between two sides of the network communication and plotted the line on the source and destination port number axis as absolute orientation. It makes the line of port number 0 is laid at the bottom and the line of port number 65535 is laid at the top of the axis. The analyst can

5. ANOMALY SIGNATURES AND APPLICATION TRAFFIC PATTERNS The anomaly traffic is classified by graphical signature without learning patterns from past data. Unfortunately, we were unable to classify the application traffic as well as the anomaly traffic because many network application create traffic that the graphical signatures are not different. Therefore, we classify traffic flow with another traffic feature, which

360

based on two constrains: 1) it is able to classify encrypted network traffic, and 2) it does not violate the privacy of users. At last, we select the traffic flow entropy for classification. Examples of anomaly traffic and application traffic classification are shown in the following situation. Assume that in the official time of one day, some users watched the television on the Internet that made network throughput was dropped. Many users complained and inform an admin to find out the cause of the problem. The investigation began with frequency item-set mining of the network traffic logs. The large traffic flow from IP Address 10.0.137.7 to 10.182.221.186 was found. From traffic logs, the admin found the destination IP address was used by Bob account.

Figure 4. Alpha flow with!parallel coordinates When alpha flow was shown on 6-axis parallel coordinates, an analyst is able to see the traffic flow shape which is a bit different from the signature. The shape on the parallel coordinates looks like a pyramid or a dome that the peak is on time axis and the base is on the source and destination port number axis. It allows the analyst to see 5 minutes traffic flow behavior of the process on each network communication side via port number 1464 from both sides. By these characteristics, it seems to be a streaming traffic between the VLC media player and streaming server. In the same case, if it is shown with 4axis parallel coordinates [1] we will see it looks like a straight line only.

The detail of traffic flow was discovered by the frequency item-set mining which was shown below. Source IP Address: 10.0.137.7 Source Port Number: 1464 Destination IP Address: 10.182.221.186 Destination Port Number: 1464 #Packets: 13,242 Packets Average Packet Size: 1,344 Bytes From the detail we found that the shape of traffic flow above matched with the alpha flow signature as shown in Fig. 3

We determined the attack signatures into four major types: 1) scan attack 2) denial of service attack 3) worm attack, and 4) other anomaly signature (flash crowd and alpha flow). Each anomaly signature which is not fake source IP address and source port number can be reflex flow that replied to source process as in Fig. 3. To classify application traffic, the distribution of traffic flow on each axis of parallel coordinates was transformed to sample entropy as Equation (2). The example of H(SIP), H(time) and H(DPort) of DNS traffic in Fig. 2 (b) are 0, 1.48 and 0. Although two entropy values are zero, the graph can be skewed to the middle or the bottom of parallel coordinates. Only entropy method cannot show port number behavior selection of process in each side of the network communication completely. But it is enough for training to classify application traffic because we do not allow the port number to influence classification. So we can classify similar application traffic with different port number as DNS traffic in Fig. 2 (a) and (b) as well as encrypted traffic. Therefore, visualization of traffic flow on the parallel coordinates is important for the behavior understanding of network applications which allow an analyst to understand behavior and patterns of traffic flow from a multidimensional visualization of data. 6. TIMELINE Only classification and traffic flow! presentation on parallel coordinates may not enough for forensic task, which needs to find the cause of the disorder. We developed a prototype to illustrate network communication. The prototype tool can show the behavior on the network with parallel coordinates and the timeline of events. An analyst can create new parallel

Figure 3. An example of!TCP/UDP anomaly flow signature

361

coordinates and timeline that related to suspect event by ondemand parameter customization as shown in the next example.

(d)

(c)

(e)

(f)

Figure 6 The investigation with!the timelines and! parallel coordinates 7. SYSTEM DESIGN

Figure 5. Dominant traffic flow From Fig. 5 an analyst explored the timeline of events on June 5, 2557 3.50 PM., at that time, an analyst found 2 dominant traffic flow including reflex port scan (average packet size 32 bytes) and DNS request traffic (average packet size 68.38 bytes) which replied to the destination IP address 10.182.208.1. It is noted that reflex port scan traffic from IP address 10.182.208.186 was sent by the host 10.182.208.1, which acted as a DNS server process, so the analyst wanted to examine this discrepancy by creating a new timeline, but in the opposite traffic directions. It can be seen in Fig. 6 (a) - (b)

Figure 7. System process

In addition, from the timeline at Fig. 6 (c) - (d), the analyst found UDP traffic from two IP address which was sent to IP address 10.182.208.186 with 3 well-known port number. When inbound traffic with parallel coordinates is shown, the analyst found only DNS traffic that reply from domain name server. It was probably not to be port scan from IP address 10.182.208.1 but it likely to be probe packets from IP address 10.182.221.186 which was sent to port number 192 (OSU-NMS) on IP Address 10.182.208.1. The IP address 10.182.208.1 is a source of this anomaly traffic.

In order to classify traffic flow and display it with parallel coordinates, we designed a system that consists of 8 major process include: 1) data collection 2) simple flow generation 3) frequent item-set mining 4) entropy generation 5) feature extraction 6) heuristic classification 7) parallel coordinates visualization and 8) manual inspection. Manual inspection and parallel coordinates visualization are directly interact with a user while the remaining 6 process are automatic process. From Fig. 7, network traffic on the link between external and internal network are captured and stored in a database. Network traffic are divided into outbound and inbound traffic and combined to simple flow. Then, simple flow and network traffic are mined with frequent item-set mining to find item-sets from traffic features of traffic flow (SIP, Sport, Dport, DIP, simple flow size) and simple flow (SIP, Sport, DIP, Dport), which allowed a chance to discover port scan, host scan, back scatter or distributed host scan that usually have a simple flow size equal to one from simple flow and find single source DoS or large Alpha Flow from network traffic.

(a)

(b)

To classification, entropy generation process calculated traffic feature entropy including H(SIP), H(SPort), H(Time), H(DPort), H(DIP) and H(Size) of each traffic flow while

362

feature extraction process collected those shapes of traffic flow. Traffic entropy of traffic flow became input data for classification by Naive Bayes classifier, which it learned the patterns of application traffic from the training dataset. We selected Naive Bayes classifier for application traffic classification although it does not return the highest accuracy among classifiers in the experimental results [9]. However it can classify and provide probability value for the heuristic algorithm, which can be used to predict results of both Naive Bayes classifier and anomaly traffic classifier to classify traffic flow.

TP

FN

FP

TN

(3)

Using the confusion matrix the success rate of each classifier can be evaluated by dividing the number of true positive and true negative results by the total number of feature vectors, as displayed in the following formula [9]: 34 5 36 34 5 74 5 36 5 76

The heuristic algorithm used a probability value of Naive Bayes classifier. If the probability value was greater than or equal the threshold, the heuristic algorithm would use Naive Bayes classifier prediction result for classification. Otherwise, heuristic algorithm would use anomaly classifier prediction for classification. Finally the classification result and traffic flow were shown with parallel coordinates visualizer.

(4)

The application traffic classifier can classify data with the 92% accuracy that is computed from the confusion matrix.

8. EVALUATION This section discusses how to measure the performance of application traffic classification and visualization techniques. We captured network traffic from real network communication and used them to create a timeline of events. We used frequent item-set mining to discover traffic flow and explored each of traffic flow by selecting related! packets from text filter that is created automatically for the Wireshark program (in Fig. 8).

(a) Netflix streaming

!!!!!!

(b) Youtube streaming

Figure 8. Text filter We manually collected 50 application traffic flows in 5 minutes! of a flow which are included 1) 20 flows of Dropbox 2) 10 flows of Bit 3) 20 flows of streaming video (YouTube, NetFlix) and used 25 flows for training and 25 flows for testing. True Labels Dropbox Bit Streaming Totals

(c) Upload Dropbox

(d) Upload Bit

Table 2. Confusion matrix Dropbox 10 2 0 12

Estimated Labels Bit Streaming 0 0 3 0 0 10 3 10

Totals 10 5 10 25 (e) One subnet host scan (ICMP)

Table 2 shows the confusion matrix, that determines the distribution of error across three classes (Dropbox, Bit, streaming video) and the point where miss-classification occurs [10]. In other words, it shows true positive (TP), false positive (FP), true negative (TN) and false negative (FN) values. diagonal elements show the effectiveness of the classifier, while off diagonal presents errors [9].

(f) Nine subnet host scan (ICMP)

Figure 9. Traffic flow images from the real network Instead of sighting the messy traffic during!"!minutes, each pattern of dominant traffic flow on parallel coordinate is shown. Application traffic and anomaly traffic are shown with 6 axes. It allows us to see the shape and the distribution of information on the SIP, Sport, Time, DIP, Dport and Size

363

[6] U. K. Chaudhary, I. Papapanagiotou, and M. Devetsikiotis, “Flow classification using clustering and association rule mining,” in 2010 15th IEEE International Workshop on Computer Aided Modeling, Analysis and Design of Communication Links and Networks (CAMAD), 2010, pp. 76–80. [7] D. Brauckhoff, X. Dimitropoulos, A. Wagner, and K. Salamatian, “Anomaly Extraction in Backbone Networks Using Association Rules,” IEEEACM Trans. Netw., vol. 20, no. 6, pp. 1788–1799, Dec. 2012. [8] A. Lakhina, M. Crovella, and C. Diot, “Mining Anomalies Using Traffic Feature Distributions,” in Proceedings of the 2005 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, New York, NY, USA, 2005, pp. 217– 228. [9] N. Promrit, A. Mingkhwan, M. Merabti, and W. Hurst, “Advanced Feature Extraction for Evaluating Host Behaviour in a Network,” in The 15th Annual Conference on the Convergence of Telecommunications, Networking and Broadcasting, Liverpool, 2014. [10] N. D. Marom, L. Rokach, and A. Shmilovici, “Using the confusion matrix for improving ensemble classifiers,” in 2010 IEEE 26th Convention of Electrical and Electronics Engineers in Israel (IEEEI), 2010, pp. 000555–000559.

(Packet Size) axis such as Netflix traffic that is sent from port number 80 smoothly (Fig. 9 (a)), Youtube traffic is sent from port number 80, but less smoothly (Fig. 9 (b)) because Netflix has a higher quality of the video and the higher variety of the packet size from the unique characteristic of the steaming server.! However, both traffic flows have the same shape which are the communication from the lower port number to higher port numbers. This style is a pattern of the server process to the client process communication. In Fig. 9 (c), the upload Dropbox traffic is sent to multiple servers (Amazon cloud servers), Bit traffic is sent to multiple peers (Fig.9 (d)), a subnet host scan sends packets to a higher destination IP address in the order of time (Fig. 9 (e)) and the multiple subnet host scan send packets to random subnet hosts. (Fig. 9 (f)). 9. CONCLUSION AND FUTURE WORK In this paper, an approach (framework) was put forward for the network forensic task using iterative visualization (the timeline and parallel coordinates). Using this approach, the results demonstrate that the technique can identify the cause of the network problem. Using frequent item-set mining to search dominant traffic flow and classify them by traffic flow shape and entropy including showing each traffic flow with parallel coordinates can provide a detailed account of the network application and network anomaly behaviour. Future work will include measurement the performance of the anomaly detection in the real network data. ACKNOWLEDGEMENT The research was financially supported by the National Science and Technology Development Agency, Thailand. REFERENCES [1] H. Choi, H. Lee, and H. Kim, “Fast Detection and Visualization of Network Attacks on Parallel Coordinates,” Comput Secur, vol. 28, no. 5, pp. 276–288, Jul. 2009. [2] A. Inselberg, “Multidimensional detective,” in , IEEE Symposium on Information Visualization, 1997. Proceedings, 1997, pp. 100–107. [3] N. Promrit, A. Mingkhwan, S. Simcharoen, and N. Namvong, “Multi-dimensional visualization for network forensic analysis,” in 2011 The 7th International Conference on Networked Computing (INC), 2011, pp. 68–73. [4] E. Glatz, S. Mavromatidis, B. Ager, and X. Dimitropoulos, “Visualizing big network traffic data using frequent pattern mining and hypergraphs,” Computing, vol. 96, no. 1, pp. 27–38, Jan. 2014. [5] I. Paredes-Oliva, I. Castell-Uroz, P. Barlet-Ros, X. Dimitropoulos, and J. Sole-Pareta, “Practical anomaly detection based on classifying frequent traffic patterns,” in 2012 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), 2012, pp. 49–54.

364