Improving network security using genetic algorithm ... - Semantic Scholar

4 downloads 2276 Views 561KB Size Report
Improving network security using genetic algorithm approach. Zorana Bankovic a,*, Dušan Stepanovic b, Slobodan Bojanic a,. Octavio Nieto-Taladriz a.
ARTICLE IN PRESS

Computers and Electrical Engineering xxx (2007) xxx–xxx www.elsevier.com/locate/compeleceng

Improving network security using genetic algorithm approach Zorana Bankovic´

b

a,*

, Dusˇan Stepanovic´ b, Slobodan Bojanic´ a, Octavio Nieto-Taladriz a

a ETSI Telecomunicacio´n, Technical University of Madrid, Ciudad Universitaria s/n, 28040 Madrid, Spain Faculty of Electrical Engineering, University of Belgrade, Bulevar Kralja Aleksandra 78, 11000 Beograd, Serbia

Abstract With the expansion of Internet and its importance, the types and number of the attacks have also grown making intrusion detection an increasingly important technique. In this work we have realized a misuse detection system based on genetic algorithm (GA) approach. For evolving and testing new rules for intrusion detection the KDD99Cup training and testing dataset were used. To be able to process network data in real time, we have deployed principal component analysis (PCA) to extract the most important features of the data. In that way we were able to keep the high level of detection rates of attacks while speeding up the processing of the data.  2007 Published by Elsevier Ltd. Keywords: Intrusion detection; Genetic algorithm; Principal component analysis

1. Introduction Internet and local area networks are expanding at an amazing rate in recent years, not just in the terms of size, but also in the terms of changing the services offered and the mobility of users that make them more vulnerable to various kinds of complex attacks. While we are benefiting from the convenience that new technology has brought us, computer systems are exposed to increasing number and complexity of security threats. Of particular importance, thus, is the ability of applying rapidly new network security policies in order to detect and react as quickly as possible to the occurring attacks. Different techniques have been developed and deployed to protect computer systems against network attacks (anti-virus software, firewall, message encryption, secured network protocols, password protection). Despite all the efforts, it is impossible to have a completely secured system. Therefore, intrusion detection is becoming an increasingly important technique that monitors network traffic and identifies network intrusions such as anomalous network behaviors, unauthorized network access, or malicious attacks to computer systems. Most of the existing solutions are developed for well-defined networks and systems [1–3]. Nevertheless, they are not adapted to dynamic environments, or to the increasing complexity of user behaviors. *

Corresponding author. E-mail address: [email protected] (Z. Bankovic´).

0045-7906/$ - see front matter  2007 Published by Elsevier Ltd. doi:10.1016/j.compeleceng.2007.05.010

Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS 2

Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

There are two general categories of intrusion detection systems (IDSs): misuse detection and anomaly based. Misuse detection systems are most widely used and they detect intruders with known patterns. The signatures and patterns used to identify attacks consist of various fields of a network packet, like source address, destination address, source and destination ports or even some key words of the payload of a packet. These systems exhibit a drawback in the sense that only the attacks that already exist in the attack database can be detected, so this model needs continuous updating, but they have a virtue of having very low false positive rate. Anomaly detection systems identify deviations from normal behaviour and alert to potential unknown or novel attacks without having any prior knowledge of them. They exhibit higher rate of false alarms, but they have the ability of detecting unknown attacks and perform their task of looking for deviations much faster. Application and development of specialized machine learning techniques is gaining increasing attention in the intrusion detection community [19]. Soft computing is a collection of methodologies, which aim to exploit tolerance for imprecision, uncertainty and partial truth to achieve tractability, robustness and low solution cost. As soft-computing techniques can also be used for machine learning, different soft-computing techniques have been used for intrusion detection (Fuzzy Logic, Artificial Neural Networks, Genetic Algorithms) [4,7,15,17], but their possibilities are still under-utilized. In this work we have realized a misuse detection system that is based on Genetic Algorithm (GA). We have exploited both possibilities, either to classify network traffic as normal or abnormal, or to further classify the attacks by their type. Many features of GA make it very suitable for intrusion detection. Like robustness to noise, self-learning capabilities, and the fact that initial rules can be built randomly so there is no need of knowing the exact way of attack machinery at the beginning. Further classification of the attacks is not very important for intrusion detection, but it is important for network forensics because knowing the exact type of a threat and the way it performs its attack, the recovery after an attack would be more successful. Genetic algorithm (GA) field is one of the up-coming fields in computer security, especially in intrusion detection systems (IDS) [7,15,18]. GA operates on a population of potential solutions applying the principle of survival of the fittest to produce better and better approximations to the solution of the problem that GA is trying to solve. At each generation, a new set of approximations is created by the process of selecting individuals according to their level of fitness value in the problem domain and breeding them together using the operators borrowed from the genetic process performed in nature, i.e. crossover and mutation. This process leads to the evolution of populations of individuals that are better adapted to their environment than the individuals that they were created from, just as it happens in natural adaptation. For evolving new rules the KDD99Cup training and testing dataset was used [5]. KDD99Cup dataset was found to have quite drawbacks as containing missing and useless features and impossibility of detection of some attacks [6]. Despite of the shortcomings, it is still prevailing dataset used for training and testing of IDSs due to its good structure, i.e. every connection is described using 41 features and is labelled, thus providing the information whether the connection is normal or it is a specific attack type, and the fact that it is the only data set of that kind available [7,8]. It is extremely difficult to process in real time large amount of network traffic data in order to be able to detect network attacks and take the appropriate actions. On the other hand, not all the features have the same relevance for intrusion detection [9]. To process network data in real time and perform efficient intrusion detection, we need to extract the most important piece of information that can be deployed for efficient detection of network attacks. We have deployed principal component analysis (PCA), known also as Karhunen– Loe`ve transform, in order to extract the most relevant features of the data. PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations of the original variables with the largest variance. Our results confirm maintenance of high detection rate while using lower dimension of data. The other benefit is that data processing and the decision making whether a connection is an attack are performed much faster. Rest of the work is organized as follows. In Section 2 a survey on the machine learning techniques is given. Section 3 gives the overview of the dimension reduction technique deployed. In Section 4 the overview of genetic algorithm is shown. Genetic algorithm approach to intrusion detection as deployed is presented in Section 5. In Section 6 software implementation of the proposed approach is shown. Obtained results are given in Section 7. Possible application environments of the system are presented in Section 8. Conclusions are drawn in Section 9. Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

3

2. Survey on the machine learning techniques used for intrusion detection Large amount of network data and big number of network attacks have imposed the usage of intelligent machine learning techniques in order to discover attacks and their way of functioning. Past few years have witnessed a growing recognition of intelligent techniques for the construction of efficient and reliable intrusion detection systems. Most of the well-known pattern recognition techniques, both supervised and unsupervised, and their combinations resulting in meta-classifiers have been used for intrusion detection. Some of the techniques used in the state-of-the-art [15,17–24] and their results performed over KDD99Cup dataset are presented in Table 1. Genetic algorithm is one of the techniques that have recently been recognized as having potential in the intrusion detection field. Some of its applications are presented in [7,8,15,18]. Novelty of our approach consists in the fact that we have used only tree features out of 41 in order to describe network connection while maintaining high detection rates, thus providing to the system the ability to perform intrusion detection process rapidly, in the terms of both training and testing the rules for detection of intrusions, and the possibility of application to the high speed networks. Our approach exhibits similar detection and false-positive rate as the approaches presented in [7,8], but at the same time exhibits much shorter process of training and thus refreshing the rule set. Frequent refreshing of the rule set is very important characteristic considering the rate of the emerging of new attacks [25]. 3. Dimension reduction technique 3.1. PCA – overview Advances in data collection and storage capabilities during the past decades have led to an information overload in most sciences. Researchers working in various domains face every day’s larger need for observations and simulations. Traditional statistical methods do not give satisfactory results mostly because of the growing number of variables associated with each observation. Dimension of the data is the number of variables used to describe the data. High dimensional datasets have given rise to the new theoretical developments, like dimension reduction techniques, as one of the problems with high dimensional datasets is that, in many cases, not all measured Table 1 Machine learning techniques deployed to intrusion detection and their performance on KDD99Set Technique

Detection rate (%)

False positive rate (%)

C4.5 Support vector machine (SVM) Multi layer perception (MLP) k-nearest neighbor (k-NN) Linear programming machine (LPM) Regularized discriminant analysis (RDA) Ficher linear discriminant (FD) c-algorithm k-means clustering Single leakage clustering Quarter-sphere SVM Y-means clustering Genetic programming ensemble for distributed intrusion detection (GEdIDS) SVM + GA SVM + Fuzzy Logic Neural Networks + PCA C4.5 + PCA GA C4.5+Hybrid neural networks Hidden Markov model (HMM)

95 95.5 94.5 92 94 94 89 80 65 69 65 89.89 91 99 99.56 92.22 92.16 97.47 93.28 79

1 1 1 1 1 1 1 1 1 1 1 1 0.43 – 0.44 – – 0.69 0.2 –

Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS 4

Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

variables are important for the phenomena of interest. A large number of dimension reduction techniques have been developed [10], like principal component analysis, factor analysis, projection pursuit, independent component analysis etc. For our work, we have deployed principal component analysis (PCA) or Karhunen–Loe`ve transform because it is the best, in mean-square error sense, linear dimension reduction technique [11]. In our approach it was more convenient to select a subset of the original features that preserves most of the relevant information according to some optimality criteria, i.e. in our case the features that are more likely to participate in an attack, rather then finding a mapping that uses all of the original features. Our benefit of finding the proper subset is in avoiding the cost of computations of unnecessary features thus leading to the speed gain in both the process of detecting intrusions and training the rules. Different techniques have been deployed for feature selection, like regression techniques, clustering or PCAbased methods [31–33]. It has been demonstrated that PCA-based techniques exhibit certain advantages to regression techniques in the terms of optimality property and speed [33]. 3.2. Implementation of the PCA technique deployed for feature selection In essence, PCA seeks to reduce the dimension of the data by finding a few orthogonal linear combinations (the PCs) of the original variables with the largest variance [10]. The first PC, s1, is the linear combination with the largest variance. The second PC is the linear combination with the second largest variance and orthogonal to the first PC, and so on. There are as many PCs as the number of original variables. For many datasets, the first several PCs explain most of the variance, so that the rest can be discarded with minimal loss of information. We have deployed an alternative way to reduce the dimension of a dataset using PCA proposed in [12]. Instead of using the PCs as new variables, this method uses the information in the PCs to find important variables in the original dataset. As before, one first calculates the PCs, and then studies the scree plot, i.e. shows the sorted eigenvalues, from large to small, as a function of the eigenvalue index, to determine the number of k important variables to keep. Next, one considers the eigenvector corresponding to the smallest eigenvalue (the least important PC), and discards the variable that has the largest (absolute value) coefficient in that vector. Then, one considers the eigenvector corresponding to the second smallest eigenvalue, among the variables not discarded earlier. The process is repeated until only k variables remain. 4. Genetic algorithm overview Genetic algorithms (GA) are search algorithms based on the principles of natural selection and genetics. The bases of genetic algorithm approach are given by Holland [13] and it has been deployed to solve wide range of problems. GA evolves a population of initial individuals to a population of high quality individuals, where each individual represents a solution of the problem to be solved. Each individual is called chromosome, and is composed of a predetermined number of genes [14]. The quality of each rule is measured by a fitness function as the quantitative representation of each rule’s adaptation to a certain environment. The procedure starts from an initial population of randomly generated individuals. Then the population is evolved for a number of generations while gradually improving the qualities of the individuals in the sense of increasing the fitness value as the measure of quality. During each generation, three basic genetic operators are sequentially applied to each individual with certain probabilities, i.e. selection, crossover and mutation. The algorithm flow is presented in Fig. 1. Determination of the following factors has the crucial impact on the efficiency of the algorithm: selection of fitness function, representation of individuals and the values of GA parameters (crossover and mutation rate, size of population, threshold of fitness value). Determination of these factors usually depends on the application. In our work we have employed two simple fitness functions. The fist one calculates the detection rate of a rule subduing the false detection rate and the second one is support-confidence framework. GA parameters were chosen after a large number of experiments. Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

Generat e initial population

Evaluate population

Is end of evaluation reached?

yes

5

Best individuals

no

start

Generation of new populat ion

Selection

result

Crossover

Mutation

Fig. 1. Genetic algorithm flow.

Deployment of GA in the intrusion detection field offers number of advantages, namely: • GAs are intrinsically parallel, since they have multiple offspring, they can explore the solution space in multiple directions at once. If one path turns out to be a dead end, they can easily eliminate it and continue working on more promising avenues, giving them a greater chance by each run of finding the optimal solution. • Due to the parallelism that allows them to implicitly evaluate many schemas at once, GAs are particularly well-suited to solving problems where the space of all potential solutions is truly huge – too vast to search exhaustively in any reasonable amount of time, as network data is. • System based on GA can easily be re-trained, thus providing the possibility of evolving new rules for intrusion detection. This property provides the adaptability of a GA-based system, which is an imperative quality of an intrusion detection system having in mind the high rate of emerging of new attacks.

5. Genetic algorithm approach to intrusion detection The proposed approach contains two stages. In the first one, the training stage, a set of rules for detecting intruders is generated using network audit data offline. In the second stage, the best rules, i.e. the rules with the highest fitness values, are used for intrusion detection in the real-time environment. As some of the network characteristics have higher possibilities to be involved in network intrusions, we have deployed PCA approach to identify these characteristics. The PCA algorithm described above was implemented in MATLAB and deployed over the training dataset in order to define the features that participate most frequently in a machinery of an attack. According to the obtained results, we have selected three features out of forty one used to describe each connection of KDD99Cup dataset. The objective was to select the smallest possible number of the features while maintaining high detection rate of intrusions. In such a way detection could be performed as a real-time one. Selected features and their explications are presented in Table 2. In Appendix original 41 features from KDD99Cup dataset and their explication are presented as well as selected features up to dimension of 16 that we have obtained deploying the same PCA technique as described before. Every feature represents one gene of the chromosome. As one byte is being used to represent every feature, i.e. every gene, a chromosome that represents each individual is composed of three bytes. Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

6 Table 2 Selected network features Name of the feature

Explication

Number of genes

Duration src_bytes dst_host_srv_serror_rate

Length (number of seconds) of the connection Number of data bytes from source to destination Percentage of connections that have ‘‘SYN’’ errors

1 1 1

Every rule for intrusion detection is simple if-then clause. Features from Table 2 are connected using an and function thus forming the conditional part of a rule. The result of every rule is the confirmation of an intrusion. For example, one rule could be: if ðduration ¼ \1" and src bytes ¼ \0" and dst host srv serror rate ¼ \50"Þ then intrusion; To determine a fitness value of each rule, the following fitness function is deployed [15]: a b  ð1Þ A B where a is the number of correctly detected attacks, A is the total number of attacks in the training dataset, b is the number of normal connections incorrectly characterized as attacks, i.e. false-positives, and B is the total number of normal connections in the training dataset. Scale of fitness values is [1, 1], where -1 is the lowest and 1 the highest value. High detection rate and low rate of false-positives result in a high fitness value. On the other side, low detection rate and high rate of false-positives result in a low fitness value. We have also exploited the possibility of GA to detect the exact type of an attack. Detecting the type of each intrusion is not very important for intrusion detection, but it is important for forensics in order to recover from an attack. In this case a rule can be presented as: fitness ¼

if ðduration ¼ \1" and src bytes ¼ \0" and dst host srv serror rate ¼ \50"Þ then ðattack name ¼ \portsweep"Þ; As the previous fitness function being aware only of total number of intrusions and not of its exact type, we have deployed support-confidence framework [8] to determine the fitness of each rule: support ¼ jA and Bj=N confidence ¼ jA and Bj=jAj

ð2Þ

fitness ¼ w1  support þ w2  confidence where N is the total number of network connections in the testing dataset, jAj stands for the number of connections matching the condition A, and jA and Bj stands for the number of connections that matches the rule if A then B. The weights w1and w2 are used to control the balance between the two terms. The algorithm for generating new rules is performed as follows. The first step is initialization of an initial population when each gene is given a random value. Then the parameters of genetic algorithm (crossover and mutation rate, size of population, end of evolution of rules) are specified and the network audit data is being loaded. After that the initial population is being evolved for a number of generations. In every generation the quality of every rule, i.e. fitness value, is being calculated according to the fitness function, then a number of rules with the highest fitness values are being selected and at the end the genetic operators (crossover and mutation) are performed with a certain probability. The output of the algorithm are rules for intrusion detection. 6. Implementation of the system The system proposed here is implemented in software using C++ programming language. Implementation contains two systems: an ‘‘offline’’ system used to derive the rules from network audit data, and an ‘‘online’’ system that uses the derived rules for intrusion detection and is supposed to perform its function in a real-time environment. Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

7

Fig. 2. Class diagram of the realized system.

The program contains the following classes: • Individual: represents an individual of the rule population; contains its chromosome representation and fitness value; also contains the field called attack_name which is used in the second case of rule generation to determine the exact type of an attack; • Fitness: used for the calculation of a fitness value; contains the definitions of both fitness functions described above and each of them is selected when necessary; • Initializer: used for the initialization of a population; • Evaluator: used for selecting the rules whose fitness values are higher than the determined threshold value (‘‘best-fit’’ rules); • Breeder: used for the breeding of each generation; at the beginning it selects two individuals randomly, performs their one-point crossover with a certain probability thus generating two new individuals and then performs the mutation of the new individuals with a small probability; • PriorityQueue: class that contains a queue of elements organized by their priority, in this case the individuals organized in descending order of their fitness values, facilitating on that way the work of Breeder and Evaluator classes and election of the best-fit individuals at the end of the training process. The classes (with their most important attributes and functions) and their interconnections and dependencies are depicted in Fig. 2. 7. Training and testing the rules for intrusion detection 7.1. Training and testing data subsets For the purpose of this work, two subsets of KDD99Cup datasets for training and testing are derived. Each connection has the corresponding marking that states whether it is a normal connection or a certain type of an Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

8

attack. The subset used for training contained 137 attacks and 839 normal connections. The testing subset contained 234 attacks and 743 normal connections. The most of the connections selected are normal, which is generally the case in real-world networks. The subsets contained three types of network attacks: portsweep, smurf and neptune [16]. Portsweep is a kind of an attack that sweeps through many ports to determine which services are supported on a single host. Smurf is a denial-of-service attack that sends a stream of ICMP ‘‘ECHO’’ to the broadcast address of many subnets, resulting in a large continuous stream of ‘‘ECHO’’ replies that floods the victim. Neptune (or SYN-flood) is a denial-of-service attack where attacking system continues sending IP-spoofed packets requesting new connections faster than the victim system can close pending connections, i.e. they will expire. In some cases, the system may exhaust memory, crash or be rendered otherwise inoperative. In the terms of the features used to describe a particular type of an attack, portsweep attacks experiment greater time range, i.e. duration feature usually has a value higher that ‘0’, considering that it takes some period of time to perform its attack, while the other two in most of the cases have ‘0’ value. They also exhibit wider range of dst_host_srv_serror_rate than the other two attacks. Neptune attack exhibits great value of the dst_host_srv_serror_rate, usually around 100%, considering that the attacker sends a stream of SYN packets to a port on the target machine and dst_host_srv_serror_rate provides the information of the connections that have SYN errors, while duration and src_bytes features have ‘0’ value in most of the cases. Smurf attack, on the other side, has a high value of the src_bytes feature, few times bigger than the number of src_bytes in the normal connection, while the values of dst_host_srv_serror_rate and duration remain ‘0’. The training dataset contained 74 neptune, 24 smurf and 39 portsweep attacks. The testing dataset contained 87 neptune, 107 smurf and 40 portsweep attacks. 7.2. GA parameters deployed for training the rules The system was trained using the fitness function defined in formula (1) with the following parameters of genetic algorithm: 1000 generations, 500 initial rules, ‘‘one-point’’ crossover with the probability of 0.6 in the first and 0.7 in the second experiment, the mutation rate of 0.01 in the first and 0.05 in the second experiment. When the process of training was finished, 10 ‘‘best-fit’’ rules were selected for the classification of the intrusions and the normal connections in the testing dataset. 7.3. Obtained results Obtained results are presented in Table 3. It can be observed from the table that in the first experiment normal connections are classified 100% correctly, i.e. there are no false-positives, but the ‘‘trade-off’’ is lower attack detection rate. Obtained detection rates are similar to the detection rates obtained in [7,8], but the main advantage of our approach is that we are using only 3 features of a network connection, while they are using 7 and 41 respectively. Hence, our system can perform the training process and the process of detecting intrusions faster and could be applied to high speed networks while maintaining high detection rates. The system was also trained using support-confidence framework with the fitness function defined with the formula (2) and the same parameters of the algorithm as defined before. In our experiments we have varied the values of the weight coefficients w1 and w2. Obtained results are presented in Table 4. From the Table 4, we can conclude that the maximum detection rate of the rules obtained by training the rules with the fitness function (2) is 87.6% and that the set of the rules trained deplying the fitness function according to (1) gives higher detection rate. The detection rates of neptune and smurf are very high, but the Table 3 Detection rates (%) in experiments 1 and 2 of the system trained with fitness function (1) Type of connection

Experiment 1

Experiment 2

Normal Attack

100 92.74

98.38 94.87

Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

9

Table 4 Detection rates (%) in experiments 1 and 2 of the system trained with fitness function (2) w1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

w2

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

Exp1

Exp2

Neptune

Smurf

Portsweep

False-positive

Neptune

Smurf

Portsweep

False-positive

0 0 0 0 0 0 100 100 100 100 100

0 0 99 0 0 99 99 99 99 99 99

22.5 7.5 22.5 22.5 35 30 30 30 30 30 30

0 0 0 0 1.6 0 0 0 0 0 0

0 0 0 0 0 0 100 100 100 100 100

0 0 0 99 0 99 99 99 99 99 99

30 30 22.5 22.5 35 30 30 30 30 30 30

1.6 0 0 0 1.6 0 0 0 0 0 0

detection rate of portsweep remains low probably because of the smaller specimen used for training and higher diversity of individual attacks in regard to the features deployed to describe the attacks. False positive rate in this case is also 0 and confirms the previous state that GA approach provides low false-positive rate. Again, we have maintained high detection rate while using only three features of a network connection. 8. Application environments The implementation of the algorithm presented here is supposed to be deployed in a wired network environment. Nevertheless the system could also be deployed in a wireless network environment as a supplementary item for reinforcing security considering the number of existing security flaws in wireless networks [26], provided the proper attack taxonomy customized to wireless network attacks and the proper training dataset. IEEE 802.11 protocol of Wireless Local Area networks (WLAN) has similar security issues as Ethernet protocol for LAN or WAN networks [27], adding up some new ones due to the security issues concerning Wired Equivalent Protocol (WEP) [26,30]. This implies the necessity of deploying supplementary items for security reinforcement. For example, as RTS/CTS (RequestToSend/ClearToSend) combination is similar to TCP’s synchronize (SYN) and Acknowledge (ACK) [28], the attack that has the machinery similar to the one of SYN-flood (Neptune) attack, could be deployed in the WLAN environment. The RequestToSend (RTS) frame is followed by a ClearToSend (CTS) frame to ensure that no hidden node can transmit while another node out of the sender’s range is also transmitting [29]. This allows other nodes in the broadcast area to suspend transmission until the current frame has been transmitted. Many RTS frames could be sent in a flood, thus tying up the medium and causing a DoS. Because of a lack of (improper) validation of the senders of the packets, a RTS Flood could be developed. This example, and the others presented in [27], implies the possibility of using some existing intrusion detection systems used for securing wired networks that are flexible enough to be deployed in wireless environment, as the system presented here is, after having performed necessary changes. 9. Conclusions In this work we have deployed genetic algorithm approach to intrusion detection. Software implementation of the proposed approach is presented. Genetic algorithm was used to obtain classification rules for intrusion detection while principal component analysis was used to identify the most important features of network connections. As in real word types of intrusions are changing rapidly and becoming increasingly complex, an intrusion detection system should be adaptive in order to be able to cope with the evolution of the threat-space. As our system can upload and update new data y evolve new rules for detecting new intrusions, it is adaptive and also cost effective because it is easy to maintain. Therefore, GA-approach, with the appropriate and simple Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

10

representation of the rules and effective fitness functions that can be applied, is easy to implement and maintain. Moreover, our system is flexible enough to be used in different application environments, provided the proper attack taxonomy and the proper training dataset. We have demonstrated that GA-approach can be used either to classify network connections as either normal or intrusive or further classify attacks by their type. Classification of attacks is not important in intrusion detection, considering that the goal of intrusion detection is detecting attacks in real time so they could be retained before creating any damage. On the contrary, classification of the attacks is very important in network forensics because a good recovery after the damage made by an attack can be done by knowing the exact type of an attack and its mechanism. High attack detection rate and low false-positive rate demonstrate advantages of applying this technique to intrusion detection without using any complementary technique typically used with other soft-computing techniques. Our system is using only three features of the network connections maintaining high detection rates, so it can perform intrusion detection process fast and could be applied to high speed networks. Appendix See Tables A1–A3.

Table A1 List of features with their description and data types of KDD99Cup data set [4] Feature

Description

Type

1. Duration 2. Protocol type 3. Service 4. Flag 5. Source bytes 6. Destination bytes 7. Land 8. Wrong fragments 9. Urgent 10. Hot 11. Failed logins 12. Logged in 13. # Compromised 14. Root shell 15. su attempted 16. # Root 17. # File creations 18. # Shells 19. # Access files 20. # Outbound cmds 21. Is hot login 22. Is guest login 23. Count

Duration of the connection Connection protocol (udp, tcp, icmp) Destination service (e.g. telnet, ftp) Status flag of the connection Bytes sent from source to destination Bytes sent from destination to source 1 if connection is from/to the same host/port; 0 otherwise Number of wrong fragments Number of urgent packets Number of hot indicators Number of failed logins 1 if successfully logged in; 0 otherwise Number of ‘‘compromised’’ conditions 1 if root shell is obtained; 0 otherwise 1 if ‘‘su root’’ command attempted; 0 otherwise Number of ‘‘root’’ accesses Number of file creation operations Number of shell promts Number of operations on access control files Number of outbound commands in an ftp sessions 1 if the login belongs to the ‘‘hot’’ list; 0 otherwise 1 if the login is a ‘‘guest’’ login; 0 otherwise Number of the connections to the same host as the current connection in the past two seconds Number of connections to the same service as the current connection in the past two seconds % of connections that have ‘‘SYN’’ errors % of connections that have ‘‘SYN’’ errors % of connections that have ‘‘REJ’’ errors % of connections that have ‘‘REJ’’ errors % of connections to the same service % of connections to different services % of connections to different hosts count of connections having the same destination host

Continual Discrete Discrete Discrete Continual Continual Discrete Continual Continual Continual Continual Discrete Continual Discrete Discrete Continual Continual Continual Continual Continual Discrete Discrete Continual

24. srv count 25. 26. 27. 28. 29. 30. 31. 32.

serror rate srv serror rate rerror rate srv rerror rate Same srv rate Diff srv rate srv diff host rate dst host count

Continual Continual Continual Continual Continual Continual Continual Continual Continual

Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

11

Table A1 (continued) Feature 33. 34. 35. 36. 37. 38. 39. 40. 41.

dst dst dst dst dst dst dst dst dst

host host host host host host host host host

srv count same srv rate diff srv rate same src port rate srv diff host rate srv serror rate srv serror rate rerror rate srv rerror rate

Description

Type

count of connections having the same destination host and using the same service % of connections having the same destination host and using the same service % of different services on the current host % of connections to the current host having the same src port % of connections to the same service coming from different host % of connections to the current host that have S0 error % of connections to the current host and specified service that have an S0 error % of connections to the current host that have RST errors % of connections to the current host and specified service that have an RST error

Continual Continual Continual Continual Continual Continual Continual Continual Continual

Table A2 List of important features for each dimension, part I Dimension

1

2

3

4

5

6

7

8

src_bytes

duration src_bytes

duration src_bytes dst_host _srv_ serror_rate

duration src_bytes serror_rate

duration src_bytes serror_rate

duration flag src_bytes

duration flag src_bytes

duration flag src_bytes

dst_host _srv_ serror_rate

dst_bytes

serror_rate

hot

srv_rerror_rate

dst_host _srv_ serror_rate

dst_bytes

serror_rate

hot

dst_host _srv _serror_rate

dst_bytes

serror_rate

dst_host _srv_ serror_rate

dst_bytes

dst_host_srv_ serror_rate

Table A3 List of important features for each dimension, part II Dimension

9

10

11

12

13

14

15

16

duration Service

duration Service

flag src_bytes srv_rerror _rate hot

flag src_bytes srv_rerror _rate hot

duration protocol _type service flag src_ bytes

duration protocol_ type service flag src_bytes

duration protocol _type service flag src_bytes

duration protocol_ type service flag src_bytes

duration protocol_ type service flag src_bytes

duration protocol_ type service flag src_bytes

dst_host_srv_ serror_rate serror_rate dst_bytes

count

srv_ rerror _rate hot

srv_rerror _rate hot

srv_rerror _rate hot

srv_rerror _rate hot

srv_rerror _rate hot

srv_rerror _rate hot

count serror_ rate dst_bytes

count serror_rate dst_bytes

count serror_rate dst_bytes

count serror_rate dst_bytes

count serror_rate dst_bytes

count serror_ rate dst_bytes

serror_rate dst_bytes dst_host_ srv_ serror_rate

(continued on next page)

Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

12 Table A3 (continued) Dimension

9

10

11

12

13

14

15

16

dst_host_ srv_ serror_rate

dst_host _srv_ serror_rate num_root

dst_host _srv_ serror_rate num_root srv_count

dst_host_ srv_ serror_rate num_root srv_count num_ compromised

dst_host_ srv_ serror_rate num_root srv_count num_ compromised wrong _fragment

dst_host_ srv_serror_ rate num_root srv_count num_ compromised wrong _fragment srv_diff_ host_rate

References [1] Balasubramaniyan JS, Garcia-Fernandez JO, Isaco D, Spatford E, Zamboni D. An architecture for intrusion detection using autonomous agents. In: Proceedings of 14th annual computer security applications conference, 1998. [2] Heberlein LT, Mukherjee B, Levitt KN, Mansur DL. Towards Detecting Intrusions in a Networked Environment. In: Proceedings of 14th department of energy computer security group conference, 1991. [3] White B, Fich EA, Pooch UW. Cooperating security managers: a peer-based intrusion detection system. IEEE Network J 1996(Jan/ Feb):20–3. [4] Yao JT, Zhao SL, Saxton LV. A study on fuzzy intrusion detection. In: Belur V. Dasarathy, editor. In Proceedings of SPIE Vol. 5812, Data Mining, Intrusion Detection, Information Assurance, And Data Networks Security, 28 March–1 April 2005, Orlando, Florida, USA: SPIE, Bellingham, WA; 2005. p. 23–30. [5] KDD Cup 1999 data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, October 1999. [6] McHugh J. Testing intrusion detection systems: a critique of the 1998 and 1999 DARPA IDS evaluation as performed by Lincoln laboratory. ACM Trans Inform Syst Security 2000;3(4):262–94, November. [7] Gong RH, Zulkernine M, Abolmaesumi P. A software implementation of a genetic algorithm based approach to network intrusion detection. In: Proceedings of the sixth international conference on software engineering, artificial intelligence, networking and parallel/distributed computing and first ACIS international workshop on self-assembling wireless networks (SNPD/SAWN‘05), 2005. [8] Lu W, Traore I. Detecting new forms of network intrusion using genetic programming. Comput Intell 2004;20(3):470–90. [9] Kayacik HG, Zincir-Heywood AN, Heywood MI. Selecting Features for Intrusion Detection: A Feature Relevance Analysis on KDD99 Intrusion Detection Datasets. . [10] Fodor IK. A Survey of Dimension Reduction Techniques. . [11] Jolliffe IT. Principal component analysis. Springer Verlag; 1986. [12] Mardia KV, Kent JT, Bibby JM. Multivariate analysis. Probability and mathematical statistics. Academic Press; 1995. [13] Holland J. Adaptation in natural and artificial system. Ann Arbor. The University of Michigan Press; 1975. [14] Polhlheim H, Genetic and Evolutionary Algorithms: Principles, Methods and Algorithms. , accessed in 2006. [15] Chittur A. Model Generation for an Intrusion Detection System Using Genetic Algorithms, http://www1.cs.columbia.edu/ids/ publications/gaids-thesis01.pdf, accessed in 2006. [16] Kendall K. A database of computer attacks for the evaluation of intrusion detection systems. , accessed in 2005. [17] Pan Z, Chen S, Hu G, Zhang D. Hybrid Neural Network and C4.5 for Misuse Detection. In: Proceedings of the second international conference on machine learning and cybernetics, vol. 4. 2003. pp. 2463–2467 [Nov.]. [18] Folino G, Pizzuti C, Spezzano G. GP ensemble for distributed intrusion detection systems. In ICAPR 2005, 3rd international conference on advances in pattern recognition, LNCS, Springer Verlag, 3686/2005, Bath, UK, August 2005. [19] Laskov P, Du¨ssel P, Scha¨fer C, Rieck K. Learning intrusion detection: supervised or unsaupervised? CIAP: international conference on image analysis and processing No. 13, vol. 3617. Italy: Cagliari; 2005, September. [20] Guan Y, Ghorbani AA, Belacel N, Y-means. A Clustering method for Intrusion Detection. In Canadian conference on electrical and computer engineering, IEEE CCECE, vol. 2. 2003. p. 1083–6. [21] Kim DS, Ha-Nam Nguyen, Jong Sou Park. Genetic algorithm to improve SVM based network intrusion detection system. In: Proceedings of the 19th international conference on advanced informational networking and applications, vol. 2. 2005. p. 155–8. [22] Yao JT, Zhao SL, Saxton LV. A study on fuzzy intrusion detection. data mining, intrusion detection, information assurance, and data networks security 2005. Orlando FL: 2005; [28–29 March]. [23] Bouzida Y, Gombault S. Eigenconnections to intrusion detection. In: Proceedings of the 19th IFIP international information security conference. Kluwer Academic; 2004. August.

Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx

13

[24] Joshi SS, Phoha VV. Investigating hidden Markov model capabilities in anomaly detection. ACM Regional Conference. In Proceedings of the 43rd annual southeast regional conference, vol. 1, 2005; p. 98–103. [25] . [26] Sklavos N, Koufopavlou O. Mobile communications world: security implementations aspects – a state of the art. CSJM J, vol. 11. Institute of Mathematics and Computer Science; 2003, Number 2(32), p. 163–87. [27] Lough DL. A Taxonomy of Computer Attacks with Applications to Wireless Networks. PhD Dissertation. . [28] Jon Postel. Transmission control protocol: DARPA Internet Program protocol specification. Request for Comments (RFC) 793, September 1981. InternetEngineering Task Force; . [29] O’Hara Bob, Petrick Al. The IEEE 802.11 handbook: A designer’s companion.standards information network. 3 Park Avenue, New York,New York, 10016-5997: IEEE Press; 1999. [30] Borisov N, Ian Goldberg, Wagner D. Intercepting mobile communications: the insecurity of 802.11. In The seventh annual international Conference on Mobile computing and networking. Rome: 2001. p. 180–9. [31] Hocking RR. Development in linear regression methodology: 1959–1982. Technometrics 1983;25:219–49. [32] Jolliffe IT. Discarding variables in principal component analysis. I: artificial data. Appl Statist 1972;21:160–72. [33] Krzanowski WJ. Selection of variables to preserve multivariate data structure, using principal component analysis. Appl Stat – J Roy Stat Soc Series C 1987;36:22–33.

Zorana Bankovic´ got the title of Electrical Engineer by the Faculty of Electrical Engineering, University of Belgrade (Serbia) in 2005. Currently she is a PhD student at Technical University of Madrid. Her main research interests are related to network security and FPGA implementations of common intrusion detection algorithms.

Dusˇan Stepanovic´ received the title of Electrical Engineer from the University of Belgrade, Serbia, in 2004. Currently he is a PhD student at the University of California, Berkeley. His research interests include design of high-speed and low-power digital integrated circuits, general area of signal processing and its applications to image and video compression.

Slobodan Bojanic´ received B.Sc, M.Sc and Ph.D from the University of Belgrade (Serbia) in 1986, 1991 and 1997 respectively. Currently he is a Research Scientist at Technical University of Madrid. His research interests are in the areas of data security related to FPGA implementations for accelerating cryptographic algorithms, cryptanalysis, network security and bioinformatics.

Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010

ARTICLE IN PRESS 14

Z. Bankovic´ et al. / Computers and Electrical Engineering xxx (2007) xxx–xxx Octavio Nieto-Taladriz received B.Sc and PhD by the Universidad Polite´cnica de Madrid in 1984 and 1989 respectively. Currently he is Full Professor and the Head of the Department of the Departamento de Ingenierı´a Electro´nica at the ETSI Telecomunicacio´n of the Universidad Polite´cnica de Madrid. Main research and development fields are the development of embedded systems, high performance digital architectures mainly focused on broad band radio communications and the development and integration of services and applications for mobility over heterogeneous communication platforms, security, ambient intelligence and domotics.

Please cite this article in press as: Bankovic´ Z et al., Improving network security using genetic algorithm approach, Comput Electr Eng (2007), doi:10.1016/j.compeleceng.2007.05.010