Binary Gravitational Search Algorithm BGSA

Binary Gravitational Search Algorithm (BGSA): Improved Efficiency Mansour Sheikhan Department of Electrical Engineering, Islamic Azad University, South Tehran Branch, Tehran, Iran

Hamid Bostani Department of Computer Engineering, Islamic Azad University, South Tehran Branch, Tehran, Iran

Downloaded by [Mansour Sheikhan] at 22:52 13 July 2016

Abstract Today, detecting anomalous traffic and preventing it in computer networks has become increasingly important for the community of security researchers. An intrusion detection system (IDS) is an effective tool for reaching high security. This is a software tool for analyzing system behavior or network traffic as input data to detect deviations from normal behavior. With the development of computer networks, highdimensional input data analysis has become a huge problem in IDSs. One solution for overcoming this problem is feature selection, which is a process for selecting an optimal subset of features. Populationbased heuristic search algorithms have been widely used for this optimization problem. This entry presents a novel feature selection method based on a binary gravitational search algorithm (BGSA). The proposed method, which is called modified BGSA (MBGSA), uses BGSA for performing the global search to find the best subset of features through the wrapper method. Moreover, for improving the efficiency of BGSA, mutual information (MI) feature selector under the uniform information distribution (MIFS-U) method, which works as a filter method, is integrated into BGSA as the inner optimization layer. In fact, with the computation of the relevance between each selected feature and the target class and the redundancy between the selected features (in the feature subset generated by the wrapper), MIFS-U will find more valuable features that have maximum relevance to the target class and minimum redundancy to each other. The experimental results on NSL-KDD dataset using different classifiers show that the proposed method can find better subset features and achieve higher accuracy and an improved detection rate using fewer features as compared to standard BGSA and binary particle swarm optimization (BPSO) feature selection methods.

INTRODUCTION In recent decades, the development of computer networks, especially the Internet, has meant that more sophisticated intrusions are being experienced. Intrusion detection systems (IDSs) now represent a significant issue in the information security field and are widely used in detecting intruders or malicious behaviors in computer networks. IDSs are classified into three main categories: misuse detection,[1] anomaly detection,[2] and specification-based systems.[3] In misuse detection systems, predefined attack patterns are modeled and maintained in the database of the attacks’ signature. These kinds of systems will alert when a match among signatures and traffic or system behaviors is identified. This method can detect all known attacks with a low false alarm rate (FAR), but new attacks are not detectable. Anomaly detection systems are based on the typical behavior of the network or system, which means they build a model for normal behavior using statistical or machinelearning methods that can detect anomalies in the observed

data by noticing deviations from these models. In these methods, only normal data are required in building profiles.[2] Anomaly detection algorithms are useful for new intrusions, but they are not as effective as misuse detection models in detecting known attacks,[4] and they suffer from a high rate of false positive.[4] Specification-based systems are similar to anomaly detection systems; however, in these systems, in addition to relying on machine learning techniques, the guidance of experts is required to develop the model of traffic or system behaviors.[3] Many different methods such as Bayesian networks, artificial neural networks (ANNs), fuzzy logic, and genetic algorithms (GAs) (as soft computing models) have been used in the intrusion-detection context. The detection methods that are used in anomaly and misuse detection systems are classified into three main categories: statistically based, knowledge based, and machine learning.[5] To reduce the computational complexity and increase the accuracy of these methods, feature selection is needed in a preprocessing stage, because the number of features that

Encyclopedia of Information Assurance DOI: 10.1081/E-EIA-120052597 Copyright © 2016 by Taylor & Francis. All rights reserved.

1

EIA_120052597.indd 1

04-04-2016 15:51:26


2

is used in IDS is usually large, and feature selection algorithms can identify the most relevant ones. Thus, feature selection methods are used to determine the best minimal set of features that does not contain redundant features.[6] In general, a feature selection problem in the context of machine learning can be described as follows: Suppose that T = (D, F, C) is a dataset with n instances, m dimensions or features, and k labels or target classes where D = {o1, o2, … on}, F = {f1, f2, …, fm} and C = {c1, c2, …, ck} are the sets of instances, features, and labels, respectively. The goal of machine learning is to provide a model h: F ĺ C that maps the input feature space F onto the class space C, and it would be better if the model h contained those input features that contribute much more information to the distribution of the classes.[7] A feature subset selection problem refers to finding an optimal feature subset like Fƍ with d member of the whole feature space where FƍÕ F and d , m, and it can lead to the best possible accuracy in the classification or, more clearly, in the optimization of a criterion function. Feature selection is useful for avoiding the over-fitting problem, improving the performance of classification models, reducing training and testing time, improving stability against noise, and reducing measurement and storage requirements.[8] Depending on feature subset evaluation, the feature selection methods can be classified into filter, wrapper, and embedded methods.[8] The filter methods select a subset of features without using a learning algorithm and are faster than wrapper methods,[6] so these methods are appropriate for processing high-dimensional data. Wrapper methods use learning algorithms to evaluate the feature subset selection, and this means that they use classification feedbacks for feature selection. These methods have improved the accuracy of classification with a high computational complexity.[6] In the embedded approach, feature selection is a part of training phase in machine learning, so the feature selection is specific to the applied learning algorithms.[6] Another approach to feature selection is based on hybrid methods. This approach is a combination of filterand wrapper-based methods. The filter method selects the candidate features that are refined by the wrapper part.[6] A feature selection algorithm has two main steps: a) the generation procedure that finds an optimal feature subset and b) an evaluation procedure to measure the optimality of the generating subset of features by using the evaluation criterion.[9] Let the number of features in dataset be n, so the number of the feature subsets will be 2 n − 1. Because the number of the best features is not predictable, all feature subsets should be evaluated to find an optimal feature subset. On the other hand, when n is large, it is impossible to evaluate all feature subsets, and therefore finding an optimum feature subset is usually NP-hard.[9] Finding an optimum feature subset depends on the search strategy. The search strategy is classified into exhaustive, heuristic, and random searches, and is combined with several types of measures to form different algorithms.[10] The time complexity is exponential in terms of data dimensionality for exhaustive search and quadratic for heuristic search.[10] In this work,

EIA_120052597.indd 2

Binary Gravitational Search Algorithm (BGSA): Improved Efficiency

feature selection is defined as an optimization problem, and a hybrid feature selection procedure based on mutual information (MI) and binary gravitational search algorithm (BGSA) called modified BGSA (MBGSA), is proposed in order to use the advantages of both filter and wrapper methods. MBGSA selects features by applying filter and wrapper methods in a hybrid way. This hybrid framework integrates MI feature selector under uniform information distribution (MIFS-U) as an MI-based filter method in the BGSA-based wrapper model. In the proposed method, BGSA is used as a population-based heuristic search strategy to find the best feature subset. However, the relevance between the features and the target class and the redundancy between selected features are not considered in the feature subsets generated by BGSA. Hence, the MI approach as a filter method is used in the feature subset generation phase to improve the performance of BGSA. To evaluate each feature subset selection, support vector machine (SVM), classification and regression tree (CART), and naïve Bayes (NB) classifiers are used as popular machine-learning methods. The rest of this entry is organized as follows: section “Related work” reviews related work. In section “Preliminaries,” the foundation of MI theory, the investigated classifiers (i.e., SVM, CART, and NB), and BGSA are introduced briefly. The proposed hybrid feature selection method and NSL-KDD intrusion dataset are introduced in Sections “Proposed Model” and “Intrusion Dataset,” respectively. The performance of the proposed method is compared with standard BGSA and binary particle swarm optimization (BPSO) feature selection methods in section “Experimental Results,” and the experimental results are reported on NSL-KDD dataset. The entry is concluded in section “Conclusion.”

RELATED WORK Many feature selection algorithms have been proposed in recent decades. Depending on different criteria, like the availability of the class labels, the number of evaluated features at one time, and the evaluation manner, feature selection algorithms can be classified into different categories.[7] In this section, we have focused solely on the evaluation manner. According to the evaluation manner, feature selection methods can be classified into three main methods: filter, wrapper, and embedded. The filter and embedded methods evaluate features by worthiness of their relevance to the class labels; however, the wrapper methods often take classification performance as an evaluation measure.[7] A variety of evaluation criteria have been introduced for filter-based methods, which can be classified into five groups:[8] distance, information (or uncertainty), dependency, consistency, and classifier error rate. The information metric is a nonlinear correlation metric[8] and is a good measurement to quantify the uncertainty of the feature.[9] In this way, Deisy et al.[9] proposed the information theoretic-based interact (IT-IN) algorithm, which selects

04-04-2016 15:51:26



relevant features through an efficient feature interaction. They ranked the features of a special hashing data structure to overcome the feature order problem and to avoid repeated scanning of a dataset by taking properties of the c-contribution measure. The experimental results of their study showed that IT-IN significantly reduced the number of features and had better performance as compared to the Relief,[11] the correlation-based feature selection (CFS),[12] and the interact[13] algorithms. Liu, Wu, and Zhang[7] introduced a supervised feature selection method to pick up important features by using the information criteria. Their method worked like clustering and used maximal relevance to the class labels and minimal redundancy to the selected features. To measure the relevance and redundancy of the feature, two different information criteria (i.e., MI and the coefficient of relevance) were adopted in their method. The result of simulation experiments on 12 datasets showed that their method outperforms other investigated feature selection methods in their work. Zhang and Hancock[14] presented a hypergraph based on an information-theoretic approach to feature selection. They used hypergraph clustering to select the most informative feature subset (MIFS) from a set of objects using high-order similarities. In the initial step, they used multidimensional interaction information (MII) to measure the significance of different combinations with respect to the class labels, and, in the next step, a hypergraph-based clustering was used to select the MIFS. Hence, the optimal size of the feature subset can be automatically specified. In recent years, the algorithms that are inspired by observing a natural phenomenon have been used in the soft-computing community to solve complex optimization problems. For this purpose, several evolutionary and swarmed-based methods have been used for feature selection. For example, Unler, Murat, and Chinnam[15] proposed a hybrid feature selection method based on particle swarm optimization (PSO) for SVM classification. They used the MI to measure the feature relevance and the redundancy to weigh the bit-selection probabilities in discrete PSO. They compared the performance of their method with a hybrid filter–wrapper method based on GA and a wrapper algorithm based on PSO. The results showed that the proposed method is superior in terms of classification accuracy and computational performance. Sheikhan and Mohammadi[16] proposed a hybrid model of GA and ant colony optimization (ACO) feature selection algorithms for feature selection using a multilayer perceptron (MLP) neural network as the classifier in short-term load forecasting (STLF) application. The performance of ACO + MLP and GA-ACO + MLP hybrid models were compared with a principal component analysis (PCA) + MLP hybrid model and also with the case of no-feature selection (NFS) when using MLP and radial basis function (RBF) neural models. Experimental results showed that the feature selection based on GA-ACO performs better than other simulated methods. Sivagaminathan and Ramakrishnan[17] have presented a hybrid feature selection method based on ACO and ANN.

EIA_120052597.indd 3

3

The gravitational search algorithm (GSA) is one of the latest heuristic optimization algorithms based on the metaphor of gravitational interaction among masses.[18] The binary version of GSA has also been used for feature selection in recent years. For example, Sheikhan[19] compared the performance of the BGSA-based feature selection with the binary PSO algorithm in prosody generation application based on a recurrent neural network (RNN). In this entry, a hybrid filter-wrapper algorithm is introduced for a feature selection problem that is based on BGSA. Many proposed hybrid methods were based on two sequential approaches, according to which, in the first step (as the filter phase), some features are removed to reduce the number of features. In the second step, the remaining features were used by the wrapper method to find the optimum feature subset. However, the proposed method in this study encapsulates the filter method in the wrapper module. In other words, to reach a minimum redundancy between features and a maximum relevance between features and the target class, a filter method based on MI is integrated in BGSA, that is, it is used as a wrapper model. This method improves the performance of the feature selection method as compared to the standard BGSA and BPSO feature selection methods. In our experiments, three classic classifiers (i.e., SVM, CART, and NB) are used for classification.

PRELIMINARIES In this section, we will briefly review the foundations of MI, the investigated classifiers (i.e., SVM, CART, and NB), and the BGSA. Mutual Information Bonev[20] in his Ph.D. thesis, classified features into three categories: relevant, redundant, and noisy. Relevant features are the features that have information about the target classes. It means that the relevant features can classify instances by themselves or in a subset with other features. These features include two different types: strong relevance and weak relevance. A feature fi is strongly relevant if its removal degrades the performance of a classifier. It is weakly relevant if it is not strongly relevant, and, if F is a subset of features, the performance of classifier on F ∪{fi } will be better when only F is used.[20] Because the redundant features can provide the same information about the target classes like selected features, they can be removed from feature space. Some features do not include information about the target classes and also they are not redundant; these are called noisy features, and they should be removed, because they have a negative effect on classification accuracy (Fig. 1).[20] In most intrusion-detection datasets, some of the features are practically either redundant or irrelevant (noisy) to the classification problem, and they result in a low detection rate (DR). One of the popular and effective methods

04-04-2016 15:51:27

4


proposed a new method called MIFS-U. They modified the MIFS evaluation function as follows: J ( fi ) = I ( fi ; C) − β∑

fs ∈S


Fig. 1 Illustration of feature types as a feature space.

for improving feature selection is based on using MI. MI is an information theory metric that is used to measure the relevance of features, and it shows how much information is shared between features. Suppose X and Y are two discrete random variables that take on values from sets of A and B, respectively. The MI of these two variables is defined as follows:

I ( X; Y ) =

p ( x, y )

∑ ∑ p ( x, y ) log p ( x ) p ( y ) ;

x ∈A y ∈B

(1)

p ( x ) = Pr ( X = x ), p ( y ) = Pr(Y = y ) where p(x) and p(y) are the marginal probability distribution functions for X and Y and p(x,y) is the joint probability distribution function of X and Y. If X and Y are closely related to each other, the MI between them will be very high and vice versa. The main idea in using MI for feature selection is that the features should be highly correlated with the target class, without having any correlation with each other.[8] According to this idea, Batti[21] proposed a method called information-based feature selection (MIFS), where its evaluation function was defined as follows: J ( fi ) = I ( fi ; C) − β∑ I ( fi ; fs )

(2)

fs ∈S

where fi is the selected feature, C is the target class, and S is the candidate selected feature set. b is a parameter that is determined empirically and varies between 0 and 1.[8] The aim of this method is to maximize the relevance between input features and target class, and to minimize the redundancy between selected features.[22] In Eq. (2), the first term gives the feature relevance between new input feature and output feature, and the second term is like a penalty for the first term and measures correlation between new feature and selected features. The appropriate value assignment of b is a critical task because it has a great effect on the appropriate feature selection in MIFS. To decrease the influence of parameter b on MIFS, Kwak and Choi[23]

EIA_120052597.indd 4

I ( fs ; Y ) H ( fs )

I ( fi ; fs )

(3)

where H (fs) is the entropy of fs, which is the uncertainty or equivocation of fs. Here again, the parameter b acts as a factor for controlling the redundancy penalization. In Eq. (3), it is obvious that 0 ≤ I ( fs ; Y ) / H ( fs ) ≤ 1. This ratio also provides the proportion for the redundant information between feature fi and feature fs taking up in the MI between the selected feature fs and output classes C.[24] Thus, Eq. (3) can be represented as follows: J ( fi ) = I ( fi ; C) −

1 S

I (f ; Y ) ∑ H (f ) I (f ; f ) s

i

fs ∈S

s

(4)

s

In this study, we integrated MIFS-U evaluation function to the BGSA as an intermediate measure for improving feature-selection performance. Investigated Classification Methods Supervised machine learning takes a known set of input data, and known responses to the data, and seeks to build a predictor model that generates reasonable predictions for the response to new data. Classification is a fundamental issue in supervised learning. In this section, we review the foundation of three different classification methods that are used to evaluate the proposed feature selection algorithm. Support vector machine In 1995, Cortes and Vapnik[25] introduced a learning machine based on statistical learning theory called SVM, which has been broadly used in classification problems. SVM has been widely used in small datasets and highdimensional feature spaces because of its good applicability and high efficiency, especially in abnormal intrusion detection.[26] SVM is a binary classifier that can classify data into two groups. The basic concept is finding an optimal separating hyperplane to classify separable data by maximizing the margin between points from each class. Therefore, the optimal hyperplane is in the middle of the margins. A large margin leads to a lower generalization error of the SVM classifier.[27] As seen in Fig. 2, the points lying on the boundaries are called support vectors. Unlike classic methods (e.g., MLP neural network) that merely minimize the empirical training error, the aim of SVM is to focus on minimizing an upper bound of the generalization error by maximizing the margin between the separating marginal hyperplanes and the data.[27]

04-04-2016 15:51:27


5

of the constraints in Eq. (7), a non-negative α i is introduced and a new fitness function is expressed as follows: n arg min max   1 T T L Primal = W W − ∑α i  y i W .x i − b − 1  2 ( W, b ) α ≥ 0  i =1 

(

)

(8) The optimum solution based on these constraints is called saddle point. The advantage of this method is removing W and b from L Primal as follows: Fig. 2 A separable problem in a 2-D space.

Suppose a dataset like S including a set of points: S = ( x i , y i ) | x i ∈ R d , y i ∈{+1, −1} , i = 1,…, n , where d is the number of dimensions in problem space and n is the number of instances in dataset. The general equation of hyperplane for this linear binary SVM can be written as follows:


{

}

 w1   x1    w i x i + b = 0 ⇒ W X + b = 0, W =  #  , X =  #  ∑ i =1  w n   x n 

n  ∂L Primal  ∂w = 0 ⇒ W = ∑α i y i x i  i =1  n ∂ L  Primal = 0 ⇒ α y = 0 ( biasconstraint ) ∑ i i  ∂b i =1

Using Eq. (9) in L Primal, the problem will only depend on maximizing α i. Therefore, L Primal will be transformed to the L Dual as follows:

N

T

(9)

L Dual = −

n 1 n n 1 α i α j y i y j x iT x j + ∑α i = − α T Hα + f T α ∑∑ 2 i =1 j =1 2 i =1

(10)

(5) where W is a normal vector perpendicular distance from the hyperplane to the origin. SVM is based on the optimal hyperplane that maximizes the separating margin between two classes (optimal margin).[28] Therefore, SVM focuses on two parallel marginal hyperplanes called supporting hyperplanes as borderline for maximizing the separating margin between these hyperplanes. In binary SVM, the classifier should be constructed as follows:  ∀i = 1,…, n if y i = 1 then w i x i + b ≥ 1 2 , R=  + ≤ − 1 1 ∀ i = 1 , … , n if y = − then w x b & & W i i i 

(6)

where α , H, and f T are defined as:  α1  1   H = ∑∑y i y j x x j = ∑∑h ij , α =  #  , f = #  i =1 j =1 i =1 j =1 α n  n ×1 1 n ×1 n

n

n

n

T i

(11) Therefore, the new fitness function will be defined as follows (quadratic programming): arg max  1 T T  L Dual = − α Hα + f α  α  2 

s.t.

n

where R is the distance between supporting hyperplanes. According to the presented description, the fitness function is defined as follows: arg min  1  arg min  1 T  & W &2  = W W ( W, b)  2 ( W, b)  2  

(

)

y i W T .x i + b ≥ 1 ,

i = 1,…, n

i

i

= 0, ∀i = 1,…, n; α i ≥ 0

(12)

i =1

Furthermore, if S is the set of support vectors, bias can be defined as follows: s.t. (7)

The above primal problem is difficult to solve because b (bias) in Eq. (7) does not affect the fitness function and affects only the feasibility. For simplification, the Lagrange technique can be used for removing the constraint of Eq. (7) and appending it to the fitness function. Therefore, for each

EIA_120052597.indd 5

∑α y

∀ i ∈S ⇒ bi = y i − W T x i ⇒ b =

1 S

∑ (y

i

− WT xi

)

(13)

i ∈S

In the training phase, SVM will find the optimum hyperplane that can classify instances in the testing phase. It should be noted that when a linear decision boundary is not sufficient, we can project the data onto a higher dimensional space using a transformation function. It means that

04-04-2016 15:51:30

6


the linear SVM can be extended for nonlinear classifier by a nonlinear kernel function like, ϕ ( X ) which transforms input instances into higher dimension. One of the popular kernel functions is radial basis. The radial basis kernel can map the data from a low-dimensional space to highdimensional space. This RBF is defined as follows:  & x i − x j &2  K x i , x j = exp   2σ 2  

(

)

(14)

where & x i − x j &2 is the squared Euclidean distance between the two feature vectors in original space and σ is a free parameter. Therefore, in order to transform the T linear optimization problem into a nonlinear one, x i .x j T in Eq. (11) should be replaced by K x i , x j as the RBF kernel function. Downloaded by [Mansour Sheikhan] at 22:52 13 July 2016

(

)

Classification and regression tree Decision-tree learning is one of the most successful techniques for supervised classification learning. A decision tree is a structure which includes a root node, internal nodes, and leaf nodes. An internal node represents test on a feature and a branch represents the outcome of that test and a leaf node represents class labels.[29] A decision tree is usually built from the top-down by recursively selecting a feature to split on and partitioning the training samples with respect to that feature. Then, each new subspace of the data is split into new subspaces iteratively until an end criterion is met. Setting the appropriate end criterion is very important because trees that are too large can be overfitted and small trees can be underfitted, the result in both cases being a loss in accuracy.[30] In the decision tree, each new instance is classified by navigating them from the root of the tree down to the leaves, according to the outcome of the tests along the path.[30] The goal of decision-tree classification is to iteratively partition the given dataset into subsets where all elements in each final subset belong to the same class. The various heuristic methods for the construction of decision trees can roughly be divided into four categories:[31] bottom-up, top-down, hybrid, and tree growingpruning approaches. One of the popular growing-pruning approaches is CART[32] which is constructed in two stages: growing and pruning. CART is a binary decision tree that is constructed by recursively partitioning N learning samples into different subsets, beginning from the root node, which contains the whole learning sample. It can work with continuous and nominal features. A visualization of this process is shown in Fig. 3.[33] When the algorithm has generated a maximal tree, it examines smaller trees obtained by pruning away branches of the maximal tree. For splitting criteria, the Gini index[32] is used, which essentially is a measure of how well the splitting rule separates the classes contained in the parent

EIA_120052597.indd 6

Fig. 3 The structure of CART.

node. The Gini index measures the inequality among values of a frequency distribution. Now, the growing process of CART algorithm is briefly reviewed. As already mentioned, a CART tree is a binary decision tree that is constructed by splitting a node into two child nodes repeatedly, beginning with the root node, which contains the whole learning sample. The basic idea of tree growing is to choose a split among all the possible splits at each node, so that the resulting child nodes are the “purest.” Each split depends on the value of only one predictor variable. All possible splits consist of possible splits of each predictor. If X is a nominal categorical variable of I categories, there are 2 I −1 − 1 possible splits for this predictor, and if X is an ordinal categorical or continuous variable with K different values, there are K 2 1 different split on X. The basic idea in choosing any splitting criteria at an internal node is to make the data in son nodes “purer.” One way to accomplish this task is to define an impurity function i ( t ) at every internal node t. Suppose at node t, the best split S which divides node t into left son t L and right son t R is chosen to maximize ∆i (S, t ) which is a splitting criterion. When the impurity measure for a node can be defined, the splitting criterion, which is defined as follows, corresponds to a decrease in impurity: ∆i (S, t ) = i ( t ) − p L i ( t L ) − p R i ( t R )

(15)

where p L and p R are probabilities of sending a case to the left child node t L and to the right child node t R, respectively. They are estimated as p L = p L (t L ) / p(t ) and p L = p R (t R ) / p(t). The impurity function suggested in Breiman et al.[32] is the Gini index, defined as follows: i ( t ) = ∑C ( i|j) p ( i|t ) p( j | t )

(16)

i, j

where C ( i|j) is the cost of miss-classifying a class j case as a class i case. p ( i|t ) and p( j | t ) are the probability of

04-04-2016 15:51:32


a random sample X belonging to the class i and j, respectively, given that they fall into node t. The final component in decision tree construction is to determine when the splitting should be stopped. One approach is to use stopping rules. Breiman et al.[32] suggested continuing the splitting until all the terminal nodes are pure or nearly pure (instead of using stopping rules) and then selectively pruning this large tree in order to get a decreasing sequence of subtrees. Naïve Bayes Bayesian inference computes the probabilities according to Bayes’ theorem as follows:


P ( A | B) =

P ( B | A )P ( A ) P ( B)

(17)

According to this theorem, we can calculate the probability of event A conditioned on data B by first calculating the probability of the data B conditioned by event A multiplied by the probability of event A and normalized by the probability of the data B. Bayesian theorem suggests a straightforward process to find the hypothesis with maximum probability that is not based on search methods.[34] The NB classifier is a supervised learning algorithm, which is a simple probabilistic classifier based on applying Bayes’ theorem with a strong (naïve) independence assumption, which means the features are conditionally independent in the given classification. In other words, NBs are very simple Bayesian networks that are composed of directed acyclic graphs with only one parent (representing the unobserved node) and several children (corresponding to observed nodes), with a strong assumption of independence among child nodes in the context of their parent.[35] Fig. 4 shows an example of NB. In particular, the decoupling of the class-conditional feature distributions means that each distribution can be independently estimated as one-dimensional distribution. The objective in NB method is to classify a new sample by using features values that describe the new sample. It means the classifier wants to pick the hypothesis that is most probable. This is known as the maximum a posteriori (MAP) decision rule. Suppose f : x p → C is a function which maps x p to C. x p is a training pattern that is described by n-dimensional features F = < f1 ,…, fn > and C

Fig. 4 An example of naïve Bayes.

EIA_120052597.indd 7

7

is a set of classes of training patterns. We want to find the class of pattern x test using NB:

classify ( x test ) = CMAP =

n arg max ì ü íP ( c j ) ´ ÕP ( a i = v | c j ) ý cj î i =1 þ (18)

As shown in Eq. (18), the NB classifier assumes that the presence (or absence) of a particular feature of a class is unrelated to the presence (or absence) of any other feature, given the class variable.[34] P(c j ) is calculated by counting the number of instances that belong to class c j and then dividing it to the cardinality of the training set. The value of P(a i = v | c j ) is calculated by the same way, as well. Just count the number of instances belong to class c j , the value of their ith feature being equal to v, and then divide that to the cardinality of instances of class c j. In the learning phase of NB classifier, for each class c j ∈ C in the given training set, the prior probability P(c j ) will be computed. Then, for each feature value, a i = v, P(a i = v | c j ) will be calculated as the probability of that feature value given class c j . In the classifican tion phase, p (c j ) × Π P (a i =v|c j ) will be computed for each i=1 class c j ∈ C , and then using Eq. (18), the most probable class will be selected. Binary Gravitational Search Algorithm According to the Newtonian gravitational law, each particle attracts other particles with a gravitational force[18] that causes a movement of all objects globally toward those objects with heavier masses.[18] This force is directly proportional to masses of particles and inversely proportional to the square of the distance between them:[36]

F=G

M1 M 2 R2

(19)

where M1 and M2 are the masses of particles, and R is the distance between them. Based on Eq. (19), the gravitational force between large particles that are closer will be large. G is the gravitational constant, and its actual value depends on the age of universe:[36]  t  G (t ) = G0 1 −  t max 

(20)

where t max is the total number of iterations. G0 is the value of the gravitational constant at the beginning. Computer science uses the Newtonian gravitational law in optimization problems, such as some random candidate solutions, called agents, that are created for an optimization problem as an initial population. Then, each agent moves other agents based on Newtonian gravitational law. So the

04-04-2016 15:51:35

8


problem space is searched to find the best possible solution. According to GSA, in the beginning of the algorithm, a population of n agents is created. The position of each agent (particle), which is a solution, is defined as follows:

(

1 i

d i

X i = x ,…, x ,…, x

m i

);

v di ( t + 1) = α i v di ( t ) + a di ( t ) i = 1, 2,…, n

qi (t )

; qi (t ) =

n

∑ q j (t )

(27)

(21)

where x di is the position of the ith mass in the dth dimension and m is the number of dimensions. Then, by computing and using the fitness value of each particle, their masses are calculated as follows (for a maximization problem): Mi (t ) =

Then, the velocity of an agent in the dimension d at the tth iteration is computed by current velocity and its acceleration:

fit i ( t ) − worst ( t ) best ( t ) − worst ( t )

(22)

j =1

where α i is a random value between 0 and 1. In the BGSA, the position of an agent in each dimension can take only a value of 1 or 0 (does that mean the equivalent subset includes the corresponding feature or not?) The position updating of agents is calculated according to the mass velocity probability:

(

)

compelent x di ( t ) ; x di ( t + 1) =  d  x i ( t ) ;

(

)

(

γ i < S v di ( t ) = tanh v di ( t )

)

otherwise


(28) where fit i ( t ) presents the fitness value of the ith agent in the tth iteration (specific time). The best(t) and worst(t) show the best and worst values in tth iteration and are defined as follows:

best ( t ) = max fit j ( t ), j∈{1, …, n}

worst ( t ) = min fit j ( t ) j∈{1, …, n}

(23)

To reach a good convergence rate, Rashedi, Nezamabadipour, and Saryazdi[18] limited the velocity by |v di (t )| < v max = 6. The above steps will continue until the allowed iteration number is reached or an acceptable solution is found. The flowchart of BGSA is shown in Fig. 5.

In the next step, the force acting from the jth agent to the ith agent in the tth iteration is calculated as follows: Fijd ( t ) = G ( t )

Mi (t ) × M j (t ) R ij + ε

( x ( t ) − x ( t )) d j

d i

(24)

where M i and M j are the mass of agents i and j, respectively and e is a small constant. R ij is the hamming distance between two agents i and j. The total force acting from other agents to the ith agent in dimension d is calculated by summing the randomly weighted forces that are extracted from the dth dimension of other agents: Fid ( t ) =

∑

γ j Fijd ( t )

(25)

j∈Kbest, j ≠ i

where γ j is a random value between 0 and 1. Note that only K best agents that have the highest fitness value are considered to attract the others,[19] because this limitation will improve the performance of GSA by controlling exploration and exploitation to avoid trapping in local optimum solution.[36] At the beginning, a set of K best comprises all the agents and decreases linearly to one over time. In the next step, the acceleration of the ith agent in the dimension d is computed based on the law of motion: a di ( t ) =

Fid ( t ) Mi (t )

(26) Fig. 5 Flowchart of BGSA.

EIA_120052597.indd 8

04-04-2016 15:51:37


9


Fig. 6 The proposed framework.

PROPOSED MODEL In this section, the proposed feature selection method based on the BGSA is introduced. The proposed method uses MIFS-U as an MI-based filter method for reaching the maximum relevance between selected features and the target class and the minimum redundancy between selected features. Fig. 6 depicts the presented hybrid framework. According to Fig. 6, the feature selection process comprises the following steps: Step 1 (Initialization): Some of the feature subsets are generated randomly as initial population of agents (n). Each agent is comprised of a binary string with a length of m (the number of features in the dataset), where the value of each bit shows the presence of a corresponding feature in an agent. So, the agent position is determined by the value of binary strings. In this step, a threshold is also computed as follows:

∑ I (f ; C ) j

δ MI =

f j ∈F

m

(29)

where F is the feature set with m dimension and C is the target class. This threshold is the average of MI between each feature and the target class that will be used in updating the position of agents.

Step 2 (Evaluation): Each agent is evaluated by the fitness function in this step. It means that for each agent (feature subset), a binary classifier (e.g., SVM, CART, or NB) is formed by training instances in the dataset that includes only those features in the corresponding feature subset, and it is evaluated by testing instances. In this step, the best position of the agent (best feature subset) between the current position of agents and the best position up to the previous iterations is identified based on a fitness function, which is the accuracy rate. Step 3 (Updating): The position of agents is updated based on Newtonian gravitational laws. To do so, G, the best and the worst values of the population, is updated in the beginning using Eqs. (20) and (23), respectively. Then, the mass (M) and the acceleration (a) of agents are calculated using Eqs. (22) and (26), respectively. After that, the velocity and the position are updated using Eqs. (27) and (28), respectively. Then, if the maximum number of iterations is reached (the termination condition), the best position of the agent will be returned as an optimal feature subset. Because the position of each agent is updated only based on Eq. (28), this means that the select/deselect decision about a special feature is independent of other feature selection decisions, which can lead to an increase in the redundancy or statistical dependencies among the selected features. So, MI is used in the position updating phase in order to find an optimum feature subset to achieve more accuracy. To use the MI in the third step, as shown in Fig. 7,

Fig. 7 Updating the position’s dimension of an agent.

EIA_120052597.indd 9

04-04-2016 15:51:37

10


every position’s dimension of each agent is updated using Eq. (28) until the first position’s dimension value is updated by 1. After this position index, the algorithm will work sequentially. It means that for each remainder feature (called current candidate feature or current position’s dimension), we first calculate the redundancy and relevance J ( fcandidate ) among the current candidate feature, the selected features (every position’s dimension before the current position’s dimension that has the value of 1), and the target class using Eq. (4). After that, if J ( fcandidate ) ≥ δ MI , then the current candidate feature will be selected (the current position’s dimension will be set to 1); otherwise, the current candidate feature is skipped (the current position’s dimension will be set to 0). The following pseudocode describes the proposed algorithm:


Algorithm 1. Updating phase in the proposed method (1) (2) (3) (4) (5) (6)

(7) (8) (9) (10) (11) (12) (13) (14)

for each agent i = 1, 2,…, n for each dimension d = 1, 2,…, m repeat update x di by using Eq. (28) until x di = 1 while d < m compute J (fd +1 ) which is based on (d + 1) th feature and S = {if x ik = 1 then return k| k = 1,…, d} as selected features and, the target class using Eq. (4) if J (fd +1 ) ≥ δ MI then set x di +1 = 1 (the new candidate feature is selected) end else set x di +1 = 0 (the new candidate feature is skipped) end set d = d + 1 end end

INTRUSION DATASET To evaluate the proposed feature selection method, the NSL-KDD dataset[37] is used. The NSL-KDD dataset is a new version of KDD 99 consisting of selected records from the complete original KDD dataset. The KDD dataset has some problems, like having redundant and duplicate records[38] that will cause negative effects on the evaluation result when being used as an evaluation dataset. NSL-KDD consists of 41 continuous and discrete (nominal) features and one target class attribute (like KDD, but without its problems). The features in the NSL-KDD dataset can be classified into basic features, traffic features (including time-based traffic and host-based traffic), and content features.[38] Basic features can be derived from packet headers without inspecting the payload; and, in the content features, domain knowledge is used to assess the payload of the original transmission control protocol (TCP) packets.[5] Time-based traffic features are designed to capture properties that mature over a two-second temporal window,

EIA_120052597.indd 10

and the host-based traffic features use a historical window, estimated over the number of connections instead of time. Hence, they are designed to assess attacks that span in intervals longer than 2 sec.[5] The details of NSLKDD, such as the name of features and the description along with the categories, are shown in Table 1. Attack types in the NSL-KDD dataset have been categorized into four groups: probe, denial of service (DoS), user to root (U2R), and remote to user (R2L). As shown in Fig. 6, since information theory methods like MI can handle only discrete attributes, we discretized continuous features in NSL-KDD by using the equal-width interval discretization method.[39] To avoid the unbalanced data, it is essential to normalize the values of each feature. Data normalization is a major step in the preprocessing phase, which is used to scale the values of each continuous attribute into a suitable range.[40] As shown in Fig. 6, before the evaluation phase, we normalized all the features in the NSL-KDD dataset. In the beginning, categorical features like the protocol type value were replaced by integer values. Then, the values of the features were scaled by statistical normalization.[40] The statistical normalization is defined as follows: xi =

vi − µ ; i = 1, 2,…, N σ

(30)

where N is the number of instances in the dataset and v i is the value of the ith instance in the dataset for the given feature. µ and σ are the mean and the standard deviation, respectively. In this study, we select 10,000 and 5,000 instances from the KDDTrain+ and KDDTest+ datasets[37] for the training and the testing datasets, respectively. The details of the dataset are shown in Table 2. EXPERIMENTAL RESULTS In this section, the experimental results of the proposed model are presented. The performance of the proposed method is evaluated in an anomaly detection system. In fact, the proposed system is based on two-class (attack or normal) for IDS in the classification process. To evaluate the performance of the proposed algorithm, we use three well-known classifiers (namely SVM, CART, and NB which are already implemented in MATLAB R2014a). The performance of the proposed method was compared with the standard BGSA and the binary PSO (BPSO)[41] feature-selection methods using SVM, CART, and NB classifiers. These classifiers were trained with 10,000 selected instances from KDDTrain+[37] in 100 iterations and were evaluated with 5,000 selected instances from KDDTest+.[37] The parameters setting of the BGSA part of the hybrid MBGSA proposed algorithm and standard

04-04-2016 15:51:39


11

Table 1 Description of NSL-KDD intrusion detection dataset features.


Feature

Description

Duration of the connection (in sec) Type of the connection protocol 3: service Service on the destination 4: flag Status flag of the connection 5: src_bytes Number of bytes sent from source to destination 6: dst_bytes Number of bytes sent from destination to source 7: land 1 if connection is from/to the same-host/port; 0 otherwise 8: wrong_fragment wrong_fragment 9: urgent Number of urgent packets 10: hot Number of “hot” indicators 11: num_failed_logins Number of failed logins 12: logged_in 1 if successfully logged in; 0 otherwise 13: num_compromised Number of “compromised” conditions 14: root_shell 1 if root shell is obtained; 0 otherwise 15: su_attempted 1 if “su root” command attempted; 0 otherwise 16: num_root Number of “root” accesses 17: num_file_creations Number of file-creation operations 18: num_shells Number of shell prompts 19: num_access_files Number of operations on access control files 20: num_outbound_cmds Number of outbound commands in a FTP session 21: is_host_login 1 if the login belongs to the “hot” list; 0 otherwise 22: is_guest_login 1 if the login is a “guest” login; 0 otherwise 23: count Number of connections to the same host as the current connection in the past 2 s 24: srv_count Number of connections to the same service as the current connection in the past 2 s 25: serror_rate Percent of connections that have “SYN” errors (same-host connections) 26: srv_serror_rate Percent of connections that have “SYN” errors (same-service connections) 27: rerror_rate Percent of connections that have “REJ” errors (same-host connections) 28: srv_rerror_rate Percent of connections that have “REJ” errors (same-service connections) 29: same_srv_rate Percent of connections to the same service 30: diff_srv_rate Percent of connections to different services 31: srv_diff_host_rate Percent of connections to different hosts 32: dst_host_count Number of connections having the same destination host 33: dst_host_srv_count Number of connections having the same destination host and using the same service 34: dst_host_same_srv_rate Percent of connections having the same destination host and using the same service 35: dst_host_diff_srv_rate Percent of different services on the current host 36: dst_host_same_src_port_rate Percent of connections to the current host having the same src port 37: dst_host_srv_diff_host_rate Percent of connections to the same service coming from different hosts 38: dst_host_serror_rate Percent of connections to the current host that have an S0 error 39: dst_host_srv_serror_rate Percent of connections to the current host and specified service that have an S0 error 40: dst_host_rerror_rate Percent of connections to the current host that have an RST error 41: dst_host_srv_rerror_rate Percent of connections to the current host and specified service that have an RST error

Type

1: duration

Continuous

2: protocol_type

Discrete Discrete Discrete Continuous Continuous Discrete Continuous Continuous

EIA_120052597.indd 11

Continuous

Category Basic

Content

Continuous Discrete Continuous Discrete Discrete Continuous Continuous Continuous Continuous Continuous Discrete Discrete

Continuous

Time-based traffic

Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous

Continuous

Host-based traffic

Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous Continuous

04-04-2016 15:51:40

12


Table 2 Size of the training and testing datasets. Dataset


Training set Test set

# Instances

# Normal instances

# Anomaly instances

10,000 5,000

5,249 2,137

4,751 2,863

BGSA is reported in Table 3. The parameter setting of the BPSO feature selection method is also reported in Table 4. As shown in Tables 3 and 4, the feature selection methods use n = 5 as an initial population, and the maximum number of iterations is t max = 100. The maximum velocity constraints and gravitational constant are set as v max = 6 d G 0 = 100 , respectively. In the BPSO method, w is the inertia weight or the momentum of particles that shows the effect of the pervious velocity on the new velocity, and c1 and c 2 will control the stochastic influence of the cognitive and social components of the overall velocity of a particle.[19] We set c1 2, c 2 2 and w = 0.2. To evaluate the performance of the proposed method, the following numerical evaluations were used: ฀ True positive rate (TPR), known as the DR, is calcuR = TP TP / ( T TP P+F FN ) where TP is the numlated as: TPR ber of the positive instances (attack) that are classified correctly and FN is the number of positive instances (attack) that are classified incorrectly. ฀ False positive rate (FPR) is calculated as: FPR = FP / ( TN + FP ) where FP is the number of negative instances (normal) that are classified incorrectly and TN is the number of negative instances (normal) that are classified correctly. ฀ Accuracy: ( TP + TN ) / ( TP + TP + FN + FP ) . We implemented the feature selection methods by MATLAB R2014a on a PC with an Intel(R) Core (TM) i5-4460, CPU 3.20 GHz and 8 GB RAM. We run all the feature selection methods three times and report the average of accuracy rate, DR, and FPR. Results and Discussion In this experiment, the optimum feature subset was considered. It means that the proposed MBGSA was found to be the best subset of features that maximizes the fitness Table 3 Parameters setting of BGSA part of proposed feature-selection method and standard BGSA in simulations. Parameters n max

G0 υmax

EIA_120052597.indd 12

MBGSA 5 100 100 6

Table 4 Parameters setting of BPSO-based feature-selection method in simulations. Parameters

BPSO

n

5 100 6 2 2 0.2

max

υmax c1 c2 w

function, which is the accuracy rate. The performance of the proposed model has been compared to BGSA and BPSO in terms of the accuracy rate, DR, and FPR. As mentioned earlier, we ran the feature selection methods three times using SVM, CART, and NB classifiers. Figs. 8–16 report the average runtime behavior of the accuracy rate, DR, and FPR in 100 iterations of three different feature selection methods using various classifiers on NSL-KDD dataset, respectively. As shown in Figs. 8–16, the proposed method performs better in terms of the accuracy rate and the DR using all classifiers. As has been seen, using the information theory in the BGSA feature selection method to consider both feature–feature and feature–class MI can improve the fine-tuning capability and efficiency of standard BGSA. This search strategy, which is used as an inner-layer optimization, can increase the relevance between the input features and the target class and decrease the redundancy among the selected features, and it can lead to improvement in the performance of BGSA in feature selection. For analyzing the experimental results, the details of the best results of feature selection methods among all executions are used. As seen in Figs. 17 and 18, MBGSA-CART, which is the combination of the proposed method and the CART classifier, offers the best accuracy rate and DR. The details of selected features and performance metrics are reported in Tables 5–7. According to Table 6, when the proposed feature selection method and CART-based classifier are used, the accuracy rate and DR are improved by 2.70% and 8.21% as compared to standard BGSA, respectively. These rates are improved by 0.32% and 3.88% as compared to the BPSO feature selection method, respectively. It is noted that the proposed feature selection algorithm selects only a small portion of the original features as compared to other feature selection methods (only 7 features among 41 features as compared to 21 and 23 selected features in the BGSA and BPSO methods, respectively). According to the experimental results reported in Tables 5–7, the proposed feature selection algorithm can make dramatic reductions in the feature space and can consequently improve the classification performance. Unlike the accuracy rate and DR, the performance of the proposed method suffers from high FPR. As seen in Fig. 19, the BGSA feature selection method using

04-04-2016 15:51:41


13


Fig. 8 Average accuracy rate when using SVM classifier and MBGSA/BGSA/BPSO feature selection methods.

Fig. 9 Detection rate when using SVM classifier and MBGSA/BGSA/BPSO feature selection methods.

Fig. 10 False positive rate when using SVM classifier and MBGSA/BGSA/BPSO feature selection methods.

Fig. 11 Average accuracy rate when using CART classifier and MBGSA/BGSA/BPSO feature selection methods.

EIA_120052597.indd 13

04-04-2016 15:51:42


14


Fig. 12

Detection rate when using CART classifier and MBGSA/BGSA/BPSO feature selection methods.

Fig. 13

False positive rate when using CART classifier and MBGSA/BGSA/BPSO feature selection methods.

Fig. 14

Average accuracy rate when using NB classifier and MBGSA/BGSA/BPSO feature selection methods.

Fig. 15 Detection rate when using NB classifier and MBGSA/BGSA/BPSO feature selection methods.

EIA_120052597.indd 14

04-04-2016 15:51:43



15

Fig. 16

False positive rate when using NB classifier and MBGSA/BGSA/BPSO feature selection methods.

Fig. 17

Comparison of feature selection methods using different classifiers in terms of accuracy rate.

Fig. 18

Comparison of feature selection methods using different classifiers in terms of DR. Table 5 The best experimental results of MBGSA, BGSA, and BPSO feature-selection methods using SVM classifier. Accuracy rate (%)

DR (%)

FPR (%)

Execution time (sec)

# Feature subset

MBGSA BGSA

89.260 87.880

87.880 84.736

8.890 7.908

697.644 532.810

4 15

BPSO

85.103

79.148

6.922

523.427

21

FS method

EIA_120052597.indd 15

Selected features {2, 3, 5, 6} {3, 4, 5, 7, 9, 15, 17, 18, 20, 22, 23, 25, 26, 27, 30} {3, 4, 5, 6, 9, 11, 13, 14, 16, 17, 18, 20, 22, 25, 26, 27, 28, 34, 39, 40, 41}

04-04-2016 15:51:43

16


Table 6

The best experimental results of MBGSA, BGSA, and BPSO feature-selection methods using CART classifier. Accuracy rate (%)

DR (%)

FPR (%)


# Feature subset

MBGSA BGSA

89.400 86.700

87.321 79.113

7.815 3.135

53.133 47.194

7 21

BPSO

89.082

83.444

3.368

53.021

23

FS method

Table 7

The best experimental results of MBGSA, BGSA, and BPSO feature-selection methods using NB classifier. Accuracy rate (%)

DR (%)

FPR (%)

MBGSA BGSA

83.820 81.680

82.955 71.149

15.021 4.216

106.152 23.041

5 18

BPSO

83.440

76.458

7.206

23.11

19

FS method


Selected features {1, 2, 3, 5, 6, 7, 29} {1, 3, 5, 6, 7, 9, 11, 12, 14, 16, 19, 20, 21, 27, 31, 32, 33, 34, 39, 40, 41} {1, 3, 4, 5, 9, 11, 13, 15, 16, 17, 20, 23, 25, 28, 29, 30, 32, 33, 36, 37, 38, 39, 41}


# Feature subset

Selected features {3, 4, 24, 30, 41} {2, 5, 7, 11, 15, 17, 18, 19, 22, 24, 27, 28, 30, 35, 36, 37, 38, 41} {3, 4, 5, 7, 8, 9, 11, 12, 13, 14, 19, 27, 28, 30, 35, 37, 39, 40, 41}

Fig. 19

Comparison of feature selection methods using different classifiers in terms of FPR.

Fig. 20

Number of selected features in simulated feature selection methods using different classifiers.

EIA_120052597.indd 16

04-04-2016 15:51:44



the CART classifier offers the best FPR. As reported in Table 6, the FPR when using the BGSA method is 4.68% better than the proposed method. Theoretically, having more features implies more classification accuracy. However, this is not always true due to data sparsity issues, noise, and other factors. Since producing the classifier from an intrusion dataset represents a major phase in an anomaly-based IDS, some features in the dataset may not represent the underlying phenomena of interest and increase the model’s complexity, and hence, the performance of the classifier may be compromised. Therefore, using the proposed feature selection method, the classification model will be constructed faster because MBGSA can select fewer features than other investigated methods (Fig. 20); and, due to the significant reduction in the number of features, better computational efficiency is also achieved. Hence, this optimum feature subset can reduce the training and testing time.

CONCLUSION In this research, a novel feature selection method based on the BGSA (as a population-based heuristic search algorithm), called MBGSA, was proposed. To evaluate the selected feature subset, the accuracy rate was used as the fitness function. SVM, CART, and NB were used as different classifiers, which were trained and tested by the NSL-KDD dataset. Because BGSA only uses Newtonian gravitational rules to find an optimum feature subset, to improve the performance of standard BGSA, the MIFS-U search strategy was integrated in the BGSA wrapper module as a filter-based model that considers the MI between feature-feature and feature-class for feature selection. This approach can maximize the relevance between the candidate feature and the target class and minimize the redundancy between the candidate feature and the selected features. Standard BGSA and BPSO feature selection methods were used as competitive methods to evaluate the proposed MBGSA. The experimental results showed that the proposed method improved performance in terms of accuracy rate and DR with fewer features as compared to the standard BGSA and BPSO methods.

REFERENCES 1. Le, A.; Loo, J.; Lasebae, A.; Aiash, M.; Luo, Y. 6LoWPAN: A study on QoS security threats and countermeasures using intrusion detection system approach. Int. J. Commun. Syst. 2012, 25 (9), 1189–1212. 2. Wu, S.X.; Banzhaf, W. The use of computational intelligence in intrusion detection systems: A review. Appl. Soft Comp. 2010, 10 (1), 1–35. 3. Stakhanova, N.; Basu, S.; Wong, J. On the symbiosis of specification-based and anomaly-based detection. Comp. Secur. 2010, 29 (2), 1–16.

EIA_120052597.indd 17

17

4. Kim, G.; Lee, S.; Kim, S. A novel hybrid intrusion detection method integrating anomaly detection with misuse detection. Expert Syst. Appli. 2014, 41 (4), 1690–1700. 5. Sheikhan, M.; Jadidi, Z.; Farrokhi, A. Intrusion detection using reduced-size RNN based on feature grouping. Neural Comp. Appli. 2012, 21 (6), 1185–1190. 6. Hoque, N.; Bhattacharyya, D.K.; Kalita, J.K. MIFS-ND: A mutual information-based feature selection method. Expert Syst. Appli. 2014, 41 (14), 6371–6385. 7. Liu, H.; Wu, X.; Zhang, S. A new supervised feature selection method for pattern classification. Comp. Intell. 2012, 30 (2), 342–361. 8. Kumar, G.; Kumar, K. An information theoretic approach for feature selection. Secur. Commun. Netw. 2012, 5 (2), 178–185. 9. Deisy, C.; Baskar, S.; Ramraj, N.; Saravanan Koori, J.; Jeevanandam, P. A novel information theoretic-interact algorithm (IT-IN) for feature selection using three machine learning algorithms. Expert Syst. Appli. 2010, 37 (12), 7589–7597. 10. Ruiz, R.; Riquelme, J.C.; Aguilar-Ruiz, J.S. Heuristic search over a ranking for feature selection. Lect. Notes Comput. Sci. 2005, 3512, 742–749. 11. Kira, K.; Rendell, L.A. The feature selection problem: Traditional methods and a new algorithm. In Proceedings of National Conference on Artificial Intelligence, San Jose, California, July 12–17, 1992; 129–134. 12. Hall, M.A. Correlation-based feature selection for discrete and numeric class machine learning. In Proceedings of International Conference on Machine Learning, Stanford, CA, June 29–July 2, 2000; 359–366. 13. Zhao, Z.; Liu, H. Searching for interacting features. In Proceedings of International Joint Conference on Artificial Intelligence, Hyderabad, India, January 6–12, 2007; 1156–1167. 14. Zhang, Z.; Hancock, E.R. Hypergraph based informationtheoretic feature selection. Patt. Recogn. Lett. 2012, 33 (15), 1991–1999. 15. Unler, A.; Murat, A.; Chinnam, R.B. mr2PSO: A maximum relevance minimum redundancy feature selection method based on swarm intelligence for support vector machine classification. Inform. Sci. 2011, 181 (20), 4625–4641. 16. Sheikhan, M.; Mohammadi, N. Neural-based electricity load forecasting using hybrid of GA and ACO for feature selection. Neural Comp. Appli. 2012, 21 (8), 1961–1970. 17. Sivagaminathan, R.K.; Ramakrishnan, S. A hybrid approach for feature subset selection using neural networks and ant colony optimization. Expert Syst. Appli. 2007, 33 (1), 49–60. 18. Rashedi, E.; Nezamabadi-pour, H.; Saryazdi, S. BGSA: Binary gravitational search algorithm. Nat. Comput. 2010, 9 (3), 727–745. 19. Sheikhan, M. Generation of suprasegmental information for speech using a recurrent neural network and binary gravitational search algorithm for feature selection. Appl. Intell. 2014, 40 (4), 772–790. 20. Bonev, B.I. Feature selection based on information theory. Ph.D. Thesis, Department of Computer Science and Artificial Intelligence, University of Alicante, Spain, 2010. 21. Battiti, R. Using mutual information for selecting features in supervised neural networks learning. IEEE Trans. Neural Netw. 2002, 5 (4), 537–550.

04-04-2016 15:51:44


18

22. Amiri, F.; RezaeiYousefi, M.; Lucas, C.; Shakery, A.; Yazdani, N. Mutual information-based feature selection for intrusion detection systems. J. Netw. Comput. Appl. 2011, 34 (4), 1184–1199. 23. Kwak, N.; Choi, C. Input feature selection for classification problems. IEEE Trans. Neural Netw. 2002, 13 (1), 143–159. 24. Huang, J.; Cai, Y.; Xu, X. A hybrid genetic algorithm for feature selection wrapper based on mutual information. Patt. Recogn. Lett. 2007, 28 (13), 1825–1844. 25. Cortes, C.; Vapnik, V. Support vector networks. Mach. Learn. 1995, 20 (2), 273–297. 26. Lei, X.; Zhou, P. An intrusion detection model based on GS-SVM classifier. Inform. Technol. J. 2012, 11 (7), 794–798. 27. Wang, Y.; Wang, S.; Lai, K.K. A new fuzzy support vector machine to evaluate credit risk. IEEE Trans. Fuzzy Syst. 2005, 13 (6), 820–831. 28. El-Naqa, I.; Yang, Y.; Wernick, M.N.; Galatsanos, N.P.; Nishikawa, R.M. A support vector machine approach for detection of microcalcifications. IEEE Trans. Med. Imag. 2002, 21 (12), 1552–1563. 29. Anuradha; Gupta, G. Self explanatory review of decision tree classifiers. In Proceedings of International Conference on Recent Advances and Innovations in Engineering, Jaipur, Gurgaon, India, May 9–11, 2014. 30. Poƺaka, I.; Tom, I.; Borisovs, A. Decision tree classifiers in bioinformatics. Sci. J. Riga Technical University. 2010, 44, 119–124. 31. Landgreb, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cyber. 1991, 21 (3), 660–674. 32. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Chapman and Hall: Belmont, Wadsworth, CA, U.S.A., 1984.

EIA_120052597.indd 18


33. Morgan, J. Classification and Regression Tree Analysis; Department of Health Policy & Management. Boston University: Boston, MA, U.S.A., 2014. 34. Amiri, M.; Eftekhari, M.; Keynia, F. Using naïve Bayes classifier to accelerate constructing fuzzy intrusion detection systems. Int. J. Soft Comput. Eng. 2013, 2 (6), 453–459. 35. Kotsiantis, S.B. Supervised machine learning: A review of classification techniques. Informatica 2007, 31, 249–268. 36. Rashedi, E.; Nezamabadi-pour, H.; Saryazdi, S. GSA: A gravitational search algorithm. Inform. Sci. 2009, 179 (13), 2232–2248. 37. Tavallaee, M.; Bagheri, E.; Wei, L.; Ghorbani, A.A. Available on http://nsl.cs.unb.ca/NSL-KDD (accessed November 11, 2014). 38. Tavallaee, M.; Bagheri, E.; Wei, L.; Ghorbani. A.A. Detailed analysis of the KDD CUP 99 data set. In Proceedings of IEEE Symposium on Computational Intelligence for Security and Defense Applications. Ottawa, Canada, July 8–10, 2009. 39. Dash, R.; Paramguru, R.L.; Dash, R. Comparative analysis of supervised and unsupervised discretization techniques. Int. J. Advan. Sci. Technol. 2011, 2 (3), 29–37. 40. Wang, W.; Zhang, X.; Gombault, S.; Knapskog, S.J. Attribute normalization in network intrusion detection. In Proceedings of International Symposium on Pervasive Systems, Algorithms, and Networks, Kaohsiung, Taiwan, December 14–16, 2009; 448–453. 41. Nezamabadi-pour, H.; Rostami-Shahrbabaki, M.; MaghfooriFarsangi, M. Binary particle swarm optimization: Challenges and new solutions. J. Comp. Sci. Eng. 2008, 6 (1–A), 21–32.

04-04-2016 15:51:44