Network intrusion detection: Evaluating cluster ... - Semantic Scholar

3 downloads 8108 Views 169KB Size Report
Over the last years information security technologies have increased in .... erally, cluster analysis discovers structures in data without explaining why they exist.
Information Sciences 177 (2007) 3060–3073 www.elsevier.com/locate/ins

Network intrusion detection: Evaluating cluster, discriminant, and logit analysis Vasilios Katos School of Computing, University of Portsmouth, Buckingham Building, Lion Terrace, Portsmouth PO1 3HE, UK Received 20 November 2006; received in revised form 16 February 2007; accepted 20 February 2007

Abstract This paper evaluates the statistical methodologies of cluster analysis, discriminant analysis, and Logit analysis used in the examination of intrusion detection data. The research is based on a sample of 1200 random observations for 42 variables of the KDD-99 database, that contains ‘normal’ and ‘bad’ connections. The results indicate that Logit analysis is more effective than cluster or discriminant analysis in intrusion detection. Specifically, according to the Kappa statistic that makes full use of all the information contained in a confusion matrix, Logit analysis (K = 0.629) has been ranked first, with second discriminant analysis (K = 0.583), and third cluster analysis (K = 0.460).  2007 Elsevier Inc. All rights reserved. Keywords: Intrusion detection; Cluster analysis; Discriminant analysis; Logit analysis KDD-99

1. Introduction Over the last years information security technologies have increased in importance due to a severely increased number of attempts for unauthorised access to or malicious activities on computers or information systems. The research community has shown a particular interest in the areas of access control models, application and database security [17]. Although these security technologies were designed so as to prevent unauthorised access to systems, complete prevention seems at present to be unrealistic. To compensate for the potential failures of the prevention technologies, Intrusion Detection Systems (IDS) have been developed in order to detect intrusion attempts and are the primary source of data for network forensics. The most common method to detect intrusions is to use ‘audit data’ generated by the operating system, aiming at distinguishing acceptable or normal system behaviour from that which is abnormal or actively harmful. Two complementary in nature techniques are usually employed to cope with the intrusion detection problem:

E-mail address: [email protected] 0020-0255/$ - see front matter  2007 Elsevier Inc. All rights reserved. doi:10.1016/j.ins.2007.02.034

V. Katos / Information Sciences 177 (2007) 3060–3073

3061

• Anomaly detection techniques attempting to model normal behaviour. Any events which violate this model are considered to be suspicious. • Misuse detection techniques attempting to model abnormal behaviour, representing attacks in the form of a pattern or a signature, any occurrence of which clearly indicates system abuse. The usual drawback of the anomaly detection techniques is in building accurate and parsimonious models that reflect the complex dynamic information system environment. In fact, models used in practice proved to easily trigger thousands of alarms per day, up to 99% of which are ‘false positives’ (i.e. alarms that were mistakenly triggered by benign events) that made it difficult to identify the hidden ‘true positives’ (i.e. alarms that correctly flag attacks) [19]. Although misuse detection is assumed to be more accurate than anomaly detection, the major drawback of these techniques is in creating a signature that encompasses most possible variations of intrusive and non-intrusive activities. Both disciplines of statistics and AI claim shares of intrusion detection tools. In building models for anomaly detection, statistical approaches and predictive pattern generation, i.e. predicting future events based on historical events, have been employed. In building models for misuse detection, pattern matching approaches, i.e. models encoding known signatures as patterns that are then matched against audit data, have been highly employed. Naturally, pattern matching refers to fuzzy logic and AI techniques such as neural networks and the recently introduced fuzzy polynomial neural networks [27]. Furthermore, there are clear equivalence relations between neural networks and statistical approaches; a good comparison is developed in [13]. It should be noted that anomaly detection schemes are not limited to network intrusion detection (see for example [35] for anomaly detection in IPv6 and [23] for malicious code classification). In [9] the research was focused on financial fraud detection. In the same paper it was shown that neural networks performed better than their statistical counterparts. Yet, in [29] a comparison between neural networks and logit showed no significant advantage of one approach against the other. In general, when it comes to comparing different intrusion detection techniques, the main stream of research adopts a ‘‘camp’’ view, by evaluating AI tools against statistics and vice-versa. This paper focuses solely on the statistics side of intrusion detection and its purpose is to evaluate three statistical methods used in intrusion detection: cluster analysis [19], discriminant analysis [2], and Logit analysis [5]. These methods are used in intrusion detection modelling because the dependent variable is usually dichotomous. However, although these models have been used individually, by examining the effectiveness of the three competing statistical approaches using the same data, this paper contributes to modeling intrusion detection. The rest of the paper is structured as follows. Section 2 presents some major points in applying these three statistical approaches. In Section 3 the actual application of these approaches and the corresponding results are presented using a random subset of the KDD-99 dataset [20]. Finally, Section 4 presents the conclusions of the paper and identified areas for future research.

2. Methodology In this section some major points for applying cluster analysis, discriminant analysis, and Logit analysis are presented. 2.1. Cluster analysis Clustering is a procedure for grouping objects of similar kind into respective categories. Clustering is a popular technique used in data mining [28,11]. ‘Hierarchical clustering’ is a typical clustering algorithm which clusters objects (cases or variables). The algorithm starts by finding the closest pair of objects, according to a distance measure, and combines them by forming a cluster. The algorithm works one step at a time by joining pairs of objects, pairs of clusters, or an object with a cluster, until all the data are in one cluster [30]. Generally, cluster analysis discovers structures in data without explaining why they exist. The following points are important in applying cluster analysis:

3062

V. Katos / Information Sciences 177 (2007) 3060–3073

• Distance measures: Distance measures are employed in determining how different two objects are. The smaller the value of a distance the similar two objects are. Common distance measures are the following: X Squared Euclidean distance ¼ ðX i  Y i Þ2 i

P i ðX i  X ÞðY i  Y Þ Pearson correlation coefficient ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P 2P 2 i ðX i  X Þ i ðY i  Y Þ where Xi and Yi are observations, and X and Y are means of the two variables X and Y, respectively. • Standardisation: Clustering is very sensitive to the units of measurement of all the variables involved. The contribution of the variables with large values is erroneously higher to the distance measures calculations than the variables with small values. Thus, standardisation of all the variables is needed. The most common standardisation method is the following: Z scores ¼ Z i ¼

Xi  X s

where s is standard deviation. The mean value and the standard deviation of the Z’s is zero and one respectively. • Linking method: Different methods exist for linking two subgroups at each step in the clustering algorithm. The most common methods are the following: Between groups linkage method: At the first step, when each object represents its own cluster, the distances between those objects are defined by the employed distance measure. However, once several objects have been linked together, there are various possibilities (e.g. single linkage, average linkage, complete linkage) for linking two clusters together when any two objects in the two clusters are closer together than the respective linkage distance [31]. Ward’s method: This method uses an analysis of variance approach to evaluate the distances between clusters. In short, this method attempts to minimize the sum of squares of any two clusters that can be formed at each step [33]. In general, this method is regarded as very efficient, however, it tends to create clusters of small size. • Evaluation: There are no completely satisfactory methods for determining the number of clusters in the data [14]. Ordinary significance statistical tests, parametric or nonparametric, are not valid for testing differences between clusters [21]. If by applying the various linking methods the clustering results differ a lot, then it is rather unlikely that the data incorporate distinct clusters.

2.2. Discriminant analysis The main purpose of discriminant function analysis (and similar methods stemming from this, such as linear discriminant analysis [34]) is to classify or predict cases into values of a categorical dependent variable, Y, usually a dichotomy, based on a linear combination of interval independent variables, X1, X2, . . . , Xk (dichotomies, dummies, and ordinal variables may be used as well). The expected value of variable Y, is given by EðY jX Þ ¼ b1 X 1 þ b2 X 2 þ    þ bk X k þ c where the b’s are discriminant coefficients, the X’s are discriminating variables, and c is a constant. The discriminant coefficients are selected by maximizing the distance between the means of the dependent variable, or alternatively by minimising the distance between the actual Y and the predicted Y. The following points are important in applying discriminant analysis [18,7]: • Dependent variable: The dependent variable must be a true dichotomy (or more). In contrast to cluster analysis that classifies unknown groups, discriminant function analysis classifies known groups. • Independent variables: No independent variables have zero standard deviation in one or more of the groups indicated by the dependent variable.

V. Katos / Information Sciences 177 (2007) 3060–3073

3063

• Cases: All cases must be independent. It is recommended the number of cases (observations) to be at least five times as the number of independent variables. Group sizes of the dependent variable may not be very much different. • Discriminant coefficients: The un-standardised discriminant coefficients are used to compute the discriminant scores, whilst the standardised discriminant coefficients are used to assess the contribution to the discriminant function of each independent variable. The ‘Fisher’s linear discriminant functions’ refer to one set of discriminant function coefficients for each group of the dependent variable. • Discriminant scores: These are the predicted values from the discriminant function after the substitution of the sample cases. The ‘Z scores’ refer to the standardised discriminant scores. • Homogeneity of variances: For the same independent variable, the groups formed by the dependent variable should have similar variances. Lack of homogeneity of variances and presence of outliers may be evaluated through box-plots. • Homogeneity of covariances: Within each group formed by the dependent variable, the covariance between any two independent variables should be similar the corresponding covariance in other groups. Box’s M, and a corresponding F statistic, is a procedure for testing homogeneity of variances and covariances. In case that this assumption is violated the Mahalanobis distance may be used [2]. However, discriminant analysis can be robust even when this assumption is violated [22,24]. • Absence of multicollinearity: The independent variables must not be linear functions of other independent variables. Low correlation coefficients may indicate low multicollinearity. • Normality: Although discriminant analysis does not necessarily assumes the independent variables to follow multivariate normal distributions, for purposes of significant testing the independent variables should follow multivariate normal distributions. However, discriminant analysis conclusions can be robust even when this assumption is violated [22,24]. • Function specification: The discriminant function is assumed to be linear and correctly specified. The ‘Wilks’s lambda’, taking values from 0 to 1, is used to test the significance of the discriminant function as a whole, employing the v2 distribution. A small Wilks’s lambda value, or a high v2 value, indicates significance of the discriminant function. Similarly, a high ‘eigenvalue’, and correspondingly a high ‘canonical correlation’ taking values from 0 to 1, assess the relative importance of the dimensions which classify cases of the dependent variable. • Evaluation: The discriminant scores, employing specific ‘cutoffs’, are used to classify cases into groups. These groups may be validated with respect to the actual groups of the dependent variable. Furthermore, cluster analysis may be used in conjunction with discriminant function analysis. After cases have been classified into groups using cluster analysis, discriminant function analysis may be used on the resulting groups to discover the linear structure of the independent variables used in the analysis.

2.3. Logit analysis Linearity, normality and homoscedasticity of within-group variances of the independent variables are the usual assumptions relevant to discriminant analysis. Due to violations of these assumptions, discriminant analysis has been replaced by logit regression which requires fewer assumptions, it is more robust in its results, it is easier to use in practice, and it is easier to understand than discriminant analysis. Logit analysis is aiming at predicting the probability (p) that a variable is taking the value of 1 rather 0, using a set of independent continuous or categorical variables. The logit regression model is written as ProbðeventÞ ¼

1 1 þ eðb0 þb1 X 1 þþbk X k Þ

or  logitðpÞ ¼ log

p 1p

 ¼ b0 þ b1 X 1 þ    þ bk X k

3064

V. Katos / Information Sciences 177 (2007) 3060–3073

where Prob(event) = p, and p/(1  p) is the so called ‘odds ratio’. This model is estimated by the ‘maximum likelihood’ method, which maximises the probability of reaching the observed values with respect to the estimated regression coefficients. The following points are important in applying Logit analysis [30,6,3]: • Significance of individual regressors: The test that a coefficient is equal to 0 can be based on the ‘Wald statistic’, which follows the v2 distribution for large sample sizes. • Goodness of fit: Similarly to the R2 in a linear regression model, the ‘Cox & Snell R2’ and the ‘Nagelkerke R2’ in a logit regression model quantify the proportion of the explained variation. • Interpretation of regression coefficients: All other variables held constant in a Logit regression model, for every one-unit increase in an independent variable Xj, there is a constant increase of bj in logit(p), or in the log of the odds ratio. • Contribution of variables for prediction: Similarly to the philosophy of the ‘beta coefficients’ in the linear regression model, the contribution of the variables to be used in the prediction may be assessed by multiplying each regression coefficient by the standard deviation of the corresponding variable. The ranking of these products reflects the relative importance of the corresponding independent variables. • Categorical variables: Similarly to the philosophy of the ‘dummy variables’ in a linear regression model, ‘categorical variables’ may be used in a logit regression model. • Diagnostic tests: The standardised residuals from the logit regression are used to test the adequacy of the resulting model. If the sample size is large, the standardised residuals should be approximately normally distributed, with a mean of 0 and a standard deviation of 1. The usual Kolmogorov–Smirnov test may be used for testing normality of the residuals. • Evaluation: The predicted groups may be validated with respect to the actual groups of the dependent variable in a usual classification accuracy table [10].

3. Evaluation As an approach to intrusion detection, the three approaches are tested with a subset of the publicly available KDD-99 cup data [20]. The KDD datasets are a public collection of different types of data led by the ACM Special Interest Group on Knowledge Discovery and Data Mining. The data which are relevant for intrusion detection and network security were published under the KDD 99 heading (for other datasets and KDD approaches see [1] and [4], respectively). This dataset contains a wide variety of intrusions simulated in a military network environment and is a de facto dataset for benchmarking and evaluating IDS tools (see for example [8,26]). 3.1. The sample To evaluate the accuracy of the proposed methodologies, 1200 random cases of the 42 features (variables) contained in KDD-99 were used. Table 1 presents some descriptive statistics of these 42 features. A feature that has zero variability (standard deviation) is not used in the analysis, as it is indicated by ‘out’ in the last column of Table 1. Finally, only 30 features were used in the analysis, indicated by ‘in’. It should be highlighted that the indicator of ‘bad’ connections, designating intrusions or attacks, and ‘good’ normal connections, is presented by variable V42. The sample contains 1013 normal connections (84.5%) and 187 intrusions (15.6%). 3.2. Cluster analysis Table 2 presents the results from the application of three different cluster analysis methods: (1) betweengroups linkage cluster method and Pearson’s correlation coefficient distance measure; (2) Ward’s cluster method and Pearson’s correlation coefficient distance measure; (3) Ward’s cluster method and squared Euclidean distance measure. In all cases the Z-score standardisation method was employed. The results of these

V. Katos / Information Sciences 177 (2007) 3060–3073

3065

Table 1 Listing of features in KDD Cup 1999 Data, and descriptive statistics of the 1200 cases of the dataset Variable

Feature name

Type

Mean

Std. dev.

Pos

Continuous Discrete Discrete Discrete Continuous Continuous Discrete Continuous Continuous

9.1642 1.6442 2.5158 1.0100 622.1483 2511.6333 .0000 .0000 .0000

197.4245 .4893 1.1295 .1681 8375.7012 12425.2217 .0000 .0000 .0000

In In In In In In Out Out Out

Content features within a connection suggested by domain knowledge V10 Hot Continuous V11 num_failed_logins Continuous V12 logged_in Discrete V13 num_compromised Continuous V14 root_shell Continuous V15 su_attemted Continuous V16 num_root Continuous V17 num_file_creations Continuous V18 num_shells Continuous V19 num_access_files Continuous V20 num_otbound_cmds Continuous V21 is_host_login Discrete V22 is_guest_login Discrete

1.917E02 .0000 .6075 .0000 .0000 .0000 .0000 8.333E04 .0000 5.000E03 .0000 .0000 1.667E03

.5392 .0000 .4885 .0000 .0000 .0000 .0000 2.887E02 .0000 7.056E02 .0000 .0000 4.081E02

In Out In Out Out Out Out In Out In Out Out In

Traffic features computed using a two-second time window V23 Count Continuous V24 srv_count Continuous V25 serror_rate Continuous V26 srv_serror_rate Continuous V27 rerror_rate Continuous V28 srv_rerror_rate Continuous V29 same_srv_rate Continuous V30 diff_srv_rate Continuous V31 srv_diff_host Continuous

6.1817 8.8750 4.697E02 2.888E02 1.667E03 1.667E03 1.5515 .2269 4.3109

8.1503 11.2568 .7257 .3929 4.081E02 4.081E02 6.4496 3.1943 11.2432

In In In In In In In In In

Traffic features computed using a two-second time window from dest. to host V32 dst_host_count Continuous V33 dst_host_srv_count Continuous V34 dst_host_same_srv_rate Continuous V35 dst_host_diff_srv_rate Continuous V36 dst_host_same_srv_port_rate Continuous V37 dst_host_srv_diff_host_rate Continuous V38 dst_host_serror_rate Continuous V39 dst_host_srv_serror_rate Continuous V40 dst_host_rerror_rate Continuous V41 dst_host_srv_rerror_rate Continuous

172.4708 230.0967 11.1301 .9282 1.4063 .7301 .1724 .1604 .4273 .1557

102.2443 63.4418 28.0632 5.4654 4.1837 1.7006 2.6758 3.4335 5.1357 2.2569

In In In In In In In In In In

Connections V42

.1558

.3628

In

Basic features of individual TCP connections V1 Duration V2 protocol_type V3 Service V4 Flag V5 src_bytes V6 dst_bytes V7 Land V8 wrong_fragment V9 Urgent

Normal (0) or Attack (1)

Discrete

methods are presented in comparison to the original connection (normal, attack) supplied by the KDD-99 dataset. For each case the diagonal cells measure the number of normal and attack connections that have been predicted by the specific clustering method which match the original connection data supplied by the dataset. The off-diagonal cells measure miss-matches; the lower off-diagonal cells measure the number of attack connections that wrongly are assumed to be normal connections (false negative), and the upper off-diagonal cells measure the number of normal connections that wrongly are assumed to be attack connections (false positive).

3066

V. Katos / Information Sciences 177 (2007) 3060–3073

Table 2 Results from different cluster analysis methods Original connection Normal (0)

Attack (1)

Total

CCR

v2

SC

M1

Normal (0) Attack (1) Total

746 (73.6) 267 (26.4) 1013 (100.0)

2 (1.1) 185 (98.9) 187 (100.0)

748 (62.3) 452 (37.7) 1200 (100.0)

M2

Normal (0) Attack (1) Total

841 (83.0) 172 (17.0) 1013 (100.0)

15 (8.0) 172 (92.0) 187 (100.0)

856 (71.3) 344 (28.7) 1200 (100.0)

84.4

430.6 [.000]

0.602 [.000]

M3

Normal (0) Attack (1) Total

732 (72.3) 281 (27.7) 1013 (100.0)

2 (1.1) 185 (98.9) 187 (100.0)

734 (61.2) 466 (38.8) 1200 (100.0)

76.4

333.8 [.000]

0.530 [.000]

Notes: M1 = Between groups linkage method and Pearson correlation; M2 = Ward’s method and Pearson correlation; M3 = Ward’s method and squared Euclidean distance; CCR = correct classification rate; SC = Spearman correlation; percentages in parentheses; significance levels in square brackets.

From the results in Table 2 the following conclusions may be derived: 1. The three different clustering methods produced similar results indicating thus that the data have highly separated and distinct clusters. 2. The correct classification rate of each clustering method, measured by the percent ratio of total diagonal elements by total elements (1200), is rather high, and it is equal to 77.4%, 84.4%, and 76.4% for the first, second, and third method respectively. 3. Although the clustering results of the three methods are significantly associated with the original classification data, measured by either the v2 and/or the Spearman correlation coefficient, it is seen that the correct classification rate of the second method is much higher (84.4%) than the success of the other two methods (77.4% and 76.4%). However, this conclusion may be misleading if we do not consider the off-diagonal characteristics. For example, if the emphasis of the stakeholder is on the ‘false negative’, i.e. if he believes that a lot are in stake if the IDS does not generate an alert when an intrusion is actually taking place, then the first method produced much better clustering results than the second method, because the false negative cases of the first method is only 1.1%, in comparison to the second method where the false negative cases is 8.0%. 3.3. Discriminant analysis The true dichotomy original connection (normal = 0, attack = 1) variable V42 supplied by the KDD-99 dataset is used in the analysis as the dependent variable. The independent variables are the variables V1, V3, V5, V6, V23, V24, V32, V33, V34, V35 and V36 that have zero standard deviation in one or both of the groups indicated by the dependent variable. The sample size is rather large, although the group sizes of the dependent variable are much different. We used two types of discriminant analysis with respect to the original connection. The first refers to the case where ‘all’ the independent variables enter the analysis, and the second is a ‘stepwise’ analysis for avoiding possible multicollinearity in the independent variables. The Box’s M tests indicate that there is no homogeneity of variances, as it is shown in Table 3, and thus, because the Table 3 Tests for investigating the assumptions of the discriminant analysis Function

Box’s M

F [Sig.]

Original

All Stepwise

3171.8 2810.3

197.4 [.000] 185.5 [.000]

M1

All Stepwise

15631.8 13850.0

234.4 [.000] 249.4 [.000]

% of Variance

Canonical correlation

Wilk’s lambda

v2 [Sig.]

.604 .593

100.0 100.0

.614 .610

.624 .628

563.288 [.000] 556.591 [.000]

4.663 4.651

100.0 100.0

.907 .907

.177 .177

2067.711 [.000] 2066.142 [.000]

Eigen-value

V. Katos / Information Sciences 177 (2007) 3060–3073

3067

Table 4 Discriminant function coefficients Original connection

M1

All

V1 V3 V5 V6 V23 V24 V32 V33 V34 V35 V36 Constant

Stepwise

Un-standardized

Standardized

.001 1.148 .000 .000 .003 .016 .001 .006 .009 .010 .008 4.330

.163 1.078 .204 .164 .020 .169 .065 .353 .250 .056 .032

Unstandardized

All

Stepwise

Standardized

Unstandardized

Standardized

.001 1.215

.160 1.140

.016

.176

.007 .008

.420 .233

.001 1.711 .000 .000 .023 .023 .004 .006 .020 .016 .013 5.137

.131 1.145 .441 .384 .175 .233 .362 .380 .537 .090 .055

4.825

Unstandardized .001 1.707 .000 .000 .024 .022 .005 .006 .020 .017

Standardized .133 1.142 .440 .383 .178 .228 .384 .381 .533 .095

5.065

assumption of homogeneity of variances is violated, the Mahalanobis distance is used in the analysis. For each type of discriminant analysis, only one canonical discriminant function was used with average canonical correlations equal to 0.614 and 0.610 respectively. The Wilk’s lambdas and the v2 indicated that the discriminant function as a whole were significant, as it is shown in Table 3. The discriminant coefficients are presented in Table 4. The un-standardised coefficients are used for the calculation of the discriminant scores, whilst the standardised coefficients are used in assessing the contribution to the discriminant function of each independent variable. Thus, from Table 4 it is seen that variables V3 (service: network service on the destination, eg. private, domain_u, http, smtp, ftp_data, other) and V33 (dst_host_srv_count: number of connections to the same service as the current connection in the past 2 s from destination to host) may have an important role within the analysis. Furthermore, variables V34 (dst_host_ same_srv_rate: percent of connections to the same service from destination to host), V24 (srv_count: number of connections to the same service as the current connection in the past two seconds) and V1 (duration: length of the connection in number of seconds) may also be important. Unfortunately, one weakness of discriminant analysis is that there are no significance tests for the discriminant coefficients, and we simply look for the highest value(s) of the standardised coefficients to indicate the most important variable(s). Knowing the prior classification between normal connections and intrusions, the classification of the cases according to the discriminant analysis is shown in Table 5. It is seen that for the ‘all’ variables discriminant analysis the correct classification rate, i.e. how many of cases that were categorized into a particular group actually belong to that group, is 83.8%, and for the ‘stepwise’ analysis it is 84.8. Generally, a value of success which is more than 80% it is considered to be satisfactory. Furthermore, the off-diagonal percentages in Table 5 indicate that the misclassifications were rather at low levels. The same discriminant analysis was repeated by using the classification of the groups according to the cluster analysis employing the between groups linkage method and the Pearson Correlation distance (Variable M1) to discover the linear structure of the independent variables used in the analysis. Tables 3, 4, and 6 present Table 5 Evaluation of discriminant analysis predictions with respect to original connection Original connection

Correct classification rate

Normal (1)

Attack (1)

Total

All

Normal (0) Attack (1) Total

820 (80.9) 193 (19.1) 1013 (100.0)

2 (1.1) 185 (98.9) 187 (100.0)

822 (68.5) 378 (31.5) 1200 (100.0)

83.8

Stepwise

Normal (0) Attack (1) Total

832 (82.1) 181 (17.9) 1013 (100.0)

2 (1.1) 185 (98.9) 187 (100.0)

834 (69.5) 366 (30.5) 1200 (100.0)

84.8

Notes: Percentages in parentheses.

3068

V. Katos / Information Sciences 177 (2007) 3060–3073

Table 6 Evaluation of discriminant analysis predictions with respect to between groups linkage method & Pearson Correlation cluster analysis M1

Correct classification rate

Normal (0)

Attack (1)

Total

All

Normal (0) Attack Total

724 (96.8) 24 (3.2) 748 (100.0)

9 (2.0) 443 (98.2) 452 (100.0)

733 (61.1) 467 (38.9) 1200 (100.0)

97.3

Stepwise

Normal (0) Attack (1) Total

723 (96.9) 25 (3.3) 748 (100.0)

9 (2.0) 443 (98.2) 452 (100.0)

732 (61.0) 468 (39.0) 1200 (100.0)

97.2

Notes: Percentages in parentheses.

the corresponding results. Although the results of this discriminant analysis are ‘too good’ (as it was expected) since the dependent variable does not refer to the original connections but it refers to the classified connections according to the cluster analysis, the general conclusion is still the same, i.e. that variable V3 (service: network service on the destination, eg. private, domain_u, http, smtp, ftp_data, other) discovered to be the most important variable that discriminates between the two groups, ‘normal’ and ‘attack’. Other variables may be V34 (dst_host_same_srv_rate: percent of connections to the same service from destination to host), V5 (src_bytes: number of data bytes from source to destination), V32 (dst_host_count: number of connections to the same host as the current connection in the past to seconds from destination to host), V6 (dst_bytes: number of bytes from destination to source), and V33 (dst_host_srv_count: number of connections to the same service as the current connection in the past two seconds from destination to host). 3.4. Logit analysis Table 7 presents the results from the application of Logit analysis on various models in order to identify subsets of independent variables that adequately predict the dependent variable. For Models 1 to 6 the dependent variable is the original connection variable, whilst for Model 7 the dependent variable is the classification of the groups according to the cluster analysis employing the between groups linkage method and the Pearson Correlation distance (Variable M1). Although for all models the correct classification rate is near 87% (except for Model 7 where it is much higher as it was expected) the decision about which of the models predicts better the original connections is a matter of further investigation. Table 8 presents the most common classification measures supporting a ‘confusion matrix’. In Table 8 although most measures are self explained, ‘sensitivity’ refers to the conditional probability that case X is correctly classified, ‘specificity’ is the inverse, ‘positive predictive power’ assesses the probability that a case is X if the classifier classifies the case as X, ‘negative predictive power’ assesses the probability that a case is not X if the classifier does not classify the case as X, Kappa and the Normalised Mutual Information (NMI) measure the proportion of agreement. The odds-ratio and the NMI measures cannot be computed if one element is equal to zero, and the Kappa and NMI measures do not behave properly under conditions of excessive errors. However, for K < 0.4 the proportion of agreement is rather poor, for 0.4 < K < 0.75 is good, and for K > 0.75 is excellent [25]. Moreover, we must note here that because the confusion matrix depends on the ‘cut off’ or the ‘threshold’ point (usually 0.50) which assigns the predicted probabilities from the Logit model to the groups 0 or 1, the measures depend on the confusion matrix or alternatively on cut off point. From the classification measures in Table 8 it is seen that (excluding Model 7) the highest Kappa coefficient is for Model 3. Considering that the Kappa coefficient (and the NMI coefficient) makes full use of all the information contained in the confusion matrix, it seems that Model 3 is preferable than the other models. However, other models may be preferable according to the classification measure employed. Although a model depends on the classification measure used, an undisputable point from the estimates in Table 7 is that variable V3 (service: network service on the destination) is significant in all models. Because this variable is categorical, with six categories (private = 1, domain_u = 2, http = 3, smtp = 4, ftp_data = 5, other = 6) Table 9 presents the results for Model 3 considering this information about variable V3. For Model

V. Katos / Information Sciences 177 (2007) 3060–3073

3069

Table 7 Logit analysis results (dependent variable original connection)

V1 V3

Model 1

Model 2

Model 3

Model 4

Model 5

Model 6

2.893 (0.000)

0.003 (0.000) 3.169 (0.000)

0.034 (0.000) 6.831 (0.000)

0.028 (0.000) 5.627 (0.000)

0.022 (0.000) 5.941 (0.000)

0.022 (0.000) 5.939 (0.000)

0.027 (0.000)

0.022 (0.000)

0.017 (0.001)

0.017 (0.001)

0.695 (0.000)

0.684 (0.000)

0.621 (0.001)

V5

V6

V23

V24 V32

Model 7a 3.015 (0.000) [3.405] 0.000 (0.047) [0.837] 0.001 (0.002) [12.425] 0.555 (0.000) [4.523] 0.046 (0.000) [4.703]

V33 V34

0.090 (0.000) [2.526]

V35 V36

0.184 (0.016)

0.195 (0.006) 0.582 (0.065)

Constant

3.014 (0.000)

3.309 (0.000)

10.900 (0.000)

10.140 (0.000)

9.519 (0.000)

9.517 (0.000)

2.211 (0.228)

Cox and Snell R2 Nagelkerke R2

0.340 0.588

0.350 0.604

0.374 0.647

0.384 0.662

0.385 0.665

0.387 0.668

0.711 0.968

Percentage correct Normal (0) Attack (1) Correct classification rate

84.9 98.9 87.1

84.8 98.9 87.0

84.9 100.0 87.3

93.2 55.1 87.3

93.5 55.1 87.5

90.5 67.9 87.0

99.3 98.7 99.1

Notes: Wald significant levels in parentheses. a Dependent variable M1; Contribution coefficients in brackets.

3 (categorical), variable V3(1) represents the ‘private’ case, coded 1 if the value is ‘private’ and 0 otherwise, variable V3(2) represents the ‘domain_u’ case, coded 1 if the value is ‘domain_u’ and 0 otherwise, and so on till variable V3(5). The final variable ‘other’, is the reference variable and is represented by codes of 0 for all five variables. The coefficient of variable V3(1) in Table 9 indicates the change in log odds when a V3(1) value is compared to the ‘other’ value, and so on for variables V3(2) to V3(5). The coefficient for the ‘other’ variable is necessarily 0, since it does not differ from itself. A positive coefficient like that of V3(1) means that compared to the ‘other’ values, V3(1) values are associated with increased log odds of intrusions. Negative coefficients mean the opposite. In our case in Table 9, it seems that only the effect of variable V3(1) on intrusions differs significantly from the effect of the ‘other’ variable. The effects of all other indicator variables are non significant. Moreover, Model 7 (or the other models) could be used in assessing the contribution of the variables to be used in the prediction. The ranking of the independent variables according to the contribution parameters is as follows: V5(0.8) < V34(2.5) < V3(3.4) < V23(4.5) < V32(4.7) < V6 (12.4). In words, the contribution of variable V6 (dst_bytes: number of data bytes from destination to source) is the highest in detecting intrusions. Finally, all the Logit results above should be treated with caution because the ‘standardised residuals’ from Model 3 did not follow the assumption of normality. In fact, applying the Kolmogorov-Smirnov test in the

3070

V. Katos / Information Sciences 177 (2007) 3060–3073

Table 8 Classification measures for Logit analysis models (Predicted: 1,0; Actual: 1,0)

Calculation

Confusion matrix (1;1) = a (1;0) = b (0;1) = c (0;0) = d Prevalence overall Diagnostic power Correct classification Rate sensitivity Specificity False positive Rate false negative Rate positive predictive power Negative predictive power Misclassification rate Odds-ratio Kappa normalised Mutual information

(a + c)/N (b + d)/N (a + d)/N a/(a + c) d/(b + d) b/(b + d) c/(a + c) a/(a + b) d/(c + d) (b + c)/N (ab)/(cb) K NMI

Model 1

Model 2

Model 3

Model 4

Model 5

Model 6

Model 7

185;2 153;860

185;2 154;859

187;0 153;860

103;84 69;944

103;84 66;947

127;60 96;917

446;6 5;743

0.282 0.718 0.871 0.547 0.998 0.002 0.453 0.989 0.849 0.129 519.9 0.631 0.382

0.283 0.718 0.870 0.546 0.998 0.002 0.454 0.989 0.848 0.130 516.0 0.629 0.380

0.283 0.717 0.873 0.550 1.000 0.000 0.450 1.000 0.849 0.128

0.143 0.857 0.873 0.599 0.918 0.082 0.401 0.551 0.932 0.128 16.8 0.499 0.228

0.141 0.859 0.875 0.609 0.919 0.081 0.391 0.551 0.935 0.125 17.6 0.505 0.236

0.186 0.814 0.870 0.570 0.939 0.061 0.430 0.679 0.905 0.130 20.2 0.542 0.245

0.376 0.624 0.991 0.989 0.992 0.008 0.011 0.987 0.993 0.009 11045.9 0.980 0.922

0.637

Notes: N = a + b + c + d Þ K ¼ ðaþdÞðððaþcÞðaþbÞþðbþdÞðcþdÞÞ=N NðððaþcÞðaþbÞþðbþdÞðcþdÞÞ=NÞ NMI ¼ 1  alnðaÞblnðbÞclnðcÞdlnðdÞþðaþbÞlnðaþbÞþðcþdÞlnðcþdÞ . NlnNððaþcÞlnðaþcÞþðbþdÞlnðbþdÞÞ

Table 9 Logit analysis results (dependent variable original connection) Model 3

Model 3 (categorical)

0.034 (0.000) 6.831 (0.000)

0.008 (0.098)

V1 V3 V3(1) V3(2) V3(3) V3(4) V3(5) V6

0.027 (0.000)

6.366 (0.090) 7.565 (0.921) 3.826 (0.831) 5.527 (0.947) 8.096 (0.961) 0.007 (0.092)

Constant Cox and Snell R2 Nagelkerke R2

10.900 (0.000) 0.374 0.647

5.116 (0.128) 0.379 0.654

Percentage correct Normal (0) Attack (1) Overall success

84.9 100.0 87.3

84.9 100.0 87.3

Notes: Wald’s significant levels in parentheses.

standardised residuals it has been found that they do not follow the normal distribution (mean = 0.011, standard deviation = 0.581, significance = 0.000). 4. Conclusions and areas for further research This paper utilised the publicly available dataset KDD-99 and used it as a point of reference in order to investigate the effectiveness of three competing statistical approaches in predicting intrusion detection. Although these approaches have been used in the past for examining intrusion detection (see e.g. [2,5,21]), the methodologies followed were primarily focused on the individual methods rather than on comparisons of the various approaches employed. Thus, by examining the effectiveness of the three competing statistical approaches using the same data, this paper contributes to the intrusion detection modeling. Table 10 presents

V. Katos / Information Sciences 177 (2007) 3060–3073

3071

Table 10 Classification measures for the three competing approaches (Predicted: 1.0; Actual: 1.0)

Cluster analysis

Discriminant analysis

Logit analysis

Confusion matrix (1;1) = a (1;0) = b (0;1) = c (0;0) = d

185; 2 267; 746

185; 2 181; 832

187; 0 153; 860

0.377 0.623 0.776 0.409 0.997 0.003 0.591 0.989 0.736 0.224 258.446 0.460

0.305 0.695 0.848 0.505 0.998 0.002 0.495 0.989 0.821 0.153 425.193 0.583

0.283 0.717 0.873 0.550 1.000 0.000 0.450 1.000 0.849 0.128

0.251

0.341

Prevalence Overall diagnostic power Correct classification rate Sensitivity Specificity False positive rate False negative rate Positive predictive power Negative predictive power Misclassification rate Odds-ratio Kappa Normalised Mutual Information

0.637

the classification measures for comparing these three approaches with respect to their confusion matrices derived. The values of the overall diagnostic power, the correct classification rate, sensitivity, and specificity for Logit analysis are much higher than the corresponding values for the other two approaches, indicating thus that Logit analysis is more effective. Furthermore, the values for false positive rate and false negative rate are lower, and the values for positive predictive power and negative predictive power are higher for Logit analysis than the corresponding values for the other two approaches. Finally, the misallocation rate is much lower and the Kappa coefficient, that makes full use of all the information contained in the confusion matrix, is much higher for the Logit analysis, indicating thus, that this analysis is more effective than the cluster analysis or discriminant analysis. Irrespectively of the fact that for this application Logit analysis is preferable than the other two approaches, we still see that although the false positive rate is very low and thus the positive predictive power is very high, the false negative rate is rather high and thus the negative predictive power is rather low. This is a dangerous problem, and is far more serious than the problem of false positives [32], because it distracts the intrusion detection analyst from spotting real attacks [19]. Although the results above depend on the threshold level which in the present analysis was set to 0.50, selection of other threshold levels may not unreasonably magnify the problems of false negative or false positive predictions. However, a scenario analysis may be employed in order to investigate the sensitivity of the predicting power of the three methods used in estimation. For each threshold level and for each estimation method employed, the classification measures should be calculated. As the series of the calculated measures are dependent on the threshold levels, polynomial regression analysis could be used to obtain the optimum levels of the measures and their confidence interval for each approach. The comparison of these optimum levels and their confidence intervals may produce confidence bands where all methods are equivalent and this is recommended as an area for future research. ‘Simplicity’, ‘consistency’, ‘accuracy’ and ‘predictive power’ are four essential criteria used in determining the quality of a model [12]. However, all these criteria depend on the ‘correct specification’ of the model, in the sense that the underlying equation used in estimation must have the proper number of independent variables. The inclusion of irrelevant variables or the absence of relevant variables may produce problems in the results obtained. Two methods are usually employed in obtaining the correctly specified model [12,15,16]: (a) the ‘classical method’, starting from a simple model and ending to a general model, or the so-called ‘down-toup’ method, and (b) the ‘over-parameterized method’, starting from a general model and ending to a simple model, or the so-called ‘up-to-down’ method, or the ‘nested models method’. In cases where no previous knowledge exists on the proper number and context of the independent variables to be included in the model,

3072

V. Katos / Information Sciences 177 (2007) 3060–3073

the second methodology is preferable [12,15,16]. Thus in this paper we followed the second methodology, starting the investigation by including almost all the available variables in the data set and ending with only three appropriate variables. In other words, considering that the Logit analysis gave much better intrusion detection results in comparison to the other two methods and taking into account that the Logit model finally employed only three predictive variables, we could argue that the Logit model is closer the four criteria used in determining the quality of a model than the other two types of models employed. This is a very important property, the property of simplicity and predictive power, considering the significant volumes of audit data involved and thus makes Logit analysis valuable as a real-time as well as post-mortem analysis tool. Finally, introduction of a timestamp variable would allow an investigation on a chronological correlation of events. From a network forensics perspective, this is a very important research exercise, as the resulting model would be able to detect reconnaisance activity which is not normally detected by intrusion detection engines. We believe that this is a very important area for future research as the additional dimension of time could improve the efficiency of the existing methods. References [1] ACM KDD CUP Center, . [2] M. Asaka, T. Onabura, T. Inoue, S. Goto, Remote attack detection method in IDA: MLSI-based intrusion detection using discriminant analysis, in: Proceedings of the 2002 Symposium on Applications and the Internet, IEEE, 2002. [3] V.K. Borooah, Logit and probit, Sage Publications, Thousand Oaks, CA, 2002. [4] Z. Chen, Q. Zhu, Query construction for user-guided knowledge discovery in databases, Information Sciences 109 (1–4) (1998) 49–64. [5] D. Dagon, X. Oin, G. Gu, and W. Lee, HoneyStat: Local worm detection using honeypots. Georgia Institute of Technology, 2005. [6] R.B. Darlington, Regression and linear models, McGraw-Hill, New York, 1990. [7] Discriminant Function Analysis. . [8] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, A geometric framework for unsupervised anomaly detection: detecting intrusions in unlabeled data, Applications of Data Mining in Computer Security, Kluwer, 2002. [9] K. Fanning, K. Cogger, Neural network detection of management fraud using published financial data, Intelligent Systems in Accounting, Finance and Management 7 (1998) 21–41. [10] A.H. Fielding, J.F. Bell, A review of methods for the assessment of prediction errors in conservation presence/absence models, Environmental Conservation 24 (1997) 38–49. [11] M. Friedman, M. Last, Y. Makover, A. Kandel, Anomaly detection in web documents using crisp and fuzzy-based cosine clusteringnext term methodology, Information Sciences 177 (2) (2007) 467–475. [12] C. Gilbert, Professor Hendrys econometric methodology, Oxford Bulletin of Economics and Statitsics 48 (1986) 283–307. [13] G. Gim, T. Whalen, Logical second order models: Achieving synergy between computer power and human reason, Information Sciences 114 (1-4) (1999) 81–104. [14] J. Hartigan, A. Statistical theory in clustering Journal of classification 2 (1985) 63–76. [15] D. Hendry, J. Richard, On the formulation of empirical models in dynamic econometrics, Journal of Econometrics 20 (1982) 193–220. [16] D. Hendry, Econometric methodology, Econometric Society Fifth World Congress MIT 1985. [17] S. Jajodia, D. Wijesekera, Recent advances in access control models, in: Martin S. Olivier, David L. Spooner (Eds.), Database and Application Security XV, Kluwer Academic Publishers, Boston, 2002, pp. 3–15. [18] R.A. Johnson, D.W. Wichern, Applied Multivariate Statistical Analysis, fourth ed., Prentice Hall, Jersey, 1998. [19] K. Julisch, Clustering intrusion detection alarms to support root cause analysis, ACM Transactions on Information and System Security 6 (9) (2003) 443–471. [20] KDD Cup 1999 Data, 1999, University of California, Irvine, . [21] T.D. Klastoring, Assessing cluster analysis results, Journal of Marketing Research 20 (1983) 92–98. [22] W. Klecka, Discriminant Analysis, Sage Publications, Thousand Oaks, CA, 1980. [23] J. Kolter, M. Maloof, Learning to detect and classify malicious executables in the wild, Journal of Machine Learning Research 7 (2006) 2721–2744. [24] P.A. Lachenbruch, Discriminant Analysis, Hafner, NY, 1975. [25] J.R. Landis, G. Koch, The measurement of observer agreement for categorical data, Biometrics 33 (1977) 159–174. [26] P. Laskov, K. Rieck, C. Schfer, K. Mller, Visualization of anomaly detection using prediction sensitivity. In Sicherheit 2005 (Sicherheit – Schutz und Zuverlssigkeit), 2. Jahrestagung des FB Sicherheit der GI, 2005, pp. 197–208. [27] S. Oh, W. Pedrycz, S. Roh, Genetically optimized fuzzy polynomial neural networks with fuzzy set-based polynomial neurons, Information Sciences 176 (23) (2006) 3490–3519. [28] T.W. Ryu, C. Eick, A database clustering methodology and tool, Information Sciences 171 (1–3) (2005) 29–59. [29] T. Sen, A. Gibbs, An evaluation of the corporate takeover model using neural networks. Intelligent Systems in Accounting, Finance and Management 3 (4) (1994) 279–292. [30] SPSS Base 10.0, Applications Guide, SPPS Inc., Chicago, 1999. [31] Statistica, Electronic Textbook, StatSoft Inc., 2004.

V. Katos / Information Sciences 177 (2007) 3060–3073

3073

[32] A. Sundaram, An introduction to intrusion detection, 1996, . [33] J.W. Ward, Hierarchical grouping to optimize an objective function, Journal of the American Statistical Association 58 (1963) 236– 244. [34] J. Yana, B. Zhangb, S. Yana, N. Liuc, Q. Yangd, Q. Chenga, H. Lie, Z. Chenb, W. Ma, A scalable supervised algorithm for dimensionality reduction on streaming data, Information Sciences 176 (14) (2006) 2042–2065. [35] L. Yao, L. ZhiTang, L. Shuyu, A fuzzy anomaly detection algorithm for IPv6, Semantics, Knowledge, and Grid, SKG ’06 (2006) 67.