Classification and Regression Trees for Handling

0 downloads 0 Views 444KB Size Report
Gustavo A Valencia-Zapata, Juan C Salazar-Uribe, Ph.D. Escuela de .... 2002 2.016. Total. 5027 2022 7049. TABLA V. CHI-SQUARE TEST. Value. Exact Sig.
Classification and Regression Trees for Handling Missing Values in a CMBD to reduce malware in an Information System. Gustavo A Valencia-Zapata, Juan C Salazar-Uribe, Ph.D. Escuela de Estadística, Universidad Nacional de Colombia-Sede Medellín [email protected], [email protected]

KeyWords: CART, Missing Values, Data Mining, Banking Sector, Malware.

mining software (through Open Database Connectivity, ODBC) we collect information about malware attacks over a period of eleven weeks. Then, secondary sources information is added as can be seen in Figure 1. Around 18.22% of CMDB data (infected computers) are missing values. Classification and Regression Trees (CART) are used for handling missing values (imputation) to avoid losing valuable information. This research used nonparametric Statistical tests for checking the quality of the imputed data. Moreover, statistical analysis is conducted to select variables that will be included into the antivirus scanning dosage model

I. INTRODUCTION

II. INFORMATION COLLECTION STAGES

Abstract— In this paper we propose a Classification and Regression Trees model (CART) for handling missing values in a Configuration Management Database (CMDB). Once the information is completed a statistical model to dose antivirus scans inside an information system (IS) in banking sector is implemented. Since about 18.22% of the extracted information from the CMBD was incomplete. As a consequence we propose a data mining modeling strategy to impute this missing information. Finally, we illustrated both this imputation methodology and the statistical dosage model using real data from an IS.

T

research is intended to be a resource to improve the information security levels in banking sector (IS). The research question is: How malware incidence can be decreased in an IS? As in human epidemiologic context is necessary to apply treatments (medicine, vaccines, therapies, etc.), on a computer environment would be the application of scanning. Based on that, we can reformulate the above question as: How antivirus scans (medical tests) can be dosed, in our population (computer network), for the reduction of malware (disease) incidence in banking IS? Different approaches have been made in modeling „disease‟ for Information Technology (IT) environment, using some analogies with the epidemiologic context [1]. In this sense, an exponential growth of malware has been observed in the last decades, as well as, the outlook and limitations of epidemiological concepts for malware prevention [1]. Also, malware spreading and measurements models had been elaborated [2], [3], [4]. Similarly, simulation of the networking topology influence in malware problems had been discussed by [5], [6]. Finally, epidemiological methodologies are used to estimate growth and propagation of worms in a network [7]. In this paper the first stages to build the model are: information extraction (IE), handling missing values, and statistics analysis. The main information source is the bank antivirus software. Secondary information sources are: web filtering, Human Capital/Resource Management-HCM (company employees), and CMDB. Physiological computer information is provided by CMDB such as: brand, operating system, processor type, random access memory (RAM), and so on. CMDB parameters are showed in Table I. Using data HIS

A. Antivirus Software In this stage, amount and type of malware per computer are identified. Information over a period of eleven weeks about 8476 computers was collected. Antivirus software reports the active user account when malware was detected and some other technical information of the computer. TABLE I CMDB PARAMETERS

Variable

Meaning/value

Type

Unit

Class

Laptop, CPU or server

Nominal NA

Brand

Computer brand

Nominal NA

Computer_Age

Operating time

Scale

Processor_Type

Type of computer processor

Nominal NA

Processor_Clock

The speed of a computer processor Scale

GHz

Processors

Number of processors

Integer

Count

Memory (RAM)

Memory size

Scale

GB

Week

Operation_System Operation System (OS)

Nominal NA

Service_Pack

Updates to a OS

Nominal NA

Hard_Disk

Hard disk size

Scale

GB

E. HCM Software Anti-virus software

CMDB

A

Active Directory

B

C

Web filtering software Data Mining Software

D

III. CMDB DATA IMPUTATION A. Classification and Regression Trees, CART

HCM software E

Fig. 1. Informationcollection stages. TABLE II CMDB DATA QUALITY

Percent Complete

Valid Records

Missing Records

Class

83.483

7076

1400

Brand

83.483

7076

1400

Computer_Age

81.784

6932

1544

Processor_Type

99.493

8433

43

Processor_Clock

99.493

8433

43

Processors

99.493

8433

43

Memory (RAM)

99.493

8433

43

Operation_System

99.493

8433

43

Service_Pack

99.457

8430

46

Hard_Disk

99.493

8433

43

Variable

In this stage, employee information is identified. Collected information such as position and work area is used to assess the influence of these variables over detected malware levels.

CART model is explained in detail in [8]. The classification and regression trees (CART) method was suggested by Breiman et al. [8]. According to Breiman, the decision trees produced by CART are strictly binary, containing exactly two branches for each decision node. CART recursively partitions the records with similar values for the target attribute. The CART algorithm grows by conducting for each decision node, an exhaustive search of all available variables and all possible splitting values, selecting the optimal split according to the following criteria [9]. Let be a measure of the “goodness” of a candidate split s at node , where (1) Split parameters are defined in Table III. One of the major contributions of CART was to include a fully automated and effective mechanism for handling missing values [10]. Decision trees require a missing value-handling mechanism at three levels: (a) during splitter evaluation, (b) when moving the training data through a node, and (c) when moving test data through a node for final class assignment [11]. TABLE III SPLIT PARAMETERS

B. CMDB Here, technical information about computers is identified as well as its relationship with user´s account. Table II shows CMDB variables, percent completed valid and missing records. This information is used to assess the influence of these variables over detected malware levels.

Parameter

Meaning Left child node of node Right child node of node

C. Active Directory At this point, the user‟s privileges are identified. For example: adding a user to the Local Administrator Group or some user accounts are allowed to use USB devices. In this way, privileges acquaintance is important to establish whether or not these variables have some influence over malware levels. D. Web Filtering Software At this step, user account is related to web surfing, identifying variables such as: number and type of blocked websites, web surfing time, etc. The main purpose of studying this association is to establish if web surfing behavior has influence on malware levels.

According to [11], regarding (a), the later versions of CART (the one we use) offers a family of penalties that reduce the improvement measure to reflect the degree of missingness. (For example, if a variable is missing in 20% of the records in a node then its improvement score for that node might be reduced by 20%, or alternatively by half of 20%, and so on.) For (b) and (c), the CART mechanism discovers “surrogate” or substitute splitters for every node of the tree, whether missing values occur in the training data or not. The surrogates are thus available, should a tree trained on complete data be applied to new data that includes missing values.

TABLA IV CONTINGENCY TABLE CLASS E_Class Total CPU

Laptop

CPU

5.013

20

5.033

Laptop

14

2002

2.016

5027

2022

7049

Class Total

TABLA V CHI-SQUARE TEST

B. Handling Missing Values through Classification and Regression Trees Ten variables were imputed (Table II), that is, ten CART were used, a CART for each variable, which together make up a classification and regression forest. The imputation was made using PASW® Modeler (a data mining software). Model training was made with complete data and then, this trained model was applied to missing values. Table VI and Table VII show a nonparametric statistics test called McNemar Test for Significance of Changes [12], which evaluated model prediction by using complete data. In this case E_Class is the imputed value and Class is the real value. For instance, 5013 (99.6%) computers with Class equal to CPU were classified correctly by CART, and 2002 (99.3%) computers with Class equal to Laptop were classified correctly by the same CART. The formulated hypotheses for McNemar test (2-sided) were: Class is not changed after

Value

Exact Sig. (Twosided)

McNemar Test

1.058

0.392

Nº Valid Cases

7049

Use binomial distribution

Figure 2 shows the scatterplot for these variables. We can observe a linear trend between the imputed and real values ( ). So we fitted a possible simple linear regression. Nevertheless, assumptions validation of the linear model is not the aim this article. Spearman´s test was used to test independence. The formulated hypotheses for this test were:

and

are

imputation.

and Class was changed after

imputation. According to this analysis we cannot reject the null hypothesis, that is, CART doesn‟t change Class values after imputation (p-value=0.396). Figure 2 Shows a CART model for Class variable that was built using the software PASW® Modeler. Spearman‟s Test (a nonparametric statistic) was used to explore the correlation between real and predicted values for continuous variables. On the other hand, McNemar´s test was used to assess the performance of the CART Model with complete data. In this case, E_Computer_Age is the imputed value whereas Computer_Age is the real value.

mutually

independents. are not

mutually independent. According to the results from this test and are not mutually independent (p-value