An Empirical Study of Software Metrics Selection ... - Semantic Scholar

3 downloads 23790 Views 121KB Size Report
In this study we focus on feature ranking using linear Support Vector. Machines (SVM) ... checking the stopping criterion) for the SVM ranker. The pa- rameter ..... For the KNN learner, SVM ranker with full backward elimination performed best.
An Empirical Study of Software Metrics Selection Using Support Vector Machine Huanjing Wang Western Kentucky University Bowling Green, Kentucky 42101 [email protected]

Taghi M. Khoshgoftaar Florida Atlantic University Boca Raton, Florida 33431 [email protected]

Abstract—The objective of feature selection is to identify irrelevant and redundant features, which can then be discarded from the analysis. Reducing the number of metrics (features) in a data set can lead to faster software quality model training and improved classifier performance. In this study we focus on feature ranking using linear Support Vector Machines (SVM) which is implemented in WEKA. The contribution of this study is to provide an extensive empirical evaluation of SVM rankers built from imbalanced data. Should the features be removed at each iteration? What should the recommended value be for the tolerance parameter? We address these and other related issues in this work.

I. I NTRODUCTION Software quality models are trained using software measurement data (metrics) and various data mining techniques to get useful information that can be used to find higher quality software production. The characteristics of software metrics (a.k.a. features or attributes) influence the performance and effectiveness of the quality model. Previous studies [1], [2], [3] have shown that the performance of software quality models are improved when irrelevant and redundant features are removed before modeling. In this study, we investigated feature selection by the means of a feature ranking method using linear support vector machines (SVM ranker). This SVM ranker has been implemented in WEKA [4]. WEKA is an open source data mining and machine learning package implemented in JAVA at the University of Waikato. Many researchers and practitioners in the data mining and machine learning community commonly use WEKA, but past work has provided no empirically proven recommendation on the appropriate default values for the parameters percentT oEliminateP erIteration (percentage of attributes to be removed at each iteration) and toleranceP arameter (for checking the stopping criterion) for the SVM ranker. The parameter values are critical to the performance and running time of the ranker (see Section III). In fact, a bad selection of parameter values can entirely prevent experiment completion. We contend that default values built into WEKA for SVM ranker are not reasonable for experimentation. This work, as far as we know, is the first to conduct comprehensive experimentation with the SVM ranker in WEKA and recommend empirically proven default values for the percentT oEliminateP erIteration and toleranceP arameter parameters. Since the introduction of SVM ranker [5], there has been little empirical work on the topic. As far as we know, there is no related work that empirically recommends default settings for the percentT oEliminateP erIteration and toleranceP arameter parameters for the SVM ranker in the WEKA tool. Therefore, we present, in this work, a comprehensive empirical evaluation of learning from imbalanced data by varying the percentT oEliminateP erIteration and toleranceP arameter parameters.

Amri Napolitano Florida Atlantic University Boca Raton, Florida 33431 [email protected]

In this study, we built classification models using na¨ıve Bayes (NB), multilayer perceptron (MLP), k-nearest neighbors (KNN), support vector machines (SVM), and logistic regression (LR) on the smaller subsets of selected attributes. Each classification model is assessed with the Area Under the ROC (Receiver Operating Characteristic) curve (AUC). The empirical validation of the different models was implemented through a case study of three consecutive releases of a very large telecommunications software system (denoted as LLTS), nine data sets from the Eclipse software project, and three data sets from NASA software project KC1. Our experimentation is done in WEKA using the 15 different data sets with varying degrees of class imbalance. In the experiments, ten runs of five-fold cross-validation were performed. Statistical analysis using Analysis of Variance (ANOVA) models is used to determine reasonable default values for the percentT oEliminateP erIteration and toleranceP arameter parameters. By varying the parameters percentT oEliminateP erIteration and toleranceP arameter, this work is the first to compare the SVM ranker on imbalanced data and recommend a default setting for each of the parameters. Experimental results demonstrate the default setting in WEKA for toleranceP arameter (1.0E-10) is not appropriate value, but 1.0E03 is reasonable. This work further shows that SVM ranker with no backward elimination performs similar to or better than the ranker with full backward elimination. Thorough experimentation makes our work extremely comprehensive and dramatically augments the reliability of our conclusions. The remainder of the paper is organized as follows. Section II presents the feature selection techniques. Section III describes the data sets, experimental design, and a discussion of the results. Section IV presents background on feature selection and the SVM ranker in different domains. Finally, we conclude the paper in Section V and provide suggestions for future work. II. F EATURE S ELECTION Feature selection has been applied in many data mining and machine learning applications [6]. The main goal of feature selection is to select a subset of features that minimizes the prediction errors of classifiers. It is broadly classified as feature ranking and feature subset selection, where feature ranking sorts the attributes according to their individual predictive power, and feature subset selection finds subsets of attributes that collectively have good predictive power. In this study, we investigated a feature ranking method, called SVM ranker. The Support Vector Machine (SVM) classifier is one of most commonly used classifiers. It builds a linear discrimination function using a small number of critical boundary instances (called support vectors) from each class while ensuring a maximum possible separation [7].

SVM has been extended to form an embedded method of feature selection. A linear classifier is trained and features are ranked based on the weight (calculated from the support vectors) of each feature. The larger the weight, the more important role the feature plays in the decision function. This ranking procedure can be applied recursively. Guyon et al. [5] introduced recursive feature elimination for support vector machines, SVM-RFE. SVM-RFE uses a backward elimination procedure recursively. At each iteration one or more features with the lowest score (weight) are eliminated. The process is repeated until a predefined number of features remains. SVM ranker has been implemented in WEKA, called SVMAttributeEval. By default, WEKA uses percentT oEliminateP erIteration = 0 and attsT oEliminateP erIteration = 1 (one feature is removed at each iteration) and toleranceP arameter = 1.0E-10. The tolerance parameter defines the stopping criterion for SVM model optimization. The lower the tolerance parameter, the higher the computational complexity. In our experimentation we first change the tolerance parameter to 1.0E-03 to reduce computational cost. Along with the default value of percentT oEliminateP erIteration and attsT oEliminateP erIteration, this variation is called SVM ranker with full backward elimination. Our experiments also set percentT oEliminateP erIteration to 100 (regardless of value of attsT oEliminateP erIteration), indicating no backward elimination, and consider different values for toleranceP arameter, specifically 1.0E-03, 1.0E-05, 1.0E-07, and 1.0E-10. In total, five SVM rankers are evaluated in this study (tolerance of 1.0E-03 with full backward elimination, and tolerance of 1.0E-03, 1.0E05, 1.0E-07, and 1.0E-10 with no backward elimination). Our experiments demonstrated that 1.0E-10 is not an appropriate value, and in particular 1.0E-03 is more reasonable. We also showed that SVM ranker with no backward elimination performed similar to or better than the ranker with full backward elimination, while the computational cost of no backward elimination is much lower than full backward elimination. SVM ranker has been used as feature selection method in several domains. However, for imbalanced data in the software engineering domain, a comprehensive evaluation of the ranker, with varying values for percentT oEliminateP erIteration and toleranceP arameter, has not been performed. Researchers and practitioners who use WEKA have no guidance on these settings and too often rely on the default values in the WEKA tool. Our work shows that the default value for toleranceP arameter is not appropriate for experimentation with imbalanced data. To our knowledge, there has been no previous study to empirically recommend percentT oEliminateP erIteration and toleranceP arameter values for SVM ranker in the WEKA tool. III. E XPERIMENTS A. Experimental Data Sets Experiments conducted in this study used software metrics and fault data collected from real-world software projects, including a very large telecommunications software system (denoted as LLTS) [1], the Eclipse project [8], and NASA software project KC1 [9]. The software measurement data set of LLTS contains data from four consecutive releases, which are labeled as SP1, SP2, SP3, and SP4. We only provide results for SP2, SP3, and SP4 since we couldn’t get results for SP1 when tolerance parameter was set to 1.0E-10 for no backward elimination due to computational complexity. The software measurement data sets consist of 42 software metrics, including 24

TABLE I S OFTWARE DATA S ETS C HARACTERISTICS

LLTS

Eclipse

NASA

Data SP2 SP3 SP4 Eclipse2.0-10 Eclipse2.0-5 Eclipse2.0-3 Eclipse2.1-5 Eclipse2.1-4 Eclipse2.1-2 Eclipse3.0-10 Eclipse3.0-5 Eclipse3.0-3 KC1-20 KC1-10 KC1-5

#Metrics 42 42 42 208 208 208 208 208 208 208 208 208 62 62 62

#Modules 3981 3541 3978 377 377 377 434 434 434 661 661 661 145 145 145

%fp 4.75% 1.33% 2.31% 6.1% 13.79% 26.79% 7.83% 11.52% 28.8% 6.2% 14.83% 23.75% 6.90% 14.48% 24.83%

%nfp 95.25% 98.67% 97.69% 93.9% 86.21% 73.21% 92.17% 88.48% 71.2% 93.8% 85.17% 76.25% 93.10% 85.52% 75.17%

product metrics, 14 process metrics, and four execution metrics [1]. The dependent variable is the class of the program module: faultprone (fp) or not fault-prone (nfp). A program module with one or more faults is considered fp, and nfp otherwise. From the PROMISE data repository [8], we also obtained the Eclipse defect counts and complexity metrics data set. In particular, we use the metrics and defects data at the software package level. The original data for the Eclipse packages consists of three releases denoted 2.0, 2.1, and 3.0 respectively. We transform the original data by: (1) removing all nonnumeric attributes, including the package names, and (2) converting the post-release defects attribute to a binary class attribute: fault-prone (fp) and not fault-prone (nfp). Membership in each class is determined by a post-release defects threshold t, which separates fp from nfp packages by classifying packages with t or more post-release defects as fp and the remaining as nfp. In our study, we use t  {10, 5, 3} for release 2.0 and 3.0, while we use t  {5, 4, 2} for release 2.1. These values are selected in order to have data sets with different levels of class imbalance. All nine derived data sets contain 208 independent attributes. Releases 2.0, 2.1, and 3.0 contain 377, 434, and 661 instances respectively. The original NASA project, KC1 [9], includes 145 instances containing 94 independent attributes each. After removing 32 Halstead derived measures, we have 62 attributes left. We used three different thresholds to define defective instances, thereby obtaining three structures of the preprocessed KC1 data set. The thresholds are 20, 10, and 5, indicating that instances with numbers of defects greater than or equal to 20, 10, or 5 belong to the fp class. The three data sets are named KC1-20, KC1-10, and KC1-5. The fifteen data sets used in the work reflect software projects of different sizes with different proportions of fp and nfp modules. Table I lists the characteristics of the 15 data sets utilized in this work. B. Classification Algorithms The software defect prediction models were built with five commonly used classification algorithms including na¨ıve Bayes, multilayer perceptron, K-nearest neighbors, support vector machine, and logistic regression. The five learners were selected because of their common use in software engineering and other application domains. Unless stated otherwise, we use the default parameter settings for the different learners as specified in WEKA. Parameter settings are changed only when a significant improvement in performance is obtained.

1) Na¨ıve Bayes (NB) [10] utilizes Bayes’s rule of conditional probability and is termed ‘naive’ because it assumes conditional independence of the features. 2) Multilayer Perceptron (MLP) [11] is a neural network of simple neurons called perceptrons. Some related parameters of MLP were set as follows. The ‘hiddenLayers’ parameter was set to 3 to define a network with one hidden layer containing three nodes. The ‘validationSetSize’ parameter was set to 10 to cause the classifier to leave 10% of the training data aside to be used as a validation set to determine when to stop the iterative training process. 3) K-Nearest Neighbors (KNN) [12], also called instance-based learning, uses distance-based comparisons. The choice of distance metric is critical. KNN was built with changes to two parameters. The ‘distanceWeighting’ parameter was set to ‘Weight by 1/distance’ and the ‘kNN’ parameter was set to 5. 4) Support Vector Machine (SVM) [13], also called SMO in WEKA, had two changes to the default parameters: the complexity constant ‘c’ was set to 5.0 and ‘build Logistic Models’ was set to true. By default, a linear kernel was used. 5) Logistic Regression (LR) [14] is a statistical regression model for categorical prediction by fitting data to a logistic curve. C. Performance Metric The classification models are evaluated using the Area Under ROC (Receiver Operating Characteristic) Curve (AUC) performance metric. AUC has been widely used, providing a general idea of predictive potential of the classifier. The ROC curve is used to characterize the trade-off between true positive rate and false positive rate. An ROC curve illustrates the classifier’s performance across all decision thresholds, i.e., a value between 0 and 1 that theoretically separates the fp and nfp modules. AUC is a single-value measurement that ranges from 0 to 1, where a perfect classifier provides an AUC value of 1. It has been shown that AUC has lower variance and is more reliable than other performance metrics (such as precision, recall, or F-measure) [15]. D. Results Analysis Before using a feature ranking technique, the practitioner must choose how many features to select. These selected features will be used for modeling. In this study, we choose the top log2 n features that have the highest scores, where n is the number of independent features for a given data set. The reasons why we select the top log2 n features include (1) no general guidance has been found in related literature on the number of features that should be selected when using a feature ranking technique; (2) a software engineering expert with more than 20 years experience recommended selecting log2 n number of metrics for software quality prediction; and (3) a recent study [16] showed that log2 n is appropriate for various learners. Thus, for the three LLTS data sets, log2 42 = 6; for the nine Eclipse data sets, log2 208 = 8; and for the three NASA KC1 data sets log2 62 = 6. Following the feature selection algorithm (SVM ranker in this study), the classification models are constructed with data sets containing only the selected features. The defect prediction models are evaluated with respect to the AUC performance metric. We used WEKA for the defect prediction model building and testing process. In the experiments, ten runs of five-fold cross-validation were performed. The ten results from the five folds were then combined to

produce a single estimation. In total 18,750 models were evaluated during our experiments. E. Experimental Results The classification performance results are reported in Table II and III. Note that each value presented in the table is the average over the ten runs of five-fold cross-validation outcomes. Due to paper size limitations, we could not present each individual ranker’s performance. We only present results of no backward elimination with tolerance parameters 1.0E-03 and 1.0E-10, and full backward elimination. All the results of three rankers over 15 different software data sets are reported. We also summarize the average performance (last row of table) for each feature selection technique across 15 data sets. The best model across all data sets is indicated in boldfaced print. The results demonstrate that, on average, SVM ranker with no backward elimination and tolerance parameter 1.0E-03 outperformed the other SVM rankers when four classifiers (NB, MLP, SVM, and LR) are applied to the selected subset of features. For the KNN learner, SVM ranker with full backward elimination performed best. We also performed a two-way ANalysis Of VAriance (ANOVA) F test on the classification performance of learners and SVM rankers for the LLTS data sets and over all 15 data sets separately to statistically examine the various effects on performance of the classification models. Due to size limitation, we didn’t present ANOVA results for the Eclipse and KC1 data sets. The underlying assumptions of ANOVA were tested and validated prior to statistical analysis. The two factors were designed as follows. Factor A represents five classifiers and Factor B represents five SVM rankers (no backward with four different tolerance parameters and full backward eliminations). The null hypothesis for the ANOVA test is that all the group population means are the same, while the alternate hypothesis is that at least one pair of means is different. Table IV shows the ANOVA results. It includes two subtables, and each represents the result for each individual case (three data sets of LLTS and all 15 data sets). All the p-values for Factor A and the p-value for Factor B of the LLTS data sets are less than the typical cutoff value of 0.05, indicating that for the classification performance (in terms of AUC), the alternate hypothesis is accepted, namely, at least two group means are significantly different from each other for at least one pair of groups in the corresponding factors or terms. For Factor B of all 15 data sets, the p-value (0.9795) is much greater than the cutoff value of 0.05, which implies no significant difference exists between the five rankers across all 15 data sets. The p-values for the interaction terms A×B are greater than 0.05, indicating Factor A is the same at every level of B. We further conducted multiple comparisons for the main factors and interaction term A×B with Tukeys Honestly Significant Difference (HSD) criterion. The multiple comparison results are shown in Figure 1 and 2, where each sub-figure displays graphs with each group mean represented by a symbol (◦) and 95% confidence interval around the symbol. Two means are significantly different if their intervals are disjoint, and are not significantly different if their intervals overlap. Matlab was used to perform the ANOVA and multiple comparisons presented in this work. Based on the multiple comparison results, we can conclude the following points: • For Factor A, LR performed best. This is true regardless of which data sets are used to build classification models. KNN, MLP, and NB produced moderate performance and SVM performed poorly. • For Factor B, no backward elimination with tolerance parameter 1.0E-03 performed better than full backward elimination .

TABLE II C LASSIFICATION R ESULTS : NB, MLP, AND KNN

LEARNERS

Data Set

NB No Backward 1.0E-03 1.0E-10

KNN No Backward 1.0E-03 1.0E-10

SP2 SP3 SP4

0.7271 0.7227 0.6741

0.7412 0.7146 0.6612

0.7336 0.7255 0.6706

0.7660 0.7104 0.7520

0.7626 0.7035 0.7442

0.7645 0.7092 0.7406

0.7498 0.7445 0.7750

0.7361 0.7289 0.7497

0.7263 0.7273 0.7575

Eclipse2.0-10 Eclipse2.0-5 Eclipse2.0-3 Eclipse2.1-5 Eclipse2.1-4 Eclipse2.1-2 Eclipse3.0-10 Eclipse3.0-5 Eclipse3.0-3

0.8786 0.8594 0.8017 0.8496 0.8049 0.8155 0.8870 0.8868 0.8584

0.8787 0.8587 0.8017 0.8496 0.8055 0.8156 0.8868 0.8868 0.8593

0.8602 0.8776 0.8136 0.8249 0.8076 0.8005 0.8911 0.8926 0.8489

0.8646 0.8713 0.7970 0.8105 0.8252 0.8542 0.8535 0.9094 0.8913

0.8645 0.8701 0.7970 0.8105 0.8263 0.8543 0.8535 0.9094 0.8919

0.8504 0.8872 0.8020 0.8129 0.8409 0.8434 0.8673 0.8978 0.8851

0.8694 0.8541 0.7844 0.8547 0.8416 0.8019 0.8715 0.8783 0.8415

0.8696 0.8538 0.7844 0.8547 0.8418 0.8010 0.8710 0.8783 0.8412

0.8780 0.8665 0.8100 0.8578 0.8493 0.8302 0.8926 0.9115 0.8719

KC1-20 KC1-10 KC1-5

0.7285 0.7373 0.8647

0.7283 0.7363 0.8647

0.7105 0.7386 0.8711

0.7423 0.7161 0.7617

0.7415 0.7134 0.7630

0.7504 0.7186 0.7425

0.7411 0.7286 0.8136

0.7370 0.7273 0.8136

0.7734 0.7900 0.8384

Average

0.8064

0.8059

0.8045

0.8084

0.8070

0.8075

0.8100

0.8059

0.8254

Full Backward 1.0E-03

MLP No Backward 1.0E-03 1.0E-10

TABLE III C LASSIFICATION R ESULTS : SVM

Source A B A×B Error Total

Sum Sq. 0.1738 0.0028 0.0486 23.8446 24.0699

d.f. 4 4 16 725 749

0.6444 0.6019 0.6179

0.6052 0.6128 0.6356

0.5979 0.5739 0.6049

0.7803 0.7401 0.7599

0.7803 0.7359 0.7488

0.7800 0.7139 0.7503

Eclipse2.0-10 Eclipse2.0-5 Eclipse2.0-3 Eclipse2.1-5 Eclipse2.1-4 Eclipse2.1-2 Eclipse3.0-10 Eclipse3.0-5 Eclipse3.0-3

0.8737 0.8987 0.8363 0.8559 0.8490 0.8788 0.8903 0.9337 0.9098

0.8743 0.8982 0.8363 0.8560 0.8486 0.8787 0.8903 0.9337 0.9097

0.8582 0.9102 0.8533 0.8367 0.8499 0.8672 0.8897 0.9247 0.9002

0.8546 0.8886 0.8323 0.8504 0.8334 0.8742 0.8825 0.9292 0.9151

0.8557 0.8878 0.8323 0.8504 0.8325 0.8742 0.8825 0.9292 0.9151

0.8398 0.9017 0.8487 0.8277 0.8536 0.8595 0.8874 0.9216 0.9106

KC1-20 KC1-10 KC1-5

0.7632 0.7393 0.8086

0.7623 0.7394 0.8085

0.7673 0.7417 0.8109

0.7467 0.7167 0.7617

0.7465 0.7188 0.7617

0.7563 0.7273 0.7147

Average

0.8068

0.8060

0.7991

0.8244

0.8234

0.8195

(b) Fifteen Data Sets



Mean Sq. 0.04346 0.0007 0.00304 0.0064

F p-value 6.79 0 0.11 0.9795 0.47 0.9598

For interaction A × B, 25 groups produced by five learners combined with 5 rankers are presented. It can be seen that Factor A was different at every level (groups) of Factor B. For example, LR performed better than other learners when no backward elimination with tolerance parameter 1.0E-03 ranker is used to select features.

F. Threats to Validity A typical software development project is very human intensive, which can affect many aspects of the development process including software quality and defect occurrence. Consequently, software engineering research that utilizes controlled experiments for evaluating the usefulness of empirical models is not practical. Experimental research commonly includes a discussion of two different types of threats to validity.

Full Backward 1.0E-03

LEARNERS

SP2 SP3 SP4

Full Backward 1.0E-03

Mean Sq. F p-value 0.54111 499.23 0 0.00329 3.03 0.017 0.00113 1.04 0.4081 0.00108

d.f. 4 4 16 3725 3749

LR

SVM No Backward 1.0E-03 1.0E-10

(a) LLTS Sum Sq. 2.16446 0.01316 0.01809 0.78582 2.98153

AND

Data Set

TABLE IV T WO - WAY A NALYSIS OF VARIANCE Source A B A×B Error Total

Full Backward 1.0E-03

LR No Backward 1.0E-03 1.0E-10

Full Backward 1.0E-03

In an empirical software engineering effort, threats to external validity are conditions that limit generalization of case study results. The analysis and conclusion presented in this article are based upon the metrics and defect data obtained from 15 data sets of three software projects. The benchmark WEKA data mining tool was used for feature selection (SVM ranker) and all learners, and all of the parameter settings have been included in this work to allow the experiments to be repeated. The parameters for the SVM rankers were chosen to ensure good performance in many different circumstances and to be reasonable for the imbalanced data sets. Experimentation with different settings for the toleranceP arameter provides guidance to the research community as to the recommended value for that parameter. Our comparative analysis can easily be applied to another software system. Moreover, since all our final conclusions are based on ten runs of five-fold cross-validation and statistical tests for significance, our findings are grounded in using sound methods. Threats to internal validity are unaccounted for influences on the experiments that may affect case study results. Poor fault proneness estimates can be caused by a wide variety of factors, including measurement errors while collecting and recording software metrics, modeling errors due to the unskilled use of software applications, errors in model-selection during the modeling process, and the presence of outliers and noise in the training data set. Measurement errors are inherent to the data collection effort. In our experiments we utilize 15 real-world imbalanced software data sets, which greatly enhances the reliability of our conclusions. Performing numerous repetitions of cross validation greatly reduces the likelihood of anomalous results due to selecting a lucky or unlucky partition of the data. Moreover, the experiments and statistical analysis were performed by only one skilled person in order to keep modeling errors to a minimum.

NB

NB

MLP

MLP

KNN

KNN

SVM

SVM

LR

LR

0.65

0.7

0.75

0.8

0.795

0.8

0.805

(a) Factor A

NoBackward_1.0E−03

NoBackward_1.0E−05

NoBackward_1.0E−05

NoBackward_1.0E−07

NoBackward_1.0E−07

NoBackward_1.0E−10

NoBackward_1.0E−10

FullBackward

FullBackward

0.7

0.705

0.71

0.715

0.72

0.725

0.73

0.735

0.8

0.74

0.805

(b) Factor B NB−1.0E−03

0.825

0.83

0.835

0.81

0.815

0.82

0.825

NB−1.0E−03

MLP−1.0E−03

MLP−1.0E−03 KNN−1.0E−03

SVM−1.0E−03

SVM−1.0E−03

LR−1.0E−03

LR−1.0E−03

NB−1.0E−05

NB−1.0E−05

MLP−1.0E−05

MLP−1.0E−05

KNN−1.0E−05

KNN−1.0E−05

SVM−1.0E−05

SVM−1.0E−05

LR−1.0E−05

LR−1.0E−05

NB−1.0E−07

NB−1.0E−07

MLP−1.0E−07

MLP−1.0E−07

KNN−1.0E−07

KNN−1.0E−07

SVM−1.0E−07

SVM−1.0E−07

LR−1.0E−07

LR−1.0E−07

NB−1.0E−10

NB−1.0E−10

MLP−1.0E−10

MLP−1.0E−10

KNN−1.0E−10

KNN−1.0E−10

SVM−1.0E−10

SVM−1.0E−10

LR−1.0E−10

LR−1.0E−10

NB−FullBackward

NB−FullBackward

MLP−FullBackward

MLP−FullBackward

KNN−FullBackward

KNN−FullBackward

SVM−FullBackward

SVM−FullBackward

LR−fullBackward

LR−fullBackward 0.6

0.65

0.7

0.75

0.8

0.78

(c) Factor A×B Fig. 1.

0.82

(b) Factor B

KNN−1.0E−03

0.55

0.815

(a) Factor A

NoBackward_1.0E−03

0.695

0.81

LLTS: Multiple Comparison in Terms of AUC

IV. R ELATED W ORK The main goal of feature selection is to select a subset of features that minimizes the prediction errors of classifiers. Feature selection has been applied in many data mining and machine learning applications. A good overview on feature selection was provided by Guyon and Elisseeff [17]. They outlined key approaches used for attribute selection, including feature construction, feature ranking, multivariate feature selection, efficient search methods, and feature validity assessment methods. Liu and Yu [18] provided a comprehensive survey of feature selection algorithms and presented an integrated approach to intelligent feature selection. In this paper, we examine a feature ranking technique, called SVM ranker. Mladenic et al. [19] investigated three feature weighting methods and conclude that feature selection using weights from linear SVMs yields better classification performance than other feature

0.79

0.8

0.81

0.82

0.83

0.84

0.85

0.86

(c) Factor A×B Fig. 2.

Fifteen Data Sets: Multiple Comparison in Terms of AUC

weighting methods when combined with the three explored learning algorithms including NB, Perceptron, and SVM. Chang and Lin [20] showed that linear SVMs with simple feature rankings are effective on data sets in the Causality Challenge. The goal of Causality Challenge [21] is to investigate problems in which the training and testing sets might have different class distributions. SVM ranker is used in the bioinformatics domain recently [22], [23]. Canual-Reich et al. [22] compared a feature perturbation method to SVM-RFE, 50% of features were removed at every iteration until 10% of total initial features remain. Results demonstrates that the perturbation method outperformed SVM-RFE for most sets of features. Abeel et al. [23] studied an ensemble feature selection method for biomarker identification. An ensemble of a single feature selection method (SVM-RFE, 20% features were removed at each iteration) was built. Experimental results show that the ensemble

feature selection method improves classification performances and biomarker stability. We also noticed that although feature selection has been widely applied in many application domains for many years, its applications in the software quality and reliability engineering domain are limited. Chen et al. [24] have studied the applications of wrapper-based feature selection in the context of software cost/effort estimation. They conclude that the reduced data set improved the estimation. Rodriguez et al. [3] evaluated three filter- and three wrapper-based models for software metrics and defect data sets, with the conclusion that wrappers were better than filters but at a high computational cost. V. C ONCLUSION This work presented a comprehensive experimental analysis of the performance of linear SVM for feature ranking (SVM ranker) with different percentT oEliminateP erIteration and toleranceP arameter parameters. Much of the related work on SVM ranker has not focused on selecting reasonable values for the two important parameters, percentT oEliminateP erIteration and toleranceP arameter. In most previous studies, researchers and practitioners simply utilize a particular value for percentT oEliminateP erIteration and toleranceP arameter without much supporting evidence. Our analysis provides guidelines to researchers and practitioners in data mining and machine learning to select a value of 100 for percentT oEliminateP erIteration and 1.0E-03 for toleranceP arameter when selecting a subset of features. In the experiments, we first set percentT oEliminateP erIteration to 100 (no backward elimination) and considered four different toleranceP arameter values (1.0E-03, 1.0E-05, 1.0E-07, and 1.0E-10). We then set percentT oEliminateP erIteration to 0 and using the default value of 1 for attsT oEliminateP erIteration (full backward elimination) and toleranceP arameter to 1.0E-03. In total, five SVM rankers are considered. Our experimentation on the SVM rankers is done in WEKA using 15 different data sets from three real-world software projects with varying degrees of class imbalance. Classification models were built using five commonly used learners: na¨ıve Bayes (NB), multilayer perceptron (MLP), K-nearest neighbors (KNN), support vector machines (SVM), and logistic regression (LR) with the selected features. These models are used to identify faulty software modules. The conclusions of our experimentation show that the default settings in WEKA for percentT oEliminateP erIteration (0) and toleranceP arameter (1.0E-10) are not appropriate values, and in particular 100 for percentT oEliminateP erIteration and 1.0E-03 for toleranceP arameter are more reasonable. This work further shows the robustness of the LR learner when compared to four other learners. Thorough experimentation makes our work extremely comprehensive and dramatically augments the reliability of our conclusions. Future work may include experiments using additional data sets from the software engineering domain as well as from other application domains. In addition, different data sampling techniques can be considered in the feature selection process of imbalanced data. R EFERENCES [1] K. Gao, T. M. Khoshgoftaar, and H. Wang, “An empirical investigation of filter attribute selection techniques for software quality classification,” in Proceedings of the 10th IEEE International Conference on Information Reuse and Integration, Las Vegas, Nevada, August 10-12 2009, pp. 272–277.

[2] H. Wang, T. M. Khoshgoftaar, and A. Napolitano, “A comparative study of ensemble feature selection techniques for software defect prediction,” in Proceedings of the Ninth International Conference on Machine Learning and Applications, Washington DC, USA, December 12-14 2010, pp. 135–140. [3] D. Rodriguez, R. Ruiz, J. Cuadrado-Gallego, and J. Aguilar-Ruiz, “Detecting fault modules applying feature selection to classifiers,” in Proceedings of 8th IEEE International Conference on Information Reuse and Integration, Las Vegas, Nevada, August 13-15 2007, pp. 667–672. [4] I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. Morgan Kaufmann, 2005. [5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik, “Gene selection for cancer classification using support vector machines,” Mach. Learn., vol. 46, pp. 389–422, March 2002. [6] H. Wang, T. M. Khoshgoftaar, K. Gao, and N. Seliya, “High-dimensional software engineering data and feature selection,” in Proceedings of 21st IEEE International Conference on Tools with Artificial Intelligence, Newark, NJ, USA, Nov. 2-5 2009, pp. 83–90. [7] H. Liu, J. Li, and L. Wong, “A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns,” Genome Informatics, vol. 13, pp. 51–60, 2002. [8] T. Zimmermann, R. Premraj, and A. Zeller, “Predicting defects for eclipse,” in ICSEW ’07: Proceedings of the 29th International Conference on Software Engineering Workshops. Washington, DC, USA: IEEE Computer Society, 2007, p. 76. [9] A. G. Koru, D. Zhang, K. E. Emam, and H. Liu, “An investigation into the functional form of the size-defect relationship for software modules,” IEEE Trans. Software Eng., vol. 35, no. 2, pp. 293–304, 2009. [10] G. H. John and P. Langley, “Estimating continuous distributions in bayesian classifiers,” in Proceedings of Eleventh Conference on Uncertainty in Artificial Intelligence, vol. 2, San Mateo, 1995, pp. 338–345. [11] S. Haykin, Neural Networks: A Comprehensive Foundation, 2nd ed. Prentice-Hall, 1998. [12] D. W. Aha, D. Kibler, and M. K. Albert, “Instance-based learning algorithms,” Machine Learning, vol. 6, no. 1, pp. 1573–0565, January 1991. [13] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. [14] S. Le Cessie and J. C. Van Houwelingen, “Ridge estimators in logistic regression,” Applied Statistics, vol. 41, no. 1, pp. 191–201, 1992. [15] Y. Jiang, J. Lin, B. Cukic, and T. Menzies, “Variance analysis in software fault prediction models,” the 20th IEEE international conference on software reliability engineering, pp. 99–108, 2009. [16] T. M. Khoshgoftaar, M. Golawala, and J. Van Hulse, “An empirical study of learning from imbalanced data using random forest,” in Proceedings of the 19th IEEE International Conference on Tools with Artificial Intelligence, vol. 2, Washington, DC, USA, 2007, pp. 310–317. [17] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” Journal of Machine Learning Research, vol. 3, pp. 1157– 1182, March 2003. [18] H. Liu and L. Yu, “Toward integrating feature selection algorithms for classification and clustering,” IEEE Transactions on Knowledge and Data Engineering, vol. 17, no. 4, pp. 491–502, 2005. [19] D. Mladeni´c, J. Brank, M. Grobelnik, and N. Milic-Frayling, “Feature selection using linear classifier weights: interaction with classification models,” in Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval. New York, NY, USA: ACM, 2004, pp. 234–241. [20] Y. wen Chang and C. jen Lin, “Feature ranking using linear svm,” in JMLR Workshop and Conference Proceedings, vol. 3, June 1-6 2008, pp. 53–64. [21] I. Guyon, C. Aliferis, G. Cooper, A. Elisseeff, J.-P. Pellet, P. Spirtes, and A. Statnikov, “Design and analysis of the causation and prediction challenge,” in JMLR Workshop and Conference Proceedings, vol. 3, June 1-6 2008, pp. 1–33. [22] J. Canul-Reich, L. Hall, D. Goldgof, and S. Eschrich, “Feature selection for microarray data by auc analysis,” in IEEE International Conference on Systems, Man and Cybernetics, 2008, pp. 768 –773. [23] T. Abeel, T. Helleputte, Y. Van de Peer, P. Dupont, and Y. Saeys, “Robust biomarker identification for cancer diagnosis with ensemble feature selection methods,” Bioinformatics, vol. 26, pp. 392–398, February 2010. [24] Z. Chen, T. Menzies, D. Port, and B. Boehm, “Finding the right data for software cost modeling,” IEEE Software, no. 22, pp. 38–46, 2005.