Over-Sampling via Under-Sampling in Strongly Imbalanced Data

0 downloads 0 Views 379KB Size Report
Keywords: Classification, Imbalanced Data, Minority Class, Re-Sampling, Over- ... gorithms due to the under-represented data and severe class distribution ...
Over-Sampling via Under-Sampling in Strongly Imbalanced Data Rozita Jamili Oskouei1 and Bahram Sadeghi Bigham *2 Assistant Prof. in Department of Computer Science and Information Technology Institute for Advanced Studies in Basic Sciences (IASBS)1,2 , Zanjan, Iran [email protected] , [email protected] *2(Corresponding Author)

Abstract. Classification of imbalanced data sets is an important challenge in machine learning. Whenever the size of one of the classes is very smaller than others, it is called as imbalanced data sets. In these types of data sets, all the algorithms and efficiency criteria care the majority class and ignore the minority class. But sometimes these minority class contains important information and ignoring that information may be cause wrong total results, especially whenever, the accuracy of results are very important such as medical data. Therefore, proposing approaches for solving this problem is necessary. One of the best approaches to solve this problem is re-sampling. Re-sampling runs usually as an extra preprocessing step and it has two main methods named as over-sampling and under-sampling. This investigation analysis the effect of Ratio imbalance and the selected classifier on the application of several re-sampling strategies to deal with imbalanced data sets. We applied two different classifiers (J48 and Naïve Bays), four resampling algorithms (Org, SMOTE, Borderline SMOTE, OSS and NCL approaches) and four Performance assessment measures (TPrate , TNrate, Gmean and AUC) on 13 sets of real data. Our experimental results show that, whenever, data sets are strongly imbalanced, over- sampling methods are more efficient in compare with under-sampling methods. Moreover, our results indicates that, when dealing with imbalanced data with any level, applying resampling techniques is preferred. Further, the results indicate that the classifier has very poor influence on the effectiveness of the resampling strategies.

Keywords: Classification, Imbalanced Data, Minority Class, Re-Sampling, OverSampling, Under-Sampling.

Biographical notes: Rozita Jamili Oskouei is working as an Assistant Prof. at the Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran. Her research areas are: data mining,

Page 1 of 10

eLearning, Intelligent Transportation Agents (ITS) and Social network analysis. Bahram Sadeghi Bigham is working as an Assistant Prof. at the Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS), Zanjan, Iran. His research interests are: data mining, Robot Motion Planning and Computational Geometry and applications Algorithms and Graph.

1

Introduction

Nowadays handling imbalanced data is one of the most important problems in machine learning and artificial intelligent. This problem occurs in the step of knowledge extraction in data mining, whenever data set includes two classes named as the majority and the minority classes [1]. Usually in this kind of data sets, the minority class is more important and each single entity of this class is also valuable. Unfortunately, all the learning algorithms and evaluation criteria which are proposed and are using currently, care the majority class [2, 3] and attempt to ignore all data of minority class. Approximately, all of them are claiming that, they can reach to an acceptable error rate with simply ignoring all data of minority class. For example assume that the size of majority and minority classes are respectively 9 and 1. With applying each algorithm that ignores all the data in minority class and process only data in majority class, in an efficient way, we can obtain %90 accuracy. The minority and majority classes are almost known as positive and negative samples respectively. There are many papers that try to handle the problems that are either with developing an algorithm or applying re-sampling. This paper focuses on re-sampling methods in clustering (un-supervised learning) in which learning process runs on an updated set which is balanced. Accuracy of a classifier on a specific test set is equivalent to tuples of test data, which are classified correctly by the classifier. The label of corresponding class with any multi-tuple test data with prediction of learned class classifier is compared. An imbalanced data set is a data set which includes some classes with large differences regarding their size. Generally, imbalanced data occurs whenever, we have more than two classes. However, most of the researchers mentioned that, imbalanced data occurs only whenever we have two negative and positive sets.

Page 2 of 10

In this investigation, we focused on two subsets. One of these subsets is very tiny and assumed as positive or minority class and other subset assumed as negative or majority one. We applied some re-sampling algorithms which are run on 13 real and standard imbalanced data sets. J48 and Naïve Bays are used as a classifier and methods are compared with two accuracy criteria TPrate and TNrate.

The paper is organized in six sections. Second Section presents basic concepts related to this research. Third section includes some of the related works. Section four, introduces our data collection process. Section five, shows our experimental results and comparison between our results with other researchers’ results. And finally section six concludes the paper and introduce our future works.

2

Basic Concepts

2.1

Data Mining Concepts

Data mining is the process of extracting knowledge and interesting patterns from huge amount of data. It is the analysis of (often large) observational datasets to find unsuspected relationships and help to summarize the data in novel ways that are both understandable and useful to data owners [4, 5]. In the other words, data mining is the main part of Knowledge Discovery in Database (KDD) process, includes: data selection, preprocessing, transformation, data mining or pattern recognition and interpretation or evaluation. Various techniques in data mining for extracting information, creating classification, clustering or association rules are available. 2.2. Balanced vs. Imbalanced Data Whenever we have equal numbers of records or data in each class, then we achieved the balanced data sets. For example, if we have equal numbers of female and male in one class, then we have balanced data set including two subsets. In case of unbalanced data sets, we have sets with different numbers of records or data elements. In the other words, some sets are having numbers of elements and one set may be having single element [5]. The imbalanced learning problem is caused to the low performance of learning algorithms due to the under-represented data and severe class distribution skews. Because of the inherent complex characteristics of imbalanced data sets, learning from such data needs new principles, understandings, tools and algorithms to transform vast amounts of raw data efficiently into information and knowledge representation [6].

Page 3 of 10

2.3. Over-Sampling vs. Under-Sampling Over-Sampling is a method for increasing the size of minority class and it is corresponds to random sampling [2]. Random sampling is non-heuristic method that attempts through random replication of positive samples make balances in data distribution. Since it makes several copies of minority class samples, therefore, it can cause over-fitting. SMOTE is used as an over-Sampling algorithm. One type of sampling is called as, under-sampling [15] and it attempts to make balance datasets by removing the negative samples by random. It is one of the most effective resampling method. The main problem of this method is, that can be remove data which are important or necessary for the classification process.

3

Related Works

Researches in this field are organized in three groups. The first group includes all the methods that focus on both data and algorithm [7, 8]. The second group focuses on approaches that try to improve the classifiers efficiency [9, 10]. Researches in the third group discusses the effects of imbalanceness on efficiency of the algorithms [11, 12]. There are many new ideas on the first group whose are working on data sets and improving the algorithms. They try to change data set by re-sampling, such that finally two classes (negative and positive) have almost the same size. Hulse and Khoshgoftar [13] have shown that the efficiency of the algorithms depend on the classifiers and the rate of imbalanceness. There are some other papers that confirmed the same results [14 ~ 16]. There are some evidences that show very bad results in applying supervised classifiers on imbalanced data [7]. The simplest way for re-sampling is adding new positive data to the data set randomly. In this way, we don’t lose any information but it would increase the time and space complexity of the algorithm. On the other hand, because of adding the previous positive samples, we will have over fitting. The next simple method to solve these types of problems is, concerned on removing some data from negative (majority) class [1]. The main weak point of this approach is losing some useful information by removing data but it still is one of the useful approaches [17].

Page 4 of 10

4

Our Data Sets Collection

We used 13 standard real data sets from UCI machine learning database [18]. Some brief comments about these data sets are shown in Table 1. Table 1: Our Data Set Data Base Resources

Attributes No.

Samples No.

Class No.

N+

N-

IR

Pima Breast German Haberman Vehicle NewThyroid Opdigits Pendigits SatImage Glass Ecoli Letter-A

8 9 10 3

768 699 1000 306

2 2 2 2

268 241 300 81

500 458 700 235

1.8 1.9 2.3 2.7

18 5

376 215

4 3

96 30

280 185

2.9 6.1

64 16 36 9 7 16

3823 3498 4435 214 336 229

10 10 7 7 8 26

382 335 415 17 20 10

3441 3163 4020 197 316 219

9 9.4 9.6 11.5 15.8 21.9

It is evident from Table 1 that, we have two classes named as majority and minority classes. Further, all the data sets are classified in two groups named as, strongly imbalanced and normal imbalanced. SatImage, Letter-A, Glass, Pendigits, Opdigits, Ecol and Yeast are the data sets with strongly imbalanced (greater than 9). Data sets with normal balanced (less than 9) in our research are Vehicle, Haberman, German, Breast, and New-Thyroid. The balanced ratio is defined between majority and minority class in each data set. Data imbalanceness ratio are defined as a relationship between majority and minority classes as it is shown in the following equation: IR= N-/ N+ + N : Number of samples belong to minority class N- : Number of samples belong to majority class Regarding the resources that are mentioned, it is necessary to mention the following information: - Prima : is Indian diabetic patients database - Breast: data base of patients who have breast cancer and detected by mammography Page 5 of 10

-

German: the database of German bank customers Haberman: Chicago hospital’s database of recurred patients who had breast cancer and treated by surgery Vehicle: database of four car’s (Van, Bus, Saab and Opel) New-Thyroid: database of Thyroid patients Opdigits: dataset of recognizing the number that are handwriting with bit map Pendigits: dataset related to recognizing handwriting numbers with pen from 44 authors SatImage: dataset from NASA. Glass: datasets related to Criminology Ecoli: datasets including focused areas of protein. Letter-A: dataset belong to detecting words and set of characters.

4.1 Efficiency Criteria for Imbalanced Data Set Evaluating the performance of classifiers plays an important role in machine learning. Choosing good criteria to evaluate the system is as important as choosing an efficient algorithm. Traditionally, accuracy and error are the main evaluation factors to test the efficiency of these kinds of systems. Assume that TN shows the number of negative samples that are classified correctly (True Negative) and FP shows the number of negative samples that are not classified correctly (False Negative). In other words, they are classified to positive class by mistake. Also FN is the number of positive samples which are classified by mistake to negative class (False Negative) and TN shows the number of true negative samples. We will use here two other factors to evaluation, TPrate and TNrate. All the data we used are real from UCI data base. To re-sampling, we do both oversampling and under-sampling, and decision tree J48 is used as classifier. We are working on data sets with two classes, so when some data has more than two classes, then we keep one of minor class as positive one and merge others to one bigger group. 4.1. Differences between Our Work With Other Papers

Several research works attempted to study the effect of applying decision tree (C4.5) algorithm on real data sets and artificial data sets and record the error ratio [3, 16]. Barandela [15] made a comparison between several re-sampling techniques and undersampling based on heuristic intelligent methods. Their experiment are limited on maximum 5 real data sets and they used nearest neighbors for classification Geometric mean for performance evaluation ratio.

Page 6 of 10

5

Experimental Results

After dividing the data sets into two separate groups, we apply some re-sampling methods (both over-sampling and under-sampling) Org, SMOTE, Borderline SMOTE, OSS and NCL approaches and then run the classifier on them and calculate TPrate and TNrate. Table 2 Shows the TPrate for six data sets with almost balanced ratio. We can see the similar table for TNrate in Table 3. As it can be seen in the Table 2 and Table 3, the results of re-sampling techniques yield are very similar to the optimal one. In the next step, we calculate the results for the data sets with high imbalanceness. New results are shown in Tables 4 and 5. Table 2: TPrate, Data Sets with Less Imbalanced Rate

Optimal Org SMOTE Borderline SMOTE OSS NCL

Pima

Breast

German

Haberman

Vehicle

1 0.37 0.765

1 0.958 0.938

1 0.45 0.6

1 0 0.647

1 0.95 0.95

NewThyroid 1 0.833 0.833

0.537

0.979

0.517

0.647

0.9

0.667

0.778 0.815

0.958 0.979

0.85 0.733

0.706 0.333

1 0.9

1 0.833

Table 3: TNrate, Data Sets with Less Imbalanced Rate

Optimal Org SMOTE Borderline SMOTE OSS NCL

Pima

Breast

German

Haberman

Vehicle

1 0.82 0.704

1 0.946 0.938

1 0.857 0.6

1 1 0.647

1 0.911 0.95

NewThyroid 1 1 0.833

0.72

0.957

0.793

0.822

0.964

1

0.6 0.65

0.913 0.46

0.621 0.636

0.489 0.706

0.768 0.911

0.838 1

Table 4: TPrate Data Set with Strongly Imbalanced Rate Opd PenSatI Gl Ec Letigits digits mage ass oli ter-A

optimal

1

1

1

Page 7 of 10

1

1

1

Ye ast

1

ORG SMOTE BorderlineSMOTE OSS NCL

0.71 1 0.85 5 0.85 5 0.85 5 0.69 7

0.83 6 0.92 5 0.89 6 0.94 0.83 6

0.80 7 0.87 0.85 0.89 4 0.88 9

0. 333 0. 667 0. 994 1 0. 333

0. 571 0. 714 0. 714 0. 714 0. 429

0.5 0.5 0.5 0.5 0.5

0. 244 0. 674 0. 628 0. 779 0. 779

Table 5: TNrate, Data Set with Strongly Imbalanced Rate

optimal

Opdigi ts 1

Pendigi ts 1

SatIma ge 1

ORG

0.97

0.994

0.944

SMOTE

0.983

0.983

0.932

BorderlineSMOTE

0.975

0.997

0.929

OSS

0.962

0.829

0.885

NCL

0.975

1

0.916

Glas s 1 0.92 5 0.82 5 0.99 4 0.92 6 0.9

Ecol i 1 0.98 4 0.85 2 0.91 8 0.85 2 0.85 2

LetterA 1 0.977 0.977 1 0.886 0.977

Yea st 1 0.95 7 0.76 3 0.79 1 0.60 2 0.57 8

With comparing the results of Table 4 and Table 5 with the results of Tables 2 and 3, with can see that oversampling methods run better on the strongly imbalanced data sets, and there is no meaningful difference between two methods on the normal imbalanced data sets. All of these results are obtained by the help of KEEL, SPSS and WEKA (based on the format of datasets we used one of these tools). Finally we have 11 models for each Performance assessment measure. For each Performance assessment measure an 11* Dimensional table has been made. Dimension indicates the number of datasets. Each input in the form of (I, J) to this table, consist of I: the value obtained for model I in dataset J.

Page 8 of 10

6

Conclusion and Future Works

One of the main problems in machine learning is clustering the imbalanced data sets. These types of data can be seen vey frequently in many applications. In fact the most important group of data in these applications is the minority group (e.g. cancer samples in all the patients), but unfortunately the algorithms and the evaluation criteria often ignore the minority class. To solve the problem, we have to balance the data set to reach a new data set in which the majority and minority class has almost the same size. One can increase the size of positive class (over-sampling) or decrease the number of negative entities (under-sampling). In this paper, some re-sampling algorithms run on 13 real and standard data sets which are all imbalanced. J48 and Naïve Bays are used as a classifier and the methods are compared with two accuracy criteria TPrate and TNrate. Our experimental results shows that: J48 had higher accuracy ratio in compare with Naïve Bays. - Selecting more number of samples from total data collection (over-sampling) with strongly high imbalance ratio acts better than selecting less number of samples (under-sampling). The reason is, it is possible to ignore more number of negative samples in under-sampling for creating balance in the size of two classes. Therefore, the important potential information for learner may be lost. Whenever, we have less imbalance rate, our results are proposing that, undersampling and over-sampling both have similar performance. Our future works will be related to re-sampling in imbalanced data sets including: • Analyzing sets with special concern on data complexity measures, for discovering a special technique to deal with imbalanced data issue. • Extending the results of this paper for testing data sets with multiple classes of minority. • Applying cost-sensitive learning approach in the present paper’s analysis.

References 1. Haibo He & Garcia E.A., (2009), “Learning from Imbalanced Data”, IEEE Transactions on Knowledge and Data Engineering, Vol. 21, No. 9, pp. 1263 – 1284. 2. Japkowicz N., Stephen S., 2002, “The Class Imbalance Problem: A Systematic Study”, Journal Intelligent Data Analysis, Vol. 6, No.5, pp. 429–449.

Page 9 of 10

3. Jinfu Liu *, Qinghua Hu, Daren Yu., (2008), “A comparative study on rough set based class imbalance learning”, Journal Knowledge-Based Systems, Vol. 21, No.8, pp. 753–763. 4. Tom Fawcett & Foster Provost., (1997), “Adaptive fraud detection”, Data mining and Knowledge Discovery Journal, Vol. 1, No.3, pp. 291–316. 5. U. Fayyad, G. P-Shapiro & P. Smyth, (1996), “From Data Mining to Knowledge Discovery an Overview” , American Association for Artificial Intelligence Journal, pp. 37- 58. 6. Haibo He, (2009), “Learning from Imbalanced Data”, IEEE Transactions on Knowledge and Data Engineering, Vol. 21, No. 9, pp. 1265-1284. 7. Barandela, R., Sanchez, J.S., Garcia, V. & Rangel, E., (2003), “Strategies for learning in class imbalance problems”, Pattern Recognition Journal, Vol. 36, No.3, pp. 849–851. 8. Garcia, S., Derrac, J., Triguero, I., Carmona, C.J. & Herrera, F., (2012), “Evolutionarybased Selection of Generalized Instances for Imbalanced Classification”, Knowledge- Based Systems Journal, Vol. 25, No.1, pp. 3–12. 9. Jin, H., Ling, C.X., (2005), “Using AUC and Accuracy in Evaluating Learning Algorithms”, IEEE Transactions on Knowledge and Data Engineering, Vol. 17, No.3, pp. 299–310. 10. Ranawana, R., Palade, V., (2006), “Optimized precision – a new measure for classifier performance evaluation”, Proceedings of the IEEE Congress on Computational Intelligence, pp. 2254–2261. 11. Jo, T.K., Japkowicz, N., (2004), “Class imbalances versus small disjuncts”, SIGKDD Explorations Newsletter, Vol. 6 , No. 1, pp. 40–49. 12. Prati, R.C., Batista, G.E.A.P.A., Monard, M.C., (2004), “Learning with Class Skews and Small Disjuncts”, Proceedings of the 17th Brazilian Symposium on Artificial Intelligence, pp. 296–306. 13. Hulse, J.V., Khoshgoftaar T.M. & Napolitano, A., (2007), “Experimental perspectives on learning from imbalanced data”, Proceedings of the 24th International Conference on Machine Learning, pp. 935–942. 14. Batista, G.E.A.P.A., Prati, R.C., Monard, M.C., (2004), “A study of the behavior of several methods for balancing machine learning training data”, ACM SIGKDD Explorations Newsletter, Vol. 6 ,No. 1, pp. 20–29. 15. Barandela, R., Valdovinos, R.M., Sanchez, J.S. & Ferri, F.J., (2004), “The Imbalance Training Sample Problem: Under or Over Sampling”, Structural, Syntactic, and Statistical Pattern Recognition Journal, pp. 806–814. 16. Estabrooks, A & Japkowicz, N., (2004), “A multiple resampling method for learning from imbalanced data sets”, Computational Intelligence Journal, Vol. 20, No.1, pp. 18–36. 17. Garcia, V., Sanchez, J.S. & Mollineda, R.A., (2012), “On the Effectiveness of Preprocessing Methods when Dealing with Different Levels of Class Imbalance”, KnowledgeBased Systems, Vol. 25, pp. 13–21. 18. http://archive.ics.uci.edu/ml (Accessed on 30 July 2014)

Page 10 of 10