Cascade Classifiers for Multiclass Problems

4 downloads 32146 Views 168KB Size Report
instances to the most frequent class, leads to a misclassification error smaller than. 50%; instead ... the best for all problems, we can always restrict ourselves to consider polichotomies that fulfill ..... Computers and Electronics in Agriculture 47.
Cascade Classifiers for Multiclass Problems Pablo M. Granitto1, Alejandro Rébola2, Ulises Cerviño2, Flavia Gasperi1, Franco Biasoli1 and H. A. Ceccatto2 1

Istituto Agrario di S. Michele all Adige, Via E. Mach 1, 38010 S. Michele a/Adige, Italia

{pablo.granitto,flavia.gasperi,franco.biasioli}@iasma.it 2 Instituto de Física Rosario, CONICET-UNR Bvd. 27 de Febrero 210 Bis, 2000 Rosario, Argentina

{rebola,cervino,ceccatto}@ifir.edu.ar

Abstract. We discuss a cascade approach to multiclass classification problems which breaks the original task into smaller subproblems in a divide-andconquer strategy. We use a splitting strategy based on the confusion matrix associated to the first (primary) classifier of the cascade, sorting test samples into independent problems according to the columns of this matrix. We test all possible combinations of three state-of-the-art classification algorithms, applying them alternatively in the two stages of the method. Performances of all these combined classifiers are evaluated on 7 real-world datasets.

1 Introduction Multiclass classification problems (polichotomies) are notoriously difficult problems. The reasons for this can be easily understood by noticing that, for binary problems (dichotomies), the simplest possible classification strategy, i.e., assigning all the instances to the most frequent class, leads to a misclassification error smaller than 50%; instead, the same strategy for fairly balanced polichotomies with c classes would produce errors of the order of 100(1-1/c)% (i.e., approximately 90% por c=10). This difficulty is particularly acute for problems involving a large number of classes (c equal to a few tens or even hundreds), as it is frequently the case in several application domains like, for instance, Bioinformatics and Text Categorization, or for particular tasks like seeds identification or phoneme recognition. Methods for tackling these problems usually rely on the adaptation of classification algorithms originally devised for dichotomies[1]. Using binary strategies like 1-vs.-all or 1-vs.-1, classifiers are conveniently combined to produce the final classification (for instance, by voting or adding posterior class-membership probabilities). Some algorithms can handle polichotomies directly, or can be extended to do that, although this requires very flexible and powerful methods and, in many cases, also a large computational cost. A different and appealing route to solve large multiclass problems, which has been several times considered in the last years in the specialized literature[2-7], is the

divide-and-conquer strategy. Disregarding sometimes important details of implementation, all these works are based on the common idea of splitting polichotomies with large c in a few independent, in principle simpler subproblems, each one of them involving a smaller number of classes. Most implementations of this idea make use of a cascade of classifiers, in which the first (“primary”) classifier divides the original problem in smaller subproblems, which are in turn solved by specific (“secondary”) classifiers. The intuitive idea behind this strategy is illustrated in Fig. 1, where a 5-class problem is decomposed into three smaller subproblems.

Figure 1. Top: original 5-class problem. Bottom: Decomposition of this problem into three independent smaller classification problems. For this idea to be successful in practice, three main implicit conditions have to be met: First, the classes should not all overlap with each other since, otherwise, there would be no way of splitting the original problem into more or less independent subproblems. This condition is satisfied by most polichotomies, particularly for those with a large number of classes. Furthermore, since no classification method can be the best for all problems, we can always restrict ourselves to consider polichotomies that fulfill this condition. Second, the primary classifier that produces the splitting in subproblems should be, for this task, substantially more accurate than the best, single monolithic classifier one can apply to the original polichotomy; if possible, no misclassification errors should be committed at this stage because they cannot be corrected afterwards. This condition is not a particularly problematic one, and can be fulfilled in a number of ways as discussed below. Finally, the third condition requires the classifiers which solve the independent subproblems to be (on average) more accurate, for their specific tasks, than the best monolithic classifier available for the original polichotomy. This is the most difficult condition since, as we will see empirically, flexible classification methods can treat c-class problems as efficiently as c´-class problems with c´< c. Notice also that in case the computational cost is an issue, a fourth condition have to be considered: Training the primary and secondary classifiers should be less (or at most equally) time consuming than training the best monolithic predictor for the whole problem. In this preliminary work we evaluate a possible way of constructing a cascade classifier, considering all pair combinations of three powerful classification methods

interchangeably used as primary and secondary classifiers. The resulting algorithms are tested by applying them to 7 different datasets. In the next section (Section 2) we discuss the cascade construction algorithm, introduce the classification methods considered, and describe the datasets used for evaluation. Then, in Section 3 we present the experimental results. Finally, in Section 4 we draw some conclusions and discuss proposals for future work.

2 Cascade Classifier There are several different ways of splitting a large multiclass problem into more or less independent subproblems. One possibility is using a k=1 Nearest Neighbor (1NN) algorithm as primary classifier to split the original problem in c subproblems, and determining within each subproblem the smallest number kj (j=1,…,c) of nearest neighbors required to be sure at least one representative of the correct class is found. Then, the j-th subproblem will require discriminating between all the classes at which the nearest kj instances belong to. According to this strategy, a new test sample whose nearest neighbor belongs to class j is first classified into subproblem j; then, the remaining (kj -1) nearest neighbors are determined and the final classification is performed by a classifier trained on the classes these kj neighbors belong to. A different possibility is to use a generative model as primary classifier, and split the problem according to the largest of the posterior class-membership probabilities Pi (i=1,…,c) output by this classifier. Once a test sample is classified into, let's say, subproblem j, the largest probability Pj and the second, third, etc. largest values are added up until their sum exceeds a prescribed value (1-θj), chosen as to guarantee that the residual misclassification error can be neglected. Then, a final classifier is trained on the problem defined by the classes whose posterior probabilities had to be added up to reach this threshold. Notice that in these two strategies one may have to train several different classifiers for each subproblem, although in general they will involve a few classes. For instance, for the identification of different species of weed seeds –a problem with 236 classes– we found[8] that the class-membership probability strategy splits it in small groups of 4 species each, with a 98% of correct assignment of examples to these subproblems. Thus, even on-line training of secondary classifiers for these groups becomes feasible. Another simple method to split a polichotomy is based on the confusion matrix (CM). The entries nij (i,j = 1,…,c) of this matrix represent the number of instances observed to be of class i that are classified as belonging to class j. The CM can be reliably determined by using k-fold cross validation or n bootstraps (see next section), and the original multiclass problem can be split in c subproblems according to its columns. That is, the j-th subproblem will require discriminating between the classes for which nij / Nj > θj, where Nj = ∑i nij and θj is some small fraction of allowed misclassification errors at this stage for subproblem j. This is the method effectively implemented in

this work for simplicity, since it requires training only c secondary classifiers. Notice, however, that these classifiers will have to be trained on subproblems containing in general a larger number of classes than those generated by the two strategies previously discussed. We will use three powerful classification algorithms: 1. Random Forest (RF)[9]: This method grows many classification trees on bootstrap resamples of the original dataset, using random subsets of features to increase diversity between the trees. Then, to classify a new object each tree gives a classification (a "vote" for that class) and the forest chooses the classification having the most votes (over all the trees in the forest) . 2. Penalized Discriminant Analysis (PDA)[10]: This method is a regularized version of classical Fisher's LDA. It performs a linear regression into a (explicit) basis expansion and penalizes “rough” coordinates. 3. Support Vector Machines (SVM)[11]: The implementation considered in this work is the much used “libsvm” with a 1-vs.-1 multiclass classification method. We have considered these state-of-the-art classifiers to correctly evaluate the actual practical value of cascade methods by comparison with them. All these algorithms are used alternatively as primary or secondary classifiers on the following multiclass datasets: From repositories 1. Vowel: 990 examples, with 10 classification features and 11 classes. Frequently used dataset from the Carnegie Mellon University Artificial Intelligence repository, associated to speech recognition. 2. Glass: 214 examples, with 9 features and 6 classes. Dataset taken from the University of California at Irvine Machine Learning repository, corresponding to 6 types of glasses defined in terms of their oxide content (i.e., Na, Fe, K, etc) From the agro-alimentary industry 3. Weed seeds: approximately 10,000 examples, with 12 features and 236 classes. Shape, gray level distribution and texture attributes of seed images of weed species frequently found in Argentina's agriculture. 4. Mirop60: 60 examples, with 35 features and 6 classes. Data from sensorial analysis of "Nostrani" cheeses from the Trento province. 5. Grana60: 60 examples, with 30 features and 4 classes. Data from sensorial analysis of “Grana” cheeses from North Italy. 6. MiropPTRMS: 48 examples, with 230 features and 6 classes. Subset of examples from dataset 4, analyzed by PTR-MS (Proton Transfer Reaction -Mass Spectrometry). Inputs are concentrations of atomic masses in the cheese “odour”. 7. FragolaPTRMS: 233 examples, with 230 features and 9 classes. PTR-MS measurements on 9 experimental varieties of strawberry.

3 Experimental Results In all cases the CM was determined by classification of out-of-bootstrap instances in 5 bootstrap resamplings of the original datasets. That is, a different classifier was

trained on each bootstrapped dataset, and later it was used to classify the corresponding out-of-bootstrap instances. The fact that 5 bootstraps were enough to reliably determine the CM was established empirically from the results shown in Table 1. There we show the changes in (the fraction of) misclassification errors for the Glass dataset using PDF-RF methods as primary and secondary classifiers respectively, and for the Vowels dataset with PDA-SVM classifiers, when the number of bootstraps to determine the CM is varied from 1 to 10. We see that there are no significant changes after 5 bootstraps. All the results in this section correspond to an average over 50 experiments (10 repetitions of a 5-fold cross-validation approach), except for the MiropPTRMS dataset for which only 12 repetitions of a 4-fold cross validation have been performed. Average errors and their standard deviations are provided.

Table 1. Evolution of the fraction of misclassified examples as a function of the number of bootstraps considered to determine the confusion matrix. Results correspond to the Glass and Vowels datasets and the corresponding classification methods used in the cascade are indicated.

# of bootstraps 1 2 3 4 5 6 7 8 9 10

Glass / PDA-RF 0.262 ± 0.063 0.253 ± 0.063 0.257 ± 0.063 0.246 ± 0.066 0.251 ± 0.063 0.251 ± 0.063 0.251 ± 0.060 0.253 ± 0.060 0.249 ± 0.060 0.250 ± 0.060

Vowel / PDA-SVM 0.227 ± 0.031 0.216 ± 0.026 0.212 ± 0.026 0.210 ± 0.024 0.210 ± 0.027 0.210 ± 0.026 0.209 ± 0.026 0.208 ± 0.026 0.208 ± 0.026 0.208 ± 0.027

Tables 2a and 2b show the results of a very preliminary investigation on the possibilities of cascade classifiers. We have implemented all possible combinations of the three classification methods considered and have applied them to all the above listed datasets. Furthermore, in all cases we have taken θj = 0, i.e., we retain in each subproblem j all the classes for which at least one example was mistakenly classified as belonging to class j. In case for a given subproblem j all the classes were present, we considered only (c -1) classes, discarding that with the smallest nij. In gray we highlight the combination of methods A-B that produced a smaller error than the primary classifier A; in bold we indicate the best performing method in each case. From these results we see that in 27 out of the 60 cases considered (45%) the cascade classifier was equally good or better than the primary classifier alone. However, only

in 2 of the 7 datasets investigated (30%) this strategy produced the best result (smallest misclassification error). This modest performance of the cascade approach clearly indicates that it cannot be simply taken as an out-of-shelf classification method. In this regard, claims on contrary from previous works in the literature seem to be directly tied to the use of very simple primary classifiers, which leave enough room for improvement during the second stage. Table 2a. Fraction of misclassified examples for the two datasets obtained from repositories, using cascade classifiers built with all possible combination of the basic classification methods. In gray we highlight successful combinations that improve on the primary method of the cascade; bold numbers indicate the best method for each dataset.

Glass Primary RF PDA SVM

RF 0.235 ± 0.064 0.241 ± 0.064 0.228 ± 0.064

Secondary PDA 0.377 ± 0.055 0.374 ± 0.062 0.372 ± 0.057

SVM 0.360 ± 0.056 0.365 ± 0.063 0.360 ± 0.064

Only Primary 0.235 ± 0.067 0.376 ± 0.054 0.361 ± 0.060

Vowel Primary RF PDA SVM

RF 0.064 ± 0.019 0.062 ± 0.019 0.066 ± 0.018

Secondary PDA 0.298 ± 0.031 0.351 ± 0.033 0.334 ± 0.029

SVM 0.196 ± 0.022 0.193 ± 0.021 0.200 ± 0.020

Only Primary 0.055 ± 0.020 0.401 ± 0.030 0.186 ± 0.021

Table 2b. Idem to Table 2a but for the 5 datasets related to agro-alimentary industry.

Mirop60 Primary RF PDA SVM

RF 0.237 ± 0.106 0.217 ± 0.098 0.202 ± 0.091

Secondary PDA 0.238 ± 0.135 0.213 ± 0.119 0.225 ± 0.123

SVM 0.208 ± 0.114 0.185 ± 0.096 0.160 ± 0.092

Only Primary 0.237 ± 0.097 0.245 ± 0.121 0.172 ± 0.100

Grana60 Primary RF PDA SVM

RF 0.292 ± 0.130 0.282 ± 0.137 0.305 ± 0.126

Secondary PDA 0.302 ± 0.121 0.287 ± 0.120 0.303 ± 0.113

SVM 0.295 ± 0.124 0.270 ± 0.124 0.307 ± 0.104

Only Primary 0.290 ± 0.128 0.262 ± 0.112 0.327 ± 0.109

SVM 0.112 ± 0.050 0.110 ± 0.036 0.111 ± 0.041

Only Primary 0.116 ± 0.052 0.085 ± 0.034 0.109 ± 0.043

SVM 0.302 ± 0.116 0.365 ± 0.139

Only Primary 0.286 ± 0.104 0.387 ± 0.128

FragPTRMS Primary RF PDA SVM

RF 0.126 ± 0.056 0.121 ± 0.059 0.132 ± 0.052

MiropPTMRS

Primary RF PDA

RF 0.306 ± 0.105 0.359 ± 0.139

Secondary PDA 0.095 ± 0.044 0.098 ± 0.043 0.101 ± 0.042 Secondary PDA 0.375 ± 0.147 0.443 ± 0.140

SVM Weed Seeds Primary RF PDA

0.290 ± 0.118

0.377 ± 0.155

0.281 ± 0.094

0.269 ± 0.099

RF 0.097 ± 0.009 0.163 ± 0.009

Secondary PDA 0.135 ± 0.008 0.245 ± 0.011

SVM 0.101 ± 0.009 0.157 ± 0.005

Only Primary 0.096 ± 0.009 0.265 ± 0.007

4 Conclusions We have considered a cascade approach to multiclass classification problems which breaks the original task into smaller subproblems in a divide-and-conquer strategy. For this, we used a splitting strategy based on the CM associated to the first (primary) classifier, sorting test samples into c independent problems according to the columns of this matrix. In this preliminary work we have tested all possible combinations of three state-of-the-art classification algorithms, applying them alternatively in the two stages of the method on 7 real-world datasets. The results obtained show a modest performance increase in a few cases, which rules out the possibility of using the cascade approach as an out-of-shelf classification method. Reasons for this poor performance need to be further investigated, in order to establish conditions for this strategy to be advantageous. For instance, notice that when the method used as secondary classifier is more accurate on the original polichotomy than the primary one, the cascade approach is able to improve on the results of the first classifier in 85% (11/13) of the cases. Then, for problems were the direct application of the (suppossed to be) best classification method is not feasible because of the computational cost involved, breaking the problem into smaller subproblems that can be tackled one by one might be an alternative. In future works we are planning to investigate more in depth some of the outcomes of the experiments here reported. For instance, the puzzling fact that in 11 out of the 20 cases (55%) in which the cascade was implemented using the same method for the primary and secondary classifiers, the results turned out to be worse than those produced by the primary classifier alone. This might be related to a defficient determination of the CM and/or the possibility of the dataset not being separable into subproblems. In any case, it would be interesting to determine in all cases which one of the conditions required for the cascade approach to succeed is not actually fulfilled. Different implementations of the splitting strategy are also worth being investigated; for instance, the two other strategies proposed in Section 2. Finally, an important issue not discussed here that deserves consideration is the possibility of implementing further feature selection processes for the second classifiers in the cascade.

Aknowledgements: P.M.G is supported by PAT project SAMPPA.

References

2. 3.

6. 7. 8.

1. Allwein, E., Schapire, R., Singer, Y,: Reducing Multiclass to Binary: A unified Approach for Margin Classifiers. Journal of Machine Learning Research 1 (2000) 113-141. Gama, J., Brazdil, P.: Cascade Generalization. Machine Learning 41 (2000) 315-343 Ferri, C., Flach, P., Hernandez-Orallo, J.: Delegating Classifiers. Proceedings of the 21st. International Conference on Machine Learning, Banff, Canada (2004). 4. Alpaydin, E., Kaynak, C.: Cascading Classifiers. Kybernetika, 34 (1998) 369-374 5. Kaynak, C., Alpaydin, E.: MultiStage Cascading of Multiple Classifiers: One Man's Noise is Another Man's Data. Proceedings of the 17st. International Conference on Machine Learning, Stanford, USA (2000) Bellili, A., Gilloux, M., Gallinari, P.: Reconaissance des chifres manuscrits par systeme hybride MLP-SVM, Proceedings of RFIA 2002 Rolli, F., Giacinto, G.: Design of Multiple Classifier systems. Review volume: Hybrid Methods in Pattern Recognition (2002) Granitto, P.M., Verdes, P.F., Ceccatto, H.A.: Large Scale Investigation of Weed Seeds Identification by Machine Vision Techniques. Computers and Electronics in Agriculture 47 (2005) 15-24 9. Breiman, L.: Random Forests. Machine Learning 45 (2001) 5-32 10. Hastie, T., Buja, A., Tibshirani, R.: Penalized Discriminant Analysis. Annals of Statistics 23 (1995) 73-102 11. Hastie, T., Friedman, J., Tibshirani, R.: The Elements of Statistical Learning. SpringerVerlag, New York (2001) 12. Vapnik, V.: The Nature of Statistical Learning Theory. Springer-Verlag, New York (1996)