p1 1

2 downloads 0 Views 118KB Size Report
30 Saw Mill River Rd., Hawthorne, NY 10532 [email protected]. Nabil R. Adam. CIMIC, Rutgers University. 180 University Avenue, Newark, NJ 07102.
Efficient Splitting Rules based on the Probabilities of Pre-Assigned Intervals Nabil R. Adam CIMIC, Rutgers University 180 University Avenue, Newark, NJ 07102 [email protected]

June-Suh Cho IBM T.J. Watson Research Center 30 Saw Mill River Rd., Hawthorne, NY 10532 [email protected]

Abstract This paper describes new methods for classification in order to find an optimal tree. Unlike the current splitting rules that are provided by searching all threshold values, this paper proposes the splitting rules that are based on the probabilities of pre-assigned intervals.

nL , nR = the number of objects in the left and the right buckets n1 , n0 = the number objects of class 1 and 0 in the parent node n1L , n1R , n0L , n0R = the number of objects of class 1 and 0 in the left and

right buckets

pL

= nnL , pR = nnR , p1L = nnLL , p1R = nnRR , p0L = nnLL , p0R = nnRR

Most methods of classification repeatedly search for the best split of a subset by searching all possible threshold values for all variables [4, 1, 7, 5, 8, 6]. However, these methods do not efficiently find precise cutoff points when applied to continuous variables. In addition, they have computational complexity in split searching that requires long computation time. We propose simple splitting rules based on minimizing the sum of variance and maximizing the difference of probabilities of intervals. Interval-based methods reduce the computational complexity of finding the optimal cutoff point since the number of intervals is less than the number of all possible threshold values. In experiments, we demonstrate that our methods properly classify image objects based on new split rules. Our approach produces average recall and precision over for experimental image data. We also describe the computational complexity of our split selection approach, which is better than exhaustive search methods. In continuous variables, m; is for our approach and l; is for the exhaustive search approach, where m is the number of intervals and l is the number of all possible threshold values. The number of possible splits is l ; > m ; , where l > m > .

85%

1

1

0

0

+

+ =1

min [1 ; (qL p1L + qR p1R)]  max [q p1 + q p1 ] c c L L R R

Another criterion for splitting maximizing the difference of probabilities between the left and the right bucket. This approach is based on the search for interesting regions of the data regardless of non-interesting regions. This criterion can be expressed as follows:

max jp1 ; p1R j c L

In this section, the algorithm to select optimal cutoff point under each splitting rule and the method to combine these two rules are summarized. The hybrid splitting rule is to compromise the advantages and disadvantage in minimizing the variances and maximizing the difference of probabilities in the left and the right bucket.

1

1

1

Given a split with left and right buckets, the probabilities of class 0 and class 1 on the left and the right satisfy p0L p1L , p0R p1R , respectively. In this paper, we consider two kinds of splitting rule. One is to select an optimal cutoff point by minimizing the sum of variances of intervals. Minimizing the sum of variances is equivalent to maximizing the sum of weighted probabilities in the left and the right bucket as follows:

=1

1. Introduction

1

1

2 The Proposed Splitting Rules In our approach, we assume that there are two classes, which are class 0 and class 1. We give definitions as follows: n = nL + nR

Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM’01) 0-7695-1119-8/01 $17.00 © 2001 IEEE



Splitting rule 1.

 i  L. For each interval Ii , calculate pi = P (class 1 jx 2 Ii ) and qi = P (class 1 j class 1 in parent node, x 2 Ii ).

1. Split a predictor variable into Ii intervals, 1 2.

1

3. From the left, the right interval, or the interval which has the highest probability, add up the intervals to find a cutoff point c by maximizing the sum of weighted probabilities in each intervals as follows: dI



= max [q p1 + q p1 ] c L L R R

Splitting rule 2.

Classes Minimum Error Rate

Plate 6.49%

Cup 5.32%

Pan 7.67%

Pot 10.88%

Methods Hybrid CART S-plus C4.5

Table 1. Average Minimum Error Rate on Training Datasets.

Cup train 94.6% 92.7% 90.3% 95.1%

Plate test 98.7% 96.2% 94.6% 99.1%

train 93.5% 93.4% 89.8% 94.4%

Pan test 96.9% 97.2% 92.9% 96.4%

train 92.3% 92.4% 91% 93.2%

Pot test 96.4% 96% 93.8% 96.2%

train 89.1% 93.2% 88.7% 93.4%

test 91.8% 94.5% 91.7% 94.2%

Table 3. The classification accuracy using our method and Classes Plate Cup Pan Pot

Avg. Precision 95% 99% 94% 88%

Avg. Recall 93% 94% 92% 89%

the existing methods.

F-score 93.9% 96.4% 92.9% 88.5%

Five test sets were run through the classifier, and the recall and precision scores averaged. We also evaluated performance using F-score to combine recall and precision with a single value metric. F-score is the method for calculating an appropriate value without being biased toward either recall or precision. If recall and precision score high at the same time, F-score becomes close to 1 in this case, the method has a good capability of classifying. In Table 3, it shows simple comparison between our method and other existing methods such as CART, S-plus, and C4.5. It reports the classification accuracy using them for two cases, and classification accuracy is defined as folof objects correctly classified lows: accuracy = numbertotal number of objects

Table 2. Performance evaluation with Precision, Recall, and F-score.

1. For each interval Ii , calculate p1i

= P (class 1 jx 2 Ii ).

2. From the left, the right interval, or the interval which has the highest probability, add up the intervals to select a cutoff point c by maximizing as follows: dII



= max jp1 ; p1R j c L

Hybrid splitting rule.

j 2 Ii ) and

1. For each interval Ii , calculate p1i = P (class 1 x qi = P (class 1 class 1 in parent node, x Ii ).

j

2

4 Conclusion

2. Calculate two statistics dI , dII and give a rank to them according to the grouping of the interval.

In this paper, we have introduced the methods to perform classification that are new splitting rules based on the probabilities of pre-assigned intervals, and we used the binary tree splits to find cutoff points. In this paper, new methods provided a means of finding the optimal cutoff points and a simple and fast classifier with reasonable accuracy to recognize the objects from a large image database.

3. The average ranks for each group are given by two splitting rules. When the average ranks are the same, the priority is given to the first splitting rule because there is still a chance to split the other predictor variables later. 4. Take a cutoff point c with the highest rank.

We stop splitting the tree if the sample size in some bucket is less than a user-specified value or there is no significant different variables under the splitting rules.

References [1] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, 1984. [2] J.-S. Cho, A. Gangopadhyay, and N. Adam. Feature Extraction for Content-based Image search in Electronic Commerce. In MIS/OA International Conference, 2000. [3] B. Efron. Estimating the error rate of a prediction rule: Improvement on cross-validation. J. of the American Statistical Association, 1983. [4] D. M. Hawkins and G. V. Kass. Automatic Interaction Detection. Cambridge Univ Press, 1982. [5] W.-Y. Loh and Y.-S. Shih. Split selection methods for classification trees. Statistica Sinica, 1997. [6] S. K. Murty, S. Kasif, and S. Salzberg. A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 1994. [7] J. R. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann, 1993. [8] R. Rastogi and K. Shim. PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning. In Proceedings of the 24nd VLDB Conference, 1998.

3 Experimental Results Experiment results are represented by 300 objects of images from image databases of electronic catalogs. The objects of images are categorized by semantics, such as pan, pot, cup, and plate. We have performed the feature extraction [2] for the electronic catalogs to extract the image parameters of the image objects. of the total numWe assigned the size of intervals as ber of objects. We used 120 objects for the training data set, and 180 objects for the test data sets using Bootstrap [3] method because of the size of data. Table 1 shows the accuracy of classifier for plate and cup as well as pan and pot. This table shows average minimum error rate for plate/cup and pan/pot on the training set. Experiments have been implemented using Matlab for test the split rules. We described recall and precision that are used to evaluate the effectiveness of classification as shown in Table 2.

20%

2 Proceedings of the 2001 IEEE International Conference on Data Mining (ICDM’01) 0-7695-1119-8/01 $17.00 © 2001 IEEE