Fusion of multiple approximate nearest neighbor

0 downloads 0 Views 342KB Size Report
Dec 5, 1998 - The nearest neighbor classifier (NNC) is a popular non-parametric classifier. It is a simple classifier with no design phase and shows good ...
Information Fusion 5 (2004) 239–250 www.elsevier.com/locate/inffus

Fusion of multiple approximate nearest neighbor classifiers for fast and efficient classification P. Viswanath *, M. Narasimha Murty, Shalabh Bhatnagar

*

Department of Computer Science and Automation, Indian Institute of Science, Bangalore 560 012, India Received 8 May 2003; received in revised form 25 February 2004; accepted 25 February 2004 Available online 20 March 2004

Abstract The nearest neighbor classifier (NNC) is a popular non-parametric classifier. It is a simple classifier with no design phase and shows good performance. Important factors affecting the efficiency and performance of NNC are (i) memory required to store the training set, (ii) classification time required to search the nearest neighbor of a given test pattern, and (iii) due to the curse of dimensionality it becomes severely biased when the dimensionality of the data is high with finite samples. In this paper we propose (i) a novel pattern synthesis technique to increase the density of patterns in the input feature space which can reduce the curse of dimensionality effect, (ii) a compact representation of the training set to reduce the memory requirement, (iii) a weak approximate nearest neighbor classifier which has constant classification time, and (iv) an ensemble of the approximate nearest neighbor classifiers where the individual classifier’s decisions are combined based on the majority vote. The ensemble has constant classification time upperbound and according to empirical results, it shows good classification accuracy. A comparison based on empirical results is shown between our approaches and other related classifiers. Ó 2004 Elsevier B.V. All rights reserved. Keywords: Multi-classifier fusion; Ensemble of classifiers; Nearest neighbor classifier; Pattern synthesis; Approximate nearest neighbor classifier; Compact representation

1. Introduction The nearest neighbor classifier (NNC) is a very popular non-parametric classifier [1,2]. It is widely used because of its simplicity and good performance. It has no design phase but simply stores the training set. The test pattern is classified to the class of its nearest neighbor in the training set. So the classification time required for the NNC is largely due to reading the entire training set to find the nearest neighbor(s). 1 Thus two major shortcomings of the classifier are (i) entire training set needs to be stored and (ii) entire training set needs to be searched. To add to this list, when the dimensionality of the data is *

Corresponding authors. Tel.: +91803942368; fax: +910803600683. E-mail addresses: [email protected] (P. Viswanath), [email protected] (M. Narasimha Murty), [email protected]. ernet.in (S. Bhatnagar). 1 We assume that the training set is not preprocessed (like indexed, etc.) to reduce the time needed to find the neighbor. 1566-2535/$ - see front matter Ó 2004 Elsevier B.V. All rights reserved. doi:10.1016/j.inffus.2004.02.003

high, it becomes severely biased with finite training set due to the curse of dimensionality [2]. Cover and Hart [3] show that the error for the NNC is bounded by twice the Bayes error when the available sample size is infinity. However, in practice, one can never have an infinite number of training samples. With a fixed number of training samples, the error of the NNC tends to increase as the dimensionality of the data gets large. This is called the peaking phenomenon [4,5]. Jain and Chandrasekharan [6] point out that the number of training samples per class should be about 5–10 times the dimensionality of the data. The peaking phenomenon with the NNC is known to be more severe than other parametric classifiers such as Fisher’s linear and quadratic classifiers [7,8]. Thus, it is widely believed that the size of training sample set needed to achieve a given classification accuracy would be prohibitively large when the dimensionality of data is high. Increasing the training set size has two problems. These are: (i) space and time requirements get increased

240

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

and (ii) it may be expensive to get training patterns from the real world. The space requirement problem can be solved to some extent by using a compact representation of the training data like PC-tree [9], FP-tree [10], CFtree [11], etc., or by using some editing techniques [1] which reduce the training set size without affecting the performance. The classification time requirement problem can be solved by finding an index over the training set, like R-tree [12], while the curse of dimensionality problem can be solved by using re-sampling techniques like bootstrapping [13] and is widely studied [14–18]. These remedies are orthogonal, i.e., have to be done one after the other (cannot be combined into a single step). This paper, however, attempts to give a unified solution. In this paper, we propose a novel bootstrap technique for the NNC design, that we call partition based pattern synthesis which reduces the curse of dimensionality effect. The artificial training set generated by this method can be exponentially larger than the original set. 2 As a result the synthetic patterns cannot be explicitly stored. We propose a compact data structure called partitioned pattern count tree (PPC-tree) which is a compact representation of the original set and is suitable for performing the synthesis. The classification time requirement problem is solved as follows. Finding an approximate nearest neighbor (NN) is computationally less demanding since it avoids exhaustive search of the training set. We propose an approximate NN classifier called PPC-aNNC whose classification time is independent of the training set size. PPC-aNNC directly works with PPC-tree which does implicit pattern synthesis and finds an approximate nearest neighbor of the given test pattern from the entire synthetic set. Thus an explicit bootstrap step to generate the artificial training set is avoided. However, PPC-aNNC is a weak classifier having a lower classification accuracy (CA) than NNC. Classification decision fusion of multiple PPC-aNNC’s is empirically shown to achieve better CA than conventional NNC. This ensemble of PPC-aNNC’s is based on simple majority voting technique and is suitable for parallel implementations. The proposed ensemble is a faster and better classifier than NNC and some of the classifiers of its kind. PPC-tree and PPC-aNNC assume discrete valued features. For other domains, the data sets need to be discretized appropriately. Some of the earlier attempts at combining nearest neighbor (NN) classifiers are as follows. Breiman [19] experimentally demonstrated that combining NN classifiers does not improve performance as compared to that of a single NN classifier. He attributed this behavior to the characteristic of NN classifiers that the addition or removal of a small number of training instances does not change NN classification boundaries significantly.

2

By original set we mean the given training set.

Decision trees and neural networks, he argued, are in this sense less stable than NN classifiers. In his experiments, the component NN classifiers stored a large number of prototypes. Thus it is computationally less efficient as well. Skalak [20] used a few selected prototypes for each component NN classifier and showed that the composite classifier outperforms the conventional NNC. Alpaydin [21] used multiple condensed sets generated by accessing the training set in various random orders. Individual NNC works with a condensed training set and the final decision is made by taking a majority voting (either simple or weighted) of the individual classifiers. Experimental results show that this improves performance. Kubat and Chen [22] propose an ensemble of several NNCs, such that each independent classifier considers only one of the available features. Class assignment to new patterns is done through weighted majority voting of the individual classifiers. This does not work well in domains where mutual inter-correlation between pairs of attributes is high. Bay [23] combined multiple NN classifiers where each component uses only a random subset of features. Experimentally this also is shown to improve performance in most cases. Hamamoto et al. [18] proposed a bootstrap technique for NNC design which is experimentally shown to perform well. In their approach, each training pattern is replaced by weighted average (which is the centroid if the weights are equal) of its r nearest neighbors in the training set. We present experimental results in this paper with six different data sets (having both discrete and continuous valued features), and a comparison is drawn between our approaches and (i) NNC, (ii) k-NNC, (iii) Naive– Bayes classifier, (iv) NNC based on a bootstrap technique given by Hamamoto et al. [18], (v) voting over multiple condensed nearest neighbors [21] and (vi) weighted nearest neighbor with feature projection [22]. This paper is organized as follows: partition based pattern synthesis is described in Section 2, compact data structures in Section 3, PPC-aNNC in Section 4.2, the ensemble of PPC-aNNC’s in Section 4.3, experimental results in Section 5 and conclusions in Section 6, respectively.

2. Partition based pattern synthesis We use the following notation and definitions to describe partition based pattern synthesis and various other concepts throughout this paper. 2.1. Notation and definitions Set of features: F ¼ ff1 ; f2 ; . . . ; fd g is the set of features. Feature fi can get its value from domain Di (1 6 i 6 d).

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

Pattern: T X ¼ ðx1 ; x2 ; . . . ; xd Þ is a pattern in d dimensional vector format. X ½fi  is the feature-value of pattern X for feature fi . X ½fi  2 Di (1 6 i 6 d). Thus, X ½fi  ¼ xi for pattern X . Set of class labels: X ¼ f1; 2; . . . ; cg is the set of class labels. Each training pattern has a class label. Set of training patterns: X is the set of all training patterns. Xl is the set of training patterns for a class with label l. X ¼ X1 [ X2 [ . . . [ Xc . Partition: pl ¼ fB1 ; B2 ; . . . ; Bp g is a partition of F for class with label l, i.e., Bi F , 8i, S i Bi ¼ F , and Bi \ Bj ¼ ;, if i 6¼ j, 8i, 8j. Set of partitions is P ¼ fpl j1 6 l 6 cg. Sub-pattern: A pattern for which zero or more feature-values are absent (missing or unknown) is called a sub-pattern. An absent feature-value is represented by H. Thus, if Y is a sub-pattern, then Y ½fi  2 Di [ fHg;

1 6 i 6 d:

Scheme of a sub-pattern: A sub-pattern Y is said to be of scheme S, where S F , if for 1 6 i 6 d, Y ½fi  2 Di ; Y ½fi  ¼ H;

if fi 2 S otherwise:

Sub-pattern of a pattern: X S is said to be sub-pattern of pattern X , with scheme S, provided S

X ½fi  ¼ X ½fi ; ¼ H;

if fi 2 S otherwise:

Set of sub-patterns: A collection of sub-patterns with all members having the same scheme. A collection of sub-patterns of different schemes is not a set of sub-patterns. We further define, set of sub-patterns for a set of patterns with respect to a scheme as follows. If W is a set of patterns, then WS is called the set of sub-patterns of W with respect to scheme S with WS ¼ fW S jW 2 Wg. Merge operation (}): If P , Q are two sub-patterns of schemes Si , Sj respectively, then merge of P and Q written as P }Q is a subpattern of scheme Si [ Sj and is defined only if Si \ Sj ¼ ;. If R ¼ P }Q, then for 1 6 k 6 d,

241

R½fk  ¼ P ½fk  if fk 2 Si ; ¼ Q½fk  if fk 2 Sj ; ¼ H;

otherwise:

Join operation (}): If Ym , Yn are sets of sub-patterns of scheme Si , Sj respectively, then join of Ym and Yn , written as Ym }Yn is defined only if Si \ Sj ¼ ;, and Ym }Yn ¼ fRjR ¼ P }Q; P 2 Ym ; Q 2 Yn g. Join operation is commutative and associative. 3 So, Ym }ðYn }Yo Þ ¼ ðYm }Yn Þ}Yo , and is written as Ym }Yn }Yo . 2.2. Synthetic pattern generation The method of synthetic pattern generation is as follows. (1) Choose an appropriate set of partitions P ¼ fpl j1 6 l 6 cg where pl ¼ fB1 ; B2 ; . . . ; Bp g is a partition of F for a class with label l. (2) The set of training patterns for the class with label l, that is Xl is replaced by its synthetic counterpart, B SP ðXl Þ, where SP ðXl Þ ¼ XBl 1 }XBl 2 }    }Xl p . (3) Repeat step 2 for each label l 2 X. Note 1. Partition can be different for each class. However, we assume jpl j ¼ p, a constant for all l 2 X. This simplifies analysis of classification methods and cross-validation method as described in subsequent sections. Note 2. If each pattern is seen as an ordered tuple, then Xl SP ðXl Þ D1  D2      Dd . 2.3. Example This example illustrates the concept of synthetic pattern generation. Let F ¼ ff1 ; f2 ; f3 ; f4 g, D1 ¼ fred; green; blueg, D2 ¼ f2; 3; 4; 5g, D3 ¼ fsmall; bigg and D4 ¼ f1:75; 2:04g, respectively. Let Xl ¼ fðred; 3; big; 1:75ÞT ; ðgreen; 2; small; 1:75ÞT g be the training set for the class with label l. Also, let the partition for class x1 be px1 ¼ fB1 ; B2 g where B1 ¼ ff1 ; f3 g, and B2 ¼ ff2 ; f4 g. Then, XBl 1 ¼ fðred; H; big; HÞT ; ðgreen; H; small; HÞT g, and XBl 2 ¼ fðH; 3; H; 1:75ÞT ; ðH; 2; H; 1:75ÞT g, respectively. The set of synthetic patterns for class l, i.e. SP ðXl Þ, is SP ðXl Þ ¼ XBl 1 }XBl 2 ¼ fðred; 3; big; 1:75ÞT ; ðred; 2; big; 1:75ÞT ; ðgreen; 3; small; 1:75ÞT ; ðgreen; 2; small; 1:75ÞT g:

3 This can be directly proved from the definitions of join and merge operations.

242

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

2.4. A partitioning method

3. The data structures

Appropriate partition needs to be chosen for the given classification problem. We present a simple heuristic based method to find a partition. This method is based on pair-wise correlation between the features and therefore is suitable for domains having numerical feature values only. Domain knowledge also can be used to get an appropriate partition. The synthesis method can however work with any domain provided a partition is given. The partitioning method is given below. The objective of this method is to find a partition such that the average correlation between features within a block is high and that between features of different blocks is low. Since this objective is a computationally demanding one, we give a greedy method which can find only a locally optimal partition.

Partition based pattern synthesis can generate synthetic set of size Oðnp Þ, where n is the original set size and p is the number of blocks in the partition. Hence explicitly storing the synthetic set is very space consuming. In this section we present a compact representation of the original set which is suitable for the synthesis. For large data sets, this representation requires less storage space than that for the original set. This representation is called partitioned pattern count tree (PPC-tree). Partitioned pattern count tree (PPC-tree) is a generalization of pattern count tree (PC-tree). For the sake of completeness, we give first a brief overview of PC-tree, details of which can be found in [9]. These data structures are suitable when each feature can take discrete values (can be categorical values also). For continuous valued features, an appropriate discretization needs to be done first. Later, we present a simple discretization process which is used in our experimental studies.

Find-Partition() { Input: (i) Set of features, F ¼ ff1 ; . . . ; fd g. (ii) Pair-wise correlation between features, C ¼ fc½fi ½fj  ¼ correlation between fi ; fj j ð1 6 i; j 6 dÞg. (iii) p ¼ Number of blocks required in the partition such that p 6 d. Output: Partition, p ¼ fB1 ; B2 ; . . . ; Bp g. (1) Mark all features in F as unused. (2) Find c½f10 ½f20 , the minimum element in C such that f10 6¼ f20 (3) B1 ¼ ff10 g, B2 ¼ ff20 g (4) Mark f10 , f20 as used. (5) For i ¼ 3 to p { (i) Choose an un-marked feature, fi0 such that 0 ðc½fi0 ½f10  þ    þ c½fi0 ½fi1 Þ=ði  1Þ is minimum, 0 0 where f1 ; . . . ; fi1 are marked (as used) features. (ii) Bi ¼ ffi0 g (iii) Mark fi0 as used. } (6) For each unmarked feature, f 0 { (i) For i¼ p  P1jBito j 0 Ti ¼ c½f ½fj0  =jBi j j¼1 (ii) Find maximum element from fT1 ; T2 ; . . . ; Tp g. Let it be Tk . (iii) Bk ¼ Bk [ ff 0 g (iv) Mark f 0 as used. } (7) Output the partition, p ¼ fB1 ; . . . ; Bp g. } For each class of training patterns, the above method is used separately. Experiments (Section 5) are done with number of blocks (i.e., p) being 1, 2, 3 and d, respectively, where d is the total number of features.

3.1. PC-tree PC-tree is a complete and compact representation of training patterns which belong to a class. An order is imposed on the set of features F ¼ ff1 ; . . . ; fd g, where fi denotes the ith feature. Patterns belonging to a class are stored in a tree structure (PC-tree), where each feature occupies a node. Every training pattern is present in a path from root to leaf. Two patterns X , Y of a class can share a common node for their respective nth feature if, X ½fi  ¼ Y ½fi  for 1 6 i 6 n. A node has along with feature value, a count indicating how many patterns are sharing that node. A compact representation of the training set is obtained as many patterns share a common node in the tree. The given training set is represented as the set fT1 ; T2 ; . . . ; Tc g where each element Ti is the PC-tree for the class of training patterns with label i. 3.1.1. Example Let fða; b; c; x; y; zÞT ; ða; b; d; x; y; zÞT ; ða; e; c; x; y; uÞT ; ðf ; b; c; x; y; vÞT g be the original training set for a class with label i. Then the corresponding PC-tree Ti (same symbol is used for the tree and the root node of the tree) for this training set is shown in Fig. 1. Each node of the tree is of the format (feature : count). 3.2. PPC-tree Let Xi be the set of original patterns which belong to a class with label i. Let pi ¼ fB1 ; B2 ; . . . ; Bp g be a partition of the feature set F , where each block Bj ¼ ffj1 ; . . . ; fjjBj j g (for 1 6 j 6 p) is an ordered set and the nth feature of block Bj is fjn . Then PPC-tree for Xi

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

y:1

c :1

z:1

c:1

x:1

d:1

x:1

y:1

z:1

c:1

x:1

y:1

u:1

b :2

b:2 a:3 e:1

Ti

b:1

c:1

x :1

y:1

v:1

Fig. 1. PC-tree Ti .

with respect to pi is Ti ¼ fTi1 ; . . . ; Tip g, a set of PC-trees B such that Tij is the PC-tree for the set of sub-patterns Xi j for 1 6 j 6 p where the H-valued features (see Section 2.1) are ignored. Each PC-tree (Tij ) corresponds to a class (with label i) and to a block (Bj such that Bj 2 pi ) of the partition of that class. The given training set is represented as the set fT1 ; T2 ; . . . ; Tc g, where each element Ti is the PPC-tree for the class of training patterns with label i, and Ti ¼ fTi1 ; . . . ; Tip g, a set of PC-trees. A path from root to leaf of the PC-tree Tij (excluding the root node) corresponds to a unique sub-pattern with scheme Bj 2 pi . If (x1 ; x2 ; . . . ; xjBj j ) is a path in Tij then the corresponding sub-pattern is P such that P ½fj1  ¼ x1 ; P ½fj2  ¼ x2 ; . . . ; P ½fjjBj j  ¼ xjBj j and for the remaining features f , such that f 2 F  Bj , P ½f  ¼ H. If Qj is sub-pattern corresponding to a path in Tij for 1 6 j 6 p, then Q ¼ Q1 }Q2 }    }Qp is a synthetic pattern in the class with label i. Algorithms 1 and 2 give the construction procedures. Algorithm 1 (Build-PPC-trees()) {Input: (i) Original Training Set (ii) Partition for each class, i.e., p1 ; p2 ; . . . ; pc . Output: The set of PPC-trees, T ¼ fT1 ; . . . ; Tc g where Ti ¼ fTi1 ; . . . ; Tip g for 1 6 i 6 c. Tij is the PC-tree for the class with label i and block Bj 2 pi . Assumption: (i) Number of blocks in each pi , 1 6 i 6 c, is the same and is equal to p. (ii) Each Tij is empty (i.e., has only root node) to start with.} for i ¼ 1 to c do for each training pattern X 2 Xi do for j ¼ 1 to p do Add-Pattern(Tij ; X ) end for end for end for 3.2.1. Example For the example considered in Section 3.1.1, the PPCtree is shown in Fig. 2 where the partition is pi ¼ fB1 ; B2 g such that B1 ¼ ff1 ; f2 ; f3 g and B2 ¼ ff4 ; f5 ; f6 g, respectively. The ordering of features considered for each block

z :2

d :1 a :3 Ti1 f :1

f:1

243

e :1

c :1

b :1

c :1

PCtree for block 1.

Ti2

x:4

y:4

u :1

v :1 PCtree for block 2.

Fig. 2. PPC-tree Ti ¼ fTi1 ; Ti2 g.

is the same as that in Example 3.1.1. Thus the PPC-tree is the set of PC-trees fTi1 ; Ti2 g. Ti1 is the PC-tree for the set of sub-patterns XBi 1 ¼ fða; b; c; H; H; HÞT ; ða; b; d; H; H; HÞT ; ða; e; c; H; H; HÞT ; ðf ; b; c; H; H; HÞT g where the Hvalued features are ignored. Similarly Ti2 is the PC-tree for the set of sub-patterns XBi 2 , see Fig. 2. Note that PPC-tree is a more compact representation than the corresponding PC-tree. From the examples, it can be seen that the number of nodes in PPC-tree is 16, but that in PC-tree is 22. A path from root to leaf of Ti1 represents a sub-pattern with scheme B1 and that of Ti2 represents a sub-pattern with scheme B2 . Merging the two sub-patterns gives a synthetic pattern according to the partition. Further, Algorithm 2 (Add-Pattern(PC-tree Tij , Pattern X )) X 0 ¼ X Bj such that Bj 2 pi {X 0 is the sub-pattern of X with scheme Bj 2 pi .} Node current-node ¼ Tij  root for j ¼ 1 to d do {d is the dimensionality of X } if (X 0 ½fj  6¼ H) then L ¼ List of child nodes of current-node if (L is empty) then Node new-node ¼ create a new node new-node Æ feature-value ¼ X 0 ½fj  new-node Æ count ¼ 1 Make new-node as a child for the current-node current-node ¼ new-node else if (a node v 2 L exists such that v  featurevalue ¼ X 0 ½fj ) then v  count ¼ v  count þ 1 current-node ¼ v else Node new-node ¼ create a new node new-node Æ feature-value ¼ X 0 ½fj  new-node Æ count ¼ 1 Make new-node as a child for the current-node current-node ¼ new-node end if end if end if end for

244

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

both PC-tree and PPC-tree can be incrementally built by scanning the database of patterns only once and are suitable with discrete valued features which could be of categorical type as well.

4. Classification methods with synthetic patterns We present three classification methods to work with synthetic patterns viz., NNC(SP), PPC-aNNC, and ensemble of several PPC-aNNC’s. 4.1. NNC(SP) NNC(SP) is the nearest neighbor classifier with synthetic patterns. Explicit generation of the synthetic set is done first and then NNC is applied. This method is computationally inefficient as the space and classification time requirements are both Oðnp Þ, where p is the number of blocks used in the partition and n is the original training set size. This method is presented for comparison purposes with other methods using the synthetic set. The distance measure used by this method is Euclidean distance. 4.2. PPC-aNNC PPC-aNNC finds an approximate nearest neighbor of a given test pattern. The distance measure used here is the Manhattan distance (City block distance). The method is suitable for discrete and numeric valued features only. PPC-aNNC is described in Algorithm 3. Let Q be the given test pattern. The quantity distij is the distance between the sub-pattern QBj and B its approximate nearest neighbor in the set Xi j (the set of sub-patterns of Xi with respect to the scheme Bj 2 pi ), wherePthe H-valued features are ignored. The quantity di ¼ pj¼1 distij is then the distance between Q and its approximate nearest neighbor in the class with label i. The method progressively finds a path in each of Tij starting from root to leaf. The ordering of features present in QBj must be the same as that of Bj 2 pi which is used to construct the PC-tree Tij . At each node, it tries to find a child which is nearest to the corresponding feature value in QBj (based on the absolute difference between the values) and proceeds to that node. If there is more than one such child then it proceeds to the child that has the maximum count value. Let the child node be v and the feature value of the corresponding feature in QBj be q. Then the distance distij is increased by jv  feature-value  qj. If Q is present in the original training set then PPCaNNC will find it and in this case the neighbor obtained is the exact nearest neighbor.

4.2.1. Computational requirements of PPC-aNNC Let the number of discrete values any feature can take be at-most l, the dimensionality of each pattern be d and the number of classes be c. Then the time complexity of PPC-aNNC is OðcldÞ, since it finds only one path in each of the c PPC-trees (one for a class) and at any node it searches only the child-list (can be of size OðlÞ) of that node to find the next node in the path. The path will have d nodes. For a given problem, c, l and d are constants (i.e., independent of the number of training patterns) that are typically much smaller than the number of training patterns. Thus, the effective time complexity of the method is only Oð1Þ. That is classification time of PPC-aNNC is constant and is independent of the training set size. However, since it avoids exhaustive search of the PPC-tree, it can only find an approximate nearest neighbor. Algorithm 3 (PPC-aNNC(Test Pattern Q)) {Assumption (i): The set of PPC-trees fT1 ; . . . ; Tc g, where Ti ¼ fTi1 ; . . . ; Ti2 g for (1 6 i 6 c) and (1 6 j 6 p) is assumed to be already built. Assumption (ii): pi ¼ fB1 ; B2 ; . . . ; Bp gð1 6 i 6 cÞ is the partition of the feature set F for the class with label i, and is the same as that used in the construction of the PPC-tree, Ti , where each block Bj ¼ ffj1 ; . . . ; fjjBj j g (for 1 6 j 6 p) is an ordered set with the nth feature of the block Bj being fjn .} for each class with label i ¼ 1 to c do for each Bj 2 pi , (1 6 j 6 pÞ do Q0 ¼ QBj current-node ¼ Tij  root distij ¼ 0 for l ¼ 1 to jBj j do L ¼ List of child nodes of the current-node. Choose a sublist of nodes in L such that jQ0 ½fjl   v  feature-valuej is minimum. Let this sublist be L0 . Choose a node v 2 L0 such that v  count is maximum. {Ties are broken arbitrarily} distij ¼ distij þ jQ0 ½fjl   v  feature-valuej current-node ¼ v end for end for end for for i ¼ 1 to c do di ¼ 0 for j ¼ 1 to p do di ¼ di þ distij end for end for Find dx ¼ Minimum element in fd1 ; d2 ; . . . ; dc g Output (class label of Q ¼ x) The space requirement of the method is mostly due to the PPC-tree structures. PPC-trees require space of

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

245

OðnÞ, where n is the total number of original patterns. For medium to large data sets, empirical studies show that the space requirement is much smaller than that required by conventional vector format (i.e., each pattern is represented by a list of feature values). However, for small data sets the space required may increase because of the data structure overhead (the space needed for pointers, etc.).

is there only when the test pattern is communicated to all individual classifiers and when majority vote is required, and therefore results in a very small overhead. On the other hand, if the ensemble is to be implemented on a single machine, then the space and time requirements will be r times that of a single PPC-aNNC, which may not be feasible for large data sets.

4.3. The ensemble of PPC-aNNC’s

5. Experiments

PPC-aNNC is a weak classifier since it finds only an approximate nearest neighbor for the test pattern. Partition based pattern synthesis depends on the partition chosen. PPC-tree and hence PPC-aNNC depends not only on the partition chosen for each class, but also on the ordering of features within each block of the partitions. Thus various orderings of features in each block of the partitions results in various PPC-aNNC’s. An ensemble of PPC-aNNC’s where the final decision is made based on simple majority voting is empirically shown to perform well. Let there be r component classifiers in the ensemble. Each component classifier is chosen based on a random ordering of features in each block. Intuitively, the functioning of PPC-aNNC’s can be explained as follows. While finding the approximate nearest neighbor, PPC-aNNC gives emphasis to the features in a block according to its order. The first feature in a block is emphasized the most while the last feature the least. Notice that if there is only one feature in each block for partitions of all classes, then PPC-aNNC finds the exact nearest neighbor in the entire synthetic set (generated according to this partitioning). This is because, in this case all features are emphasized equally (since there is only one feature in each block). Since each PPC-aNNC in the ensemble is based on a random ordering of the features, the emphasis on features given by each of them is quite different from that of others. Because of this the errors made by each of the individual classifiers become uncorrelated significantly, causing the ensemble to perform well. The ensemble is suitable for parallel implementation with r machines, where each machine implements a different PPC-aNNC. The communication requirement

5.1. Datasets We performed experiments with six different datasets, viz., OCR, WINE, VOWEL, THYROID, GLASS and PENDIGITS, respectively. Except the OCR dataset, all others are from the UCI Repository [24]. OCR dataset is also used in [9,25] while WINE, VOWEL, THYROID and GLASS datasets are used in [21]. The properties of the datasets are given in Table 1. All the datasets have only numeric valued features. The OCR dataset has binary discrete features, while the others have continuous valued features. Except OCR dataset, all other datasets are normalized to have zero mean and unit variance for each feature and subsequently discretized. Let a be a feature value after normalization, and a0 be its discrete value. We used the following discretization procedure. If (a < 0:75) then a0 ¼ 1; Else-If (a < 0:25) then a0 ¼ 0:5; Else-If (a < 0:25) then a0 ¼ 0; Else-If (a < 0:75) then a0 ¼ 0:5; Else a0 ¼ 1. 5.2. Classifiers for comparison The classifiers chosen for comparison purposes are as follows. NNC: The test pattern is assigned to the class of its nearest neighbor in the training set. The Distance measure used is Euclidean distance. k-NNC: A simple extension of NNC, where the most common class in the k nearest neighbors is chosen. The distance measure is Euclidean distance. Threefold cross-validation is done to choose the k value.

Table 1 Properties of the datasets used Dataset

Number of features

Number of classes

Number of training examples

Number of test examples

OCR WINE VOWEL THYROID GLASS PENDIGITS

192 13 10 21 9 16

10 3 11 3 7 10

6670 100 528 3772 100 7494

3333 78 462 3428 114 3498

246

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

Naive–Bayes classifier (NBC): This is a specialization of the Bayes classifier where the features are assumed to be statistically independent. Further, the features are assumed to be of discrete type. Let X ¼ ðx1 ; . . . ; xd ÞT be a pattern and l be a class label. Then the class conditional probability, P ðX jlÞ ¼ P ðx1 jlÞ      P ðxd jlÞ. P ðxi jlÞ is taken as the frequency ratio of number of patterns in class with label l and with feature fi value equal to xi to that of total number of patterns in that class. A priori probability for each class is taken as the frequency ratio of number of patterns in that class to the total training set size. The given test pattern is classified to the class for which the posteriori probability is maximum. OCR dataset is used as it is, whereas the other datasets are normalized (to have zero mean and unit variance for each feature) and discretized as done for PPC-aNNC. NNC with bootstrapped training set (NNC(BS)): We used the bootstrap method given by Hamamoto et al. [18] to generate an artificial training set. The bootstrapping method is as follows. Let X be a training pattern and let X1 ; . .P . ; Xr be its r nearest neighbors in its class. Then X 0 ¼ ð ri¼1 Xi Þ=r is the artificial pattern generated for X . In this manner, for each training pattern an artificial pattern is generated. NNC is done with this new bootstrapped training set. The value of r is chosen according to a three-fold cross-validation. Voting over multiple condensed nearest neighbors (MCNNC): Condensed nearest neighbor classifier (CNNC) first finds a condensed training set which is a subset of the training set, such that NNC with the condensed set classifies each training pattern correctly. The condensed set is incrementally built. Changing the order of the training patterns considered can give a new condensed set. Alpaydin [21] proposed to train a multiple such subsets and take a vote over them, thus combining predictions from a set of concept descriptions. Two voting schemes are given: simple voting where voters have equal weight and weighted voting where weights depend on classifiers’ confidence in their predictions. The second scheme is shown empirically to do well, so this is taken for comparison purposes. The paper [21] proposes some additional improvements based on bootstrapping, etc., which are not considered here. Weighted nearest neighbor with feature projection (wNNFP): This is given by Kubat et al. in [22]. If d is the number of features, then d individual nearest neighbor classifiers are considered, each classifier taking only one feature into account. That is, d separate training sets are projected, each being used by an individual NNC. Weighted majority voting is taken to combine the decisions of the individual NNC’s. The weights for the individual classifiers are given based on their classification accuracies. Three-fold cross-validation is done for this.

NNC with synthetic patterns (NNC(SP)): This method is given in Section 4.1. The parameter P, i.e., the set of partitions is chosen based on the cross-validation method given in Section 5.3. PPC-aNNC: This method is given in Section 4.2 and cross-validation method to choose the parameter values in Section 5.3. Ensemble of PPC-aNNC’s: This method is given in Section 4.3. Cross-validation method to choose the parameter values is given in Section 5.3. 5.3. Validation method Three-fold cross-validation is used to fix the parameter values for various classifiers described in this paper. For the methods proposed in this paper, viz., NNC(SP), PPC-aNNC and Ensemble of PPC-aNNC’s we give a detailed cross-validation procedure below. The training set is randomly divided into three equal non-overlapping subsets. If equal division is not possible then one or two randomly chosen training patterns are replicated to get an equal division. Two such subsets are combined to form a training set called validation training set, and the remaining one is called validation test set. Like this we get three different validation training sets and corresponding validation test sets. We call these sets as val-train-set-1, val-train-set-2, val-train-set-3 for validation training sets and val-test-set-1, val-test-set-2, valtest-set-3 for the corresponding validation test sets, respectively. For a given set of parameter values, valtrain-set-i is used as the training set for the classifier and classification accuracy(CA) is measured over val-testset-i and is called val-CA-i, where i ¼ 1, 2 or 3. Average value of {val-CA-1, val-CA-2, val-CA-3} is called avgval-CA and its standard deviation as val-SD. Objective of cross-validation is to find a set of parameter values for which avg-val-CA is maximum. val-SD measures the spread of val-CA-i, i ¼ 1, 2 or 3, around avg-val-CA. Exhaustive search over all possible sets of parameter values is a computationally expensive activity and hence we give a greedy approach for choosing the set of parameter values. (1) NNC(SP): The parameters used in NNC(SP) are partitions of the set of features (F ) for each class which are used for performing partition based pattern synthesis. Let these partitions be represented as a set P ¼ fp1 ; p2 ; . . . ; pc g where pi (1 6 i 6 c) is the partition used for class with label i. Further, let Pp be the set of partitions where each element (i.e., partition) has exactly p blocks. An element from fP1 ; P2 ; P3 ; Pd g (where d ¼ jF j) which gives maximum avg-val-CA is chosen. Pp for a given p is obtained either from domain knowledge or by using the method given in Section 2.4. OCR dataset consists of handwritten images on a two dimensional rectangular grid of size 1612 where for

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

each cell, presence of ink is represented by 1 and its absence by 0. It is known that for a given class, the values in nearby cells are highly dependent than for far apart cells (nearness here is based on physical closeness between the cells). This knowledge is used for obtaining the partitions. An entire image is represented as a 192 dimensional vector where the first 12 features correspond to the first row of the grid, the second 12 features correspond to the second row of the grid and so on. Let the set of features in this order be F ¼ ff1 ; f2 ; . . . ; f192 g. A partitioning of F with p (where p ¼ 1, 2, 3 or 192) blocks, i.e., fB1 ; B2 ; . . . ; Bp g is obtained in the following manner: The first 192=p features go into block B1 , the next 192=p features into block B2 , and so on. For other datasets, viz., WINE, VOWEL, THYROID, GLASS and PENDIGITS, partitions are obtained by using the method given in Section 2.4. (2) PPC-aNNC: Parameters are the partitions of the set of features as used by NNC(SP), ordering these in all blocks of each partition. These are chosen as follows. Pp for p ¼ 1, 2, 3 and d are obtained as done for NNC(SP). For each Pp (where p ¼ 1, 2, 3 or d), features in each block (for each partition) are randomly ordered and avg-val-CA is obtained. 100 such runs (each with a different random ordering of features) are obtained for each Pp . The Pp along with ordering of features for which avg-val-CA is maximum is then chosen. (3) Ensemble of PPC-aNNC’s: The parameters here are (i) number of component classifiers (r), (ii) set of partitions Pi ¼ fp1 ; . . . ; pc g used by each component ið1 6 i 6 rÞ, and (iii) ordering of features in each block of each partition. These parameters are chosen by restricting the search space as given below. Set of partitions used by each component classifier is same (except for ordering of features). That is, P1 ¼ P2 ¼    ¼ Pr and is chosen as done for PPCaNNC. With 100 random orderings of features, PPCaNNC is run and their respective classification accuracies (CA) are obtained. Let avg-CA and max-CA be the average, and maximum CA of these 100 runs, respectively. We define a threshold classification accuracy called thr-CA ¼ (avg-CA+max-CA)/2. 50 component classifiers are obtained by finding 50 random orderings of features such that each component has CA greater

247

than the thr-CA. This process is done to choose good component classifiers. For these 50 component classifiers, their respective CAs and orderings of features are stored. This is done with each pair (val-train-set-i, valtest-set-i) for i ¼ 1, 2 and 3. So we get in total of 150 (i.e., 503) orderings of features along with their respective CAs. This corresponds to the list of orderings. This list is sorted based on CA values and best r orderings are chosen to be used in the final ensemble where r, the number of component classifiers, is chosen as described below. For each pair (val-train-set-i, val-test-set-i), for i ¼ 1, 2 and 3, we obtain 50 component classifiers as explained above. From these 50 components, we choose randomly m distinct components to form an ensemble. CA of this ensemble is measured and is called val-CA-im . The above is done for m ¼ 1 to 50 and for i ¼ 1, 2 and 3. The quantity avg-val-CAm is the average value of {val-CA1m , val-CA-2m , val-CA-3m }, and val-SDm is its standard deviation. The number of components r is chosen such that avg-val-CAr is the maximum element in {avg-valCA1 , avg-val-CA2 ,. . .,avg-val-CA50 }. 5.4. Experimental results Tables 2 and 3 give the comparison between these classifiers. These show the classification accuracy (CA) for each of the classifiers as a percentage over respective test sets. The parameter values are chosen by performing cross-validation as described in Section 5.3. Table 3 shows the CAs for the methods proposed by us. Along with CA values, it also shows the parameter values p (the number of blocks used in the synthesis) and r (the number of components used in the ensemble). The points worth noting are as follows: (i) For large values of n (the number of original training patterns) and p, it may not be feasible to consider the entire synthetic set which is required in the case of NNC(SP). (ii) If p ¼ d, where d is the total number of features, each feature goes into a separate block (i.e., each block contains only one feature) and therefore only one ordering of features is possible. This means, in the case of ensemble of PPC-aNNC’s, each component is same and hence CA of one component is equal to the CA of the ensemble. (iii) If p ¼ 1, then the synthetic and original sets are same.

Table 2 A comparison between the classifiers (showing CA (%)) Dataset

NNC

k-NNC

NBC

NNC(BS)

MCNNC

wNNFP

OCR WINE VOWEL THYROID GLASS PENDIGITS

91.12 94.87 56.28 93.14 71.93 96.08

92.68 96.15 60.17 94.40 71.93 97.54

81.01 91.03 36.80 83.96 60.53 83.08

92.88 97.44 57.36 94.57 71.93 97.57

91.97 95.00 55.97 92.23 71.67 97.25

10.02 92.30 23.38 92.71 53.5 45.05

248

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

Table 3 A comparison between the classifiers Dataset

NNC(SP) (# blocks)

PPC-aNNC (# blocks)

Ensemble of PPC-aNNCs (# blocks) (# components)

OCR WINE VOWEL THYROID GLASS PENDIGITS

93.01 96.15 56.28 97.23 71.93 96.08

84.91 89.74 43.51 94.16 60.53 90.19

94.15 94.87 46.32 94.66 67.54 96.34

(3) (2) (1) (d) (1) (1)

The cross-validation results for four datasets (WINE, VOWEL, GLASS and PENDIGITS) for the ensemble of PPC-aNNC’s are given in Tables 4–7 respectively, that show the average CA (avg-val-CAm ) and standard deviation (val-SDm ) with various values of m (i.e., number of components). For the remaining two datasets (OCR and THYROID), similar results as those for WINE, GLASS and PENDIGITS are observed and hence not presented.

(3) (2) (1) (2) (1) (1)

(3) (2) (1) (2) (1) (1)

(45) (7) (33) (49) (7) (29)

From the results presented, some of the observations are: (1) The methods given by us (viz., NNC(SP) and Ensemble of PPC-aNNC’s) outperform the other methods in the case of OCR and THYROID datasets. For the remaining datasets, our methods show good performance.

Table 4 Cross validation results for ensemble of PPC-aNNC’s for WINE dataset # component classifiers

Number of blocks 1

1 7 10 20 30 40 50

95.09 98.03 98.03 98.03 98.03 98.03 99.01

2 (1.38) (1.38) (1.38) (1.38) (1.38) (1.38) (1.38)

96.08 99.06 99.02 98.04 98.04 98.04 98.04

3 (1.38) (1.38) (1.38) (1.38) (1.38) (1.38) (1.38)

95.09 98.03 98.03 99.01 99.01 99.01 99.01

d (1.38) (1.38) (2.77) (1.38) (1.38) (1.38) (1.38)

77.45 77.45 77.45 77.45 77.45 77.45 77.45

(2.83) (2.41) (2.28) (1.93) (1.93) (2.09) (2.45)

23.86 23.86 23.86 23.86 23.86 23.86 23.86

(6.04) (7.34) (7.34) (7.20) (7.34) (6.04) (6.04)

46.08 46.08 46.08 46.08 46.08 46.08 46.08

(5.00) (5.00) (5.00) (5.00) (5.00) (5.00) (5.00)

Table 5 Cross validation results for ensemble of PPC-aNNC’s for VOWEL dataset # component classifiers

Number of blocks 1

1 10 20 30 33 40 50

82.77 91.86 91.86 92.23 93.18 92.42 92.61

2 (1.49) (2.63) (1.75) (1.93) (2.02) (2.14) (1.67)

82.57 88.06 89.58 90.15 90.72 89.58 89.96

3 (3.15) (2.78) (2.09) (2.09) (2.56) (2.33) (2.38)

79.73 85.22 85.41 85.98 85.98 85.98 85.22

d (2.45) (2.45) (2.45) (2.45) (2.45) (2.45) (2.45)

Table 6 Cross validation results for ensemble of PPC-aNNC’s for GLASS dataset # component classifiers

Number of blocks 1

1 7 10 20 30 40 50

67.65 76.47 70.59 72.55 73.53 73.53 71.57

2 (2.40) (2.40) (2.40) (1.39) (0.00) (2.40) (1.39)

71.57 67.65 69.61 71.57 70.59 70.59 72.55

3 (5.00) (6.35) (6.04) (5.00) (4.80) (4.16) (6.04)

63.73 66.67 65.69 64.71 66.67 65.69 65.69

d (7.34) (7.34) (7.34) (7.34) (7.34) (7.34) (7.34)

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

249

Table 7 Cross validation results for ensemble of PPC-aNNC’s for PENDIGITS dataset # component classifiers

Number of blocks 1

1 10 20 29 30 40 50

93.70 98.16 98.45 98.71 98.57 98.64 98.51

2 (0.32) (0.26) (0.18) (0.14) (0.27) (0.14) (0.19)

93.68 96.99 97.17 97.30 97.41 97.37 97.43

3 (0.42) (0.37) (0.30) (0.30) (0.21) (0.31) (0.36)

(2) The Ensemble of PPC-aNNC’s performs uniformly better as compared to NBC, wNNFP, and PPCaNNC, respectively, over all datasets. (3) It is interesting to note that PPC-aNNC outperforms NBC and wNNFP over all datasets except for WINE dataset. The actual space requirement on the average for PPC-tree is about 60% to 90% for OCR, THYROID and PENDIGITS datasets, when compared with the space requirement of the respective original sets. For other datasets, the actual space requirement is slightly more than that required for the original set. This is because for small datasets, the data structure overhead is larger when compared with the space reduced because of the sharing of nodes in PPC-tree.

6. Conclusions This paper presented a fusion of multiple approximate nearest neighbor classifiers having constant (Oð1Þ) classification time upper bound and good classification accuracy. Each individual classifier of the ensemble is a weak classifier which works with a synthetic set generated from the novel pattern synthesis technique called partition based pattern synthesis which reduces the curse of dimensionality effect. Further explicit generation of the synthetic set is avoided by doing implicit pattern synthesis within the classifier which works directly with a compact representation of the original training set called PPC-tree. The proposed ensemble of PPC-aNNC’s with parallel implementation is a fast and efficient classifier suitable for large, high dimensional data sets. Since it has constant classification time upper bound, it is a suitable classifier for online, real-time applications.

7. Future work A formal explanation for the good behavior of the ensemble of PPC-aNNC’s needs to be given. Next, one needs to answer questions such as ‘What is a good partition for doing partition based pattern synthesis and

90.28 93.15 93.76 93.79 93.79 93.78 93.78

d (0.94) (1.19) (1.03) (1.13) (1.13) (0.99) (1.04)

29.06 29.06 29.06 29.06 29.06 29.06 29.06

(5.00) (5.00) (5.00) (5.00) (5.00) (5.00) (5.00)

how to find it?’ We gave a partitioning method based on pair-wise correlations between features within a class, but this takes into account only linear dependency between the features, so for features having higher order dependency this method can fail to capture that. A general partitioning method which is computationally also efficient needs to be found which can be used for both numerical and categorical features.

Acknowledgements Research work reported here is supported in part by AOARD Grant F62562-03-P-0318. Thanks to the three anonymous reviewers for constructive comments. Special thanks to B.V. Dasarathy for prompt feedback and many suggestions during the revision that significantly improved the content of the paper.

References [1] B.V. Dasarathy, Nearest neighbor (NN) norms: NN pattern classification techniques, IEEE Computer Society Press, Los Alamitos, California, 1991. [2] R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., A Wiley-interscience Publication, John Wiley & Sons, 2000. [3] T. Cover, P. Hart, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13 (1) (1967) 21–27. [4] K. Fukunaga, D. Hummels, Bias of nearest neighbor error estimates, IEEE Transactions on Pattern Analysis and Machine Intelligence 9 (1987) 103–112. [5] G. Hughes, On the mean accuracy of statistical pattern recognizers, IEEE Transactions on Information Theory 14 (1) (1968) 55– 63. [6] A. Jain, B. Chandrasekharan, Dimensionality and sample size considerations in pattern recognition practice, in: P. Krishnaiah, L. Kanal (Eds.), Handbook of Statistics, vol. 2, North Holland, 1982, pp. 835–855. [7] K. Fukunaga, D. Hummels, Bayes error estimation using parzen and k-nn procedures, IEEE Transactions on Pattern Analysis and Machine Intelligence 9 (1987) 634–643. [8] K. Fukunaga, Introduction to Statistical Pattern Recognition, second ed., Academic Press, 1990. [9] V. Ananthanarayana, M.N. Murty, D. Subramanian, An incremental data mining algorithm for compact realization of prototypes, Pattern Recognition 34 (2001) 2249–2251.

250

P. Viswanath et al. / Information Fusion 5 (2004) 239–250

[10] J. Han, J. Pei, Y. Yin, Mining frequent patterns without candidate generation, in: Proceedings of ACM SIGMOD International Conference of Management of Data, Dallas, Texas, USA, 2000. [11] Z. Tian, R. Raghu, L. Micon, BIRCH: an efficient data clustering methods for very large databases, in: Proceedings of ACM SIGMOD International Conference of Management of Data, 1996. [12] A. Guttman, A dynamic index structure for spatial searching 2 (1984) 47–57. [13] B. Efron, Bootstrap methods: Another look at the jackknife, Annual Statistics 7 (1979) 1–26. [14] A. Jain, R. Dubes, C. Chen, Bootstrap technique for error estimation, IEEE Transactions on Pattern Analysis and Machine Intelligence 9 (1987) 628–633. [15] M. Chernick, V. Murthy, C. Nealy, Application of bootstrap and other resampling techniques: Evaluation of classifier performance, Pattern Recognition Letters 3 (1985) 167–178. [16] S. Weiss, Small sample error rate estimation for k-NN classifiers, IEEE Transactions on Pattern Analysis and Machine Intelligence 13 (1991) 285–289. [17] D. Hand, Recent advances in error rate estimation, Pattern Recognition Letters 4 (1986) 335–346.

[18] Y. Hamamoto, S. Uchimura, S. Tomita, A bootstrap technique for nearest neighbor classifier design, IEEE Transactions on Pattern Analysis and Machine Intelligence 19 (1) (1997) 73–79. [19] L. Breiman, Bagging predictors, Machine Learning 24 (1996) 123– 140. [20] D.B. Skalak, Prototype Selection for Composite Nearest Neighbor Classifiers, Ph.D. Thesis, Department of Computer Science, University of Massachusetts Amberst, 1997. [21] E. Alpaydin, Voting over multiple condensed nearest neighbors, Artificial Intelligence Review 11 (1997) 115–132. [22] M. Kubat, W.K. Chen, Weighted projection in nearest-neighbor classifiers, in: Proceedings of the First Southern Symposium on Computing, The University of Southern Mississippi, December 4– 5, 1998. [23] S.D. Bay, Combining nearest neighbor classifiers through multiple feature subsets, Intelligent Data Analysis 3 (3) (1999) 191–209. [24] P.M. Murphy, UCI Repository of Machine Learning Databases [http://www.ics.uci.edu/mlearn/MLRepository.html], Department of Information and Computer Science, University of California, Irvine, CA, 1994. [25] T.R. Babu, M.N. Murty, Comparison of genetic algorithms based prototype selection schemes, Pattern Recognition 34 (2001) 523–525.