Nearest Neighbor Classi cation with a Local Asymmetrically Weighted Metric Francesco Ricci and Paolo Avesani Istituto per la Ricerca Scienti ca e Tecnologica 38050 Povo (TN) Italy email: [email protected]

Abstract

This paper introduces a new local asymmetric weighting scheme for the nearest neighbor classi cation algorithm. It is shown both with theoretical arguments and computer experiments that good compression rates can be achieved outperforming the accuracy of the standard nearest neighbor classi cation algorithm and obtaining almost the same accuracy as the k-NN algorithm with k optimised in each data set. The improvement in time performance is proportional to the compression rate and in general it depends on the data set. The comparison of the classi cation accuracy of the proposed algorithm with a local symmetrically weighted metric and with a global metric strongly shows that the proposed scheme is to be preferred.

1

1 Introduction Nearest neighbor algorithms (NN) are ubiquitous in many research areas such as pattern recognition, machine learning and case based reasoning. The k-NN algorithm, a simple generalisation of NN which is 1-NN, maintains a set of training examples and classi es a new example as the most frequent class among the k most similar examples in the training set. Basically, the motivations for such a large diusion of NN reside in its good performance, which can be obtained in many situations, and in its ease of use, as no parameter needs to be tuned and very little experience is required on the part of the user. However, some limitations in the basic NN algorithm have been recognised: it has poor generalisation performance; it suers from the existence of noisy features; it requires big training sets; it suers from the so-called "curse of dimensionality" and as the training set grows, run time performance deteriorates linearly. All these points have been addressed positively by a number of papers (see [9] for a large collection of papers on NN and its generalisations) and the proposed improvements eectively mitigate those limitations. This paper is mostly concerned with the task of reducing the number of examples stored in the training set in order to speed up query time and improve classi cation accuracy by learning features relevance in a context-sensitive way. That goal is pursued by the introduction of a new local metric [19] and a procedure that progressively transforms the Euclidean metric in a set of locally de ned metrics attached to a reduced set of examples in the training set. The idea of using local metrics is not new, many approaches have been considered [24, 25, 27, 8, 3, 4, 12, 11] and some of these will be reviewed in the next Section. The novelty of our proposal is related to the weighting method and the particular process that learns the local metric. Usually a positive weight wi is associated to each feature i. In local metrics wi is a function of the example from which the distance is taken and the form of this dependence varies from one approach to another. For an ordered feature, a distinction is made between \left" and \right" direction, introducing two weights for each feature. The motivation for this is related to the algorithm used for computing local weights. It is based on a feedback method [29] similar to the delta rule [21, 1, 2] or that used in competitive learning [31] or learning vector quantization [14]. But, while in the majority of these approaches the examples stored (\weight vectors" representing cluster centres or \codebooks") are moved in the input space1 , we instead change the metrics attached to the stored examples, and therefore directional information has to be taken into account. As a result of the asymmetrical weighting scheme, the local metric does not have an invariant property that is always assumed, namely d(x; x + ) = d(x; x ? ), where x is a generic example and d is a metric. This means that the distance in the \north" direction is not taken for the \south" direction. A practical result is that sensible compression rates can be achieved. It turns out that it is possible to store a small fraction of the training set achieving greater accuracy than 1-NN and almost the same accuracy as k-NN, where k is chosen as the value that maximises accuracy in a particular data set. The compression rate obviously depends on the data set but on ten data sets we discovered that less that 10% of the training data still provides the same accuracy as k-NN. As a consequence query time can be reduced by In some cases it is not advisable to create these abstract cases as the meaning that a user has attached to them can be lost or could result in feature values that are not possible for other reasons. 1

2

approximately the same rate. The theoretical motivations for such ability is clari ed by the bounds on the minimum number of examples required for exact learning of a concept in a given class. For bidimensional concepts represented by a rectilinear polygon it is shown that a number of points equal to the number of faces of the polygon is enough to learn exactly the concept. This compares favourably with a quadratic bound proved by Salzberg et al. [22] for the Euclidean metric. Similarly, given a concept represented by a polygonal tessellation it is shown that an -similar concept can be learned exactly with fewer points than those required by NN if the Euclidean metric is used. Our study is related to the line of research called editing experiments, that is those techniques aimed at selecting a subset of prototypes from a given set of training examples both for computational eciency and for making the classi cation more reliable (see [9] Chapter 6). Even if the nal result is similar, an edited training set, the focus of this paper is quite dierent. Rather than search for a subset of the original training set, we simply choose a subset at random and we adapt the metric to that choice. If both good classi cation accuracy and compression rate can be achieved in this way, then this would support the inherent ability of the metric and the learning process to adapt to data. It seems obvious that a good choice of the stored examples could further improve both performance and accuracy, but this is not the focus of this work and the accurate selection of an edited set requires a costly search process that could prevent the system being incremental with respect to new data. With respect to the techniques used, many editing experiments [28, 20] try to select those examples that are closer to the classes' boundary or by deleting those examples that reduce accuracy [30, 5]. Conversely, using a local asymmetrically weighted metric, it seems more reasonable to select those examples that lie within the training clusters [6] or even better to choose one example for each set in a rectangular decomposition of the target concept (see Section 6). Some evidence to support these ideas will be given below, but we believe that this aspect deserves more detailed study.

2 Previous Approaches This section is dedicated to reviewing other contributions to the concept of local metric. Local metrics have been introduced previously by a number of authors. Here attention is focused on three aspects, mainly: The features type. Some approaches deal with discrete features, some make the additional restriction that the feature must be boolean, some consider real features. Even if some methods are equally applicable to both real and categorical features, one of them always is more suitable. Kind of locality. Feature weights may be invariant on regions of the input space, they may be dependent on the point from which distance is taken or may depend on feature values. Moreover, weights may be attached to the examples stored or may be computed as a function of the query example. 3

How weights are computed. For categorical features, statistics on the distri-

bution of values and the correlation between values and categories are exploited. Weights for real features have been computed both with conditional probabilities (see [29] for a discussion) and with feed-back based learning techniques. Creecy et al. [8] consider boolean features and compute two types of weights: the per category feature importance (PCF), which is the conditional probability that an example belongs to a class given that a feature is true; and the cross category feature importance (CCF) that averages the PCF on all the categories. More formally, PCF assigns weights using wi(c) = P [cjxi = 1], where c is a class P and xi is the value of the i-th feature (boolean). Whereas CCF is de ned as: wi = c2C P [cj jxi = 1]2. Only the rst de nition yields a local metric and for each training example x the weight wi (c) is stored with the i-th feature of x, where c is the class of x. The metric is local but a feature weight, for a given example, depends only on its feature value and on the class of the example. In [8] the authors show that the per category weighting method gives better results on a particular classi cation problem using an error minimising metric that combines the weighted local metric and the maximisation of P [cjxi = 1]. Interestingl y, in a very similar problem they got better results with cross-category weighting and k nearest neighbors. Stan ll and Waltz [27] introduced the notion of context-sensitive similarity metric early in their Memory-Based Reasoning approach. The need for dierent weightings was motivated mostly by coupling features' relevance to the task to be solved. The cross category feature importance CCF discussed in [8] was in fact introduced early by Stan ll and Waltz in the task of pronouncing novel words, using a database of English words and their pronunciation. In this case feature values are discrete but not restricted to being boolean. When a test example y isqmatched with memory the weight to be used for the i-th feature is given by: wi (y) = Pc2C P [cjyi]2. P [cjyi ] is the conditional probability that a training example is in class c given that its i-th feature value is yi 2 . They observed that the above weighting technique is too strict if combined with the Hamming distance (di(xi ; yi ) as the equality test P on xi and yi ). So they introduced a dierent distance on each feature space: di (xi; yi ) = c2C (P [cjyi ] ? P [cjxi ])2 that was called VDM, Value Distance Metric. Both previous approaches apply only to categorical features. For real features, Short and Fukunaga [24, 25] have proposed, for the two classes classi cation problem, a local metric that minimises E [! (x; y )], where ! (x; y ) is the quadratic dierence of the probability of misclassifying x given that the NN of x is y and the probability of misclassifying x given an hypothetically in nite set. They show that the metric d(x; y ) = jP [1jx] ? P [1jy ]j achieves that goal (P [1jx] is the probability of x being classi ed 1). They also give an estimate of such a metric: d(x; y ) = @[email protected] P [1jt]jt=x jx ? y j. Myles and Hand [16] generalise that framework to the multi-class case, showing various possible extensions. Salzberg [23] uses a global metric on real features with an additional weight for each stored case that measures how frequently the case has been used to make a correct classi cation (prediction). Some cases are "generalised", which means, in this case, that the point representation is changed for a two point representation. Two points x; y in [0; 1]N Stan ll and Waltz in fact compute dierent weights for dierent tasks, i.e., the weight depends on the goal feature that has to be predicted. 2

4

de ne an hyper rectangle, that is the subset Hxy [0; 1]N , Hxy = fz 2 [0; 1]N j zi 2 [min(xi; yi ); max(xi; yi )] 8 i = 1; : : :; N g. The distance between a point and an Hyper rectangle is de ned as the distance between a point and a set in Euclidean geometry (minimal distance), so for example if a point is inside an hyper rectangle the distance between the two is zero. In this way, in relation to the size of the Hyper rectangle the metric behaves dierently in the input space. In [7] Cost and Salzberg exploit a modi ed version of VDM, by a weighting scheme that weights examples in memory according to their performance history. Aha and Goldstone [3, 4] claim that an attribute's importance in human classi cation depends on its context and they provide a computational model that is able to generalise with results similar to those shown by experiments conducted with humans. One of the computational models proposed (GCM-ISW) uses both local weights and global weights and it applies on categorical and real features as well. When the distance between an input point x from a point in the training set y is to be computed an interpolation of the local (attached to y ) and global weights is used. Weights are updated with a method similar to the delta rule [21]. They show that GCM-ISW provides a better t to the subject data than other methods with no local weights. Therefore, they provide a cognitive foundation to local metric and to local weights update rules (see the reference in [4] for other models of human concept formation with local metric).

3 Weights and Metrics This section introduces the de nitions of system of asymmetric weights and local asymQ Q d d metrically weighted metric. Let j =1 Fj be the input space and x 2 j =1 Fj be a generic example. Assume that Fj is the closed unit interval [0; 1] (real feature) or a generic set of symbols (categorical feature). Each space Fj is endowed with a feature metric j : Fj [ f?g Fj [ f?g ?! R0 , where ? is a special symbol denoting an unknown value in Fj : 8 > jx ? y j if x ; y 2 [0; 1] > < 0 j j if Fjj isi a set of symbols and xj = yj j (xj ; yj ) = > 1 if Fj is a set of symbols and xj 6= yj > : 0:5 if xj or yj is unknown Q Given an example x 2 dj=1 Fj a set of asymmetric weights w(x) for x is a 2 d matrix with values in [0; 1]. Let wjk (x), k = 0; 1 and j = 1; : : :; d, be a generic element of w(x). Assume that wj0 (x) = wj1 (x) if Fj is a set of symbols. Let y be another point in the input space, if Fj = [0; 1] then the following notation will be adopted: ( 0 p if yj xj (1) wj (x) (xj ; yj )p = wwj1((xx))((xxj ;; yyj ))p otherwise j j

j

for all p = 1; : : :; 1. Conversely when Fj is a set of symbols wj (x) (xj ; yj )p = wj0(x) (xj ; yj )p = wj1(x)(xj ; yj )p . Given a case base CB Qdj=1 Fj and a set of asymmetric weights for 5

each example in CB a local asymmetrically weighted metric (LASM) is de ned as follows:

: CB (x; y) = (

d X j =1

d Y j =1

Fj ?! R0

wj (x) dj (xj ; yj )p)1=p

(2)

In general all the local and global metrics on real features obey the following property

d(x; x + ) = d(x; x ? ), that means that the metric is invariant with respect to the inversion of the direction of the displacement. This is no longer true for . In fact all of the previously introduced metrics on real features are functions of the jx ? y j vector, whereas is a true function of (x ? y ). This distinction obviously could not be made on categorical feature spaces. Moreover , as well as other local metrics [27, 3, 4], is not symmetric, i.e. (x; y ) = 6 (y; x), when x; y 2 CB, because the weights stored for x could be dierent from those of y . The importance of an asymmetric weighting scheme is best understood when, given a target concept and a case base, a system of weights is to be found in such a way that the nearest neighbor classi er endowed with the metric de ned by that set of weights is as accurate as possible. Asymmetrically weighted local metrics more exibly adapt to the data, as will be shown in the example in Section 5; they enable a freer choice of examples to store in the case base (see [19, page 306]) and make possible high compression rates (see Section 6). Figure 1 shows a set of level curves for dierent values of p in a two dimensional 1

0.8

0.6

p=1

0.4

p=2 0.2 p=3 p = 10 0 0

0.2

0.4

0.6

0.8

1

Figure 1: Level curves for dierent values of p space. Each curve is the set of points with distance 0:08 from point x = (:4; :82) with 6

dierent values for p3. It is clear that for p ! 1 the level curves tend to be rectangular4 .

4 The Learning Procedure Let us now suppose that a target concept has to Q be learned, i.e., there exists a boolean Q d 5 function f : j =1 Fj ! f0; 1g and a sample S dj=1 Fj on whose points the de nition Q of f is known. An example x 2 dj=1 Fj is classi ed as a positive (negative) example of the concept Cf if f (x) = 1 (f (x) = 0). Given a subset CB of S our goal is to nd a system of weights for CB that maximises the accuracy of the nearest neighbor algorithm thatQis E [f (nn(X )) = f (X )] is maximum, where nn(x) 2 CB is the nearest neighbor of x 2 dj=1 Fj . In the classical Nearest Neighbor algorithm the whole sample S is stored, but the idea of using a subset of S to reduce the amount of storage required arose early and sped up query time. Dierent approaches, which are called edited Nearest Neighbor, have been proposed (see [9] Chapter 6 for a collection of papers on this topic). In this paper, instead of ltering the sample space to maintain the original accuracy, we are interested in nding, for a given subset of the sample, a local metric that assures the same accuracy as the unedited NN or better. To attain this goal we propose a training procedure that, given an example in S but not in CB , compares the value of f on the example, which is known, with the value of f on the nearest neighbor of the example. If the two values are equal then the prediction is correct, and the distance between the nearest neighbor and the sample is decreased, whereas if the two values are not equal the distance between the nearest neighbor and the sample is increased (see also [14, 21, 3, 31] for other applications of this well known technique). Q Let CB = fx1 ; : : :; xm g be a subset of dj=1 Fj (case base). W = [0; 1]2djCB j is the space of all possible local weights for CB . A learning step is a pair of functions R and P that map a set of weights w(x) for x, the example x and a testing example y in a new system of weights R(w; x; y ) and P (w; x; y ) for x. R, the reinforcement step is chosen if the value of f on y is equal to the value of f on x. If this is not the case P , the punishment step is used (see also [18] for more details on learning steps and learning procedures based on reinforcement). A learning procedure iterates that step adjusting an initially Euclidean system of weights for CB , aiming to optimise classi cation accuracy. Many dierent types of reinforcement and punishment steps can be adopted, the following class of maps was tested in a number of experiments described in Section 7. Let xi 2 CB be the nearest neighbor of y : 1. If f (xi ) = f (y ) then: ( 0 0 R0ij (wij0 ; xi; yj ) = wwij0 ? wij g(j (xij ; yj )) ifif yxj a + (1 ? b), it is impossible to have accuracy 1. In fact the rst and second examples must be chosen in such a way that they are symmetric in respect to axis x = a and the second and third examples must be symmetric in respect to axis x = b. We may have accuracy 1 if we use two examples for class 2. More generally if we have a situation like that depicted in the right part of Figure 7, for each internal rectangle four examples (drawn as little circles in the gure) are needed and two examples for each boundary rectangle. Conversely using an asymmetrically weighted L1 local metric only one example is needed for each rectangle, and this seed could be placed in every place inside that region. In fact one can always modify the metrics attached to the example in such a way that the level curves at a given distance are exactly the boundaries of a given rectangle (see Figure 1). The intuitive arguments presented above are made more formal in what follows. Let us rst consider the case of rectilinear polytopes, that is, objects in a d dimensional space 12

Figure 8: Example of a rectilinear polygon with two holes and decomposition into a set of rectangles. in which all the faces are parallel to the axes. Figure 8 shows an example of a rectilinear polygon in a two dimensional space. In [22] it is proven that (2n=d)d examples are enough to learn exactly a rectilinear polytope, where n is the number of faces of the polytope and the Euclidean metric is used. Let us suppose that d = 2. A well studied problem in computational geometry consists of nding a procedure for optimal decomposition of a polygonal object into simpler components. The reader can refer to [13] for a survey paper on the decomposition problem. Decomposing a rectilinear polygon with holes into the minimum number of rectangles has been studied by Ferrari, Sancar and Sklansky [10]. In the case of non-degenerate holes, they give a O(n5=2) procedure for solving this task. The degenerate case has been long considered NP-hard but recently Soltan and Gorpinevich have shown [26] that this problem is solvable in O(n3=2 log n) time. Before giving a bound on the number of rectangles needed to decompose a rectangular polygon (connected) a de nition is needed. If the interior angle at a vertex is re ex (> 180 deg.) the vertex is called a notch. In [10] the following theorem is proven (see also [26]):

Theorem 1 A rectilinear connected polygon with H non-degenerate holes can be partitioned into N ? L + 1 ? H rectangles, where N is the number of notches, L is the maxi-

mum number of non-intersecting chords that can be drawn either vertically or horizontally between two notches.

Using the above theorem the following lemmas can be proven: Lemma 1 A rectilinear polygon P with H holes can be optimally partitioned in n=2 + H ? 1 ? L rectangles, where n is the number of faces of P and L is the maximum number of non-intersecting chords that can be drawn either vertically or horizontally between two notches. 13

Proof. If N and n are the number of notches and vertex of P respectively, then N = N0 + N1 + : : : + NH and n = n0 + n1 + : : : + nH , where ni and Ni are the number of the notches and faces of the i-th connected component of the boundary of P , where indices i > 0 denote the connected components of the boundary of the holes. The following equations hold: n0 = 2N0 +4, n1 = 2N1 ? 4, : : :, nH = 2NH ? 4, because the total number of faces (vertex) of a rectilinear polygon is two times the number of the notches plus four. Summing the previous equations we get n = 2N + 4(1 ? H ), and using Theorem 1 we get the thesis of the lemma 2

Lemma 2 If P is rectilinear polygon in [0; 1] [0; 1] with H holes then the complement of P in [0; 1] [0; 1] can be optimally partitioned in n=2 ? H + 2 ? L0 rectangles, where n is the number of faces of P and L0 is the maximum number of non-intersecting chords that can be drawn either vertically or horizontally between two notches of the complement of P .

Proof. To prove this lemma we apply the above lemma to each connected component of the complement of P . If n is the number of vertices of P then n = n0 + n1 + : : : + nH , where ni is the number of faces of the i-th connected component of the boundary of P , where indices i > 0 are for the connected components of the boundary of the holes. Therefore the connected components of the complement of P have n0 +4; n1; : : :; nh faces. Applying lemma 1, the complement of P can be composed into ((4+ n0)=2 ? L00 )+(n1 =2 ? 1 ? L01 ) + : : : + (nH =2 ? 1 ? L0H ) = n=2 ? H + 2 ? L0 rectangles 2 As an application of the above results, let us consider the polygon P shown in Figure 8. It has 32 faces, 2 holes and 1 chord between two notches, therefore it can be decomposed into 16 = (32=2 + 2 ? 1 ? 1) rectangles (see Figure 8). The complement of P has three connected components and can be decomposed into 16 = (32=2 ? 2 + 2 ? 0) rectangles. Let us now suppose that p = 1, that is: d (x; y) = max fw (x)jxj ? yj jg: j =1 j

In L1 the locus of points equidistant from a given point are rectangular. Given a point x 2 [0; 1]d and a system of asymmetric weights w for x, the ball of radius r, B 1 (x; r) is de ned as follows:

B1 (x; r) = fy 2 [0; 1]d : (x; y) rg d Y [xj ? r=wj0; xj + r=wj1] = j =1

We can now prove the following: Theorem 2 n + 1 ? L ? L0 examples are enough for exact learning a rectilinear polygon P with n faces in two dimensions, where L and L0 are the maximum number of nonintersecting chords that can be drawn either vertically or horizontally between two notches of P and the complement of P respectively. 14

Proof. Let P be a rectilinear polygon. For Lemma 1 and Lemma 2 P and P 0 can be optimally decomposed into k = n=2 + H ? 1 ? L and k0 = n=2 ? H + 1 ? L0 rectangles respectively. Let R1; : : :; Rk , and R01; : : :; R0k be such rectangles. If xi 2 Ri and x0i 2 R0i are arbitrary points in the interior of those rectangles and r is a positive constant. Let wj0(xi) = x ?r l and wj1(xi) = x ?r u , where lij = minfyj : y 2 Rig and uij = maxfyj : y 2 Ri g. Analogously are de ned wjk (x0i). With this de nition Ri = B 1 (xi; r) and if y is a generic point in P then y 2 Rj for one index j and (xj ; y ) r, whether (xi; y ) > r for i 6= j and (x0i ; y ) > r for all i = 1; : : :; k0. Therefore nn(y ) = xi 2 P . With similar arguments it is proven that if y 2 P 0 then nn(y ) 2 P 0 2 Some observations are in order. The rectangles into which a polygon P and its complement can be decomposed are the Voronoi regions of a set of points internal to the rectangles provided that distances are taken with a local asymmetrically weighted metric endowed with weights as calculated in the proof of Theorem 2. It can also be observed that if the points chosen in the proof of Theorem 2 are taken in the barycentre then the local metric is symmetrically weighted. Finally we observe that the bound on the number of points needed to exact learning a rectilinear polygon provided by Theorem 2, i.e. n + 1 compares favourably with the quadratic bound obtained in [22] for the Euclidean Metric. polygon Let us suppose now that the concept to be learned is not a rectilinear polygon. If the target concept can be approximated in some sense by a rectilinear polygon then the original concept can be learned with a bounded number of examples and a given error. In [22] the symmetric dierence between sets is proposed as a measure of the distance between two concepts. Let C and C 0 be two subsets of [0; 1]d then C is said to be -similar to C 0 if (C n C 0 [ C 0 n C ) , where () is the usual (Lebesgue) measure on [0; 1]d. The following Theorem holds: ij

ij

ij

ij

Theorem 3 Let C be a polygonal tessellation with n edges. If > 0 then n2= + 1 points are enough for exact learning of a rectilinear polygon C 0 that is -similar to C , using a local L1 metric.

Proof. Let Ei be an edge of C with slope with respect to the horizontal axis, then Ei can be approximated with a rectangular line as in Figure 9. This construction can be extended to all the edges of C giving a new rectangular polygon C 0. The distance between C and C 0 (C n C 0 [ C 0 n C ) can be computed as the sum of the errors on each edge, that results in the measure of the shadowed area in Figure 9. If km is the number of the faces of the m-th rectangular approximation of an edge of length e and Am is the measure of the shadowed area, then the following equations hold: k1 = 2, km = 2km?1 ? 1 and Am e2 =2m+1 . The error on each edge is bounded by e2 =2m+1 because (e2 sin cos )=2m is the real error when the edge makes an angle with the horizontalPaxes, and Am is the P n n 0 0 0 2 m+1 m+1 m maximum when p = 45 . Then (C nC [C nC ) i=1 ei =m2 i=1 2=0 2 0 = n=2 , because ei 2 for all i = 1; : : :; m. Therefore if = n=2 then (C n C [ C n C ) . It is also easy to see that km < 2m ? m(m ? 1)=2 and therefore the total number of faces of the rectilinear approximation is nkm = n(2m ? m(m ? 1)=2) < n2m = n2 =. Finally, applying the Theorem 2 we get the thesis 2

15

α k1 = 2

k2 = 3

k3 = 5

Figure 9: Approximations of an edge with rectangular lines Table 1: UCI Repository - ftp://ics.uci.edu/pub/machine-learning-databases/ Dataset Instances Classes Features Unknown Balance Scale 625 3 4 4C no Breast Cancer 286 2 9 4C 5S yes Cleveland 303 4 13 5C 8S no Echocardiogram 74 2 13 13C yes Glass 214 7 9 9C no Ionosphere 351 2 34 34C no Iris 150 3 4 4C no Liver Disorders 345 2 6 6C no Thyroid 215 3 5 5C no Wine 178 3 13 13C no

7 Experimental Results In this Section we shall present the results of some experiments conducted on ten data sets taken from the UCI Repository[15]. Some general information on the chosen data sets is shown in Table 1. We have chosen those data sets that contain many real features, as only on this type of feature is an asymmetrically weighted local metric dierent from a symmetrically weighted one. Nevertheless some of the data sets contain also symbolic features. Unknown attribute values are also present in some of them. First LASM was compared with a nearest neighbor classi er, then we tried to understand whether an asymmetrically weighted local metric would behave better than a symmetrically weighted and a global metric. So three new types of metrics were de ned: LSSM is a local metric with equal left and right weights, i.e., wij0 = wij1 , for all i = 1; : : :; jCB j and j = 1; : : :; d (Local Symmetric Similarity Metric). GASM is a global metric in which all the examples in the Case Base CB use the same set of weights but left and right weights are in general dierent i.e., wijk = wljk , for all i; l = 1; : : :; jCB j (Global Asymmetric Similarity Metric). GSSM is a global metric in which all the examples in the Case Base CB use the same set of weights and left and right weights are equal, i.e., wijk = wljm , for all i; l = 1; : : :; jCB j and k; m = 0; 1 (Global Symmetric Similarity Metric). 16

Table 2: Comparison of the average accuracy and standard deviation obtained on dierent data sets by dierent algorithms (bold font means signi cantly better, i.e. at least at 0.02 level for 2-tailed test). = 0:2, = 1:0, g (z ) = z . Algorithm LSSM GASM 84:2 3:1 74:1 4:2 70.04.3 63:4 7:6 80.33.2 78:4 4:3

Data Set LASM GSSM NN Balance Scale 86.02.2 73:5 3:9 78:2 2:6 Breast Cancer 69.74.1 62:1 7:9 64:9 4:0 80.43.4 76:2 5:6 77:2 3:7 Cleveland Echocardiogram 68.610.9 67.512.7 65.016.8 63:1 16:2 62:6 8:1 61:1 5:6 58:6 5:4 55:2 7:6 54:8 7:2 70.05.5 Glass 90.02.8 88:4 3:9 81:0 7:2 75:5 9:5 90.02.3 Ionosphere Iris 94.83.0 93:8 4:1 94.34.0 93:1 4:7 94.02.8 Liver Disorders 60:5 4:9 58:1 5:0 54:3 5:6 54:4 5:4 62.33.8 Thyroid 94:8 2:8 93:9 3:5 91:0 5:3 89:5 5:1 96.02.1 95.03.4 92:7 4:4 92:8 3:7 90:8 4:6 96.12.4 Wine All these metrics are learned with the same learning procedure described in Section 4. Each data set is split 50 times into two parts: 2/3 for training and 1/3 for test. The LearningPath procedure runs on each training set and when it stops the accuracy is calculated on the test. The case base is extracted from the training by taking a xed percentage of the training. This percentage is proportional to the number of classes and the value tried initially is two percent for each class (jCB j = d0:02jClassesjjTrainingSetje). The number of examples in each class in the case base is made proportional to the number of examples in that class found in the training set. So the selection was random, but there is a requirement that the probability of nding an example in a given class is the same in the training set and in the case base. Weights are all initialised to an equal value (10?4 ) and the Learning-Path procedure is stopped when the accuracy on the training set decreases two times consecutively. The accuracy is calculated at each pass as the proportion of reinforcements over the total number of examples in the training set minus the number of examples in the case base (r=(jTrainingSetj ? jCB j)). Table 2 shows a rst set of results. In these experiments we have: = 0:2, = 1:0, g (z ) = z . Bold font is used when the algorithm is signi cantly better than the others at least at 0.02 level for 2-tailed test comparison. Table 2 shows that LASM is better (or not signi cantly dierent) than LSSM, GASM and GSSM in all the data sets. It is not signi cantly dierent from LLSM on 3 data sets (Cancer, Cleveland and Cardio). The rst two have many symbolic features so eectively in these cases LASM and LSSM are similar. On the third, Echocardiogram, LASM is not distinguishable from GASM and we believe that this depends on the very little dimension of the data set, that would again make LASM and GASM very similar. These experiments show that LASM should always be preferred to the other weighting schemes (LLSM, GASM and GSSM). On Glass LASM is clearly inferior to NN and we guess that this is caused by the 17

Table 3: Comparison of the average accuracy and standard deviation obtained on dierent data sets by dierent algorithms. The two numbers in parenthesis on the LASM columns indicate and respectively. On the k-NN column the number in parenthesis represents the optimal K used. Algorithm Data Set LASM k-NN Balance Scale 86.51.8 (.1 1.0) 88.02.3 (11) 72.23.8 (.4 1.0) 73.73.5 (7) Breast Cancer 80.93.5 (.2 .8) 81:5 3:4 (11) Cleveland Echocardiogram 70.910.5 (.2 .6) 69:46.4 (3) Glass 62.45.1 (.6 .8) 70:05.5 (1) 92.32.9 (.05 1.0) 90:02.3 (1) Ionosphere 95.02.9 (.3 1.0) 95:22.4 (11) Iris Liver Disorders 63.44.5 (.3 .6) 63:83.8 (3) 95.02.8 (.2 .8) 96:02.1 (1) Thyroid 96.32.6 (.05 .4) 97:22.4 (9) Wine

NN 78:2 2:6 64:9 4:0 77:2 3:7 62:6 8:1 70.05.5 90.02.3 94:0 2:8 63.33.8 96.02.1 96.12.4

relatively greater number of categories and the small size of the data set. Experiments with a greater percentage of examples, which will be discussed later, show a visible increase in accuracy on this data set (see Table 4). On Liver Disorders and Thyroid, LASM is inferior to NN but with a very limited dierence, and this dierence will be further reduced with a better choice of parameters (see Table 3). The reinforce and punishment parameters and do have an eect on system accuracy. The next trials were performed to experimentally measure this contribution. We have optimised the choice of these parameters in the data sets and the results are shown in Table 3. This table also shows the accuracy of the k-NN algorithm, where K is chosen in each data set as the optimal value. The optimisation of and results in increased accuracy so that LASM always outperforms NN (apart from Glass and Thyroid) and behaves reasonably well compared to k-NN. Note that on Glass and Thyroid the best K is 1 so it seems that k-NN represents a limit for LASM. Besides, the improvement obtained by trying with dierent and is not so great, so we claim that LASM is quite robust against suboptimal choices of the reinforcement and punishment parameters. From a practical point of view, the user should choose an initial value for in the range [0:6; 1] and then optimise taking into account that the larger gets, the less the system ts the training set and the faster convergence of the training phase becomes (see example in Section 5). In general a default setting of the parameters = 0:2 and = 0:8 will provide initially acceptable results. As a last evaluation one can try to change the percentage of compression and see how the learning procedure is able to adapt. The results of this experiment are shown in Table 4. Here, for example, LASM-1% means that for each class the case base is 1% of the training. So for example in the rst data set (Balance) we have 625 examples, the training set has 437 = 70% * 625 examples and there are 3 classes, so the case base is 18

Table 4: Comparison of the average accuracy and standard deviation obtained on dierent data sets by LASM with dierent percentages of the Training Set as case base. Algorithm Data Set LASM-1% LASM-2% LASM-3% LASM-4% k-NN Balance Scale 86.61.9 86.51.8 85.62.2 84.52.2 88.02.3 69.94.4 72.23.8 71.34.2 71.34.2 73.73.5 Breast Cancer Cleveland 81.12.9 80.93.5 80.73.6 79.73.5 81:5 3:4 Echocardiogram 49.09.3 70.910.5 69.88.1 69.88.1 69:87.2 61.95.8 62.45.1 64.25.4 64.25.4 70:74.9 Glass Ionosphere 87.95.2 92.32.9 92.82.3 93.02.1 90:42.3 Iris 94.43.6 95.02.9 95.33.0 95.13.0 95:22.4 60.75.0 63.44.5 62.64.8 63.83.8 63:83.8 Liver Disorders 93.93.3 95.02.8 94.72.9 93.73.8 96:02.0 Thyroid Wine 96.22.9 96.32.6 96.02.7 96.41.9 97:22.4 3 d437 0:01e = 15 (in this way it is also assured that at least one example for each category is inserted in the case base). The best results are obtained in all the data sets with very little percentages of training. On Glass LASM is sensibly inferior to NN and we guess that this is due to the relatively greater number of categories and to the little dimension of the data set. Tests performed with the leave-one-out procedure con rm this hypothesis. In fact using leave-one-out we have obtained accuracy 73:4 with LASM and 72:9 with NN, that change completely the situation. It is also to be noted that the accuracy is quite stable with respect to that percentage, so it is quite easy to nd a good accuracy value. Note that it is not always true that increasing the percentage of examples stored always yields an increase in accuracy. This can be explained because after a certain point LASM tends to be similar to NN and therefore the accuracy approaches the accuracy of NN, which is normally less than that of k-NN. The comparison with k-NN should take into account the time performances. In Table 5 the average times8 for training and testing the classi er are shown. Training times are almost one order greater than test times so quite acceptable. But most notably the testing time of LASM is, as expected, rather one order less than the time needed for the nearest neighbor algorithm. Especially on the bigger data sets, good speedup can be detected (Balance, Cleveland, Wine). Again, on the Glass data set LASM does not have a good speedup as the compression rate is too small. This discussion clearly shows that the proposed technique is more suitable on medium-big data sets where both good accuracy and sensible speed up can be obtained. 8

Times are in seconds on a Lisp implementation running on a SunSparc10.

19

Table 5: Comparison of the time for training and testing. Data Set train-LASM Balance Scale 20:7 5:8 %1 8:2 2:7 %2 Breast Cancer 16:7 4:5 %1 Cleveland Echocardiogram 0:5 0:2 %2 42:4 16:4 %3 Glass 133:5 30:1 %4 Ionosphere Iris 3:2 1:3 %3 Liver Disorders 17:3 6:5 %2 6:2 2:0 %2 Thyroid 7:1 2:4 %1 Wine

Algorithm test-LASM test-k-NN Speedup 1:4 22:5 16 0:5 5:7 11 1:1 12:9 12 0:03 0:4 13 2:6 4:5 2 7:9 45:7 6 0:3 1:1 4 0:8 8:0 10 0:4 2:9 7 0:4 4:7 12

8 Discussion and Future Directions This paper has presented a new weighting scheme for nearest neighbor classi cation. We have shown both with theoretical arguments and computer experiments that good compression rates can be achieved that outperform the accuracy of the standard Nearest Neighbor classi cation algorithm and obtain almost the same accuracy as the k-NN algorithm. The improvement in time performance is proportional to the compression rate and in general it depends on the data set. It is also relevant to note that the comparison of classi cation accuracy of LASM with a local symmetrically weighted metric and with a global metric strongly shows that LASM is to be preferred. Many aspects of the framework introduced here would deserve more attention, some of them are listed below. The stop condition. The procedure Learning-Path stops when the accuracy on the training decreases twice consecutively. This heuristic criteria is not free from the problem to over t the data in training. In fact, the parameter, i.e. a non null reinforcement, works in opposition to that tendency. At present, we have not detected this tendency in the experiments as an increase of accuracy in training always produces an increase on the test. Another problem related to this naive condition is that the classi er produced two passes before that it nally returned, that is more accurate than the classi er used for testing. So in general if a best optimisation of the accuracy of the training could be achieved, a system improvement is expected as well. Choice of the case base. It could be guessed that a sensible improvement may be obtained with a better choice of case base . This is a eld rich in techniques already available to experiment with, as has been argued in the Introduction. We are going to tackle this subject in a coming paper, comparing both accuracy and compression rate. 20

Theoretical results. The theoretical results illustrated in Section 6 provide some evidence that in a two dimensional input space a lower number of examples is to be stored to learn exactly a class of concepts compared with the number of examples required by a simple NN classi er. A generalisation of the results in Section 6 would be valuable. But, it is known that the geometrical investigation of d-dimensional spaces, when d 3 is not simple at all. For example, the 3-d rectangle partitioning problem is NP-complete and therefore perhaps some other path is to be investigated.

9 Acknowledgements We would like to thank our anonymous reviewers for their insightful suggestions and remarks and Mark Keil for having indicated relevant reference papers in the computational geometry literature. Special thanks to David Aha for helpful discussions and encouragement in pursuing this research. This paper bene ted from the editing help provided by Susan Zorat. This work has been partially supported by the EspritIII projects: #6095 CHARADE (Combining Human Assessment and Reasoning Aids for Decision Making in Environmental Emergencies) and CARICA #20401 (Cases Acquisition and Replay in Fire Campaign Ambience).

References [1] D. W. Aha. Incremental, instance-based learning of independent and graded concept description. In Proceedings of the Sixth International Workshop on Machine Learning, Ithaca, NY, 1989. Morgan Kaufmann. [2] D. W. Aha. A study of instance-based algorithms for supervised learning tasks: Mathematical, empirical and psycological evaluations. Technical Report TR-90-42, University of California, Irvine, 1990. [3] D. W. Aha and R. L. Goldstone. Learning attribute relevance in context in instancebased learning algorithms. In Proceedings of the Twelfth Annual Conference of the Cognitive Science Society, pages 141{148, Cambridge, MA, 1990. Lawrence Earlbaum. [4] D. W. Aha and R. L. Goldstone. Concept learning and exible weighting. In Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society, pages 534{539, Bloomington, IN, 1992. Lawrence Earlbaum. [5] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6:37{66, 1991. [6] C.-L. Chang. Finding prototypes for nearest neighbour classi er. IEEE Transactions on Computers, C-23(11):1179{1184, 1974. [7] S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10:57{78, 1993. 21

[8] R. H. Creecy, B. M. Masand, S. J. Smith, and D. L. Waltz. Trading MIPS and memory for knowledge engineering. Communication of ACM, 35:48{64, 1992. [9] B. V. Dasarathy, editor. Nearest beighbour (NN) norms: NN pattern classi cation techniques. IEEE Computer Society Press, Los Alamitos, CA, 1991. [10] L. Ferrari, P. V. Sankar, and J. Sklansky. Minimal rectilinear partitions of digitized blocks. Computer Vision Graphics and Image Processing, 28:58{71, 1984. [11] J. H. Friedman. Flexible metric nearest neighbour classi cation. Unpublished manuscript available by anonymous FTP from playfair.stanford.edu. [12] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbour classi cation. In U.M.Fayad and R.Uthurusamy, editors, KDD-95: Proceedings First International Conference on Knowledge Discovery and Data Mining, 1994. [13] J. M. Keil and J. Sack. Minimum decomposition of polygonal objects. In G. Toussaint, editor, Computational Geometry. North Holland, 1985. [14] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464{1480, Sept. 1990. [15] P. M. Murphy and D. W. Aha. UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science, Irvine, CA, 1994. [16] J. P. Myles and D. J. Hand. The multi-class metric problem in nearest neighbour discrimination rules. Pattern Recognition, 23(11):1291{1297, 1990. [17] F. P. Preparata and M. I. Shamos. Computational Geometry. Springer, 1985. [18] F. Ricci. Constraint reasoning with learning automata. International Journal of Intelligent Systems, 9(12):1059{1082, Dec. 1994. [19] F. Ricci and P. Avesani. Learning a local similarity metric for case-based reasoning. In International Conference on Case-Based Reasoning (ICCBR-95), Sesimbra, Portugal, Oct. 23-26, 1995, Oct. 1995. [20] G. L. Ritter, H. B. Woodru, S. R. Lowry, and T. L. Isenhour. An algorithm for selective nearest neighbor decision rule. IEEE Transaction on Information Theory, IT-21(6):665{669, 1975. [21] D. E. Rumelhart and J. L. McClelland, editors. Parallel Distributed Processing: Exploration in the Miscrostructure of Cognition. MIT Press, 1986. [22] S. Salzberg, A. Delcher, D. Heath, and S. Kasif. Best-case results for nearest neighbor learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6):599{610, 1995. [23] S. L. Salzberg. A nearest hyperrectangle learning method. Machine Learning, 6:251{ 276, 1991. 22

[24] R. D. Short and K. Fukunaga. A new nearest neighbour distance measure. In Proceeding of the 5th IEEE International Conference on Patter Recognition, pages 81{86, Miami beach, FL, 1980. [25] R. D. Short and K. Fukunaga. Optimal distance measure for nearest neighbour classi cation. IEEE Transactions on Information Theory, 27:622{627, 1981. [26] V. Soltan and A. Gorpinevich. Minimum dissection of a rectilinear polygon with arbitrary holes into rectangles. Discrete and Computational Geometry, 9:57{79, 1993. [27] C. Stan ll and D. Waltz. Toward memory-based reasoning. Communication of ACM, 29:1213{1229, 1986. [28] C. W. Swonger. Sample set condensation for a condensed nearest neighbor decision rule for pattern recognition. In S. Watanabe, editor, Frontiers of Pattern Recognition, pages 511{519. Academic Press, 1972. [29] D. Wettschereck and D. Aha. Weighting features. In M.Veloso and A. Aamodt, editors, Case-Based Reasoning, Research and Development, pages 347{358. Springer, 1995. [30] D. L. Wilson. Asymptotic properties of nearest neighbor rule using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3):408{421, 1972. [31] L. Xu, A. Krzyzak, and E. Oja. Rival penalized competitive learning for cluster analysis, RBF net, and curve detection. IEEE Transaction on Neural Networks, 4(4):636{649, 1993. [32] F. F. Yao. Computational geometry. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, pages 343{389. Elsevier, 1990.

23

Abstract

This paper introduces a new local asymmetric weighting scheme for the nearest neighbor classi cation algorithm. It is shown both with theoretical arguments and computer experiments that good compression rates can be achieved outperforming the accuracy of the standard nearest neighbor classi cation algorithm and obtaining almost the same accuracy as the k-NN algorithm with k optimised in each data set. The improvement in time performance is proportional to the compression rate and in general it depends on the data set. The comparison of the classi cation accuracy of the proposed algorithm with a local symmetrically weighted metric and with a global metric strongly shows that the proposed scheme is to be preferred.

1

1 Introduction Nearest neighbor algorithms (NN) are ubiquitous in many research areas such as pattern recognition, machine learning and case based reasoning. The k-NN algorithm, a simple generalisation of NN which is 1-NN, maintains a set of training examples and classi es a new example as the most frequent class among the k most similar examples in the training set. Basically, the motivations for such a large diusion of NN reside in its good performance, which can be obtained in many situations, and in its ease of use, as no parameter needs to be tuned and very little experience is required on the part of the user. However, some limitations in the basic NN algorithm have been recognised: it has poor generalisation performance; it suers from the existence of noisy features; it requires big training sets; it suers from the so-called "curse of dimensionality" and as the training set grows, run time performance deteriorates linearly. All these points have been addressed positively by a number of papers (see [9] for a large collection of papers on NN and its generalisations) and the proposed improvements eectively mitigate those limitations. This paper is mostly concerned with the task of reducing the number of examples stored in the training set in order to speed up query time and improve classi cation accuracy by learning features relevance in a context-sensitive way. That goal is pursued by the introduction of a new local metric [19] and a procedure that progressively transforms the Euclidean metric in a set of locally de ned metrics attached to a reduced set of examples in the training set. The idea of using local metrics is not new, many approaches have been considered [24, 25, 27, 8, 3, 4, 12, 11] and some of these will be reviewed in the next Section. The novelty of our proposal is related to the weighting method and the particular process that learns the local metric. Usually a positive weight wi is associated to each feature i. In local metrics wi is a function of the example from which the distance is taken and the form of this dependence varies from one approach to another. For an ordered feature, a distinction is made between \left" and \right" direction, introducing two weights for each feature. The motivation for this is related to the algorithm used for computing local weights. It is based on a feedback method [29] similar to the delta rule [21, 1, 2] or that used in competitive learning [31] or learning vector quantization [14]. But, while in the majority of these approaches the examples stored (\weight vectors" representing cluster centres or \codebooks") are moved in the input space1 , we instead change the metrics attached to the stored examples, and therefore directional information has to be taken into account. As a result of the asymmetrical weighting scheme, the local metric does not have an invariant property that is always assumed, namely d(x; x + ) = d(x; x ? ), where x is a generic example and d is a metric. This means that the distance in the \north" direction is not taken for the \south" direction. A practical result is that sensible compression rates can be achieved. It turns out that it is possible to store a small fraction of the training set achieving greater accuracy than 1-NN and almost the same accuracy as k-NN, where k is chosen as the value that maximises accuracy in a particular data set. The compression rate obviously depends on the data set but on ten data sets we discovered that less that 10% of the training data still provides the same accuracy as k-NN. As a consequence query time can be reduced by In some cases it is not advisable to create these abstract cases as the meaning that a user has attached to them can be lost or could result in feature values that are not possible for other reasons. 1

2

approximately the same rate. The theoretical motivations for such ability is clari ed by the bounds on the minimum number of examples required for exact learning of a concept in a given class. For bidimensional concepts represented by a rectilinear polygon it is shown that a number of points equal to the number of faces of the polygon is enough to learn exactly the concept. This compares favourably with a quadratic bound proved by Salzberg et al. [22] for the Euclidean metric. Similarly, given a concept represented by a polygonal tessellation it is shown that an -similar concept can be learned exactly with fewer points than those required by NN if the Euclidean metric is used. Our study is related to the line of research called editing experiments, that is those techniques aimed at selecting a subset of prototypes from a given set of training examples both for computational eciency and for making the classi cation more reliable (see [9] Chapter 6). Even if the nal result is similar, an edited training set, the focus of this paper is quite dierent. Rather than search for a subset of the original training set, we simply choose a subset at random and we adapt the metric to that choice. If both good classi cation accuracy and compression rate can be achieved in this way, then this would support the inherent ability of the metric and the learning process to adapt to data. It seems obvious that a good choice of the stored examples could further improve both performance and accuracy, but this is not the focus of this work and the accurate selection of an edited set requires a costly search process that could prevent the system being incremental with respect to new data. With respect to the techniques used, many editing experiments [28, 20] try to select those examples that are closer to the classes' boundary or by deleting those examples that reduce accuracy [30, 5]. Conversely, using a local asymmetrically weighted metric, it seems more reasonable to select those examples that lie within the training clusters [6] or even better to choose one example for each set in a rectangular decomposition of the target concept (see Section 6). Some evidence to support these ideas will be given below, but we believe that this aspect deserves more detailed study.

2 Previous Approaches This section is dedicated to reviewing other contributions to the concept of local metric. Local metrics have been introduced previously by a number of authors. Here attention is focused on three aspects, mainly: The features type. Some approaches deal with discrete features, some make the additional restriction that the feature must be boolean, some consider real features. Even if some methods are equally applicable to both real and categorical features, one of them always is more suitable. Kind of locality. Feature weights may be invariant on regions of the input space, they may be dependent on the point from which distance is taken or may depend on feature values. Moreover, weights may be attached to the examples stored or may be computed as a function of the query example. 3

How weights are computed. For categorical features, statistics on the distri-

bution of values and the correlation between values and categories are exploited. Weights for real features have been computed both with conditional probabilities (see [29] for a discussion) and with feed-back based learning techniques. Creecy et al. [8] consider boolean features and compute two types of weights: the per category feature importance (PCF), which is the conditional probability that an example belongs to a class given that a feature is true; and the cross category feature importance (CCF) that averages the PCF on all the categories. More formally, PCF assigns weights using wi(c) = P [cjxi = 1], where c is a class P and xi is the value of the i-th feature (boolean). Whereas CCF is de ned as: wi = c2C P [cj jxi = 1]2. Only the rst de nition yields a local metric and for each training example x the weight wi (c) is stored with the i-th feature of x, where c is the class of x. The metric is local but a feature weight, for a given example, depends only on its feature value and on the class of the example. In [8] the authors show that the per category weighting method gives better results on a particular classi cation problem using an error minimising metric that combines the weighted local metric and the maximisation of P [cjxi = 1]. Interestingl y, in a very similar problem they got better results with cross-category weighting and k nearest neighbors. Stan ll and Waltz [27] introduced the notion of context-sensitive similarity metric early in their Memory-Based Reasoning approach. The need for dierent weightings was motivated mostly by coupling features' relevance to the task to be solved. The cross category feature importance CCF discussed in [8] was in fact introduced early by Stan ll and Waltz in the task of pronouncing novel words, using a database of English words and their pronunciation. In this case feature values are discrete but not restricted to being boolean. When a test example y isqmatched with memory the weight to be used for the i-th feature is given by: wi (y) = Pc2C P [cjyi]2. P [cjyi ] is the conditional probability that a training example is in class c given that its i-th feature value is yi 2 . They observed that the above weighting technique is too strict if combined with the Hamming distance (di(xi ; yi ) as the equality test P on xi and yi ). So they introduced a dierent distance on each feature space: di (xi; yi ) = c2C (P [cjyi ] ? P [cjxi ])2 that was called VDM, Value Distance Metric. Both previous approaches apply only to categorical features. For real features, Short and Fukunaga [24, 25] have proposed, for the two classes classi cation problem, a local metric that minimises E [! (x; y )], where ! (x; y ) is the quadratic dierence of the probability of misclassifying x given that the NN of x is y and the probability of misclassifying x given an hypothetically in nite set. They show that the metric d(x; y ) = jP [1jx] ? P [1jy ]j achieves that goal (P [1jx] is the probability of x being classi ed 1). They also give an estimate of such a metric: d(x; y ) = @[email protected] P [1jt]jt=x jx ? y j. Myles and Hand [16] generalise that framework to the multi-class case, showing various possible extensions. Salzberg [23] uses a global metric on real features with an additional weight for each stored case that measures how frequently the case has been used to make a correct classi cation (prediction). Some cases are "generalised", which means, in this case, that the point representation is changed for a two point representation. Two points x; y in [0; 1]N Stan ll and Waltz in fact compute dierent weights for dierent tasks, i.e., the weight depends on the goal feature that has to be predicted. 2

4

de ne an hyper rectangle, that is the subset Hxy [0; 1]N , Hxy = fz 2 [0; 1]N j zi 2 [min(xi; yi ); max(xi; yi )] 8 i = 1; : : :; N g. The distance between a point and an Hyper rectangle is de ned as the distance between a point and a set in Euclidean geometry (minimal distance), so for example if a point is inside an hyper rectangle the distance between the two is zero. In this way, in relation to the size of the Hyper rectangle the metric behaves dierently in the input space. In [7] Cost and Salzberg exploit a modi ed version of VDM, by a weighting scheme that weights examples in memory according to their performance history. Aha and Goldstone [3, 4] claim that an attribute's importance in human classi cation depends on its context and they provide a computational model that is able to generalise with results similar to those shown by experiments conducted with humans. One of the computational models proposed (GCM-ISW) uses both local weights and global weights and it applies on categorical and real features as well. When the distance between an input point x from a point in the training set y is to be computed an interpolation of the local (attached to y ) and global weights is used. Weights are updated with a method similar to the delta rule [21]. They show that GCM-ISW provides a better t to the subject data than other methods with no local weights. Therefore, they provide a cognitive foundation to local metric and to local weights update rules (see the reference in [4] for other models of human concept formation with local metric).

3 Weights and Metrics This section introduces the de nitions of system of asymmetric weights and local asymQ Q d d metrically weighted metric. Let j =1 Fj be the input space and x 2 j =1 Fj be a generic example. Assume that Fj is the closed unit interval [0; 1] (real feature) or a generic set of symbols (categorical feature). Each space Fj is endowed with a feature metric j : Fj [ f?g Fj [ f?g ?! R0 , where ? is a special symbol denoting an unknown value in Fj : 8 > jx ? y j if x ; y 2 [0; 1] > < 0 j j if Fjj isi a set of symbols and xj = yj j (xj ; yj ) = > 1 if Fj is a set of symbols and xj 6= yj > : 0:5 if xj or yj is unknown Q Given an example x 2 dj=1 Fj a set of asymmetric weights w(x) for x is a 2 d matrix with values in [0; 1]. Let wjk (x), k = 0; 1 and j = 1; : : :; d, be a generic element of w(x). Assume that wj0 (x) = wj1 (x) if Fj is a set of symbols. Let y be another point in the input space, if Fj = [0; 1] then the following notation will be adopted: ( 0 p if yj xj (1) wj (x) (xj ; yj )p = wwj1((xx))((xxj ;; yyj ))p otherwise j j

j

for all p = 1; : : :; 1. Conversely when Fj is a set of symbols wj (x) (xj ; yj )p = wj0(x) (xj ; yj )p = wj1(x)(xj ; yj )p . Given a case base CB Qdj=1 Fj and a set of asymmetric weights for 5

each example in CB a local asymmetrically weighted metric (LASM) is de ned as follows:

: CB (x; y) = (

d X j =1

d Y j =1

Fj ?! R0

wj (x) dj (xj ; yj )p)1=p

(2)

In general all the local and global metrics on real features obey the following property

d(x; x + ) = d(x; x ? ), that means that the metric is invariant with respect to the inversion of the direction of the displacement. This is no longer true for . In fact all of the previously introduced metrics on real features are functions of the jx ? y j vector, whereas is a true function of (x ? y ). This distinction obviously could not be made on categorical feature spaces. Moreover , as well as other local metrics [27, 3, 4], is not symmetric, i.e. (x; y ) = 6 (y; x), when x; y 2 CB, because the weights stored for x could be dierent from those of y . The importance of an asymmetric weighting scheme is best understood when, given a target concept and a case base, a system of weights is to be found in such a way that the nearest neighbor classi er endowed with the metric de ned by that set of weights is as accurate as possible. Asymmetrically weighted local metrics more exibly adapt to the data, as will be shown in the example in Section 5; they enable a freer choice of examples to store in the case base (see [19, page 306]) and make possible high compression rates (see Section 6). Figure 1 shows a set of level curves for dierent values of p in a two dimensional 1

0.8

0.6

p=1

0.4

p=2 0.2 p=3 p = 10 0 0

0.2

0.4

0.6

0.8

1

Figure 1: Level curves for dierent values of p space. Each curve is the set of points with distance 0:08 from point x = (:4; :82) with 6

dierent values for p3. It is clear that for p ! 1 the level curves tend to be rectangular4 .

4 The Learning Procedure Let us now suppose that a target concept has to Q be learned, i.e., there exists a boolean Q d 5 function f : j =1 Fj ! f0; 1g and a sample S dj=1 Fj on whose points the de nition Q of f is known. An example x 2 dj=1 Fj is classi ed as a positive (negative) example of the concept Cf if f (x) = 1 (f (x) = 0). Given a subset CB of S our goal is to nd a system of weights for CB that maximises the accuracy of the nearest neighbor algorithm thatQis E [f (nn(X )) = f (X )] is maximum, where nn(x) 2 CB is the nearest neighbor of x 2 dj=1 Fj . In the classical Nearest Neighbor algorithm the whole sample S is stored, but the idea of using a subset of S to reduce the amount of storage required arose early and sped up query time. Dierent approaches, which are called edited Nearest Neighbor, have been proposed (see [9] Chapter 6 for a collection of papers on this topic). In this paper, instead of ltering the sample space to maintain the original accuracy, we are interested in nding, for a given subset of the sample, a local metric that assures the same accuracy as the unedited NN or better. To attain this goal we propose a training procedure that, given an example in S but not in CB , compares the value of f on the example, which is known, with the value of f on the nearest neighbor of the example. If the two values are equal then the prediction is correct, and the distance between the nearest neighbor and the sample is decreased, whereas if the two values are not equal the distance between the nearest neighbor and the sample is increased (see also [14, 21, 3, 31] for other applications of this well known technique). Q Let CB = fx1 ; : : :; xm g be a subset of dj=1 Fj (case base). W = [0; 1]2djCB j is the space of all possible local weights for CB . A learning step is a pair of functions R and P that map a set of weights w(x) for x, the example x and a testing example y in a new system of weights R(w; x; y ) and P (w; x; y ) for x. R, the reinforcement step is chosen if the value of f on y is equal to the value of f on x. If this is not the case P , the punishment step is used (see also [18] for more details on learning steps and learning procedures based on reinforcement). A learning procedure iterates that step adjusting an initially Euclidean system of weights for CB , aiming to optimise classi cation accuracy. Many dierent types of reinforcement and punishment steps can be adopted, the following class of maps was tested in a number of experiments described in Section 7. Let xi 2 CB be the nearest neighbor of y : 1. If f (xi ) = f (y ) then: ( 0 0 R0ij (wij0 ; xi; yj ) = wwij0 ? wij g(j (xij ; yj )) ifif yxj a + (1 ? b), it is impossible to have accuracy 1. In fact the rst and second examples must be chosen in such a way that they are symmetric in respect to axis x = a and the second and third examples must be symmetric in respect to axis x = b. We may have accuracy 1 if we use two examples for class 2. More generally if we have a situation like that depicted in the right part of Figure 7, for each internal rectangle four examples (drawn as little circles in the gure) are needed and two examples for each boundary rectangle. Conversely using an asymmetrically weighted L1 local metric only one example is needed for each rectangle, and this seed could be placed in every place inside that region. In fact one can always modify the metrics attached to the example in such a way that the level curves at a given distance are exactly the boundaries of a given rectangle (see Figure 1). The intuitive arguments presented above are made more formal in what follows. Let us rst consider the case of rectilinear polytopes, that is, objects in a d dimensional space 12

Figure 8: Example of a rectilinear polygon with two holes and decomposition into a set of rectangles. in which all the faces are parallel to the axes. Figure 8 shows an example of a rectilinear polygon in a two dimensional space. In [22] it is proven that (2n=d)d examples are enough to learn exactly a rectilinear polytope, where n is the number of faces of the polytope and the Euclidean metric is used. Let us suppose that d = 2. A well studied problem in computational geometry consists of nding a procedure for optimal decomposition of a polygonal object into simpler components. The reader can refer to [13] for a survey paper on the decomposition problem. Decomposing a rectilinear polygon with holes into the minimum number of rectangles has been studied by Ferrari, Sancar and Sklansky [10]. In the case of non-degenerate holes, they give a O(n5=2) procedure for solving this task. The degenerate case has been long considered NP-hard but recently Soltan and Gorpinevich have shown [26] that this problem is solvable in O(n3=2 log n) time. Before giving a bound on the number of rectangles needed to decompose a rectangular polygon (connected) a de nition is needed. If the interior angle at a vertex is re ex (> 180 deg.) the vertex is called a notch. In [10] the following theorem is proven (see also [26]):

Theorem 1 A rectilinear connected polygon with H non-degenerate holes can be partitioned into N ? L + 1 ? H rectangles, where N is the number of notches, L is the maxi-

mum number of non-intersecting chords that can be drawn either vertically or horizontally between two notches.

Using the above theorem the following lemmas can be proven: Lemma 1 A rectilinear polygon P with H holes can be optimally partitioned in n=2 + H ? 1 ? L rectangles, where n is the number of faces of P and L is the maximum number of non-intersecting chords that can be drawn either vertically or horizontally between two notches. 13

Proof. If N and n are the number of notches and vertex of P respectively, then N = N0 + N1 + : : : + NH and n = n0 + n1 + : : : + nH , where ni and Ni are the number of the notches and faces of the i-th connected component of the boundary of P , where indices i > 0 denote the connected components of the boundary of the holes. The following equations hold: n0 = 2N0 +4, n1 = 2N1 ? 4, : : :, nH = 2NH ? 4, because the total number of faces (vertex) of a rectilinear polygon is two times the number of the notches plus four. Summing the previous equations we get n = 2N + 4(1 ? H ), and using Theorem 1 we get the thesis of the lemma 2

Lemma 2 If P is rectilinear polygon in [0; 1] [0; 1] with H holes then the complement of P in [0; 1] [0; 1] can be optimally partitioned in n=2 ? H + 2 ? L0 rectangles, where n is the number of faces of P and L0 is the maximum number of non-intersecting chords that can be drawn either vertically or horizontally between two notches of the complement of P .

Proof. To prove this lemma we apply the above lemma to each connected component of the complement of P . If n is the number of vertices of P then n = n0 + n1 + : : : + nH , where ni is the number of faces of the i-th connected component of the boundary of P , where indices i > 0 are for the connected components of the boundary of the holes. Therefore the connected components of the complement of P have n0 +4; n1; : : :; nh faces. Applying lemma 1, the complement of P can be composed into ((4+ n0)=2 ? L00 )+(n1 =2 ? 1 ? L01 ) + : : : + (nH =2 ? 1 ? L0H ) = n=2 ? H + 2 ? L0 rectangles 2 As an application of the above results, let us consider the polygon P shown in Figure 8. It has 32 faces, 2 holes and 1 chord between two notches, therefore it can be decomposed into 16 = (32=2 + 2 ? 1 ? 1) rectangles (see Figure 8). The complement of P has three connected components and can be decomposed into 16 = (32=2 ? 2 + 2 ? 0) rectangles. Let us now suppose that p = 1, that is: d (x; y) = max fw (x)jxj ? yj jg: j =1 j

In L1 the locus of points equidistant from a given point are rectangular. Given a point x 2 [0; 1]d and a system of asymmetric weights w for x, the ball of radius r, B 1 (x; r) is de ned as follows:

B1 (x; r) = fy 2 [0; 1]d : (x; y) rg d Y [xj ? r=wj0; xj + r=wj1] = j =1

We can now prove the following: Theorem 2 n + 1 ? L ? L0 examples are enough for exact learning a rectilinear polygon P with n faces in two dimensions, where L and L0 are the maximum number of nonintersecting chords that can be drawn either vertically or horizontally between two notches of P and the complement of P respectively. 14

Proof. Let P be a rectilinear polygon. For Lemma 1 and Lemma 2 P and P 0 can be optimally decomposed into k = n=2 + H ? 1 ? L and k0 = n=2 ? H + 1 ? L0 rectangles respectively. Let R1; : : :; Rk , and R01; : : :; R0k be such rectangles. If xi 2 Ri and x0i 2 R0i are arbitrary points in the interior of those rectangles and r is a positive constant. Let wj0(xi) = x ?r l and wj1(xi) = x ?r u , where lij = minfyj : y 2 Rig and uij = maxfyj : y 2 Ri g. Analogously are de ned wjk (x0i). With this de nition Ri = B 1 (xi; r) and if y is a generic point in P then y 2 Rj for one index j and (xj ; y ) r, whether (xi; y ) > r for i 6= j and (x0i ; y ) > r for all i = 1; : : :; k0. Therefore nn(y ) = xi 2 P . With similar arguments it is proven that if y 2 P 0 then nn(y ) 2 P 0 2 Some observations are in order. The rectangles into which a polygon P and its complement can be decomposed are the Voronoi regions of a set of points internal to the rectangles provided that distances are taken with a local asymmetrically weighted metric endowed with weights as calculated in the proof of Theorem 2. It can also be observed that if the points chosen in the proof of Theorem 2 are taken in the barycentre then the local metric is symmetrically weighted. Finally we observe that the bound on the number of points needed to exact learning a rectilinear polygon provided by Theorem 2, i.e. n + 1 compares favourably with the quadratic bound obtained in [22] for the Euclidean Metric. polygon Let us suppose now that the concept to be learned is not a rectilinear polygon. If the target concept can be approximated in some sense by a rectilinear polygon then the original concept can be learned with a bounded number of examples and a given error. In [22] the symmetric dierence between sets is proposed as a measure of the distance between two concepts. Let C and C 0 be two subsets of [0; 1]d then C is said to be -similar to C 0 if (C n C 0 [ C 0 n C ) , where () is the usual (Lebesgue) measure on [0; 1]d. The following Theorem holds: ij

ij

ij

ij

Theorem 3 Let C be a polygonal tessellation with n edges. If > 0 then n2= + 1 points are enough for exact learning of a rectilinear polygon C 0 that is -similar to C , using a local L1 metric.

Proof. Let Ei be an edge of C with slope with respect to the horizontal axis, then Ei can be approximated with a rectangular line as in Figure 9. This construction can be extended to all the edges of C giving a new rectangular polygon C 0. The distance between C and C 0 (C n C 0 [ C 0 n C ) can be computed as the sum of the errors on each edge, that results in the measure of the shadowed area in Figure 9. If km is the number of the faces of the m-th rectangular approximation of an edge of length e and Am is the measure of the shadowed area, then the following equations hold: k1 = 2, km = 2km?1 ? 1 and Am e2 =2m+1 . The error on each edge is bounded by e2 =2m+1 because (e2 sin cos )=2m is the real error when the edge makes an angle with the horizontalPaxes, and Am is the P n n 0 0 0 2 m+1 m+1 m maximum when p = 45 . Then (C nC [C nC ) i=1 ei =m2 i=1 2=0 2 0 = n=2 , because ei 2 for all i = 1; : : :; m. Therefore if = n=2 then (C n C [ C n C ) . It is also easy to see that km < 2m ? m(m ? 1)=2 and therefore the total number of faces of the rectilinear approximation is nkm = n(2m ? m(m ? 1)=2) < n2m = n2 =. Finally, applying the Theorem 2 we get the thesis 2

15

α k1 = 2

k2 = 3

k3 = 5

Figure 9: Approximations of an edge with rectangular lines Table 1: UCI Repository - ftp://ics.uci.edu/pub/machine-learning-databases/ Dataset Instances Classes Features Unknown Balance Scale 625 3 4 4C no Breast Cancer 286 2 9 4C 5S yes Cleveland 303 4 13 5C 8S no Echocardiogram 74 2 13 13C yes Glass 214 7 9 9C no Ionosphere 351 2 34 34C no Iris 150 3 4 4C no Liver Disorders 345 2 6 6C no Thyroid 215 3 5 5C no Wine 178 3 13 13C no

7 Experimental Results In this Section we shall present the results of some experiments conducted on ten data sets taken from the UCI Repository[15]. Some general information on the chosen data sets is shown in Table 1. We have chosen those data sets that contain many real features, as only on this type of feature is an asymmetrically weighted local metric dierent from a symmetrically weighted one. Nevertheless some of the data sets contain also symbolic features. Unknown attribute values are also present in some of them. First LASM was compared with a nearest neighbor classi er, then we tried to understand whether an asymmetrically weighted local metric would behave better than a symmetrically weighted and a global metric. So three new types of metrics were de ned: LSSM is a local metric with equal left and right weights, i.e., wij0 = wij1 , for all i = 1; : : :; jCB j and j = 1; : : :; d (Local Symmetric Similarity Metric). GASM is a global metric in which all the examples in the Case Base CB use the same set of weights but left and right weights are in general dierent i.e., wijk = wljk , for all i; l = 1; : : :; jCB j (Global Asymmetric Similarity Metric). GSSM is a global metric in which all the examples in the Case Base CB use the same set of weights and left and right weights are equal, i.e., wijk = wljm , for all i; l = 1; : : :; jCB j and k; m = 0; 1 (Global Symmetric Similarity Metric). 16

Table 2: Comparison of the average accuracy and standard deviation obtained on dierent data sets by dierent algorithms (bold font means signi cantly better, i.e. at least at 0.02 level for 2-tailed test). = 0:2, = 1:0, g (z ) = z . Algorithm LSSM GASM 84:2 3:1 74:1 4:2 70.04.3 63:4 7:6 80.33.2 78:4 4:3

Data Set LASM GSSM NN Balance Scale 86.02.2 73:5 3:9 78:2 2:6 Breast Cancer 69.74.1 62:1 7:9 64:9 4:0 80.43.4 76:2 5:6 77:2 3:7 Cleveland Echocardiogram 68.610.9 67.512.7 65.016.8 63:1 16:2 62:6 8:1 61:1 5:6 58:6 5:4 55:2 7:6 54:8 7:2 70.05.5 Glass 90.02.8 88:4 3:9 81:0 7:2 75:5 9:5 90.02.3 Ionosphere Iris 94.83.0 93:8 4:1 94.34.0 93:1 4:7 94.02.8 Liver Disorders 60:5 4:9 58:1 5:0 54:3 5:6 54:4 5:4 62.33.8 Thyroid 94:8 2:8 93:9 3:5 91:0 5:3 89:5 5:1 96.02.1 95.03.4 92:7 4:4 92:8 3:7 90:8 4:6 96.12.4 Wine All these metrics are learned with the same learning procedure described in Section 4. Each data set is split 50 times into two parts: 2/3 for training and 1/3 for test. The LearningPath procedure runs on each training set and when it stops the accuracy is calculated on the test. The case base is extracted from the training by taking a xed percentage of the training. This percentage is proportional to the number of classes and the value tried initially is two percent for each class (jCB j = d0:02jClassesjjTrainingSetje). The number of examples in each class in the case base is made proportional to the number of examples in that class found in the training set. So the selection was random, but there is a requirement that the probability of nding an example in a given class is the same in the training set and in the case base. Weights are all initialised to an equal value (10?4 ) and the Learning-Path procedure is stopped when the accuracy on the training set decreases two times consecutively. The accuracy is calculated at each pass as the proportion of reinforcements over the total number of examples in the training set minus the number of examples in the case base (r=(jTrainingSetj ? jCB j)). Table 2 shows a rst set of results. In these experiments we have: = 0:2, = 1:0, g (z ) = z . Bold font is used when the algorithm is signi cantly better than the others at least at 0.02 level for 2-tailed test comparison. Table 2 shows that LASM is better (or not signi cantly dierent) than LSSM, GASM and GSSM in all the data sets. It is not signi cantly dierent from LLSM on 3 data sets (Cancer, Cleveland and Cardio). The rst two have many symbolic features so eectively in these cases LASM and LSSM are similar. On the third, Echocardiogram, LASM is not distinguishable from GASM and we believe that this depends on the very little dimension of the data set, that would again make LASM and GASM very similar. These experiments show that LASM should always be preferred to the other weighting schemes (LLSM, GASM and GSSM). On Glass LASM is clearly inferior to NN and we guess that this is caused by the 17

Table 3: Comparison of the average accuracy and standard deviation obtained on dierent data sets by dierent algorithms. The two numbers in parenthesis on the LASM columns indicate and respectively. On the k-NN column the number in parenthesis represents the optimal K used. Algorithm Data Set LASM k-NN Balance Scale 86.51.8 (.1 1.0) 88.02.3 (11) 72.23.8 (.4 1.0) 73.73.5 (7) Breast Cancer 80.93.5 (.2 .8) 81:5 3:4 (11) Cleveland Echocardiogram 70.910.5 (.2 .6) 69:46.4 (3) Glass 62.45.1 (.6 .8) 70:05.5 (1) 92.32.9 (.05 1.0) 90:02.3 (1) Ionosphere 95.02.9 (.3 1.0) 95:22.4 (11) Iris Liver Disorders 63.44.5 (.3 .6) 63:83.8 (3) 95.02.8 (.2 .8) 96:02.1 (1) Thyroid 96.32.6 (.05 .4) 97:22.4 (9) Wine

NN 78:2 2:6 64:9 4:0 77:2 3:7 62:6 8:1 70.05.5 90.02.3 94:0 2:8 63.33.8 96.02.1 96.12.4

relatively greater number of categories and the small size of the data set. Experiments with a greater percentage of examples, which will be discussed later, show a visible increase in accuracy on this data set (see Table 4). On Liver Disorders and Thyroid, LASM is inferior to NN but with a very limited dierence, and this dierence will be further reduced with a better choice of parameters (see Table 3). The reinforce and punishment parameters and do have an eect on system accuracy. The next trials were performed to experimentally measure this contribution. We have optimised the choice of these parameters in the data sets and the results are shown in Table 3. This table also shows the accuracy of the k-NN algorithm, where K is chosen in each data set as the optimal value. The optimisation of and results in increased accuracy so that LASM always outperforms NN (apart from Glass and Thyroid) and behaves reasonably well compared to k-NN. Note that on Glass and Thyroid the best K is 1 so it seems that k-NN represents a limit for LASM. Besides, the improvement obtained by trying with dierent and is not so great, so we claim that LASM is quite robust against suboptimal choices of the reinforcement and punishment parameters. From a practical point of view, the user should choose an initial value for in the range [0:6; 1] and then optimise taking into account that the larger gets, the less the system ts the training set and the faster convergence of the training phase becomes (see example in Section 5). In general a default setting of the parameters = 0:2 and = 0:8 will provide initially acceptable results. As a last evaluation one can try to change the percentage of compression and see how the learning procedure is able to adapt. The results of this experiment are shown in Table 4. Here, for example, LASM-1% means that for each class the case base is 1% of the training. So for example in the rst data set (Balance) we have 625 examples, the training set has 437 = 70% * 625 examples and there are 3 classes, so the case base is 18

Table 4: Comparison of the average accuracy and standard deviation obtained on dierent data sets by LASM with dierent percentages of the Training Set as case base. Algorithm Data Set LASM-1% LASM-2% LASM-3% LASM-4% k-NN Balance Scale 86.61.9 86.51.8 85.62.2 84.52.2 88.02.3 69.94.4 72.23.8 71.34.2 71.34.2 73.73.5 Breast Cancer Cleveland 81.12.9 80.93.5 80.73.6 79.73.5 81:5 3:4 Echocardiogram 49.09.3 70.910.5 69.88.1 69.88.1 69:87.2 61.95.8 62.45.1 64.25.4 64.25.4 70:74.9 Glass Ionosphere 87.95.2 92.32.9 92.82.3 93.02.1 90:42.3 Iris 94.43.6 95.02.9 95.33.0 95.13.0 95:22.4 60.75.0 63.44.5 62.64.8 63.83.8 63:83.8 Liver Disorders 93.93.3 95.02.8 94.72.9 93.73.8 96:02.0 Thyroid Wine 96.22.9 96.32.6 96.02.7 96.41.9 97:22.4 3 d437 0:01e = 15 (in this way it is also assured that at least one example for each category is inserted in the case base). The best results are obtained in all the data sets with very little percentages of training. On Glass LASM is sensibly inferior to NN and we guess that this is due to the relatively greater number of categories and to the little dimension of the data set. Tests performed with the leave-one-out procedure con rm this hypothesis. In fact using leave-one-out we have obtained accuracy 73:4 with LASM and 72:9 with NN, that change completely the situation. It is also to be noted that the accuracy is quite stable with respect to that percentage, so it is quite easy to nd a good accuracy value. Note that it is not always true that increasing the percentage of examples stored always yields an increase in accuracy. This can be explained because after a certain point LASM tends to be similar to NN and therefore the accuracy approaches the accuracy of NN, which is normally less than that of k-NN. The comparison with k-NN should take into account the time performances. In Table 5 the average times8 for training and testing the classi er are shown. Training times are almost one order greater than test times so quite acceptable. But most notably the testing time of LASM is, as expected, rather one order less than the time needed for the nearest neighbor algorithm. Especially on the bigger data sets, good speedup can be detected (Balance, Cleveland, Wine). Again, on the Glass data set LASM does not have a good speedup as the compression rate is too small. This discussion clearly shows that the proposed technique is more suitable on medium-big data sets where both good accuracy and sensible speed up can be obtained. 8

Times are in seconds on a Lisp implementation running on a SunSparc10.

19

Table 5: Comparison of the time for training and testing. Data Set train-LASM Balance Scale 20:7 5:8 %1 8:2 2:7 %2 Breast Cancer 16:7 4:5 %1 Cleveland Echocardiogram 0:5 0:2 %2 42:4 16:4 %3 Glass 133:5 30:1 %4 Ionosphere Iris 3:2 1:3 %3 Liver Disorders 17:3 6:5 %2 6:2 2:0 %2 Thyroid 7:1 2:4 %1 Wine

Algorithm test-LASM test-k-NN Speedup 1:4 22:5 16 0:5 5:7 11 1:1 12:9 12 0:03 0:4 13 2:6 4:5 2 7:9 45:7 6 0:3 1:1 4 0:8 8:0 10 0:4 2:9 7 0:4 4:7 12

8 Discussion and Future Directions This paper has presented a new weighting scheme for nearest neighbor classi cation. We have shown both with theoretical arguments and computer experiments that good compression rates can be achieved that outperform the accuracy of the standard Nearest Neighbor classi cation algorithm and obtain almost the same accuracy as the k-NN algorithm. The improvement in time performance is proportional to the compression rate and in general it depends on the data set. It is also relevant to note that the comparison of classi cation accuracy of LASM with a local symmetrically weighted metric and with a global metric strongly shows that LASM is to be preferred. Many aspects of the framework introduced here would deserve more attention, some of them are listed below. The stop condition. The procedure Learning-Path stops when the accuracy on the training decreases twice consecutively. This heuristic criteria is not free from the problem to over t the data in training. In fact, the parameter, i.e. a non null reinforcement, works in opposition to that tendency. At present, we have not detected this tendency in the experiments as an increase of accuracy in training always produces an increase on the test. Another problem related to this naive condition is that the classi er produced two passes before that it nally returned, that is more accurate than the classi er used for testing. So in general if a best optimisation of the accuracy of the training could be achieved, a system improvement is expected as well. Choice of the case base. It could be guessed that a sensible improvement may be obtained with a better choice of case base . This is a eld rich in techniques already available to experiment with, as has been argued in the Introduction. We are going to tackle this subject in a coming paper, comparing both accuracy and compression rate. 20

Theoretical results. The theoretical results illustrated in Section 6 provide some evidence that in a two dimensional input space a lower number of examples is to be stored to learn exactly a class of concepts compared with the number of examples required by a simple NN classi er. A generalisation of the results in Section 6 would be valuable. But, it is known that the geometrical investigation of d-dimensional spaces, when d 3 is not simple at all. For example, the 3-d rectangle partitioning problem is NP-complete and therefore perhaps some other path is to be investigated.

9 Acknowledgements We would like to thank our anonymous reviewers for their insightful suggestions and remarks and Mark Keil for having indicated relevant reference papers in the computational geometry literature. Special thanks to David Aha for helpful discussions and encouragement in pursuing this research. This paper bene ted from the editing help provided by Susan Zorat. This work has been partially supported by the EspritIII projects: #6095 CHARADE (Combining Human Assessment and Reasoning Aids for Decision Making in Environmental Emergencies) and CARICA #20401 (Cases Acquisition and Replay in Fire Campaign Ambience).

References [1] D. W. Aha. Incremental, instance-based learning of independent and graded concept description. In Proceedings of the Sixth International Workshop on Machine Learning, Ithaca, NY, 1989. Morgan Kaufmann. [2] D. W. Aha. A study of instance-based algorithms for supervised learning tasks: Mathematical, empirical and psycological evaluations. Technical Report TR-90-42, University of California, Irvine, 1990. [3] D. W. Aha and R. L. Goldstone. Learning attribute relevance in context in instancebased learning algorithms. In Proceedings of the Twelfth Annual Conference of the Cognitive Science Society, pages 141{148, Cambridge, MA, 1990. Lawrence Earlbaum. [4] D. W. Aha and R. L. Goldstone. Concept learning and exible weighting. In Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society, pages 534{539, Bloomington, IN, 1992. Lawrence Earlbaum. [5] D. W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6:37{66, 1991. [6] C.-L. Chang. Finding prototypes for nearest neighbour classi er. IEEE Transactions on Computers, C-23(11):1179{1184, 1974. [7] S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning with symbolic features. Machine Learning, 10:57{78, 1993. 21

[8] R. H. Creecy, B. M. Masand, S. J. Smith, and D. L. Waltz. Trading MIPS and memory for knowledge engineering. Communication of ACM, 35:48{64, 1992. [9] B. V. Dasarathy, editor. Nearest beighbour (NN) norms: NN pattern classi cation techniques. IEEE Computer Society Press, Los Alamitos, CA, 1991. [10] L. Ferrari, P. V. Sankar, and J. Sklansky. Minimal rectilinear partitions of digitized blocks. Computer Vision Graphics and Image Processing, 28:58{71, 1984. [11] J. H. Friedman. Flexible metric nearest neighbour classi cation. Unpublished manuscript available by anonymous FTP from playfair.stanford.edu. [12] T. Hastie and R. Tibshirani. Discriminant adaptive nearest neighbour classi cation. In U.M.Fayad and R.Uthurusamy, editors, KDD-95: Proceedings First International Conference on Knowledge Discovery and Data Mining, 1994. [13] J. M. Keil and J. Sack. Minimum decomposition of polygonal objects. In G. Toussaint, editor, Computational Geometry. North Holland, 1985. [14] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464{1480, Sept. 1990. [15] P. M. Murphy and D. W. Aha. UCI Repository of Machine Learning Databases. University of California, Department of Information and Computer Science, Irvine, CA, 1994. [16] J. P. Myles and D. J. Hand. The multi-class metric problem in nearest neighbour discrimination rules. Pattern Recognition, 23(11):1291{1297, 1990. [17] F. P. Preparata and M. I. Shamos. Computational Geometry. Springer, 1985. [18] F. Ricci. Constraint reasoning with learning automata. International Journal of Intelligent Systems, 9(12):1059{1082, Dec. 1994. [19] F. Ricci and P. Avesani. Learning a local similarity metric for case-based reasoning. In International Conference on Case-Based Reasoning (ICCBR-95), Sesimbra, Portugal, Oct. 23-26, 1995, Oct. 1995. [20] G. L. Ritter, H. B. Woodru, S. R. Lowry, and T. L. Isenhour. An algorithm for selective nearest neighbor decision rule. IEEE Transaction on Information Theory, IT-21(6):665{669, 1975. [21] D. E. Rumelhart and J. L. McClelland, editors. Parallel Distributed Processing: Exploration in the Miscrostructure of Cognition. MIT Press, 1986. [22] S. Salzberg, A. Delcher, D. Heath, and S. Kasif. Best-case results for nearest neighbor learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(6):599{610, 1995. [23] S. L. Salzberg. A nearest hyperrectangle learning method. Machine Learning, 6:251{ 276, 1991. 22

[24] R. D. Short and K. Fukunaga. A new nearest neighbour distance measure. In Proceeding of the 5th IEEE International Conference on Patter Recognition, pages 81{86, Miami beach, FL, 1980. [25] R. D. Short and K. Fukunaga. Optimal distance measure for nearest neighbour classi cation. IEEE Transactions on Information Theory, 27:622{627, 1981. [26] V. Soltan and A. Gorpinevich. Minimum dissection of a rectilinear polygon with arbitrary holes into rectangles. Discrete and Computational Geometry, 9:57{79, 1993. [27] C. Stan ll and D. Waltz. Toward memory-based reasoning. Communication of ACM, 29:1213{1229, 1986. [28] C. W. Swonger. Sample set condensation for a condensed nearest neighbor decision rule for pattern recognition. In S. Watanabe, editor, Frontiers of Pattern Recognition, pages 511{519. Academic Press, 1972. [29] D. Wettschereck and D. Aha. Weighting features. In M.Veloso and A. Aamodt, editors, Case-Based Reasoning, Research and Development, pages 347{358. Springer, 1995. [30] D. L. Wilson. Asymptotic properties of nearest neighbor rule using edited data. IEEE Transactions on Systems, Man, and Cybernetics, SMC-2(3):408{421, 1972. [31] L. Xu, A. Krzyzak, and E. Oja. Rival penalized competitive learning for cluster analysis, RBF net, and curve detection. IEEE Transaction on Neural Networks, 4(4):636{649, 1993. [32] F. F. Yao. Computational geometry. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, pages 343{389. Elsevier, 1990.

23