SNN: A Supervised Clustering Algorithm Jesús S. Aguilar, Roberto Ruiz, José C. Riquelme, and Raúl Giráldez Department of Computer Science. University of Sevilla Avda. Reina Mercedes s/n. 41011 Sevilla. Spain. [email protected]

Abstract. In this paper, we present a new algorithm based on the nearest neighbours method, for discovering groups and identifying interesting distributions in the underlying data in the labelled databases. We introduces the theory of nearest neighbours sets in order to base the algorithm S-NN (Similar Nearest Neighbours). Traditional clustering algorithms are very sensitive to the user-defined parameters and an expert knowledge is required to choose the values. Frequently, these algorithms are fragile in the presence of outliers and any adjust well to spherical shapes. Experiments have shown that S-NN is accurate discovering arbitrary shapes and density clusters, since it takes into account the internal features of each cluster, and it does not depend on a usersupplied static model. S-NN achieve this by collecting the nearest neighbours with the same label until the enemy is found (it has not the same label). The determinism and the results offered to the researcher turn it into a valuable tool for the representation of the inherent knowledge to the labelled databases. Keywords: clustering, supervised learning, nearest neighbours.

1. Introduction In the area of the supervised learning there are several techniques to classify a new example from the labelled database from which the inherent knowledge has been obtained. The form in which it estates the knowledge is dependent on the technique (decision rules, decision trees, association rules, etc.); however, some methods do not provide that knowledge limiting themselves to carry out the classification (neuronal networks, Bayesian model, nearest neighbours, etc.). From the works of [3], [5], [9], [6], [7], [4], [10], [11], or more recently [12], [8], and [1] the research has been mainly focused on the convergence of the method, the search of prototypes or surfaces of separation, the techniques of editing and condensing and in the acceleration of algorithm. However, there has not been any interest on providing to the technique of the nearest neighbours a form to represent the inherent knowledge to the information. Clustering, in Data Mining, is a useful technique for grouping data points such that points in a single cluster have similar characteristics (or are close to each other). Traditional clustering algorithms are applied in the area of the learning nonsupervised. S-NN employs a novel hierarchical clustering algorithm based on the nearest neighbour techniques. S-NN stars with each input as a separate cluster and at each successive step merges the clusters with identical neighbours. We collect all the L. Monostori, J. Váncza, and M. Ali (Eds.): IEA/AIE 2001, LNAI 2070, pp. 207-216, 2001. © Springer-Verlag Berlin Heidelberg 2001

208

J.S. Aguilar et al.

neighbours that their distances are shorter than the first enemy, that is to say, with not the same label. The remainder of the paper in organised as follows. In section 2 and 3, we survey basis contents of the theory of the nearest neighbours’ sets. These hard definitions allow us apply the supervised clustering algorithm. The step involved en clustering using S-NN are described in Section 4. In Section 5, we present the results of our experiments. Section 6 concludes and presents our ideas for future work.

2. Basic Concepts Before beginning to describe the near set theory, we have to mention the concepts of the classic theory of sets that are necessary for the development of that theory. We will use the operations known on sets: ³, , ¬ and # (cardinal of a set). Also we will use the logic operations on the set {F, T} (false and true): ¾, ¿, and ½; and the following generalisations: " (universal quantifier) and $ (existential quantifier), where

z = ∀i : D ( x).E ≡ E ( x1 ) ∧ E ( x2 ) ∧ ...E ( xn ) ) z = ∃i : D( x )·E ≡ E ( x1 ) ∨ E ( x2 ) ∨ K ∨ E ( xn )

(1)

and z is T if all the expressions (if some of the expressions) E(xi) are T in the domain D for x = ( x1 ,..., xn ) , if we consider the universal quantifier (existential quantifier). Definition 1 (Sequence): a sequence is a finite or infinite collection of elements with an inherent order of access (sequential). It is always begun by first and to accede to any element i, it will be necessary to pass through i-1 previous. Since its definition is inherited of sets, it also inherits the operations associated to these ³, , ¬, - and #. Likewise, we defined the following operations for sequence S of elements of T type: :S (empty sequence); _+_:ST S (insertion of an element in the end of the sequence); [_]:S N T (access to ith element of the sequence, with i ³{1... #S}); and

+i:D·E (generalised concatenation of sequences), where + i : {1..k}·s(i) = s(1) + s(2) + ... + s(k)

(2)

+ s(i) . k

with s(i) sequences. By convenience, it will be written as

i=1

Definition 2 (Ordered Sequence): a sequence s, of size #s is ordered if it satisfies: ∀i : {1..(# s ) − 1}· s [i ] ≤ T s [i + 1]

(3)

where T is an established relation of total order between the elements of T type of the sequence.

SNN: A Supervised Clustering Algorithm

209

3. Definitions Definition 3 (Attribute): attribute A is defined by a set of values. The attribute can be continuous or discrete. If the attribute is continuous, the set of values will be limited by the extreme values of an interval, forming therefore the rank of values for the attribute. If the attribute is discrete, the set of values of this one will appear like an enumeration of the possible values of the attribute. We will name C the set of values that can adopt the label. Definition 4 (Example): an example E is one row formed by the Cartesian product of the attributes of condition and decision. Likewise, we defined the following operations to get to the attributes of condition or their label. atr : E × N → A etiq : E → C

(4)

Definition 5 (Universe): the universe U is a sequence of examples. We will say that a database with n examples, each one of them with m attributes (the last one is denominated label), will form the particular universe from this moment. Then U=. Since we will model the database with a sequence, the access for an example of the database will be made by means of the access to the sequence, that is to say, the sequence is s, then s[i] represents the example ith of the database. To accede to jth attribute of the example, since we have modelled to this one with one row, one will become atr(s[i],j), and to know its label, etiq(s[i ]). Definition 6 (Distance): the distance between two examples is a function that fulfils the properties of a metric space, that is to say,

d : E × E → ℜ + ∪ {0}

(5)

with the following properties: reflective, defined nonnegative, symmetrical and transitive. Since the examples belong to a sequence, we can redefine the distance basing on the position that these examples occupy in the sequence, therefore, compute the range between two examples ei and ej, we will do d(i,j). Definition 7 (Sequence of Distances): sequence SD(i) formed by the distances of an example i to all the others, and is defined by

+( j, d (i, j)) #s

SD (i ) =

(6)

j =1

In the universe U, the sequence associated for the first example will be: SD(1)= and each element of the sequence are a pair formed by the position of the example for which it is wanted to compute the range and the value of the distance. Hence, in the expression (j,d(i,j)) the first coordinate is the index of an example and the second coordinate is the distance of an example i for the example whose index is indicated in the first coordinate. In order to access to each one of the two coordinates easily we defined two operations on the pair: ind : N × ℜ → N dist : N × ℜ → ℜ

(7)

210

J.S. Aguilar et al.

Definition 8 (Ordered Sequence of Distances): as of a sequence P of pairs (index, distance), we can obtain a sequence Q ordered by the distance if this one fulfils the following property: ∀j : {1..# Q − 1}·dist (Q[j]) ≤ dist (Q[ j + 1]) ∧ ∀j : {1..# Q}·∃k : {1..# P}·ind (P[k ]) = ind (Q[ j])

(8)

that we can be resumed like ordered(Q,P). A ordered sequence OSD(i) of distances fulfils ordered(OSD(i), SD(i)). Definition 9 (Relation of Neighbourhood): two examples whose positions in the ordered sequence are i and j are neighbours to a third k if in the ordered sequence of distances of example k, OSD(k), any example between i and j does not exist (or between j and i) whose label is different from which they have i and j. Therefore, if examples i and j do not have the same label, are not neighbours. The relation can be defined of the following way: 2

(9) Rk={(u[ind(OSD(k)[i])],u[ind(OSD(k)[j])])³U |"h:{min(i,j)+1..max(i,j)-1}) ·(etiq(u[ind(OSD(k)[i])])=etiq(u[ind (OSD(k)[j])])=etiq(u[ind(OSD(k)[h])])} For example, it is OSD(1)=, we will say that examples 4 and 3 are neighbours of 1 if examples 4 and 3 have the same label and all the examples that are between these (5 and 34) have the same label that those too. Definition 10 (Class of Neighbours of an Example i respect to another Example k): class of neighbours of an example i is defined respect to another k like all those that in the ordered sequence of distances of k can be grouped around i in a region of examples of the same class. The class is defined from the neighbourhood relation as it follows: [ i]k={j³N | u[i ] Rk u[j]} This class of neighbours can also be understood like a sequence, because intrinsically an order relative to the distance exists [2]. Definition 11 (Ordered Subsequence of Order k respect to an Example i): given to the relation of neighbourhood and the definition of class of neighbours, we can construct the ordered sequence of distances of an example i from the concatenation of ordered subsequences, that is to say, OSD(i)=OSD(i)1+OSD(i)2+... +OSD(i)k+... +OSD(i)z

(10)

where every subsequence OSD(i)k is constructed from the classes of separates examples in relation to i. This way, each OSD(i)k is a class of examples whose attribute of decision is the same one, but that differs from the decision attribute of the classes OSD(i)k-1 and OSD(i)k+1. Therefore, we could represent the database (together with the information relative to the distances) of the following way: OSD(1)=OSD(1)1+OSD(1)2+... +OSD(1)k1 OSD(2)=OSD(2)1+OSD(2)2+... +OSD(2)k2 ………………………….………………… OSD(n)=OSD(n)1+OSD(n)2+... +OSD(n)kn

(11)

where Ki ³{1..n}; however, if some Ki were 1, all the examples would belong to the same class, and the more it approaches n, in principle, the more homogeneously distributed will be the examples of the same class in the database. Since the sequence OSD has got associated with each element an example and the distance to it, we are going to do without the distance now to associate for each example of the database a sequence of classes, where the concatenation of all of them

SNN: A Supervised Clustering Algorithm

211

will group the total of examples of the database. Hence we will have associated for each example (preceding to the symbol ) an indefinite number of classes in a specific order that implicitly contains the information about the "proximity" of these for the example of the beginning. Then, [1] [1]1+[1]2+…+[1]k1 [2] [2]1+[2]2+…+[2]k2

(12)

……………………………………

[n] [n]1+[n]2+…+[n]kn it indicates that example k has a class of neighbouring examples [k]1 whose labels are the same ones that the one of k. Afterwards, there is another class of neighbouring examples [k]2 whose labels are different that those of the previous class [k]1 and the later class [k]3. And so on. To these classes, [i]j, we will denominate classes jneighbours of an example i. From a mathematical point of view, we have obtained the joint quotient according to the relation of neighbourhood R for each example of the universe:

∀i : {1..#U }⋅ [i ] = OSD(i )

(13)

R

For example, if OSD(1)= + + ... we are indicating that examples 1, 4, 5, 34 and 3 have the same label (values 0, 1, 2, 3 and 4 would be the distance of each example for example 1), that in addition differs from the one to examples 73, 2 and 31. From this we can here construct the class [1] of the following form: [1] [1,4, 5, 34,3]+[73, 2, 31]+.…The class 1neighbour of 1 is [1, 4, 5, 34,3]; the class 2-neighbour of 1 is [73, 2, 31]; and so on. Definition 12 (Class of Order k j-Neighbour of an Example i): the class of order 0 0 is defined 1-neighbour of an example i, [i ]1 , like the class of neighbours of an example i respect to itself. That is to say, since in the ordered sequence of distances of an example i, OSD(i), the first example always will be the own i, since the distance to itself is 0, the class of order 0 1-neighbour of example i will be the set of neighbouring examples of this one whose label is the same one, or of another form, they will be those that belong to OSD(i)1. 0 On the other hand, the class of order 0 j-neighbour of an example i will be [i ]j . We are specially interested on the classes of order k 1-neighbours. We define then the order class 1 1-neighbour like:

[i] = [i] 1

0

1

1

∪ j j∈

U [k ]

k∈[i ]

(14)

0 1

0 1

The interpretation of this expression is the following one: since i contains all the examples with the same label than i that is nearer to it (including it) until finding 1 0 another example of different label, the class [i ]1 has to those contained in [i ]1 plus the 1-neighbours of them. For example, in Fig. 1 we have [i ]1 = (the 6 does not belong to it, it has another label). 0

212

J.S. Aguilar et al. 1 2 6

i

3

5 4

Fig. 1.

In general, the class of order k 1-neighbour of an example i is defined as:

[i]

k

1

=

U[i]

(15)

j

1

0≤ j < k

And, therefore, the class of order k j-neighbour of an example i is defined as:

[i]

k j

=

U[i]

(16)

h j

0≤ h < k

By convenience, we will speak of k-class instead of class of order k, and therefore, k-class j-neighbour, instead of class of order k j-neighbour. In particular, we are interested on the k-classes 1-neighbours, and when we will speak about them we omit k k subscript 1, that is to say, instead of [i ]1 we will write [i ] . Only when the neighbourhood order is different from 1 we will express this order. Definition 13 (Equality of Classes): two classes are equal when both contain exactly the same examples, although in different order. Formally,

[i] = [ j ] ⇔ (∀e ∈ [i] ⇒ e ∈ [ j ] ∧ ∀e ∈ [ j ] ⇒ e ∈ [i])

(17)

Definition 14 (Set of k-Classes 1-Neighbours): we define the set of k-classes 1neighbour like the set formed by the k-classes 1-neighbour for each example. Also, by convenience, k-set of neighbouring classes will be named, instead of set of kk classes 1-neighbour, and it will written like SN where

{

SN k = [1] , [2] ,...,[n] k

k

k

}

(18)

Definition 15 (Reduced k-Set of Neighbouring Classes): we define k-set reduced of neighbouring classes as the set of k-classes 1-neighbours where there are not two equal classes. Formally,

{

RSN k = [i ] ∈ SN k ∀[ j ] ∈ SN K ⋅ i ≠ j ⇒ [i ] ≠ [ j ] k

k

k

k

}

(19)

We will identify the reduced sets of neighbouring classes with the examples that have been reduced so that we do not lose information, that is to say, if the class [i] and the class [j] are equal, since their neighbours are the same [w1...,wk], then [i, j] has as neighbours to [w1...,wk].

SNN: A Supervised Clustering Algorithm

213

4. Algorithm ‘‘Similar Nearest Neighbour" (S-NN) Once seen all the necessary definitions that support the theory that we present in this work, we describe the details of our algorithm in figure 2. S-NN (U: Database) ret (RSN: Set of Classes) i ←0 SN1i ← { } For each example j de U i SN1i ← SN1i ∪ [ j ]1

{ }

i −1 1

RSN

← { } (by convenience RSN1-1)

While SN1i ≠ RSN1i−1

RSN1i ← reduction(SN1i )

SN1i +1 ← { }

[ j ] ∈ RSN For each k ∈ [ j ] [ j ] ← [ j ] ∪ [k ] SN ← SN ∪ {[ j ] } i

For each

i 1

1

i

1

i +1

i

i

1

1

1

i +1 1

i +1 1

i +1

1

i ← i +1

Fig. 2. Algorithm

The Input parameters are the U database, containing n examples with m attributes. As we mentioned earlier, starting with the individual points as individual clusters, at each successive step the clusters with identical neighbours are merged. The process is repeated until we can not simplify the set of clusters. S-NN treats each input point as a separate cluster, in each iteration of the whileloop, until we can not simplify the set of clusters, we compute the neighbours of each cluster member. reduction (C:Set of Classes) ret (RSN:Set of Classes) RSN ← C For each (x,y) con x, y ³ C If [x]=[y] ([x]=[x][y]=[x]¬[y]) RSN ← RSN − {y} x← x∪ y Fig. 3. Reduction

The expression RSNi ← reduction (SNi ) invokes to the following algorithm 1 1 shown in figure 3, whose assignment is to simplify the set of classes by means of the

214

J.S. Aguilar et al.

elimination of those classes that have exactly the same neighbours. If there are two classes x and y and they have the same neighbours, then the examples of y are added to those of x, which both, will have exactly the same neighbours.

5. Results 5.1. Iris We have used the database Iris to illustrate the complete results of the method because they are possible to be included in the article. However, we regret not to be able to include, by lack of space, the intermediate results (set SN and RSN for each order of iterations). The next table contains two types of rows: odd rows: [class, examples of the class, neighbours of the examples of the class]. The first value refers to the class or labels; the second indicates how many examples belong to the class that has the mentioned label; and the third value corresponds with the number of neighbours that have got that class. even rows: the example of the class are placed on the left column, whose cardinal corresponds with the second number of the previous row; and in the right column the neighbours of the examples of the class are placed, of the left column, and has got as cardinal the third value of the previous row. [A,50,50] 1, 6, 10, 18, 26, 31, 36, 37, 40, 42, 44, 47, 50, 51, 53, 54, 55, 1, 95, 106, 55, 36, 64, 125, 88, 107, 112,145, 134, 72, 67, 63, 37, 100, 58, 59, 60, 63, 64, 67, 68, 71, 72, 78, 79, 87, 88, 91, 95, 96, 31, 54, 135, 50, 47, 68, 6, 18, 101, 144, 78, 42, 143, 149, 139, 53, 51, 100, 101, 106, 107, 112, 115, 124, 125, 134, 135, 138, 139, 26, 115, 40, 10, 44, 96, 60, 124, 91, 58, 79, 138, 87, 59, 136, 71 143, 144, 145, 149, 136

[C, 47, 48] 2, 4, 17, 21, 23, 24, 39, 41, 45, 73, 80, 89, 102, 110, 126, 7, 2, 57, 122, 83, 131, 4, 15, 132, 13, 35, 81, 27, 41, 111, 123, 74, 17, 13, 15, 20, 35, 27, 49, 104, 56, 57, 74, 148, 77, 81, 83, 111, 102, 23, 20, 127, 148, 34, 7, 80, 146, 46, 32, 75, 16, 5, 56, 49, 77, 126, 122, 123, 127, 131, 132, 146, 16, 75, 32, 34, 46, 52, 62, 82, 52, 104, 110, 73, 24, 62, 45, 108, 137, 82, 39, 21, 89 108, 137

[B, 47, 49] 3, 28, 113, 8, 11, 14, 76, 85, 86, 116, 109, 121, 129, 19, 29, 3, 92, 141, 113, 142, 119, 61, 29, 128, 11, 117, 103, 130, 28, 118, 147, 30, 43, 66, 70, 99, 33, 98, 48, 133, 38, 61, 119, 65, 69, 84, 150, 22, 69, 30, 105, 114, 76, 86, 66, 94, 19, 121, 43, 99, 116, 85, 65, 8, 109, 93, 92, 94, 97, 103, 105, 114, 141, 118, 120, 140, 128, 130, 14, 33, 140, 98, 133, 70, 48, 129, 93, 38, 9, 150, 97, 84, 120 142, 22, 147

[C, 1, 1] 5

5

[B, 1, 1] 9

9

[B, 1, 1] 12

12

[C, 1, 1] 25

25

[C, 1, 1] 90

90

[B, 1, 1] 117

117

For the database Iris 4 iterations have been needed, in each one of which the cardinal one of set RSN has been: 98, 62, 19 and last the 9 that is in the table. The method offers a very valuable information because it provides:

SNN: A Supervised Clustering Algorithm

215

The number of regions: 9. If the examples of the class agree with the neighbours (fact that happens for the class A) then the region is clearly separable of the rest. Which are the examples that make difficult the classification (5, 9, 12, 25, 90 and 117), therefore, we could extract them from the database for a later classification. An estimation of the error rate on the training file (let us take into account that if we keep the three first regions we would be around 96%, that is approximately what they provide other good sort keys). 5.2. Breast Cancer For this database (with the 683 examples without noise), and needing 5 iterations we have obtained 44 regions. The regions calculated in each iteration are: 440, 308, 142, 45 and 44. The two first are stops A (427 examples) and for B (210 examples), which means that only with these two we would be around the 93,26% of the information. Regarding the computational cost of the algorithm, the executions have been made in a PC Pentium 550 MHz and for the database Iris it uses less than 1 second; for the breast cancer database it uses 2 minutes.

6. Conclusions The definitions presented in this article base the theory that it supports on algorithm S-NN. The algorithm, besides does not need parameters is determinist. As for the information that it provides, in the example Iris it is demonstrated that it is able to obtain: a geometric idea of the distribution of examples of the database; an estimation of the number of regions (possible rules); an estimation of the difficulty of classification of the database; and which are the examples that make difficult the learning of the database, with a view to eliminate them in the phase of learning. On the other hand, algorithm S-NN allows some interesting directions, which we are studying, as far as the reduction criterion is concerned (it see point 2,1 of the reduction algorithm). Three criteria of reduction exist: a restrictive criterion (the one that at the moment is applied, that is to say, they will be reduction if the two classes are exactly equal); a moderate criterion (there will be reduction if one of the classes is included in the other); and, finally, a relaxed criterion (there will be reduction if the intersection of the classes is not empty). These criteria provide different solutions, as much more numerous as for regions, as restrictive is the reduction criterion. The characteristics of the contributed solutions as well as their differences, both analytical and geometrically, will be object of next works. In the same way, another interesting line is the use of the set of classes of neighbours like sort key. Acknowledgements. This work has been supported by the Spanish Research Agency CICYT under grant TIC99-0351.

216

J.S. Aguilar et al.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Aha, D. W. A (1990). Study of Instance-Based Algorithms for Supervised Learning Tasks: Mathematical, Empirical, and psychological Evaluations. Ph. D. Dissertation. UCI. Codrington, C. W. y Brodley, C. E. (1997). On the Qualitative Behavior of ImpurityBased Splitting Rules I: The Minima-Free Property. Technical Report, Purdue University. Cover, T. M. y Hart, P. E. (1967). Nearest Neighbor Pattern Classification. NN-Pattern Classification Techniques. IEEE. Chang, C. L. (1974). Finding Prototypes for Nearest Neighbor Classifiers. IEEE Transactions on Computers. Hart, P.E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, IT-14. Hellman, M. E. (1970). The Nearest Neighbor Classification Rule with a Reject Option. NN-Pattern Classification Techniques. IEEE. Jarvis, R. A. y Patrick, E. A. (1973). Clustering using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers. Joussellin, A. y Dubuisson, B. (1987). A Link Between k-Nearest Neighbor Rules and Knowledge Based Systems by Sequence Analysis. Pattern Recognition Letters. Patrick, E. A. y Fischer, F. P. (1970). A Generalized k-Nearest Neighbor Rule. NN-Pattern Classification Techniques. IEEE. Ritter, G. L., Woodruff, H.B., Lowry, S.R. y Isenhour, T.L. (1975). An algorithm for a Selective Nearest Neighbor Decision Rule. IEEE Transactions on Information Theory, 21. Tomek, I. (1976). An Experiment with the Edited Nearest-Neighbor Rule. IEEE Transactions on Systems, Man, an Cybernetics SMC-6. Wilson, D. (1972). Asymptotic Properties of Nearest Neighbor Rules using Edited Data. IEEE Transactions on Systems, Man and Cybernetics 2. D. Fisher. (1995). Optimization and Simplification of Hierarchical Clusters. Proceedings of the International Conference on Knowledge Discovery and Data Mining. S. Guha, (1998). CURE: An Efficient Clustering Algorithm for Large Databases. Proceedings of the 1998 ACM SIGMOD Conference.

Abstract. In this paper, we present a new algorithm based on the nearest neighbours method, for discovering groups and identifying interesting distributions in the underlying data in the labelled databases. We introduces the theory of nearest neighbours sets in order to base the algorithm S-NN (Similar Nearest Neighbours). Traditional clustering algorithms are very sensitive to the user-defined parameters and an expert knowledge is required to choose the values. Frequently, these algorithms are fragile in the presence of outliers and any adjust well to spherical shapes. Experiments have shown that S-NN is accurate discovering arbitrary shapes and density clusters, since it takes into account the internal features of each cluster, and it does not depend on a usersupplied static model. S-NN achieve this by collecting the nearest neighbours with the same label until the enemy is found (it has not the same label). The determinism and the results offered to the researcher turn it into a valuable tool for the representation of the inherent knowledge to the labelled databases. Keywords: clustering, supervised learning, nearest neighbours.

1. Introduction In the area of the supervised learning there are several techniques to classify a new example from the labelled database from which the inherent knowledge has been obtained. The form in which it estates the knowledge is dependent on the technique (decision rules, decision trees, association rules, etc.); however, some methods do not provide that knowledge limiting themselves to carry out the classification (neuronal networks, Bayesian model, nearest neighbours, etc.). From the works of [3], [5], [9], [6], [7], [4], [10], [11], or more recently [12], [8], and [1] the research has been mainly focused on the convergence of the method, the search of prototypes or surfaces of separation, the techniques of editing and condensing and in the acceleration of algorithm. However, there has not been any interest on providing to the technique of the nearest neighbours a form to represent the inherent knowledge to the information. Clustering, in Data Mining, is a useful technique for grouping data points such that points in a single cluster have similar characteristics (or are close to each other). Traditional clustering algorithms are applied in the area of the learning nonsupervised. S-NN employs a novel hierarchical clustering algorithm based on the nearest neighbour techniques. S-NN stars with each input as a separate cluster and at each successive step merges the clusters with identical neighbours. We collect all the L. Monostori, J. Váncza, and M. Ali (Eds.): IEA/AIE 2001, LNAI 2070, pp. 207-216, 2001. © Springer-Verlag Berlin Heidelberg 2001

208

J.S. Aguilar et al.

neighbours that their distances are shorter than the first enemy, that is to say, with not the same label. The remainder of the paper in organised as follows. In section 2 and 3, we survey basis contents of the theory of the nearest neighbours’ sets. These hard definitions allow us apply the supervised clustering algorithm. The step involved en clustering using S-NN are described in Section 4. In Section 5, we present the results of our experiments. Section 6 concludes and presents our ideas for future work.

2. Basic Concepts Before beginning to describe the near set theory, we have to mention the concepts of the classic theory of sets that are necessary for the development of that theory. We will use the operations known on sets: ³, , ¬ and # (cardinal of a set). Also we will use the logic operations on the set {F, T} (false and true): ¾, ¿, and ½; and the following generalisations: " (universal quantifier) and $ (existential quantifier), where

z = ∀i : D ( x).E ≡ E ( x1 ) ∧ E ( x2 ) ∧ ...E ( xn ) ) z = ∃i : D( x )·E ≡ E ( x1 ) ∨ E ( x2 ) ∨ K ∨ E ( xn )

(1)

and z is T if all the expressions (if some of the expressions) E(xi) are T in the domain D for x = ( x1 ,..., xn ) , if we consider the universal quantifier (existential quantifier). Definition 1 (Sequence): a sequence is a finite or infinite collection of elements with an inherent order of access (sequential). It is always begun by first and to accede to any element i, it will be necessary to pass through i-1 previous. Since its definition is inherited of sets, it also inherits the operations associated to these ³, , ¬, - and #. Likewise, we defined the following operations for sequence S of elements of T type: :S (empty sequence); _+_:ST S (insertion of an element in the end of the sequence); [_]:S N T (access to ith element of the sequence, with i ³{1... #S}); and

+i:D·E (generalised concatenation of sequences), where + i : {1..k}·s(i) = s(1) + s(2) + ... + s(k)

(2)

+ s(i) . k

with s(i) sequences. By convenience, it will be written as

i=1

Definition 2 (Ordered Sequence): a sequence s, of size #s is ordered if it satisfies: ∀i : {1..(# s ) − 1}· s [i ] ≤ T s [i + 1]

(3)

where T is an established relation of total order between the elements of T type of the sequence.

SNN: A Supervised Clustering Algorithm

209

3. Definitions Definition 3 (Attribute): attribute A is defined by a set of values. The attribute can be continuous or discrete. If the attribute is continuous, the set of values will be limited by the extreme values of an interval, forming therefore the rank of values for the attribute. If the attribute is discrete, the set of values of this one will appear like an enumeration of the possible values of the attribute. We will name C the set of values that can adopt the label. Definition 4 (Example): an example E is one row formed by the Cartesian product of the attributes of condition and decision. Likewise, we defined the following operations to get to the attributes of condition or their label. atr : E × N → A etiq : E → C

(4)

Definition 5 (Universe): the universe U is a sequence of examples. We will say that a database with n examples, each one of them with m attributes (the last one is denominated label), will form the particular universe from this moment. Then U=. Since we will model the database with a sequence, the access for an example of the database will be made by means of the access to the sequence, that is to say, the sequence is s, then s[i] represents the example ith of the database. To accede to jth attribute of the example, since we have modelled to this one with one row, one will become atr(s[i],j), and to know its label, etiq(s[i ]). Definition 6 (Distance): the distance between two examples is a function that fulfils the properties of a metric space, that is to say,

d : E × E → ℜ + ∪ {0}

(5)

with the following properties: reflective, defined nonnegative, symmetrical and transitive. Since the examples belong to a sequence, we can redefine the distance basing on the position that these examples occupy in the sequence, therefore, compute the range between two examples ei and ej, we will do d(i,j). Definition 7 (Sequence of Distances): sequence SD(i) formed by the distances of an example i to all the others, and is defined by

+( j, d (i, j)) #s

SD (i ) =

(6)

j =1

In the universe U, the sequence associated for the first example will be: SD(1)= and each element of the sequence are a pair formed by the position of the example for which it is wanted to compute the range and the value of the distance. Hence, in the expression (j,d(i,j)) the first coordinate is the index of an example and the second coordinate is the distance of an example i for the example whose index is indicated in the first coordinate. In order to access to each one of the two coordinates easily we defined two operations on the pair: ind : N × ℜ → N dist : N × ℜ → ℜ

(7)

210

J.S. Aguilar et al.

Definition 8 (Ordered Sequence of Distances): as of a sequence P of pairs (index, distance), we can obtain a sequence Q ordered by the distance if this one fulfils the following property: ∀j : {1..# Q − 1}·dist (Q[j]) ≤ dist (Q[ j + 1]) ∧ ∀j : {1..# Q}·∃k : {1..# P}·ind (P[k ]) = ind (Q[ j])

(8)

that we can be resumed like ordered(Q,P). A ordered sequence OSD(i) of distances fulfils ordered(OSD(i), SD(i)). Definition 9 (Relation of Neighbourhood): two examples whose positions in the ordered sequence are i and j are neighbours to a third k if in the ordered sequence of distances of example k, OSD(k), any example between i and j does not exist (or between j and i) whose label is different from which they have i and j. Therefore, if examples i and j do not have the same label, are not neighbours. The relation can be defined of the following way: 2

(9) Rk={(u[ind(OSD(k)[i])],u[ind(OSD(k)[j])])³U |"h:{min(i,j)+1..max(i,j)-1}) ·(etiq(u[ind(OSD(k)[i])])=etiq(u[ind (OSD(k)[j])])=etiq(u[ind(OSD(k)[h])])} For example, it is OSD(1)=, we will say that examples 4 and 3 are neighbours of 1 if examples 4 and 3 have the same label and all the examples that are between these (5 and 34) have the same label that those too. Definition 10 (Class of Neighbours of an Example i respect to another Example k): class of neighbours of an example i is defined respect to another k like all those that in the ordered sequence of distances of k can be grouped around i in a region of examples of the same class. The class is defined from the neighbourhood relation as it follows: [ i]k={j³N | u[i ] Rk u[j]} This class of neighbours can also be understood like a sequence, because intrinsically an order relative to the distance exists [2]. Definition 11 (Ordered Subsequence of Order k respect to an Example i): given to the relation of neighbourhood and the definition of class of neighbours, we can construct the ordered sequence of distances of an example i from the concatenation of ordered subsequences, that is to say, OSD(i)=OSD(i)1+OSD(i)2+... +OSD(i)k+... +OSD(i)z

(10)

where every subsequence OSD(i)k is constructed from the classes of separates examples in relation to i. This way, each OSD(i)k is a class of examples whose attribute of decision is the same one, but that differs from the decision attribute of the classes OSD(i)k-1 and OSD(i)k+1. Therefore, we could represent the database (together with the information relative to the distances) of the following way: OSD(1)=OSD(1)1+OSD(1)2+... +OSD(1)k1 OSD(2)=OSD(2)1+OSD(2)2+... +OSD(2)k2 ………………………….………………… OSD(n)=OSD(n)1+OSD(n)2+... +OSD(n)kn

(11)

where Ki ³{1..n}; however, if some Ki were 1, all the examples would belong to the same class, and the more it approaches n, in principle, the more homogeneously distributed will be the examples of the same class in the database. Since the sequence OSD has got associated with each element an example and the distance to it, we are going to do without the distance now to associate for each example of the database a sequence of classes, where the concatenation of all of them

SNN: A Supervised Clustering Algorithm

211

will group the total of examples of the database. Hence we will have associated for each example (preceding to the symbol ) an indefinite number of classes in a specific order that implicitly contains the information about the "proximity" of these for the example of the beginning. Then, [1] [1]1+[1]2+…+[1]k1 [2] [2]1+[2]2+…+[2]k2

(12)

……………………………………

[n] [n]1+[n]2+…+[n]kn it indicates that example k has a class of neighbouring examples [k]1 whose labels are the same ones that the one of k. Afterwards, there is another class of neighbouring examples [k]2 whose labels are different that those of the previous class [k]1 and the later class [k]3. And so on. To these classes, [i]j, we will denominate classes jneighbours of an example i. From a mathematical point of view, we have obtained the joint quotient according to the relation of neighbourhood R for each example of the universe:

∀i : {1..#U }⋅ [i ] = OSD(i )

(13)

R

For example, if OSD(1)= + + ... we are indicating that examples 1, 4, 5, 34 and 3 have the same label (values 0, 1, 2, 3 and 4 would be the distance of each example for example 1), that in addition differs from the one to examples 73, 2 and 31. From this we can here construct the class [1] of the following form: [1] [1,4, 5, 34,3]+[73, 2, 31]+.…The class 1neighbour of 1 is [1, 4, 5, 34,3]; the class 2-neighbour of 1 is [73, 2, 31]; and so on. Definition 12 (Class of Order k j-Neighbour of an Example i): the class of order 0 0 is defined 1-neighbour of an example i, [i ]1 , like the class of neighbours of an example i respect to itself. That is to say, since in the ordered sequence of distances of an example i, OSD(i), the first example always will be the own i, since the distance to itself is 0, the class of order 0 1-neighbour of example i will be the set of neighbouring examples of this one whose label is the same one, or of another form, they will be those that belong to OSD(i)1. 0 On the other hand, the class of order 0 j-neighbour of an example i will be [i ]j . We are specially interested on the classes of order k 1-neighbours. We define then the order class 1 1-neighbour like:

[i] = [i] 1

0

1

1

∪ j j∈

U [k ]

k∈[i ]

(14)

0 1

0 1

The interpretation of this expression is the following one: since i contains all the examples with the same label than i that is nearer to it (including it) until finding 1 0 another example of different label, the class [i ]1 has to those contained in [i ]1 plus the 1-neighbours of them. For example, in Fig. 1 we have [i ]1 = (the 6 does not belong to it, it has another label). 0

212

J.S. Aguilar et al. 1 2 6

i

3

5 4

Fig. 1.

In general, the class of order k 1-neighbour of an example i is defined as:

[i]

k

1

=

U[i]

(15)

j

1

0≤ j < k

And, therefore, the class of order k j-neighbour of an example i is defined as:

[i]

k j

=

U[i]

(16)

h j

0≤ h < k

By convenience, we will speak of k-class instead of class of order k, and therefore, k-class j-neighbour, instead of class of order k j-neighbour. In particular, we are interested on the k-classes 1-neighbours, and when we will speak about them we omit k k subscript 1, that is to say, instead of [i ]1 we will write [i ] . Only when the neighbourhood order is different from 1 we will express this order. Definition 13 (Equality of Classes): two classes are equal when both contain exactly the same examples, although in different order. Formally,

[i] = [ j ] ⇔ (∀e ∈ [i] ⇒ e ∈ [ j ] ∧ ∀e ∈ [ j ] ⇒ e ∈ [i])

(17)

Definition 14 (Set of k-Classes 1-Neighbours): we define the set of k-classes 1neighbour like the set formed by the k-classes 1-neighbour for each example. Also, by convenience, k-set of neighbouring classes will be named, instead of set of kk classes 1-neighbour, and it will written like SN where

{

SN k = [1] , [2] ,...,[n] k

k

k

}

(18)

Definition 15 (Reduced k-Set of Neighbouring Classes): we define k-set reduced of neighbouring classes as the set of k-classes 1-neighbours where there are not two equal classes. Formally,

{

RSN k = [i ] ∈ SN k ∀[ j ] ∈ SN K ⋅ i ≠ j ⇒ [i ] ≠ [ j ] k

k

k

k

}

(19)

We will identify the reduced sets of neighbouring classes with the examples that have been reduced so that we do not lose information, that is to say, if the class [i] and the class [j] are equal, since their neighbours are the same [w1...,wk], then [i, j] has as neighbours to [w1...,wk].

SNN: A Supervised Clustering Algorithm

213

4. Algorithm ‘‘Similar Nearest Neighbour" (S-NN) Once seen all the necessary definitions that support the theory that we present in this work, we describe the details of our algorithm in figure 2. S-NN (U: Database) ret (RSN: Set of Classes) i ←0 SN1i ← { } For each example j de U i SN1i ← SN1i ∪ [ j ]1

{ }

i −1 1

RSN

← { } (by convenience RSN1-1)

While SN1i ≠ RSN1i−1

RSN1i ← reduction(SN1i )

SN1i +1 ← { }

[ j ] ∈ RSN For each k ∈ [ j ] [ j ] ← [ j ] ∪ [k ] SN ← SN ∪ {[ j ] } i

For each

i 1

1

i

1

i +1

i

i

1

1

1

i +1 1

i +1 1

i +1

1

i ← i +1

Fig. 2. Algorithm

The Input parameters are the U database, containing n examples with m attributes. As we mentioned earlier, starting with the individual points as individual clusters, at each successive step the clusters with identical neighbours are merged. The process is repeated until we can not simplify the set of clusters. S-NN treats each input point as a separate cluster, in each iteration of the whileloop, until we can not simplify the set of clusters, we compute the neighbours of each cluster member. reduction (C:Set of Classes) ret (RSN:Set of Classes) RSN ← C For each (x,y) con x, y ³ C If [x]=[y] ([x]=[x][y]=[x]¬[y]) RSN ← RSN − {y} x← x∪ y Fig. 3. Reduction

The expression RSNi ← reduction (SNi ) invokes to the following algorithm 1 1 shown in figure 3, whose assignment is to simplify the set of classes by means of the

214

J.S. Aguilar et al.

elimination of those classes that have exactly the same neighbours. If there are two classes x and y and they have the same neighbours, then the examples of y are added to those of x, which both, will have exactly the same neighbours.

5. Results 5.1. Iris We have used the database Iris to illustrate the complete results of the method because they are possible to be included in the article. However, we regret not to be able to include, by lack of space, the intermediate results (set SN and RSN for each order of iterations). The next table contains two types of rows: odd rows: [class, examples of the class, neighbours of the examples of the class]. The first value refers to the class or labels; the second indicates how many examples belong to the class that has the mentioned label; and the third value corresponds with the number of neighbours that have got that class. even rows: the example of the class are placed on the left column, whose cardinal corresponds with the second number of the previous row; and in the right column the neighbours of the examples of the class are placed, of the left column, and has got as cardinal the third value of the previous row. [A,50,50] 1, 6, 10, 18, 26, 31, 36, 37, 40, 42, 44, 47, 50, 51, 53, 54, 55, 1, 95, 106, 55, 36, 64, 125, 88, 107, 112,145, 134, 72, 67, 63, 37, 100, 58, 59, 60, 63, 64, 67, 68, 71, 72, 78, 79, 87, 88, 91, 95, 96, 31, 54, 135, 50, 47, 68, 6, 18, 101, 144, 78, 42, 143, 149, 139, 53, 51, 100, 101, 106, 107, 112, 115, 124, 125, 134, 135, 138, 139, 26, 115, 40, 10, 44, 96, 60, 124, 91, 58, 79, 138, 87, 59, 136, 71 143, 144, 145, 149, 136

[C, 47, 48] 2, 4, 17, 21, 23, 24, 39, 41, 45, 73, 80, 89, 102, 110, 126, 7, 2, 57, 122, 83, 131, 4, 15, 132, 13, 35, 81, 27, 41, 111, 123, 74, 17, 13, 15, 20, 35, 27, 49, 104, 56, 57, 74, 148, 77, 81, 83, 111, 102, 23, 20, 127, 148, 34, 7, 80, 146, 46, 32, 75, 16, 5, 56, 49, 77, 126, 122, 123, 127, 131, 132, 146, 16, 75, 32, 34, 46, 52, 62, 82, 52, 104, 110, 73, 24, 62, 45, 108, 137, 82, 39, 21, 89 108, 137

[B, 47, 49] 3, 28, 113, 8, 11, 14, 76, 85, 86, 116, 109, 121, 129, 19, 29, 3, 92, 141, 113, 142, 119, 61, 29, 128, 11, 117, 103, 130, 28, 118, 147, 30, 43, 66, 70, 99, 33, 98, 48, 133, 38, 61, 119, 65, 69, 84, 150, 22, 69, 30, 105, 114, 76, 86, 66, 94, 19, 121, 43, 99, 116, 85, 65, 8, 109, 93, 92, 94, 97, 103, 105, 114, 141, 118, 120, 140, 128, 130, 14, 33, 140, 98, 133, 70, 48, 129, 93, 38, 9, 150, 97, 84, 120 142, 22, 147

[C, 1, 1] 5

5

[B, 1, 1] 9

9

[B, 1, 1] 12

12

[C, 1, 1] 25

25

[C, 1, 1] 90

90

[B, 1, 1] 117

117

For the database Iris 4 iterations have been needed, in each one of which the cardinal one of set RSN has been: 98, 62, 19 and last the 9 that is in the table. The method offers a very valuable information because it provides:

SNN: A Supervised Clustering Algorithm

215

The number of regions: 9. If the examples of the class agree with the neighbours (fact that happens for the class A) then the region is clearly separable of the rest. Which are the examples that make difficult the classification (5, 9, 12, 25, 90 and 117), therefore, we could extract them from the database for a later classification. An estimation of the error rate on the training file (let us take into account that if we keep the three first regions we would be around 96%, that is approximately what they provide other good sort keys). 5.2. Breast Cancer For this database (with the 683 examples without noise), and needing 5 iterations we have obtained 44 regions. The regions calculated in each iteration are: 440, 308, 142, 45 and 44. The two first are stops A (427 examples) and for B (210 examples), which means that only with these two we would be around the 93,26% of the information. Regarding the computational cost of the algorithm, the executions have been made in a PC Pentium 550 MHz and for the database Iris it uses less than 1 second; for the breast cancer database it uses 2 minutes.

6. Conclusions The definitions presented in this article base the theory that it supports on algorithm S-NN. The algorithm, besides does not need parameters is determinist. As for the information that it provides, in the example Iris it is demonstrated that it is able to obtain: a geometric idea of the distribution of examples of the database; an estimation of the number of regions (possible rules); an estimation of the difficulty of classification of the database; and which are the examples that make difficult the learning of the database, with a view to eliminate them in the phase of learning. On the other hand, algorithm S-NN allows some interesting directions, which we are studying, as far as the reduction criterion is concerned (it see point 2,1 of the reduction algorithm). Three criteria of reduction exist: a restrictive criterion (the one that at the moment is applied, that is to say, they will be reduction if the two classes are exactly equal); a moderate criterion (there will be reduction if one of the classes is included in the other); and, finally, a relaxed criterion (there will be reduction if the intersection of the classes is not empty). These criteria provide different solutions, as much more numerous as for regions, as restrictive is the reduction criterion. The characteristics of the contributed solutions as well as their differences, both analytical and geometrically, will be object of next works. In the same way, another interesting line is the use of the set of classes of neighbours like sort key. Acknowledgements. This work has been supported by the Spanish Research Agency CICYT under grant TIC99-0351.

216

J.S. Aguilar et al.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Aha, D. W. A (1990). Study of Instance-Based Algorithms for Supervised Learning Tasks: Mathematical, Empirical, and psychological Evaluations. Ph. D. Dissertation. UCI. Codrington, C. W. y Brodley, C. E. (1997). On the Qualitative Behavior of ImpurityBased Splitting Rules I: The Minima-Free Property. Technical Report, Purdue University. Cover, T. M. y Hart, P. E. (1967). Nearest Neighbor Pattern Classification. NN-Pattern Classification Techniques. IEEE. Chang, C. L. (1974). Finding Prototypes for Nearest Neighbor Classifiers. IEEE Transactions on Computers. Hart, P.E. (1968). The condensed nearest neighbor rule. IEEE Transactions on Information Theory, IT-14. Hellman, M. E. (1970). The Nearest Neighbor Classification Rule with a Reject Option. NN-Pattern Classification Techniques. IEEE. Jarvis, R. A. y Patrick, E. A. (1973). Clustering using a Similarity Measure Based on Shared Near Neighbors. IEEE Transactions on Computers. Joussellin, A. y Dubuisson, B. (1987). A Link Between k-Nearest Neighbor Rules and Knowledge Based Systems by Sequence Analysis. Pattern Recognition Letters. Patrick, E. A. y Fischer, F. P. (1970). A Generalized k-Nearest Neighbor Rule. NN-Pattern Classification Techniques. IEEE. Ritter, G. L., Woodruff, H.B., Lowry, S.R. y Isenhour, T.L. (1975). An algorithm for a Selective Nearest Neighbor Decision Rule. IEEE Transactions on Information Theory, 21. Tomek, I. (1976). An Experiment with the Edited Nearest-Neighbor Rule. IEEE Transactions on Systems, Man, an Cybernetics SMC-6. Wilson, D. (1972). Asymptotic Properties of Nearest Neighbor Rules using Edited Data. IEEE Transactions on Systems, Man and Cybernetics 2. D. Fisher. (1995). Optimization and Simplification of Hierarchical Clusters. Proceedings of the International Conference on Knowledge Discovery and Data Mining. S. Guha, (1998). CURE: An Efficient Clustering Algorithm for Large Databases. Proceedings of the 1998 ACM SIGMOD Conference.