sub infinity ... - IEEE Xplore

0 downloads 0 Views 949KB Size Report
Abstruct- An extension of the hard and fuzzy c-means. (HCM/FCM) clustering algorithms is described. Specifically, these models are extended to admit the case ...
545

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 21, NO. 3, MAYIJUNE 1991

e-Means Clustering with the

and I , Norms

I1

Leon Bobrowski and James C. Bezdek

Abstruct- An extension of the hard and fuzzy c-means (HCM/FCM) clustering algorithms is described. Specifically, these models are extended to admit the case where the (dis)similarity measure on pairs of numerical vectors includes two members of the Minkowski or p-norm family, viz., the p = 1 and p = 00 (or “sup”) norms. In the absence of theoretically necessary conditions to guide a numerical solution of the nonlinear constrained optimization problem associated with this case, it is shown that a basis exchange algorithm due to Bobrowski can be used to find approximate critical points of the new objective functions. This method broadens the applications horizon of the FCM family by enabling users to match “discontinuous” multidimensional numerical data structures with similarity measures that have nonhyperelliptical topologies. For example, data drawn from a mixture of uniform distributions have sharp or “boxy” edges; the @ = 1 and p = 00) norms have open and closed sets that match these shapes. The technique is illustrated with a small artificial data set, and is compared with the results with the c-means clustering solution produced using the Euclidean norm.

I. INTRODUCTION

I

NTELLIGENT DIAGNOSTIC SYSTEMS, e.g., machine vision platforms that attempt to interpret scenes with various kinds of sensor and contextual information, almost always include (and depend upon) one or more pattern recognition algorithms for low and intermediate level data processing. The area of pattern recognition addressed by the present paper is called cluster analysis or unsupervised learning. Treatments of many classical approaches to this problem include the texts by Kohonen [l], Bezdek [2], Duda and Hart [3], Tou and Gonzalez [4], Hartigan [ 5 ] , Dubes and Jain [6] and Pao [7]. More specifically, this paper concerns c-Means (CM) clustering with global objective functions; interested readers may consult the books of Bezdek [2], Pal and Mujumdar [8] or Kandel [9] for introductions to a variety of models of this kind. A survey by Bezdek [lo] provides an overview and bibliography of many recent theoretical and application articles on this and related topics. In what follows HCM and FCM are used, respectively, as acronyms for Hard (or conventional) and fuzzy c-means. In this paper we focus on a generalization of the c-means clustering algorithms that enables them to use two members of the Minkowski or p-norm family, viz., the p = 1 and p = x or sup norms. Section I1 prescribes our notation, provides an overview of the c-means models for inner product Manuscript received March 6, 1990; revised September 1, 1990, and November 16, 1990. This work was supported by NSF Grant 1Rl-9003252. J. C. Bezdek is with the Division of Computer Science, University of West Florida, Pensacola, FL 32514. L. Bobrowski is with the Institute of Biocybernetics, Polish Academy of Sciences, ul. Krajowej Rady Narodowej 55, PL-00-818 Warsawa, Poland IEEE Log Number 9042582.

norms, and defines the scope of the current work. Section I11 describes our algorithm for iteratively optimizing the FCM functional for the 1 and sup norms. Section IV contains a numerical example using a small artificial data set specifically chosen to illustrate the feasibility and necessity of the proposed approach. The example also makes an initial comparison with results secured using the Euclidean norm on the same data. Section V summarizes our conclusions, and lists some ideas for further research. 11. THE e-MEANSPARTITIONING MODELS The structure of partition spaces underlying clustering algorithms is generally well known, so this section is brief, following [2]. Let (c) be an integer, 1 < c < M and ~ } a set of ( M ) unlabeled let X = { X ~ . X ~ ; . . , X ~denote column vectors in RN;X is numerical object data; the j t h object (some physical entity such as a medical patient, airplane, seismic record, photograph, etc.) has vector x3 as it’s numerical representation; x J S is the sth characteristic (or feature) associated with object j , 1 5 s 5 N . Given X , we say that (c) fuzzy subsets {uk : X i [O. 11) are a fuzzy e-partition of X in case the ( c M ) values {uk,= u k ( x j ) , 1 5 k 5 c and 1 5 j 5 M } satisfy three conditions: 0 5 ukJ 5 1

v k.j

(14

C

k=l

and M

3=1

Each set of ( c M ) values satisfying conditions (1) can be arrayed as a ( e x M ) matrix U = [uk,]. The set of all such matrices are the nondegenerate fuzzy c-partitions of X :

Mfcnl = { U in R“”l u k J satisfies conditions (1) V k and j } .

(2)

And in case all the U k 3 ’ s are either 0 or 1, we have the subset of hard (or conventional) e-partitions of X :

M c =~{ U in

M,-M is a subset of

MfcMIuk3 =

0 or 1 V k apd j } .

(3)

M f c M : every hard c-partition of X is fuzzy, but not conversely. The reason these matrices are called partitions follows from the interpretation of uk3 as the membership of x, is the kth partitioning subset (cluster) of X. All clustering algorithms generate solutions to the clustering problem for X that are matrices in M f c M . An “optimal”

0018-947219110500-0545$01.00 0 1991 IEEE

IEEE TRANSACTIO1NS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 21, NO. 3, MAYIJUNE 1991

546

partition U is one that groups together object data vectors (and hence that objects they represent) which share some well defined (mathematical) similarity. Here (c) is assumed to be known; otherwise, its value becomes a part of the clustering problem. This facet of the problem is often called the cluster validity question. Its resolution is the focus of a large and active research community (cf. Dubes [12]); this aspect of clustering will not be discussed further. The most well known objective function for clustering in X is the classical within groups sum of squared errors (WGSS) function, which many authors refer to as J1:

Fig. 1 . Some closed unit p-balls in R2.

the function F ( w k ) = (2,- w k ) T A ( z j - W k ) is differentiable in ‘uk, so necessary conditions for the prototypes as explicit functions of the { U k , } can be derived using constrained optimization via differentiation. However, clusters found with where v = ( w l . w2.. . . w,) is a vector of (unknown) cluster inner product norms are more or less forced to match smooth, centers (weights, or prototypes), W k E R x for 1 5 IC 5 c, hyperellipsoidal shapes whose principal axes are given by the and U E Mc.tl a hard or conventional c-partition of X ; eigenvector structure of A . Data structures that are “boxy,” and )(zj - w k l l I is the Euclidean norm on RN.Optimal i.e., have sharp “edges,” are not well matched by the topology partitons U’ of X are taken from pairs (U”.v * ) that are of inner product induced metrics. For data of this kind, inner “local minimizers” of J1; J 1 was popularized as part of product norms are inappropriate. In this case, the so-called the ISODATA algorithm by Ball and Hall [13] in 1967. p-norms may yield better results. Fig. 1, which shows the Subsequent generalizations of J1 have been introduced by topological structure of several closed unit balls in R2,lends many authors (cf. [14]-[19]). All of these functionals can be geometric plausibility to this supposition. The limiting cases written in the fuzzy c-prototypes form, viz.: ( p = 1 and p = m) have topologies commensurate with, e.g., clusters of samples drawn from the uniform distribution. c 111 Many real data sets have clusters shaped like this; for example, Jm(U, p : x ) (ukjlm~kj (5) boundaries of unlabeled rectangular objects found by standard k=l j=l edge detection algorithms applied to digital images (e.g., see where matching algorithm [32]). The p-norms are incorporated in the c-prototypes criterion m E 11.x)is a weighting exponent on function framework as follows. Define (64 each fuzzy membership: c M U E MfCa1is a fuzzy c-partition of X: (6b) Jm(u,v;x) = (Ukj)7nDkj (7) P = (PI.P~,....P,) are (geometric) k = l j=1 cluster prototypes ( 6 4 where D k l is some measure of similarity, m E [l,CO) is a weighting exponent on (error, proximity, etc) between P k and xJ. (6d) each fuzzy membership (8b) Three families of prototypes are known in the literature; linear U E MfcM is a fuzzy c-partition of X (84 varieties (points, lines, planes, . . . , hyperplanes), hyperelv = (Wl,W2,...,WC) lipses, and regression models; D k , is usually a norm metric. For point prototypes, two families are well known on RN: the are protypical points in R~ (W inner product norms, /1z, - w l i l l i = (z,- wa) T A ( X , - b k ) , the OG distance from xj to V k , where A = is any positive and definite ( N x N ) matrix; and the Minkowski norms defined below. Equation (5) reduces to (4) when m = 1 and/or U E MCAi1with A = I , the identity matrix on R N . Necessary conditions that define iterative algorithms for is the p-norm distance (approximately) minimizing J , are known for various cases from z j to ‘Uki p 2 1 (se) of its parameters: necessary conditions for optimizing (4) and r

M

.

C

(5) are given in Section 111 below. In all cases to date the technique called grouped coordinate minimization (GCM) has been used to optimize J,. Basically, this involves fixing each set of parameters ( U ) or ( P ) ;and deriving necessary conditions for critical points of the reduced objective function of the remaining variables. When an inner product norm is the basic measure of (dis)similarity between data points and prototypes,

Minimization of Jm at (7) with constraints (8) forms a class of constrained nonlinear optimization problems whose solution is unknown. Because this problem arises by placing the p norm in FCMIHCM functionals, we shall call this the pFCM (pHCM) problem, according as m > 1 or m = 1,respectively; and any algorithm that may provide approximate solutions to these problems will be called a pFCM or pHCM algorithm. In

~

BOBROWSKI AND BEZDEK c-MEANS CLUSTERING WITH THE l 1 AND I ,

Section 111 we follow the usual path towards optimization of (7) under (8), viz., partial optimization in each set of grouped coordinates, as generally described in [28]. Thus, we first fix v and find necessary conditions on the { U k j } to minimize J , for m 2 1 at any p. Then we fix U and minimize J , with respect to v. We will describe and apply a basis exchange technique for numerically optimizing J,(U, v; X ) with respect to v when U is fixed and p = 1 or p = 00. Similar algorithms have been used for discriminant analysis with the perceptron criterion function by Bobrowski and Niemiro [21] and Bobrowski [22]; for regression analysis with the L1 norm by Bloomfield and Steiger [23]; and for L1 solutions to overdetermined linear systems by Bartels et al. [24]. 111. THE C-MEANSALGORITHMS: FCM/HCM AND pFCM/pHCM Necessary conditions for the minimization of J , for fixed prototypes P = v = ( v 1 , v 2 , . . . , v c ) are well known for m 2 1. A. Explicit Half-Step for U with Fixed Prototypes P* Theorem 1: FCMIpFCM: Explicit half-step for U with P" fixed with m > 1: Let

with

P* = (P;',PZ*,...,P,*) fixed. For m E (1,w), assume that the values { o k j } are any (cM) nonnegative real numbers, let X have at least c < M distinct points, let I = {1,2,. . . , c ) , and for j = 1 to M , let I . - { 2' E I ] & j = 0 : 1 5 k 5 c } . Then U E Mfc&Imay be a critical point for K , only if

1 5 k 5 c; 1 5 j Ij

# $I + u k j = 0

< M ; or if

547

NORMS

Theorem 2: HCMlpHCM-Explicit half-step for U with P' M fixed and m = 1: Define K1(U; = UkjDkj with P* = (PT,P;, . . . , P,*) fixed. Assume that the values { & j } are any ( c M ) nonnegative real numbers, let have at least c < M distinct points, and let I = {1,2, . . . , c } . Then U E M c may ~ be a critical point for K1 only if

x)

xjz1 x

Proof: See [2, p. 731. In (9c) the minimum for some column (j) may be zero without necessitating the condition shown in (9b). If the minimum is not unique, then U k j = 1 may arbitrarily assigned to the first minimizing index ( k ) , and the remaining entries of . this column are put to zero (because U is hard in M c ~ )Thus, singularity for m > 1 corresponds to some D k j = 0, while for m = 1 it corresponds to nonuniqueness of the minimum at (9c). Theorems 1 and 2 are "half" of the conditions needed to make a well-defined algorithm that may iteratively minimize J , for m 2 1. In summary, the explicit functional form for the { U k j } is identical for both the p-norm and A-norm families. The situation for the prototypes { V k } , however, is quite different. As noted previously, necessary conditions for the {vk} with U' fixed in case the norm function is differentiable with respect to v k are found by setting the gradient of the appropriate reduced functional to zero, and solving for v k . The well-known result is contained in Theorem 3. Theorem 3: FCMIHCM-Explicit half-step for v with U* fixed, inner product norms, and m 2 1. Let U* E ibffcM ( M c h f )be fixed for m > 1 (m = l), let A be any ( N x N ) positive definite matrix, and define c

M

k=l

j=1

Assume that X has at least c < M distinct points. Then v E 'RcNmay be a critical point for L , only if (sa)

V k E ( I - 13);

and arbitrarily,

usj =

1.

(9b)

SEI,

Proof: See [2, pp. 66-68]. Equations (sa) and (9b) give explicit necessary conditions on the matrix U with P* fixed for any set of nonnegative { D k j } ; in particular, for any p-norm, D k j = ( I z j - 2 ) k l I p . When (9b) must be used, the partial memberships that sum to one may be arbitrarily distributed to the centers that have zero distances to data point zj.We also note that computation of U with these equations always produces a matrix in Mfchf; this remark is important when discussing initialization of the pFCM algorithm below. When m = 1, (9a)-(9b) are replaced by the well known minimum distance formula, which is stated in Theorem 2.

Proof: See [2, pp. 68-69]. The denominator in (10) is always positive because of (IC). When m = 1 (10) simply yields the geometric centroids of the c hard clusters in U * . The FCM and HCM algorithms are Picard iteration through (9)-(10): A . FuzzylHard c-Means (FCM) Algorithms [2]: Inner Product Norms Case

: Given unlabeled data set X = { z 1 , z 2 , . . . , z ~ }Fix: . 1< c < M; 15 m < w ;positive definite weight matrix A to induce an inner product norm on EN. Choose E , a small positive constant.

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, VOL. 21, NO. 3, MAYIJUNE 1991

54s

: Guess U. E A4fch{ (or, initalize vo = ( v 1 , 0 ,v2,0,

. . . , %,O)

E

RCN

To avoid being at a fixed point, the rows of U0 (or vectors { v o , ~ }must ) be distinct. : For a = 1 to crMAX: : Calculate W k , @ with Ua-l and (lo), 1 5 IC 5 c; : Update U,-l to U, with { v k , U } and (sa), (9b) if m > 1) or (9c) if m = 1); < 3 0 : If max { J u ? ~- ,U~? k , , - l I} 5 E , then stop and put ( U * , v * ) = (Ua,va); Else Next j This procedure converges q-linearly from any initialization to a local minimum or saddle point (local maxima are impossible) of J , [20]. Termination usually occurs within 25-100 iterations for values in the range 0.01-0.001; this depends, of course, on the data being processed. Several algorithms have been proposed to test whether ( U * ,v*) is a local minimum or saddle point [29]-[31]. However, this has to date been largely a question of testing for unrealistic pathologies, since saddle points have been observed in only very artificial circumstances; nonetheless, this now becomes an open question for the pFCM/pHCM algorithms. Theorems 1-3 give explicit necessary conditions and algorithms for FCM/HCM for all D , k ' s that are inner product A-norms; and necessary conditions for the U matrix for all D j k ' s that are p-norms. At this point we turn to the main question addressed by this paper, viz., finding the vectors { vk} when the p-norm is sued in the FCM functional as at (7). As noted previously, we are unable to provide explicit theoretical conditions analogous to Theorem 3 in this case. Instead, we consider a numerical solution to this problem for the special cases p = 1 and p = CO. B. The pFCM Functionals for the l1 and 1, Norms

Assume that U* is fixed; for simplicity, we suppress the asterisk (*) in this section. First we write (7) as follows:

sum can be minimized by minimizing each term separately. Let ,&= ( u k j ) , for all k and j. The value of m does not affect the analysis of minimization of e ( U k ,vk), since with U fixed the numbers { ( u k j ) " } are also. With this notation we write Q ( U k ,vk) in the following form: M @(pk,vk)

=cokjp(zj,vk) j=1

where

@k

= [(~kl),,

. . . ,( 7 4 M ) " l T

and p ( x , v ) = 112 - v(Ip for any

2

and v in

The only norms we consider in (13) are the

11

Ix; - vi1

=

where ei = (0, 0 , . . . , 0, I, 0, . . . , 0) has a one in the ith place (ei is the ith canonical orthogonal basis vector for RN).For the 11 norm:

For the I ,

norm, see (19) and (20):

is the criterion function for the kth cluster (or row U, of U ) over the A4 data points. The minimum of J,(U, v; X ) with respect to v in ( l l a ) can be done term by term via ( l l b ) because each * ( u k , vk) is nonnegative, so the Q ( U k , vk)

( q v k ) ) =

1;

v k i - ~ j =i

0;

otherwise

a , where

Q

(14)

The function fi(vi) = llci - vi1 is a convex, piecewise affine function of vi. The sum of any finite number of such functions is again convex and piecewise affine, so the norm functions in (14) and (15) have these two properties. Consequently, the global minimizer of ( l l a ) can be found with linear programming. It is precisely this fact that enables us to utilize the basis exchange algorithm given below to find optimal v's for cp1 and pa. Next, we introduce the mean correction vector:

Qlb)

(Ukj)"(Dkj).

and 1, norms:

i=l

M j=1

RN. (13)

N Cpl(Z,V)

where Q ( u k , vk)

(12)

= max { lvks - xisI}, i = min {s : 'uks - x j s = a } S S

~

BOBROWSKI AND BEZDEK: c-MEANS CLUSTERING WITH THE 11 AND 1,

The mean correction vector with respect to w as%

K is related to the gradient of

*

In what follows we need a set of hyperplanes hj,, which are related to the functions { ( p 1 ( x j , W k ) } and { ~ o o ( x j , V k ) } as follows:

hj, = {w in

RN:

= }

549

NORMS

Fig. 2. Geometry of the basis exchange hyperplanes for p = 1, where = points in X and U = vertices in V .

(22)

where: = ,L?,iv; is the Euclidean inner product. For p = 1 the vectors { b , } are required to have the following form:

b, = [0,...,0,2,0,...,0] = 2e,, 15 s

5 N.

( p = 1).

(23)

The gradient of the function y l ( x j , w,) with respect to 'Uk exists on the set (RN- hj,). Similarly, the gradient of y m ( x j , w g ) with respect to 2)k exists unless the point W k belongs to one of the two hyperplanes described by the following equations:

These hyperplanes can be also expressed in the form (22) using the following vectors for { b , } :

Fig. 3. Geometry of the basis exchange hyperplanes for p = w , where = points in X and 0 = vertices in V .

p = 1, p = CO respectively. In the former instance, there are N hyperplanes for each data point, and N M such hyperplanes in all; in the latter case, there are N ( N - 1) hyperplanes for each data point, and a total of M N ( N - 1) hyperplanes for the entire data set. Every vertex w E V lies at the intersection of at least N of the hyperplanes { h 3 , } at (22): i.e., w(n) E V if and only if for some data point x 3 :

< v ( n )b, k , > = < X j 1 , bk, >

=

...... ...... = < X j , ,

....

>.

(26)

It is convenient to express these equations in matrix form:

.. bN1 =

bkN

~ ( n ) w ( n= ) &k,

= (Kronecker's Delta.)

(31)

If vertex w(n) E V is situated at the intersection of exactly N of the hyperplanes { h j , } at (22), then (2N) edges are connected to w(n).The move from w(n)to w(n 1) is related to the transformation of the current basis B ( n )into B ( n 1). The basis B ( n 1) is obtained from B ( n )by replacing one of the vectors b k , by some b k . The exit criterion determines the vector b k , leaving B ( n ) ,and the entry criterion determines the vector b k entering B(n). At the end of the swap, B ( n ) becomes B ( n 1). The imposition of the exit and entry criteria determine the search strategy, and a stopping criterion determines when the search is complete. The exit criterion from the steepest descent search is described next. Exit criterion: The vector b k , leaves the basis B(n) during the nth step if the following relation holds:

+

+

+

+

+

), = w(n)- ~ < i ( nand ), where wa(n) = w(n) ~ ( i ( n w;(n) ( E ) is a small positive real number. The exit criterion determines not only which vector b k , leaves the basis B(n),but also the half-line w,(t) on which the vertex w(n 1) is situated. Thus,

+

(35)

Since the direction vector b k at (25) is a h e a r combination of the vectors { e i } , the relationship between the correction vectors in two neighboring regions D1 and D2 can be expressed in the following form: x(W2)

=~

x(V2)

=x(Wl) f

( W I )f P j k b k

if

bki,

f 6:1(W1)

if

bki,

- bl:

= 0;

(36a)

= 0,

(36b)

or Pjkbk

(211)

where ' u p E D2, w1 E D1, and bkil is the 21th component of b k . Similarly, the relation between the correction vectors in two neighboring regions has the following form in case p = 1 as in (23): x(v2)= x ( v l ) f x(W2)

if if

@jkbk

= x(W1)- P j k b k

vlk

2 xjk

Z'lk