An Ambiguity Measure for Pattern Recognition Problems ... - wseas

0 downloads 0 Views 272KB Size Report
(2) Laboratoire de Mathématiques et Applications, Univ. de Haute Alsace, rue des Fr`eres Lumi`ere, ..... dices vs. the number of clusters (bottom). J (c) = 1 n ∑x.
An Ambiguity Measure for Pattern Recognition Problems using Triangular-Norms Combination

(2)

(1) , Laurent MASCARILLA(1) and Augustin FRUCHARD(2) ´ Carl FRELICOT (1) Laboratoire d’Informatique – Image – Interaction (EA 2128), Univ. de La Rochelle, Av. Michel Cr´epeau, 17042 La Rochelle Cedex, FRANCE Laboratoire de Math´ematiques et Applications, Univ. de Haute Alsace, rue des Fr`eres Lumi`ere, 68089 Mulhouse, FRANCE

Abstract: - In pattern recognition, the membership of an object to classes is often measured by labels. This article mainly deals with the mathematical fundations of labels combination operators, built on t-norms, that allow to define an ambiguity measure of objects. Mathematical properties of the new family of combination operators are established. The application of the proposed measure to three major pattern recognition problems is presented. Keywords: - triangular norms, axiomatics, ambiguity measure, feature selection, cluster validity, classification.

1

Introduction

2 A New Ambiguity Measure

The aim of this article is to define an ambiguity measure within the framework of pattern recognition, and more specifically supervised and unsupervised classification. Let ω be a set of c classes ω = {ω1 , ω2 , ..., ωc } of objects and x a p-dimensional vector of characteristics (features) to be associated with a part of ω. The pattern x can be labelled using a function L:

: [0, 1] × [0, 1] → [0, 1] satisfying the following four axioms : ∀x, y, z ∈ [0, 1] x>y

(1)

where d(x, ωi ) is a distance between x and for instance the mean of the class ωi . Note that labels µi (x) are said to be possibilistic if they are in [0, 1] and fuzzy/probabilistic (depending on the underlying Pc mathematical model) if i=1 µi (x) = 1. In this mathematical study, we do not need any constraint on labels except that they are in [0, 1]. This paper is composed of two parts. In the first one we define a class of labels µi (x) combination operators, based on triangular norms, suitable for ambiguity measurement. Classical properties of such aggregation operators are checked, others are proved and a new ambiguity measure is proposed. In the second part, this new ambiguity measure is applied on three fundamental Pattern Recognition problems : variable selection for classification [17, 8], cluster validity [1, 2] and supervised classification with reject options [11, 20]. Results for comparison to other approaches are provided.

=

y>x

y ≤ z ⇒ x>y ≤ x>z

(2) (3)

x>(y>z)

=

(x>y)>z

(4)

x>1

=

x

(5)

The dual t-conorm ⊥ is defined as : x⊥y = 1 − (1 − x)>(1 − y)

(6)

and then satisfies axioms (2), (3), (4) as well as : x⊥0 = x

(7)

Axioms (3), (5) and (7) imply : x>y ≤ x

(8)

x ≤ x⊥y.

(9)

It ensues : • 0 is an absorbing element for any t-norm, • min is the largest t-norm,

• 1 is an absorbing element for any t-conorm,

200

⊥ µi = i=1,c >⊥ µj . j6=i

Such operators are multi-valued extensions of crisp sets’ intersection ∩ and union ∪ operators, as well as boolean logic AND, OR connectives. Numerous triangular norms and t-norms have been defined, some of them are recalled in Table 1. In the remaining part of the paper, > is some t-norm and ⊥ the associated t-conorm.

Probabilistic Hamacher (γ = 0) Łukasiewicz

>

Yager



2.2

• boundary conditions : 200

⊥0=0

  max 1 − ((1 − x)q + (1 − y)q )1/q , 0   min (xq + y q )1/q , 1

• symmetry : {1, 2, . . . , c}

µi >

⊥ µj j6=i

200

⊥ µi ⊥ λi ≤ i=1,c

i=1,c

(14)

200

⊥ µσ(i) = i=1,c ⊥ µi .

(15)

i=1,c

In decision theory, ambiguity is related to the risk of classification in the class of greatest membership degree. Hence, assuming that labels are sorted in decreasing order (µ1 ≥ ... ≥ µc ), decision is generally made by comparison of the greatest label µ 1 to the others. A straightforward way for ambiguity measurement is then to compare µ1 to µ2 . Based on this idea, a first operator was proposed in [14]:

⊥ µi = i=1,c ⊥ i=1,c

(13)

for any permutation σ of

200

A Class of Combination Operators



⊥1=1

i=1,c

• continuity with respect to each operand if > is continuous,

max(0, x + y − 1) min(x + y, 1)



and

i=1,c

(∀i = 1, c λi ≤ µi ) ⇒

xy x+y−xy x+y−2xy 1−xy

20

200

200

min(x, y) max(x, y) xy x + y − xy

> ⊥ > ⊥ > ⊥ > ⊥

200

Some mathematical properties of ⊥ result from those of > and ⊥. Specifically :

• monotony :

Table 1: t-norms (>) and t-conorms (⊥) Standard

(12)

i=1,c

• max is the smallest t-conorm.



Moreover it can easily be checked that : • 0 is a neutral element in the following meaning : 200

[c]

20

⊥ µi = µ 2 .

(µ1 , . . . , µc−1 ) (16)

• with more than two operands (1, 1) is “absorbing”: 200

⊥(µ1, . . . , 1, . . . , 1, . . . , µc) = 1 .

Proposition : whatever c ≥ 2 and (µ1 , ..., µc ) ∈ [0, 1]c , we have : 200

⊥ µi . ⊥ µi ≤ i=1,c > µi ≤ i=1,c

i=1,c

It is easy to define other operators sharing this 2

property. We denote them by the generic ⊥ symbol. In this paper we focus on a particular one having a quite simple mathematical expression – hence computational efficiency, good theoretical properties with respect to the ambiguity concept, and giving promising results in Pattern Recognition applications as shown in section 3:

(17)

Finally, it satisfies the important property that we call “weak compensation” :

(11)

i=1,c

[c−1]

⊥ (µ1, . . . , 0, . . . , µc−1) = ⊥

(10)

where the notation j 6= i implies j = 1, c. It has been shown that when the selected t-norm is min, we have:

200

Proof : For the right inequality, letting y = 200

write



⊥ µi = j=2,c ⊥ µj

i=1,c

200

(8) and (9) infer



> ⊥µk , we

i=2,c k6=i

>y. Therefore, properties

⊥ µi ≤ j=2,c ⊥ µj ≤ j=1,c ⊥ µj .

i=1,c

(18)

For the left inequality, we introduce the notation µc+1 := µ1 inorder that inequality (9) leads to



µi+1 ≤ µi+1 ⊥

=

µj

j6=i,i+1

⊥ µj . j6=i

Therefore,

200

axiom (3) infers

⊥ µi . > µi = i=1,c > µi+1 ≤ i=1,c

i=1,c

20

Note that operator ⊥ (10) do satisfies this property whereas some others do not, e.g.: 2000

⊥ µi = i=1,c;j6 ⊥ =i (µi>µj ) .

i=1,c

2.3

(19)

The Proposed Ambiguity Measure

Let us assume that labels are sorted in decreasing order : µ1 ≥ µ2 ... ≥ µc . In [13], it was argued that ambiguity was revealed by a high value of µ 2 /µ1 . As the standard t-conorm (max) gives µ 1 and the general 2

⊥ operator equals µ2 if the dual t-norm is min, we define a general ambiguity measure by: 2

Asc (x)

=



⊥ µi(x) i=1,c ⊥ µi(x)

i=1,c

(20)

where lower script c denotes the number of labels under interest and upper script s indicates, if needed, that the pattern x lies in a s-dimensional reduced feature 2

space. It is worth noting that ⊥−operators that satisfy the weak compensation property (18) ensure that: Asc (x) ≤ 1

(21)

Let us present how this new measure (20) can be used in three pattern recognition problems: feature selection, cluster validity and classification with reject options.

3

Applications to Pattern Recognition Problems

3.1

Feature Selection

Feature selection is an important issue of classification where a large number p of features are available, some of them being irrelevant or redundant [16, 5, 4]. It aims at selecting a subset of s features (s < p) without significantly decreasing (or even increasing) the classification accuracy. Feature selection methods are

based on a selection or search algorithm and a evaluation function assessing how effective feature subsets are. Many algorithms have been proposed and comparative studies are available, e.g. in [17, 19]. This paper does not address the selection search problem. We use the SFFS (Sequential Forward Floating Search) algorithm [23] that starts with an empty set of features, iteratively adds one feature at a time and adaptively allows to remove several ones. It is currently assumed to be one of the most effective suboptimal algorithms [17]. Two approaches are commonly admitted for the evaluation function definition [8]. The wrapper approach consists in using the classification accuracy of a feature subset for a particular classifier. Even if it generally gives better results, this approach is classifier dependent and ensures a low generalization property. Alternatively, the filter approach is not classifier dependent. It can be divided into the following categories of evaluation functions: (i) distance measures (e.g. the Mahalanobis and Battacharyya distances), (ii) information measures (e.g. entropy measure) or (iii) dependence measures (e.g. correlation coefficients). Another dichotomy can be based on whether supervised or unsupervised classification is considered. Note that wrapper methods rely on the former case. In this paper, we propose a filter-type evaluation function, designed for supervised classification, which belongs to category (i). We assume that a feature is as relevant with respect to a supervised classification problem as classes less overlap along this feature, so a subset of features. Our idea is to use the ambiguity measure (20) to define a new evaluation function to be minimized by: X (22) J(s) = Asc (x) x

We compared it, with probabilistic norms (⊥, >) to three other criteria to be minimized: the Entropy E [21], the Fuzzy Index F I [22] and the Fuzzy distance F D [3] on well-known data sets including: (i) Wisconsin Breast Cancer (n = 683 samples of p = 9 expert assessments on morphological attributes of benignant and malignant (c = 2) breast cells, (ii) Pima Indian Diabetes consisting of p = 8 various attributes observed on n = 768 females (number of previous pregnancies, mass, glucose absorption tolerance, and so on) allowing to predict a diabetic status or not (c = 2). Classification error rates obtained with the k-Nearest Neighbors (k-NN) rule and the Quadratic Bayes (BQ) classifier (under Gaussian assumption) are reported in Table 2. A leave-one-out procedure

was used and the number of neighbors (k) which gives the lowest error rate was chosen. Features selected by J(s) quite always resulted in lower error rates than the others, whatever the classifier. For Pima, it even gave better results than using the entire set of features. Table 2: Error rates for original p and s = 3 selected features Data Breast Cancer W isc.

P ima Indian Diab.

3.2

Evaluation functions p = 9 features J(3) F I(3) E(3) F D(3) p = 8 features J(3) F I(3) E(3) F D(3)

BQ (%) 4.98 5.42 5.42 5.27 6.44 26.04 24.09 31.64 32.42 32.16

k-PPV (%) 2.64 3.07 3.07 4.25 6.00 23.70 22.66 34.24 32.55 33.07

Cluster Validity

In unsupervised classification, the class-membership of the data points x is not known. The clustering problem aims at partitioning the sample set into c clusters that are as compact and separable as possible. A lot of partitioning methods are available [1] but the number of cluster is generally user-defined. The quality of the resulting partition obviously depends on c. A common approach consists in running the clustering algorithm for different values of c (from 2 up to an upper bound cmax ) and assessing a cluster validity index. The ”good“ number of clusters then corresponds to the index optimum value [2, 9, 6]. The more the clusters overlap, the less separable they are. Therefore, we propose to use the ambiguity measure (20) to define a new cluster validity index to be minimized by: 1X p A (x) (23) J(c) = n x c Used with various (>, ⊥) we compared it to the following (non geometrical) indices: the normalized entropy H [12] to be minimized, the normalized partition coefficient F [24] to be maximized, and the proportion coefficient P [25] to be minimized, defined by: 1 X µc (x) (24) P (c) = n x µ1 (x) This latter index is, as far as we know, the first one based on a ratio to the maximum label µ 1 . It presents the drawback to be monotonic with c. In our mind (23) is closer to the concept of ambiguity, however (24) could be generalized by:

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 0.9 0.8 0.7 0.6

F H P J

0.5 0.4 0.3 0.2 0.1 0

2

3

4

5

6

7

8

Figure 1: Data to be clustered (top) and validity indices vs. the number of clusters (bottom) 1X J (c) = n x 0





> µi(x) i=1,c ⊥ µi(x)

i=1,c



(25)

In order to demonstrate the ability of the proposed index to validate a partition, we present results obtained on a simple bi-dimensional artificial dataset shown in Figure 1 (top). The Fuzzy C-Means algorithm [1] was chosen as partitioning method. The standard norms (min,max) were used in (23). Indices values obtained for a number of clusters varying from c = 2 up to cmax = 8 are plotted in Figure 1 (bottom). To make the comparison easier, the values were divided by the corresponding maximal index. All the indices suggested that c = 4 clusters is the ”good“ structure.

3.3

Classification with Reject Option

The purpose of classification is the design of rules assigning an unknown pattern x to a part of ω. Once a labelling function is defined, the common rule consists in choosing the class of maximum label. Such an exclusive classification rule is not efficient in practice because it supposes that : (i) ω is exhaustively defined (closed-world assumption), (ii) classes do not overlap (separability assumption). In order to overcome these limits and to reduce the misclassification risk, reject options can be used. Two kinds of rejection have been defined. The first one, called distance rejection [11] is dedicated to outlying patterns and allows to associate a vector x to any class. The second one, called ambiguity rejection, allows x to be classified in several or all the classes [15, 7] ; it deals with inlying patterns. Formally, including reject options consists in defining a function H: [0, 1]c → Lchc = {l(x) ∈ [0, 1]c : li (x) ∈ {0, 1}} the set of vertices of the unit hypercube, µ (x) 7→ l (x) = (l(x), ..., lc (x))t , where µi (x) represents the degree of typicality of x to class ωi , e.g. by (1). The design of such classifiers can be made according to well-identified strategies operating in two sequential steps (H1 , H2 ) [20]. The ambiguity measure (20) can be used to define a classifier following a ”mixture-first“ strategy [14] as follows: H1 : H2 :

rejection in dark gray and distance rejection in white. 8

7

6

5

4

3

2

1

0 −3

−2

−1

0

1

2

3

4

5

−2

−1

0

1

2

3

4

5

8

7

6

5

4

3

Apc (x)

if > a, then reject x for ambiguity between classes ωj : µj (x) > µa elseif µi (x) = maxj=1,c µi (x) > d, then classify x in ωi else distance-reject x endif

where a, µa , d are user-defined thresholds. The following example aims at showing that such a classifier based on the ambiguity measure A pc (20) can result in defining well-adapted classification boundaries. A learning set of c = 3 classes of 25 artificial Gaussian p = 2-dimensional samples was generated such as two classes (◦,?) slightly overlap while the third one (O) is well-separated from the others. A contour plot of Apc (x) using the Hamacher norms is shown in Figure 2 (left). Areas are as dark as Apc (x) was high. Thresholds were set to a = µa = t0 = 0.3 such as classification areas were good looking. Corresponding classification areas are plotted in Figure 2 (right): exclusive classification in one of the three classes in light gray, ambiguity

2

1

0 −3

Apc

Figure 2: Contour plot (top) and Decision boundary (bottom)

4 Conclusion In this article, we have proposed a new ambiguity measure within the context of pattern recognition. It is based on an aggregation operator, built on triangular norms, which combines labels representing the degrees of typicality of a pattern to predefined classes. Mathematical properties of this operator have been established. The resulting ambiguity measure allows to define useful tools for classical pattern recognition problems: feature selection, cluster validity and rejection-based classification. Performed tests showed

excellent performance. The influence of the properties of the combination operator on the behavior of the ambiguity measure needs further investigation. Future works will also concern the definition of other operators and their properties.

References [1] Bezdek, J.C., Keller, J.M., Krishnapuram, R., Pal, N.R.: Fuzzy Models and Algorithms for Pattern Recognition and Image Processing. In: Dubois, D., Prade, H. (eds): The Handbooks of Fuzzy Sets Series. Kluwer Academic Publishers, Norwell, Massachusetts (1999) [2] Bezdek, J.C., Pal, N.R.: Some new indexes of cluster validity. IEEE Trans. on Systems, Man and Cybernetics 28(3) (1998) 301–315 [3] Campos, T.E., Bloch, I., Cesar Jr, R.M.: Feature selection based on fuzzy distances between clusters: first results on simulated data. In: Sameer Singh, S., Murshed N.A., Kropatsch W.G. (eds): Lecture Notes in Computer Science, Vol. 2013. Springer-Verlag, Berlin Heidelberg New York (2001) 186–195 [4] Cho Soosun: Feature Selection for Web Images . In: Proc. WSEAS MCBC-MCBE-ICAI-ICAMSL, Tenerife, Spain, December 19-21 (2003) 462–177 [5] Chong A., Gedeon T.D. , Koczy L.T.: Feature selection for clustering based fuzzy modeling In: Proc. 8th WSEAS CSCC, Corfu Island, Greece, July 7-10 (2003) 457–456 [6] Chou Chien-hsing , Su Mu-chun, Lai Eugene: Symmetry as A new Measure for Cluster Validity. In: Proc. 7th WSEAS CSCC, Rethymno, Greece, July 7-14 (2002) [7] Chow., C.K.: An optimum character recognition system using decision functions. Trans. on Electronic Computers 6 (1957) 247–253

[12] Dunn., J.C.: Indices of partition fuzziness and detection of clusters in large data sets. In: Gupta, M.M., Saridis, G. (eds): Fuzzy Automata and Decision Processes. Elsevier, New York.(1977) [13] Fr´elicot, C., Dubuisson, B.: A multi-step predictor of membership function as an ambiguity reject solver in pattern recognition. In: Proc. 4th Int. Conf. on Information Processing and Management of Uncertainty in Knowledge-based Systems (1992) 709–715. [14] Fr´elicot, C., Mascarilla, L.: A third way to design pattern classifiers with reject options. In: Proc. 21th Int. Conf. of the North American Fuzzy Information Processing Society (2002) [15] Ha, T-M.: The optimum class-selective rejection rule. IEEE Trans. on Pattern Analysis and Machine Intelligence 19(6) (1997) 608–615 [16] Jain, A.K., Duin, R.P.W., Mao, J.: Statistical pattern recognition: A review. IEEE Trans. on Pattern Analysis and Machine Intelligence 22(1) (2000) 4–37 [17] Jain, A.K., Zongker, D.: Feature selection: evaluation, application and small sample performance. IEEE Trans. on Pattern Analysis and Machine Intelligence 19(2) (1997) 153–158 [18] Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice Hall (1995) [19] Kudo, M., Sklansky, J.: Comparison of algorithms that select features for pattern classifiers. Pattern Recognition, 33(1) (2000) 25–41 [20] Mascarilla, L., Fr´elicot, C.: A class of reject-first possibilistic classifiers based on dual triples. In: Proc. 9th Int. Fuzzy Systems Association World Congress (2001) 743–747 [21] Mitra, P., Murthy, C.A., Pal, S.K.: Unsupervised feature selection using feature similarity. IEEE Trans. on Pattern Analysis and Machine Intelligence 24(3) (2002) 301–312

[8] Dash, M., Liu, H.: Feature selection for classification. Intelligent Data Analysis 1(4) (1997) 131–156

[22] Pal, S.K., De, R.K., Basak, J.: Unsupervised feature evaluation : a neuro-fuzzy approach. IEEE Trans. on Neural Network 11 (2000) 366–376

[9] Devillez A. , Billaudel P. , Villermain Lecolier G.: Use of criteria of class validity with the Possibilistic C Means algorithm. In: Proc. 3th WSEAS CSCC,Athens, Greece, July 4-9 (1999)

[23] Pudil, P., Novovicova, J., Kittler, J.: Floating search methods in feature selection. Pattern Recognition Letters 15 (1994) 1119–1125

[10] Dubois, D., Prade, H.: A review of fuzzy sets aggregation connectives. Information Sciences 36 (1985) 85– 121 [11] Dubuisson, B., Masson, M-H.: A statistical decision rule with incomplete knowledge about classes Pattern Recognition 26(1) (1993) 155–165

[24] Roubens M.: Fuzzy clustering algorithms and their cluster validity. European Journal of Operational Research 10 (1982) 294–301 [25] M.P. Windham Cluster validity for the fuzzy c-means algorithm. IEEE Trans. on Pattern Analysis and Machine Intelligence 4(4) (1982) 357–363.