Feature Selection for Handwritten Chinese Character ... - CiteSeerX

0 downloads 0 Views 477KB Size Report
can be considered as a problem of global combinatorial optimization. It is a .... The essence of Genetic Algorithms is imitating the way .... father nodes. Because ...
Feature Selection for Handwritten Chinese Character Recognition Based on Genetic Algorithms Darning Shi Dept. of Computing, The Hong Kong Polytechnic University, Hung Horn, Kowloon,Hong Kong

Wenhao Shu and Haitao Liu Dept. of Computer Science, Harbin Institute of Technology, Harbin, 150001, P.R.C.

1. INTRODUCTION

ABSTRACT Feature selection is of great importance in recognition

The criterion

system design because it directly affects the overall

assessed as follows: how the distances among the means

performance of the recognition system. Feature selection

of classes are maximized, and how the

can be considered as a problem of global combinatorial

classes are minimized.

optimization. It is a very time-consuming task in order

character feature is statistically independent, it wlould be

to search the most suitable features amongst a huge

very easy to apply this criterion, that is, analyzing n

number of possible feature combinations, therefore, an

features within the training samples one by one, and

effective and efficient search technique is desired. In

then selecting

this paper, we use Genetic Algorithms (GA) to design a

mentioned above. For example, the training samples of

feature selection approach for handwritten

Chinese

character recognition. Four contributions are claimed:

measuring a class separability can be

m from

variances of

Provided that every initial

features which meet the criterion

2 classes wi and o ,, are given,

miand miare the mean m,kand mlk

First, the general transformed divergence among classes,

vectors of these 2 classes respectively, while

which

distances, is

are the mean vectors in the kth dimension, ajk2 and O,k

in the feature

are the variances in the kth dimension, then the criterion

selection based on GA; Second, a special crossover

function measuring class separability can be definled as:

is derived

from

Mahalanobis

proposed to be the fitness function

operator other than traditional one is given; Third, a special criterion of terminating

selections is inferred

from the criterion of minimum error probability

in a

. G,

trnrk =

2 O,k

-

m.lk +

I2

k=1,2, .._,n

(Eq.1)

$k

Bayes classifier; Fourth, we compare our method with Obviously, Gk is positive. The larger Gk is, thle more

the feature selection based on Branch-And-Bound algorithm(BAB),

which is often used to reduce the

calculation of Feature Selection via exhaustive search. The analyses of the experimental

results can be

significant the kth feature is when classifying

11); and

0 .j, Let the set {G, ,k=I,2, ___ , n} ordered from the

proceeded that traditional GA is an ergodic Markov chain, while, BAB is a depth first heuristic algorithm for

largest to the smallest, then, the feature selec:tion is

exhaustive search. We conclude that the GA-based

nothing else, but selecting the larger

method proposed in this paper is promising to solve the

set as classification features.

m elements

in the

feature selection problems in a multi-dimensional space. Keywords:

Feature Selection,

Divergence,

Genetic

Eq.1 is only suitable for a normal feature space.

Algorithms,

Branch-And-Bound

Algorithm,

Hand-

Sometimes this condition can’t be satisfied, such as in handwritten Chinese character recognition. The features

written Chinese Character Recognition

0-7803-4778-l

/98 $10.00

0 1998 IEEE

4201

in the experiments described in this paper, Rapidtransformed traverse features of handwritten Chinese characters, are not in a normal space. There

is another

method,

Feature

Selection

via

exhaustive search, which means selecting the optimum m features from the C,,“’ paths which combinations

covered all

of the n features. According

to this

method, not only the optimum feature subset can be

Definition 2:In a multi-class (w classes) problem, the average classification information is 1,; when taking 0, out of o j , and the average classification information is I,, when taking w.~ out of wi , the divergence between these two classes J, =l,i+/ii, then average diverg,ence is defined as:

found, but all classification information among classes can

be

browsed

completely.

Nevertheless,

J = 2 2 J,, ,=I ,=,+I

the

calculation expending of this method is rather high

(Eq.3)

when n is large. So, many algorithms are proposed to

In Eq.3, J is the sum of each J,, between a pair of classes.

reduce its calculation,

The average divergence will be enlarged only if there is

algorithm(BAB)

such as Branch-And-Bound

one large divergence within a certain pair of classes. As

([ 1].[2]).

a result, many pairs with small divergences will be of In this paper, the methodology and implementation of

little advantages. In this case, each divergence between a

Feature Selection based on Genetic Algorithms

pair should be considered. To solve this problem,

are

described, and then the effect is analyzed in theory and

transformed divergence can be used.

experiment. In section 2, the derivation of transformed divergence criterion from Mahalanobis distance is given. We will introduce how to construct the expression of this problem and how to apply GA to feature selection

Definition 3:Transformed divergence is defined as: J;;‘ = 100% * [I - exp t-J,

/ 8)]

(Eq.4)

in section 3. Finally, the feature selection based on GA and based on BAB are compared and analyzed in

Thus, transformed average divergence is:

section 4.

2. TRANSFORMED DIVERGENCE

AVERAGE

i=l

CRITERION

.,=r+,

That is to say, J’,i is an exponential saturation curve of

A similarity measure must be defined to analyze the similarity among the samples of the same classes and the dissimilarity among the samples of different classes.

J,, .Even if J,i is very large, J ,i can never exceed to the . . saturation lme of lOO/Oo, b’it J’,, will be more sensitive in the area of small J,,.

There are some definitions as follows: In many real applications: (I) It is tolerable to assume that probability density

Dejinition I: Mahalanobis distance: D2 = (x - m)‘C-1(x

- m)

(Eq.2)

Where x is the feature vector of a pattern which mean

function(PDF)

p(xi w , ) is distributed normally for the

class w, .

vector is m and global covariance matrix is C. (2) It is also a reasonable assumption that all the covariance matrices of the classes are equivalent and

4202

equal to the global covariance matrix.

3. FEATURES SELECTION

BASED ON

Then, we can decrease the calculation expending of the criterion function.

GENETIC

Suppose that a pattern sample x is given, whose

ALGORITHMS

The essence of Genetic Algorithms is imitating the way of the evolution of organisms and human beings to solve

conditional

probability

densities in oi and w, are

p(xi o , > and p(x] o ., > respectively, the classification information when taking w , out of o , is provided by:

11,= C P(XIWi) ‘{/(XI 0;) / /‘(Xl” j))

(Eq.6)

complex optimization

problems, where the problem

solution can be expressed by a string with a certain length. A group of solutions is remained after every iteration, which can be arranged in optimal order, and from which some solutions can be selected according to a certain principle. After those solutions selected from last iteration are calculated by Genetic operators (such as crossover and mutation),another

new group

of

solutions are generated. Repeat the procedure above, until a certain goal is reached ([3]). l tr(C,-‘C Cyllll’tlIL’nI = eztr[C-l(m, 2 CO”~I~,‘,,lce

[(x - mj )‘(x - tn; )p(xj wi ,I}

3.1 Expression of Problem

- m,)(m,

The problem solution is expressed as a string with a

- mj,']

certain length in alphabet {O,l,*}

I,, = D; / 2

i.e.

(Eq.7)

in traditional GA.

While, the feature selection discussed in this paper can be regarded as a problem of global combilnatorial

then. J = 2 2 tr[Cel(m, 1=1 .j=r+l

- m., )(m;

= 2 2 (m, - m,)‘Cel(mj 1=1 j=i+l

-

optimization. For a problem selecting nz features from n

I -ml )I mi 1

features, its solution can be expressed as a m-length string in alphabet { 1,2,

., n}. Here, the value of genes

refers to the feature of a certain dimension. Running GA in this way to solve the problem described in section 1:

i.e.

J=T

AD; ,=l ./=i+l

VW)

(1) The size of population.300. (2) Fitness function of individuals. Let D’,, express

then, the Mahalanobis distance between wi and 0,

in the

feature subset S, then according to Eq.9,the transformed average divergence can be used as the fitness function of individuals. In Eq.7,Eq.S,Eq.9, D’,, is the Mahalanobis distance between the ith pattern and thejth pattern.

(3) Crossover operator.

It

is known

that such

optimization problem has nothing to do with the order of gene, the gene string in the parents can be (ordered before the crossover works. The gene which appears in

both sides of the parents can’t operate crossover. Let

number Y within [O,f; ],at last, add the ith string from the

p,rm, =o.fJ. (4) Mutation operator. Let a gene’s value mutate in a

population into the mating pool, which is satisfied with:

random position of father, the mutated value should be

c f, ,=1

arranged from I to n ,and should be different with other m-l genes. Let P,,,, =0.05.

I 2 r , here,J is fitness of the ith string.

(6) Stopping condition. When the number of different

(5) Selection.Ten strings are selected to mate through

individuals exceeds 5000, stop GA.

Roulette Wheel Method each time. Firstly, arrange all individuals of population from the largest fitness to the

Figure

smallest one; secondly, calculateA ,the sum of all strings

selecting 20-dimensional-26-dimensional

in current population,

original 64-dimensional features.

z z X

c y

and then generate a random

I shows the running process of GA. when

+

;::

features from

m=20

-m=21

60

50 -------40

-

m=22

-x-

m=23

30

2

20

-m-

m=24

E

10

-c-

m=25

0

,oo --tm=26 _ _ _ _ _ _First24byEq,,

IO

” (Nugber zf di~rent’;ndiv!lndu~l~~*50

9o

Figure 1 Feature Selection Based on GA

3.2 Reasonable Number of Features for Selection

From the training samples in a w-class problem, generally,

However,

P(w,)=P(o,)=

how

many

dimensions

is reasonable for

. . . =P( ~)=l/w.

Let

selection?

u,j = ln(PCxl WI ) / PlxI O.j 1)

(Eq.11)

At first, Let’s analyze error probability of Bayes classifier in the case of two-class problem. Class o 1 and o 2 should correspondingly be divided into two areas, ct 1 and cx2 .But it is possible to exist two kinds of errors: the patterns belonging to o 1 was classified into 02, and vice versa.

Thus, the error probability

is

P(U,,N

wi ),

if

the

patterns belonging to o i are classified into wi ,while error probability

is

P(Ui, 0 10, ) I Eq.12)

(Eq.10)

4204

For a problem with the PDF for class distributed normally

and equivalent covariance matrices: According to Eq.15, P(e) is monotone decreasing with the increasing D,,‘, on the other hand, transformed divergence is a monotone increasing function of D,; , hence, the last feature subset could be specified when the =-~(x-m,)Y-~(x-m;)+~(x-m~,~c-~(x-m~j) increasing range of transformed divergence tends to easeL L

= x’C-’ (m, - m, 1--

1

(mj + rn., Y C-’ (m, 2

UP.

m.j 1

4. A COMPARISON

(h.13) Because U;,(X) is a linear function of

x , U;,(X) , as well as

x , is distributed normally.

OF TWO METHODS

Feature selection based on Branch-And-Bound

is a heuristic method which can simplify the exhaustive search. It is applicable under a precondition that the feature selection criterion function must satisfy the

E;{U,,]=

ip(x Icoi)ln~\dx

property.

= I,

that an original

pattern x is n-

these n-dimensional features, other k-dimensional features

Eq.7,

E,{U,,}=kDi

Provided

monotonicity

dimensional, some m-dimensional features is selected from

J

Because of

Algorithm

is selected from these m-dimensional

=k(rn; -m,)‘C-‘(mi

case,

-wzj) (Eq.14)

the

criterion

function

should

this

be satisfied

with J,, ,J,,, >Jk .BAB can cut down the search time under this condition, which can be summarized as follows: At first, generate a k-dimensional

initial feature randomly

whose initial bound is B=J(x*);then

= E, {(m; - rn.,1 C‘-I ( x - m, )( x - m, >’C -‘( mj -m., )}

features, in

search backwards to

father nodes. Because the criterion function is monotonic relative to the number of dimension, so long as there is a

= D,;

subset xi””

which is m-dimensional and is satisfied with

xi”‘) B. Suppose that the calculation is evaluated by the

retical Artificial Intelligence, 1990(2), pp. IO l- I 15.

calculation of criterion function, the calculation spent is

[5] M. Mitchell

more

Algorithms:

than

Algorithm,

0.4*Ck,,+0.6*Ck-‘,, by while

the

Branch-Bound

it is only 5000 by GA. Such pre-

dominance of GA is derived from being able to find a higher order good building block via a shorter order one. The third,

Although

the calculation spent by GA

is

definitely less then by Branch-Bound algorithm, but the time spent by GA is not definitely less then by Branch-

4206

Proc. Of ~~245-254.

et al., The Royal Road for Genetic

Fitness Landscapes and GA Performance. 1st European Conf. on Artificial

Lifiz,1992.