can be considered as a problem of global combinatorial optimization. It is a .... The essence of Genetic Algorithms is imitating the way .... father nodes. Because ...
Feature Selection for Handwritten Chinese Character Recognition Based on Genetic Algorithms Darning Shi Dept. of Computing, The Hong Kong Polytechnic University, Hung Horn, Kowloon,Hong Kong
Wenhao Shu and Haitao Liu Dept. of Computer Science, Harbin Institute of Technology, Harbin, 150001, P.R.C.
1. INTRODUCTION
ABSTRACT Feature selection is of great importance in recognition
The criterion
system design because it directly affects the overall
assessed as follows: how the distances among the means
performance of the recognition system. Feature selection
of classes are maximized, and how the
can be considered as a problem of global combinatorial
classes are minimized.
optimization. It is a very time-consuming task in order
character feature is statistically independent, it wlould be
to search the most suitable features amongst a huge
very easy to apply this criterion, that is, analyzing n
number of possible feature combinations, therefore, an
features within the training samples one by one, and
effective and efficient search technique is desired. In
then selecting
this paper, we use Genetic Algorithms (GA) to design a
mentioned above. For example, the training samples of
feature selection approach for handwritten
Chinese
character recognition. Four contributions are claimed:
measuring a class separability can be
m from
variances of
Provided that every initial
features which meet the criterion
2 classes wi and o ,, are given,
miand miare the mean m,kand mlk
First, the general transformed divergence among classes,
vectors of these 2 classes respectively, while
which
distances, is
are the mean vectors in the kth dimension, ajk2 and O,k
in the feature
are the variances in the kth dimension, then the criterion
selection based on GA; Second, a special crossover
function measuring class separability can be definled as:
is derived
from
Mahalanobis
proposed to be the fitness function
operator other than traditional one is given; Third, a special criterion of terminating
selections is inferred
from the criterion of minimum error probability
in a
. G,
trnrk =
2 O,k
-
m.lk +
I2
k=1,2, .._,n
(Eq.1)
$k
Bayes classifier; Fourth, we compare our method with Obviously, Gk is positive. The larger Gk is, thle more
the feature selection based on Branch-And-Bound algorithm(BAB),
which is often used to reduce the
calculation of Feature Selection via exhaustive search. The analyses of the experimental
results can be
significant the kth feature is when classifying
11); and
0 .j, Let the set {G, ,k=I,2, ___ , n} ordered from the
proceeded that traditional GA is an ergodic Markov chain, while, BAB is a depth first heuristic algorithm for
largest to the smallest, then, the feature selec:tion is
exhaustive search. We conclude that the GA-based
nothing else, but selecting the larger
method proposed in this paper is promising to solve the
set as classification features.
m elements
in the
feature selection problems in a multi-dimensional space. Keywords:
Feature Selection,
Divergence,
Genetic
Eq.1 is only suitable for a normal feature space.
Algorithms,
Branch-And-Bound
Algorithm,
Hand-
Sometimes this condition can’t be satisfied, such as in handwritten Chinese character recognition. The features
written Chinese Character Recognition
0-7803-4778-l
/98 $10.00
0 1998 IEEE
4201
in the experiments described in this paper, Rapidtransformed traverse features of handwritten Chinese characters, are not in a normal space. There
is another
method,
Feature
Selection
via
exhaustive search, which means selecting the optimum m features from the C,,“’ paths which combinations
covered all
of the n features. According
to this
method, not only the optimum feature subset can be
Definition 2:In a multi-class (w classes) problem, the average classification information is 1,; when taking 0, out of o j , and the average classification information is I,, when taking w.~ out of wi , the divergence between these two classes J, =l,i+/ii, then average diverg,ence is defined as:
found, but all classification information among classes can
be
browsed
completely.
Nevertheless,
J = 2 2 J,, ,=I ,=,+I
the
calculation expending of this method is rather high
(Eq.3)
when n is large. So, many algorithms are proposed to
In Eq.3, J is the sum of each J,, between a pair of classes.
reduce its calculation,
The average divergence will be enlarged only if there is
algorithm(BAB)
such as Branch-And-Bound
one large divergence within a certain pair of classes. As
([ 1].[2]).
a result, many pairs with small divergences will be of In this paper, the methodology and implementation of
little advantages. In this case, each divergence between a
Feature Selection based on Genetic Algorithms
pair should be considered. To solve this problem,
are
described, and then the effect is analyzed in theory and
transformed divergence can be used.
experiment. In section 2, the derivation of transformed divergence criterion from Mahalanobis distance is given. We will introduce how to construct the expression of this problem and how to apply GA to feature selection
Definition 3:Transformed divergence is defined as: J;;‘ = 100% * [I - exp t-J,
/ 8)]
(Eq.4)
in section 3. Finally, the feature selection based on GA and based on BAB are compared and analyzed in
Thus, transformed average divergence is:
section 4.
2. TRANSFORMED DIVERGENCE
AVERAGE
i=l
CRITERION
.,=r+,
That is to say, J’,i is an exponential saturation curve of
A similarity measure must be defined to analyze the similarity among the samples of the same classes and the dissimilarity among the samples of different classes.
J,, .Even if J,i is very large, J ,i can never exceed to the . . saturation lme of lOO/Oo, b’it J’,, will be more sensitive in the area of small J,,.
There are some definitions as follows: In many real applications: (I) It is tolerable to assume that probability density
Dejinition I: Mahalanobis distance: D2 = (x - m)‘C-1(x
- m)
(Eq.2)
Where x is the feature vector of a pattern which mean
function(PDF)
p(xi w , ) is distributed normally for the
class w, .
vector is m and global covariance matrix is C. (2) It is also a reasonable assumption that all the covariance matrices of the classes are equivalent and
4202
equal to the global covariance matrix.
3. FEATURES SELECTION
BASED ON
Then, we can decrease the calculation expending of the criterion function.
GENETIC
Suppose that a pattern sample x is given, whose
ALGORITHMS
The essence of Genetic Algorithms is imitating the way of the evolution of organisms and human beings to solve
conditional
probability
densities in oi and w, are
p(xi o , > and p(x] o ., > respectively, the classification information when taking w , out of o , is provided by:
11,= C P(XIWi) ‘{/(XI 0;) / /‘(Xl” j))
(Eq.6)
complex optimization
problems, where the problem
solution can be expressed by a string with a certain length. A group of solutions is remained after every iteration, which can be arranged in optimal order, and from which some solutions can be selected according to a certain principle. After those solutions selected from last iteration are calculated by Genetic operators (such as crossover and mutation),another
new group
of
solutions are generated. Repeat the procedure above, until a certain goal is reached ([3]). l tr(C,-‘C Cyllll’tlIL’nI = eztr[C-l(m, 2 CO”~I~,‘,,lce
[(x - mj )‘(x - tn; )p(xj wi ,I}
3.1 Expression of Problem
- m,)(m,
The problem solution is expressed as a string with a
- mj,']
certain length in alphabet {O,l,*}
I,, = D; / 2
i.e.
(Eq.7)
in traditional GA.
While, the feature selection discussed in this paper can be regarded as a problem of global combilnatorial
then. J = 2 2 tr[Cel(m, 1=1 .j=r+l
- m., )(m;
= 2 2 (m, - m,)‘Cel(mj 1=1 j=i+l
-
optimization. For a problem selecting nz features from n
I -ml )I mi 1
features, its solution can be expressed as a m-length string in alphabet { 1,2,
., n}. Here, the value of genes
refers to the feature of a certain dimension. Running GA in this way to solve the problem described in section 1:
i.e.
J=T
AD; ,=l ./=i+l
VW)
(1) The size of population.300. (2) Fitness function of individuals. Let D’,, express
then, the Mahalanobis distance between wi and 0,
in the
feature subset S, then according to Eq.9,the transformed average divergence can be used as the fitness function of individuals. In Eq.7,Eq.S,Eq.9, D’,, is the Mahalanobis distance between the ith pattern and thejth pattern.
(3) Crossover operator.
It
is known
that such
optimization problem has nothing to do with the order of gene, the gene string in the parents can be (ordered before the crossover works. The gene which appears in
both sides of the parents can’t operate crossover. Let
number Y within [O,f; ],at last, add the ith string from the
p,rm, =o.fJ. (4) Mutation operator. Let a gene’s value mutate in a
population into the mating pool, which is satisfied with:
random position of father, the mutated value should be
c f, ,=1
arranged from I to n ,and should be different with other m-l genes. Let P,,,, =0.05.
I 2 r , here,J is fitness of the ith string.
(6) Stopping condition. When the number of different
(5) Selection.Ten strings are selected to mate through
individuals exceeds 5000, stop GA.
Roulette Wheel Method each time. Firstly, arrange all individuals of population from the largest fitness to the
Figure
smallest one; secondly, calculateA ,the sum of all strings
selecting 20-dimensional-26-dimensional
in current population,
original 64-dimensional features.
z z X
c y
and then generate a random
I shows the running process of GA. when
+
;::
features from
m=20
-m=21
60
50 -------40
-
m=22
-x-
m=23
30
2
20
-m-
m=24
E
10
-c-
m=25
0
,oo --tm=26 _ _ _ _ _ _First24byEq,,
IO
” (Nugber zf di~rent’;ndiv!lndu~l~~*50
9o
Figure 1 Feature Selection Based on GA
3.2 Reasonable Number of Features for Selection
From the training samples in a w-class problem, generally,
However,
P(w,)=P(o,)=
how
many
dimensions
is reasonable for
. . . =P( ~)=l/w.
Let
selection?
u,j = ln(PCxl WI ) / PlxI O.j 1)
(Eq.11)
At first, Let’s analyze error probability of Bayes classifier in the case of two-class problem. Class o 1 and o 2 should correspondingly be divided into two areas, ct 1 and cx2 .But it is possible to exist two kinds of errors: the patterns belonging to o 1 was classified into 02, and vice versa.
Thus, the error probability
is
P(U,,N
wi ),
if
the
patterns belonging to o i are classified into wi ,while error probability
is
P(Ui, 0 10, ) I Eq.12)
(Eq.10)
4204
For a problem with the PDF for class distributed normally
and equivalent covariance matrices: According to Eq.15, P(e) is monotone decreasing with the increasing D,,‘, on the other hand, transformed divergence is a monotone increasing function of D,; , hence, the last feature subset could be specified when the =-~(x-m,)Y-~(x-m;)+~(x-m~,~c-~(x-m~j) increasing range of transformed divergence tends to easeL L
= x’C-’ (m, - m, 1--
1
(mj + rn., Y C-’ (m, 2
UP.
m.j 1
4. A COMPARISON
(h.13) Because U;,(X) is a linear function of
x , U;,(X) , as well as
x , is distributed normally.
OF TWO METHODS
Feature selection based on Branch-And-Bound
is a heuristic method which can simplify the exhaustive search. It is applicable under a precondition that the feature selection criterion function must satisfy the
E;{U,,]=
ip(x Icoi)ln~\dx
property.
= I,
that an original
pattern x is n-
these n-dimensional features, other k-dimensional features
Eq.7,
E,{U,,}=kDi
Provided
monotonicity
dimensional, some m-dimensional features is selected from
J
Because of
Algorithm
is selected from these m-dimensional
=k(rn; -m,)‘C-‘(mi
case,
-wzj) (Eq.14)
the
criterion
function
should
this
be satisfied
with J,, ,J,,, >Jk .BAB can cut down the search time under this condition, which can be summarized as follows: At first, generate a k-dimensional
initial feature randomly
whose initial bound is B=J(x*);then
= E, {(m; - rn.,1 C‘-I ( x - m, )( x - m, >’C -‘( mj -m., )}
features, in
search backwards to
father nodes. Because the criterion function is monotonic relative to the number of dimension, so long as there is a
= D,;
subset xi””
which is m-dimensional and is satisfied with
xi”‘) B. Suppose that the calculation is evaluated by the
retical Artificial Intelligence, 1990(2), pp. IO l- I 15.
calculation of criterion function, the calculation spent is
[5] M. Mitchell
more
Algorithms:
than
Algorithm,
0.4*Ck,,+0.6*Ck-‘,, by while
the
Branch-Bound
it is only 5000 by GA. Such pre-
dominance of GA is derived from being able to find a higher order good building block via a shorter order one. The third,
Although
the calculation spent by GA
is
definitely less then by Branch-Bound algorithm, but the time spent by GA is not definitely less then by Branch-
4206
Proc. Of ~~245-254.
et al., The Royal Road for Genetic
Fitness Landscapes and GA Performance. 1st European Conf. on Artificial
Lifiz,1992.