Recognition of Handwritten Indic Script Using Clonal

0 downloads 0 Views 156KB Size Report
tion algorithm is formulated for better recognition of handwritten numerals (a ..... dataset containing about 3500 handwritten samples for Devanagari digits.
Recognition of Handwritten Indic Script Using Clonal Selection Algorithm Utpal Garain1, Mangal P. Chakraborty1, and Dipankar Dasgupta2 1

Indian Statistical Institute, 203, B.T. Road, Kolkata 700108, India 2 The University of Memphis, Memphis, TN 38152 [email protected], [email protected]

Abstract. The work explores the potentiality of a clonal selection algorithm in pattern recognition (PR). In particular, a retraining scheme for the clonal selection algorithm is formulated for better recognition of handwritten numerals (a 10-class classification problem). Empirical study with two datasets (each of which contains about 12,000 handwritten samples for 10 numerals) shows that the proposed approach exhibits very good generalization ability. Experimental results reported the average recognition accuracy of about 96%. The effect of control parameters on the performance of the algorithm is analyzed and the scope for further improvement in recognition accuracy is discussed. Keywords: Clonal selection algorithm, character recognition, Indic scripts, handwritten digits.

1 Introduction Several immunological metaphors are now being used (in a piecemeal) for designing Artificial Immune Systems (AIS) [1]. These approaches can broadly classified into three groups namely, immune network models [2], negative selection algorithms [3], and clonal selection algorithms [4]. This paper investigates a new training approach for clonal selection algorithm (CSA) and its application to character recognition. Earlier CSA was used for a 2-class problem to discriminate pair of similar character patterns [5], the present study extends it for a m-class classification problem. Training in CSA so far is modeled as one pass method where each antigen undergoes single training phase. Once the training on all antigens is over, an immune memory is produced and used for solving classification problem (as used in [5] and [6]). Our work presents a new training algorithm where a refinement phase is used to finetune the initial immune memory that is build from the single pass training. In the refinement stage, training of an antigen depends on its recognition score. Incorrect recognition of an antigen triggers further training. This process continues until the immune system suffers from negative learning or it is over-learned. Recognition of handwritten Indic numerals has been considered to study the performance of the modified CSA. Because of its numerous applications for postal automation, bank check reading, etc., the document image analysis researchers have been studying the problem for last several years and a number of methods have been proposed. H. Bersini and J. Carneiro (Eds.): ICARIS 2006, LNCS 4163, pp. 256 – 266, 2006. © Springer-Verlag Berlin Heidelberg 2006

Recognition of Handwritten Indic Script Using Clonal Selection Algorithm

257

While some of these are biologically inspired approaches such as neural networks [7], genetic algorithms [8], AIS approaches remained unexplored for this application; though AIS techniques have been applied to several pattern recognition problems [9-14]. The rest of the paper is organized as follows. Section-2 describes the CSA with the proposed retraining scheme. Section-3 provides the experimental details and report results highlighting the performance of the CSA in classifying handwritten numerals. This section also exhibits the performance of the new retraining scheme over the previously used single-pass approach. In addition, section-3 discusses the effect of CSA control parameters on its performance, and section-4 provides some concluding remarks.

2 Classification Using Clonal Selection Algorithm Let AG represent a set of training data (antigens) and agi represents an individual member of this set: AG = {ag1, ag2, …, agk}. Each agi has two attributes: class: ag.c ∈C ={c1,c2,………cn} (n = 10 for digit classification) and feature vector: ag.f. Let the immune memory, IM={m1, m2, …, mm} where mi is a memory cell having two attributes similar to those of an individual antigen. For any mi, mi.c∈C = {c1, c2,………cn} is the class information and mi.f is the feature vector. Binary images of handwritten numerals are first size-normalized in a 48x48 matrix whose each element is binary. This matrix is used as a feature map for the experiments. Similarity between two such feature matrices S(F1, F2) a measure of autocorrelation coefficient between F1 and F2 as defined below:

S ( F1 , F2 ) =

s10 s 01 − s 00 s11 1 − 2 2 ( s11 + s10 )( s 01 + s 00 )( s11 + s 01 )( s10 + s 00 )

(1)

where s00, s11, s01, and s10 denote the number of zero matches, one matches, zero mismatches, and one mismatches, respectively. It is to be noted that S gives values in the range [0, 1], where 1 indicates the highest and 0 signifies the lowest similarity between two samples. We used this metric to measure similarity/affinity during antibody-antibody or antigen-antibody interactions. Training has two phases: Phase-I is the same as was used in [6], while Phase-II incorporates a refinement process. Phase-I involves three stages namely, initialization of immune memory, clone generation, and selection of clones to update the immune memory. These stages are briefly discussed below. Initialization: This stage deals with choosing some antigens as initial memory cells to initialize the immune memory. In the present study, only one antigen from each class is randomly chosen to initialize the immune memory (IM). It is to be noted that the number of initial cells has certain effect on system’s performance as illustrated in [6]. Clone generation: For a given antigen agi, its closest match (say, mi) is, at first, chosen from the existing IM as follows:

stim(agi, mi) ≥ stim(agi, mj), for all j ≠ i and mj.c=agi.c

(2)

The function stim() is used to measure the response of a b-cell to an antigen or to another b-cell and is directly proportional to the similarity between the feature matrices as defined in equation (1). After a memory cell mi (renamed as mmatch) is

258

U. Garain, M.P. Chakraborty, and D. Dasgupta

selected for a training antigen agi, mmatch goes through a proliferation process (Proliferation-I), known as somatic hyper-mutation that generates a number of clones of mmatch. The exact number of clones is determined by three parameters, namely, (i) hyper-mutation rate, (ii) clonal rate and (iii) stim(agi, mmatch). Note that the first two parameters are user-defined. Each clone is produced through mutation (controlled by MUTATION_RATE, a user defined parameter) at selected sites of mmatch’s feature matrix. No clone is an exact copy of mmatch. The algorithms for Proliferation-I and the generation of mutated clones are outlined in Algorithm-I and II, respectively. These algorithms are similar to the ones described in [6]. On completion of hyper-mutations, a stimulation value is computed for each element bj ∈ B as stim(bj, agi). Here bj denotes an individual b-cell clone and B represents the entire cloned population. In order to minimize the computational cost in generating clones, a modified version of the resource limitation policy [15] is incorporated. The modified version considers only the recent clones generated for the current antigen undergoing the (maturation) training process. The method does not consider clones generated for previous antigens i.e. present implementation considered the entire resource for the current antigen’s class only. Stopping criterion defined in equation (3) is used to terminate the training on an antigen agi. If this criterion is not met then further proliferation of existing (i.e. survived after resource limitation) b-cells is invoked. In this stage (i.e. Proliferation-II), each survived b-cell, i.e. bj is proliferated to produce a number of clones determined by the resources allocated to it. Proliferation-II process is similar to one for proliferation-I outlined in Algorithm-I except the calculation of the number clones to be generated from each surviving b-cell (bj). This number is determined only by the CLONAL_RATE and stim(agi, bj). B

b j .stim j =1

B

(3) > STIMULATION_THRESHOLD

Algorithm I. Hyper-mutation/Proliferation-I

Let B is the set of b-cell clones to be created due to somatic hyper-mutation started with mmatch. Initially B={mmatch}. Let Nc denote the number of clones and calculated as, Nc HYPER_MUTATION_RATE * CLONAL_RATE * stim(agi, mi) While (|B| ≤ Nc) Do mut false //mut is a Boolean variable Call mutate(mi, mut) Let bj denote a mutated clone of mi If (mut) Then B B ∪ bj Done

Recognition of Handwritten Indic Script Using Clonal Selection Algorithm

259

Algorithm II. Production of Mutated Clones

mutate(x, flag){ For each binary feature element (i, j) in x.f // note that x.f is basically a matrix Do Generate a random number, r in [0, 1] If (r < MUTATION_RATE) Then x.fi,j toggle(x.fi,j) flag true Endif Done } Clone selection and update of immune memory: Once the training criterion in equation (3) is met for an antigen, the most stimulated (w.r.t. the current antigen undergoing training) b-cell among the survived ones is selected as a candidate (let bcandidate denote this cell) to be inserted into immune memory. This process is outlined in Algorithm III that is similar to one in [6]. This algorithm makes use of two parameters AS (average stimulation) and α (a scalar value). The parameter α is a user-defined one, whereas AS is measured from the input training antigen set as the average stimulation between all pairs of the mean values of the antigen classes. Algorithm III: Update of immune memory

CandStim stim(agi, bcandidate) MatchStim stim (agi, mmatch) CellAff stim(mmatch, bcandidate) If (CandStim > MatchStim) IM IM ∪ bcandidate If (CellAff > α × AS) IM IM – mmatch

// insertion into the immune memory // memory replacement

Phase-II of the training algorithm: Note that the training in Phase-I is a one-pass method i.e. the system is trained only once on a training antigen. At the end of the training phase, the immune memory i.e. IM0={m1, m2, …, mm} is produced. In the present implementation, training involves a second phase namely Phase-II that employs a refinement process. In this method recognition and training go hand in hand to obtain a better immune memory from its initial version i.e. IM0. In this phase, recognition of the all the training antigens is done first using the immune memory (IMi, i=0, 1, …) obtained in the previous stage (say, i-th stage). Classification strategy outlined next is used for recognition of antigens and the recognition accuracy is noted. Next, antigens for which incorrect classification is recorded act as a bootstrap samples that undergo further training involving clone generation, selection and updating immune memory as outlined above in Phase-I of the training. This results in an updated immune memory (IMi+1), which is then used for classification of all the training antigens. This newer version is retained if better

260

U. Garain, M.P. Chakraborty, and D. Dasgupta

(than what was obtained using IMi) recognition accuracy is obtained. Otherwise, IMi is reloaded and the Phase-II terminates. It is observed that for a few iterations of Phase-II newer versions of the immune memory continue to produce better recognition accuracy and then there is degradation in accuracy, signaling a negative (or over) learning in the system. In fact, instead of using the training antigen set, a separate validation set can be used in this refinement phase. This modification would be considered in the future extension of the present study. Classification strategy: Classification is implemented by a k-nearest neighbor (kNN) approach. For a target antigen (ag), k (an odd number) closest (w.r.t. ag) memory cells are selected from the immune memory IM. Closeness is measured by the stim function i.e. stim(ag, mi) for all i, mi ∈ IM. Next, k mi’s are grouped based on their class labels. Class of the largest sized (a majority-voting strategy) group identifies ag.

3 Experimental Details Two different datasets (DS1 and DS2) [16] have been used to test the proposed classification approach based on clonal selection algorithm (CSA). These datasets DS1 and DS2 contain samples for handwritten numerals in two major Indic scripts namely, Devanagari (Hindi) and Bengali, respectively. Unlike English, Chinese, Japanese, etc., studies in Indic script handwriting recognition are rare and this provides additional motivation to this present work to deal with datasets of handwriting in Indian languages. Moreover, datasets consisting of a large number of samples for handwritten digits in Indic scripts are recently available [16] in public domain and this facilitates training and testing of an approach and comparing it with other competing methods. Both the datasets contain real samples collected from different kinds of handwritten documents such as postal mails, job application forms and railway ticket reservation forms, passport application forms, etc. For our experiment, each dataset consists of 12,000 samples (equal number of samples for each class). DS1 samples are randomly selected from a collection of 22,556 Devanagari numerals written by 1049 persons and DS2 samples are taken from a set of 12,938 Bengali numerals written by 556 persons. Some samples for each digit class are shown in fig 1. The datasets are divided are into six equal sized partitions. Training is conducted on samples from five partitions and classification is tested on the sixth partition. This realizes a six-fold experiment that results in six test runs. The results reported next are averaged over these six runs. Experiments are carried out under two different training policies, L1: training is single pass and L2: proposed method that employs refinement process. Recognition accuracies under these two environments are reported in Table 1 and it is observed that L2 outperforms L1 by a significant margin. However, L2 generates a slightly larger sized immune memory than the one produced by L1. Significant difference is observed in the time units required for training. On a Pentium-IV (733 MHz, 128 RAM) PC, L1 takes quite less CPU time than L2 that involves additional refinement phase. However, there is hardly any difference in the time needed for classification by the two approaches. The system can classify about 50 characters per second. Abso lute time units taken during training and testing are outlined in Table 2 below.

Recognition of Handwritten Indic Script Using Clonal Selection Algorithm

261

Fig. 1. Hundred random samples from the dataset of Bengali handwritten numerals

Performance of the proposed refinement stage is studied to check how rapidly the system attains the maximum classification rate on the training set. In fact, it’s the first local maximum where the training terminates and at present, the system does not attempt to find the global one. The response of the additional training module is shown in fig. 2 for the dataset DS1. A similar behaviour is obtained for the other dataset too. In fig. 2 it is to be noted that the recognition accuracy gradually increases till the 8th iteration after which the accuracy degrades and training terminates. Number of antigens undergo training in each pass is also plotted by a line curve in fig. 2. Please note that iteration 0 represents the initial Phase-I training where all 10,000 antigens were trained. Table 1. Recognition accuracies and size of immune memory with two different training algorithms

Dataset DS1 DS2

Recognition accuracy L1 L2 93.31% 96.23% 92.57% 95.68%

Size of immune memory L1 L2 912 1283 1103 1472

Table 2. CPU Time for training and classification using two different training algorithms

Time to train Dataset DS1 DS2

L1 5 H 14 Min 5 H 19 Min

L2 7 H 05 Min 7 H 22 Min

Classification speed (#characters per second) L1 L2 52 49 51 47

262

U. Garain, M.P. Chakraborty, and D. Dasgupta

Fig. 2. Performance analysis of the bootstrap module

Next, the effects of parameters are studied for two different measures: (i) recognition accuracy and (ii) size of the immune memory. Results are reported here for the new training algorithm. Almost similar effects have been observed on both the datasets and results on DS1 are shown in Fig 3. Finally, the effect of k in k-nearest neighbour classification is examined and it is observed that k = 5 gives the best performance. Recognition accuracies for different values of k are shown in Fig. 4. The overall results reported in Table 1 are obtained with k = 5, stimulation threshold = 0.89, number of resources = 400, mutation rate = 0.008, affinity threshold scalar, α = 0.4, hyper-mutation rate = 2 and clonal rate = 10 (the last two parameters are used in Algorithm-I of section 2). Classification results are further grouped into three classes, correct: a sample is properly classified; incorrect: a sample is wrongly classified, and reject: the system cannot classify a sample. A rejection is reported when no single class gets majority among the k choices returned by the classifier. Table 3 presents the average classification results taking these three aspects into consideration.

Recognition of Handwritten Indic Script Using Clonal Selection Algorithm

263

Table 3. Classification results Dataset

% correct

% incorrect

% reject

DS1

96.23

2.14

1.63

DS2

95.68

2.44

1.88

Fig. 3. Effect of different parameters on recognition accuracy and size of immune memory: (a) stimulation threshold (refer equation (3)), (b) number of resources used for resource limitation, (c) Mutation rate (refer Algorithm-II), and (d) Affinity threshold scalar, α as used in Algorithm-III

Fig. 5 presents the class-wise classification rates. Recognition of the digit ‘0’ attains highest recognition score in both scripts. On the other hand, samples of (digit ‘2’) in Hindi and (digit ‘9’) in Bengali result in the lowest classification rates as 89.32% and 90.52%, respectively. Study of the confusion matrix identifies several similar-shaped character pairs. For example, many samples from (digit ‘1’) and (digit ‘2’) in Hindi dataset and from (digit ‘1’) and (digit ‘9’) in Bengali dataset resulted in confusion during classification. Some post-processing can be employed to discriminate such confusion pairs. In this context, a previous study [5] reported promising ability of an AIS-based approach for discrimination of similar-shaped character pairs. The same approach can also be employed here to further improve the classification accuracy. Such multi-level recognition scheme is considered as a future extension of the present study.

264

U. Garain, M.P. Chakraborty, and D. Dasgupta

Comparison with other existing studies: As mentioned earlier that there are many studies on recognition of handwritten digits in English and Oriental scripts. However, there are only a few reports on Indic script. A recent study [17] makes use of fuzzy model based recognition scheme and reports recognition accuracy of about 95% on a dataset containing about 3500 handwritten samples for Devanagari digits. Study in [18] used neural net as classifier and achieved an accuracy of 93.26% on the same dataset used here for recognition of handwritten Bengali digits.

Fig. 4. Recognition accuracies using k nearest neighbor approach with different k values

Fig. 5. Class-wise recognition accuracies

Recognition of Handwritten Indic Script Using Clonal Selection Algorithm

265

Compared to these approaches and achievements, the proposed AIS-based method can be viewed as a potential alternative. However, it is to be noted that no study employs the same feature set. Authors in [17] use some grid-based features, [18] considers wavelet coefficients as features whereas, a size normalized binary image array has been used as feature in the present study. Use of distance measure also differs from one study to another. Therefore, a direct comparison needs replication of these experiments using a uniform feature set and the same distance measure. Our future study will consider this aspect to bring out a judicious comparison between an AISbased framework and other approaches using different learning paradigm.

4 Conclusions This paper presents an application of a clonal selection algorithm for recognition of handwritten Indic numerals. In particular, a 2-phase clonal selection algorithm implementing a retraining scheme is proposed, and experiments using different datasets are performed. Reported results show that this new method outperforms the previously used single pass method. Overall classification performance shows that this method compares well with the existing approach. In particular, the proposed scheme achieves recognition accuracy of about 96% that is comparable to the previous approaches. This study uses a feature vector and a simple distance measure to explore the feasibility of an AIS-based approach as an alternative classification tool. Since encouraging results have been obtained in this experiment, future extension of this study would include examination of different feature sets and distance measures to further improve the recognition accuracy.

Reference 1. D. Dasgupta, Z. Ji, and F. Gonzalez, F, “Artificial immune system (AIS) research in the last five years,” in Congress on Evolutionary Computation (CEC’03), 2003, Volume: 1, pp. 123- 130. 2. Zheng Tang, Koichi Tashima, and Qi P. Cao, “Pattern recognition system using a clonal selection-based immune network,” Systems and Computers in Japan, Volume 34, Issue 12, pp. 56 - 63, 2003. 3. Z. Ji and D. Dasgupta, “Real-valued negative selection algorithm with variable-sized detectors,” in LNCS 3102, Proceedings of GECCO, pages 287–298, 2004. 4. L. N. d. Castro and F. J. V. Zuben, “Learning and Optimization Using the Clonal Selection Principle,” IEEE Transactions on Evolutionary Computation, Special Issue on Artificial Immune Systems, vol. 6, pp. 239-251, 2002. 5. U. Garain, M. P. Chakraborty, D. Dutta Majumder, “Improvement of OCR Accuracy by Similar Character Pair Discrimination: an Approach based on Artificial Immune System,” to be presented in the 18th Int. Conf. on Pattern Recognition (ICPR), August 2006, Hongkong. 6. A.B. Watkins, “AIRS: a resource limited artificial immune classifier,” Master’s dissertation, Dept. of Computer Science, Mississippi State University, 2001.

266

U. Garain, M.P. Chakraborty, and D. Dasgupta

7. Keith Price Bibliography on use of Neural Networks for recognition of Numbers and Digits at http://iris.usc.edu/Vision-Notes/bibliography/char1019.html 8. C. de Stefano, A. Della Cioppa, and A. Marcelli, “Handwritten Numeral Recognition by Means of Evolutionary Algorithms,” in Proc. of the 5th Int. Conf. on Document Analysis and Recognition (ICDAR), Bangalore, India, page: 804-808, 1999. 9. J. H. Carter, “The Immune System as a model for Pattern Recognition and classification,” Journal of the American Medical Informatics Association. Vol. 7, no. 3, pp.28-41, 2000 10. L. N. de Castro and J Timmis, “Artificial Immune Systems: A Novel Approach to Pattern Recognition,” in Artificial Neural Networks in Pattern Recognition (Eds. L Alonso J Corchado and C Fyfe), pp. 67-84. University of Paisley, January 2002. 11. S. Forrest, B. Javornik, R. E. Smith and A. S. Perelson, “Using genetic algorithms to explore pattern recognition in the immune system,” in Evolutionary Computation 1:3, pp. 191-211, 1993. 12. Jennifer A. White and Simon M. Garrett, “Improved Pattern Recognition with Artificial Clonal Selection,” in the Proc. of 2nd Int. Conf. on Artificial Immune Systems (ICARIS), September 1-3, 2003, Napier University, Edinburgh, UK. 13. Y. Cao and D. Dasgupta, “An Immunogenetic Approach in Chemical Spectrum Recognition,” Advances in Evolutionary Computing (Eds. Ghosh & Tsutsui), Chapter 36, Springer-Verlag, January 2003. 14. Tarakanov and V. Skormin, “Pattern Recognition by Immunocomputing,” in the proceedings of the special sessions on artificial immune systems in Congress on Evolutionary Computation, 2002 IEEE World Congress on Computational Intelligence, Honolulu, Hawaii, May 2002. 15. J. Timmis, “Artificial Immune Systems: a novel data analysis techniques inspired by the immune network theory,” PhD Thesis, University of Wales, Aberystwyth, 2001. 16. U. Bhattacharya and B. B. Chaudhuri, “Databases for research on recognition of handwritten characters of Indian scripts,” in Proc. of the 8th Int. Conf. on Document Analysis and Recognition (ICDAR), Seoul, Korea, vol. II, page: 789-793, 2005. 17. M. Hanmandlu and O.V. Ramana Murthy, “Fuzzy Model Based Recognition of Handwritten Hindi Numerals,” Proc. Int. Conf. on Cognition and Recognition, Dec. 2005, pp. 490-496. http://www.studentprogress.com/appln/colleges/cogrec/ 18. U. Bhattacharya, T. K. Das, A. Dutta, S. K. Parui, and B. B. Chaudhuri, “A Hybrid scheme for handwritten numeral recognition based on Self Organizing Network and MLP,” in Int. J. on Pattern Recognition and Artificial Intelligence (IJPRAI), Volume 16, pp. 845-864, 2002.