A Novel Feature Extraction Technique for Offline Handwritten ... - Khilare

0 downloads 0 Views 1MB Size Report
numerals recognition system for automatic letter sorting ... Hindi numerals recognition. ... Among Indian scripts, Devanagri, Tamil, ... [Downloaded free from http://www.jr.ietejournals.org on Thursday, February 13, 2014, IP: 223.225.213.255] ...
[Downloaded free from http://www.jr.ietejournals.org on Thursday, February 13, 2014, IP: 223.225.213.255]  ||  Click here to download free Android application for this journal

A Novel Feature Extraction Technique for Offline Handwritten Gurmukhi Character Recognition Munish Kumar, R. K. Sharma1 and Manish Kumar Jindal2 Department of Computer Science, Punjab University Rural Centre, Kauni, Muktsar, 1School of Mathematics and Computer Applications, Thapar University, Patiala, 2Department of Computer Science and Applications, Punjab University Regional Centre, Muktsar, Punjab, India

ABSTRACT A novel feature extraction technique is presented in this paper for an offline handwritten Gurmukhi character recognition system. Handwritten character recognition is a complex task because of various writing styles of different individuals. To select a set of features is an important step for implementing a handwriting recognition system. In this work, we have extracted various topological features, namely, peak‑extent features, shadow features and centroid features. A new feature set is also proposed by using horizontal peak extent features and the vertical peak extent features. For classification, we have used k‑NN and Linear‑SVM classifiers. In view of learning and simplification capabilities of multi layer perceptrons (MLPs), MLPs based pattern classifier is also used for classification. In the present work, we have taken 7,000 samples of offline handwritten Gurmukhi characters for training and testing. Proposed system achieves a maximum recognition accuracy of 95.62% using SVM with linear kernel classifier. By using k‑NN and MLPs, a maximum recognition accuracy of 95.48% and 94.74%, respectively, has been achieved with five‑fold cross validation. Keywords: Feature extraction, k‑NN, Multi layer perceptrons, Peak extent features, SVM.

1.

INTRODUCTION

Optical character recognition (OCR) is still a critical domain of research, particularly for handwritten text. OCR is the most essential part of a document analysis system. Offline Handwritten Character Recognition usually abbreviated as OHCR is one of the oldest methods in the history of document analysis system and pattern recognition using computers. In recent times, handwritten Gurmukhi character recognition has been explored by researchers owing to its practical usage. Achievements of commercially available printed OCR systems are hitherto to be extended to handwritten text. It is a laid down fact that frequent discrepancy in writing styles of individuals makes recognition of handwritten characters complicated. A good number of researchers have already worked on the recognition problem of offline printed characters. The work in chronological order on OCR of handwritten alphabet and numerals have mostly been found to concentrate on Roman script [1‑3], a few European languages alike English, and scripts linked to several Asian languages such as Arabic [4], Chinese [5], etc., Wen et al.  [6] have proposed a handwritten Bangla numerals recognition system for automatic letter sorting machine. Swethalakshmi et al. [7] have proposed a handwritten Devanagri and Telugu character recognition system using SVM. The input to their recognition system consists of features of the stroke information in each character and SVM‑based stroke information module has been considered for generalization capability. They have IETE JOURNAL OF RESEARCH | VOL 59 | ISSUE 6 | NOV-DEC 2013

obtained a maximum recognition accuracy of 97.27%. Pal et al. [8] have presented a technique for off‑line Bangla handwritten compound characters recognition. They have used modified quadratic discriminant function for feature extraction. They have achieved a maximum recognition accuracy of 96% with tree classifier. Pal et al. [9] have used curvature feature for recognizing Oriya handwritten characters. Basu et al. [10] have presented a hierarchical approach for handwritten Bangla characters recognition. They have achieved a recognition accuracy of 72.06% with MLP classifier. Hanmandlu et al. [11] have reported grid‑based features for handwritten Hindi numerals recognition. They have achieved a recognition accuracy of 90.65% with course classification technique. Among Indian scripts, Devanagri, Tamil, Oriya, and Bangla have started to obtain awareness for OCR‑related research in the topical years. In general, research on handwritten optical character recognition for Indian scripts is ongoing, yet a solution has not been offered that solves the problem correctly and efficiently for offline handwritten Gurmukhi script recognition. This work is an effort in the direction of proposing an Offline Handwritten Character Recognition System for this script. This paper has been divided into five sections. Section 1 introduces the problem. Section 2 describes the work done on Gurumukhi script in the prior period and section 3 depicts the present work. In section 4, we have illustrated experimental results on the proposed technique. Finally, conclusion and future scope are presented in section 5. 687

[Downloaded free from http://www.jr.ietejournals.org on Thursday, February 13, 2014, IP: 223.225.213.255]  ||  Click here to download free Android application for this journal Kumar M, et al.: A Novel Feature Extraction Technique for OHCR

2.

PREVIOUS WORK

Gurmukhi script is the script used for Punjabi language and is derived from the old Punjabi term “Guramukhi”, which means “from the mouth of the Guru”. The Gurmukhi script has three vowel bearers, thirty two consonants, six additional consonants, nine vowel modifiers, three auxiliary signs, and three half characters. Gurmukhi script is the 12th most widely used script in the world. Writing style of the Gurmukhi script is from top to bottom and left to right. In Gurmukhi script, there is no case sensitivity. Presently, fairly printed Gurmukhi script documents and degraded printed Gurmukhi script documents can be recognized by OCR software, but there are very limited efforts in the recognition of complete handwritten Gurmukhi script documents. Lehal and Singh [12] have proposed a Gurmukhi script recognition system. They have developed a complete recognition system for printed Gurmukhi script, where connected components are first segmented using thinning‑based approach. Algorithm for segmentation of isolated handwritten Gurmukhi words was available in 2006 by Sharma and Lehal [13]. Jindal et al. [14] have provided a solution for touching character segmentation of printed Gurmukhi script. They have also provided a complete recognition system for recognition of degraded printed Gurmukhi script documents [14,15]. Online handwritten Gurmukhi script recognition system was available in 2008 by Sharma et al. [16]. They have used elastic matching technique in which character is recognized in two stages. In the first stage, they recognize the strokes and in the second stage character is constructed on the basis of recognized strokes. Sharma and Jhajj [17] have used zoning density‑based features for isolated handwritten Gurmukhi character recognition. They have used SVM and k‑NN classifiers for classification and achieved a maximum recognition accuracy of 72.83% with RBF kernel of SVM classifier. Kumar et al. [18] have achieved a maximum recognition accuracy of 94.29% using intersection and open end points as features and SVM with polynomial kernel for offline handwritten Gurmukhi character recognition. As we have noticed from literature of handwritten character recognition, not much work has been reported for feature extraction. So, this unresolved problem has motivated the authors for attempting the present work.

3.

PRESENT WORK

The handwritten character recognition system consists of phases, namely, digitization, pre‑processing, feature extraction, and classification. The performance of handwritten character recognition system massively depends on the features, which are being extracted. The extracted features should be able to classify a character in a unique way. In present work, we have presented a powerful feature set by using horizontal peak extent

688

features and vertical peak extent features. We have also compared the recognition results of the proposed system with recently used systems. Before feature extraction phase, we have performed digitization and pre‑processing activities on character image. Digitization is the process of converting paper‑based handwritten Gurmukhi character into electronic form. The electronic conversion is accomplished by using a procedure whereby a character image is scanned and an electronic representation of the original image of character, in the form of a TIFF image, is produced. Digitization produces the digital image which is fed to the pre‑processing phase. In this phase, the gray level character image is normalized into a window of size 100 × 100 using Nearest Neighborhood Interpolation (NNI) algorithm. After normalization, we produce bitmap image of the normalized image. Now, the bitmap image is transformed into a thinned image using parallel thinning algorithm proposed by Zhang and Suen [19]. For recognition of patterns appearing in each such image, we have used peak‑extent features, shadow features, and centroid features. 3.1 Proposed Feature Extraction Technique We have proposed a feature extraction technique, namely, peak extent‑based feature in this work. The peak‑extent feature is extracted by taking into consideration the sum of the lengths of the peak extents that fit successive black pixels along each zone, as shown in Figure 1a‑c. While fitting an extent along a series of successive black pixels within a region, the extent may be extended outside the boundary of the region if it continues in the next zone. We have proposed an innovative feature set, by using the horizontal peak extent features and the vertical peak extent features, for offline handwritten Gurmukhi character recognition. In the horizontal peak extent features, we consider the sum of the lengths of the peak extents that fit successive black pixels horizontally in each row of a zone as shown in Figure 1b, whereas in vertical peak extent features we consider the sum of lengths of the peak extents that fit successive black pixels vertically in each column of a zone as depicted in Figure 1c. The steps that have been used to extract these features are given below. Step I: Divide the bitmap image into n (=100) number of zones, each of size 10 × 10 pixels. Step II: Find the peak extent as sum of successive foreground pixels in each row of a zone. Step III: Replace the values of successive foreground pixels by peak extent value, in each row of a zone. Step IV: Find the largest value of peak extent in each row. As such, each zone has 10 horizontal peak extent features [Figure 1b]. Step V: Obtain the sum of these 10 peak extent sub‑feature values for each zone and consider

IETE JOURNAL OF RESEARCH | VOL 59 | ISSUE 6 | NOV-DEC 2013

[Downloaded free from http://www.jr.ietejournals.org on Thursday, February 13, 2014, IP: 223.225.213.255]  ||  Click here to download free Android application for t journal Kumar M, et al.: A Novel Feature Extraction Technique for OHCR

a

b

shown in Figure 2, on the four sides of the minimal bounding boxes enclosing the character image [10]. Each value of the shadow feature so computed is to be normalized by dividing it with the maximum possible length of the projections on the respective side. The profile counts the number of pixels between the bounding box of the character image and the edge of the character. Shadow features describe well the external shapes of characters and allow distinguishing between a number of confusing characters, such as “b” and “K”. The steps that have been used to extract these features are given below. Step I: Input the character image of 100 × 100 size. Step II: Calculate the length of projections of the character image on the four sides as shown in Figure 2. Step III: Calculate the projection profile as number of pixels between the bounding box of the character image and the edge of the character. Step IV: Normalize the values of feature vector by dividing each element of feature vector by largest value in the feature vector. These steps will yield into a feature set with 400 elements. 3.3 Centroid Feature Extraction Technique Coordinates of centroid of ON pixels in each zone of a character image can also be considered as features [10]. Figure 3 shows approximate locations of all such centroid in each zone of a character image.

c

Figure  1:  (a) Zone of bitmap image  (b) Horizontally peak extent features (c) Vertically peak extent features. this as a feature for corresponding zone. Step VI: For the zones that do not have a foreground pixel, take the feature value as zero. Step VII: Normalize the values in feature vector by dividing each element of feature vector by largest value in the feature vector. Similarly, for vertical peak extent features, we have considered the sum of the lengths of the peak extents in each column of each zone as shown in Figure 1c. These steps will give a feature set with 2n elements. 3.2 Shadow Feature Extraction Technique Shadow features are computed by considering the lengths of projections of the character images, as IETE JOURNAL OF RESEARCH | VOL 59 | ISSUE 6 | NOV-DEC 2013

Following steps have been implemented for extracting these features. Step I: Divide the bitmap image into n (=100) number of zones, each of size 10 × 10 pixels. Step II: Find the coordinates of foreground pixels in each zone. Step III: Calculate the centroid of these foreground pixels and store the coordinates of these foreground pixels as a feature value. Step IV: Corresponding to the zones that do not have a foreground pixel, take the feature value as zero. These steps will give a feature set with 2n elements. 3.4 Classification Classification is the decision‑making stage of an OHCR engine. This stage makes use of the features extracted in previous stage for deciding the class membership. In this work, we have used Linear‑SVM, k‑NN, and MLP classifiers because these classifiers are included in top ranking classifiers for data classification. The SVM is 689

[Downloaded free from http://www.jr.ietejournals.org on Thursday, February 13, 2014, IP: 223.225.213.255]  ||  Click here to download free Android application for this journal Kumar M, et al.: A Novel Feature Extraction Technique for OHCR

an extremely helpful technique for data classification. SVM is a learning machine, which has been generally applied in pattern recognition. SVMs are based on statistical learning theory that uses supervised learning. In supervised learning, a machine is trained instead of being programmed to perform a given task on a number of inputs/outputs pairs. In this work, C‑SVC type classifier in Lib‑SVM tool has been used for SVM classification purpose. In the k‑NN classifier, Euclidean distances from the candidate vector to stored vector are computed. The Euclidean distance between a candidate vector and a stored vector is given by: d=

∑ (x N

k =1

k

− yk )

a

b

Figure  2:  (a) Shadow features of Gurmukhi character  (b) (b) Shadow features of Gurmukhi character (K).

2

Here, N is the total number of features in feature set, xk is the library stored feature value, and yk is the candidate feature value. MLPs have also been used in present work for classification. Back Propagation (BP) learning algorithm with learning rate (y) =0.3 and momentum term (a) =0.2 is used here for training of these MLP‑based classifiers. For developing a training set and a testing set for each of the classifiers, employed for this work, the relevant dataset is segmented into a ratio of 4:1.

4.

EXPERIMENTAL RESULTS AND DISCUSSION

In this section, the results of proposed system for offline handwritten Gurmukhi character recognition are illustrated. In the process of evaluating the performance of the proposed technique, we have collected 7,000 samples of offline handwritten Gurmukhi characters from two hundred different writers. There were no restrictions on writing. Few samples from this dataset have been presented in Figure 4. The experimental results are based on different feature extraction techniques, namely, horizontal peak extent features, vertical peak extent features, proposed peak extent features, shadow features, and centroid features. As mentioned above, for classification, we have used three different classifiers, namely, Linear‑SVM, k‑NN, and MLPs classifiers. Here, we have used 5‑fold cross validation for obtaining recognition accuracy. In general, r‑fold crosses validation, divides the complete dataset of each category into r equal subsets. Then, one subset is taken as testing data and the remaining r‑1 subsets are taken as training data. By cross validation, each sample of training data is predicted and it gives the percentage of correctly recognized testing dataset. We have presented classifier‑wise recognition results in the following sub‑sections 4.1‑4.3. 690

Figure 3: An illustration of zone centroid features.

Figure 4: Samples of handwritten Gurmukhi characters. 4.1 Recognition Results Based on Linear‑SVM Classifier In this sub‑section, recognition results of Linear‑SVM classifier are illustrated. Using this classifier, we have achieved an average recognition accuracy of 95.62% with proposed feature extraction technique. As such, we have seen that our proposed feature extraction technique performs better than other recently proposed feature extraction techniques. The recognition results of different features considered under this work are given in Table 1. These results are also shown graphically in Figure 5. 4.2 Recognition Results Based on K‑NN Classifier In this sub‑section, experimental results based on IETE JOURNAL OF RESEARCH | VOL 59 | ISSUE 6 | NOV-DEC 2013

[Downloaded free from http://www.jr.ietejournals.org on Thursday, February 13, 2014, IP: 223.225.213.255]  ||  Click here to download free Android application for this journal Kumar M, et al.: A Novel Feature Extraction Technique for OHCR

k‑NN classifier are presented. It has been noted that proposed peak extent features, with k‑NN classifier, achieved an average recognition accuracy of 95.48%. Recognition results based on k‑NN classifier are depicted in Table 2. These results are again graphically shown in Figure 6. 4.3 Recognition Results Based on MLPs Classifier In this sub‑section, we have presented recognition results of different features taken in this study based on MLPs classifier. Using this classifier, we have achieved an average recognition accuracy of 94.74% with proposed technique [Table 3]. These results are graphically demonstrated in Figure 7.

5.

CONCLUSION AND FUTURE SCOPE

recognition. The features of a character that have been considered in this work include horizontally peak extent features, vertically peak extent features, shadow features, centroid features, and proposed peak extent features. The classifiers that have been employed in present work are k‑NN, Linear‑SVM, and MLPs. We have used 7,000 samples of offline handwritten Gurmukhi characters in this study and achieved the 5‑fold cross validation accuracy of 95.62%, 95.48%, and 94.74% with Linear‑SVM, k‑NN, and MLPs classifier, respectively. This accuracy can possibly further be increased by considering a combination of classifiers and by considering a larger dataset while training the classifier. This work can also be extended for offline handwritten character recognition of other Indian scripts, which are akin to Gurmukhi script.

A novel feature extraction technique has been presented in this work for offline handwritten Gurmukhi character Table 1: Recognition results based on linear‑SVM classifier

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Average

Horizontal peak extent 89.00 91.57 94.71 85.29 96.57 91.42

Feature extraction techniques (%) Vertical Shadow Centroid peak extent 87.14 75.71 94.86 92.29 84.71 98.29 95.49 79.43 98.14 87.43 79.29 85.43 92.00 90.14 90.43 90.87 81.85 93.43

Proposed peak extent 95.57 98.14 96.57 91.29 96.57 95.62

Figure 5: Recognition accuracy obtained by linear‑SVM for different features and cross validations.

SVM – Support vector machine

Table 2: Recognition results based on k‑NN classifier

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Average

Horizontal peak extent 98.57 93.71 95.14 86.29 89.29 92.60

Feature extraction techniques (%) Vertical Shadow Centroid peak extent 94.14 71.57 90.00 96.14 69.43 89.00 95.14 76.00 94.71 90.43 69.86 87.43 95.86 73.14 89.29 94.34 72.00 90.08

Proposed peak extent 96.00 96.29 96.57 92.00 96.57 95.48

Figure  6:  Recognition accuracy obtained by k‑NN for different features and cross validations.

NN – Neighborhood interpolation

Table 3: Recognition results based on MLPs classifier

Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Average

Horizontal peak extent 88.91 94.74 92.46 92.23 94.34 92.53

Feature extraction techniques (%) Vertical Shadow Centroid peak extent 82.83 73.26 90.97 94.83 71.06 93.83 95.06 72.63 82.63 87.43 70.63 82.11 91.11 71.97 93.26 90.25 71.91 88.56

Proposed peak extent 94.34 96.34 96.34 91.20 95.49 94.74

MLPs – Multi layer perceptrons

IETE JOURNAL OF RESEARCH | VOL 59 | ISSUE 6 | NOV-DEC 2013

Figure  7:  Recognition accuracy obtained by MLPs for different features and cross validations.

691

[Downloaded free from http://www.jr.ietejournals.org on Thursday, February 13, 2014, IP: 223.225.213.255]  ||  Click here to download free Android application for this journal Kumar M, et al.: A Novel Feature Extraction Technique for OHCR

REFERENCES 1.

2.

10.

W Senior, and A J Robinson, “An off‑line cursive handwriting recognition system,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, pp. 309‑21, 1998. D Kim, and S Y Bang, “A handwritten numeral character classification using tolerant rough set,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 22, no. 9, pp. 923‑37, 2000.

3.

G Mayraz, and G E Hinton, “Recognizing handwritten digits using hierarchical products of experts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 24, pp. 189‑97, 2002.

4.

A Amin, “Off‑line Arabic character recognition: The state of the art,” Pattern Recognition, Vol. 31, no. 5, pp. 517‑30, 1998.

5.

P K Wong, and C Chan, “Off‑line handwritten Chinese character recognition as a compound bays decision problem,” IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, pp. 1016‑23, 1998.

6.

Y Wen, Y Lu, and P Shi, “Handwritten Bangla numeral recognition system and its application to postal automation,” Pattern Recognition, Vol. 40, pp. 99‑107, 2007.

7.

H Swethalakshmi, A Jayaraman, V S Chakravarthy, and C C Sekhar, “Online handwritten character recognition of Devanagari and Telugu characters using support vector machine,” in Proc. 10th International workshop on Frontiers in Handwriting Recognition  (IWFHR), pp. 367‑72, 2006.

8.

9.

U Pal, and B B Chaudhuri, “OCR in Bangla: An Indo‑Bangladeshi language,” in Proc. 12th  International Conference on Pattern Recognition (ICPR), pp. 269‑73, 1994. U Pal, T Wakabayashi, and F Kimura, “A system for off‑line Oriya handwritten character recognition using curvature feature,” in Proc. 10th  International conference on Information Technology  (ICIT), pp. 227‑9, 2007.

11. 12. 13.

14.

15. 16. 17. 18.

19.

S Basu, N Das, R Sarkar, M Kundu, M Nasipuri, and D K Basu, “A hierarchical approach to recognition of handwritten Bangla characters,” Pattern Recognition, Vol. 42, pp. 1467‑84, 2009. M Hanmandlu, J Grover, V K Madasu, and S Vasikarla, “Input fuzzy for the recognition of handwritten Hindi numeral,” in Proc. of ITNG, pp. 208‑13, 2007. G S Lehal, and C Singh, “A Gurmukhi script recognition system,” in Proc. 15th  International Conference on Pattern Recognition  (ICPR), Vol. 2, pp. 557‑60, 2000. D V Sharma, and G S Lehal, “An iterative algorithm for segmentation of isolated handwritten words in Gurmukhi script,” in Proc. 18th International Conference on Pattern Recognition (ICPR), Vol. 2, pp. 1022‑5, 2006. M K Jindal, G S Lehal, and R K Sharma, “On segmentation of touching characters and overlapping lines in degraded printed Gurmukhi script,” International Journal of Image and Graphics  (IJIG), Vol.  9, no. 3, pp. 321‑53, 2009. M K Jindal, G S Lehal, and R K Sharma, “Segmentation of horizontally overlapping lines in printed Indian scripts,” International Journal of Computational Intelligence Research, Vol. 3, no. 4, pp. 277‑86, 2007. A Sharma, R Kumar, and R K Sharma, “Online handwritten Gurmukhi character recognition using elastic matching,” in Proc. Congress on Image and Signal Processing, pp. 391‑6, 2008. D V Sharma, and P Jhajj, “Recognition of isolated handwritten characters in Gurmukhi script,” International Journal of Computer Applications, Vol. 4, no. 8, pp. 9‑17, 2010. M Kumar, R K Sharma, and M K Jindal, “SVM based offline handwritten Gurmukhi character recognition,” in Proc. International Workshop on Soft Computing and Knowledge Discovery  (SCAKD), Vol. 758, pp. 51‑62, 2011. T Y Zhang, and C Y Suen, “A fast parallel algorithm for thinning digital patterns”, Communications of the ACM, Vol.  27, no. 3, pp.  236‑9, 1984.

AUTHORS Munish Kumar received his Masters degree in Computer Science and Engineering from Thapar University, Patiala, India in 2008. He started his career as an Assistant Professor in computer application at Jaito centre of Punjabi university, Patiala. He is working as an Assistant Professor in the computer science department, Panjab University Rural Centre, Kauni, Muktsar, Punjab, India. He is currently pursuing his Ph.D. degree from Thapar University, Patiala, Punjab, India. His research interests include Character Recognition. E‑mail: [email protected] R. K. Sharma received his Ph.D. degree in Mathematics from the University of Roorkee (Now, IIT Roorkee), India in 1993. He is currently working as a Professor at Thapar University, Patiala, India, where he teaches, among other things, statistical models and their usage in computer science. He has been involved in the organization of a number of conferences and other

courses at Thapar University, Patiala. His main research interests are statistical models in computer science, Neural Networks, and Pattern Recognition. E‑mail: [email protected] Manish Kumar Jindal received his Bachelors degree in science in 1996 and Post Graduate degree in Computer Applications from Punjabi University, Patiala, India in 1999. He holds a Gold Medal in his Post graduation. He received his Ph.D. degree in Computer Science and Engineering from Thapar University, Patiala, India in 2008. He is working as an Associate Professor in Panjab University Regional Centre, Muktsar, Punjab, India. His research interests include Character Recognition and Pattern Recognition. E‑mail: [email protected]

DOI: 10.4103/0377-2063.126961; Paper No JR 595_12; Copyright © 2013 by the IETE

692

IETE JOURNAL OF RESEARCH | VOL 59 | ISSUE 6 | NOV-DEC 2013