Template matching method for Kannada Handwritten ... - IEEE Xplore

13 downloads 0 Views 478KB Size Report
Template Matching Method For Kannada Handwriten. Recognition Based On Correlation Analysis. Aravinda C.V. Asst. Professor Department of Information ...
Template Matching Method For Kannada Handwriten Recognition Based On Correlation Analysis Aravinda C.V

Dr.H.N.Prakash

Asst. Professor Department of Information Science & Engineering, S.J.B.I.T. B.G.S Health & Education City, Kengeri, Bengaluru, Karnataka. [email protected]

Head Department of Computer Science & Engineering, Rajeev Institute of Technology, Hassan, Karnataka. [email protected]

Abstract—Handwriting recognition systems have been developed out of a need to automate the process of converting data into electronic format, which otherwise would have been lengthy and error prone. As we all know that building a character recognition sys- tem is one of the major areas of research over a decade, due to its wide range of prospects. Various techniques have been discussed by many researchers regarding the recognition of handwritten characters for diơerent languages. In this paper we adopted a Correlation Technique for recognition of Kannada Handwritten Characters. The formation of Kannada Characters into its compound form, also called as Kagunita makes its recognition more complex. The digitized input image is subjected to various preprocessing techniques and the processed image is then segmented into individual characters using simple segmentation algorithm. The segmented individual character is correlated with the stored templates. The template with maximum correlation value is displayed in editable format. Keywords—Optical Character Recognition, Handwritten Character Recognition, Self-Organizing Map,

I.

INTRODUCTION

Handwritten Character recognition is an important area in image processing and pattern recognition elds. The process of automatic computer recognition of characters in optically scanned and digitized pages of text is called Optical Character Recognition (OCR) [2]. It is one of the most fascinating and challenging areas of pattern recognition with various practical applications. Important research in OCR includes degraded Omni font text recognition, analysis and recognition of complex documents (including texts, images, charts, tables and video documents). Commercial OCR packages are available for languages in English. Considerable work has been done for languages like Devanagari, Bengali, Kannada and Tamil characters. Hand- written Character Recognition (HCR) is a subeld of OCR. Handwriting has continued to persist as a means of communication and recording information in day to day life even after the introduction of new technologies. Handwriting recognition can be dened as the task of transforming text represented in spatial form of graphical marks into its symbolic representation.

c 978-1-4799-6629-5/14/$31.00 2014 IEEE

II.

CHARACTER RECOGNITION

Two essential components in a character recognition algorithm are the feature extractor and the classier. Feature analysis determines the descriptors, or feature set, used to describe all characters. Given a character image, the feature extractor derives the features that the character possesses. The derived features are then used as input to the character classier [7]. Template matching, or matrix matching, is one of the most common classication methods. In template matching, individual image pixels are used as features. Classication is performed by comparing an input character image with a set of templates from each character class . Each comparison results in a similarity measure between the input character and the template. One measure increases the amount of similarity when a pixel in the observed character is identical to the same pixel in the template image. If the pixels diơer, the measure of similarity will be decreased. After all templates have been compared with the observed character image, the character’s identity is assigned as the identity of the most similar template. Structural classication methods utilize structural features and decision rules to classify characters. Structural features may be dened in terms of character strokes, character holes, or other character attributes such as concavities f. For a character image input, the structural features are extracted and a rule-based system is applied to classify the character. Template matching for character recognition is straightforward and reliable. This method is more tolerant to noise than structural analysis method. III.

STUDIES ON KANNADA HAND WRITTEN CHARACTER RECOGNITION

Kannada is the Dravidian language and it is the second most popular scripts in India. It is the ofcial language of the southern Indian state, Karnataka and also spoken by neighboring states. The Kannada scripts are closely related to the Telugu script. The survey was carried out on Kannada HCR Methods which is as shown in Table 1

857

TABLE 1: LITERATURE SURVEY Author and Year Tariq J, 2010.

R.Shukla,2012

Liu, Y, R.H. Weisberg, and C.N.K. Mooers , 2006

Paper Title

Contribution

ǹ-Soft: An English language OCR

The paper mainly concentrates business cards with xed font and color characters. The approach taken is a very simple one, comparing the characters with the one present in the database as English has only 26 alphabets. Explains the optical character recognition using neural network for multi-lingual characters.

Object oriented framework modeling of a Kohonen network based character recognition System Performance evaluation of the Self Organizing Map for feature extraction

IV.

This paper evaluates the feature extraction performance of the SOM by using articial data representative of known patterns. By adding random noise to the linear progressive wave data, it is demonstrated that the SOM extracts essential patterns from noisy data.

to the segmentation module as shown in the below figures.

Fig.1. Preprocessing

METHODOLOGY

In order to achieve the better recognition rate, we incorporated various methods which is discussed below. Stages Involved 1. Preprocessing 2. Segmentation 3. Template Matching 4. Conversion to Editable Format. A. Algorithm for Preprocessing Optical Character Recognition is the process of translating images of handwritten, typewritten, or printed text into a format understood by machines for the purpose of editing, indexing/searching, and a reduction in storage size. In order to achieve this, the image is subjected to various preprocessing techniques which is as shown in the block diagram Fig. 1 a. Conversion to Gray Scale: This is done by eliminating the hue and saturation information while retaining the luminance. Then this grayscale image is ltered to obtain the required and essential information out of the images, rest all the unwanted information, like noise and disturbance in the image are removed. b. Binarization: The image after noise removal is converted to a binary image, based on a threshold value. The output binary image replaces all pixels in the input image with luminance greater than level with the value 1 (white) and replaces all other pixels with the value 0 (black). The level is specied in the range [0, l]. This range is relative to the signal levels possible for the image’s class. Therefore, a level value of 0.5 is midway between black and white, regardless of class. To compute the level, the threshold value is used. The resulting image is send

858

Fig.2. Segmentation

B. Algorithm for Segmentation It is the process of extracting objects of interest from an image. It subdivides an image into its constituent regions or objects, which are certainly characters. This is need because the classier recognizes only the isolated characters. Segmentation phase is also crucial in contributing to the error due to touching characters, which the classier cannot properly tackle. Even in good quality documents, some adjacent characters touch each other due to inappropriate scanning resolution. Fig.2 shows ow of the segmentation module. The process of segmentation contains 3 stages which are the line crop, letter crop and boundary determination. a. Line Segmentation: The lines of a text block are segmented by nding the valleys of the projection prole computed by a row wise sum of black pixel values. b. Letter segmentation: Characters of a line are segmented by nding the valleys of the projection prole computed by column wise sum of black pixel value. c. Boundary Detection: In this stage the con- nected objects are given a label and a rectangular box is plotted

2014 International Conference on Contemporary Computing and Informatics (IC3I)

around each connected object. The value of each label is extracted and each rectangular box is cropped to get the isolated character. C.

෍ ‫ ݂ܹ א‬ଶ ሺ‫ݎ‬ǡ ܿሻ ൌ Ͳ ௥ǡ௖

2-D Correlation Coefficient D. Template-Matching

r=

ഥ ሻሺ஻೘೙ ି஻ ഥሻ σಾ σಿሺ஺೘೙ ି஺  మ ഥ ሻ ሿሾσಾ σಿሺ஻೘೙ ି஻ ഥ ሻమ ሿ ඥሾσಾ σಿሺ஺೘೙ ି஺

Syntax r=corr2(A,B) r=corr2(grouparrayA,grouparrayB) r=corr2(A,B) returns the correlation coeˆˆ‹cient r between A and B, where A and B are matrices or vectors of the same size. r is a scalar double r=corr2(grouparrayA,grouparrayB): performs the operation on a GPU. The input images are 2-D gpuArrays of the same size. r is a scalar double gpuArray In case of localizing the object given as a template g in the image f the problem reduces the simple searching. The domain of g on which the pattern is dened is referred to as mask or window W. The idea of template matching , here is to place the mask at all possible pixel locations of the image and compare the content of mask and the superimposed image region. That means the problem is to compute the degree of match between. ݂ሺ‫ ݎ‬൅ ‫ ܿݏ‬൅ ‫ݐ‬ሻȁሺ‫ݎ‬ǡ ܿሻ  ‫ܹ א‬and ݃ሺ‫ݎ‬ǡ ܿሻ for all ሺ‫ݏ‬ǡ ‫ݐ‬ሻ Placing this number at the center of the sub-image where the mask is placed, a matching score matrix is generated. The peakሺ‫ݏ‬ሻ at the matching score matrix indicates a good match. If the matching score is greater than a threshold, the pattern is said to be present at that location. On the other hand, in case of pattern recognition problem a set of templates݃௜ ሺ‫ݎ‬ǡ ܿሻ. The Degree of matching may be measured through the distance f and݃௜ , as we know the better the match, the less is the distance between f and ݃௜ . Then considering Euclidean Distance for the best match we used the formula.

Optical Character Recognition by using Template Matching is a system prototype that is useful to recognize the character or alphabet by comparing two images of the alphabet. Template matching involves determining similarities between a given template and windows of the same size in an image and identifying the window that produces the highest similarity measure. a.

Template Generation:The oƫine dataset contains isolated characters and symbols. The dataset consists of 4 sets of images consisting of samples of both printed and handwritten characters of the entire Kagunita. 1 set contains 430 printed characters including all Kagunita characters, end letters, digits ad special symbols. The other 3 sets consist of handwritten samples of Kagunita, end letters and digits. The template generation preprocesses the collected dataset. During this each dataset image is converted to grayscale, noise is removed and then is converted to binary image. It is then resizes to a xed size and stored.

b.

Recognition: Template matching is used for classifying the image. Image is correlated with each template image. Correlation value for each template image is stored into an array. The index value corresponding to the maximum correlation value is found and is returned. Fig.3 shows the ow of this module.

݉ܽ‫ݔ‬௜ ൝෍ ‫ܹ א‬ሺ݂ሺ‫ݎ‬ǡ ܿሻ െ ݃௜ ሺ‫ݎ‬ǡ ܿሻሻଶ ൡ ௥ǡ௖

݉݅݊௜ ൝෍ ‫ ݂ܹ א‬ଶ ሺ‫ݎ‬ǡ ܿሻ ൅ ෍ ‫ܹ݃ א‬௜ ଶ ሺ‫ݎ‬ǡ ܿሻൡ ௥ǡ௖

௥ǡ௖

Hence,σ௥ǡ௖ ‫݂ܹ א‬ሺ‫ݎ‬ǡ ܿሻ݃௜ ሺ‫ݎ‬ǡ ܿሻ may be taken as a measure of similarity, when normalized with σ௥ǡ௖ ‫ ݂ܹ א‬ଶ ሺ‫ݎ‬ǡ ܿሻand ܵሺ‫ݎ‬ǡ ܿሻ ‫݃ א‬௜ ଶ ሺ‫ݎ‬ǡ ܿሻ, it becomes correlation between f and ݃௜ that is, σ௥ǡ௖ ‫ܹ א‬ሺ݂ሺ‫ݎ‬ǡ ܿሻ െ ݃௜ ሺ‫ݎ‬ǡ ܿሻሻଶ

ߨሺ݂ǡ ݃௜ ሻ ൌ

ටσ௥ǡ௖ ‫ ݂ܹ א‬ଶ ሺ‫ݎ‬ǡ ܿሻඥσ௥ǡ௖ ‫ܹ݃ א‬௜ ଶ ሺ‫ݎ‬ǡ ܿሻ

Fig.3. Template Matching

E.

Conversion to Editable Format

Based on index value (for template based matching), the Unicode corresponding to the character stored at the obtained index value is stored into a variable letter. The letter value is stored into a word array. The recognized characters are printed on to a notepad which can be further edited and saved.This flow is as shown in Fig.4. The overview of the complete design is shown in the below Fig. 5

We have assumed

2014 International Conference on Contemporary Computing and Informatics (IC3I)

859

Fig.6. Preproccessing Results Fig.4. Conversion to Editable Form mat

Fig.7.Line Segm mentation Results

Fig.5. Overview of the design

V.

Fig.8. Letter Seggmentation Results

RESULTS

Experiments have been performed to test the proposed method. t was used for Matlab (R2009a) [9] is the software tool that recognition of Kannada Characters. The experiments were performed on many test images have diơerrent words in one line or more with two writing forms of Kaannada characters. The Fig. 6 shows the pre-processing Resultss, Fig.7 shows the Line Segmentation Results, Fig.8 shows the Letter Segmentation Results, Fig.9 shows the Results of Template Values, Fig.10 shows the Correlation Templlate Values Stored. Fig.11 show the obtained result in editable foormat.

Fig.9. Tem mplate Values

860

2014 International Conference on Contemporary Computing and Informatics (IC3I)

VI.

Fig.10. Template Correlation Valuees

CONCLUSION ENHANCEMENT T

AND

FUTURE

HCR is the process of identifyying the handwritten characters. The text in an image is convertted into other letter codes which are usable within computer annd text processing applications. Here recognition is done for Kannada K Characters using template based method. This methhod gave appreciable results for constrained handwritten texts. Because of various degrees of slant, skew and noise leveel and various writing styles recognizing segmented characcters is not an easy task using template matching. Thus tempplate based matching failed to yield the appreciable results for unconstrained handwriting styles. Hence feature based extrraction method can be employed for future enhancement which w extracts the salient characteristics from a characteer which make it distinct from others. REFER RENCES [1]

[2]

[3]

[4]

Fig.11. Result

The obtained results are tabulated inn Table 2. TABLE 2. RESULTS OBTAIN NED Accuracy in %

Printed

Hand dwritten

Vowels

85-90

755-80

Consonants

80-82

7 70

Consonant Conjunct

75-78

655-68

Votaksharas

80

700-73

[5]

[6]

[7]

Bhardwaj Anurag, Jose Damien and Govin- daraju Venu. 2008. “Script Independent Word Spotting in Multilingual M Documents.In the Second International workshop on Crosss Lingual Information Access”-2008 Chung Yuk Ying, Wong Mann To and Ben- namoun Mohammed, Handwritten Char- acter Recognnition by contour sequence mo- ments and Neural Network, In Internatiional Conference on Systems, Man, and Cybernetics, Volume 5, Issue , 111-14 Oct 1998 page(s): 4184 - 4188 Kunte R Sanjeev and Samuel Sudhaker S R D., “A Bilingual MachineInterface OCR for Printed Kannnada and En- glish Text Employing Wavelet Features.,” In Tenth IE EEE International conference on information technology. page(s): 202--207. Ragha Leena and M Sasikumar.., “Adapting Moments for Handwritten Kannada Kagunita Recognitioon,” In International Conference on Machine Learning and Computinng ( ICMLC- 2010) at Bangalore. India. page(s): 125-129.. V. N. Manjunath Aradhya, G. Hemanth Ku- mar and S.Noushath, “Robust Unconstrained Handwrritten Digit Recognition Using Radon Transform,” Proceedings of IEEE E- ICSCN 2007, pp-626-629, (2007). R. Plamondon, S.N. Srihari,,, “On-line and o8-line handwritten recognition: a comprehensive surrvey,” IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 62–84. U. Bhattacharya, T.K. Das, A. Datta, S.K. Parui, B.B.Chaudhuri, “A hybrid scheme for handprinted numeral recognition basedon a selforganizing network andMLP cllassiHers”, Int. J. Pattern Recognition Artif. Intell. 16 (2002) 845–864..

2014 International Conference on Contemporary Computing and Informatics (IC3I)

861