Offline Recognition of Handwritten Devanagari words using ... - IJIRST

1 downloads 0 Views 419KB Size Report
Devanagari is the most popular script in India use by more than 500 million people. ... originated from the ancient Brahmi script through various transformations.
IJIRST –International Journal for Innovative Research in Science & Technology| Volume 1 | Issue 11 | April 2015 ISSN (online): 2349-6010

Offline Recognition of Handwritten Devanagari words using Hidden Markov Model Mrs. Asma Shaikh PG Student Department of MCA MM. College of Engineering, Pune

Mr. Rahul Dagade Assistant Professor Department of Computer Engineering MM. College of Engineering, Pune

Abstract Devanagari is the most popular script in India use by more than 500 million people. The handwriting recognition area has been researched extensively till date, whereas offline recognition of Devanagari script is progressing area of research. This system involves three steps, pre-processing, feature extraction and classification respectively. In the first stage words from input images are normalized. Then, a set of intensity features are extracted from each of the segmented words. Gradient features are extracted from an image Then structure-like features are also extracted. Finally, these features are applied in a combined scheme for classification. Intensity features & Gradient features are used to train a HMM classifier, whose results are re-ranked using structural features for improved recognition rate. Keywords: HWR, HMM, Classifier, Re-ranking etc _______________________________________________________________________________________________________

I. INTRODUCTION Handwritten recognition of words (HWR) is a system for identifying actual words from word image, which have an important role in many human computer interface, including mail sorting, office automation, cheque verification[4]. Handwriting recognition is the ability of a computer to obtain and understand comprehensible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the handwritten text may be sensed "off line" from a piece of paper by optical scanning (optical character recognition). Off-line recognition of Devanagari handwritten script is a complex task due to the similarities between letters under diverse writing way as well as deformation of letters from different people.

Fig. 1: Offline Handwritten Recognition system

A. Devanagari Script: Most of the Indian languages, including Devanagari originated from the ancient Brahmi script through various transformations [3]. In Devanagari script, there are 13 vowels “Swar” and thirty six Consonants “Vyanjan” as shown in the Fig 2. When those thirty six consonants are attached with the vowels, then complex compositions of its constituent symbols are generated [4].

Fig. 2: Modified Alphabets when Consonants are attached with vowels

All rights reserved by www.ijirst.org

526

Offline Recognition of Handwritten Devanagari words using Hidden Markov Model (IJIRST/ Volume 1 / Issue 11 / 091)

Fig. 3: Header line with three parts of Devanagari words

Fig. 4 Devanagari Alphabets: Vowels, Consonants & Modifiers

A vowel is followed by a consonant may produce a modified shape, which, depending on the vowels are placed before, after, top and bottom of the consonants is called modifiers or “Matras” because they are used to modify the consonants meaning as shown in Fig 3. There are total twelve modifiers. Devanagari script is a unicase script. There is no concept of upper and lower case letter as Latin script Words are written in the Devanagari script as they are pronounced so that script is called as phonetic script. Devanagari script is also called as syllabic script because the text is combinations of consonants and vowels that together form syllables [4]. Apart from this, composite & complex characters are formed by combining more than one basic character. Consonants can have a half form when they are joined with other consonants. A key feature of the Devanagari script, upper horizontal line on the top of all characters is known as header line or “Shirorekha” [5]. This header line is used to divide the word into three parts, top, that is above the header line, contains top modifiers, bottom part contain lower modifier and core part contains the consonants as shown in Fig 4. B. System Architecture: Recognition of Devanagari Script is more difficult because of its nature of writing. A word base offline recognition system is proposed using HMM. Input: -An image Processing: - 1. Pre-processing 2. Feature extraction 3. Classification Output: - Word Identification Proposed System work as follows:

Fig. 5: Proposed System

All rights reserved by www.ijirst.org

527

Offline Recognition of Handwritten Devanagari words using Hidden Markov Model (IJIRST/ Volume 1 / Issue 11 / 091)

The proposed system is working as follows. To create efficient pre-processing algorithms for image enrichment, removal of noise, header-line (Shirorekha) detection, slant & skew correction and word segmentation, in which statistical analysis and knowledge supported decision making are used. After finding the header-line (Shirorekha), Gradient Features are extracted from an image by using Discrete Fourier Transform. Numerous structural features are taken out, such as sub words, the parts above header-line (Shirorekha), within middle (core) part & bottom part. Then set of intensity features is taken out from fragmented words by applying sliding window. Then utilize the above features for word recognition by combining HMM classifier with re-ranking. C. Module of the System: 1) Module 1: Train Image Module In this module we are training our sample images by giving folder as input. This module will take one by one image and then perform all the procedures like pre-processing, feature extraction and classification. In the classification, we are using the Baum Welch algorithm of HMM to form HMM models. 2) Module 2: Testing Module In this module, we have to select one image for the recognition purpose. When we are selecting an image, then all the procedure is carried out like pre-processing, feature extraction and classification using Viterbi algorithm of HMM to find a best suitable path & Re-ranking the data.

II. ALGORITHMS A. Preprocessing Algorithm: The pre-processing is an important part in the offline word recognition to delete the all the elements those are not useful to identify, enhance the inputted signal to represent it which can be determined constantly for recognition. Here pre-processing stage entails scanning paper document, image enhancement, removing of noise segment & headline (Shirorekha) detection, which have mainly depended on the quality of paper document [2].

Fig. 6: Preprocessing technique

1) Algorithm : Preprocessing Operation  INPUT : Scan Image  OUTPUT : Preprocessed Image STEPS: 1) Step 1: Start 2) Step 2: Load an image. 3) Step 3: Binarize image by using thresholding method of Otsu [18]. 4) Step 4: Smooth image by using median filter [8]. 5) Step 5: For the normalization of an image, Morphological operations, like erosion and dilation are used [8].

Fig. 7: after Morphological Operation Erosion and Dilation

6) Step 6: Thinning is performed by using structuring element [9].

All rights reserved by www.ijirst.org

528

Offline Recognition of Handwritten Devanagari words using Hidden Markov Model (IJIRST/ Volume 1 / Issue 11 / 091)

Fig. 8: after thinning operation

7) Step 7: Hough line detection algorithm is used to find the header line of words, which is the basis for the skew correction as well as for the feature extraction. 8) Step 8: For skew correction, the angle is identifying from the detected header line. This angle is used to rotate the matrix of an image.

Fig. 9: after De-skew Operation

9) Step 9: Stop. B. Feature Extraction Algorithm: Feature Extraction is used to identify the information which is useful to recognize word [7]. It is a way of extracting essential information from the raw image by removing the redundancy of data and gives a more efficient illustration of the word image. These features are used to categorize words [7]. 1) Algorithm : Feature Extraction INPUT : Preprocessed Image OUTPUT : Feature file STEPS : 1) Step 1: Start 2) Step 2: Load pre-processed image 3) Step 3: To use the sliding window technique, we are going to form window with equal height of word & width of the window is three pixels with one covered pixel. 4) Step 4: Then each Devanagari word is divided into 15 identical horizontal parts [2]. Numbers of foreground pixels are calculated by moving a sliding window from left to right. 5) Step 5: These horizontal 15 parts are used to identify total 30 features as follows.  In the beginning, the first fifteen intensity features (X1 - X15) are identified as the average intensity of the foreground pixels in each part, i.e. 

16th feature of an image X16 is resulting from taking averages of above15 features. 15

X 16 =

X

i

/ 15.

Eq.1

i=1



6) 7) 8)

9) 10)

In the next step, the mean intensity of each following pair of parts is extracted as fourteen additional features (X17X30) as follows: Eq.2 X i+16 = (X i + X i+1 ) / 2, i [1,14]. Step 6: Structures-like features are also identified, including, number of regions “hr” that are joined, number of joined regions above the header line "ha". These features represent the topological form of the image to some degree [2]. Step 7: Fifty two Gradient features are extracted from an image Step 8: Above extracted features are used in the combined scheme for recognition. These thirty intensity features and gradient features are used in HMM classifier to generate HMM model and structure features are used to enhance the HMM classification output by re-ranking [2]. Step 9: Store this features into city_feature_extraction.text file Step 10: Stop

Fig. 10: After Feature Extraction

All rights reserved by www.ijirst.org

529

Offline Recognition of Handwritten Devanagari words using Hidden Markov Model (IJIRST/ Volume 1 / Issue 11 / 091)

C. Recognition Algorithm: There are different kinds of classifiers available for the recognition of handwritten text, e.g. Hidden Markov Model, K-Nearest Neighbour Algorithm, Support Vector Machine, Neural Network etc[9]. Hidden Markov Model has become very useful for a wide variety of applications like ecology, handwriting, gesture recognition and wide spectrum of speech applications. HMM has become popular in recent years for classification because they are rich in mathematical structure to form theoretical as well as a practical basis for such applications.HMM Toolkit is an open available platform for HMM Development. Elements of HMM [19] [4]: N = no. of hidden states in the model, M= no. of distinct observation symbols, Q = {q1, q2.... qn}, states, V= {v1, v2.... vm}, discrete set of possible symbol observation, T= state transition probability, O= observation symbol probability distribution, Π =initial state distribution, S= {s1, s2.... St}, observation sequence [9]. 2) Algorithm : Recognition Algorithm INPUT : Feature File OUTPUT : Three best match recognize result STEPS :

1) Step 1: Start 2) Step 2: Feature file is taken as input to the classifier 3) Step 3: Find different inputs from features: n states, m observations, A transition matrix with n rows by n columns, O transition matrix with n rows by m columns, and k observations in a vector Y with elements Yj for each time step 1 to k. A and O can be filled with appropriate uniform probability values if we really have no idea what should go in there. 4) Step 4: Expectation step 5) Step 5: Maximization step



We need the probabilities of being at state  at time j, and  at time j+1, which we can find using the following calculation You can see that the formula is the probability of being at state  at time j, times the transition probability



we’re now at state  . You can store this as a 3D array or something. Find the new transition matrix A by calculating each element



from  to  , the probability of being at state  at time j+1, and the probability of the observation happening given

  

k 1

A , 

 ,  , t

t 0 k 1

P , t t 0

Eq3

6) Step 6: This can be summarized as the sum of probabilities of going from state  to  at each time stamp over the sum of probabilities of just being in state i (and going anywhere for the next state). 7) Step 7: Find the new transition matrix B by calculating each element . This is calculating the probabilities of being in a state at the times that the observation x happened divided by the probabilities that we are in that state at any time. 8) Step 8: Go back to the expectation step, and repeat until convergence. 9) Step 9: In the testing phase, a modified Viterbi algorithm is used for recognition. Optimized HMM model λ = (π, T, O) 10) Step 10: Get the recognize result 11) Step 11: Stop. D. Re-Ranking Algorithm: Re-ranking 1) Algorithm : Recognition Algorithm INPUT : Three best match recognize result OUTPUT : Re-rank the result STEPS : 1) Step 1: Start 2) Step 2: Take pre-processed image 3) Step 3: Calculate the threshold of an image 4) Step 4: Find contours of an image 5) Step 5: Find the approximate contours to polygon, get the bounding rectangle and circle 6) Step 6: Initialize noOfUppermhatras=0 7) Step 7: Find the area of image above the headerline by using roiArea function 8) Step 8: Calculate the no of lines 9) Step 9: Increment the noOfUppermhatras by one 10) Step 10: Update the recognize result 11) Step 11: Stop

All rights reserved by www.ijirst.org

530

Offline Recognition of Handwritten Devanagari words using Hidden Markov Model (IJIRST/ Volume 1 / Issue 11 / 091)

III. RESULT A. Dataset: Two thousand scanned images of “Hindi City Names” written in Devanagari Script. Table -1: Comparative Result before & after re-ranking Sr. No

Steps in the system

1

Pre-processing and image enhancement operations 1) Binarization 2) Normalization 3) Smoothed an image 4) Thinning 5) Header line detection 6) Skew detection and correction

2

3

Feature Extraction 1) Intensity Features 2) Gradient Features 3) Structural Features Data Classification 1) Baum Velch algorithm 2) Viterbi Algorithm 3) Re-ranking

Before Re-ranking

After Re-Ranking

81.34 %

83.77%

IV. CONCLUSION & FUTURE ENHANCEMENT This system proposes a Hidden Markov Model based system for the recognition of offline handwritten Devanagari words. This system uses different pre-processing steps to improve the quality of words, then study structural & statistical feature to classify the words. A combined scheme using HMM classifier followed by re- ranking helps to improve the accuracy in recognizing words form Devanagari handwritten script. This system may enhance for whole document instead of one word.

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18]

Mrs. Asma Shaikh, Mr. Rahul Dagade, “Recognition of Devanagari Script: A Survey”, in IJATER vol 4, Issue 2, pp.6-13, March 2014. Jawad AlKhateeb, Jinchang Ren. Jianmin Jiang, Husni Al. Muhtaseb, “Offline handwritten Arabic cursive text recognition using Hidden Markov Models and Re-ranking” in Pattern Recognition vol.32, pp.1081-1088, 2011. Ved Prakash Agnihotri, “Offline Handwritten Devanagari Script Recognition” in MEC, pp. 37-42, 2012. U. Pal and B. B. Chaudhuri, “Indian script character recognition: A survey,” Pattern Recognition., vol. 37, pp. 1887–1899, 2004. R. Jayadevan, Satish R. Kolhe, Pradeep M. Patil and Umapada Pal, “Offline Recognition of Devanagari Script: A Survey” in IEEE Transaction on Systems, Man and Cybernetics –Part C. Applications and Review, vol.41, pp-782-796, 2011. Bikash Shaw, Swapan Kr. Parui and Malayappan Shridhar, “Offline Handwritten Devanagari Word Recognition: A Segmentation Based Approach”, 19th International Conference on Pattern Recognition (ICPR'08), December, 8-11, 2008, Tampa, Florida, USA. Naveen Shankaran, Aman Neelappa and C.V. Jawahar, “Devanagari Text Recognition: A Transcription based Formulation” in ICDAR, pp. 678-68, 2013. Vedgupt Saraf, “Offline Handwritten Character Recognition of Devanagari script uses Genetic Algorithm for Improve efficiency” in ICCSE, pp.161-164, 2013. A. Bharat and Sriganesh Madhavnath, “HMM – Based Lexicon Driven and Lexicon-Free word Recognition for Online Handwritten Indic Scripts” in IEEE Transactions on Pattern Analysis and Machine Intelligence, vol-34, pp.670-682, 2012. Lawrence Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition” in IEEE 1989. Amin A. (1998), “Off-line Arabic character recognition: the stat e of the art”, Pattern Recognition, Vol. 31, 5, pp 517-530. L.R. Rabiner, B.H. Juang, “An Introduction to Hidden Markov Models” in IEEE ASSP Magazine 1986. Sandhya Arora ,Debotosh Bhattacharjee, M. Nasipuri, and L. Malik, “A Two Stage Classification Approach for Handwritten Devanagari Characters”, International Conference on Computational Intelligence and Multimedia Application(ICCIMA07), Sivkasi, Tamil Nadu, India 2007 Umapada Pal, T. Wakabayashi, F. Kimura, “Comparative study of Devanagari Handwritten Character Recognition using Different Features and Classifiers” in 10th ICDAR, IEEE, pp.1111-1115, 2009. Alkhateeb, J.H., Ren, J., Ipson, S.S., Jiang, J., 2009b. “Component-based segmentation of words from handwritten Arabic text”. Internet. J. Comput. Systems Sci. Eng. 5(1). Rajiv K., Deepak Bagal, T.S.Kamal, “Skew Angle detection of a cursive handwritten Devanagari Script character image”, J.Indian Inst Sci, vol.82,pp 161175, 2002. Khorsheed M.S. (2003), “Recognizing Arabic manuscripts using a single hidden Markov model”, Pattern Recognition Letters, 24, pp. 2235-2242. M.S. Khorsheed, “Off-Line Arabic Character Recognition – A Review”, Pattern Analysis & Apps., vol.5, pp.31-45,2002.

All rights reserved by www.ijirst.org

531