Handwritten Marathi Basic Character Recognition ...

105 downloads 0 Views 939KB Size Report
Few works are reported in literature for the recognition of Marathi and other Indian ... features and Neural network classification for basic Kannada character ...
Handwritten Marathi Basic Character Recognition using Statistical Method Parshuram M. Kamble∗ and Ravindra S. Hegadi Department of Computer Science, Solapur University, Solapur 413 255, Maharashtra, India. e-mail: [email protected], [email protected]

Abstract. The number of distance measurement methods like nearest neighbor, similarity, Linear-correlation, cross-correlation and hamming were used to find the distance between character. In this paper we propose a statistical based feature extraction approach for recognition of handwritten Marathi characters. These features are dependent on area, shape, orientation, perimeter and other variation in handwritten characters. 200 samples of each character from different writers have been collected and database is prepaid. 100 samples of each character were treated as training samples and average eccentricity, orientation and center mass of gravity features were evaluated for these training samples. A distance based approach is used to classify remaining 100 testing samples of each character. The results show the satisfactory performance rate. Keywords:

Character recognition, Classification, Perimeter, Eccentricity, Mass of character, Orientation.

Introduction Optical character recognition is a type of computer software designed to translate image of handwritten or typewritten text in to machine editable text by recognizing characters at high speeds one at a time. Handwritten character detection and recognition in general have a list of related applications, such as information retrieval or automatic indexing namely document indexing, content based image retrieval and postal address recognition application, which further opens up the possibility for advanced system. Marathi language uses 63 phonemic letters, divided into three groups namely swear (Vowels: 13 letters), Vyanjan (Consonants: 38 letters) and Ank (numbers: 10 digits) as well as Modifiers (Diacritics: 12 letters) as shown in figure 2. Development of offline and Online OCR for Marathi handwritten characters and numbers is challenging work for researchers because handwritten characters of each person are mimetic. People recognize the handwritten characters easily but machine has some difficulty to do this task. Few works are reported in literature for the recognition of Marathi and other Indian language characters and numbers. S. B Patil and G. R. Sinha (2012) proposed the real time handwritten Marathi numerals recognition using neural network. They used multilayer backward propagation for recognition purpose. Handwritten Devnagari character Recognition model using neural Network was proposed by Gaurav Jaiswal (2014). In her paper author used zone and count metric based feature extraction Algorithm. Multilayer feed forward neural network based classifier was used for classification. Mustafa D. A. (2013) has proposed optical character recognition system for multi font English text using DCT and wavelet transform. They used wavelet for feature vector construction and DCT based recognition method. Accuracy for his proposed method was 96%. M. M. Kodabagi et al., (2013) proposed character recognition of Kannada text in scene images using neural network. proposed method uses zone wise horizontal and vertical profile based features and Neural network classification for basic Kannada character recognition. R. S. Hegadi and P. M. Kamble (2014) proposed recognition of Marathi handwritten numerals using multi-layer feed-forward neural network. In this experiment they segmented each numeral from the document and it is resized to 7 × 5 pixels using cubic interpolation. These resized numerals were converted into a vector with 35 values. These vector values were used as input to train the neural network. Multilayer feed-forward neural network was used to classify handwritten Marathi numerals. The algorithms accuracy is 97%. A template matching based approach was proposed by Hegadi R. S. (2011) for the recognition of Kannada numerals. This method uses the correlation coefficient for matching the numeral. This method ∗ Corresponding author

28

© Elsevier Publications 2014.

Handwritten Marathi Basic Character Recognition using Statistical Method

Figure 1. Architecture of HOCR.

Figure 2.

Sample handwritten basic marathi character.

achieved an accuracy of 91%. In another work by Hegadi R. S. (2012) a multilayer feed-forward neural network is used for the classification of printed Kannada numerals. In this work experimentation was carried out using all the existing fonts of printed Kannada numerals and this algorithm could recognize all the numerals. Pisal and Kamble (2012) proposed the probabilistic neural network classification to recognize Marathi Handwritten characters. In their paper authors used water reservoir feature extraction for each character. After feature extraction probabilistic neural network based classification technique is used to classify each character. 1. Proposed system Design In this paper, we propose statistical based feature extraction and classification of Handwritten Marathi characters. Architecture of proposed method as shown in figure 1. In any handwritten character recognition problem training and testing are two important stages. In our work the training stage consists of preparation of standard database for Marathi handwritten character images and extraction of features from these training dataset. In testing stage Marathi handwritten character images are tested against trained handwritten character images. In pre-processing stage the input image is segmented in to individual characters and each character image is converted into binary form. Pre-processing of these character images is essential before feature extraction stage. In feature extraction stage we extract a set of features such as centroid, eccentricity, perimeter and orientation for each and every character. These features are used in the recognition stage. 2. Data Collection, Pre-Processing and Segmentation 2.1 Data collection 200 sets of handwritten Marathi character samples were collected randomly from different persons of different age group, at different time and were used as dataset for the proposed experimentation. These datasets were collected on A4 size white blank paper and were scanned using Umax Astra 5600 scanner with 300 dpi. After this process all sample were stored in JPEG image file format. 2.2 Pre-Processing Pre-processing plays an important role in handwritten character Recognition as in any other pattern recognition. The figure 3 show stages in pre-processing and segmentation of a characters from a input document image. The main objectives of pre-processing are binarization of input image, noise reduction, connect broken tiny character, edge detection, region filling, normalization and segmentation. The binarization of image is done by applying thresholding technique. Thresholding refers to the conversion of gray-scale image to the Binary image. This process converts the image into two components, namely, the object component and background. The object component contains the characters and background contains the noise and other unwanted information. Handwritten character shows various undesirable effects like small unwanted strokes, gaps or breaks which occur due to binarization. Many times when © Elsevier Publications 2014.

29

Parshuram M. Kamble and Ravindra S. Hegadi

Figure 3. Pre-processing stages of handwritten character recognition system.

Figure 4. a) Sample handwritten Marathi character, b) AB(m1) is major axis and AC is base line axis (m2).

a character is handwritten, it exhibits lesser width at the curvature than at other part of the character. This point is more likely to break during binarization. Hence a 3 × 3 averaging operator is implemented before binarization, which blurs the image resulting into bridging small gaps and retaining the actual shape of the character. It also removes pepper noise. The unwanted strokes occur more often between the pen lifting and placing points and their occurrence depend upon the writing style and the ink viscosity. These strokes may result into unwanted feature detection after binarization. In order to avoid this, the binarized image should be cleaned. This is done by using morphological opening operator. Morphological opening removes thin protrusions, breaks thin connections and smoothes the object contour. In the proposed work, marathi basic characters are written on plain paper. The characters are written in such way that they do not overlap. They are segmented using bounding box. The character separated are further cropped and then passed to normalization. The cropped character are in different size, because handwritten style of each writer is dissimilar. Normalization is applied on each character to bring all the character to uniform size. then these character is passed to the feature extraction process during Training and Testing time. 3. Feature Extraction 3.1 Orientation Angle of orientation (in degrees ranging from −90 to 90 degrees) is the angle between major axis of the oval which covers the character and x-axis, as shown in figure 4. Solid blue line are axes of the ellipse and red dots are the foci of covered character region of ellipse. The orientation [8] is the angle between the horizontal line and the major axis, which is given by in figure 4. The angle theta (θ ) is calculated by using equation 1. In this equation the major axis is m1 and the m2 is the base line. m1 − m2 (1) tan(θ ) = 1 + m1m2 3.2 Centriod The two co-ordinates x¯ and y¯ specifies the center of mass for character region. The first and second element of Centriod is the horizontal and vertical co-ordinate of the character region. Figure 5 illustrates the Centriod [8] for the sample character region. The region consist of white pixels; the red dot is Centriod. x¯ = size of M th element of character/2 and y¯ = size of N th element of character/2 30

© Elsevier Publications 2014.

Handwritten Marathi Basic Character Recognition using Statistical Method

Figure 5. Sample handwritten Marathi character, b) Red dot is the centriod of character.

Figure 6. a) Sample handwritten Marathi character, b) Border of character.

3.3 Perimeter It is the distance around the boundary of the Character region. The perimeter [8] has been calculated by using distance between each adjoining pair of pixels around the border of the region of the character. The following figure 6 shows the pixels included in the perimeter calculation for this Character. 3.4 Eccentricity Shape, size and orientation of Marathi characters are heterogynous. Generally, shape of handwritten Marathi vowels is like an oval shape. We used Eccentricity [8] of character as one of the feature for our work. Eccentricity is the ratio of major axis and minor axis of ellipse which covers the entire character as given by equation 2. In this equation AB is the major axis and C D is the minor axis. AB (2) Ecentricity = CD Eccentricity is calculated for all characters with connected axes regions. In figure 7 the red line is the ellipse region of Handwritten Marathi letter and the blue lines AB is the major axis and C D is minor axis. 3.5 Area of character The mass of character is the total number of white pixels in the binarized character. The total number of white pixels are counted in each character to obtain its mass value. After pre processing character is normalized in standard 50 by 70 size then area [8] is calculated. 4. Classification

Figure 7. a) Sample handwritten Marathi character, b) Major axis and minor axis.

We computed the four features namely, total mass of character, Centriod, eccentricity and orientation, for 100 set of Basic Marathi Handwritten Characters and stored same in database. The above said features were computed for different sets of training samples and distance between features of each character in the stored database is computed against the training dataset. Figure 8 shows a GUI where a character image is read and the features are computed and a label is displayed for the class to which that character belongs.

Figure 8. GUI showing the labe of Marathi handwritten character. © Elsevier Publications 2014.

31

Parshuram M. Kamble and Ravindra S. Hegadi Table 1. Average values of features 100 training samples of each character and accuracy rate of 100 testing samples of each character.

5. Result and Discussion The Experimentation is carried out using Matlab 7.0 tool. 200 different set of each handwritten characters of Marathi language were used for this experimentation, out of which 100 sets of each characters were used as training samples and 100 sets of each characters as testing samples. From the 100 training sets of each characters the four features 32

© Elsevier Publications 2014.

Handwritten Marathi Basic Character Recognition using Statistical Method

namely, eccentricity, orientation and mass of character, and perimeter were obtained and average value is computed for each character, which is shown in table 1. Again these features were computed for each character from the 100 sets of testing sample and distance is computed between each of these character and the values obtained for training samples. Based on the smallest distance the classification is done. In tabel 1 shows The classification accuracy for each of these characters. It can be noticed that the character such as , , , , , , , , and were classified with very high rate accuracy. where our Technique has performed very poor for the character like , , and . overall FAR of proposed algorithm was 15.52% due to the fact that the a few writers written equivalent shape in character. Hence in many cases this character may be falsie classified as . The overall accuracy of proposed algorithm was 94.38%. 6. Conclusion In this paper we have proposed a statistical feature extraction on Marathi Handwritten basic Character Recognition. We can apply two stage recognition approaches to improve the performance of the scheme. The main characteristics of the Marathi characters is their shapes which are mostly formed with more curves. Most of the failures in recognition are due to either characters with sharp edges and corners, or writing inappropriate style of a characters making it as unknown characters. The post processing may improve the performance which we will undertake in our feature work. References [1] S. B. Patil and G. R. Sinha, “Real Time Handwritten Marathi Numerals Recognition using Neural Network”, Int. Jr. Info. Tech. Comp. Sci., pp. 76–81, (2012). [2] D. A. Mustafa, “Optical Character Recognition (OCR) System For Multifint English Texts using DCT & Wavelet Transform”, Int. Jr. Comp. Engg, and Tech., pp. 48–61, (2013). [3] R. S. Hegadi and P. M. Kamble, “Recognition of Marathi Handwritten Numerals Using Multi-layer Feed-Forward Neural Network”, IEEE Explore, WCCT pp. 21–24, (2014). [4] M. M. Kodabagi, S. A. Angadi and C. R. Shivanagi, “Character Recognition of Kannada Text in Scence Images using Neural Network”, Int. Jr. Graphics and Multimedia, vol. 4, Issue 1, pp. 09–19, (2013). [5] R. S. Hegadi, “Classification of Kannada Numerals using Multi-layer Neural Network”, Adv. in Int. Sy. and Comp., vol. 174, pp. 963–968, (2012). [6] R. S. Hegadi, “Template Matching Approach for Printed Kannada Numeral Recognition”, Int. Conf. Comp. Int. Info. Tech., pp. 480–483, (2011). [7] T. B. Pisal and P. M. Kamble, “Marathi Character Recognition by using Probabilistic Neural Network Classification”, Int. Jr. Comp. Sci. Info. Tech., pp. 66–63, (2012). [8] http://www.mathworks.in/help/images/ref/regionprops.html. [9] Gaurav Jaiswal, “Handwritten Devanagari Character Recognition Model using Neural Network”, Int. Jr. of. Engg. Dev. Research., vol. 2, Issue 1, pp. 901–906, (2014).

© Elsevier Publications 2014.

33