Support vector machine for identification of handwritten Gujarati ...

1 downloads 0 Views 962KB Size Report
Mar 13, 2015 - Optical Character Recognition for handwritten Gujarati alphabets. For this work, forty ... Gujarati also uses half alphabet which adds complexity.
CSIT (January 2015) 2(4):235–241 DOI 10.1007/s40012-014-0059-z

ORIGINAL RESEARCH

Support vector machine for identification of handwritten Gujarati alphabets using hybrid feature space Apurva A. Desai

Received: 25 May 2014 / Accepted: 8 December 2014 / Published online: 13 March 2015 Ó CSI Publications 2015

Abstract Gujarati language is used in the western state of Gujarat. Because of its peculiarities, its Optical Character Recognition becomes very difficult. For this language very less work has been done in the area of Optical Character Recognition. In this paper, I have attempted the problem of Optical Character Recognition for handwritten Gujarati alphabets. For this work, forty handwritten alphabets are collected from about one hundred and ninety nine writers. Here; aspect ratio, extent of alphabet, and image subdivision approach has been used as feature space and support vector machine (SVM) has been used for the classification purpose and it gives 86.66 % of performance accuracy. kNN is also used for classification and the result is compared with that of SVM. The paper also describes support vector machine. Keywords Optical character recognition (OCR)  Feature extraction  Classification  Support vector machine (SVM)  Gujarati language  Indian language

Like many Indian languages, many modifiers are used in Gujarati, and sometimes these modifiers change even the shape of the basic alphabet. Gujarati also uses half alphabet which adds complexity to the recognition problem. Besides this, in, Gujarati language, the modifiers are placed in all the four directions; left, right, upper and lower, of an alphabet. All these issues make Gujarati optical character recognition complex, and if it is done on handwritten Gujarati script then the recognition becomes even more difficult. Despite its difficulties many researchers have made an attempt to recognize Gujarati script. In this work an algorithm is presented for recognizing handwritten Gujarati alphabets. This paper is composed of seven sections. This introduction section is followed by a survey on optical character recognition for Gujarati language. Section three is allotted for a short note on the classification method, support vector machine (SVM). Next two sections discuss preprocessing and feature extraction methods successively. Section six includes the results of the presented research, and the last section summarizes this work.

1 Introduction Gujarati language belongs to Devnagari language. It has eleven vowels and thirty four consonants. It does not have ‘‘Shirolekha’’ like script of many other languages belonging to Devnagari family. Figure 1 shows why handwritten alphabets of Gujarati language considered in this work of optical character recognition (OCR) of Gujarati script becomes difficult because of lack of use of ‘‘Shirolekha’’. A. A. Desai (&) Department of Computer Science, Veer Narmad South Gujarat University, Surat, Gujarat, India e-mail: [email protected]; [email protected]

2 Related work Optical character recognition (OCR) is a research area which attracts many researchers across the world. Lots of work has been done for languages like English. Many researchers have contributed a lot in many Indian languages like Bengali, Tamil, and Telugu. However, optical character recognition for Gujarati language remained untouched. Some of the researchers have started OCR for Gujarati language in the year 1999. Antani and Agnihotri [1] presented their work on similar types of printed Gujarati alphabets. Here researchers used first and higher order

123

236

CSIT (January 2015) 2(4):235–241

Fig. 1 Set of handwritten Gujarati alphabets considered in this work

moments as features and tried out different classifiers like kNN and Hamming Distance Classifier. This work does not talk anything about preprocessing. Since it is a pioneering work for Gujarati optical character recognition, its importance is high even though the success rate was very less. Gujarati language uses three logical zones: upper, middle, and lower. Dholakia [2] proposed the identification based on these three zones devoting to zone separation of Gujarati printed text, a smearing algorithm is used for joining modifiers (matras) with the text. After separation of matras, horizontal and vertical profiles are used for separating words and lines respectively. After separating individual words, the words are wrapped up in a box and a line slope of this box finds three zones; lower, middle and upper. The researchers have achieved 95 % of accuracy of zone identification. Shah and Sharma [3] investigated the recognition of printed Gujarati characters. In this paper, a template matching approach was used. For classification of Gujarati alphabets the researchers firstly separated modifiers from the alphabets. The authors used Fringe Formation Wavelet Transformation coefficients as a feature set and a distance comparison classification as a classifier for template matching. Even though preprocessing was done, the proposed algorithm provided 72.3 % of overall accuracy and even less accuracy for small punctuation marks and modifiers. Dholakia et al. [4] proposed to classify the printed Gujarati script using a wavelet based feature set. In this work authors used Daubechies D4 wavelet coefficients as a feature set and two classifiers kNN and General Regression Neural Network to get a high success rate of 96.71 and 97.59 % respectively. The authors correctly mentioned that visually similar characters also confuse classifiers into a feature space. A good review of Gujarati document processing and optical character recognition has been presented by Dholakia and Negi [5].They showed various feature extractors and mainly two classifiers namely kNN and artificial neural network (ANN). The authors have also described various preprocessing procedures like segmentation, zone detection, etc. This work mainly focuses on printed Gujarati text. Here they concluded that the combination of wavelets and GRNN gives better recognition result. First ever work for handwritten Gujarati characters recognition is found in 2010.

123

Desai [6] presented a research for Gujarati handwritten numeral recognition. In this work the structural feature extractor was used. During the preprocessing phase, the author has performed cropping, smoothing, thinning before resizing alphabets in standardized size of 16 9 16 pixels. In this work the author has proposed a new idea of training the classifier with the rotated alphabets for considering skew in the alphabets rather than performing skew correction in the alphabet image. In this work feed forward neural network (FFNN) was used as a classifier. This work achieved 81.66 % of accuracy of identifying standalone handwritten Gujarati numerals. To improve the accuracy of Handwritten Gujarati numerals, Desai [7] presented another algorithm in 2010. This time he suggested hybrid feature set. The image of handwritten character was divided into sixteen sub images of four by four pixels and used total numbers of foreground color pixels in each of the subdivided images. Also the aspect ratio of handwritten numeral was used as seventeenth feature. In this work, Desai has used the kNN classifier to achieve 96.99 % of accuracy. This paper has also analyzed the effect of different values of k and size of training set on the result of kNN algorithm for handwritten Gujarati numerals. Patel and Desai [8, 9] presented two papers on handwritten Gujarati scripts. They addressed problem of word segmentation for Gujarati handwritten script. In this work they used Radon transform for skew detection and correction and horizontal profiles for line segmentation. For extracting words from line the authors suggested vertical profiles after morphological operation called ‘closing’. Patel and Desai [18] presented a work on handwritten Gujarati alphabets recognition. In this work they proposed hybrid feature set. For recognition of alphabets they suggested multi level hybrid recognition process. They achieved low accuracy rate of 63 % by using tree classifier and kNN classifier in different stages. They also used hybrid feature set made of structure features and moments of the alphabets. This approach achieved 89.24 % of accuracy. Patel and Desai [9] they addressed problem of zone separation for handwritten Gujarati script. Then they identified the central line of a string and find three zones of a string by its profile analyses. In this work the authors used the Euclidean distance to achieve accuracy of 75.2, 75.2, and 86.6 % for

CSIT (January 2015) 2(4):235–241

237

separating upper, middle, and lower zones respectively. In year the 2011 an algorithm for identifying handwritten Gujarati numerals was proposed by Maloo and Kale [10]. In this work they have suggested four different feature sets, obtained by subdividing the image of a numeral in different sizes. They have proposed to use four invariant moments of these subdivided images. The accuracy achieved was 90.55 % by using Support Vector Machine as a classifier.

The proposed algorithm uses SVM to classify handwritten Gujarati alphabets. It is a binary classifier which classifies a group into two classes. It is a supervised learning model associated with some learning procedure. V. N. Vapnik introduced SVM. Its standard form was determined by Cortes and Vapnik [11]. Support vector machine creates a hyperplane in n-dimensional place separates two classes. Support vector machine places hyperplane at a distance from where the nearest point of both the classes have the largest distance. This distance is known as ‘‘margin’’. Support vector machine gives efficient results if all parameters are set properly. Therefore it is important to understand various parameters which play an important role in SVM. Most of the learning models have the two data sets: the training set and the testing set. The training set contains a feature set and its matching target values whereas the test set has only a feature set. The purpose of the model like SVM is to identify target value of an observation of test set using the feature set. Let’s suppose that a training set includes n feature vectors xi, i = [1, n]. Each of these vectors has a matching target value yi, i = [1, n] presented by Eq. 1. ( 1 if xi 2 class 1 yi ¼ ð1Þ 1 if xi 2 class 2: The aim is to separate two classes—class 1 and 2 by a hyperplane. As shown in Fig. 2, the hyperplane WTX ? b = 0 separates two classes, class one and class two if both conditions of Eq. 2 are satisfied. T

W X þ b[0

if yi ¼ 1 if yi ¼ 1:

ð2Þ

(Here, WTX is an inner product) The decision function f(x) is determined by Eq. 3. f ðxÞ ¼ signðW T X þ bÞ:

2 2 ¼ pffiffiffiffiffiffiffiffiffiffiffiffi kW k WT W

ð4Þ

Therefore the optimization problem to get support vector machine (SVM) is defined as 1 min kW k2 W;b 2

3 Support vector machine

WT X þ b [ 0

Obviously there could be many such hyperplanes but it is needed in maximizes the margin. In margin no observation can fall in. Distance between WTX ? b = 1 and WTX ? b = -1 is determined by Eq. 4;

ð3Þ

Hyperplane (W, b) separates classes where W 2 Rd is normal to hyperplane and d 2 R is an offset of hyperplane.

Subject to Eq. 5 yi ðW T xi þ bÞ  1

i ¼ 1ð1Þn:

ð5Þ

A quadratic programming problem equivalent is provided by 1 min W T W W;b 2 Subject to Eq. 6 yi ðW T xi þ bÞ  1

i ¼ 1ð1Þn:

ð6Þ

Therefore the SVM separates two groups from one another. However the analysis of forty alphabets is not possible by direct usage of SVM. SVM can still be used in such situations with variations. They are one against all and one against one. One against one compares each group with all rest of the groups and one against all compares each group with a group consisting of all rest of the groups. Further SVM can be also be used with kernel technique. Here different types of nonlinear or even linear kernels can be used to separate groups in parts. Two known nonlinear kernel are polynomial, and Gaussian.

4 Preprocessing In preprocessing stage images are prepared for classification. Handwritten Gujarati alphabets from about 200 writers (at all 7,960 alphabets) were collected during experiments. All these alphabets, since they are handwritten, are different in size, orientation, color, thickness, etc. The quality of papers and writing pens are also important factors. For making images more useful for automated recognition, they are needed to be refined. This refinement is known as preprocessing. In this research firstly the image of a standalone alphabet is converted into a binary image. For this global threshold method [12] is used. Otsu’s method finds a global threshold by Eq. 7 r2G ¼ WF ðtÞr2F ðtÞ þ WB ðtÞr2B ðtÞ;

ð7Þ

r2G

is an intra class (foreground and background) where variation, WF, WB are weights of foreground and

123

238

CSIT (January 2015) 2(4):235–241

Fig. 2 Support vector

background class respectively, r2F, r2B are variances of foreground and background class respectively, t is a threshold value which minimizing intra class variation. From such black and white image, an alphabet is cropped to a smallest box. The height and width of this box gives one of the features called aspect ratio used in the feature set. Since the target is to analyze a basic shape of character for building feature set, a handwritten character is converted into thickness of one pixel. Before thinning, a smoothening of character boundary is analyzed to preserve this information. Dilation, the morphological operation is used for smoothening. Let us assume two images A and B. A is a cropped image of alphabet and B is a small image of 2 9 2 pixels of foreground color. Thereby a dilation of image is done by using Eq. 8.

A ⊕ B = U Ax x∈B

Fig. 3 Alphabet in various preprocessing stages; original character, enlarged original alphabet, enlarged smooth alphabet, resized alphabet and finally its skeleton

frequently used for handwritten character recognition. Shapes of many handwritten Gujarati characters are close to one another. Therefore the complexity of recognition is high. Looking to that, the hybrid method for extracting features of Gujarati characters was developed. Some Gujarati alphabets are broader compared to height and vise versa, and some have almost equal height and width as shown in Fig. 4. The aspect ratio takes care of this strong feature which is provided by Eq. 9.

ð8Þ

This stage of, alphabet image gives another feature, the extent. The next step from preprocessing phase is to put all these varying sized images into a uniform size of 16 9 16 pixels. The image interpolation algorithm with ‘bilinear’ method for resizing image is proposed. The ’bilinear’ method produces smoother edges of image compared to other methods like nearest neighborhood method. Now images are in uniform size of 16 9 16 pixels. Thinning, also known as ‘skeletonization’, is performed on image of size 16 9 16 pixels. This is done using an algorithm proposed by Lam et al. [13]. The characters after thinning look as shown in Fig. 3. These images are the inputs for the feature extraction.

Aspect Ratio ¼

Height of alphabet Width of alphabet

Spread of an alphabet in a fixed box is known as an extent, which is calculated as a ratio of total numbers of ON (foreground) pixels to the total number of pixels in a bounding box that fits the given alphabet. Extent ¼ ðForeground pixels in bouxnding boxÞ= ðTotal pixels in bounding boxÞ

123

ð10Þ

Besides these two statistical features, aspect ratio and extent, sixteen more features are obtained by dividing 16 x 16 pixels image into sixteen sub images of 4 x 4 each.

5 Feature extraction Three types of feature extraction methods; structural [6, 7, 14], statistical [1, 3, 5, 10, 15, 16] and hybrid [6, 7, 18] are

ð9Þ

Fig. 4 Gujarati alphabets with different aspect ratios

CSIT (January 2015) 2(4):235–241

239

Number of foreground of a subdivided image gives a feature as it is done in [6, 7]. Thus in this research total eighteen features were used.

6 Results and discussion The proposed algorithm is tested on a dataset of forty alphabets collected from one hundred ninety nine writers of different age groups and gender. Thus this dataset contains a total of seven thousand nine hundred and sixty handwritten Gujarati alphabets. The dataset collected from writers are scanned through a flatbed scanner in the resolution of 300 dpi. In this algorithm SVM with [14, 17] with polynomial kernal is suggested as the success rate of it is best compared to the success rate of SVM with Gaussian kernel and even that is of kNN classifier. Table 1 shows that when the classifier engine is trained with four thousand alphabets kNN gives the best classification result of 77.39 % compared to SVM classifier with Gaussian kernel and Polynomial kernel with kernel option c = 1. Same is the case with size of the training set five thousand. In this case kNN classifier gives 84.22 % of success rate and SVM classifier with Gaussian kernel and polynomial kernel gives 76.38 and 73.86 % respectively. This means kNN classifier performs better compare to SVM classifiers with Gaussian and Polynomial kernel (c = 1). The interesting fact is the polynomial and JCB kernels give similar result.

Further the algorithm is tested with SVM classifier with polynomial kernel with kernel value c = 2. In this case the algorithm gives 81.48 and 86.66 % success rate when the classifier was trained with 4,000 and 5,000 alphabets respectively. Further only structural feature set is also tested with SVM with polynomial kernel (c = 2) and kNN classifier. Here also SVM classifier with polynomial kernel (c = 2) performs better than kNN classifier. The Table 2 shows summary of result obtained by structural feature set using kNN and SVM classifier.

Fig. 5 Accuracy of identification of various Gujarati alphabets using hybrid feature set and SVM classifier

Table 1 Summary of results of Hybrid feature set with various classifier Feature set

Classifier

Size of training set

Success of seen data (%)

Success of unseen data (%)

Total success rate (%)

Hybrid

SVM (Gaussian kernel)

4000

4000 (100)

1375 (34.72)

5375 (67.75)

Hybrid

SVM (Polynomial kernel) c = 1

4000

3374 (84.35)

2391 (60.38)

5765 (72.42)

Hybrid

SVM (JCB kernel)

4000

3374 (84.35)

2391 (60.38)

5765 (72.42)

Hybrid

kNN

4000

4000 (100)

2160 (54.55)

6160 (77.39)

Hybrid

SVM (Gaussian kernel)

5000

5000 (100)

1080 (36.49)

6080 (76.38)

Hybrid

SVM (Polynomial Kernel) c = 1

5000

4042 (80.84)

1855 (62.67)

5897 (73.86)

Hybrid Hybrid

SVM (JCB kernel) kNN

5000 5000

4042 (80.84) 5000 (100)

1855 (62.67) 1704 (57.57)

5897 (73.86) 6704 (84.22)

Hybrid

SVM (Polynomial kernel) c = 2

4000

4000 (100)

2486 (62.78)

6486 (81.48)

Hybrid

SVM (Polynomial kernel) c = 2

5000

5000 (100)

1898 (64.12)

6898 (86.66)

Table 2 Summary of result of structural feature set with various classifiers Feature set

Classifier

Size of training set

Success of seen data (%)

Success of unseen data (%)

Total success rate (%)

Structural

SVM (polynomial) c = 2 4000

4000 (100)

2426 (61.26)

6426 (80.72)

Structural

SVM (polynomial) c = 2 5000

5000 (100)

1838 (62.09)

6838 (85.90)

Structural

kNN

4000

4000 (100)

2193 (55.38)

6193 (77.80)

Structural

kNN

5000

5000 (100)

1684 (56.89)

6684 (83.99)

123

240

CSIT (January 2015) 2(4):235–241

Fig. 6 Character

Desai [6, 7] got highest accuracy for 0–10 Gujarati handwritten numbers but same experiment has given less accuracy rate than result of proposed algorithm.

is confused for most number of times

7 Conclusion

Table 3 Identification of confusing alphabet

147

14

2

8

10

2

3

169

5

1

5

8

2

8

178

0

0

0

11

2

0

173

3

1

6

8

3

3

158

3

6

8

0

1

6

168

Table 4 Comparison of various approaches of handwritten Gujarati OCR Work

Length of feature set

Classifier

Number/ alphabets

Accuracy

Desai AA [6]

95

Artificial neural network

Numbers

81.66 %

Desai AA [7]

16

kNN

Numbers

96.99 %

Patel CN, Desai AA [9]

16

kNN, tree classifier

Alphabets

63 %

Proposed

18

SVM with polynomial kernel

Alphabets

86.66

Comparing Tables 1 and 2 it is evident that the hybrid feature set of structural and statistical features and SVM classifier with polynomial kernel (c = 2) performs better than rest of the combinations studied in this paper. Figure 5 shows a graph of success rate of identification of various Gujarati alphabets. From the Fig. 5 it is seen that the highest success rate is for alphabet and it is 97.97 %. Whereas the lowest accuracy is 73.89 % and it is of alphabet . Gujarati alphabets , , , , and are confusing alphabets as they are similar is shapes. Figure 6 shows these handwritten Gujarati alphabets. Table 3 shows the identification of these confusing alphabets as one of the other confusing alphabet. Table 1 shows that alphabet is confused with other alphabets more often. Alphabets , , and are more confusing with . There are three major works done on Gujarati handwritten numbers and alphabets identification which are summarized in Table 4.

123

There are many challenges for Gujarati alphabets recognition like their shapes and number of alphabets. This work presents an algorithm for handwritten Gujarati alphabet recognition. This work shows that for Gujarati handwritten alphabet identification hybrid feature set is more effective than simple structural feature set. Also SVM with polynomial kernel (c = 2) gives the best accuracy of Gujarati handwritten alphabet identification compare to other classifiers like kNN and SVM with Gaussian kernel. Though this work has achieved 86.66 % of identification accuracy it is needed to improve it a lot. Acknowledgments The author acknowledges the support of University Grants Commission (UGC), New Delhi, for this research work through project file no. F. 42-127/2013,

References 1. Antani S, Agnihotri L (1999) Gujarati character recognition. In: Proceedings of fifth international conference on document analysis and recognition (ICDAR’99), pp 418–421 2. Dholakia J, Negi A, Rama Mohan S (2005) Zone identification in the printed Gujarati text. In: Proceedings of the eight international conference on document analysis and recognition (ICDAR’05) 3. Shah SK, Sharma A (2006) Design and implementation of optical character recognition system to recognize Gujarati script using template matching. J Inst Eng (India) Electron Telecommun Eng Div 86:44–49 4. Dholakia J, Yajnik A, Negi A (2007) Wavelet feature based confusion character sets for Gujarati script. In: Proceeding of international conference on computational intelligence and multimedia application, ICCIMA (2007) 366–371 doi:10.1109/ ICCIMA.2007/7.230 5. Dholakia J, Negi A, Rama Mohan S (2009) Progress in Gujarati document processing and character recognition, guide to OCR for Indic scripts, advances in pattern recognition. Springer-Verlag, London, pp 73–95. doi:10.1007/978-1-84800-330-9_4 6. Desai AA (2010) Gujarati handwritten numeral optical character recognition through neural network. Pattern Recognit 2582–2589:2010. doi:10.1016/j.patcog.01.008 7. Desai AA (2010) Handwritten Gujarati numeral optical character recognition. In: Proceeding of international conference on image processing, Computer vision and pattern recognition, (IPCV’10) Vol. II, 733–739 8. Patel CN, Desai AA (2010) Segmentation of text lines into words for Gujarati handwritten text. In: Proceedings of international conference on signal and image processing (ICSIP’10), IEEEXplore, 15–17 9. Patel CN, Desai AA (2011) Zone identification for Gujarati handwritten words. In: Proceedings of international conference

CSIT (January 2015) 2(4):235–241

10. 11. 12.

13.

14.

on emerging applications of information technology (EAIT 2011), IEEEXplore, 19–20 Maloo M, Kale KV (2011) Support vector machine based Gujarati numeral recognition. Int J Comput Sci Eng 3–7:2595–2600 Cortes C, Vapnik V (1995) Support vector network. Mach Learn 20:273–297 Otsu NA (1979) Threshold selection method from gray-level histograms. IEEE Trans Syst Man Cybern 9(1):62–66. doi:10. 1109/TSMC.1979.4310076 Lam L, Lee S-W, Suen CY (1992) Thinning methodologies-a comprehensive survey. IEEE Trans Pattern Anal Mach Intell 14(9):879 Sheethalkumari R, Sreeranjani TR, Balachandar T (2005) Optical character recognition for printed tamil text using unicode. J Zhejiang Univ Sci 64–11:1297–1305

241 15. Majumdar A (2007) Bangla basic character recognition using digital curvlet transform. J Pattern Recognit Res 1:17–26 16. Abdul Rahiman M, Rajasree MS (2009) Printed malayalam character recognition using back propagation neural networks. IEEE international advance computing conference (IACC 2009) 197–201 doi:10.1109/IADCC.2009.4809006 17. Bhowmik TK, Ghanty P, Roy A, Parui SK (2009) SVM-based hierarchical architectures for handwritten Bangla character recognition. Int J Doc Anal Recognit (IJDAR) 12(2):97–108 18. Patel CN, Desai AA (2013) Gujarati handwritten character recognition using hybrid method based on binary tree-classifier and k-nearest neighbour. Int J Eng Res Technol II(6):2337–2345

123