Printed and Handwritten Character &Number

26 downloads 0 Views 179KB Size Report
IJRTET.03.02.187. Printed and Handwritten Character &Number. Recognition of Devanagari Script using. SVM and KNN. Anilkumar N. Holambe1, Dr.Ravinder.
REVIEW PAPER International J. of Recent Trends in Engineering and Technology, Vol. 3, No. 2, May 2010

Printed and Handwritten Character &Number Recognition of Devanagari Script using SVM and KNN Anilkumar N. Holambe1, Dr.Ravinder.C.Thool2, 1

Department of Information Technology College Of Engineering Osmanabad-413512(M.S.)India [email protected] 2 Professor and Head Department of Information Technology Shri Guru Gobind Singhji Institute Of Engg & Technology, Vishnupuri, Nanded-431606 (M.S.) India [email protected] Abstract—Recognition of Devanagari scripts is challenging problems. In Optical Character Recognition [OCR], a character or symbol to be recognized can be machine printed or handwritten characters/numerals. There are several approaches that deal with problem of recognition of numerals/character. In this paper we have compared SVM and KNN on handwritten as well as on printed character and numerical database for this we have created four different database. Index Terms— Devanagari, Optical Character Recognition, SVM, KNN, Database.

I.

derived from the right and left projection profiles of the numeral images for recognition of handwritten numerals of Devanagari and Roman scripts. There are also some studies on handwritten character recognition of other Indian scripts [5,6]. The development of an efficient system for handwriting recognition, needs a large set of samples with ground truth. Generation of such a data set is always difficult since it is time consuming and labor intensive [7]. Such standard data sets for any Indian script did not exist. However, recently, a large database of handwritten Devanagari numeral and character images has been developed by authors. In this paper we have implemented SVM and KNN classifier. Both are used on printed and handwritten database.

INTRODUCTION

Handwritten and printed character & numeral recognition has significant application potentials. The need for such a system has been increasingly felt even in a country like India. However, although significant development has already been achieved on this in scripts of developed nations, not much work has been reported on devanagari scripts. The development of a handwritten and printed character recognition engine for any script is always a challenging problem mainly because of the enormous variability of handwriting styles. The above factors provided the motivation for the proposed research work.

II. EXTRACTION OF FEATURES A. Database of printed and handwritten Devanagari numerals and characters. In the present work we have developed printed and handwritten database. For printed we have used different ISM office fonts. and for handwritten we have collect dat from people of different age groups and from different profession . This data were scanned at 600 dpi using a HP flatbed scanner and stored as gray-level images. A few samples from this database are shown in Fig.1.

For this work , we have studied off-line recognition of handwritten Devanagari numerals. Devanagari is the script of a number of Indian languages, including Hindi and Marathi. Hindi is the third most popularly used language in the world after Chinese and English. The earliest available work on recognition of hand printed Devanagari characters is found in [1]. For recognition of handwritten Devanagari numerals, Ramakrishnan et al. [2] used independent component analysis technique for feature extraction from numeral images. Bajaj et al [3] considered a strategy combining decisions of multiple classifiers. In all these three studies, very small sets of samples were considered. In an attempt to develop a bilingual handwritten numeral recognition system, Lehal and Bhatt [4] used a set of global and local features 163 © 2010 ACEEE DOI: 01.IJRTET.03.02.187

REVIEW PAPER International J. of Recent Trends in Engineering and Technology, Vol. 3, No. 2, May 2010

III.RECOGNITION SCHEME Several classification approaches can be considered for classification but we are using only SVM and KNN classification and comparing there results. Hand written Characters and numerical

Printed Characters and numerical

Figure 1. Samples from database of handwritten Devanagari characters and numerial. Preprocessing and Feature Extraction

Figure 2. Samples from database of printed Devanagari characters

The database is exclusively divided into training and test sets. The distribution of samples in these training and test sets over 10 digit classes for numerical data. For Devnagari (character )script has about 11 vowels ('svar') and 33 consonants or ('vyanjan'),and 11 modifiers so we organized data in 55 character classes. The handwritten database is collected from marathi peoples.

SVM

KNN

Results

Results

Figure 4. Our proposed work Flow

C. K-Nearest Neighbor Classification The k-nearest neighbors classifier is used for classifying the Bengali alphabets. A detailed discussion on the KNN classification technique can be found in dasarathy [8]. A short but formal definition of the K-NN Classification is as follows. Given a set of prototype

B. Preprocessing The preprocessing steps performed in this work are steps for rectification of distorted images, improving the quality of images for ensuring better quality edges in the subsequent edge determination step and size normalization. In the scanning process, some distortion in images may be introduced due to pen quality, light hand handwriting, poor quality of the paper on which the numerals and characters are written etc. Further, many times edges show discontinuities leading to erroneous feature extraction. For cleaning of possible noise in the input image, it is first binarized by Otsu’s thresolding technique followed by its smoothing using median filter of window size 5. In this first we reduces a gray image into an intensity image and approximately segments the image by intensity thresholding. Then, it refines the segmentation using image edges. Here we have used segmented the character or numerical as vertical, horizontal and arc.

vectors,

Txh = {(x1, y1 ), ( x2 , y2 ),...., (xl , yl ) },

the input

xiεx ⊆ R and corresponding. n

vectors being Targets being

{

}

n ' ' 2 yiεY = {1,2,.., c}, let R ( x) = x : x − x ≤ r be

a ball centered in the vector x in which K prototype

n xi iε {1,2,.., l}, lie, i.e. xi xiε R ( x) = k the k-nearest neighbor classification rule q : X → Y is

vectors

defined as

q ( x) = arg max v( x, y, ) ,

Where v(x,y) is the number of prototype vectors

xε R targets y1 = y , which lie in the ball i (a1)

(f1)

(b1)

(c1)

( x).

(d1)

D. SVM Support Vector Machines classification Support vector Machines [10,9] are pair-wise discriminating classifiers with the ability to identify the decision boundary with maximal margin. Maximal margin results in better generalization – a highly desirable property for a classifier to perform well on a novel data set. From the statistical learning theory point of view, they are less complex (smaller VD dimension) and perform better (lower actual error) with limited training data. Classifiers like KNN provide excellent results (very low empirical error) on a data set on which

(f2)

Figure 3. Sample of preprocessed handwritten character.(a1)original image,(b1)sharpen and smoothen,(c1)filtered,(d1)Edge detected (f1)segmented (f2)segmented numerical

we have processed all printed characters and numerical similarly as shown in figure 3. 164 © 2010 ACEEE DOI: 01.IJRTET.03.02.187

n

xi with

REVIEW PAPER International J. of Recent Trends in Engineering and Technology, Vol. 3, No. 2, May 2010

TABLE III. REJECTION VERSUS ERROR RATE COMPUTATION

they are trained, while the capability to generalize on a new data set depends on the labeled samples used. Identification of the optimal hyper plane for separation involves maximization of an appropriate objective function. The training process results in the identification of a set of important labeled support vectors xi and a set of coefficients α i .Support vectors are the samples near

Printed Database Rejection (%)

the decision boundary and most difficult to classify. They have class l Have class Labels Yi(i.e.,±1). The decision is made from

f ( x) = sgn(∑i =1α i yi K ( xi , x)). The function k in the l

previous equation is called the kernel function. Ist is defined as k(x,y) = φ ( x).φ ( y ), where φ : R → H maps the data points in d dimensions to a higher dimensional (possibly infinite dimensional) space H for a linear SVM, k (x,y) = x.y. We do not need to know the values of the images of the data points in H to solve the prohglem in H. By finding specific cases of Kernel functions, we can arrive at nerural networks or radial Basis functions. d

Handwritten

79.08

76.02

73.07

Printed

94

91

89

30.00

39.00

19.00

vowels ('svar')

5.00

20.00

28.00

18.00

consonants ('vyanjan') without modifiers

7.00

18.01

25.00

15.00

consonants ('vyanjan') with modifiers

15.00

11.00

30.00

10.00

vowels ('svar')(%)

85

83

80

Printed

93

95

94

2

3

4

From our experiment we have observe that recognition accuracy of SVM classifier is more than that of KNN classifier as shown in Table I & II, the accuracy is more in printed database than in handwritten database. The rejection rate is very high in handwritten database. Support Vector Machines provide better results than KNN classifiers. Computational complexity of KNN increases with more and more labeled samples

165 © 2010 ACEEE DOI: 01.IJRTET.03.02.187

handwritten

Figure 5. Error Rate of printed and handwritten database

consonants ('vyanjan') with modifiers(%)

Handwritten

printed

Samples

consonants ('vyanjan') with modifiers(%)

consonants ('vyanjan') without modifiers(%)

35 30 25 20 15 10 5 0 1

TABLE II. RECOGNITION RATE OF SVM CLASSIFIER Numerical

Error (%)

4.00

Error in %

TABLE I. RECOGNITION RATE OF KNN CLASSIFIER consonants ('vyanjan') without modifiers(%)

Rejection (%)

Error Rate

Data used for the present work were collected from different individuals. We considered 15000 basic characters (vowels as well as consonants) and 10000 numerical samples of Devnagari for the experiment of the proposed work.we also formed printed database of ISM office fonts, in which we have used font size of 16 and different fonts.

vowels ('svar')(%)

Error (%)

Numerical

IV. RESULT AND OBSERVATION

Numerical

Handwritten Database

REVIEW PAPER International J. of Recent Trends in Engineering and Technology, Vol. 3, No. 2, May 2010

classifiers”. Sadhana, Vol. 27, Part 1, 2002, pp. 59 – 72. [4] G.S. Lehal and Nivedan Bhatt, “ A recognition system for Devnagri and English handwritten numerals”, Advances in Multimodal Interfaces– ICMI 2001, T. Tan, Y. Shi and W. Gao (Editors), LNCS, Vol. 1948, 2000, pp. 442-449. [5] U. Bhattacharya, T.K. Das, A. Datta, S.K. Parui and B. B. Chaudhuri, “A hybrid scheme for handprinted numeral recognition based on a self-organizing network and MLP classifiers”, IJPRAI, Vol. 16(7), 2002, pp. 845-864. [6] M.B. Sukhaswami, P. Seetharamulu and A.K. Pujari, “Recognition of Telugu characters using neural networks”, Int. J. Neural Syst., Vol. 6, 1995, pp.317-357. [7] Huanfeng Ma, David S. Doermann, “Adaptive Hindi OCR using generalized Hausdorff image comparison”. ACM Trans. Asian Lang. Inf. Process. Vol.2, 2003, pp. 193-218. [8] B. V. Dasarathy, Nearest neighbor pattern classification techniques. IEEE Computer Society Press, New York, 1991. [9] John.C.Platt, N. Cristianini, and J. Shawe-Taylor. Large margin dags for multi-class classification. In Advances in Neural Information Processing Systems 12, pages 547–553, 2000. [10] M.Vidyasagar. A Theory of Learning and Generalization.Springer-Verlag,New York, 1997.

CONCLUSIONS The performance of character recognition is dependent on the accuracy of stroke recognition. The results obtained for recognition of Devanagari show that reliable classification is possible using SVMs. The results also indicate the scope for further improvement, especially in the case of confusing character recognition. The Advantage of SVM Classifier over other classifiers is that An Indian Language OCR Systems generally has large number of classes and high dimensional feature vectors. Variability of characters is also very high at each occurrence. SVMs are well suited for such problems since they have excellent generalization capability. REFERENCES [1] I.K. Sethi and B. Chatterjee, “Machine recognition of constrained handprinted Devanagari”. Pattern Recognition, Vol. 9, 1977, pp.69-75. [2] K.R. Ramakrishnan, S.H. Srinivasan and S. Bhagavathy, “The independent components of characters are ‘Strokes’”, Proc. of the 5th ICDAR, 1999, pp. 414-417. [3] R. Bajaj, L. Dey and S. Chaudhuri, “Devanagari numeral recognition by combining decision of multiple connectionist

166 © 2010 ACEEE DOI: 01.IJRTET.03.02.187