Script identification in a handwritten document image ... - IEEE Xplore

0 downloads 0 Views 506KB Size Report
Shivashankar S. Department of Computer Science. Karnatak Science College. Dharwad, Karnataka, India [email protected]. Jagdeesh D. Pujari.
Script identification in a handwritten document image using texture features Hiremath P. S.

Shivashankar S.

Department of Computer Science, Gulbarga University Gulbarga, Karnataka, India [email protected]

Department of Computer Science Karnatak Science College Dharwad, Karnataka, India [email protected]

Jagdeesh D. Pujari

V. Mouneswara

Department of Information Science, SDM College of Engineering Dharwad, Karnataka, India [email protected]

Department of Computer Science Gulbarga University, Gulbarga, Karnataka, India [email protected]

Abstract— Script identification for handwritten document image is an open document analysis problem. In this paper, we propose an approach to script identification for documents containing handwritten text using the texture features. The texture features are extracted based on the co-occurrence histograms of wavelet decomposed images, which capture the information about relationships between each high frequency subband and that in low frequency subband of the transformed image at the corresponding level. The correlation between the subbands at the same resolution exhibits a strong relationship, indicating that this information is significant for characterizing a texture. This scheme is tested on seven Indian language scripts alongwith English. Our method is robust to the skew generated in the process of scanning a document and also to the varying coverage of text. The experimental results demonstrate the effectiveness of the texture features in identification of handwritten scripts. The experiments are also performed by considering the multiple writers. Keywords-texture; wavelet; co-occurrence handwritten script identification; correlation;

I.

histogram;

INTRODUCTION

Current research in the field of script and document analysis and their recognition aims at conceiving and establishing automatic system which is able to discriminate a certain number of scripts in order to select the appropriate recognition system to a given document. In character recognition, for instance, almost all existing works require that the script and/or language of the processed document be known[1]. Most of the research work in the field of identification of the script shows that the majority of them are concerned only with the identification of scripts in printed documents, except Hochberg et al.[2], who have proposed a method for the identification of handwritten documents. Handwritten documents present three challenges for script identification. First, some scripts resemble each other more when handwritten than when printed. Second,

c 978-1-4244-4791-6/10/$25.00 2010 IEEE

handwriting styles are more diverse than printed fonts. Cultural differences, individual differences, and even differences in the way that people write at different times, enlarge the inventory of possible character and word shapes seen in handwritten documents. Third, problems typically addressed in preprocessing, such as ruling lines and character fragmentation due to low contrast, are common in handwritten documents due to the variety of papers and writing instruments used[2, 4]. Given its ubiquity in human transactions, machine recognition of handwriting has practical significance such as in reading handwritten notes in a postal document analysis, postal address on envelopes, amounts in bank checks, etc.,[3] All the proposed methods in the literature can be classified in to three main categories according to the entity used to carry out script identification. One can distinguish methods based on the analysis of blocks of text, methods based on the analysis of lines and methods based on the analysis of connected components. In the first case, a text block is considered as a whole. Such a method considers a text block as being only one entity and thus does not resort to other analyses related to the text lines or connected components. In the first approach, authors uses the analysis and the classification of textures of text blocks[7, 8, 9, 11, 12, 14]. In the second and third approaches, line level and word level segmentations are to be carried out, respectively, for further structural analysis.[3] In this paper, we propose an approach that relates to the identification of handwritten scripts using the texture features. The texture features are extracted based on the cooccurrence histograms of wavelet decomposed images, which capture the information about relationships between each high frequency subband and that in low frequency subband of the transformed image at the corresponding level. The correlation between the subbands at the same resolution exhibits a strong relationship, indicating that this information is significant for

110

characterizing a texture. This scheme is tested on seven Indian language scripts in addition to English. Our method is robust to the skew generated in the process of scanning a document and also to the varying coverage of text. The experimental results demonstrate the effectiveness of the texture features in identification of handwritten scripts. The experiments are also performed by considering the multiple writers. The paper is organized as follows: In section 2, the texture feature extraction from the wavelet transformed image is discussed. The training and classification phases are explained in Section 3. In section 4, the results of handwritten script identification are discussed in detail. Finally, Section 5 contains conclusion.

[ ] V = [H * [G * I ] ] D = [G * [G * I ] ] H = G x * [H y * I ]↓ 2 ,1

↓1 ,2

x

y

↓ 2 ,1 ↓1,2

x

y

↓ 2 ,1 ↓1, 2

Here is the input image. Hx, Hy and Gx, Gy represent low and high pass filters respectively, * denotes the convolution operator and ↓ 2 denotes downsampling operation. The subbands labeled H, V, D correspond to the detail images, while A corresponds to the approximation image as shown in the Fig 1. Every subimage contains the information at a specific scale and orientation. A H

TEXTURE FEATURE EXTRACTION

II.

A. Discrete wavelet transform The continuous wavelet transform of a 1-D signal F(x) is defined as

(WaF )(b ) = ³ F ( x )ψ * a ,b ( x ) dx

where the wavelet ψ a ,b is computed from the mother wavelet

ψ

by translation and dilation,

1

ψ a ,b ( x ) =

a

§x−a· ¸ © b ¹

ψ¨

Under some mild assumptions, the mother wavelet satisfies the constraint of having zero mean. It can be discretized by restraining a and b to a discrete lattice ( a = 2 b , b ∈ A ). Typically, it is imposed that the transform should be non-redundant, complete and that it constitutes a multiresolution representation of the original signal. The extension of the 2-D case is usually performed by using the product of 1-D wavelet filters. The Haar wavelet is defined as

ψ

­1 ° ψ (t ) = ®− 1 °0 ¯ for



which m,n

(

= 2

it

−m / 2

is

, 0 ≤ t ≤ 1/ 2 , , 1/ 2 ≤ t < t , , otherwise . easy

to

verify

ψ (2 t − n ))}m, n∈Ζ is an orthonormal −m

that basis

for L (ℜ) . Herein, the discretization a=2m and b=n2m is used, which will be pursued throughout this section. The Haar wavelet was, historically speaking, the first known wavelet. The simplest way to compute a 2D discrete wavelet transform (DWT) of an image is to apply one-dimensional transform over image rows and columns separately and then to carry out down sampling. This transform decomposes an image with the overall scale factor of four, providing at each level one low resolution subimage and three wavelet coefficient subimages[5, 10, 13]: 2

[

]

A = H x * [H y * I ]↓2 ,1 ↓1,2

V D (a) (b) Fig 1. (a) Original image (b) 1-level wavelet transformed image (Haar) B. Feature extraction scheme The feature extraction scheme is inspired from the observation that the humans are capable of distinguishing between unfamiliar scripts just based on a sample visual inspection. We consider the script identification as a process of the texture classification. In general, a texture is a complex visual pattern composed of subpatterns. Although, the subpatterns can lack a good mathematical model, it is well established that a texture is analysed completely only if its subpatterns are well defined. We use a multiresolution approach based on a DWT for the texture feature extraction and then classify the textures using the k-NN classifier. The feature extraction method is described below. We consider an input script image X and apply a 2D DWT on X using Haar wavelet, which yields the approximation (A) and the detail(H,V,D) subband images (Fig. 2). We consider a pair of images say (A,H), and compute the co-occurrence histograms H1 and H2 for a given direction as described in [6]. For each histogram, we construct the normalized cumulative histogram and compute the features, namely, slope of the regression line, mean and mean deviation as shown in Fig 2. This process is repeated for 8 directions yielding 2(histograms) x 3(features) x 8(directions) = 48 features for the pair (A,H). Similarly, the features are extracted for the pairs (A,V), (A,D) and (A, abs(V-H-D)) yielding a total of 192 features for the input image X. This procedure is repeated for the complementary

[]

[

]

image X , defined as x = 255 − x , x being the gray value of a pixel in X.

2010 IEEE 2nd International Advance Computing Conference

111

Wavelet co-efficients

Histogram

1251 Cumulative histogram

121

A (A,H) ……. ……. …….

H Image V

1 Normalized Cumulative histogram

D Slope of the regression line Mean

Sample points Mean deviation

Fig. 2. Schematic diagram of the feature extraction algorithm The features extracted from X and X are combined to form a feature vector of dimensionality 384, which is used for the training and classification. The schematic diagram of the feature extraction method is shown in the Fig. 2 [6]. III.

TEXTURE TRAINING AND CLASSIFICATION

A. Training In the training phase, the texture features are extracted from the samples selected randomly belonging to each script using the feature extraction algorithm. These features are stored in the feature library, which are further used for the script identification. B. Classification In the classification phase, the texture features are extracted from the test sample X using the feature extraction algorithm, and then compared with the corresponding feature values that are stored in the feature library using the distance vector formula,

D(M ) =

¦[f N

j =1

j

( X ) − f j (M )

IV.

TEXTURE TRAINING AND CLASSIFICATION

The problem of script identification in the handwritten document image analysis as a specific application of the texture classification is discussed herein to emphasize the effectiveness of texture feature extraction method. Different scripts have distinctive visual appearance. Thus, a block of text may be regarded as a distinct texture pattern. This observation motivates us to utilize the texture classification algorithm for the handwritten script identification. A texture based approach does not require connected component analysis. In this sense, a texture based approach may be called a global approach. The experiments are performed to identify English and 7 Indian scripts (namely, Kannada, Tamil, Urdu, Telugu, Bengali, Hindi, and Malayalam) using the wavelet based texture features. The script documents are digitized at 150 dpi. The experiments were carried out for 4000 images with different text coverage and different orientation. The sample images of the experimental database are shown in the Figs. 3, 4 and 5.

]

2

where, N is the number of features in the feature vector f,

f j ( X ) represents the jth texture feature of the test sample X,

while f j (M ) represents the jth feature of Mth texture class in the feature library. Then, the test script is identified using the k-nearest neighbor (k-NN) classifier. In the k-NN classifier, a test sample is classified by a majority vote of its neighbors, with the test sample being assigned the class most common among its k-nearest neighbors, where k is a positive integer, typically small. If k=1, then the object is simply assigned the class of its nearest neighbor. It is helpful to choose k to be an odd number as this avoids difficulties with tied votes.

112

Fig. 3. Some sample image of the handwritten scripts used for the identification problem (Left to right row 1: Bengali, English, Hindi, Kannada, row 2: Malayalam, Tamil, Telugu, Urdu)

2010 IEEE 2nd International Advance Computing Conference

Fig 4. Some sample image of the scripts with orientations at different angles and with different text coverage’s (Left to right row 1: 0 degree, 5 degrees, 10 degrees, 15 degrees orientations row 2 : one line, two line, three line, four line text coverage)

Fig 5. Some sample images of the scripts written by three different writers Half the numbers of samples are used for training and the remaining half for testing. The image size of the script samples used for the experiments is 256x256 pixels. These images are binarized using global threshold. The results of the experiments are given in the Table 1 and 2. Table 1. Average classification accuracies(%) for different language scripts using the texture features for handwritten(single writer) script classification Sl.No

Language 0o

1 2 3 4 5 6 7 8 Average

Kannada English Tamil Urdu Telugu Bengali Hindi Malayalam

99.6 100 83.6 100 96.8 100 100 100 97.5

Full text coverage 5o 10o 15o 100 100 84 88.8 100 100 100 100 96.6

96.4 99.2 81.6 88.4 100 100 98.8 100 95.55

94.8 96.8 82.8 87.2 100 100 97.2 99.2 94.75

1line 68.4 96.4 72.8 92.8 97.6 97.6 63.6 77.2 83.3

Partial text coverage 2-line 34-line line 88.8 100 100 98.4 100 98.8 84.8 86.8 83.2 96 98.8 100 100 100 96 100 100 100 85.6 100 100 94.8 100 100 93.55 98.2 97.25

Table 2. Average classification accuracies(%) for different scripts written by two and three writers in comparison with that for single writer Sl.No

Language

1 2 3 4 5 6 7 8 Average

Kannada English Tamil Urdu Telugu Bengali Hindi Malayalam

One writer 99.6 100 83.6 100 96.8 100 100 100 97.5

Two writers 86.8 79.6 96 100 90.8 99.6 92 89.6 91.8

Three writers 72.4 58.4 96 90 80.4 100 78 62.8 79.75

The average classification accuracy for the texture based method is approximately 98% for the single writer (i.e., different scripts are written by different writer) as shown in Table 1. Table 2 shows the classification of different scripts written by one, two and three writers(i.e., each script is written

by different writers). However, for the document images with partial text coverage, the classification rate is less than that for full text coverage, which can be attributed to the loss of texture information in the document images. The result also shows that, as the number of writers increases, the classification accuracy decreases. This is because of the fact that the different writers have used different pens and the pattern differs from writer to writer for the same script. These results demonstrate the efficacy of the wavelet based feature set in the script identification. V.

CONCLUSION

In this paper, we propose a method based on the texture features for script identification in a handwritten document image. The co-occurrence histogram based texture features are extracted using the correlation between subbands at the same resolution of wavelet decomposed image, indicating that this information is significant in characterizing the handwritten scripts. The average classification accuracy is 97.5% for a single writer document with full text coverage, which decreases slightly with the increase in angle of orientation and decrease significantly with the increase in the number of writers. The experimental results demonstrate the efficacy of the proposed method and the potential of such a global approach for the handwritten script identification in the document image analysis. REFERENCES [1]. S. Rice, G. Nagy, T. Nartker, Optical Character Recognition: An Illustrated Guide to the Frontier, Kluwer Academic Publishers, 1999. [2]. J Hochberg, K. Bowers, M. Cannon, P. Kelly, Script and Language Identification for Handwritten Document Images, International Journal on Document Analysis and Recognition, vol. 2, pp 4552, 1999. [3]. R Plamondon and Sargur N. Srihari, On-Line and Off-Line Handwriting Recognition: A Comprehensive Survey, IEEE Trans. Pattern Analysis Machine Intelligence, vol. 22(1), pp 6384, jan-2000. [4]. U. Pal, B.B. Chaudhuri, Automatic Separation of Words in Multi-lingual Multi-script Indian Documents, Proc. Fourth Int’l Conf. Document Anal. Recog., Ulm-Germany, pp. 576-579, 1997. [5]. Daubechies, I., 1988. Orthonormal bases of compactly supported wavelets. Commun. Pure Appl. Math. XLI, 909–996. [6]. P.S. Hiremath and S. Shivashankar , Wavelet based co-occurrence histogram features for texture classification with an application to script identification in document image, Pattern Recognition Letters 29 (2008) 1182–1189. [7]. Tan, T.N., 1998. Rotation invariant texture features and their use in automatic script identification.

2010 IEEE 2nd International Advance Computing Conference

113

[8].

[9].

[10]. [11].

114

IEEE Trans. Pattern Anal. Machine Intell. 20 (7), 751–756. Busch, A., Boles, W.W., Sridharan, S., 2005. Texture for script identification. IEEE Trans. Pattern Anal. Machine Intell. 27 (11), 1720–1732. Chang, T., Kuo, C.C.J., 1993. Texture Analysis and classification with tree-structured wavelet transform. IEEE Trans. Image Process. 2, 429–441. Daubechies, I., 1992. Ten Lectures on Wavelets. SIAM, Philadelphia, PA. Haralick, R.M., Shanmugam, K., Dinstein, I., 1973. Texture features for image classification. IEEE Trans. Systems Man Cybernet. 8 (6), 610–621.

[12]. Laine, A., Fan, J., 1993. Texture classification by wavelet packet signatures. IEEE Trans. Pattern Anal. Machine Intell. 15 (11), 1186–1190. [13]. Mallat, S., 1989. A theory of multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Machine Intell. 11(7), 674– 693. [14]. Montiel, E., Aguado, A.S., Nixon, M.S., 2005. Texture classification via conditional histograms. Pattern Recognition Lett. 26, 1740–1751.

2010 IEEE 2nd International Advance Computing Conference