Word-level Handwritten Script Identification from Multi

0 downloads 0 Views 506KB Size Report
tical, left and right diagonals plus mean and standard deviations of the decomposed ... apply it as a precursor to OCR. Fig. 1 shows .... Telugu (T el.), Tamil (T am ...
Word-level Handwritten Script Identification from Multi-script Documents Mallikarjun Hangrage1,† and K.C. Santosh2,† 1

2



Karnatak Arts, Science and Commerce College Bidar, Karnataka, INDIA US National Library of Medicine, National Institutes of Health 8600 Rockville Pike, Bethesda, MD 20894, USA [email protected], [email protected]

Equally contributed work. The paper is not related to NLM’s work.

Abstract. In this paper, we present the directional discrete cosine transform (DCT) based rotation invariant features for word-level handwritten script identification. Our aim in this paper, is two folds: one is to validate the effectiveness of the directional DCT (D-DCT) in extracting edge information of the studied word image and another is to provide rotation invariant property since conventional DCT (C-DCT) does not offer both issues. For each extracted word image, we compute DCT, its coefficient matrix and decompose into different directions such as horizontal, vertical, left and right diagonals plus mean and standard deviations of the decomposed components. These statistical features are then evaluated with hundreds of word images from six different scripts by using linear discriminant analysis (LDA) and achieved an accuracy of 97.35% in average.

Keywords: Discrete Cosine Transform, Rotation Invariant Features, Script Identification, Multi-script Document.

1

Introduction

In document image processing, multi-script handwritten character recognition has been received an important attention since a few decades. Hitherto, no multi-script OCR exists to handle real-world documents with several different scripts, more specifically Indic handwritten documents where various scripts can be found in word, line and sometimes paragraph levels. In this context, an automatic identification or separation of different script zones from a multi-script document basically enhances the OCR performance. And of course, an automatic script separation facilitates multi-lingual document indexing, sorting, retrieval and further knowledge discovery. Considering the difficulties and challenging issues associated with Indic handwritten multi-script document processing, in this paper, we attempt a generic method for script identification and thus we aim to apply it as a precursor to OCR. Fig. 1 shows the expected outcomes of the input

2

Hangrage and Santosh



(a) A sample image

+

(b) Devanagari

(c) English

Fig. 1. An example showing the script idenfication, where (a) a bi-script sample is expected to separate (b) Devanagari and (c) English script.

bi-script document sample. To handle this, we propose a technique which is simple, rotation invariant, and robust to various complexities such as unconstrained gaps between words, lines, writing styles, skew angles and sizes. Basic techniques are aimed to classify words, lines or text blocks from documents where a few different scripts are present, by using either global, local features or integrating both [1]. Global approaches are primarily based on discrete cosine transform (DCT) [12, 13], discrete wavelet transform (DWT) and Gabor filters. Global approaches are efficient in representing large size images i.e., text blocks, for instance. They are faster, robust to noise and improper segmentation, and are script independent. For example, text blocks of eight Indic scripts are classified based on DCT and wavelet features [13]. In contrast, local approaches employ shape features based on connected components [4, 7, 8]. They are script dependent, slower in computation. These methods may not offer rotation invariant property and thus their usefulness can be limited. From the practical view point, rotation invariant classification of scripts is highly desirable [15] to enhance the performance of the OCR. On the whole, global approaches are better to deal with such problems by incorporating the appropriate modifications so that generic and optimized solutions [2] are possible. Under this purview, in this paper, we study directional DCT (D-DCT) to address the aforementioned challenges. We primarily aim to demonstrate how D-DCT [3], [16] is efficient while considering rotation invariant property through the statistics of directional energy distributions of DCT coefficients.

2

Materials and method

In short, the proposed method (similar to [5]) can be described as follows. Words are first extracted from each document based on the morphological operators where we primarily are based on connected component. The extracted words are then represented with DCT features and its variants. For classification, a well-known LDA classifier is employed. Feature selection. Before applying directional DCT, the primary task is to extract words from input document image. It is composed of three steps: 1) image

Word-level Handwritten Script Identification from Multi-script Documents

(a) Input image

(b) Binarization

(c) Dilation

3

(d) Output

Fig. 2. An example showing word segmentation of Devanagari text block.

binarization, 2) image dilation (both horizontal and vertical), and 3) word image extraction based on connected components. Further, the length of the structuring element is adoptive to the script of the document. The complete process of word extraction from Devanagari document is shown in Fig. 2. We apply 2D DCT on each word image and compute its coefficient matrix as CN ×N . Further, C is partitioned into three bands namely principal diagonal (µ), upper (α) and lower (β) diagonals of size N − 2. Then µ is extracted from C and computed standard deviation σ1 using q Pn 1 ¯ 2 (1) σ1 = n−1 u=1 C(u)µ − C(u)µ ) , where u = 1, 2, . . . , n and n is the number of coefficients in µ. The standard deviation σ1 is a scalar value. Similarly, α diagonals of C are extracted and computed their standard deviations using q Pn 1 ¯ 2 σα = n−1 (2) u=1 (C(u)α − C(u)α ) , where u = 1, . . . , n and β = 1, . . . , N − 2 and σα is a column vector of size N − 2 × 1. Then, by appending the value of σ1 and a zero into σα , we get first feature f1 of dimension N × 1. In the same way, β diagonals of C are extracted and computed their standard deviations using q Pn 1 ¯ 2 (3) σβ = n−1 u=1 (C(u)β − C(u)β ) , where β = 1, . . . , N −2 and σβ is a column vector of size N −2×1. By appending two zeros into σβ , we get second feature f2 of dimension N ×1. Similarly, features f3 and f4 are computed by flipping the input matrix C. The flipped matrix is denoted by C f and upper, lower and principal diagonals are denoted by β f , αf and µf respectively. Finally, standard deviations of DCT of horizontal and vertical coefficients of C are computed to obtain features f5 and f6 respectively. Thus, we have an integrated feature vector F = {f1 , . . . , f6 } of size N × 6. The dimension of the feature vector can further be reduced by taking their mean and standard deviation i.e., the reduced dimension will be 12 × 1 from N × 6 i.e., (six means and six standard deviations). Classification. Since LDA offers class discriminating information to the higher extent by reducing dimensionality of feature space and also maximizes separability between the classes by maximizing the ratio of inter-class variance to the intra-class variance, we employ LDA and study its characteristics.

4

3

Hangrage and Santosh

Experiments

Dataset and evaluation protocol. Our dataset is composed of 6000 handwritten word images of six different scripts, namely Roman (Rom.), Devanagari (Dev.), Kannada (Kan.), Telugu (T el.), Tamil (T am.) and Malayalam (Mal.), 1000 words of each script. Out of 6000, 3000 word images are reference text words (500 from each script), remaining 3000 are rotated word images produced by choosing 100 word images of each script and rotated with various angles such as 30◦ , 60◦ , . . . , 150◦ . In order to evaluate the performance of the method, 10-fold cross validation has been implemented unlike traditional dichotomous classification. The performance of any script s classification is measured by using the precision, precision@s =

number of correctly classified words , s = {1, 2, ...., 6}. total number of words

(4)

Results and discussion.To attest the performance of the proposed algorithm, tests are primarily carried out in three ways and provided in Table 1 and Table 2. In Table 1, the first two issues are covered where results that are shown in 1. lower triangular part of the table are based on rotated word images i.e., 100 word images of each script and five different angles of each word image; and 2. upper triangular part of the table are based on non-rotated word images i.e., 500 word images of each script. In Table 2, another issue i.e., the superiority of D-DCT over C-DCT will be confirmed by performing a test using non-rotated, rotated and mixed word images. The reported results in lower triangle of Table 1, highlights the significance of the directional energy distribution in classification of rotated word images. For instance 99.40% average precision is achieved in classification of Devanagari versus other scripts. However, low average precision has been noticed in case of Roman versus other scripts and Kannada versus other scripts as 94.54% and 93.07% respectively. This is due to the similarity in writing style and thus similar vertical energy distributions. The average precisions shown in upper triangular part of Table 1 shows little higher performance compared to lower triangle precisions. This is because of the uneven distributions of DCT coefficients in case of rotated images. Hence it’s average precision is 1.51% lesser than that. To compare the performance of D-DCT with C-DCT, tests are carried out on rotated, non-rotated and the combination of these word images and the results are provided in Table 2. The C-DCT yields average precision of 86.30% with non-rotated word images of Roman script in combination with other five scripts. Whereas, D-DCT gives a superior result of 97.20% for the same combinations. On the whole, the D-DCT yields 94.54% in average in comparison to 70.38% from C-DCT. Comparative study. For comparison, we have extended our experimentation on a dataset of 22,500 printed word images used in [12] and achieved the average

Word-level Handwritten Script Identification from Multi-script Documents

5

Table 1. Bi-script identification performance in % using D-DCT.

Eng. 93.80 91.40 97.90 91.10 98.50 94.54

Script Eng. Dev. Kan. T el. T am. Mal. Avg.

Dev. 99.70 99.30 99.60 99.00 99.70 99.40

Kan. 92.70 99.50 93.70 91.70 93.80 90.03

T el. 98.40 99.00 94.90 93.50 98.30 95.90

T am. 97.80 98.90 91.60 97.90 96.30 96.30

Mal. 97.40 99.70 93.50 98.20 98.90 95.84

Avg. 97.20 99.28 93.33 98.05 98.90 97.35 —

Table 2. Performance comparison of D-DCT with C-DCT.

Bi-Scripts

Non-rotated images Rotated images Mixed images C-DCT D-DCT C-DCT D-DCT C-DCT D-DCT Rom./Other scripts 86.30 97.20 70.38 94.54 67.80 92.98 Dev./Other scripts 92.00 99.28 74.33 99.40 69.12 96.96 Kan./Other scripts 80.00 93.33 71.90 90.03 71.17 88.90 T el./Other scripts 82.25 98.05 81.25 95.90 80.88 94.83 T am./Other scripts 84.00 98.90 78.25 96.30 76.25 96.70

identification accuracy of 97.06%, which is higher in comparison to 93.5% using LDA for multi-script identification. The results of [12] is achieved by using 36 features; however, we provide an accuracy of 97.06% by using only one-third features of [12]. Besides, we have presented a major state-of-the-art methods to compare methods on a one-to-one basis in Table 3.

4

Conclusion

In this paper, we have studied the rotation invariant features based on directional DCT for word-level handwritten script identification and validated with six major Indic scripts. The features used in this method is derived based on visual perception of the shape of characters, which are basically dominated by directional strokes. In our further work, we will use printed word images of eleven Indic scripts used in [11] since our preliminary results are encouraging.

References 1. Bela¨ıd, A., Santosh, K.C., D’Andecy, V.P.: Handwritten and printed text separation in real document, In: Proc. of MVA, pp. 218–221, 2013. 2. D Ghosh, T.D., Shivaprasad, A.P.: Script recognition – a review. IEEE PAMI 32(12), 2142–2161 (2010)

6

Hangrage and Santosh Table 3. Comparative study

Method Sarkar et al. [14]

Naboodiri and Jain [10]

Major features Horizontal, foreground and background features Spatial and temporal features

Hochberg et al. [6] Relative centroid, holes, spherecity and aspect ratio Moussa et al. [9] Fractal features Our method DCT

Scripts Accuracy Devanagari and Roman 99.28% Bangla and Roman 98.43% Arabic, Cyrillic, Devanagari, Han, Hebrew and Roman Arabic, Cyrillic, Devanagari, Chinese, Japanese and Roman Arabic and Latin Roman, Kannada, Telugu, Devanagari, Malayalam and Tamil

95.10%

88.00%

96.64% 99.70%

3. Fu, J., Zeng, B.: Directional discrete cosine transforms: A theoretical analysis. In: Proc. of ICASSP. vol. 1, pp. I–1105–I–1108 (2007) 4. Hangarge, M., B.V. Dhandra: Offline handwritten script identification in document images. IJCA 4(6), 1–5 (2008) 5. Hangarge, M., Santosh ,K.C., P., Rajmohan: Directional Discrete Cosine Transform for Handwritten Script Identification. In: Proc. of ICDAR pp. 344–348 (2013) 6. Hochberg, J., Bowers, K., Cannon, M., Kelly, P.: Script and language identification for handwritten document images. IJDAR 2(2-3), 45–52 (1999) 7. K. Roy, A.B., Pal, U.: Word-wise hand-written script separation for indian postal automation. In: Proc. of IWFHR pp. 521–526 (2006) 8. Lijun Zhou, Yue Lu, C.L.T.: Bangla/english script identification based on analysis of connected component profiles. In: Proc. of ICDAR pp. 243–254 (2006) 9. Moussa, S.B., Zahour, A., BenAbdelhafid, A., Alimi, A.M.: Fractal-based system for arabic/latin, printed/handwritten script identification. In: Proc. ICPR pp. 1–4 (2008) 10. Namboodiri, A., Jain, A.: Online handwritten script recognition. IEEE PAMI 26(1), 124–130 (2004) 11. Pati, P.B., Ramakrishnan, A.G.: Word level multi-script identification. PRL 29(9), 1218–1229 (2008) 12. Peeta Basa Pati, A.G.R.: Hvs inspired system for script identification in indian multi-script documents. In: Proc.of ICDAR pp. 380–389 (2006) 13. Rajput, G.G., B, A.H.: Handwritten script recognition using dct and wavelet features at block level. In: IJCA pp. 158–163 (2010) 14. Sarkar, R., Das, N., Basu, S., Kundu, M., Nasipuri, M., Basu, D.K.: Word level script identification from bangla and devanagri handwritten texts mixed with roman script. J. of Computing 2(2), 103–108 (2010) 15. Tan, T.: Rotation invariant texture features and their use in automatic script identification. IEEE PAMI 20(7), 751–758 (1998) 16. Zeng, B., Fu, J.: Directional discrete cosine transforms:a new framework for image coding. IEEE TCSVT 18(3), 305–313 (2008)