Multilingual Word Spotting in Offline Handwritten Documents

21st International Conference on Pattern Recognition (ICPR 2012) November 11-15, 2012. Tsukuba, Japan

Multilingual Word Spotting in Offline Handwritten Documents Safwan Wshah, Gaurav Kumar and Venu Govindaraju Department of Computer Science and Engineering University at Buffalo srwshah, gauravku, [email protected] Abstract

by applying the moment features and used cosine similarity for matching. The main advantage of these approaches is that there is minimum learning involved. However, they need at least one keyword sample in the training dataset. Moreover, the text line images have to be segmented into words and there are limitations of dealing with wide variety of unknown writers[5]. Fischer[5], Rodrguez [9] and Cao [3] showed that learning based approaches outperform template based word spotting approaches for single and multi-writers. In [11] we presented a line based word spotting system that removes the limitations of other line based approach proposed by Fischer [5] that uses a lexicon free approach and heavily relies on white space models. In this paper, we extend our work [11], to multilingual documents by presenting techniques that spot keywords in multilingual documents corpus. The spotting is performed at both document and word levels. In a learning based framework using Hidden Markov Models (HMM), the keywords, even those unseen in the training corpus, are simulated in model space as a sequence of trained character models. Filler models are used for better representation of non-keyword images. As a result, the system is effectively capable of dealing with large vocabularies without the need for word or character segmentation. To the best of our knowledge, we are the first to propose a learning based approach for multilingual word spotting for handwritten documents. The system is evaluated on mixed corpus of public datasets such as AMA [1] for Arabic, IAM [7] for English and LAW [6] for Devanagari. The remainder of the paper is organized as follows. The baseline HMM based word spotting system is described in Section 2. The proposed multilingual word spotting system is described in Section 3. Section 4 and 5 cover the experimental results and conclusion.

In this work, we propose a novel multilingual word spotting framework based on Hidden Markov Models that works on corpus of multilingual handwritten documents and documents that contain more than one handwritten script. The system deals with large multilingual vocabularies without need for word or character segmentation. A keyword is represented by concatenating its character models. We propose and compare two systems: a script identifier based (IDB) and a script identifier free (IDF) system. IDB uses a HMM based script identifier before spotting a keyword. While, IDF does the spotting without the script identification. The system is evaluated on a mixed corpus of public dataset from several scripts such as IAM for English, AMA for Arabic and LAW for Devanagari and on synthetic dataset generated by concatenating words and lines from different scripts in a document image.

1. Introduction Offline handwriting recognition continues to be a challenging task due to vast variability in writing styles and applications which do not offer the means to constrain large vocabularies. Word spotting has been proposed as an alternative to full transcription to retrieve keywords from document images. Huge number of multilingual handwritten documents and forms are sent every day to companies for processing [5]. An efficient retrieval system for these documents has great advantage of saving these companies time and money. Recently, most of the libraries over the world have digitized their valuable old handwritten books written in many scripts over the old ancient to modern ages. Word Spotting techniques can be applied for the retrieval of such digitized corpus. Few studies have been focused on multilingual word spotting systems. Srihari [10] proposed work based on template matching for Arabic, Latin and Sanskrit using GSC features and Dynamic Time Warping for matching. Bhardwaj [2], proposed a template based approach

978-4-9906441-1-6 ©2012 IAPR

2. HMM-based Word Spotting System (HWS) In this section, we provide a brief overview of our line based word spotting system, which we refer to as HWS.

310

duced lexicon background models showed high accuracy and low complexity as discussed in [11]. For more details about the HWS, filler and background models, the reader is referred to [11]. Algorithm 1 CFM Generation INPUT :HMM characters models, testing data, number of the required filler models. OUTPUT: Character filler models. Initialization: INPUT ← HMM character models. OUTPUT ← ” Step 1: for all character model in INPUT do for other character models in INPUT do Merge the characters pair models. PAIRS[Accuracy, characters pair]← Evaluate the testing set accuracy after merging. end for MaxPair← Pick maximum accuracy from PAIRS. Merge the corresponding pair (MaxPair)and store it in OUTPUT array. Delete pair from INPUT. end for Step 2: Label the testing data according to the new models. Step 3: if OUTPUT size == Number of filler models then end else INPUT ← OUTPUT, OUTPUT ← ”, go to step 1. end if

Figure 1. Feature Extraction The input to the system is a line image and the output is start and end segment of the keywords present in the line. At the prepossessing stage, we assume that the document lines are properly segmented. The input image lines are cleaned, deskewed and slant corrected. For each document line, features are computed from a sequence of overlapping windows, also called frames. The gradient and intensity features are extracted from each frame as shown in figure 1. The Baum-Welch algorithm is used to estimate each model parameters during training. For recognition, we use Viterbi algorithm[8] to compute the probability of each model parameters for the given observation sequence. We divide each frame into two vertical cells known as bins. For each bin, we calculate the gradient features from a normalized histogram of 8-directions. The intensity features are extracted by dividing each bin into four equal horizontal strips and count of black to white pixels is normalized for each strip. As a result, 24 features are extracted from each frame. For each character, a 14state linear topology HMM is learnt from the extracted features and for each state Si , the observation probability distribution is modeled by a continuous probability density function of Gaussian mixtures. Each line is constructed from words that are formed by concatenating the corresponding character HMM models. The word spotting algorithm starts by detecting the candidate keywords using a recognizer that searches for the keywords in a line. Keyword models consist of all keywords built by concatenating their HMM character models. The HMM based recognizer uses the Viterbi beam search decoder [8] to parse the entire line finding the maximum probability path between the keywords and filler models. The filler models represent the non-keyword regions. We refer to them as character filler models CFM. Each CFM is an HMM model that has exactly same implementation of the character models but trained on different classes. The clustering of these CFMs is described in algorithm 1 and was proposed in our earlier work [11]. The extracted candidate keywords are processed by extracting them from the line using their start and end position and then pruned by normalizing their score to word background model scores, to efficiently reduce the false positive rate. In this work, we used the lexicon based Word Background Model (WBM) proposed in [11]. In WBM, the large non-keyword lexicon is reduced based on Levenshtein distance between the lexicon entries and the candidate keyword text. Re-

3. Multilingual Word Spotting In this work, we propose and evaluate two learning based Multilingual Word Spotting framework. In the first approach, we first estimate the script of the input image and apply the corresponding script specific word spotting system. In the second more efficient approach, the script identification is ignored and spotting is performed in a identification free framework making it applicable for spotting words in documents that contain more than one script. We refer to the first System as Identifier Based (IDB) and second system as Identifier Free (IDF). The approaches are covered in detail below.

3.1. Identifier-Based Multilingual Word Spotting System (IDB) In a multilingual corpus where it is known beforehand that the documents are written in a single script, it becomes relevant to predict the script of the document before applying the corresponding word spotting system.

311

Our identifier based word spotting system first predicts the script of the document. Based on the script identified, the corresponding HWS system is applied for word spotting. A novel method for script identification is used in this technique in which script specific Character Filler Models, CFMs are learnt from the script specific training corpus as specified in Section 2 and clustered using algorithm 1. Such CFMs learn a better representation of the script specific text background. The script prediction is done by processing each line of the document using Viterbi beam search decoder and finding the maximum probability path between the filler models of each script as shown in figure 2. For each script the score of its detected filler models is summed and the decision is taken according to the maximum score as shown below: Scriptindex = max(S1 , S2 , .., Sn )

identification is required. A high accuracy for segmentation and word level script identification is difficult to achieve as shown in Table 1. In languages such as Arabic the space could be within or between the words revealing little information about the word boundaries [4]. Hence while working with line images containing multiple scripts, the identifier is completely ignored. Instead, the filler models from all scripts are connected in parallel with the keyword models, thus representing all keyword and non-keyword regions in a line as shown in figure 3. The Viterbi parser finds the highest path between appropriate keywords and filler models finding the correct keyword candidates for the input line even if the line includes more than one script. Similar to IDB, the false positives are reduced by comparing the candidate keyword score with the score of the lexicon based Word Background Model as described in Section 2. The advantage of including filler models from different scripts ensures a robust modeling of the background and creates more robust rejection criteria. In addition few keywords from one script are detected in lines of other scripts due to the robustness of the filler and keyword models. Hence, IDF has slightly better performance than IDB with the added advantage of detecting keywords from more than one script in one line.

(1)

where Si is given by: Si =

X

fi

(2)

i denotes the script index and fi denotes the score of the detected filler models of script i. Finally, for word spotting, we use the script specific HWS system presented in section 2. For each script a HWS system is trained on its script data. The advantage of this system is that there is a reduction in framework complexity by ignoring the keywords and fillers that do not belong to the other scripts.

Figure 3. Identifier Free Multilingual Spotting (IDF) Figure 2. Identifier Based Multilingual Spotting (IDB)

4. Experimental Results Both IDB and IDF are evaluated on semi-automatic datasets generated using three scripts, English, Arabic and Devanagiri. The public IAM dataset [7] is used for English, AMA [1] dataset for Arabic, and LAW [6] dataset for Devanagari. We randomly select 250 pages from IAM dataset out of which 200 pages are used for training and 50 for testing. For AMA dataset, 400 docu-

3.2. Identifier-free multilingual word spotting system In line images which contain scripts from different languages, the script identification relies on proper word segmentation and for each segmented words, a reliable

312

5. Conclusion

Table 1. Script Identification Results for IDB Dataset Document Line Word Accuracy 100% 95.3% 83.3%

In this paper we presented a learning based multilingual word spotting framework that works on corpus of multilingual documents and documents that contain more than one language. The system is effectively capable of dealing with large vocabularies without the need for word or character segmentation . Any keyword can be implemented by concatenating its character models. The IDF system shows high performance for all scripts used (English , Arabic and Devanagari) on document, line and word level mixed datasets. The lower complexity IDB system shows high performance on document and line level datasets.

Table 2. Mean Average Precision for Word Spotting Dataset IDB IDF

Document 0.57 0.59

Line 0.55 0.59

Word 0.24 0.55

ments are used. 350 are used for training and 50 for testing. The Devanagari lines are semi automatically generated from word images taken from LAW Devanagari dataset [6]. Each line is formed by randomly concatenating up to 5 words from the dataset which contains 26,720 handwritten words. 350 documents are generated, 300 is used for training and 50 for testing. The character and filler models of each script were trained using the training set of each script. Three sets of datasets are prepared using the three scripts at Document, Line and Word level. The document level dataset contains one script per document. In the line level dataset the document contains mix of lines from different scripts. For the word level dataset each document line contains more than one script. The mixing of the line and word level datasets is semiautomatically implemented by merging random lines and words with random space apart. For each script 50 random keywords are used, the results show high performance for IDF on all above datasets. As expected, the IDB showed poor performance on the word level dataset. The mean average precision for each system at document, line and word level is shown in Table 2. The ROC for the systems is shown in figure 4.Note that the curves for IDF at document and line level are same because the documents are segmented into lines before spotting.

References [1] Applied media analysis, arabic-handwritten-1.0. 2007. [2] A. Bhardwaj, D. Jose, and V. Govindaraju. Script independent word spotting in multilingual documents. 2nd Intl Workshop on Cross Lingual Information Access, pages 48–54, 2008. [3] H. Cao and V. Govindaraju. Template-free word spotting in low-quality manuscripts. [4] J. Chan, C. Ziftci, and D. Forsyth. Searching off-line arabic documents. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, volume 2, pages 1455 – 1462, 2006. [5] A. Fischer, A. Keller, V. Frinken, and H. Bunke. Hmmbased word spotting in handwritten documents using subword models. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 3416 –3419, aug. 2010. [6] R. Jayadevan, S. R. Kolhe, P. M. Patil, and U. Pal. Database development and recognition of handwritten devanagari legal amount words. Document Analysis and Recognition, International Conference on, 0:304–308, 2011. [7] U.-V. Marti and H. Bunke. The iam-database: an english sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition, 5:39–46, 2002. 10.1007/s100320200071. [8] L. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257 –286, feb 1989. [9] J. A. Rodrguez-Serrano and F. Perronnin. Handwritten word-spotting using hidden markov models and universal vocabularies. Pattern Recognition, 42(9):2106 – 2116, 2009. [10] S. N. Srihari, H. Srinivasan, C. Huang, and S. Shetty. Spotting words in latin, devanagari and arabic scripts. vivek: A quarterly in. Artificial Intelligence, page 2006. [11] S. Wshah, G. Kumar, and V. Govindaraju. Script independent word spotting in offline handwritten documents based on hidden markov models. In International Conference on Frontiers in Handwriting Recognition (Under Review), 2012.

Figure 4. Precision Recall curve

313