Recognition of Arabic Handwritten Words using ... - Semantic Scholar

7 downloads 0 Views 356KB Size Report
a University of Balamand, Faculty of Engineering, PoBox 100 Tripoli, LEBANON. b GET-Ecole Nationale Supérieure des Télécommunications, 46 rue Barrault, ...
Recognition of Arabic Handwritten Words using Contextual Character Models Ramy El-Hajj a,b , Chafic Mokbel a and Laurence Likforman-Sulem b a

b

University of Balamand, Faculty of Engineering, PoBox 100 Tripoli, LEBANON. GET-Ecole Nationale Supérieure des Télécommunications, 46 rue Barrault, 75013 Paris, FRANCE ABSTRACT

In this paper we present a system for the off-line recognition of cursive Arabic handwritten words. This system in an enhanced version of our reference system presented in [El-Hajj et al., 05] which is based on Hidden Markov Models (HMMs) and uses a sliding window approach. The enhanced version proposed here uses contextual character models. This approach is motivated by the fact that the set of Arabic characters includes a lot of ascending and descending strokes which overlap with one or two neighboring characters. Additional character models are constructed according to characters in their left or right neighborhood. Our experiments on images of the benchmark IFN/ENIT database of handwritten villages/towns names show that using contextual character models improves recognition. For a lexicon of 306 name classes, accuracy is increased by 0.6 % in absolute value which corresponds to a 7.8% reduction in error rate.

Keywords: Arabic words, HMMs, contextual character models, AWHR, handwriting recognition. 1. INTRODUCTION The recognition of Arabic handwriting is an active field in the pattern recognition domain [Lorigo and Govindaraju 06]. Constructing off-line recognition systems is a challenging task because of the variability and the cursive nature of the Arabic handwriting. Arabic handwriting is difficult to pre-segment so that many recognition systems are based on the HMM framework [Amin 98; Khorsheed 03; Ben Amara and Bouslama 03]. HMM-based methods can be used with two main basic strategies: holistic and analytical. The holistic strategy considers word images as a whole and does not attempt to segment words into characters or any other units. Thus models are trained from word images and this approach is restricted to small lexicons. In HMM-based analytical strategies, words are modeled by the concatenation of compound character HMMs. When segmentation is external, the word image is first segmented into characters or smaller units and then these units are recognized by single character HMM classifiers [Arica and Yarman-Vural, 01]. When segmentation is implicit, there is no attempt to segment the image word before recognition. Segmentation into characters is performed jointly with recognition. The advantage of analytical strategies is that the size of the lexicon may be large as new word models can be easily added by concatenating character models.

Document Recognition and Retrieval XV, edited by Berrin A. Yanikoglu, Kathrin Berkner, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 6815, 681503, © 2008 SPIE-IS&T · 0277-786X/08/$18

SPIE-IS&T/ Vol. 6815 681503-1 2008 SPIE Digital Library -- Subscriber Archive Copy

Our reference HMM system for the recognition of Arabic words is based on an analytical strategy with implicit segmentation [El-Hajj et al., 05]. A set of 28 features is extracted through a sliding window approach. These features are related to pixel density distribution and to local pixel configurations. Character overlap, slant of the writing, wrong position of diacritics and wrong estimation of baseline position may lead to classification errors. We propose here an enhanced reference system which takes into account character overlap. It is based on the use of contextual character models. Characters which may include ascending or descending strokes in a near spatial context are represented by specific HMM character models. The set of character models is thus increased. It can be noted that these additional models differ from those of Arabic characters whose shapes are context sensitive, according to the position of the character in a word (beginning, middle, end): these various shapes are already taken into account in the reference system. In Section 2, we recall the feature set extracted within vertical frames. In Section 3, we present the HMM-based modelling, the recognition phase, and our first results obtained on the IFN/ENIT database. Section 4 draws some conclusions and future perspectives. 2. FEATURE EXTRACTION The system takes as input binary word images. Then the pre-processing phase consists of searching for the lower and the upper baselines. The estimated positions of the two writing baselines are used to extract baseline dependent features in order to emphasize the presence of ascenders and descenders. The proposed approach for baseline detection, described in [El-Hajj et al., 05], is based on the vertical projection profile of foreground pixels. The feature extraction stage consists of extracting a sequence of feature vectors by dividing the word image into vertical overlapping frames (see Figure 1). The sliding windows are shifted in the direction of the writing (right to left in the case of Arabic). The width of these windows is equal to L pixels. The windows are successively shifted by an overlap δ=L/2 pixels. The height of a window varies according to the dimension of the word image.

Figure 1. Word images are divided into vertical frames (here without overlap). Each frame is divided into cells.

SPIE-IS&T/ Vol. 6815 681503-2

Table 1. The set of features includes 17 baseline-independent and 11 baseline-dependent features

baseline independent f1 f2 f3 f4-f11 f17-f22 baseline dependent f12 f13-f14 f15 f16 f23-f28

density of foreground pixels B/W transitions between cells derivative feature (gravity center) density of foreground pixels within each frame column number of background pixels according to 6 concavity configurations (upside-down,etc...) normalized vertical distance of the gravity center of foreground pixels from lower baseline pixel density over and under the lower baseline B/W transitions between cells over the lower baseline position of the gravity center: core zone, over upper baseline, under lower baseline. number of background pixels in the core zone according to the 6 concavity configurations (upside-down,etc...)

___ C

to

___ =

-

___

=

ff

___ wo t

LJ it. —

Each window is divided into a fixed number n of cells. Parameters n=17 cells and L= 8 pixels were set experimentally using a validation set [Al-Hajj, 2007]. In each window we extract a set of 28 features (see Table 1). Some of these features are extracted from specific areas of the image delimited by the word baselines. These features represent the local distributions of densities and the configurations of foreground pixels (see Figure 2) which capture the type of strokes (curved, oriented, vertical, horizontal). More details about these features can be found in [El-Hajj et al., 05].

Figure 2. The six types of local configurations for a background pixel P. Pixels marked by D (Don’t care) could be either black or white.

SPIE-IS&T/ Vol. 6815 681503-3

3. CONTEXTUAL CHARACTER MODELLING SYSTEM Arabic handwriting is highly cursive and difficult to segment because of the variable length of ligatures or inter and intra word spaces. Arabic words include diacritical points and marks that can change their meaning and which are placed above or under the writing baselines. In addition to the existence of horizontal ligatures between letters within words or sub-words, vertical ligatures can occur between two or three Arabic characters. The modeling of Arabic characters and words has thus to be robust to such characteristics. Our approach is HMM-based and uses an analytical strategy with implicit segmentation. This allows us to avoid pre-segmentation and to model words from a large lexicon. 3. 1 Character and word models Arabic character shapes are context sensitive and vary according to their position: at the beginning, in the middle or at the end of the word. Table 2 shows sample characters which include either additional ligatures or descending strokes according to their position. We model Arabic characters and their context sensitive shapes by HMM character models. The topology of character models is left-right with four states (see Figure 3). The probability densities of observations in each state are modelled as a mixture of three Gaussian distributions as described in [El-Hajj et al. 05]. This leads to 167 character models: the 29 basic shapes, the 94 context sensitive shapes, 43 additional shapes which represent characters with additional diacritical marks such as shadda ( ) and one space model. Space models are added between pseudo words to model inter and intra word spaces. A word model is built by the concatenation of the models of its compound characters. We now construct an enhanced system which uses contextual character models. The contextual models are models of characters which include fragments of neighboring characters: these fragments are due to the ascending or descending strokes of characters on the left or on the right of the considered character (see Figure 4). These fragments are seen by the sliding window and features are modified. Contextual modeling is used in speech recognition for modeling the co-articulation effect in speech. It consists of assigning different models to a given phoneme according to its left and right context. Such models are called ‘triphones’: for instance phone ‘a’ will be different within the word ‘cat’ or ‘hat’ because of its different left context (c or h). Such approach leads to a large number of phone models. For handwriting, this approach can be applied to characters leading to tri-character models [Fink and Plötz, 07]. In the current paper, we simplify the tri-character approach and we present a parallel alternative in which we select manually the characters which are overlapped by neighboring strokes and we construct the corresponding contextual model. This leads to a lower number of additional models than with the tri-character approach because characters are modeled by at most two models. The first model corresponds to a shape in the original set (which may be basic, context-sensitive or with diacritical mark). The second model corresponds to a shape overlapped by neighboring descenders. This leads to 44 additional models for the number of name classes considered. Figure 5 presents examples of the Arabic character “‫( “ ا‬aa) with and without overlapping context.

SPIE-IS&T/ Vol. 6815 681503-4

Figure 6 shows examples of how characters can be modeled differently depending on their overlapping or non-overlapping context.

Figure 3. Left-right character HMM topology

Table 2. Context-sensitive shapes according to character position in a word.

Characters Hha'

Isolated

Beginning

'Ayn Fa Kaf Lam Mim Nun Ha Ya

SPIE-IS&T/ Vol. 6815 681503-5

Middle

End

Figure 4. Neighboring characters may overlap due to their descending strokes.

(a)

(d)

(13)

Figure 5. Examples of Arabic character “‫( “ ا‬aa) with (resp. without) overlapping context (a) and (b) (resp. (c) and (d)).

pppP p

I

ppp

(b)

-p

p

p

p

p

p

p

p

(d)

Figure 6. Examples of character models. (b) and (d) are additional contextual models. (a) and (c) belong to the set of non-contextual models.

SPIE-IS&T/ Vol. 6815 681503-6

3.2 Training and Recognition The generic system HCM [Mokbel et al. 02] is used to achieve training and word recognition. Training of character models is based on word images and their transcriptions. Thus, the training approach is segmentation-free: character models are trained through embedded training without pre-segmenting words into characters. Training uses the iterative EM (ExpectationMaximization) algorithm as follows: the observations are associated to the most probable states in the Expectation phase, and model parameters are re-estimated during the phase of Maximization. In the Initialization step, the observations are assigned to states by segmenting them linearly: then the primary parameters are estimated based on this assignment. Note that for training contextual models, words are transcribed with all their variant shapes: contextual and non contextual. Non-contextual character models are thus learnt from words that never include overlapping characters and contextual models from words that may include overlapping characters. In the recognition phase, the Viterbi algorithm is used. The sequence of extracted feature vectors are passed to a network of lexicon entries formed of character models, and the character sequence providing the maximum likelihood identifies the recognized entry. The HMM-based classifier produces a list of the best candidates (Top N) that have the greatest likelihoods. 3.3 Experimental results To evaluate the performance of our recognition system, experiments are conducted on the benchmark database IFN-ENIT [Pechwitz et al. 02]. The IFN-ENIT includes a total of 26,459 handwritten words of 946 Tunisian town/villages names handwritten by different writers. The data are divided into four subsets a, b, c and d. We compare the reference system [El-Hajj et al., 05] with the system using contextual character models. These experiments are conducted on a reduced word lexicon of 306 classes. The word and character HMM models are trained on a set of 10,798 examples of subsets a, b and c. Tests are achieved on another set of 3,648 examples of subset d. Results in Table 3 show an improvement due to the contextual character modeling: accuracy is increased by 0.6 % in absolute value which corresponds to a 7.8 % reduction in error rate. This shows the effectiveness of modelling overlapped characters by specific models while keeping the total number of models manageable. This improvement is similar to the one reported in [Fink and Plötz, 07] for Latin handwriting using clustered models. Our previous experiments conducted on only 157 name classes showed a lower reduction in error rate (4.3%). The improvement due to contextual modelling increases when using more training data as more accurate models are built for overlapped characters. Higher improvement may thus be achieved using the entire IFN/ENIT database. Table 4 presents some word examples that are misclassified by the reference system and correctly recognized by the system using contextual character models. These words include one or more overlapping characters.

SPIE-IS&T/ Vol. 6815 681503-7

Table 3. Results obtained by the reference system and the contextual character modeling system. System

Recognition Rate (%) Top1

Top2

Top3

Top5

Top10

reference system

92.31

94.72

95.96

97.14

97.89

contextual character modeling system

92.92

94.68

96.65

97.29

98.21

Table 4. Examples of words misclassified by the reference system and correctly classified by contextual character modeling. Input words

Reference system (misclassification)

Contextual character modeling (correct classification)

‫اﻟﺮﺣﻴﺶ‬

‫ﻧﻜﺮیﻒ‬

‫ﺳﺎﻗﻴﺔ اﻟﺰیﺖ‬

‫رأس اﻟﺬرّاع‬

18 ‫اﻟﻤﺤﺎرزة‬

‫اﻟﻌﻤﺮان اﻷﻋﻠﻰ‬

‫ﻧﺤّﺎ ل‬

‫روّاد‬

4. CONCLUSION AND PERSPECTIVES We have enhanced the HMM-based reference system [El-Hajj et al. 05] by using contextual character models. These models take into account overlapping strokes from neighboring characters. This leads to a better modeling of Arabic characters which include many descending strokes. This modeling has improved accuracy on a restricted lexicon (306 classes) of the IFN/ENIT database. The accuracy is increased by 0.6 % in absolute value and which corresponds to a 7.8 % reduction in error rate while keeping the total number of models manageable: 44 contextual models are added to the original set of 167 character models. At present, the selection of contextual character models is performed manually. Future work consists of automatically clustering character images to detect the overlapped shapes.

SPIE-IS&T/ Vol. 6815 681503-8

REFERENCES

Al-Hajj R., Reconnaissance hors ligne de mots manuscrits cursifs par l’utilisation de systèmes hybrides et de techniques d’apprentissage automatique, PhD thesis, Ecole Nationale Supérieure des Télécommunications, Paris, 2007. Amin A., Off-line Arabic character recognition: the state of the art, Pattern Recognition, Vol. 31, 5, 1998, pp. 517-530. Arica N., Yarman-Vural T., An overview of character recognition focused on off-line handwriting. IEEE Trans on Systems, Man, and Cybernetics – Part C: Applications and Reviews, Vol. 31, No. 2, 2001, pp. 216-232. El-Hajj R., Likforman-Sulem L., Mokbel C., Arabic handwriting recognition using baseline dependent features and Hidden Markov Modeling, ICDAR 05, Seoul, 2005, pp. 893-897. Ben Amara N., Bouslama F., Classification of Arabic script using multiple sources of information: State of the art and perspectives, IJDAR, 5 (4), pp. 195-212, 2003. Fink G., Plötz T. On the use of context-dependent modeling units for HMM-based offline handwriting recognition, Proc. of ICDAR’07, pp. 729-733, 2007. Khorsheed M.S., Recognizing Arabic manuscripts using a single Hidden Markov Model, Pattern Recognition Letters, 24, 2003, pp. 2235-2242. Lorigo L. M., Govindaraju V., Off-line Arabic Handwriting Recognition: A survey, IEEE PAMI, 2006, pp. 712-724. Mokbel C., Abi Akl H., Greige H., Automatic speech Recognition of Arabic digits over the telephone network, Proc. of Research Trends in Science and Technology RSTS’02, 2002, Beyrouth. Pechwitz M., Maddouri S., Märgner V., Ellouze N., IFN/ENIT–DataBase for Handwritten Arabic words, Proc. of CIFED’02, Hammamet, Tunisia, 2002, pp. 129-136.

SPIE-IS&T/ Vol. 6815 681503-9