Handwritten Jawi Words Recognition

1 downloads 0 Views 195KB Size Report
of segmentation-free method used to transform word image into ... Downloaded on November 14, 2008 at 02:30 from IEEE Xplore. ... image skeletonization to the recognition of text in an ..... R. El-Hajj, L. Likforman-Sulem and C. Mokbel,.
Handwritten Jawi Words Recognition Using Hidden Markov Models Remon Redika Center for Artificial Intelligent Technology Fakulti Teknologi dan Sains Maklumat Universiti Kebangsaan Malaysia [email protected]

Khairuddin Omar Center for Artificial Intelligent Technology Fakulti Teknologi dan Sains Maklumat Universiti Kebangsaan Malaysia [email protected]

Abstract Handwritten Jawi recognition is a challenging task because of the cursive nature of the writing. In manuscript writings, words are writer-dependent. The recognition task of Jawi Manuscript still opens problem due to the existence of many difficulties, such as the variability of character shape, overlap and presence of ligature in manuscript words. This paper describes a technique of Jawi word recognition using Hidden Markov Model (HMM). The technique of segmentation-free method used to transform word image into sequences of frames. The geometrical features are extracted using sliding window from each observation frame sequence. Besides, baseline parameters of Jawi word are use in the calculation of black pixel density. Vector Quantization clusters these features and assigns them into symbols that will be used as HMM input. Experiments have been conducted on 579 images of 100 words lexicon of Syair Rakis manuscript, and the recognition rate has reached 84 percent recognition.

1. Introduction Jawi is a script which is derived from Arabic alphabets and adopted for the use of Malay language writing. Malay language has been recognized as a lingua franca in South East Asia since the 15th century [25]. Malay is spoken by more than 200 million people in Malaysia, Indonesia, southern Thailand, Singapore, Brunei, southern Philippine and

Mohammad Faidzul Nasrudin Center for Artificial Intelligent Technology Fakulti Teknologi dan Sains Maklumat Universiti Kebangsaan Malaysia [email protected]

southern Myanmar [26]. There are an estimated of 15,000 known Jawi manuscripts kept in libraries and museums around the world. However, most of the time, they are ignored. Moreover, Jawi Character Recognition also attracts less interest from scholars compared to the attention given to Arabic, Farsi and Urdu. In manuscript written in Jawi, variability of handwriting styles causes a big problem to Jawi Optical Recognition System. Most of problems are encountered during the segmentation of the words into characters. The vertical overlap of characters and presence of non standard ligatures (as in Figure 1) in Jawi words make the segmentation process as a pivotal stage to recognition system. Different segmentation techniques may produce different characters. Until currently, there is no efficient solution to automatically read medieval manuscript text with good accuracy [1].

(a) (b) Figure 1. Non standard ligature of word image (a) ‘Asmara’ – Ligature, (b) ‘Berhenti’ Ligature and Overlap Recognitions of Arabic handwritten and its variants such as Farsi (Persian) and Urdu using HMM have received considerable attention lately. In 2000-2001, [4][5] presented an HMM-

978-1-4244-2328-6/08/$25.00 © 2008 IEEE

Authorized licensed use limited to: University Kebangsaan Malaysia. Downloaded on November 14, 2008 at 02:30 from IEEE Xplore. Restrictions apply.

(a)

(b)

(c)

(d)

(e)

(f)

Figure 2. Profile Features (a) Frame Sequence, (b) 1 Frame, (c) Eight profiles features from 4 Directions, (d) Top and Bottom Profile, (e) Left and Right Features, (f) Four Directions of Profiles Features based system to recognize 198 names of cities in Iran. In 2003, [6] proposed 160 semicontinuous HMMs representing the characters or shapes to recognize 26,459 images of Tunisian city-names. In the same year, [7] applied an HMM recognizer with image skeletonization to the recognition of text in an ancient manuscript. [8] applied a continuous-density variable-duration hidden Markov model to the recognition of Nastaaligh style handwritten Persian words. HMM has not yet been applied to another variant of Arabic which is the Jawi script. In this paper, we propose a word-based HMM recognition method which does not have prior knowledge of segmenting character in advanced.

2. Word modeling In this paper, we applied HMM to Jawi word recognition. The basic task is to transform word image into a suitable model to HMM. It is understood that Hidden Markov Modeling is suitable for 1-D time sequential signal such as speech [2]. For that reason, our segmentation-free approach converts the word image into sequential data taken from the predefined width columns of word image which are called as ‘frames’.

Segmentation of word image into sequential frames is also known as segmentation-free technique. It is usually used for cursive word recognition strategy [3][4][5][9][10][11]. The general purpose of this method is to improve the recognition performance when the handwritten word is not easily segmented [3]. In our proposed technique, we divided the word image into five vertical frames. In dividing the word image, we had followed two steps. First, we divided the word image into five equal-sized vertical frames. Second, we adjusted each frame in accordance to the smallest horizontal histogram projection as the new x-coordinate. As a result, different-sized frames were obtained (see Figure 2). These observation frames of word image symbolize the shape primitive poses of the word image. These shape primitives were collected by extracting the geometrical features of each frame. The sequential data, later known as extracted features, for a word image are comprised of captured writing styles, shape transitions and shape primitives. Most studies on cursive word recognition use segmentation-free method which transform word image into equal-sized frames [4][5][9]. However, this study employs a different segmentation-free method. The advantage of our method is it allows us to slice down the word image with adjustable

978-1-4244-2328-6/08/$25.00 © 2008 IEEE

Authorized licensed use limited to: University Kebangsaan Malaysia. Downloaded on November 14, 2008 at 02:30 from IEEE Xplore. Restrictions apply.

estimation. Each of the resulted frames contains important shape primitive. Moreover, the state of probability between frames was captured the important features which is helpful to HMM in yielding the sequential transition probability.

3. Features extraction Feature extraction is the task to detect and isolate various desired attributes (features) of an object in an image. Recognition performance largely depends on the quality and the relevance of the features extracted. In our proposed method, we collected the shape primitive information of image word. To collect this knowledge, geometrical information features of low-level pixel-based knowledge were applied to each frame. With respect to the method above, each frame was divided into eight equal-sized strips from top-bottom and left-right. Each strip extracted eight profiles features from four directions such as: left right, top and bottom show in (figure 2). Hence, from each frame 32 of profiles features were extracted. Further, we calculated the two directions ink crossing (vertical and horizontal) of frames. To be more clear see Figure 3, (a) and (b). For each direction, eight ink crossing features were extracted. This means we had extracted 16 features for each frame. We also generated eight equal-sized zones. We calculated pixel density of eight column zones horizontally and vertically. Figure 3, (c) and (d) show this calculation. For zoning features, we also extracted 2 pixel density features from row and column of frame. In order to reduce the sensitivity of all the features vector, each of vertical geometrical features, namely: top and bottom profiles and vertical ink crossing were normalized by width of frame. On the other hand, the horizontal geometrical features, namely: left and right profiles were normalized by height of frame. For pixel density features of vertical zones, we normalized them by width of each zone. Meanwhile, for horizontal zone features, we normalized them to the height of the zones. Therefore, each frame was represented by 65dimensional feature vector.

(c)

(d)

(a)

(b)

Figure 3. Crossing Features (a) Vertical Crossing Features, (b) 2 Crossings onto Black Pixel, (c) Original Frame, (d) Darker Squares Indicate Higher Density of Zone Pixels

4. Features vector encoding The extracted features from sequential frames of word image need to be encoded into discrete symbols. We used simple Euclidian distance by comparing the distance among 65-dimensional features of each frame. The algorithm partitioned the features space and generated a sequence of number as symbol of the features vector. The final symbols were the observation sequences in HMM.

5. HMM model The proposed method adopted left-right Hidden Markov Model or also called as Bakis Model. We chose 8 hidden states of HMM for each word image. The Hidden Markov Model λ can be formulated as λ = (A, B, λ) where A is known as the state transition probability distribution and B as the symbol probability distribution. The probability of A and B were resolved by training λ with a set of training examples. In the recognition process, given the model λ, the observation sequence O (testing sample) was used to maximize the probability P(O|λ).

6. Training Training of the HMM model means the maximization process of the likelihood of the training data. We trained 400 images of 100 word image lexicons. Every word image was transformed into five

978-1-4244-2328-6/08/$25.00 © 2008 IEEE

Authorized licensed use limited to: University Kebangsaan Malaysia. Downloaded on November 14, 2008 at 02:30 from IEEE Xplore. Restrictions apply.

observation sequences, where each frame contained 65-dimension features. We trained all of the training data by using K-Means algorithm from HMMPak software. The training output was the trained HMMs of every word image lexicon. The important goal in training is to capture all different writing styles and all other important information related to the variation of word image shapes. In this stage, parameters such as, number of dimension, observation sequence and number of data training for each lexicon must be given to HMMPak.

rate. Therefore, recognition is possible without the use of vast training sets in attempting to incorporate all of handwriting styles knowledge.

7. Experiment and recognition In the stage of recognition, by employing Viterbi algorithm, HMMs found out the maximum likelihood. We applied the same procedure as in the training to transform the word image into five observation frames. Each frame was extracted and the features were assigned into symbols by utilizing classification algorithm. Subsequently, Viterbi algorithm was re-employed to recognize the word image and to get the maximum probability of each trained HMMs. An HMM that gives the maximum probability is considered as the winner. In other words, it is the HMM which will be used as the representation of that particular word lexicon. Figure 4 illustrates the recognition of the word image that ranks the HMMs of the word “Asmara” among the top candidates that have the highest probability. Many researchers in the field of character recognition use core zone detection to enhance the accuracy of the word recognition. With this respect, we conducted two separate experiments. In the first experiment, we used a set of features of the whole image. Meanwhile, in the second experiment, we detected the upper-line and lower-line parameters in order to extract the pixel density of the core zone image. The original technique of projection baseline estimation detection can be found in [12]. In the experiment stage, we used a Jawi manuscript entitled Syair Rakis. Words selected as samples are those which appear in the manuscript more than four times. Those samples are separated into 4 sets of data. One subset of the data is for testing. Meanwhile, the other three are for training as shown in Table 1. The result of the recognition is presented in up-to top-5 candidates for each of the word image. Table 2 shows the result of the experiment which uses pixel density features of whole word image. Table 3 shows the result of the experiment which employs pixel density of core zone word image. Our experiments prove that the features of pixel density of core zone image increase the recognition

Figure 4. HMMs Probability Ranking of word image “Asmara” Table 1. Data Set Test 1 2 3 4

Training Data Data Size B, D, C 410 A, C, D 427 A, B, D 443 A, B, C 458

Test Data Set Data Size A 169 B 152 C 136 D 122

The results in Table 3 show significant improvement in recognition rate coming from core zone features. The recognition rate increases about 2%-9%. The recognition percentage in Table 3 is significantly higher than in Table 2. However, the first test in core zone detection features shows less recognition rate compared the first test of pixel density of shows the recognition rate is less than the first experiment. This problem is probably caused by an error in detecting core zone image. Hence, a good method of core zone detection is possibly important in improving the accuracy of word recognition. Moreover, the low recognition rate of both experiments is probably due to a huge difference in terms of style between word images taken from one manuscript.

978-1-4244-2328-6/08/$25.00 © 2008 IEEE

Authorized licensed use limited to: University Kebangsaan Malaysia. Downloaded on November 14, 2008 at 02:30 from IEEE Xplore. Restrictions apply.

Table 2. Top 5 Recognition Rate Test 1 2 3 4

Rec 1 71% 69% 61% 57%

Rec 2 66% 70% 65% 65%

Rec 3 74% 73% 67% 70%

Rec 4 76% 75% 70% 74%

Rec 5 79% 76% 77% 78%

Table 3. Top 5 Recognition Rate by extracts Core Zone Pixel Density Test 1 2 3 4

Rec 1 69% 78% 66% 71%

Rec 2 76% 78% 71% 76%

Rec 3 78% 79% 73% 78%

Rec 4 81% 82% 81% 80%

Rec 5 83% 83% 84% 81%

8. Conclusion and future works We have proposed Hidden Markov Model to recognize the Jawi word recognition. The model has successfully recognized the word image without segmenting it into characters. This segmentation-free approach has efficiently saved us from difficulties in recognition problems such as, non standard ligatures and overlaps. The results from two experiments conducted in this study show that core zone features improve the recognition rate’s accuracy. We suggest further researches to improve HMM models and slicing method of word image. Moreover, how to improve the accuracy of core zone detection should be the concern of future studies. New models of shapes which are helpful in generalizing myriad of handwritten styles from many manuscripts should be introduced. Thus, recognition is possible without the use of a vast training set in attempting to incorporate all of handwriting styles knowledge.

8. Acknowledgment This research is fully funded by Ministry of Science, Technology and Innovation (MOSTI), Malaysia through Science Fund SF0231.

Model Type Stochastic Network”, IEEE PAMI, Vol. 16 No. 5, May 1994 [3] Mohamed Magdi and Gader Paul, “Handwritten Word Recognition Using Segmentation-Free Hidden Markov Modeling and Segmentation-Based Dynamic Programming Technique” IEEE Transaction, Vol 18 No. 5 May 1996 [4] M. Dehghan, K. Faez, M. Ahmadi and M. Shidhar, “Holistic Handwritten Word Recognition Using Discrete HMM and Self-Organizing Features Map”, IEEE 2000 [5] M. Dehghan, K. Faez, M. Ahmadi and M. Shidhar, “Unconstrained Farsi handwritten word recognition using fuzzy logic vector quantization and Hidden Markov Models”, Pattern Recognition Letters, 2001 [6] Pechwitz Mario and Maergner Volker, “HMM Based Approach for Handwritten Arabic Word Recognition Using IFN/ENIT – Database” Proc. Int’l Conf. Document Analysis and Recognition, 2003 [7] M.S. Khorsheed, “Recognising Handwritten Arabic Manuscript Using a Single Hidden Markov Model”, Pattern Recognition Letters, 2003 [8] R. Safabakhsh and P. Adibi, “Nastaaligh Handwritten Word Recognition Using Continuous-Density VariableDuration HMM”, The Arabian Journal for Science and Eng., 2005 [9]. R. El-Hajj, L. Likforman-Sulem and C. Mokbel, “Arabic Handwriting Recognition Using Baseline Dependant Features and Hidden Markov Modeling”, Proceeding of International Conference, Document Analysis and Recognition, 2005. [10] R. Ball Gregory, N. Srihari Sargur and Srinivasan Harish, “Segmentation-Based And Segmentation-Free Methods for Spotting Handwritten Arabic Words”, International Workshop on Frontiers in Handwritten Recognition HAL – CCSD, 2006 [11] B. Al-Badr and R. Haralick, “A Segmentation-Free Approach to Text Recognition with Application to Arabic Text”, Int’l Journal Document Analysis and Recognition”, 1998 [12] A.W.Senior and A.J.Robinson, “An Off-Line Cursive Handwriting Recognition System”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, 1998.

9. References [1] Y. Laydier, F. Labourgeois and Empthoz Hubert, “Text Search for medieval manuscript images”, Science Direct, Journal of Pattern Recognition Society, 2 April 2007 [2] Chen Mou-Yen, Kundu Amlan and Zhou Jian, “Off-line handwritten word recognition using a Hidden Markov

978-1-4244-2328-6/08/$25.00 © 2008 IEEE

Authorized licensed use limited to: University Kebangsaan Malaysia. Downloaded on November 14, 2008 at 02:30 from IEEE Xplore. Restrictions apply.