Recognition of Segmented Online Arabic Handwritten ... - IEEE Xplore

2 downloads 0 Views 411KB Size Report
segmenting the well known ADAB database [3] of handwritten Arabic words. The number of segmented online Arabic characters in the introduced segmented.
2011 10th International Conference on Machine Learning and Applications

Recognition of Segmented Online Arabic Handwritten Characters of the ADAB Database Sherif Abdel Azeem, Hany Ahmed Electronics Engineering Department American University in Cairo (AUC) Cairo, Egypt [email protected], [email protected] 3 Total

Abstract—The aim of this work is to fill a void in the literature of Arabic handwriting recognition by studying the performance of different feature extraction methods on online segmented Arabic characters. The contribution of this paper is to introduce a large database of segmented online handwritten Arabic characters and report the performance of various feature extraction techniques on the segmented characters to serve as a benchmark for any future work on the problem of online Arabic characters recognition.

7846 23536

9902 29538

Pre-processing goes through five stages: Interpolation, baseline extraction, manual segmentation into classes of Arabic letters, removing the extra parts (segments) before or after the main stroke of letter, and Re-sampling. A. Interpolation The problem of broken strokes and large spaces between consecutive points is solved using linear interpolation [7], as shown in Fig. 1.

NTRODUCTION

Arabic is currently the sixth most widely spoken language in the world. Despite this fact, there has been little research on Arabic Handwriting recognition compared to other languages of similar importance. Most Arabic letters have four different shapes, depending on their position within a word (beginning, middle, end, and isolated), and are written from right to left. A further feature of Arabic script is the presence of dots and strokes, those dots and strokes are known as delayed strokes, since they are usually drawn last in a handwritten word. Character segmentation is a necessary preprocessing step for character recognition in many online Arabic handwriting recognition systems [1, 2]. The study of segmented characters recognition is crucial for the success of any segmentation-based recognition system. The target of the presented work is to fill this void by introducing a large database of online segmented handwritten Arabic characters and studying the performance of various feature extraction algorithms on it. The database has been obtained by manually segmenting the well known ADAB database [3] of handwritten Arabic words. The number of segmented online Arabic characters in the introduced segmented ADAB database in sets 1, 2, and 3 is given in Table I. TABLE I.

8062 29757

II. PRE-PROCESSING

Keywords; Online character recognition; Handwritten Arabic; Arabic characters recognition

I.

8451 25251

Figure 1. Example of solving the problem of inaccuracy of the digitalization process, (a) Original word, (b) Word after interpolating the missed points.

B. Baseline Extraction The base line was detected by converting the trace of (X, Y) Points into its corresponding bitmap matrix. Then, a simple algorithm of horizontal projection [8] has been followed to get the baseline of the whole image, as shown in Fig. 2. To enhance the algorithm, we divide the image into three equally vertical segments, and then find the baseline in each part, as shown in Fig. 3.

Figure 2. Extracting the baseline of whole image.

NUMBER OF CHARACTERS IN THE SEGMENTED ADAB DATABASE

Set/ Shape 1 2

Isolated 8183 8617

Start 9613 10185

978-0-7695-4607-0/11 $26.00 © 2011 IEEE DOI 10.1109/ICMLA.2011.120

Middle 7628 8062

Figure 3. Extracting basaeline 1,basleline 2,baleline 3.

End 9533 10103

204

Experiments showed that the best position of the baseline is the minimum value of all of them. Baseline=min (baseline 1, baseline 2, baseline 3, baseline of whole image), as shown in Fig. 4.

A computer program has been developed to assist in the manual segmentation process. A human observer decides the start and end of each letter separately. The observer also detects the delayed strokes of each letter as shown in Fig. 6.

Figure 4. The base line of whole image after modification. Figure 6. Example of the manual segmentation.

C. Manual Segmentation Table II shows Arabic letters divided into classes of similar shapes. TABLE II. Letter Group

1 2

The segmented ADAB database is intended to be a benchmark for online segmented Arabic handwritten character recognition research and, henceforth, we have made it available freely for research groups upon request.

LETTER GROUPS OF ARABIC Possible Shapes Isolated

End

‫ا أ إ ّا‬ ‫تثب‬

‫ـﺎ ـﺈ ـﺄ‬ ‫ـﺐ ـﺖ‬ ‫ـﺚ‬ ‫ـﺞ ـﺢ‬ ‫ـﺦ‬ ‫ـﺪ ـﺬ‬ ‫ـﺮ ـﺰ‬ ‫ـﺲ‬ ‫ـﺶ‬ ‫ـﺺ‬ ‫ـﺾ‬ ‫ـﻂ ـﻆ‬ ‫ـﻊ ـﻎ‬ ‫ـﻒ‬ ‫ـﻖ‬ ‫ـﻚ‬

3

Aleph Ba'a ,Ta'a , Tha'a,Nun,Ya'a Jeem,Ha'a,Kha'a

4 5 6

Dal , Thal Raa , Zai Seen ,Sheen

‫دذ‬ ‫رز‬ ‫سش‬

7

Sad , Dad

‫صض‬

‫جحخ‬

Middle

‫ـﻨـ ـﺒـ ـﺜـ‬ ‫ـﺠـ ـﺤـ‬ ‫ـﺨـ‬

D. Removing the Extra Parts Before or After the Main Stroke

Start

After the manual segmentation, we found some letters having extra parts before or after the main stroke of the letter because the writer had moved the pen up during the writing. Those extra parts have to be removed before the re-sampling stage. Without removing those parts, the letter may be confused with other letters after re-sampling as shown in Fig. 7.

‫ﺑـ ﺗـ ﺛـ ﻳـ‬ ‫ﻧـ‬ ‫ﺟـ ﺣـ ﺧـ‬

‫ـﺴـ ـﺸـ‬

‫ﺳـ ﺷـ‬

‫ـﺼـ ـﻀـ‬

‫ﺻـ ﺿـ‬

8 TTa,ThTha ‫طظ‬ ‫ـﻄـ ـﻈـ‬ 9 Ein , Gein ‫عغ‬ ‫ـﻌـ ـﻐـ‬ 10 Faa ,Qaf ‫ف‬ ‫ـﻔـ ـﻘـ‬ 11 Qaf ‫ق‬ 12 Kaf ‫ك‬ ‫ـﻜـ‬ 13 Kaf 2 14 Lam ‫ل‬ ‫ـﻞ‬ ‫ـﻠـ‬ 15 Meem ‫م‬ ‫ـﻢ‬ ‫ـﻤـ‬ 16 Meem 2 ‫ـﻤــ‬ 17 Meem 3* o 18 Nun ‫ن‬ ‫ـﻦ‬ 19 Hah , Ta'a ‫ﻩة‬ ‫ـﻪ ـﺔ‬ ‫ـﻬـ‬ 20 Hah 2* ‫ـ هـ‬ 21 Waw ‫وؤ‬ ‫ـﻮـﺆ‬ 22 Ya'a ‫ىي‬ ‫ـﻰ ـﻲ‬ 23 Hamza ‫ء‬ * class 17 and class 20 described in details in section II. D.

‫ﻃـ ﻇـ‬ ‫ﻋـ ﻏـ‬ ‫ﻓـ ﻗـ‬ Figure 7. Example of an extra part before Middle Ha'a(class 3): (a) Middle Ha'a (class 3) before re-sampling, (b) Middle Ha'a (class- 3) after re-sampling is confused with Ein (class 9) (c) Ha'a (class 3) after removing the extra part.

‫آـ‬ ‫ﻟـ‬ ‫ﻣـ‬

The Middle Ha'a (class 3) after re-sampling is confused with Middle Ein (class 9). If the extra points are removed, the letter appears to be Beginning Ha'a (class 3). The same problem has been encountered with other letters. Many writers have moved their hands up while writing Meem (class 15) before and/or after the circle of the Meem. After re-sampling, we found that the Beginning Meem may confuse with Beginning Hah (class 19), as shown in Fig.8.a. After removing the extra parts, the resulting beginning Meem looks different from the normal beginning Meem shown in Table II (class 15). Thus, we had to create a new class for this new beginning Meem (class 17) as shown in Fig. 8.b.

o ‫هـ‬

We have 19 classes for isolated letters, 19 classes for end letters, 14 classes for middle letters and 12 classes for beginning letters. Besides the previous classes, there are four ligatures in the ADAB database as shown in Fig. 5.

Figure 5. The four ligatures in the ADAB database.

After adding the four ligatures to previous classes, the isolated letters have 20 classes, beginning letters have 15 classes and end letters have 20 classes.

Figure 8. Example of special case of Beginning Meem (class 15) was found in the database : (a) Confusion with Beginning Hah (class 19) , (b) Meem after removing the extra parts(class 17).

205

The same problem resulted in the creation of a new class for the middle Meem as described in Table II (Meem 3) (class- 17). Our approach to remove the extra parts before or after the main body of a letter is to measure the Euclidean distance between any two consecutive points. If the distance is greater than a predetermined threshold, then we have extra parts and the longest stroke is considered the main body of the letter and the other strokes are removed.

Characterizes the average distance above and below the baseline. 1 M1 ∑ |Y (a ) −Y baseline | M 1 a =1 1 M2 D2= ∑ |Y (b ) −Y baseline | M 2 b =1 D1 =

E. Re-sampling •

The last stage of pre-processing is to adjust the number of (X, Y) points in each character to a fixed number (70, empirically) for classification purposes. Linear interpolation [5] was used to do the re-sampling. III. FEATURE EXTRACTION

Measures the ratio of the Euclidean distance between the first and last points of the segment (D) to the sum of the Euclidean distances between consecutive points of the segment ( d i ).

Several features used before for Latin Handwriting problem can be used for Arabic Handwriting recognition such as chain code [4] [10], directional features [5], elliptical Fourier descriptors [6], and structural features [9]. We have tried many features and found out that the best feature set is the directional features. The following steps describe our feature extraction.

S =



Let P = (X i ,Y i ) , i=1, 2, 3 …N, where (N=70) is the number of points that represent the main stroke. • F1: Directional Features: The local writing direction at a point at instant t is described by: ΔY ΔS ΔX s i n (α (t ) ) = ΔS

D N −1

∑d

i

i =1

A. Temporal Features

c o s (α (t ) ) =

D3=D1-D2 Where, M1+M2=N= 70 M1: Total number of points above the baseline. M2: Total number of points below the baseline. F5: Straightness Feature:

F6: Number of intersections Feature: The given P = ( X i ,Y i ) is converted to bitmap matrix and then resized to 30 × 30 while preserving the aspect ratio of the character. Three horizontal lines, three vertical lines, one right diagonal line, and one left diagonal line were chosen to find the number of intersections with the foreground, as shown in Fig. 9.

(t ) (t ) (t ) (t )

ΔX (t ) , ΔY (t ) and ΔS (t ) are defined as follows: Δ X (t ) = X ( t − 1 ) − X ( t + 1 ) Δ Y ( t ) = Y (t − 1 ) − Y (t + 1 ) Δ S (t ) =

ΔX

2

(t ) + Δ Y

2

Figure 9.

(t )

IV. POST-PROCESSING

B. Spatial Features •

There are some Arabic characters that are similar and only delayed strokes can distinguish those characters. Delayed strokes are used to distinguish between similar characters according to the number of the delayed strokes, their position, and their shape (dots or lines).

F2: Height and Width Features: Characterize the height and width of the bounding box surrounding the segmented character. Height = max(Y i ) − min(Y i )

W idth = max( X i ) − min( X i )



V. RESULTS

F3: Aspect Ratio Feature:

The proposed system has an input feature vector of length 151 resulting from the concatenation of the following feature: F1 (136 elements), F2 (2 elements), F3 (1 element), F4 (3 elements), F5 (1 element) and F6 (8 elements). SVM with an RBF kernel is used as a

Characterizes the height-to-width ratio of the bounding box. A spect =



Number of intersections Feature.

Height W idth

F4: Baseline Feature: 206

classifier. Table III reports corresponding accuracies.

the

used

sets

and

TABLE III. TABLE OF ACCURACIES Training Sets

Test Set

Accuracy

1,2

3

97.11%

1,3

2

95.95%

2,3

1

95.36%

Figure 11.

Confusion between classes.

VI. CONCLUSIONS

Our experiments show that the use of the temporal directional features alone results in 96% recognition rate (on test set 3) and that some Arabic characters cannot be classified using those online features only. Adding several spatial features is crucial for the recognition of those characters. For example, the base line features enhanced the recognition of the characters Dal, Raa, Nun, Isolated Lam, Faa , Qaf ,and Ya'a as shown in Fig. 10.a; also the number of intersections feature enhanced the recognition of the characters Hah, Meem, Isolated Ein, Isolated Lam as shown in Fig. 10.b; and the height feature enhanced the recognition of the characters Beginning( Ba'a,Ta'a,Tha'a), Lam and Dal as shown in Fig. 10.c.

In this paper, we have introduced a large manually segmented Arabic handwritten characters database generated from the ADAB database. The performance of different feature extraction methods on online segmented handwritten Arabic characters has been studied. The goal of introducing the segmented database and studying the performance of different features on it is to serve as a benchmark for any future research on segmentation-based Arabic handwriting recognition. Experimental results show that temporal online features are not enough to successfully recognize online segmented Arabic characters and those offline spatial features are needed to enhance the recognition of those characters. REFERENCES [1]

Figure 10.

H. Boubaker, A. Elbaati, M. Kherallah, A.M. Alimi, and H. Elabed, “Online Arabic Handwriting Modeling System Based on the Graphemes Segmentation”, in Proc. ICPR , pp.2061-2064, 2010. [2] Khaled Daifallah, Dr. Nizar Zarka and Hassan Jamous, “Recognition-Based Segmentation Algorithm for On-Line Arabic Handwriting ”, 10th International Conference on Document Analysis and Recognition, 2009. [3] Haikal El Abed, Volker Margner, Monji Kherallah, and Adel M. Alimi. 2009, “ICDAR 2009 Online Arabic handwriting recognition competition,” In Proceedings of the 2009 10th International Conference on Document Analysis and Recognition (ICDAR '09). [4] H. Izakian, S. A. Monadjemi, B. Tork Ladani, and K. Zamanifar, “Multi-Font Farsi/Arabic isolated character recognition using chain codes,” World Academy of Science, Engineering and Technology 43 ,2008. [5] S. Jager, S. Manke, J. Reichert, and A. Waibel, “Online handwriting recognition: the npen++ recognizer,” IJDAR, vol. 3, no. 3, pp. 169–180, 2001. [6] F.P. Kuhl, and C.R. Giardina, “Elliptic fourier features of a closed contour, ” Computer Graphics and Image Processing, vol. 18, pp. 236-258, 1982. [7] Moisés Pastor, Alejandro Toselli, and Enrique Vidal, “Writing speed normalization for on-Line handwritten text recognition,” in Proceedings of the 2005 Eight International Conference on Document Analysis and Recognition (ICDAR’05). [8] Samia Snoussi Maddouri,Fadoua Bouafif Samoud,Kaouthar Bouriel,Noureddine Ellouze, and Haikal El Abed, “Baseline Extraction: Comparison of six methods on IFN/ENIT database,” The 11th International Conference on Frontiers in Handwriting Recognition,2008. [9] Ahmad T. Al-Taani, and Saeed Al-Haj ,“ Recognition of on-line arabic handwritten characters, ” JOURNAL OF PATTERN RECOGNITION RESEARCH 1 (2010) 23-37. [10] H. Yuen, “A chain coding approach for real-time recognition of on-line handwritten characters,” ICASSP'96, Atlanta, USA, 1996.

Feaures used to distinguish between similar characters.

The addition of spatial features along with online features does not solve all the problems. We still have confusion among some classes because of two main problems: the delayed strokes are not found in some cases in the ADAB database. This problem causes confusion among classes such as beginning Lam versus beginning (Ba'a, Ta'a,Tha'a, Nun) , Isolated Nun versus Isolated Ya'a, and Middle Lam versus Middle (Ba'a,Ta'a, Nun, Tha'a) as show in Fig. 11.a. The second problem is the similarity in writing different characters such as Middle Meem and Middle Hah, as shown in Fig.11.b.

207