Text Normalization Framework for Handwritten Cursive Languages by ...

9 downloads 133 Views 504KB Size Report
The process of baseline detection has an important role in optical recognition systems and document image analysis systems. It is widely used in many various ...
Available online at www.sciencedirect.com

ScienceDirect Procedia Technology 11 (2013) 666 – 671

The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013)

Text Normalization Framework for Handwritten Cursive Languages by Detection and Straightness the Writing Baseline Tarik Abu-Ain*, Siti Norul Huda Sheikh Abdullah, Bilal Bataineh, Waleed Abu-Ain, Khairuddin Omar Pattern Recognition Research Group, Center for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia.

Abstract The process of baseline detection has an important role in optical recognition systems and document image analysis systems. It is widely used in many various preprocessing stages as a text normalization including skew, slant and slop corrections, writing lines straightness and characters segmentation, as well as in feature extraction process. In this work, a new framework for baseline detection and straightness for cursive handwritten texts is proposed based on analysis and extraction the directions features from the subwords of the text skeleton. Arabic script is chosen as a case study since it is cursive and widely adopted in many languages all around the world such as Arabic, Jawi, Urdu and Persian. The experiments results on a popular Arabic dataset showed the efficiency of the proposed framework. © 2013 2013The B.V. © TheAuthors. Authors.Published PublishedbybyElsevier Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and and peer-review && Technology, Universiti Kebangsaan Selection peer-review under underresponsibility responsibilityofofthe theFaculty FacultyofofInformation InformationScience Science Technology, Universiti Kebangsaan Malaysia. Malaysia. Keywords:Handwriting text normalization; Baseline detection; Preprocessing; Sub-word extraction, Feature extraction;

* Corresponding author. Tel.: +60-3-89216708; fax: +60-3-89256732. E-mail address: [email protected]

2212-0173 © 2013 The Authors. Published by Elsevier Ltd. Open access under CC BY-NC-ND license. Selection and peer-review under responsibility of the Faculty of Information Science & Technology, Universiti Kebangsaan Malaysia. doi:10.1016/j.protcy.2013.12.243

Tarik Abu-Ain et al. / Procedia Technology 11 (2013) 666 – 671

667

1. Introduction Baseline in cursive script defined as the imaginary straight line whereas all characters align and connected over it in a specific part of each character [1]. Baseline provides essential information about text orientation as well as the location of ascenders, descenders, dots, diacritics and connection points between characters. For languages written horizontally such as Arabic, the Text line is virtually split into three regions: upper, middle and lower regions [2]. The upper region contains upper dots, upper diacritic, and ascenders; while the main part of characters, loops and connection points between characters are lies in the middle region (baseline region); whereas the lower region contains descenders, lower dots and lower diacritic. In Arabic printed texts, the baseline can be detected perfectly by finding the row that contains the most number of foreground pixels as shown in Fig. 1(a). While in handwritten scripts, this way is not fit due to the wide variety of free writing styles and irregularity in subwords alignment which caused when one of these five letters ‫ﺃ‬،‫ﻭ‬،‫ﺯ‬،‫ﺭ‬،‫ﺫ‬،‫ﺩ‬ located in the beginning or middle of the word, as shown in Fig. 1(b).That leads to inaccurate in detecting a straight baseline for the text which conflicts with the definition of baseline as introduced previously. This paper consists five sections; the introduction of the work, the related works, the proposed framework, the experiment’s results, and finally the conclusions and future directions are reported.

(a)

(b) Fig. 1.Illustrationof: (a) success in baseline detection process for a printed text, (b) failure in baseline detection process for a handwritten text.

2. Related works An overview of some popular approaches used in baseline detection of handwritten Arabic scripts is discussed in this section. A baseline detection method based on polygonally approximated skeleton processing had proposed by Pechwitz [3]. Farooq proposed another method which uses a two-steps linear regression after locating the local minima points of word contour [4], which achieve a reasonable enhancement of previous mentioned method in [3]. However, both of the methods results are conflict of baseline definition, since the linear regression algorithm is not working well with unaligned text. Boukerma proposed an algorithm that uses a piece-wise painting scheme to estimate the baseline by identifying a set of points to be used in the estimation process [5]. However, in case of large diacritics and small characters existence, the algorithm is defected. As a result, most methods are defected by diacritics, isolated characters, and short words. As well as, some binding points between characters and subwords are not intersecting with the estimated baseline in the right points. 3. The proposed framework The detection process of baseline location is very useful in extracting accurate information such as writing directions, ascenders, descenders, dots and diacritics. Irregularity in Arabic script handwriting style is leading to irregularity in text components straightness. To overcome this problem, a framework is proposed to estimate a baseline for each subword separately which will be used later to estimate a straight baseline of the whole text (Fig. 2). The whole process of the proposed baseline detection and straightness is illustrate as following,

668

Tarik Abu-Ain et al. / Procedia Technology 11 (2013) 666 – 671

For each document image: Step 1: Binaries the image using the method proposed in [6] (Fig. 3(a)). Step 2: Extract the subwords of text image using the method proposed in [7] (Fig. 3(b)). Step 3: Keep the main subwords body only (remove noise, dots). Step 4: for each subword,

○ The general horizontal projection histogram [8] is applied for each subword (Fig. 3(c)). ○ Calculate the threshold value “T1”, which equal the mean value of all black pixels (the vertical line in Fig. 3 (c)).

○ Detect the candidates’ baseline regions where every region is the set of continues foreground pixels that exceeds the threshold value T1 (Fig. 3(d)).

○ A robust text thinning algorithm is applied to ensure the text skeleton is a one pixel width only [9]. ○ Perform a set of direction features on the skeleton to detect the horizontal and vertical adjacent pixels as well as the circles shape, where each one of them takes a unique label LH, LV and LC respectively.

○ The landmark points will be assigned to the pixels that gather between two different labels (Fig. 3 (e)). ○ The region that has the most number of landmarks will contain the baseline which will be the row that

has the highest number of foreground pixels (Fig. 3 (f)). Step 5: Align all subword of text line on one horizontal straight line where all subwords baselines are lay on (Fig. 3 (g)).

The effectiveness of the proposed image is obvious when a comparison is performed with the result of the baseline detection by horizontal projection profile method (Fig. 3 (h)).

669

Tarik Abu-Ain et al. / Procedia Technology 11 (2013) 666 – 671

Document Image Binarization [6]

Subwords separation [7]

Noise , dots and diacritics Removing (T1)

Text thinning [9]

Candidate baseline regions detection (horizontal projection profile and T2) [8]

Text skeleton analysis

A straight Baseline of the text line

Alignment process

Baseline detection for each subword

The proposed method

Pixels labeling and landmarks assigning

Fig. 2. The proposed framework of handwritten Arabic scripts baseline detection.

4. Experiments and results Many experiments are performed on the set_a of IFN/ENIT dataset [10] to validate the abilities of the proposed framework. As a definite result of the unconstraint process of handwriting as well as the differences in the topology of the generated handwritten texts from person to other, there is no ideal position of handwritten text baseline (Fig. 4 (a-j)); on the contrary of the machine printed texts (Fig. 4 (k-o)). The initial results are very promising in solve many important problems that arise in baseline detection for both handwritten and machine printed cursive scripts such as in the case of diacritics, isolated characters and short words as well as the binding points between characters are laying over the baseline accurately, also all subwords are intersecting the straight baseline in the right location interval as shown in Fig. 4 (f-j).

670

Tarik Abu-Ain et al. / Procedia Technology 11 (2013) 666 – 671

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Fig. 3. An example of proposed framework steps (a)Original text image, (b)Subword extraction, (c) Calculation of T1 and T2, (d) Text skeleton and candidate baseline regions, (e) pixel labeling and landmark points assigning, (f) subwords baseline, (g) The final result of text baseline straightness, (h) The result of baseline detection process using the horizontal projection profile .

(a)

(f)

(k)

(b)

(g)

(l)

(c)

(h)

(m)

(d)

(i)

(n)

(e)

(j)

(o)

Fig. 4. Results of baseline detection using: (a-e) the horizontal projection histogram method [5], (f-j) the proposed method “handwritten text”, (ko) both the horizontal projection histogram and the proposed method “machine printed text”.

5. Conclusion In this paper, a new baseline detection method for Arabic script is proposed. The method consists of four main stages: connected component separation, calculation the average of the horizontal projection histogram for each

Tarik Abu-Ain et al. / Procedia Technology 11 (2013) 666 – 671

component, circle shapes detection, pixels labeling and landmark spots selection and finally baseline detection and straightness. The visual experiments demonstrate the high-quality performance of the proposed method on textual binary images. IFN/ENIT dataset is used in the experiments, and the results of the proposed method are compared with some other methods and it achieves a superior performance compared to them. Currently we are looking for further robust baseline relevant features to be used in both preprocessing and feature extraction stages to be tested in a complete recognition system. Currently we are developing the framework to be adjustable when dealing in different languages and we consider finding a set of baseline relevant features that can be used in text segmentation to characters as well as to be used in a complete recognition system. Acknowledgements The authors would like to thank the Faculty of Information Science and Technology and Center for Research and Instrumentation Management of the Universiti Kebangsaan Malaysia for providing facilities and financial support under Exploration Research Grant Scheme Project No. ERGS/1/2011/STG/UKM/01/18 entitled "Calligraphy Recognition in Jawi Manuscripts using Palaeography Concepts Based on Perception Based Model" and Fundamental Research Grant Scheme No. FRGS/1/2012/SG05/UKM/02/8 entitled "Generic Object Localization Algorithm for Image Segmentation”. References [1] Gacek, A., Arabic Manuscripts: A Vademecum for Readers2009: BRILL. 338. [2] Abu-Ain, T.A.H., et al. Off-line Arabic Character-Based Writer Identification – a Survey. in International Journal on Advanced Science, Engineering and Information Technology, Proceeding of the International Conference on Advanced Science, Engineering and Information Technology , Bangi, Malaysia, 2011. [3] Pechwitz, M. and V. Margner. Baseline estimation for Arabic handwritten words. in Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Workshop on. 2002. [4] Farooq, F., V. Govindaraju, and M. Perrone. Pre-processing methods for handwritten Arabic documents. in Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on. 2005. [5] Boukerma, H. and N. Farah. A Novel Arabic Baseline Estimation Algorithm Based on Sub-Words Treatment. in Frontiers in Handwriting Recognition (ICFHR), 2010 International Conference on. 2010. [6] Bataineh, B., S.N.H.S. Abdullah, and K. Omar, An adaptive local binarization method for document images based on a novel thresholding method and dynamic windows. Pattern Recognition Letters, 2011. 32(14): p. 1805-1813. [7] Linda G. Shapiro , G.C.S., Computer Vision2002: Prentice Hall. 608. [8] Parhami, B. and M. Taraghi, Automatic Recognition of Printed Farsi Texts, in Proc. Conf. Pattern Recognition, Oxford: England., Editor, 1980. [9] Abu-Ain, W., et al., Skeletonization Algorithm for Binary Images, in International Conference on Electrical Engineering and Informatics 2013 (ICEEI 2013). [10] IFN/ENIT - Database of Arabic Handwritten words, T.U. Institute of Communications Technology, Braunschweig, Germany., Editor 2002.

671