A Novel Baseline Detection Method of Handwritten Arabic-Script ...

2 downloads 0 Views 3MB Size Report
S.A. Noah et al. (Eds.): M-CAIT 2013, CCIS 378, pp. 67–77, 2013. © Springer-Verlag Berlin Heidelberg 2013. A Novel Baseline Detection Method of Handwritten.
A Novel Baseline Detection Method of Handwritten Arabic-Script Documents Based on Sub-Words Tarik Abu-Ain1, Siti Norul Huda Sheikh Abdullah1, Bilal Bataineh1, Khairuddin Omar1, and Ashraf Abu-Ein2 1

Pattern Recognition Research Group, Center for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia 2 Computer Engineering Department, Al-Balqa' Applied University, Faculty of Engineering Technology, Amman, Jordan [email protected], {mimi,ko}@ftsm.ukm.my, [email protected], [email protected]

Abstract. Baseline detection is an important process in document image analysis and recognition systems. It is extensively used to many various preprocessing stages such as text normalization, skew correction, characters segmentation, slant and slop correction as well as in feature extraction. in this work, we proposed a new method for baseline detection based on horizontal projection histogram and directions features of subwords skeleton for Arabic script; which form the main component of the text that may consist of at least one letter, in addition of diacritic and dots. The efficiency of the proposed method is has been proven by the experiment’s results on an IFN/ENIT Arabic benchmark dataset. Keywords: Preprocessing, Text normalization, Arabic handwriting, Baseline detection, Sub-word extraction.

1

Introduction

Arabic language is one of six international languages recognized in the United Nations and it has been widely adopted by many other languages such as Jawi, Persian, Kurdish, Pashto, Urdu, and Hausa [1] and [2]. In Arabic script, there are three types of written forms: the printed, handwritten and calligraphy [3]. Baseline in Arabic script defined as a virtual straight line whereas all characters aligns and connected over it in a specific part of each character [4]. Baseline leads to important information about the orientation of the text and the location of connection points between characters; the ascenders, descenders, dots and the diacritics location. Text line in Arabic scripts is splits into three imaginary regions: upper, middle and lower regions. The upper region contains ascenders dots, and upper diacritic; while the lower region contains descenders dots and lower diacritic points whereas the main contents of the text and loops are lies in the middle region (baseline region). S.A. Noah et al. (Eds.): M-CAIT 2013, CCIS 378, pp. 67–77, 2013. © Springer-Verlag Berlin Heidelberg 2013

68

T. Abu-Ain et al.

For the printed scripts, the baseline can be detected ideally using the general horizontal projection histogram as shown in Fig. 1(b-d). While in handwritten scripts, this method is not suitable due to extensive variety of writing styles and variation characteristics such as cursive writing and large number of dots and diacritics as in Arabic scripts as shown in Fig. 1(e-g).

Fig. 1. The horizontal projection histogram method [5], (a) Flowchart of baseline detection, (b-d) an example of success in baseline detection process for a printed text, (e-g) an example of failure in baseline detection process for a handwritten text

In document image analysis process, the baseline detection process is a substantial step, leading to more accurate result especially if a set of logical and language topology dependent rules are used simultaneously. The aim of the presented work is the baseline detection of off-line handwriting words of Arabic scripts, which may consist of one or more of Part of Arabic Word (PAW), which are one of the main distinguishing properties of Arabic script that differs from the other scripts. It is appearing in case of any one of these letters: ‫أ‬،‫و‬،‫ز‬،‫ر‬،‫ذ‬،‫ د‬appear in the middle of the word which causes the division of the word into PAWs (Table 1). These PAWs are distributed irregularly because of the free style of writing for every person. That leads to inaccurate in detecting a straight baseline for each word/line of the text. These facts conflict with the definition of baseline as introduced previously. This paper consists of four main sections; first we introduce the work and its importance. Then, an overview of related works is discussed. The proposed method is explored and the experiment’s results are reported in the next section. Finally, the conclusion and the future directions are noted subsequently.

A Novel Baseline Detection Method of Handwritten Arabic-Script Documents

69

Table 1. Arabic letters and their shapes depending on the position in the text

2

Beginning

Middle

End

Isolated

Beginning

Middle

End

Isolated

‫ﺑـــ‬ ‫ﺗـــ‬ ‫ﺛـــ‬ ‫ﺟـــ‬ ‫ﺣـــ‬ ‫ﺧـــ‬ ‫ﺳـــ‬ ‫ﺷـــ‬ ‫ﺻــ‬

‫ـــﺒـــ‬ ‫ـــﺘـــ‬ ‫ـــﺜـــ‬ ‫ـــﺠـــ‬ ‫ـــﺤـــ‬ ‫ـــﺨـــ‬ ‫ـــﺴـــ‬ ‫ـــﺸـــ‬ ‫ـــﺼــ‬

‫ـــﺄ‬ ‫ـــﺐ‬ ‫ـــﺖ‬ ‫ـــﺚ‬ ‫ـــﺞ‬ ‫ـــﺢ‬ ‫ـــﺦ‬ ‫ـــﺪ‬ ‫ـــﺬ‬ ‫ـــﺮ‬ ‫ـــﺰ‬ ‫ـــﺲ‬ ‫ـــﺶ‬ ‫ــﺺ‬

‫أ‬ ‫ب‬ ‫ت‬ ‫ث‬ ‫ج‬ ‫ح‬ ‫خ‬ ‫د‬ ‫ذ‬ ‫ر‬ ‫ز‬ ‫س‬ ‫ش‬ ‫ص‬

‫ﺿـــ‬ ‫ﻃـــ‬ ‫ﻇـــ‬ ‫ﻋـــ‬ ‫ﻏـــ‬ ‫ﻓـــ‬ ‫ﻗـــ‬ ‫آـــ‬ ‫ﻟـــ‬ ‫ﻣـــ‬ ‫ﻧـــ‬ ‫هـــ‬ ‫ﻳـــ‬

‫ـــﻀــ‬ ‫ـــﻄـــ‬ ‫ـــﻈـــ‬ ‫ـــﻌـــ‬ ‫ـــﻐـــ‬ ‫ـــﻔـــ‬ ‫ـــﻘـــ‬ ‫ـــﻜـــ‬ ‫ـــﻠـــ‬ ‫ـــﻤـــ‬ ‫ـــﻨـــ‬ ‫ـــﻬـــ‬ ‫ـــﻴـــ‬

‫ـــﺾ‬ ‫ـــﻂ‬ ‫ـــﻆ‬ ‫ـــﻊ‬ ‫ـــﻎ‬ ‫ـــﻒ‬ ‫ـــﻖ‬ ‫ـــﻚ‬ ‫ـــﻞ‬ ‫ـــﻢ‬ ‫ـــﻦ‬ ‫ـــﻪ‬ ‫ـــﻮ‬ ‫ـــﻲ‬

‫ض‬ ‫ط‬ ‫ظ‬ ‫ع‬ ‫غ‬ ‫ف‬ ‫ق‬ ‫ك‬ ‫ل‬ ‫م‬ ‫ن‬ ‫ﻩ‬ ‫و‬ ‫ي‬

State of the Art

This section provides an overview of the related methods of baseline detection for handwritten Arabic scripts and a review of previous work on this topic. 2.1

Baseline Properties

Scripts are divided into two main categories based on the text generator either via machine or human. The script lay on straight line in machine printed text due to nonintervention from humans and preformatting rules dictated by text editor programs which is absolutely confirmed with the definition of baseline. However, the baseline detection challenges appear in human handwritten scripts due to free style of writing, writing habitats, circumstances environment, writer psychology and physically, writing tools. All above factors caused problem in perfection of baseline detection process. Each script has unique baseline properties depending on the nature of the way of writing the characters. Since the Arabic script is cursive, the baseline defined as a horizontal straight line that all binding points between the characters as well as certain position of these characters should be laid over it.

70

2.2

T. Abu-Ain et al.

Arabic Script Baseline Detection Methods

Pechwitz et. al. (2002) proposed a baseline detection method based on polygonally approximated skeleton processing [6]. However, the method is conflict of baseline definition. Since, it is using a linear regression algorithm which is not working well with unaligned text (Fig. 4 (k-l)). A little enhancement of Pechwitz baseline detection method by Farooq et. Al. (2005), which uses a two-steps linear regression after locating the local minima points of word contour [7]. However, it is still suffer same problems (Fig. 4 (m-n)). Ziaratban et. al. (2008) proposes a baseline estimating algorithm using a template matching and a polynomial fitting algorithm [8]. However, the method is less effective in presents of short words, dots and diacritics. A method of baseline detection for thinned text then find the relation between the text point’s alignment and their trajectory neighbor directions is proposed by Bobaker [9]. However, the algorithm is not efficient when it is deal with short words that consist of isolated characters only. Boukerma et. al. (2009) proposed an algorithm based on subwords skeleton where some feature points use to estimate a horizontal band of the text using a linear interpolation algorithm [10]. However, the method fails in case of ligatures and small diacritics. In addition, the final result is not a straight line which is conflict with the definition on baseline. Nagabhushan et. al. (2010) proposed an algorithm that uses a piece-wise painting scheme to identify points that will be used to estimate the baseline [11]. However, the algorithm is less effective if large diacritics and small characters are exists (Fig. 4 (o-p)). From literature [6 - 11], it clearly most methods are defected by many factors such as diacritics, isolated characters, long words. In addition, some binding points between characters are not laying over the baseline as well as some of subwords are not intersecting the straight baseline in the right points.

3

The Proposed Method

The accurate detection of baseline location is help in extracting more accurate meaningful information such as writing directions, ascenders, descenders, dots and diacritics. Irregularity in Arabic script handwriting style is leading to irregularity in word/line components straightness. Baseline detection and straightness is crucial step in preprocessing stage as a text normalization process. From literature, most of the methods did not detect the correct baseline in case of short characters and when large diacritics exist. To overcome the problem, a new method is proposed to estimate the local baseline for each PAW which will be used later to detect the global baseline of the whole word/line. Fig. 2, shows the proposed method framework.

71

Proposed Work

A Novel Baseline Detection Method of Handwritten Arabic-Script Documents

Fig. 2. The Proposed Method Framework

3.1

Binarization

An adaptive threshold technique is used as shown in the equation 1 [12]. From literature, it is considered as one the best method that implements with fine and degraded document images [3], as shown in Fig. 3(a).

ܶ‫ ݓ‬ൌ ݉‫ ݓ‬െ

݉‫ݓߪ  כ ʹ ݓ‬ Ǥ ሺ݉݃ ൅ ߪ‫ ݓ‬ሻሺߪ‫ ݁ݒ݅ݐ݌ܽ݀ܣ‬൅  ߪ‫ ݓ‬ሻ

(1)

where TW is the thresholding value of the binarization window, mw is the mean value of the pixels in the window; mg is the mean value of the global image pixels. σAdaptive is the adaptive standard deviation for the window, σw is the standard deviation of the window.

72

T. Abu-Ain et al.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

[

[

nearest overlapped connected connected component components

Distance 1 Distance 2

(i)

(h)

Fig. 3. An example of a handwritten text image ( ‫ ) ﺭﺍﺩﺱ‬results after complete every step of the proposed method (a) text after binarization and removing the noise, dots and diacritics, (b) subwords detection, (c) horizontal projection histogram for each of subwords and calculate the T2, (d) candidate baseline regions detection, (e) thinned text, (f) circle shapes detection, pixels labeling and landmark spots selection, (g) local baseline location, (h) process of the baseline straightness, (i) final baseline location.

A Novel Baseline Detection Method of Handwritten Arabic-Script Documents

3.2

73

Connected Component Detection

This step uses two-pass scanning algorithm for connected-region detection [13] which works as follows: on the first pass, for all foreground pixels do the following: (i) get the 8-neighboring pixels of the current pixel (ii) if there are no neighbors, uniquely label the current pixel and continue, otherwise, find the neighbor with the smallest label and assign it to the current pixel (iii) store the equivalence between neighboring labels. While on the second pass, all the foreground pixels are re-labeled to the lowest equivalent label as shown in Fig. 3(b). 3.3

Noise, Dots and Diacritics Removing

From the definition of baseline, only the main parts of the characters are align over it not the dots or diacritics. For those components that their size is less than a threshold value (T1) calculated by the equation 2, they will be removed.

ܶͳ ൌ ሺ

σ ܾ݈ܽܿ݇‫ݏ݈݁ݔ݅݌‬ ሻȀܸ݁ Ǥ ݊‫ݏݐ݊݁݊݋݌݉݋ܿ݀݁ݐܿ݁݊݊݋݂ܿ݋ݎܾ݁݉ݑ‬

(2)

where 3