Word Extraction from Arabic Handwritten Documents ...

5 downloads 314 Views 856KB Size Report
International Review on Computers and Software (I.RE.CO.S.), Vol. 11, N. 5 ... Based on Statistical Measures. Ayman Al-Dmour. 1 ... benchmarking Database for Arabic Handwritten Text Recognition Research (AHDB). Our tests produced very ...
International Review on Computers and Software (I.RE.CO.S.), Vol. 11, N. 5 ISSN 1828-6003 May 2016

Word Extraction from Arabic Handwritten Documents Based on Statistical Measures Ayman Al-Dmour1, Raed Abu Zitar2 Abstract – In Arabic, word extraction is particularly challenging because words are often divided into sub-words, and a few letters do not connect to the following letter. In this paper, we present an efficient method for extracting words from Arabic handwritten documents. The proposed method is based on two groups of spatial measures (the lengths of connected components (CCs) and the gaps between these CCs) which differentiate successive CCs in text lines. Lengths are clustered into three distinct clusters to identify an optimal threshold for separating isolated letters, sub-words, and words. Besides, Gaps are clustered into two clusters, to indicate whether the gap occurs "between-words" or "within-a word". This clustering is implemented using Self-Organizing Map (SOM) algorithm. The efficiency of the proposed method was tested by conducting experiments on 35 ages of handwritten Arabic text, accessed from benchmarking Database for Arabic Handwritten Text Recognition Research (AHDB). Our tests produced very promising results, achieving a correct extraction rate of 86.3%. Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved.

Keywords: Arabic Handwriting, Word Extraction, SOM Clustering, Handwriting Recognition

I.

Fig. 1 presents the framework of our approach. The first step is preprocessing, where the input text image is converted to binary and filtered. The second step involves text line segmentation, where a Horizontal Projection Profile (HPP) and space estimating are used. In the third step, connected components are extracted using a Columnar Projection Profile (CPP); subsequently, distance measures are applied. Afterward, clustering step based on Self-Organizing Maps (SOM) techniques is performed to obtain segmentation thresholds. In the final step, the proposed Word Extraction algorithm is applied. The remainder of the paper is organized as follows. Section 2 presents previous research related to word extraction from handwritten Arabic. The preprocessing performed on the handwriting images is discussed in Section 3. In Section 4, the text line extraction method is introduced. The proposed word extraction method is presented in section 5, and the clustering algorithms are described in section 6. The experiments and the results obtained are discussed in section 7. Finally, conclusions and proposals for future studies are presented in section 8.

Introduction

Handwriting recognition (HWR) “is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens, and other devices” [1]. There are two types of HWR systems: character-based and word-based. In character-based systems, words are segmented into letters, and these letters are used for recognition; this is known as an analytical approach. Word-based systems do not require such segmentation and entire words are used for recognition; this is known as a global approach. In both types, segmentation of text into words is required. The process of segmenting a handwritten text line into words is known as word extraction or separation. Furthermore, word extraction is the first and most critical preprocessing step in other word processing tasks, including i) word segmentation, which isolates word images into separate letters, ii) word recognition, which attempts to recognize words from their overall shape and considers words to be single, indivisible entities, and iii) word spotting, the process of detecting specific keywords in handwritten document images. Extracting words from handwritten Arabic documents remains a challenge. Because words are often divided into sub-words and can be non-uniformly skewed, the space between words is habitual and occasionally words are overlaid. Moreover, some letters do not connect to the following letter, even in the middle of a word. In this paper, we focus on the segmentation of handwritten Arabic documents into words (word extraction).

II.

Previous Work

In this section, we introduce previous work in a manner that clearly separates the four converging and overlapping research areas concerning handwritten word processing tasks. It is worth mentioning that, the extraction of a word, which is a crucial step for word recognition, segmentation and spotting systems, from

Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved

DOI: 10.15866/irecos.v11i5.9384

436

Ayman Al-Dmour, Raed Abu Zitar

Arabic handwritten text is a very challenging task due to sub-word phenomena [2], [3]. Table I categorizes the previous works according to word processing research areas. Most of the proposed techniques for segmenting Arabic handwritten text segment the text lines into subwords or pieces of Arabic word (PAWs) [24]-[26]. Moreover, only Jawad et al. [27] and Ayman et al. [28] have focused on word extraction from handwritten Arabic text while extensive research on word extraction from Latin scripts are available [29]–[33]. Jawad et al. [27] defined a static threshold to classify "within-a word" and "between-words" gaps, by analyzing over 200 images containing more than 250 words derived from the IFN/ENIT database. The Bayesian criteria for minimum classification errors was employed. The correct rate was 85 %. Ayman et al. [28] defined dynamic "within-a word" and "between-words" gaps, derived from the input document itself.

Experiments using different clustering algorithms were performed. The correct rate was 84.8 %. Table II summarizes the differences, the justification and the motivation of the proposed method. In this work, main objective is to introduce a dynamic and adaptive segmentation measures. We take two remarkable characteristics of the nature of Arabic handwriting into consideration to define the necessary thresholds. First, Arabic words are often divided into PAWs. These Paws may be isolated letter, particle, sub word or word. Second, gap spaces between words/ sub words are habitual. Therefore, we introduce two groups of measures tolerant to such cases. First, lengths are segmented to determine an optimal threshold for isolated letters, subwords/ particle, and words. Second, gaps between these segments are categorized as either "between-words" gaps or "within-a word" gaps. This clustering is implemented using Kohonen Neural Network.

Handwritten document

Preprocessing

Text lines extraction

CCs extraction

Length & gap measures

Clustering

Extracted words Fig. 1. Framework for word extraction of handwritten Arabic

Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved

International Review on Computers and Software, Vol. 11, N. 5

437

Ayman Al-Dmour, Raed Abu Zitar

TABLE I HANDWRITTEN WORD PROCESSING AREAS Research areas Word segmentation Input: handwritten word Output: handwritten characters Word recognition Input: handwritten word Output: printed word

Authors Brief Summary [4]-[14] The process of isolating a word image into a sequence of characters [15]-[20] The attempt to recognize words based on their overall shape; considers words to be single, indivisible entities [21]-[23] The process of detecting specific keywords in handwritten document images

Word spotting Input: handwritten document Output: detect searched key word Word extraction [24]-[25] The process of segmenting a Input: handwritten handwritten text line into sub document words or into pieces of Arabic Output: Set of handwritten word (PAWs). words [27]-[28] The process of segmenting a handwritten text line into words

TABLE II COMPARING THE PROPOSED METHOD WITH WELL-KNOWN METHODS Clustering & Method Measures Measures type Corr.rate Jawad et al. Bayesian Within-word Static (constant [27] criteria with and betweenderived from words gaps analyzing over 200 85 %. images containing more than 250 words derived from the IFN/ENIT database, then used for any input document) Ayman et al. Within-word Dynamic (variable Kmeans with [28] 84.8 %. and betweenderived from the words gaps input document that would be segmented into words) Proposed Within-word Dynamic (variable Kohonen work Neural and betweenderived from the words gaps input document that Network with would be segmented 86.3 %. Connected into words) components lengths Adaptive ( differentiate between isolated letter, sub word and particle segments)

Fig. 2. Text image after preprocessing

This begins by locating the histogram of black pixels along the horizontal scan lines of the preprocessed handwritten text image. The spacing estimate method is then applied to remove the interference between lines. Fig. 3 presents the results of the text line segmentation step.

Fig. 3. Text line segmentation for Arabic handwritten documents

V.

Word Extraction

In this section, we will explain the proposed word extraction method, which is designed to accommodate the unique nature of the Arabic language. Well-known extraction techniques developed for other languages would require modifications before they could be used to extract Arabic handwriting. The first section will explain the nature of Arabic handwriting. The following section explains the proposed method in detail.

III. Handwritten Image Preprocessing Once a handwritten image sample is acquired, preprocessing is necessary to enhance the image for better performance. Preprocessing typically includes many related techniques such as Thresholding, binarization, and noise removal [34]. A sample of the preprocessing step is shown in Fig. 2.

V.1.

IV. Text Line Extraction

Arabic Handwriting Text Nature

Whether printed or handwritten, Arabic writing is cursive by nature. It is written from right to left, and most of the letters are directly connected to the letter that immediately follows. A few letters do not connect to the following letter, even in the middle of a word.

There are four main categories of text line extraction methods; namely, projection-based, grouping, smearing, and Hough-based [35]. In this work, text lines are extracted based on an HPP. Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved

International Review on Computers and Software, Vol. 11, N. 5

438

Ayman Al-Dmour, Raed Abu Zitar

Each individual letter can have up to four distinct forms, based on its position (beginning, middle, end, or isolated) within a word or a group of letters. Table III presents a sample of different Arabic letter positions, illustrating how the shape of a letter depends on its position.

In [23] authors stated that "An Arabic sentence consists of words. The word may be a particle, a noun, or a verb. Particles may consist actually of more than one letter". In our work, we consider the fact that the Arabic language is, by nature, often divided into PWAs. Then, we differentiate between handwritten PAWs as words, sub-words, particles, and isolated letters, as shown in Fig. 4(a). This categorization is necessary for extraction phase, especially if we know that sub word and particle appear in the handwritten texture with the same appearance. However, sub word does not represent a word until it joined with its other sub word. While, particle is a word itself. Therefore, we designed an adapted extraction algorithm, which is modified to accommodate these properties of Arabic handwritten text by firstly, identifying particle as a PAW that have "between-words" gaps at its both sides. While, sub word have this gap only at one side, Fig. 4(d) shows sub word and particle. Secondly, isolated letters must be combined to the nearest PAW. The Word Extraction algorithm, shown in Algorithm 1, begins by computing the CPP for each text line to locate connected components (CCs), Fig. 4(a). Two spatial measures (CCs lengths (L) and the gap distances (D) between these CCs) are then calculated, Fig. 4(b). Lengths are provided to a clustering algorithm, which calculates the thresholds for words, sub-words, and letter lengths (L3, L2, and L1), and gap distances are clustered to identify the "between-words" and "within-a word" gaps (D2 and D1). The complete set of measures is shown in Fig. 4(c). Then, previously calculated segment lengths are updated according to the following steps. First, to solve the isolated letters problem, each isolated letter is combined with the segment before (L-) or after it (L+), depending on the nearest gap on its both sides ( gap after D+) or ( gap before D- ). Second, to solve the sub-words and particles problem (L2), if there is "within-a word" (D1) threshold at one of its sides (i.e.; it is a sub word), update segment length before or after depending on where D1 lies. While, if both gaps on its both sides is "between-words" (D2) threshold, its kept without any update (i.e.; it is a particle), Fig. 4(d). An example of the output produced by this algorithm is shown in Figs. 5.

TABLE III PRINTED AND HANDWRITTEN ARABIC LETTERS Handwritten Letter Printed Letter Beginning Middle End

Table IV presents a sample of general Arabic text writing characteristics, showing the characteristics that make Arabic handwriting processing somewhat difficult. These include the shape of the letter (according to position), sub word and particle appear in the handwritten texture with the same appearance and the writing habits, mood, education, health, and other conditions of the writer. As a result, fixed-size width segmentation is not applicable. In Arabic language, the characters have diacritics, which are positioned above or below the main parts of the character. Diacritics are usually not connected to the sub-word’s body, and hence, they do not affect the proposed extraction process. TABLE IV ARABIC TEXT WRITING CHARACTERISTICS [36]

1. Writing direction 2. Ascenders Letters 3. Descending Letters 4. Holes (loops)

V.2.

5. Secondary Parts (dots or diacritics) 6. Ligatures 7. Connected Components (sub-word)

Proposed Method

Well-known word extraction techniques consider the sizes of the gaps between successive connected components, and define a threshold to classify "within-a word" and "between-words" gaps. However, because of the unique nature of handwritten Arabic text (sub-word phenomena), applying this technique is challenging and have to be adapted [2], [3]. Most of the proposed techniques for offline Arabic handwritten processing, segment the text lines into sub-words or pieces of Arabic word (PAWs) [22]-[24]. Based on these PAWs, previous studies on word extraction area propose one group of measures to differentiate "within- a word" from "between-words" spaces, whether it is static or dynamic, shown in Fig. 4(c). Thereafter, based on these measures they locate words and combine sub words that have "within-a word" spaces between them before extraction.

VI.

Clustering Algorithm

Most of the word extraction methods consider a spatial measure of the space between segmented succeeding connected components and identify a threshold to categorize "within- a word" and "betweenwords" spaces [37]. The methods that are accustomed to the properties of the text image are comparatively more robust. SOM clustering, which is also known as Kohonen Neural Networks, was first introduced by Kohonen [38]. The primary idea of SOM is to map the data patterns onto a d-dimensional grid of neurons (a feature map). This mapping process attempts to preserve topological relations between input and output data.

Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved

International Review on Computers and Software, Vol. 11, N. 5

439

Ayman Al-Dmour, Raed Abu Zitar

Figs. 4. Proposed method for word extraction

Algorithm 1 WORD EXTRACTION ALGORITHM 1: procedure WEXTRACT 2: Apply CPP to identify CCs 3: for each CC do 4: Compute Length (L) 5: Compute Gap Spaces (D) between successive CCs 6: end for 7: Run clustering algorithm for L into ( L1, L2, L3) then D into (D1, D2 ) 8: for each L1 do 9: if Gap before (D ) < Gap after (D+) then 10: Length before (L )= L +D +L1 11: else 12: Length after (L+)= L++D++L1 13: end if 14: end for 15: for each L2 do 16: if ((D and D+) > D1) then 17: L2 = L2 18: else 19: if D < D2 then 20: L = L + D + L2 21: else 22: if D+ < D2 then 23: L+ = L+ + D+ + L2 24: end if 25: end if 26: end if 27: end for 28: end procedure

(a) particles problem

(b) solved particle problems Figs. 5. Output (a) previous methods (b) proposed method

Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved

International Review on Computers and Software, Vol. 11, N. 5

440

Ayman Al-Dmour, Raed Abu Zitar

The primary advantage of SOM is the spatial organization of the feature map that is achieved after the learning process. Fig. 6 illustrates the mapping between input and output data. Fig. 7 illustrates a basic SOM. The neighborhood function h decreases along with the distance to the winning neurons, and is responsible for the interactions between different neurons in the SOM structure. Typically, the radius is decreased in the training process, such that each neuron will become more isolated from the effects of its neighbors.

VII.2.

Error Analysis

A verification process for extracting words from Offline Arabic Handwritten text was carried out and mainly two types of errors were recognized as: writer habits in terms of spacing and overlapping and clustering validity. VII.2.1. Writer Habits Outliers are defined as "an observation (or subset of observations) that appears to be inconsistent with the remainder of that set of data" [40]. One of the most important problems in handwritten Arabic images is that the space between words is habitual, and occasionally words are overlaid. Based on these characteristics, outlier identification measures for word length or "within-a word" spacing thresholds can be determined. Figs. 8 illustrate an example of outliers, where two words are connected to each other and appear to be a single word.

Fig. 6. Kohonen Neural Networks (SOM)

Let X be the set of n training patterns x1,x2,…xn W be a p×q grid of units wij where I and j are their coordinates on that grid α be the learning rate, assuming values in (0,1), initialized to be a given initial learning rate. R be the radius of the neighborhood function h(wij,wmn,r), initialized to a given initial radius. 1 Repeat 2 For k=1 to n 3 For all wij W, calculate dij= 4 Select the unit that minimizes dij as the winner Wwinner 5 Update each unit wij W: wij= wij + α h (Wwinner,wij,r ) 6 Decrease the value of α and r 7 Until α reach 0

(a) Habitual un consistence spacing between words and sub words

(b) Habitual overlapping Figs. 8. Example of un consistence spacing and overlapping inside Arabic handwriting text image

To address these types of problems and minimize their effects, a Grubbs test [41] was implemented on the thresholds dataset, where the tested value is either the highest or lowest from the mean value. For each value x in the dataset, a z-score is used to determine whether x is an outlier, based on:

Fig. 7. The basic SOM clustering algorithm

VII.

Experiments and Results

Several experiments were performed to demonstrate the effectiveness of the proposed method. They were implemented using Matlab 2010a. All experiments were performed on a Windows 7 machine with a 2.1 GHz CPU and 2 GB RAM. VII.1.

where is a distribution value at a significant level , and is the number of values (lengths or gaps) in the segmented document.

Database

For the purposes of this study, 35 different handwritten Arabic documents, all produced by different writers, were downloaded from the AHDB [39]. The AHDB contains 105 forms, and is available to the public (http://handwriting.qu.edu.qa/dataset/).

VII.2.2. Clustering Validity Problem The proposed algorithm was hybridized with a clustering algorithm.

Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved

International Review on Computers and Software, Vol. 11, N. 5

441

Ayman Al-Dmour, Raed Abu Zitar

In general, clustering presents two problems. First, the clustering algorithm that works most effectively to solve the problem must be identified. Second, the number of clusters that optimally fit the dataset must be determined. To address the first problem, we performed several preliminary experiments, using common clustering techniques discussed in the literature including distancebased, probability-based, and density-based methods. We concluded that the Self-Organizing Maps clustering algorithm was the proper algorithms for identifying thresholds for the Word Extraction algorithm. Regarding the second problem, it is generally accepted that good clusters will have high variances among intercluster members, as well as low variances among intracluster members. In the literature, there are a number of methods that can analyze how well-separated the resulting clusters are. Therefore, we selected a weighted inter-intra (Wint) index [42] for segment length thresholds L1, L2, and L3. In addition, we selected the Silhouette method [43] for "within-a word" and "between-words" thresholds (D1, D2). Wint minimizes gaps within cluster variances (MSE). Silhouette measures how closely related objects are in a cluster, and simultaneously measures how distinct or well-separated a cluster is from other clusters. Figs. 9(a) and 9(b) present the performance of Silhouette and Wint indices, respectively. It is very clear that Three is the optimal number of clusters for segment lengths, and two clusters are optimal for gap distance measurements.

(a)

(b) Figs. 9. (a) Silhouette index for gap measure (b) Wint index for length measure

VIII. Results Two experiments were performed. First, the SOM clustering algorithm was applied to three different handwritten Arabic texts. This demonstrated that the selected measures (Lengths and Gaps) were only dependent on the natural characteristics of the language and personal writing habits. This experiment is made to figure out the necessity of proposed adaptability according to the input, one can observe that segmentation thresholds are varying from one writer to another and from document to another. Second, the proposed method was applied to 35 samples from the AHDB to show the results for the proposed method using SOM. Segmentation thresholds obtained for three different handwritten documents are shown in Table V. Based on the results, it is clear that these measures are not dependent on the writer or what is written. Table VI presents the results of the proposed method, where 35 Arabic handwriting images were analyzed. The best results were obtained for image number 2. In Fig. 10, the results were compared with those obtained during a previously published studies. It is clear that the new approach outperformed the previous methods described in the literature.

TABLE V CALCULATED MEASURES FOR THREE DIFFERENT SAMPLES Sample Calculated measures (in pixels) S1 S2 L1 L2 L3 1 22 7 17 36 80 2 13 4 10 24 61 3 12 5 13 40 65

Jawad et al. Ayman et al. Proposed work Fig. 10. Rates of correct word extraction

Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved

International Review on Computers and Software, Vol. 11, N. 5

442

Ayman Al-Dmour, Raed Abu Zitar

TABLE VI CLUSTERING ALGORITHM RESULTS Clustering algorithms No of FCM SOM words in Image No. No. of No. of the Correct Correct Misplaced Misplaced document Seg. rate Seg. rate Seg. words Seg. words 1 70 4 94.3% 5 92.9% 2 90 5 94.4% 5 94.4% 3 58 6 89.7% 5 91.4% 4 92 17 81.5% 15 83.7% 5 62 8 87.1% 8 87.1% 6 29 5 82.8% 6 79.3% 7 55 15 72.7% 13 76.4% 8 58 9 84.5% 3 94.8% 9 23 5 78.3% 2 91.3% 10 51 4 92.2% 4 92.2% 11 14 3 78.6% 2 85.7% 12 45 6 86.7% 4 91.1% 13 88 21 76.1% 20 77.3% 14 34 5 85.3% 4 88.2% 15 42 3 92.9% 3 92.9% 16 38 4 89.5% 3 92.1% 17 91 8 91.2% 5 94.5% 18 80 11 86.3% 13 83.8% 19 35 6 82.9% 1 97.1% 20 101 9 91.1% 14 86.1% 21 9 2 77.8% 0 100.0% 22 24 3 87.5% 3 87.5% 23 43 9 79.1% 7 83.7% 24 82 23 72.0% 22 73.2% 25 23 4 82.6% 5 78.3% 26 33 7 78.8% 7 78.8% 27 10 2 80.0% 2 80.0% 28 14 3 78.6% 4 71.4% 29 27 4 85.2% 3 88.9% 30 62 8 87.1% 7 88.7% 31 64 18 71.9% 17 73.4% 32 67 17 74.6% 12 82.1% 33 41 7 82.9% 4 90.2% 34 27 7 74.1% 2 92.6% 35 88 18 79.5% 12 86.4% Total 1770 286 83.8% 242 86.3% average

IX.

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11] [12]

[13]

[14]

Conclusion [15]

This paper investigates the performance of a word extraction algorithm for handwritten Arabic text using a SOM clustering algorithm. Word segmentation is achieved by first identifying the CPP for each text line in the document. The algorithm then calculates the lengths of the CCs and the gaps between them. Subsequently, the clustering algorithm is used to locate distinct measures for determining optimal segmentation thresholds, based on the document itself. Experiments were performed using the SOM clustering algorithm. The efficiency of the proposed method was tested by conducting experiments on different handwritten Arabic text documents accessed from the AHDB benchmark dataset; a promising correct word extraction rate was achieved.

[16]

[17]

[18]

[19]

References [1] [2]

The free dictionary http://acronyms.thefreedictionary.com/hwr, June 2016. Hashem Ghaleb, P. Nagabhushan, and Umapada Pal. Article:

[20]

Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved

Segmentation of overlapped handwritten arabic sub-words. IJCA Proceedings on National conference on Digital Image and Signal Processing, DISP 2015(2):24–29, April 2015. Full text available. Sargur N. Srihari, Gregory R. Ball, and Harish Srinivasan. Arabic and Chinese Handwriting Recognition: SACH 2006 Summit College Park, MD, USA, September 27-28, 2006 Selected Papers, chapter Versatile Search of Scanned Arabic Handwriting, pages 57–69. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008. M. Zand, A.N. Nilchi, and S.A. Monadjemi. Recognition-based segmentation in persian character recognition. In Proceedings of World Academy of Science: Engineering Technolog, page 183, April 2008. A. Alaei, P. Nagabhushan, and U. Pal. A baseline dependent approach for persian handwritten character segmentation. In Pattern Recognition (ICPR), 2010 20th International Conference on, pages 1977–1980, Aug 2010. S. Mozaffari, K. Faez, F. Faradji, M. Ziaratban, and S. M. Golzan. A comprehensive isolated farsi/arabic character database handwritten ocr research. In 10th International Workshop on Frontiers in Handwriting Recognition, pages 385–389, 2006. Mario Pechwitz, Samia Snoussi Maddouri, Volker Märgner, Noureddine El-louze, and Hamid Amiri. Ifn/enit - database of handwritten arabic words. In Proc. of CIFED 2002, pages 129– 136, 2002. E. EL-Sherif and S. Abdleazeem. A two-stage system for arabic handwritten digit recognition tested on a new large database. In International Conference on Artificial Intelligence and Pattern Recognition, pages 237–242, 2007. A. Lawgali, M. Angelova, and A. Bouridane. Hacdb: Handwritten arabic characters database for automatic character recognition. In Visual Information Processing (EUVIP), 2013 4th European Workshop on, pages 255–259, June 2013. Ahmed Lawgali. A survey on Arabic character recognition. International Journal of Signal Processing, Image Processing and Pattern Recognition, 8:401–426, 2015. H. Goraine, M. Usher, and S. Al-Emami. Off-line arabic character recognition. Computer, 25(7):71–74, July 1992. Husni A. Al-Muhtaseb, Sabri A. Mahmoud, and Rami S. Qahwaji. Recognition of off-line printed arabic text using hidden markov models. Signal Process., 88(12):2902–2912, December 2008. Adnan Amin, Humoud Al-Sadoun, and Stephen Fischer. Handprinted arabic character recognition system using an artificial network. Pattern Recognition, 29(4):663 – 675, 1996. R. El-Hajj, L. Likforman-Sulem, and C. Mokbel. Arabic handwriting recognition using baseline dependant features and hidden markov modeling. In Document Analysis and Recognition, 2005. Proceedings. Eighth International Conference on, pages 893–897 Vol. 2, Aug 2005. Mahmoud Khalifa and Yang BingRu. Advanced Research on Electronic Commerce, Web Application, and Communication: International Conference, ECWAC 2011, Guangzhou, China, April 16-17, 2011. Proceedings, Part I, chapter A Novel Word Based Arabic Handwritten Recognition System Using SVM Classifier, pages 163–171. Springer Berlin Heidelberg, Berlin, Heidelberg, 2011. A. Benouareth, A. Ennaji, and M. Sellami. Hmms with explicit state duration applied to handwritten Arabic word recognition. In Pattern Recognition, 2006. ICPR 2006. 18th International Conference on, volume 2, pages 897– 900, 2006. J.H. AlKhateeb, Jinchang Ren, Jianmin Jiang, S.S. Ipson, and H. El Abed. Word-based handwritten Arabic scripts recognition using dct features and neural network classifier. In Systems, Signals and Devices, 2008. IEEE SSD 2008. 5th International Multi-Conference on, pages 1–5, July 2008. S. Alma’adeed, C. Higgens, and D. Elliman. Recognition of offline handwritten Arabic words using hidden markov model approach. In Pattern Recognition, 2002. Proceedings. 16th International Conference on, volume 3, pages 481–484 vol.3, 2002. M. T. El-Melegy and A. A. Abdelbaset. Global features for offline recognition of handwritten Arabic literal amounts. In Information and Communications Technology, 2007. ICICT 2007. ITI 5th International Conference on, pages 125–129, Dec 2007. V. Madhvanath, S. Govindaraju. The role of holistic paradigms in

International Review on Computers and Software, Vol. 11, N. 5

443

Ayman Al-Dmour, Raed Abu Zitar

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

handwritten word recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:149–164, February 2001. Sargur Srihari, Harish Srinivasan, Pavithra Babu, and Chetan Bhole. Handwritten Arabic word spotting using the cedarabic document analysis system. In Proc. Symposium on Document Image Understanding Technology (SDIUT-05), College Park, MD, pages 123–132, 2005. Sargur Srihari, Harish Srinivasan, Pavithra Babu, and Chetan Bhole. Spotting words in handwritten arabic documents. In Document Recognition and Retrieval XIII: Proceedings SPIE, 2006. M. Khayyat, L. Lam, and C.Y. Suen. Arabic handwritten word spotting using language models. In Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on, pages 43–48, Sept 2012. S.S. Maddouri, F. Ghazouani, and F.B. Samoud. Text lines and paws segmentation of handwritten arabic document by two hybrid methods. In Advanced Technologies for Signal and Image Processing (ATSIP), 2014 1st International Conference on, pages 310–315, March 2014. Y. Osman. Segmentation algorithm for arabic handwritten text based on contour analysis. In Computing, Electrical and Electronics Engineering (IC-CEEE), 2013 International Conference on, pages 447–452, Aug 2013. N. Aouadi, S. Amiri, and A. Kacem Echi. Segmentation of connected components in Arabic handwritten documents. Procedia Technology, 10:738 – 746, 2013. First International Conference on Computational Intelligence: Modeling Techniques and Applications (CIMTA) 2013. Jawad H AlKhateeb, Jianmin Jiang, Jinchang Ren, and Stan Ipson. Interactive knowledge discovery for baseline estimation and word segmentation in handwritten Arabic text. Recent Advances in Technologies, Maurizio A Strangio (Ed.), 2009. A Al-Dmour and F Fraij. Segmenting arabic handwritten documents into text lines and words. International Journal of Advancements in Computing Technology (IJACT), 6(3):109–119, 2014. T. Stafylakis, V. Papavassiliou, V. Katsouros, and G. Carayannis. Robust text-line and word segmentation for handwritten documents images. In Acoustics, Speech and Signal Processing, 2008. ICASSP 2008. IEEE International Conference on, pages 3393–3396, 2008. U.-V. Marti and H. Bunke. Text line segmentation and word recognition in a system for general writer independent handwriting recognition. In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, pages 159–163, 2001. R. Manmatha and Jamie L. Rothfeder. A scale space approach for automatically segmenting words from historical handwritten documents. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 27(8):1212–1225, Aug 2005. Vassilis Papavassiliou, Themos Stafylakis, Vassilis Katsouros, and George Carayannis. Handwritten document image segmentation into text lines and words. Pattern Recognition, 43(1):369 – 377, 2010. G. Louloudis, B. Gatos, I. Pratikakis, and C. Halatsis. Text line and word segmentation of handwritten documents. Pattern Recognition, 42(12):3169–3183, 2009. New Frontiers in Handwriting Recognition. Peake, G.S., Tan, T.N., 1997. A general algorithm for document skew angle estimation. IEEE International Conference on Image Process. 2, 230-233. Z. Razak, K. Zulkiflee, M. Yamani, I. Idris, E. M. Tamil, M. Noorzaily, M. Noor, R. Salleh, M. Yaakob, Z. M. Yusof, and M. Yaacob. Off-line handwriting text line segmentation: a review. International Journal of Computer Science and Network Security, 8(7):12–20, 2008. John M. Trenkle, Steve Schlosser, and S. Gillies. An off-line Arabic recognition system for machine-printed documents. In Symposium on Document Image Understanding Technology, At Annapolis, MD, pages 155–161, 1997. Giovanni Seni and Edward Cohen. External word segmentation of off-line handwritten text lines. Pattern Recognition, 27(1):41 – 52, 1994.

[38] T. Kohonen. The self-organizing map. Proceedings of the IEEE, 78(9):1464– 1480, Sep 1990. [39] S. Al-Ma’adeed, D. Elliman, and C.A. Higgins. A data base for arabic handwritten text recognition research. In Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Workshop on, pages 485–489, 2002. [40] V. Barnett and Lewis T. Outliers in Statistical Data. John Wiley Sons, 3rd edition, 1994. [41] F. E. Grubbs. Procedures for detecting outlying observations in samples. Techno metrics, 11(1):1–20, 1969. [42] Alexander Strehl. Relationship-based clustering and cluster ensembles for high-dimensional data mining. PhD thesis, The University of Texas at Austin, May 2002. [43] Peter J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53–65, April 1987.

Authors’ information 1

Faculty of Information Technology, Al-Hussein Bin Talal University, Jordan. E-mail: [email protected] 2

Faculty of Information Technology, American University of Madaba, Jordan. E-mail: [email protected] Ayman Al Dmour, He is an Associate Professor of Computer Information Systems at Al-Hussein Bin Talal University (AHU) in Jordan. He received his BSc in Electronic – Communication Engineering in 1994 from Jordan University of Science and Technology, Irbid, Jordan. He pursued his MSc and PhD in 2003 and 2006, respectively, both in Computer Information Systems in the Arab Academy for Banking and Financial Sciences, Amman, Jordan. His research interests are in Arabic language processing ,data compression and computer education. At Al-Hussein Bin Talal University (AHU), he has led the Department of Computer Information Systems and the Computer and Information Technology Center. Currently, he is the Dean of the College of Information Technology. Raed Abu Zitar, a professor, was born in Gaza in 1966. He earned his BS in electrical engineering from University of Jordan in 1988, a master’s degree in computer engineering from North Carolina A&T State University, Greensboro, in 1989, and his PhD in computer engineering from Wayne State University in 1993. He is currently the dean of College of Information Technology, American University of Madaba, Mādabā, Jordan. He has more than 80 publications in international journals and conferences; his research interests are machine learning, simulations, modeling, pattern recognition, and evolutionary algorithms with applications

Copyright © 2016 Praise Worthy Prize S.r.l. - All rights reserved

International Review on Computers and Software, Vol. 11, N. 5

444