Segmenting Arabic Handwritten Documents into ... - Semantic Scholar

36 downloads 0 Views 761KB Size Report
Abstract. In this paper, we present a method for segmenting Arabic handwritten documents into text lines and words. Text line segmentation is addressed by a ...
Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour1 and Fares Fraij2 Corresponding Author, Department of Information Technology Taif University, Taif, KSA, Email: [email protected] Sabbatical Leave: Al-Hussein Bin Talal University, Ma’an – Jordan 2 Department of Software Engineering, Al-Hussein Bin Talal University, Maan, Jordan, Email: [email protected] 1

Abstract In this paper, we present a method for segmenting Arabic handwritten documents into text lines and words. Text line segmentation is addressed by a well-known technique, the horizontal projection profile, in which autocorrelation is used to enhance the self similarity of this profile. This technique promotes the estimation of text line spacing. Word extraction is based on an adaptation of a known method, gap metrics.This improvement relies on deriving the values of these gaps from the properties of each input document, making the proposed method tolerant and robust to Arabic handwritten nature. Text is often divided into words, sub-words and letters; however, some letters do not connect to the following letter, even in the middle of a word. A gap metric method exploits the membership values of a clustering algorithm to identify segmentation thresholds as “within word” or “between words” gaps. The proposed method is tested on the benchmarking datasets of Arabic handwritten text recognition research (AHDB), and very promising results were achieved, with an 84.8% correct extraction rate.

Keywords: Arabic handwriting, Text line segmentation, Word extraction, FCM clustering 1. Introduction Handwriting recognition (or HWR[1]) “is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices.” HWR can be either off-line, by optically scanning from a piece of paper, or on-line [2], through sensing movements of the pen tip by a pen-based touch screen. Applications of HWR include office automation, mail sorting, banking, and natural human-computer interaction. HWR is either character-based or word-based. With the first type, words are segmented into letters, which are used to recognize words. In contrast, word-based HWR does not need such segmentation, it using whole words for recognition. In both types, segmentation of text into words is required; Figure 1 illustrates the process of a recognition system. Handwritten Image

Segmentation

Recognition

Character based

Word based

Figure 1. Handwriting recognition Normally, segmentation includes breaking down a handwritten document into its basic entities, namely, text lines and words. However, locating text lines and words in Arabic handwritten documents remains a challenge. In the case of text lines, challenges include the difference in skew angles between

International Journal of Advancements in Computing Technology(IJACT) Volume 6, Number 3, May 2014

109

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

lines and adjacent text lines touching, while words are often divided into letters, sub-words, and words. Scripts are non-uniformly skewed, and the spaces between words and also between sub-words are habitual [3]. This paper handles segmentation of Arabic handwritten text into text lines and words, first by applying the horizontal projection profile (HPP), a previously published text line segmentation algorithm, and second by applying an enhanced gap metric, based on a well-known method used for word extraction, in which thresholds are tailored for each input document. Figure 2 presents the framework of the proposed approach. As shown, step 1 involves preprocessing the input text image to enhance the quality of the image using binarization and filtering. Step 2 applies text line segmentation based on HPP. In step 3, word extraction process begins, extracting connected components using columnar projection profile (CPP), followed by applying gap measures, and finally clustering to obtain segmentation thresholds for word extraction. Input: Arabic Hand written Image

Step 1: Preprocessing

Step 2: Text lines extraction  HPP  Autocorrelation  Spacing estimating

Step 3: Word extraction  CPP  Gap measure  Clustering

Output: Segmented document Figure 2. Arabic handwritten segmentation framework

This paper is organized as follows. Section 2 is dedicated to related studies on text line and word extraction with Arabic handwritten documents. The nature of Arabic text is presented in Section 3. The pre-processing of handwriting images is discussed in Section 4. Sections 5 and 6 describe the text line extraction method used and the proposed word extraction method, respectively. Clustering methods are presented in Section 7. Section 8 discusses experiments and results. Finally, the conclusion and future work are discussed in Section 9.

2. Previous Work The literature review in this section is presented as two categories. Section 2.1 is allocated to related work in text line extraction. Section 2.2 covers related work on word extraction.

2.1. Text Line Extraction There are several algorithms dealing with text line extraction in the literature. These algorithms are implemented for different language scripts. Text line extraction from Arabic handwritten text shares increasing attention in the literature. An overview of these efforts is summarized in Table 1.

110

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

Table 1. Overview of related work in text line segmentation

Authors A. Zahour et al. [4] Jayant Kumar et al. [5] Zhixin Shi et al. [6]

N. Ouwayed et al. [7]

Muna Khayyat et al. [8]

I.S.I. Abuhaiba et al. [9]

Short Summary Authors applied HPP analysis and then regrouped the connected components. To detect the separating lines, they use a partial contour-following-based method. Graph-based method for extracting handwritten text lines. The proposed method is very robust to non-uniform skew and character size variations. Based on a generalized adaptive local connectivity map (ALCM) using a steerable directional filter. The proposed method is efficient for fluctuating, touching or crossing text lines Authors applied morphology analysis to the terminal letters of Arabic words. The proposed scheme has been evaluated on overlapping and touching documents and was very efficient Based on morphological dilation with a dynamic adaptive mask. Method is evaluated on the Arabic handwritten documents database, which contains multiskewed and touching lines. A shortest spanning tree algorithm was applied to line text segmentation for Arabic handwritten documents. The proposed algorithm can handle various line positions but does not handle touching and overlapping text-lines.

In this study, text lines are extracted based on a well-known HPP method [4]. Sometimes, the HPP is irregular and inaccurate; this is due to touching and overlapping text lines. We enhance the HHP results by applying autocorrelation in order to obtain better text line spacing.

2.2. Word Extraction It is worth mentioning that, concerning processing handwritten words, there are four terms that converge and overlap each other. It is important to differentiate between these terms: word segmentation, word recognition, word spotting, and word extraction (or separation). Word segmentation is the process of dividing handwritten words into characters for the purpose of character-based word recognition, while word recognition is the process of recognizing words from their overall shape. Word spotting is locating a query word in a dataset of document images, with searching done in the image domain without converting to text. Segmentation of a handwritten text line into words is known as word separation or word extraction. Word segmentation of Arabic handwritten text into characters for the purpose of character-based recognition has received considerable attention in literature [10-13]. Word recognition from Arabic handwritten text also shares increasing attention in literature [14-17]. More recently, exploration of word spotting in documents written in Arabic has begun [18-20]. But it is surprising that, AlKhateeb et al. [21] represents the first and only study known to the author on word extraction from handwritten Arabic text, in spite of the fact that word extraction from Latin scripts has received extensive interest in the literature [22-26]. Most of the proposed techniques for word extraction in the literature consider a spatial measure of the gap between successive connected components, and define a threshold to classify “within” and “between” word gaps [27]. In addition, for Arabic in AlKhateeb et al. [21], these thresholds are employed by manually analyzing over 200 images containing more than 250 words, derived from all IFN/ENIT databases [28]. Methods that do not use any prior knowledge and adapt to the properties of the document image would be more robust [25]. Thus, in this study we introduce a method in which thresholds for segmentation are calculated from every handwritten document. This makes the method adaptive, tolerant, and robust when applied to Arabic handwritten nature.

3. Arabic Handwriting Text Natures In the Arabic language there are 28 basic letters; each letter changes its shape based on its position in the word, whether it is at the beginning, middle, end, or isolated. This structure of Arabic letters gives Arabic text a cursive nature. Thus, each character can correspond to up to four different forms. Table 2 illustrates a sample of the four different positions.

111

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

Table 2. Printed and Handwritten Arabic Letters

Letter

Beginning

Middle

End

Printed Handwritten Printed Handwritten Printed Handwritten

Every person has his own way in writing depending on writing habits, shape of the letter, mood, education, health, and other conditions of the writer. Figure 3 presents an example of Arabic handwritten text. There are some general characteristics for Arabic, such as a writing direction from right to left, and the fact that some letters are isolated and some are connected based on location. Moreover, words are often divided into sub-words; the space between words is habitual and sometimes words are overlaid [3].

1. Writing direction 2. Ascender letters 3. Descender letters 4. Holes (loops).

5. Secondary parts (dots or diacritics) 6. Ligatures 7. Connected components (sub-word).

Figure 3. Arabic text writing characteristics

4. Handwritten Image Preprocessing The first step in the framework of the proposed approach is to enhance the quality of the input image. Pre-processing usually includes several relevant techniques like thresholding, binarization, and noise removal [29]. A sample of pre-processing is shown in Figure 4.

Figure 4. Text image after preprocessing

5. Text Line Extraction The method employed here identifies text lines based on HPP. It starts by finding the histogram of black pixels along the horizontal scan-lines of the preprocessed image. Figure 5 presents the HPP process output where the peaks represent text baseline positions and the valleys represent blanks between lines. Sometimes HPP is irregular and not accurate; due to writing habits and touching textures (Figure 5 (a)). Therefore, autocorrelation is applied to enhance the similarity of the HHP. Subsequently, estimation of text line spacing is improved (Figure 5 (b)),

112

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

(a) Locating text lines using ordinary HPP

(b) Locating text lines using enhanced HPP Figure 5. Locating text lines

6. Proposed Method for Word Extraction In this paper we modify an existing word extraction method for Arabic handwritten image. This was based on computing a spatial measure of the gap between successive connected components, and defines an adapted threshold to classify “within” and “between” word gaps, shown in Figure 6.a. However, this threshold is drawn from the properties of the input document image, which is often divided into words, sub-words, letters and a few of these letters do not connect to the following letter, even in the middle of a word, shown in Figure 6.b.

Figure 6. (a) Within and between word spaces (b) Text classification to letter, word, and sub-words The algorithm of proposed method is presented in Figure 7, in which CPP for each text line is applied to define connected components (CC). Then the length of each segment and the gap spaces between segments are computed. Lengths are used to determine threshold values for letter, word, and sub-word, while gap spaces are used to identify “within word” and “between words” thresholds. These classifications are achieved by a clustering algorithm. Finally, we propose updating CC lengths by merging words that have spaces within them that are smaller than the “within word” threshold, on the condition that the length of the previous or next segment is less than the word length. An example of the output of the proposed word extraction method is shown in Figure 8.

113

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

Word Extraction Algorithm. 1. Input: text lines 2. Apply CPP for each text line. 3. Compute : i. Length for each segment (CC) in each text line ii. Gap spaces between these segments 4. Implement clustering algorithm for: i. Segments lengths Output < letter, sub-word and word thresholds.> ii. Gap spaces Output < “within word” or “between words” Gap thresholds.> 5. Compare each computed gap space in 3 with “within word” threshold i. If previous or next segment length is less than word length: Combine previous and next segments Else Continue ii. Update previous\next segment length 6. Output: extracted words.

Figure 7. Pseudo code for proposed word extraction method

Figure 8. Output example for proposed word extraction method

7. Clustering Algorithms Clustering techniques are mostly unsupervised methods; their role is to partition a dataset into clusters (groups) so that data points within the same cluster are more closely related to each other than to those assigned to different clusters. It is well-known that any clustering effort is faced with two main questions. First, how do we know which clustering method is suitable? Second, how do we decide the optimal number of clusters that fits a dataset? In this study, we conducted experiments using the most well-known types of clustering algorithms: distance-based, probability-based, and density-based, to answer the first question. For the second question, the number of clusters generally needs to be either specified by users based on their prior knowledge or estimated in a definite way. The number of clusters is determined by prior knowledge of the authors of Arabic language nature.

7.1. Distance Based There are different distance based clustering algorithms. In this work we examined clustering methods that have been applied to a wide range of topics and areas, including k-means and fuzzy cmeans clustering.

114

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

K-means is one of the most popular partitioning clustering methods [30]. This method is based on Euclidean distance as the dissimilarity measure. The k-means algorithm can be divided into two phases: the initialization phase and the iteration phase. In the initialization phase, the algorithm randomly assigns datasets into K clusters. In the iteration phase, it computes the distance between each data point and each cluster and assigns the data point to the nearest cluster. K-means clustering is fast, robust, and easy to understand. But it fails for non-linear datasets and is unable to handle noisy datasets. The fuzzy c-means (FCM) clustering algorithm was first introduced by Dunn [31] and was later modified and extended by Bezdek [32]. FCM is an iterative distance-based clustering algorithm that tries to obtain optimal clusters. Fuzzy clustering is considered soft clustering; data points can belong to more than one cluster, in contrast to hard clustering, where data is divided into crisp clusters and each data point belongs to exactly one cluster.

7.2. Probability Based The Gaussian mixture model (GMM) is the most popular probability based clustering algorithm [33]. Mixture models have been widely used for data clustering and it is effective with multidimensional datasets. Each cluster is represented by a Gaussian distribution. The clustering process estimates the parameters of the Gaussian mixture, usually by the expectation-maximization (EM) algorithm.

7.3. Density Based The density based, spatial clustering of applications with noise (DBSCAN) algorithm identifies clusters on the basis of the density of the points. It is used widely because it can effectively handle noise points and deal with data of any type in clustering. The DBSCAN algorithm was originally proposed by Martin Ester et al. in 1996 [34], and it has obtained excellent performance in image segmentation. DBSCAN uses NNS (nearest neighbor searching) to identify whether a point is a core point or not and to search density connection points.

8. Experiments and Results Several experiments were performed to demonstrate the effectiveness of the proposed word extraction method. They were implemented using Matlab 2010a. All experiments were run on a machine with a 2.1 GHz processor, 2GB RAM, and the Windows 7 operating system. The dataset used in this work was adopted from the AHDB [35]; 25 images were used for different writers. The dataset is available at http://handwriting.qu.edu.qa/dataset/. Firstly, the FCM clustering algorithm was applied to three different Arabic handwritten inputs, to show that the selected measures differ according to writing nature (input document properties). From Figures 9.a and 9.b, it is obvious that results for the three different samples of Arabic handwritten documents are different and changes for each input. 25 20 15 10 5 0

Writer 1 Writer 2 Writer 3 Between words Within word gap gap

(a) Sample of threshold to classify “within” and “between” word gaps.

115

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

80 60 40 20 0

Writer 1 Writer 2 Writer 3 Letter length

Word length

Sub word length

(b) Sample of threshold to classify words, sub-words, letters Figure 9. Thresholds set for three different writers Secondly, to identify which clustering algorithm is suitable for our problem, we implemented the proposed algorithm using four clustering methods: DBSCAN, GMM, KNN, and FCM. Table 3 presents the results. The best result was obtained for image number 2, where FCM and GMM outperform KNN and DBSCAN. The best and worst percentage of misplaced words for FCM is 5.5% and 20%, respectively. Table 3. Results of clustering Algorithms

Image No.

No. of words

Clustering Algorithms Probability GMM KNN

1

70

Misplaced words 13

2

90

8

91.1

5

94.4

6

93.3

5

3

58

7

87.9

11

81.0

10

82.8

8

86.2

4

92

9

90.2

24

73.9

14

84.8

13

85.9

5

62

28

54.8

17

72.6

19

69.4

15

75.8

6

29

7

75.9

6

79.3

8

72.4

6

79.3

7

58

21

63.8

12

79.3

18

69.0

11

81.0

8

51

10

80.4

5

90.2

5

90.2

8

84.3

Density DBSCAN Seg. Rate % 81.4

Misplaced words 5

Seg. Rate % 92.9

Distance FCM

Misplaced words 4

Seg. Rate % 94.3

Misplaced words 4

Seg. Rate % 94.3 94.4

9

45

21

53.3

11

75.6

10

77.8

10

77.8

10

88

21

76.1

33

62.5

28

68.2

15

83.0

11

34

6

82.4

4

88.2

5

85.3

7

79.4

12

42

4

90.5

5

88.1

9

78.6

4

90.5

13

38

4

89.5

5

86.8

7

81.6

4

89.5

14

91

18

80.2

8

91.2

7

92.3

9

90.1

15

80

20

75.0

15

81.3

11

86.3

11

86.3

16

101

16

84.2

20

80.2

16

84.2

11

89.1

17

43

12

72.1

7

83.7

9

79.1

8

81.4

18

33

12

63.6

12

63.6

10

69.7

6

81.8

19

62

8

87.1

13

79.0

20

67.7

12

80.6

20

64

20

68.8

17

73.4

23

64.1

12

81.3

21

67

15

77.6

14

79.1

14

79.1

13

80.6

22

41

9

78.0

9

78.0

13

68.3

7

82.9

23

27

6

77.8

7

74.1

6

77.8

6

77.8

24

88

27

69.3

25

71.6

14

84.1

16

81.8

25

32

8

75.0

6

81.3

8

75.0

5

84.4

330

77.8

296

80.1

294

80.2

234

84.8

Average

116

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

Figure 10 shows the overall correct extraction rate computed for all 25 samples for each clustering algorithm. It is clear that FCM is the best clustering algorithm for our segmentation problem, it outperforms other clustering algorithms. 86% 84% 82% 80% 78% 76% 74% DBSCAN

GMM

KNN

FCM

Figure 10. Correct Extraction Rate Finally, we compare the results obtained with those for a previously published method. It is obvious from the results in Table 4 that the two approaches have nearly the same correct rate. However, our method thresholds are calculated for each input document and are not fixed, as they are not derived from the words that are stored in the database, as in Jawad et al. [21] Table 4. Correct rate of word extraction Method Correct Rate Proposed method Jawad et al. [21]

84.8% 85.0%

9. Conclusions This paper presents an efficient segmentation method for Arabic handwritten text images into text lines and words. Text line segmentation is addressed by computing the HPP of the document. Then, to enhance the estimation of text line spacing, autocorrelation is used. Word segmentation is based on an adapting gap metric. Three different clustering algorithms have been tested against benchmarking datasets of the AHDB, and very promising results (84.8%) of correct extraction rate were achieved.

10. Acknowledgements The work for this paper was carried out during 2013-2014 while the first author was on sabbatical leave from Al-Hussein Bin Talal University /Jordan to the Taif University /KSA. It was funded by Taif University. Special appreciation for both universities who made this support possible and strongly encouraged the study.

11. References [1] http://acronyms.thefreedictionary.com/HWR. [2] R. Plamondon and S. N. Srihari, “On-line and off-line handwriting recognition: A comprehensive survey”, IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.22, no.1, pp.63–84, 2000. [3] L. Lorigo and V. Govindaraju, “Off-line Arabic Handwriting Recognition: A Survey”, IEEE Transactions on Pattern Analysis & Machine Intelligence, vol.28, no.05, pp.712-724, 2006.

117

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

[4] A. Zahour, B. Taconet, P. Mercy, S. Ramdane, “Arabic handwritten text-line extraction”, In Proceeding(s) of the Sixth International Conference on Document Analysis and Recognition( ICDAR), vol.37, pp. 281–285, 2001. [5] J. Kumar, W. Abd-Almageed, L. Kang, D.S. Doermann, “Handwritten Arabic text line segmentation using affinity propagation”, In Proceeding(s) of DAS '10 Proceedings of the 9th IAPR International Workshop on Document Analysis Systems, pp. 135-142, 2010 [6] Z. Shi, S. Setlur, V. Govindaraju, “A Steerable Directional Local Profile Technique for Extraction of Handwritten, Arabic Text Lines”, ICDAR, pp. 176–180, 2009. [7] N. Ouwayed, A. Belaıd, “Separation of overlapping and touching lines within handwritten Arabic documents”, In Proceeding(s) of the 13th International Conference on Computer Analysis of Images and Patterns, CAIP. 9, pp. 123–138, 2009. [8] M. Khayyat, L. Lam, C. Y. Suen, F. Yin and C-L. Liu, “Arabic Handwritten Text Line Extraction by Applying an Adaptive Mask to Morphological Dilation,” In Proceeding(s) of 10th IAPR International Workshop on Document Analysis Systems (DAS 2012), pp. 100-104, 2012. [9] I.S.I. Abuhaiba, S. Datta, M.J.J. Holt, “Line Extraction and Stroke Ordering of Text Pages”, In Proceeding(s) of Third International Conference on Document Analysis and Recognition ICDAR'9, pp. 390–393, 1995. [10] H. Goraine, M. Sher, S. Al-Emami, “Off-Line Arabic Character Recognition”, Computer, vol. 25, pp. 71–74, 1992. [11] H.A. Al-Muhtaseb, S.A. Mahmoud, R.S. Qahwaji, “Recognition of offline printed Arabic text using Hidden Markov Models”, Signal Processing, vol. 88, pp. 2902–2912, 2008. [12] A. Amin, H. Alsadon, S. Fisher, “Hand printed Arabic character recognition system using an artificial network”, Pattern Recognition, vol. 29, no. 4, pp. 663–675, 1996. [13] R. El-Hajj, C. Mokbel, L. Likforman-Sulem, “Arabic Handwriting Recognition Using Baseline Dependant Features and Hidden Markov Modeling”, In Proceeding(s) of the Eight International Conference on Document Analysis and Recognition, ICDAR , 2005. [14] M. Khalifa, Y. BingRu, “A Novel Word Based Arabic Handwritten Recognition System Using SVM Classifier”, Communications in Computer and Information Science, vol. 143, pp. 163-171, 2011. [15] A. Benouareth, A. Ennaji, M. Sellami, “HMMs with explicit state duration applied to handwritten Arabic word recognition”, In Proceeding(s) of the 18th International Conference on Pattern Recognition, ICPR , 2006. [16] J.H. AIKhateeb, “Word-based Handwritten Arabic Scripts Recognition using DCT Features and Neural Network Classifier”, In Proceeding(s) of the 5th International Multi-Conference on Systems, Signals and Devices, 2008. [17] S. Almaadeed, C. Higgens, D. Elliman, “Recognition of off line hand written Arabic words using hidden markov model approach”, In Proceeding(s) of the 16th International Conference on Pattern Recognition,vol. 3, pp. 481–484, 2002. [18] S.N. Srihari, H. Srinivasan, P. Babu, C. Bhole, “Handwritten Arabic Word Spotting using the CEDARABIC Document Analysis System”, In Proceeding(s) of Symposium on Document Image Understanding Technology (SDIUT-05), College Park, MD, pp. 123–132, 2005. [19] S.N. Srihari, H. Srinivasan, P. Babu, C. Bhole, “Spotting Words in Handwritten Arabic Documents”, In Proceeding(s) of the SPIE, pp. 606702-1–606702, 2006. [20] M. Khayyat, L. Lam, C.Y. Suen,” Arabic Handwritten Word Spotting Using Language Models”, In Proceeding(s) of the 2012 International Conference on Frontiers in Handwriting Recognition (ICFHR '12), pp.43–48, 2012. [21] J.H. AlKhateeb, J. Jiang, J. Ren, S. Ipson, “Interactive Knowledge Discovery for Baseline Estimation and Word Segmentation”, in: Maurizio A Strangio (Ed.), Handwritten Arabic Text, Recent Advances in Technologies, ISBN: 978-953-307-017-9, InTech. DOI: 10.5772/7428. [22] T. Stafylakis, V. Papavassiliou, V. Katsouros, G. Carayannis, “Robust text-line and word segmentation for handwritten documents images”, In Proceeding(s) of International Conference on Acoustics, Speech and Signal Processing, pp. 3393–3396, 2008. [23] U.V. Marti, H. Bunke, “Text line segmentation and word recognition in a system for general writer independent handwriting recognition”, In Proceeding(s) of International Conference on Document Analysis and Recognition, pp. 159–163, 2001.

118

Segmenting Arabic Handwritten Documents into Text lines and Words Ayman Al-Dmour, Fares Fraij

[24] R. Manmatha, J.L. Rothfeder, “A scale space approach for automatically segmenting words from historical handwritten documents”, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 27, no.8, pp. 1212–1225, 2005. [25] V. Papavassiliou, T. Stafylakis, V. Katsouros, G. Carayannis, “Handwritten document image segmentation into text lines and words”, Pattern Recognition, vol. 43, no. 1, pp. 369–377, 2010. [26] G. Louloudis, B. Gatos, I. Pratikakis, C. Halatsis, “Text line and word segmentation of handwritten documents”, Pattern Recognition, vol. 42, no. 12, pp. 3169–3183, 2009. [27] G. Seni, E. Cohen, “External word segmentation of off-line handwritten text lines”, Pattern Recognition, vol. 27, pp. 41–52, 1994. [28] M. Pechwitz, S.S. Maddouri, V. Maergner, N. Ellouze, H. Amiri, “IFN/ENIT - database of handwritten Arabic words”, In Proceeding(s) of CIFED, pp. 129–136, 2002. [29] G.S. Peake T.N. Tan, “Script and language identification from document images”, In Proceeding(s) of the British Machine Vision Conference (BMVC97), vol. 2, pp. 169–184, 1997. [30] J.B. MacQueen, “Some Methods for classification and Analysis of Multivariate Observations”, In Proceeding(s) of 5th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, vol.1, pp. 281–297, 1967. [31] J.C. Dunn, “A Fuzzy Relative of the ISODATA Process and its Use in Detecting Compact Well Separated Clusters”, Journal of Cybernetics, vol. 3, pp. 32–57, 1974. [32] J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms, NewYork: Plenum Press, 1981. [33] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 2007. [34] M. Ester, H-P. Kriegel, J. Sander, X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise”, In Proceeding(s) of the Second International Conference on Knowledge Discovery and Data Mining (KDD-96), AAAI Press, pp. 226–231. 1996. [35] S. Al-Ma’adeed, D. Elliman, C.A. Higgins, “A Data Base for Arabic Handwritten Text Recognition Research”, In Proceeding(s) of 8th International Workshop on Frontiers in Handwriting Recognition, 2002.

119