A New Approach for Off-Line Handwritten Arabic ...

2 downloads 0 Views 325KB Size Report
various writing styles employed, recognition of. Arabic handwritten text can be difficult. In this paper, an off-line Arabic handwritten word recognition system is ...
2009 IEEE International Conference on Signal and Image Processing Applications

A New Approach for Off-Line Handwritten Arabic Word Recognition Using KNN Classifier Jawad H AIKhateebl, Fouad Khelifil, Jianmin Jiani, Stan S Ipsonl School of Computing, Informatics and Media University of Bradford, BD7 IDP, UK 1

Abstractvarious

jhyalkhirst{J.h.y.alkhateeb, f.khelifi, j.jiangl, S.S.Ipson }@bradford.ac.uk Due to similarities between Arabic letters, and the

writing

styles

employed,

recognition

of

Arabic

handwritten text can be difficult. In this paper, an off-line Arabic handwritten word recognition system is proposed, in which technical details are presented in terms of three stages, i.e. pre­ processing, feature extraction and classification. Firstly, words are segmented from input scripts and also normalized in size. Secondly, each segmented word is divided into overlapping blocks. Absolute mean values computed for each block of segmented

words

constitutes

a

feature

vector.

Finally,

the

resulting feature vectors are used to classify the words using the

K nearest Neighbour classifier (KNN). The proposed system has been successfully tested on the IFN/ENIT database consisting of

32492 Arabic handwritten words which are written by more than 1000 different writers.

Experimental results show a good

recognition rate when compared with other methods.

Keyword�

Offline recognition, Document analysis, Feature

Extraction, KNN classifier, Arabic OCR

I. INTRODUCTION The mechanism of transforming the text to its symbolic representation is known as Handwriting recognition (HWR). HWR plays essential roles in many applications, such as office automation, cheque verification, mail sorting, and a large variety of banking, business as well as natural human­ computer interaction. In general, HWR is divided into online and offline based systems. Recognition in the online systems is based on the pen movement which is the dynamics of writing. However, recognition in offline systems is based on the written text image. The offline recognition task is the harder of the two because use cannot be made of additional information available to online systems, such as determining the strength and sequential order during the writing [1]. In this paper, the focus is on the offline recognition of handwritten Arabic text Basically, there are two different categories of systems for the recognition of Arabic scripts: segmentation based and segmentation free based. The segmentation based approach needs to segment words into characters or letters for recognition, and is an analytical approach. The segmentation free approach is a global approach, which uses the whole word image in recognition and does not require segmentation steps. Although the global approach makes the recognition process simpler by avoiding the difficulty of character

978- I -4244-5561-4/09/$26.00 ©2009

segmentation, it suffers more problems due to the huge vocabulary of words needed than the analytical approach [3]. Paechwitz and Margner [4] used pixel values in a sliding window as the main features, and the sliding window is shifted from right to left to generate feature vectors which are reduced in dimension using the Karhunen Loeve Transformation (KLT). A similar sliding window is also used in [5], in which the word image is divided into vertical overlapping frames with a constant width and variable heights. For each frame, 24 features are extracted using foreground pixel densities and concavity features. In EIAbed and Magner [6], a word skeleton graph was used where each vertical frame was further split into five zones of equal heights. The lengths of all lines in each zone frame were calculated in four directions to form a 20 dimensional feature. In publications [7-8], statistical and structural features were utilized based on an adopted segmentation in which implicit word segmentation is used to divide images into vertical frames of constant width. Based on maxima and minima analyses of the vertical projection histogram, morphological complexity of the Arabic handwritten characters is further considered. In this paper, we proposed a system for handwritten Arabic word recognition where different stages are discussed in detail including pre-processing, feature extraction, and classification. Unlike methods which recognize whole word images at the character level individually; this approach recognizes the whole word image at the word. Our method is validated using the IFN/ENIT database, where the overlapping features of words are inputted to train the KNN system for classification. II. IFN/ENIT DATABASE Any recognition system needs a large database to train and test the system. Real data from banks or the post code are confidential and generally inaccessible for non commercial research. Work on Arabic script recognition started more than three decades ago but generally it used small databases of the researchers own construction or presented results on databases which were unavailable to the public. There was no well established benchmark to compare the results obtained by different researches until 2002, when the IFN/ENIT database ) was made freely available for use as a ( standard test set [4, 9]. This database contains the names of Tunisian town/villages postcodes. In total more than 1000 different people were selected to write their names and to fill in one or more forms

191

Authorized licensed use limited to: University of Leeds. Downloaded on June 10,2010 at 09:47:49 UTC from IEEE Xplore. Restrictions apply.

with handwritten pre-selected names of a Tunisian town or village with the corresponding postcode. All the forms were scanned at 300dpi and converted to binary images. The images are divided into five sets so that researches can use some of them for training and some for testing. III. THE PROPOSED METHOD In this paper, a recognition system of handwritten Arabic words is proposed where three main phases are included: pre­ processing, feature extraction, and classification. In an HWR system, the recognition rate depends on a number of factors. Two very important factors are the quality of input images and effectiveness of pre-processing. Once the sample image is acquired, pre-processing is required to enhance the signal for better performance. After pre-processing, features are extracted using overlapping blocks for each word image. Means computed for each block constitutes the feature vector and ending up with a feature matrix for all the word images. Finally, the KNN classifier is applied to decide to which class an unknown word belongs. Figure 1 illustrates the proposed scheme. Training im age; ==t>

I

P re- proc e;si ng

I

algorithm from [7] is employed. A sample word image in binary format is shown in Fig. 2(a), with its normalized and filtered forms shown in Fig. 2(b) and Fig. 2(c) respectively. Knowing that the edges in the word image represent all the information in the image, horizontal and vertical high pass filters are applied to the normalized word image. This is to extract the edges from the word image. Given the input image I, the horizontal high pass filter using Prewitt operator Gh is applied to obtain h (1) Likewise, the vertical high-pass filtered image I v is obtained with the vertical Prewitt operator Gv as follows

Iv = Gv(/)

(2)

Eventually, the filtered image l' is computed as

l'(i,j) = max(abs(/h(i,j),abs(/v(i,j))

(b)

(3)

(e )

==>

Fig. 2. One sample image of word (a), normalized image (b), filtered image (c)

'-------;:---'

B. Feature Extraction

I

Cla ss Output

Fearures Databax

Fig. 1. The proposed scheme for word recognition

A. Preprocessing

Pre-processing may include techniques such as thresholding, skew/slant correction, noise removal, thinning, baseline estimation and segmentation of words. The main aim of pre-processing is to enhance the inputted signal and to represent it in a form which can be analysed consistently for robust recognition. Here the pre-processing stage includes scanning the paper document, noise removal, image binarization, line and word segmentation, and baseline estimation. These steps are strongly dependent on the quality of the paper document. In the case of the IFN/ENIT database, the words were separated and cropped out during the database development stage [9]. Therefore, the only additional operations needed to be done were estimation of the baseline and normalization. Despite of manually cropped words in the database, we have also investigated how to generally segment words and estimate the baseline and details can be found in [10]. Normalization is essential to remove the variations in handwritten images for consistent analysis and robust recognition. Among the many algorithms proposed for this purpose, the skeletonization technique is the one most commonly used and in the proposed system the normalization

The main goal for feature extraction is to remove the redundancy from the data and to produce a more effective representation of the word image in the form of a set of numerical characteristics. These features are then mapped into a classifier in order to separate the input words into classes. In this paper, the overlapping block method is applied to the normalized word image. The word image is divided into overlapping blocks where the size of each block is 12 pixels and the length of the overlapping is 2 pixels. These sizes have been chosen empirically. The feature vector of the word image is computed by computing the absolute mean value of all the coefficients in each block in the word image. The database consists of 32492 Arabic words representing 937 classes. So, the dataset has been pre-processed so that the dataset is represented by 937 folders. Each folder was named with the city/village postcode. All the images have been read and saved in the proper folder based on their post code. This is has been done using sets a, b, c, and d as the training set and set e as the testing set. All the images in the training/testing set were read and their features extracted and saved. C.

Classification

Generally, any classification problem includes a sample of instances. The instances are divided into two sets which are a training set and a testing set. Each instance is a vector and belongs to a known class. The classification task maps the input features from the feature space, training set, into the output space via a classifier trained on this set. The testing set is used to test the classifier accuracy, which is estimated based on the number of correct predictions found on this set. This paper focuses on Arabic handwritten word recognition using multi-class classification by KNN. A multi-class classification

192 Authorized licensed use limited to: University of Leeds. Downloaded on June 10,2010 at 09:47:49 UTC from IEEE Xplore. Restrictions apply.

problem can be defined as follows. Given an n-dimension feature space (Q), and a training data set (Qtr), it is known that Qtr c Q, where each element (x) ilfl tr is associated with a class label (c) where CjE{CJ, C2, C3, cd and K>2. The system will be trained 01il tr, so that for any given feature vector (x) E(Q), F(x) E class label (c). Classification methods vary and depend on the nature and the type of the extracted features. The KNN method used in this paper is a fast machine learning algorithm which is used to classify the unlabeled testing set in which a labelled training set is read by the algorithm. In order to classify a word image, the features for the word image is loaded from the testing set and compared to the training features based on their distance. Then, prediction class of the testing image is found based on the minimum distance, measured by the Euclidean distance, between the testing word image and the training samples. For example, given a query instance for a word image, the K nearest instances to this query word image is the most common class. Actually the KNN algorithm takes the neighbourhood samples as a prediction values for the testing set. This concept works for the minimum distance from the training set samples. The Euclidean distance D between two feature vectors X and Y is: . • . .

N

D=

I

(Xi

-

yJ.2

(4)

i=1

where Xi and Yj are elements of X and Y, respectively IV. EXPERIMENTAL RESULTS In order to evaluate the performance of the proposed recognition system, several experiments were conducted on the IFNIENIT database containing 32492 Arabic words handwritten by more than 1000 different writers and divided into five distinct sets a, b, c, d, and e [9]. In the experiments, cross validation was used to verify the performance of the KNN classifier. Each time 80% of the samples in the database (sets a, b, c, and d) were used for training and the remaining 20% (set e) were used for testing. Each word image in the dataset was normalized to 45 x269 pixel, and was then divided into overlapping blocks. The block size was set to 12 and several experiments were carried out by varying the size of overlap of the blocks. Table I summarizes the recognition rates achieved for different overlaps.

The recognition rate was then assessed in comparison with different recognition systems reported in ICDAR 2005 on Arabic Handwriting Recognition [II]. Table 2 depicts the recognition results of the ICDAR 2005 competition. It is important to note that the ICDAR 2005 competition used the same testing database set as the current work. It can be seen that the proposed algorithm achieved better performance in classifying the word images. TABLE III PERFORMANCE OF ICDAR 2005 SYSTEMS

System ID 1 2 3 4 5 6 Proposed Algorithm

Recognition rate (%) 65.74 35.70 29.62 75.93 15.36 74.69 76.0421

V. CONCLUSION We have proposed a system to use KNN for the classification of handwritten words. The system has been applied to the well-known IFNIENIT database containing handwriting words written by different writers. We have found that extracting features from the edges is effective with a KNN classifier, and promising results have been achieved in terms of high recognition rates. The proposed approach outperforms several existing methods. In addition, this system can be applied to other patterns recognition problems with slightly adaptation. Further investigations could apply more effective normalization and introduce further features like moments, Discrete Cosine Transform (DCT) coefficients, and wavelet coefficients. REFERENCES [1]

A. Amin. "Offline Arabic character recognition: The state of

[2]

L. M. Lorigo and V. Govindaraju, "Offline Arabic handwriting

the art". Pattern Recognition, vol. 3, pp. 517-530, 1998. recognition: a survey", IEEE T-PAMI, vol. 28, pp. 712-724, 2006. [3]

M.S. Khorsheed," Off-Line Arabic Character Recognition - A Review", Pattern Analysis & Applications, vol.5, pp. 31-45, 2002.

[4]

M. Pechwitz, and V. Margner. HMM based approach for handwritten Arabic word recognition using the IFNIENIT

TABLE I RECOGNITION RESULTS Size of block Overlap 2 3 4 5 6 7 8

Recognition rate % 76.0421 75.9600 72.1415 71.0650 66.4388 65.1703 65.9236

database. Proc. ICDAR, [5]

Rarny

El-Hajj,

pp. 890-894, 2003.

Laurence

Likforrnan-Sulem,

and

Chafic

Mokbel, "Arabic Handwriting Recognition Using Baseline Dependant Features and Hidden Markov Modeling," Proc. ICDAR ,pp.893-897, 2005. [6]

H

ElAbed,

and

V.

Margner.

Comparison

of

Different Preprocessing and Feature Extraction Methods for Offline Recognition of Handwritten Arabic Words. ICDAR, vol.2, [7]

Proc.

pp. 974-978, 2007.

Abdallah Benouareth, Abdellatif Ennaji, and Mokhtar Sellami, "HMMs with Explicit State Duration Applied to Handwritten

193 Authorized licensed use limited to: University of Leeds. Downloaded on June 10,2010 at 09:47:49 UTC from IEEE Xplore. Restrictions apply.

Arabic Word Recognition," Proc. ICPR vol.2, pp.897-900,

[10]

[8]

Abdallah Benouareth, Abdel Ennaji, and Mokhtar Sellami:

Segmentation

Semi-continuous

Arabic Text". Proc. 5t

HMMs

with

explicit

state

duration

for

"Knowledge­

Recognition Letters 29(12): 1742-1752 (2008)

in

Efficient Pre-processing of Handwritten h Int. Conf. Information Technology:

New Generation, pp 1158-1159, 2008.

unconstrained Arabic word modeling and recognition. Pattern [9]

J. AIKhateeb, J. Ren, S. S. Ipson and J. Jiang:

based Baseline Detection and Optimal Thresholding for Words

2006.

[11]

V. Margner, M. Pechwitz, and H. EI Abed "ICDAR 2005

M. Pechwitz, S. S. Maddouri, V. M"'argner, N. Ellouze and H.

Arabic Handwriting Recognition". Proc. ICDAR ,pp.70-74,

Amiri,

2005.

"IFNIENIT

-

Database

of

Arabic

Handwritten

words",Colloque International Franco-phone sur l'Ecrit et Ie Document (CIFED), 2002, pp 127-136

194 Authorized licensed use limited to: University of Leeds. Downloaded on June 10,2010 at 09:47:49 UTC from IEEE Xplore. Restrictions apply.