A Database for Arabic Handwritten Character ... - Science Direct

4 downloads 59807 Views 546KB Size Report
Department Of Computer Science, College of Computer Science and Engnieering ..... Markov Models and re-ranking, Pattern Recognition Letters, Vol 32,, ...
Available online at www.sciencedirect.com

ScienceDirect Procedia Computer Science 65 (2015) 556 – 561

International Conference on Communication, Management and Information Technology (ICCMIT 2015)

A Database for Arabic Handwritten Character Recognition Jawad H AlKhateeb* Department Of Computer Science, College of Computer Science and Engnieering,Taibah University, KSA

Abstract This paper proposes an image database for Arabic handwritten character recognition (AHCR). In this paper, the Arabic handwritten images character database written by multi writers is proposed. This database is eligible for Arabic handwritten recognition research. The database contains the digital images of the Arabic alphabets written by 100 native Arabic writers. Each writer writes the Arabic letter 10 times on a form. All the forms were scanned using a high quality scanner. Earlier, all the Arabic characters were cropped form the forms. Therefore, the database contains 28000 images. These images were divided into two sets; 80% for training and 20% or testing. This database base will be freely available The Authors. Published Elsevier B.V.access article under the CC BY-NC-ND license © © 2015 2015 Published by Elsevier B.V.by This is an open (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of Universal Society for Applied Research. Peer-review under responsibility of Universal Society for Applied Research Keywords: Arabic Offline Handwritten Recognition, Pre-Processing, AHCR Database.

1. Introduction A large number of research has been conducted for the recognition of Latin, Chinese, and Japanese text. On the other hand, relatively little research has been done on Arabic text. This is due to the complexity of Arabic text and to limited Arabic databases. Recognition of Arabic text is at the early stage compared to the methods for recognition of Latin, Chinese, and Japanese text. In addition, there is a major challenge in the Arabic writing recognition systems, which comes from the cursive nature of the data. Recognition of Arabic handwritten text is a difficult task. This

* Corresponding author. Tel.: +966569749207 E-mail address: [email protected]

1877-0509 © 2015 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of Universal Society for Applied Research doi:10.1016/j.procs.2015.09.130

Jawad H. AlKhateeb / Procedia Computer Science 65 (2015) 556 – 561

difficulty comes from many factors such as the Arabic writing mechanism which is cursive, the writer style, the pen, and other factors [1][2]. Many domains for Arabic handwritten recognition can be classified as, character recognition, office automation, cheques verification, mail sorting, and a large variety of banking, business as well as natural human-computer interaction[3]. In general, the Arabic handwritten task is divided into two main systems. First, the on-line based system where the process of writing is being traced by the computer. Hence the strength and sequential order of each segment when it is written can be recorded for recognition. Second, the off-line based systems in which, the digital image is available only. The off-line based system is more difficult [3][4][5]. The work for Arabic script recognition has started more than three decades ago. Al-Muallim and Yamaguchi [4] proposed a structural recognition technique for Arabic handwritten words which were segmented into strokes. The strokes were classified and combined into characters according to their features. However, their system showed a failure in most cases due to incorrect segmentation of words. Amin and Alsadoun [5] proposed techniques using binary tree to segment printed Arabic text into characters. Amin and Alsadoun [6] proposed recognition of hand printed Arabic characters using neural network. Abuhaiba [7] dealt with some problems in the processing of binary images of handwritten text documents, such as extracting lines from pages, which is found to be powerful and suitable for variable handwriting. Abuhaiba et al. [8] introduced a novel offline cursive Arabic script recognition system to recognize offline handwritten cursive script having high variability based on segmentation based system. In their system, a single component strokes were extracted. Khorsheed [9] presented a new method on off-line recognition of handwritten Arabic script, in which segmentation into characters is not required. The method decomposed the skeleton of the word into an observation sequence, and then a single hidden Markov model (HMM) with structural features is employed for classification. HMM is also used in Alma'adeed et al [10] for unconstrained Arabic handwritten word recognition. In Alma'adeed [11], a complete scheme for unconstrained Arabic handwritten word recognition based on a neural network is proposed. Any recognition system ideally needs a large database to train and test the system. Real data from banks or the post code are confidential and inaccessible for non-commercial research. Although some work was conducted in Arabic handwritten digits, but generally they had small databases of their own or the presented results on databases which were unavailable to the public. Consequently, there was no benchmark to compare the results obtained by researches. The ADBase database is available for free, is very important in this context as it has been used as a standard test set in such a context [12]. El-sherif and Abdleazeem [12] released an Arabic handwritten digit database (ADBase) which is composed of 70,000 digits written by 700 writers. Each writer wrote each digit (from 0 -9) ten times. To ensure including different writing styles, the database was gathered from different institutions: Colleges of Engineering and Law, School of Medicine, the Open University (whose students span a wide range of ages), a high school, and a governmental institution. Forms were scanned with 300 dpi resolution then digits are automatically extracted, categorized, and bounded by bounding boxes. The scanner was adjusted to produce binary images directly. Some noisy and corrupted digit images were edited manually. The database is divided into two sets: training and testing set. The training set includes 60,000 digits to 6,000 images per class, and the testing set includes 10,000 digits to 1000 images per class. The ADBase is available for free (http://datacenter.aucegypt.edu/shazeem/ ) for researchers. A standard database of Arabic handwritten images is required to design any recognition system. Arabic handwritten recognition systems lack the standard databases since most of the Arabic handwritten research is conducted on private database. In this paper, we propose an image database for Arabic handwritten character recognition (AHCR). 2. Arabic Writing The Arabic alphabet consists of 28 letters, and writing is written from right to left in a cursive manner. The Arabic alphabet is used for writing different languages such as Persian, Urdu, and Jawi. Each Arabic letter has either two or four shapes depending on its position in the text. The shape of a letter changes with its position, which may be at start, middle or end of a word, or alone [3][6]. Although generally Arabic is cursive, there are some non-cursive letters. There are 22 cursive letters with four different shapes and 6 non-cursive letters with only two shapes corresponding to the alone and end positions. Table 1 shows each shape for each letter. For example letter Ayn (‫ )ﻉ‬has the following shapes: ‫ ﻋـ‬at start, ‫ ـﻌـ‬at middle, ‫ ـﻊ‬at end, and ‫ ﻉ‬when alone. Moreover, in order to distinguish some characters from

557

558

Jawad H. AlKhateeb / Procedia Computer Science 65 (2015) 556 – 561

each other, Arabic uses different set of dots. One letter can have up to three dots which may be placed either above or below the main body of the letter. Due to dots, the following Arabic letters are special [3][6][9]: x Ten Arabic letters have one dot ( ‫ﻥ‬،‫ﻑ‬،‫ﻍ‬،‫ﻅ‬،‫ﺽ‬،‫ﺯ‬،‫ﺫ‬،‫ﺥ‬،‫ﺝ‬،‫)ﺏ‬ x Three Arabic letters have two dots (‫ﻱ‬،‫ﻕ‬،‫)ﺕ‬ x Two Arabic letters have three dots (‫ﺵ‬،‫)ﺙ‬ x Several Arabic letters presents loop (‫ﺓ‬،‫ﻭ‬،‫ـﻤـ‬،‫ﻡ‬،‫ﻕ‬،‫ﻑ‬،‫ـﻐـ‬،‫ـﻌـ‬،‫ﻅ‬،‫ﻁ‬،‫ﺽ‬،‫)ﺹ‬ It is worth knowing that removal of any of these dots will lead to a misrepresentation of the character. So, efficient pre-processing techniques have to be used in order to deal with these dots so as not to ignore them and change the identity of the character. Table 1. Arabic Alphabet

Name Alif Baa Taa Thaa Jeem Haa Khaa Dall Dhaal Raa Zaay Seen Sheen Saad Daad TTaa Dhaa Ayn Ghyan Faa Qaaf Kaaf Laam Meem Noon Haa Waw Yaa

Alone (isolated) ‫ﺍ‬ ‫ﺏ‬ ‫ﺕ‬ ‫ﺙ‬ ‫ﺝ‬ ‫ﺡ‬ ‫ﺥ‬ ‫ﺩ‬ ‫ﺫ‬ ‫ﺭ‬ ‫ﺯ‬ ‫ﺱ‬ ‫ﺵ‬ ‫ﺹ‬ ‫ﺽ‬ ‫ﻁ‬ ‫ﻅ‬ ‫ﻉ‬ ‫ﻍ‬ ‫ﻑ‬ ‫ﻕ‬ ‫ﻙ‬ ‫ﻝ‬ ‫ﻡ‬ ‫ﻥ‬ ‫ﻩ‬ ‫ﻭ‬ ‫ﻱ‬

Start ‫ﺍ‬ ‫ﺑــ‬ ‫ﺗـ‬ ‫ﺛـ‬ ‫ﺟـ‬ ‫ﺣـ‬ ‫ﺧـ‬ ‫ﺩ‬ ‫ﺫ‬ ‫ﺭ‬ ‫ﺯ‬ ‫ﺳـ‬ ‫ﺷـ‬ ‫ﺻـ‬ ‫ﺿـ‬ ‫ﻁ‬ ‫ﻅ‬ ‫ﻋـ‬ ‫ﻏـ‬ ‫ﻓـ‬ ‫ﻗـ‬ ‫ﻛـ‬ ‫ﻟـ‬ ‫ﻣـ‬ ‫ﻧـ‬ ‫ﻫـ‬ ‫ﻭ‬ ‫ﻳـ‬

Middle ‫ـﺎ‬ ‫ـﺒـ‬ ‫ـﺘـ‬ ‫ـﺜـ‬ ‫ـﺠـ‬ ‫ـﺤـ‬ ‫ـﺨـ‬ ‫ـﺪ‬ ‫ـﺬ‬ ‫ـﺮ‬ ‫ـﺰ‬ ‫ـﺴـ‬ ‫ـﺸـ‬ ‫ـﺼـ‬ ‫ـﻀـ‬ ‫ـﻂ‬ ‫ـﻆ‬ ‫ـﻌـ‬ ‫ـﻐـ‬ ‫ـﻔـ‬ ‫ـﻘـ‬ ‫ـﻜـ‬ ‫ـﻠـ‬ ‫ـﻤـ‬ ‫ـﻨـ‬ ‫ـﻬـ‬ ‫ـﻮ‬ ‫ـﻴـ‬

End ‫ـﺎ‬ ‫ـﺐ‬ ‫ـﺖ‬ ‫ـﺚ‬ ‫ـﺞ‬ ‫ـﺢ‬ ‫ـﺦ‬ ‫ـﺪ‬ ‫ـﺬ‬ ‫ـﺮ‬ ‫ـﺰ‬ ‫ـﺲ‬ ‫ـﺶ‬ ‫ـﺺ‬ ‫ـﺾ‬ ‫ـﻂ‬ ‫ـﻆ‬ ‫ـﻊ‬ ‫ـﻎ‬ ‫ـﻒ‬ ‫ـﻖ‬ ‫ـﻚ‬ ‫ـﻞ‬ ‫ـﻢ‬ ‫ـﻦ‬ ‫ـﻪ‬ ‫ـﻮ‬ ‫ـﻲ‬

3. Proposed Method Ideally, any recognition system requires a large database to train and test the system. Real data from banks or the post code are confidential and inaccessible for non-commercial research. Although some work was conducted in Arabic handwritten digits, but generally they had small databases of their own or the presented results on databases which were unavailable to the public. Consequently, there was no benchmark to compare the results obtained by researches. Generally, the database can be classified into two types. First, database for Arabic words and text. Second,

Jawad H. AlKhateeb / Procedia Computer Science 65 (2015) 556 – 561

database for isolated characters, numerals, and Symbols. The proposed database contains all the Arabic Alphabet

characters. 3.1 Data Collection Finding the suitable source of data is considered as a first step toward building a database. The first step in building a database is finding the suitable source [13]. Here, the main goal is to collect images of Arabic handwritten characters written by many writers. So a form is designed to do so. The form is shown in figure 1. The form consists of 28 alphabets where each letter has been printed and it has ten empty blocks. The writers have been asked to write each letter in the empty blocks. The total number of writers is 100.

Fig 1. Form Example

The form has been distributed among three main categories: The academic staff of the Computer Science department at Taibah University, the university students, and the high school students. 3.2 Form Processing All form pages were scanned using a high quality scanner. The scanner scans 100 pages per minute using 300 to 1800 dpi. The output of the scanner can be either a pdf, jpeg, bmp format. 300 dpi were used and the bmp format was chosen as well. All the letters were written in a black or dark blue pen since the paper was white. An example of scanned page is shown in figure 2.

Fig 2. Scanned page Example

3.3 Cropping A cropping process was applied to each form page to crop each letter block. This process was done manually using the paint software. This is summarized in figure 3; while the saving process is summarized in figure 4.

559

560

Jawad H. AlKhateeb / Procedia Computer Science 65 (2015) 556 – 561

Fig 3. letter cropping Example

Fig 4. Letter Saving Example

In the cropping process each letter was cropped and saved in a separate folder. The entire letter images were saved in the same size. Since each letter was written 10 times by 100 writers resulting in 1000 image for each letter. The database is divided into two main folders training and testing. Each folder contains 28 folders having the images of each letter, the training folder contain 80% of the images while the testing folder contains 20% of the images. For example in the training folder there is a folder for the latter Alif (‫ )ﺍ‬as shown in figure 5.

Fig 5. Letter Alif images

All the cropped letter images needs to be pre-processed to remove all the noise. 3.4 Pre-processing The pre-processing process is an important process in any recognition system. The goal of the pre-processing process is to improve the quality of the images for extracting the proper features later in any recognition system. A pre-processing process was applied to each form page to enhance the images. First of all, all the similar letters are placed in a folder ending up with 28 folders. Each folder is read separately where each image in the folder in resized and normalized. Hence, the median filter is applied for removing all the salt and pepper noise caused by the scanning process. This is summarized in figure 6.

Jawad H. AlKhateeb / Procedia Computer Science 65 (2015) 556 – 561

Fig 6. Letter

561

Raa images

4. DataBase This paper has produced a database for Arabic handwritten character recognition (AHCR). This database can help all the researchers in many fields. All the images are saved in grey level images. (1) 5. Conclusion This paper proposed a database for Arabic handwritten character recognition (AHCR). 100 different writers filled 100 forms where each form contains 10 images for each letter resulting in 280 in each form. The writers guarantee a wide variety of writing styles. 28000 valid Arabic letters were cropped and extracted from the forms. This database is possible to develop and test Arabic handwritten character recognition systems.

References 1. Jawad H AlKhateeb, Jinchang Ren, Jianmin Jiang, Husni Al-Muhtaseb, Offline handwritten Arabic cursive text recognition using Hidden Markov Models and re-ranking, Pattern Recognition Letters, Vol 32,, pp 1081-1088, 2011. 2. Jawad H. AlKhateeb, Olivier Pauplin, Jinchang Ren, Jianmin Jiang, Performance of hidden Markov model and dynamic Bayesian network classifiers on handwritten Arabic word recognition, Knowledge-Based Systems, Vol 24, pp680-688,2011. 3. L. M. Lorigo and V. Govindaraju, "Offline Arabic handwriting recognition: a survey", IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 28, pp. 712-724, 2006 4. H. Al-Muallim and S Yamaguchi. "A method of recognition of Arabic cursive handwriting". IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 9, pp. 715-722, 1987. 5. A. Amin and H. Alsadoun, “A new segmentation technique of Arabic text.”, IEEE Trans. Pattern Recognition, Vol.2, pp. 441-445, 1992. 6. A. Amin and H. Alsadoun, “Hand printed Arabic Character Recognition System”, IEEE Trans. Pattern Recognition, Vol. 2, pp536-539, 1994. 7. I. S. I. Abuhaiba, M. J. J. Holt, and S. Datta, "Processing of binary images of handwritten text documents," Pattern Recognition, vol. 29, pp. 1161-1177, 1996. 8. I. S. I. Abuhaiba, M. J. J. Holt, and S. Datta, "Recognition of Off-Line Cursive Handwriting," Computer Vision and Image Understanding, vol. 71, pp. 19-38, 1998. 9. M. Khorsheed, “Recognising handwritten Arabic manuscripts using a single hidden Markov model”, Pattern Recognition Letters, vol. 24, pp. 2235-2242, 2003. 10. S. Alma’adeed, C. Higgens, and D. Elliman, “Off-line recognition of handwritten Arabic words using multiple hidden Markov models”, Knowledge-Based Systems, vol. 17, pp. 75-79, 2004. 11. Jawad H AlKhateeb, Jinchang Ren, Jianmin Jiang, Stanley Ipson “A Machine Learning Approach for Classifying Offline Handwritten Arabic Words”, proc. International Conference on CYBERWORLDS, pp 219-223, 2009. 12. El-Sherif, E. & Abdleazeem, S. “A Two-Stage System for Arabic Handwritten Digit Recognition Tested on a New Large Database”. International Conference on Artificial Intelligence and Pattern Recognition, AIPR-07. Orlando, Florida, USA, 2007. 13. Al-Ohali Y, Cheriet M, and Suen Ching, “Database for Recognition of Handwritten Arabic Cheques”, Proceeding of the Seventh Workshop on Frontiers in Handwriting Recognition. Amsterdam, 2000.