KHATT: a Deep Learning Benchmark on Arabic Script

2 downloads 15 Views 747KB Size Report
by implementing the most successful deep learning approach based on Long Short-Term Memory (LSTM) networks. Connec- tionist Temporal Classification ...

KHATT: a Deep Learning Benchmark on Arabic Script Riaz Ahmad∗ , Saeeda Naz† M. Zeshan Afzal∗ , S. Faisal Rashid‡ , Marcus Liwicki§ , Andreas Dengel∗ ∗

rahmad, [email protected], [email protected], DFKI & TU-Kaiserslautern, Germany † [email protected], Govt. Girls Postgraduate College No.1, Pakistan ‡ [email protected], Al-Khwarizmi Institute of Computer Science, UET, Lahore Pakistan § [email protected], University of Fribourg, Switzerland Abstract—This work presents state-of-the-art results on one of the complex datasets; known as KHATT. The KHATT dataset shows complex patterns for Arabic handwritten text. We have achieved better performance in terms of Character Recognition by implementing the most successful deep learning approach based on Long Short-Term Memory (LSTM) networks. Connectionist Temporal Classification (CTC) is used as a final layer to align the predicted labels according to the most probable path. The application of MDLSTM scans text-lines in all direction to cover fine inflammation in horizontal and vertical direction. Further, we apply pre-processing on text-lines to prune extra white regions, and de-skew the text lines for accurate height normalization. The deep learning and pre-processing allow us to improve results from 46.13% to 75.8%. Keywords—KHATT data-set, MDSLTM, CTC, Arabic language, Handwriting recognition.



Arabic script is one of the hardest scripts regarding recognition in the field of Optical Character Recognition (OCR). Due to cursive nature, Arabic script is associated with many challenges [1]. These challenges include segmentation, character overlapping, context dependency, font style variation and space anomalies, etc. Due to these complexities, the recognition becomes more difficult in handwritten text documents [2]. Therefore, this work focuses the problem of Arabic handwriting recognition and evaluates the proposed method on KHATT dataset [3, 4]. Figure 1 shows a sample of KHATT dataset.

Fig. 1.

Sample of KHATT dataset, taken from [3].

motivation of this work is the improvisation of deep learning based MDLSTM architecture on the KHATT data-set and present the-state-of-art results on it.

Deep Learning (DL) proved itself as a core component for robust classification techniques. Deep learning based networks having multiple hidden layers play a vital role in classification, recognition, and segmentation. The DL models emerge as the cutting edge techniques in the field of pattern and object classification. Deep learning based models like Convolutional Neural Network (CNN)[5], Extreme Learning Model (ELM) [6], different variations of Recurrent Neural Network (RNN) like Long Short-Term Memory (LSTM), BLTM [7], MDLSTM [8], Hierarchical LSTM [9, 10] showed promising performance in optical character recognition, handwriting recognition, object recognition, medical imaging, text mining etc. In character recognition, LSTM has proved as an efficient model with a combination of output CTC layers for sequence labeling over Hidden Markov Models (HMM) and other models [11]. In this work, our proposed model is also based on MDLSTM with CTC layer as a final layer.

The main contribution of this paper is the application of deep learning neural network for Arabic Script such that it outperforms current state-of-the-art methods on KHATT dataset. To the best of our knowledge, it is the first work that uses deep learning against KHATT dataset. The rest of this paper is organized as follows. Sections II details the literature review about KHATT dataset and RNN based systems. The details and peculiarities of KHATT dataset are given in Section III. Then the Section IV presents the proposed MDLSTM based Arabic character recognition on samples from handwritten KHATT data-set and Section V shows experimental results along with a comparison with the existing systems. Finally, Section VI concludes the paper.

The state of the art result on KHATT data-set is about 46.13% on large text-lines and 51% on limited text-lines. The 1



This section presents related work on KHATT data-set and covers most of the work of Recurrent Neural Network (RNN) in the field of Arabic OCR. Mahmoud et al. [4] developed a pilot version of KHATT data-set for the research community in the area of handwritten Arabic character recognition in ICFHR 2012. There are four pages in each form in the data-set. A first page consisting of information about the writer, the second page contains two paragraphs (fixed text covering all shapes of Arabic characters and randomly free text from different 46 sources), third page consisting of another randomly selected free text in a paragraph and last page consists of open data with a ruled line. The pilot KHATT data-set contains text lines from third and fourth paragraphs collected by 1000 writers from 46 sources. The Hidden Markov Model (HMM) is trained on 1400 text lines (125180 words) collected by 258 writers and testing applied 233 lines (26159 words) written by another 40 writers. A series of three experiments conducted on the adaptive pixel density of text-line images, edge derivatives of horizontal & vertical and statistical features; and gradient features, separately. They reported 51.2% highest accuracy for character recognition using gradient features. Further, the KHATT data-set is extended in [3] from 1400 text-lines to 9327 text-lines. They realized HMM and syntactic classifiers on KHATT data-set and reported 46.13% and 41.64% accuracy at character level on a test using adaptive pixel density, horizontal & vertical derivatives, and statistical features, respectively. Hamdani et al. [12] used HMM and achieved word error rate up to 32.5% and 26.8% on constrained and unconstrained tasks, respectively. The error rate decreased using language modeling (based on morphological operations) and reported 11.40% Out-of-Vocabulary OOV rate in a constrained task and 3.51% in an unconstrained task. There are more affixes with words in an unconstrained task. Stahlberg and Vogel [13] took an idea from automatic speech recognition [14] by splitting the text image in 3 pixels wide sliding window with 1-pixel overlap and then extracted pixels based features. They also obtained centroid based features by dividing the sliding window into six zones and called it as segmented features. They got the highest accuracy up to 69.5% (30.5% WER) on pixels based features. The classification performed using Kaldi toolkit based on deep neural network [14]. BenZeghiba [15] presented language modeling based MDLSTM systems for recognition of handwritten and printed sub-word. The systems evaluated on handwritten KHATT and Maurdor databases; and printed Maurdor database. Different techniques used for decomposition of words into sub-words for languages modeling. They reported 34.3% word error rate (WER) on KHATT data-set.

Sample of original text-line 1

Sample of original text-line 2

Sample of original text-line 3 Fig. 2. Skew, slant and white extra spaces at the top, bottom, right, and left positions of the text-lines.

edge-hinge, and edge-direction. The multi-class SVM classifier implemented and reported 84.10% and 92.80% for top-1 and top-10, respectively. Moreover, Christen et al. [18] presented SVM assembler using GMM super vector, which is computed from RootSIFT descriptors and achieved approximately 100% accuracy. Saabni et al. [19] proposed the identification of writer and reported 94.2% accuracy using Ada-Boost classifier. Bouadjenek et al. [20] presented systems for identification of age, gender, and handedness of the writer of text in handwritten KHATT data-set [20]. The histogram of gradients is first extracted, and a support vector machine is used for identification. They reported accuracies up to 68.89%, 83.93%and 67.78% for a gender, handedness, and age. They further improved the gradient feature named as gradient local binary patterns (GLBP) for identification of age, gender and handedness of writers [21]. They enhanced the performance of SVM classifier using GLBP features and obtained 74.44% for gender and 55.55% for age. The GLBP performance was not as good as HOG and gave 78.57% accuracy for handedness. Another work published in [22], reports the generation of printed KHATT data-set (P-KHATT). Ahmad et al. [22] has shown an impressive study of mono-font and mixed-font text recognition using HMMs. The same authors also developed online KHATT data-set having text from the primary source of offline KHATT data-set for ICFHR competition in 2016 for online Arabic handwritten 1 . Some other work that successfully used BLSTM or MDLSTM as core classifiers can be found in [23–26]. We can conclude from the above literature that the state of the art character recognition accuracy on complete handwritten KHATT dataset is 46.13%[3]. This poor recognition rate motivates us to investigate the complexity of KHATT dataset such that we can improve the handwriting recognition for Arabic script.

There is limited related work on handwriting recognition on KHATT data-set. Hence, some efforts exist, which include writer identification, gender identification, and age detection, etc. A competition held in ICFHR-2014 for identification of writer using handwritten KHATT data-set in [16]. The competitors have submitted three systems and highest accuracy achieved up to 73.4% and 87.6% with in top-1 and top-10, respectively with rootSIFT features, GMM Super vectors, and Exemplar-SVM++. In another attempt, Djeddi et al. [17] have proposed a system for identification of writer using features vector of white and black pixels based on run length coding,



KFUPM Handwritten Arabic TexT (KHATT) is presented in the conference of ICFHR in 2012 with the intent to facilitate 1


Fig. 3.

Architecture of MDLSTM for handwritten Arabic character recognition [11].

the researchers in the field of character recognition [4] with respect to Arabic script. The extended work in the form of completed data-set is found in [3]. It is a large data-set having 2000 paragraphs written by 1000 writers. It is collected from 46 sources and gathered by writers across 18 countries. The paragraphs are segmented into a total number of 9000 text lines automatically. It is freely available2 . This dataset is very helpful for many tasks like writer identification, line segmentation, age identification, personality detection, handedness detection, skew detection and correction, baseline detection, noise removal and binarization. In the area of Arabic handwriting recognition, the KHATT dataset is a comprehensive database, which provides text lines with a lot of peculiarities, and which can degrade the learning process in the models. Figure 2 shows some of the challenging issues related to text-lines in KHATT dataset. The peculiarities which we faced are: •

Many text-line images have extra white regions.

There is no proper baseline.

Many images need to be properly de-skewed.

Each text-line has different height.


that contain skew and extra white spaces. These issues need preprocessing techniques for skew detection and correction. It is also important for accurate height normalization to discard any additional white regions. First, we have removed/pruned the extra white areas. We have achieved this by searching the black pixel from top to bottom. The first black pixel at the top gives a position at the top, and the black pixel at the very bottom provides the lower extreme point. Then the images of text-lines are cropped between these two points. This cropping eliminates the additional white regions. However, skew still exists. For skew detection and correction, we have used the work reported by Ahmad et al. [27]. After deskewing, we normalized the images regarding its height. For height normalization, we used 48 pixels as fixed, and the aspect ratio is kept maintained. Figure 2 shows the representatives of such images from KHATT data-set, and Figure 4 illustrates the resulted text-lines after pre-processing.

Fig. 4. The resulted samples text-lines after Skew and slant correction, pruning and normalization


Our proposed system consists of three steps. In step (1) we proposed a set of pre-processing tasks. In step (2) The proposed MDLSTM based classifier is trained, and in step (3), we evaluate the test set on the proposed model. Figure 3 shows the overview of the handwritten Arabic character recognition system based on MDLSTM architecture and CTC output layer.

B. Proposed Model Based on literature reviews, it is clear that the performance of deep learning approaches has shown significant attainment in the field of handwriting recognition. Unfortunately, KHATT dataset so far is not realized by deep learning approaches. Therefore, we for the first time use MDLSTM based deep learning approach, considering the KHATT dataset as a test case. Our proposed model is inspired by the model presented in the work of Ahmad et al. [11].

A. Pre-processing As mentioned, the text-lines of KHATT data-set have high skew and extra white spaces. Figure 2 shows such examples

We have taken the proposed model, and its parameters from work reported in [11]. However, KHATT dataset has 116 class





Experiment Training Set Validation Set Test Set

Character Error Rate (%) 19.70 25.14 24.20

Learning rate (%) 80.80 74.86 75.80

depicts the results obtained by our proposed method and stateof-the-art method. VI.

In this paper, we have proposed an MDLSTM based Arabic character recognition system. We used the CTC layer for alignment of the predicted labels according to the most probable path. Further, we used KHATT dataset which is one of the challenging datasets that contains Arabic handwritten text documents. Furthermore, we have also introduced a preprocessing mechanism, which prunes extra white spaces and de-skews the text-lines for accurate height normalization. Based on experimental results, our proposed system outperforms stateof-the-art, and improve accuracy by a high margin of 29%. The overall recognition rate of our system is 75.8% on unique-textlines of KHATT dataset.

Fig. 5. The learning behaviour of the network in terms of label Error rate corresponding to each epoch.

labels (115 basics+blank). Therefore, the output layer used in our proposed model is of 116 units. Figure 4 describes the architecture more illustratively. The MDLSTM model has the following main parameters; Three MDLSTM hidden layers are used, each layer has a size of 4, 20, and 100 respectively. Two tanh layers of size 16 and 80 are stacked in between hidden layers. The Connectionist Temporal Classification (CTC) layer is used as a final layer. The purpose of CTC layer is to align the predicted labels correspond to the ground-truth labels. The use of CTC layer gives us implicit segmentation of complexly written text-lines. For further detail, please read the work reported in [11]. V.


R EFERENCES [1] S. Naz, K. Hayat, M. I. Razzak, M. W. Anwar, S. A. Madani, and S. U. Khan, “The optical character recognition of urdu-like cursive scripts,” Pattern Recognition, vol. 47, no. 3, pp. 1229–1248, 2014. [2] M. Parvez and S. Mahmoud, “Offline Arabic handwritten text recognition: a survey,” ACM Computing Surveys (CSUR), 2013. [3] S. A. Mahmoud, I. Ahmad, W. G. Al-Khatib, M. Alshayeb, M. T. Parvez, V. M¨argner, and G. A. Fink, “Khatt: An open arabic offline handwritten text database,” Pattern Recognition, vol. 47, no. 3, pp. 1096–1112, 2014. [4] S. A. Mahmoud, I. Ahmad, M. Alshayeb, W. G. Al-Khatib, M. T. Parvez, G. A. Fink, V. M¨argner, and H. El Abed, “Khatt: Arabic offline handwritten text database,” in Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on. IEEE, 2012, pp. 449–454. [5] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei, “Large-scale video classification with convolutional neural networks,” in In Proceedings of the conference on Computer Vision and Pattern Recognition. IEEE, 2014, pp. 1725–1732. [6] S. S. G. J. Huang, G. and C. Wu, “Semi-supervised and unsupervised extreme learning machines,” Transactions on Cybernetics, pp. 2405–2417, 2014. [7] M. Liwicki, A. Graves, H. Bunke, and J. Schmidhuber, “A novel approach to on-line handwriting recognition based on bidirectional long short-term memory networks,” in Proceedings 9th Int. Conf. on Document Analysis and Recognition, vol. 1. IEEE, 2007, pp. 367–371. [8] A. Graves and J. Schmidhuber, “Offline handwriting recognition with multidimensional recurrent neural networks,” in Advances in Neural Information Processing Systems, 2009, pp. 545–552. [9] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical attention networks for document


Pre-processed images are fed to the proposed model. After 9th epochs the convergence became visible. Other network parameters are; The learning rate is 1e−4 , and the momentum is 0.9. We let the system to train for more 20 epochs after the after getting the best label error in some certain epoch. The system validates learning on the validation set during the training process. Figure 5 shows the plot of errors corresponding to each epoch during training. A. Results The experiments are carried out on unique-text-lines of KHATT dataset. We have exactly used the same split reported in [3]. According to the default split the training set has 4, 825 text-lines, test set has 966 text-lines, and validation set contains 937 text-line images. In this work, we report the Character Error Rate (CER), by using Levenshtein’s [28] distance between the annotations and the predicted text. Table I shows the performance regarding characters recognition rates on KHATT dataset. The results show that our proposed model outperforms state-of-the-art results on KHATT dataset. The state-of-the-art used gradient features with Hidden Markov Model (HMM) as learning classifier. We report 24.25% CER on KHATT dataset, which is the lowest CER so far on KHATT dataset. Table II 4

TABLE II. Systems Ahmad et al. [3] Proposed





[14] [15]



[18] [19] [20]



Comparison with the state of the art systems.

Approach Implicit Segmentation Implicit Segmentation

Classifier HMM htk tool MDLSTM

Features Set Adaptive gradients raw pixels

Recognition Rate (%) 46.13 75.80

[23] A. Graves, M. Liwicki, S. Fern´andez, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system for unconstrained handwriting recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 5, pp. 855–868, 2009. [24] A. Graves, “Offline arabic handwriting recognition with multidimensional recurrent neural networks,” in Guide to OCR for Arabic Scripts, 2012, pp. 1–8. [25] S. F. Rashid, M.-P. Schambach, J. Rottland, and S. von der Null, “Low resolution Arabic recognition with multidimensional recurrent neural networks,” in in Proceedings of the 4th International Workshop on Multilingual OCR, 2013, p. 6. [26] Y. Chherawala, P. Roy, and M. Cheriet, “Feature design for offline Arabic handwriting recognition: handcrafted vs automated?” in 12th International Document Analysis and Recognition (ICDAR, 2013. [27] R. Ahmad et al., “A novel skew detection and correction approach for scanned documents,” in DAS 2016, 12th Intl IAPR Workshop on Document Analysis Systems, At Santorini, Greece. IAPR, 2016. [28] V. I. Levenshtein, “Binary codes capable of correcting deletions, insertions and reversals,” in Soviet physics doklady, vol. 10, 1966, p. 707.

classification,” in In Proceedings of NAACL-HLT, 2016, pp. 1480–1489. P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, “Hierarchical recurrent neural encoder for video representation with application to captioning,” in In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2016, pp. 1029–1038. R. Ahmad, M. Z. Afzal, S. F. Rashid, M. Liwicki, T. Breuel, and A. Dengel, “Kpti: Katib’s pashto text imagebase and deep learning benchmark,” in Frontiers in Handwriting Recognition (ICFHR), 2016 15th International Conference on. IEEE, 2016, pp. 453–458. M. Hamdani, A. Mousa, and H. Ney, “Open vocabulary arabic handwriting recognition using morphological decomposition,” in 12th International Conference on Document Analysis and Recognition (ICDAR). IEEE, 2013, pp. 280–284. F. Stahlberg and S. Vogel, “The qcri recognition system for handwritten arabic,” in International Conference on Image Analysis Springer International Publishing and Processing. Springer, 2015, pp. 276–286. Z. X. K. S. Povey, D., “Parallel training of deep neural networks with natural gradient and parameter averaging,” in a workshop contribution at ICLR. Springer, 2015. M. F. BenZeghiba, “Arabic word decomposition techniques for offline Arabic text transcription,” in International Workshop on Arabic Script Analysis and Recognition (ASAR). IEEE, 2017. F. Slimane, S. Kanoun, S. Mahmoud, and V. Mrgner, “Icfhr2014 competition on arabic writer identification using ahtid/mw and khatt databases,” in 14th International Conference on Frontiers in Handwriting Recognition (ICFHR). IEEE, 2014. C. Djeddi, L.-S. Meslati, I. Siddiqi, A. Ennaji, H. E. Abed, and A. Gattal, “Evaluation of texture features for offline arabic writer identification,” in 11th IAPR International Workshop on Document Analysis Systems (DAS), 2014. V. Christlein, D. Bernecker, F. Hnig, A. Maier, and E. Angelopoulou, “Writer identification using gmm supervectors and exemplar-svms,” Pattern Recognition, 2017. R. Saabni, “Boosting feature based classifiers for writer identification,” in International Workshop on Arabic Script Analysis and Recognition (ASAR). IEEE, 2017. N. Bouadjenek, H. Nemmour, and Y. Chibani, “Age, gender and handedness prediction from handwriting using gradient features,” in 13th International Conference on In Document Analysis and Recognition (ICDAR). IEEE, 2015, pp. 1116–1120. ——, “September. histogram of oriented gradients for writer’s gender, handedness and age prediction,” in International Symposium on Innovations in Intelligent Systems and Applications (INISTA). IEEE, 2015, pp. 1–5. I. Ahmad, S. A. Mahmoud, and G. A. Fink, “Openvocabulary recognition of machine-printed arabic text using hidden markov models,” Pattern Recognition, 2016.