International Conference on Bangla Speech and Language Processing (ICBSLP), 21-22 September, 2018
Bangla Handwritten Digit Recognition Using Deep CNN for Large and Unbiased Dataset Ashadullah Shawon Department of Computer Science and Engineering Rajshahi University of Engineering and Technology Rajshahi, Bangladesh [email protected]
Firoz Mahmud Department of Computer Science and Engineering Rajshahi University of Engineering and Technology Rajshahi, Bangladesh [email protected]
Abstract—Bangla handwritten digit recognition is a convenient starting point for building an OCR in the Bengali language. Lack of large and unbiased dataset, Bangla digit recognition was not standardized previously. But in this paper, a large and unbiased dataset known as NumtaDB is used for Bangla digit recognition. The challenges of the NumtaDB dataset are highly unprocessed and augmented images. So different kinds of preprocessing techniques are used for processing images and deep convolutional neural network (CNN) is used as the classification model in this paper. The deep convolutional neural network model has shown an excellent performance, securing the 13th position with 92.72% testing accuracy in the Bengali handwritten digit recognition challenge 2018 among 57 participating teams. A study of the network performance on the MNIST and EMNIST datasets were performed in order to bolster the analysis. Keywords— NumtaDB, CNN, Bangla Handwritten Digit Recognition, Large Unbiased Dataset, Computer Vision
I. INTRODUCTION Bangla handwritten digit recognition is a classical problem in the field of computer vision. There are various kinds of practical application of this system such as OCR, postal code recognition, license plate recognition, bank checks recognition etc. . Recognizing Bangla digit from documents is becoming more important . The unique number of Bangla digits are total 10. So the recognition task is to classify 10 different classes. The critical task of handwritten digit recognition is recognizing unique handwritten digits. Because every human has his own writing styles. But our contribution is for the more challenging task. The challenging task is about getting robust performance and high accuracy for large, unbiased, unprocessed, and highly augmented NumtaDB  dataset. The dataset is a combination of six datasets that were gathered from different sources and at different times containing blurring, noise, rotation, translation, shear, zooming, height/width shift, brightness, contrast, occlusions, and superimposition. We have not processed all kinds of augmentation of this dataset. We have processed blur and noisy images mainly. Then our processed images are classified by a deep convolutional neural network.
978-1-5386-8207-4/18/$31.00 ©2018 IEEE
Md. Jamil-Ur Rahman Department of Computer Science and Engineering Rajshahi University of Engineering and Technology Rajshahi, Bangladesh [email protected]
M.M Arefin Zaman Department of Computer Science and Engineering Rajshahi University of Engineering and Technology Rajshahi, Bangladesh [email protected]
This paper is organized as follows: The related works are presented in Section II. Our dataset and proposed approach are described in Section III. Section IV shows the preprocessing technique of images. We have shown the architecture of the deep convolutional neural network in Section V. Section VI represents our experiments and result analysis and finally conclusion and future works are briefed in Section VII. II. RELATED WORKS There are several research works based on Bangla handwritten digit recognition using deep learning. But most of the research works used biased dataset like CMATERDB 3.1.1  because NumtaDB dataset was not available then. Recently some researchers have shown a better accuracy of 99.50% using the auto encoder and deep convolutional neural network for CMATERDB 3.1.1 dataset. The local binary pattern was also used for Bangla digits recognition . The first CNN architecture known as Le-net was also used by some researchers for Bangla digit recognition . The Convolutional neural network was introduced for better supervised learning and accuracy . So recent researchers of Bangla handwritten digit recognition are using deep CNN architecture. There are some other classifiers like Support Vector Machine (SVM), Neural Network (NN) etc. for handwritten digit recognition. But the performance of CNN is better than other classifiers. As a result, CNN has become the recent trend handwritten digit recognition. Modified CNN architecture by adding more layers or more nodes has become a way to break the state of the art accuracy. This strategy is followed by some research works . Besides Bangla, there are some significant research works in English handwritten digit recognition. MNIST  and EMNIST  are the most popular dataset for English handwritten digit and character recognition. EMNIST  presented handwritten digit recognition for both balanced and imbalanced dataset. We researched considering MNIST and EMNIST dataset previously. The research work of this paper is related to our previous research and that was English handwritten digit recognition. We achieved better results from our previous research than the previous research  and it motivates us to contribute in Bangla handwritten digit recognition too.
III. DATASET AND PROPOSED APPROACH A. DATASET In our proposed approach, we give importance to the NumtaDB dataset which consists of 85,000+ Bangla handwritten digit images. Because NumtaDB is a standard, unbiased (in terms of geographic location, age, and gender), large, unprocessed, and reviewed dataset [3, Sec. I]. The NumtaDB dataset can verify our proposed approach performance perfectly and we hope that our proposed approach will get approximately the same accuracy for reallife handwritten digit recognition. The images of the NumtaDB dataset are real-world images without any preprocessing. The NumtaDB dataset is an assembled dataset from six different sources. According to NumtaDB, “The sources are labeled from 'a' to 'f'. The training and testing sets have separate subsets depending on the source of the data (training-a, testing-a, etc.). All the datasets have been partitioned into training and testing sets so that handwriting from the same subject/contributor is not present in both. Dataset-f had no corresponding metadata for contributors for which all of it was added to the testing set (testing-f)” . Each image of the dataset is about 180×180 pixels. The sample images of the NumtaDB dataset are shown in Fig. 1.
EMNIST dataset  is the extended version of MNIST dataset where we used EMNIST digits and EMNIST balanced dataset for digit and letter recognition. Fig. 2 illustrates the visual breakdown of EMNIST digits dataset where it consists of 280,000 samples with 10 classes and 30,000 samples for each class. Fig. 3 visualize EMNIST balanced dataset having 47 classes where each class consists of 3,000 examples.
Fig. 2. Visualization of EMNIST digits dataset [10, Fig.2].
Fig. 3. Visualization of EMNIST balanced dataset [10, Fig.2].
B. Proposed Approach We propose to classify NumtaDB handwritten digits following two major steps. These steps are: • • Fig. 1. Bangla handwritten digit images from NumtaDB.
According to NumtaDB, “Two augmented datasets (augmented from test images of dataset 'a' and 'c') are appended to the testing set which consists of the following augmentations” . • • • •
Spatial Transformations: Rotation, Translation, Shear, Height/Width Shift, Channel Shift, Zoom. Brightness, Contrast, Saturation, Hue shifts, Noise. Occlusions. Superimposition (to simulate the effect of text being visible from the other side of a page).
So, NumtaDB dataset has made Bangla handwritten digit recognition more challenging by augmented images. MNIST dataset  is one of the most popular balanced dataset in English handwritten digit and it has a training set of 60,000 samples and a test set of 10,000 samples. For each class there are 6,000 training examples and 1000 testing examples.
Preprocessing of images. Deep convolutional neural network.
The details of our two major steps are described in Section IV and V. IV. PREPROCESSING OF IMAGES A. Resizing and Grayscaling The original size of NumtaDB images are 180×180 pixels which are too large for preprocessing efficiently. So we reduce the size of images to 32×32 pixels. We also convert all RGB images to GRAY scale images. The color channel is converted to 1 channel from 3 channel. B. Interpolation Images can lose much important information due to resizing. Inter-area interpolation is preferred method for image decimation. This method is resampling using pixel area relation. We use inter-area interpolation after resizing images. . C. Removing Blur from Images We use Gaussian blur to add blur at first and then subtract the blurred image from the original image. Then we add a weighted portion of the mask to get de-blurred image .
( , ) = ( , − ′( , )
( , )= ( ,
( , )
Here ′( , ) is the blurred image and k is a weight for generality. D. Sharpening Images There are many filters for sharpening images. In this paper, we use the Laplacian filter. Our filer is a 3×3 matrix. −1 −1 −1
−1 9 −1
−1 −1 −1
E. Removing Noise from Images We remove salt and pepper noise from NumtaDB images. We use the median filter to remove salt and pepper noise. After preprocessing, the images become clear, sharp and salt and pepper noiseless.
layers . That’s why we have chosen the deep learning for Bangla handwritten digit recognition. We have built a custom architecture for deep learning. The architecture is illustrated in the subsection A. A. Architecture of Model Our proposed architecture consists of 6 convolutional layers and 2 fully connected dense layer. The first two layers have 32 filters and each filter size is 5×5. The middle two layers have 128 filters and each filter size is 3×3. The last two layers have 256 filters and each filter size is 3×3. Rectified Linear unit (ReLu)  is used as an activation function for all layers. Maxpooling layers and Batch normalization are used after every two layers. The pool size of maxpooling layer is 2×2. Batch normalization is used for speed up learning . Dropout (20%) is added after first dense layer to reduce overfitting. Among the two fully connected layers, the first one has 64 filters and the last one has 10 filters for the 10 digits. The last activation function is a softmax function for the classification. We use the Adam  optimizer to update weights. Fig. 6 shows the design of our proposed deep CNN architecture. From input to output every configuration is marked properly.
Fig. 4. Preprocessed images.
Images can be more specified by applying segmentation or thresholding. Otsu’s method  is used for global threshold selection method. Fig. 6. Design of proposed deep CNN architecture.
Table I Shows the whole model summary of our proposed deep CNN architecture. TABLE I.
. Fig. 5. Preprocessed images after thresholding. .
V. DEEP CONVOLUTIONAL NEURAL NETWORK Deep learning has been providing the outstanding performance in the field of handwritten digit recognition since the last few years. Deep learning is more efficient learning technique than others because deep learning is the combination of feature extraction and deep classification
MODEL SUMMARY OF OUR DEEP CNN ARCHITECTURE
Layer Conv2D_1 Conv2D_2 Batch Normalization_1 MaxPooling2D_1 Conv2D_3 Conv2D_4 Batch Normalization_2 MaxPooling2D_2 Conv2D_5 Conv2D_6 Batch Normalization_3 MaxPooling2D_3 Flatten_1 Dense_1 Activation_1 Dropout_1 Dense_2 Activation_2
Output Shape ( None, 32, 32, 32 ) ( None, 32, 32, 32 ) ( None, 32, 32, 32 ) ( None, 16, 16, 32 ) ( None, 16, 16, 128 ) ( None, 16, 16, 128 ) ( None, 16, 16, 128 ) ( None, 8, 8, 128 ) ( None, 8, 8, 256 ) ( None, 8, 8, 256 ) ( None, 8, 8, 256 ) ( None, 4, 4, 256 ) ( None, 4096 ) ( None, 64 ) ( None, 64 ) ( None, 64 ) ( None, 10) ( None, 10 )
B. Training Model Parameters The training model of our architecture is dependent on some parameters. The parameters of the training model are given in Table II. TABLE II.
PARAMETERS OF TRAINING MODEL
analyzing the result. Table IV shows precision, recall, F1 score and WAA of testing result. TABLE IV.
PRECISION, RECALL, F1 SCORE AND WAA
Fig. 7 shows the confusion matrix of result for 10 classes.
VI. EXPERIMENT AND RESULT ANALYSIS A. Experimental Environment Our experimental environment is configured with high level Intel core-i9 processor, GeForce GTX 1080ti GPU and 32GB of RAM. This configuration helped us to reduce the training time of our experiment. B. Training, Validation, and Testing Among 85000+ images, train and test split ratio of NumtaDB dataset is 85%-15% . In the experiment, we split the training data into training and validation keeping the split ratio about 80%-20%. We use the validation data to evaluate our model performance. At last, our final result is measured by testing dataset. C. Evaluation According to Bengali handwritten digit recognition challenge , the test datasets of NumtaDB are from six different sources (codename: a, b, c, d, e, f) and additionally two augmented datasets were produced from test set A and test set C. Un-weighted average accuracy (UAA) is used as an evaluation metric . Here
1 = 8
We normalize the confusion matrix for clear understanding. The competition authority  provided us testing accuracy for every separate dataset (a-f and augmented-a, augmented-c) among the assembled dataset. Table V shows the testing accuracy for every dataset. TABLE V.
D. Result Analysis We observe mainly three results from our experiment. These are training accuracy, validation accuracy, and testing accuracy. After 30 epochs we get this result. Table III shows our training, validation result as well as the testing result for un-weighted average accuracy. UN-WEIGHTED AVERAGE ACCURACY
The processing time of training is also measured. It takes 15 seconds per epoch. As there are total 30 epochs, the total training time is 30×15=450 seconds. We also observe precision, recall, F1 score and weighted average accuracy (WAA) for better understanding and
TESTING ACCURACY FOR EVERY SEPARATE DATASET
the model accuracy of th dataset.
Fig. 7. Confusion Matrix
After observing Table V, we find that our approach gets comparatively low accuracy in f, augmented-a, and augmented- c testing dataset. So we find the reason for low accuracy and detect misclassified digits from these datasets. Fig. 8 shows some misclassified digit images.
True Label: 7
True Label: 6
True Label: 1
True Label: 6
Fig. 8. Misclassified digits
The misclassified images are highly augmented as it is also difficult for the human brain to recognize correctly. According to result, it is cleared that our approach cannot detect all kinds of augmented images. Because we did not preprocess the rotated, shifted, zoomed, superimposition and occlusion images. Table VI shows the comparison of testing accuracy considering augmented images and without augmented images. TABLE VI.
ACCURACY COMPARISON BETWEEN AUGMENTED AND NONAUGMENTED DATASET
We also apply our proposed model to the previous dataset that is used by some research papers  and compares its result with our experimental result. BanglaLekha-Isolated Numerals  is another independent dataset that is a combination of CMATERDB Numerals and ISI Numerals dataset for Bangla handwritten digits. But this dataset contains only 19748 digit images and there are no highly augmented images like NumtaDB. It is mentioned in NumtaDB paper that dataset ‘e’ of NumtaDB is actually BanglaLekha-Isolated Numerals. Table VII shows the result comparison between the NumtaDB and BanglaLekha-Isolated. TABLE VII.
RESULT COMPARISON BETWEEN NUMTADB AND BANGLALEKHA-ISOLATED NUMERALS
Number of Images
VII. CONCLUSION In this paper, we have presented a deep CNN based Bangla digit recognition system for a standard and challenging dataset. We have achieved 92.72% testing accuracy which is a good result for large and unbiased NumtaDB dataset comparing to other biased datasets. We observed that only deep classification model cannot increase performance. All kinds of preprocessing of images are also very important before training. We use some preprocessing technique for blur and noisy images but these are not enough for high performance. As a future work, we think the advanced data augmentation technique and more advanced CNN model can overcome the problems of detecting augmented images. We recommend researchers to continue our research according to future work direction. REFERENCES  
We have applied the same preprocessing and deep CNN model on both datasets. But we get a better result on BanglaLekha-isolated Numerals than NumtaDB. Because NumtaDB has more complex images which are hard to recognize. That’s why NumtaDB is a challenging dataset. Finally, we implemented the same procedure and found an accuracy of 99.70% accuracy for MNIST digit recognition whereas the state of the art of MNIST dataset is 99.79% . We also found 99.79% accuracy for EMNIST digit recognition and 90.59% accuracy for EMNIST letter recognition which has 47 balanced letter classes. In comparison with an earlier work on EMNIST digits and balanced dataset using linear and OPIUM classifier , Table VIII shows that our proposed CNN model produce better accuracy. TABLE VIII.
RESULT COMPARISON IN EMNIST DATASET
B. B. Chaudhuri, and U. Pal, "A complete printed Bangla OCR system", Pattern recognition, vol. 31.5, pp. 531-549, 1998. U. Pal, and B. B. Chaudhuri, "OCR in Bangla: an Indo-Bangladeshi language", Pattern Recognition 1994. Vol. 2-Conference B: Computer Vision and Image Processing. Proceedings of the 12th IAPR International. Conference on, vol. 2, pp. 269-273, 1994.J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73. S. Alam, T. Reasat, R.M. Doha, and A.I Humayun, “NumtaDBAssenbled Bengali handwritten digits”, arXiv:1806.02452 [cs.CV], 6 June, 2018. M. Shopon, N. Mohammed, and M. A. Abedin, "Bangla handwritten digit recognition using autoencoder and deep convolutional neural network," 2016 International Workshop on Computational Intelligence (IWCI), Dhaka, 2016, pp. 64-68. T. Hassan, and A.H. Khan, “Handwritten Bangla numeral recognition using Local Binary Pattern,” In Electrical Engineering and Information Communication Technology (ICEEICT), 2015 International Conference on, pp. 1-4. IEEE, 2015. U. Bhattacharya, and B. B Chaudhuri, “Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals”, In IEEE transactions on pattern analysis and machine intelligence, 31(3) on pp.444-457. IEEE,2009. M.I.H.R.I. M.A.H. Akhand, and M. Ahmed, “Convolutional neural network training with artificial pattern for bangla handwritten numeral recognition”, ICIEV, vol. 1, no. 1, pp. 16, 2016 M. Alom, P. Sidike, T. Taha, and V. Asari, “Handwritten Bangla Digit Recognition Using Deep Learning”, arXiv:1705.02680v1 [cs.CV], 7 May, 2017. Y. Lecun, and C. Cortes, “The MNIST database of handwrittendigits,” 1998. [Online]. Available: http://yann.lecun.com/exdb/mnist/ G. Cohen, S. Afshar, and J. Tapson, A. Schaik, “EMNIST: An extension of MNIST to handwritten letters.”, arXiv:1702.05373[cs.CV], 2017. Bengali Ai , “NumtaDB: Bengali Handwritten Digits”, 2018. [Online]. Available: https://www.kaggle.com/BengaliAI/numta. [Accessed: 608-2018]. R. Gonzalez, and R.E. Woods, “Digital Image Processing”, Third Edition, pp. 162-163. N. Otsu, “A Threshold Selection Method from Gray Level Histograms”, In IEEE transactions on system, men and cybernetics, vol smc-9, no1, January 1979. V.Nair, and Hinton, G. E. “Rectified linear units improve restricted boltzmann machines.” Proceedings of the 27th International Conference on Machine Learning (ICML-10). 2010. S. Ioffe, and C. Szegedy, “Batch Normalization: Accelerating deep network training by reducing internal covariance shift”, arXiv:1502.03167[cs.LG], 2 March, 2015. D. Kingma, and J. Ba, “Adam: A method of stochastic optimization”, arXiv:1412.6980[cs.LG], 2015. “Bengali handwritten digit recognition challenge”, 2018. [Online]. Available: https://www.kaggle.com/c/numta. [Accessed: 8-08-2018].
 “Evaluation of Bengali handwritten digit recognition challenge”, 2018.[Online].Available:https://www.kaggle.com/c/numta#evaluation . [Accessed: 8-08-2018].  S. Sharif, and M. Mahboob, “Evil Method: A deep cnn for bangla handwritten numeral classification”, In 4th Internation Conference on Advance In Electrical Engineering (ICAEE), 2017.  M. Biswas, R. Islam, G. K. Shom, M. Shopon, N. Mohammed, S. Momen, and A. Abedin, “Banglalekha-isolated: A multi-purpose comprehensive dataset of handwritten bangla isolated characters,” Data in brief , vol. 12, pp. 103–107, 2017.  Wan, Li, et al. "Regularization of neural networks using dropconnect." International Conference on Machine Learning. 2013.