Chinese Characters Recognition from Screen

0 downloads 0 Views 12MB Size Report
Chinese Characters Recognition from Screen-Rendered images using Inception Deep Learning Architecture. Jun Zhou1, Xin Xu1, 2, Hong Zhang1,2, Xiaowei ...

Chinese Characters Recognition from Screen-Rendered images using Inception Deep Learning Architecture Jun Zhou1, Xin Xu1, 2, Hong Zhang1,2, Xiaowei Fu1,2 1School

of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan, China, 430065. 2Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan University of Science and Technology, Wuhan, China, 430065.

Abstract. Text recognition in images can substantially facilitate a wide range of applications. However, current methods face great challenges towards screenrendered images due to its low resolution and low signal to noise ratio properties. In this paper, a deep convolutional neural network based method is proposed to recognize Chinese characters from screen-rendered images. Vertical projection and error correction are utilized to extract Chinese characters, which are then recognized via a novel inception module based convolutional neural network. Extensive experiments have been conducted on a number of screenrendered images to evaluate the performance of the proposed network against the state-of-the-art models. Keywords: Optical Character Recognition; Screen-Rendered Image; Chinese Character Recognition; Inception Module; Convolutional Neural Networks.



Optical Character Recognition (OCR) is a key technique in computer vision, it is widely applied in various scenarios, such as vehicle license plate recognition [1] and receipt recognition [2]. Previous Chinese OCR models were generally designed for Handwritten Chinese Character Recognition (HCCR) [3-7]. As illustrated in Fig. 1, few efforts have been made towards screen-rendered text recognition. Although there are some OCR methods for scene-rendered English text recognition [8-12], these methods face great challenges in Chinese characters Recognition from ScreenRendered images due to large Chinese characters library and complicated feature of Chinese character except for the low resolution and low signal to noise ratio [13, 14]. Since LeCun [15] put forward the Convolutional Neural Network (CNN) structure for deep learning, recent years have witness the emerging of numerous CNN models such as AlexNet [16], VGGNet [17], GoogLeNet [18], ResNet [19], and etc. These CNN models have achieved significant performance over traditional models in a wide range of applications, including image classification, object detection, object recognition, and etc. For example, Ciregan et al. [3] proposed a Multi-Column Deep Neural Network (MCDNN) for HCCR. MCDNN utilized different preprocessing results of adfa, p. 1, 2011. © Springer-Verlag Berlin Heidelberg 2011

Fig. 1. Overview of the state-of-the-art OCR schemes.

the input image as the input of different deep neural networks, and then it averaged the outputs to obtain the final recognition result. The network proposed by Fujitsu team [3] won the ICDAR offline HCCR competition. This model required a dictionary storage size of 2.46 GB. Afterward, traditional feature extraction methods are gradually applied in CNN to improve the accuracy of HCCR. Zhong et al. [5] designed a GoogLeNet based streamlined network (HCCR-GoogLeNet) with four inception-v1 modules. They combined the Gabor and gradient feature maps with original images as the input of HCCR-GoogLeNet. Finally, they obtained the recognition accuracy of 96.35% on offline HCCR dataset of ICDAR2013. Zhang et al. [6] utilized the character shape normalization and direction-decomposed feature maps to enhance the performance of CNN. So far, most OCR methods paid more attention to improve the accuracy of CNN, but in practical applications, the speed and storage size of CNN model are also crucial. Aiming at this, Xiao et al. [7] presented a Global Supervised Low-rank Expansion to accelerate the calculation in convolutional layers, and introduced an Adaptive Drop-weight method to remove redundant connections between each layer. However, HCCR methods mainly focused on designing classifier to recognize single character. They may face great challenges in recognizing text lines or characters segmented from screen-rendered text images. Generally, screen-rendered text recognition contains two stages: text detection and text recognition. The earliest screenrendered text recognition method was proposed by Wachenfeld et al. [8]. They used the segmentation correction strategy based on the recognition result to segment the screen-rendered character correctly, and their method classified the character using the European distance between the character image and subclass in sample space. Rashid et al. [9] constructed a classifier by means of Hidden Markov Models modeling character, and achieved a higher accuracy in the database developed by Wachenfeld et al. [10]. The PhotoOCR system [11] proposed a seven-layer neural network to train the classifier based on the HOG feature of the character image. It corrected the recognition result combined with a language model. Jaderberg et al. [12] took the word image as the input. Unlike previous classified methods, they trained a word classifier by using synthesize data directly. It is noted that current works mainly pay attention to English text recognition; few research works focus on Chinese characters recognition from screen-rendered images.

On the one side, Chinese characters are more complex than English in both number and types of strokes; on the other side, low resolution and low signal to noise ratio properties of screen-rendered text images may hinder the extraction and recognition processes. In order to address these problems, this paper proposes an inception deep learning architecture for screen-rendered Chinese characters recognition. The vertical projection method is firstly adopted to acquire single characters from input image. Then, a word-width fusion method is proposed to correct error segmentation characters. Next, the training dataset is synthesized by a proper data generation engine. Finally, the CNN model with inception-v2 [20] module is designed based on HCCRGoogLeNet. Experimental results demonstrate that the proposed method is able to effectively segment Chinese characters from screen-rendered images, and it significantly reduces the training time as well as recognizes extracted characters more accurately.


The Proposed Method

As shown in Fig. 2, the proposed method contains two stages: character extraction and character recognition. In the step of character extraction, initial segmented characters are obtained by a series of image processing methods, such as binarization, inverse color, dilation, and connected domain detection. However, there are some error characters in initial segmentation results, for example, '神' is divided into '礻' and '申', '脂' is divided into ' 月' and '旨'. Therefore, the word-width fusion method is presented to correct the error segmentation characters in this paper. In the step of character recognition, a data generation engine is designed to generate dataset, and then, a novel CNN based on HCCR-GoogLeNet is proposed for screen-rendered Chinese character recognition. After character segmentation and network training, the character image is identified by the proposed network. 2.1

Character Extraction

As shown in Fig. 2, the character extraction step is divided into two sub-steps: initial segmentation and error segmentation corrected. In the first step, binarization and inverse color operation are performed to preprocess input screen-rendered images. Subsequently, the initial multiple text candidates in each line are acquired by dilation and connected domain detection operation. Thereafter, a connected domain confusion is utilized to guarantee that there is only one text area per line. We divide a large character extraction task into a set of similar small tasks by getting the text line candidates. Finally, single characters are segmented by vertical projection method. In the next step, the proposed word-width fusion method is adopted to deal with the error character: traverse all character candidates, and then judge whether two consecutive characters are separated from one whole character by comparing their total width with the character width threshold Tw .

Fig. 2. Overview of the proposed method.

For different input images, the value of Tw is different due to different font size or different shooting methods. In this paper, Tw is calculated adaptively as follows:

Tw 


 w , l  max{len(i )}, i len  l  wl


where, len() is the number of elements in a finite set. i , i  1, 2,, M represents

the set of character width w , satisfying i  w | w (i 1)*10, i *10 . M represents the total number of subintervals obtained by dividing the interval [0, J ] in steps of 10. J is given by the following formula:  max_ w  (1) J   10,  10  where, max_ w is the max width of all character candidates. 2.2

Character Recognition

CNN has been widely investigated for image classification due to its weight sharing and local connection. In [8], GoogLeNet is applied for handwrite Chinese characters recognition and achieved a desired performance. Literature [8, 9] solved the problem of screen-rendered English text recognition. However, few research works focus on Chinese characters recognition from screen-rendered images. Therefore, a modified GoogLeNet is presented for Chinese and English text recognition from screenrendered images in this paper. Fig. 3(a) shows the structure of the proposed network, where the inception-v2 structure is shown in Fig. 3(b).

Fig. 3. The proposed network. (a) structure of the proposed network. (b) inception-v2 module.

Convolutions with larger spatial filters (e.g. 5  5 or 7  7 ) tend to be more timeconsuming [20]. For example, the time cost by a 5  5 convolution is 25  9=2.78 times than that of 3  3 convolution. Nevertheless, as shown in Fig. 3, two consecutive 3  3 convolutional layers have a same receptive field as a 5  5 convolutional layer, which guarantee that a 5  5 convolution can be replaced by two 3  3 convolution with the same input size and output depth. Motivated by this, the 5  5 convolution in the inception-v1 module of HCCR-GoogLeNet is decomposed into two 3  3 convolutional layers to improve the performance of HCCR-GoogLeNet in [5]. The output of convolutional layer can resort to an activation layer to increase nonlinear factors of CNN. By stacking the 3  3 convolutional layer, CNN become deeper and introduce more non-linear factors, which can improve the feature expression ability of CNN. The change of the feature maps should be a smooth process [20], when increasing number of the feature maps, the size of the feature maps should be reduced. Based on this, the change of the feature maps should not be too abrupt to keep useful information. Nevertheless, the HCCR-GoogLeNet violates this point. Between the last average pool layer and the subsequent convolutional layer, there is a reduced process for the number of feature maps without any size change. Therefore, we removed the last convolutional layer (kernel size is 11 , kernel number is 128) between last AVE Pool layer (kernel size is 5  5 , stride is 3) and first fully connected layer (output number is 1024). In the HCCR-GoogLeNet, the size of feature maps from the last AVE pool layer is 2  2  832 , but it changes into 2  2  128 after the convolutional layer, the output of first fully connected layer is 1024. In other words, the number of

feature maps directly reduce by 832 / 128  6.5 times firstly, and then it increases 1024 /  2  2 128  2 times. During this process, many useful infor-

mation will be lost. In the proposed network, the size of feature maps changes more smoothly ( 3  3  832 to 4096).


Experimental Results

In this section, the proposed network is compared with HCCR-AlexNet [5], HCCRGoogLeNet [5] and the network in Ref. [12]. Experiments are conducted on real screen-rendered text images to demonstrate the practicability of the proposed method. Since there is no public training dataset of printed character, we firstly generate three groups of character images as training sample, described in Section 3.1. And then, the proposed character extraction method is applied to segment a screen-rendered text image into multiple character images. Lastly, all character images are identified by the proposed network. All networks are trained on an open deep learning framework named CAFFE with a GTX1080ti graphics card, and all testing experiments are conducted under Pycharm2017.1 platform (the intel Pentium dual-core CPU G3420 3.20GHz). 3.1

Generated Dataset

Due to large Chinese characters library (there are 3755 Chinese characters and each character has different fonts) and complicated feature of Chinese character, it is very difficult to train a large Chinese character classifier in the absence of training samples. Therefore, following some success synthetic character datasets [21, 22], a synthetic character engine is designed to generating training and validation dataset for the proposed network. The goal number of the classifier is 3822, including 3755 Chinese characters, 52 English characters, 10 digits and 5 punctuations. Whole dataset is comprised of three types of character images. Firstly, in order to guarantee that the proposed network can identify each character of different fonts and size, we synthetic six clean images with different sizes (40×40, 50×50, 60×60, 70×70, 80×80) for each character of different font (24 light and 4 bold). There are 3822  28  6  642, 096 images totally generated as part of the dataset. Besides, in order to reduce the sensitiveness to noise of the proposed network, we generate a group of simulated images corrupted by random noise (3822×28×5×2=1,070,160 images). The generation process is: firstly, five original images are generated with different sizes (70×70, 73×73, 76×76, 79×79, 82×82); And then, we added noise of two levels (randomly chosen from 0.1% to 0.5%) to these images. Lastly, in daily experiments, screen-rendered images are conducted the binary operation to avoid the effect of uneven illumination, which will lead to character strokes break. In this case, images are generated by erode and blur operations on clean images with size 70×70 (3822×28×2=214,032images). After data cleaning, we divided the dataset into a training dataset including 1,711,277 images and a validation dataset of 123,400 images. All the training is resized to 120  120 with a 5-pixel border around.

Fig. 4. Character extraction results. (a) input image; (b) the detail of red box in (a); (c) ~ (f) represent the result of connected domain detection, connected domain confusion, vertical projection method and the proposed word-width fusion, respectively.


Character Extraction Result

The character extraction result is vital to subsequent character recognition, therefore, we test the proposed character extraction method on real screen-rendered images obtained by 13.0 megapixels’ mobile phone (MEIZU M5 note). Fig. 4 shows character extraction results of the proposed method. The input image and its local region in red box are shown in Fig. 4(a) and Fig. 4(b). Fig. 4(c) ~ (d) show the result of connected domain detection and the connected domain confusion on Fig. 4(b). Fig. 4(e) ~ (f) give the initial character segmentation result by vertical projection method and error segmentation corrected result, respectively. Seen from Fig. 4(e), it can be observed that there are some error segmentation characters (marked with a red box), such as the characters ‘神’, ‘小’, ‘细’, ‘肥’ and ‘及’. By means of the proposed word-width fusion method, we can extract all correct characters, as shown in Fig. 4(f). Eventually, the proposed character extraction method can successfully segment screen-rendered text images into multiple characters, which facilitate following character recognition. 3.3

Character Recognition Result

In this subsection, four networks are trained on the training dataset and evaluated on the validation dataset. These datasets are generated as describe in Section 3.1.



Fig. 5. The training time and the accuracy curves of each network. (a) The accuracy of different networks; (b) The training times of different networks.

Fig. 6. The recognition results of different network. (a) HCCR-AlexNet ; (b) HCCRGoogLeNet; (c) Ref. [12]; (d) the proposed network. Table 1. Average accuracy of different networks. Methods


Ref. [12]



Average (%)





The validation dataset is utilized to demonstrate the performance of the proposed network by comparison with other networks. With the increasing of iterations, we record the training time when each epoch finishing, where a training epoch includes 13500 iterations. Fig. 5(a) shows the accuracy curve on the validation dataset by the above networks of different training epoch. Objectively, the proposed network obtains the higher accuracy on the validation dataset and converges quicker. It is more stable that the other networks. In the first epoch, the proposed network achieves 98.5% accuracy on the validation dataset, while the best accuracy of the other networks is 95.5%. After converging, the accuracy on the validation dataset obtained by HCCR-AlexNet, HCCRGoogLeNet and Ref. [12] are 98.69%, 98.81% and 98.70, respectively. While the proposed network can achieve a better performance of 99.37%. Fig. 5(b) gives the training time curve of these networks. Seen from Fig. 6(b), it can be seen that the HCCR-GoogLeNet takes the longest training time. However, the proposed network reduces the training time significantly, it has similar computation complexity to HCCR-AlexNet. In other words, compared to the HCCR-GoogLeNet,

the proposed network improves the feature expression ability and reduces the computational complexity significantly. After character extraction, character images are took as input of the proposed network to identify the text information. Since test images may contain some punctuations that are not a part of the training sample, these punctuations are considered to be identified correctly in this paper. Fig. 6 shows the recognition results obtained by HCCR-GoogLeNet, Ref [12], HCCR-AlexNet and proposed network. The characters in red box represent error recognition result. The numbers of red boxes of HCCR-AlexNet, HCCR-GoogLeNet, Ref. [12] and the proposed network are 62, 35, 48 and 21, respectively. Table 2 gives the recognition accuracy on our testing images. As seen from Fig. 6 and Table 2, the proposed network can achieve the best performance on screenrendered text images than HCCR-AlexNet, HCCR-GoogLeNet and Ref. [12].



In this paper, we present a novel Chinese character recognition method for screenrendered text recognition. The vertical projection method is adopted to acquire single characters from input image. Then, a word-width fusion method is proposed to correct error segmentation characters. Finally, the network with inception-v2 module is designed based on HCCR-GoogLeNet for single characters recognition, which trained on a dataset synthesized by a proper data generation engine. The experimental results demonstrate that the proposed network can reduce the computational complexity significantly and achieve a better generalization ability by comparison with the state-ofthe-art models. Acknowledgments. This work was supported by the National Natural Science Foundation of China (61602349, 61440016) and the Hubei Chengguang Talented Youth Development Foundation (2015B22).

References 1. S. Z. Masood, G. Shu, A. Dehghan, and E. G. Ortiz. License Plate Detection and Recognition Using Deeply Learned Convolutional Neural Networks. arXiv preprint arXiv:1703.07330, 2017. 2. F. Yang, L. Jin, W. Yang. Z. Feng, and S. Zhang. Handwritten/Printed Receipt Classification Using Attention-Based Convolutional Neural Network. In 15th International Conference on Frontiers in Handwriting Recognition, pages 384-389, 2016. 3. D. Ciregan, U. Meier, and J. Schmidhuber. Multi-column deep neural networks for image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3642-3649, 2012 4. F. Yin, Q.F. Wang, X. Y. Zhang, and C. L. Liu. ICDAR 2013 Chinese handwriting recognition competition. In 12th International Conference on Document Analysis and Recognition, pages 1464-1470, 2013.

5. Z. Zhong, L. Jin, and Z. Xie. High performance offline handwritten Chinese character recognition using GoogLeNet and directional feature maps. In 13th International Conference on Document Analysis and Recognition, pages 846-850, 2015. 6. X.-Y. Zhang, Y. Bengio, and C.-L. Liu. Online and offline handwritten chinese character recognition: A comprehensive study and new benchmark. Pattern Recognition, 2016. 7. X. Xiao, L. Jin, Y. Yang, W. Yang, J. Sun, and T. Chang. Building Fast and Compact Convolutional Neural Networks for Offline Handwritten Chinese Character Recognition. arXiv preprint arXiv:1702.07975, 2017. 8. S. Wachenfeld, H. U. Klein, and X. Jiang. Recognition of screen-rendered text. In 18th International Conference on Pattern Recognition, pages 1086-1089, 2006. 9. S. F. Rashid, F. Shafait, and T. M. Breuel. An evaluation of HMM-based techniques for the recognition of screen rendered text. In 11th International Conference on Document Analysis and Recognition, pages. 1260-1264, 2011. 10. S. Wachenfeld, H. U. Klein, and X. Jiang. Annotated databases for the recognition of screen-rendered text. In 9th International Conference on Document Analysis and Recognition, pages 272-276, 2007. 11. A. Bissacco, M. Cummins, Y. Netzer, and H. Neven. Photoocr: Reading text in uncontrolled conditions. In Proceedings of the IEEE International Conference on Computer Vision, pages 785-792, 2013. 12. M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Reading text in the wild with convolutional neural networks. In International Journal of Computer Vision, 116(1): 1–20, 2016. 13. X. Xu, N. Mu, X. Zhang, and B. Li. Covariance descriptor based convolution neural network for saliency computation in low contrast images. In International Joint Conference on Neural Networks, pages 616-623, 2016. 14. X. Xu, N. Mu, H. Zhang, and X. Fu. Salient object detection from distinctive features in low contrast images. In IEEE International Conference on Image Processing, pages 31263130, 2015. 15. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278– 2324, 1998. 16. A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012. 17. K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2015. 18. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1-9, 2015. 19. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. 20. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. 21. M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisserman. Synthetic data and artificial neural networks for natural scene text recognition. arXiv preprint arXiv:1406.2227, 2014. 22. X. Ren, K. Chen, and J. Sun. A CNN Based Scene Chinese Text Recognition Algorithm with Synthetic Data Engine. arXiv preprint arXiv:1604.01891, 2016.