Character Recognition in Natural Scenes using ... - CiteSeerX

1 downloads 0 Views 591KB Size Report
character recognition (OCR) engine cannot perform well ... The OCR engine fails to produce satisfactory ... 1ABBYY FineReader 10, http://www.abbyy.com/. (a).
Character Recognition in Natural Scenes using Convolutional Co-occurrence HOG Bolan Su∗ , Shijian Lu∗ , Shangxuan Tian† , Joo Hwee Lim∗ and Chew Lim Tan† ∗ Institute for Infocomm Research, Agency for Science, Technology and Research 1 Fusionpolis Way, Singapore 138632 Email: {subl,slu,joohwee}@i2r.a-star.edu.sg † School of Computing, National University of Singapore, 13 Computing Drive, Singapore 117417 Email: {tiansx, tancl}@comp.nus.edu.sg Abstract—Recognition of characters in natural images is a challenging task due to the complex background, variations of text size and perspective distortion, etc. Traditional optical character recognition (OCR) engine cannot perform well on those unconstrained text images. A novel technique is proposed in this paper that makes use of convolutional cooccurrence histogram of oriented gradient (ConvCoHOG), which is more robust and discriminative than both the histogram of oriented gradient (HOG) and the co-occurrence histogram of oriented gradients (CoHOG). In the proposed technique, a more informative feature is constructed by exhaustively extracting features from every possible image patches within character images. Experiments on two public datasets including the ICDAR 2003 Robust Reading character dataset and the Street View Text (SVT) dataset, show that our proposed character recognition technique obtains superior performance compared with state-of-the-art techniques. Keywords-Scene Text Recognition; Histogram of Oriented Gradient; co-occurrence HOG; Feature Extraction;

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

I. I NTRODUCTION Optical Character Recognition (OCR) has been developed for decades to automatically transform the character information embedded in images into textual format. This technique has achieved great success and has been used in different commercial systems. On the other hand, OCR systems usually require that the scanned document texts are well formatted and have a good image quality. As a result, they often fail to work properly for texts in scenes, where characters have little constraints in term of text fonts, environmental lighting, image background, etc. As illustrated in Figure 1 and Table I, the recognition results produced by state-of-the-art commercial OCR engine1 is unsatisfactory when the complexity of texts in scenes increases. The OCR engine fails to produce satisfactory results due to illumination variation, various text fonts, perspective distortion, and so on. In this paper, we describe a novel method to recognize characters in natural scenes by utilizing co-occurrence of histogram of oriented gradient (CoHOG). The input images of our technique are assumed to be cropped and contain only one character. We do not discuss the 1 ABBYY

FineReader 10, http://www.abbyy.com/

Figure 1. Four text image examples in the left column and their text segmentation ground truth in the right column that are taken from the benchmarking word image dataset [1]. From top to bottom, the text images become more difficult to recognize. Their corresponding OCR results are shown in Table I. Table I R ECOGNITION R ESULTS USING A BBYY F INE R EADER 10.0, ’-’ DENOTES NO RESULTS PRODUCED .

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

Open

Open

r

fish

-

Draoon

-

-

procedure of detecting characters in images but only focus on recognizing them. In our previous work [2], the CoHOG features are extracted for each character images and linear support vector machine (SVM) is then used to train a classify model based on the Co-HOG features. Different from our previous CoHOG based technique in [2],

we construct a novel feature by exhaustively extracting CoHOG for every possible patch of the character images and further average pooling [3] for classification. The main contribution of our work is to propose a novel ConvCoHOG feature for character recognition, which can be used in various applications for automatically understanding of scene text. Compared with the widely used HOG features [4], [5] and our previous CoHOG feature [2], our proposed ConvCoHOG feature is more robust and discriminative, since it better captures the cooccurrence dual edge characteristics of text strokes by exhaustively exploring every image patches within one character image. In addition, experiments conducted on several public datasets using the same configuration show that our proposed technique consistently outperforms the CoHOG method [2], which demonstrate that our method can extract more informative features and consequently improve the character recognition results. At the same time, our proposed method only performs slight slower than the CoHOG method [2]. II. R ELATED W ORK A. Scene Text Segmentation Scene text segmentation is also referred as image binarization, which tries to convert a gray level/color image into a binary version where the text pixels and background pixels are separated via thresholding. A lot of image thresholding methods have been reported in literature [6]. However, the traditional thresholding techniques were designed for scanned documents, and cannot be directly applied for segmentation of texts in scenes that usually have vast variations in colour and texture. To improve the text segmentation accuracy in natural scene, Kita et. al. [7] propose a K-Means based algorithm to group image pixels into different segments, and further use SVM classifier to select the most possible segments as characters. In addition, a MRF-based model [8] is used to treat the segmentation as a labelling problem. Recently, Inverse Rendering [9] by decomposing an input image into basic rendering elements have been applied to extract the character regions. However, the text segmentation process might bring in segmentation errors by itself, which further causes recognition failures. In many cases, the segmented texts still cannot be fed to OCR engines due to perspective distortion as illustrated in Figure 1 (h), where OCR engine cannot produce recognition result even given the perfect segmentation output as illustrated in Table I. Further rectification [10] is required to produce satisfactory recognition results. B. Scene Text Recognition Besides the text segmentation and OCR approach, the state-of-the-art approach tends to train their own classifiers to recognize character by extracting different features on the text images. Different feature descriptors including

Shape Contexts, Scale Invariant Feature Transform (SIFT), Geometric Blur, etc. in combination with bag-of-words model have been evaluated on public datasets but didn’t produce satisfactory performance [11]. On the other hand, HOG feature [12] is widely used in different methods [4], [5], [13] and performs well due to its robustness to illumination variation and invariance to the local geometric and photometric transformations. Recently, the part-based tree structure [14] is also proposed along with the HOG descriptor to capture the structure information of different characters. In addition, the Co-occurrence HOG feature is developed in our previous work [2] to represent the spatial relationship of neighboring pixels. Other approaches also consider combining the HOG feature and features extracted from segmentation output, such as Maximally Stable Extremal Regions [15] and Weighted Direction Code Histogram [16]. With large amount training dataset, these techniques achieve high recognition accuracy with different machine learning algorithms, such as SVMs and multi-layer Neural Network [17]. III. P ROPOSED M ETHOD A. Co-occurrence Histogram of Oriented Gradient HOG feature [12] is widely used in object detection area due to its favourable characteristics, such as invariance to the local geometric and photometric transformations. To extract HOG feature, an images is first divided into smaller patches and feature extraction procedure is applied in every patches separately. The orientation of gradient of each pixel within a patch is then quantized into histogram bins, while each histogram bin represents an angle range. After that, the histogram of each patch is normalized and concatenated together to form a feature vector. Compared with other feature descriptor such as SIFT, the HOG feature is computed over the whole image, and does not require locations of suitable key-points. Since the character images in scenes are usually in low contrast, it is difficult to find out enough descriptors for feature construction. HOG-liked features are therefore often better choices for scene character recognition task. CoHOG feature is an extension of the original HOG feature that captures the spatial information of neighboring pixels. Instead of counting the occurrence of the gradient orientation of a single pixel, gradient orientations of two neighboring pixels is considered. For each pixel in an image block, the gradient orientations of the pixel pair formed by its neighbour and itself is examined. A cooccurrence matrix is constructed for each neighboring pixel pair. Given a neighbor window size W , there will be W 2 − 1 co-occurrence matrix. The dimension of the co-occurrence matrix is N × N , where N denotes the number of histogram bins of the gradient orientation. Each entry (i, j) of the matrix corresponds to the pixels pairs with gradient orientations (i, j). Considering the

soft assignment and weighted voting as in [2], the cooccurrence matrix is built as follows: Hx,y (i, j) =

X

T ∗(w1 ∗G(p, q)+w2 ∗G(p+x, q+y))

(p,q)∈I

(1)  T =

1 0

if O(p, q) = i & O(p + x, q + y) = j otherwise

where (p, q) denotes an image pixel in the image block I, (x, y denotes the offset of the pixel (p, q) to its neighboring pixel, O(·) denotes the gradient orientation assignment of an image pixel, G(·) denotes the gradient magnitude of an image pixel, w1 , w2 are the weight parameters for the soft assignment that are determined based on Equation 2. θ(p, q) − θi w1 =1 − θi0 − θi θ(p + x, q + y) − θj w2 =1 − θj 0 − θj

(2a) (2b)

where θ denotes the gradient angle, θi , θj denote the corresponding angle of gradient orientation i, j, (θi , θi0 ), (θj , θj 0 ) represent the angle intervals where θ(p, q), θ(p + x, q + y) fall in, respectively. Compared to direct assignment, this strategy assigns the co-occurrence matrix in a weighted manner, based on the gradient magnitude and its distance to gradient orientation, and usually performs better. At last, the co-occurrence matrix is normalized, vectorized and concatenated to form a feature vector. The dimension of the CoHOG feature vector is (W 2 −1)∗H 2 , where W denotes the neighboring window size, H denotes the number of gradient orientation bins. It can be observed that, in text images the contour of character is one of the most significant features that help human identify the text. So the text strokes usually have strong gradients at their two-side edges simultaneously to form a clear shape. This characteristic has been used in stroke width transform [18] for text detection in natural scene. The CoHOG feature also benefits from this characteristic, since the strong gradient of the twoside edge of a text stroke will contribute significantly in the co-occurrence matrix, and different characters should have different co-occurrence distributions of the gradient orientation of text stroke edges.

Figure 2. Illustration of ConvCoHOG feature extraction: From left to right, CoHOG feature are extracted convolutionally from every image blocks of the input image exhaustively; After the CoHOG features are vectorized, average pooling is performed.

strategy in this paper to exhaustively extract the CoHOG feature for every possible blocks of an image. The construction of ConvCoHOG feature is illustrated in Figure 2. Given an image with size N ∗ N , the CoHOG feature is extracted in a B ∗ B block, and then an M ∗ M feature matrix can be constructed as follows: F(i, j) = F(I(N − (i − 1) ∗ B, N − (j − 1) ∗ B, B)) (3) where I(x, y, B) denotes an image block with starting coordinate (x, y) in the original N ∗ N image and size B ∗ B, F denotes the RB∗B → RP feature extraction function as illustrated in last subsection, where P denotes the dimension of the CoHOG feature vector, F(i, j) denotes the extracted feature vector with dimension P at entry (i, j) of the M ∗ M feature matrix. The size M is determined based on the image size N and block size B: M =N −B+1 After the construction of feature matrix, the average pooling strategy is applied to incorporate information of neighboring blocks by averaging the feature vector within a neighboring window. The averaged feature matrix is K ∗ K, where each entry denotes a feature vector with the same of CoHOG feature vector, and K is determined based on M and the averaging window size T as follows: M T Finally, the ConvCoHOG feature vector is generated by vectoring and concatenating the averaged feature matrix. The dimension of the ConvCoHOG feature is K ∗ K ∗ P . K=

B. Construction of ConvCoHOG feature In our previous work [2], the input image is first divided into non-overlapped blocks, and then the CoHOG feature extraction is applied on each block separately. This procedure is fast, but some information might not be captured, e.g. some co-occurrence structure of text stroke could be divided into different blocks. In order to overcome this limitation, we proposed a convolutional

C. Character Recognition When the ConvCoHOG features of each character images are extracted, classifiers can be constructed with training data. Linear SVM classifiers with L2regularization is chosen to construct character recognizer as it is much faster when handling a large amount of high dimensional data. The Linear SVM is implemented

(a)

(b)

(c)

Figure 3. Character recognition results with different candidate number on different datasets. We compare our proposed method with the CoHOG+SVM method [2] and the PBS method [14]. It is worth noting that the accuracy of PBS method is reported on 49 classes, while the other two methods recognize 62 classes.

with liblinear2 . We also test other classifiers such as Sparse Representation Classifier and non-linear SVM3 , which achieve similar accuracy while are much more timeconsuming. In the next section, we will describe the details of the character recognition results on public datasets. IV. E XPERIMENTS AND D ISCUSSION A. Experiment Datasets Our proposed character recognition approach is evaluated on several public datasets, including the Char74K dataset [11], ICDAR 2003 dataset [19] and Street View Text (SVT) dataset [13]. We only focus on recognition of English characters, which consists of 62 classes (0-9, A-Z, and a-z). The ICDAR 2003 robust reading competition provides data collected from various natural scenes for three different tasks: text detection, word recognition and character recognition. We only handle the character recognition dataset in this paper, which consists of 6113 characters for training and 5379 characters for testing. There are some special characters such as ’&’ which are excluded in our experiments. The original SVT dataset only contains ground truth information for word level images. Mishra et. al. manually annotated the characters of the testing part of the SVT dataset [4] to form the SVTChar dataset for character recognition. In total, this dataset consists of 3796 samples. The sample images in these public datasets are taken from different sources, such as posters, book covers, road signs, business signboard, etc. The size and fonts of these characters change significantly. A lot of images suffer from low contrast and resolution problem. In addition, most of the characters in SVT datasets are directly extracted from the Google Street View images. They are usually more challenging due to bad illumination, perspective distortion and special art fonts. Figure 4 shows a few example images taken from ICDAR 2003 dataset. 2 LIBLinear: 3 LIBSVM:

http://www.csie.ntu.edu.tw/ cjlin/liblinear/ http://www.csie.ntu.edu.tw/ cjlin/libsvm/

B. Character Recognition Accuracy We compare our proposed method with several stateof-the-art techniques, including our previous method [2] (CoHOG+SVM), Wang Kai et. al.’s methods [5], [13] (HOG+NN, Native Ferns), MSER based method [15] (MSER), Part Based Tree Structure [14] (PBS), HOG [4] (HOG+SVM), Nonlinear Colour Transform [20], and convolutional neural network [17] (CNN) on the ICDAR 2003 and SVTChar datasets. The Char74k dataset is mainly used as training dataset. To make the evaluation fair, we adopt the same configuration for our proposed method and CoHOG+SVM [2]. All the images are resized to 32∗32 beforehand, the image patch size W is set to 8, the number of gradient orientation bins is set to 8, and we assume the gradient angle is unsigned, where the angle range is [0, 180]. At last, the averaging windows size T is set to 5. The CoHOG feature extraction of each single image block is independent to each other and can run simultaneously. The process time of our proposed method is therefore only slightly slower than the CoHOG+SVM technique [2]. We first use the Char74k dataset as training data, and testing the performance of trained models on ICDAR 2003 training and testing datasets, as well as the SVTChar dataset. We compare the accuracy of top 1 to top 5 classification results of our proposed method and CoHOG+SVM [2] under the same configuration with same training data. Experimental results are shown in Figure 3. It is clear that our proposed method obtains 1% to 2% higher accuracy than CoHOG+SVM [2] throughout the three datasets, demonstrating the robustness and superior performance of our proposed method. We also compare our method with the recent PBS method [14] on ICDAR training and testing datasets as shown in Figure 3 (a) and (b). The accuracies of PBS method on these two dataset is provided in [14], which performs 49 classes classification by grouping the characters with similar structures together, such as ’O’ and ’o’, etc. It is therefore much easier than our proposed method that classifies the character images into 62 classes.

Table II C HARACTER RECOGNITION ACCURACY ON ICDAR 2003 TESTING DATASET AND SVTC HAR DATASET

Methods

ICDAR 2003

SVTChar

ABBYY Fine Reader 10.0

0.27

0.15

HOG+NN [5]

0.52

-

Native Ferns [13]

0.64

-

MSER [15]

0.67

-

4

-

(a) Successfully recognized characters.

(b) Failure Cases.

PBS [14]

0.78

HOG + SVM [4]

-

0.62

CoHOG + SVM [2]

0.79

0.73

Human + OCR [20]

0.92

-

Proposed Method

0.81

0.75

However, the performance of our proposed method is still comparable to the PBS method [14]. Note that, the accuracies of our proposed method increased to 82.1% and 79.9% for ICDAR 2003 training and testing datasets respectively, when considering only the same 49 classes as the PBS method [14]. Table II shows experimental results of several different character recognition techniques on ICDAR 2003 testing dataset and SVTChar dataset. We combine ICDAR 2003 training dataset and Cha74k dataset together as training data for our proposed method and the CoHOG+SVM method [2]. Our proposed method achieves 81% accuracy and 75% accuracy on ICDAR 2003 testing dataset and SVTChar dataset, respectively, which outperforms most of the current state-of-the-art methods. The Abbyy FineReader 10.0 cannot produce reasonable results on most of the test images, which indicates that the commercial OCR engines designed for scanned document recognition are not suitable for scene text recognition. Only the HOG+SVM [4] and CoHOG+SVM [2] provide recognition accuracies on the SVTChar dataset. Our proposed method produces better results than both methods with the same classification tool, which is mainly due to the use of ConvCoHOG feature. The recent proposed CNN method [17] achieves 84% recognition accuracy in ICDAR 2003 testing dataset with a similar convolutional strategy. However, this method takes advantages of a huge amount training data (about 50 thousands samples) by synthetic augmentations of several public datasets, including ICDAR 2003 training dataset and Char74k dataset. In addition, it requires more 4 The

accuracy is achieved on 49 classes

Figure 4. dataset.

Some character images examples taken from ICDAR 2003

computational power by constructs a two layer neural networks. Furthermore, the testing dataset only consists of 5198 instead of 5379 characters because it recrops the characters in the original images and then removes those ’out-of-box’ characters where the bounding box exceed the border of the original images. This procedure corrects some of the segmentation error of the original dataset and excludes some extremely difficult testcases, which makes the recognition easier as well. By ignoring the same amount of testing data based on the similar strategy, the performance of our proposed method can be increased to 84% with a much smaller training dataset and much less computational complexity. C. Discussion Figure 4 shows a few character examples that our proposed method successfully recognize in (a) as well as a number of failure cases in (b). Most of the failure cases are even difficult to be recognized by human due to the low contrast, context ambiguity, broken text stroke, etc. One common ambiguity in character recognition is the text case. A few characters share similar structures in both upper and lower cases, such as (’O’,’o’), (’S’,’s’) and (’C’, ’c’), which makes them very difficult to be recognized in scene text. It can be observed clearly that, large false responses in the confusion matrix are due to the case ambiguity, as illustrated in Figure 5. By ignoring the text case, the recognition accuracy of our proposed method can be further improved to 85.3% and 81.1% on ICDAR 2003 testing dataset and SVTChar dataset, respectively. In addition, certain amount of recognition failure can be explained by the annotation error in both ICDAR 2003 testing and SVT datasets. A benchmarking recognition accuracy 92% has been reported [20] recently on ICDAR 2003 testing dataset, which is achieved by human segmentation and OCR. There are still a large performance gap between machine vision and human. However, if we look into the top 5 classification results shown in Figure 3, the proposed method can achieve higher than 90% recognition accuracy, which is quite close to the human recognition rate as reported in [20].

Confusion Matrix of ICDAR 2003 and SVTChar Datasets Z 60

Predicted Character Label

50

40

A z

30

20

a 9

10

0 0

9a

zA Testing Character Label

Z

Figure 5. Confusion matrix of character recognition on the SVTChar and ICDAR 2003 datasets. There are 62 classes indicated by the number on the coordinate, which represents ’0-9a-zA-Z’ respectively.

It is difficult to improve the recognition accuracy on character level without incorporating some high level information, even for human expert, as lots of information has been lost while breaking a word into characters. For example, it is hard to identify ’1’ and ’l’ in certain fonts without contextual information. We believe that a topdown word model would be useful for exclusion of those ambiguities and improvement of the character recognition rate. V. C ONCLUSION Character recognition under unconstrained condition is a difficult task and has drawn a lot research interest in Computer Vision. Different methods have been proposed to recognize character in natural scene images due to the bad performance of traditional OCR engines. In this paper, we proposed a novel technique that makes use ConvCoHOG to improve the character recognition performance in natural scene. Compared with our previous work [2], the proposed ConvCoHOG feature has more discriminative power by exhaustively examining every possible image patches within an image without increasing the computational complexity significantly. The linear support vector machine (SVM) is applied to train a 62 classes model for character recognition. Experimental results on different datasets show the robustness of our proposed technique, which outperforms other state-of-the-art techniques. R EFERENCES [1] D. Kumar, M. N. A. Prasad, and A. G. Ramakrishnan, “Benchmarking recognition results on camera captured word image data sets,” in Proceeding of the Workshop on Document Analysis and Recognition, ser. DAR ’12, 2012, pp. 100–107. [2] S. Tian, S. Lu, B. Su, and C. L. Tan, “Scene text recognition using co-occurrence of histogram of oriented gradients,” in ICDAR, 2013, pp. 912–916.

[3] Y.-L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in visual recognition,” in ICML, 2010, pp. 111–118. [4] A. Mishra, K. Alahari, and C. Jawahar, “Top-down and bottom-up cues for scene text recognition,” in CVPR, 2012, pp. 2687–2694. [5] K. Wang and S. Belongie, “Word spotting in the wild,” in ECCV, 2010, pp. 591–604. [6] P. Stathis, E. Kavallieratou, and N. Papamarkos, “An evaluation survey of binarization algorithms on historical documents,” in ICPR, 2008, pp. 1–4. [7] K. Kita and T. Wakahara, “Binarization of color characters in scene images using k-means clustering and support vector machines,” in ICPR, 2010, pp. 3183–3186. [8] A. Mishra, K. Alahari, and C. V. Jawahar, “An mrf model for binarization of natural scene text,” in ICDAR, 2011, pp. 11–16. [9] Y. Zhou, J. Feild, E. Learned-Miller, and R. Wang, “Scene text segmentation via inverse rendering,” in ICDAR, 2013, pp. 457–461. [10] L. Neumann and J. Matas, “A method for text localization and recognition in real-world images,” in Computer Vision ACCV 2010, ser. Lecture Notes in Computer Science, 2011, vol. 6494, pp. 770–783. [11] T. E. de Campos, B. R. Babu, and M. Varma, “Character recognition in natural images,” in Proceedings of the International Conference on Computer Vision Theory and Applications, 2009. [12] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, vol. 1, 2005, pp. 886–893. [13] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene text recognition,” in ICCV, 2011, pp. 1457–1464. [14] C. Shi, C. Wang, B. Xiao, Y. Zhang, S. Gao, and Z. Zhang, “Scene text recognition using part-based tree-structured character detection,” in CVPR, 2013, pp. 2961–2968. [15] L. Neumann and J. Matas, “Text localization in realworld images using efficiently pruned exhaustive search,” in ICDAR, 2011, pp. 687–691. [16] A. Bissacco, M. Cummins, Y. Netzer, and H. Neven, “Endto-end scene text recognition,” in ICCV, 2013, p. to appear. [17] T. Wang, D. Wu, A. Coates, and A. Ng, “End-to-end text recognition with convolutional neural networks,” in ICPR, 2012, pp. 3304–3308. [18] B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in CVPR, 2010, pp. 2963–2970. [19] L. P. Sosa, S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and R. Young, “ICDAR 2003 robust reading competitions,” in ICDAR, 2003, pp. 682–687. [20] D. Kumar, M. N. Anil Prasad, and A. G. Ramakrishnan, “NESP: Nonlinear enhancement and selection of plane for optimal segmentation and recognition of scene word images,” in Proc. SPIE, vol. 8658, 2013.