Text Localization in Camera Captured Images Using Adaptive Stroke ...

13 downloads 1247 Views 3MB Size Report
Jan 21, 2015 - Cite this paper as: Paul S., Saha S., Basu S., Nasipuri M. (2015) Text Localization in Camera Captured Images Using Adaptive Stroke Filter.
Text Localization in Camera Captured Images Using Adaptive Stroke Filter Shauvik Paul, Satadal Saha, Subhadip Basu and Mita Nasipuri

Abstract Most of the text localization techniques are sensitive to text color, size, font and background clutter. They simply exploit the general segmentation rules or the prior knowledge about the text shape/size. As, inherently, a text consists of strokes of different sizes and orientations, so the concept of Stroke Filter is much more effective, particularly where text segmentation is taken into consideration. The problem with traditional stroke filter lies in its fixed width and is capable of segmenting strokes of predefined width. The proposed method uses adaptive stroke filter which can localize text regions, having varying stroke width, within camera captured images. The method is verified by experiment on a database containing 600 images. Keywords Stroke filter Binarization

 Text localization  CCL  Thickness measurement 

1 Introduction Texts within an image contain important and useful information which can help us to understand their content. The extraction of text information is very important because texts contain high-level semantic information. That’s why text information S. Paul (&) MCA Department, Techno India, Salt Lake, Kolkata, India e-mail: [email protected] S. Saha ECE Department, MCKV Institute of Engineering, Howrah, India e-mail: [email protected] S. Basu  M. Nasipuri CSE Department, Jadavpur University, Kolkata, India e-mail: [email protected] M. Nasipuri e-mail: [email protected] © Springer India 2015 J.K. Mandal et al. (eds.), Information Systems Design and Intelligent Applications, Advances in Intelligent Systems and Computing 340, DOI 10.1007/978-81-322-2247-7_23

217

218

S. Paul et al.

is widely employed in the fields of automatic annotation, indexing and summarization of images. So the automatic extraction of text and recognition of it thereafter is the growing demand of the current time. Text extraction from scanned document images as a part of optical character recognition (OCR) has already been an area of research through long time. But recently it has become a growing demand to extract text information from camera captured scene images. It has enormous application in reading the information of product moving on conveyor belt in manufacturing industry, automatic ticketing of vehicles at toll plaza or car parking area, recognizing vehicle license plate as a part of integrated traffic management system etc. Very recently, introduction of OCR system into mobile phone has become a primary requirement by mobile phone manufacturing companies to enlarge the scope of application of it. A two-stage system for text detection in video images was proposed by Anthimopoulos et al. [1]. In the first stage, text lines are detected based on the edge map of the image and in the second stage, the result is refined using a sliding window and an SVM classifier trained on features obtained by a new Local Binary Pattern-based operator (eLBP) that describes the local edge distribution. Shi et al. [2] and Dimitrova et al. [3] proposed a method consisting of edge enhancement, binarization and text-noises removal. Scene text detection using a graph model built upon maximally stable extremal regions (MSER) has been proposed by Shi et al. [2]. A detailed survey on text information extraction in images and video has been reported by Jung et al. [4]. Lyu et al. [5] reported a comprehensive method for multilingual video text detection, localization and extraction. Jung et al. [6] proposed stroke filter based on local region analysis. But the filter parameters are not adaptive to the actual stroke width. Epshtein et al. [7] proposed text detection in natural scene images with stroke width transform. This method presents an image operator that seeks to find the value of stroke width for each image pixel and demonstrate its use on the task of text detection. Pan [8] proposed a method for text localization in natural scene images based on Conditional Random Field (CRF). Experimental results show that the proposed method gives promising performance compared to the existing methods applied on ICDAR 2003 competition dataset. An MSER based method for text localization and recognition for real-world images has been proposed by Neumann and Matas [9]. Saha et al. [10–16] proposed numerous text extraction methods for localization of license plates from vehicular images. In general, text consists of number of strokes drawn over the background. So stroke filter [6] has been used in the current work to determine the strokes within the image. In the current work, the width of the stroke filter is made adaptive by calculating the local thickness and orientation of the stroke using run-length computation. To determine whether a particular pixel is a part of a stroke or not, stroke filter is placed on the pixel and the response of the filter is measured. In Sect. 2, stroke filter is discussed. In Sect. 3, database preparation is discussed. In Sect. 4, text localization method is proposed and the experimental result is analyzed and finally, Sect. 5 concludes this paper.

Text Localization in Camera Captured Images …

219

2 Stroke Filter A stroke is defined as a straight line or arc used as a segment of text, and the texts in images comprise of several strokes, as shown in Fig. 1. A region within an image is identified as text, if and only if several stroke-like structures exist in that region. A stroke filter is designed based on this definition using the local region analysis.

2.1 Design of Stroke Filter In order to design the stroke filter, we first define a local image region to be a stroke like structure, if and only if: 1. Its intensity is highly different from its lateral regions 2. Intensities of its lateral regions are similar 3. It is nearly homogenous with respect to its intensities For each pixel in the source image, the stroke filter response is computed. As shown in Fig. 2, the three shaded regions constitute stroke filter. The central point of a stroke filter denotes an image pixel (x, y) at which filter response is to be Fig. 1 Sample image of a dark and b light stroke

Fig. 2 Stroke filter PIXELS

2

1

2

7

LEFT

RIGHT

d/2

d CENTRE

220

S. Paul et al.

measured. Let index 1 denotes the central region and indices 2 and 3 denote each of the lateral regions. The orientation and scale of these local regions are determined by the parameters α and d, where d, called as the width of the rectangular region 1, is determined based on prior knowledge of the text obtained by conducting experiments on text images. The distance between the central region and lateral region is d/2 in the stroke filter, due to the fact that dark or bright lines are often embedded around some texts to convey their meanings efficiently, and blurred edges appear around texts as a result of their compression. According to the definition of a stroke-like structure, we define the bright and dark stroke filter responses of the pixel (x, y) as follows: Ra;d ¼

jl1  l2 j þ jl1  l3 j  jl2  l3 j r

ð1Þ

The terms on the right-hand sides of Eq. (1) have clear physical meanings corresponding to the constraints of the definition of a stroke-like structure, where µi denotes the estimated mean of the intensities in the region i, where i = 1, 2, 3. The bright and dark stroke filter responses are proportional to jl1  l2 j þ jl1  l3 j. The parameter σ denotes the standard deviation of the intensities in region 1 and is a measure of the extent to which the intensities of the region are spread out. Therefore, this parameter can reflect the homogeneity of the stroke and is inversely proportional to the response. The greater the probability that the pixel (x, y) belongs to a stroke like structure, the higher is the response.

3 Database Development The dataset for the current work has been developed with synthetic and non-synthetic images. To create synthetic images Irfanview software is used. The nonsynthetic images are collected by different ways. The dataset is divided into 3 categories, as shown in Table 1.

Table 1 Types of images Image categories Synthetic image Digital camera or mobile camera captured images

ICDAR 2011 images Total

No of images Book cover page image Poster image Newspaper image Car license plate image

100 100 100 100 100 100 600

Text Localization in Camera Captured Images …

221

4 Present Work The block diagram of the current text extraction technique is shown Fig. 3. The input image may be a color or grayscale image. If the image is color image then, preprocessing operation is applied to the image as shown in the flowchart. In the algorithm, input data are a color image which is entered to the system and the segmented text is the output.

4.1 Thickness Measurement A 24-bit color image has been taken as input and it is then converted to gray scale image. A sample image and its gray-scale version is shown in Fig. 4. The main problem with the traditional stroke filter is that it works with a fixed width. That is why it does not always give the best result; because in an image, texts may be of different size with different stroke width. To solve this problem, we have computed the thickness of the stroke and changed the width of stroke filter accordingly, so that stroke filter works as an adaptive stroke filter.

Input 24-bit color image

Convert to gray image

Binarize the image

Create a binarized image with stroke response

Design stroke filter according to thickness

Calculate the thickness of stroke at each pixel

Run connected component labeling algorithm

Segment the text regions in original image

Fig. 3 Block diagram of current method

Fig. 4 Sample Original Image and gray image

222

S. Paul et al.

Fig. 5 Thickness measurement based on run-length along four directions

To calculate the thickness of a point p(x, y), the run-length of white pixels is done in four directions: horizontal, vertical and two diagonals. The thickness is defined as the minimum among the four run-lengths. Mathematically, if the runlengths of white pixel at a point p(x, y) along horizontal, vertical, diagonal 1 and diagonal 2 are H, V, D1, D2 respectively, then the thickness at the point p(x, y) is defined as Tðx; yÞ ¼ minðH; V; D1 ; D2 Þ

ð2Þ

Figure 5 shows thickness measurement method for the point p within an arbitrary object. H, V, D1 and D2 are indicated as green, yellow, red and violet arrows respectively. According to the position of the point p(x,y) in the figure, H = 3, V = 7, D1 = 5 and D2 = 5. So, according to Equation, the value of thickness at p(x, y) becomes 3. Here we determine the orientation of the stroke filter according to direction in which minimum thickness resulted, i.e. α = 0°.

4.2 Stroke Identification Stroke filter can be placed in four directions [17]. To determine whether a particular pixel is a part of a stroke or not, the stroke filter computes a response for that pixel where width is thickness of the stroke and height of the filter is taken as 7 pixels. Here computation time is lesser because from the RLC algorithm we have knowledge in which direction the stroke filter should be placed so that we get maximum response. Always higher the stroke response makes higher the chance of a part of a stroke. Stroke responses are binarized to create a binary image according to stroke response, shown in Fig. 6a.

Text Localization in Camera Captured Images …

223

Fig. 6 a Stroke output (white over black) after binarization. b Segmented Image

4.3 Image Segmentation Connected component labeling (CCL) is used in computer vision to detect connected regions in binary digital images, although color images and data with higher dimensionality can also be processed [18, 19]. Blob extraction is generally performed on the resulting binary image from a thresholding step. Here blobs are tracked and filtered by CCL algorithm. After CCL, segmentation is done. We consider that the size of the text should not be larger than 100 × 100 pixels and should not be smaller than 5 × 5 pixels. Apart from the two sizes, all other portion of the image is treated as back ground. Segmented image is shown in Fig. 6b.

5 Experimental Results Discussions As discussed in Sect. 3, various types of camera captured images are taken as input image in the current work. Here the minimum size of the stroke width is considered as 2 pixels. The maximum area covered by a text should not be more than 100 × 100 pixels and should not be less than 5 × 5 pixels. Considering above point, the result analysis is done on 600 images. Some sample output images are presented here. Figure 7a shows an image with text regions perfectly localized by using stroke filter response. Figure 7b shows the output for another image with more complex background with multi-resolution text pattern. Here the stroke filter is made adaptive by calculating the run length at each pixel, which is not otherwise possible by traditional stroke filter. In general, the performance the proposed method can be measured considering following three cases: • True Positive (tp): when a true text region is detected as text region • False Positives (fp): when a true non-text region is detected as text region • False Negatives (fn): when a true text region is detected as a non-text region The performance of the technique is measured in terms of three parameters: recall (rc), precision (pr) and f-measure (fm).

224

S. Paul et al.

Fig. 7 Sample output image with true positive and false positive results

6 Conclusion In the present work, color images are considered as input. Then the stroke information is extracted from the images using adaptive stroke filter. Using CCL algorithm, the text regions are localized. The main challenge of this work is the complex background of images. Another challenge is the multi resolutions of strokes within text pattern. The current technique gives reasonable performance resulting in an overall rc, pr and fm values of 83.27, 89.57, 85.20 % respectively. However, the technique works well for bright texts with dark background compared to the reverse one. Again, the developed method does not include any context analysis, like character recognition or dictionary matching etc. Therefore, in case of natural scene images, there are many false positive cases which appear as text like components having significant stroke signature. However, the proposed technique finds the region of interests within camera captured complex images as far as text localization is concerned.

References 1. Anthimopoulos, M., Gatos, B., Pratikakis, I.: A two-stage scheme for text detection in video images. Image Vis. Comput. 28(9), 1413–1426 (2010) 2. Shi, C., Wang, C., Xiao, B., Zhang, Y., Gao, S.: Scene text detection using graph model built upon maximally stable extremal regions. Pattern Recogn. Lett. 34(2), 107–116 (2013) 3. Dimitrova, N., Zhang, H.-J., Shahraray, B., Sezan, I., Huang, T., Zakhor, A.: Applications of video-content analysis and retrieval. IEEE MultiMedia 9(3), 42–55 (2002) 4. Jung, K., In Kim, K., Jain, A.K.: Text information extraction in images and video: a survey. Pattern Recogn. 37(5), 977–997 (2004) 5. Lyu, M.R., Song, J., Cai, M.: A comprehensive method for multilingual video text detection, localization, and extraction. IEEE Trans. Circuits Syst. Video Technol. 15(2), 243–255 (2005) 6. Jung, C., Liu, Q., Kim, J.: A stroke filter and its application to text localization. Pattern Recogn. Lett. 30(2), 114–122 (2009)

Text Localization in Camera Captured Images …

225

7. Epshtein, B., Ofek, E., Wexler, Y.: Detecting text in natural scenes with stroke width transform. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2963–2970 (2010) 8. Pan, Y.-F., Hou, X., Liu, C.-L.: Text localization in natural scene images based on conditional random field. In: 10th International Conference on Document Analysis and Recognition. ICDAR’09, pp. 6–10 (2009) 9. Neumann, L., Matas, J.: A method for text localization and recognition in real-world images. In: Computer Vision–ACCV 2010, pp. 770–783. Springer, Berlin (2011) 10. Saha, S., Basu, S., Nasipuri, M., Basu, D.K.: An offline technique for localization of license plates for indian commercial vehicles. In: Proceedings of National Conference on Computing and Communication Systems (COCOSYS-09), pp. 206–211 (2009) 11. Saha, S., Basu, S., Nasipuri, M.: Automatic localization and recognition of license plate characters for indian vehicles. Int. J. Comput. Sci. Emerg. Technol. 2(4), 520–533 (2011) 12. Saha, S., Basu, S., Nasipuri, M., Basu, D.K.: Localization of license plates from indian vehicle images using iterative edge map generation technique. J. Comput. 3(6), 48–57 (2011) 13. Saha, S., Basu, S., Nasipuri, M.: License plate localization using vertical edge map and hough transform based technique. In: Proceedings of the International Conference on Information Systems Design and Intelligent Applications (INDIA 2012) held in Visakhapatnam, India, pp. 649–656, Jan 2012 14. Saha, S., Basu, S., Nasipuri, M., Basu, D.K.: License plate localization from vehicle images: an edge based multi-stage approach. Int. J. Recent Trends Eng. (Comput. Sci.) 1(1), 284–288 (2009) 15. Saha, S., Basu, S., Nasipuri, M., Basu, D.K.: Localization of license plates from surveillance camera images: a color feature based ANN approach. Int. J. Comput. Appl. IJCA 1(23), 27–31 (2010) 16. Saha, S., Basu, S., Nasipuri, M.: iLPR: an Indian license plate recognition system. In: Multimedia Tools and Applications, pp. 1–36. Springer, Berlin (2014) 17. Emmanouilidis, C., Batsalas, C., Papamarkos, N.: Development and evaluation of text localization techniques based on structural texture features and neural classifiers. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 1270–1274 (2009) 18. Samet, H., Tamminen, M.: Efficient component labeling of images of arbitrary dimension represented by linear bintrees. IEEE Trans. Pattern Anal. Mach. Intell. 10(4), 579–586 (1988) 19. Dillencourt, M.B., Samet, H., Tamminen, M.: A general approach to connected-component labeling for arbitrary image representations. J. ACM (JACM) 39(2), 253–280 (1992)