Cursive Handwritten Segmentation and ... - Semantic Scholar

3 downloads 4375 Views 355KB Size Report
Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, ... contrast against blackboard and leftover chalk dust noise as ... 2012 Eighth International Conference on Signal Image Technology and Internet .... following fashion:.
2012 Eighth International Conference on Signal Image Technology and Internet Based Systems

Cursive Handwritten Segmentation and Recognition for Instructional Videos Ali Shariq Imran1 , Sukalpa Chanda1 , Faouzi Alaya Cheikh1 , Katrin Franke1 , Umapada Pal2 1 Dept. of Computer Science and Media Technology, Gjøvik University College, P.O.Box-191, N-2802, Gjøvik, Norway 2 Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata-700108, India [email protected], [email protected]

Abstract—In this paper, we address the issues pertaining to segmentation and recognition of cursive handwritten text from chalkboard lecture videos. Recognizing handwritten text is a challenging problem in instructor-led lecture video. The task gets even tougher with varying handwriting styles and blackboard type. Unlike handwritten text on whiteboard and electronic boards, chalkboard represents serious challenges such as, lack of uniform edge density, weak chalk contrast against blackboard and leftover chalk dust noise as a result of erasing – and many others. Moreover, the varying color of boards and the illumination changes within the video makes it impossible to use trivial thresholding techniques, for the extraction of content. Many universities throughout the world still heavily rely on chalkboard as a mode of instruction. Therefore, recognizing these lecture content will not only aid in indexing and retrieval applications but will also help understand high level video semantics, useful for Multi-media Learning Objects (MLO). In order to encounter those adversaries, we here propose a system for segmentation and recognition of cursive handwritten text from chalkboard lecture videos. We first create a foreground model to segment background blackboard. We then segment the text characters using one-dimensional vertical histogram. Later, we extract gradient based features and classify those characters using an SVM classifier. We obtained an encouraging accuracy of 86.28% on 5-fold cross validation.

Figure 1.

character individually by extracting local features and going towards producing a meaningful text. Holistic strategy uses top-down approach to recognize whole word using global features extracted from the entire word image. This strategy uses a limited-size lexicon for text recognition. Wienecke et al. whiteboard capture system is a first step to extract text written on whiteboards [1]. They employed analytic strategy. Their method of content extraction, using feature vectors based on pixel intensity, average intensity changes, and edges, works poorly under the variable lighting conditions and dry-erase marker quality. A semantic indexing method, by combining handwritten word recognition, with information retrieval techniques is proposed by Tang and Kender [2]. Their method is based on holistic strategy. They construct a table of contents (TOC) structure from course material by utilizing the domain knowledge from the course. Liwicki and Bunke proposed a new multiple classifier system (MCS) for recognizing notes written on a whiteboard based on hidden markov models (HMM) [3]. Similar middle-level feature based approach is used to summarize the visual content in instructional video in [4]. Bunke et al. developed a system for offline recognition of unconstrained handwritten texts using HMM and statistical language models [5]. They use a language model to improve the accuracy of the system. Many efforts have been made to find suitable features for text recognition. In [6], a feature extraction method based on Fourier-Wavelet transform is implemented and analyzed. The recognizer starts with a coarse resolution and then successively renders the same features at finer resolution until the classification meets acceptance criteria. Correia et al. tried to validate a recognition system that uses bi-dimensional wavelet transforms as feature extractor and then investigate the relevance of each sub-band image in the recognition process [7]. The results show that

Keywords-cursive handwriting, text segmentation, text recognition, character classification, instructional videos.

I. I NTRODUCTION Use of instructional videos is becoming popular as a means of acquiring knowledge in educational institutes. These videos usually consist of power point slides, blackboard, chalkboard or smart boards etc. In instructor-led lecture videos the use of traditional chalkboard and blackboard is still quite common. Present tools for modeling infinite variations of human handwriting extracted from chalkboard instructional videos are not yet sufficient. The classification accuracy of such tools is also not satisfactory. Handwritten text itself is difficult to segment and to recognize due to arbitrary handwriting, fonts and sizes. The occlusion of instructional content on chalkboard by the instructor adds to the difficulty of extracting text. Furthermore, the instructional videos are unscripted, unedited and without any shot or transitional changes. Therefore, automatically structuring and indexing these lecture videos is a challenging task. Two strategies can be employed to segment and recognize cursive text – analytic and holistic. The analytic strategy uses a bottom-up approach. It recognizes each 978-0-7695-4911-8/12 $26.00 © 2012 IEEE DOI 10.1109/SITIS.2012.33

An overview of proposed framework.

155

the information about relevant image features are evenly distributed in all sub-band images of wavelet coefficients and this feature is promising for numeral recognition system. A fast multistage algorithm for the recognition of printed isolated character set was presented in [8]. They use the size normalized binary images of isolated characters. From these isolated characters, signature based on the number of black pixels were extracted along with the signature scaling factor. This signature scaling factor is said to make multiple projection profiles possible. The multiple projection profile is then used to compute multiple horizontal and vertical signature values based on location of black pixels in the image. Most of these methods fail to accurately classify text extracted from the instructional videos. In this paper, we aim to segment handwritten text from the instructional videos, and to come up with suitable features to accurately classify segmented characters. We employ analytic strategy to segment individual characters and to classify them. The rest of the paper is organized as follows. Section II describes our proposed framework. Section III describes the processing steps carried out in order to extract and create chalkboard text character database. In Section IV, we present feature extraction process and classification. Section V gives the experimental results followed by conclusion in Section VI.

(a) IOriginal

(b) IF oreground

(c) IBackground

Figure 2. Foreground modeling (a) Original Image, (b) Foreground image (c) and background image.

B. Pre-processing The purpose of pre-processing module is to prepare raw video for text analysis. Depending upon the type of the input lecture video, we apply different processing steps. For example, for typical instructor-led lecture video with traditional handwritten text on blackboard, we separate the background content regions (i.e. blackboard) from the foreground objects (i.e. instructor). This is to get a clear picture of the background text for further processing. We do this by creating a foreground model of the moving object in a video frame and then removing the foreground object from the original frame using a probabilistic approach. We used the foreground object detection model proposed by Liyuan et al. [9]. It detects and segments foreground objects from a video which contains both stationary and moving background objects. The stationary object is described by the color feature and the moving object is represented by the color co-occurence feature [9]. For our application, we treat blackboard text and any illumination changes as part of the moving background. Foreground object is extracted by fusing the classification results from the stationary and moving pixels. This model is based on Bayes theorem to classify foreground objects from background objects. Any pixel that does not fit this model is then deemed to be background. By using the Bayes theorem, a posterior probability of vt (i.e. feature vector extracted from an image sequence Is(x, y) at time t), can be expressed as:

II. P ROPOSED M ETHODS The proposed framework is shown in Figure 1. Input to the framework are the raw instructor-led lecture videos. The pre-processing module prepares the raw video for further processing by removing noise, localizing, enhancing and extracting text portions, and removing any portion of a video with no pedagogical value. Video frames with no significant content detected by writing or erasing activity constitutes the idle frames. These frames usually have no pedagogical value. To perform text analysis, we first separate the background i.e. blackboard from the foreground object i.e. instructor. We then extract the key frames from the background for text analysis. A key frame is extracted by detecting an erasing activity. Extracted key frames are further subjected to text localization, extraction, enhancement and segregation. We then extract gradient features of segregated text for classification. We created cursive handwritten text database from lecture videos of chalkboard for training and testing. The processing steps of proposed framework are described in following subsections.

P (C|vt , Is) =

P (vt |C, Is)P (C|Is) , C = b or f. (1) P (vt |Is)

Therefore, by using Bayes decision rule, a pixel can be classified as foreground or background if: P (vt |Is) = P (vt |f, Is).Pf + P (vt |b, Is).Pb ,

(2)

where Pf = P (f |Is), and Pb = P (b|Is). Thus by learning a prior probability Pf , the probability Pb and conditional probability P (vt |f, Is) in advance, we can classify a feature vector vt as either associated with background or foreground. Due to the non-static nature of the text during writing phase, the text is often detected as part of moving foreground objects as shown in Figure 2b. As a result, the foreground frame contains some portions of the background text as well. To overcome this problem, we apply morphological operations on the foreground frame

A. Video Acquisition We acquire the lecture videos using Logitech Pro 9000 external web cam connected to the laptop. The camera was kept static 12-15 feet from the chalkboard. Auto-focus feature of camera was disabled to avoid any illumination changes. The camera was manually focused at the center of chalkboard. Videos were recorded at 2MB at 10 fps with instructor moving in front of the chalkboard.

156

Figure 4.

Figure 5.

Figure 3.

to remove the text portions. Once the foreground frame only contains the foreground moving object i.e. instructor in our case, we then subtract the foreground object from the original frame. This gives us the background frame. The extracted background frame usually contains some left over shadow noise as shown in Figure 2c. To remove the shadow noise, we apply shadow detection scheme based on a deterministic non-model based approach using HSV color space, proposed by Prati et al. [10]. As per this approach, a pixel is considered as part of shadow if it has more or less similar chromaticity with lower brightness than those of same pixels in background model. Using a suitable threshold “T”, we can remove the shadow from the background frame. The decision can be made based on the following equation:

S(x, y) =

⎪ ⎪ ⎩ 0

if, Brimg < Brbg and Chimg = Chbg ± T,

Cursive text segregation.

time t as: SADH (t; T ) =

N −1 

|Ht (i) − Ht T (i)|,

(4)

i=0

• • •

(3)

Otherwise



where S(x, y) is the shadow pixels located at (x, y). Br and Ch are the brightness and chrominance of background model bg and input image img respectively. For smart board based instructional videos containing cursive handwritten text, we directly process the frames for text localization and extraction as explained in section II-C. C. Text Analysis Module (TAM) The purpose of text analysis is to be able to extract the cursive handwritten text from extracted background blackboard. The TAM consist of following steps: •

Vertical text profile.

SAD of gray level histograms. Figure 6.

⎧ 1 ⎪ ⎪ ⎨

Labeling.

Extract the key frame from the background video. A key frame is selected at time t − 5 of the erasing activity at time t. An erasing activity is detected by employing sum of absolute difference (SAD) of graylevel histogram between two consecutive frame at

157

where temporal spacing T is set on the order of 3 frames for video encoded at 10fps. An example of SAD is shown in Figure 3. Binarizing the selected key frame using Otsu Method [11]. Localizing the text regions using 4-connected component based labeling method as shown in Figure 4. Enhancing and extracting the text line from the connected components obtained as blobs as a result of localization. This is done by employing morphological operation consisting of repetitive opening operation. This connects the broken text links due to weak chalk contract against the background and removes any leftover chalk noise. Segregating the characters using vertical 1-D projection histogram. For this, we first create a vertical profile of each word by traversing the text contour from top and bottom, and filling the hollow cavities as shown in Figure 5. By this, we were able to separate most of the cursive text. The yellow bars indicate the cuts. However, in certain cases this fails due to the overlapping characters as in case of word “Test”, where a portion of the character ’T’ overlaps the character ’e’. As a result, ’T’ and ’e’ were not separated. To separate such characters, we divide the text lines horizontally into two halves from middle. This is indicated by a red line in Figure 5. We then mark the candidate cut points by identifying peaks and valleys in upper half and lower half of the text lines; depicted as green and orange bars. Then by averaging the two cut points from the upper and lower half, we are able to accurately segregate the cursive handwritten text into individual characters as shown in Figure 6.

Lastly, we clean the image by applying ’clean’ and ’bridge’ morphological operation. Each character was normalized and positioned in center within 28×28 blobs to match the MNIST database. These characters are centered based on their center of mass as shown in Figure 7. The processing for creating the character database is carried out in similar fashion as for the extraction of text from the recorded lecture videos before the character segregation step described in section II-C. •

IV. M ETHODOLOGY A. Gradient Based Feature Extraction

Figure 7. ’e’.



400 dimensional gradient based feature: The gray-scale local-orientation histogram of the connected components are used for 400 dimensional feature extractions [12]. To obtain 400 dimensional features we apply the following steps. • At first, size normalization of the input binary image is done. Here we normalize the image into 126 × 126 pixels. • The input binary image is then converted into a grayscale image by applying a 2×2 mean filtering 5 times. • The gray-scale image is normalized next so that the mean gray scale becomes zero with maximum value 1. • Next, the normalized image is segmented into 9 × 9 blocks. • A robust filter is then applied on the image to obtain gradient image. The arc tangent of the gradient (strength of gradient) is quantized into 16 directions (an interval of 22.5) and the strength of the gradient is accumulated with each of the quantized direction. By strength of Gradient (f(x,y)) we mean f (x, y) = (Δu)2 + (Δv)2 and by direction of gradient (θ(x, y)) we mean (θ(x, y)) = tan−1 (Δu/Δv), here Δu = g(x+1, y +1)−g(x, y), Δv = g(x + 1, y) − g(x, y + 1) and g(x, y) is a gray scale value at an (x,y) point. • Histograms of the values of 16 quantized directions are computed in each of 9 × 9 blocks. • Finally, 9 × 9 blocks are down sampled into 5 × 5 by a Gaussian filter. Thus, we get 5 × 5 × 16 = 400 dimensional feature.

Database Samples: Handwritten character ’a’ and character

Segmenting and extracting the characters into 28×28 blobs as described in section III.

III. DATASET D ETAILS Due to lack of standardized chalkboard text database, we created our own database of chalkboard text characters that conforms to the standard MNIST digit database. We created 400 instances of each cursive handwritten lowercase (a - z) characters and upper-case (A - Z) characters from chalkboard images. These images contain characters written on green glass chalkboard in yellow, pink, purple and white chalk colors. The processing to extract and create the database of these characters is carried out in following fashion: • At first, we segment the image into individual characters using vertical and horizontal projection after binarization. • We then erode each character image by applying structuring element as ’disk’ of size 1, followed by a dilation process twice with same structuring element. • The characters are then normalized to 28 × 28 pixels using Lanczos3 interpolation kernel.

B. Classifier In our experiments, we have used a Support Vector Machine (SVM) as classifier. The SVM is defined for twoclass problem and it looks for the optimal hyper plane, which maximizes the distance, the margin, between the nearest examples of both classes, named support vectors (SVs). Given a training database of M data: {xm |m = 1, ..., M }, the linear SVM classifier is then defined as: f (x) =

 j

158

αj xj · x + b,

(5)

Figure 8.

Confidence score distribution. Table I

P ERFORMANCE WITH

RESPECT TO NUMBER OF LEARNING SAMPLES .

Number of Folds 5 10

Figure 9.

Percentage of overall accuracy 86.28% 86.75%

Segmentation error per character.

Table II S EGMENTATION ERROR FOR CHARACTER ’ A’ Interpolation Kernel Box Triangle Cubic Lanczos2 Lanczos3

where xj are the set of support vectors and the parameters j and b has been determined by solving a quadratic problem [13]. The linear SVM can be extended to various non-linear form, and details can be found in [13] [14]. In our experiments, we noted Gaussian kernel SVM outperformed other non-linear SVM kernels, hence we are reporting our recognition results based on Gaussian kernel only. The Gaussian kernel is of the form:

with text enhancement 11.9% 10.0% 9.3% 7.1% 5.8%

dence score, we mean to say the probability estimation of the recognized class [15]. It can be noticed from the Figure 8 that around 60% of the identified characters obtained a confidence score of 80% or more.

2

x − y )]. (6) 2σ 2 For gradient features, we noticed that Gaussian kernel gave highest accuracy when the value of its gamma parameter (1/2σ 2 ) is set to 36.00 and the penalty multiplier p is 1. The high value of (1/2σ 2 ) for classification with gradient feature indicates more non-linearity is involved in classification task . [k(x, y) = exp(−

without text enhancement 93.1% 89.3% 84.5% 83.8% 78.1%

C. Error Analysis When analyzing the segmentation errors, we found out that most of the errors occurred due to text normalization and resizing during segmentation process. This is because of inconsistent handwritten font and varying handwriting styles. For example, for character ’a’, without applying the text enhancement step as explained in section III would yield a very high error rate of 93.1% using box interpolation kernel. This high error rate is due to over segmentation of the character. By applying the necessary text enhancement stesps as explained in section III, we were able to significantly lower the error rate as shown in Table II. We also found out that Lanczos3 gives us the best shape description after normalizing the text, and thus reduces the error rate to 5.8% from 78.1% for character ’a’. Therefore, we used the Lanczos3 interpolation method for text resizing. The segmentation errors for each individual character is shown in Figure 9. From the Figure 9, we see that for most of the characters the overall segmentation error is less than 5%. Character ’i’ however has the highest segmentation error rate of 45.1%. This is becuase quite often due to the dot of character ’i’, which is normally removed as a result of text processing, the shape of ’i’ is not preserved properly. The segregation of the text into individual characters yield very low error rate of 1.01% only. This is due to

V. R ESULTS AND D ISCUSSION A. Accuracy on N-Fold Cross validation and Effect of training Set Size We deployed N-fold cross-validations on the charactercomponents of all segmented characters found in all video frames in our dataset. Since we didn’t have enough samples from all classes to make a separate training and testing dataset, we had to peruse cross validation techniques to evaluate our system. We also merged some classes together that contains similar shaped capital and small characters, as in case of ’O’, ’Z’, ’X’, etc. We noticed that the number of folds does not affected our system much. It is evident from Table I. The results are shown in tabular format for two different values of ’N’. B. Confidence Score Distribution We also analyzed the confidence score distribution of the top choice class returned by the classifier. By confi-

159

[3] M. Liwicki and H. Bunke, “Combining on-line and offline systems for handwriting recognition,” in Ninth International Conference on Document Analysis and Recognition, ICDAR, vol. 1, 2007, pp. 372 –376. [4] C. Choudary and T. Liu, “Summarization of visual content in instructional videos,” IEEE Transactions on Multimedia, vol. 9, no. 7, pp. 1443 –1455, 2007. [5] H. Bunke, S. Bengio, and A. Vinciarelli, “Offline recognition of unconstrained handwritten texts using hmms and statistical language models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 709 –720, june 2004. Figure 10.

Error due to under segmentation.

[6] K. Indira and S. Sethu Selvi, “An off line cursive script recognition system using fourier -wavelet features,” in International Conference on Conference on Computational Intelligence and Multimedia Applications, vol. 2, dec. 2007, pp. 506 –510.

the fact that we find the local minima and maxima cut points into two upper and lower half of the text portion as described in section II-C. This gives us nearly perfect segregation. The classification error were incurred due to under segmentation of the cursive handwriting. An example of this kind of error is shown in Figure 10. It can be easily noted that the left hand character ’n’ was not properly segmented and hence it appears very similar. Moreover, due to varying handwriting styles, sometimes some similar shaped characters are formed in handwritten Roman text like small letter ’c’ and ’e’, they appear very similar to each other and even human eye can sometimes get deceived by them.

[7] S. Correia, J. de Carvalho, and R. Sabourin, “On the performance of wavelets for handwritten numerals recognition,” in 16th International Conference on Pattern Recognition, vol. 3, 2002, pp. 127 – 130 vol.3. [8] R. Cruz, G. Cavalcanti, and T. Ren, “An ensemble classifier for offline cursive character recognition using multiple feature extraction techniques,” in The 2010 International Joint Conference on Neural Networks (IJCNN), july 2010, pp. 1 –8. [9] L. Li, W. Huang, I. Y. H. Gu, and Q. Tian, “Foreground object detection from videos containing complex background,” in Proceedings of the eleventh ACM international conference on Multimedia, ser. MULTIMEDIA ’03. New York, NY, USA: ACM, 2003, pp. 2–10.

VI. C ONCLUSION We presented cursive handwritten text analysis to extract and recognize chalkboard content. In instructor-led lecture videos, localizing and extracting text can be tricky due to content occlusion, varying handwriting styles, chalk dust noise and poor lighting conditions. By modelling the background, we removed the foreground moving object to deal with text occlusion. We further used morphological operations to localize, extract and enhance the text. Using 1-D projection approach, we found the local minima and maxima to segregate the handwritten text. Extracted characters were then normalized to predefined size for feature extraction. We used gradient features to train the SVM classifier and obtained an overall classification accuracy of 86.28% with 5-fold cross validation. As a future work, we plan to test the framework on freely available instructional videos consisting of chalkboard and smart board based handwritten lecture notes such as on The Khan Academy [16].

[10] A. Prati, I. Mikic, M. Trivedi, and R. Cucchiara, “Detecting moving shadows: algorithms and evaluation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 7, pp. 918 – 923, 2003. [11] L. Jianzhuang, L. Wenqing, and T. Yupeng, “Automatic thresholding of gray-level pictures using two-dimension otsu method,” in Proceedings of International Conference on Circuits and Systems, Jun. 1991, pp. 325 –327 vol.1. [12] U. Pal, T. Wakabayashi, N. Sharma, and F. Kimura, “Handwritten numeral recognition of six popular indian scripts,” in Ninth International Conference on Document Analysis and Recognition, ICDAR, vol. 2, sept. 2007, pp. 749 –753. [13] C. Burges, “A Tutorial on support Vector machines for pattern recognition,” Datamining and knowledge discovery, vol. 2, no. 2, pp. 656–661, 2007. [14] V. Vapnik, The nature of statistical learning theory. Springer-Verlag, 1995.

R EFERENCES [15] T. fan Wu, C.-J. Lin, and R. C. Weng, “Probability estimates for multi-class classification by pairwise coupling,” Journal of Machine Learning Research, vol. 5, pp. 975– 1005, 2003.

[1] M. Wienecke, G. Fink, and G. Sagerer, “Towards automatic video-based whiteboard reading,” in Proceedings. Seventh International Conference on Document Analysis and Recognition, 2003, pp. 87 – 91 vol.1.

[16] The Khan Academy, “Khan Academy.” [Online]. Available: http://www.khanacademy.org/

[2] L. Tang and J. Kender, “Semantic indexing for instructional video via combination of handwriting recognition and information retrieval,” in IEEE International Conference on Multimedia and Expo, ICME, 2005, pp. 920 –923.

160