Text line extraction from handwritten document pages ... - IEEE Xplore

0 downloads 0 Views 3MB Size Report
handwritten document images, presence of skewed, touching or overlapping ..... of 4th Indian International Conference on Artificial Intelligence, ... R. C. Gonzalez and R.E. Woods, Digital Image Processing, first ed., Prentice-Hall, India, 1992.
IEEE-20180

Text line extraction from handwritten document pages based on line contour estimation 2 2 1 i l l Ram Sarkar , Sougata Halder , Samir Malakar , Nibaran Das , Subhadip Basu , Mita Nasipuri I 2

Dept. of Computer Science and Engineering, Jadavpur University, Kolkata, India

Dept. of Master of Computer Application, M.C.K.V. Institute of Engineering, Liluah, Howrah, India

{raamsarkar, sougata.halder88, malakarsamir, nibaran, bsubhadip, mitanasipuri}@gmail.com Abstract.

Extraction of

text lines from handwritten/printed

document images is one of the important steps in the process of an Optical Character Recognition (OCR) system. In case of handwritten document images, presence of skewed, touching or overlapping text line(s) makes this process a real challenge to the researcher. In the present work, a new text line extraction technique based on line contour estimation is reported. Here, digitized document image is initially partitioned into a number of vertical fragments of equal width. Then all the line segments present in these vertical fragments are detected. Finally, the neighboring line segments are analyzed to place them inside the line boundary in which they actually belong. For experimental purpose, the developed technique is tested on CMATERdb1.2.1 database and present technique extracts 88.44%

text lines

successfully. Keywords:

Text line

extraction,

Handwritten document pages,

CMATERdb, Multi-skewed text line, Vertical partitioning, Contour estimation, OCR

Un-skewed text lines can easily be extracted from the document images by identifying only the valleys of horizontal pixel density histograms as shown in Fig. l(a). But in case of unconstrained document page, it is not always possible to extract the text lines using horizontal pixel density histogram only. One such document image, with the corresponding horizontal pixel density histogram is shown in Fig. l(b). II. PREVIOUS WORK

Many research documents on extraction of unconstrained handwritten text lines from digitized document pages are available in the literature [1-16]. Those work may be classified broadly into 3 different categories, viz., i) Connected Component (CC) based analysis, ii) Statistical approaches and iii) Partitioning based analysis. The work presented in this paper falls in the third category of solutions

I. INTRODUCTION

OCR involves computer recognition of characters from digitized images of optically scanned document pages. The characters thus recognized from document pages are coded with American Standard Code for Information Interchange (ASCII) or some other standard code for storing in a digital format, which can be edited using some standard word processing software or text editor. Identification/extraction of text lines is one of the important steps in the process of an OCR system for handwritten/printed document pages. If text line identification of a digitized document page fails, then words and characters belonging to the corresponding text lines can not be identified properly. Such errors are not acceptable for large-scale recognition of document pages. The problem of text line identification for handwritten document pages is more difficult than that of the printed ones. This is so as the text lines in a handwritten document may be skewed with different angles of inclination with horizontal axis i.e., individual text lines may not be parallel to one another. Sometimes adjacent text lines may even touch/overlap one another at single/multiple points. All such cases make the text line extraction from handwritten digitized document pages a challenging research problem.

26

th

_28

th

L

c:

( II::..

-

,......

....

....

IlL

(a) Handwritten document image with unskewed text lines.

ICCCNT'12 July 2012, Coimbatore, India

IEEE-20180

and the efficient separation of vertically connected characters using a novel method based on skeletonization. Statistical approaches

b) Handwritten document image with skewed text lines. Fig. I (a-b). Horizontal pixel density histograms

CC based analysis Among the first category of solutions, text lines are extracted in handwritten document images using an iterative hypothesis-validation strategy in [2]. Here, information collected from both Hough domain and the images are combined. At each stage of the technique, a text-line hypothesis is generated by searching the best alignments of the CCs. In [3], initially some general information is collected from the images by applying Hough transform. Then a natural lean�ing method, similar to human learning procedure, is applied to cluster the CCs to generate the final text lines. A block based Hough Transform method [5] takes into account gravity centers of parts of CCs. In the work [8], text lines are e�tra�ted using three steps. In the first step, image . bmanzatLOn, enhancement, CC extraction etc. are done. In the second step, a block-based Hough transformation technique is employed for detection of text lines and finally, the text lines which are not separated in the previous step are identified. In [14], a neighborhood CC analysis for detection and extraction of text lines is reported. In work [15], text line extraction is performed in 3 steps. In first step (preprocessing step), all the CCs of a document page are classified in 3 categories depending on average character height and average character widt�. In the second step, Hough transform mapping is apphe� o� a subset of the document image CCs (comprising of m�Joflty of the characters). Finally, the post-processing step mcludes the correction of possible false alanns, the detection of text lines that Hough transform failed to create

26

th

_28

th

Run Length Smoothing Algorithms (RLSA) [1] includes the fuzzy RLSA [4], in which the value of each pixel is the sum of all pixels in the original image within a specified horizontal distance. Adaptive RLSA [7] evolves from classical RLSA and uses additional smoothing constraints in regard to the geometrical properties of neighboring CCs. The technique, used in [12] is based on morphological operations and RLSA, which segments individual text lines from unconstrained handwritten document images. A Minimal Spanning Tree (MST) based clustering technique [9] with distance metric learning is used for text line segmentation purpose of Chinese documents. Another text line segmentation technique for handwritten documents is described in [10] using Mumford-Shah (MS) model. Here text line segmentation is achieved by minimizing the MS energy. In [11], density estimation and level-set methods are used for extraction of handwritten text lines from digitized docu�ent pages. In one of our earlier works [6], a novel technIque for segmentation of multi-oriented handwritten text lines using hypothetical water flows at a specific flow angle from both sides of the document image was presented. The major drawback of the technique is its inability to split touching text lines in a convincing way. Partition based analysis An improvement of the technique [6], named as piece-wise water-flow technique [13], falls in the third category of solutions. In this technique, hypothetical water flows in multiple vertical partitions of the document image are considered to identify text line segments in each partition. Touching text lines components in each such partition are automatically identified and partitioned further. Again in the work [16], a new painting technique that enhances the separability between the foreground and background is employed to smear the foreground portion of the document image. Their technique consists of the stages viz. (a) Piece­ wise Painting Algorithm (PPA), (b) dilation operation, (c) complementing the dilated image followed by a thinning operati?n, (?) trimming the extracted lines, (e) constructing separatmg lmes and (f) resolving the problems of overlapping and touching components. The techniques described in [6, 13] can extract text lines fro� document pages effectively. But, distinguishing the partially wetted and wetted region [6] is an overhead. Again, to find the touching text lines inside a fragment, an iterative approach is applied in [13] which is time consuming. Again, in handwritten document images contain large number of

ICCCNT'12 July 2012, Coimbatore, India

IEEE-20180

CCs. Processing a large number of CCs is also time consuming. To address these issues, a simple and effective line contour based algorithm to extract text lines from handwritten document pages, is developed in the present work.

digitalized document pages is performed using global thresholding technique. Then, to remove single pixel noises in the document pages, a sequence of morphological operator (open) [IS] is applied on the binarized image to make the document pages noise free. To evaluate the present technique, said database is used here.

III. PRESENT WORK

B.

A text line extraction technique from unconstrained handwritten document pages is reported here. To start with, each document page is partitioned vertically into a number of fragments. In each fragment, line segments (LSs) are estimated by identifying upper and lower contours of each line segments. After that, the identified text line segments of neighboring fragments are analyzed and merged in order to estimate the correct boundary of each text line present in the document page. To evaluate the present technique, CMATERdb1.21 [17], a database repository for text line extraction of unconstrained handwritten document pages, is considered. Fig 2. shows a schematic work flow diagram of the developed technique. e

I �1

�_____�

� �



__

Partitioning the document image

is a real valued non-linear function defmed Let, over the interval , where, . Let is divided into n small partitions as where n . As the function seems like a straight line in each of these partition , where . Here, Nand R are set of natural numbers and set of real numbers respectively. Moreover, if the function is already a straight line then it will remain as a straight line in each of the partitions mentioned above. As the text lines of handwritten document pages under consideration may be skewed or curvy, therefore partitioning the document pages into a number of vertical fragments ensures that the LS in each of fragment seems to be straight. Based on this concept, in the current work the digitized document pages are partitioned vertically into n number of fragments. The number of fragments is predefined and the width of each vertical fragment depends on width of the document page. The fragment width in the present work is defined as: (1), where, (in pixel),

cb

(in pixel) and

Analy . neighbol'in ol'der

Now, if is considered sufficiently large then the words present in a text line may be broken into pieces and this will in future makes it difficult to recognize the individual characters. Therefore, choice of NFRAG should be done cautiously. In this regard, a survey was conducted to find out how many words on an average may present in a text line of a handwritten document image. The survey revealed that a text line of a handwritten document pages contains around 7-S words on an average. Based on this information is chosen as S in the developed technique. However, to validate the result, present technique is also evaluated with . The corresponding findings are and discussed in the section IV.

of

of fragment

to mel'ge them in

to

tely

Fig. 2. Schematic work flow diagram of the present work A.

Preprocessing

CMATERdb1.2.1 [17] is a database repository of the digitised handwritten document pages of BangIa text mixed with English words. These document pages are already in digitized form. As a preprocessing step, binarization of

26

th

_28

th

C. Detection of LSs in each fragment After partitioning the document image into vertical fragments, next objective is to detect all the LSs in each

ICCCNT'12 July 2012, Coimbatore, India

IEEE-20180

fragment. To perform this, the following algorithmic steps are applied.

D. Formation affinal text line

Algorithm 1:

for

I.

to

{

Set

4.

Scan each of the fragments in top-down and left-right manner. Find a row with at least one data pixel and call it ). starting row of a LS (say, Scan from to downward and fmd another row with all non-data pixel and call it ) ending row of the LS (say, Set

5.

Store line height (

) in this fragment as

6.

Store line spacing ( as

) when

7.

Trace the top and bottom contours of text LS and between in each of the vertical fragments until

1. 2. 3.

8.

} II.

Estimation of upper and lower contours of the LSs are illustrated in Fig. 3.

As, all the LSs of different fragments are detected, now they are needed to be merged to form appropriate text line boundaries on the entire document image. The technique to meet the goal consists of two vital decisive steps i) merging the intra-fragment LSs and ii) joining the inter-fragment LSs. The steps are discussed in the following sub-sections. Merging the intra-fragment LSs Sometimes, it is found that some of the CCs, which are part of some characters, form individual LS inside a fragment. This is depicted in Fig. 4(a). Therefore, a technique is required to put them in the proper text line. To do this, first entire the LSs are classified into two categories viz. partial LS (PLS) and complete LS (CLS). These two categories of the LSs are shown in Fig. 4(a). The classification methodology of the LSs is described in Algorithm 2.

The top contour is formed by searching for the in first data pixel from the downward direction along each column position. Similarly, the bottom contour is searched from in upward direction. Find fragment-wise average line height (say, ) and average line spacmg ) within the fragments considering (say, all the and respectively.

Find average line height ( considering all

To estimate the formulae are applied:

) of a document page

and

the following

Fig. 3. Sample output image showing estimation of the upper and lower contours in each line segment (blue and red colors indicate upper and lower contours respectively)

Algorithm 2

1. 2. and 3.

26

th

_28

th

Denote ith LS ofJh partition as LSu. If LINEH of Set Else Set Set

ICCCNT'12 July 2012, Coimbatore, India

IEEE-20180

Set

4.

Set If (Status (

))

{

If merge LSij with LS(i-I)j If merge LSij with LS(i+I)j If keep LSij isolated as this PLS has the equal chance to be attached with either with upper LS or lower LS.

}

Else keep LSij isolated For R, the LS is kept isolated to avoid under segmentation.

(a) Illustration of the rests are CLSs.

of LSs. PLSs are encircled and

and

_� \ �bi§_- _'1:> '7t_'n_e;;(_ --"",--\L5 ' �l� _....:., .:.

Joining the inter-fragment text LSs

__

It is already mentioned that the document pages are partitioned into a number of vertical fragments and in each fragment different LSs are estimated. A sample output image of such partitioning is shown in Fig. 4. As this partitioning schema produces different LSs, it is now required to merge them to form the actual text lines present in the document image under consideration The technique for joining two LSs of consecutive fragments (say, Jh and U+lyh fragments ) for ilh LS is described in Algorithm 3 which is iterative in nature.

(b) Illustration of Modified with it corresponding CLSs.

and

___

after merging the PLSs

Algorithm 3:

l.

Consider ith LS ofJh fragment as

2.

Calculate

taxicab of

distance and

(

) between of

i.e., 3. 4.

where Find minimum of all index k for which If , Jom

of fragmentj+ 1. ) and set the (say, as Flag. and

Here, TH is a threshold value which is estimated experimentally. Variable i runs from 1 to LINENo of the ilh fragment and variable j runs from 1 to IV. RESULT AND DISCUSSION

For experimentation purpose, digitized document pages from CMATERdbl.2.1 containing two different scripts (viz. BangIa and English) are selected in the present work. The detail result of the present text line extraction technique on the said dataset is described in Table 1.

26

th

_28

th

(c) Illustration of text line boundaries after joining the PLSs Fig. 4(a-c): Different steps applied in the present technique to form text lines

The experimentation is performed with different values of N, the N and Table 1 shows that with present technique produces the best result. Out of 1240 text lines present in the said database, 1093 i.e. 88.14% text lines are extracted successfully. Error cases, produced by the present technique, are categorized into two types viz., under­ segmented text lines i.e., two successive text lines are extracted as a single one and over-segmented text line i.e.,

ICCCNT'12 July 2012, Coimbatore, India

IEEE-20180

single text line is separated during extraction. The results presented in the Table 1 also show that most of the error cases produced by the present technique are due to under segmentation of the text lines. Fig. 5(a) and Fig. 5(b) show under-segmented and over-segmented text lines respectively. From the result observed in the Fig. 5(a), it can be said that present technique fails due to the presence of the touching text lines in the document pages. On the other hand, over segmentation of the text lines has occurred mainly due to the choice of threshold based decisive parameter for merging successive LSs in neighboring fragments as described before. Fig. 5(c) shows an example of an output image where each text line is extracted successfully.

Authors are thankful to the "Center for Microprocessor Application for Training Education and Research" (CMATER), "Project on Storage Retrieval and Understanding of Video for Multimedia" (SRUVM) of Computer Science & Engineering Department, Jadavpur University, India, for providing infrastructural facilities during progress of the work. The work reported here, has been partially funded by DST, Govt. of India, PURSE (Promotion of University Research and Scientific Excellence) Programme. REFERENCES [I].

Table 1: Detail description of the results obtained by the present text line extraction technique

Number of handwritten document pages in CMATERdb1.2.1

[3].

50

Number of actual text lines

1240

Number of text lines extracted properly

982

1093

1039

Number of segmented

text

251

162

191

Number of segmented

text

lines lines

[2].

under over

[4].

F. M. Wahl, K.Y. Wong, and R. G. Casey, "Block segmentation and text extraction in mixed text/image documents", Computer Graphics and image Processing, vol. 20, pp. 375-390,1982. L. L. Sulem, A. Hanimyan, and C. Faure, "A Hough based algorithm for extracting text lines in handwritten documents", Proc. of 3'd international Conference on Document Analysis and Recognition, Montreal,Canada,pp. 774-777,1995. Y. Pu and Z. Shi, "A natural learning algorithm based on Hough transform for text lines extraction in handwritten documents", Proc. of 6Th iWFHR, pp. 637-646,1998. Z. Shi and V. Govindaraju, "Line separation for complex document images using fuzzy run-length", Proc. of i" international

[5]. 7

10

15

Success rate (in %)

79.19

88.14

83.79

Error rate (in %) due to under segmentation

20.24

13.06

15.40

Error rate (in % ) due to over segmentation

0.56

0.81

1.21

[6].

[7].

V. CONCLUSION

[8].

Extraction of text lines from the handwritten/printed document images is one of the important associated problems of the OCR systems. Presence of skewed text lines, which is obvious in handwriting documents, always makes it difficult for accurate extraction of the text lines from the handwritten documents than that of the printed ones. In this context, the present work develops a simple and effective partitioning based text line extraction technique by estimating the line contours of the handwritten document images. The technique produces a reasonably good result t. A lot of rooms are there to improve the present technique in future. Among these, methodologies can be developed which would be useful to avoid the under segmentation and over segmentation of the text lines produced in the present technique. More number of document pages, written in different scripts, would be included in the future test data.

[9].

26

_28

th

on

Document

image

Analysis

for

G. Louloudis, B. Gatos, I. Pratikakis, and K. Halatsis, "A block­ based Hough transform mapping for text line detection in handwritten documents", Proc. of 10Th IWFHR, France, pp. 515520,October 2006. S. Basu, C. Chaudhuri, M. Kundu, M. Nasipuri, and D. K. Basu, "Text line extraction from multi-skewed handwritten documents", Pattern Recognition, vol. 40(6), pp.1825-1839, June 2007. B. Gatos, N. Stamatopoulos and G. Louloudis, "ICDAR2007 Handwriting Segmentation Contest", proc. of 9Th iCDAR, Curitiba,Brazil,pp. 1284-1288,September 2007. G. Louloudis, B. Gatos, 1. Pratikakis, and C. Halatsis, "Text line detection in Handwritten Documents", Pattern Recognition, vol. 41,pp. 3758-3772,2008. F. Yin, and C. Liu, "Handwritten Text Line Segmentation by Clustering with Distance Metric Learning", Proc. of international

Conference

in

Frontiers

in

Handwritten

Recognition (JCFHR-08), Canada, pp. 229-234, August 91-21,

[10].

[11].

[12].

[13].

ACKNOWLEDGEMENT

th

Workshop

Libraries, pp. 306,2004.

2008. X. Du, W. Pan, and T. D. Bui, "Text Line Segmentation in Handwritten Documents Using Mumford-Shah Model", Proc. of international Conference in Frontiers in Handwritten Recognition (JCFHR-08), Canada, pp. 253-258, August 91-21, 2008. Y. Li, Y. Zheng, and D. Doermann, "Script-Independent Text Line Segmentation in Freestyle Handwritten Documents", iEEE TPAMJ, vol. 30(8),pp. 1313-1329,August 2008. P. P. Roy, U. Pal, and J. Llados, "Morphology Based Handwritten Line Segmentation Using Foreground and Background Information", Proc. of international Conference in Frontiers in Handwritten Recognition (iCFHR-08), Canada, pp. 241-246,August 19-21,2008. R. Sarkar, S. Basu, N. Das, A. F. Mollah, M. Kundu, M. Nasipuri, "Line Extraction from Unconstraint Handwritten Document Pages using Piece-wise Water-flow Technique", Proc.

ICCCNT'12 July 2012, Coimbatore, India

IEEE-20180

[14].

[15].

of 4th Indian International Conference on Artificial Intelligence, pp.1861-1872 ,2009. A. Khandelwal, P. Choudhury, R. Sarkar, S. Basu, M. Nasipuri, and N. Das, "Text Line Segmentation for Unconstrained Handwritten Document Images Using Neighborhood Connected Component Analysis", Proc. of PReMI- 2009, LNCS 5909, pp. 369-374,2009. G. Louloudis, B. Gatos, 1. Pratikakis, and C. Halatsis, "Text line and word segmentation of handwritten documents", Pattern Recognition, vol. 42,pp. 3169 - 3183,2009.

(a)

(b)

An

[16].

[17].

[18].

A. Alaei, U. Pal, and P. Nagabhushan, "A new scheme for unconstrained handwritten text-line segmentation", Pattern Recognition, pp. 917-928,20II. R. Sarkar, N. Das, S. Basu, M. Kundu, M. Nasipuri, D. K. Basu, "CMATERdbI: a database of unconstrained handwritten BangIa and BangIa-English mixed script document image", IJDAR, Springer-Verlag,vol. 15,no. I, pp 71-83. R. C. Gonzalez and R.E. Woods,Digital Image Processing, first ed., Prentice-Hall, India,1992.

example of under-segmented text lines, marked with circle.

An

example of over-segmented text lines, marked with circle

26

th

_28

th

ICCCNT'12 July 2012, Coimbatore, India

IEEE-20180

(c) An example of successful text line extraction. Fig. 5(a-c): Various output images produced by the present text line extraction technique.

26

th

_28

th

ICCCNT'12 July 2012, Coimbatore, India