Morphology Based Handwritten Line Segmentation Using ... - CiteSeerX

6 downloads 45261 Views 5MB Size Report
background portions to find some boundary information of ... foreground and background information to segment each line. Here ... For an illustration see Fig.3.
Morphology Based Handwritten Line Segmentation Using Foreground and Background Information Partha Pratim Roy Computer Vision Centre, Universitat Autònoma de Barcelona, 08193, Bellaterra (Barcelona), Spain. [email protected]

Umapada Pal Computer Vision and Pattern Recognition Unit, Indian Statistical Institute, Kolkata-108, India [email protected]

Abstract Currently text line segmentation is an important stage of research in historical document processing. Because of inter-line distance variability and base-line skew variability, line segmentation in unconstrained handwritten document is very difficult. The line segmentation task gets complicated, when overlapping or inter-penetration situation occurs between two consecutive text lines. In this paper we propose a method mostly based on morphological operation and run-length smearing algorithm (RLSA) to segment individual text lines from unconstrained handwritten document images. Here at first RLSA is applied to get individual word as a component. Next, the foreground portion of this smoothed image is eroded to get some seed components from the individual words of the document. Erosion is also done on background portions to find some boundary information of text lines. Finally, using the positional information of the seed components and the boundary information, the lines are segmented. We tested our scheme on images of five different scripts and we obtained encouraging results from the experiments. Keywords: Handwritten Document, Mathematical morphology, RLSA, Handwritten Line Segmentation.

1. Introduction At present text line segmentation is an important topic of research in historical document processing area. It costs to the quality of word and character segmentation greatly. Segmentation of unconstrained handwritten text line is difficult because of inter-line distance variability and baseline skew variability. Components of two consecutive textlines may be touched or overlapped in unconstrained handwritten text. In Indian languages such situation occurs frequently because of several modified characters. These overlapping or touching characters complicate the line segmentation task greatly. There are many methods for text line segmentation [1,3-4,6,10,13-14,16]. Global projection analysis of black pixels is often used for text line segmentation. But this method will not work properly when

Josep Lladós Computer Vision Centre, Universitat Autònoma de Barcelona, 08193, Bellaterra (Barcelona), Spain. [email protected]

overlapping or skewed text lines occur. Modification of this method is done by some researchers using partial projection method [16]. Input image is divided into vertical stripes and based on the projection profile on these stripes segmentation is done. Pal and Dutta [10] proposed a modified technique of stripe-based method using water reservoir concept. Some studies decompose the text into individual components [14]. By means of hierarchical clustering procedure the components are grouped into individual text lines. This method cannot assign a character in its correct group properly when overlapping or interpenetration situation occurs in text lines. Techniques based on statistical modeling [4], thinning [13], linear programming [15], level set [3], HMM [6] etc. are also used for text line segmentation. In this paper we propose a scheme to segment unconstrained handwritten document pages of Indian scripts into individual text lines. We have used the foreground and background information to segment each line. Here, at first, a horizontal Run Length Smearing Algorithm (RLSA) is applied on the input image. The threshold for RLSA is computed based on the height information of the text lines and it is determined using water reservoir concept. Next, the foreground portion of this smoothed image is eroded to get some seed components from the individual words of the document. These seed components generally represent the central portion of individual words of a text line. This erosion also reduces the touching effect of modified characters and makes the line segmentation task easier. Erosion is also done in background region to find the upper and lower boundary information of a text line. Finally, using the positional information of the seed components and the boundary information, individual lines are segmented. Rest of the paper is organized as follows. In Section 2 properties of different scripts used in our experiment are discussed. Estimation of text line height is described in Section 3. In Section 4, we briefly explain our proposed method, used for line segmentation. The experimental results are discussed in Section 5. Conclusion is given in Section 6.

2. Properties of Different Scripts Used in Our Experiment In our scheme we consider text lines of Devnagari, Bangla, Oriya, Gujarati and English scripts for our experiment. We briefly discuss properties of Devnagari, Bangla, Oriya and Gujarati scripts here. Among Indian scripts, Devnagari is the most popular script in India and the most popular Indian language Hindi is written in Devnagari script. Nepali, Sanskrit and Marathi are also written in Devnagari script. Moreover, Hindi is the national language of India and the third most popular language in the world [9]. In modern Devnagari script there are 14 vowels and 37 consonants. These characters may be called basic characters. Bangla, the second most popular language in India and the fifth most popular language in the world, is an ancient Indo-Aryans language. Bangla script alphabet is used in texts of Bangla, Assamese and Manipuri languages. Bangla is the national language of Bangladesh. Also Bangla is the official language of West Bengal State of India. The alphabet of the modern Bangla script consists of 50 basic characters (11 vowels and 39 consonants).

Figure 1. Basic characters of (a) Bangla and (b) Devnagari alphabet are shown. First eleven characters are vowels and rest is consonants in both the alphabet sets.

In Devnagari and Bangla scripts, most of the characters have a horizontal line at the upper part. See Fig.1 where basic characters of Devnagari and Bangla scripts are shown. When two or more characters sit side by side to form a word, these horizontal lines touch and generate a long line called head-line. In Devnagari/Bangla script a vowel following a consonant takes a modified shape, which, depending on the vowel, is placed at the left, right (or both) or bottom of the consonant [2]. These are called modified characters. A consonant or vowel following a consonant sometimes takes a compound orthographic shape, which we call as compound character. A set of modified characters of Bangla and Devnagari scripts is shown in Fig.2. Gujarati is a popular language spoken by about 46 million people in the Indian States of Gujarat, Maharashtra, Rajasthan, Karnataka and Madhya Pradesh. There are 46 basic characters (12 vowels and 34 consonants) in Gujarati. Oriya is another popular language and script of India. This language is used mainly in the Orissa State of India as it is the official language of Orissa State. The alphabet of

the modern Oriya script consists of 52 basic characters (11 vowels and 41 consonants). Like Devnagari and Bangla scripts modified and compound characters are also present in Oriya and Gujarati scripts. Since modified characters may sit at the top or bottom of the consonant in these scripts, words of two consecutive text lines may touch because of these modified characters. Such touching through modified characters complicates our line segmentation task.

Figure 2. Examples of Bangla and Devnagari modified characters.

3. Estimation of Text Line Height To compute height information of the text lines in a document page, we apply water reservoir concept. The water reservoir principle is as follows. If water is poured from top (bottom) of a component, the cavity regions of the component where water will be stored are considered as top (bottom) reservoirs [8]. For an illustration see Fig.3. For each reservoir we compute its height. By height of a reservoir we mean the perpendicular distance of the base point (the deepest border point of a reservoir) from the water flow level of the reservoir. From each component of a document page, heights of the different reservoirs obtained are computed. A height histogram is computed from these reservoirs heights. In handwritten document there exists a variety of character size and many touching may occur because of handwriting style of different individual. As a result connected component analysis can not give proper text line height information. To get proper height information we take the average height (HL) of those reservoirs whose heights lie in the right half of the height histogram. This is done to ignore the small reservoirs in our height computation. This HL gives an idea about the height of a text line and it is very useful to determine different parameters for smoothing algorithm and to decide the structuring element of morphological operation used in our segmentation approach.

Figure 3. A top water reservoir and its different features are shown. Water reservoir is marked by grey shade.

4. Proposed Line Detection Algorithm We have used binary image for our work and to convert the original grey-level document images into binary image, we have applied the algorithm due to Otsu [7]. Binary image may contain some small components and we have removed such small components for line segmentation. The original image and the resultant binary image are shown in Fig.4(a) and 4(b).

4.1.

Foreground Smoothing

RLSA algorithm links together neighboring black/white areas that are separated by less than a distance (which represents the smoothing threshold). In other word, it replaces a sequence of background pixels between two object pixels by object pixel value in a specified direction. Normally, it is used to fill the background pixel-run of length less than a certain pre-defined threshold in horizontal or vertical direction. In our approach, this method is applied only in horizontal direction, i.e. row-byrow smoothing is done. The threshold for the smoothing is given as 2.5*HL in our experiment. The smoothed image is shown in Fig.4(c). From the figure it can be noted that middle part of the text line are mostly black because of RLSA. Sometimes because of overlapping of text from two consecutive lines, two or more components may touch vertically due to smoothing. Based on the horizontal histogram of each smoothed component we detect such touching. When two smoothed components of two different text lines touch then we generally get a peak and valley shape. We analyze the histogram to find possible valleys between peaks. Generally, the peaks are obtained from parts of different text lines that touch. If a valid valley is found in between two peaks then the valley region is marked as possible touching area. Now, we analyze a touching area to detect whether such touching is formed due to RLSA or not. To do that we consider the corresponding area of the initial image (before RLSA is done) and trace its contour. Starting from the top most point of the considered area of image if we can reach its bottom most point by the tracing, then we conclude that the touching was formed before RLSA. To segment such touching we analyze the contour points of that area and based on the structural shape of the contour the touching point is detected. Using angular information of contour point and run-length information we segment a touching. If the touching is formed due to RLSA then we replace that touching area by corresponding area of initial image.

4.2.

in morphological image processing from which all other morphological operations are based. For details about this see [12]. After RLSA, we will have a smoothed image, where the foreground part belongs to black text regions and background part consists of white regions. By erosion, we determine some important information from foreground and background portion which are very helpful in our line segmentation purpose.

Morphological Operation for Foreground and Background Information Extraction

We have used morphological operations, mainly, erosion to extract the useful foreground and background information. Erosion is one of two fundamental operations

(a)

(b)

(c) Figure 4. (a) Example of Bangla handwritten document image. (b) Binary Image after Pre-processing of (a). (c) Horizontal RLSA result of (b).

4.2.1. Background Information Extraction The background region of run-length smoothed image is eroded to extract some obstacle lines. These obstacle lines will act as “Separator Lines” (SL) between two consecutive text lines. The shape of the structuring element for erosion is chosen as rectangular and its height and width are 0.5*HL and 5*HL, respectively. These threshold values are determined from the experiment. The anchor point is set at the centre of the structuring element. Background eroded image of Fig.4(c) is shown in Fig.5(a). From each of the eroded components of background region, we take the upper and lower profiles in each column and we compute the middle point from these profiles in each column. A line fitting algorithm of these mid-points is used and the resultant line is the ‘SL’. Different SLs obtained from Fig.5(a) are shown in Fig.5(b). The left and right end points of a SL are extended horizontally in both directions till (i) it touches the

foreground parts or (ii) it touches the left or right profiles of smoothed foreground image or (iii) it finds another SL within a vertical distance (HL). These extended SL lines of Fig.5(b) are shown in Fig.5(c). Note that, we have computed the left and right profiles of the smoothed image and the profile information has been used for line extraction.

(a)

RLSA as discussed in Section 4.1. To detect the FSC portion coming from such touching part, we scan FSC region column wise. The columns where height of FSC is bigger than HL are removed, so that touching portions will not affect our line segmentation method. In some cases, a very small FSC may be obtained. We delete these small FSCs also for better line segmentation. Generally, FSC should lie on text line, but because of touching and modified characters, some FSC may appear on the portions between two text lines. To take care of such situations the FSC components which touch a SL are also removed. Remaining FSCs of R are used for line extraction and we call such FSCs of an image as the candidate FSC. The candidate FSC components, obtained from the image given in Fig.4(a) are shown in Fig.6. We use these candidate FSCs for line segmentation purpose.

(b)

Figure 6. Foreground seed components are shown by black regions.

4.3.

(c) Figure 5. (a) The Eroded portions of background are shown by black region (b) Separator Line (SL) obtained by joining the mid-points of upper and lower profiles of eroded components. (c) Extended SLs are shown along with text image.

4.2.2.

Foreground Information Extraction

Morphological erosion has also been applied in the foreground part of the smoothed image to extract foreground seed component (FSC). By a FSC, we mean an isolated eroded component obtained from smoothed foreground part. These FSC components are generally the representative of word components in the document. The structuring element is chosen for foreground erosion as rectangular in shape with height 0.5*HL and width 0.65*HL. The anchor point is set at the centre of the structuring element. Let, R be a set of all FCSs of an image. In handwritten documents, the text lines sometimes touch or overlap each other because of the modified characters of scripts as well as for ascending and descending parts of characters. As a result, the smoothed text lines also touch and sometimes we may get a big FSC although we delete some touching

Line Segmentation

By joining of candidate FSC components we will get the segmented lines. The SLs guide FSC joining to get proper segmentation. For each candidate FSC, we compute two reference points named as “left” (LR) and “right” (RR) reference point. LR of a FSC is found by computing centre of gravity of partial set of pixels of the FSC which lie between leftmost columns to a column upto a width (HL) from leftmost column. RR is calculated similarly from rightmost column of FSC. If the width of a FSC is less than 2* HL, then LR (RR) of a FSC is computed from the left (right) half of the FSC. The left and right reference points of the candidate FSCs of Fig.6 are shown in Fig.7. The two reference points of each FSC are marked by black points on the FSC.

Figure 7. Left and right reference points are shown by black dots on each FSC.

Let A and B be two candidate FSC components. For joining B with A, we compute a searching zone (as shown

in Fig.8) from the right reference point (RR) of A. The length (L) of the searching zone is taken as 10*HL and the width (W) of the searching zone is determined as 1.5*HL. We do not consider full rectangular area at the beginning part of the searching zone. Beginning part is upto a length HL from RR and the searching zone is triangular in shape in this part. Rest of the searching part of length 9*HL is rectangular in shape. This is done so that two vertical candidate FSCs will not be joined. See Fig.8, where searching zone of RR of ‘A’ is shown by hatched lines. We will join B with A, if it satisfies the following conditions: (i) The LR of B lies in the searching zone of RR of A. (ii) The searching zone of RR of A will not cross any SL.

Figure 8. Searching zone of the right reference point ‘A’ is shown here.

In the first condition, we consider the foreground positional information of FSC and this is done in terms of component overlapping. In the second condition we utilize SLs which are based on background information. If the above conditions are satisfied, we join two components A and B. As soon as B is joined to A, we try to join other FSC which is nearer to B and satisfy the above two conditions for B. In this way all possible FSC of a line are joined. After joining of the FSC we extend the LR point of the leftmost FSC of the joined FSC towards left and RR point of the rightmost FSC towards right. During this extension, if we can reach border of the image, then a text line is detected. In this way, FSCs of individual lines will be clustered by joining them through a line segment to get individual text line. To get the characters of a text line, we collect all the RLSA components which pass through this line and character portions of these RLSAs in the segmented line. Sometimes, some small RLSAs may not touch this line. We cluster such RLSA components to their nearest line. If in a single line, two or more line segments are obtained due to longer distance between two FSCs, we group them together in a single line by checking the positional relationship of the leftmost and rightmost points of these line segments along with the information of gap from the line lie above it. If the line-gap is similar for these line segments from a segmented line lies above them, we group these line segments into one line. By line-gap, we mean the distance between two consecutive lines, obtained after FSC joining. Line-gap between first two lines of Fig.9 is shown by arrow. Note that the joining of FSC is done in top-to-bottom fashion starting from topmost line. Line joining result of Fig.7 is shown in Fig.9.

Figure 9. Result of FSC joinings.

5. Experimental Results and Discussions For experiment, 125 handwritten document images were considered from individual of different professions. These documents are collected from different persons. These data are collected from five different languages: Bangla, Devnagari, Oriya, Gujarati and English. We noted that these dataset contain varieties of writing styles. For the experiment we considered single column document pages. To check whether a text line is detected correctly or not we draw a marker on the text line after its detection. By viewing the results on the computer display we calculate the line detection accuracy manually. From the experiment we have found that on an average 92.68% of cases our system can detect the text line properly. Line detection results of our proposed scheme are shown on different scripts in Fig.10. We considered 520, 195, 406, 210 and 115 text lines of Bangla, Devnagari, English, Oriya and Gujarati scripts and we noted that 94.23%, 93.33%, 96.06%, 94.76% and 93.04% of text lines are correctly identified, respectively, from our experiments. For the experiment of English document, we also considered some images from IAM database [5] as well as some images from examination answer sheets. From our experiment we noted that most of the errors occur when the two consecutive text lines are very near to each other. Our system detects all the lines of a document accurately when a gap is present between consecutive lines. The SLs are very useful in preventing joining of a FSC with that of next line which is very near. If there is a missing of SL and the FSC candidates of two consecutive lines are too close then an erroneous result occurs and this is the main drawback of our approach. Our method computes the global text-size and the line detection thresholds are dependent on text-line height. If there exist multi-size text lines in a single document, this method needs to be modified for detecting the size of structuring element. Layout understanding methodologies may help detecting different uniform text zones and different text line size can be computed from those individual zones. However, we noted that hand-written size variation in a document page of a single writer is very rare in our documents. There is not much work on handwritten text line extraction on Indian languages. Recently Basu et al. [1] proposed a method for text line extraction and they

obtained 90.34% accuracy on Bangla script and 91.44% on English script. We obtained overall 92.68% accuracy from our proposed approach when tested on five different scripts.

6. Conclusion In this paper, a script independent line segmentation algorithm is developed from off-line unconstrained handwritten documents. The proposed scheme is developed based on morphological operation on the foreground and background portion of the document. We tested our scheme on different scripts like Bangla, Devnagari, Oriya, Gujarati, English etc. and obtained encouraging results.

7. References [1] S. Basu , C. Chaudhuri , M. Kundu , M. Nasipuri and D. K.

[2] [3] (a)

[4] [5] [6] [7] [8] (b)

[9] [10] [11]

(c)

[12] [13]

[14]

(d) Figure 10. Results of line segmentation are shown on (a) Bangla (b) Gujarati (c) Oriya (d) English text.

[15] [16]

Basu, Text line extraction from multi-skewed handwritten documents, Pattern Recognition, vol. 40(6), 2007, June, pp.1825-1839. B. Chaudhuri and U. Pal, “Skew angle detection of digitized Indian Script documents”, IEEE PAMI, vol. 19, 1997, pp.182-186. Y. Li, Y. Zheng, D. Doermann and S. Jaeger, “A new algorithm for detecting text line in handwritten documents”, In Proc. 10th IWFHR, 2006, pp.35-40. J. Liang, I. Philips and R. M. Haralick,“A statistically based highly accurate text-line segmentation method”, In Proc. 5th ICDAR, 1999, pp.551-554. U. Marti and H. Bunke, “A full English sentence database for off-line handwriting recognition”. In Proc. 5th ICDAR, 1999, pp. 705 - 708. S. Nicolas, Y. Kessentini, T. Paquet and L. Heutte, “Handwritten Document using Hidden Markov Random Fields”, In Proc. 8 th ICDAR, 2005, pp.212-216. N. Otsu, A Threshold selection method from grey level histogram, IEEE Trans on SMC, vol.9, pp.62-66, 1979. U. Pal, A. Belaïd and C. Choisy “Touching numeral segmentation using water reservoir concept” Pattern Recognition Letters, vol.24, 2003, pp. 261-272. U. Pal and B. B. Chaudhuri, “Indian script character recognition: A Survey”, Pattern Recognition, vol. 37, 2004, pp. 1887-1899. U. Pal and S.Datta, “Segmentation of Bangla Unconstrained Handwritten Text”, Proc. 7th ICDAR, 2003, pp.1128-1132. U. Pal and P. P. Roy, Multioriented and curved text lines extraction from Indian documents. IEEE Trans. on SMC. Part B. vol.34, 2004, pp.1676-1684. J. Serra, Image Analysis and Mathematical Morphology. Academic Press, London, 1982. S. Tsuruoka, Y. Adachi and T. Yoshikawa, “The Segmentation of a Text line for a Handwritten Unconstrained Document using Thinning Algorithm”, In Proc. of 7th IWFHR, 2000, pp. 505-510. W. Xiaoying and C. G. Leedham,“Seperating lines and words in unconstrained handwriting”, In Proc. 8 th IGS, 1997, pp. 117-118. B. Yanikoglu and P. A. Sandon, “Segmentation of off-line cursive handwriting using linear programming”, Pattern Recognition, vol.31, 1998, pp. 1825–1833. A. Zahour, B. Taconet, P. Mercy and S. Ramdane, “Arabic hand-written text-line extraction”, In Proc. 6th ICDAR, 2001, pp. 281-285.