Segmenting Sinhala Handwritten Characters

3 downloads 0 Views 213KB Size Report
The Brahmi script had five short vowels (a, i, u, e, o), and five long vowels .... Figure 2. Grouping of Sinhala characters, (a) a letter having a normal height,. (b) a letter which ... estimated using the width of the image and number of characters ...
International Journal of Conceptions on Computing and Information Technology Vol. 2, Issue. 4, June’ 2014; ISSN: 2345 - 9808

Segmenting Sinhala Handwritten Characters Chamari Silva and Cyril Kariyawasam Department of Electrical and Information Engineering, Faculty of Engineering, University of Ruhuna, Hapugala, Srilanka. [email protected] and [email protected] Abstract— Sinhala is a language used by Sinhalese people who live in Sri Lanka. In contrast from English letters, Sinhala letters are round in shape and straight lines are almost nonexistent. Unlike printed characters, handwritten characters may sometimes touch each other and they also have variations in writing style. Segmentation of a document image is one of the basic and a critical tasks which has a great impact on character recognition process. The complications that are present in handwritten documents make the segmentation process a challenging task. The method suggested in this paper is to use Self-Organizing Feature Maps (SOFM) for the segmentation of the touching character pairs.

nearly absent from the alphabet. This is because Sinhala used to be written on dried palm leaves [3]. Dried palm leaves tend to split along the veins on writing straight linesand hence, the round shapes were preferred. The simplest letter in the Sinhala alphabet is known to as “ර”, which is also specified as the ‘mala pothe akura’. In comparison with written English, written Sinhala does not have a one to one mapping for the letters “v” and “w”. In transliteration, both the letters transliterated in to one letter, ව.

Keywords- Sinhala handwritten documents; character segmentation; Self Organizing Feature Maps

Sinhala script is categorized as a segmental writing system and therefore consonant–vowel sequences are identified as a unit. The vowels can be categorized into two categories namely independent and diacritic. A vowel is categorized as independent when a vowel does not attached to a consonant. A vowel is categorized as diacritic when a vowel is attached to a consonant. In the second case the vowel is denoted by one or more strokes positioned around the consonant. Depending on the vowel, the diacritic can attach above, below, following or preceding the consonant.A pure vowel is generally used only at the beginning of a word, and has a distinct symbol [4].Fig. 1 shows 4 examples of two types of vowels, independent and diacritic and their associations.

Touching

I. INTRODUCTION Sinhala language is used by Sinhalese, an ethnic group native to Sri Lanka which comprise ofmore than 70% of the population of the country.Itis one of the official and national languages of Sri Lanka. Sinhala language belongs to the IndoAryan branch of the larger family of languages identified as Indo-European languages. Brahmi [1] script is considered as the beginning of Sinhala writing system which is known to have existed since third to second century B.C.E [2].Therefore the Sinhala alphabet is a member of the Brahmic family of scripts. The Brahmi script had five short vowels (a, i, u, e, o), and five long vowels which are the long versions of the short vowels (â etc.). In recent Sinhala, these five short vowels, along with their long versions are preserved together with 32 more consonants.

Written Sinhala also has the cursive varaety, but there’s no such notion of block capital letters as it does in English.

At some incidences pairs of consonants may be combined and they are said to form a conjunct letter, e.g., ksha. The forming of these conjunct lettersisnot compulsory, and has become less common in modern use.

Sinhala languagehas two alphabets due to the presence of two sets of letters. Suddha Sinhala (pure Sinhalese) or eḷu hōḍiya (Eḷu alphabet) which is the core set, can represent all basic native sounds in the Sinhala language.After the thirteenth century, Sinhala language was very powerfully influenced by Sanskrit and Pali languages. The ‘misra’ Sinhala (mixed Sinhalese) was formed in order to render Sanskrit and Pali words, which is an extended set of the core set.Also in later agesSinhala language was influenced by the English language. Therefore the modern Sinhala alphabet also includes theletter ‘ෆ’ forthe usage of the sound of English letter ‘f’. Sinhala letters are written from left to right. Most of the letters used in Sinhala are round in shape and straight lines are

22 | P a g e

(a)

(b)

Figure 1. Two types of vowels, (a) independent and (b) diacritic.

International Journal of Conceptions on Computing and Information Technology Vol. 2, Issue. 4, June’ 2014; ISSN: 2345 - 9808 Sinhala characters can be categorized in to three main groups according to the height. (1) Characters having a normal height, (2) Characters having an ascender (3) Characters having a descender, However, this categorization becomes more complicated when single or several vowel signs are attached to one consonant. An example of the categories can be seen in Fig.2. Only twenty three characters in the Sinhala alphabet given in Fig. 3were considered in this research work. Unlike printed character recognition, in handwritten character recognition, the difficulty of correct segmentation of characters is always at hand. Correctness in segmentation is a basic and a critical aspect of unconstrained handwritten document recognition. Two or more characters together in a word can be identified as one character by error. There are four different touching character groups according to the way they touch each other: Overlapping, touching, connecting and intersecting [5] as shown in Fig. 4. Among the four groups the connecting variety comes under cursive writing. Among other three, the overlapping and touching character occurrences are more common than intersecting characters in ordinary practice. Among the two parameters of the dimensions of a letter, width and height, width has more prominence in the context of character segmentation. The selected set of Sinhala letters can be categorized in to two main categories according to their width. Table 1 shows the categorization according to the width. II. PREVIOUS WORK The challenging nature of handwritten character recognition has drawn the attention of researches for a long time all over the world [6], [7], [8], [9]. These researches have explored many areas available in like the computational pattern recognition area with techniques such as artificial neural networks [10] and statistical approaches such as Hidden Markov Models [11] to recognize handwritten words or characters. The first step of any of the handwritten character recognition technique is the precise segmentation of text images in to lines, words and characters.

(a)

(b)

(c)

Figure 3. The set of letters considered in this research

Most of the proposed techniques use the horizontal projection profile for line extraction [12], [13]. Also for text lines with variation in the skew angle between text lines, Hough based methods has been proposed [14]. For word segmentation, most of the proposed techniques consider a spatial gap between consecutive connected components with a threshold to categorize “within” and “between” word gaps [15]. Most of the work done has made thefollowing assumptions: each connected component belongs to only one word and gaps between characters are smaller than gaps between words. The approach of using Hidden Markov Modelswhich is proposed for Sinhala handwritten character recognition can be found in [13]. But the method proposed in [13] does not accommodate for the full Sinhala alphabet. In [13] line extraction has been done using the zero values in the projection profile correspond to horizontal gaps between lines. It has been used a pre-formatted paper to collect handwriting, which contains reference lines on it and these lines have been eliminated during thebinarization of the image. Then the individual words and characters have been extracted using the vertical projection profile of each text line. But this research work has not addressed the possibilities of touching characters in the text and segmentation of the touching characters. The work presented by M.L.M Karunanayaka, N.D Kodikara, and G.D.S.P Wimalaratne in [5] has addressed the issue of touching characters. At the beginning, they have segmented the images using vertical projection profile method and at that level the touching characters are considered as a single entity.

(d)

Figure 2. Grouping of Sinhala characters, (a) a letter having a normal height, (b) a letter which has an ascender (c) a letter which has a descender and (d) a letter with a vowel sign attached with both ascender and descender sections

23 | P a g e

(a)

(b)

(c)

(d)

Figure 4. Touching character groups, (a) overlapping, (b) touching, (c) intersecting, and (d) connecting

International Journal of Conceptions on Computing and Information Technology Vol. 2, Issue. 4, June’ 2014; ISSN: 2345 - 9808 TABLE I. WIDTH CATEGORIES OF SELECTED LETTERS

Short width category

ඊ, ර, ද

Long width category

අ, ඉ, උ, එ. ක, ග, ජ, ට, ඩ, ණ, ත, න, ප, බ, ම, ය, ල, ව, ස, හ, ළ

Then the segmented character entities are further classified between two categories: touching characters and a single character. In the process the average character width has been estimated using the width of the image and number of characters occurs in that image which is obtained from the vertical projection profile as given in (1).

Then the touching characters have been distinguished using the procedure given in (2).

After distinguishing touching characters, the touching character group has been identified and segmented. Connected component labeling has been used to identify the presence of overlapping characters and to segment them. To distinguish between the other three groups, the concept known as Water Reservoir Conceptdiscussed in [16] has been used. Self-Organizing Feature Maps (SOFM) invented by Teuvo Kohonen [17] is a form of artificial neural networkwhich is trained using unsupervised learning. SOFMs facilitate representing multidimensional data using much lower dimensional space. Also it creates a network that stores information in such a way that it represents the topological relationships of the training samples. The points that are close to each other in the input space are mapped to close by map units in the SOFM. The SOFM can therefore work as a cluster analyzing tool of high dimensional data. There is no related work to be found in Sinhala handwritten character segmentation with SOMFs. The work presented by Fajri Kurnlawan et al. [18] has explored the handwritten character segmentation with SOFMs for the English language. The main idea behind their proposal is that the touching pairs can be divided in to three main regions: left, right and middle. The authors believe that these regions have unique characteristics and by mapping the touching characters in to a feature vector space, a clustering mechanism can be used to provide segmentation. The basic steps of the method proposed in [18] are, (1) Estimating the core zone of the characters (2) Extraction of feature points as input to SOFM (3) Determining segmentation path

III. IDENTIFICATION OF THE PROBLEM The concept of character recognition via a computer is a problem that has been in the research field for a long time. Recently, the problem of machine-printed character recognition has been developed further in comparison with handwritten character recognition [19], [20]. The unsteady nature of handwritten characters make the recognition task difficult. The accurate segmentation of text images in to lines, words and characters is a key factor to the accuracy of the character recognition process. This research will be focused segmenting Sinhala handwritten documents in to lines, words and characters and to address touching character segmentation with the use of SOFMs. IV. SCOPE OF THE RESEARCH The problem of character segmentation of off-line noncursive Sinhala handwritten characters will be addressed in the proposed work. Twenty three characters from the Sinhala alphabet given in Fig. 3 were chosen for the proposed work. The twenty four characters do not include the vowel signs as well as less frequently used characters. It is assumed that the characters of two consecutive text lines are not touching or overlapping and there is no slant in the text lines. V.

METHODOLOGY

This section explains how the samples were preprocessed, and how the line segmentation, word segmentation, character segmentation and segmentation of touching characters were performed. A. Preprocessing A4 size papers were used for collecting handwriting samples. All the sample documents included 5 – 10 lines. The documents were scanned to get the images required for processing.The images were converted in to binary format using OTSU [21] thresholding mechanism. B. Line segmentatiom It is assumed that characters of two consecutive text lines are not touching or overlapping. With that assumption the image is segmented in to lines using the horizontal projection profile. Fig. 5 shows how the horizontal projection profile was used for line segmentation. If the characters of two consecutive text lines are not touching or overlapping, the zero values in the horizontal projection profile correspond to the white space between the lines. C. Word Segmentation Each of the segmented lines goes through the word segmentation process. The assumption that the characters of two words are not touching each other was made. With that assumption vertical projection profile was calculated for each line segment. An example of the vertical projection profile of two text lines are given in Fig.6.

24 | P a g e

International Journal of Conceptions on Computing and Information Technology Vol. 2, Issue. 4, June’ 2014; ISSN: 2345 - 9808

Figure 5. Use of horizontal projection profile for line segmentation

As the gap between two characters within a word is relatively smaller than the gap between two words, a threshold was defined to identify the gap between words. D. Basic Character segmentation Each word segmented goes through the character segmentation process. Again, vertical projection profile was used for the basic segmentation of characters and at this point if the touching characters are present, they are identified as one single unit. Fig. 7 gives an example for the basic character segmentation including two touching characters. E. Touching Character segmentation The touching characters should be identified at first before moving in to the segmentation process. To identify a touching character, a similar approach to [4] has been used. After distinguishing touching characters, connected component labeling has been used to identify the presence of overlapping characters and to segment them. In the connected component labeling process, if there are more than two labels present, each connected component is considered as a single character. The remaining segments which are gained only one label in the connected component labeling are recognized as touching characters. VI.

FUTURE WORK

The proposed method for identifying a touching character in [4] can be improved using widths categories of the letters of the considered Sinhala letters. An average character width for each of the two width categories can be calculated and after the basic character segmentation, the segments wider than the average long width with a tolerance can be identified as a touching character.

The method proposed in [18] using SOFMs will then be used to segment the touching characters. I) Core zone estimation The ascender and descender components of the characters will be removed in order to get the core zone of the characters. As suggested by Kurniawan F. et al., the core zone estimation is done in order to improve the clustering process [18]. II) Training dataset A dataset which has all the touching letter combinations of the considered 23 Sinhala characters will be used as the training dataset. III) Feature extraction and using SOFM for clustering The simple feature extraction method will be used to generate the feature vector. Number of maximum feature points will be taken as a variable parameter. The touching character segment is scanned from left to right within the core zone to generate the feature vector. According to the number of maximum feature points, some of the foreground pixels will be selected to generate the feature vector. Then the feature vector will be clustered to segment the touching pair using the SOFM. To do that the architecture of the SOFM will be configured as one dimensional layer having three neuron nodes. The three neuron nodes are devoted to the left middle and right regions respectively. During training, each node will be getting closer to the three regions. After this clustering process the segmentation of the touching pairs can be performed based on the Position of the middle neuron node. With the assumption of the text lines does not have a slant angle, a vertical line is considered as the segmentation path.

Figure 6. Vertical projection profile for two line segments

25 | P a g e

International Journal of Conceptions on Computing and Information Technology Vol. 2, Issue. 4, June’ 2014; ISSN: 2345 - 9808

Figure 7. Basic segmentation of characters using vertical projection profile

VII. RESULTS AND DISCUSSION As per this study, horizontal and vertical projection profiles are an efficient way to establish the basic segmentation of a handwritten document with horizontal text lines.The problem of the vertical projection profile for character segmentation is that it cannot separate the touching characters which are present in most of the handwritten documents. Different width value categories of characters can be used to improve the identification of the touching character pairs. The segmentation of touching characters can be improved by using SOFMs.

[10]

[11]

[12]

[13]

REFERENCES [1] [2] [3] [4] [5]

[6]

[7]

[8]

[9]

R. Salomon, “Brahmi and Kharoshthi,” in Daniels and Bright (Eds.), The World’s Writing Systems, 1996. S.T. Nandasara and Yoshiki Mikami, “History of Computing and Education 3 (HCE3),” Vol. 269, Springer US, pp. 157-165 D. A. Indrasena: Sinhala Akshara Malava, Sridevi Printers (pvt) Ltd, 2001. J. B. Disanayaka, අ හා (Letters and Strokes), Godage, 2000. M.L.M Karunanayaka, N.D Kodikara and G.D.S.P Wimalaratne, “Off Line Sinhala Handwriting Recognition with an Application for Postal City Name Recognition,” In Conference Proceedings - 6th International Information Technology Conference on From Research to Reality, Infotel Lanka Society Colombo, Sri Lanka, 29 Nov- 01 Dec 2004, pp. 23-29 Rakesh Kumar Mandal, N. R. Manna, “Handwritten English Character Recognition using Column-wise Segmentation of Image Matrix (CSIM),” WSEAS TRANSACTIONS on COMPUTERS, vol.11, no.5, pp.148-158, 2012. D. Deng, K. P. Chan, Y. Yu, “Handwritten Chinese character recognition using spatial Gabor filters and self-organizing feature maps,” Proc. IEEE Inter. Confer. On Image Processing, vol. 3, pp. 940-944, June 1994 J. Cai and Z-Q Liu, “Integration of structural and statistical information for unconstrained handwritten numeral recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, pp. 263-270, March 1999. S. Hewavitharana, H. C. Fernando and N. D. Kodikara, “Off-line Sinhala Handwriting Recognition using Hidden Markov Models,” Proc.

[14]

[15] [16]

[17] [18]

[19]

[20]

[21]

26 | P a g e

of Indian Conference on Computer Vision , Graphics & Image Processing (ICVGIP) 2002, Ahmedabad, India, pp. 266-269, 2002. Tirtharaj Dash and Tanistha Nayak, “English Character Recognition using Artificial Neural Network,” Proceedings of National Conference on Artificial Intelligence, Robotics and Embedded Systems, AIRES-2012, pp. 7-9, 29-30 June, 2012 B. Feng, X. Ding and Y. Wu, “Chinese handwriting recognition using Hidden Markov Models,” in: Proceedings of the 16th International Conference on Pattern Recognition, Barcelona, Spain, 2002 N. Tripathy and U Pal, “Handwriting segmentation of unconstrained Oriya text,” Frontiers in Handwriting Recognition, 2004. IWFHR-9 2004. Ninth International Workshop on , vol., no., pp.306,311, 26-29 Oct. 2004 S. Hewavitharana, H. C. Fernando and N. D. Kodikara, “Off-line Sinhala Handwriting Recognition using Hidden Markov Models,” Proc. of Indian Conference on Computer Vision , Graphics & Image Processing (ICVGIP) 2002, Ahmedabad, India, pp. 266-269, 2002. G. Louloudis, K. Halatsis, B. Gatos and I. Pratikakis, “A Block-Based Hough Transform Mapping for Text Line Detection in Handwritten Documents.” 10th International Workshop on Frontiers in Handwriting Recognition (IWFHR 2006), La Baule, France, October 2006, pp. 515520. G. Seni and E. Cohen, “External word segmentation of off-line handwritten text lines,” Pattern Recognition 27 (1994) ,pp. 41–52. U. Pal,U , A. Belaid, and C. Choisy, “Touching numeral se- gmentation using water reservoir concept,” Pattern Recognition Letters,volume. 24, pages 261-272,2003 T. Kohen, “The self-organizing map,” proc. of IEEE, vol.78. No.9, pp. 1468-1480, 1990 F Kurniawan, M.S.M. Rahim, D. Daman, A. Rehman, M. Dzulkifli and S. Mariyam, “Region-based touched character segmentation in handwritten words,” Int J Innov Comput Inf Control, vol.7, pp. 31073120, June 2011 H. L. Premaratne and J. Bigun, “Recognition of Printed Sinhala Characters Using Linear Symmetry,” The 5th Asian Conference on Computer Vision, Melbourne, Australia, 23-25 Jan, 2002 H. L. Premaratne and J.Bigun, “A Segmentation-free Approach to Recognise Printed Sinhala Script,” Pattern Recognition, Vol 37,pp. 2081-2089, 2004 Nobuyuki Otsu, “A Threshold Selection Method from Gray-Level Histograms,” Systems, Man and Cybernetics, IEEE Transactions on , vol.9, no.1, pp.62-66, Jan. 1979