Development of Indonesian Handwritten Text ... - Semantic Scholar

1 downloads 0 Views 819KB Size Report
when using the same language, the motor-behavior of how the text is taught and learned at early school could be different for different people. Indonesian ...
J3 - 1

2011 International Conference on Electrical Engineering and Informatics 17-19 July 2011, Bandung, Indonesia

Development of Indonesian Handwritten Text Database for Offline Character Recognition Peb Ruswono Aryan1, Iping Supriana2, Ayu Purwarianti3 School of Electrical Engineering and Informatics, Institut Teknologi Bandung Jalan Ganesa 10 Bandung, Indonesia 1

2

[email protected] [email protected] 3 [email protected]

Abstract— Research and development of a character recognition systems in the past years has been matured with publicly available either free or commercial character recognition systems especially for isolated Latin characters in multiple languages. The character recognition problem itself can be considered as a mostly solved. Some problems that are still being worked on in this community are unconstrained handwritten text and culture specific scripts. Handwritten text is a special case when the script is used in more than one language; therefore the accuracy of the recognition depends on the knowledge of the language in which the text is written. Indonesia has a large cultural diversity especially languages because of its heterogeneous distribution of geographic areas. The implementation of handwritten character recognition for data acquisition in Indonesia still does not accommodate this background culture to improve accuracy since commercially available systems are developed with foreign language (English) although using the same alphabet. Therefore a local version of offline handwritten database (Indonesian) is required. The purpose of this paper is to report the collection of an Indonesian Handwritten text database. We collected various handwritten text samples which are isolated digits, mixed digits, isolated lowercase letters, isolated uppercase letters, lowercase words, uppercase words, and cursive words from college student source. We also present our method to select words which is used on the database. Keywords— handwriting recognition, database.

recognition,

offline

character

I. INTRODUCTION Standard database is important for handwriting recognition research. The database is used in the development, evaluation, and comparison of different handwriting recognition algorithms [1]. Many databases have been developed in the character recognition community ranging from printed, handwritten, isolated or cursive, and also in various scripts. Some of them are publicly available such as IAM [1-2] and IRONOFF [3]. Handwritten character recognition is usually classified into two groups which are online and offline. Online character recognition deals with information about writing dynamics as the text is being written while offline character recognition deals with static Information in which acquisition is done after

978-1-4577-0751-3/11/$26.00 ©2011 IEEE

all the text is written. Offline character recognition usually uses other medium of written text such as papers. One of the main issues in handwritten text recognition is that its accuracy in human depends on knowledge about which language the text is written. Same text of the same corpus but written in different language can result in different accuracy [4]. Writing may be classified as culture specific artifact. Even when using the same language, the motor-behavior of how the text is taught and learned at early school could be different for different people. Indonesian language is using the same Latin script like English but the intrinsic morphology and vocabulary is different. These differences can cause different distribution of adjacent characters of a vocabulary. For some cases like acquisition of isolated digits and characters of a specially designed form these differences can be ignored. But, to increase recognition accuracy and to be more applicable in more general and unconstrained, these facts should be accommodated. IAM database for example is using English corpus and IRONOFF contains mostly French and English lexicon. The purpose of this paper is to report the continuing progress of the development of Indonesian handwritten text database. Although the process of creating an Indonesian handwriting database is similar with English database, provided in this paper is the framework to carefully design a form used in the data collection step to exploit as many variations captured but with minimum effort and less expensive than the process to create another databases. Here IAM-db and IRONOFF are used as a reference of the design process comparison before the actual collecting process. In order to a database to be usable, it has to be as complete as possible but also as specific as what the database is intended to be used. This application bias may guide how the database is developed. IAM is developed from text corpora with may have linguistic information. IRONOFF is a dual online and offline for handwritten text. Online-offline duality is biased toward the goal of enabling inference of online information such as trajectories from offline text. The Indonesian database is developed for handwritten text recognition as a part of automated essay grading systems. This

fact then guide to the person whom the writing w is acquired which are university students. a process In the next section the methodology of acquisition is described. Section 3 is concerned with thhe lexicon selection for handwritten words which are included in the database. Current result and some sample are preseented in Section 4. Finally some conclusions are presented in Section S 5. II. METHODS AND FORMSS Overall process of sample acquisition from planning to collection is displayed in Fig. 1. First, the coollection of lexicon is determined using method described inn the next section. Then this lexicon is inserted to input forms which then printed and filled by the writer in the data collectionn step. Filled forms are then scanned and organized by writer ID D.

size of uncompressed image iss around 3.5 Megabytes (MB) and stored using lossless PN NG format which reduces the storage requirement to 10 perccent of the original size. The average size of one person data is about 8 MB. Table structure is used to siimplify extraction so there no need for text line detection annd segmentation. Gridlines can also be used to easily detect annd deskew scanned image. The deskewing can be done durinng the scanning since scanner nowadays usually provides desskew as post-scanning process and already included in the drivver. IAM uses no guide in order to the writing is done as unconnstrained as possible. A guide paper containing horizontal linees is put below the form paper so the writer can see these linnes as guide [1]. On the other hand IRONOFF uses text boxess with designed border color in the form to guide the writer which w can be dropped either by scanning or by simple thresholdding afterwards [3].

III. LEXICON SELEECTION METHOD In order to cope up with the limited number of word lexicon in the sample form, the set of lexicon is selected from an initial vocabulary. This initiial vocabulary can be prepared manually by collecting from ‘random’ sources or using the same approach like IAM which uses linguistic corpora [1-2]. The problem of selecting a subset is transformed into mal subset is determined using optimization problem. An optim Fig. 1 Development methodologgy the distribution of adjacent chaaracters. Character adjacencies t handwritten characters are Designed form is filled by undergraduuate and graduate is chosen considering the fact that students. The form is a 20 pages A4-sizedd papers including usually linked as single conneccted object in the input image cursive oness. Connected character informed consent form in the first page. The T content form is especially c issues especially in divided into 1 page of isolated digit, 5 pages of group of digits, separation/segmentation is a common 4 pages of alphabetic lowercase and uppercase characters, 3 handwritten character recognitiion usually stated as character pages of lowercase words, 3 pages of uppeercase words, and 3 segmentation problem. Usually the character distribuution for a standard database is pages of cursive words. Last 3 groups are using the same set reported as a result. In this papeer this is not the case since the m and method of lexicons which selected based on model described in the next section. Sample filledd and scanned form end distribution is a result of a desired goal prepared in the is shown on Fig. 3. The first page is sccanned for internal beginning and incorporated intoo the selection algorithm. That documentation. Since the first page contains private is why there is no justificaation about how the initial information, this information is not incluuded in the public vocabulary is prepared either randomly or using available corpora. version of the database. A character can occur at thee beginning, at the end, and in Every form in a group is laid out using tables which every the middle of a word. Therefoore it is natural to model the first column of a row contains the text to bee written repeatedly character adjacency as a string of 3-characters (trigram) as a for the next columns. An exception for thhe isolated digits is made so guiding text is written at the firsst row of the table character can be connected to the other character positioned while the rest is written at the first colum mn. Isolated digits on the left and right side. Eveery trigram can be viewed as have 14 repetitions for each class. Mixed grroup of digits has 6 point in ℤ3 space. Each dimennsion is in interval [1,26]. repetitions per row. Isolated lowercase and uppercase Whitespace is not included in the trigram so that each wercase, uppercase, dimension of the coordinate point lies in the same interval. character has 10 repetitions per class. Low and cursive words each have 3 repetitioons per row. Each Each lexicon of length c charaacters can be decomposed into person usually takes about an hour to filll all the form, and c-2 trigram(s) (possibly with duplication). A subset of the from one person the total samples collected are 1758 items. med into histogram of point in vocabulary can then transform Form papers are then scanned and groupped into directories this space. by their student ID. Image is scanned using Canon P-150 Using this model without inncorporating whitespace at the portable USB Scanner with 150 dpi ressolution and 8-bit beginning and the end of a trigram requires a lexicon to have grayscale (256 levels) via TWAIN drivvers on Windows minimum length of 3 characterss. machine. Other database usually scanned inn 300 dpi resolution. There are many ways to model the optimality of the The resolution is selected as the minimum so that the size of selection. One can use a statistic of the point distribution such the characters is in the range of 30x30 to 60x60 pixels. The as variance as objective funcction [5]. Other possibility is

similarity of probability distribution between original vocabulary and selected lexicons. mes subset search The lexicon selection problem becom optimization. Global optimization techniquue such as genetic algorithm may be used to solve this prooblem [5]. General approach to lexicon selection is shown in Fiig. 2 below. function decompose_word( S ) result Å empty list for i in [1..length(s)-2] trigram Å S[i-1]+S[i]+S[i+1]

Set of lexicon used in the daatabase described by this paper has 72 words. This wordlisst is selected using genetic algorithm and using maximal vaariance of trigram distributions as objective function. The visuaalization of trigram distribution of the wordlist is shown in Figure F 4. Maximal variance is achieved by the spread radius of points as shown in the diagram. Empty spot at the topp right possibly caused by nonexistence of a word which coonsists of combination of last alphabet letters (v, w, x, y, z). This is still an ad-hoc solution for the moment but can be used as a starting example.

point Åconvert trigram to point in ℤ3 append point to result return result function criteria(subset) empty histogram foreach word in subset foreach trigram in decompose_word( wordd ) add trigram to historam return eval( histogram ) { evaluation criteria e.g. variance v } function select_lexicon_subset( Source, LenOutputt ) repeat selection Å subset of length LenOutput from Source score Å criteria( selection ) if score > maxscore then update maxscore and result until terimation criteria {max iteration or maxscoree converged} return result Fig. 2 Lexicon selection algorithm

ℤ3 space mentioned before is not a fixeed space selection.

One can transform to ℝ3 and scaled into [0,1] interval. For example a lexicon ‘merdeka’ can be visualizzed in this space as shown in Figure 3. Lexicon ‘merdeka’ is decomposed into set of trigrams = {‘mer’, ‘erd’, ‘dek’, ‘eka’}. Each trigram is a point displayed as dots.

Fig. 4. Trigram distribution selectedd using maximum variance criteria [5]

Compared to other database such as IAM and IRONOFF, This database requires less exxpensive and uses parameters (scanning resolution) which closely resembles practical applications. This makes the proocess of scaling the database or even creating the new databasse from new language easier. IAM requires corpus splitting, form f segmentation and labeling process. Because of its large and diverse corpus used as lexicon source, a single writer only writes a small portion of the corpus and no repetition (except for repetition from the w reoccurrence) is done so a text itself like conjunctions or word writer’s performance is assum med to be fixed which is not accurate in reality. As can be seeen in Figure 6, there is a slight variation of the written text for f the same word. This is a limited approach to model the human h writing behavior which cannot be assumed to be fixed all a the time.

Fig. 5 sample texxt from IAM-db [2] Fig. 3. The Visualization of trigram histogrram from lexicon decomposition. Beginning and ending trigramss are pointed. [5]

IV. RESULT At the moment (June 2011) the databasee consists of a total of approximately two hundreds of filled forrms. 50 of them are already scanned in image form.

Figure 5 above is showing a fragment from IAM sentence database. Trigrams on above exxample such as ity, the, hey, see, eek does not exists in Inddonesian language. Therefore incorporating this in recognitionn model for example HMM [6] can be misleading since it will not be used frequently. On the other hand trigrams like thosee are frequently occurring in English sentences. So learning about those patterns would be useful in English handwriting reecognition applications.

V. CONCLUSIONS A database of Indonesian handwritten text has been described in this paper. The process is still running to enlarge the database. All data will be made available publicly and open to new contribution especially with different set of lexicon in handwritten word collection. While the repository and public site is being set up the data can be obtained by contacting author(s) of this paper. The methodology of database development described in this paper can be used to scale or extend current database with considerably faster and less expensive than methods used in other handwritten text database. The methodology may also be used as a reference framework to create relatively new database considering vast number of local cultures in Indonesia. Future studies may also try to explore different criteria and search strategy for lexicon selection.

Fig. 6 Scanned image of filled form

ACKNOWLEDGMENT Work reported on this paper is supported by Competence Building Grant from Department of National Education, Republic of Indonesia.

REFERENCES [1] [2] [3]

[4] [5] [6]

U. V. Marti and H. Bunke, A Full English sentence database for offline handwriting recognition, In Proc. of the 5th Int. Conf. on Document Analysis and Recognition, pages 705 - 708, 1999. U. Marti and H. Bunke. The IAM-database: An English Sentence Database for Off-line Handwriting Recognition. Int. Journal on Document Analysis and Recognition, Volume 5, pages 39 - 46, 2002 C. Viard-Gaudin, P. M. Lallican, S. Knerr, and P. Binter, The IRESTE On/Off (IRONOFF) Dual Handwriting Database, Proc. Intl. Conference of Document Analysis and Recognition (ICDAR), Bangalore, India, 1999 M. Liwicki, H. Bunke, Recognition of Whiteboard Notes online, offline and combination, World Scientific Publishing, 2008. P. R. Aryan, A. Purwarianti, I. Supriana, Pembangkitan Koleksi kata untuk Basisdata Tulisan Tangan Menggunakan Algoritma Genetika, Konferensi Nasional Sistem Informasi, Medan, Indonesia, Feb. 2011. T. Plötz, G. A. Fink, Markov Models for Offline Handwriting Recognition : A Survey. International Journal on Document Analysis and Recognition (IJDAR) 12, 269-298, 2009