26 A Four-Tier Annotated Urdu Handwritten Text Image Dataset for

89 downloads 0 Views 1MB Size Report
This article introduces a large handwritten text document image corpus dataset for Urdu script named. CALAM (Cursive And Language Adaptive Methodologies) ...
A Four-Tier Annotated Urdu Handwritten Text Image Dataset for Multidisciplinary Research on Urdu Script PRAKASH CHOUDHARY, National Institute of Technology Manipur, Computer Science and Engineering

NEETA NAIN, National Institute of Technology Jaipur, Computer Science and Engineering

This article introduces a large handwritten text document image corpus dataset for Urdu script named CALAM (Cursive And Language Adaptive Methodologies). The database contains unconstrained handwritten sentences along with their structural annotations for the offline handwritten text images with their XML representation. Urdu is the fourth most frequently used language in the world, but due to its complex cursive writing script and low resources, it is still a thrust area for document image analysis. Here, a unified approach is applied in the development of an Urdu corpus by collecting printed texts, handwritten texts, and demographic information of writers on a single form. CALAM contains 1,200 handwritten text images, 3,043 lines, 46,664 words, and 101,181 ligatures. For capturing maximum variance among the words and handwritten styles, data collection is distributed among six categories and 14 subcategories. Handwritten forms were filled out by 725 different writers belonging to different geographical regions, ages, and genders with diverse educational backgrounds. A structure has been designed to annotate handwritten Urdu script images at line, word, and ligature levels with an XML standard to provide a ground truth of each image at different levels of annotation. This corpus would be very useful for linguistic research in benchmarking and providing a testbed for evaluation of handwritten text recognition techniques for Urdu script, signature verification, writer identification, digital forensics, classification of printed and handwritten text, categorization of texts as per use, and so on. The experimental results of some recently developed handwritten text line segmentation techniques experimented on the proposed dataset are also presented in the article for asserting its viability and usability.

r

CCS Concepts: Computing methodologies → Image processing; Document management and text processing

r

Applied computing →

Additional Key Words and Phrases: Urdu handwritten text, annotation, OCR algorithms benchmarking, corpus ACM Reference Format: Prakash Choudhary and Neeta Nain. 2016. A four-tier annotated urdu handwritten text image dataset for multidisciplinary research on urdu script. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 15, 4, Article 26 (May 2016), 23 pages. DOI: http://dx.doi.org/10.1145/2857053

1. INTRODUCTION

Over the past few years, a lot of advancements have been made in the field of handwritten text recognition. Linguistic resources such as annotated corpus are playing a significant role and are the most demanding platform for computational linguistic research. A machine-readable corpus has more capability to explore and identify all Authors’ addresses: P. Choudhary, Department of Computer Science and Engineering, NIT Manipur, Imphal795001 India; email: [email protected]; N. Nain, Department of Computer Science and Engineering, MNIT Jaipur, Rajasthan -302017 India; email: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2016 ACM 2375-4699/2016/05-ART26 $15.00  DOI: http://dx.doi.org/10.1145/2857053

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26

26:2

P. Choudhary and N. Nain Table I. Some of Most Widely Used Online Handwritten Databases

Database CASIA [Liu et al. 2011] TAUT [Nakagawa et al. 1997] IAM-on [Indermhle et al. 2010] Nakagawa and Matsumoto [2004] Nethravathi et al. [2010] OHASD [Elanwar et al. 2010] Kumar [2010]

Scripts Chinese

Content Text pages and lines, characters, and symbols Isolated characters

Contents Size 5,090 text pages of 52,220 lines, 3.9 million characters, and 171 symbols; 10,000 images

English

941 pages with 7,616 lines; 68,841 words and 355,097 strokes

Japanese

Text documents having words and strokes Characters

Tamil Kannada

Isolated words

100,000 words in each script

Arabic

Paragraphs

Devanagari

Characters

154 paragraphs of 3,800 words and 19,400 characters 1,800 character samples

3 million patterns

features of natural language including the characteristics of the desired texts such as lexical, textual, semantic, and syntactic attributes. Corpus Linguistics is an approach for investigating the diversity of a language using a large collection of real-life natural language text samples. This approach has been used in a number of research areas for ages, like in the study of the language, writing style of the language, and development and benchmarking of various OCR algorithms. Researchers developed various datasets/databases of different standards based on requirements, such as a database of isolated digits/characters, text lines/words, or paragraphs to evaluate the performance of various OCR algorithms. In the field of handwritten text document image analysis, many highly cited algorithms exist in the literature such as those of Alaei et al. [2011b], Gatos et al. [2009], Stamatopoulos et al. [2013], Margner and El Abed [2009], Gatos et al. [2010], Likforman-Sulem et al. [2007], Li et al. [2008], Louloudis et al. [2009], and Yin and Liu [2009], which have used and have shown the need in standard databases for training and testing their method’s performance. Two axes of database development have been identified for handwritten text recognition systems based on the input mode: online and offline. Researchers have developed online handwritten databases [Liu et al. 2011; Guyon et al. 1994; Viard-Gaudin et al. 1999; Kumar 2010; Indermhle et al. 2010; Nakagawa et al. 1997] and [Nethravathi et al. 2010]. Bhaskarabhatla and Madhvanath [2004] collected handwritten data for online handwriting recognition in different Indic scripts. A brief overview of the online handwritten databases for English, Chinese, Japanese, and Indic scripts is shown in Table I. Guyon et al. [1994] designed a platform for data exchange and recognizer benchmarking. This format includes various online hand-printed and cursive alphabets including Latin and Chinese, signatures, and pen gestures. The database CASIA (Institute of Automation of Chinese Academy of Sciences) built by NLPR (National Laboratory of Pattern Recognition) [Liu et al. 2011] introduces both modes of online and offline Chinese handwriting databases, containing samples of isolated characters and handwritten texts. The dataset has 3.9 million isolated character samples produced by 1,020 writers using Anoto pen on paper for obtaining both online trajectory data and offline images. The database TAUT [Nakagawa et al. 1997] is another online handwritten database made of 10,000 character patterns by selecting the 1,227 most frequently appearing character categories from a sequence of newspaper sentences. The dataset IRESTE [Viard-Gaudin et al. 1999] is a dual handwriting database; it has 4,086 ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:3

isolated digits, 10,685 isolated lowercase letters, 10,679 isolated uppercase letters, and 410 EURO signs. It also contains 31,346 isolated words (28,657 French and 2,689 English words). Kumar [2010] introduced a database for Devanagari script, composed of 1,800 samples from 36 character classes obtained from 25 native writers. Each writer was asked to provide two samples per class. In 2010, Nethravathi et al. [2010] developed an online handwritten database of 200,000 words for two Indic scripts, Tamil and Kannada, by collecting 100,000 words for each script from 600 users to capture the variations in writing style. Bhaskarabhatla and Madhvanath [2004] collected handwritten data for online handwriting recognition in different official Indic scripts. The second category of handwritten datasets is the offline handwritten database where standard databases of isolated characters/digits or sentences have been developed and proposed in the literature. Some of the most widely used handwritten databases for some scripts such as French, English, Korean, Chinese, and Indic script are IRESTE, CEDAR, IAM, CMATER, NIST, PE92, IFN/ENIT, KHTD, FHT, HIT-MW, and PBOK. A brief overview of these databases is shown in Table II. NIST [Wilkinson 1992], MNIST [Deng 2012], IRESTE [Viard-Gaudin et al. 1999], IAM [Marti and Bunke 2002], RIMES [Grosicki et al. 2006], and CEDAR [Hull 1994] are the most frequently used standard English text databases. The NIST database is composed of handwritten characters/digits and running English texts. The data samples were extracted from 2,100 filled forms. The MNIST database is a large database of handwritten digits extracted from the NIST database. The IAM database is a collection of 1,539 handwritten text pages. Besides text pages, it also contains images of text lines and words with ground-truth labels. The CEDAR [Hull 1994] database is a collection of digital images of city names, state names, and zip codes from the postal addresses. Images have been segmented from the addresses by a semiautomatic process. It has been very widely used in the experimentation of a wide number of OCR techniques in ICDAR and ICFHR. The databases IRESTE [Viard-Gaudin et al. 1999] and CEMTAR [Sarkar et al. 2012] were developed for more than one language simultaneously. IRESTE is a dual handwriting database of English and French scripts. It includes images of isolated digits and letters and 410 EURO signs. The database also contains 31,346 isolated words (French: 28,657 and English: 2,689). The CEMTAR database contains 150 handwritten document pages, among which 100 pages were written purely in Bangla script and the rest of the 50 pages were written in Bangla texts mixed with English words. The database PE92 [Kim et al. 1993] is a collection of handwritten Korean character images where the authors collected 100 sets of KS (Korean Script) 2,350 handwritten Korean character images. The first 70 sets were generated by more than 500 different writers, and the same person wrote each of the remaining 30 sets. A Chinese handwriting database named HIT-MW [Su et al. 2007] was developed by including 853 handwritten forms, where forms were produced under unconstrained conditions without preprinted character boxes. ETL9 [Saito et al. 1985] is a set of hand-printed characters in JIS Chinese with its analysis in Japanese. The corpus in Sutat and Methasate [2004] is a Thai script handwritten character corpus, developed by the NECTEC (National Electronics and Computer Technology Center). The corpus consists of more than 44,000 images of online and offline handwritten characters, including isolated, touching, and cursive handwritten characters. For the Arabic language, most of the available databases are a collection of isolated characters/digits or words, while the database AHTD [Mahmoud et al. 2011] is a collection of handwritten text pages. The IFN/ENIT [Pechwitz et al. 2002] database was developed for training and testing of Arabic handwriting recognition systems. It contains more than 2,200 binary images of handwritten forms written by 411 writers. A ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:4

P. Choudhary and N. Nain

Table II. Some of the Most Widely Used Offline Handwritten Text Databases Database PE92 [Kim et al. 1993] HIT-MW [Su et al. 2007]

Scripts Korean Chinese

Contents Characters Sentences and characters

ETL-9 [Saito et al. 1985] RIMES [Grosicki et al. 2006] IAM [Marti and Bunke 2002]

Japanese English

Characters Mail samples Running English text pages

CEDAR [Hull 1994]

IFN/ENIT [Pechwitz et al. 2002]

Arabic

AI ISRA [Kharma et al. 1999]

Paragraph, words, digits and signatures Text forms

AHTD [Mahmoud et al. 2011] FHT [Ziaratban et al. 2009] IFN/FARSI [Mozaffari et al. 2008]

Words of state, city, and postal name Text pages

Farsi

Handwritten forms City names

HafT [Safabaksh et al. 2013] Khosravi and Kabir [2007]

Text pages Digits

CENPARMI [Haghighi et al. 2009]

Words, dates, digits, and letters

KHTD [Alaei et al. 2011a]

Indic scripts

Bhattacharya and Chaudhuri [2009]

Devanagari and Bangala Banagala and English

CEMTAR [Sarkar et al. 2012]

PBOK [Alaei et al. 2012]

Indo-Persian scripts

CENPARMI [Sagheer et al. 2009]

Urdu

CENIP-UCCP [Raza et al. 2012]

Kannada text pages, blines, words Numeral samples Text pages

Text pages of Bangala, Oriya, Kannada, and Persian Isolated digits and words Sentences

Content Size 235,000 characters 853 forms having 8,664 lines and 186,444 characters 6,07,200 characters 12,723 mail samples 5,685 paragraphs of 13,353 lines and 115,320 words 5,000 city names, 5,000 state names, 10,000 postal zip codes, and 50,000 characters 2,200 binary images 26,00 isolated words of 411 writers 500 paragraphs, 37,000 words, 10,000 digits, and 2,500 signatures 1,000 forms written by 300 writers 1,000 filled forms 7,271 images of 1,080 Iranian city names 1,800 forms 102,352 digits extracted from 1,200 registration forms 432,357 isolated samples of dates, digits, letters, and words written by 400 writers 204 text pages of 4,298 lines and 26,115 words by 51 writers 22,556 Devanagari and 23,392 Bangala 150 document pages 100 pages of Bangala and 50 pages of English-Bangala mix 707 text pages of four scripts, 12,565 text lines, and 104,541 words 44 isolated characters and 57 Urdu words 400 text pages.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:5

ground-truth file for each word in the database has been compiled. This file contains information about the word, such as the position of the word base line, and information on the individual characters used in the word. AHDB [Al-Ma’adeed et al. 2002] introduced a database for offline Arabic handwriting recognition, together with an analysis of the database and its associated preprocessing operations. The database contains a sample image of Arabic words and free handwriting text pages. Alamri et al. [2008] introduced a database of isolated Indian digits, numerical strings, Arabic isolated letters, and a collection of 70 Arabic words. It also includes a free format sample for Arabic date. Al-Ohali et al. [2003] developed an Arabic cheques database for research in the recognition of handwritten Arabic cheques. It is composed of real-life Arabic legal amounts, Arabic subwords, courtesy amounts, Indian digits, and Arabic cheques. The database Al ISRA [Kharma et al. 1999] describes the methodology for the development of a comprehensive database including handwritten Arabic words, numbers, and signatures. AHTD [Mahmoud et al. 2011] is a database for offline Arabic handwritten text recognition. The database is composed of images of the handwritten text at various resolutions, and it also provides ground-truth metainformation for written text at the page, paragraph, and line levels. For the Farsi language, there exist a few databases. FHT [Ziaratban et al. 2009] is an unconstrained Farsi handwritten text database of 1,000 forms with contributions from 250 participants in different age groups and with varied education levels. These characteristics of the database make it suitable for many OCR applications. Khosravi and Kabir [2007] introduced a very large dataset of handwritten Farsi digits. The database includes binary images of digits extracted from about 12,000 registration forms of two types, filled out by BSc and senior high school students. A new large-scale multipurpose CENPARMI Farsi handwritten dataset [Haghighi et al. 2009] consists of 432,357 images of dates, words, isolated letters, isolated digits, numeral strings, special symbols, and documents. The forms were collected from 400 native Farsi writers. The IfN/Farsi [Mozaffari et al. 2008] database consists of 7,271 binary images of Iranian province/city names. The HaFT [Safabaksh et al. 2013] database contains 1,800 grayscale images of unconstrained texts. The generation of corpus methodology for Indian scripts was initiated in 1991. To date, very few datasets are available for Indian scripts. Some of the notable works are as follows: The Kannada handwritten text database (KHTD) [Alaei et al. 2011a] is an unconstrained dataset, containing 204 handwritten documents of four different categories written by 51 native speakers of Kannada. The total number of text lines and words in the dataset are 4,298 and 26,115 respectively. Bhattacharya and Chaudhuri [2009] developed a mixed numeral handwritten database of Indian scripts. The database includes isolated handwritten numeral samples of real-life situations for Devanagari and Bangla scripts. CEMTAR [Sarkar et al. 2012] is a database of unconstrained Bangla−English mixed script handwritten document images. The database contains 150 handwritten document pages, among which 100 pages are written purely in Bangla script and the rest of the 50 pages are written in Bangla text mixed with English words. The standard database PBOK [Alaei et al. 2012] of four different scripts includes text pages of three Indic scripts, Kannada, Bangla, and Oriya. The Kannada part of the database has 228 text pages of four different domains written by 57 writers. The Kannada section contains a total of 4,850 handwritten text lines, 29,306 words, and 213,147 characters. It also contains 199 and 140 handwritten text pages of Bangla and Oriya, respectively. The database provides pixel- and content-based ground truthing for all the text pages. This database contains text pages written from both directions, and most of the samples are either overlapping or touching text lines. ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:6

P. Choudhary and N. Nain

For the Urdu language, so far only two handwritten databases exist: CENPARMI [Sagheer et al. 2009] is the Urdu offline handwriting database, which includes isolated digits, numeral strings with/without decimal points, five special symbols, 44 isolated characters, 57 financial related words, and a collection of Urdu dates in different formats. Another available offline Urdu handwritten database is CENIPUCCP [Raza et al. 2012], which is an unconstrained offline sentence database composed of 400 digitized forms produced by 200 different writers. The database has been labeled/marked up at the text-line level only. From the literature review, it can be summarized that there exists a sufficient number of standard databases for scripts like English, Chinese, and Japanese, while very few standard databases are available for Arabic and Farsi. Compared to these languages, very little attention has been given to the Urdu language. The Urdu handwritten database developed by CENPARMI [Sagheer et al. 2009] focuses only on isolated characters and digits and some selected words. Only CENIP-UCCP [Raza et al. 2012] includes 400 images of handwritten sentences. Urdu script is more complicated and elaborate compared to Arabic and Persian. The main reason for the Urdu script getting less attention and its slower development in the OCRs field is the lack of a standard database for Urdu script. The availability of resources for data collection is much less for Urdu as compared to scripts like Persian and Arabic. It is difficult to use Urdu script in automation, as a single character entry needs two to three keystroke combinations. To bypass this data entry step, we need to develop machine vision systems for automatically converting handwritten Urdu characters into their transcripted counterpart. To develop such intelligent systems, we need a large corpus to train the system for recognizing handwritten Urdu characters. These issues motivated us to develop an Urdu handwritten text database, which is a much-needed platform for training, testing, and benchmarking of handwritten text recognition systems for the Urdu script. This article describes the detailed methodology of developing an annotated corpus, CALAM, in a scientific way, including a large volume of unconstrained handwritten text images in Urdu script and their corresponding transcripted texts in a Unicode text file or in an XML file format. The corpus consists of of 1,200 handwritten images written by 725 writers belonging to different geographical regions. The number of handwritten text lines varies from two to six lines in a form. The average number of words varies from 20 to 80 in a text form/image. The text page also includes the demographic information of the writer like name, age, gender, education, address, and signature. The selection of texts is distributed within six categories and 14 subcategories to achieve the maximum variations in the words as texts. The corpus is designed to support a large number of computational linguistic research, such as identifying writing styles and grammatical information and developing machine-readable platforms. The corpus consists of an aligned transcription for image, line by line, phrase by phrase, or word by word. The corpus is completely marked up for content information to support content detection and evaluation of systems like linguistic handwriting recognition, signature verification, and writer identification. The database was experimented for the benchmarking of handwritten text recognition algorithms by generating an XML file of annotated handwritten text images. Experimental results in the form of quantitative analysis of four handwritten text-line segmentation techniques are also reported. The article first introduces the experimental setup for the collection and distribution of data in a systematic manner, and then reports the process of information fetching and feeding in both the handwritten text image and its corresponding XML file. The article is organized as follows: Section 2 describes the characteristics of Urdu script. Section 3 introduces the process of data collection and gives an overview of the statistics of the database. Section 4 describes the functionality and annotation of the scanned ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:7

Fig. 1. A set of the 38 most commonly used alphabets and digits of Urdu script.

handwritten image in a hierarchical manner with the generation of an XML file for ground truth. Section 5 does the comparative analysis of the structure with the existing document annotation tools. Section 6 provides experimental evidence in terms of textline segmentation and distribution-of-words frequency in the proposed corpus. Finally, conclusions are presented in Section 7. 2. CHARACTERISTIC OF URDU SCRIPT

Urdu script belongs to the Indo-Aryan family of scripts and is historically related with India from the time of the Mughal Empire. The present shape of Urdu script is significantly influenced by languages like Persian, Arabic, Turkish, Punjabi, and other indigenous languages of the Indian subcontinent. It is the national language of Pakistan and is one of the 22 scheduled languages in the Constitution of India. India has a large number of native Urdu speakers in its five states: Andhra Pradesh, Jammu and Kashmir, Bihar, Uttar Pradesh, and New Delhi. Urdu is the official language of Jammu and Kashmir state, and recently Urdu was also approved as the second official language of Uttar Pradesh. The population of Hindi-Urdu speakers is the fourth-largest community in the world after Mandarin, Chinese, English, and Spanish. According to Government of India 2001 census data [Census 2001], in India, more than 50 million people speak Urdu as their native language. The Urdu script is written from right to left and is an extension of the Persian alphabet, which is itself an extension of the Arabic alphabet. The Urdu alphabet set contains 38 characters and 10 digits, as shown in Figure 1. Urdu is associated with the Nastaleeq style of Persian calligraphy, whereas Arabic is written in the Naskh style. As shown in Figure 1, the “diamond shape” on the top of characters indicates the extended characters for Urdu from Persian. In Unicode, Arabic and its associative languages like Urdu, Punjabi, and Sindhi have been allocated 1,200 code points as (0600h - 06FFh, FB50h - FEFFh). At the time of writing, individual characters are joined together according to rules for every consecutive pair of characters in order to form groups of characters called ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:8

P. Choudhary and N. Nain Table III. A Sample of Valid Urdu Ligatures Formed by a Combination of Two to Eight Characters

Table IV. Differences Between Very Similar-Looking Letters Using the Dots

Table V. Differences Between Very Similar-Looking Letters Using the Diacritic

ligatures. A word consists of one or more ligatures written next to each other. Ligatures in Urdu are composed of one or more characters; Table III shows examples of seven valid ligatures formed with a combination of two to eight Urdu characters. Urdu characters typically attain different shapes according to their placement in forming a ligature. Both the meaning and shape of the characters change depending on their positions (at beginning, middle, and last). The problem is further aggravated by the cursive nature of the script. Thus, the shape assumed by a character in a word is context sensitive, decided by its placement. Furthermore, the uses of the dots(.) and diacritic during the writing makes it more complicated for the recognition process. Dots play a significant role in the Urdu alphabet; a single dot can make a big difference. The placement of a dot can change one letter into a different letter. For example, as shown in Table IV, the letter [be] has its basic shape in common with three other letters, [pe], [te], and [se], with only some dots differentiating them. One of the challenges for Urdu OCR is to characterize the differences between these very similar-looking letters. Table IV shows the differences between these very similarlooking letters using the dots, and Table V shows the differences between very similarlooking letters using the diacritic. ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:9

3. DATA COLLECTION AND DISTRIBUTION

The process of design and development of an Urdu corpus starts with the raw data collection and ends with appropriate tagging and labeling of the collected texts in the database. In our methodology, we used a higher-level (sentences)-based approach rather than collecting a list of isolated characters, digits, and words that combines different units of writings in a single trial. The collection of data has been done mostly from the news channels like BBC Urdu and ETV Urdu, Urdu blogs, historical-ancient documents, and textbooks. In some categories like history, architecture, and biography, the printed documents are entered manually in Unicode text due to nonavailability of Urdu Unicode texts for some words that were used earlier and are not in use now. To capture the maximum words for the corpus, we have used a long time period for data collection, starting from 1901 to the present, among six different categories. In order to be representative of all the phenomena of a particular language, the corpus contains a large variety of text samples. The domain of the corpus is a data collection of six different categories that are further divided into 14 subcategories to capture the maximum variance in word collection and make the corpus more significant in terms of a balanced corpus. Although there are no specific criteria for a balanced corpus, the criteria we have chosen for a balanced corpus are topics (category) of text selection and time span of data collection. The advantage of the balanced corpus is that texts are selected in such a way that the phenomena of searching become more efficient compared to the imbalanced corpus. It also provides additional facilities such as classification of texts as per research requirement, filtering of texts, and statistical analysis of data based on various terms like age, gender, educational qualification, region, and category. The list of categories and their corresponding subcategories with denoted keywords for data collection is as follows: (1) History - H (a) Indian History - IH (b) World History - WH (2) Literature - L (a) Poetry/Religion - PR (b) Gazals/Shyari - GS (c) Biography - BI (3) Science - S (a) Medical - ME (b) Physics - PH (c) Chemistry - CH (4) News - N (a) International - IN (b) National - NA (c) Sports - SP (5) Architecture - A (a) Rural Architecture - RA (b) Urban Architecture - UA (6) Politics - P (a) Central Government - CG (b) State Government - SG 3.1. Design of Handwritten Form

The form layout has been designed in a specific way to collect a large amount of significant information on a single form and make the corpus available in the ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:10

P. Choudhary and N. Nain

Fig. 2. A sample filled-in form.

multidisciplinary research areas of document image analysis, such as writer identification, signature verification, segmentation of printed and handwritten text, evaluation of OCR algorithms and technology, and training of a system for automatic data entry. The layout of the handwritten form is separated into four parts with a horizontal line for convenience in the segmentation of machine-printed text followed by handwritten text and demographic information of the writer. The design of the A4 size form is split into four parts as shown in Figure 2; each part is separated from each other by a horizontal line and organized as follows: (1) Part 1: This part of the form comprises the title for a language in the database and a unique identification number (UID). For example, an Urdu language, Indian History, Form 1 will have the UID as (URD-H-IH-001). The UID of the corresponding form is automatically updated or generated once a language and category/subcategory are selected. (2) Part 2: This part of the form consists of two to four lines of printed text, which are collected from various sources having around 20 to 80 words. (3) Part 3: The third part of the form is left blank where the writers replicate the printed text in their own handwriting as shown in Figure 2. ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:11

Fig. 3. Distribution of domain-wise data collection of handwritten forms.

(4) Part 4: The fourth part of the form is optional to collect demographic information of the writers that will be an aid for training a system for automatic data entry. In the demographic information, we collected the following information of writers: name, age/gender, education, address, rural/urban, date form was filled out, and signature. The filled-out forms were scanned at a resolution of 300dpi at a gray level. Each form was completely scanned, including the printed texts, handwritten texts, and demographic information, and its corresponding transcripted texts of the scanned image were stored in a Unicode UTF-8 text file. 3.2. Statistics of the Database

The database contains 1,200 handwritten text forms, filled out by 725 writers from different age groups and with different educational qualifications. Text pages were written by both males and females; 65% of the writers were males and 35% were females. Information about name, age, and address was collected on each page. Seventyfive percent of the 725 writers were younger than 26 years, and 79% were graduate students. Each writer was asked to write forms in an unconstrained environment in his or her natural handwriting with different pen styles and inks. To capture the maximum variance in data collection, the domain of data collection is divided into six categories and 14 subcategories. The statistics of the data collection according to the categories are shown in Figure 3. The database contains 3,403 Urdu handwritten text lines, 46,664 Urdu words, and 101,181 Urdu ligatures. On average, each filled-out handwritten text page comprises 2.84 text lines, 38.89 text words, and 84.31 ligatures. The database also contains 33,162 unique words, which are 71.07% of the total words present in the database. Besides this, the database contains 2,353 Urdu printed text lines. The domain-wise distribution of lines, words, and ligatures in the database is shown in Figure 4. The database contains ligatures of one to six characters, and the distribution of the ligatures with various character combinations is shown in Figure 5. Statistics of the demographic distribution of the dataset are tabulated in Table VI. ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:12

P. Choudhary and N. Nain

Fig. 4. Domain-wise data distribution of lines, words, and ligatures.

Fig. 5. Distribution of number of ligatures with one to six characters in the form.

Table VI. Statistics of the Demographic Distribution of the Dataset Disparate demographic information Total number

Writer names 667

City names 432

Date formats 346

Signatures 725

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:13

4. CORPUS MARKUP AND ANNOTATION

Availability of a large annotated ground-truth database is a significant advancement for handwritten text recognition techniques. Corpus annotation is a useful process for making the corpus available in the broad areas of computational linguistic research by associating it with some additional information and providing support for machine learning. Corpus annotation plays a significant role for automatic evaluation of segmentation and recognition results. Annotation is a time-consuming and error-prone task, so it requires the utmost care as highlighted in literature work related to online Indic script annotation [Kumar et al. 2006; Belhe et al. 2009; Jawahar et al. 2009; Bhaskarabhatla et al. 2004]. Messaoud and Abed [2010] have designed a structure to annotate the handwritten historical documents, while Alaei et al. [2012] and Alaei et al. [2011a] annotate an offline handwritten documents database. We have developed a structure (CALAM) that highly annotates a large volume of offline handwritten text documents in a systematic and scientific way and also reduces the time of annotation and takes care of the data validation. Apart from pure text-page annotation, CALAM provides some additional linguistics features such as aligned transcription for segmented lines and words. In addition to handwritten text-page annotation, the database also accumulates the demographic information at the form level related to the writer of the text page in Unicode. The information includes the writer’s name, education, age/gender, address, and geographical information. The annotation of a handwritten text form was done in standard encoding Unicode (UTF-8) for two reasons: (1) to ensure the compatibility with a non-Urdu operating system and character set and (2) to make the corpus, language, and operating system independent and compatible with other corpus access Unicode-based tools. The next section describes the step-by-step process of designing a corpus after the generation of scanned handwritten text forms and the four levels of annotation along with an XML standard file generation for ground truth as shown in Figure 6. 4.1. Structural Mapping and Auto-Indexing

The structural mapping provides the facilities of corpus creation and navigation through the stored information of handwritten images, segmented lines, words, and components very easily through the database, along with a broad view of the input data and transcription of Unicode text. It also provides additional support for insertion, modification, and searching of data for direct access to the needed attributes and their annotated information. The Unique Id configuration of each handwritten form is as follows: (1) The file name is the concatenation of the language (2 bits), category (3 bits), and subcategory (3 bits)xxxxxxxx(8 bit) form no. The index structure is shown in Figure 7. (2) The index of the form id is 16 bits: Total number of forms (maximum) = 216 = 65,536. (3) There can be a maximum of eight categories, and hence 2,048 forms in each category, and there can be eight subcategories, and hence 256 forms in each subcategory. (4) The structures reserves 2 bits for language to further extend and support other languages. To achieve the automatic consistency checking throughout the database, all the handwritten text images stored in the database get the same unique id that was generated during the auto-indexing. The UID of each uploaded image was automatically

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:14

P. Choudhary and N. Nain

Fig. 6. Hierarchical process flow of the corpus development and its annotation.

indexed according to the selected language, category, and subcategory as shown in Figure 7. At the time of insertion of a new form, the user selects a particular script language and category of the handwritten text form, and the id field is appended accordingly, For example, for the Urdu script and Literature category, UID of a form will be URD-L-GS-005, as shown in Figure 8. Automatic indexing is also applicable for the UID of the segmented lines and words of the handwritten image that is the extension of the form UID with a symbol of -. According to the image ID, the line UID is automatically generated. Similarly, according to the line UID, the word UID is automatically generated. For example, the first image of the Literature category and Poetry/Religion subcategory of the database is named as URD-L-PR-001. The images are stored in PNG format, so the first image file of the database has the name of URD-L-PR-001.PNG. ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:15

Fig. 7. Automatic unique id generation of a handwritten text form.

Fig. 8. A sample structural mark up of a handwritten text image.

4.2. Ground Truth and Validation

The structure provides the functionality of mapping the accurate location of handwritten texts in the corresponding scanned images, lines, words, and ligatures. These textual region coordinates are conversely indexed in the database as well as in the XML file. That is useful for proper benchmarking of segmentation techniques for handwritten text recognition. Selected segmented images of lines and words are stored in a separate folder, while all the manually entered ground-truth transcription data of images, lines, and words are directed toward the respective fields in the database. A bounding box is displayed over the selected textual region for better visibility, so that one can recognize the path of the image components. A mapping has been done for the window screen and the viewport. When the cursor points at the unique id of lines, words, and ligatures, a rectangular bounding box appears on the corresponding image of the line, word, or ligature in the viewport. A sample of the structural markup of a handwritten image is shown in Figure 8. ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:16

P. Choudhary and N. Nain Table VII. Metainformation of an XML Formatted File Level 1

Specification Demographic information

2

Handwritten image

3

Segmented lines

4

Segmented words

5

Ligatures

Metainformation Writer’s description Image unique ID Printed text Date of creation Writer information Number of line Transcription of handwritten text Pixel coordinate of image Line unique ID Pixel coordinates for bounding box Transcription text Number of words Word unique ID Pixel coordinates for bounding box Transcription text Synonyms, antonyms Number of ligatures Pixel coordinates for bounding box

Visualization of the image and corresponding information on the same viewport makes it useful for validation of context information and its visual review of annotated data. As a result, we create a database using this structure where all information stored in the database and images of text pages, segmented lines, and words are stored separately in the system with their corresponding UID as the name of the image in PNG format. Validation checks are crucial to maintain the integrity of any database structure and is also helpful in ensuring the system operates on clean, correct, and useful data. They are equipped in our corpus by using auto-indexing and cross-indexing routines using validation and data normalization rules. In a nutshell, data needs to be validated at the same stage/level where it is most likely to be erroneous. The different types of data validations applied are form-level validation, search criteria validation, field-level validation, and range validation for every field. 4.3. An XML Representation

An Extensible Mark-up Language (XML)-based set of rules was used for encoding documents in a format that is both human readable and machine readable, as XML provides a standard representation that is logically related in a hierarchical way that is better suited for document analysis tasks. An XML is the most commonly used file format to generate ground-truth annotation results of the corpus. CALAM provides the functionality of creating an XML representation based on the data entry description for each handwritten text form of the database. The user can select an image to generate a corresponding XML formatted file and then download or directly view the XML file of that image. The heart of the CALAM is the image database that includes 1,200 scanned images of handwritten sentences. In addition to image files, each image is accompanied by a rich XML metainformation file that is encoded at five levels of hierarchical metainformation as shown in Table VII. There is a hierarchical record in each XML file for categorization of handwritten text image data into different levels such as lines, words, and components to describe its specification. The XML schema also encapsulates writers’ demographic information like name, age, education, and address as other elements ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:17

Table VIII. Comparative Analysis of CALAM Structure PixLabeler GTLC

Input Type English Chinese

APTI Truthing tool MAST

Arabic English Camera image

LabelMe

Scene image

Annotation Level Image Lines Words Characters Image Image Printed text in image Object in image

CALAM

Handwritten document

Lines Words Components

Output Text, XML XML

Applications Labeling Annotation

XML XML Unicode XML XML

Transcription Retrieval of text Annotation of text

Image Unicode Auto-generated XML file

Object detection Design and annotate corpus, OCR algorithm benchmarking, Unicode transcription, statistics analysis of corpus data on various terms, NLP applications

of the corpus. Standard CES (Character Encoding Scheme) [Ide 998b] under the guidelines of TEI (Text Encoding Initiative) [Sthrenberg 2012] is used for electronic data encoding and an XML file’s metainformation. As a result, the structure generates an XML file for each text page including the data information of lines, words, and ligatures of the respective page, based on the data entries. The XML file contains the same information as was provided during the data entries (with a five-level hierarchy, suffixed with UID). For example, the XML file obtained for the handwritten form with UID “URD-U-UA-001.png” will be “URD-U-UA-001.xml.” 5. COMPARATIVE CHARACTERISTICS OF THE PROPOSED DATABASE

The comparative analysis of the proposed corpus CALAM with existing structures like Pix Labeler [Saund et al. 2009], GTLC [Yin et al. 2009], Truthing Tool [Elliman and Sherkat 2001], APTI [Slimane et al. 2009], MAST [Kasar et al. 2011], and LabelMe [Russell et al. 2008] for a handwritten text image corpus is illustrated in Table VIII. The comparative analysis of Table VIII shows the functionality of the existing tools of annotation. PixLabler and Truthing Tools provide a way to annotate English-language documents. APTI and GTLC are available for offline handwritten document annotation in Arabic and Chinese scripts, respectively. APTI has been designed to annotate handwritten images excluding lines’ and words’ annotation. MAST and LabelMe were designed for annotation of camera-based images. LabelMe provides the functionality of object recognition in a scene image, and MAST can be used for annotation of multiscript scenic images for printed text. Compared to the previous structures, CALAM provides the display of the handwritten Urdu text image file and the transcription material of the corresponding image on the same screen in a collaboration context. CALAM is a simple way for annotation and collection of a large volume of information for Urdu script, such as digits, paragraphs, lines, words, machine-printed text, and handwritten text on the same platform. CALAM automatically generates an XML file of annotated metainformation that would be useful to ground truth of the image (bounding box coordinates of lines, words, and ligatures) for benchmarking and evaluation of various OCR techniques like segmentation and handwritten text recognition. All the structural markups are done with the pixel-level precision. ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:18

P. Choudhary and N. Nain Table IX. Quantitative Analysis of Four Techniques Used for Testing the Proposed Dataset for Text-Line Segmentation Benchmarking Techniques Godara et al. [2014] Khanduja et al. [2013] Panwar and Nain [2014] Alaei et al. [2011b]

Number of Test Images 400 400 400 400

Average Accuracy 91.2% 90.6% 93.08% 94.12%

The corpus structure can be used for different classification criteria as required in multidisciplinary research, such as searching, filtering, statistics analysis on data, and the study of data distribution in terms of sex, name, education, region, domain, and other parameters. 6. EXPERIMENTATIONS AND RESULTS

To strengthen our claim for the applicability of the proposed dataset for Urdu linguistic resources, we have also conducted the experimentations of some handwritten text segmentation algorithms and the Zipf ’s [Piantadosi 2014] test on the dataset to observe the behavior of the word frequency distribution. 6.1. Text-Line Segmentation Results

To provide insight to other researchers for evaluation and comparison of their results of text-line segmentation/recognition techniques on the proposed dataset, we have tested four different text-line segmentation algorithms on the CALAM dataset. Each technique was tested on 400 images taken from the proposed CALAM Urdu handwritten dataset [Choudhary et al. 2015]. We have selected 200 and 100 images from the News and Politics categories, respectively, and the remaining 100 images were a combination of the first 25 images from each of the four categories. (1) We tested a technique proposed by Alaei et al. [2011b] to segment handwritten text documents into individual text lines. The average accuracy defined by Equation (1) of the proposed algorithm is 94.12%: Accuracy =

T .P. + T .N. , (T .P. + T .N. + F.P. + F.N.)

(1)

where terms related to accuracy measurement are as given: T.P. (True Positive), T.N. (True Negative), F.P. (False Positive), and F.N. (False Negative). (2) The second technique tested is proposed by Godara et al. [2014] for handwritten Urdu script segmentation using the smearing method for line segmentation. The average accuracy achieved by the algorithm is 91.2%. (3) The third technique used in our experimentation has been proposed by Khanduja et al. [2013]. The average accuracy achieved is 90.6%, where 400 images were used for testing. (4) Panwar et al. [2013] and Panwar and Nain [2014] proposed a line segmentation technique based on the Connectivity Strength Parameter (CSF). The average accuracy achieved is 93.08%. Table IX summarizes the complete test results. 6.2. Word Frequency Distribution Using ZipF’s Rule

To strengthen our claim for the applicability of the proposed dataset for Urdu linguistic resources, we have also conducted the Zipf ’s [Piantadosi 2014] test on the dataset to ascertain that it caters to the universality of a language principle. In 1949, Zipf [Piantadosi 2014] proposed a rule to analyze the distribution and behavior of words in a corpus that is significant in statistical linguistics analysis. According to Piantadosi ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:19

Fig. 9. Zipf curve of the words’ distribution of the proposed corpus CALAM.

[2014], every natural language follows Zipf ’s rule for the frequency distribution of words. Zipf ’s rule states that if f is the frequency of a word in a corpus and r is the rank of the word, then the frequency of words in a large corpus of natural language is inversely proportional to the rank of words as shown in Equation (2): 1 . (2) r Zipf ’s rule states that if words are arranged from the corpus in descending order of frequency (w1 , w2 , . . . , wn), then the occurrence frequency of the second word w2 is w21 , half times the first word w1 , and the third word w3 occurred roughly w31 , one-third as often as the first word, and so on. From this, it can be concluded that with the multiplication of the rank of a word r (rank one being the most frequent) by its frequency f (how many times the word occurred in the text), the product C would remain approximately the same for each word as shown in Equation (3): f ∝

w fi =

C . wri

(3)

From Equation (3), we can derive a generalization of this rule stating that the frequency of words decreases very rapidly with rank. This can also be written as Equation (4): w fi = C(wri )k.

(4)

By taking the log of Equation (4), we get Equation (5): log (w fi ) = log C + k log (wri ),

(5)

where k = −1 and C is a constant. So a log(f ) and log(r) graph drawn between frequency and rank of a corpus must be linear with slope as −1. Figure 9 shows the Zipf ’s curve for the proposed Urdu corpus words. The resultant log(f ) and log(r) Zipf ’s curve graph validates that the proposed corpus follows Zipf ’s rule for frequency distribution of words. ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:20

P. Choudhary and N. Nain

7. CONCLUSION

In this article, we have presented an Urdu handwritten text image corpus CALAM along with its annotation structure with pixel-level precision. The uniformity of the structure provides an appropriate way for annotation of handwritten text images. The balancing in the data collection stage makes the corpus useful for researchers to control the proportion of values according to different usages of the corpus. We described an XML-based handwritten text image corpus and the annotation methodology that has the potential to provide researchers all the facilities for document image processing research, on a single platform, such as writer identification; signature verification; segmentation/recognition of text pages at line, word, and ligature levels; and separation of handwritten and printed texts. The database would be helpful in the design of an automatic intelligent system for direct processing of massive handwritten forms collected for census data. Also, it can be very widely used for language transcription and transliteration applications acting as an information exchange center. To date, only two datasets are available for handwritten Urdu script. The aim of this work is to build a resource that would provide ground-truth annotation for handwritten text images. We propose floating the dataset as an open source on cloud storage free for academic use, where permissions for usage would be given on request.

REFERENCES S. Al-Ma’adeed, D. Elliman, and C. A. Higgins. 2002. A data base for Arabic handwritten text recognition research. In Proceedings of the 8th International Workshop on Frontiers in Handwriting Recognition. 485–489. Y. Al-Ohali, M. Cheriet, and C. Suen. 2003. Databases for recognition of handwritten Arabic cheques. Pattern Recognition 36, 1 (2003), 111–121. A. Alaei, P. Nagabhushan, and U. Pal. 2011a. A benchmark Kannada handwritten document dataset and its segmentation. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11). 141–145. A. Alaei, U. Pal, and P. Nagabhushan. 2011b. A new scheme for unconstrained handwritten text-line segmentation. Pattern Recognition 44, 4 (April 2011), 917–928. A. Alaei, U. Pal, and P. Nagabhushan. 2012. Dataset and ground truth for handwritten text in four different scripts. International Journal of Pattern Recognition and Artificial Intelligence 26, 04 (2012), 1–25. H. Alamri, J. Sadri, C. Y. Suen, and N. Nobile. 2008. A novel comprehensive database for Arabic offline handwriting recognition. In Proceedings of the 11th International Conference on Frontiers in Handwriting Recognition (ICFHR’08). 664–669. S. Belhe, S. Chakravarthy, and A. G. Ramakrishnan. 2009. XML standard for indic online handwritten database. In Proceedings of the International Workshop on Multilingual OCR (MOCR’09). ACM, New York, NY, USA, Article 19, 4 pages. A. S. Bhaskarabhatla, S. Madhvanath, M. N. S. S. K. Pavan Kumar, A. Balasubramanian, and C. V. Jawahar. 2004. Representation and annotation of online handwritten data. In Proceedings of the 9th International Workshop on Frontiers in Handwriting Recognition (IWFHR-9’04). 136–141. S. Bhaskarabhatla and S. Madhvanath. 2004. Experiences in collection of handwriting data for online handwriting recognition in indic scripts. In Proceedings of the 4th International Conference Linguistic Resources and Evaluation (LREC’04). 2223–2226. U. Bhattacharya and B. Chaudhuri. 2009. Handwritten numeral databases of indian scripts and multistage recognition of mixed numerals. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 3 (March 2009), 444–457. Census. 2001. (2001). http://www.censusindia.gov.in/2001-common/censusdataonline.html. P. Choudhary, N. Nain, and M. Ahmed. 2015. A unified approach for development of Urdu corpus for OCR and demographic purpose. In Proceedings of the 7th International Conference on Machine Vision (ICMV’15), Vol. 9445. 1–5. L. Deng. 2012. The MNIST database of handwritten digit images for machine learning research [best of the web]. IEEE Signal Processing Magazine 29, 6 (Nov. 2012), 141–142.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:21

R. I. M. Elanwar, M. A. Rashwan, and S. A. Mashali. 2010. OHASD: The first online Arabic sentence database handwritten on tablet PC. In Proceedings of the World Academy of Science, Engineering and Technology (WASET’10), International Conference on Signal and Image Processing (ICSIP’10) 4, 12 (2010), 585–590. D. Elliman and N. Sherkat. 2001. A truthing tool for generating a database of cursive words. In Proceedings of the 6th International Conference on Document Analysis and Recognition, 2001. 1255–1262. B. Gatos, N. Stamatopoulos, and G. Louloudis. 2009. ICDAR 2009 handwriting segmentation contest. In Proceedings of 10th International Conference on Document Analysis and Recognition (ICDAR’09). 1393– 1397. B. Gatos, N. Stamatopoulos, and G. Louloudis. 2010. ICFHR 2010 handwriting segmentation contest. In Proceedings of 2010 International Conference on Frontiers in Handwriting Recognition (ICFHR’10). 737–742. S. Godara, N. Nain, and M. Ahamed. 2014. Handwritten Urdu script segmentation using hybrid approach. In Proceedings of the DAR 2014 Satellite Workshop of ICVGIP 2014 on Document Analysis and Recognition, 2014. E. Grosicki, M. Carr, E. Augustin, and F. Prłteux. 2006. La campagne d’valuation RIMES pour la reconnaissance de courriers manuscrits. In Actes 9me Colloque International Francophone sur lEcrit et le Document (CIFED’06). Fribourg, Suisse, 61–66. I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. 1994. UNIPEN project of online data exchange and recognizer benchmarks. In Proceedings of the 12th IAPR International Conference on Computer Vision and Image Processing, Vol. 2. 29–33. P. J. Haghighi, N. Nobile, C. L. He, and C. Y. Suen. 2009. A new large-scale multi-purpose handwritten Farsi database. In Proceedings of the 6th International Conference on Image Analysis and Recognition (ICIAR’09). Springer-Verlag, Berlin, 278–286. J. J. Hull. 1994. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence 16, 5 (May 1994), 550–554. N. Ide. 1998b. Corpus encoding standard: SGML guidelines for encoding linguistic corpora. In Proceedings of the 1st International Language Resources and Evaluation Conference. 463–470. E. Indermhle, M. Liwicki, and H. Bunke. 2010. IAMonDo-database: An online handwritten document database with non-uniform contents. In Proceedings of the International Workshop on Document Analysis Systems. 97–104. C. V. Jawahar, A. Balasubramanian, M. Meshesha, and A. M. Namboodiri. 2009. Retrieval of online handwriting by synthesis and matching. Pattern Recognition 42, 7 (2009), 1445–1457. T. Kasar, D. Kumar, M. N. Anil Prasad, D. Girish, and A. G. Ramakrishnan. 2011. MAST: Multi-script annotation toolkit for scenic text. In Proceedings of the 2011 Joint Workshop on Multilingual OCR and Analytics for Noisy Unstructured Text Data. ACM, New York, NY, Article 14, 8 pages. D. Khanduja, N. Nain, and S. Panwar. 2013. A hybrid feature extraction algorithm for devanagari script. ACM Transactions on Asian Low-Resource Language Information Processing 15, 1, 105–111. N. Kharma, M. Ahmed, and R. Ward. 1999. A new comprehensive database of handwritten arabic words, numbers, and signatures used for OCR testing. In Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering, Vol. 2. 766–768. H. Khosravi and E. Kabir. 2007. Introducing a very large dataset of handwritten Farsi digits and a study on their varieties. Pattern Recognition Letters 28, 10 (2007), 1133–1141. D. H. Kim, Y. S. Hwang, S. T. Park, E. J. Kim, S. H. Paek, and S. Y. Bang. 1993. Handwritten Korean character image database PE92. In Proceedings of the 2nd International Conference on Document Analysis and Recognition. 470–473. A. Kumar, A. Balasubramanian, A. Namboodiri, and C. V. Jawahar. 2006. Model-based annotation of online handwritten datasets. In Proceedings of the International Workshop on Frontiers in Handwriting Recognition (IWFHR’06). Universit de Rennes, La Baule, Centre de Congreee Atlantia, France. S. Kumar. 2010. An analysis of irregularities in Devanagari script writing: A machine recognition perspective. International Journal of Computer Science Engineering 2, 2 (2010), 274–279. Y. Li, Y. Zheng, D. Doermann, S. Jaeger, and Yi Li. 2008. Script-independent text line segmentation in freestyle handwritten documents. IEEE Transactions on Pattern Analysis and Machine Intelligence 30, 8 (Aug. 2008), 1313–1329. L. Likforman-Sulem, A. Zahour, and B. Taconet. 2007. Text line segmentation of historical documents: A survey. International Journal of Document Analysis and Recognition (IJDAR) 9, 2–4 (2007), 123–138. C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang. 2011. CASIA online and offline chinese handwriting databases. In Proceedings of the International Conference on Document Analysis and Recognition (ICDAR’11), 2011. 37–41.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

26:22

P. Choudhary and N. Nain

G. Louloudis, B. Gatos, I. Pratikakis, and C. Halatsis. 2009. Text line and word segmentation of handwritten documents. Pattern Recognition 42, 12 (2009), 3169–3183. New Frontiers in Handwriting Recognition. S. A. Mahmoud, I. Ahmad, M. Alshayeb, and W. G. Al-Khatib. 2011. A database for offline arabic handwritten text recognition. In Image Analysis and Recognition, Mohamed Kamel and Aurlio Campilho (Eds.). Lecture Notes in Computer Science, Vol. 6754. Springer, Berlin, 397–406. V. Margner and H. El Abed. 2009. ICDAR 2009 Arabic handwriting recognition competition. In Proceedings of 10th International Conference on Document Analysis and Recognition, 2009. 1383–1387. U.-V. Marti and H. Bunke. 2002. The IAM-database: An English sentence database for offline handwriting recognition. International Journal on Document Analysis and Recognition 5, 1 (2002), 39–46. I. B. Messaoud and H. E. Abed. 2010. Automatic annotation for handwritten historical documents using Markov models. In Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR’10). 381–386. S. Mozaffari, H. El Abed, V. Margner, K. Faez, and A. Amirshahi. 2008. IfN/Farsi-database: A database of Farsi handwritten city names. In Proceedings of the 11th International Conference on Frontiers in Handwriting Recognition (ICFHR’08). 397402. M. Nakagawa, T. Higashiyama, Y. Yamanaka, S. Sawada, L. Higashigawa, and K. Akiyama. 1997. Online handwritten character pattern database sampled in a sequence of sentences without any writing instructions. In Proceedings of the 4th International Conference on Document Analysis and Recognition, 1997., Vol. 1. 376–381. M. Nakagawa and K. Matsumoto. 2004. Collection of online handwritten Japanese character pattern databases and their analyses. Document Analysis and Recognition 7, 1 (2004), 69–81. B. Nethravathi, C. P. Archana, K. Shashikiran, A. G. Ramakrishnan, and V. Kumar. 2010. Creation of a huge annotated database for Tamil and Kannada OHR. In Proceedings of the International Conference on Frontiers in Handwriting Recognition (ICFHR’10). 415–420. S. Panwar and N. Nain. 2014. A novel segmentation methodology for cursive handwritten documents. IETE Journal of Research 60, 6 (2014), 432–439. S. Panwar, N. Nain, S. Saxena, and P. C. Gupta. 2013. Language adaptive methodology for handwritten text line segmentation. In Computer Analysis of Images and Patterns, Richard Wilson, Edwin Hancock, Adrian Bors, and William Smith (Eds.). Lecture Notes in Computer Science, Vol. 8047. Springer, Berlin, 344–351. M. Pechwitz, S. S. Maddouri, V. Mrgner, N. Ellouze, and H. Amiri. 2002. IFN/ENIT - database of handwritten Arabic words. In Francophone International Conference on Writing and Document (CIFED’02). Hammamet, Tunisia, 129–136. S. T. Piantadosi. 2014. Zipfs word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin Review 21, 5 (2014), 1112–1130. A. Raza, I. Siddiqi, A. Abidi, and F. Arif. 2012. An unconstrained benchmark Urdu handwritten sentence database with automatic line segmentation. In Proceedings of the 2012 International Conference on Frontiers in Handwriting Recognition (ICFHR’12). IEEE Computer Society, Washington, DC, 491–496. B. C. Russell, A. Torralba, K. P. Murphy, and W. T. Freeman. 2008. LabelMe: A database and web-based tool for image annotation. International Journal of Computer Vision 77, 1–3 (2008), 157–173. R. Safabaksh, A. R. Ghanbarian, and G. Ghiasi. 2013. HaFT: A handwritten Farsi text database. In Proceedings of the 8th Iranian Conference on Machine Vision and Image Processing (MVIP’13). 89–94. M. W. Sagheer, C.-L. He, N. Nobile, and C. Suen. 2009. A new large urdu database for offline handwriting recognition. In Proceedings of International Conference on Image Analysis and Processing (ICIAP’09), Pasquale Foggia, Carlo Sansone, and Mario Vento (Eds.). Lecture Notes in Computer Science, Vol. 5716. Springer, Berlin, 538–546. T. Saito, H. Yamada, and K. Yamamoto. 1985. On the data base ETL9 of handprinted characters in JIS Chinese characters and its analysis (in Japanese). Transactions of the IECE Japan J68-D(4) (1985), 757–764. R. Sarkar, N. Das, S. Basu, M. Kundu, M. Nasipuri, and Dk. Basu. 2012. CMATERdb1: A database of unconstrained handwritten Bangla and Bangla-English mixed script document image. International Journal of Document Analysis and Recognition 15, 1 (March 2012), 71–83. E. Saund, J. Lin, and P. Sarkar. 2009. PixLabeler: User interface for pixel-level labeling of elements in document images. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). IEEE Computer Society, Washington, DC, 646–650. F. Slimane, R. Ingold, S. Kanoun, A. M. Alimi, and J. Hennebert. 2009. A new arabic printed text image database and evaluation protocols. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). 946–950.

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.

An Annotated Urdu Handwritten Text Images Corpus

26:23

N. Stamatopoulos, B. Gatos, G. Louloudis, U. Pal, and A. Alaei. 2013. ICDAR 2013 handwriting segmentation contest. In Proceedings of the 12th International Conference on Document Analysis and Recognition (ICDAR’13). 1402–1406. M. Sthrenberg. 2012. The TEI and current standards for structuring linguistic data. Journal of the Text Encoding Initiative 3 (Nov. 2012), 1–14. T. Su, T. Zhang, and D. Guan. 2007. Corpus-based HIT-MW database for offline recognition of general-purpose Chinese handwritten text. International Journal of Document Analysis and Recognition (IJDAR) 10, 1 (2007), 27–38. S. Sutat and L. Methasate. 2004. Thai handwritten character corpus. IEEE International Symposium on Communications and Information Technology 1 (Oct 2004), 486–491. C. Viard-Gaudin, P. M. Lallican, S. Knerr, and P. Binter. 1999. The IRESTE on/off (IRONOFF) dual handwriting database. In Proceedings of the 5th International Conference on Document Analysis and Recognition, 1999 (ICDAR’99). 455–458. R. Wilkinson. 1992. The first census optical character recognition systems. In The U.S. Bureau of Census and the National Institute of Standards and Technology (Tech. Rep. NISTIR 4912, National Institute of Standards and Technology.). Gaithersburg, MD, 1–372. F. Yin and C.-L. Liu. 2009. Handwritten Chinese text line segmentation by clustering with distance metric learning. Pattern Recognition 42, 12 (2009), 3146–3157. New Frontiers in Handwriting Recognition. F. Yin, Q.-F. Wang, and C.-L. Liu. 2009. A tool for ground-truthing text lines and characters in offline handwritten Chinese documents. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). 951–955. M. Ziaratban, K. Faez, and F. Bagheri. 2009. FHT: An unconstraint Farsi handwritten text database. In Proceedings of the 10th International Conference on Document Analysis and Recognition (ICDAR’09). IEEE Computer Society, Washington, DC, 281–285. Received January 2015; revised December 2015; accepted December 2015

ACM Trans. Asian Low-Resour. Lang. Inf. Process., Vol. 15, No. 4, Article 26, Publication date: May 2016.