Creation of Uyghur Offline Handwritten Database - IEEE Xplore

31 downloads 13973 Views 513KB Size Report
CREATION OF UYGHUR OFFLINE HANDWRITTEN DATABASE. Kurban Ubul1,2, Mavjuda Zunun1,2, Alim Aysa2,3, Nurbiya Yadikar1,2, Umut Yunus1. 1School ...
The 8th International Workshop on Systems, Signal Processing and their Applications 2013: Special Sessions

CREATION OF UYGHUR OFFLINE HANDWRITTEN DATABASE Kurban Ubul1,2, Mavjuda Zunun1,2, Alim Aysa2,3, Nurbiya Yadikar1,2, Umut Yunus1 1

2

School of Information Science and Engineering, Xinjiang University, China Xinjiang Laboratory of Multi-language Information Technology, Urumqi, China 3 Center of Modern Education Technology, Xinjiang University, China

writer identification [15,16]. Doubtlessly, some of this work relied on privately collected and kept databases of Uyghur printed and handwritten text. However, despite a determined paper lengthy and Internet retrieval campaign, no publicly accessible Uyghur database was found by far. This must be one of the sources of genuine hindrance for many researchers into Uyghur handwriting recognition (character and word recognition), word spotting and writer identification. Consequently, according to our own work [15, 16] was on offline Uyghur handwriting based writer identification, it was considered that such a database, on its own merit, is a worthwhile to academic effort. This database would provide researchers with a useful tool, if properly conducted. By the way, the authors appeal to researchers who are undertaking study on Uyghur handwriting would publicize academia to their databases if possible. Thus Uyghur information processing issue, especially research work on Uyghur handwriting recognition and identification would accelerate developing progress. This paper introduces an offline Uyghur handwriting databases built by School of Information Science and Engineering, Xinjiang University, China. The handwritten samples were produced by 813 native Uyghur writers using ordinary pen on writing papers and include handwritten Uyghur texts (continuous scripts), and small amount of isolated words and numbers. Each participant asked to write 2~3 copies of handwriting that use the same database for his/her training and testing needs.

ABSTRACT It plays essential role in estimating and comparing research results achieved by different groups of researcher to create standard databases. Many handwritten databases reported are mostly based on Latin, Chinese, Arabic, and other languages, but none of them related with Uyghur handwriting. In this paper, a Uyghur handwritten text database written by multiple writers is introduced. It can be used for research in the recognition of Uyghur handwriting (word or text), word segmentation, document image processing (analysis or retrieval), and writer identification. This database contains 2016 handwritten document images written by 813 native writers of Uyghur. Each handwriting document is scanned, stored with .bmp file, and the noise in the handwriting image is removed. So far, this is the first Uyghur offline handwriting database that is available freely for worldwide researchers.

1. INTRODUCTION In the past 3 decades, researchers extensively studied automatic technologies for handwriting such as handwriting recognition and writer identification. In this field, standard databases plays important role in estimating and comparing research results obtained with various innovative ways by groups of researchers worldwide. Therefore, many standardized handwritten databases are developed, and have been demonstrated the significance in the research work [1]. Some of them are extensively used for research in handwriting recognition and identification, for instance, CEDAR (Latin) [2], IAM (Latin) [3], CENPARMI (Latin) [3], HCL2000 (Chinese) [5], HIT-MW (Chinese) [6], Al-ISRA(Arabic)[7], IFHCDB (Farsi)[8], ISI(Indian)[9], ETL9(Japanese)[10], and PE92(Korean)[11], and so on. So, many handwritten databases reported are mostly based on Latin, Chinese, Arabic, Farsi, and other languages, but none of them related with Uyghur handwriting. Much work has been carried out in Uyghur handwritten character and word recognition [12-14],

978-1-4673-5540-7/13/$31.00 ©2013 IEEE

2. BERIF INTRDUCTION OF UYGHUR SCRIPTS NATURE The Uyghurs are a Turkic-speaking ethnic group inhabiting Eastern and Central Asia. Now days, Uyghurs mainly live in the Xinjiang Uyghur Autonomous Region (hereafter: Xinjiang) in China. Arabic-based Uyghur script is one of the officials writing system in Xinjiang, while Cyrillic based Uyghur script is still used by Uyghurs in Post-Soviet States and Latin based Uyghur script are also in use1. Arabic-based Uyghur (hereafter: 1

Uyghur Arabic alphabet. Please see the link below: http://en.wikipedia.org/wiki/Uyghur_alphabet

291

Uyghur) handwriting used widely in Xinjiang area is collected as a data for the Uyghur handwritten database in this paper. Uyghur is different from Arabic in some cases. There are 28 letters in Arabic, and normal texts are composed only of series of consonants in Arabic. Uyghur alphabet has 8 vowel letters and 24 consonant letters, and each character has 4 types of different forms. So, the 32 Uyghur letters form more than 120 character styles. For ” in Arabic, example, the word “life”, is written as “ and written as “ΕΎϳΎϫ”in Uyghur. 1. Uyghur character is written from right to left direction which is different from Latin and Chinese. There are 4 different writing forms for Uyghur letter: “initial form”, “intermediate form”, “final form”, and “isolated form”. Different forms of a letter indicated as the following table 1.

5. The stroke of vocabularies in Uyghur handwriting is not fixed. For the same Uyghur handwriting word, the numbers of strokes are different from person to person. Particularly, strokes are different in many aspects such as size, position, longitude, width, slant angle and structure. For instance, the same word “Xinjiang” (in UyghurИΎΠϨЭη) written different 5 people showed in the figure2 as follows.

Figure2. The Uyghur word “Xinjiang” (in Uyghur ИΎΠϨЭη) written by 5 different people If the stoke numbers of the first letter “ ” is taken into consideration in the figure 2 (a) to (e), they are make up of 2, 4, 3, 6, 4 strokes separately. This phenomenon becomes much more complicated if a handwritten Uyghur word, phrase even a sentence is considered.

Table1. Different forms of a Uyghur letter Initial form

Intermediate form

Final form

Isolated form

3. DISRIPTION OF THE DATABASE Data collection is the first step for creation of handwritten database. Native Uyghur people are selected to write certain text in his/her natural handwriting on the paper sheet. Then, all the handwritings are scanned using a HP scanner and stored in computer with .bmp file format, in 300 dpi resolution. Thus, the Uyghur offline handwritten database is created. The database can be used for research in a few aspects, such as the recognition of Uyghur handwriting (word or text), word segmentation, document image processing (analysis or retrieval), and writer identification. Since research for writer identification related with the similarity of handwriting among those research fields. So, The Uyghur handwriting database is divided into 3 subdatabases, in which sub-database1 and sub-database2 consist of the same handwriting text written by different people, and sub-database3 is made up of different handwriting text. The sub-database1 and sub-database2 can be used for research about text dependent writer identification, and the sub-database3 is used for text independent writer identification. Of course, the subdatabase1 and sub-database2 also can be used for text independent writer identification if each handwriting image divided into several parts. Therefore, the database is introduced to the applicable of writer identification type.

2. One or several letters are formed a vocabulary of Uyghur character. These letters will compose one or several letter passages by initial and suffix connections according to writing rules of Uyghur. The letters are connected along a certain horizontal level for a block letter or handwriting, which is called base line. 3. The width of Uyghur letter is unequal. This phenomenon not only happens on different letters, but also on the 4 different forms of certain letters. Moreover, a straight line will be adopted to fill in the spaces among the letters so that a line of text is uniformly distributed. 4. The vocabulary of Uyghur is made up of one or several syllables, which is commonly constructed from combining of vowel and consonants, where vowel is considered the centre of syllables. One or several syllables constitute a connected component. There is a blank space between two connected components or two vocabularies indicated as the figure 1 below. There are some rules for creating syllable or vocabulary in Uyghur.

3.1. Figure 1. A simple Uyghur sentences

Sub-database identification

for

text

dependent

writer

The sub-database1 and sub-database2 is used for research for text dependent writer identification. Since, the handwritten image is divided different 2 types, they are introduced separately.

The Uyghur sentences in figure 2 is consist of 2 words, 7 letters, in which, the first word is formed from one syllable, 2 letters (indicated in number with 1and 2.), and 2 connected components. The second word is made up 2 syllables, 5 letters (indicated in number from 3 to 7.), and 4 connected components.

292

All the 32 Uyghur letters including more than 120 character forms are covered in the 2 sub-databases. The occurrence number of the different character forms are calculated, which is indicated in the following table 4.

3.1.1 Sub-database1 The handwriting text of sub-database1 is written 171 Uyghur people in his/her natural handwriting style. Each writer is asked to write the same content to 2 copies on the writing paper, in which one is with red grid lines and the other is blank. The writers of the sub-database1 are university students that those ages from 19 to 22. Each writer asked to write his personal information such as name, gender, occupation, age etc on the top of the page. The writer information database is also constructed using this information. A sample of handwriting image (or filled forms) of sub-database1 is illustrated as figure 3 in the annex. The content of the handwriting include all the 32 characters in Uyghur. The numbers of the sentences, words, syllables, connected components and letters in each handwriting image and whole sub-database1 are calculated, and they are indicated in the table 2 below.

Table 4. Uyghur 32 letters occurred in sub-databases No.

1 2 3 4 5 6 7 8 9

Table 2. Different units in a page and sub-database1

sentences words syllables connected components letters

page 20 117 296 327 699

10 11

sub-database1 6,840 40,014 101,232 111,834 239,058

12 13 14 15 16 17 18

3.1.2 Sub-database2 The handwriting text of sub-database2 is written by 253 Uyghur people in his natural handwriting. Each writer is required to write the same content to 2 copies on the writing paper, in which one is with red grid lines and the other is blank. The participants of the sub-database2 are come from different occupation including students, workers, nurses, barbers etc. They ranged in age from 9 to 80 and an example of the handwriting written from 76 years old Uyghur person indicated in figure 4 in the annex. Each participant asked to write his personal information (name, gender, occupation, age etc.) on the page header. The participant’s information is also added to writer information database mentioned above. The content of the each handwriting in sub-database2 also include all the 32 characters in Uyghur. The statistics of the sentences, words, syllables, connected components and letters in a handwriting image and sub-database2 are illustrated table 3 as follows.

19 20 21 22 23 24 25 26 27 28 29 30 31 32

page 21 117 303 341 715

Ϫ΋ (/æ/) Ώ (/b/) ̟ (/p/) Ε (/t/) Ν (/dƛ /) ̧ (/tƌ /) Υ (/x/) Ω (/d/) έ (/r/) ί (/z/) ̫ (/ƛ /) α (/s/) ε (/ƌ /) ύ (/Ŭ /) ϑ (/f/) ϕ (/q/) ϙ (/k/) И (/ƾ/) ̱ (/‡ /) ϝ (/l/) ϡ (/m/) ϥ (/n/) ϫ (/h/) Ϯ΋ (/o/) Н΋ (/u/) П΋ (/ø/) С΋ (/y/) Ц (/v/) Щ΋ (/e/) ϰ΋ (/i/) ϱ (/j/)

Sub-database1 21204

Sub-database2 32890

22914

26312

7866

10626

4104

9614

10360

17710

1026

3036

2736

5566

1368 8892

3036 12144

11286

17710

7524

5060

1368

506

4788

7084

5130

8096

3762

8602

684 7866

506 11132

5814

8096

5472

7590

3078

3542

14364

24288

6156 10602

13662 19228

4788

3036

2394

7084

11286

15686

1368

2024

4104 3078

4554 3036

1710

4048

37278

55660

4788

10626

IPA in the table 5 is the abbreviation of the “International Phonetic Alphabet”.

Table 3. Different units in a page and sub-database2

sentences words syllables connected components letters

Letter (IPA) Ύ΋ (/a/)

sub-database2 10,626 59,202 153,318 172,546 361,790

3.2.

Sub-database for text independent writer identification

Since the text independent writer identification requires candidates write different content of handwriting text. So, 389 native Uyghur people are selected and asked to write

293

anything to fill 3 sheets of letter size paper in Uyghur for the creation of sub-database3. There are no strict rules for the handwriting and its content but each filled form include 10 line words, and 6~12 words per line. The writers are selected with the respect of their gender, age, occupation to get various samples of Uyghur handwriting. One part of them is the students that from elementary school to postgraduate students, while the rest of them are adults of various professions. The youngest was 9 years old and the oldest person was 80 among them. These participants’s information (such as age, gender, education, job, etc.) is also added to writer information database. The noise in the handwriting images in the database is removed in a same way of [16]. The size of the discrete noise threshold is set to 10 based on the real situation in handwriting samples. It is considered as discrete noise and they are filled with white points, if the observation points related to the number of black spots are less than 10.

[3] Marti U, Bunke H., “The IAM-Database: an English Sentence Database for Offline Handwriting Recognition,” Int. J. Doc. Anal. Recognit, vol. 5, no.1, pp.39-46, 2002. [4] C. Suen, C. Nadal, R. Legault, T. Mai, and L. Lam, “Computer recognition of unconstrained handwritten numerals,” Proc. of the IEEE, vol. 80, no.7, pp. 1162–1180, 1992. [5] H. Zhang, J. Guo, G. Chen, C. Li, “HCL2000—A largescale handwritten Chinese character database for handwritten character recognition,” Proc. of 10th ICDAR, pp.286-290, 2009 [6] T.H. Su, T.W. Zhang, D.J. Guan, “Corpus-based HIT-MW database for offline recognition of general-purpose Chinese handwritten text,” Int. J. Document Analysis and Recognition, vol. 10, no. 1, pp.27-38, 2007. [7] N. Kharma, M. Ahmed and R. Ward, “A New Comperehensive Database of Hand-written Arabic Words, Numbers, and Signatures used for OCR Testing,” IEEE Canadian Conference on Electrical and Computer Engineering, vol. 2, pp. 766–768, 1999.

4. CONCLUSION AND FUTURE WORK

[8] S. Mozaffari, K. Faez, F. Faradji, M. Ziaratban and S. M. Golzan, “A Comprehensive Isolated Farsi/Arabic Character Database for Handwritten OCR Research,” International Workshop on Frontiers in Handwriting Recognition, 2006.

A Uyghur offline handwritten text database written by multiple writers is introduced in this paper. This database contains 2016 handwritten document images written by 813 native Uyghur writers. Each filled forms (handwriting document) is scanned and stored with .bmp file format, and the noise in the handwriting image is removed. So far, this is first Uyghur offline handwritten database that is available freely for worldwide researchers. It can be used for Uyghur word/sentences recognition, word segmentation, handwritten document image processing (document analysis or retrieval), and writer identification. In the future work, the size of the database is further enlarged. The database will be placed Internet for the researchers’ world wide.

[9] U. Bhattacharya, B.B. Chaudhuri, “Databases for research on recognition of handwritten characters of Indian scripts,” Proc. 8th ICDAR, pp. 789-793, 2005, [10] T. Saito, H. Yamada, and K. Yamamoto, “On the data base ETL 9 of hand printed characters in JIS Chinese characters and its analysis,” IEICE Transactions, vol. 68, no. 4, pp.757– 764, 1985. [11] D. Kim, Y. Hwang, S. Park, E. Kim, S. Paek, and S. Bang, “Handwritten Korean character image database PE92,” In Proc. of the Second Int. Conf. on Document Analysis and Recognition, pp. 470–473, 1993. [12] Zaydun, Y., and Tsuyoshi S., “A Development of Patternbased Online Handwriting Uyghur Character Recognition System,” Proc. of FIT2009, pp. 123-124.

ACKNOWLEDGEMENTS This paper is sponsored by the National Natural Science Foundation of China under grant number 61163028, Special Training Plan Project of Xinjiang Minority Science and Technological Talents (No. 201323121), and Open Project of Xinjiang Laboratory of Multi-language Information Technology (No. 049807, 2013 year’s). The authors thanks to every one who contributes to the database generation.

[13] Ibrayim, M., and Askar H., “Design and implementation of prototype system for online handwritten Uyghur character recognition,” Wuhan University Journal of Natural Sciences, vol. 17 no. 2, pp. 131-136, April 2012.

REFERENCES

[14] Li, J., Zhaoyang L., Adili Y., and Fuxiu T., “Handwritten Uighur character segmentation and performance evaluation,” Proc of fourth International Conference on Machine Vision (ICMV 11), pp. 83491E-83491E, International Society for Optics and Photonics, 2012.

[1] I. Guyon, R. Haralick, J. Hull, and I. Phillips, “Database and benchmarking,” In Handbook of Character Recognition and Document Image Analysis, World Scientific, 1997.

[15] Ubul, K., Hamdulla, A. and Aysa, A. et al, “ Research on Uyghur Off-line Handwriting based Writer Identification,” Proc. of the 9th International Conference on Signal Processing, pp. 1656-1659, Beijing, China, October 26-29, 2008. [16] K. Ubul, A. Adler and M. Yasin, “Multi-Stage Based Feature Extraction Methods forUyghur Handwriting Based Writer Identification,” In Genetic Algorithms in Applications, InTech , 2012.

[2] J. Hull, “A database for handwritten text recognition research,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.16, no.5, pp. 550–554, May 1994.

294

ANNEX: EXAMPLES OF FILLED FORMS OF SUB-DATABASE1 AND SUB-DATABASE2

Figure 3. An filled form in sub-database1

Figure 4. An filled form in sub-database2

295