benchmark dataset for offline handwritten character

0 downloads 0 Views 336KB Size Report
applications such as historic document processing [3], ... Secondly a new dynamic resizing technique is ... method(offline, online), number of writers, number of.
BENCHMARK DATASET FOR OFFLINE HANDWRITTEN CHARACTER RECOGNITION Adeel Yousaf*, M. Jaleed Khan, M. Imran, Khurram Khurshid Department of Electrical Engineering Institute of Space Technology Islamabad, Pakistan. [email protected] Abstract— A benchmark dataset is the first and foremost step in Handwritten Character Recognition (HCR). This paper provides comprehensive detail about a newly compiled dataset known as Handwritten Characters Dataset (HCD), applied for recognition of handwritten English characters (A-Z) and digits (0-9). Compilation of this dataset is of great significance as it contains segmented characters rather than words, sentences or paragraphs. Total 150 writers belonging to different ages, gender and professional backgrounds contributed in this dataset. HCD is publically available free of cost1. HCD can be used for training of classifiers like Neural Network (NN), Hidden Markov Model (HMM) or Support Vector Machine (SVM) etc. which further can be used for a wide variety of applications of handwritten text recognition. Experimental analysis of HCD has depicted very promising results compared to state of art datasets. Keywords—Handwritten Character Recognition, Benchmark Database, HCD, Handwritten Digit Recognition, Neural Networks, English Alphabets Dataset, Decimal Digits Dataset

I. INTRODUCTION Handwriting has become an essential part in our daily life [1]. Handwriting absorbs number of patterns in it like signature verification, writer identification [2], Handwritten Text Recognition (HTR), etc. Out of these areas most of the research is focused on HTR as it covers wide range of applications such as historic document processing [3], automatic postal processing etc. Identification and recognition of handwritten characters is easy for humans [4] but due to high variability of writing styles, it becomes hard for machines. Performance of HTR highly depends upon the dataset on which system is trained first. Collection of samples for dataset is a clumsy task as benchmark dataset requires bulky and diverse handwritten samples. Having a standard dataset provides a platform for researchers to evaluate and compare performance of different HTR techniques on same ground truth data thus eliminating any bias. Like any other scientific domain, document analysis and recognition community (DAR) has also developed large number of datasets for HTR [5]. Despite deep research, no publically available dataset for Handwritten Character Recognition (HCR) was found. In most of the existing datasets writers were asked to copy the specific given text in their own unconstrained cursive handwriting, due to which these datasets provide segmented words, sentences or paragraphs rather than isolated characters. Secondly these datasets come at substantial cost and are not publically available. These setbacks compel most of the researchers to compile their own limited datasets for implementing and testing their system other state of art techniques. Due to which they are unable to compare performance of their system with other state of art techniques. Our team is providing a benchmark dataset of English block letters (A-Z) and digits (0-9) compiled by 150 writers. In this paper we present new dataset that provides

three key gains as compared to other state of art datasets. First is that HCD dataset is split up into training, validation and testing subsets so that it can provides same experimental conditions to researchers for implementation and evaluation of their systems. Secondly a new dynamic resizing technique is applied on HCD dataset which offers higher precision and recall. Thirdly HCD dataset is publically available free of cost [20]. Rest of paper is organized as follows. Section 2 presents a quick review of existing databases used for HTR, Section 3 provides detailed description of HCD dataset. Experimental analysis of HCD is provided in Section 4. II. EXISTING DATABASES Development of databases started in early 1900’s when handwritten text recognition became a significant area of research in machine learning and pattern recognition. This section presents a relative review of existing databases by evaluating them on number of aspects such as data acquisition method(offline, online), number of writers, number of samples, contents provided by database such as segmented lines, words, characters and digits. IAM database is one of most commonly used database for HTR [6-7]. It comprises of English handwritten samples acquired by both online and offline modes. 221 writers contributed for online version of IAM database and set of 1700 digitized forms were compiled, it is labeled at word level comprising of 86,272 segmented words. Offline IAM database consists 1539 digitized forms. Word and sentence level labeling is provided in IAM offline version database comprising of 5685 sentences and 115,320 words written by 657 volunteers. In 1995, National Institute of Standards and Technology (NIST) published its first database for recognition of handwritten digits and characters [8-9]. Since then on and off NIST publish revised and updated version of their databases. Offline method for acquisition of data from writers is used by NIST. NIST Special Database 19 (SD-19) is the most updated version [10], 3600 writers contributed in its development. It provides almost 800,000 images of isolated characters and digits. NIST databases are not publically available to researchers they come at sizeable cost. In 1998, subset of NIST Special Database 3 (SD-3) and Special Database 1 (SD-1) were combined to compile fresh database known as Modified National Institute of Standards and Technology (MNIST). Offline approach for handwriting data acquisition was employed in MNIST and it provides 70,000 isolated digits free of cost. Initially MNIST was extensively used in digit recognition systems but with time researchers came to know that experiments with MNIST dataset mostly suffer from high ceiling effect i.e. generalization error becomes very small which leads to over fitting. Since 1994, Center of Excellence for Document Analysis and

Recognition (CEDAR) has compiled number of databases for HTR [11-12]. Main focus of CEDAR is to automate the postal processing by recognizing the addresses on the envelopes. CEDAR used offline mode of data acquisition. 21,179 digitized forms were collected from 1500 writers consisting of 10,570 segmented words, 27,835 isolated characters. CEDAR database is further distributed into training and test subsets but it’s not publically available. Ecole Polytechnique of University of Nantes (France) has developed a dual on/off database known as IRONOFF [13]. It contains handwritten samples of French and English from 700 different writers. It provides 50,000 segmented words and 32,000 character images. It can be used for both online and offline modes of handwritten text recognition. IRONOFF is not publically available to researchers. CVL is an alphanumeric database published in 2013 [14]. A total of 613 writers contributed to develop 2163 digitized forms containing handwritten digit and character strings. It provides sentence level labeling. A subset of CVL digit database was also used in Handwritten Digit Recognition Competition (HDRC) in ICDAR 2013 competition. In 2013, a new database known as Qatar University Writer Identification dataset (QUWI) was published that contains both Arabic and English handwriting samples [15]. It provides word level labeling. A total of 1017 writers contributed in QUWI to compile 5085 documents. These documents were scanned at 600dpi and 100,000 words were extracted. A comparative overview of existing datasets is given in Table 1. Next section provides detailed description of HCD dataset. III. HCD DATASET GENERATION Two page data entry form has been designed for collection of handwritten samples from writers as shown in Figure 1. Writers were instructed to write English block letters on first page and digits on the second page. Multiple samples of each character were taken from every single writer. These forms can be filled by writers in any color, but mostly black and blue pens were used by writers. HCD dataset was acquired at Institute of Space technology (IST), Pakistan. 150 different writers have contributed for HCD dataset so far. Diversity of the dataset is maintained by selecting writers from different professional and educational background, they also vary in age and gender. Out of 150 writers that have contributed up till now, 15% were left handed and remaining 85% were right handed. Our team has further distributed the dataset into three subsets i.e. train, validate and test to ensure that same experimental conditions are provided to researchers so that they can compare performance of their techniques on same ground truth data. Statistics of HCD dataset are provided in Table 2. IV. PREPROCESSING OF HCD After dataset collection, digitization has been performed with500 dpi resolution scanner and forms were saved as 24 bit RGB JPEG images. Several preprocessing steps like resizing, binarization, geometric filtering etc. are performed before character extraction. 1400x1000 pixels was chosen as standard size, and each digitized form was resized according to it. As

digitized forms containing handwriting samples were not that complex so Otsu’s global thresholding method was selected for binarization of the form.

Fig. 1.Two page format designed for sample collection

Aspect ratio and area of connected components was calculated for all objects present in binarized image and objects with area less than 10,000 pixels and larger than 50,000 pixels were discarded to filter out non-character objects such as noise due to folding and crumpling of forms. Some of the characters were broken and thus they were detected as two different connected components, to overcome these issue horizontal and vertical distance between bounding boxes of the connected components was calculated and if either of these distances came out to be too small then such bounding boxes were merged together as one object as shown in Figure 2. Broken character ‘H’ is shown in Figure 2(a), system detected it as two connected components as shown in Figure 2(b). Distance between bounding boxes was calculated and as it came out to very small so these bounding boxes were merged together as one connected component as provided in Figure 2(c). Finally morphological operations such as closing and opening were also performed in order to fill holes in the characters and to remove unwanted objects. For segmentation of characters from binarized image a semi-automated approach had been employed. All the connected components from binarized images were extracted using 8-connectivity and then human labeler manually labels these connected components. Human labeler can also ignore the character or digit if it’s not properly written or if it’s affected by noise. Normalization of characters is one of most essential step in dataset generation [16-17]. 60x40 pixels was chosen as standard size for each character by our team as it does not distort the aspect ratio of characters. But simple resizing to 60x40 pixels leads to shape distortion of characters as shown in Figure 3, in which letter ‘I’ is resized to 60x40 pixels, and it becomes very difficult to recognize. To overcome setbacks of simple resizing our team proposed a new approach for resizing known as dynamic resizing. In this technique the connected components are first dynamically resized to match either width of 40 pixels or height of 60 pixels so that the aspect

ratio of the connected components is maintained. Then, the CC is placed at the top left corner in the standard sized image i.e. 60x40px. For example, an image of size 50x50 pixels is resized to 40x40 pixels and placed at the top left corner in 60x40 pixels image as shown in Figure 4(a-c). Similarly, Figure 4(d-f) shows the resizing of an image of size 67x17 pixels to 60x40 pixels. HCD Database is also available in raw form i.e. without performing dynamic resizing. Researchers can apply other resizing techniques according to their requirement.

Fig. 2. (a) Broken image of character ‘H’. (b) Detected as two objects. (c) Merged together as one object.

decimal digits 0~9. The digit with highest probability is selected as the recognized digit. The execution time for recognition of single character is 25 milliseconds on average. Figure 5 shows schematic diagram of Neural Network designed for analysis of HCD dataset. For analyzing the performance of HCD dataset, standard parameters i.e. precision and recall are measured from confusion matrix for every English character (A-Z) and numeric digit (0-9). Classwise precision, recall rates for alphabets and digits are shown in Table 3 and Table 4 respectively. It can be seen that the recognition rates are consistent across all classes. However, the pairs of letters with low inter-class variability such as (‘E’,’F’), (‘D’,’O’), (‘M’,’N’), (‘7’,’9’) give relatively lower recognition rates. Average precision of NN on HCD dataset came out to be 96.98 for English characters and 98.08 for numeric digits. Results presented in Table 3 and Table 4 are achieved without feature extraction, these results can be enhanced more by applying proper feature extraction technique. To consider the effect of number of neurons in hidden layer of NN, five NNs with different number of neurons in the hidden layer, were trained and tested on HCD. The experimental results of these different architectures of NNs are summarized in the Table 5.

Fig. 3. (a) Original Image of size 67x16 pixels, (b) Effect of typical resizing to 60x40 pixels

V. HCD DATASET EVALUATION Main aim of this section is to provide a basis for comparison of HCD’s performance with other state of art datasets. HCD dataset is completely applicable to any type of classifier. For experimental analysis, classification is obtained by feedforward back-propagation neural network [18]. Experiments were performed on a machine with Intel Core i3 @ 2.40GHz with 4.00GB RAM and these experiments were implemented in beta version of MATLAB 2015. No feature extraction technique is applied on dataset and only one feature that is intensity is used [19]. Two different Neural Networks are designed, one for English characters (A-Z) and other one is for numeric digits (0-9). The neural network used for recognition of English alphabets is trained with 13,596 samples of different alphabets written in different styles by 150 writers using feed-forward back-propagation algorithm, it consists of 2400 input nodes which is vectorized version of 60x40 pixel character, one hidden layer with 240 neurons and output layer of 26 nodes representing the 26 English alphabets (A-Z). The NN outputs a vector of length 26 containing the corresponding probabilities of the English alphabets (A-Z). The alphabet with highest probability is selected as the recognized alphabet. Similarly NN for numeric digits is trained with 5,404 samples of different digits written in different styles by around 150 writers using feed-forward back-propagation algorithm. Input of digit NN have 2400 input nodes, it has one hidden layer with 120 hidden neurons. The output layer has 10 nodes representing the 10 decimal digits. The neural network outputs a vector of length 10 containing the corresponding probabilities of the

Fig. 4.(a) Original Image of size 50x50 pixels, (b) Dynamic resizing to 40x40 pixels, (c) Fitting in 60x40 pixels, (d) Original Image of size 67x16 pixels, (e) Dynamic resizing to 60x14 pixels, (f) Fitting in 60x40 pixels.

Fig. 5. Schematic diagram of NN for recognition of alphabets and digits.

The NN for recognition of digits yields the highest precision with 120 hidden neurons while the NN for recognition of alphabets yields the highest precision with 240 hidden neurons.

and writer identification. TABLE 3.Class-wise Recognition Results of Alphabets Class

Precision

Recall

Class

Precision

Recall

A

98.9

95.1

N

94.7

96.4

B

99.2

96.5

O

93

93.8

Input

Hidden

C

100

97.4

P

98.3

96

neuron

Neuron

Alphabet

Digit

Alphabet

Digit

D

94

94.2

Q

98.1

96.3

E

94.7

94.4

R

96.5

96.5

2400

60

26

10

95.5

95.46

F

93.6

92.2

S

97

98

2400

120

26

10

95.6

98.08

G

100

99.1

T

93.9

99.1

H

97.2

97.1

U

98.3

95.5

2400

240

26

10

96.98

97.84

I

96.7

97.6

V

97.6

100

2400

360

26

10

95.9

96.52

J

97.3

95.9

W

100

96.6

2400

480

26

10

95.8

95.08

K

99

98

X

96.8

94.1

L

96.4

97.5

Y

98

95.1

M

93.9

97.1

Z

98.3

99.1

REFERENCES

Avg

96.98

96.48

[1] Plamondon, Réjean, and Sargur N. Srihari. "Online and off-line handwriting recognition: a comprehensive survey", IEEE Transactions on pattern analysis and machine intelligence, 22.1 (2000): 63-84.

TABLE 5. Experimental analysis of different neural network architectures Output Neurons

Precision on HCD

[2] Réjean Plamondon, "Automatic signature verification and writer identification — the state of the art". Pattern Recognition,Volume 22,1989, Pages 107-131

TABLE 4.Class-wise Recognition Results of Digits Class

Precision (%)

Recall (%)

Class

Precision (%)

Recall (%)

0

98

100

5

99.4

95.1

1

98.2

96.3

6

97.6

99.7

2

100

96.9

7

94.5

100

3

100

97.2

8

98.4

92

4

98.2

100

9

96.5

100

Avg

98.08

97.72

VI. CONCLUSION AND FUTURE WORK In this paper, a new dataset for handwritten character recognition of English block letters (A-Z) and decimal digits (0-9) is presented. There is no such standard publically available dataset which can provide isolated characters and digits free of cost. Experimental analysis of HCD has depicted very promising results compared to other state of art datasets. Average precision obtained for English alphabets (A-Z) is 96.98% and its 98.08% for numeric digits (0-9). In initial phase samples from 150 writers were collected, in future it is planned to increase the number of writers to 350 so that more diversity can be brought into dataset. Secondly, it is also part of future plan to compile dataset for lower case English letters (a-z). Thirdly, different feature extraction techniques will be applied on HCD for enhancement of results. Other future task includes use of different classifiers

[3] Yuan Y. Tang ,"Automatic document processing: A survey".Pattern Recognition,Volume 29, Issue 12, December 1996, Pages 1931-1952 [4] Renuka Patil,G.N.Srinivasan, "A SURVEY ON CHARACTER RECOGNITION".International Journal of Engineering Sciences Research-IJESR,Vol 04, Issue 02; March-April 2013, Pages 5. [5] Raashid Hussain, Ahsen Raza, Imran Siddiqi, Khurram Khurshid, "A comprehensive survey of handwritten document benchmarks: structure, usage and evaluation".EURASIP Journal on Image and Video Processing, 2015, Pages 234 [6] M. Liwicki,H. Bunke. "IAM-OnDB - an On-Line English Sentence Database Acquired from Handwritten Text on a Whiteboard". In Proceedings of the Eighth International Conference on Document Analysis and Recognition, ICDAR ’05, pages 956–961, Washington,DC, USA, 2005. IEEE Computer Society. [7] U-V Marti, H Bunke,"A full english sentence database for offline handwriting recognition" .In Proceedings of the Fifth International Conference onDocument Analysis and Recognition. , (1999), pp. 705–708. [8] R Wilkinson, J Geist, S Janet, P Grother, C Burges, R Creecy, B Hammond, J Hull, N Larsen, T Vogl, C Wilson, The First Census Optical Characte Recognition Systems Conference. (The U.S. Bureau of Census and the National Institute of Standards and Technology, 1992) [9] N Fakotakis,E Kavallieratou,"Handwritten character recognition based on structural characteristics".In Proceedings of the 16th International Conference on Pattern Recognition. vol. 3, (2002), pp. 139–142

[10] Grother, Patrick J. "NIST special database 19 handprinted forms and characters database." National Institute of Standards and Technology (1995).

normalization methods for the recognition of large-set handwritten characters." Pattern recognition 27.7 (1994): 895902.

[11] S. Srihari, S.-H. Cha, H. Arora, and S. Lee. "Individuality of handwriting: a validation study". In Document Analysis and Recognition, 2001. Proceedings. Sixth International Conference on, pages 106 –109, 2001. [12] JJ Hull, "A database for handwritten text recognition research". IEEE Trans. Pattern Anal. Mach. Intell. 16(5), 550–554 (1994)

[18] Bottou, Léon, et al. "Comparison of classifier methods: a case study in handwritten digit recognition." International conference on pattern recognition. IEEE Computer Society Press, 1994.

[13] C. Viard-Gaudin, P. Lallican, P. Binter, and S. Knerr. "The IRESTE On/Off (IRONOFF) Dual Handwriting Database". In Proceedings of the Fifth International Conference on Document Analysis and Recognition, ICDAR ’99, pages 455–458, Washington, DC, USA, 1999.IEEE Computer Society. [14] F Kleber, R Sablatnig, S Fiel, M Diem,"An off-line database for writer retrieval, writer identification and word spotting". In Proceedings of the 12th International Conference on Document Analysis and Recognition.Cvl-database: (2013), pp. 560– 564 [15] Somaya M, Wael A, Abdelaali H,"QUWI: An Arabic and English Handwriting Dataset for Offline Writer Identification".2012 International Conference on Frontiers in Handwriting Recognition. [16] Liu, Cheng-Lin, et al. "Handwritten digit recognition: investigation of normalization and feature extraction techniques." Pattern Recognition 37.2 (2004): 265-279.

[19] Trier, Øivind Due, Anil K. Jain, and Torfinn Taxt. "Feature extraction methods for character recognition-a survey." Pattern recognition 29.4 (1996): 641-662. [20] http://www.ist.edu.pk/downloads/hcd_dataset.rar

TABLE2 .Statistics of HCD Dataset Data type

No. of writers

Size of the dataset

Training

Validation

Testing

Alphabets (A-Z)

150

19,422

13,596

2,913

2,913

Digits (0-9)

150

7,720

5,404

1,158

1,158

Organization of dataset

[17] Lee, Seong-Whan, and Jeong-Seon Park. "Nonlinear shape

TABLE 1.Comparative Overview of existing datasets Database

Language

Year

Content

Mode

Writers

IAM

English

2002

Words, Sentences

Offline

657

Statistics

1539 forms, 5685 sentences, 115320 words

IAM

English

2005

Words

Online

221

1700 forms, 86,272 words

NIST

English

1995

Characters, Digits

Offline

3600

800,000 characters and digits

MNIST

English

1998

Digits

Offline

-

English

1994

Characters and Digits

Offline

IRONOFF

English and French

-

Words, Character

QUWI

English and Arabic English and German

2013 2013

CEDAR

CVL

70,000 Digits

Availability

Public Public Proprietary Pubic

1500

10,570 words, 27,835 characters, 21,179 digits

Proprietary

Offline/Online

700

50,000 words, 32000 characters

Proprietary

Words

Offline

1017

100,000 words, 5085 forms

Public

Sentences and Digits

Offline

613

2163 forms

Public