Recognition of Off-line Handwritten Arabic Words - School of

0 downloads 0 Views 5MB Size Report
The study also addresses Arabic writing, which differs from English writing in many ... Arabic writing uses letters (which consist of 28 basic letters), ten Hindi numerals .... differences in the recognition of each handwritten language. Some of ...... experiments using a manual GVM, which proposed a list of possible letters and ...
Recognition of Off-line Handwritten Arabic Words by Somaya A. S. Al-Ma’adeed, BSc, MSc

Thesis submitted to The University of Nottingham for the degree of Doctor of Philosophy, June 2004.

ABSTRACT The main steps of document processing have been reviewed, especially those implemented on Arabic writing. The techniques used in this research, such as Vector Quantization (VQ), Hidden Markov Models (HMM), and Induction of Decision Trees (ID3) have been considered, as well as reviewing pre-processing and feature extraction used in Arabic writing. Applications, which usually include some pattern recognition require the use of large sets of data. Since there are few Arabic databases available, none of which are a reasonable size or scope, this research built the AHDB database in order to facilitate the training and testing of systems that are able to recognize unconstrained handwritten Arabic text [AHE02a] [AHE03a]. The approach used in this thesis for counting the most popular written Arabic words is a very useful step in the area of Arabic handwritten recognition. The process of the recognition of Arabic characters that are extracted from words contains several parts and deals with the Arabic words from the beginning of slanted words to segmented characters [AHE01], that are entered as inputs to HMMs, ID3, or Multible HMMs for recognition. At first the HMM is used to classify handwritten words [AHE02b]. Then a global classifier is used to recognize the whole words. The last stage is to combine general and local classifiers to classify the Arabic words [AHE02c]. The main result is a new Multi-HMM approach proposed for handwritten recognition [AHE03b] [AHE04]. Finally, possible further work has been examined to consider where this approach to off-line handwriting recognition is leading. This work presents an offline cursive Arabic words recognition classifier system, which deals with several writer samples.

ii

ACKNOWLEDGMENTS My sincere and deepest gratitude to Nottingham University, Faculty of Computer Science and to my Supervisors Professor Dave Elliman and Dr. Colin Higgins for always being supportive and encouraging throughout my thesis, and for their assistance in the preparation of this manuscript. Thanks to Arabic writers who filled out the forms, that being the core of the database I developed. Thanks also to Qatar University for sponsoring my study and research. Thanks to my mother and my father for always being there for me and constantly providing me with love and encouragement. They have done so much that a simple thanks to them will not suffice. In addition, special thanks to my husband, Sultan, for all his support during the long period that the thesis took up, to my wonderful sons for their patience, to my sisters for their continuously supportive phone calls, and to my brother who showed interest in my research, always asking when (not if) I will finish this thesis. Lastly I acknowledge the reader, who I hope will find its contents useful and easy to read.

iii

TABLE OF CONTENTS Abstract .........................................................................................................ii Acknowledgments........................................................................................iii Table of Contents......................................................................................... iv List of Figures .............................................................................................. ix List of Tables ..............................................................................................xii

Chapter 1: INTRODUCTION.................................................................... 1 1.1

OPTICAL CHARACTER RECOGNITION ............................................... 2

1.2

THE HISTORICAL BACKGROUND TO OCR RESEARCH...................... 3

1.3

BASIC MODEL FOR PROCESSING THE CONCRETE DOCUMENT .......... 4

1.4

RECOGNITION STRATEGIES .............................................................. 5

1.5

PROBLEM DEFINITION...................................................................... 6

1.5.1 Difficulties from Characteristics of the Arabic Writing System 7 1.5.2 Difficulties in Handwritten Arabic Characters and their Differences from Latin ......................................................................... 14 1.5.3 Off-line Versus On-line ............................................................ 15 1.6

THE OBJECTIVES OF THIS RESEARCH ............................................. 17

1.7

CONTRIBUTION .............................................................................. 18

1.8

THE THESIS ORGANIZATION .......................................................... 19

Chapter 2: THEORY AND LITERATURE REVIEW.......................... 22 2.1

INTRODUCTION .............................................................................. 23

2.2

SURVEY OF OFF-LINE HANDWRITTEN WORDS RECOGNITION ........ 24

2.2.1 Databases................................................................................. 26 2.2.2 Data Capture............................................................................ 26 2.2.3 Pre-processing ......................................................................... 27 2.2.3.1 The Binarization of Scanned Images..................................... 27 2.2.3.2 Skew Detection ...................................................................... 27 2.2.3.3 Segmentation ......................................................................... 28 iv

2.2.4 Feature Extraction ................................................................... 29 2.2.5 Classification ........................................................................... 30 2.2.6 Post-processing........................................................................ 32 2.3

OFF-LINE HMMS FOR AN HWR SURVEY ...................................... 33

2.4

ARABIC OCR USING HMM ........................................................... 43

2.5

A SURVEY OF OFF-LINE HANDWRITTEN ARABIC WORDS

RECOGNITION.......................................................................................... 45 2.6

CONCLUSION ................................................................................. 47

Chapter 3: METHODOLOGY: USEFUL TECHNIQUES................... 49 3.1

OFF-LINE ARABIC WORDS RECOGNITION METHODS ..................... 50

3.1.1 Feature Extraction Methods .................................................... 50 3.1.2 Segmentation Methods ............................................................. 57 3.1.3 Recognition Methods................................................................ 65 3.2

VECTOR QUANTIZATION ................................................................ 66

3.2.1 VQ Mathematic Definition ....................................................... 66 3.2.2 Optimality Criteria................................................................... 68 3.2.2.1 Nearest Neighbour Condition................................................ 68 3.2.2.2 Centroid Condition................................................................ 68 3.3

HIDDEN MARKOV MODEL (HMM)................................................ 68

3.3.1 Implementation Strategies........................................................ 70 3.3.2 HMM Theory............................................................................ 71 3.3.2.1 Scoring Problem.................................................................... 73 3.3.2.2 Training Problem .................................................................. 74 3.3.2.3 Recognition Phase ................................................................. 74 3.3.2.4 Post-processing ..................................................................... 74 3.4

ID3 CLASSIFIER ............................................................................. 74

3.5

CONCLUSION ................................................................................. 76

Chapter 4: A DATABASE FOR ARABIC HANDWRITTEN TEXT RECOGNITION RESEARCH ................................................................ 78 v

4.1

A NEW ARABIC HANDWRITTEN DATABASE (AHDB) ................... 78

4.2

ARABIC WORD COUNTING............................................................. 81

4.3

FORM DESIGN ................................................................................ 85

4.4

DATA STORING .............................................................................. 85

4.5

DATA RETRIEVAL .......................................................................... 86

4.6

CONCLUSION ................................................................................. 88

Chapter 5: A PRE-PROCESSING SYSTEM FOR THE RECOGNITION OF OFF-LINE ARABIC HANDWRITTEN WORDS ..................................................................................................................... 90 5.1

OVERVIEW ..................................................................................... 90

5.2

PRE-PROCESSING STEPS ................................................................. 91

5.2.1 Image Loading ......................................................................... 94 5.2.2 Slope Correction ...................................................................... 95 5.2.3 Slant Correction....................................................................... 96 5.2.4 Thinning ................................................................................... 97 5.2.5 Normalization........................................................................... 97 5.3

FINDING HANDWRITING FEATURES ............................................... 98

5.3.1 Outer Contour and Loops ...................................................... 100 5.3.2 Locating Dots......................................................................... 100 5.3.3 Locating Endpoints ................................................................ 101 5.3.4 Junctions ................................................................................ 101 5.3.5 Turning Points........................................................................ 102 5.3.6 Right and Left Disconnection................................................. 103 5.3.7 Detect Strokes ........................................................................ 103 5.3.8 Pixel Distribution................................................................... 104 5.3.9 Moments Features.................................................................. 106 5.3.10 Zonal Features ....................................................................... 107 5.4

SEGMENTATION STAGE................................................................ 107

5.5

CONCLUSION ............................................................................... 108

vi

Chapter 6: RECOGNITION OF OFF-LINE HANDWRITTEN ARABIC WORDS USING A HIDDEN MARKOV MODEL ............ 110 6.1

SYSTEM OVERVIEW ..................................................................... 110

6.2

PRE-PROCESSING ......................................................................... 111

6.3

FEATURES USED .......................................................................... 112

6.4

HMM CLASSIFIER ....................................................................... 116

6.4.1 States and Symbols for Handwritten Words .......................... 116 6.4.2 The Calculation of Model Parameters................................... 116 6.5

THE SCORING PROBLEM .............................................................. 117

6.6

THE TRAINING PROBLEM ............................................................. 118

6.7

RECOGNITION PHASE ................................................................... 118

6.8

CONCLUSION ............................................................................... 118

Chapter 7: MULTIPLE HIDDEN MARKOV MODELS CLASSIFIER ................................................................................................................... 120 7.1

ID3 CLASSIFIER ........................................................................... 125

7.1.1 Training and Testing Sets ...................................................... 126 7.2

MULTIPLE HIDDEN MARKOV MODELS ........................................ 126

7.2.1 Global Classifier.................................................................... 127 7.2.2 Local Classifier...................................................................... 129 7.3

LOCAL GRAMMAR ....................................................................... 130

7.4

CONCLUSION ............................................................................... 130

Chapter 8: EXPERIMENTAL RESULTS............................................ 131 8.1

EXPERIMENTAL TOOLS ................................................................ 131

8.2

SOFTWARE USED ......................................................................... 132

8.3

EXPERIMENTAL DETAILS ............................................................. 133

8.3.1 Forms Scanning ..................................................................... 134 8.3.2 Data Capture and Image Loading ......................................... 135 8.3.3 Pre-processing ....................................................................... 140

vii

8.3.4 Baseline Detection ................................................................. 140 8.3.5 Slant and Slope Correction .................................................... 140 8.3.6 Thinning ................................................................................. 141 8.3.7 Feature Extraction ................................................................. 141 8.3.8 Segmentation.......................................................................... 144 8.3.9 Normalization......................................................................... 144 8.4

CLASSIFICATION USING HMM .................................................... 145

8.5

CLASSIFICATION USING ID3 ........................................................ 149

8.6

CLASSIFICATION USING MULTIPLE HMM ................................... 153

8.7

CONCLUSION OF THE EXPERIMENTAL RESULTS ........................... 167

Chapter 9: CONCLUSIONS AND SUGGESTIONS FOR FUTURE RESEARCH ............................................................................................ 170 9.1

CONCLUDING REMARKS .............................................................. 170

9.2

CONTRIBUTION TO ARABIC HANDWRITTEN RECOGNITION .......... 173

9.3

FUTURE WORK ............................................................................ 174

9.3.1 The Database ......................................................................... 174 9.3.2 Pre-processing ....................................................................... 175 9.3.3 Feature Extraction ................................................................. 175 9.3.4 Classification ......................................................................... 176 9.3.5 Post-processing...................................................................... 176 9.4

CONCLUSION ............................................................................... 177

Bibliography ............................................................................................. 179 Appendix A............................................................................................... 197 Appendix B ............................................................................................... 199

viii

LIST OF FIGURES Figure 1-1: Basic Model for Document Processing...................................... 5 Figure 1-2: Different shapes of the Arabic letter (‘ ’-‘A’in’) in: (a) beginning, (b) middle, (c) final and (d) isolated.............................. 7 Figure 1-3: Some Arabic characters that differ only by the position and number of associated dots................................................................... 11 Figure 1-4: A handwritten word that can problematic to segments........... 12 Figure 1-5: Three Arabic words with constituent sub-words (a) ‘ flower’, ‘ -‘Maqdess’, ‘ - Cairo’....................................... 12 Figure 1-6: Different Arabic sentences in different styles.......................... 13 Figure 1-7: Arabic Ligatures....................................................................... 13 Figure 1-8: Ligatures found in the Traditional Arabic font ........................ 14 Figure 2-1: Steps involved in the Optical Character Recognition System . 25 Figure 3-1: Vertical and horizontal scanning of the character (a) character (b) horizontal scanning (c) vertical scanning. .................. 53 Figure 3-2: Major segments of character ................................................. 54 Figure 3-3: An example of segmentation of the Arabic word into characters (a) Arabic Word (b) Histogram (c) word segmented into characters ............................................................................................ 58 Figure 3-4: An example of the Arabic word and its segmentation into character (a) Arabic word (b) Histogram (c) word segmented into character.............................................................................................. 60 Figure 3-5: Segmented Arabic word and the corresponding contour heights, for words (a) Mahal and (b) Alalamy ................................................. 61 Figure 3-6: An example of a segmented sub-word, with start point A, endpoint E, and horizontal lines 2-3 and 5-6 ...................................... 63 Figure 3-7: Example of an Arabic word and different techniques of the segmentation ....................................................................................... 64 Figure 4-1: One form filled in by one writer .............................................. 80 Figure 4-2: Handwritten Arabic words in the AHDB written by three different writers (a, b, and c)............................................................... 84 Figure 4-3: Examples containing sentences used in cheque writing in Arabic.................................................................................................. 87 Figure 4-4: Examples of free-handwriting.................................................. 88 Figure 5-1: The pre-processing operations ................................................. 91

ix

Figure 5-2: Different examples of pre-processing stages (a) baselines detection (b) slant and slop correction (c) features extraction (d) width normalization ...................................................................................... 93 Figure 5-3: (a) The word before the operation of slope correction. (b) The word after its slope is corrected horizontally. (c) The same word after slant correction. (d) The operation of thinning ................................... 94 Figure 5-4: The two baselines of the word ‘ - five’. (a) the second baseline (b) the main baseline............................................................. 95 Figure 5-5: Two words with the features written on them.......................... 99 Figure 5-6: The blobs of the Arabic word “ahad” .................................... 100 Figure 5-7: Four turning points in different directions (a) top, (b) down, (c) left, and (d) right. .............................................................................. 102 Figure 5-8: The four stroke directions detected in this research for an Arabic word (a) horizontal (b) vertical (c) positive or back diagonal (d) negative or diagonal .................................................................... 104 Figure 5-9: Arabic word ‘ - five’ after (a) contour extraction and thinning, (b) width normalization, and (c) segmentation.................. 105 Figure 5-10: Horizontal histogram and segmentation of words into frames .......................................................................................................... 107 Figure 6-1: Feature vector for HMM classifier......................................... 111 Figure 6-2: Training and testing phases in the HMM classifier ............... 112 Figure 6-3: Examples of feature vectors in different Arabic words ......... 114 Figure 7-1: The Arabic word “ nine” written in different allographs and styles........................................................................................... 121 Figure 7-2: The Arabic word “ one” written in different allographs and styles ................................................................................................. 122 Figure 7-3: “

eighty” written in different allographs and styles .... 122

Figure 7-4: “

fifty” written in different allographs and styles....... 123

Figure 7-5: “

hundred” written in different allographs and styles ....... 123

Figure 7-6: “

ninety” written in different allographs and styles ... 124

Figure 7-7: “ no” written in different allographs and styles .................. 124 Figure 7-8: ID3 classifier .......................................................................... 125 Figure 7-9: Global features vector ............................................................ 126 Figure 7-10: Recognition of off-line handwritten Arabic words using Multiple Hidden Markov Models ..................................................... 127 Figure 7-11: A word recognition using local and global features ............ 129 Figure 8-1: Stages of this research............................................................ 131 x

Figure 8-2: Colour dropout using software a) scanned image b) after applying blue channel mode c) the image after stamp filter............. 133 Figure 8-3: Colour dropout using hardware.............................................. 134 Figure 8-4: Words with touching characters............................................. 135 Figure 8-5: Dot above the last left character “ noon” and below the real baseline ............................................................................................. 136 Figure 8-6: Over-segmented words .......................................................... 136 Figure 8-7: Error from file transformation................................................ 137 Figure 8-8: Wrong baseline for different Arabic words ........................... 137 Figure 8-9: Dots inside loops in character “ waw” in word “ -Wahed one”................................................................................................... 138 Figure 8-10: Arabic letter “ Alef” was mistakenly classified as a complementary character .................................................................. 138 Figure 8-11: Complementary characters above Arabic letter “

Alef” . 139

Figure 8-12: Example of overwritten dots or unwritten dots in the word “ Twenty” .............................................................................. 139 Figure 8-13: ID3 tree to classify words into four groups.......................... 151 Figure 8-14: The relation between words, groups, and the percentage of each word in each group for Table 8.2 ............................................. 153 Figure 8-15 Recognition rate decrease as number of iterations increases for all groups (codebook =90, and twenty states)................................... 157 Figure 8-16: Recognition rate and codebook size relation for groups two to eight when number of iteration is constant. ...................................... 158

xi

LIST OF TABLES Table 1-1: Arabic alphabet in all its forms ................................................... 9 Table 1-2: Supplementary characters (‘ - Hamza’ and ‘~ - Madda’) and their position in respect to the main character (‘ - Alif’, ‘ - Waow’ and ‘ - Ya’) ......................................................................................... 9 Table 1-3: Diacritical markings in Arabic writing...................................... 10 Table 1-4: Example of an Arabic word with different diacritics indicates different meanings .............................................................................. 11 Table 1-5: Differences between Latin and Arabic Writing ........................ 16 Table 3-1: A comparison between PD-HMM and MD-HMM strategies... 71 Table 4-1: The twenty most used words in written Arabic, with their meanings in English............................................................................ 83 Table 5-1: The curve categorization using the coordinates ...................... 102 Table 6-1: Arabic words without dots and other diacritical markings...... 115 Table 7-1: Group names and a list of each group ..................................... 128 Table 8-1: Result of series of tests using HMM ....................................... 148 Table 8-2 Recognition rate basic statistics................................................ 149 Table 8-3: ID3 classifier results................................................................ 152 Table 8-4: The relation between words, groups, and the percentage of each word in each group for some words in the dictionary ...................... 152 Table 8-5: The recognition rate for the global Word Feature Recognition Engine ............................................................................................... 154 Table 8-6: Recognition rate for each group and the total recognition rate156 Table 8-7 The mean of 20 recognition rates for group six results from different states and codebook sizes .................................................. 159 Table 8-8 The std. Deviation of 20 recognition rates for group six results from different states and codebook sizes. ......................................... 159 Table 8-9 The mean of 20 recognition rates for group two results from different states and codebook sizes................................................... 160 Table 8-10 The std. Deviation of 20 recognition rates for group two results from different states and codebook sizes. ......................................... 160 Table 8-11 The mean of 20 recognition rates results from different states and codebook sizes for group three. ................................................. 161 Table 8-12 The std. Deviation of 20 recognition rates for group three results from different states and codebook sizes. ............................. 161

xii

Table 8-13 The mean of 20 recognition rates results from different states and codebook sizes for group four. .................................................. 162 Table 8-14 The std. Deviation of 20 recognition rates for group four results from different states and codebook sizes. ......................................... 163 Table 8-15 The mean of 20 recognition rates results from different states and codebook sizes for group five. ................................................... 163 Table 8-16 The std. Deviation of 20 recognition rates for group five results from different states and codebook sizes. ......................................... 164 Table 8-17 The mean of 20 recognition rates results from different states and codebook sizes for group seven. ................................................ 165 Table 8-18 The std. Deviation of 20 recognition rates for group seven results from different states and codebook sizes. ............................. 165 Table 8-19 The mean of 20 recognition rates results from different states and codebook sizes for group eight. ................................................. 166 Table 8-20 The std. Deviation of 20 recognition rates for group eight results from different states and codebook sizes. ............................. 166

xiii

T

he handwriting recognition problem arouses great interest in researchers, since there is a high level of ambiguity and complexity

in such kind of images, and because of the importance of Optical Character Recognition (OCR) in office automation and many other applications. Recognition of cursive handwritten text is one of the most diffecult cases in the domain of OCR. However, the large number of potential applications results in it being a very popular research subject. Much less research has been undertaken on the task of recognizing Arabic script influenced perhaps by the lack of an international database in this field. The objective of this thesis is to provide a better way to recognise Arabic handwritten words. This chapter describes the concept of OCR and its importance. It provides an overview of the Document structures: both the geometric structure and the logical structure. In addition, there is a discussion of the algorithms used for word recognition. They are classified into three categories, namely the holistic approach, the analysis approach, and feature sequence matching. In section 1.5, the off-line Arabic

1: Introduction

2

handwritten character recognition problem is defined. The particular problems of this application are a result of Arabic writing characteristics, the nature of Arabic handwriting, and the use of offline recognition. This chapter also summarizes the thesis objective of building an off-line Arabic handwritten character recognition system. Since the proposed system involves several processing steps, it is useful to summarize the stages involved in optically handling a handwritten document, from pre-processing to post-processing. The optical character recognition system comprises five processing steps, namely data capture, pre-processing, feature extraction, classification, and postprocessing. An outline of the research approach and the contribution points are discussed. Finally, there is a summary of how this thesis is organized.

1.1 Optical Character Recognition What is Optical Character Recognition (OCR) and why do we need it? OCR is a process that attempts to turn a paper document into a fully editable form, which can be used in word processing and other applications as if it had been typed through the keyboard. The constant development of computer tools leads to the requirement for simpler interfaces between man and computer. The automatic recognition of handwritten text could be applied in many areas, for example ‘form-filling’ applications (including handwritten postal addresses, cheques, insurance applications, mail order forms, tax returns, credit card sales slips, custom declarations, and many others). All these applications generate handwritten script from an unconstrained population of writers and writing, which must subsequently be processed off-line by computers [ND94].

1: Introduction

3

1.2 The Historical Background to OCR Research Character recognition is an area of pattern recognition that has been the subject of considerable research during the last three decades [Na68]. Since the 1960s, much research on document processing has been carried out using OCR [AA94]. Surveys of the underlying techniques have been made by several researchers [Ma86] [IOO91] [MSY92] [Sa94]. Studies of automatic text segmentation and discrimination have been widely conducted since the early 1980s [AWS81][WCW82]. Since then, the application of document image analysis has been growing rapidly due to developments in hardware enabling processing to be performed at a reasonable cost and speed [OK95]. Today, effective OCR packages can be bought for as little as $100 [CL96]. However, these are only able to recognize high quality printed text documents or neatly written hand printed text [CL96]. To date, lots of methods have been proposed and many document processing systems have been described. About 750 papers have been presented at The International Conferences On Document Analysis And Recognition-ICDAR’97, ICDAR’99 and ICDAR’01 [ICDAR97, ICDAR99, ICDAR01]. Nine articles have been published in the special issue

of the Journal for Machine Vision and Applications concerned with document analysis and understanding. Many papers have been published describing new achievements in research in these areas [IWF02, ICDAR03]. Several books on these topics have also been published [DI97, OK95, BWB94]. The current focus of the research area in the subject of OCR is now for systems that can handle documents that are not well recognized by current systems. As improvements in technology continue, document-processing systems will become increasingly common. The automatic acquisition of knowledge from documents such as technical reports, government

1: Introduction

4

files, newspapers, books, journals, magazines, letters, and bank cheques using OCR has become a commercial imperative.

1.3 Basic Model for Processing the Concrete Document There are two types of document in the Romance or Anglo-Saxon languages, machine-printed text and handwritten text, which may also be divided into hand-printed words and cursive words. This research concentrates on the automatic recognition of handwritten Arabic text, which is more similar to cursive Latin handwritten text and cursive words. The objective of automatic document processing is to recognize text, graphics and digital image pictures and extract the desired information, in an acceptable format for humans [Ob94]. The following principal concepts were proposed in a basic model for processing the concrete document [MSY92]. A Concrete Document is considered to have two structures: a geometric (layout) structure and a logical structure. The geometric structure represents the objects of a document based on the presentation, and connections among these objects. The logical structure represents the objects of a document and connection among these objects, as they would be classified by a person. Document processing is divided into two phases: document analysis, which refers to the extraction of the geometric structure from a document; and document understanding, which refers to mapping the geometric structure into a logical structure.

1: Introduction

5

Once the logical structure has been captured, AI or other techniques can attempt to decode its meaning. In some cases, the boundary between the analysis and understanding phases is not clear. For example, the logical structure of bank cheques may also be found using an analysis by knowledge rules. In Figure 1-1, the relationships among the geometric structure, logical structure, document analysis and document understanding are depicted.

Extraction

     

Mapping

    

 

    

   

   

    

       

Figure 1-1: Basic Model for Document Processing

1.4 Recognition Strategies Word recognition algorithms may be classified into the following categories: •

The holistic approach



The analysis approach



Feature sequence matching

The holistic approach generally utilizes shape features extracted from the word image in an attempt to recognize the entire word. It is usually

1: Introduction

6

accepted that holistic methods are feasible only when a small number of words are to be recognized. The analytic approach segments the word image into primitive components (typically characters). Character segmentation prior to recognition is called external character segmentation, while concurrent segmentation and recognition is called internal character segmentation. Feature sequence matching extracts features sequentially and derives word identity from this sequence. For a review on Statistical Pattern Recognition see [JDM00]. The Hidden Markov Model (HMM) has been used widely for recognition based on feature sequences. It must be concluded that recognition based on HMM is often classified as the holistic approach [Na92].

1.5

Problem Definition

The problem of recognizing off-line Arabic handwritten words is important in office automation, as well as in many other applications. Using the analytical approach to extract features included in Arabic characters seems to be most appropriate due to the nature of Arabic handwritten characters. The Handwritten Arabic character has no fixed pattern, but has fixed geometrical features. The shapes of handwritten Arabic characters differ between writers, but the geometrical features are always the same. An important difference of Arabic handwritten characters from Latin ones is the existence of dots. Dots differentiate between characters with the same geometry. Another difference is that there is no one baseline on which the characters are written, but two or more baselines, which makes recognition more difficult. This research deals with the recognition of off-line handwritten Arabic characters. As described by the title of this thesis, the problem of Arabic handwritten recognition is a result of many factors, which can be summarized as follows:

1: Introduction



7

The thesis studies cursive handwritten Arabic characters, which differ from the machine printed case (section 1.5.1).

• The study also addresses Arabic writing, which differs from English writing in many ways. Readers can see the difference between English and Arabic writing in sections 1.5.2. •

It deals with off-line recognition, which differs in important respects from the on-line recognition system (section 1.5.3).

1.5.1 Difficulties from Characteristics of the Arabic Writing System The main characteristics of Arabic Writing can be summarized as follows: •

Arabic text (machine printed or handwritten) is written cursively and in general from right to left. Arabic letters are normally connected to the baseline.



Arabic writing uses letters (which consist of 28 basic letters), ten Hindi numerals, punctuation marks, spaces, and special symbols.

(a)

(b)

(c)

(d)

Figure 1-2: Different shapes of the Arabic letter (‘ ’-‘A’in’) in: (a) beginning, (b) middle, (c) final and (d) isolated

1: Introduction



8

An Arabic letter might have up to four different shapes, depending on its relative position in the word and this increases the number of classes from 28 to 100 (Table 1-1). For example, the letter (‘ ’-‘A’in’) has four different shapes, at the beginning of the word, the middle, and the end of the word and one in isolation as a standalone word. These four shapes of the letter (‘ ’-‘A’in’) are shown in Figure 1-2. Furthermore, there are two supplementary characters that operate on vowels to create a kind of stress (Hamza

) and elongation (Madda

); the latter

operates only on the character Alif (Table 1-2). The character Lam-Alif ( ) is created as a combination of two characters, Lam ( ) and Alif ( ), when the character Alif is written immediately after the character Lam. This new character, together with the combination of Hamza ( ) and Madda ( ), increases the number of classes to 120. This is made clear in Table 1-2

1: Introduction

Name Alif Ba Ta Tha Jeem Hha Kha Dal Thal Ra Zay Seen Sheen Sad Dhad Tta Za Ain Gain Fa Qaf Kaf Lam Meem Noon Ha Waow Ya

9

Table 1-1: Arabic alphabet in all its forms Isolated Start Middle End

(

" % ) , /

# & * 0

8 ; ? C G K

9 < @ D H L

O R U X [ ^ a d

P S V Y \ _ b e

h

i

3 6

> B F J N

! $ ' + . 1 2 4 5 7 : = A E I M Q T W Z ] ` c f g j

Table 1-2: Supplementary characters (‘ - Hamza’ and ‘ Madda’) and their position in respect to the main character (‘ Alif’, ‘ - Waow’ and ‘ - Ya’) Isolated Name Start Middle End Alif Alif Alif LamAlif

LamAlif LamAlif LamAlif Waow Ya

1: Introduction

10

Table 1-3: Diacritical markings in Arabic writing Diacritics

Figure

Single diacritics:

Double diacritics: Shadda: Combined diacritics:



In the representation of vowels, Arabic uses diacritical markings (Table 1-3). The presence and absence of vowel diacritics indicates different meanings of the same word. If the word is isolated, diacritical marks are essential to distinguish between the two or more possible meanings, i.e. (

! ). Table 1-4 gives an example of an Arabic word with different

diacritics indicating four different meanings. If diacritical markings occur in a sentence, contextual information inherent in the sentence can be used to infer the appropriate meaning. In this research, the issue of vowel diacritics is not addressed, since it is more common for Arabic writing not to employ these diacritics.

1: Introduction

11

Table 1-4: Example of an Arabic word with different diacritics indicates different meanings Arabic word

English meaning he studied a lesson he taught it was studied

"#$ '( (a)

&% ) (b)

Figure 1-3: Some Arabic characters that differ only by the position and number of associated dots



Different Arabic characters may have exactly the same shapes, and so are distinguished from each other by the addition of complementary characters (the position and number of the associated dots). Hence, any thinning algorithm needs to deal efficiently with these dots without changing the identity of the character (Figure 1-3). In the segmentation process of the handwritten Arabic word, the characters are more difficult to segment if the dots are not allocated exactly under or above the character body (Figure 1-4).

1: Introduction



12

Arabic writing is cursive and words are separated by spaces. Some Arabic characters are not connectable with the succeeding character. Therefore, if one of these characters exists in a word, it divides that word into two sub-words. These characters appear only at the tail of a sub-word and the succeeding character forms the head of the next subword (Figure 1-5).

Figure 1-4: A handwritten word that can problematic to segments

2V ^ (a)

(b)

5dV [ (c)

Figure 1-5: Three Arabic words with constituent sub-words - flower’, ‘ 2V ^-‘Maqdess’, ‘ 5dV[- Cairo’ (a) ‘ •

Arabic writing contains many fonts and writing styles. The letters are overlaid in some of these allographs and styles. Furthermore, characters of the same font have different sizes. Hence, segmentation which is based on fixed size or width cannot be applied to Arabic [Ob94]. In Arabic writing sometimes it is difficult to separate words from each other, especially when people write with calligraphy. See the following example (Figure 1-6) taken from [Sa03].

1: Introduction

13

Figure 1-6: Different Arabic sentences in different styles

Figure 1-7: Arabic Ligatures

1: Introduction



14

Ligatures are combinations of two, or sometimes three characters into one shape (see Figure 1-7). Ligature selection is dependent not only on the characters themselves but also on the selected Arabic font. Some allographs do not use ligatures at all and others may have as many as 200 different ligatures defined. Note also that ligatures affect the positioning of diacritical marks [AG02]. Figure 1-8 lists ligatures found in one Arabic font.

Figure 1-8: Ligatures found in the Traditional Arabic font

1.5.2 Difficulties in Handwritten Arabic Characters and their Differences from Latin Arabic handwritten characters suffer not only from scale, location and orientation variation, but also from person-dependent deformations. These variations

are

neither

predictable

nor

can

they

be

formulated

mathematically. Therefore, research on handwritten character recognition has always been challenging. However, the variation problem needs to be solved before it can be used to automate certain applications such as handwritten mail sorting, handwritten check processing, and so on. All of these applications require both high recognition rates and high reliability. In

1: Introduction

15

the system described in Chapter 4, some trials for solving the problem of Arabic handwriting recognition are implemented in pre-processing steps. The basic problems of handwriting recognition are common to all languages, but the special features, constraints, etc. for each language also need to be considered. It seems that Arabic and cursive connected English handwriting are similar, but researchers [AA92, AH95] have found many differences in the recognition of each handwritten language. Some of these are listed in Table 1-5.

1.5.3 Off-line Versus On-line Handwritten Character Recognition Systems can be divided into two broad types: •

Optical character readers (OCR): A whole page of handwritten, or handwritten and machine printed text (e.g. forms) are processed



On-line character recognition (OLCR): Characters are converted and recognized interactively as they are formed

Abuhaiba et al. [AHD94] mentioned that on-line recognition is less difficult than off-line recognition, since the temporal information in the script is available. Also pen speed and even pressure information may be available. For a comprehensive survey of on-line and off-line handwriting recognition see [PS00].

1: Introduction

16

Table 1-5: Differences between Latin and Arabic Writing English

Arabic

Direction

from left to right

from right to left

Connection

In general each character is

Arabic letters are

connected to the next character with

normally connected to

diagonal strokes

the baseline with horizontal strokes

Character

English characters have few shape

Arabic letter might have

versions

variations

up to four different shapes, depending on its relative position in the word

Features

English Writing has specific

Arabic writing has a

geometrical features

unique feature for each character, especially curves and dots

Segmentation

Any analytical segmentation

The letters or segmented

approach can segment the

sub-letters are different

handwriting into different letters or

from segments in English

sub-letters

1: Introduction

17

1.6 The Objectives of this Research This research deals with the pre-processing steps and classification of offline handwritten Arabic words. In this system, some of the methods applied to handwritten Arabic writing such as; HMM after segmenting words into frames, has not been applied before, as can be seen from the literature survey. The feature extraction process includes locating endpoints, junctions, turning points, loops, generating frames, and detecting strokes. Also, more features are extracted from the characters such as moments. Future work, as well as suggestions to improve the overall accuracy of the systems, are discussed at the end of the context section. Before discussing the proposed system, it is necessary to make a quick revision of the nature of handwritten Arabic characters and, hence, the challenges that must be faced when attempting automatic recognition. The thesis objectives can be summarized as follows: •

A survey of off-line handwritten Arabic character recognition



A review of the difficulties involved in the recognition of Arabic handwritten characters



Since there is no well known database containing Arabic handwritten words for researchers to test, one of the objectives has been to build such a database. The words were collected from several writers



Building a

pre-processing system for recognizing off-line

handwritten words. First, the system involves a new implementation of slant correction techniques for off-line handwritten Arabic words.

1: Introduction

18

Second, implementing a slope correction procedure for the first time, and finally, thinning the word into a skeleton •

Constructing a feature extraction process which is implemented by extracting geometrical features from each zone of the word which represents the characters present



Implementing a segmentation procedure that divides any word into characters or sub-characters using a histogram calculation, and also extracts other features such as moments



Building a suitable codebook using Vector Quantization



Building the HMM for the body of Arabic words



Training the system



Testing the system



Developing a lexicon reduction operation, through a global recognition system which uses a simple classifier



Further training and testing of the system



Presentation from the results and conclusion from the experiments

1.7 Contribution An important contribution of this research lies in the provision of a much needed database. This offers practical benefits for researchers on handwritten Arabic, by providing a testbed to facilitate training and testing.

1: Introduction

19

This research develops a new database for the collection, storage and retrieval of Arabic handwritten text (AHDB), which supersedes previous databases both in terms of the size of the database and the number of different writers involved. With this research the most popular words in Arabic writing have been identified for the first time, using an associated program. A second contribution is to the field of pre-processing and feature extraction. A novel set of handwritten features are combined and tested in the classification stage. A third contribution is in the field of classification: a new HMM approach is used to train and test Arabic-Handwritten words taken from around 100 different writers. A fourth contribution is in the use of a global approach, which is an inexpensive method of features classification which avoids the problematic segmentation stage. The combination of using global and local features to recognize words also improves the recognition rate and has not been used previously in Arabic word recognition.

1.8 The Thesis Organization As previously mentioned, this chapter describes the concept of OCR and its importance in office automation and other applications, and gives a brief general background of OCR research. It summarizes the basic model for processing any document, forms an overview of the basic two types of OCR, namely on-line (OCR) and off-line (OLCR) and discusses the nature of handwritten Arabic characters and, hence, the problems that could be faced when automatically (optically) recognizing them. The main characteristics of the Arabic writing system and its difficulties are discussed. The chapter also summarizes the thesis objective of building an off-line Arabic handwritten character recognition system. The general

1: Introduction

20

approach of this research is described and the contribution of the work described in this thesis is evaluated. Chapter 2 discusses the steps involved in the OCR system, the contents of which are summrized as data capture, pre-processing (binarization of scanned images, skew detection, segmentation), feature extraction, classification, and post-processing. It also surveys existing systems and research results in this field. Since this research uses HMM, a survey of HMM for handwritten recognition is presented first, followed by a survey of HMM used in Arabic OCR. The chapter closes with a review of some of the previous trials in the field of off-line handwritten Arabic character recognition. Chapter 3 reviews useful techniques used in the automatic recognition of off-line handwritten Arabic character research (feature extraction methods, segmentation methods, and recognition methods) and then discusses the three main techniques used in this research – vector quantization, HMM, and the ID3 classifier. In Chapter 4 the generation of a database of off-line Arabic hand-printed words generated from the handwriting of more than 100 writers is described. That database is one of its kind in Arabic handwriting. And it is a very useful stage in Arabic handwritting research. Also in Chapter 4 the most used words in Arabic writing have been counted for the first time. Chapter 5 describes the operation of the complete pre-processing system for the recognition of a single handwritten Arabic word, from the scanned document to the output of a segmented and connected word. Chapter 6 describes the recognition of handwritten Arabic characters classified by HMMs. In Chapter 2, there were some implementations of

1: Introduction

21

HMMs with Arabic OCR. The trials do not include the implementation of HMMs on handwritten Arabic words. This chapter includes implementation of HMMs on Arabic handwriting. Chapter 7 discusses a lexicon reduction system, and further classification using different hidden Markov models. The overall engine of this combination of a global feature scheme with an HMM module is a more capable system. Chapter 8 presents the Experimental Results, discussing the detail of the experiments done throughout this Thesis, as well as the results. Chapter 9 presents the conclusions of this research. The objective of this concluding chapter is to provide an overview of the research, to analyze some useful development opportunities in the research and to offer some suggestions about how future research on the topics related to it could be carried out.

T

his Chapter discusses the steps involved in OCR in general, and surveys the systems and research trials in this field. The steps

involved in developing on OCR system include the construction of a testing database, data capture, pre-processing (binarization of scanned images, skew detection, segmentation), feature extraction, classification, and post-processing. The training using prior data is described followed by a description of the past trials of recognition of handwritten words in general, and then the state of the art of recognition of handwritten Arabic text. The remainder of the chapter briefly reviews research that has greatly influenced the evolution of handwriting recognition especially that on using HMMs, and it then surveys individual ways in the automatic recognition of off-line handwritten Arabic characters. Most of the published work on the recognition of off-line handwritten Arabic characters assumes that the characters are already segmented. However, this research assummed that the word is not segmented into characters, as the Arabic characters cannot be written seperately.

2: Theory and Litrature Review

23

2.1 Introduction While the early experimental OCR systems were often rule-based, by the 1980s these had been completely replaced by systems based on statistical pattern recognition. For clearly segmented printed materials, such techniques offer virtually error-free OCR for the most important alphabetic systems, including variants of the Latin, Greek, Cyrillic, and Hebrew alphabets. However, when the number of symbols is large, as with the Chinese or Korean writing systems, or the symbols are not separated from one another, as in Arabic or Devanagari text, OCR systems are still far from the error rates achieved by human readers, and the gap between the two is also evident when the image quality is compromised, for example with fax transmission. Until these problems are resolved, OCR is unable to play the central role in the transmission of cultural heritage to the digital age that it is often assumed it can. In the recognition of handprint, algorithms with succesive segmentation, classification, and identification (language modelling) stages are still the most succesful. For cursive handwriting, HMMs that make segmentation, classification, and identification decisions in parallel, have proved to be superior. However, their performance still leaves much to be desired because they do not necessarily synchronize spatial and the temporal aspects of the written signal (that is discontinuous constituents arising for example at the crossing of ‘t’s and when dotting of ‘i’s), and because the inherent variability of handwriting is far greater than that of speech, to the extent that we often see illegible handwriting but rarerly hear unintelligible speech. A comprehensive reference for cursive machine-print is, Bazzi et al. (1999) [BSM99]. The state of the art in handwriting recognition is closely tracked by the International Workshop on Frontiers of Handwriting Recognition

2: Theory and Litrature Review

24

(IWFHR) [IWF02]. For language modelling in OCR see Kornai (1994) [Ko94]. A good general introduction to the problems of page decomposition is offered by O’Gorman and Kasturi (1995) [OK95], and to OCR in general by Bunke and Wang (1994) [BWB94]. A contribution to document image analysis of about one hundered papers published in the IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI) were summarized in [Na00]. In the next section a general review of the trials done on off-line handwritten recognition will be discussed, and section 2.3 includes a review of papers published on off-line handwritten recognition using HMM.

2.2 Survey of Off-line Handwritten Words Recognition An OCR system consists of the following processing steps [OK95]: •

Data Capture: Grey scales level scanning at an appropriate resolution (typically 300-1000 dpi)



Pre-processing (Pixel-Level Processing): Constitutes the following: o Binarization (two level thresholding), using a global or a locally adaptive method o Determining the skew (any tilt at which the document may have been scanned) o Document layout analysis: Finding columns and paragraphs; Line, word, and character segmentation: extracting text lines, words, and characters

2: Theory and Litrature Review



Feature extraction



Classification



Contextual verification, or post-processing

25

PA P ER

G RA Y L E VE L

S IN G LE

F EA T UR E

C LA S SI F IE D

C LA S SI F IE D

Figure 2-1: Steps involved in the Optical Character Recognition System

2: Theory and Litrature Review

26

In Figure 2-1, the first step, and a part of the second, may be termed “geometric structure”, or document analysis. The following steps are termed document understanding, or mapping the geometrical structure into logical structure. In the following sub-sections, each of the steps involved in the OCR system (shown in the previous Figure) are briefly discussed [ TJT96, TLS96]. The next sections summarize research and trials conducted in each area of handwriting recognition.

2.2.1 Databases A standard database of images is needed to facilitate research in handwritten text recognition. A number of existing databases for English off-line handwriting recognition are summarized in [MB99-MB02], and also in [Na92-JLG78]. For machine-printed Arabic, the Environmental Research Institue of Michigan (ERIM) has created a database of machine-printed Arabic documents. These images are extracted from typewritten and typeset Arabic books and magazines [Sc02].

2.2.2 Data Capture Data capture is usually carried out by optically scanning a paper document. The resulting data is stored in a file of picture elements (pixels) that are sampled in a grid pattern throughout the document. In general, the grey-level scanning will be performed at a resolution of 300-1000 dots per inch. In this research, the researcher used

2: Theory and Litrature Review

27

samples of Arabic handwritten data, and stored samples in files to use them off-line later [ OK95].

2.2.3 Pre-processing Pre-processing is a step that enhances the quality of feature extraction because it enhances the quality of the image. Pre-processing includes steps such as 1) Binarization, 2) Skew detection, 3) Segmentation, 4) Dissection etc., which are discussed in the following sub-sections.

2.2.3.1 The Binarization of Scanned Images The resultant images from the optical scanning process are usually in grey scale format. There is a need to binarize these images, i.e. to turn them into two level formats, to enable the subsequent processing steps. The two levels are usually black for character pixels, and white for background pixels. Binary scanners which combine digitization with thresholding may not produce images, with a clear separation between the foreground and background

components.

There

are

two solutions to improving

binarization. Firstly, one can empirically determine the best binarization setting each time the scanning process is to be done. Alternatively, one can start with grey scale images resulting from the digitization process and use methods for automatic threshold determination.

2.2.3.2 Skew Detection There have been many methods, or techniques, developed to perform the skew detection of an image [OK95]. Akiyama and Hajeta [AH90] developed an automated entry system for skewed documents, but this failed with documents that consist of text blocks, photographs, figures, charts, and tables. The Hough Transform can be applied in skew detection. Hinds,

2: Theory and Litrature Review

28

Fisher and D’amato [TLS96] developed a document skew detection method using run-length encoding and the Hough Transform. In [HFA90], all skews have been detected correctly for the thirteen test images of five different types of documents. Nakano, Shima, and Fuzisawa [NSF+90] proposed an algorithm for skew normalization of a document image based on the Hough Transform. These methods can handle documents with limited non-text regions. Ishitani [Is93] proposed a method to detect skew for document images containing a mixture of text areas, photographs, figures, charts and tables. Yu Tang and Suen [YTS95] developed a method using the least squares to handle a multi-skew problem. All skews have been detected correctly for the thirteen test images of five different types of document. Approaches based on the horizontal projection histogram as used for Arabic text are presented by [OM02]. They present a method that is completely based on polygonally approximated skeleton processing. However, this method still does not work well with words containing isolated characters. It was also not tested on words with overlapping characters.

2.2.3.3 Segmentation The initial segmentation of characters can make the difference between very good and very poor results from an OCR process. The goal of a character segmentation algorithm is to partition a word image into regions, each containing an isolated complete character. In handwritten words, it is extremely difficult to segment characters without the support of recognition algorithms. Therefore, unlike the problem of machine printed character recognition, the handwritten character segmentation and recognition are closely coupled [ LS96]. A character is a pattern that

2: Theory and Litrature Review

29

resembles one of the symbols that the system is designed to recognize. To determine such a resemblance, the pattern must be segmented from the document image. Researchers in the 1960s and 1970s observed that segmentation caused more errors than shape distortions in reading unconstrained characters, whether hand or machine printed. Three “pure” strategies for segmentation, plus numerous hybrid approaches that are weighted combinations of the three are mentioned in [ CL96] and are outlined below: •

The classical approach, in which segments are identified based on ‘character like’ properties. This process of cutting up

the

image

into

meaningful

components

is

named

‘dissection’, referring to the decomposition of the image into a sequence of sub-images using general features. •

Recognition based segmentation, in which the system searches the image for components that match classes in its alphabet.



Holistic methods (or global approach), in which the system seeks to recognize whole words, avoiding the need to segment them into characters.

2.2.4 Feature Extraction Feature extraction is defined as the problem of “extracting (from raw data) the information which is most relevant for classification purposes, with the aim of minimising the within-class pattern variability whilst enhancing the between-class pattern variability” [ DK82]. Feature extraction is a problematic topic, often art rather

2: Theory and Litrature Review

30

than science, as it is difficult to predict in advance which measures will be useful. Features can be expensive to calculate [ Ob94]. Feature extraction methods differ from one application to another. Methods that succeed in one application may not be very useful in another. Feature extraction is, however, an important step in an OCR system, although it is not independent of the other steps (see Figure 2-1). The choice of the feature extraction method limits or dictates the output of the pre-processing step. Some methods work on greylevel sub-images of single characters, whilst others work on solid four or eight connected symbols segmented from the binary raster image, thinned symbols or skeletons, or symbol contours. Further, the type of format of the extracted features must match the requirement of the chosen classifier. Graph descriptions or grammar-based description of the characters are well suited for structural or syntactic classifiers. A literature survey of feature extraction methods is provided by [ Al99]. A discussion of feature extraction techniques used in Arabic writing is given in Chapter 3.

2.2.5 Classification Typical character classification systems extract several features from each character image and then, based on the similiarity of the feature vector to the character class, attempt to classify it. Many well-known pattern classification methods, as well as syntactic and structural methods, have been used [ MSY92, Na92]. There are different character classifier structures for isolated handwritten character classification, such as simple linear classifiers (one classifier for the whole program), two-stage hierarchical classifiers, and tree classifiers. The results of experiments on handwritten characters show that combining multiple classifiers is an effective means of producing highly reliable decision classifiers. Intrinsically,

2: Theory and Litrature Review

31

neural networks are suitable to serve as combination functions because they contain the following three valuable characteristics. They: -

can infer subtle, unknown relationships from data;

-

can generalize, meaning that they can still respond correctly to patterns that are only similar to the original training data;

-

are non-linear; that is, they can solve some complex problems more accurately than linear techniques do.

Efforts have been made to improve the performance of OCR by using powerful character feature extraction and classification methods. Further improvement could be obtained by exploiting contextual information [ Na92]. The classifiers used in such systems frequently output several classes for each input pattern and associate a degree of confidence to each label. A final class assignment is made after analyzing the outputs from a string of characters, rather than making a decision based on a single character. Because

of

large

shape

variations

in

human

handwritting,

recognition accuracy of the cursive handwritten word is hardly satisfied using a single classifier. In recent years some multiple classifier combination techniques were proposed to improve handwritten character recognition performance, and they have been shown to give promising results by a number of different researchers.

[ XKL02]

used

HMM

classifiers

with

different

architectures and different features to recognize the names of the months, giving an 85% recognition rate. Wang et al. [ WBR02] introduced a framework to combine the results of multiple

2: Theory and Litrature Review

32

classifiers and present an intuitive run-time-weighted opinion pool (RWOP) combination approach for recognizing cursive handwritten words. Promising results have been achieved with these methods. A study of multiple expert systems for handprinted numeral recognition was discussed by [ YNT97], and [ LBK97] discusses handprinted

recognition.

A

multiple

classifier

approach

to

recognizing handwritten characters was studied by [ RF97], whilst [ Go97] discusses several techniques for a variety of practical tasks. [ GB02] introduced new methods for the creation of ensembles based on feature selection algorithms, which are evaluated and compared to the existing approach using HMM. A review of previous trials on handwritten recognition using HMM is discussed in further detail in Chapter 3.

2.2.6 Post-processing Post-processing systems are designed to correct OCR errors without human

intervention.

The

well-known

application

of

lexical

knowledge for contextual post-processing compares dictionarybased (top down) and statistical approaches (bottom up). The advantage

of

statistical

over

dictionary-based

methods

is

computational time and memory utilization. On the other hand, lexical knowledge is more accurate when using a dictionary. Finally, the contextual post-processing of OCR results can also take into account knowledge of context of words. From a linguistic point of view, a technique for contextual post-processing can incorporate a multitude of different knowledge sources, for example frequencies of single words and word combinations, compounds and idioms, and

2: Theory and Litrature Review

33

linguistic structures such as phrases and sentences etc. An overiew of possible knowledge sources for post-processing is presented in [ Na92, Sr93].

2.3 Off-line HMMs for an HWR Survey The discussions here are by no means exhaustive. There is a growing interest in applying HMMs to the problem of document analysis and recognition, and a large body of literature is being published in reputed journals and conference proceedings. Several promising research achievements have been presented at recent conferences and workshops. Results in HMM research for handwriting recognition can be grouped into the on-line and off-line cases. Work done in the field of off-line handwriting recognition is reviewed and divided into two groups – that done on segmented handwritten words, and that done on non-segmented words. first the Single Contextual Hidden Markov Model (SCHMM) that was introduced by [KHB98] to recognize hand-printed words will be descussed, i.e. handwritten words that are naturally segmented. When the letters of the words are naturally segmented, and if these letters are identified as states [KHB98], there are a finite number of predetermined states, for example the 26 letters of the English alphabet. In general, handwritten words are usually not naturally segmented into letters, and a word segmentation algorithm is necessary for such a task. At present, no good segmentation algorithm exists which separates all the letters perfectly and without any spurious segmentation points. [CKZ94] use a more general framework that can be applied to cursive, non-cursive, naturally segmented or any other type of handwritten words. In this approach, a morphology-based segmentation algorithm is first used to divide the word image into a sequence of

2: Theory and Litrature Review

34

segments, which could signify a whole, partial, or joint letter. The sequence of segments is then recognized by an HMM-type stochastic network which can deal with the problems of touching and broken characters. Since touching characters are not guaranteed or required to be split by the segmentation algorithm, the number of states, which depends on the training set, may go up to over 6,000 [CKZ94] for handwritten English words. Consequently, the state assignment for a large training set is rather complicated. Furthermore, this individual segment-based recognition system might never know how well a character is formed by combining several consecutive segments. Nevertheless, the scheme described in [CKZ94] has clearly shown that the application of HMM to a large vocabulary HWR problem is, indeed, much more complex than one described in [NWF86]. To overcome the problems of the SCHMM system, a new system using a Continuous

Density

Variable

Duration

Hidden

Markov

Model

(CDVDHMM) [CKS95] was proposed, with the help of an enhanced segmentation algorithm, which splits all the touching characters (of course, this leads to more spurious segmentation points). The CDVDHMM defines the 26 letters in the alphabet as 26 different states, and this number is fixed and much smaller than the previous systems described by [CKZ94]. Consequently,

the

recognition

speed

is

much

improved.

The

implementation and experiments for the CDVDHMM system are discussed in [CKZ94] and are outside the parameters for the CDVDHMM in HWR, the NEHMM (Non-Ergodic HMM) based system proposed by Chen and Kundu [CK94]. The NEHMM system follows the the Model Discriminant HMM (MD-HMM) strategy (see section 3.3.1).

However, the model

parameters can be derived from the statistics of the CDVDHMM, which appears to perform better than the CDVDHMM strategy albeit at a slower

2: Theory and Litrature Review

35

speed. A combination using both CDVDHMM and NEHMM can be considered as a trade-off between performance and speed. One problem with the VDHMM system is ensuring its reliable computation of model probabilities given the limited number of databases that are available at the present time. [CKZ94] has presented an interesting idea to avoid the computation of duration probabilities. By using oversegmentation, this scheme considers many different sub-sets of the segmentation points. Each sub-set leads to one distinct observation sequence. The recognition task is then to find the best segmentation; that is, find the sub-set that contains the correct segmentation points, and the associated optimal state sequence which corresponds to the letter sequence of the word. This philosophy is similar to that of VDHMM. The added complexity of computing the duration probability in each state is avoided in this approach by making a simple, but realistic, assumption which assumes that a character can be broken into (at most) four segments and, therefore, that there are four discrete duration probabilities for each state. Instead of assigning pre-computed duration probabilities to each state, only one duration will be picked (during recognition) by matching one, two, three and four consecutive segments to the symbols in the feature space, and finding the best match and its corresponding number of segments. In this way, the computation of duration probability in each state is avoided without sacrificing the advantage of VDHMM. However, the structure of the Viterbi algorithm used during recognition is substantially altered. The overall performance of this scheme, as expected, is quite similar to the VDHMM based word recognition system [CK95]. In the previous approaches (SCHMM, CDVDHMM, and NEHMM), the models are actually semi-hidden Markov models, i.e. the states of HMMs

2: Theory and Litrature Review

36

are transparent during training. Because re-estimation algorithms, such as the Baum-Welch product, do not preserve the correspondence of the states to their semantic meanings, it is not suitable for training the semi-hidden Markov models. Another approach is described using a Multi-Level Hidden Markov Model (MLHMM) which is a doubly embedded network of HMMs, whereby characters are modeled by an HMM and words by a higher-level HMM. The HMM belongs to the Model Discriminant HMM (MD-HMM) strategy at the character level. Since states are not assigned any semantic meaning at the character level, the re-estimation algorithm is applicable. For the word model, on the other hand, both the MD-HMM and One Path Discriminant (PD-HMM) strategies can be used. Another major difference between this new system and the previous approaches is the output-independence assumption of the HMM (see section 3.3.1). The details of this approach are described in [CKZ94]. There are many uncertainties in handwritten character recognition. Stochastic modelling is a flexible and general method for modelling such problems, and entails the use of probabilistic models to deal with uncertain or incomplete information. Cho et al. [CLK95] used another strategy for modelling and recognizing cursive words with HMM. In the proposed method, a sequence of thin vertical frames is extracted from the image, capturing the local features of the handwriting. By quantizing the feature vectors of each frame, the input word image is represented as a Markov chain of discrete symbols. A handwritten word is regarded as a sequence of characters and optional ligatures. Hence, the ligatures are also explicitly modelled. With this view, an interconnected network of character and ligature HMMs is constructed to model words of indefinite length. This model can ideally describe any form of handwritten words, including discretely spaced words, pure cursive words and unconstrained words of mixed styles. Experiments have been conducted with a standard database to

2: Theory and Litrature Review

37

evaluate the performance of the overall scheme. The performance of various search strategies based on the forward and backward score has been compared. Experiments on the use of a pre-classifier based on global features show that this approach may even be useful for large-vocabulary recognition tasks. Another method for off-line recognition of cursive handwriting using HMMs is implemented by Bunke et al. [BR95]. The features used in their HMMs are based on the arcs of skeleton graphs of the words to be recognized. An algorithm is applied to the skeleton graph of a word that extracts the edges in a particular order. Given the sequence of edges extracted from the graph, each edge is transformed into a ten-dimensional feature vector. The features represent information about the location of an edge relative to the four reference lines, its curvature and the degree of the nodes incident to the considered edge. The linear model was adopted as basic HMM topology. Each letter of the alphabet is represented by a linear HMM. Given a dictionary of fixed size, an HMM for each dictionary word is built by sequential concatenation of the HMMs representing the individual letters of a word. Training of the HMMs is done using the Baum-Welch Algorithm, while the Viterbi algorithm is used for recognition. An average correct recognition rate of over 98% on the word level has been achieved in experiments with cooperative writers using two dictionaries of 150 words each. Park et al. [PL96] present an efficient scheme for off-line recognition of large-set handwritten characters in the framework of stochastic models, the first-order HMMs. To facilitate the processing of unconnected patterns and patterns with isolated noise, four types of feature vectors, based on the regional projection contour transformation (RPCT), are employed. The character recognition system consists of two phases – a training phase where multiple HMMs corresponding to different feature types of RPCT

2: Theory and Litrature Review

38

are built, and the classification phase, where the results of individual classifiers are integrated to produce the final recognition result, where each individual HMM classifier produces one score that is the probability of generating the test observation sequence for each character model. In this paper, several methods for integrating the results of different classifiers are considered so that better results can be obtained. In order to verify the effectiveness of the proposed scheme, the most frequently used 520 types of Hangul characters in Korea were considered in experiments. Experimental results suggest the proposed scheme is promising for the recognition of large-set handwritten characters with numerous variations. Other authors who have proposed a recognition system of constrained Handwritten Hangul (Korean character) and alphanumeric characters using discrete HMMs are Kim et al. [KP96]. Hangul shapes are classified into six types with fuzzy inference, and their recognition based on quantized features is performed by optimally ordering features according to their effectiveness in each class. Constrained alphanumerics recognition is also performed using the same features employed in Hangul recognition. The forward-backward, Viterbi and Baum-Welch re-estimation algorithms are used for training and recognition of handwritten Hangul and alphanumeric characters. The simulation result shows that the proposed method recognizes handwritten Korean characters and alphanumeric effectively. [SK98] proposed a Network-based approach to Korean handwriting analysis. The starting point of this research is a network of HMMs, which models whole sets of characters. These are followed by the assertion that the HMM for the on-line script can be applied to not only on-line character recognition, but also to handwriting synthesis and even to pen-trajectory recovery in off-line character images. The solutions to these problems are based on the single network of HMMs and the single principle of DP-based state-observation alignment. Given an observation sequence, the search for

2: Theory and Litrature Review

39

the best path in the network corresponds to the recognition whereas with character models, the search for the best observation sequence corresponds to the handwriting generation. Kundu et al. [KHC98] have published work concerning variable duration HMM in handwriting recognition (VDHMM). They showed that if the duration statistics are computed, this could be utilized to implement an MD-HMM approach for better experimental results. They also described a PD-HMM based HWR system where the duration statistics are not explicitly computed, but results are still comparable to a VDHMM based HWR scheme. In recent years, there have been several attempts to extend the onedimensional HMM to two-dimension, for example Park and Lee [PL98]. Unfortunately, previous efforts have not yet achieved a truly twodimensional (2-D) HMM because of the difficulty in establishing a suitable 2-D model and its computational complexity. Park and Lee [PL98] presented a framework for the recognition of handwritten characters using a truly 2-D model: Hidden Markov Mesh Random Field (HMMRF). The HMMRF model is an extension of a 1-D HMM to 2-D HMM, which provides a better description of the 2-D nature of characters. The application of the HMMRF model to character recognition necessitates two phases – a training phase and a decoding phase. Their optimization criterion for training and decoding is based on the maximum, marginal and posterior probabilities. They also develop a new formulation of parameter estimation for character recognition. Computational concerns in 2-D, however, necessitate certain simplifying assumptions in the model and approximations on the implementation of the estimation algorithm. In particular, the image is represented by a thirdorder MMRF and the proposed estimation algorithm is applied over the

2: Theory and Litrature Review

40

look-ahead observations rather than the entire image. Thus, the formulation is derived from the extension of the look-ahead technique as devised for real-time decoding. Experimental results confirm that the proposed approach offers great potential for solving difficult handwritten character recognition problems under reasonable modelling assumptions. El-Yacoubi et al. [EGS99] used an HMM approach to recognize off-line unconstrained handwritten words for large vocabularies. After preprocessing, a word image is segmented into letters (or pseudoletters) and represented by two feature sequences of equal length, each consisting of an alternating sequence of shape-symbols and segmentation-symbols, which are both explicitly modelled. The word model is made up of the concatenation of appropriate letter models consisting of elementary HMMs and an HMM-based interpolation technique is used to optimally combine the two feature sets. Two rejection mechanisms are considered depending on whether or not the word image is guaranteed to belong to the lexicon. Experiments carried out on real-life data show that the proposed approach can be successfully used for handwritten word recognition. HMM based word recognition can be applied to reading the amount on cheques. Knerr et al. [KAN+98] implemented an HMM based word recognition algorithm to the recognition of legal amounts from French bank cheques. The algorithm starts from images of handwritten words, which have been automatically segmented from binary cheque images. After finding the lower-case zone on the complete amount, words are slant corrected and then segmented into graphemes. Then features are extracted from the graphemes and the feature vectors are vector quantized, resulting in a sequence of symbols for each word. The likelihood of all word classes are computed by a set of HMMs, which have been previously trained using

2: Theory and Litrature Review

41

either the Viterbi algorithm or the Baum-Welch Algorithm. The various parameters of the system have been identified and their importance evaluated. Results have been obtained on large real-life databases of French handwritten cheques. More recently, a Neural Network-HMM hybrid has been designed, which produces even better recognition rates. Senior and Robinson [SR98] designed a complete system for the recognition of off-line handwriting. A recurrent neural network is used to estimate probabilities for the characters represented in the skeleton. The operation of HMM which calculates the most appropraite word in the lexicon is also described. As mentioned earlier in this chapter, segmentation recognition schemes are primarily character-based approaches. This means that the basic element of recognition is the character. For small lexicons, as in the bank cheque application, most approaches are global, with words considered as individual entities [GS95]. Guillevic and Suen have published papers on recognition of legal amounts on bank cheques. The overall engine combines a global feature scheme with an HMM module. The global features encode the relative position of the ascenders, descenders and loops within a word. The HMM uses one feature set based on the orientation of contour points, and their distance from the baselines. The system is fully trainable, reducing to a strict minimum the number of hand-set parameters. The system is also modular and independent of specific languages, as they have to deal with at least two languages in Canada, namely English and French. The system can be easily adapted to read other European languages based on the Roman alphabet [GS98]. An HMM has also been used for the linguistic post-processing component of human handwriting recognition applications, by Bouchaffra et al. [BKK+96] and Hull [Hu96]. Article [BKK+96] shows that the SSS algorithm

2: Theory and Litrature Review

42

has a direct interpretation as an HMM whose states correspond to words that have been tagged with their parts of speech, and whose observations are discrete recogniser confidences. The HMM interpretation has the added advantage that it can be naturally extended to handle error recovery in the recogniser. Preliminary results indicate that the SSS model is successful in selecting the true path over alternate paths. Hull [Hu96] used an HMM to improve the performance of an algorithm for recognising digital images of handwritten or machine-printed text. A word recognition algorithm first determines a set of words (called a neighbourhood) from a lexicon that is visually similar to each input of the word image. Syntactic classifications for the words and the transition probabilities between those classifications are input to the Viterbi algorithm. The Viterbi algorithm determines the sequence of syntactic classes (the states of an underlying Markov process) for each sentence that has the maximum posterior probability given the observed neighbourhoods. The performance of the word recognition algorithm is improved by removing words from neighbourhoods with classes that are not included on the estimated state sequence. An experimental application is demonstrated with a neighbourhood generation algorithm that produces a number of guesses about the identity of each word in a running text. The use of zero, first and second order transition probabilities, and different levels of noise in estimating the neighbourhood are explored. Post-processing (probabilities between words) has also been used to improve performance.

2: Theory and Litrature Review

43

2.4 Arabic OCR using HMM This section discusses the implementation of HMM on Arabic OCR. The following trials do not include implementation of HMM on handwritten Arabic words. Bazzi et al. [BSM99] present an omni font, unlimited-vocabulary OCR system for English and Arabic that is based on an HMM. They focus on two aspects of the OCR system. They address the issue of how to perform OCR on omni font and multi-style data (such as plain and italic) without the need to have a separate model for each style. The amount of training data from each style, which is used to train a single model, becomes an important issue in the face of the conditional independence assumption inherent in the use of HMMs. This paper demonstrates mathematically and empirically how to allocate training data among the different styles to alleviate this problem. Secondly, a method is described which enables a word-based HMM system to perform character recognition with unlimited vocabulary. This method includes the use of a trigram language model on character sequences. Using all these techniques, they have achieved character error rates of 1.1% on data from the University of Washington English Document Image Database, and 3.3% on data from the DARPA Arabic OCR Corpus. The application of HMM to Arabic OCR was first attempted by Amin and Mari [AM89]. They used HMM in the post-processing stage to improve the recognition accuracy, where each word is described by an HMM. As part of a larger project for transcription of the documents in the Ottoman Archives, Atic et al. [AM89] developed a heuristic method for segmentation, feature extraction and recognition of the Arabic script. They developed a geometrical and topological feature analysis method for the

2: Theory and Litrature Review

44

segmentation and feature extraction stages. Chain code transformation is applied to the main strokes of the characters that are classified by the HMM in the recognition stage. Experimental results indicate that the performance of the proposed method is satisfactory, as long as the thinning process does not yield spurious branches. Makhoul et al. [MLR+96] used a system that depended on the estimation of character models, a lexicon, and grammar from the training samples. This system was identical to their speech recognition system but replaced speech, phonemes, and phonological rules with scanned images, characters, and orthographic rules, respectively. It also describes each word with separate HMMs, which limited the number of words the system could recognize. Khorsheed and Clocksin [KC99] present a technique for the off-line recognition of cursive Arabic script based on an HMM in which it is not necessary to segment the word. After pre-processing, the thinned binary image of each word is decomposed into a number of curved edges in a certain order. Each edge is transformed into a feature vector, including features of curvature and length normalized to stroke thickness. The observation sequence presented to the HMM consists of codes derived from a vector quantization of the feature vector. The lexicon is represented by a single HMM, where each word is represented by a sequence of states. A modified Viterbi algorithm is used to provide an ordered list of the best paths, indicating candidate transliterations. The HMM was trained using the words written in one typeface and one size, and test samples were written in two different typefaces and in three sizes. Recognition rates ranging from 68% to 73% were achieved depending on the task performed. The system was less affected by distortion and variation than a system that uses the raw pixel data as an observation sequence. However it does not suit Arabic handwritten words, because dots (which are important elements of

2: Theory and Litrature Review

45

handwritten Arabic characters) are not written exactly below or above each character or edge feature as described in the paper. Also the result (from 68% to 73%) for the Arabic printed words using the Traditional Arabic Font is not high. Dehghan et al. [DF01] use a holistic system for the recognition of handwritten Farsi/Arabic words using HMM and a Kohonen selforganizing vector quantization was presented. The image was divided into fixed-width frames, and each frame divided into five zones, each with four features depending in the contour direction. In this way, each frame is represented as a 20-dimensional feature vector. According to the unique property of handwritten Arabic writing it is believed that these features are not enough to get a reasonable recognition rate. The recognition rate was 32% without smoothing and 65% with smoothing. With the exception of Khorsheed, and Dehghan et al. [Kh00-DF01], the above experiments using the HMM approach were tested on printed Arabic text, not on handwritten words.

2.5 A Survey of Off-line Handwritten Arabic Words Recognition Research in the field of Arabic character recognition started as early as 1975 when Nazif presented his thesis [Na75]. However, due to lack of computing power, further significant work was not performed until the 1980s [BS97]. Many papers have been published on the recognition of Latin, Chinese and Japanese characters. However, little research has been conducted towards the automatic recognition of Arabic characters, which are used in several widespread languages. This is because of the strongly cursive nature of its writing rules. In fact, the techniques applied in other

2: Theory and Litrature Review

46

languages are not directly applicable to Arabic characters without fundamental modifications. Even less research in the field of handwritten Arabic characters has been published [BWB94]. Amin et al. [AAF96] propose a technique for the recognition of handprinted Arabic characters using neural networks. Firstly, their technique combines rule-based (structural) and classification tests. Secondly, it is more efficient for large complex sets, such as Arabic characters. Thirdly, feature extraction is inexpensive. Finally, the execution time is independent of both the character font and size. This paper describes the neural network method applied in the classification step, the computation of intensive earlier stages being carried out by more classical approaches. Maddouri and Amiri [ MA02] propose a recognition system based on combining a global and local vision modelling of the word developed for Latin word recognition by M. Cote. The drawback of this system is in its assumption that diacritical dots are naturally separated, which is not the case with handwritten Arabic, as was shown in Chapter 1. Also, loops are not naturally written in handwritten Arabic, and thus leads to a substantial difference in recognition rate. In the same study, the researchers did one of the experiments using a manual GVM, which proposed a list of possible letters and words containing these characters. Al-Ohali et al. [ ACS02] used an HMM to classify handwritten words used in cheque filling applications. The authors segmented the training and testing of sub-words and characters manually. Geometrical features were used.

2: Theory and Litrature Review

47

2.6 Conclusion In this chapter, various image-processing methods commonly used in the field of document image analysis and character recognition have been presented. These methods are grouped into four processing categories, namely, data capture, pre-processing, feature extraction, classification and contextual verification (or postprocessing). This represents the processing steps used in many document

image

analysis

systems

currently

in

use.

Image

acquisition describes the process of converting a document into its numerical representation. The pre-processing step for the scanned image can be divided into three sub-sections: binarization, skew detection and segmentation. The features can be fuzzy to define or difficult to extract, so the process of the feature extraction step varies and depends on many factors. The features that result from each image are classified using classification methods. Trials of offline Arabic handwritten recognition systems were also discussed and an overview of handwritten character processing has been presented. A detailed review of handwriting recognition using HMM was also presented suggesting that the field of document understanding and, in particular, this handwriting recognition using HMM is undergoing an exciting phase of research and development. Indeed, the HMM has the potential to become one of the most dominant techniques in this field. Since the HMM has been used as one of the main classification techniques in this research, a further review of research done in this area has been discussed at the end of the chapter. This chapter has reviewed the current state of the field up to the point where the actual text classification begins and described the trials coming out on Arabic text using an HMM. In the next chapter a detailed review of more specific recognition techniques is

2: Theory and Litrature Review

48

presented (VQ, HMM and ID3), and the pre-processing methods used for Arabic writing are discussed. Much of the important research is briefly described to present the current status of off-line handwritten Arabic Character Recognition research. It is clear that most of this work assumes that the Arabic characters are already segmented, whilst the database and pre-processing systems are built assuming that words have not been segmented into characters.

T

his Chapter discusses the techniques that have been used in the recognition of printed and handwritten Arabic text and

discusses three important tools which are used throughout this work (Vector Quantization, HMM and ID3) classifier. In section 3.2 Vector Quantization (VQ) is discussed because VQ has been used to recognize segments of words, such as characters or sub-characters in HMMs classifier in Chapter 6 and Multiple HMMs classifiers in Chapter 7. In section 3.3, HMMs techniques have been discussed to give an idea about the systems mathematics that have been used in Chapters 6 and Chapter 7. In this work HMM techniques have been used to classify Arabic handwritten words. The ID3 tree technique is discussed in section 3.3 and used in Chapter 7 to classify an Arabic handwritten word into a group of words or a single word.

3: Methodology: Useful Techniques

50

3.1 Off-line Arabic Words Recognition Methods The previous chapter the steps that every OCR system includes (for any language) was mentioned. The main steps that differ when dealing with Arabic writing systems are segmentation and feature extraction because of the special characteristics of Arabic writing. In the next sub-section some techniques used for the recognition of off-line Arabic writing (handwritten and printed) are discussed.

3.1.1 Feature Extraction Methods It is known that features represent the smallest set that can be used for discrimination purposes and for a unique identification of each character. Features can be classified into thee categories: •

geometric features (e.g. concave/convex parts, and type of junctions – intersections/T-junctions/endpoints etc.)



topological features (connectivity, number of connected components, number of holes, etc.)



statistical (Fourier transform, invariant moments, etc.)

Off-line character recognition systems typically use a scanner as the main input device. Off-line recognition can be considered as the most general case where no special device is required for writing [BWB94]. Since this research deals with the off-line recognition of handwriting, some of the field trials to automatically recognize handwritten Arabic writing are summarized below.

3: Methodology: Useful Techniques

51

Abuhaiba [AHD94] produced a paper that deals with three different problems in the processing of binary images of handwritten text documents. Firstly, an integrated algorithm that finds a strength line approximation of textual strokes is described. The distance transform of thinned binary images has been used to identify spurious bifurcation points (which are unavoidable when thinning algorithms are used) and remove them and recover the original ones. Secondly, a method is presented to recover loops that become blobs due to blotting. As reported, it is not possible to recover such loops with a high rate of success. Finally, a method is developed to extract lines from pages of handwritten text by finding the shortest spanning tree of a graph formed from the set of main strokes. At the end, an ordered list of main strokes is obtained. Each combination of main secondary strokes is the input to a subsequent recognition stage. The method can deal with variable handwriting styles. So in the pre-processing system in Chapter 5 a similar strokes extraction method has been used. But a different thinning algorithm (improved Zhang and Suen method) has been used as in [MIB01], which proved that the Zhang and Suen method is among one of methods that has a better skeleton structure and execution time than other techniques when resolution is less than 600 dpi. Almuallim and Yamaguchi [AY87] proposed a structural technique for the recognition of Arabic handwritten words. Their system consisted of four phases. The first phase is pre-processing, in which the word is thinned and the middle of the word is calculated. Since it is difficult to segment a cursive word into letters, words are then segmented into separate strokes and classified either as strokes with a loop, strokes without a loop, or complementary characters. These strokes are then further classified using their geometrical and topological properties. Finally, the relative positions of the classified strokes are examined and the strokes are combined in several steps on to the string of characters that represents the recognized

3: Methodology: Useful Techniques

52

word. The system in Almuallim and Yamaguchi paper [AY87] is too simple to deal with complex Arabic words, since it used simple geometrical features and a small set of testing words. Also loops are difficult to extract from Arabic handwriting. A look-up table can be used for the recognition of isolated hand-written Arabic characters. In this approach, the character is placed in a frame, which is divided into six rectangles, and a contour-tracing algorithm is used for coding the contour as a set of directional vectors by using Freeman coding. However, this information is not sufficient to determine Arabic characters. Extra information related to the number of dots and their position is therefore added. If there is no match, the system will add the feature vector to the table and consider that character as a new entry [SY85]. This method is used to recognize segmented Arabic characters, which is not a real case in Arabic writing, since Arabic text is written cursively and, in general, Arabic characters are difficult to segment in the real handwritten Arabic words. Saleh et al. [Sa94] describe an efficient algorithm for coding handwritten Arabic characters. Certain feature points of the skeleton, which are end, branch, and main connection points, are extracted. Primitives are then assigned according to the sequence of ordering and positioning of these points. Isolated sub-patterns (secondaries) within some Arabic characters are treated separately and then related to the principal patterns of the character. Stability and performance of the algorithm have been established by applying it in several patterns of all Arabic characters as well as in an experimental context-free recognizer. Again this method is not realistic since Arabic letters are separated in nature. A structural approach has also been adopted for recognizing printed Arabic text (Amin and Masini [AM86]). Words and sub-words are segmented into

3: Methodology: Useful Techniques

53

characters using a baseline technique. Features such as vertical bars are then extracted from the character using horizontal and vertical projections (Figure 3-1). Four decision trees, which are chosen according to the position of the character within the word and computed in the segmentation process, have been used. The structure of the four decision trees allowed a rapid search for the appropriate character. Furthermore, trees are utilized to distinguish characters that have the same shape but appear in different positions within a word.

(a)

(b)

(c)

Figure 3-1: Vertical and horizontal scanning of the character (a) character (b) horizontal scanning (c) vertical scanning. Amin and Mari [AM89] proposed a technique for multifont Arabic text that includes character and word recognition. A character is divided into many segments by a horizontal scan process (Figure 3-2). In this way, segments are connected within the basic shape of the character. Segments that are not connected with any other segment are considered to be complementary characters. By using the Freeman Code [Fr68], the contour detection process is applied in these segments to trace the basic shape of the character and generate a directional vector through a 2*2 window. A decision tree is then used for the recognition of the characters. Finally, a Viterbi algorithm

3: Methodology: Useful Techniques

54

[Fo73] is used for Arabic word recognition to enhance the recognition rate. The main advantage of this technique is to allow an automatic learning process to be used. The last two approaches used general features for limited printed words. Some of the features can be extended to be used in handwritten words. Nouh et al. [NST80] suggested a standard Arabic character set to facilitate computer processing of Arabic characters. In this work, thirteen features, or radicals, which represent parts of the characters, are selected by inspection. The recognition is based on a decision tree and a strong correlation measurement. The disadvantage of the proposed system is the assumption that the incoming characters are generated according to specified rules.

Figure 3-2: Major segments of character

Parhami and Taraghi [PT81] presented a technique for the automatic recognition of machine printed Farsi text (which is similar to Arabic text). The authors first segment the sub-word into characters by identifying a series of potential connection points on the baseline at which line thickness changes from or to the thickness of the baseline. Although they also have some rules to keep characters at the end of a sub-word intact, they segment some of the wider characters (e.g. * ) into up to three segments. Then they

3: Methodology: Useful Techniques

55

select twenty features based on creation geometric properties of the Farsi symbols to construct a 24-bit vector, which is compared with entries in a table where an exact match is checked first. The system is heavily font dependent, and the segmentation process is expected to give incorrect rules in some cases. The study reported in [ERK90, AU92] utilizes descriptors to recognize the characters. Other techniques include a set of Fourier descriptors from the coordinate sequences of the outer contour, which is used for recognition [EG88]. Also, Nough [NU87] assign each character with a logical function where characters are re-classified into four groups depending on the existence of certain pixels in a specified location of the image. The last papers examined segmented printed Arabic characters and a similar pixel distribution feature that can be used with recognition of handwritten words, as for the case on feature (section 5.3), used in this research. To enhance the recognition rate of an OCR system, Taylor [Ta00] describes a family of lexical analyzers and text measurement tools. The tools are used to tag verbs, search for roots, and discover morpheme frequencies in Arabic text. The morpheme frequencies can be used to construct relative figures of merit for alternative lexical analyses of an ambiguous word. Amin and Al-Sadoun [AA94] proposed a structural approach for recognizing handprinted (this is not a true case in Arabic writing) Arabic characters. The binary image of the character is first thinned using a parallel thinning algorithm and then the skeleton of the image is traced from right to left using a 3*3 window in order to build a graph to represent the character. Features like straight lines, curves and loops are then extracted from the graph. Finally, a hierarchical classification (similar to a decision tree) is used for the recognition of the characters.

3: Methodology: Useful Techniques

56

Obaid [Ob94] introduced Arabic handwritten character recognition by neural nets, using the traditional Multi Layer Perceptron (MLP) with its back propagation learning algorithm to classify handwritten Arabic characters. Since Arabic script is cursive, he assumes that the characters are already segmented and he presents them to the network. When the network is trained, the output layer, in response to a familiar input pattern or one which resembles a familiar pattern, activates the neuron corresponding to this character classification [Ob94]. Al-Badr and Haralick [AH95] proposed a system to recognize machine printed Arabic words without prior segmentation by applying a mathematical morphology operation on the whole page to find the locations where shape primitives are present. They then combine those primitives into characters and print out the character identities and their location on the page. The advantage of the work in that paper is that it optimized the recognition of the symbols with respect to the whole word, without committing itself to a particular segmentation of the word into symbols. In this work a segmentation-free approach has been tested to recognize handwritten Arabic words using the ID3 tree (see sections 3.4, 7.1, and 8.5). Finally, El-Khaly and Sid-Ahmed [ES90] used moment-invariant descriptors to recognize isolated and connected printed Arabic characters. They obtained a 100% recognition rate for isolated printed Arabic characters in one font type. For connected printed characters, they obtained a 95% recognition rate for isolated characters. In this work another set of moments to recognize handwritten Arabic words was used as described in Chapter 5, sub-section 5.3.8.

3: Methodology: Useful Techniques

57

3.1.2 Segmentation Methods Two techniques have been applied for segmenting machine printed and handwritten Arabic words into individual characters – implicit and explicit segmentations. Implicit segmentation (straight segmentation): this type of segmentation is usually designed with rules that attempt to identify all the characters’ segmentation points in order to segment the words directly into letters. Explicit segmentation: words are externally segmented into pseudo-letters, which are individually recognized. In all Arabic characters, the width at a connection point is much less than the width of the beginning character. This property is essential in applying the baseline segmentation technique [AM86, AM89]. The baseline is a medium line in the Arabic word in which all the connections between successive characters take place. If vertical projection of bi-level pixels is performed on the word Eq. 3-1

v( j ) =

w(i, j )

Eq. 3-1

i

where w (i, j) is either zero or one and i, j indexes the rows and columns, respectively the connectivity point will have a sum less than the average value (AV) Eq. 3-2

( Nc)

AV = 1

Nc j =1

Xj

Eq. 3-2

Where Nc is the number of columns and Xj is the number of the jth column.

3: Methodology: Useful Techniques

58

Hence, each part with a sum value much less than AV should be a boundary between different characters. However, if the histogram produced for the vertical projection does not follow the condition of Eq. 3-3, the character remains unsegmented, as illustrated in Figure 3-3. By examining typewritten Arabic characters, it is found that the distance between successive peaks in histogram (Figure 3-3) does not exceed onethird the width of an Arabic character. That is: |dk|
1.5* Lk

Eq. 3-4

by means of some features, such as the existence of a maximum or minimum or either the horizontal or vertical direction of the main stroke, the ratio between length and width, the type of secondary stoke and other features Al-Emami and Usher [AU90] presented a system for the on-line recognition of handwritten Arabic words. Words were entered via a graphic table and segmented into strokes based on the method proposed by Belaid et al. [BM83, AA92]. In the preliminary learning process, specifications of the strokes of each character are fed to the system, while in the recognition process, the parameters of each stroke are found and special rules are applied to select the collection of strokes that best matches the features of a stored character. However, few words were used in the learning and testing process, which makes the performance of the system questionable. This approach depends heavily on a predefined threshold value relating to the character width. Moreover, this approach will not work effectively for skewed images.

3: Methodology: Useful Techniques

60

Figure 3-4: An example of the Arabic word ]i \ Uand its segmentation into character (a) Arabic word (b) Histogram (c) word segmented into character

Segmentation is also achieved by tracing the outer contour [EG88] of a given word and calculating the distance between the extreme points of intersection of the contour with a vertical line. The segmentation is based on a horizontal scan from right to left of the closed contour using a window of adjustable width (w). for each position of the window, the average vertical distance (hav) is calculated across the window. At the boundary between two characters, the following conditions should be met: hav < T. In this case, a silence region is detected, which means that the average vertical distance over the window should be less than a certain preset threshold. Directed boundaries should lie on the same horizontal line (the baseline). No complimentary character should be located either above or below the baseline at a silence region.

3: Methodology: Useful Techniques

61

Re-adjustment of parameters w and T, as well as backtracking, may occur if segmentation leads to a rejected character shape. Figure 3-5 illustrates some examples of this method.

Figure 3-5: Segmented Arabic word and the corresponding contour heights, for words (a) Mahal and (b) Alalamy

El-Khaly and Sid-Ahmed [ES90] segment a thinned word into characters by following the average baseline of the word and detecting when the pixels start going above or below the baseline. Abdelazim and Hashish [AH88] used the technique of an energy curve (similar to that used in speech recognition which discriminates the spoken

3: Methodology: Useful Techniques

62

utterance from the silence background), to show the number of black pixels in each column of the digitized word, and hence to segment the word into characters. This curve is traversed and a threshold value is used to select significant primitives, leaving out silent zones. Shoukry [SS91] used a sequential algorithm based on the input-time tracing principle, which depends on the connectivity properties of the acquired text in the binary image domain. This algorithm bears some resemblance to an algorithm devised by Wakayama [Wa82] for the skeletonization of binary pictures. The SARAT system [Ma92] used the outer contour to segment an Arabic word into characters. The word was divided into a series of curves by determining the start and endpoints of the word. Whenever the outer contour changed sign (from a positive to negative curvature) the character was segmented, as illustrated in Figure 3-6. Kurdy and Joukhadar [KJ92] use the upper distance of the sub-word to segment printed Arabic words, which is the set of the highest points in each column. They assign each point of the function a token name by comparing the height of the point to the height and the token name of the point on its right. Using a grammar, they then parse the sequence of tokens of a subword to find the connection points.

3: Methodology: Useful Techniques

63

Figure 3-6: An example of a segmented sub-word, with start point A, endpoint E, and horizontal lines 2-3 and 5-6

Finally, Amin and Al-Sadoum [AA92, AA95] adopted a new technique for segmenting Arabic text. The algorithm can be applied to any font and it permits the overlay of characters. There are two major problems with the traditional segmentation method, which refers to the baseline. Firstly, overlapping of adjacent Arabic characters occurs naturally (see Figure 3-7a), hence no baseline exists, a common phenomenon in both typed and handwritten Arabic text. Secondly, the connection between the two characters is often short. Therefore, placing the segmentation points is a difficult task. In many cases, the potential segmentation points will be placed within, rather than between, characters. The word in Figure 3-7(a) is segmented utilizing a baseline technique. Figure 3-7(b) shows the proper segmentation, and the result of the new segmentation method is shown in Figure 3-7(c). Their technique can be divided into four major steps. In the first, the original image is transformed into a binary image utilizing a scanner (300 dpi). Secondly, in the preprocessing step, the Arabic word is thinned using a parallel thinning algorithm. Then, the skeleton of the image is traced from right to left using a 3*3 window, a binary tree is constructed and the Freeman Code [Fr68] is

3: Methodology: Useful Techniques

64

used to describe the skeleton shape. Finally, the binary tree is segmented into sub-trees, each tree describing a character in the image. The system was tested on a small set of words. The drawback of this recognition system is that it needs too complicated a tree to recognize and segment more words. For segmentation in this thesis a new histogram calculation was used, which combines more pre-processing operations as well as histogram calculation to suit handwritten Arabic words.

Figure 3-7: Example of an Arabic word 2_-^and different techniques of the segmentation

3: Methodology: Useful Techniques

65

3.1.3 Recognition Methods

Surveys on Arabic recognition can be found in [Kh02, Al99, AM95 ]. There are three main strategies which have been applied to printed and handwritten Arabic character recognition, as described in section 1.4. These can be categorized as the holistic approach, the analysis approach and feature sequence matching. Using the holistic approach, recognition is performed globally on the whole representation of the word with no attempt to identify characters individually. With the analysis approach, recognition is not directly performed at word level but at an intermediate level dealing with units or segments. In this strategy the words are not considered as a whole, but as sequences of small size units or segments. Feature sequence matching uses methods based on probabilistic framework HMM. Chapter 2, section 2.4 contains the previous trials done on Arabic recognition using HMM. Amin [Am00] used a global method to recognize printed Arabic words using machine learning to generate a decision tree. The algorithm resulted in a 92% recognition rate. In this thesis the holistic approach, feature sequence matching, and a combination of them was used to recognize Arabic words (see the system described in Chapters 6 and 7).

3: Methodology: Useful Techniques

66

3.2 Vector Quantization One application of distance measures that has been important in automatic object and speech recognition is known as Vector Quantization (VQ). VQ is a data reduction method, which means it seeks to reduce the number of dimensions in the input data so that the models used to match unknown characters or segments are as simple as possible. VQ reduces dimensionality quite drastically since it encodes each vector as a single number [Sc03] called the codebook [GN90]. In the earlier days, the design of a vector quantizer (VQ) was considered to be a challenging problem due to the need for multi-dimensional integration. In 1980, Linde, Buzo, and Gray (LBG) proposed a VQ design algorithm based on a training sequence. The use of a training sequence bypasses the need for multi-dimensional integration. VQs that are designed using this algorithm are referred to as LBG-VQ [Ph03].

3.2.1 VQ Mathematic Definition

The VQ design problem can be stated as follows. Given a vector source with known statistical properties, a distortion measure, and the number of codevectors, one can use the codebook and a partition which results in the smallest average distortion. Assume that there is a training sequence consisting of M source vectors: Τ = {x1 , x 2 ,

, x m}.

Eq. 3-5

This training sequence can be obtained from some large database. For example, if the source is a speech signal, then the training sequence can be obtained by recording several long telephone conversations. M is assumed

3: Methodology: Useful Techniques

67

to be sufficiently large so that all the statistical properties of the source are captured by the training sequence. It assumed that the source vectors are kdimensional, e.g.,

x

m

=

(x , x m ,1

m, 2

,

)

, xm , k ,

m = 1,2,

Eq. 3-6

,M.

Let N be the number of codevectors and let

C = {c , c , , c }, 1

2

Eq. 3-7

N

represent the codebook. Each codevector is k-dimensional, e.g.,

c = {c , c n

n ,1

,

n,2

}

, cn ,k ,

n = 1,2,

Eq. 3-8

, N.

Let Sn be the encoding region associated with codevector cn, and let

P = {S , S 1

2

,

, S N },

Eq. 3-9

denote the partition of the space. If the source vector xm is in the encoding region Sn, then its approximation (denoted by Q(xm)) is cn:

Q( xm ) = c , n

if

x

m

Eq. 3-10

∈ S n.

Assuming a squared-error distortion measure, the average distortion is given by:

Dave =

1 MK 2

M m =1

x m − Q ( x m)

where e = e12 + e22 +

2

,

Eq. 3-11

+ ek2 . The design problem can be succinctly stated

as follows: given T and N, find C and P such that Dave is minimized.

3: Methodology: Useful Techniques

3.2.2

68

Optimality Criteria

If C and P are a solution to the above minimization problem, then it must satisfy the following two criteria.

3.2.2.1 Nearest Neighbour Condition The condition says that the encoding region Sn should consist of all vectors that are closer to cn than any of the other codevectors and the equation is written as follows:

S

n

{

= x : x − cn

2

≤ x − c n′

2

∀n ′ = 1,2,

,N

}

Eq. 3-12

3.2.2.2 Centroid Condition This condition says that the codevector cn should be an average of all those training vectors that are in encoding region Sn as described in the following equation:

c

n

=

xm ∈S n

x

xm ∈S n

m

1

n = 1,2,

,N

Eq. 3-13

In implementation, one should ensure that at least one training vector belongs to each encoding region so that the above equation is never zero.

3.3 Hidden Markov Model (HMM) During the last decade, HMMs have become the predominant approach to automatic speech recognition The success of HMMs in speech recognition has recently led many researchers to apply them to handwriting recognition by representing each word image as a sequence of observations [Ca01,

3: Methodology: Useful Techniques

69

BWB94]. Historically, the HMM has been used in text recognition, as early

as 1980, e.g. Cave and Neuwirth [CN80], who analyzed machine-printed text using HMM. For on-line recognition of handwriting, the HMM was first used in [NWF86] where the approach followed the basic HMM scheme – each word is modelled by an HMM, like the one used in the recognition of isolated digits of speech. The success of these attempts, however, were limited by constrained experiments, and problems of a single writer, small and fixed vocabulary, small test samples, etc. The application of HMMs to the more general problem of handwriting recognition, involving large dictionaries, off-line data, unconstrained style, etc., was introduced in [BWB94, KHB98, JSW90]. The basic problems of handwriting recognition are common to all languages, but the special features and constraints, for example, of each language, need to be considered as well. For example, the large set of Chinese characters and the complicated combination of strokes make the recognition task difficult. In [JSW90], projection profiles are first obtained from each Chinese character image. The HMM is then used to model the sequence of the histogram of projection profiles. To counter the serious loss of stroke information after the projection of the character image [PL93], the Regional Projection Contour Transform (RPCT) is proposed to transform the character image into the contour of four feature maps. As the pattern transformed by RPCT has only one outer contour and does not contain any internal contour, the HMMs used for 2-D planar shape recognition such as the one proposed in [HK91] are directly applicable. Another interesting application of HMM to the off-line HWR problem was recently introduced in [BR95]. Besides handwriting recognition, HMMs have also been used to analyze document images. Vlontzos and Kung have proposed a multilevel structure of HMMs for the recognition of machine, or hand-printed text [VK92]. Kuo and Agazzi have successfully spotted key words in a poorly printed

3: Methodology: Useful Techniques

70

document image using the pseudo 2-D HMM [KA93], where the word image is modelled by a hierarchical structure composed of vertical and horizontal HMMs. A complete scheme for encoding the printed document using HMMs, from the pixel level to characters, words, and whole documents, is proposed by Kopec and Chou [KC94].

3.3.1 Implementation Strategies

In practical pattern recognition problems, there are three ways of building the HMM. 1. Model Discriminant HMM (MD-HMM): the patterns are classified by different models. For each class of pattern one or more HMM are built. Given an observed pattern, the path probability against each model was calculated, and it was classified to the class whose model leads to the maximum path probability. This kind of model has been highly successful for speech recognition, especially for the recognition of isolated words [Ra89]. 2. One Path Discriminant HMM (PD-HMM): for all the classes and use different paths to distinguish one class from the others [LHR89]. A comparison of the two strategies is shown in Table 3-1. 3. Combination of PD-HMM and MD-HMM: this composite approach has been successfully used [LHR89]. From the review of papers written on HMM used to recognize handwritten and printed words (see section 2.3), one can see that there has been little experimention of the HMM approach to Arabic writing recognition. Until this thesis, there was no implementation of HMM on Arabic handwriting text written by different writers.

3: Methodology: Useful Techniques

71

In this research HMMs have been used in a new way. In Arabic handwriting dots are important in character recognition, but writers do not always place dots exactly on each character. So dots will be omitted in the recognition of words using HMM, and use dots and more general features in the global recognition of Arabic words for lexicon reduction, before recognition by HMM. Table 3-1: A comparison between PD-HMM and MD-HMM strategies MD-HMM

Memory and dictionary size

A reasonable approach for vocabularies up to a few hundred words The MD-HMM often performs better than the

Accuracy

PD-HMM, since many modelling constraints can be easily implemented in the MD-HMM approach Poor portability (the

Portability

ability to adapt as the dictionaries change)

PD-HMM Likely to be independent of

dictionary size as one HMM is built for the whole dictionary Less accurate recognition results [LHR89], since the states are usually transparent during training and are semantically meaningful Better portability since the only need for changing a dictionary is to recompute the transition probability

3.3.2 HMM Theory

Markov Dependency: Assuming that the occurrences of states depend only on their immediate preceding states, for example a first-order Markov

3: Methodology: Useful Techniques

72

chain, the joint probability of P(Q) of a sequence of states Q={q1, …, qT} can be defined as: P(Q) = P(q1)P(q2|q1) P(q3|q2)…….. P(qT|qT-1).

Eq. 3-14

Similary, if we assume that qt depends only on immediate preceding states, n-th order Markov chains can also be defined. HMM: An HMM is a doubly stochastic process with an underlying Markov process that is not observable (the states are hidden). It can only be observed through another set of stochastic processes, which are produced by the Markov process (the observations are probabilistic functions of the states). Let us assume that a sequence of observations O=( o1, …,oT) is produced by the state sequence Q=(q1, …, qT ) where each observation ot is from the set of M observation symbols V={ vk ; 1 k k k M} and each state qt is from the set of N states S ={ si ; 1 k i k N}. Thus, an HMM can be characterized by: Π={

i

}, where

i

= P(q1 = si) is the initial state probability;

Α = { aij }, where aij = P(qt+1 = sj | qt =si ) is the state transition probability; Γ={

j

}, where

j

= P(qT = sj) is the last state probability;

Β = { bj (k) }, where bj (k) = P(ot = vt | qt = j ) is the symbol probability; (3.2); and they satisfy the probability constraints:

3: Methodology: Useful Techniques

N

π =1

73

Eq. 3-15

∀ i;

i =1

N j =1

a

N j =1 M k =1

ij

=1



=1



j;

bk = 1



j;

j

Eq. 3-16

i;

Eq. 3-17 Eq. 3-18

We will denote the HMM by a compact notation λ= {Π, Α, Γ, Β}. Chapter 6 contains a discussion of HMM Statistics implementation on Handwritten Arabic Words.

3.3.2.1 Scoring Problem

Given an observation sequence O = o1,…..,oT and a model λ= {Π, Α, Γ, Β}, how one can find P(Ο/λ)? This is the scoring problem. One can find P(Ο/λ) by the forward algorithm [DHP00]. Where a forward variable lt(j) is the probability of having the state qt at time t generating the observation sequence Ot = o1,…..,ot, it can be computed iteratively as:

αt ( j) =

π j b j (k )

α (i )aij b j (k ) i =1 t −1 N

if t = 1 otherwise

Eq. 3-19

Then it can be shown that:

P(O | λ ) =

N i =1

α T (i )γ i

Eq. 3-20

3: Methodology: Useful Techniques

74

3.3.2.2 Training Problem

Given the training sequence O = o1,…..,oT, to adjust the model parameters λ= {Π, Α, Γ, Β} such that P(Ο/λ) is maximized. The Baum-Welch Algorithm will be used as the optimization criterion for finding the Maximum Likelihood (ML). In general, HMMs can be trained by the Baum-Welch Algorithm with satisfactory performance [BWB94].

3.3.2.3 Recognition Phase

The modified Viterbi algorithm (MVA) can solve the recognition problem. Given the model parameters λ= {Π, Α, Γ, Β} and the testing sequences O = o1,…..,oT, we want to find the optimal state sequence:

*

Q = arg max P(Q | O, λ ) = arg max P(Q, O | λ ) Q

Eq. 3-21

Q

3.3.2.4 Post-processing

The Post-processing operation is used if a PD-HMM system is implemented, because the PD-HMM system is not guaranteed to be a legitimate word from the given dictionary.

3.4 ID3 Classifier A decision tree is constructed by looking for regularities in data. ID3 “Induction of Decision Trees” is particularly interesting for its representation of learned knowledge, approach to management of complexity, heuristics for selecting candidate concepts, and its potential for

3: Methodology: Useful Techniques

75

handling noisy data. ID3 represents concepts as decision trees, a representation that allows us to determine the classification of an object by testing its values for certain properties [WF00]. The learnt decision tree should capture relevant relationships between attributes’ values and class information. In addition, such systems typically use information-based heuristics to bias their learning towards shallower trees. ID3 was developed by Quinlan [Qui79, Qui86] and is perhaps the most commonly used ML algorithm in scientific literature and commercial systems. A quick introduction is given in [LS93, HS94]. In conclusion, ID3 is an algorithm which has high classification accuracy (even in noisy data sets), a fast learning phase, and low time complexity. ID3 must be supplied with the entire training set at once, but variations with incremental learning exist. The decision tree resulting from ID3 is not very simple for humans to cope with when large amounts of data are used [So96]. To perform induction, start with a set of objects (training examples) C. Choose an attribute A as the root node. Create as many children as the number of values for A. If all the objects in C belong to the same class, stop with a leaf labelled with that class name. Otherwise, distribute the objects in C among the children nodes according to their value for A. Iterate the process for each child. The main issue is which attribute to split on each iteration? Since many DTs exist that correctly classify the training set, a smaller one is usually preferred (Occam’s Razor) [Qui79]. However, finding the smallest DT is NP-complete, so we need a heuristic. The paper by Quinlen [Qui79] advocates the use of information theory: the quantity of information I is a real number between 0 and 1 that characterizes the amount of uncertainty in a set C w.r.t. class membership. I=0 if all objects in C belong to the same

3: Methodology: Useful Techniques

76

class. I=1 if they are evenly distributed between two classes. At each iteration, the heuristic minimizes I. ID3 uses all training examples at each step in the search to make statistically-based decisions regarding how to refine its current hypothesis. This contrasts with methods that make decisions incrementally, based on individual training examples (e.g. version space candidate-elimination). One advantage of using statistical properties of all the examples is that the resulting search is much less sensitive to errors in individual training examples. ID3 can be easily extended to handle noisy training data by modifying its termination criterion to accept hypotheses that imperfectly fit the training data [Ha9].

3.5 Conclusion In this chapter some of the important techniques used in this thesis are discussed. Since this Thesis deals with Arabic handwriting, a general review of techniques that have been used for feature extraction and preprocessing steps in Arabic writing are discussed. This chapter clarifies that segmentation by histogram calculation is still one of the most commonly used techniques in Arabic writing and that features (such as dots) are useful features in Arabic writing. This chapter also presented the techniques of VQ, HMM and ID3 classifiers. Also, there was a discussion of the mathematics used in the HMM which was implemented in the system this thesis describes. A new database for off-line Arabic handwriting recognition is discussed in next chapter. Applications, which usually include some pattern recognition require the use of large sets of data. Since there are few Arabic databases available, none of which are a reasonable size or scope, this research built

3: Methodology: Useful Techniques

77

the AHDB database in order to facilitate the training and testing of systems that are able to recognize unconstrained handwritten Arabic text [AHE02a] [AHE03a].

4.1 A New Arabic Handwritten Database (AHDB)

T

his chapter presents a new database for off-line Arabic handwriting recognition. A new database for the collection, storage and retrieval

of Arabic handwritten text (AHDB) have been developed. This supercedes previous databases both in terms of the size and the number of different writers involved. In this chapter the most popular words in Arabic writing are identified for the first time, using an associated program. An off-line handwriting character recognition system is required to perform the automatic transcription of text where only an image of the script is available. Much work has been done on the recognition of Latin characters, covering both the cases of separated (hand printed) characters and cursive script. Much less research has been undertaken on the task of recognizing Arabic script. The results reported are also applicable to the recognition of

4: A Database for Arabic Handwritten Text Recognition Research

79

handwritten text in languages such as Farsi, Kurd, Persian, and Urdu, which also use Arabic characters in writing, but differ in pronunciation. Previous research in this area includes work carried out by Abuhaiba et al. [AMG94], who dealt with some problems in the processing of binary images of handwritten text documents. In this database, the stages of designing, storing, and retrieving information have been considered, as well as the pre-processing of off-line handwritten Arabic words. Successful off-line Arabic character recognition is likely to be a complex process involving many steps that are interdependent and may need to be undone using backtracking algorithms. It is crucial to have a suitable representational scheme to underpin the research. In this chapter the first organized database for Arabic handwritten text and words is described. A significant aspect of handwriting recognition in domains such as bank cheques [FEB+00] and postal address recognition [CKZ94] is that there is no control over the author, writing instrument, or writing style. For example, an arbitrary handwritten word might be produced by a felt pen and could include isolated, touching, overlapping characters, cursive fragments, or fully cursive words [CK94]. However, these difficulties are offset by the constraint that input words come from a relatively small fixed vocabulary.

4: A Database for Arabic Handwritten Text Recognition Research

80

Figure 4-1: One form filled in by one writer

A standard database of images is needed to facilitate research in handwritten text recognition. A number of existing databases for English off-line handwriting recognition are summarized by [Su90, Na92], and also

4: A Database for Arabic Handwritten Text Recognition Research

81

by [MB99, Zi02]. For machine-printed Arabic the Environmental Research Institue of Michigan (ERIM) has created a database of machine-printed Arabic documents. These images are extracted from typewritten and typeset Arabic books and magazines [Sc02]. Applications, which usually include some pattern recognition, require the use of large sets of data. Since there are few Arabic databases available, and none of a reasonable size and scope, the AHDB database was built in order to facilitate the training and testing of systems that are able to recognize unconstrained handwritten Arabic text. In Figure 4-1 one can see an example of a form filled in by one writer. There are different approaches to form dropout, some using separated cleaning steps, whilst others use combined cleaning methods for both foreground and background [DI97]. The three most common approaches to form dropout, symbolic subtraction of an image, colour filtering, and thresholding, are described by [CD97]. The later approach proposed in this chapter is dropout by colour filtering using hardware (optical filtering), which is faster than the other techniques and more accurate than dropout by symbolic subtraction. Sections 4.4 and 4.5 will discuss how the AHDB is stored and sorted into separate directories for simpler data retrieval. The database created in this research contains Arabic words and text written by 100 different writers. The following sections describe the steps involved in constructing this database. As the AHDB contains the most popular written Arabic words, the next section will discuss in detail how they were identified.

4.2 Arabic Word Counting The aim of this step was to find the most popular words in Arabic writing. First, Arabic texts differing in context and subject matter were copied from several sites from the Internet. Then a program was written to count the

4: A Database for Arabic Handwritten Text Recognition Research

82

repeated words in the text files, which contained more than 30,000 different words. Finally, the words were totalled and sorted using a Microsoft Excel worksheet. From the test experiment, the twenty most used words in written Arabic were identified (for the first tinme), sorted, and are illustrated, along with the English meanings, in Table 1. From the table it can be seen that the most popular words in Arabic writing are different from those in English. For example, in English the most popular word is “the” whereas in Arabic ”. The most popular words have been added to the AHDB to it is “in = +,

be used as a testbed database for researchers as in the case for English and other languages. It should be clear that this work has never previously been done for Arabic writing.

4: A Database for Arabic Handwritten Text Recognition Research

83

Table 4-1: The twenty most used words in written Arabic, with their meanings in English

Arabic Word 1

jR

2

c^

Meaning in English In From Is

3 4

m\

On

5

m[

To

6

j# [

That

7

4[

That

8

c

About

9

^

With

10 11

^ 4d

Which That

12

n

Or

13

X

Was

14

`"

Finish

15

gd

He

16

No

17

jd

She

18

o

God

19

2

Servant

20

]U

Before

4: A Database for Arabic Handwritten Text Recognition Research

(a)

(b)

84

(c)

Figure 4-2: Handwritten Arabic words in the AHDB written by three different writers (a, b, and c)

4: A Database for Arabic Handwritten Text Recognition Research

85

4.3 Form Design The form was designed with six pages. The first three pages were filled with 96 words, 67 of which were handwritten words corresponding to numbers that can be used in handwritten cheque writing. The other 29 words were the most popular Arabic words, as identified in section 2. The fourth page was designed to contain three sentences of handwritten words representing numbers and quantities that may be written on cheques. The fifth page was lined, and designed to be completed by the writer in freehand on any subject of their choice. The colour of the forms was selected as light blue with black ink in the foreground because the scanner can mask blue, green, and red. This means one can print forms in green filled in with blue, red or black ink and get the same result as from a blue form with black writing. In the first three pages, the spaces for handwritten words are equal so there is no pressure on the writer as to the length of the word. The forms were scanned in black and white using channel blue as a mask (hardware mask). One hundred and five forms were scanned using a scanner. The images were scanned at 600 dpi.

4.4 Data Storing Every word image is saved with a name and number indicating its writer, for example the image for the word “in” for the first writer is saved as ‘in001.TIFF’. Common file types in bitmap format are JPEG, GIF, TIFF and WMF. The TIFF format was chosen for the AHDB because it can store complex information for the CMYK colour model and can also use JPEG compression techniques. This makes TIFF one of the most robust and well-

4: A Database for Arabic Handwritten Text Recognition Research

86

supported image formats available [Ho00]. For easier retrieval of handwritten images, the Arabic handwritten data was sorted and saved in five sub-directories containing: 1. Wrd_no: words used for numbers and quantities in cheque filling, as used in this research for testing and training data because of it use in cheque verification. 2. Wrd_mst: contains the most popular words in Arabic writing (Table 4-1), which have been calculated in this research as described in section 4.2. 3. Chq: contains sentences used in writing cheques with Arabic words (Figure 4-3), which is useful for cheque verification applications. 4. Page: free-handwriting pages in any area of writer interest (Figure 1-3). 5. Form_Wrd: the first three pages of the forms. The first page is stored as the number of the form_a, the second page is stored as the number of the form_b, and the third page is stored as the number of the form_c, for example, 001_a, 001_b, and 001_c, respectively. As

shown

in

The

Arabic

handwritten

DB

ftp

site

ftp://ftp.cs.nott.ac.uk/pub/users/sxa.

4.5 Data Retrieval As mentioned earlier in section 4.4, the data was stored in TIFF format. For image retrieval the system used lizard’s TIFF library for Java [ LW0 ]. The pre-processing operations are implemented on the

4: A Database for Arabic Handwritten Text Recognition Research

87

word’s images (stored in “Wrd_no” directory) as described in Chapter 5.

Figure 4-3: Examples containing sentences used in cheque writing in Arabic

4: A Database for Arabic Handwritten Text Recognition Research

88

.

Figure 4-4: Examples of free-handwriting

4.6 Conclusion The AHDC have been built which contains Arabic words and text written by 100 different writers. This database contains words used for numbers and quantities in cheque filling. It also contains the most popular words in Arabic writing (reported for the first time in this thesis). Also contained are sentences used in writing cheques with Arabic words. Finally it contains free-handwriting pages in any area of writer interest. This database is meant to provide training and testing sets for Arabic text recognition research.

4: A Database for Arabic Handwritten Text Recognition Research

89

In the next chapters there are some useful pre-processing operations applied to the wrd_no set (containing words used for numbers and quantities in cheque filling in AHDB). An innovative and simple, yet powerful, tagging procedure designed for this database, which enables easily extract the bitmaps of words. A pre-processing class, which contains some useful preprocessing operations was constructed. These are discussed in the next chapter. The next chapter deals with pre-processing steps (before classification) of off-line handwritten Arabic words. In this system, some of the methods applied to handwritten Arabic writing, such as slant correction, slope correction, thinning, segmentation and feature extraction, are descussed. These methods have not been applied before, as mentioned in section 2.3. The system first attempts to remove some of the variation in the images that do not affect the identity of the handwritten word (slant correction, slope correction, and baseline estimation). Next, the system codes the skeleton of the word so that information about the lines is passed on to the recognition system (segmentation and feature extraction).

5.1 Overview

T

his chapter describes the operation of the complete pre-processing system for the recognition of a single Arabic word taken from the

handwriting Arabic word database. Any word recognition system can be divided into modules, for example, pre-processing, recognition and postprocessing. In this systen the handwritten word is normalized to remove incidental differences in style which are independent of the identity of a word. Then, recognition is carried out by first estimating the likelihoods for each frame of data in the representation (using a suitable classification technique, such as a codebook and HMM in this research), and then postprocessing can be carried out to reduce ambiguity. The system built in this chapter concentrates on the pre-processing operations, which are especially important in the recognition process of handwritten Arabic words. The

5: A Pre-processing System for the Recognition of Off-Line AHW

91

steps involved in pre-processing are implemented in Java code. The system has three main advantages. Firstly, it deals with non-segmented words. Secondly, it takes advantage of the position of features in the character or sub-character. Thirdly, more than 29 features are calculated and used by the VQ and HMM classifiers in the next two chapters.

Word

image

Image

loading

Cal cula te Ver tic a l H i s t og r a m

B a sel ine E s t i m a t i on

Slope C or r e c t i o n

Slant C or r e c t i o n

Thinning

Skeleton of word image consisting of vertical letter on horizontal baseline Figure 5-1: The pre-processing operations

5.2 Pre-processing Steps Pre-processing of the handwritten word image is important in order to organize the information so as to simplify the task of recognition. The most crucial step in the pre-processing stage is normalization, which attempts to remove some of the variations in the images which do not affect the identity of the word [SR98]. The system incorporates normalization for stroke width, slope and height of the letters (see Figure 5-1). The normalization task reduces each word image into one consisting of vertical letters of uniform height on a horizontal baseline and made up of one-pixel-

5: A Pre-processing System for the Recognition of Off-Line AHW

92

wide strokes. In this system, the word image is loaded and cropped. Then the slant and slope of the word is corrected and thinned. Features are calculated to represent the useful information contained in the image of the word [AHE01]. Then the word is segmented into frames, so the features in each frame can be found. The following sub-sections of this chapter discuss in detail the steps involved in Arabic handwritten pre-processing that have been used for this research. This system incorporates normalization for each of the following factors: 1. Stroke width: this depends on the writing instruments used,

the pressure of the instrument, and its angle with respect to the tablet 2. Slant: this is the deviation of strokes from the vertical axis,

which varies between words and between writers 3. Slope: the slope is the angle of the base of a word if it is

not written horizontally 4. Height of the letters: this varies between authors for the

same document, and for a given author for different documents. Figure 5-1 shows the processes involved in pre-processing. In this system, the word image is loaded and cropped. Then the slant and slope of the word is corrected and thinned. The details of each of the processes are described in the following sections.

5: A Pre-processing System for the Recognition of Off-Line AHW

93

Figure 5-2 shows some examples of pre-processing operations done on Arabic words in the database. The implementation of the pre-processing operation consists of the following important steps: •

Image Loading



Slope Correction



Slant Correction



Thinning

Figure 5-2: Different examples of pre-processing stages (a) baselines detection (b) slant and slop correction (c) features extraction (d) width normalization

5: A Pre-processing System for the Recognition of Off-Line AHW

94

In the Java implementation, every pre-process operation has a separate class. In the following sub-sections, the steps (processes) involved in the Arabic handwritten pre-processing that are used in this research are discussed in detail.

5.2.1 Image Loading

For loading the word image, the system uses a ready-made class library for loading images of type TIFF [LW0]. Before that the images of Arabic words stored in the database were converted to TIFF format type4 using the Polyview software.

Figure 5-3: (a) The word before the operation of slope correction. (b) The word after its slope is corrected horizontally. (c) The same word after slant correction. (d) The operation of thinning

5: A Pre-processing System for the Recognition of Off-Line AHW

95

5.2.2 Slope Correction

The slope may be defined as the angle of the baseline of a word that is not written horizontally. Figure 5-3 (a and b) shows a sloped word before and after slope correction. The character height is determined by finding the important baseline. For each Arabic word there are two baselines, as can be seen in Figure 5-4, but the second baseline is difficult to examine since some handwritten Arabic words may have more than a secondary baseline for each segment. So only the main baseline was computed.

Figure 5-4: The two baselines of the word ‘ :_/ - five’. (a) the second baseline (b) the main baseline

The heuristic, used for the baseline estimation, consists of three main steps [RW92]: 1. Calculation of the vertical density histogram for the word image. 2. Baseline correction. 3. Slope correction. The first steps are done by counting the number of black pixels in each horizontal line in the image. Then the baseline estimation follows by

5: A Pre-processing System for the Recognition of Off-Line AHW

96

rejecting the part of the image likely to be a hooked descender, such as occurs with the letter (-!.! /). Such descenders are indicated by the maximum peak in the vertical density histogram. Finally the slope correction procedures are carried out as follows: 1. Calculate the slope a. Find the lowest remaining pixel in each vertical scan line b. Retain only the points around the minimum of each chain of pixels and discard the points that are too high c. Find the line of best fit through these points. 2. Slope correction a. The image of the word is straightened to make the baseline horizontal by application of the Shear transform parallel to the y-axis b. The baseline, height, and bounding rectangle of the cropped image are re-estimated, under the new assumption that the image is now horizontal.

5.2.3 Slant Correction

The slant is the deviation of strokes from the vertical axis, which varies between words and between writers. In Figure 5-3 (c), a word is seen after correcting its slant. The slant of a word is estimated by finding the average angle of nearvertical strokes [RW92]. This is calculated by using an edge detection filter

5: A Pre-processing System for the Recognition of Off-Line AHW

97

to find the edges of strokes. This technique gives a chain of connected pixels representing the edges of strokes. The mode orientation of the edges which are close to the vertical are used to estimate the overall slant. The procedure of slant correction contains the following steps: 1. Thin the image and calculate its endpoints 2. Find all near vertical strokes by starting at each endpoint above the baseline until another endpoint on the baseline 3. Calculate the average slant for all strokes 4. Using a ‘Shear’ transform parallel to the axis, the slanted word can be corrected 5. Bounding – the bounding box and width of the image is reestimated.

5.2.4 Thinning

Numerous algorithms have been proposed for thinning (also called skeletonizing)

the

plane

region.

This

system

uses

a

Zhang-

Suen/Stentiford/Holt combined algorithm for thinning binary regions [Ch94].

5.2.5 Normalization

Before finding the handwriting features of the word, the original word image can be normalized and encoded in a canonical form so that different images of the same word can be encoded similarly. The normalization task will reduce each word image to one consisting of vertical letters of uniform

5: A Pre-processing System for the Recognition of Off-Line AHW

98

height on a horizontal baseline, and made up of one-pixel-wide strokes. The width will be normalized to 64 bits. Also the segments that do not contain any features are removed, which improves the recognition rate, as will be seen in the experimental results. See Figure 5-9 (c) for an example of width normalization, which is useful in finding moments and pixel distribution features.

5.3 Finding Handwriting Features This section discusses the method used to represent the useful information contained in the image of the word. The choice of the feature extraction method limits or dictates the nature and output of the pre-processing step [SR98]. Since the word in this system is represented by a thinned pattern, or skeleton, most of its geometrical features are suitable for this representation. The features that capture topological and geometrical shape information, both globally and locally, are the most useful, while features that capture the spatial distribution of black pixels are also important. Some features related to the positional information of segments are also of value. A good mixture of these features was expected to perform well. After experimental investigations with classification results as the selection criterion, the following 29 features were chosen. The geometrical and the topological features are: loops with position of its intersection (up, down, left, right), four curve directions, right-, left-disconnection, four directions of long strokes and their position (above the baseline or below the baseline), endpoints, intersections, and the number and positions of endpoints and intersections. The moment features provide information about the global shape. The feature selection technique is empirical but largely reflects the structure of an Arabic handwritten word.

5: A Pre-processing System for the Recognition of Off-Line AHW

99

The system used in this thesis uses a skeleton coding scheme. The word is then segmented into frames. The horizontal density histogram is calculated and smoothed. The maxima and minima of the smoothed density histogram are found and frame boundaries are defined to be the midpoints between adjacent maximum/minimum pairs. To ensure that the frames do not exceed a certain width, more frames are added where the maxima and minima are far apart, and chosen according to character height. Every vertical frame will be segmented into regions or rectangles. For each of these rectangles, four bins are allocated to represent different line segments, angles with vertical and horizontal, and the line 45 degrees from these. The concept of frame segmentation and lines will be discussed in section 5.4.

Figure 5-5: Two words with the features written on them

5: A Pre-processing System for the Recognition of Off-Line AHW

100

The performance of the recognizer can be improved by passing on more information about salient features in the word. A number of useful features can easily be discerned from the processing that has already been performed on the writing: endpoints, junctions, complementary characters, loops, and turning points. These salient features are defined in more detail in sections 5.3.1 through to 5.3.6, which discuss methods for the detection of intersection points, and endpoints which all operate on a skeletonized bit map.

5.3.1 Outer Contour and Loops

This system uses a class for detecting contour and inner loops, to know how many segments there are in the word and the area of the loops. The location of each “blob” can be found. This procedure works very quickly since it deals with the original file’s data.

Figure 5-6: The blobs of the Arabic word “ahad” 5.3.2 Locating Dots

Dots above, or below, the letters (i.e. 0 12 34 5 .6 7 8 9 : ; ) can be identified with a simple set of rules. Short, isolated strokes occurring on or above the half-line are marked as potential dots. The number of dots and their location relative to the main skeleton of the character have to be identified in every frame. The number of dots can be one, two, or three, and

5: A Pre-processing System for the Recognition of Off-Line AHW

101

they can be below or above the main skeleton of the character. Dots are calculated by tracing every path from every endpoint. If the tracing reaches another endpoint and the path length is less than a threshold, the procedure finds a dot. If the path is more than one pixel, the centre of the path is enrolled as a dot feature. This is then added to the dots array and the endpoint feature is erased in that point (dot). From the contour, if the width of the dot is double the height of the dot, then the line is considered to be two dots. If the “dots line” has an east or south curve, it is treated as three dots.

5.3.3 Locating Endpoints

The endpoint is the end or start of a line segment. Endpoints are points in the skeleton with only one neighbour, which marks the ends of strokes, though some are artifacts of the skeletonization algorithm. Endpoints are found by examining all individual one-pixels in the skeletonized bit map image. As a consequence of skeletonization, an endpoint will have one, and only one, of its eight contiguous neighbours as a one-pixel. Therefore, if the sum of eight neighbours is one, this is an endpoint.

5.3.4 Junctions

Junctions occur where two strokes meet or cross and are easily found in the skeleton as points with more than two neighbours. The system proposed in this thesis will use an algorithm as follows: each of the one-pixels in the image is examined, and the number (n) of contiguous one-pixels to the focus pixel is counted. If the count n exceeds 2 (np3), then the focus pixel is considered to be an intersection.

5: A Pre-processing System for the Recognition of Off-Line AHW

102

Figure 5-7: Four turning points in different directions (a) top, (b) down, (c) left, and (d) right.

5.3.5 Turning Points

Points where a skeleton segment changes direction from upward to downward are recorded as top turning points. Similarly left, right, and bottom turning points can be found, as illustrated in Figure 5—7. Turning points are detected by multiple fixed windows to examine the variation in coordinate values of the start, mid, and endpoints of the curve. It is worth noting that in the input image ‘x’ increases from left to right and ‘y’ increases from top to bottom. The following table (Table 5-1) demonstrates the curve categorization using these coordinates. Table 5-1: The curve categorization using the coordinates East x-value

Start>mid

South _

West Startmid