Printed and Handwritten Digits Recognition Using Neural Networks Daniel Cruces Álvarez, Fernando Martín Rodríguez, Xulio Fernández Hermida. Departamento de Tecnologías de las Comunicaciones. Universidad de Vigo. E.T.S.I.T. Ciudad Universitaria S/N.36200 Vigo. SPAIN. Phone: +34-986-812131. Fax: +34-986-812121. E-mail: [email protected], [email protected], [email protected]

ABSTRACT: In this paper, we show a scheme for recognition of handwritten and printed numerals using a multilayer and clustered neural network trained with the backpropagation algorithm. Kirsch masks are adopted for extracting feature vectors and a three-layer clustered neural network with five independent subnetworks is developed for classifying numerals efficiently. The neural network was trainned with a handwritten numeral dabase of more than 9000 patterns of differents writers and differents styles of type. We obtain correct recognition rates of about 96.2 %. Finally, we show how to construct a refinement stage to improve, even more, the results. 1 .- INTRODUCTION: In this paper, we study the automatic classification of handwritten numerals (the recognizer developed is also valid for printed numerals, printing is simply another writting style). The application field is very wide, for example: postal code recognition, numbers written on bank cheques... We can find the patterns shifted, scaled, distorted, with some skew and even overwritten. Our method is based in a multilayer neural network [1] trained with the classical ‘Backpropagation’ [2] algorithm. The strucutre of the paper is: - The creation of the database and the information representation model. - The prepocessing used to compute the feature vectors that will be applied to the network. - The structure and training details of our neural net.

- The results we have obtained. - A method for improving the results building a refinement neural net. 2 .- DATABASE AND INFORMATION REPRESENTATION: We use a database consisting of 9300 handwritten digits provided by 90 writters of different ages and with many different sizes and writting styles. Writers provide their numbers inside a template, in this way we have an easy segmentation of the patterns. Besides, we have a balanced database (the number of samples per class is the same for all 10 classes). The election of an adequate data representation is a key point in pattern recognition. Using a low-level representation we will need a very large training set (generalization will be more difficult, because the neural net will tend to learn the noise). That’s why it is better to use a high-level representation. We have to design carefully the preprocessing stage (this stage will convert the low-level data into high-level). Besides, there are other factors that influence the preprocessing design, such as the need for high computing speed or hardware limitations. That’s why we will choose representations with low computing costs. Digits (both printed and handwritten) are essentially draws made of lines, id est: one-dimensional structures in a twodimensional space. So detecting line segments present in the image seems a good method. For each image zone, the information about the presence of a line

segment and its concrete orientation is extracted and introduced in a feature map. With this representation we achieve a non very complex structure, with a good preprocessing speed and an adequate highlevel representation.

Kirsch [4] defines an algorithm that uses the following notation:

3 .- PREPROCESSING: As we already said before, digits are written inside a template (a mesh of black straight lines) to make easier the segmentation process. The sheet is scanned in bi-level (black and white) mode. The resolution is 300 dpi’s and the result is a graphic file (TIFF format) ready to preprocess. We extract a 100x100 image from each box of the mesh. Then, we refine the segmentation extracting the minimal box that contains the character. We also remove the isolated points due to noise. The pattern that we obtain in this manner has an unpredictable size, that’s why it is necessary a normalization (or scaling) that makes sure the recognizer is size invariant. The pattern is also centered because when we scale we keep the initial aspect ratio (this condition avoids a deformation of the pattern). We always obtain images of size 16x16, these images are not bi-level because of the scaling process. The problem now is to extract some features so that we transform them from a (low-level) pixel representation to a higher level. We must retain only the key features of the character and remove all redundant information. We must extract feature maps tht contain information about the line segments, its position and orientation. The first order differential axis detectors are suitable for this task. Besides, they are fast to compute. There are several axis detectors of this kind: Freichen, Kirsch, Prewitt, Sobel... The most accurate finding the four directional axis: horizontal, vertical, right diagonal and left diagonal is the Kirsch detector (this is the only one that uses the eight point neighbourhood of each pixel).

Fig. 1: Notation for the neighbours of pixel (i,j).

A 0 A 7 A 6

A 1 (i, j) A 5

The equations are:

{

A 2 A 3 A 4

}

G(i. j) = max 1, max [ 5Sk - 3Tk ] k=0 7

(1)

where G(i,j) is the gradient for the pixel (i,j), and: Sk = A k + A k +1 + A k +2 Tk = A k +3 + A k +4 + A k +5 + A k+ 6 + A k +7 Subindexes of A are evaluated “modulus 8”

We can base ourselves in these equations to compute feature maps in the four directions: horizontal (H), vertical(V), right diagonal (R) and left diagonal (L). In this way, we obtain four local feature maps, to obtain a global feature map we include the original pattern with no gradient operator applied. The set of feaure maps is so made by five patterns of size 16x16 (4 local + 1 global). Figure 2. The last preprocessing stage is a compression of the 16x16 Kirsch patterns. They are coverted to size 4x4 via a linear decimation. In this manner we reduce the input space dimension for a factor of 16. This is very important to make sure the neural net will learn without a very big number of training samples.

have 10 units fully connected with all units in the hidden layer. Each output unit represents one class. When we introduce a pattern that belongs to class 'i', the trained output will be 1 for the ith output unit and 0 for the others. This network has 170 units and 2080 links (2170 weights). This structure has the advantage to make possible the correct pattern recognition even when some of the subnets finds ambiguity (the other subnets can cancel the efect of that ambiguity).

Fig. 2: Preprocessing consisting of normalization to size 16x16, feature extraction (4 Kirsch components) and compression to 4x4.

4 .- MULTI-LAYER NEURAL NET: One of the purposes of the work with neural networks is the minimization of the mean square error (MSE) between the actual output and the desired one. This minimization is peformed via gradient algorithms. Backpropagation [2] is an efficient and very classical method. In our network there exists a trade off about the number of conections (or weights). If the number of weights is too low the net will not be able to learn. Otherwise, if that number is too big, we can have “overtraining” (the net can learn “even the noise”) [2]. 4.1 .- NETWORK STRUCTURE: The network structure used is shown in figure 3. The input layer consists of five maps of 4x4 units each. Four of these maps correspond to local characteristics (one for each direction) and the fifth one retains the global characteristic of the input pattern. Each of these five input maps make up fully connected groups with their corresponding maps in the hidden layer. So we have five independent subnets. In the output layer, we

Fig. 3: Network Structure.

4.2 .- NETWORK TRAINING: Each of the subnets between the input and the hidden layer is initialized with random weights and trained with different feature maps. All the connections in the net are of adaptive nature and are trained with the "Backpropagation" algorithm. The learning coefficient that is the only free parameter in this method is set up before the training and is not changed during it. In figure 4 we see that the maximum recognition rate is achieved about 200 training iteractions (the maximum rate is 96.2%). After that, the rate remains almost constant. That’s why we do not use any escpecial criterion to stop the algorithm. The rates are measured using the cross-validation method. Id est: The training set and the test one have no common elements. That’s why we are not affected by “over-training” efects.

the net had never seen before digits like these ones). Table 3 shows patterns that were incorrectly classified. The reader can see that are very ambiguous, even for the human eye.

Fig. 4: Correct recognizing rate versus number of training epochs.

Table 1: “Strange” patterns classified correctly.

5 .- EXPERIMENTAL RESULTS: We have made some experiments to find the parameters which yield the best results. We used our own database (9300 patterns) for all of them. In our first experiment, we made 32 partitons of the database. We trained the network with 31 of them and then tested it with the remaining partition. We repeated this process for the 32 possibilities. This is a well known cross-validation method, its name in the literature is 'Leave one out'. We have averaged the recognition rates to find a global accuracy measure. In some applications is interesting to allow some rejection rate, id est: the network refuses to recognize the more ambiguous patterns. Those patterns should be passed to a human operator for revision (the network output in this case can be used as a hint). We can apply this idea defining an ambiguous pattern as follows: “A pattern is ambiguous if either the maximum output of the net is less than some threshold close to 1 (t1) or if some of the non maximum outputs is more than some threshold close to 0 (t2)”. Choosing adequate values for the thresholds, we got a rejection rate of 9%, and a correct classification rate (only over the non rejected patterns, of course) of 99%. With no rejection our correct recognition rate is 96.2%. Other experiment was to face our network with some “strange” (almost patological) patterns. Table 2 shows pattern examples that were correctly classified (this proves the network generalization power, as

1/9

4/9

4/1

7/3

4/9

7/2

5/1

3/5

4/9

5/6

Table 2: Patterns incorrectly classified.

6 .- REFINEMENT STAGE: We decided to improve even more the results using a second neural classifier. This “refinement network” acts only on the ambiguous patterns that were rejected by the first network. To consider that a pattern is ambiguous we must have the following situation: the pattern was rejected by the first network (using the previously stated criterion), two outputs must clearly dominate over the others (we choose a treshold to assure this) and the distance between those two dominant outputs is small (we also choose a threshold to assure this). This refinement network is made by 45 neurons. Each of them is trained to distingish between a pair of numerals (there are 45 combinations of two different digits: 10 ! 2 !(10 − 2 )!

= 45 ). The ambiguous patterns are

applied only to the neuron designed to distinguish between the two contending candidates (the two dominant ones in the first stage). Each of the 45 neurons receives as input a feature vector very similar to the one used

in the first stage. The difference here is that the 4 Kirsch components are compressed to a size of 8x8 (instead of 4x4). The reason for that is to pay more attention to local details that may be very important to distinguish between the contending candidates. The global component is removed here for the same reason.

Fig 6. Results for the inclusion of a refinement stage.

7 .- FUTURE LINES:

Fig 5. Structure of the refinement network.

These neurons were again trained with our database and with the ‘Backpropagation’ algorithm. Each neuron takes a binary decision, id est: a value of 0 decides for one digit and a 1 decides for the other. We can set thresholds to decide when the output is “in the middle”. In these cases the second network rejects the pattern (this will occur with very ambiguous patterns). We can see the results of applying this network in the following graph. If we have a big rejection rate (the rightmost points on the graph corresponds to 9% of rejection) error is negligible. If we risk to make less rejections, we will have more errors. In the dotted lines, we have the curves corresponding to applying the first stage only. We see as the refinement stage improves performance significantly.

Our main future line is to get an optimum initialization of the net weights using genetic algorithms. That would allow a better and quickest training. 8 .- CONCLUSIONS: We have designed an scheme able to recognize handwritten and printed digits. We have used a multilayer neural network that we trained using the Backpropagation algorithm. Our result is a classification error of 3.8% with no rejection. If we establish a rejection rate of 9%, error rate reduces to 1%. We have improved those results using a refinement stage based on another neural network. 9 .- REFERENCES: [1] S.-Whan Lee. “Off-Line Recognition of Totally Unconstrained Handwritten Numerals Using Multilayer Cluster Neural Networks”. IEEE Transactions on P.A.M.I. Vol 18, Num. 6, pp 648652, 1996. [2] S. Haykin. “Neural Networks. A Comprehensive Foundation”. pp 179181. 1994. [3] W.K. Pratt. “Digital Image Processing”. John Wiley & Sons. 1978.

ABSTRACT: In this paper, we show a scheme for recognition of handwritten and printed numerals using a multilayer and clustered neural network trained with the backpropagation algorithm. Kirsch masks are adopted for extracting feature vectors and a three-layer clustered neural network with five independent subnetworks is developed for classifying numerals efficiently. The neural network was trainned with a handwritten numeral dabase of more than 9000 patterns of differents writers and differents styles of type. We obtain correct recognition rates of about 96.2 %. Finally, we show how to construct a refinement stage to improve, even more, the results. 1 .- INTRODUCTION: In this paper, we study the automatic classification of handwritten numerals (the recognizer developed is also valid for printed numerals, printing is simply another writting style). The application field is very wide, for example: postal code recognition, numbers written on bank cheques... We can find the patterns shifted, scaled, distorted, with some skew and even overwritten. Our method is based in a multilayer neural network [1] trained with the classical ‘Backpropagation’ [2] algorithm. The strucutre of the paper is: - The creation of the database and the information representation model. - The prepocessing used to compute the feature vectors that will be applied to the network. - The structure and training details of our neural net.

- The results we have obtained. - A method for improving the results building a refinement neural net. 2 .- DATABASE AND INFORMATION REPRESENTATION: We use a database consisting of 9300 handwritten digits provided by 90 writters of different ages and with many different sizes and writting styles. Writers provide their numbers inside a template, in this way we have an easy segmentation of the patterns. Besides, we have a balanced database (the number of samples per class is the same for all 10 classes). The election of an adequate data representation is a key point in pattern recognition. Using a low-level representation we will need a very large training set (generalization will be more difficult, because the neural net will tend to learn the noise). That’s why it is better to use a high-level representation. We have to design carefully the preprocessing stage (this stage will convert the low-level data into high-level). Besides, there are other factors that influence the preprocessing design, such as the need for high computing speed or hardware limitations. That’s why we will choose representations with low computing costs. Digits (both printed and handwritten) are essentially draws made of lines, id est: one-dimensional structures in a twodimensional space. So detecting line segments present in the image seems a good method. For each image zone, the information about the presence of a line

segment and its concrete orientation is extracted and introduced in a feature map. With this representation we achieve a non very complex structure, with a good preprocessing speed and an adequate highlevel representation.

Kirsch [4] defines an algorithm that uses the following notation:

3 .- PREPROCESSING: As we already said before, digits are written inside a template (a mesh of black straight lines) to make easier the segmentation process. The sheet is scanned in bi-level (black and white) mode. The resolution is 300 dpi’s and the result is a graphic file (TIFF format) ready to preprocess. We extract a 100x100 image from each box of the mesh. Then, we refine the segmentation extracting the minimal box that contains the character. We also remove the isolated points due to noise. The pattern that we obtain in this manner has an unpredictable size, that’s why it is necessary a normalization (or scaling) that makes sure the recognizer is size invariant. The pattern is also centered because when we scale we keep the initial aspect ratio (this condition avoids a deformation of the pattern). We always obtain images of size 16x16, these images are not bi-level because of the scaling process. The problem now is to extract some features so that we transform them from a (low-level) pixel representation to a higher level. We must retain only the key features of the character and remove all redundant information. We must extract feature maps tht contain information about the line segments, its position and orientation. The first order differential axis detectors are suitable for this task. Besides, they are fast to compute. There are several axis detectors of this kind: Freichen, Kirsch, Prewitt, Sobel... The most accurate finding the four directional axis: horizontal, vertical, right diagonal and left diagonal is the Kirsch detector (this is the only one that uses the eight point neighbourhood of each pixel).

Fig. 1: Notation for the neighbours of pixel (i,j).

A 0 A 7 A 6

A 1 (i, j) A 5

The equations are:

{

A 2 A 3 A 4

}

G(i. j) = max 1, max [ 5Sk - 3Tk ] k=0 7

(1)

where G(i,j) is the gradient for the pixel (i,j), and: Sk = A k + A k +1 + A k +2 Tk = A k +3 + A k +4 + A k +5 + A k+ 6 + A k +7 Subindexes of A are evaluated “modulus 8”

We can base ourselves in these equations to compute feature maps in the four directions: horizontal (H), vertical(V), right diagonal (R) and left diagonal (L). In this way, we obtain four local feature maps, to obtain a global feature map we include the original pattern with no gradient operator applied. The set of feaure maps is so made by five patterns of size 16x16 (4 local + 1 global). Figure 2. The last preprocessing stage is a compression of the 16x16 Kirsch patterns. They are coverted to size 4x4 via a linear decimation. In this manner we reduce the input space dimension for a factor of 16. This is very important to make sure the neural net will learn without a very big number of training samples.

have 10 units fully connected with all units in the hidden layer. Each output unit represents one class. When we introduce a pattern that belongs to class 'i', the trained output will be 1 for the ith output unit and 0 for the others. This network has 170 units and 2080 links (2170 weights). This structure has the advantage to make possible the correct pattern recognition even when some of the subnets finds ambiguity (the other subnets can cancel the efect of that ambiguity).

Fig. 2: Preprocessing consisting of normalization to size 16x16, feature extraction (4 Kirsch components) and compression to 4x4.

4 .- MULTI-LAYER NEURAL NET: One of the purposes of the work with neural networks is the minimization of the mean square error (MSE) between the actual output and the desired one. This minimization is peformed via gradient algorithms. Backpropagation [2] is an efficient and very classical method. In our network there exists a trade off about the number of conections (or weights). If the number of weights is too low the net will not be able to learn. Otherwise, if that number is too big, we can have “overtraining” (the net can learn “even the noise”) [2]. 4.1 .- NETWORK STRUCTURE: The network structure used is shown in figure 3. The input layer consists of five maps of 4x4 units each. Four of these maps correspond to local characteristics (one for each direction) and the fifth one retains the global characteristic of the input pattern. Each of these five input maps make up fully connected groups with their corresponding maps in the hidden layer. So we have five independent subnets. In the output layer, we

Fig. 3: Network Structure.

4.2 .- NETWORK TRAINING: Each of the subnets between the input and the hidden layer is initialized with random weights and trained with different feature maps. All the connections in the net are of adaptive nature and are trained with the "Backpropagation" algorithm. The learning coefficient that is the only free parameter in this method is set up before the training and is not changed during it. In figure 4 we see that the maximum recognition rate is achieved about 200 training iteractions (the maximum rate is 96.2%). After that, the rate remains almost constant. That’s why we do not use any escpecial criterion to stop the algorithm. The rates are measured using the cross-validation method. Id est: The training set and the test one have no common elements. That’s why we are not affected by “over-training” efects.

the net had never seen before digits like these ones). Table 3 shows patterns that were incorrectly classified. The reader can see that are very ambiguous, even for the human eye.

Fig. 4: Correct recognizing rate versus number of training epochs.

Table 1: “Strange” patterns classified correctly.

5 .- EXPERIMENTAL RESULTS: We have made some experiments to find the parameters which yield the best results. We used our own database (9300 patterns) for all of them. In our first experiment, we made 32 partitons of the database. We trained the network with 31 of them and then tested it with the remaining partition. We repeated this process for the 32 possibilities. This is a well known cross-validation method, its name in the literature is 'Leave one out'. We have averaged the recognition rates to find a global accuracy measure. In some applications is interesting to allow some rejection rate, id est: the network refuses to recognize the more ambiguous patterns. Those patterns should be passed to a human operator for revision (the network output in this case can be used as a hint). We can apply this idea defining an ambiguous pattern as follows: “A pattern is ambiguous if either the maximum output of the net is less than some threshold close to 1 (t1) or if some of the non maximum outputs is more than some threshold close to 0 (t2)”. Choosing adequate values for the thresholds, we got a rejection rate of 9%, and a correct classification rate (only over the non rejected patterns, of course) of 99%. With no rejection our correct recognition rate is 96.2%. Other experiment was to face our network with some “strange” (almost patological) patterns. Table 2 shows pattern examples that were correctly classified (this proves the network generalization power, as

1/9

4/9

4/1

7/3

4/9

7/2

5/1

3/5

4/9

5/6

Table 2: Patterns incorrectly classified.

6 .- REFINEMENT STAGE: We decided to improve even more the results using a second neural classifier. This “refinement network” acts only on the ambiguous patterns that were rejected by the first network. To consider that a pattern is ambiguous we must have the following situation: the pattern was rejected by the first network (using the previously stated criterion), two outputs must clearly dominate over the others (we choose a treshold to assure this) and the distance between those two dominant outputs is small (we also choose a threshold to assure this). This refinement network is made by 45 neurons. Each of them is trained to distingish between a pair of numerals (there are 45 combinations of two different digits: 10 ! 2 !(10 − 2 )!

= 45 ). The ambiguous patterns are

applied only to the neuron designed to distinguish between the two contending candidates (the two dominant ones in the first stage). Each of the 45 neurons receives as input a feature vector very similar to the one used

in the first stage. The difference here is that the 4 Kirsch components are compressed to a size of 8x8 (instead of 4x4). The reason for that is to pay more attention to local details that may be very important to distinguish between the contending candidates. The global component is removed here for the same reason.

Fig 6. Results for the inclusion of a refinement stage.

7 .- FUTURE LINES:

Fig 5. Structure of the refinement network.

These neurons were again trained with our database and with the ‘Backpropagation’ algorithm. Each neuron takes a binary decision, id est: a value of 0 decides for one digit and a 1 decides for the other. We can set thresholds to decide when the output is “in the middle”. In these cases the second network rejects the pattern (this will occur with very ambiguous patterns). We can see the results of applying this network in the following graph. If we have a big rejection rate (the rightmost points on the graph corresponds to 9% of rejection) error is negligible. If we risk to make less rejections, we will have more errors. In the dotted lines, we have the curves corresponding to applying the first stage only. We see as the refinement stage improves performance significantly.

Our main future line is to get an optimum initialization of the net weights using genetic algorithms. That would allow a better and quickest training. 8 .- CONCLUSIONS: We have designed an scheme able to recognize handwritten and printed digits. We have used a multilayer neural network that we trained using the Backpropagation algorithm. Our result is a classification error of 3.8% with no rejection. If we establish a rejection rate of 9%, error rate reduces to 1%. We have improved those results using a refinement stage based on another neural network. 9 .- REFERENCES: [1] S.-Whan Lee. “Off-Line Recognition of Totally Unconstrained Handwritten Numerals Using Multilayer Cluster Neural Networks”. IEEE Transactions on P.A.M.I. Vol 18, Num. 6, pp 648652, 1996. [2] S. Haykin. “Neural Networks. A Comprehensive Foundation”. pp 179181. 1994. [3] W.K. Pratt. “Digital Image Processing”. John Wiley & Sons. 1978.