RECOGNITION OF HANDWRITTEN DIGITS, 2002
Recognition of Handwritten Digits using a Neural Network Mathias Wellner Jessica Luan Caleb Sylvester Abstract—This paper presents an experiment that determines the ability of a multi-layer neural network implemented in Matlab to identify handwriting samples of the single digits 0-9. Specifically, the implementation in Matlab was based on a Multi-Layer Perceptron Network (MLPN) trained with back propagation. The network performs with an overall recognition rate (accuracy) of 81.6%.
ANDWRITING recognition has become very useful in endeavors of human/computer interaction. Reliable, fast, and flexible recognition methodologies have elevated the utility, and therefore the prevalence of, PDA and even more dynamic pocket PC models. Though not a significant enough solution to the broad demands of PDA type devices, one realm in which we think our system would be useful is mail sorting. Recognizing the individual numbers of a zip code is exactly what our system is geared towards. Our interest in handwriting recognition and its practical applications compelled us to develop a neural network capable of classifying the digits 0-9. Our intent in using an MLPN was to test the level of accuracy we could achieve with a single hidden layer of neurons. We assumed that back propagation training and a single hidden layer structure would yield high accuracy once we were able to provide the network with inputs containing the maximum amount of information about the digits 0-9. Since a neural network can perform only as well as its training data is accurate, optimizing the network inputs was the main focus of our work. Writing samples are taken from a 100 100 pixel writing space in MS Paint. We selected this large tablet size for two reasons. First, it provides users ample room to write according to their individual style. Second, it gives us disposable space to eliminate in order to condense the space, where starting with a small space and possibly having to enlarge it in order to obtain some feature of the digit is much more difficult and more likely to generate error in the enlarged version of the sample. Samples are saved as 8-bit bitmaps and the corresponding data matrix opened in Matlab is 100 100, so extracting from this 10,000 element data set the appropriate features of each digit (those containing maximum information) was of great importance. M. Wellner is an Exchange Student from Dresden, Germany, currently with the Department of Electrical and Computer Engineering, Virginia Polytechnic Institute and State University, Blacksburg VA, 24061 (USA), email: [email protected]
J. Luan is currently doing graduate work in the area of controls at the Virginia Polytechnic Institute in Blacksburg, VA, email: [email protected]
C. Sylvester is a first semester MSEE student from Charlottesville, Virginia. He is currently enrolled in a Controls concentration at Virginia Tech, Blacksburg, Virginia, 24061 (USA), email: [email protected]
Fig. 1. Windowing of digit
The remaining portions of the paper are organized as follows. Section II explains the features of interest for the digits 0-9 and the calculation of these feature parameters. Section III addresses the network architecture and training. Section IV demonstrates the results of testing the network and addresses some interesting points about its errors. Section V suggests some further improvements that should enhance the system’s dependability. II. Feature Extraction Microsoft Paint is used as an interface between the user and our handwriting recognition program. We have specified a drawing space in Paint of 100x100 pixels. After drawing a digit from 0-9, the image is then saved as an 8-bit, 256 color bitmap. This image is then imported into Matlab using the command ”a1 = imread(’name.bmp’, ’bmp’)” and is read as a two-dimensional matrix of 255’s and 0’s, with 0’s being the colored pixels. Instead of putting this entire 10000 element matrix as inputs into the neural network, we chose to look at characteristics specific to that digit drawn in order and obtained a 39-element vector. This feature extraction process would cut down on the computation required by the NN and allow the training process to be done much quicker. Some of the features we looked at were the distances
RECOGNITION OF HANDWRITTEN DIGITS, 2002
Fig. 4. Intersection definition by analysing neighbours
Fig. 2. Distance measuring
depending on how large the user chooses to draw the digit. Other characteristics we thought were important to include in the input vector for distinction were the number of intersections between the digit and the gridlines, as shown in Figure 3. This turned out to be a trickier task than we had originally thought. At first, we just wrote a program to count the number of zeros in the matrix that fell on the gridlines. However, depending on where these gridlines fell, a perfectly straight line could potentially count as having a hundred intersections, or zero. For the case of the number one, this became a particularly crucial detail. Our solution was to redefine what we counted as an intersection, so that not only would it have to cross a colored pixel, but there must also a zero in one of the three places to the left of the pixel as well as in one of the three places to the right. This is shown in Figure 4. Three more elements were then added to the input vector, the number of vertical intersections, the number of horizontal intersections, and the total crossings. The last features we considered were the percentage of colored pixels that fell in the left and right halves of the window, as well as the percentage that appeared in the top and bottom halves of the image. This added four more elements to the input vector, bringing the total to 39 elements. III. Network Architecture and Training
Fig. 3. Definition of intersections
between the drawn digit and the edges of the drawing space. In order for these distances to be meaningful, we first needed to crop the image and center the digit in the bitmap. This region is called the ”window of attention”, seen in Figure 1. This window was then divided with gridlines. As shown in Figure 2, the distances were measured along these gridlines from each of the edges of the window to the closest colored pixel. In this way, we can obtain a rough outline of the number within our window. We chose to have six vertical and ten horizontal gridlines, so this added a total of 32 elements to the input vector. These distances were then normalized with respect to the length and width of the window, since the window size can vary
Fig. 5. General network architecture
Since feature dimensions were not absolutley clear in the beginning, the network program was done with various numbers of inputs, hidden layer and output neurons (see Figure 5). Finally we had the following network con-
Wellner, Luan, Sylvester: RECOGNITION OF HANDWRITTEN DIGITS
Fig. 6. Perfect network output after training
figuration: 39 inputs: This number occurred due to the number of features per number. 30 hidden layer neurons: We varied this number and found out, that a higher number does not yield better results. 14 output neurons: Originially we had 10 outputs, one for each digit to be recognized. But to get a better performance we decided to introduce different versions for some numbers that have different ways of being written. The interface between feature extraction and network programming was a CSV file, that can be easily read and written by MatLab. The weight values, gained by the backpropagation training algorithm, were also saved in this format to allow testing. We had two databases, one for training (62 entries) and one for testing (212 entries). One main problem was to allocate the digits to the respective row of the testing or training matrix. It was solved in this project by defining an allocation matrix, containing the allocated digits in the order of matrix rows. The network output consisted of 14 neurons, each representing one digit or digit variant. In training, the desired response was created by setting the desired output to 1 and all other outputs to -1. This happened to make best use of the range given by the arctan style activation function we used. After several thousand steps, the network was in most cases able to seperate the training patterns. To illustrate this, Figure 6 shows such an ideal network output. The X axis contains all 62 training patterns. To help distinction between the desired responses, several boxes have been drawn. On the Y axis the network output for each pattern is visible. IV. Experimental Results One hundred and fifty new digit patterns were developed for testing. We included the original 62 training patterns, so the system was tested with 212 pattern vectors. Table I shows in tabular form the results of testing with these 212 patterns. Experimental data shows the network’s overall accuracy to be 81.60%. At first sight, these data show us that the added training
patterns for digits 1, 2, 4, and 7 do quite well to reduce the number of original errors generated by the network. Oddly though, the digit 3 is classified with quite low accuracy. When the network was trained with only the original 40 training patterns (not the new 22 for digits 1, 2, 4, 7), the three was identified in training with no problem. There was some confusion with the digit eight, but the network was far more accurate than 52.63%. However, the addition of training patterns seems to have biased the network’s training for the digit 3 away from the correct classification. The problem may lie with the fact that the majority of the new seven patterns were of the type with a small cross (like the letter t). Such training may have biased the classification of a three towards that of a seven, as only the bottom portions of these numbers are significantly different. In spite of the high error rate for the digit three, the network’s overall accuracy of 81.60% is very encouraging. However, in an effort to better understand the origin of our errors, we looked at the relative frequency with which each digit was classified incorrectly. More importantly, we looked at the actual classification when a misidentification did occur. Figure 7 shows these relative frequencies. Note first that both scales should be shifted by negative one so that they both range from 0-9. According to this change, we can see that the darkly shaded centers of coordinates (”one”,”seven”), (”two”,”one”), and (”three”,”two”) correspond to the network’s frequently mistaking a written ”one” for ”seven”, a written ”two” for a ”one”, and a written ”three” for a ”two”. In order to eliminate these errors, we need to develop an understanding of the correlation between three things: 1. the desired classification, 2. the style in which the writing sample was created by the user, and 3. the training for the desired digit (i.e., whether or not additional training is required in order to cater to the particular user’s handwriting style)
Digit 0 1 2 3 4 5 6 7 8 9
Test Patterns 19 23 25 19 25 19 19 25 19 19
Testing Errors 1 5 5 10 5 5 1 0 4 3
Percentage Error 5.26 21.74 20.00 52.63 20.00 26.32 5.26 0.00 21.05 15.79 (overall) 18.40
TABLE I Test Results
Percentage Accuracy 94.74 78.26 80.00 47.37 80.00 73.68 94.74 100.00 78.95 84.21 (overall) 81.60
RECOGNITION OF HANDWRITTEN DIGITS, 2002
may be more a point of confusion for the network instead of a point of clarification. One obvious test of this theory is to eliminate these percentages altogether, retrain the network, and test for new system accuracies. Aside from the limitations on performance, the main limitation on utility is that the system applies only to single digits. Future work might include ways to divide a set of n > 1 digits in one input space into n distinct spaces. Replicas of the network might then be used to simultaneously analyze each of the n subspaces. References  M. Parizeau, A. Lemieux, C. Gagn´ e Character Recognition Experiments Using Unipen Data, Proceedings of ICDAR 2001, Seattle (Oregon, USA), pp. 1-5, 2001 Fig. 7. Misidentification frequencies
This representation of our error distribution is important because, for example, it already shows that the initial explanation of three’s misclassification is wrong. We can see that only with minimal frequency did the network misinterpret an actual three as a seven; see coordinate (3,7) = 0.25. So, very closely analyzing this relative frequency plot should shed more light on how to cater to different writing styles with our training process. And with some technical correlation analysis, it might also offer some hints about certain character features we originally thought valuable that may actually not be. V. Conclusions Experimental data shows the network’s overall accuracy to be 81.60%. In comparison, Parizeau, Lemieux, and Gagne  have achieved a single digit recognition rate of 97.0% with their back propagation MLPN. Their work revolved about the use of a very large handwriting database called Unipen. Compared to our 62 training and 212 test patterns, they used 15,953 training and 8,598 test patterns. This enormous disparity between their inclusion of so many writing styles and our exclusion makes even more obvious our need for more dynamic training. So, the first effort to improve network accuracy should be to generate a more dynamic training set. As mentioned in Section 4, a correlation analysis of the specific errors should aid in eliminating causes of erroneous performance. Already of special interest are two digit characteristics: 1. the number of intersections with horizontal and vertical grid lines and 2. the percentage of each digit that falls in each half plane. Our current definition of intersection may be disadvantageous, as well may be the number of gridlines we use to locate each intersection. The percentages constitute four of the 39 characteristic parameters. These seemed to be valuable features when we observed our handwriting in an enlarged 100 100 input space, but there is actually little difference in their values. The lack of distinct differences
Mathias Wellner was born June 26, 1979 in Karl-Marx-Stadt, what is today Chemnitz. After attending a school with focus on mathematics, natural sciences and technics he had to serve his country for 13 months in a hospital laboratory. 1999, he happily started his studies of electrical engineering at the TU Dresden. But after three years he felt that something new had to be ventured. He readily applied for an exchange program in the US and was accepted, much to his surprise.
Caleb Sylvester claims his birth occurred on 13 April 1980 in Indiana, but his recent genealogical tracings have led him to a small, wellkept pack of wolves in the northern tundra of Canada. Not really surprised by his finding, he has calmly accepted the fact that he may not even be 22 years old. Fortunately, he spent nine years in Charlottesville, Virginia civilizing himself enough to attend Virginia Military Institute, at which he earned his BSEE in May 2002. Having widely broadened his horizons this past semester, Caleb is anxious for whatever a post-student life brings him, whether it be teaching calculus to high school students or working only to earn money enough to splurge on extreme backpacking trips.
Jessica Luan was born June 26, 1980 in Southbend, Indiana. After first trying to attain a dual degree in music as well as electrical engineering, she gave up on music and received her B.S. degree in electrical engineering at the University of New Hampshire in Durham, NH in May 2002. She is currently doing graduate work in the area of controls at Virginia Tech.