A Segmentation-Free Neural Network Classifier for ... - CiteSeerX

1 downloads 0 Views 102KB Size Report
The numeric fields were extracted from binary images of mail- ... the mail sorting operations of the United States Postal Service, the huge variety in fonts. and the ...
A Segmentation-Free Neural Network Classifier for Machine-Printed Numeric Fields Paul D. Gader, Brian D. Forester, Andrew M. Gillies, Margaret J. Ganzberger, Robert C. Vogt, and John M. Trenkle Environmental Research Institute of Michigan PO Box 134001 Ann Arbor, MI 48113-4001

Abstract A segmentation-free neural network classifier for reading machine-printed numeric fields is described. The numeric fields were extracted from binary images of mailpieces from the USPS mailstream. The images include a wide variety of fonts and a significant percentage of them are of very low quality. The classifier uses convolution-like features which are extracted and used as inputs to a neural network. This process is performed over a window that is passed along the image of the entire numeric field. Thus the characters in the fields are not isolated by any segmentation process. Likelihood functions are used to make final decisions concerning the locations and class memberships of the characters in the fields. Approximately 83% of the numeric fields were correctly read.

1.0 Overview Character recognition has been studied extensively for the past thirty years. A wide variety of techniques have been proposed and described in the literature over the years. Furthermore, there are many Optical Character Recognition programs that can be obtained for desktop scanners and personal computers. These machines perform quite well with certain sets of fonts. Some of them can be trained to recognize new fonts. Furthermore, desktop scanners can provide high quality images if one is willing to wait many seconds to many minutes for a scan. In a high volume and high speed scanning environment such as that encountered in the mail sorting operations of the United States Postal Service, the huge variety in fonts and the low image quality obtained by current image acquisition devices creates problems for OCR algorithms. A sample of some of the numeric fields encountered in this environment is shown in Figure 1.

FIGURE 1. Some sample numeric fields.

Most OCR algorithms rely on isolating the individual characters in a word and then applying a character recognition algorithm to the individual digits. This process of isolating the individual characters is referred to as character segmentation. This approach has several drawbacks: The computational structure of the segmentation algorithms does not usually match that of the recognition algorithm. Arguments can be made that the only true way to isolate the characters in a word is by reading them. One method of attacking the second problem is by generating multiple character segmentation possibilities and then ranking the segmentations by performing character recognition on each of them using a recognizer that assigns a confidence value. Another approach is to try to perform recognition without segmentation. In this paper, we describe experiments in the latter vein. In particular, we describe a neural network classifier that is applied in a “sliding mode” to images of numeric fields with unknown numbers of characters. That is, the neural network makes a classification hypothesis at each location in the numeric field. A likelihood function is then used to evaluate likelihoods that a character is centered at each point. These likelihoods are then used to generate strings representing the possible character memberships of the characters in the numeric field. The classifier described herein will often be referred to as a “sliding window classifier” due to the nature in which it is implemented. An image of a numeric field is used as input to the classifier. In the first section, we describe the feature set that is used as input to the neural network. In the next section, we describe the recognition experiments that were performed. We describe the data used in the experiments and how it was collected. Following that, we some results obtained with a 4-layer network trained using the backpropagation algorithm. Finally, we discuss some implications of the research.

2.0 Feature Set In this section, we discuss the feature set. We first discuss the convolution masks that were used. We then discuss how the convolution masks were used to generate feature vectors.

2.1 Convolution Features The convolution shape features consist of a set of 150 convolution masks of size 8 x 8. These masks were generated from a training set. They are shown in Figure 2. The weights in the masks vary from -1 to 1. A filled-in square represents a positive weight and a open square represents a negative weight. The size of the square represents the magnitude of a weight. No square represents a zero weight. The features were all derived using 24 x 16 normalized, isolated digit images. Two hundred digit samples from each class were used. The initial features were derived from

FIGURE 2. Example of convolution shape features developed for sliding window neural network classifier. A filled-in square represents a positive weight and a open square represents a negative weight. The size of the square represents the magnitude of a weight. No square represents a zero weight.

these normalized digit images using 15 overlapping, 8 x 8 zones depicted in Figure 3. Zone 1 is the upper left hand corner, zone 5 the upper middle region, etc. Currently, there is one feature per zone per digit class. Each feature consists of an 8 x 8 mask. The initial masks are generated by counting the number of times the corresponding pixel is on and mapping that value between -1 and 1. The results are then smoothed to generate the final mask.

{ { {

zones zones zones 1,4,7 3,6,9 2,5,8

{ { {

zones 1,2,3 zones 4,5,6

zones 7,8,9

FIGURE 3. Overlapping zones.

2.2 Feature Vector Generation Once the masks are generated, they can be used to generate feature vectors for input to a feature vector based classifier. In the ideal case that a character image is 24 x 16, the character is divided into 15 overlapping zones of size 8 x 8. The inner product of each mask with the associated zone in the character is computed and used to generate a 150 (= 15 zones x 10 masks/zone) element feature vector. A sliding window classifier needs to process character images of a variety of widths however. We assume the line or numeric field image has been normalized by height. The width of the characters within the line or field is not known and may be variable. Thus, our feature extraction must adapt to variation in width. Furthermore, if the sliding window is the same size across the line or field, then there may be pieces of other characters inside a window centered on a character. The feature extraction process must be able to adapt to the presence of these extraneous character pieces in the window. We computed the distribution of widths of characters in line images that were normalized to a height of 24 pixels. It was found that approximately 98% of the characters had widths of 24 pixels or less. Thus, we chose a window size of 24 x 24. The features are computed on the 24 x 24 window as follows: The window is divided into 15 overlapping zones, each of size 12 x 8. Each feature mask fits inside the corresponding zone at five locations, the first when the left edge of the mask is aligned

with the left edge of the zone, and the others each shifted one pixel to the right until the right edge of the mask is aligned with the right edge of the zone. Thus, five values are produced for each mask. The maximum of these values is used as the feature value for that mask. In this way, feature vectors of 150 elements are produced. This method is applicable to any size window larger than 24 x 16.

8 x 12 zone

8 x 8 mask

v1

v2

v3

v4

v5

v = max {v1, v2, v3, v4, v5 } FIGURE 4. Feature extraction process for one mask on one zone of 24 x 24 window. This process is repeated for 10 masks for each of 15 zones resulting in a feature vector of length 150.

3.0 Recognition Experiments In this section, we discuss the recognition experiments.

3.1 Data Set Generation A data collection effort was undertaken to generate training and testing sets suitable for a sliding window classifier. A truth-based segmentation technique was used to identify characters in a numeric field. Approximately 4000 numeric fields were processed to generate about 20,000 distinct numerals. These were then divided into a training, statistical analysis and test sets. The truth-based segmentation takes as input an image of a word and a string representing the class membership of each of the characters in the image and generates as output boundaries of the characters in the image. The boundaries are represented as xlocations. Thus the output for a word image with three characters would be of the form ((x1 x2) (x3 x4) (x5 x6)).

The technique works quite well in identifying the boundaries of the characters but is not perfect. Thus, some of the data needed to be manually cleaned before being used to develop training sets of feature vectors. Those images that were not correctly recognized by an existing numeral reader, described in [1], were manually cleaned. The technique uses dynamic programming to match representations of characters to images. Thus, for example, given an image of the word “Timbers”, representations of the characters “T”, “i”, ”m”, “b”, “e”, “r”, “s”, are appended to created a representation of the word and then match against the image using dynamic programming. The representation used is the vertical histogram. That is, a function of horizontal position which associates with each point x between 0 and the width of the word image the number of pixels with value 1 in the column of the word image associated with x. The results of applying the truth-based segmentation technique to images of the words “Timber” and “KELLWAY” are shown in Figures 5 and 6. As can be seen, images of very low quality can be segmented correctly and therefore assigned truth correctly using this technique. The truth-based segmentation technique was developed and implemented by Gillies and Ganzberger. Once cleaned data sets were generated with boundaries, we began to use them to generate training sets. The method of assigning truth for a sliding window classifier is open to debate. We take the middle of the boundaries to be the center of the character. Certainly, if a 24 x 24 window is centered directly on a character, then the truth at that point should be that corresponding character class. Is this also true for a window slightly offcenter? If so, how far off-center? Should the desired output of the neural network decrease as a function of the distance off-center? If so, what function? These questions all arise in assigning truth. We assign truth to a column x of an image of a numeric field by specifying the desired output of the neural network at that point. The neural networks we use are all class coded, that is, we have one output node per class. In addition, we have one output node for non-characters. Thus, our numeric classification neural networks have eleven output nodes. We chose to assign truth as follows: The desired outputs at the center of the character and at the two locations immediately adjacent to the center are 0.9 on the corresponding character class and low (-0.9 for the perceptron and 0.1 for the backpropagation network) elsewhere. The desired output is not defined for locations x having the property that 3 < | x - c| < 6 for some character center c. For all other locations, the desired output on all character output nodes is low. The question of training the classifier on gaps between characters also arises. If the classifier is not trained on gaps, then the behavior in-between characters is not predictable. If the classifier is trained using too many gaps, then the classifier may not be very good at classification. We are therefore generating some training samples of gaps and performing experiments with different percentages of gap images in the training set. We randomly choose gaps between characters using a uniform distribution. That is, we collect feature

Timbers

Vertical Histogram

Cost

Matched Templates

Templates Used FIGURE 5. Truth-Based Segmentation of Image of the word “Timbers”.

KELLWAY Vertical Histogram

Cost

Matched Templates

Templates Used FIGURE 6. Truth-Based Segmentation of Image of the word “KELLWAY”.

values at some percent of the locations between characters and the percent and location is determined by a uniform probability distribution. Gaps are also not allowed to be too close to character centers. The data collection algorithm thus performs the following: For each numeric field with character centers { c1, c2, ..., cn} Collect feature vectors from 24 x 24 window centered at ci - 2, ci - 1, ci, ci + 1, ci + 2 and n% of the gap locations between the characters A dataset of 25,856 samples was formed in this fashion. We collected the number of samples per class shown in Table 1. Examples of characters generated by this process are shown in Figure 7. TABLE 1. Numbers of Samples Per Class. 0

3406

1

2553

2

1901

3

1298

4

1076

5

3214

6

1399

7

2843

8

1076

9

914

gaps

6000

4.0 Sliding Neural Network A backpropagation neural network was trained using samples of characters collected in the process described above. The networks were trained 600 samples per class taking three vectors from each digit for a total of 19,800 vectors used in training. The network has four layers; 150 inputs, 11 outputs (10 digit classes plus one gap output, and 64 and 32 nodes in the two hidden layers). The network trained to a classification rate of 95% To perform numeric field recognition, the network is run on each 24 x 24 window in the image of a numeric field. Thus, at each point in the interior of the image, there are

FIGURE 7. Sample of training data used for sliding window classifiers. Each window is 24 x 24 with center near the center of the character. As can be seen, the digits are a variety of widths and parts of neighboring characters are included in the windows.

eleven scores, one for each digit class and one for gaps. These scores can be thought of as generating eleven signals to be evaluated. Examples are shown in Figures 8-10. We refer to this process as running the neural network in sliding mode. In order to “read” the numeric field, these eleven signals must be interpreted. First of all, to put the strengths of the signals on similar scales, the outputs of the neural net are passed through likelihood functions. Thus the meaning of the value of a particular signal

FIGURE 8. Results of applying the sliding neural network to an image of “75261” The signals represent, from left to right, 0 - 9 and then gaps.

at a point now becomes the likelihood that the an instance of a digit from the class associated with that signal is centered at the point. The likelihood values were thresholded. Thus, for each signal, we have a small, discrete set of locations at which there is significant evidence of the existence of a digit. We then form hypotheses for the identity of the numeric field by creating sequences of digit classes from these discrete sets of locations. The neural net was run in sliding mode on a set of 800 digit field images. Statistics were generated on these images and used to create likelihood functions. Using these likelihood functions, 83% of the images were correctly read. That is, 83% of the images had only one hypothesis for the identity of the numeric field and it was correct. Figures 11 and 12 contain examples of digit fields recognized incorrectly and correctly.

FIGURE 9. Results of applying the sliding neural network to an image of “1515” The signals represent, from left to right, 0 - 9 and then gaps.

FIGURE 10. Results of applying the sliding neural network to an image of “5400” The signals represent, from left to right, 0 - 9 and then gaps.

FIGURE 11. A random selection of incorrect digit fields.

FIGURE 12. A random selection of correct digit fields.

5.0 Conclusion A sliding neural network approach to reading badly degraded images of machineprinted numeric fields has been presented. The results are encouraging and suggest that segmentation free approaches are feasible in the sense of achieving good recognition performance. The algorithm requires further development and testing. There is also a requirement for high speed architecture for implementing the many arithmetic operations required to implement the algorithm.

6.0 Acknowledgments This work was sponsored by the Office of Advanced Technology (OAT) of the United States Postal Service under contract number 104230-86-H-0042. The authors would like to thank Joe Yacyk and Carl O’Connor of OAT, as well as Gilles Houle of Arthur D. Little for their valuable contributions to this work.

7.0 References [1]

Paul D. Gader, “Pipelined Systems for Recognition of Handwritten Digits in USPS Zip Codes,” Proceedings of the Fourth Advanced Technology Conference, United States Postal Service, 1990.