Handwritten Digit Recognition using Convolutional Neural Networks and Gabor filters Andr´es Calder´on, Sergio Roa and Jorge Victorino Grupo Simbiosis Bogot´ a, Colombia [email protected], [email protected], jorge [email protected] http://gruposimbiosis.com To appear in the Proceedings of the International Congress on Computational Intelligence CIIC 2003 (http://www.unalmed.edu.co/˜ciic)

Abstract. In this article, the task of classifying handwritten digits using a class of multilayer feedforward network called Convolutional Network is considered. A convolutional network has the advantage of extracting and using features information, improving the recognition of 2D shapes with a high degree of invariance to translation, scaling and other distortions. In this work, a novel type of convolutional network was implemented using Gabor filters as feature extractors at the first layer. A backpropagation algorithm specifically adapted to the problem was used in the training phase for the rest of layers. The training and test sets were taken from the MNIST database. A boosting method was applied to improve the results by using experts that learn different distributions of the training set and combining its results.

1

Introduction

Multilayer feedforward networks are commonly used in handwritten character recognition tasks. However, it has been observed that feature extraction and feature mapping are common issues which have an important influence in the results of the classifier. In most cases, a hand-crafted feature extractor is often designed, specifically adapted to the problem. This is a hard task, which must be redone for each new problem. Therefore, in this work a type of multilayer perceptron called Convolutional Network (CNN), firstly described in LeCun et al.[9], is presented. It is specifically designed to recognize two-dimensional shapes with a high degree of invariance to translation, scaling, skewing, and other forms of distortion [5]. This network includes in its structure some forms of constraints: feature extraction, feature mapping and subsampling. In these experiments, the network architecture is composed by an input layer, five hidden layers and an output layer. The topology of a typical CNN contains two types of hidden layers. Some of them perform convolution, i.e., each layer is composed by some feature maps which are used for local feature extraction. This is achieved by an operation

2

Andr´es Calder´ on, Sergio Roa and Jorge Victorino

equivalent to convolution, followed by an additive bias and squashing function. The others perform a local averaging and a sub-sampling, reducing the resolution of the feature map, and reducing the sensitivity of the output to shifts and distortions [10]. In this work, experimental results were improved using Gabor filters as feature maps for the first layer of the network, instead of using an usual Convolutional layer. Therefore, this modified network topology will be called GCNN. Multiresolution analysis was used in order to obtain different feature maps for the first layer. A gradient-based algorithm to adapt the weights of the other layers was used. Finally, it is shown that the use of committee machines improve the results. In this case, a boosting machine was trained. This article is organized as follows. At first, the input and output data are described. Then, the GCNN network, the training algorithm used and the boosting technique are explained. Finally, some experimental results and conclusions are presented.

2

Data

In this problem, the task of classifying individual handwritten characters is studied. Training examples were taken from the MNIST database, which is available at http://yann.lecun.com/exdb/mnist/index.html. LeCun and others designed this database [10], which is composed by 60000 training examples and 10000 test examples. Data were collected among Census Bureau employees and high school students. It has been demonstrated that the higher the training set size the better results are obtained [10]. The original images have a normalized size of 28x28 and contain gray levels as a result of the anti-aliasing. In figure 1 some training data are shown.

Fig. 1. Some training examples from the MNIST database. The values of the pixels are normalized so that the background level (white) corresponds to a value of −0.1 and the foreground (black) corresponds to 1.175. This makes the mean input roughly 0, and the variance roughly 1 which accelerates learning [12]. The target values are 10 gray-scale images of hand-designed digits of size 12x7. However, in this case, only two values were used, the background (1) and the foreground (−1) colors, resulting in binary images. These images were designed so that each digit contains enough differentiable features to facilitate the discrimination. A description of the classifier system and its topology will be explained in section 3.

Handwritten Digit Recognition using GCNNs

3

3

Convolutional networks using Gabor filters

Convolutional networks combine three architectural ideas to ensure some degree of shift, scale and distortion invariance: local receptive fields, shared weights (or weight replication), and spatial sub-sampling. With local receptive fields, neurons can extract elementary visual features such as oriented edges, end-points, corners, etc. In addition, elementary feature detectors that are useful on one part of the image may also be useful across the entire image. This knowledge can be applied by forcing the use of identical weight vectors by different receptive fields [4]. Successive layers of convolution and sub-sampling are typically alternated, resulting in a “bi-pyramidal” effect. That is, at each convolutional or subsampling layer, the number of feature maps (richness of the representation) is increased while the spatial resolution is reduced, compared to the corresponding previous layer, obtaining a large degree of invariance to geometric transformations [5,10]. This process is inspired by the notion of “simple” cells followed by “complex” cells that was first described in Hubel and Wiesel [6]. The weight sharing technique has the interesting side effect of reducing the number of free parameters, thereby reducing the “capacity” of the machine and reducing the gap between test error and training error [8]. Gabor filters are used for multiresolution analysis. They describe an image in different levels of frequency. Such representations permit a hierarchical search of objects in a given image. Therefore, different features are extracted, depending on the response of each filter and its frequency. A Gabor filter representation might have a biological justification. In experimental results with cats [7], it was shown that neurons in visual cortex have similar receptive fields to the real and imaginary part of Gabor filters. Daugman [2] encouraged the use of Gabor filters as a representation for receptive fields and their use in recognition systems. A family Ψ of Gabor filters of N frequency levels and M orientations is defined by: k2j − k2j x2 ikj x σ2 − e− 2 e 2σ2 e 2 σ v+2 π kv cos ϕµ kj = , kv = 2− 2 π, ϕµ = µ kv sin ϕµ M Ψ=

where

(1)

(2)

where kj is the wave vector, v = 0, ..., N −1 is the frequency index, µ = 0, ..., M − 1 is the orientation index, j = µ + M v and σ the standard deviation [1]. 3.1

Topology

As described in section 1, a GCNN is composed by an input layer and six layers of four types called G1 , S2 , C3 , S4 , C5 and F6 . The input layer, made up of 28x28 sensory nodes, receives the images of different characters that have been normalized in size. Thereafter, the computational architecture is as follows (see figure 2): G1 is a layer which is represented by Gabor filters connected to the input layer. This is the unique layer which lacks training parameters. Its function is

4

Andr´es Calder´ on, Sergio Roa and Jorge Victorino Input

Input image (28 x 28)

G1

Gabor layer 12 sublayers (28x28)

C3

S2

Subsampling layer 12 sublayers (14x14)

Convolutional layer 16 sublayers (10x10)

Subsampling layer 16 sublayers (5x5)

S4

Convolutional layer 120 sublayers (1x1)

C5

F6

Output layer 1 sublayer (1x84)

Fig. 2. Topology of the GCNN.

similar to a convolutional layer one, acting as a feature extractor. This layer is composed by 12 sublayers, which are in particular the Gabor filter responses to 2 different frequencies and 6 orientations, with σ = 2.4. The selection of these parameters was found to be experimentally efficient. The size of each sublayer is 28x28. S2 and S4 are subsampling layers. In general, this type of layer is divided by the same amount of sublayers received from its inputs. Let be Cx the layer which is input to a subsampling layer Sy . Each neuron in Sy has a receptive field of size 2x2, i.e., a local averaging of the associated neurons in the respective feature map in Cx is performed. Therefore, each sublayer have 1/4 the size as the corresponding feature map in Cx . Finally, the neuron value is computed by multiplying this average by the weight (trainable coefficient), adding a bias and passing the result through the sigmoid activation function. A “noisy OR” or a “noisy AND” is performed by the subsampling units if the weight is large. If it is small, a simple blurring occurs [10]. C3 and C5 are convolutional layers. A convolutional layer Cx is made up a determined number of sublayers, depending on the existence of connections between each previous feature map in a sub-sampling layer Sy and a neuron in a feature map in Cx . The size of the convolution mask used is 5x5, reducing the rows and columns of the sublayers by 4 units with respect to the associated previous sublayers. Convolution masks are represented by its training parameters and corresponding bias. Let be |Cx |, |Sy | and M the amount of sublayers or biases, the amount of previous sub-sampling layers and the size of the convolution mask, respectively. It can be verified that the total number of free parameters T in Cx is defined by T = |Cx | × |Sy | × M + |Cx |. Afterwards, the result of convolution is passed through the activation function. The output layer F6 was designed in this work as a perceptron. This layer is totally connected to its previous layer C5 . It is composed by 84 neurons, representing the output of the net as a gray-scale image. Such representation is not particularly useful for recognition of isolated characters, but, e.g., for recognizing characters from the ASCII set, because some of them are similar and they will have similar output codes. This is useful when a linguistic processor is

Handwritten Digit Recognition using GCNNs

5

used to correct such confusions [10]. It is expected to use this approach in future experiments with other training sets. The use of the values −1 and +1 for the desired responses avoids sigmoids from getting saturated. The activation function in the GCNN is a hyperbolic tangent, which is used for faster convergence [10]. 3.2

Training algorithm

Let be yj the output of the j-th neuron in F6 and dj its desired response at time n. The error signal is defined by: ej (n) = dj (n) − yj (n).

(3)

Then, the instantaneous error is defined as follows: E(n) =

1X 2 e (n). 2 j j

(4)

The learning objective is the minimization of the averaged square error over the whole training set (N patterns): Eavg =

N 1 X E(n). N n=1

(5)

Therefore, the information of the gradient δE(n) δwji is used to minimize eq. 5, i.e., adjusting the trainable parameters, where wji (n) is a parameter connecting j to i. The forward propagation is described in the following manner: – Weights are initialized randomly using a normal distribution with mean zero and standard deviation σw roughly 1, allowing the weights to range over the sigmoid’s linear region. The advantages are that gradients are large enough for efficient learning and the network learns the linear part of the mapping before the more difficult nonlinear part [11]. This can be achieved by using 1 σw = m− 2 , where m is the number of inputs to the unit. – The learning rate η is adjusted using a search-then-converge schedule of η0 the form 1+ n , where η0 is the initial learning rate, τ is a constant and τ n is the current training epoch. This schedule is used to avoid parameter fluctuation through a training cycle, preventing a decrease in the network training performance [5]. – The stochastic update was used and, in addition, the training examples were randomly presented, avoiding the convergence to local minima [5]. – Gabor filters are calculated and its output is fed to the first subsampling layer. The forward propagation step is performed layer by layer, as explained above.

6

Andr´es Calder´ on, Sergio Roa and Jorge Victorino

The backpropagation step is performed when the error signal corresponding to a specific pattern is calculated. The backpropagation algorithm is derived from the well-known standard backpropagation method. In general, the backpropagation (bprop) method of a function F is a multiplication by the Jacobian of F . For example, the bprop of a “Y” connection is a sum and vice-versa. The bprop method of a multiplication by a coefficient is the multiplication by the same coefficient. The bprop method of a multiplication by a matrix is a multiplication by the transpose of that matrix. The bprop method of an addition with a constant is the identity [10].

4

Boosting method

The boosting method was initially described in Schapire [13]. Boosting is a class of committee machines, which combine decisions of different experts to come up with a superior overall decision, distributing the learning task between experts. The primary idea of boosting is to produce an accurate prediction rule by combining rough and moderately inaccurate weak subhypothesis. In a boosting machine the experts are trained on data sets with entirely different distributions [5]. Suppose that three experts (subhypothesis) have an error rate of < 12 with respect to the distributions on which they were trained. In Schapire [13], it is proved that the overall error rate is bounded by a function significantly smaller than the original error rate . Drucker et al. [3] developed a specific boosting method for the LeNet-4 architecture of Convolutional networks, reaching better results. The boosting method developed in this work is a type of boosting by filtering method. Let be (x1 , y1 ), ..., (xN1 , yN1 ) the complete training set of inputs xi and associated labels yi , where N1 = 60000. A first expert is trained on this complete set, as described in section 3. This first expert is used to filter another set of examples in the following manner: – The set of N2 misclassified examples is saved, in order to be used as half of the training set for the second expert. – The remaining 50% of the new training examples are obtained from the correctly classified examples. However, it was found that chosing a determined set of correctly classified examples reduced the performance of the machine. Sampling these 60000 − N2 examples the remaining 50% of the training time was shown to be more accurate. This might be explained because the second expert tries to generalize better the whole training set, but it specializes on the set of examples which are misclassified by the first net. Once the second expert has been trained as usual, a third training set is formed for the third expert: – The whole set of examples is passed through the first and second experts. If any of them misclassify a specific pattern, this pattern is added to the new training set. This set of N3 examples is saved.

Handwritten Digit Recognition using GCNNs

7

– As for the second expert, the remaining 60000 − N3 examples were sampled randomly 12 of the training time, learning in this case a different distribution. In these experiments the output of the boosting machine was obtained by simply adding the results of the three experts. Experimentally this process was shown to be more efficient than voting.

5

Experimental results

The first GCNN was trained using 60000 training patterns from the MNIST database through 12 epochs of training. In previous experiments an usual convolutional layer was used as first layer. However, Gabor filters as feature extractors improved the results. Then, the boosting method was applied and two additional experts were trained using 582 misclassified examples and 600 epochs for the second net. 1595 patterns and 300 epochs were employed for the third net, in which the response of at least one of the first experts was incorrect. The test set consists of 10000 examples, which were used in the assessment of the performance of the net. The classification results can be observed in figure 3.

5

K−NN Euclidean 3.3

40 PCA + quadratic

3.6

1000 RBF + linear 1.1

SVM poly 4

1.0

RS−SVM poly 5

1.1

LeNet−4 (CNN)

0.95

LeNet−5 (CNN)

0.84

First GCNN

1.39

Second GCNN

1.31

Third GCNN 0.68

Boosted GCNN 0%

1%

2%

3%

4%

5%

Fig. 3. Comparative results over the MNIST test set (percentage of misclassification).

It can be observed that the use of the boosting method improve the results. Though second and third experts results are less accurate, it was found that the combination of results is efficient, because they recognize some patterns which are more difficult to learn by the first expert. The experts improve their generalization ability by modifying the sampling policy, as described in section 4.

8

Andr´es Calder´ on, Sergio Roa and Jorge Victorino

In general, the efficiency in pattern recognition and image processing tasks of the Convolutional networks was demonstrated. Figure 3 shows also the comparative results against other classifiers.

6

Conclusions

In this work, convolutional neural networks were applied for handwritten digit recognition. The goal was the recognition of patterns taken from the MNIST database. CNNs were modified by the use of Gabor filters, which are known as good feature extractors. The results demonstrated that CNNs perform pattern recognition effectively, incorporating in its structure some feature extraction and feature mapping characteristics, which are extremely adapted to invariances usually found in pattern recognition problems. Likewise, it was shown that Gabor filters can be appropriately incorporated in a CNN architecture, because such methods own similar principles. The effectiveness of the boosting method in classification improvement was also confirmed. The use of the backpropagation algorithm was shown to be also efficient in the training task. In future work it is expected to improve the learning and classification tasks, by the use of techniques like network pruning and variations in connection and feature mapping policies. Furthermore, modification of Gabor filters parameters, improvements of committee machines and development of better learning algorithms and network topologies can be studied. The use of other training sets like alphanumeric character sets is an important challenge.

References 1. L. Cort´es and A. Calder´ on. Reconocimiento de Patrones con Invarianza bajo Transformaciones Geom´etricas usando Redes Neuronales Artificiales. Proyecto de grado. Departamento de Ing. de Sistemas. Universidad Nacional de Colombia, Bogot´ a, 1998. 2. J. Daugman. Complete discrete 2D Gabor transforms by neural networks for image analysis and compression. IEEE Transactions on Acoustics, Speech and Signal Proc., 36(7):1169–1179, 1988. 3. H. Drucker, R. Schapire, and P. Simard. Improving performance in neural networks using a boosting algorithm. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 42–49. Morgan Kaufmann, San Mateo, Calif., 1993. 4. K. Fukushima and S. Miyake. Neocognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15:455– 469, 1982. 5. Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River, N.J., 1999. 6. D.H. Hubel and T.N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160:106–154, 1962. 7. J. Jones and L. Palmer. An evaluation of two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6):1233– 1258, 1987.

Handwritten Digit Recognition using GCNNs

9

8. Y. LeCun. Generalization and network design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Zurich, Switzerland, 1989. Elsevier. an extended version was published as a technical report of the University of Toronto. 9. Y. LeCun and Y. Bengio. Convolutional Networks for Images, Speech, and TimeSeries. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995. 10. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. In Intelligent Signal Processing, pages 306–351. IEEE Press, 2001. 11. Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and Muller K., editors, Neural Networks: Tricks of the trade. Springer, 1998. 12. Y. LeCun, I. Kanter, and S. Solla. Eigenvalues of covariance matrices: application to neural-network learning. Physical Review Letters, 66(18):2396–2399, May 1991. 13. Robert E. Schapire. The Strength of Weak Learnability. Machine Learning, 5:197– 227, 1990.

Abstract. In this article, the task of classifying handwritten digits using a class of multilayer feedforward network called Convolutional Network is considered. A convolutional network has the advantage of extracting and using features information, improving the recognition of 2D shapes with a high degree of invariance to translation, scaling and other distortions. In this work, a novel type of convolutional network was implemented using Gabor filters as feature extractors at the first layer. A backpropagation algorithm specifically adapted to the problem was used in the training phase for the rest of layers. The training and test sets were taken from the MNIST database. A boosting method was applied to improve the results by using experts that learn different distributions of the training set and combining its results.

1

Introduction

Multilayer feedforward networks are commonly used in handwritten character recognition tasks. However, it has been observed that feature extraction and feature mapping are common issues which have an important influence in the results of the classifier. In most cases, a hand-crafted feature extractor is often designed, specifically adapted to the problem. This is a hard task, which must be redone for each new problem. Therefore, in this work a type of multilayer perceptron called Convolutional Network (CNN), firstly described in LeCun et al.[9], is presented. It is specifically designed to recognize two-dimensional shapes with a high degree of invariance to translation, scaling, skewing, and other forms of distortion [5]. This network includes in its structure some forms of constraints: feature extraction, feature mapping and subsampling. In these experiments, the network architecture is composed by an input layer, five hidden layers and an output layer. The topology of a typical CNN contains two types of hidden layers. Some of them perform convolution, i.e., each layer is composed by some feature maps which are used for local feature extraction. This is achieved by an operation

2

Andr´es Calder´ on, Sergio Roa and Jorge Victorino

equivalent to convolution, followed by an additive bias and squashing function. The others perform a local averaging and a sub-sampling, reducing the resolution of the feature map, and reducing the sensitivity of the output to shifts and distortions [10]. In this work, experimental results were improved using Gabor filters as feature maps for the first layer of the network, instead of using an usual Convolutional layer. Therefore, this modified network topology will be called GCNN. Multiresolution analysis was used in order to obtain different feature maps for the first layer. A gradient-based algorithm to adapt the weights of the other layers was used. Finally, it is shown that the use of committee machines improve the results. In this case, a boosting machine was trained. This article is organized as follows. At first, the input and output data are described. Then, the GCNN network, the training algorithm used and the boosting technique are explained. Finally, some experimental results and conclusions are presented.

2

Data

In this problem, the task of classifying individual handwritten characters is studied. Training examples were taken from the MNIST database, which is available at http://yann.lecun.com/exdb/mnist/index.html. LeCun and others designed this database [10], which is composed by 60000 training examples and 10000 test examples. Data were collected among Census Bureau employees and high school students. It has been demonstrated that the higher the training set size the better results are obtained [10]. The original images have a normalized size of 28x28 and contain gray levels as a result of the anti-aliasing. In figure 1 some training data are shown.

Fig. 1. Some training examples from the MNIST database. The values of the pixels are normalized so that the background level (white) corresponds to a value of −0.1 and the foreground (black) corresponds to 1.175. This makes the mean input roughly 0, and the variance roughly 1 which accelerates learning [12]. The target values are 10 gray-scale images of hand-designed digits of size 12x7. However, in this case, only two values were used, the background (1) and the foreground (−1) colors, resulting in binary images. These images were designed so that each digit contains enough differentiable features to facilitate the discrimination. A description of the classifier system and its topology will be explained in section 3.

Handwritten Digit Recognition using GCNNs

3

3

Convolutional networks using Gabor filters

Convolutional networks combine three architectural ideas to ensure some degree of shift, scale and distortion invariance: local receptive fields, shared weights (or weight replication), and spatial sub-sampling. With local receptive fields, neurons can extract elementary visual features such as oriented edges, end-points, corners, etc. In addition, elementary feature detectors that are useful on one part of the image may also be useful across the entire image. This knowledge can be applied by forcing the use of identical weight vectors by different receptive fields [4]. Successive layers of convolution and sub-sampling are typically alternated, resulting in a “bi-pyramidal” effect. That is, at each convolutional or subsampling layer, the number of feature maps (richness of the representation) is increased while the spatial resolution is reduced, compared to the corresponding previous layer, obtaining a large degree of invariance to geometric transformations [5,10]. This process is inspired by the notion of “simple” cells followed by “complex” cells that was first described in Hubel and Wiesel [6]. The weight sharing technique has the interesting side effect of reducing the number of free parameters, thereby reducing the “capacity” of the machine and reducing the gap between test error and training error [8]. Gabor filters are used for multiresolution analysis. They describe an image in different levels of frequency. Such representations permit a hierarchical search of objects in a given image. Therefore, different features are extracted, depending on the response of each filter and its frequency. A Gabor filter representation might have a biological justification. In experimental results with cats [7], it was shown that neurons in visual cortex have similar receptive fields to the real and imaginary part of Gabor filters. Daugman [2] encouraged the use of Gabor filters as a representation for receptive fields and their use in recognition systems. A family Ψ of Gabor filters of N frequency levels and M orientations is defined by: k2j − k2j x2 ikj x σ2 − e− 2 e 2σ2 e 2 σ v+2 π kv cos ϕµ kj = , kv = 2− 2 π, ϕµ = µ kv sin ϕµ M Ψ=

where

(1)

(2)

where kj is the wave vector, v = 0, ..., N −1 is the frequency index, µ = 0, ..., M − 1 is the orientation index, j = µ + M v and σ the standard deviation [1]. 3.1

Topology

As described in section 1, a GCNN is composed by an input layer and six layers of four types called G1 , S2 , C3 , S4 , C5 and F6 . The input layer, made up of 28x28 sensory nodes, receives the images of different characters that have been normalized in size. Thereafter, the computational architecture is as follows (see figure 2): G1 is a layer which is represented by Gabor filters connected to the input layer. This is the unique layer which lacks training parameters. Its function is

4

Andr´es Calder´ on, Sergio Roa and Jorge Victorino Input

Input image (28 x 28)

G1

Gabor layer 12 sublayers (28x28)

C3

S2

Subsampling layer 12 sublayers (14x14)

Convolutional layer 16 sublayers (10x10)

Subsampling layer 16 sublayers (5x5)

S4

Convolutional layer 120 sublayers (1x1)

C5

F6

Output layer 1 sublayer (1x84)

Fig. 2. Topology of the GCNN.

similar to a convolutional layer one, acting as a feature extractor. This layer is composed by 12 sublayers, which are in particular the Gabor filter responses to 2 different frequencies and 6 orientations, with σ = 2.4. The selection of these parameters was found to be experimentally efficient. The size of each sublayer is 28x28. S2 and S4 are subsampling layers. In general, this type of layer is divided by the same amount of sublayers received from its inputs. Let be Cx the layer which is input to a subsampling layer Sy . Each neuron in Sy has a receptive field of size 2x2, i.e., a local averaging of the associated neurons in the respective feature map in Cx is performed. Therefore, each sublayer have 1/4 the size as the corresponding feature map in Cx . Finally, the neuron value is computed by multiplying this average by the weight (trainable coefficient), adding a bias and passing the result through the sigmoid activation function. A “noisy OR” or a “noisy AND” is performed by the subsampling units if the weight is large. If it is small, a simple blurring occurs [10]. C3 and C5 are convolutional layers. A convolutional layer Cx is made up a determined number of sublayers, depending on the existence of connections between each previous feature map in a sub-sampling layer Sy and a neuron in a feature map in Cx . The size of the convolution mask used is 5x5, reducing the rows and columns of the sublayers by 4 units with respect to the associated previous sublayers. Convolution masks are represented by its training parameters and corresponding bias. Let be |Cx |, |Sy | and M the amount of sublayers or biases, the amount of previous sub-sampling layers and the size of the convolution mask, respectively. It can be verified that the total number of free parameters T in Cx is defined by T = |Cx | × |Sy | × M + |Cx |. Afterwards, the result of convolution is passed through the activation function. The output layer F6 was designed in this work as a perceptron. This layer is totally connected to its previous layer C5 . It is composed by 84 neurons, representing the output of the net as a gray-scale image. Such representation is not particularly useful for recognition of isolated characters, but, e.g., for recognizing characters from the ASCII set, because some of them are similar and they will have similar output codes. This is useful when a linguistic processor is

Handwritten Digit Recognition using GCNNs

5

used to correct such confusions [10]. It is expected to use this approach in future experiments with other training sets. The use of the values −1 and +1 for the desired responses avoids sigmoids from getting saturated. The activation function in the GCNN is a hyperbolic tangent, which is used for faster convergence [10]. 3.2

Training algorithm

Let be yj the output of the j-th neuron in F6 and dj its desired response at time n. The error signal is defined by: ej (n) = dj (n) − yj (n).

(3)

Then, the instantaneous error is defined as follows: E(n) =

1X 2 e (n). 2 j j

(4)

The learning objective is the minimization of the averaged square error over the whole training set (N patterns): Eavg =

N 1 X E(n). N n=1

(5)

Therefore, the information of the gradient δE(n) δwji is used to minimize eq. 5, i.e., adjusting the trainable parameters, where wji (n) is a parameter connecting j to i. The forward propagation is described in the following manner: – Weights are initialized randomly using a normal distribution with mean zero and standard deviation σw roughly 1, allowing the weights to range over the sigmoid’s linear region. The advantages are that gradients are large enough for efficient learning and the network learns the linear part of the mapping before the more difficult nonlinear part [11]. This can be achieved by using 1 σw = m− 2 , where m is the number of inputs to the unit. – The learning rate η is adjusted using a search-then-converge schedule of η0 the form 1+ n , where η0 is the initial learning rate, τ is a constant and τ n is the current training epoch. This schedule is used to avoid parameter fluctuation through a training cycle, preventing a decrease in the network training performance [5]. – The stochastic update was used and, in addition, the training examples were randomly presented, avoiding the convergence to local minima [5]. – Gabor filters are calculated and its output is fed to the first subsampling layer. The forward propagation step is performed layer by layer, as explained above.

6

Andr´es Calder´ on, Sergio Roa and Jorge Victorino

The backpropagation step is performed when the error signal corresponding to a specific pattern is calculated. The backpropagation algorithm is derived from the well-known standard backpropagation method. In general, the backpropagation (bprop) method of a function F is a multiplication by the Jacobian of F . For example, the bprop of a “Y” connection is a sum and vice-versa. The bprop method of a multiplication by a coefficient is the multiplication by the same coefficient. The bprop method of a multiplication by a matrix is a multiplication by the transpose of that matrix. The bprop method of an addition with a constant is the identity [10].

4

Boosting method

The boosting method was initially described in Schapire [13]. Boosting is a class of committee machines, which combine decisions of different experts to come up with a superior overall decision, distributing the learning task between experts. The primary idea of boosting is to produce an accurate prediction rule by combining rough and moderately inaccurate weak subhypothesis. In a boosting machine the experts are trained on data sets with entirely different distributions [5]. Suppose that three experts (subhypothesis) have an error rate of < 12 with respect to the distributions on which they were trained. In Schapire [13], it is proved that the overall error rate is bounded by a function significantly smaller than the original error rate . Drucker et al. [3] developed a specific boosting method for the LeNet-4 architecture of Convolutional networks, reaching better results. The boosting method developed in this work is a type of boosting by filtering method. Let be (x1 , y1 ), ..., (xN1 , yN1 ) the complete training set of inputs xi and associated labels yi , where N1 = 60000. A first expert is trained on this complete set, as described in section 3. This first expert is used to filter another set of examples in the following manner: – The set of N2 misclassified examples is saved, in order to be used as half of the training set for the second expert. – The remaining 50% of the new training examples are obtained from the correctly classified examples. However, it was found that chosing a determined set of correctly classified examples reduced the performance of the machine. Sampling these 60000 − N2 examples the remaining 50% of the training time was shown to be more accurate. This might be explained because the second expert tries to generalize better the whole training set, but it specializes on the set of examples which are misclassified by the first net. Once the second expert has been trained as usual, a third training set is formed for the third expert: – The whole set of examples is passed through the first and second experts. If any of them misclassify a specific pattern, this pattern is added to the new training set. This set of N3 examples is saved.

Handwritten Digit Recognition using GCNNs

7

– As for the second expert, the remaining 60000 − N3 examples were sampled randomly 12 of the training time, learning in this case a different distribution. In these experiments the output of the boosting machine was obtained by simply adding the results of the three experts. Experimentally this process was shown to be more efficient than voting.

5

Experimental results

The first GCNN was trained using 60000 training patterns from the MNIST database through 12 epochs of training. In previous experiments an usual convolutional layer was used as first layer. However, Gabor filters as feature extractors improved the results. Then, the boosting method was applied and two additional experts were trained using 582 misclassified examples and 600 epochs for the second net. 1595 patterns and 300 epochs were employed for the third net, in which the response of at least one of the first experts was incorrect. The test set consists of 10000 examples, which were used in the assessment of the performance of the net. The classification results can be observed in figure 3.

5

K−NN Euclidean 3.3

40 PCA + quadratic

3.6

1000 RBF + linear 1.1

SVM poly 4

1.0

RS−SVM poly 5

1.1

LeNet−4 (CNN)

0.95

LeNet−5 (CNN)

0.84

First GCNN

1.39

Second GCNN

1.31

Third GCNN 0.68

Boosted GCNN 0%

1%

2%

3%

4%

5%

Fig. 3. Comparative results over the MNIST test set (percentage of misclassification).

It can be observed that the use of the boosting method improve the results. Though second and third experts results are less accurate, it was found that the combination of results is efficient, because they recognize some patterns which are more difficult to learn by the first expert. The experts improve their generalization ability by modifying the sampling policy, as described in section 4.

8

Andr´es Calder´ on, Sergio Roa and Jorge Victorino

In general, the efficiency in pattern recognition and image processing tasks of the Convolutional networks was demonstrated. Figure 3 shows also the comparative results against other classifiers.

6

Conclusions

In this work, convolutional neural networks were applied for handwritten digit recognition. The goal was the recognition of patterns taken from the MNIST database. CNNs were modified by the use of Gabor filters, which are known as good feature extractors. The results demonstrated that CNNs perform pattern recognition effectively, incorporating in its structure some feature extraction and feature mapping characteristics, which are extremely adapted to invariances usually found in pattern recognition problems. Likewise, it was shown that Gabor filters can be appropriately incorporated in a CNN architecture, because such methods own similar principles. The effectiveness of the boosting method in classification improvement was also confirmed. The use of the backpropagation algorithm was shown to be also efficient in the training task. In future work it is expected to improve the learning and classification tasks, by the use of techniques like network pruning and variations in connection and feature mapping policies. Furthermore, modification of Gabor filters parameters, improvements of committee machines and development of better learning algorithms and network topologies can be studied. The use of other training sets like alphanumeric character sets is an important challenge.

References 1. L. Cort´es and A. Calder´ on. Reconocimiento de Patrones con Invarianza bajo Transformaciones Geom´etricas usando Redes Neuronales Artificiales. Proyecto de grado. Departamento de Ing. de Sistemas. Universidad Nacional de Colombia, Bogot´ a, 1998. 2. J. Daugman. Complete discrete 2D Gabor transforms by neural networks for image analysis and compression. IEEE Transactions on Acoustics, Speech and Signal Proc., 36(7):1169–1179, 1988. 3. H. Drucker, R. Schapire, and P. Simard. Improving performance in neural networks using a boosting algorithm. In S. J. Hanson, J. D. Cowan, and C. L. Giles, editors, Advances in Neural Information Processing Systems 5, pages 42–49. Morgan Kaufmann, San Mateo, Calif., 1993. 4. K. Fukushima and S. Miyake. Neocognitron: a new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 15:455– 469, 1982. 5. Simon Haykin. Neural Networks: A Comprehensive Foundation. Prentice Hall, Upper Saddle River, N.J., 1999. 6. D.H. Hubel and T.N. Wiesel. Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160:106–154, 1962. 7. J. Jones and L. Palmer. An evaluation of two-dimensional Gabor filter model of simple receptive fields in cat striate cortex. Journal of Neurophysiology, 58(6):1233– 1258, 1987.

Handwritten Digit Recognition using GCNNs

9

8. Y. LeCun. Generalization and network design strategies. In R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors, Connectionism in Perspective, Zurich, Switzerland, 1989. Elsevier. an extended version was published as a technical report of the University of Toronto. 9. Y. LeCun and Y. Bengio. Convolutional Networks for Images, Speech, and TimeSeries. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995. 10. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document Recognition. In Intelligent Signal Processing, pages 306–351. IEEE Press, 2001. 11. Y. LeCun, L. Bottou, G. Orr, and K. Muller. Efficient backprop. In G. Orr and Muller K., editors, Neural Networks: Tricks of the trade. Springer, 1998. 12. Y. LeCun, I. Kanter, and S. Solla. Eigenvalues of covariance matrices: application to neural-network learning. Physical Review Letters, 66(18):2396–2399, May 1991. 13. Robert E. Schapire. The Strength of Weak Learnability. Machine Learning, 5:197– 227, 1990.