Numeral Recognition Based on Hierarchical Overlapped ... - CiteSeerX

0 downloads 0 Views 105KB Size Report
work architecture (HONG), which has been built by retaining the essence of the original NG algorithm. ..... [15] Bai-ling Zhang, Min-yue Fu, and Hong Yan.
Numeral Recognition Based on Hierarchical Overlapped Networks Ajantha S. Atukorale y Department of Computer Science and Electrical Engineering University of Queensland, QLD 4072 Australia

P. N. Suganthan School of Electrical and Electronic Engineering Nanyang Technological University Republic of Singapore, 639798

Abstract This paper describes our investigation into the neural gas (NG) algorithm and the hierarchical overlapped network architecture (HONG), which has been built by retaining the essence of the original NG algorithm. By defining an implicit ranking scheme, the NG algorithm is made to run faster in its sequential implementation. The HONG network generates multiple classifications for every sample data presented as confidence values. These confidence values are combined to obtain the final classification of the HONG architecture. The proposed architecture is tested on the NIST SD3 database, which contains real world handwritten numerals with high variations, and an excellent recognition rate is consequently obtained.

1. Introduction Neural network models have been intensively studied for many years in an effort to obtain superior performance when compared to classical approaches. The self-organizing feature map (SOFM) proposed by Kohonen [11] is one of the network paradigms widely used for solving complex problems such as vector quantization, speech recognition, combinatorial optimization, pattern recognition and modeling the structure of the visual cortex. Kohonen’s feature map is a special way for conserving the topological relationships in input data, which also bears some limitations. In this model the neighborhood relations between neural units have to be defined in advance. The topology of the input space has to match the topology of the output space which is to be represented. In addition, the dynamics of the SOFM algorithm cannot be described as a stochastic gradient descent on any energy function. A set of energy functions, one for each weight vector, seems to be the best description of the dynamics of the algorithm [5]. Martinetz et al. [12, 13] proposed the neural gas (NG) algorithm for vector quantization, prediction and topology representation in the early nineties. The NG network model: 1) converges quickly to low distortion errors, 2) reaches a distortion error lower than that resulting from K-means clustering, maximum-entropy clustering and Kohonen’s SOFM, 3) obeys a gradient descent on an energy surface. Similar to the SOFM algorithm, the NG algorithm uses a soft-max adaptation rule (i.e., not only adjusts the winning reference vector, but also the other cluster centers depending on their proximity to the input signal). This is mainly to generate a topographic map and y Corresponding author: [email protected]

also to avoid confinement to local minima during the adaptation procedure. Despite all those advantages, the NG network algorithm suffers from the problem of a high time complexity in its sequential implementation [4]. In this paper, we discuss how the time complexity associated with the NG algorithm can be reduced efficiently. In addition, by defining an hierarchical overlapped structure [14] on top of the standard NG network, a hierarchical overlapped neural gas (HONG) network model is constructed for the classification of real world handwritten numerals with high variations. The paper is organized as follows. After a brief review of the NG algorithm in Section 2, the proposed speed up technique is presented in Section 3. In Section 4, the functionality of the HONG network architecture is discussed, and in Section 5, the experimental results are presented. The paper is concluded with a brief discussion in Section 6.

2. The Neural Gas Algorithm

w w

In the neural gas algorithm, the synaptic weights i are adapted without any fixed topological arrangement of the neural units within the network. Instead, it utilizes a neighborhood-ranking of the synaptic weights i for a given data vector . The synaptic weight changes i are not determined by the relative distances between the neural units within a topologically prestructured lattice, but by the relative distances between the neural units within the input space: hence the name neural gas network.

v

w

Information about the arrangement of the receptive fields within the input space is implicitly given by a set of fk ? i k; i ; : : : ; N g, associated with each , where N is the number of units in the distortions, Dv network [13]. Each time an input signal is presented, an ordering of the elements of the set Dv is necessary (because of the ranking) to determine the adjustment of the synaptic weights i . This ordering has a time complexity O N N in its sequential implementation. The resulting adaptation rule can be described as a winner-take-most instead of winner-take-all rule.

= v w

=1

v

v

w

( log )

v

( )

A presented input signal is received by each neural unit i and induces excitations fi Dv which depend on the set of distortions Dv . Assuming a Hebb-like rule, the adaptation step for i is given by adjusting

wi =   fi(Dv )  (v ? wi )

w

i = 1; : : : ; N:

(1)

( )

The step size  2 [0,1] describes the overall extent of the modification (learning rate) and fi Dv 2 [0,1] accounts for the topological arrangement of the i within the input space. Martinetz et al. [13] reported that an exponential function ?ki = should give the best overall result, compared to other choices such as Gaussians for the excitation function fi Dv . Here  determines the number of neural units significantly changing their synaptic weights with the adaptation step (1). The rank index, ki ; : : : ; N ? , describes the neighborhood-ranking of being the closest synaptic weight ( i0 ) to the input signal , ki being the second the neural units with ki closest synaptic weight ( i1 ) to , and so on. That is, the set f i0 , i1 , : : : , iN ?1 g is the neighbourhood-ranking of i relative to the given input vector . The neighbourhood-ranking index ki depends on and the whole set of synaptic weights =f 1 ; 2 ; : : : ; N g and we denote this as ki ( ; ). The original NG algorithm is summarised below.

exp(

w

) ( ) =0 w v W w w

w

=0

v w

( 1) w w w w vW

w () () () () NG2: Present an input vector v and compute the distortions Dv .

NG1: Initialise the synaptic weights i randomly and the training parameter values are initial values of  t ;  t and f ; f are final values of  t ;  t .

v

=1 v

(i ; f ; i; f ), where i ; i

NG3: Sort the distortion set Dv into ascending order. NG4: Adapt weight vectors according to:

wi = (t)  h(ki (v; W ))  (v ? wi )

i = 1; : : : ; N

(2)

where the parameters have the following time dependencies:  t i f =i t=tmax ,  t i f =i t=tmax , h ki exp

()= (

)

( )= (

)

( )=

(?ki=(t)).

NG5: Increment the time parameter (or the adaptation step), t, by 1. NG6: Repeat NG2 - NG5 until the maximum iteration tmax is reached.

3. Implicit Ranking Scheme The NG algorithm suffers from the problem of a high time complexity in its sequential implementation [13, 4]. In the original neural gas network, an explicit ordering of all distances between synaptic weights and the training N in a sequential implementation. Recently some work sample is necessary. This has a time complexity O N has been done on speeding up procedures for the sequential implementation of the NG algorithm. Ancona et al. [1] discussed the questions of sorting accuracy and sorting completeness. With theoretical analysis and experimental evidence, they have concluded that partial sorting is a better candidate for the NG learning algorithm. Also, they have concluded that even the first few units in partial sorting is sufficient to attain a final distortion equivalent to that attained by the original NG algorithm. Moreover, they have concluded that correct identification of the bestmatching unit becomes more and more important as training proceeds. This is true, because as training proceeds, the adaptation step (1) becomes equivalent to the K-means adaptation rule. Choy et al. [4] have also applied a partial distance elimination method to speed up the NG algorithm in the above context.

( log )

In our investigations, we eliminate the explicit ordering (NG3 in the above summary) by employing the following implicit ordering metric:

) mi = (d(di ??dmin d ) max

(3)

min

where dmin and dmax are the minimum and maximum distances between the training sample and all synaptic weights in the network respectively, and di 2 Dv ; i ; : : : ; N . The best matching unit (winner) will then have an index of 0, the worst matching unit will have an index of 1, and the other units will take values between 0 and 1 (i.e., mi 2 ; ).

=1

[0 1]

By employing the above modification to the original NG algorithm discussed earlier, the two entries NG3 and NG4 are modified as follows. NG3: Find dmin ; dmax from the distortion set Dv . NG4: Adapt weight vectors according to:

wi = (t)  h0 (mi (v; W ))  (v ? wi ) where h0 (mi ) = exp(?mi =0 (t)) and 0 (t) = (t)=(N ? 1).

i = 1; : : : ; N

(4)

The modification in (3) speeds up the running time of the sequential implementation of the NG algorithm by a factor of five (see Table 1). Further modification of the weight update rule in (4) is done by rewriting it as,

wi = (t)  (v ? wi)

(t) = (t)  exp(?mi=0(t)).

where

()

Only those neurons with a non-negligible effective overall learning rate t are updated as in [13, 4]. Given a threshold for t (say ?5 ), the weight update rule in (4) is modified by updating neurons only if,

()

10

(t) > 10?5 ) exp(?mi =0 (t)) > 10?5=(t). Thus mi =0 (t) 6 5 log(10) + log((t)) Let r(t) = 5 log(10) + log((t)). Since (t) = i (f =i )t=tmax , then it follows that: r(t) = 5 log(10) + log(i ) + log(f =i ) t=tmax

(5)

That is, update the weight vectors according to the following truncated weight update rule:

 0 wi = 0(t)  exp(?mi= (t))  (v ? wi)

()

(t)  0 (t)

if mi 6 r otherwise

(6)

where r t is a parameter which depends only on t. Because of the above truncation, the weight update rule (4) will update those weights with non-zero values of i . These modifications enable us to eliminate the explicit ranking mechanism completely, and reduce the number of weight updates by about 99% on average (see Table 1). Both modifications speed up the sequential implementation of the NG algorithm significantly and enable us to build more complicated structures on top of it.

w

4. Hierarchical Overlapped Neural Gas Architecture By retaining the essence of the original NG algorithm and our modifications, we have been able to develop a hierarchical overlapped neural gas (HONG) network architecture for labeled pattern recognition. The structure of the HONG network architecture is an adaptation of the hierarchical overlapped architecture developed for SOMs by Suganthan [14]. First, the network is initialized with just one layer which is called the base layer. The number of neurons in the base layer has to be chosen appropriately. In labeled pattern recognition applications, the number of distinct classes and the number of training samples may be considered in the selection of the initial size of the network. Similar to the SOM architecture, in the HONG network every neuron has an associated synaptic weight vector which has the same dimension as the input feature vector. Once we have selected the number of neurons in the base layer, we applied our modified version of the NG algorithm to adapt the synaptic weights of the neurons in the base network. Having completed the unsupervised NG learning, the neurons in the base layer are labeled using a simple voting mechanism. In order to fine tune the labeled network, the supervised learning vector quantization (LVQ) algorithm [11] is applied. The overlaps are then obtained, as described below, for each neuron in the base layer. For instance, if we had 100 neurons in the base layer network, then we have 100 separate second layer NG networks grown from each neuron in the base layer network (see Figure 1). The overlapping is achieved by duplicating every training sample to train several second layer NG networks. That is, the winning neuron as well as a number of runner-up neurons will make use of the same training sample to train second layer NG networks (hereafter referred to as overlapped NG networks) grown from those neurons in the base layer NG network. For example, in Figure 1, the overlapped NG network grown from neuron A is trained on samples where the neuron A is either the winner or one of the first few runners-up for all the training samples presented to the base layer network. Figure 1 also shows the overlap in the feature space of the two overlapped NG networks conceptually assuming that the nodes A and B are adjacent to each other in the feature space. Once the partially overlapping training samples are obtained for each of the overlapped NG networks, we train each of them as we trained the base layer NG network earlier. The overlapped NG networks are then labeled as in the base layer. The testing samples are also duplicated, but to a lesser degree (eg., 3 times). Hence the testing samples fit well inside the feature space spanned by the winner and several runners-up in the training data. In addition, this duplication of the samples allows each HONG network to generate 5 independent classifications for every training  As we train up to 5 second layer NG networks using the same training sample, we claim that there is a partial overlap between several second layer NG networks.

Confidence Vector of overlapped NG network B

C C

C

A

C

B

Overlapped NG Networks overlapped NG network grown from unit A

overlapped NG network grown from unit B

Second Level

Base Level

C A

B Base NG Network

HONG Architecture Input feature vector

Feature Extractor

Input data vector

Figure 1. Hierarchical overlapped architecture showing three second layer NG networks grown from units, A, B and C of the base NG network.

sample and 3 independent classifications for every testing sample. In order to combine the outputs of the overlapped NG networks, the idea of confidence values is employed. That is, for a given sample, we can calculate its class membership in an overlapped NG network using the following:

"

cj = 1 ? Pdjd j j

#

=0

(7)

9

where dj is the minimum distance for every class j , and j ; : : : ; for numeral classification. This will define a confidence value (cj ) for the input pattern belonging to the j th class of an overlapped NG network. If the class j is not represented in the network considered, we simply ignore that class and make cj . The class which has the global minimum distance yields a confidence value closer to one (in case of a perfect match, i.e., dj , the confidence value for that class becomes one). That is, the higher the confidence value for a class, the more likely the sample belongs to that class. We can also consider the above function as a basic probability assignment, because 6 cj 6 . We can define the confidence vector, i , the collection of all ten confidence values for a given sample of an overlapped NG network, as

=0

0

1

C

C i = fcj j j = 0; 1; : : : ; 9g where i

=0

(8)

= 1; 2; : : : ; n and n is the number of overlaps considered.

For example, let us assume that three overlaps are considered for testing data. This will produce three confidence vectors from the corresponding overlapped NG networks. Given the individual confidence vectors, the overall confidence vector (  ) of the HONG architecture can be calculated by adding the individual confidence values

C

according to their class label, as

C j =

n X C ij i=1

where j = 0; 1; : : : ; 9

and

n = 3:

(9)

The class label is assigned to the given test data according to the overall confidence vector (i.e., select the index of the maximum confidence value from that vector).

k = argmax fC j g j

(10)

An evidence fusion technique based on the fuzzy integral can also be used to combine the individual confidence values, and the results are presented in [3]. summary of the HONG network architecture is given below.

HONG1: Initialize the synaptic weights and training parameters of the base NG network. HONG2: Train the base NG network using the modified NG algorithm as explained in (4) using the neighborhood function defined in (6). HONG3: Label the base NG network using a simple voting scheme. HONG4: Fine tune the base network with the supervised LVQ algorithm. HONG5: Obtain the overlaps for each unit in the base layer. HONG6: Initialize the synaptic weights of the second level NG networks around their base layer unit’s (i.e., root unit) value. HONG7: Train each second layer NG network as in HONG2 using the overlapped samples obtained in HONG5. HONG8: Label each of the second layer NG networks accordingly as in HONG3, and fine tune them as in HONG4. HONG9: Obtain the final recognition rates by combining the confidence values generated by each of the second layer overlapped NG networks.

5. Experimental Results 5.1. Implicit Ranking Table 1 compares the corresponding processing times, number of updates and the recognition rates using our proposed implicit ranking metric with those obtained with the original NG algorithm. The simulations are performed using the parameters given in section 5.2 on a 350MHz Pentium II personal computer.

The overall processing time of the NG algorithm is comprised of two distinct phases. The first is the distance calculation phase and the second is the sorting or ranking phase. Note that the distance calculation time is common to both explicit sorting and implicit ranking algorithms.

Distance Calculation Time

ts

Sorting Time

Implicit Ranking Metric

td

tm

Improvement Figure 2. Processing time of the NG algorithm

In Table 1, the processing time “Original NG” refers to the time taken by the sorting procedure (using qsort(), the C library routine with time complexity O N N ) denoted by ts in Figure 2. The processing time “Implicit Ranking” refers to the time taken by the implicit ranking metric defined in equation (3) denoted by tm in Figure 2. These times do not include the weight update times. The common distance calculation time (td ) taken by the given parameters is 154.51 seconds.

( log )

Table 1. Comparison of the processing time, number of updates and recognition rate for the original NG algorithm (2), the implicit ranking metric (4), and the truncated weight update rule (6).

Processing Time (sec) No. of Updates Recognition Rate

Original NG 152.93 106,152,000 96.83%

Implicit Ranking 27.72 106,152,000 96.81%

Truncated Update N/A 707,957 96.98%

In Table 1, the “No. of Updates” refers to the total number of updates performed in the training phase. This is given by the total number of training samples multiplied by the total number of neurons in the base network multiplied by the total number of iterations. The “Recognition Rate” refers to the percentage of training samples correctly classified by the base network using the given samples. The percentage of improvement of time (or the speed up) for the implicit ranking metric over the sorting algorithm is given by, speed up

  = ts ?ts tm  100%

(11)

where ts ; tm as shown in Figure 2. The results in the second and third columns of Table 1 compare the original NG algorithm against the implicit ranking metric defined in (3). This is a comparison between equations (2) and (4). A speed up of 81.87% is

achieved with the proposed implicit ranking metric. The results in the second and fourth columns of Table 1 compare the original NG algorithm against the truncated weight update rule. This is a comparison between equations (2) and (6). The truncated weight update rule has reduced the number of weight updates vastly - more than 99%. Since the processing time for this involves weight update time, we did not report its processing time. Also, this modification has increased the recognition rate slightly. This is due to the fact that, very small weight updates are generally noisy and, eliminating them would improve recognition accuracy.

5.2. Character Recognition We perform experiments on handwritten numerals to test our proposed HONG classifier. These handwritten numeral samples are extracted from the NIST SD3 database (which contains 223,124 isolated handwritten numerals written by 2100 writers and scanned at 300 dots per inch) provided by the American National Institute of Standards and Technology (NIST) [7]. We partition the NIST SD3 database into three non-overlapping sets shown in Table 2. The test set comprises samples from 600 writers not used in the Training and Validation sets.

Table 2. Partitions of SD3 data set used in our experiments. Partition(s) Size Usage

hsf f0,1,2g 106,152 Training

hsf f0,1,2g 53,076 Validation

hsf 3 63,896 Testing

We restricted the number of upper layers of the overlapped NG networks to two. The base layer consisted of 250 units. The number of units for each overlapped NG network are determined empirically by considering the available training samples for each of them. We found experimentally minf ; maxf ; training samples = gg is a good estimate for selecting the number of neurons for the second layer. We used 5 overlaps for the training set and : ; f : ; i : ; 3 overlaps for the testing set. Through trial and error, we discovered empirically i f : and tmax  Training Samples, gave the best results for the proposed network. Using the : ? :  ?6 t which is used to above parameters in equation (5), we calculated the parameter r t truncate the weight update rule as described in (6).

300

= 0 0001

=4

35 ( = 0 7 = 0 05 ( ) = 11 156 6 2 10

)8 = 0 01

The feature extraction method is briefly summarized below.



Prior to the feature extraction operation, pre-processing operations are performed on the isolated numerals. This involves removing isolated blobs from the binary image based on a ratio test.



The pre-processed digit is then centered and only the numeral part is extracted from the 128128 binary image.



The extracted binary image is rescaled to an 88 72 pixel resolution.



Finally, each such binary image is sub-sampled into 88 blocks and the result is an 119 grey scale image with pixel values in the range [0,64].

As a result of the above feature extraction method, we are left with a feature vector of 99 elements. The recognition rates obtained using the above parameters are illustrated in Table (3). As can be seen, the HONG architecture improves further on the high classification rate provided by the base NG network. To the best of our knowledge, the most successful results obtained for the NIST SD3 database were by Ha et al. [8]. They used the same set of 223,124 samples, but partitioned it with 40,000 samples for training, 10,000

Table 3. Recognition rates

Base Network HONG

Training 99.31% 99.90%

Validation 98.60% 99.30%

Testing 98.84% 99.30%

for validation and 173,124 for testing. They have obtained a recognition rate of 99.54%. They designed two recognition systems based on two distinct feature extraction methods and used a fully connected feed-forward three layer perceptron as the classifier for both feature extraction methods. In addition, if the best score in the combined classifier was less than a fixed predefined threshold, they replaced the normalization operation prior to feature extraction by a set of perturbation processes which modeled the writing habits and instruments.

6. Conclusions In this paper, we have proposed an implicit ranking scheme to speed up the sequential implementation of the original NG algorithm. In contrast to Kohonen’s SOFM algorithm, the NG algorithm takes a smaller number of learning steps to converge, does not require any prior knowledge about the structure of the network, and its dynamics can be characterized by a global cost function. We have also developed the HONG network architecture to obtain a better classification on conflicting data. This is particularly important with totally unconstrained handwritten data, since they contain conflicting information within the same class due to the various writing styles and instruments used. The HONG network architecture systematically partitions the input space to overcome such situations by projecting the input data to different upper level NG networks (see Fig. 1). Since the training and the testing samples are duplicated in the upper layers, we obtained multiple decision classifications for every sample. The final classification is obtained by combining the individual classifications generated by the second level networks. We have employed the idea of confidence values in obtaining the final classification. The proposed architecture is tested on handwritten numerals extracted from the NIST SD3 database and, an excellent recognition rate was consequently obtained. Compared to the number of applications for Kohonen’s SOFM, there are relatively few for NG in the literature [2, 6, 9, 10, 15, 16]. We hope, due to the speeding up method that we have introduced for the sequential implementation, that there will be more applications of the NG algorithm in the future.

Acknowledgments The authors would like to thank Marcus Gallagher, Ian Wood and Hugo Navone of the Neural Network Laboratory, University of Queensland, Australia, for their invaluable support and comments.

References [1] Fabio Ancona, Sandro Ridella, Stefano Rovetta, and Rodolfo Zunino. On the Importance of Sorting in “Neural Gas” Training of Vector Quantizers. In Proceedings of the IEEE International Conference on Neural Networks, pages 1804–1808, 1997. [2] E. Ardizzone, A. Chella, and R. Rizzo. Color Image Segmentation Based on a Neural Gas Network. In Maria Marinaro et al., editor, International Conference of Artificial Neural Networks (ICANN ’94), Sorrento, Italy, pages 1161–1164, May 1994. [3] Ajantha S. Atukorale and P. N. Suganthan. Multiple HONG Network Fusion by Fuzzy Integral. In Proceedings of the Sixth International Conference on Neural Information Processing (ICONIP’99), pages 718–723. Perth, Australia, November 1999.

[4] Clifford Sze-Tsan Choy and Wan-Chi Siu. Fast Sequential Implementation of “Neural Gas” Network for Vector Quantization. IEEE Transactions on Communications, 46(3):301–304, March 1998. [5] E. Erwin, K. Obermayer, and Klaus Schulten. Self-organizing maps: ordering, convergence properties and energy functions. Biological Cybernetics, 67:47–55, 1992. [6] M. Fontana, N.A. Borghese, and S. Ferrari. Image Reconstruction Using Improved “Neural Gas”. In Maria Marinaro et al., editor, Italian Workshop on Neural Nets (7th), Vietri Sul Mare, Italy, pages 260–265, 1996. [7] Michael D. Garris. Design, Collection and Analysis of Handwriting Sample Image Databases. The Encyclopedia of Computer Science and Technology, 31(16):189–213, 1994. [8] Thien M. Ha and Horst Bunke. Design, Implementation, and Testing of Perturbation Method for Handwritten Numeral Recognition. Technical Report IAM-96-014, Institute of Computer Science and Applied Math. University of Berne, Switzerland, October 1996. Anonymous ftp: iamftp.unibe.ch/pub/TechReports/1996/. [9] Thomas Hofmann and Joachim M. Buhmann. An Annealed ‘Neural Gas’ Network for Robust Vector Quantization. In C von der Malsburg, W von Seelen, J.C Vorbruggen, and B. Sendhoff, editors, Artificial Neural Networks (ICANN ’96), volume 7, pages 151–156. Springer, Bochum Germany, July 1996. [10] Kazuya Kishida, Hiromi Miyajima, and Michiharu Maeda. Destructive Fuzzy Modeling Using Neural Gas Network. IEICE Transactions on Fundamentals of Electronics, Communicationss and Computer Sciences, E80-A(9):1578–1584, September 1997. [11] Teuvo Kohonen. The Self-Organizing Map. Proceedings of the IEEE, 78(9):1464–1480, September 1990. [12] Thomas M. Martinetz, Stanislav G. Berkovich, and Klaus J. Schulten. “Neural Gas” Network for Vector Quantization and its Application to Time-Series Prediction. IEEE Transactions on Neural Networks, 4(4):558–569, July 1993. [13] Thomas M. Martinetz and Klaus Schulten. A “Neural-Gas” Network Learns Topologies. In T. Kohonen, K. M¨akisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 397–402. North-Holland, Amsterdam, 1991. [14] P. N. Suganthan. Hierarchical Overlapped SOMs for Pattern Classification. IEEE Transactions on Neural Networks, 10(1):193–196, January 1999. [15] Bai-ling Zhang, Min-yue Fu, and Hong Yan. Application of Neural ‘Gas’ Model in Image Compression. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’98), pages 918–921. Anchorage, Alaska, USA, May 1998. [16] Bai-ling Zhang, Min-yue Fu, and Hong Yan. Handwritten Digit Recognition by Neural ‘Gas’ Model and Population Decoding. In Proceedings of the International Joint Conference on Neural Networks (IJCNN’98), pages 1727–1731. Anchorage, Alaska, USA, May 1998.