Evolutionary Learning of Nearest Neighbor MLP - Semantic Scholar

19 downloads 0 Views 183KB Size Report
is a neural network realization of the nearest neighbor clas- si er (NNC). It is a .... better and better, and will nally be good enough to clas- sify all samples underĀ ...
762

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 3, MAY 1996

Evolutionary Learning of Nearest Neighbor MLP Qiangfu Zhao, Tatsuo Higuchi

| The nearest neighbor multilayer perceptron (NN-MLP) is a single-hidden-layer network suitable for pattern recognition. To design an NN-MLP eciently, this paper proposes a new evolutionary algorithm consisting of four basic operations: , , and . Experimental results show that this algorithm can produce the smallest or nearly smallest networks from random initial ones. Abstract

recognition remembrance reduction

review

I. Introduction

The nearest neighbor multilayer perceptron (NN-MLP) is a neural network realization of the nearest neighbor classi er (NNC). It is a single-hidden-layer network with each hidden neuron corresponding to a training sample and each output neuron corresponding to a pattern class. The hidden neurons are realized using the model shown in Fig. 1. In this model, the distance (any distance measure can be used) between the input sample and the weighting vector of the neuron is rst computed, and the result is then put into a count-down counter. The neuron res (outputs a 1) when its counter becomes 0, and its output is then used to inhibit ring of all neurons with di erent class labels. In the NN-MLP, all hidden neurons compute the distances and count-down simultaneously. An output neuron res if at least one of its inputs is 1. From pattern recognition theory, an NN-MLP can be used for any complex decision making if the number of hidden neurons is suciently large [1]. Further, since patterns are represented locally, an NN-MLP is very suitable for self-organization and real-time learning. This is the main reason that NN-MLPs have been studied by many authors in di erent forms [2]0[11] . However, if the hidden neurons correspond directly to all training samples, the network will be extremely large. The basic idea to reduce the network size is to nd a small number of prototypes, and use them as the connection weights of the hidden neurons. Suppose that the set of prototypes is P , then P must be found such that [x 2P D(xij ) =

D(xij ) \ D(xkl ) = 8; xij ; xkl 2 P; i 6= k (1) M (P ) = min where [ and \ represent the union and intersection of sets, respectively, is the training set, 8 is the empty ij

Q. F. Zhao is with the Multimedia Device Laboratory, The University of Aizu, Aizu-Wakamatsu City, Japan 965-80. E-mail: [email protected] . T. Higuchi is with the Graduate School of Information Sciences, Tohoku University, Sendai, Japan 980-77. This research is supported in part by the Telecommunications Advancement Foundation (TAF) of Japan.

To other neurons AND GATE From other neurons DOWN COUNTER

DISTANCE

Input sample

Fig. 1. A conceptual realization of the nearest neighbor neuron (small circles express inverse operations)

set, xij 2 P is the j-th prototype (or hidden neuron) of the i-th pattern class, D(x) is the set of samples belonging to the decision region of x, and M (P ) is the cardinality of P (or the size of the corresponding NN-MLP). Two approaches can be used to solve the above problem. In the rst approach, the prototypes are selected directly from the training set . For example, the condensed nearest neighbor (CNN) rule [13] , the reduced nearest neighbor (RNN) rule [14] and the restricted coulomb energy (RCE) algorithm [11] , all belong to this approach. The main problem here is that the prototypes are not optimal in any sense, and thus M (P ) is usually larger than necessary. In the second approach, prototypes that are in some sense optimal can be found iteratively using training samples. Algorithms of this approach include the vector quantization (VQ) algorithm [12] and the competitive learning algorithms [6]0[9] . A common problem in these algorithms is that there is no ecient way to determine M (P ). For example, in the self-organization feature map (SOFM) and the learning vector quantization (LVQ) algorithms of Kohonen, a large enough M (P ) must be assumed, and all prototypes are actually used after learning. In the adaptive resonance theory (ART) of Grossberg, a prototype is used to represent a certain subspace of the feature space and a new prototype is added when a sample is \far" away from any existing prototypes. Thus, M (P ) can be determined automatically. However, in practice, it is dicult to know the e ective region of each prototype. A small

ZHAO AND HIGUCHI: EVOLUTIONARY LEARNING OF NN-MLP

763 remembrance. On the other hand, if the network func-

Recognition

Reduction

Remembrance

Training set

Review

Fig. 2. R4 |rule for evolutionary learning of NN-MLP

predetermined e ective region may result in too many prototypes, while a large e ective region may produce a poor recognition rate. Therefore, it is dicult to get a prototype set that satis es (1) using existing algorithms. To solve the problem more eciently, this paper proposes a new evolutionary algorithm. Some general considerations about this algorithm are given in the next section. Detailed algorithm is provided in Section III. The performance of the new algorithm is shown by experimental results in Section IV. II. A Rule for Evolutionary Learning: The

R4|Rule

The purpose of this paper is to develop an evolutionary algorithm which can produce the smallest or nearly smallest NN-MLPs from initial networks whose connection weights and sizes are all given at random. For this purpose, let us rst consider the human brain, which is a natural model for evolutionary learning. There are four basic operations in the learning process of a human brain: practice, remembrance, oblivion, and review. Practice is the process of observation. If the observed data are unknown, the brain will remember them. Often observed data will be remembered strongly, and rarely observed ones will be gradually forgotten. By review, learned knowledge is rearranged and abstracted, so that something can be forgotten without losing learned knowledge, or many things can be learned by remembering only a small portion of them. Inspired by the above observation, an NN-MLP can also be learned in an evolutionary way by successively performing four operations: recognition, remembrance, reduction and review. The operation recognition tests the ability of the network (to see what is unknown) and the importance of each hidden neuron (to see what is often observed). If the network functions too poorly, some hidden neurons are added to the network by the operation

tions very well, some unimportant hidden neurons might be removed from the network by the operation reduction. When some neurons are added or removed, the network is readjusted by the operation review to achieve better performance. One learning cycle thus can be de ned as recognition ^ (remembrance _ reduction) ^ review, where _ and ^ represent the logic \or" and \and" operations, respectively. The learning is performed cycle after cycle until some criterion is satis ed. This learning rule will be called the R4 |rule throughout this paper. Using this learning rule, the performance of a network is expected to become better and better, and will nally be good enough to classify all samples under consideration, by using the smallest number of hidden neurons. To realize the R4 |rule, some general considerations will be given in this section. A. Recognition In the R4 |rule, each operation is performed by a process or a subroutine. The operation recognition is a process to

test the ability of the recent network and the importance of each hidden neuron using training samples. For these purposes, two special parameters  and  are used.  is the class label of a hidden neuron, and  is a measure of importance of that neuron. Both of them are assigned to the hidden neurons automatically in the learning process. The ability or the recognition rate of the recent network can be tested as usual by using the nearest neighbor rule. The key point in the R4 |rule is to test the importance of the hidden neurons. This is performed by competition, but in a di erent way. Conventionally, a hidden neuron is a winner whenever its weighting vector is nearest to the input sample, or in other words, if it res rst. In our algorithm, however, a hidden neuron is a winner for a given sample x only if it satis es three conditions: 1) It belongs to the same pattern class as x; 2) It res before any neuron of di erent classes res; 3) Its  is the largest one of all red neurons.

A hidden neuron is a loser if it satis es the rst two conditions, and has the smallest  value of all red neurons. That is, if two or more hidden neurons re for a given sample, the most important one will be the winner, and the most unimportant one will be the loser. The parameter  of the winner should be increased to make it more important, and that of the loser should be decreased. Clearly, if a hidden neuron frequently wins in the competition, its importance  will become larger and larger, and this neuron will become a permanent neuron in the network. B. Remembrance After recognition, we can know how the present network

performs and how important each hidden neuron is. If there are too many recognition errors, some new hidden neurons should be added. On the other hand, if the recognition rate is very high, some of the unimportant neurons could be removed to make the network more ecient.

764

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 3, MAY 1996

We might add a hidden neuron whenever a sample is misclassi ed (as in ART and RCE), but this is not an ef cient method because the network may soon become too large. In our algorithm, only one hidden neuron is added for each misclassi ed class in the operation remembrance, and this is done only when the recognition rate is smaller than a certain threshold r0. A neuron is added by recovering an unused neuron, or by creating a new neuron. Thus, it is not necessary to specify the number of hidden neurons before learning. C. Reduction

We can remove a neuron whenever its  becomes very small. However, to make the learning process more stable, reduction of hidden neurons should be performed carefully. In our algorithm, we do not remove any neurons in the operation recognition, but do it in the operation reduction. In reduction, only one neuron with negative  is removed, and this neuron is selected randomly. If reduction of a neuron results in too many recognition errors, this neuron will be returned to the network. Removing of a neuron is performed by making it unused, or deleting it physically from the network. Thus, the number of hidden neurons can be kept the same during the learning process, as in human brains, or changes constantly.

Step 1 : For any x 2 , nd n + 1 hidden neurons y0 ; y1; :::; yn , such that

(yi ) = (x); d(yi ; x)  d(yi+1 ; x); i = 0; 1; :::; n 0 1 (yn ) 6= (x)

(2) where d(y; x) is the distance between y and x. If n > 0, x is recognizable, go to Step 2; if n = 0, x cannot be recognized, go to Step 4; Step 2 : Find the winner yw and the loser yl such that (yw ) = max[(yi )] (yl ) = min[(yi)]

i = 0; 1; :::; n 0 1:

(3)

Step 3 : To make the winner more important and the loser more unimportant, their  values are changed as follows:

(yw ) = (yw ) +  (yl ) = (yl ) 0 

(4)

where  is a suciently small positive number. In practical applications, it may be necessary to normalize the value of  to prevent it from going too large or too small. In our experiments, however, the above simple rule is used directly because a small number of learning cycles have been sucient. After modifying , go to Step 5; Step 4 : Decrease the recognition rate r accordingly, and record (x) for latter use; D. Review Step 5 : If all training samples have been presented, call Review is necessary when some hidden neurons are re- reduction in case r  r0 , or call remembrance in case r < r0 ; moved or added. It is the process to rearrange the knowl- otherwise, return to Step 1. edge learned up to now, and to make them more abstract B. Subroutine 2: Remembrance and simpler. In review, network parameters are readjusted so that the network can achieve the highest ability using There are three steps in remembrance, and they are given recent neurons. For this purpose, any supervised compet- as follows: itive learning method can be used, as long as it is simple Step 1: For a misclassi ed pattern class, select a misclassiand ecient. In the evolutionary learning of this paper, the ed sample x at random; DSM (decision surface mapping) algorithm given in [10] is Step 2: Find an unused neuron or create a new neuron y, and add it to the network using the following equations: adopted due to its desired convergence properties. y = x; (y ) = (x); (y ) = random > 0; (5) III. The Algorithm for Evolutionary Learning The evolutionary learning algorithm consists of four sub- Step 3: Call review if a new hidden neuron has been added routines corresponding to the four basic operations de- for each misclassi ed pattern class; otherwise, return to scribed in the last section. The relation between these sub- Step 1. routines are shown in Fig. 2. Before learning, a network Note that for a new hidden neuron, its importance is with random connection weights and random size is given unknown, and should be determined through evolution. as the initial network. Each learning cycle can be brie y Therefore, it is more reasonable to use a random number as described as follows. First, present all training samples to the initial value of the importance, rather than a constant the subroutine recognition. If the recognition rate is higher value. than the desired value r0 , call subroutine reduction to remove an unimportant hidden neuron, otherwise, call sub- C. Subroutine 3: Reduction routine remembrance to add some hidden neurons. When Reduction of an unimportant neuron is performed as folsome hidden neurons are removed or added, call review to lows: achieve higher network performance. After reviewing, an- Step 1: Randomly select a hidden neuron y with negative importance , and keep it in a temporary memory; other learning cycle starts. Step 2: Remove y from the network. We can remove y physically from the network, or just do the following A. Subroutine 1: Recognition (y) = undefined; y = random vector (6) The subroutine recognition can be described as follows:

ZHAO AND HIGUCHI: EVOLUTIONARY LEARNING OF NN-MLP

765

1

TABLE I Experimental results for the generalized XOR problem

class 0

class 1

M(P) NNC CNN RNN RCE New

6400 231 162 183 4

Error rate (%) 0.91 1.11 1.27 0.69 0.13

1

class 0

0

1

0.6

0.8

class 1

Fig. 3. The generalized XOR problem

0.4

Step 3: Test the ability of the network using training samples. If the recognition rate is less than a threshold r1 , return y to the network; otherwise, call review.

0.2

class 0

In the last step, we can also test the network ability after calling review. If the network still functions too poorly after reviewing, y should be returned. In this case, the threshold r1 can take the same value as r0, which is used in recognition. Actually, this method has been adopted in our algorithm. D. Subroutine 4: Review

The algorithm used in the process review is adapted from [10], and is given here for convenience of the readers. Step 1: 8x 2 , test if x can be recognized, if yes, no change; otherwise, update the connection weights as follows: y0 = y0 0 (x 0 y0) (7) y1 = y1 + (x 0 y1) where y0 is the nearest neuron of di erent pattern classes, y1 is the nearest neuron of the same pattern class, and

is the convergent ratio;

Step 2: Call recognition if the number of iterations reaches

to the given value, or all samples have been correctly recognized; otherwise, return to Step 1. IV. Experimental Results

0

0.2

class 1

0.4

0.6

0.8

1

Fig. 4. The straight line class boundaries problem shown in Fig. 5. The digit 9 is not considered here because features employed in the experiment are rotational invariant. A. The Generalized XOR Problem

The generalized XOR problem is a two-class problem as depicted in Fig. 3. There are four perfect prototypes corresponding to the centers of the four subregions. Table I gives the experimental results, where M (P ) is the number of prototypes, and the error rate is obtained using the test set. As shown in the table, the NNC uses all 6400 training samples as prototypes (or hidden neurons), the CNN, RNN and RCE use about 200, and the NN-MLP obtained by the new algorithm uses only four. Further, the error rate of the four-hiddenneuron NN-MLP is much smaller even than that of the NNC. Thus, the generalization ability has also been increased for this problem. In the experiment, the threshold r0 used in recognition is 99%,  in eq. (4) is 0:001, and the initial value for in eq. (7) is 0:1. is decreased linearly in the review process. Further, each training sample is presented once in recognition, and 20 times in review. The number of hidden neurons is a random integer number between 0 and 20, and the number of learning cycles is 25. The four prototypes obtained after the 25-th learning cycle are as follows: (0.267457, 0.739306, 1), (0.734149, 0.262727, 1) (0.733432, 0.738035, 0), (0.266190, 0.262267, 0) where the rst and second numbers are the feature values, and the third one is the class label. Clearly, they are very close to the perfect prototypes.

To demonstrate the performance of the new algorithm, we have conducted experiments with three pattern recognition problems. The rst two problems are arti cial problems adapted from [10], and the third is a practical handwritten digit recognition problem. For the arti cial problems, both training set and test set consist of 6400 samples taken at random. For the handwritten digit recognition problem, 80 samples were written by one of the authors in an ordinary manner for each pattern class. The training set consists of the rst 40 samples of B. The Straight Line Class Boundaries Problem each class, and the test set consists of the remaining samples. In the The straight line class boundaries problem is also a two-class probexperiment, every digit was written using mouse in a 256 2 256 frame. No normalization of any kind was performed. Part of the samples are lem, and is depicted in Fig. 4. Similar to the generalized XOR prob-

766

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 7, NO. 3, MAY 1996 TABLE II

TABLE III

Experimental results for the straight line class boundaries

Experimental results for the handwritten digit recognition

problem

problem

M(P) NNC CNN RNN RCE New

6400 307 216 243 10

Error rate (%) 0.97 1.17 1.14 0.56 0.48

Number of features 12 16 20 24 28 32 Error rate (%) 6.9 3.9 3.6 3.6 5.0 4.4 NNC M(P) 360 360 360 360 360 360 Error rate (%) 6.7 8.3 6.4 3.1 10.3 7.2 CNN M(P) 50 45 49 45 53 45 Error rate (%) 5.3 8.3 6.9 3.9 10.8 8.1 RNN M(P) 41 37 41 35 45 39 Error rate (%) 11.1 5.8 6.7 5.3 6.7 1.7 RCE M(P) 80 63 62 61 54 55 Error rate (%) 8.1 3.3 5.3 5.6 6.7 4.7 New M(P) 15 13 11 11 12 10 D. Recommendations on Parameter Selection

Fig. 5. Part of the samples for handwritten digit recognition lem, we can easily nd out that there are ten perfect prototypes for this problem. Using the same parameters as in the generalized XOR problem, the prototypes obtained by the proposed algorithm are as follows (0.847078, 0.356892, 1), (0.551071, 0.427418, 1) (0.048296, 0.827414, 0), (0.551475, 0.827253, 1) (0.844268, 0.778659, 1), (0.228646, 0.248717, 0) (0.430515, 0.183015, 1), (0.550853, 0.771827, 0) (0.429797, 0.220361, 0), (0.554753, 0.364483, 0) The experimental results are summarized in Table II. Again, the smallest prototype set with the best performance has been obtained by the new method. C. The Handwritten Digit Recognition Problem

(1) The desired recognition rate r0 : r0 should not be too large or too small. A large r0 may result in too many prototypes, while a small ro may result in overly reduction. In our experiments, the rst value chosen for ro is 99%. This value is better than 98%, and is therefore used for all cases. Thus, a few occurrences of trial-and-error might be necessary to obtain good r0 for di erent problems. (2) The initial value of : Usually, should be large at the beginning of review to achieve fast adjustment, and gradually become small for ne-tuning. As pointed out in [10], a value smaller than 0:3 would be a very good choice for the initial value of . In all of our experiments, 0:1 has been used. (3) The parameter : Generally,  should be inversely proportional to the number of training samples. For example,  = 0:001 for the generalized XOR problem, and  = 0:01 for the digit recognition problem, both are good choices of . (4) The number of iterations in review: As pointed out in [10], the DSM algorithm often converges after 10 iterations. Therefore, in our experiments, all training samples are presented 10 or 20 times. If other algorithms are to be used, the number of iterations should be determined according to the convergent properties of the algorithm. V. Comparison with Genetic Algorithm

Interestingly, there are also four operations in the well-known genetic algorithm (GA): competition, reproduction, selection and mutation (see [15] and references therein). Although these operations are similar to those in the R4 |rule, the R4 |rule is a di erent approach for the evolutionary learning, and have di erent properties as compared with GAs. The main di erences are given as follows: (1) The importance of individuals in the R4 |rule is determined by evolution, while the tness of individuals in GAs is often given in one-step by calculating a tness function; (2) Only one generation of individuals is considered in the R4 | rule, and each individual is trained in an evolutionary manner to become an expert for representing a certain class of samples. In GAs, however, new individuals are reproduced from existing ones, and the learning ability of individuals is totally ignored; (3) The review operation in the R4 |rule is \purpose-controlled", so that all individuals become more useful for the whole population. However, in GAs, mutation is performed randomly, and there is no assurance that the individual can become better after mutation; (4) In the R4 |rule, many hidden neurons (individuals) compete for classifying many samples (tasks), and the result is an ecient network (population). In GAs, however, many individuals compete for ful lling one task, and the result is one of the ttest individuals.

As stated previously, the handwritten digit recognition problem considered here is a nine-class problem. Features used in the recognition are crossing numbers on some concentric circles, with the center of the concentric circles being the center of gravity of the image. For comparison, the number of concentric circles (features) has been changed from 12 to 32, with increment of four. All training samples are presented once in recognition, and 10 times in review. Again, the desired recognition rate r0 is 99%, the initial value of in eq. (7) is 0:1, the number of hidden neurons is given initially as a random VI. Concluding Remarks number between 0 and 20, and the number of learning cycles is 25. The parameter  in eq. (4) is 0:01 here, so that unimportant hidden In this paper, a new evolutionary learning algorithm has been neurons can be detected more quickly. proposed for designing NN-MLP. By this algorithm, the smallest or Table III gives the experimental results. Clearly, for each case, nearly smallest networks can be obtained from given initial random the new algorithm always produces the smallest or nearly smallest networks by successively performing four operations: recognition, reprototype set, and the error rate is comparable with that of the NNC. membrance, reduction and review. This algorithm is very simple and

ZHAO AND HIGUCHI: EVOLUTIONARY LEARNING OF NN-MLP suitable for parallel realization. Its eciency has been veri ed using experimental results. Many topics remain for future studies. First, in this paper, the training set is unchanged during the learning process. To make the new algorithm more useful for practical applications, real-time learning with changing should be studied in the future. For realtime learning, it is also necessary to implement the new algorithm by hardware. Another interesting topic is to combine the proposed R4 | rule with the GA, and perhaps obtain a more powerful evolutionary algorithms. References

[1] T. M. Cover and P. E. Hart, \Nearest neighbor pattern classi cation," IEEE Trans. on Information Theory, Vol. IT-13, No. 1, pp. 21-27, Jan. 1967. [2] Q. F. Zhao and T. Higuchi, \A study on the determination of MLP structures," Proc. IEICE Karuizawa Workshop, pp. 121126, Karuizawa, Japan, April 1994. [3] Q. F. Zhao and T. Higuchi, \Supervised organization of nearest neighbor MLP," Proc. International Conference on Neural Information Processing, pp. 1398-1403, Seoul, Korea, Oct. 1994. [4] O. J. Murphy, \Nearest neighbor pattern classi cation perceptrons," Proc. IEEE, Vol. 78, No. 10, pp. 1595-1598, Oct. 1990. [5] N. K. Bose and A. K. Garga, \Neural network design using Voronoi diagrams," IEEE Trans. on Neural Networks, Vol. 4, No. 5, pp. 778-787, Sept. 1993.

767 [6] [7] [8]

T. Kohonen, \Self-organized formation of topologically correct feature maps," Biolog. Cybern.,Vol. 43, pp. 59-69, 1982. T. Kohonen, \The self-organizing map," Proc. IEEE, Vol. 78, No. 9, pp. 1464-1480, Sept. 1990. G. A. Carpenter and S. Grossberg, \ART 2: self-organization of stable category recognition codes for analog input patterns," Applied Optics, Vol. 26, No. 23, pp. 4919-4930, Dec. 1987. [9] G. A. Carpenter and S. Grossberg, \The ART of adaptive pattern recognition by a self-organizing neural network," IEEE Computer, Vol. 21, No. 3, pp. 77-88, Mar. 1988. [10] S. Geva and J. Sitte, \Adaptive nearest neighbor pattern classi cation," IEEE Trans. on Neural Networks, Vol. 2, No.2, pp. 318-322, Mar. 1991. [11] D. L. Reilly, L. N. Cooper and C. Elbaum, \A neural model for category learning," Biol. Cybern. 45, pp. 35-41, 1982. [12] Y. Linde, A. Buzo and R. M. Gray, \An algorithm for vector quantizer design," IEEE Trans. on Communication, Vol. COM28, No. 1, pp. 84-95, Jan. 1980. [13] P. E. Hart, \The condensed nearest neighbor rule," IEEE Trans. on Information Theory, Vol. 14, No. 5, pp. 515-516, May 1968. [14] G. W. Gates, \The reduced nearest neighbor rule," IEEE Trans. on Information Theory, Vol. 18, No. 5, pp. 431-433, May 1972. [15] D. B. Fogel, \An introduction to simulated evolutionary optimization," IEEE Trans. on Neural Networks, Vol. 5, No. 1, pp. 3-14, Jan. 1994.