Ensemble Based Extreme Learning Machine

2 downloads 0 Views 585KB Size Report
ensemble learning and cross-validation are embedded into the training phase so as to alleviate the overtraining problem and enhance the predictive stability.
754

IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 8, AUGUST 2010

Ensemble Based Extreme Learning Machine Nan Liu and Han Wang, Senior Member, IEEE

Abstract—Extreme learning machine (ELM) was proposed as a new class of learning algorithm for single-hidden layer feedforward neural network (SLFN). To achieve good generalization performance, ELM minimizes training error on the entire training data set, therefore it might suffer from overfitting as the learning model will approximate all training samples well. In this letter, an ensemble based ELM (EN-ELM) algorithm is proposed where ensemble learning and cross-validation are embedded into the training phase so as to alleviate the overtraining problem and enhance the predictive stability. Experimental results on several benchmark databases demonstrate that EN-ELM is robust and efficient for classification. Index Terms—Cross-validation, ensemble learning, extreme learning machine, neural network.

I. INTRODUCTION

E

XTREME learning machine (ELM) was proposed recently as an efficient learning algorithm for single-hidden layer feedforward neural network (SLFN) [1]. It increases the learning speed by means of randomly generating weights and biases for hidden nodes rather than iteratively adjusting network parameters which is commonly adopted by gradient-based methods. Although ELM is fast and presents good generalization performance, there are still a lot of room for improvements. Zhu et al. [2] mentioned that random assignment of parameters will introduce un-optimal input weights and hidden biases. As a result, evolutionary extreme learning machine (E-ELM) was proposed by taking advantages of both ELM and differential evolution (DE) [3] to remove redundancy among hidden nodes and achieve satisfactory performance with more compact networks. Furthermore, pruned-ELM (P-ELM) was presented by Rong et al. [4]. Their idea is to initialize a large network and prune it during learning. Apart from numerous improvements [5], [6], ELM was also implemented in microarray data classification [7] and showed its superiority to support vector machines. Neural network classifiers usually suffer from overtraining which might degrade the generalization performance. During the training phase, all training samples are categorized into several classes by classifier and the learning error is used to evaluate the efficiency of training. Minimum training error is expected, but it cannot guarantee good classification results on unseen

data. It is shown that combining a number of neural networks could solve the above problem [8], [9]. As a result, Lan et al. [10] presented an extended ELM method by simply averaging the outputs of individual classifiers. In this letter, we propose an ensemble based ELM (EN-ELM) algorithm which uses the cross-validation scheme to create an ensemble of ELM classifiers for decision making. The main mechanism behind our proposal is, firstly, partitioning the original training set using crossvalidation scheme into subsets and then pairs of training and validation sets are obtained so that each training set consubsets. Then in the new training procedure, sists of subsets and each of the learners is constructed using validated with the remaining single subset. The cross-validation process is then repeated times, with each of the subset used exactly once for validation. Secondly, initializing a set of parameters (weights and biases) and updating them adaptively through iterations on the same pairs of data sets by means of particular criterions. Finally, constructing a decision ensemble on testing data with multiple sets of predictors that are generated from iterations. The above learning procedure is reasonable to avoid overfitting because the validation set is used to replace the entire training set to evaluate learning error in each one of the classifiers. Moreover, a number of diversified ELM classifiers might be able to improve the prediction performance. In this letter, we attempt to propose the EN-ELM algorithm and evaluate its effectiveness for pattern classification. The remaining of this letter is organized as follows. Section II briefly presents the ELM algorithm. The ensemble based ELM is described in Section III. The experimental results on several benchmark databases are reported in Section IV. Section V gives a conclusion. II. PRELIMINARIES As a learning algorithm for SLFN, ELM randomly selects weights and biases for hidden nodes, and analytically determines the output weights by finding least samples square solution. Given training set consisting of , where is an input vector and is an target vector, an SLFN with hidden nodes is formulated as (1)

Manuscript received April 02, 2010; revised May 28, 2010; accepted May 30, 2010. Date of publication June 21, 2010; date of current version July 01, 2010. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Nikolaos V. Boulgaris. N. Liu is with the Department of Emergency Medicine, Singapore General Hospital, Singapore 169608 (e-mail: [email protected]). H. Wang is with the School of Electrical and Electronic Engineering, College of Engineering, Nanyang Technological University, Singapore 639798 (e-mail: [email protected]). Digital Object Identifier 10.1109/LSP.2010.2053356

is -dimensional where additive hidden node is employed. weight vector connecting th hidden node and input neurons. samples using hidden nodes, , , In approximating and are supposed to exist if zero error is obtained. Consequently, (1) can be written in a more compact format as where is hidden layer

1070-9908/$26.00 © 2010 IEEE Authorized licensed use limited to: Nanyang Technological University. Downloaded on July 03,2010 at 05:35:29 UTC from IEEE Xplore. Restrictions apply.

LIU AND WANG: ENSEMBLE BASED EXTREME LEARNING MACHINE

755

output matrix of the network, is the output of th hidden neuron with respect to , and ; and are the output weight matrix and target matrix, respectively. Huang et al. [1] pointed out that in real applications training error cannot be made exactly zero as the number of hidden nodes will always be less than the number of training samples . To obtain small non-zero training error, Huang et al. [1] proand , and posed randomly assigning values for parameters thus the system becomes linear so that the output weights can be , where is the Moore–Penrose generestimated as alized inverse [11] of the hidden layer output matrix . Given , activation function , and hidden node a training set number , the ELM algorithm can be summarized as three steps and for . as follows. 1) Generate parameters 2) Calculate the hidden layer output matrix ; 3) Calculate the . output weight using III. ENSEMBLE BASED EXTREME LEARNING MACHINE Depending on the random selection of weights and biases for hidden nodes, ELM decreases the learning time dramatically. However, the parameters are not able to incorporate prior knowledge of the inputs and may contain non-optimum, thus the generalization performance might be degraded. Consequently, we propose constructing an ensemble of several predictors on the training set using various sets of random parameters where the parameters of each predictor are updated according to particular criterions, and then making decisions for testing samples through majority voting with the ensemble. The proposed ensemble based ELM algorithm (EN-ELM) is illustrated in Fig. 1. Cross-validation is applied throughout the classification task. On the one hand, cross-validation scheme prohibits overfitting; on the other hand, it extends the number of predictors in the ensemble so as to guarantee a stable and accurate decision making. A. Initialization The proposed EN-ELM algorithm starts its learning with partitioning the entire training set into subsets and initializing and . Then each of learners is trained using subsets and and validated with the remaining one subset. Afterward, are assigned with the values of and , and the results on the validation set across trials, including mean of accuracies and mean value of norms of output weights are calculated. As it is commonly agreed that smaller could lead to is considered as better generalization performance [1], [2], one criterion for creating the decision ensemble. The parameters and are used to store the “best” performances (higher accuracy or smaller norm) so far in each learning stage, and meanwhile their corresponding weights and biases are saved as well. B. Weights Updating and Ensemble Construction iterations, if the parameters and of th In the next iteration result in higher classification accuracy, or smaller norm and , the values of parameters and are of than assigned to and , respectively; otherwise, weights and biand . These opases in current iteration are updated with erations will enhance individual predictors in the ensemble by

Fig. 1. Proposed EN-ELM algorithm.

utilizing parameters that are able to achieve better generalization performance. The variables and keep the most optimal values (in terms of “best” performances) for the first iterations, and therefore they play an essential role in parameter updating. After finishing learning all training samples, sets of weights and biases are obtained for predictors. Given a testing , a hypothesis ensemble of predictors instance on is constructed, where . C. Decision Making on the Ensemble As suggested by Zhou et al. [9], many instead of all predictors in the ensemble could be used for decision making. Since is available for each individual ELM classifier after training, all predictors in the ensemble are sorted according to the norm of in an increasing order, and the first half of the ensemble is used to make decision through majority voting. For pattern , the class that receives the highest votes is considered as the

Authorized licensed use limited to: Nanyang Technological University. Downloaded on July 03,2010 at 05:35:29 UTC from IEEE Xplore. Restrictions apply.

756

IEEE SIGNAL PROCESSING LETTERS, VOL. 17, NO. 8, AUGUST 2010

TABLE II COMPARISONS OF THE EXPERIMENTAL RESULTS ON THREE FACE DATABASES (COMBO, GTFD, AND ORL), AND THREE UCI DATA SETS (DIABETES, LANDSAT SATELLITE, AND IMAGE SEGMENTATION)

Fig. 2. Four face databases used in the experiments: (a) GTFD, (b) ORL, (c) UMIST, and (d) Yale. TABLE I DATA SETS USED IN THE EXPERIMENTS

predicted label, and the total vote received by each class calculated as

is

(2)

where is set to one if its value is zero.

is predicted as class , otherwise,

IV. EXPERIMENTS The experiments are carried out on three face databases (Combo, GTFD [12], and ORL [13]) (Fig. 2) where the combo data set encompasses ORL, UMIST [14], and Yale database [15]. Moreover, the EN-ELM algorithm is also validated on three benchmark data sets (Diabetes, Landsat Satellite, and Image Segmentation) in UCI Machine Learning Repository [16]. The characteristics of these data sets are summarized in Table I. In face databases, all the images are resized to 112 92 and pre-processed by discrete cosine transform (DCT) to reduce the dimensionality to 81. All of these computerized simulations are run in MATLAB on a workstation equipped with Intel Pentium 4 3.2 GHz CPU and 1 G RAM. The learning and testing processes are repeated 50 times in which the sigmoid function is selected as the activation function. In this letter, tenfold cross-validation is adopted for training and EN-ELM chooses 50 iterations to create the decision , , and are 50, 1, and ensemble. In E-ELM algorithm, 0.8, respectively, and the number of generations is heuristically

determined as 20. Moreover, EN-ELM is also compared with -nearest neighbor ( -NN) classifier where , and backpropagation (BP) neural network where Levenberg–Marquardt algorithm and Log-sigmoid transfer function are implemented. In the experiments, all approaches use the original training and testing sets, except for E-ELM and BP algorithms which divide the testing data into two groups equally and choose one part as the validation set to avoid overtraining. A. Experimental Results Table II presents the comparison results where both training time and testing time are the averaged values across 50 repeats of learning process. It is shown that ELM is the fastest learner but performs fairly in classification. The proposed EN-ELM algorithm outperforms ELM, E-ELM, and other classification methods in terms of achieving higher testing accuracies on all data sets. Overall, EN-ELM is a stable and efficient extension of ELM as it can provide satisfactory generalization performance with short training time when compared with E-ELM algorithm and conventional gradient based method such as the BP algorithm. Referring to Table II and the results reported in Zhu et al. [2], the BP classifier costs much longer time for training whereas the validation results are far from satisfactory.

Authorized licensed use limited to: Nanyang Technological University. Downloaded on July 03,2010 at 05:35:29 UTC from IEEE Xplore. Restrictions apply.

LIU AND WANG: ENSEMBLE BASED EXTREME LEARNING MACHINE

757

V. CONCLUSION In this letter, the EN-ELM algorithm is proposed, in which the ensemble learning method and the cross-validation strategy are introduced into the training process to alleviate overfitting and improve the generalization ability. The experimental results demonstrate that EN-ELM outperforms the original ELM algorithm and other popular classifiers for face recognition as well as several classification tasks in terms of accuracy. Although the proposed method spends more time for training than ELM, it is still effective when compared with E-ELM and conventional gradient-based learning algorithm. Moreover, it is possible to alleviate the computational burden by selecting appropriate parameters. ACKNOWLEDGMENT The authors would like to thank the editors and reviewers for the constructive comments and suggestions. REFERENCES

Fig. 3. Results on the ORL database using EN-ELM method. (a) Classification results with different number of folds for cross-validation where ~ is 100; (b) testing results with different number of hidden nodes where is 10.

R

N

Although EN-ELM requires more training time than ELM to create the decision ensemble in which the parameters are adaptively updated, its performance is still superior to E-ELM and BP, in terms of both testing accuracy and learning time. Nevertheless, the efficiency of the EN-ELM algorithm is possible to be further improved by choosing smaller size of ensemble or using less hidden nodes. B. Study of Effects on Parameter Selection The effects of parameters in EN-ELM are depicted in Fig. 3. It is observed from Fig. 3(a) that the classification accuracy increases monotonically till the number of folds is larger than eight. A small fold number results in poor generalization performance because less training samples involve in learning process. Fig. 3(b) depicts that a large number of hidden nodes may give higher accuracies in testing, but a complex network could also overfit the training data. For example, the generalization performance decreases when the number of hidden nodes is larger than 80. In general, these parameters should be selected empirically in particular applications.

[1] G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme learning machine: Theory and applications,” Neurocomputing, vol. 70, pp. 489–501, 2006. [2] Q. Y. Zhu, A. K. Qin, P. N. Suganthan, and G. B. Huang, “Evolutionary extreme learning machine,” Pattern Recognit., vol. 38, pp. 1759–1763, 2005. [3] R. Storn and K. Price, “Differential evolution – A simple and efficient heuristic for global optimization over continuous spaces,” J. Global Optim., vol. 11, pp. 341–359, 1997. [4] H. J. Rong, Y. S. Ong, A. H. Tan, and Z. X. Zhu, “A fast pruned-extreme learning machine for classification problem,” Neurocomputing, vol. 72, pp. 359–366, 2008. [5] G. B. Huang, L. Chen, and C. K. Siew, “Universal approximation using incremental constructive feedforward networks with random hidden nodes,” IEEE Trans. Neural Netw., vol. 17, pp. 879–892, 2006. [6] G. B. Huang, M. B. Li, L. Chen, and C. K. Siew, “Incremental extreme learning machine with fully complex hidden nodes,” Neurocomputing, vol. 71, pp. 576–583, 2008. [7] R. X. Zhang, G. B. Huang, N. Sundararajan, and P. Saratchandran, “Multicategory classification using extreme learning machine for microarray gene expression cancer diagnosis,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 4, pp. 485–495, 2007. [8] L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Trans. Pattern Anal. Machine Intell., vol. 12, pp. 993–1001, Oct. 1990. [9] Z. H. Zhou, J. Wu, and W. Tang, “Ensembling neural networks: Many could be better than all,” Artif. Intell., vol. 137, pp. 239–263, 2002. [10] Y. Lan, Y. C. Soh, and G. B. Huang, “Ensemble of online sequential extreme learning machine,” Neurocomputing, vol. 72, pp. 3391–3395, 2009. [11] D. Serre, Matrices: Theory and Applications. New York: Springer, 2002. [12] L. Chen, H. Man, and A. V. Nefian, “Face recognition based on multi-class mapping of fisher scores,” Pattern Recognit., vol. 38, pp. 799–811, 2005. [13] F. S. Samaria and A. C. Harter, “Parameterization of a stochastic model for human face identification,” in Proc. IEEE Workshop on Applications of Computer Vision, Sarasota, FL, Dec. 1994, pp. 138–142. [14] D. B. Graham and N. M. Allinson, “Characterizing virtual eigensignatures for general purpose face recognition,” in Face Recognition: From Theory to Applications, NATO ASI Series F, Computer and Systems Sciences, H. Wechsler, P. J. Phillips, V. Bruce, F. Fogelman-Soulie, and T. S. Huang, Eds. New York: Springer, 1998, vol. 163, pp. 446–456. [15] P. N. Belhumeur, J. P. Hespanha, and D. J. Kriegman, “Eigenfaces vs. fisherfaces: Recognition using class specific linear projection,” IEEE Trans. Pattern Anal. Machine Intell., vol. 19, pp. 711–720, Jul. 1997. [16] A. Frank and A. Asuncion, UCI Machine Learning Repository, Univ. California, Sch. Inform. Comput. Sci., Irvine, CA, 2010 [Online]. Available: http://archive.ics.uci.edu/ml

Authorized licensed use limited to: Nanyang Technological University. Downloaded on July 03,2010 at 05:35:29 UTC from IEEE Xplore. Restrictions apply.