COMBINING SUPPORT VECTOR MACHINES FOR ... - CiteSeerX

6 downloads 300 Views 88KB Size Report
SVMs is a state-of-the-art pattern recognition technique whose foundations stem from statistical ..... [17] C.J.C. Burges, “A Tutorial on Support Vector Machines for.
COMBINING SUPPORT VECTOR MACHINES FOR ACCURATE FACE DETECTION I. Buciu

C. Kotropoulos and I. Pitas

Department of Informatics, Aristotle University of Thessaloniki Box 451, Thessaloniki 540 06, Greece fcostas,[email protected] ABSTRACT The paper proposes the application of majority voting on the output of several support vector machines in order to select the most suitable learning machine for frontal face detection. The first experimental results indicate a significant reduction of the rate of false positive patterns. 1. INTRODUCTION Face detection is a prerequisite task in many applications including face recognition, teleconferencing, and face gesture recognition. The human face plays a central role in intelligent human computer interaction. The goal of face detection is to determine if there are any human faces in a test image or not. If a face exists, then the objective is to locate it in the test image regardless of the actual position, orientation, scale and pose of the head as well as the lighting variations. Due to the many aforementioned variable factors, developing a robust human face detector is a hard task. A comprehensive survey on face detection methods can be found in [1]. A probabilistic method based on density estimation in a high dimensional space using an eigenspace decomposition is proposed in [2]. A closely related work is the example-based approach in [3] for locating vertically oriented and unoccluded frontal face views at different scales by using a number of Gaussian clusters to model the distributions of face and non-face patterns. The color distribution of the face region pixels is described by a Gaussian mixture model with two Gaussian functions so that the face and hair colors are adequately represented in [4, 5]. In this approach, a single Gaussian function was used to model the color of the non-face region pixels. A mixture of linear subspaces has been used to model the latter distributions in [6] where a mixture of factor analyzers is employed to detect faces with wide variations. A detection algorithm that combines template matching and feature-based detection using hierarchical Markov random fields and maximum a posteriori probability estimation is developed in [7]. Kullback relative information is used for maximal discrimination between positive and negative examples of faces whose densities are modeled by discrete Markov processes in [8]. Other techniques include neural networks [9], or algorithms where feature points are detected using spatial filters and then grouped into face candidates using geometric and gray level constrains [10]. Fast face detection using multilayer perceptrons and fast Fourier transform is described in [11]. For the detection of upright, frontal This work was supported by the European Union Research Training Network “Multi-modal Human-Computer Interaction (HPRN-CT-200000111). I. Buciu is on leave from Applied Electronics Department, University of Oradea, 5 Armatei Romane str., Oradea, 3700, Romania.

views of faces in grayscale images, another neural network-based algorithm that applies one or more neural networks directly to portions of the input image and arbitrates their results is presented in [12]. Recently, a new neural network model, the constrained generative model, is proposed [13]. Its learning process aims at evaluating the probability that the face model has been generated the input data using some counterexamples to increase the quality of the estimation. The application of support vector machines (SVM) in frontal face detection in images was first proposed in [14, 15]. A sparse network of Winnows (SNoW), that is a sparse network of linear functions that utilize the Winnow update rule, has been employed to build a face detector in [16]. In several papers, false positive patterns are collected and are fed to the learning machine at the next iteration of the training procedure, a procedure that resembles bootstrap [3]. An alternative approach is proposed in this paper. More specifically, we propose to rank an ensemble of SVMs trained on the same set by combining their outputs with majority voting in the decision making process. By doing so, we can define the most efficient SVM, i.e., the one whose outputs appear most frequently in the set of the outputs produced by the ensemble of SVMs. We apply this technique to frontal face detection and report a significant reduction of rate of false positive patterns. We apply also a bagging (bootstrap aggregating) technique to each SVM and compare the results with those obtained by the proposed method. The outline of the paper is as follows. A brief description of SVMs is presented in Section 2. The proposed method is explained in Section 3 followed by a brief presentation of the bagging SVMs in Section 4. Experimental results are reported in Section 5 and conclusions are drawn in Section 6.

2. SUPPORT VECTOR MACHINES SVMs is a state-of-the-art pattern recognition technique whose foundations stem from statistical learning theory [19]. However, the scope of SVMs goes beyond pattern recognition, because they can handle also another two learning problems, i.e., regression estimation and density estimation. In the context of pattern recognition, the main objective is to find the optimal separating hyperplane, that is, the hyperplane that separates the positive and negative examples with maximal margin. We briefly describe the linearly separable case followed by the nonseparable case, and the nonlinear one. Consider the training data set = xi ; yi li=1 of labeled d 2 IR , where d denotes the dimensionality training patterns, xi of the training patterns, and yi 2 ; . We claim that is

S f( f 1 +1g

)g

S

linearly separable if for some w

(

yi wT xi

=

2 IRd and b 2 IR,

+ b)  1;

for

= 1; 2; : : : ; l:

i

+

=0

1 wT w 2 yi (wT xi + b)  1;

subject to

i

The optimal w is given by

)

(1)

Then w is the normal vector to the separating hyperplane wT x and b is a bias (or offset) term [19]. The optimal separating b hyperplane is the solution of the following quadratic problem [19, 17]: minimize

(

Dij yi yi K xi ; xj and the decision rule implemented by the nonlinear SVM is given by

= 1; 2; : : : ; l:

(2)

l X  w = i yi xi

(3)

i=1

f

(x) = sign

l X i yi K (x; xi ) + b

=02

= 10

500

(4)

3 4

vector of ones and D is an l l matrix havwhere 1 is the l ing elements Dij yi yi xTi xj . Thus w is a linear combination of the training patterns xi for which i > . These training patterns are called support vectors. Given a pair of support vectors x  ; x that belong to the positive and negative patterns, the bias term is found by [19]

5

GRBF Sigmoid ERBF

T



=

0

( (1) ( 1)) b

h i = 12 wT x(1) + wT x ( 1) :

(5)

Accordingly, the decision rule implemented by the SVM is simply f

(x) = sign

If the training set (4) is generalized to





w  x + b : T

(6)

S is nonseparable, the optimization problem l X

1 w T w + C i 2 i=1 yi (wT xi + b)  1 i  0

minimize subject to

)

( + 1) exp( ) tanh(  ) exp( )

jj  jjp denotes the vector p-norm, p = 1; 2. For brevity, we index each SVM by k, k = 1; 2; : : : ; 5. To distinguish between training and test patterns, the latter ones are denoted by zj . Let Z be the test set. We define the two values of the histogram of labels assigned to all zj 2 Z as h1 (zj ) = #ffk (zj ) = 1; k = 1; 2; : : : ; 5g h 1 (zj ) = #ffk (zj ) = 1; k = 1; 2; : : : ; 5g (9)

#

denotes the set cardinality. We combine the decisions where taken separately by the SVMs indexed by k ; ; : : : ; as follows:  if h1 zj h 1 zj g zj (10) otherwise.

( )= 11

i ;

i

= 1; 2; : : : ; l (7)

Fk Gk

=12 ( ) ( )

= #ffk (zj ) = 1; zj 2 Zg = #fg(zj ) = 1 and fk (zj ) = 1;

5

H

() H ( ) ( )= (

)

zj 2 Zg (11)

To determine the best SVM, we simply choose

0 

!H

(

Kernel function K x; y xT y T q x y jjx yjj22 2 2  xT y  jjx yjj1 2 2

Let us define the quantities:

where i are positive slack variables [18], and C is a parameter which penalizes the errors. The Lagrange multipliers now satisfy i C . The main difference is that support the inequalities vectors do not necessarily lie on the margin. Finally, SVMs can also provide nonlinear separating surfaces by projecting the data to a high dimensional feature space through a mapping  IRd . A linear hyperplane is searched for separating all the projected data  x in the high dimensional feature space. If the inner product in this space had an equivalent kernel in the input space IRd , i.e., T xi  xj K xi ; xj , the inner product would not need to be evaluated in the feature space, thus avoiding the curse of dimensionality problem. In such a case

:

=05

Table 1. Kernel functions used in SVMs.

i=1

1

2

10

SVM type Linear Polynomial

subject to

(8)

Let us consider five different SVMs defined by the kernels indicated in Table 1. The following kernels have been used: (1) Linear kernel; (2) Polynomial with q equal to ; (3) Gaussian Radial Basis Function (GRBF) with  ; (4) Sigmoid with  : and  : ; (5) Exponential Radial Basis Function (ERBF) having  equal to . The penalty, C , in (7)was set up to . In Table 1,

k 1 2

maximize

:

i=1

3. APPLICATION OF MAJORITY VOTING IN THE OUTPUT OF SEVERAL SVMS

where  is the vector of Lagrange multipliers obtained as the solution of the so-called Wolfe-dual problem [18]

1 T D 1  2 l X yi i = 0 and i  0

!

m

= arg max f GF k g: k k

(12)

If the set m contains more than one element we can define mat hk

= #ffk (zj ) = g(zj );

zj 2 Zg:

(13)

Then the condition (12) become k

= fk j k = m

and k

= arg max (mat hk )g: k

(14)

4. BAGGING SUPPORT VECTOR MACHINES Bagging is a method for improving the prediction error of a classifier learning system by generating replica bootstrap samples of the original training set [20]. Given a training set , a bootstrap replica of it, ? , is built by taking l samples with replacement ? be B from the original training set . Let f x; 1? ; : : : ; f x; B ? that employ the same kernel funcSVMs trained on 1? ; : : : ; B tion. For each test sample zj , yi f zj ; i? , i ; ; : : : ; B , is the estimated class label by each of the B SVMs. Let

S

S

S

( S) ( S ) S ^ = ( S ) =1 2 S

(zj ) = #fi : f (zj ; Si?) = 1g ? 1 (zj ) = #fi : f (zj ; Si ) = 1g

h1 h

Table 2. Ratio Gk =Fk achieved by the various SVMs. In parentheses are the mat h value SVM type k 1 2 3 4 5

1 0.83 0.52 0.67 0.64 1

2 0.20 0.28 0.25 0.14 0.50

Test Image numbers 3 4 5 0.57 0.66 1 (70) 0.57 0.44 1 (72) 0.44 0.44 0.80 0.15 0.11 0.22 0.80 0.80 0.80

6 0.74 0.71 0.83 0.13 1

(15)

be the histogram values of the estimated class labels. The bagging SVM decides according to (10). 5. EXPERIMENTAL RESULTS For all experiments the Matlab SVM toolbox developed by Steve Gunn was used [21]. For a complete test, several auxiliary routines have been added to the original toolbox. 5.1. Data set and pattern extraction A training data set of 96 images, 48 images containing a face and another 48 images with non-face patterns, was used. The images that contain face patterns have been derived from the face database of IBERMATICA collected within the framework of M2VTS project where several sources of degradation were modeled, such as varying face size and position and changes in illumination. All images in this database were recorded in 256 grey levels and they are of dimensions 320 240. They correspond to 12 different persons. For each person four different frontal images were collected. The procedure for collecting face patterns was as follows. From each image a bounding rectangle of dimensions 160 128 pixels has been manually determined that includes the actual face. The face region included within the bounding rectangle has been subsampled four times. Accordingly, training patterns xi of dimensions were built. The ground truth, that is, the true class lawas appended to each pattern. Similarly, non-face bel yi patterns have been collected from images depicting trees, wheels, bubbles, and so on, by subsampling randomly selected regions of dimensions four times. The latter patterns were annotated by yi .

In the case of equals ratios the mat h is written in the parentheses, therefore the best SVM for the figure number is the polynomial SVM. Interestingly, the ERBF machine experimentally yields the greatest number of support vectors, as can be seen in Table 3.

5

Table 3. Number of support vectors found in the training of the several SVMs studied. SVM type k 1 2 3 4 5

1 11 14 12 13 39

Test Image numbers 2 3 4 5 11 11 11 10 13 14 14 14 10 12 16 12 11 11 11 11 41 41 40 39

6 11 13 12 11 40





10  8 =1

48

160  128 = 1

To assess the performance of the majority voting procedure, we have manually annotated each test pattern zi with the ground truth that is denoted as zi;81 . Two quantitative measurements have been used for the assessment of the performance of each SVM, namely, the false acceptance rate (FAR), that is, the rate of false positives) and the false rejection rate (FRR), that is, the rate of false negatives) during the test phase. We have measured FAR and FRR for each SVM individually as well as after majority voting. We have found that FRR is always zero while FAR varies. The values of FAR attained by each SVM individually and after applying majority voting along with the values obtained with bagging are shown in Table 4. It is seen that application of majority voting

5.2. Performance assessment We have trained the five different SVMs indicated in Table 1. The trained SVMs have been applied to six test images from the IBERMATICA database that were not included in the training set. Each test image corresponds to a different person. The resolution of each test image has been reduced four times yielding a final image of dimensions . Scanning row by row the reduced , test patterns resolution image, by a rectangular window are classified as non-face ones (i.e., f z ) or face patterns (i.e., f z ). When a face pattern is found by the machine, a rectangle is drawn, locating the face in image. We have tabulated the ratio Gk =Fk in Table 2. From Table 2, it can be seen that ERBF is found to maximize the ratio in (12) for the five test images. On the contrary the machine built using the sigmoid kernel attains the worst performance with respect to (12).

15  20

( )=1

10  8 ( )= 1

%

Table 4. False acceptance rates (in ) achieved by the various SVMs individually and after applying majority vote. SVM type k 1 2 3 4 5 combining

1 3.9 6.5 5.2 7.8 2.6 2.6

Test Image numbers 2 3 4 5 10.5 6.5 5.2 2.6 6.5 6.5 9.2 2.6 7.8 9.2 9.2 3.9 17.1 31.5 44.7 21.0 2.6 3.9 3.9 3.9 1.3 2.6 2.6 2.6

6 6.5 6.5 5.2 47.3 3.9 3.9

reduces the number of false positives in all cases and particularly

6=

when Fk Gk . Leaving the SVM that employs the sigmoid kernel function out, we developed bagging SVMs for each of the remaining four kernels. The number of bootstrap replicas was . Unfortunately, for SVMs, we found that bagging does not yield a lower FAR, as can be seen in Table 5.

21

%

Table 5. False acceptance rates (in ) of the bagging SVMs. SVM type Test Image numbers k 1 2 3 4 5 6 1 4.7 12.1 7.6 6.5 3.5 7.8 2 10.1 9.3 7.6 9.2 3.5 10.8 3 7.7 10.1 10.6 13.5 4.5 8.8 5 2.6 3.1 6.5 6.5 4.5 4.8

Figure 1 depicts 2 extreme cases observed during a test. It is seen that majority voting helps to discard many of the candidate face regions returned by a single SVM (Fig. 1(b)) yielding the best face localization (Fig. 1(a)).

[3] K.-K. Sung and T. Poggio, “Example-based learning for view-based human face detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 39–51, January 1998. [4] R. Rao and R. Mersereau, “On merging hidden Markov models with deformable templates,” in Proc. of 1995 Int. Conf. on Image Processing, Washington, D.C., 1995. [5] T. Chen, “Audio-visual speech processing,” IEEE Signal Processing Magazine, vol. 18, no. 1, pp. 9–21, January 2001. [6] M.-H. Yang, N. Ahuja, and D. Kriegman, “Face detection using a mixture of factor analyzers,” in Proc. 1999 IEEE Int. Conf. on Image Processing, vol. 3, pp. 612-616, 1999. [7] R.J. Qian and T.S. Huang, “Object detection using hierarchical MRF and MAP estimation,” in Proc. 1997 IEEE Computer Society Conf. Computer Vision and Pattern Recognition, pp. 186–192, 1997. [8] A.J. Colmenarez and T.S. Huang, “Face detection with information-based maximum discrimination,” in Proc. 1997 IEEE Computer Society Conf. Computer Vision and Pattern Recognition, pp. 782–787, 1997. [9] R. Vaillant, C. Monrocq, and Y. Len Cun, “Original approach for the localisation of objects in images ,” IEE Proc. Vis. Image Signal Processing, vol. 141, no. 4, August 1994. [10] K.-C Yow and R. Cipolla, “Feature-based human face detection ,” Image and Vision Computing, vol. 15, no. 9, pp. 713–735, 1999.

(a)

(b)

Fig. 1. (a) Best and (b) worst face location determined during a test. 6. CONCLUSIONS AND DISCUSSION In this paper, we have attempted to improve the accuracy of SVMs by applying majority vote on the output of an ensemble of different machines. We have tested the aforementioned technique for frontal face detection. We have also used bagging in order to reduce the false acceptance rate and compared the rates obtained with those achieved by the proposed technique. Note that in the case of bagging, the majority vote is applied to the labels derived by each SVM during the test phase, while it is applied to labels derived by SVMs employing different kernel functions in the case of the ensemble of SVMs. Ensembling different kernel machines turns out to be more accurate than bagging for face detection. 7. REFERENCES [1] M.-H. Yang, N. Ahuja, and D. Kriegman, “A survey on face detection methods,” IEEE Trans. on Pattern Analysis and Machine Intelligence, to appear 2001. [2] B. Moghaddam and A. Pentland, “Probabilistic visual learning for object recognition,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, pp. 696–710, July 1997.

[11] S. Ben-Yacoub, B. Fasel, and J. Luettin, “Fast Face Detection using MLP and FFT, ” in Proc. of Second Int. Conf. Audio and Video-based Biometric Person Authentication, pp. 31– 36, 1999. [12] H.A. Rowley, S. Balujao, and T. Kanade, “Neural networkbased face detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 1, pp. 23–37, January 1998. [13] R. F´eraud, O.J. Bernier, J.-E. Viallet, and M. Collobert, “A fast and accurate face detector based on neural networks,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 23, no. 1, pp. 42–53, January 2001. [14] E. Osuna, R. Freund, and F. Girosi, “Training support vector machines: An application to face detection,” in Proc. IEEE Computer Society Conf. Computer Vision and Pattern Recognition, pp. 130–136, 1997. [15] C. Papageorgiou, M. Oren, and F. Girosi, “A general framework for object detection,” in Proc. Fifth Int. Conf. on Computer Vision, pp. 555–562, 1998. [16] M.-H. Yang, D. Roth, and N. Ahuja, “A SNoW-based face detector,” in S. A. Solla, T.K. Leen, and K.-R. Muller, Eds. Advances in Neural Information Processing Systems vol. 12, pp. 855–861, MIT Press, 2000. [17] C.J.C. Burges, “A Tutorial on Support Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998. [18] R. Fletcher, Practical Methods of Optimization, 2/e. Chichester, U.K.: J. Wiley & Sons, 1987. [19] V.N. Vapnik, The Nature of Statistical Learning Theory. New York: Springer Verlag, 1995.

[20] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24 , pp. 123–140, 1996. [21] S. Gunn, ”Support Vector Machines for Classification and Regression”, ISIS Technical Report ISIS-1-98, Image Speech & Intelligent Systems Research Group, University of Southapton, May. 1998.