Automatic Video Based Face Verification and ... - Semantic Scholar

2 downloads 0 Views 202KB Size Report
Two different strategies for multi-class face verification and recognition ... and recognition methods are based on one or a few views of client's face to build the.
Automatic Video Based Face Verification and Recognition by Support Vector Machines Gang Song*, Haizhou Ai, Guangyou Xu, Li Zhuang Dept of Computer Science & Technology, Tsinghua University, Beijing, China ABSTRACT This paper presents an automatic video based face verification and recognition system by Support Vector Machines (SVMs). Faces as training samples are automatically extracted from input video sequences in real-time by LUT-based Adaboost and are normalized both in geometry and in gray level distribution after facial landmark localization via Simple Direct Appearance Model (SDAM). Two different strategies for multi-class face verification and recognition problems with SVMs, "one-vs-all" and "one-vs-another", are discussed and compared in details. Experiment results over 100 clients are reported to demonstrate the effectiveness of SVM on video sequences. Keywords: Support Vector Machine, Face Recognition, Face Verification

1.

INTRODUCTION

Lots of the present face verification and recognition methods are based on one or a few views of client’s face to build the face model. And almost all publicly available face databases are set up for this kind of research purposes, such as FERET [12] and ORL [13], for which many different approaches are developed ([3], [15]). The famous ones include eigenface methods [11], dynamic link architecture [9], deformable model [10], and elastic graph matching [14] and so on. Great progress has been achieved in tolerating lighting and pose changes by building various compact face representation models. But in nature it is difficult to have good generalization ability over the broad diversity which can not be exploited by only a few views of each person. Many existing methods depend on pose estimation of the input image. Unfortunately that is still not an easy task. Recently Support Vector Machines have been used in those data sets that achieve highly competitive results [6][8] without complex face representation. This statistical learning method shows a new way for face verification and recognition problems. Furthermore videos provide abundant samples compared with static images. In this paper we will discuss the performance of SVM in exploring the potential of video sequences for face verification and recognition. We will present a fully automatic system, in which face samples are extracted through real-time face detection and facial landmark localization. Two different strategies (“one-vs-all” and “one-vs-another”) for multi-class face verification and recognition problems with SVMs are discussed. The paper is organized as follows. A system overview is given in Sec. 2. The extraction method of face samples from videos is described in Sec. 3. The two strategies for face verification and recognition are given in Sec. 4. Sec. 5 shows the experimental results. The conclusion is given in Sec.6.

2.

SYSTEM OVERVIEW

As shown in Fig. 1, the system consists of 4 modules: face detection, facial landmark localization, face sample normalization, SVM training and verification/recognition procedure. First, faces are detected by LUT-based Adaboost algorithm [1] from input video sequences. Then facial landmarks are localized in the face area for the reference points of face sample normalization. The system can work in a speed of about 15Hz on a PIII-800MHz that renders out face samples for SVMs. Two different SVMs strategies (“one-vs-all” and “one-vs-another”) are applied for multi-class face verification and recognition problems. In practice, the SMO algorithm developed by J. C. Plat [18] is applied for SVM training

*

[email protected]

Video input LUT-bas ed A daboost cas cade detector for face location Facia l landma rk loca lizat ion (eye centers , mouth corners )

Face s ample normalization in geo metry and lighting

Face s amples of Pers on 1

Face s amples of Pers on i

Face s amples of Pers on n

Training SVM s clas s ifier in one-vs -all or one-vs -another strategy Pers on verification/recognition

Figure 1: System overview

3. FACE SAMPLE EXTRACTION FROM VIDEO 3.1 Face Region Detection A variant of Viola’s Adaboost [17] called LUT method [1] is used to detect face regions. The LUT method uses Look-Up-Table as weak classifiers that approximated a multimodal distribution, instead of binary threshold classifiers in [17]. This resulted in a more efficient training process and a cascade detector with fewer layers. Suppose a Haar-like feature f(x) is normalized in the domain [0, 1], the size of LUT is n, the k-th LUT item corresponds to an equally separated range Rk=[(k-1)/n, k/n]. The LUT weak classifier is defined as follows:

where P1( k )

1 f ( x) ∈ Rk and P1( k ) > P2( k ) (1) hLUT ( x) =  0 otherwise  = P( f ( x) ∈ Rk | w1 ) , P2(k ) = P( f (x) ∈ Rk | w2 ) , w1 and w2 are two classes. More details can be found in [1].

3.2 Facial Landmark Localization After the face region is detected, three facial landmarks are located for face normalization: the two eyes and the mouth center. We used a Simple Direct Appearance Model (SDAM) similar to DAM [16] to calculate the facial landmark positions directly from the input face gray image. Suppose that the texture t and the shape s of face images can be modeled by a linear function: (2) s = R *t + ε where t is gray image vector, s is the vector of the facial landmarks coordinates, ε is the error. The coefficient matrix R can be learned statistically. Here is a brief description of the SDAM algorithm: 1. Use the detected face region image t0 as the initial texture t ← t 0 . Go to 2. 2. Get the facial landmarks position vector s from (2). If the current shape s is near to the average shape s (obtained in the training procedure), calculate the corresponding position in the original image and stop. Otherwise, go to 3. 3. Use affine transform on the current image with current s to get the new face image for new texture t. Go to 2 3.3 Face Sample Normalization Geometric normalization is applied on the face samples cut from the input video frames. Given three pairs of facial landmarks, an affine transform parameter can be completely determined. As shown in Fig.2, the located facial landmarks are aligned to the standard position by the affine transform. A linear lighting adjustment is applied on the gray level of each aligned face (Fig.2.b) by subtracting an interpolated intensity plane (Fig.2.c) for simple photometric normalization. A mask is covered on the face to exclude the corner

points (Fig. 2.d). Thus a masked normalized sample is extracted from each frame in the video sequence to concatenate into a feature vector for SVM classifiers. (d) Cutout with mask

Face Region

48 px 48 px

(c) Photometric normalized cutout

Standard face alignment (b) Geometric normalize cutout

(a) Facial features

Figure 2: Face normalization procedures

4.

TRAINING SVMS FOR FACE VERIFICATION AND RECOGNITION

4.1 Support Vector Machines SVM [5] shows excellent generalization performance for high dimensional data and small-size training sets. Given labeled samples ( y i , x i ), x i ∈ R n , yi ∈ {−1,+1}, i = 1,", l and the kernel function K (x i , x j ) , SVM is formed as solving the quadratic programming (QP) problem: α = arg min α

subject to

l 1 l l α i α j y i y j K (x i ⋅ x j ) − ∑ α i ∑∑ 2 i =1 j =1 i =1 l

∑α y i =1

i i

= 0 and 0 ≤ α i ≤ C , i = 1,", l

(3)

All the xi corresponding to non-zero α i are the Support Vectors (SVs). The predicted class label of a sample x i ∈ R n is: (3) f (x) = sign( α y K (x , x) + b )



x i ∈SVs

i

i

i

Two training strategies for multi-class problem,, “one-vs-all” and “one-vs-another”[7], are described below for video based face verification and recognition. Suppose the database contains n clients. In “one-vs-all” case the i-th SVM classifier is trained between client i and all the other clients. So there are totally n classifiers. In “one-vs-another” case a classifier is trained between each pair of two clients, which results in totally n(n-1)/2 classifiers. 4.2 Face Verification z “One-vs-All” Strategy Suppose the person claims to be “client i”。Each video frame is fed into the i-th SVM classifier to get the verification result separately at the frame level. At the video level the claim is accepted only when a large ratio (for example, 70%) of all the frames in the video pass the verification. This ratio is referred as video accept ratio in the following. Fig. 3 shows the procedure at the frame level. z “One-vs-Another” Vote Strategy Suppose the person claims to be “client i”, at the frame level, a frame is accepted when it passes all the corresponding n-1 one-vs- another SVM classifiers. At the video level his claim will be accepted only when a large ratio (video accept ratio) of all those frames pass the n-1 verification tests respectively. Fig. 4 shows the procedure at the frame level.

Accept

Accept Y

Σ = n-1 ? count

Client i ?

Y N

Client i ?

SVM Classifier i

Reject

Client i Client 1 ……Client i Client i-1Client i Client i+1……Client i Client n

samples of each frame

sample s of each frame

Figure 3: Verification procedure of “one-vs-all” strategy

Figure 4: Verification procedure of “one-vs-another”

4.3 Face Recognition z “One-vs-All” Vote Strategy The recognition procedure is shown in Fig. 5. Face samples extracted from all frames are sent to the n clients’ SVM classifiers. The i-th classifier determines whether this frame is “client i” or not. For the positive hypothesis, a vote is added to this client. Each frame gives one vote for one of the n of clients. The final recognition result for a video sequence is the client who gets the majority votes. Client K*

Max vote Client k

………………

Vote 1

Vote 2

Y

Y

Client 1? SVM Classifier

…..

Vote n Y

Client 2?

…..

SVM Classifier

…..

………………

Client 1/2

Client n? SVM Classifier

“one-vs-all” strategy

Client n-1/n

………

Client 1 Client 2 Client 3 Client 4 ………

samples of each frame

Figure 5: Recognition procedure of

Client 3/4

Client n-1 Client n

samples of each frame

Figure 6: Recognition procedure of one-vs-another”

z “One-vs-Another” Competitive Strategy The recognition procedure is shown as Fig. 6 with a competitive strategy. At the frame level, given a testing input face, all the clients are firstly divided into n/2 pairs. The winners of each pair continue in this elimination scheme until finally only one champion is reserved. At the video level, one client gets a vote for each face of all the video frames. The client having the maximum votes over all the frames is the recognized person.

5. EXPERIMENTAL RESULTS Experiments are carried out on a database of 100 persons. Each person has 1 to 11 video sequences, captured in a period over 5 months. In each video the person moves around his head before the camera for various poses in about 4 seconds duration. 30 to 50 face shots are automatically extracted from each video sequence, excluding some frames due to facial landmark localization failures. Each face shot is transformed into 6 samples via mirroring and translating the center of mouth 2 pixels left and right for compensating the error in feature localization.. Thus 180 to 300 samples are extracted from each video sequence. Only one video sequence of each person is used to train SVM. The testing video sequences are independent of the training set, varying in illumination and poses. The experiments results below use RBF as SVM kernel functions. Similar results are obtained with other kernels like the polynomial kernel.

Table 1 gives the system performance in the 100 client database with the two strategies. In FRR test and RR test, there are totally 50 testing sequences of 10 clients with 14790 detected face shots; in FAR test, there are totally 536 testing sequences of 11 clients with 158782 face shots. It can be seen that the performance at the video level is very inspiring. In each test, we compared the performance of “one-vs-one-another” and “one-vs-all” strategy at the frame level by varying the size of face databases. The curves shown in Fig. 7, Fig. 8 and Fig. 9 indicate that “one-vs-one” outperforms the “one-vs-another” when the size of the database increases. The explanation may lies in the different SVM margins trained in the two strategies.

Table 1. System Performance (database size = 100, video accept ratio = 70%)

Strategy

one-vs-all

one-vs-another

Level

Test (%) FRR

FAR

RR

Frame

7.74

0.176

92.18

Video

6.00

0.00

94.00

Frame

5.04

0.00

98.02

Video

0.00

0.00

100

Figure 7: FRR with varying database size (frame level)

Figure 8: FAR–with varying database size (frame level)

Figure 9: Recognition Rate with varying database size(frame level)

The “one-vs-another” SVM classifiers find the max margin between each pair of two clients, which is relative unchanged to the size of the database. However as “one-vs-all” strategy considers the margin between one client and all the others in the database, it binds more unrelated clients into one class for training and classifying when the database enlarges. The

unbalance of the sample numbers of the two classes, for example about 1:100, leads to more support vectors and harmed generalization ability with a complex margin. Considering the practical application, “one-vs-another” has another advantage. When the (n+1)-th client is added into a database of n clients, “one-vs-all” needs to re-train all the (n+1) SVM classifiers. But for “one-vs-another”, only the SVM classifiers between this client and other n clients need to be trained, leaving the other n*(n-1)/2 SVMs unchanged.

6. CONCLUSION AND DISCUSSION We present a video based face verification and recognition system by support vector machines in this paper. Face samples are automatically extracted from input video sequences by LUT-based Adaboost algorithm and are normalized both in geometry and in gray level distribution. Two different strategies for multi-class face verification and recognition problems with SVMs are discussed. Experiment results over the database of 100 clients show that “one-vs-another” strategy has better performance than “one-vs-all” strategy, especially when the database size increases.

REFERENCES 1.

Bo Wu, Haizhou Ai, Chang Huang, LUT-Based Adaboost for Gender Classification, Proc. of AVBPA’ 03, (2003) 104-110. 2. Brunelly, R., Poggio, T., Face Recognition: Features Versus Templates, IEEE Trans. Patt. Anal. Mach. Intell.,PAMI 15(10) (1995) 1042-1052. 3. Chellappa, R., Wilson, C. L. and Sirohey S.et. al., Human and Machine Recognition of Faces: A Survey, Proc. of the IEEE, 83(5): (1995) 705-740. 4. Darrell, T., Gordon, G., et.al.J, Integrated Person Tracking Using Stereo, Color, and Pattern Detection, Proc. of CVPR’98, (1998) 601-609. 5. Gunn, S. R., Support Vector Machines for Classification and Regression, Technical Report, Image Speech and Intelligent Systems Research group, University of Southampton (1997). 6. Guo, G, Li S. Z. and Chan K., Face Recognition by Support Vector Machines, Proc. of 4th IEEE Intern. Conf. on Automatic Face and Gesture Recognitionof AFG’ 00, (2000) 96-201. 7. Heisele, B., Ho, P. and Poggio, T., Face Recognition with Support Vector Machines, Global versus Component-based Approach. Proc. of ICCV’01, VOL-II, (2001) 688-694. 8. Jonsson, K., Matas, J., et.al, Learning Support Vectors for Face Verification and Recognition, Proc. of 4th IEEE Intern. Conf. on AFGAFG’ 00, (2000), 208-213. 9. Lades, M., Vorbruggen J. C., et.al., Distortion Invariant Object Recognition in the Dynamic Link Architecture, IEEE Trans. on Computers, 42(3) (1993) 300-311. 10. Lanitis, A., Taylor, C. J. and Cootes, T. F., Automatic Face Identification System Using Flexible Appearance Models, Image and Vision Computing, 13(5) (1995) 393-401. 11. Moghaddam, B., Pentland, A., Beyond Linear Eigenspaces: Bayesian Matching for Face Recognition, In: Wechsler, H. et al.(eds.): Face Recognition from Theory to Applications (Springer 1998) 230-243. 12. Phillips, P. J., Hyeonjoon, M., et.al., The FERET Evaluation Methodology for Face-Recognition Algorithms, IEEE Trans. Patt. Anal. Mach. Intell.PAMI, 22(10) (2000) 1090-1104. 13. AT&T Laboratories Cambridge, http://www.uk.research.att.com/facedatabase.html 14. Wiskott, L., Fellous, J. M., et.al, Face Recognition by Elastic Bunch Graph Matching, IEEE Trans. Patt. Anal. Mach. Intell.PAMI, 19(7) (1997) 775-779. 15. Zhao, W., Chellappa, R., Rosenfeld, A., Phillips, P.J., Face Recognition: A Literature Survey, UMD CS-TR-4167, 2000 16. Li, S. Z., ShiCheng, Y., Zhang, H., Cheng, Q., Multi-View Face Alignment Using Direct Appearance Models, Proc. Conf. on AFGof AFG’00, (2002), 324-329. 17. Viola P., Jones M., Rapid Object Detection using a Boosted Cascade of Simple Features, Proc of CVPR’01 (2001). 18. Platt, J. C., Sequential Minimal Optimization: A Fast Algorithm for Training Support Vector Machines, Technical Report MSR-TR-98-14 (1998)