Head Pose Estimation in Seminar Room using Multi View ... - CiteSeerX

2 downloads 0 Views 559KB Size Report
Mar 30, 2006 - To model the temporal changing of head pose, Hidden Markov Model is used to obtain ... view. In the flow chart, there are totally five view-based face detectors: frontal view, left ... of head pose in video sequence. In particular ...
Head Pose Estimation in Seminar Room using Multi View Face Detectors Zhenqiu Zhang, Yuxiao Hu, Ming Liu and Thomas Huang Beckman Institute, University of Illinois at Urbana-Champaign, 405 N Mathews Ave, Urbana, IL 61801, USA zzhang6,hu3,[email protected] [email protected] March 30, 2006 Abstract Head pose estimation in low resolution is a challenge problem. Traditional pose estimation algorithms, which assume faces have been well aligned before pose estimation, would face much difficulty in this situation, since face alignment itself does not work well in this low resolution scenario. In this paper, we propose to estimate head pose using viewbased multi-view face detectors directly. Naive Bayesian classifier is then applied to fuse the information of head pose from multiple camera views. To model the temporal changing of head pose, Hidden Markov Model is used to obtain the optimal sequence of head pose with greatest likelihood.

1

introduction

Most previous works on pose estimation (PE) [1] [2] assume face has been well aligned before PE. However, in many real applications, face alignment itself is a quite difficult problem, as seen in Fig. 1. Hence much error would be introduced in face alignment stage, which decrease the performance of PE dramatically. View-based multi view face detection is one of the successful application of statistical learning in the past several years [3] [4]. It detect faces of different pose in an input image, meanwhile, it provides estimation of head pose based on which channel of view-based face detectors give the output. In scenarios when face alignment is difficult to be done, view-based face detector could be used as a pose estimator directly. In the task of head pose estimation in seminar room of CHIL evaluation, a five-channel multi-view face detector is applied to each of four camera views. Head pose with respect to seminar room could then be estimated using naive

Figure 1: Four camera view images of CHIL data. bayesian network [5]. Temporal changing of head pose has been modeled with hidden markov model [6]. The remainder of this paper is organized as follows: Section 2 briefly discusses FloatBoost, used in this work as face detection algorithm. Pose estimation with local multi-view face detection is described in section3. Section 4 presents in detail the proposed framework for head pose estimation using naive bayesian network. Hidden Markov Model used to model temporal change of head pose is presented in section 5. Experiments on the CHIL dataset are described in Section 6, and a brief summary is given in Section 7.

2

FloatBoost Multi-View Face Detection

Various algorithms have been proposed in the literature for face detection. Among those, in this paper, we use the FloatBoost approach [3]. FloatBoost is a variant of AdaBoost [7], introduced to amend some of its limitations. AdaBoost is a sequential forward search procedure using the greedy selection strategy. Its heuristic assumption is the monotonicity. The premise offered by the sequential procedure can be broken down when this assumption is violated. FloatBoost instead incorporates the idea of floating search [8] into AdaBoost to overcome the non-monotonicity problems associated with the latter. The sequential floating search (SFS) method [8] allows the number of backtracking steps to be controlled instead of being fixed beforehand. Specifically, it adds or deletes l = 1 feature and then backtracks r steps, where r depends on the current status. As a result, quality improvement of the selected features is obtained at the cost of increased computation due to the extended search. These feature selection methods, however, do not address the problem of (sub)optimal classifier design based on the selected features. FloatBoost combines

Figure 2: Framework of local face detection. them into AdaBoost for both effective feature selection and classifier design. Briefly, FloatBoost is an iterative procedure involving various steps: In the forward inclusion step, the currently most significant weak classifier is added one at a time, a step identical to AdaBoost. In the conditional exclusion step, FloatBoost removes the least significant weak classifier from the current ensemble, subject to the condition that the removal leads to a lower cost than the one incurred at the previous iteration. The classifiers following the removed one will subsequently need to be re-trained. The above steps are repeated until no more removals can be performed. In the scenario of interest in this paper, face detection needs to accommodate the lecturer’s varying head pose, as captured in the fixed camera views inside the smart room. Therefore, a multi-view FloatBoost approach is used, where three face detectors are trained: One for frontal view, one for left half-profile view and another for left profile view, with the right side face detectors obtained by mirroring the left ones. All detectors are trained by the FloatBoost technique.

3

Pose Estimation with Local Multi-view Face Detection

As mentioned in introduction part, we estimate head pose with multi-view face detectors directly. As illustrated in Fig. 2, a sequence of face detectors were applied around face region, given the bounding boxes of face in each camera view. In the flow chart, there are totally five view-based face detectors: frontal view, left half-profile, right half-profile, left profile and right profile. First of all, frontal face detector is applied in the region around bounding box. If frontal face is detected, we stop the sequence of face detection and estimate head pose as frontal view. If no frontal face is found in the local region, we continue the process of multi-view face detection with left half-profile face detector. Similarly, if left half-profile face is detected, we stop the sequence of face detection and estimate head pose as left half-profile view. We continue the process, following the flow chart shown in Fig. 2. If none of these five detectors could detect face around bounding box region, non-face is assigned to this local region, which means probably it is the back of presenter’s head from this view (this could also happen, if we miss detection of the face). With this local multi-view face detection, head pose in each camera view is obtained. Let v1 ,v2 ,v3 ,v4 be the variable, which present the head pose in camera

Figure 3: Naive Bayesian network for head pose estimation. view 1, camera view 2, camera view 3 and camera view 4 respectively. For example, v1 has six possible value 1, 2, 3, 4, 5, 6, corresponding to frontal view, left half-profile, right half-profile, left profile, right profile and non-face. Basic assumptions for this framework of sequential local multi-view face detection is that: (1). Frontal face detector is more robust than half-profile face detectors. (2). Half-profile face detectors are more robust than profile face detectors. (3). Right and left side face detectors are independent. It is with low probability that we would detect left side and right side face simultaneously.

4

Naive Bayesian Network for Head Pose Estimation

Recent work in supervised learning has shown that a surprisingly simple Bayesian classifier with strong assumptions of independence among features, called naive Bayes [5], is competitive with state-of-the-art classifiers such as C4.5. This fact raises the question of whether a classifier with less restrictive assumptions can perform even better. This classifier learns from training data the conditional probability of each attribute Ai , given the class label C. Classification is then done by applying Bayes rule to compute the probability of C given the particular instance of A1 , ...., An , and then predicting the class with the highest posterior probability. This computation is rendered feasible by making a strong independence assumption: all the attributes Ai are conditionally independent given the value of class C. For the scenario of head pose estimation in seminar room, let v1 ,v2 ,v3 ,v4 denote head pose in each camera view and θ be head pose with respect to seminar room, which has eight possible value, such as: east, west, northwest. Relation between v1 ,v2 ,v3 ,v4 and θ is modeled with naive Bayes, as shown in Fig. 3. Given

Figure 4: Overall framework of pose estimation using Naive Bayesian network.

observation of particular instance of v1 ,v2 ,v3 ,v4 , θ could be estimated with the highest posterior probability. The overall flow chart of head pose estimation with naive Bayes is illustrated in Fig. 4.

5

Hidden Markov Model for Pose Estimation

In computational pattern recognition, temporal patterns are very common tasks. In order to model the inherent temporal ordering from the time sequences, some assumptions are made to simplify the analysis. Markov assumption is one of the most common assumption which state that the temporal dependence can be break down to first-order approximation: the random variable vt at time t only depends on previous time instant vt−1 . The Hidden Markov Model is a finite set of states, each of which is associated with a (generally multidimensional) observation probability distribution. Transitions among the states are governed by a set of probabilities called transition probabilities. To specify the Hidden Markov Model(HMM) completely, there are three types of parameters: initial probability, transition probability and observation probability distribution of each state. In this paper, Hidden Markov Model is applied to model the temporal change of head pose in video sequence. In particular, the HMM model is specified as: – Initial probability p(θ) – Transition probability p(θt |θt−1 ) – Observation Probability distribution of each state p(V¯t |θt ) = p(vt1 vt2 vt3 vt4 |θt ) Given observation (V¯1 , V¯2 , ..., V¯T ), Viterbi algorithm is used to find the state sequence (θ1 , θ2 , ..., θT ) with greatest likelihood.

6

Experiment Result

The meeting room considered in this paper corresponds to the smart room located at one of the CHIL project partners [9]. A number of sensors are installed

Table 1: Average pose estimation accuracy with leave-one out strategy. S0 54%

S1 85%

H0 58%

H1 92%

in the room which include the four fixed cameras, providing the data used in this paper. The cameras capture color data at a 640×480 pixel resolution and at 15 frames per second, are synchronized. For multi-view face detection, three face detectors are trained on the development data: One for the frontal view, one for the left half-profile view and another for left profile (the right side face detectors are obtained by mirroring the left ones). A number of frontal,left half-profile and left profile view face images are cropped from selected images in the development set for this purpose. In addition, non-face training samples are cropped from an image database that does not include faces. Estimation of p(θ), p(θt |θt−1 ) and p(v1 v2 v3 v4 |θt ) could be learned from development set. Experiment result of head pose estimation on development set, with leave-one out strategy, is shown in Table 1. There, S0 denotes pose estimation using naive Bayesian classifier without temporal information and without tolerance of neighbouring pose, S1 denotes pose estimation using naive Bayesian classifier without temporal information but with tolerance of neighbouring pose, H0 is pose estimation using Hidden Markov Model with temporal information but without tolerance of neighbouring pose and H1 is pose estimation using Hidden Markov Model with temporal information and with tolerance of neighbouring pose. Tested on evaluation set, we got 87% correct classification within range of neighbouring pose classes using the Hidden Markov Model described in section 5, and 33.56◦ of mean absolute error.

7

Summary

In this paper, multi-view face detectors are applied to estimate head pose in seminar room scenario. Naive Bayesian is used to fuse estimation of head pose from four camera view and HMM is used to model temporal change of head pose in video sequence.

References [1] S. Gong, S. Mckenna, and J. Collins, ”An investigation into face pose distributions”, FG 1996. [2] S. Li, X. Peng, X. Hou, H. Zhang and Q. Cheng, ”Multi-view face pose estimation based on supervised ISA learning”, FG 2002. [3] S. Li and Z. Zhang, “FloatBoost learning and statistical face detection,” IEEE Trans. Pattern Anal. Machine Intell., 26(9), 2004.

[4] H. Schneiderman and T. Kanade, “A statistical method for 3D object detection applied to faces and cars,” In Proc. Conf. Computer Vision Pattern Recog., 2000. [5] N. Friedman, D. Geiger, and M. Goldszmidt,”Bayesian network classifiers”, Machine Learning 29:131–163, 1997. [6] L. Rabiner, ”A tutorial on Hidden Markov Models and selected applications in speech recognition”, 1989, Proc. IEEE 77(2):257–286. [7] P. Viola and M. Jones, “Robust real time object detection,” In Proc. IEEE ICCV Work. Statistical and Computational Theories of Vision, 2001. [8] P. Pudil, J. Novovicova, and J. Kittler, “Floating search methods in feature selection,” Pattern Recog. Lett., 15:1119–1125, 1994. [9] CHIL “Computers in the Human Interaction Loop” project web-site: http://chil.server.de