HIERARCHICAL GAZE ESTIMATION BASED ON ... - IEEE Xplore

3 downloads 0 Views 265KB Size Report
HIERARCHICAL GAZE ESTIMATION BASED ON ADAPTIVE FEATURE LEARNING. Xiying Wang. 1. , Kang Xue. 1. , Dongkyung Nam. 2. , Jaejoon Han. 2.
HIERARCHICAL GAZE ESTIMATION BASED ON ADAPTIVE FEATURE LEARNING Xiying Wang1 , Kang Xue1 , Dongkyung Nam2 , Jaejoon Han2 , Haitao Wang1 1. Samsung R&D Institute China, Beijing, China 2. Samsung Advanced Institute of Technology, Yongin, Korea xiying.wang, kang.xue, max.nam, jaejoon.han, [email protected]

ABSTRACT Existing appearance-based gaze estimation methods suffer from tedious calibration and appearance variation caused by head movement. In this paper, to handle this problem, we propose a novel appearance-based gaze estimation method by introducing supervised adaptive feature extraction and hierarchical mapping model. Firstly, an adaptive feature learning method is proposed to extract topology-preserving (TOP) feature individually. Then hierarchical mapping method is proposed to localize gaze position based on coarse-to-fine strategy. Appearance synthesis approach is used to increase the refer sample density. Experiments show that under the condition of sparse calibration, proposed method has better performance in accuracy than existing methods under fixed head pose without chinrest. Moreover, our method can be easily extended for head pose-varying gaze estimation. Index Terms— gaze estimation, feature learning, appearance synthesis, hierarchical mapping 1. INTRODUCTION Since gaze intuitively plays an important role in representing human attention, feeling and desire, research of gaze estimation has already attracted much attention in recent years [1]. Although lots of paper has been published, when just using single web camera without extra accessories, it is still hard to estimate gaze position even under fixed head pose. The existing techniques under fixed head pose can be classified into intrusive and non-intrusive. The intrusive techniques [2] require attachments around the eye to determine the gaze. For non-intrusive techniques, feature-based methods are firstly introduced. The most commonly extracted features include corneal infrared reflections, pupil center and iris contour. However, feature-based method is not robust enough due to its dependence on feature detection and image quality. Appearance-based methods map entire eye image as a high-dimensional feature to low-dimensional target space. Their advantage lies in that, because there is no need to localize precise eye features, it could be robust to image noise and sufficient to use only one camera with VGA resolution. However, tedious calibration process is still an open problem for appearance based gaze estimation. Baluja and Pomerleau [3]

978-1-4799-5751-4/14/$31.00 ©2014 IEEE

Fig. 1. Overview of proposed gaze estimation method. proposed a neural network based regression method and collect 2000 samples for training. The method by Tan et al [4] used appearance manifold for gaze estimation and interpolate unknown gaze point using 252 calibration samples. But this tedious calibration process become a bottleneck for actual use. Recently, Feng Lu et al [5] proposed an adaptive linear regression method for accurate mapping via collecting training samples from 9 ∼ 33 calibration points. In this method, the two linear weight in appearance feature space and point of gaze coordinates space are assumed as same. However, due to the insufficiency of training data and head pose noise, such assumption cannot be guaranteed. Y. Sugono et al [6] suggested utilizing automatically computed saliency maps to generate training samples. This method is novel but with relatively low accuracy and long-term training process. Considering the problems in existing appearance-based techniques, we propose a novel gaze estimation method only using sparse 5-point calibration (as shown in Figure. 1). The main contributions in the paper include: 1. Based on sparse calibration, a hierarchical mapping method which finds a direct correlation between feature space and Point-of-Gaze (PoG) space is proposed for gaze estimation by coarse-to-fine strategy. Support Vector Regression (SVR) is used in global scope for coarse localization of PoG. Then an 2-norm regularized least square canonical correlation analysis (LS-CCA2 ) method is applied in local scope for precise PoG. To handle the lack of calibration samples, we introduce an appearance synthesis method to increase density of reference samples. 2. A supervised learning method is proposed to extract individual topology-preserving (TOP) feature. The topological

3347

ICIP 2014

structure of 2D PoG locations can be kept well on the surface of TOP feature manifold. Thereby, compared with existing methods, TOP feature provides a more efficient way to represent eye appearance for accurate gaze estimation. The paper is organized as follows: Section 2 gives the feature learning method. Section 3 describes our hierarchical mapping method including the system initialization and online gaze estimation method. Experimental results are given in Section 4. Finally, Section 5 concludes the paper. Notations: The number of calibration training samples, the data dimensionality and the feature dimensionality are denoted by N , D and K respectively. ei ∈ RK denotes the appearance feature of i-th eye image with the corresponding 2D position of PoG: pi = (xi , yi ) ∈ R2 . The training dataset is Γ = {E, P } = {[ei , pi ]}N i=1 . One testing sample and its corresponding PoG position are denoted as eˆ and pˆ. 2. TOPOLOGY-PRESERVING FEATURE LEARNING Appearance-based gaze estimation highly depends on the features representative ability to appearance variation. Thus, it is clear that it is not appropriate to extract feature in same way for different users due to the appearance variation from different person. To solve the problem of general feature extraction, we put forward a supervised learning process to learn a novel feature, named TOP feature, for different users. The method can be divided into the following 2 steps. Step1: Dense local feature construction. Since high dimensionality can lead to high performance [7], as shown in Figure. 2(a), we first extract dense local features from each eye image as a high-dimensional representation of eye appearance by dense sampling of image patches in multiple scales. In our method, for efficiency, we just use average intensity value v as the pixels local feature. Given eye image set IM obtained from sparse calibration, (i) (i) zi = {v1 , . . . , vD } denotes the high-dimensional representation of the i-th images with corresponding 2D PoG label pi = (xi , yi ). Then for the whole IM with size N , we get the dense local feature set denoted as Z = {zi } ∈ RD×N , and the corresponding label set P = {pi } ∈ R2×N . Step2: Sparse feature-selection pattern learning. Based on the dense local features, we expect to learn a discriminative sparse subset which can strictly preserve the topological structure of PoG position. Spectral feature selection methods are widely used to learn the features with high power of preserving sample similarity [8]. Considering the eyeball movement can be decomposed to horizontal movement and vertical movement, two feature selection processes are needed. Due to the limit of paper length, we only describe the feature learning process in horizontal direction. Given the horizontal label information {xi }N i=1 , the supervised similarity matrix S of IM can be defined by:  1/nl xi = xj = l Sij = (1) 0 otherwise

(a)

(b)

Fig. 2. (a) Patch-based local feature in image pyramid; (b) Featureselection patterns at different scales in horizontal and vertical directions.

where nl denotes the number of training samples whose horizontal label are l. Since S can specify the topological similarity among all samples, spectral feature selection can take advantage of S to select those features that strictly preserve the samples topological similarity. G is used to denote the undirected graph constructed from S. Given G’s adjacency matrix W and degree matrix Q [8], the Laplacian matrix L = W − Q. Then the D sub-features can be ranked in ascending order for the following score function ϕ(): f T Lfm (2) ϕ(fm ) = m T Qf fm m (1)

(N )

where vectors fm = [vm , . . . , vm ], m ∈ {1, . . . , D}, the score function quantifies performance of m-th dimension subfeature fm to form groups of data Z according to the target label {xi }. Figure. 2(b) shows the horizontal feature selection pattern SF X and vertical feature selection pattern SF Y at different scales. Based on these two patterns, for a new input z, top K/2 sub-features for X and top K/2 sub-features for Y are chosen to compose the K-dimensional TOP feature as following: E = {ei } = {[zSF X , zSF Y ]}. 3. HIERARCHICAL MAPPING METHOD Considering head pose noise and sparse training samples, single non-linear or linear mapping function can hardly localize PoG from eye appearance feature. In this section, we propose a hierarchical mapping method method for accurate PoG localization. This method contains two parts: Appearance synthesis to increased the density of reference samples; Coarseto-fine gaze estimation to localize the PoG locations. 3.1. Appearance synthesis To handle the sparisity problem of training set, appearance synthesis method is proposed to increase the sample density. By using five eye images corresponding to five calibration points (red circles in Figure. 3(a)), we synthesize other eye images to construct a dense, e.g. 4 × 8, appearance model array, which simulates user fixating on 32 (4 rows, 8 columns) calibration points in the screen. The synthesis method is based on optical flow calculation and interpolation. Unlike the synthesis process in [9] which use 1D flows to simulate the appearance distortion caused by head pose moving, our method introduce a 2D interpolation to synthesize the eye appearance variation caused by eyeball moving. Since eyeball movement can be divided into horizontal and vertical parts independently, we simulate the eye

3348

ICIP 2014

(a)

(c)

Figure. 3(b) and 3(c). The red/green color indicate the positive/negative direction of optical flow along X and Y-axis. The brightness of color indicates the absolute value of optical flow. Figure. 3(d) shows the final 4 × 8 appearance models. The synthesized dataset with M samples is denoted as } ∈ RK×M is Γsyn = {E syn , P syn }, where E syn = {esyn i syn syn 2×M = {pi } ∈ R is corresponding TOP features and P PoG.

(b)

(d)

Fig. 3. Optical flow interpolation. (a) 1D interpolation; (b) Horizontal interpolation; (c) Vertical interpolation. (d) Synthesized eye appearance.

optical flow in the two directions respectively. Although its hard to simulate the whole eye with one linear model, we can assume that each pixel of the eye image moves in an individual 2D linear way. Thus, unknown intermediate appearance can be simulated based on dense optical flow interpolation. The detailed algorithm is given by Algorithm 1. Algorithm 1: Appearance synthesis algorithm. Input: Images from 5 markers position: {IMi } Output: 4 × 8 synthesized images: SIm 1: Set five marker images as basic image iteratively; 2: for i = 1 to 5 do 3: Calculate dense optical flow in X/Y axis directions with all five images: OFij , j ∈ {1 ∼ 5}, OFii = 0; 4: Take all OFij as five 1D vectors at 2D position (1,1), (1,8), (2.5,4.5), (4,1) and (4,8); 5: 1D linear interpolation to get OF vectors at 2D position (1,4.5), (4,4.5), (2.5,1) and (2.5,8); X/Y 6: 2D bilinear interpolation to get OFim , m ∈ 1, 2, ..., 32, at all other intersection points of 4 × 8 grids; 7: Based on intensity value of basic image IMi , obtain synthesized 32 images SIim , on the 32 marker positions. For each pixel of SIim : X Y (p)) + OFim (p) = Inti (p), Intim (p + OFim Int(p) is the intensity of pixel p; 8: end for 9: Set weights wim for each synthesized images SIim . The weight value is proportional to the distance of markerm and markeri ; 10: Synthesized image at markerm , m ∈ {1, 2, ..., 32}: 5 SIm = i=1 (wim · SIim ). Interpolation of dense optical flow [10] at middle positions is done in horizontal (X) and vertical (Y) directions respectively. According to the distribution of four corners and one central marker, we firstly implement four 1D linear interpolations (blue cross in Figure. 3(a)) in Step. 4 and Step. 5 of Algorithm. 1. Then, 2D bilinear interpolations are carried out in each sub-region R1, R2, R3 and R4 to get optical flow value on all 28 unknown markers positions. An example of interpolated optical flow in X and Y-axis directions are shown by

3.2. Coarse-to-fine gaze estimation After increasing the density of referring samples during the training process, coarse-to-fine gaze estimation is introduced to localize the PoG online. This method can be separated into two steps: coarse localization in global scope of PoG space and refined estimation in local scope. In global scope, we introduce the non-linear SVR to localize a coarse PoG p˜ in global scope. TOP feature and the corresponding PoG locations of 5 calibration training samples are used to train the regression function. RBF is chosen as kernel of SVR. However, since the coarse estimation is calculated in a global scope and reference samples are sparse, the estimation result is not precise enough. To solve this problem, local estimation is added to improve accuracy. In contrast to the conventional linear interpolation-based methods [4, 5, 11] which assumes the linear weights in feature space and PoG space are same, our approach find a direct correlation between feature space and PoG coordinates space based on CCA. In this paper, we extend the 2-norm regularized least-squares CCA (LS-CCA2 ) algorithm [12] to the regression usage to find a local mapping matrix B ∈ RK×2 which project appearance feature e to the 2D PoG location p. The nearest 9 neighbors of p˜, denoted as ΓL , are used as training set to learn the local mapping matrix B. EL is the TOP feature of the training set with corresponding PoG location set PL . Then, the precise local mapping matrix B is determined by minimizing the following cost function: 2  9  f (B, λ) = ( wi Hi,j − bj · ei,L 2 + λB2 ) (3) j=1 i=1 1

where H = (PL PLT )− 2 PL , and λ is the regularization parameter. wi is the i-th local samples weight. The sample from calibration set Γcal is given weight w1 , and sample from synthesized set Γsyn are given weight w2 , w1 > w2 and  (wi ) = 1. In B = {b1 , b2 }, vector bj ∈ RK×1 , and b1 , ei  = xi , b2 , ei  = yi , where · denotes inner product. Finally, the precise PoG location of a test sample pˆ can be obtained by: pˆ = B · eˆ. 4. EXPERIMENTS We implement our system on desktop PC with 22-inch LCD monitor and a webcam of VGA resolution. User sits in front of the monitor about 65cm away and keeps his/her head as stable as possible with no equipments help. The system captured the user’s frontal appearance while his gaze was focusing on

3349

ICIP 2014

Calibration points

Accuracy (in degree)

Local region based linear

252

0.5

33 18 9

0.64 0.69 0.99

25

2

Martinez [13]

l1 -based adaptive linear Polynomial regression Gaussian processing SVR/RVR

Noris [17]

PCA+SVR

Proposed

SVR+ LC-CCA2

Method Tan [4] (a)

(b)

(c)

Lu [5]

Fig. 4. (a) Calibration marker pattern. (b)Sample images of five different users (in row) fixating on five calibration markers in our database. (c)Comparison with 3 existing methods using same dataset (average gaze estimation errors and standard deviations).

one marker showing on the screen. The marker pattern used in our experiment is shown in Figure.4(a). The red markers are for calibration, and the other 20 blue crosses are for evaluation. In our experiments, testing data of 25 persons (16 males, 9 females) are collected. For each person, 5 × 15 frames are used for calibration and 20 × 15 frames are used for testing. Figure.4(b) shows some example images in our database. Table. 1 compares our hierarchical coarse-to-fine method with linear interpolation and nonlinear SVR method using training data of sparse 5 markers calibration. This evaluation uses four kinds of features: Grid feature [5], HOG feature [13], 2D Gabor feature [14] and proposed TOP features. Note that all feature dimensions are 200. PP Method Linear P FeaturePPP Interpolation Grid HOG Gabor TOP

3.53/2.31 ±1.22/1.17 2.54/2.21 ±0.78/0.92 2.83/2.29 ±0.79/0.91 2.31/2.14 ±0.75/1.02

SVR

Coarseto-Fine

3.05/1.98 ±1.07/0.86 2.27/2.15 ±0.78/0.52 2.09/1.98 ±0.55/0.83 1.82/1.78 ±0.38/0.56

1.92/1.52 ±0.87/0.92 2.03/1.51 ±0.73/0.52 2.01/1.82 ±0.89/0.91 1.63/1.27 ±0.41/0.43

Table 1. Estimation errors of X/Y PoG position and standard deviation (in degree) of different methods and features.

There are several observations from Table. 1 to show the performance of our mapping method and TOP feature. First, our coarse-to-fine mapping method can obtain the best estimation result under sparse calibration when using different kinds of features. Compared with linear interpolationbased method and nonlinear SVR, the average improvement of X/Y errors are 0.91◦ /0.71◦ and 0.41◦ /0.44◦ respectively. Second, Table. 1 also illustrates the performance of TOP feature. Compare with the other three local features, TOP feature can obtain the best estimation using different kinds of mapping model. In online estimation, the average time cost for one frame processing is about 47ms (i.e. 21fps) on PC computer with 3.2G, Intel core-i3 CPU and 4GB memory. In Table. 2, we compare our proposed system with existing appearance-based methods based on their reported estimation error and number of training samples (calibration

Ryan [15] Williams [16]

16(labeled)+ 75(unlabeled) 100 200 800

1.40/1.69 1.85/1.34 1.64/1.26

5

1.63/1.27

1.32

Table 2. Comparison of estimation error with existing methods based on their reported results.

points). Note that this comparison is not direct due to the fact that different systems use different training dataset, testing dataset and different hardware accessory, such as chinrest [5] or extra illuminator [4]. However, it can still be revealed from Table. 2 that our method shows the ability to achieve very high accuracy even with fewest calibration points. We also compare our approach with three most related works based on our testing database: Lu’s method [5], Tan’s method [4] and Martinez’s nonlinear method [13]. Considering property of our data set, Lu’s method uses 8D (2 × 4) Grid feature with 9 point calibration. Tan’s method uses 200D Grid feature and linear interpolation in a local region of feature space that contains 3 closest neighbors with testing sample’s feature. Martinez’s method chose SVR model rather than RVR, because in sparse condition they have almost the same performance. The comparison result(Figure. 4(c)) shows our proposed method obtained the smallest estimation error and standard deviation on both horizontal and vertical directions. Our method can be easily transferred for head posevarying gaze estimation just adjusting the calibration process. We capture a short video clip while the user is fixating at each calibration point (total 5) but rotating his head for seconds. By combining face and eye appearance feature together, PoG can be estimated real-time in a head pose range (Y aw/P itch ≈ 20◦ ) without head pose tracker. The average error is 2.81◦ after testing 10 persons’ data. 5. CONCLUSION In this paper, we propose a novel appearance-based gaze estimation method under the condition of sparse calibration by introducing TOP feature and hierarchical mapping method. Experiments show that proposed method achieves high accuracy under fixed head pose with 5 calibration points. Moreover, our framwork can be easily extended for head posevarying gaze estimation.

3350

ICIP 2014

6. REFERENCES [1] D. Hansen and Q. Ji, “In the eye of the beholder: A survey of models for eyes and gaze,” PAMI, vol. 32, no. 3, pp. 478–500, 2010. [2] Bernardo R. Pires, Myung Hwangbo, Michael Devyver, and Takeo Kanade, “Visible-spectrum gaze tracking for sports,” CVPR Workshops, 2013.

[16] O.M.C Williams, A. Blake, and R. Cipolla, “Sparse and semi-supervised visual mapping with the s3 p,” CVPR, 2006. [17] B. Noris, J. B. Keller, and A. Billard, “A wearable gaze tracking system for children in unconstrained environments,” CVIU, 2011.

[3] S. Baluja and D. Pomerleau, “Non-intrusive gaze tracking using artificial neural networks,” NIPS, 1994. [4] K. Tan, D. Kriegman, and N. Ahuja, “Appearance based eye gaze estimation,” WACV, pp. 191–195, 2002. [5] Feng Lu, Yusuke Sugano, Takahiro Okabe, and Yoichi Sato, “Inferring human gaze from appearance via adaptive linear regression,” ICCV, 2011. [6] Y. Sugano, Y. Matsushita, and Y. Sato, “Calibration-free gaze sensing using saliency maps,” CVPR, 2010. [7] Dong Chen, Xudong Cao, Fang Wen, and Jina Sun, “Blessing of dimensionality: High-dimensional feature and its efficient compression for face verification,” CVPR, 2013. [8] Zheng Zhao and Huan Liu, “Spectral feature selectionfor supervised and unsupervised learning,” ICML, 2007. [9] Feng Lu, Yusuke Sugano, Takahiro Okabe, and Yoichi Sato, “Head pose-free appearance-based gaze sensing via eye image synthesis,” in Pattern Recognition (ICPR), 2012 21st International Conference on. IEEE, 2012, pp. 1008–1011. [10] T. Brox and A. Bruhn, “High accuracy optical flow estimation based on a theory for warping,” ECCV, 2004. [11] Y.Sugano, Y.Matsushita, Y.Sato, and H.Koike, “An incremental learning for unconstrained gaze estimation,” ECCV, 2008. [12] L. Sun, S. Ji, and J. Ye, “Canonical correlation analysis for multi-label classification: A least-squares formulation, extensions, and analysis,” PAMI, 2011. [13] F Martinez, A Carbone, and E Pissaloux, “Gaze estimation using local features and non-linear regression,” ICIP, 2012. [14] Chengjun Liu and Harry Wechsler, “Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition,” TIP, vol. 11, no. 4, pp. 467–476, 2002. [15] W. J. Ryan, “Limbus/pupil switching for wearable eye tracking under variable lighting conditions,” ETRA, 2008.

3351

ICIP 2014