3D Face Reconstruction from Stereo Images

0 downloads 0 Views 516KB Size Report
The main problem of 3D reconstruction from 2D stereo im- ages is to find the .... tip, the upper and the lower lip middle points are searched in both the views.
3D Face Reconstruction from Stereo Images Stefano Arca, Elena Casiraghi, Giuseppe Lipori, Gabriele Lombardi Universit`a degli Studi di Milano, Dipartimento di Scienze dell’Informazione, Via Comelico 39/41, Milano {arca, casiraghi, lipori, lombardi}@dsi.unimi.it Abstract

that fail to match due to occlusion or uniform texture. In this paper we describe two 3D reconstruction techniques; the first creates a sparse model, while the second determines a dense model. In section 3 we present a local method to find the correspondences of the 15 most characteristic face fiducial points1 , extracted as described in [1]. The stereo matching is based on the analysis of the grey level intensities in the neighborhoods of the points, by means of the SSD and a Gabor filter bank. A 3D face model is then obtained, by using the 3D reconstructed face fiducial points as control points to deform the parameterized 3D face model (Candide model) defined in [13]. Section 4 presents a different stereo matching method that deploys global and local techniques to define a dense disparity map; it employs a multi-resolution energy minimizing snake grid algorithm [10]. The grid ensures uniqueness, continuity and ordering, while the minimum of the global energy function brings to a good correspondence quality and smoothness.

In this paper we propose a system for 3D face reconstruction from 2D stereo images. Two different stereo matching algorithms are presented; the first is a local method that finds the correspondences of 15 most characteristic facial fiducial points, whose 3D reconstruction allows us to deform a parameterized sparse model. The second method employs a snake grid algorithm which minimizes a multi-resolution energy function; it defines a dense disparity map and determines a dense 3D face model. Besides we investigate how 3D features affect the recognition performance of a face recognition system working on 2D images.

1. Introduction In the last few decades a great deal of research work has been devoted to the development of algorithms for image rectification, 3D modelling and reconstruction. In [1] we presented a face recognition system working on 2D images, whose recognition performance might be improved by the highly informative biometric measures that can be extrapolated by the 3D face representation; based on this consideration we present in this paper a 3D face reconstruction system working on 2D images. The main problem of 3D reconstruction from 2D stereo images is to find the correspondences, that is the points in the images that are the projection of the same point in the scene. The problem of finding correspondences in stereo images, has long been the focus of much research in computer vision and it is still an open problem; it has been addressed and described in the survey presented by Brown in [4] who classifies the employed techniques as local and global. Local methods [12, 7, 11] find correspondences by looking at the local neighborhoods of the points, while global methods [3, 6, 15] exploit constraints on global features in order to reduce sensitivity to those local regions in the stereo images

2. Image preprocessing The first and fundamental step to obtain a good 3D reconstruction is a precise camera calibration, which provides the intrinsic, extrinsic and distortion parameters of the cameras used for the stereo images acquisition. Good calibration results can be obtained using the very well known algorithm, presented by Zhang [16]; the weakness of this method is that it needs an accurate initialization based on the precise corner positions of a calibration pattern (e.g. a chessboard). To initialize the camera calibration algorithm we use a chessboard corner detector [2] which finds the chessboard corners with sub-pixel precision, making no assumption about the scale and the orientation of the chessboard, and working under very different illumination con1 The face fiducial points are: the eyebrow vertexes, the nose tip, the eye and the lip corners, the upper and the lower middle points of the lips and the eyes.

1

ditions. The camera parameters, determined by the camera calibration algorithm, are used by the image rectification algorithm presented by Fusiello et al. [9]. In the rectified images, the relative rotation among the original images is removed to simplify the task of the following steps for 3D reconstruction. Image rectification enables an efficient and reliable stereo matching since the search of conjugate points (or matching patterns in general) can be executed along the same horizontal lines of the rectified images. Our stereo matching algorithm works on pairs of stereo images (a frontal and two lateral views) of the same subject, and defines the stereo correspondences by matching the points in the lateral views to those in the frontal view.

3. 3D reconstruction of face Fiducial Points In this section we present our method for 3D facial fiducial point reconstruction, using three stereo views (left, right and frontal view). For each view, the 15 2D fiducial points are at first identified as described in [1]; we define as P˜L (i), PF (i) and P˜R (i) (i = 1, .., 15) the sets of fiducial points located respectively in the left, frontal and right view (figure 1). Based on the assumption that the fiducial points in the frontal view are more precisely detected than those in the lateral views, we consider the PF (i) as the elements that must be matched to determine the conjugate pairs. To find the corresponding points of PF (i) we employ a stereo matching procedure in the neighborhood of P˜L (i) and P˜R (i) since they are very close to the conjugate of PF (i). The rectified images are scaled so that the distance between the fiducial points located on the external eye corners in the frontal view is 150 pixels. The conjugate point corresponding to the fiducial point PF (i) in the left (right) side of the face is searched in the left (right) view. The nose tip, the upper and the lower lip middle points are searched in both the views. For each point PF (i) we search for the correspondent, PL (i), in a rectangular area, SA, which dimensions are 21 × 7 pixels. It is vertically centered on the epipolar line corresponding to PF (i)2 , and horizontally centered on the point P˜L (i). Our method uses the multiple window approach proposed in [8]; the distance metric analyzed to find the matching is the Sum of Square Difference (SSD). For each fiducial point PF (i) we extract 9 template sub-images from the frontal view. The template models are shown in figure 2: they are characterized by varying positions of the pixel PF (i) in the template. The algorithm starts by scanning the search area SA to search, for each template, its best matching point. This is found by taking the point which minimizes the SSD of the grey level intensities. In this 2 This

is a reasonable choice since we work on rectified images.

way we find nine candidate points Ck (k = 1, .., 9). To select the best match we characterize the point PF (i) and the nine candidates Ck in terms of their surrounding grey level portion of image. A vector (Jet) J(~x) of 40 coefficients is calculated convolving the portion of grey level image around the point ~x with the following bank of 40 Gabor kernels (we consider 5 frequencies and 8 orientations): !  2  kj2 kj2 x2 σ ~ exp(ikj ~x) − exp − ψj (~x) = 2 exp − 2 σ 2σ 2 The best candidate selected is the one that maximizes a similarity measure between Jets: P Jz (PF (i))Jz (Ck ) S(J(PF (i)), J(Ck )) = pP z P 2 2 z (Jz (PF (i))) z (Jz (Ck ))

where z = 0, ..., 39. To improve the robustness of the method this procedure is repeated three times, by varying the dimension of the template windows (11×11, 21×21 and 31×31). We choose as correspondent of PF (i) the one with the highest similarity measure.

Given the set of conjugate pairs the algorithm for the 3D point reconstruction can be applied. At first, in order to correct the effect of the lens distortion, the undistorted positions of the fiducial points are computed. For each pair of conjugate points their optical rays may be computed and expressed in the same reference frame [14]. Their intersection defines the 3D reconstruction of the point represented in the different views3 . Due to the calibration and the stereo matching errors the rays may not intersect; this is the reason why the reconstructed 3D point is the mid point of the segment minimizing the distance between the two rays [14]. In this way we obtain the set of the 15 3D face fiducial points. These points are used as the control points to deform the parameterized 3D face model (see figure 3-a). It is composed of 64 vertices and 114 triangles. The model deformation is obtained by two steps: the global rigid alignment scales and aligns the Candide model to the control points, while the local deformation brings its vertices closer to the 3D fiducial points of a specific face. This second step replaces the 3D coordinates of the model fiducial points with the ones of the 3D fiducial points, and determines the position of the other vertices according to these displacements. Although this model is quite simple and can be deformed in a low computational time, its simplicity does not allow 3 Note that for the nose tip, the upper and the lower lip middle points we have two conjugate pairs; their final 3D reconstruction is the mid point of the segment connecting the two 3D points obtained by each pair.

Figure 1. A stereo image set: left, frontal and right view; the fiducial points are highlighted.

Figure 2. The nine template windows. The pixel PF (i) is highlighted. significative deformation, and the method creates very similar models for different faces. To obtain a model that could better represent different faces we infer from the 15 face fiducial points 72 additional points on the images (see figure 3-b,c). A stereo matching algorithm is then used to find the correspondences of these points in the three images, and their 3D reconstructions allow us to build a 3D model with 87 (72+15) vertices and 156 triangles (see figure 3-d). Since the 72 inferred points are often located in low textured or occluded regions, the conjugate pairs found are often inaccurate, leading to an imprecise 3D model (see figure 4). Therefore we tried to obtain a more precise model by using a dense stereo matching technique, described in the following section.

4. Stereo matching by multi-resolution snake grid model. The stereo matching technique described in this section defines a dense disparity map by employing a significantly modified version of the energy minimizing snake grid algorithm presented in [10]. It is applied separately on two pairs

of stereo images (left-frontal, right-frontal), and the dense models obtained are integrated in an unique face model. At first the eye centers are localized in the stereo images with the algorithm described in [5]. Their position allows to define the bounding boxes strictly including the face. Since we work on rectified images the bounding box in the lateral (left or right) image is translated and scaled to have the same vertical position and the same dimensions of the bounding box detected in the frontal view. To obtain a more precise reconstruction, the snake grid is applied at different scales in a coarse to fine fashion. To create a multi scale pyramid the two stereo sub-images (referred as images in the following) are iteratively halfweighted; the first level of the pyramid, L0 , contains the original stereo images, and the lowest level, Lh , contains two images whose biggest dimension is less than 27 . In this way we obtain a number of levels h = dlg 2 X − 7e, where X is the maximum dimension of the original images. The snake grid algorithm is at first applied at the lowest level of the pyramid and initialized by means of two grids, GF and GL , composed of N × M grid points, and created on the frontal and the lateral views. We choose N = 8 and M = 12. GF is kept fixed and GL is iteratively distorted to

a

b

c

d

Figure 3. a: Candide face model. b,c: 72 inferred points. d: New 3D model.

Figure 4. Example of 3D reconstruction with the new model. find the best matches of the grid points in GF , by minimizing a global energy function (see figure 5). When the snake grid, applied to the generic level Lk of the pyramid, has stabilized to the minimum global energy, the two grids are scaled to initialize the grids in the stereo images at the above level of the pyramid, Lk−1 . Using bilinear interpolation additional points are added to those grids and the energy minimizing algorithm is applied to find more and more correspondences. This iterative process is repeated until the first level L0 is reached. At the generic level, Lk , the number of grid points, ]GL and ]GF , is given by the equation: ]GL (k) = ]GF (k) = (2(h−k) (N −1)+1)(2(h−k) (M −1)+1) For each level Lk , 0 ≤ k ≤ h, the energy minimizing snake-grid algorithm can be described as follows; let PF (i, j)(x,y) , PL (i, j)(x,y) be the grid points in position (i, j) in GF and GL , the superscript (x, y) refers to the horizontal and vertical image coordinates of these points, let En be the global energy of the grid at the nth iteration.

Figure 5. Distortion of the grid stabilized to the minimum global energy.

• En−1 :=∞; En :=0;

C

IL =

• while |En − En−1 | > ε(k) do begin

t l 1 X X eC IL (u, v) t · l u=1 v=1

where IeL (u, v) are the pixel grey levels in the portion of image around peL , and (t, l) are the dimensions of the portion of image used for the matching. The energy parameter α is proportional to the gradient in the portion of image IF :

– En−1 := En ; – For each point,PF (i, j)4 , in GF do begin ∗ Define the neighborhood A of PL (i, j) where the best match of PF (i, j) must be searched: it is the horizontal segment5 com(x,y) posed of the points peL having a vertical y coordinate peL = PL (i, j)y , and a horizontal coordinate PL (i − 1, j)x < pexL < PL (i + 1, j)x ; ∗ For each point peL ∈ A compute the energy function E(e pL , PF (i, j)); (see equation 1); ∗ Choose as best matching point the peL with the minimum energy; (x,y)

∗ Set the new position of PL (i, j)(x,y) :=e pL ; ∗ Define the energy of the grid point (i, j) as Eij :=E(e pL , PF (i, j));

 t X l  X ∂IF (u, v) ∂IF (u, v) + ∂u ∂v

α=

u=1 v=1

thus, in image regions with high gradient, which correlate well and unambiguously, the grid points are moved mostly relying on the external energy. On the other hand, in low gradient regions where false matches may occur, the most significative term is the internal energy. It penalizes those points peL that are far away from the centroid of the eight surrounding grid points, P L (see figure 6), assuming that the best match must be located somewhere close to it; it also ensures the smoothness of the disparity map.

end; – Compute P the global energy of the grid as En := i,j Eij ;

end;

The energy function E(e pL , PF (i, j)), is used to evaluate the match between the two points peL and PF (i, j); it is computed as a linear combination of an external energy, Eext , and an internal energy, Eint . More precisely: E(e pL , PF (i, j)) = Eint − αEext

(1)

where α is an energy parameter. Based on the assumption that a portion of image, IF , around PF (i, j)(x,y) remains almost unchanged in the lateral image we choose as a measure of the external energy, Eext , the median of the cross correlation value (CC) calculated over the nine templates described in [8]. CC is defined by the following equation and includes the correlation of all the three color channels in the same formula: X

C=R,G,B

CC = X

C=R,G,B

 

u,v=1

u,v=1

C

5A

C

C IF (u, v) − I F



  C C IeL (u, v) − I L 

v v  u u t,l  t,l  u X  u X 2 C 2u C u  C C IF (u, v) − I F t IeL (u, v) − I L  t

IF = 4 The

t,l  X

u,v=1

t l 1 XX C IF (u, v) t · l u=1 v=1

PF (i, j) are chosen in random order. is chosen according to the fact that we work on rectified images.

Figure 6. The point peL and its 8-neighbors.

More precisely: (x,y)

PL

=

1 8

X

P (i + t, j + l)(x,y)

t,l=−1,0,1; (t,l)6=(0,0)

and the internal energy is composed of two terms as follows: Eint

=

P exp(z; p, s)

=

P exp(a; p1 , s1 ) + P exp(b; p2 , s2 ) (2) ( s · |z| if |z| < p |z| (3) −1 s·p·e p otherwise

where p1 = 0.5, s1 = 1, p2 = 0.2, s2 = 2 are experimentally set parameters and the function P exp(z; p, s) is defined in −1 < z < 1 (see figure 7). The first term in equation 2 is a function of the distance, a, between the horizontal position of peL and the horizontal

Figure 7. The functions P exp(a; 0.5, 1) and P exp(b; 0.2, 2)

Figure 8. The 3D reconstruction obtained with the snake grid algorithm.

position of P L ; a is defined as: x x pL − P e   a= x x max PL (i + 1, j)x − P , PL (i − 1, j)x − P

model. The sparse 3D model is mainly based on the reconstruction of the 15 most characteristic facial fiducial points. In order to verify the usefulness of 3D information for face recognition, we have computed 3D biometric measures of the face such as the inter-ocular distance, the width of the eyes and the mouth, and the distance between the eyes corners and the nose tip. Preliminary experiments show that these measures do not increase the performance of the face recognition system [1], working on 2D images only. This is probably due to the small number of features that can be extracted by the 15 3D fiducial points. We believe that a greater number of points may help defining a set of more discriminative features and, based on this assumption, we developed a more detailed dense model. Future works are focused on the definition of a feature set extracted on the dense model; the features may be obtained by applying 3D filters or by inferring measures on characteristic face curves.

In the second term, b is a function of the slope of the segments connecting peL to PL (i, j − 1) and PL (i, j + 1), and it is defined as follows, where the quantities 4x and 4y are shown in figure 6:   4x2 4x1 4y1 − 4y2  b=1 2 (4y1 + 4y2)

The iterative snake algorithm running on the grids at the level Lk of the pyramid, is stopped when the difference between the global energies of two consecutive iterations is √

|En − En−1 | < ε(k). We set ε(k) =

]GL (k) 1000 .

Once the dense disparity maps are computed on the two pairs of stereo images, two 3D models are built by a ray intersection algorithm. An example of 3D model obtained is shown in figure 8. The original face sub-image has a dimension of 644×423, the multi-scale pyramid has 4 levels and the integration of the two 3D models provides a final 3D face model composed of 10146 = 5073 · 2 points.

5. Conclusions and future works In this paper we present two methods for 3D face reconstruction from 2D stereo images. We introduce two stereo matching algorithms used to create a sparse and a dense

References [1] S. Arca, P. Campadelli, and R. Lanzarotti. A face recognition system based on automatically determined face fiducial points. Pattern Recognition, 39:432–443, March 2006. [2] S. Arca, E. Casiraghi, P. Campadelli, and G. Lombardi. A fully automatic system for chessboard corner localization. Submitted to Pattern Recognition Letters, August, 2005. [3] A. Benshair, P. Miche, and R. Debrie. Fast and automatic stereo vision matching algorithm based on dynamic programming method. Pattern Recognition Letters, 17:457– 466, 1996. [4] M. Brown, B. Burschka, and G. Hager. Advances in computational stereo. IEEE Transaction On Pattern Analysis and Machine Intelligence, 25:993–1008, 2003. [5] P. Campadelli, R. Lanzarotti, and G. Lipori. Eye localization and face recognition. RAIRO, to be published, 2006.

[6] I. Cox, S. Hingorani, S. Rao, and B. Maggs. A maximum likelihood stereo algorithm. Computer Vision and Image Understanding, 63:542–567, 1996. [7] F. et al. Realtime correlationbased stereo: algorithm, implementation and applications. (2013), 1993. [8] A. Fusiello and V. Roberto. Efficient stereo with multiple windowing. IEEE 19/97, pages 1063–69, 1997. [9] A. Fusiello, E. Trucco, and A. Verri. A compact algorithm for rectification of stereo pairs. Machine Vision and Applications, pages 16–22, 2000. [10] S. Huq, B. Abidi, A. Goshtasby, and M. Abidi. Stereo matching with energy minimizing snake grid for 3d face modeling. Proceedings of SPIE, Conference on Biometric Technology for Human Identification, 5404:339–350, 2004. [11] S. S. Intille and A. F. Bobick. Disparityspace images and large occlusion stereo. European Conference on Computer Vision, pages 179–186, 1994. [12] Y. Ohta and T. Kanade. Stereo by intra and interscanline search using dynamic programming. Pattern Analysis and Machine Intelligence, (7(2)):139–154, 1985. [13] M. Rydfalk. Candide, a parameterized face. Report No. LiTH-ISY-I-866, Dept. of Electrical Engineering, Linkping University, Sweden, 1987. [14] E. Trucco and A. Verri. Introductory Techinques for 3D Computer Vision. Prentice Hall, 1998. [15] H. Zao. Global optimal surface from stereo. In Proc. of International Conference on Pattern Recognition, 1:101–104, 2000. [16] Z. Zhang. Flexible camera calibration by viewing a plane from inknown orientations. In Proc of International Conference on Computer Vision (ICCV’99), Corfu, Greece, pages 666–673, 1999.