3D View-Invariant Face Recognition Using a

0 downloads 0 Views 2MB Size Report
as a way of either supplementing or reinforcing a 2D approach. .... Chang et al [17] report the largest study on 3D face recognition to date, which is ... manual normalization is not feasible in a practical system, besides being prone to human ... After locating salient facial landmarks, feature vectors are created based on spatial.
3D View-Invariant Face Recognition Using a Hierarchical Pose-Normalization Strategy ____________________________________________________________ Authors: Martin D. Levine and Ajit Rajwade Center of Intelligent Machines, Room 410, 3480 University Street, McConnell Engineering Building, McGill University, Montreal, H3A 2A7, Canada. Address: Same as above. Email addresses: [email protected], [email protected] Corresponding Author: Martin D. Levine

1

Abstract Face recognition from 3D shape data has been proposed as a method of biometric identification as a way of either supplementing or reinforcing a 2D approach. This paper presents a 3D face recognition system capable of recognizing the identity of an individual from a 3D facial scan in any pose across the view-sphere, by suitably comparing it with a set of models (all in frontal pose) stored in a database. The system makes use of only 3D shape data, ignoring textural information completely. Firstly, we propose a generic learning strategy using support vector regression [2] to estimate the approximate pose of a 3D head. The support vector machine (SVM) is trained on range images in several poses belonging to only a small set of individuals and is able to coarsely estimate the pose of any unseen facial scan. Secondly, we propose a hierarchical twostep strategy to normalize a facial scan to a nearly frontal pose before performing any recognition. The first step consists of either a coarse normalization making use of facial features or the generic learning algorithm using the SVM. This is followed by an iterative technique to refine the alignment to the frontal pose, which is basically an improved form of the Iterated Closest Point Algorithm [8]. The latter step produces a residual error value, which can be used as a metric to gauge the similarity between two faces. Our two-step approach is experimentally shown to outperform both of the individual normalization methods in terms of recognition rates, over a very wide range of facial poses. Our strategy has been tested on a large database of 3D facial scans in which the training and test images of each individual were acquired at significantly different times, unlike all except one of the existing 3D face recognition methods.

Keywords: 3D face recognition; support vector regression; iterated closest point; 3D pose estimation; pose normalization; residual error

Acknowledgements: The authors would like to thank National Sciences and Engineering Research Council, Canada for their financial assistance.

2

List of Figures: Figure 1:Sample Faces from the Freiburg Database

14

Figure 2: Faces from the Notre Dame Database

14

Figure 3: A Few Cropped Models from the Notre Dame database

20

Figure 4: Typical corresponding cropped probe images from the Notre Dame

21

Database

Figure 5: Residual error between different images of the same persons

24

Figure 6:Residual error values between different images of different persons

24

Figure 7: Residual error histogram for images of the same (red) and different (green) people shown together for comparison

25

Figure 8: Recognition rate versus image size

27

Figure 9: Recognition rate versus number of training images

28

Figure 10: Two scans of the same person with different facial expressions

28

Figure 11: Removal of non-rigid regions of the face (region below the four dark lines)

29

3

List of Tables Table 1:Results of Pose Estimation

15

Table 2:Recognition rates with ICP, ICP-variant and LMICP

23

Table 3:Recognition rates with ICP and the ICP-Variants after applying the feature-based method as an initial step

23

Table 4:Recognition rates with ICP and ICP Variant after applying SVR as the initial step

23

4

1.

Introduction

Owing to heightened security concerns during the past decade, face recognition has emerged as an important method of biometric authentication. Traditional methods of recognizing the identity of individuals use 2D facial images. However, 2D images of an object are known to be inherently sensitive to incident illumination. Thus images of one and the same individual under different lighting conditions tend to differ to a greater extent than those of different individuals under the same lighting conditions [1]. A possible way to circumvent this difficulty is the employment of a sensor, which gathers data in a manner insensitive to visible light. One such sensor is a 3D range scanner, which measures the actual 3D shape of the face and so is inherently invariant to changes in illumination. Face recognition systems are also very sensitive to variation in facial pose. A robust system should be able to recognize individuals across a very large range of poses, ideally the entire view-sphere. Geometric normalization for pose variations is certainly simpler given the 3D shape of a person’s head, than from the 2D texture image. Nonetheless, robustly dealing with pose variations remains a major issue even in face recognition systems that use 3D data. In this paper, we outline a hierarchical strategy to normalize the 3D scan of a person’s face (in any pose across the entire view-sphere) to a closely frontal view. Our hierarchical strategy consists of an approximate normalization using support vector regression (SVR) [2] followed by a refined normalization using an improved variant of an existing surface (or curve) registration method, the “iterated closest point” (ICP) algorithm [8]. To test the efficacy of the combined strategy, we take an ensemble of 3D facial scans belonging to several individuals, originally in several different views, and normalize them using our technique. Thereafter, we compare these scans with a database containing frontal 3D scans of the same individuals and ascertain the recognition rates. We report a recognition rate of 91.5% on the Notre Dame database for the case of a single 3D training image. We also quantify the variation in recognition rates with respect to increasing pose-difference between training and test images, To the best of our knowledge, this is the first attempt to systematically study the effect of pose variation on the 3D face recognition rate. Our two-step pose-normalization method is shown to be stable over a large range of views across the entire view-sphere (inclusive of profile views), unlike existing studies which have reported recognition rates only on near-frontal head poses ([10], [11], [12], [14], [16], [17], [18], [20], [22], [23]). Both the work in [17] and this paper represent the largest studies done on 3D face recognition so far. Additionally, unlike most existing systems (with the sole exception of [17]), our approach has been tested on a database in which the time interval between the acquisition of training and test images is significant. In comparison to [17], we report a slightly lower recognition rate on the same database. However, it should be noted that the approach in [17] requires considerable manual intervention, whereas the method described in this paper is fully automated.

5

This paper is organized as follows. Section (2) briefly reviews the existing literature in 3D face recognition. Section (3) presents a bird’s eye view of the approach discussed in the paper. Thereafter, in section (4), we proceed to describe our strategy in greater detail, beginning with a brief discussion of the theory of support vector machines and their application to facial pose estimation from range data. In section (5), we explain the need for further refinement of the pose estimate and give a detailed description of ICP and its merits and demerits. We also describe a variant of ICP that uses simple heuristics to improve the registration performance (in terms of recognition accuracy). Several experimental results on a large face database (obtained from Notre Dame University [17]) are presented in section (6) along with detailed discussions. Finally, section (7) presents the conclusions along with pointers to possible future work.

2.

Background

Heretofore, not many papers have been published on 3D face recognition. The existing literature can be divided into different categories based on the approach followed. These include 3D face recognition methods employing principal components analysis (PCA), methods that represent faces as vectors of specific features, methods that use point signatures and methods that use the iterated closest point (ICP) algorithm. This section briefly reviews these approaches and their reported results, followed by a short critique. (2.1) Methods Using PCA Principal Components Analysis (PCA) was first used for the purpose of 2D face recognition in the paper by Turk and Pentland [9]. The technique has been applied to 3D face recognition by Hesher and Srivastava [10]. Their database consisted of 222 rangeimages of 37 different people. The different images of one and the same person have 6 different facial expressions. The range images are normalized for pose changes by first detecting the nasal bridge and then aligning it with the Y-axis. An eigenspace is then created from the “normalized” range-images and used to project the images onto a lower dimensional space. Using exactly one gallery image per person, a face recognition rate of 83% was obtained. PCA has also been used by Tsalakanidou et al [11] on a set of 295 frontal 3D images, each belonging to a different person. They chose one range image each of 40 different people to build an eigenspace for training purposes. Their test set consisted of artificially rotated range images of all the 295 people in the database, varying the angle of rotation around the Y-axis from 2 to 10 degrees. For the 2-degree rotation case, they claim a recognition rate of 93%, but this rate drops to 85% for rotations larger than 10 degrees. Yet another study using PCA on 3D data has been reported by Achermann et al [12]. They have used the PCA technique to build an eigenspace out of 5 poses each of 24 different people. Their method has been tested on 5 different poses each of the same person but the poses of the selected test images seem to lie between the different training

6

poses.1 The authors report a recognition rate of 100% on their data set using PCA with 5 training images per person. They have also applied the method of Hidden Markov Models on exactly the same data set and report recognition results of 89.17%. Chang et al [17] report the largest study on 3D face recognition to date, which is based on a total of 951 range images of 277 different people. Using a single gallery image per person, and multiple probes, each taken at different time intervals as compared to the gallery, they obtained a face recognition rate of 92.8% by performing PCA using just the shape information. They have also examined the effect of spatial resolution (in X, Y and Z directions) on the accuracy of recognition. However, they actually perform manual facial pose normalization by aligning the line joining the centers of the eyes with the Xaxis, and the line joining the base of the nose and the chin with the Y-axis. Obviously, manual normalization is not feasible in a practical system, besides being prone to human error in marking feature points. (2.2) Feature-based methods Surface properties such as maximum and minimum principal curvatures allow segmentation of the surface into regions of concavity, convexity and saddle points, and thus offer good discriminatory information for object recognition purposes. Tanaka et al [14] calculate the maximum and minimum principal curvature maps from the depth maps of faces. From these curvature maps, they extract the facial ridge and valley lines. The former are a set of vectors that correspond to local maxima in the values of the minimum principal curvature. The latter are a set of vectors that correspond to local minima in the values of the maximum principal curvature. From the knowledge of the ridge and valley lines, they construct extended Gaussian images (EGI) for the face by mapping each of the principal curvature vectors onto two different unit spheres, one for the ridge and the other for the valley lines. Matching between model and test range images is performed using Fisher’s spherical correlation [15], a rotation-invariant similarity measure, between the respective ridge and valley EGI. This algorithm has been tested on only 37 range images, with each image belonging to a different person and 100%2 accuracy has been reported. However, extraction of the ridge and valley lines requires the curvature maps to be thresholded. This is a clear disadvantage because there is no explicit rule to obtain an ideal threshold, and the location of the ridge and valley lines are very sensitive to the chosen value. Lee and Milios [20] obtain convex regions from the facial surface using curvature relationships to represent distinct facial regions. Each convex region is represented by an EGI by performing a one-to-one mapping between points in those regions and points on the unit sphere that have the same surface normal. The similarity between two convex regions is evaluated by correlating their Extended Gaussian images. To establish the correspondence between two faces, a graph-matching algorithm is employed to correlate the set of only convex regions in the two faces (ignoring the non-convex regions). It is assumed that the convex regions of the face are more insensitive to changes in facial 1 2

No specific data are provided in this paper. See Section 2.5 for a detailed discussion of results reported in other publications.

7

expression than the non-convex regions. Hence their method has some degree of expression invariance. However, they have tested their algorithm on range images of only 6 people and no results have been explicitly reported. Feature-based methods aim to locate salient facial features such as the eyes, nose and mouth using geometrical or statistical techniques. Commonly, surface properties such as curvature are used to localize facial features by segmenting the facial surface into concave and convex regions and making use of prior knowledge of facial morphology [16], [18]. For instance, the eyes are detected as concavities (which correspond to positive values of both mean and Gaussian curvature) near the base of the nose. Alternatively, the eyebrows can be detected as distinct ridgelines near the nasal base. The mouth corners can also be detected as symmetrical concavities near the bottom of the nose. After locating salient facial landmarks, feature vectors are created based on spatial relationships between these landmarks. These spatial relationships could be in the form of distances between two or more points, areas of certain regions, or the values of the angles between three or more salient feature-points. Gordon [16] creates a feature-vector of 10 different distance values to represent a face, whereas Moreno et al [18] create an 86valued feature vector. Moreno et al [18] segment the face into 8 different regions and two distinct lines, and their feature-vector includes the area of each region and the distance between the center of mass of the different regions as well as angular measures. In both [16] and [18], each feature is given an importance value or weight, which is obtained from its discriminatory value as determined by Fisher’s criterion [19]. The similarity between gallery and probe images is calculated as the similarity between the corresponding weighted feature-vectors. Gordon [16] reports a recognition rate of 91.7% on a dataset of 25 people, whereas Moreno et al [18] report a rate of 78% on a dataset of 420 range images of 60 individuals in two different poses (looking up and down) and with five different expressions. A major disadvantage of these methods is that the location of accurate feature-points (as well as points such as centroids of facial regions) is highly susceptible to noise, especially because curvature is a second derivative. This leads to errors in the localization of facial features. The errors are further increased with even small pose changes that can cause partial occlusion of some features, for instance downward facial tilts that partially conceal the eyes. Hence the feature-based methods described in [16] and [18] lack robustness. (2.3) Methods Using Point Signatures The concept of point signatures was discovered by Chua and Jarvis and was used for the purpose of robust object recognition in [21]. It was extended to expression-invariant 3D face recognition by Chua, Han and Ho [22]. In the latter, the facial surface is treated as a non-rigid surface. A heuristic function is first used to identify and eliminate the non-rigid regions on the two facial surfaces (further details in [22]). Correspondence is established between the rigid regions of the two facial surfaces by means of correlation between the respective point signature vectors and other criteria such as distance, and finally the optimal transformation between the surfaces is estimated in an iterative manner. Despite its advantages, this method has been tested on images of only six different people, with four range images of different facial expressions for each of the six persons. Another

8

disadvantage is that the registration achieved by this method is not very accurate3 (as reported in [22]) and requires a further refinement step such as ICP. This two-step procedure contains two iterative steps and is computationally very expensive. The concept of point signatures has also been used for face recognition in recent work by Wang, Chua and Ho [23]. They manually select four fiducial points on the facial surface from a set of training images and calculate the point signatures over 3 by 3 neighborhoods surrounding those fiducial points (i.e., 9 point signature vectors). These signature vectors are then concatenated to yield a single feature vector. The selected fiducial points include the nasal tip, the nasal base and the two outer eye corners. A separate eigenspace is built from the point signatures in the 3 by 3 neighborhood surrounding each fiducial point in each range image. Thus, four different eigenspaces are constructed in total. To locate the corresponding four fiducial points in a test range image, point signatures are calculated for 3 by 3 neighborhoods surrounding each pixel in the image and represented by a single vector. The distance from feature space (DFFS) value is calculated between the vector at each pixel on the test face and the four eigenspaces created during training. The fiducial points correspond to those pixels at which the DFFS value with respect to the appropriate eigenspace is minimal. For face matching, classification is performed using support vector machines [2], with the input consisting of the pixel signature vectors in the 3 by 3 neighborhoods surrounding the four fiducial points. The maximum recognition rate with three training images and three test images per person is reported to be around 85%. The different images collected for each person show some variation in terms of facial expression. The authors do not specify the effect of important parameters such as the radius of the sphere required for calculating the point signatures. Furthermore, their approach takes into account data at only four fiducial points on the surface of the face. This seems to be inadequate from the point of view of robust facial discrimination, as it tends to ignore several important regions of the face. They have also not given any statistical analysis of errors in localization of the facial feature points and the effect on recognition accuracy. (2.4) Methods Using ICP Lu, Colbry and Jain have used an ICP-based method for facial surface registration in [25] and [26]. They employed a feature-based method followed by a hybrid ICP algorithm alternating in successive iterations between the method proposed by Besl & McKay [8] and that proposed by Chen and Medioni [24]. In this way they are able to make use of the advantages of both algorithms: the greater speed of the Besl and McKay [8] technique and the greater accuracy of the method of Chen and Medioni [24]. Their hybrid ICP algorithm has been tested on a database of 18 individuals with frontal gallery and probe images involving pose and expression variations. A probe image is registered with each of the 18 gallery images, and the gallery image giving the lowest residual error is the one that is considered to be the best match. Using the residual error alone, they obtain a

3

No exact results are specified in [23].

9

recognition rate of 79.3%. This result is improved to 84% by incorporating additional data such as a shape index and texture4. (2.5) Critique Two important shortcomings in most existing 3D face recognition systems are worth mentioning. Firstly, apart from [17], none of the experiments described in this section specify the time-span between the collection of the training and testing images for the same person. It needs to be assumed, therefore, that the scans for each person were done at the same sitting. The inclusion of sufficient time gaps between the collection of training and testing images is a vital component of the well-known FERET protocol for face recognition [13]. Likewise, none of the aforementioned papers has presented a detailed study of the effect of the increase in pose difference between gallery and probe images on the recognition rate reported by the respective algorithms. Additionally, the techniques described in the literature are designed to work only for those facial poses in which both eyes are clearly visible. In this study, we use the same database as [17] and report a recognition result of 91.5% (as against their 92.8%). However, ours is a fully automated system, unlike the one in [17] which requires manual intervention for detection of salient feature-points. Additionally, we show results that quantify the effect of pose variation between probe and gallery images, and report results that are fairly stable across a wide range of poses covering the entire view-sphere. To the best of our knowledge, ours is the first attempt to perform such an investigation.

3.

Overview

The aim of our system is to normalize for facial pose changes in a completely automated manner. Intuitively, feature-based methods that can reliably detect three distinct fiducial points on the face can be utilized for the purpose of pose alignment. Most of the prior research on 3D facial pose-normalization employs feature detectors to locate the two inner eye corners and the nasal tip ([16], [17]). The actual alignment is performed using prior knowledge of the spatial relationships between selected fiducial points on a canonical frontal face. For instance, the face is rotated in such a way that the line joining the two inner eye corners is aligned with the horizontal or X-axis, and the line that joins the nasal tip to the center of the line connecting the two inner eye corners, always has a fixed orientation with the Y-axis. If the eye-corners and nasal tip are detected accurately, the face will be easily normalized to an exact frontal view. These methods have been employed in [16] with automated feature detection and in [17] with manual techniques. Given a set of normalized faces, a simple Euclidian distance comparison can act as a reliable metric for recognition. However, these techniques cannot be employed for large pose variations across the view-sphere. For instance, in facial views with a large yaw 4

The recognition rate referred to here is for the case of the automated hybrid ICP algorithm applied to range images without considering texture. See Table (1) in [25]. The column entitled “ICP hybrid, automatic, range-image only” states that the recognition rate is 79% [(63-13)/63].

10

rotation, several distinctive facial features, such as one of the eyes (and both the inner eye corners) are obscured. Furthermore, feature-based methods lack robustness against noise, which is commonly found in range data and therefore the exact location of the individual points can be erroneous [18]. In this paper, we follow a different approach for pose normalization, which basically stems from an interesting observation regarding facial pose variations: the faces of different individuals in similar poses show a marked similarity to one another, which can be expressed in terms of the apparent 3D shape characteristics (or facial outlines) of a particular pose. This is the primary motivation for the employment of machine learning to recognize the pose of an individual’s face. The specific technique used here is support vector regression. A mathematical relationship is deduced between prototypical 3D shapes in different poses and the poses themselves, the latter being quantified in terms of the angles of rotation around the Y- and X-axis (further details in [31]). Such a learning approach has been used before by Gong et al [28] for pose estimation from 2D facial images. However, the learning algorithm cannot predict the exact angle from a 3D scan. Instead, we treat the angles found as coarse estimates and rotate the 3D facial scan to a roughly frontal face. Thereafter, we make use of the assumption of facial symmetry on either side of the nasal ridge to correct for missing facial points. Finally, we utilize a robust variant of ICP [8] to refine the normalization by aligning the 3D scan to a pose as close as possible to a perfectly frontal pose. The ICP algorithm (or its variant described in section (5) below) yields residual error values upon convergence, which can act as a reliable metric for face recognition.

4.

Pose Estimation Using Support Vector Machines

In this section, we briefly describe the application of SVMs to pose estimation from range data, beginning with a note on the theory of support vector regression, and followed by an overview of the experimental method and summary of the results. The reader is referred to [31] for detailed experimental results on facial pose estimation from 3D shape information. (4.1) Theory of Support Vector Regression Support Vector Machines have emerged in recent times as a very popular data mining technique for applications such as classification [2], regression [4] and unsupervised outlier removal [5]. Consider a set of l input patterns denoted as x, with their corresponding class labels, denoted by the vector y. A support vector machine obtains a functional approximation given as f ( x, α ) = w.Φ ( x) + b , where Φ is a mapping function from the original space of samples onto a higher dimensional space, b is a threshold, and α represents a set of parameters of the SVM. If y is restricted to the values –1 and +1, the approximation is called support vector classification (SVC). If y can assume any valid real values, it is called support vector regression (SVR). By using a kernel function given

11

as K ( x, y ) = Φ ( x).Φ ( y ) , the problem of support vector regression can be stated as the following optimization problem [4]: Maximize over α

W (α * , α ) = −

l

l l 1 * * * ( α − α ) ( α − α ) K ( x , x ) − ε ( α + α ) + y i (α i* − α i ) ∑ i i j j i j ∑ ∑ i i 2 i , j =1 i =1 i =1 Subject to the following conditions:

l i =1

Equation (2)

(α i* − α i ) = 0 and 0