Face Recognition at-a-Distance using Texture and ... - Semantic Scholar

0 downloads 0 Views 4MB Size Report
Ham Rara, Aly Farag, Shireen Elhabian, Asem Ali, William Miller, Thomas Starr and Todd Davis ..... extended LBP operator in a (P, R) neighborhood, with only ..... [20] T. Ojala, M. Pietikainen, and D. Harwood, ”A comparative study of.
Face Recognition at-a-Distance using Texture and Sparse-Stereo Reconstruction Ham Rara, Aly Farag, Shireen Elhabian, Asem Ali, William Miller, Thomas Starr and Todd Davis Abstract— We propose a framework for face recognition at a distance based on texture and sparse-stereo reconstruction. We develop a 3D acquisition system that consists of two CCD stereo cameras mounted on pan-tilt units with adjustable baseline. We first detect the facial region and extract its landmark points, which are used to initialize the face alignment algorithm. The fitted mesh vertices, generated from the face alignment process, provide point correspondences between the left and right images of a stereo pair; stereo-based reconstruction is then used to infer the 3D information of the mesh vertices. We perform experiments regarding the use of different features extracted from these vertices for face recognition. The local patches around the landmark points are also well-suited for Gaborbased and LBP-based recognition. The cumulative rank curves (CMC), which are generated using the proposed framework, confirm the feasibility of the proposed work for long distance recognition of human faces.

I. INTRODUCTION There is a considerable increase in biometrics applications recently, due to advances in technology and higher security demands. Biometric recognition systems use unique human signatures (face, iris, fingerprint, gait, etc.) to automatically identify or verify certain individuals [1]. Biometric sensing modalities can be classified into three categories: contact, contactless, and at-a-distance. The required distance to get an acceptable biometric sample between the user and sensor determines the classification [1]. Contact devices (e.g., fingerprint, palm) need active cooperation from the user since they require physical contact to the sensor. Contactless devices do not require physical contact from the user; however, sensors belonging to this category require a short distance, in the range of 1 cm. to 1 m., entailing some degree of cooperation from the user. Iris capturing devices and touchless fingerprint sensors are examples of this category. Biometric systems capable of acquiring data at-a-distance are very suitable for integrated surveillance/identity tasks since active cooperation from the target may not be required. Gait recognition and newly developed iris recognition systems fall under this category. The main focus of this paper is to develop an at-a-distance biometric modality based on face recognition, due to the fact that the human face (aside from gait) is most accessible for these applications. Face recognition is a challenging problem that has been an attractive area of research the past three decades [2]. The face H. Rara, A. Farag, S. Elhabian, A. Ali, W. Miller and T. Starr are with the CVIP Lab, University of Louisville, Louisville, KY, USA

{hmrara01,aafara01}@louisville.edu T. Davis is with EWA Government Systems, Inc., Bowling Green, Kentucky, USA [email protected]

978-1-4244-7580-3/10/$26.00 ©2010 IEEE

Fig. 1. Camera-lens-tripod portable unit with rechargeable batteries. Two such portable units comprise our stereo system and are connected wirelessly to a server, where processing and analysis are done.

recognition problem is formulated as: given a still image, it is required to identify or verify one or more persons in the scene using a stored database of face images. The main theme of solutions provided by different researchers involves detecting one or more faces from the given image, followed by facial feature extraction that can be used for recognition. Depending on the distance between the target user and camera, face recognition systems can be categorized into three types: near-distance, middle-distance, and far-distance. The last case is referred to as face recognition at-a-distance (FRAD) [3]. Cameras in near-distance face recognition systems can easily capture high-resolution and stable images; the quality of images may become a big issue in FRAD systems. This is in addition to the fact that the performance of many state-of-the-art face recognition algorithms suffers in the presence of lighting, pose, and many other factors [4]. Recently, there has been interest in face recognition ata-distance. Yao, et al. [5] created a face video database, acquired from long distances, high magnifications, and both indoor and outdoor under uncontrolled surveillance conditions. Medioni, et al. [6] presented an approach to identify non-cooperative individuals at a distance by inferring 3D shape from a sequence of images. Due to the current lack of existing FRAD databases (as well as facial stereo datasets), we built our own passive stereo

acquisition setup in [7]. We initially captured images from 30 subjects at various distances (3, 15, and 33 meters) indoor. Shape from sparse-stereo reconstruction was used to identify subjects, with acceptable results. We increased the number of subjects to 60 in [8] and captured images at 30m and 50m, outdoors. In addition to sparse-stereo reconstruction, we performed dense reconstruction to aid in recognition. The face alignment approach in [7][8] is the simultaneous Active Appearance Model (AAM) fitting of [9][10]. For this work, we increased our number of subjects to 97 and captured images at 80m. Recently developed face alignment methods (e.g., Constrained Local Models (CLM)) are used to improve our AAM implementation. We make use also of texture from the stereo pair images for recognition. This work is part of our long-term project to create a real FRAD system capable of identifying at distances as far as 1000 meters in an unconstrained setting. The paper is organized as follows: Section 2 describes the system setup, Section 3 discusses face detection and alignment, Section 4 talks about the features used for recognition, and later sections deal with the experimental results, discussions and conclusions. II. SYSTEM SETUP To achieve our goal and due to the lack of facial stereo databases, we built our own passive stereo acquisition setup to acquire a stereo database. Our setup consists of a stereo pair of high resolution cameras (and telephoto lenses) with an adjustable wide-baseline configuration. Specifically, our system consists of two CCD cameras mounted on pan-tilt units (PTUs) on two separate tripods. Each camera (Canon EOS 50D) is attached with a lens (Canon EF Telephoto 800 mm) and produces an image of 4752 × 3168 resolution. The camera-lens-tripod unit is packaged compactly into a single portable unit, as shown in Fig. 1. Two such portable units comprise our stereo system and are connected wirelessly to a server, where processing and analysis are done. Flexible baselines for stereo reconstruction can be achieved by positioning them strategically to each other. Since the system parameters (i.e. baseline B (meter), focal length f (mm), pan angle φ (degree), and scale factor of the cameras kα (pixel/mm)) are known, the scene point (x, y, z) can be reconstructed from its projections p and q in the left and right images, respectively, by triangulation. The necessary equations and sample values of the system parameters can be found in [7]. Fig. 2 shows sample captured images of our setup at various distances. III. FACE DETECTION AND ALIGNMENT Face recognition systems typically follow the detectionalignment-recognition pipeline, which divides the task of detecting faces, aligning them by locating facial feature points, and finally performing recognition based on the alignment results [2].

Fig. 2. Illustration of captured images: (a) 15-meter (b) 30-meter, (c) 50-meter, and (d) 80-meter.

A. Face Detection Face detection, in this work, involves detecting the facial region, as well as the eyes and mouth locations (e.g., Fig. 3(a)). It starts with identifying the possible facial regions in the input image, using a combination of the Viola-Jones detector [11] and the skin detector of [12]. The face is then divided into four equal parts to establish a geometrical constraint of the face. The face landmarks are then identified using variants of the Viola-Jones detector, i.e., the face detector is replaced with the corresponding face landmark (e.g., eye detector) detector [13]. False detections are then removed by taking into account the geometrical structure of the face (i.e., expected facial feature locations).

(a)

(b)

Fig. 3. (a) Sample of face and facial feature detection, (b) 68-vertex mesh used in face alignment.

B. Face Alignment Our previous work in [7][8] uses the simultaneous Active Appearance Model (AAM) version of [10] to perform face alignment. We use a 68-vertex facial mesh similar to that in Fig. 3(b). AAMs belong to the class of methods employing a generative model together with image templates of object appearance. However, such algorithms suffer when the target object has a large appearance variation from the training set, due to changes caused by illumination conditions and nonrigid motion.

Recently, Constrained Local Models (CLM) [14] and its variants [15][16] have been found to outperform leading holistic approaches such as AAMs. CLMs retain the shape model of AAMs but, instead of using the whole facial region, uses only patches (local regions) around facial feature points as image templates. For this paper, the Exhaustive Local Search (ELS) variant [17] of CLM is used. We extend the framework of [17] to include the recently invented Local Binary Pattern (LBP) method for local feature extraction. We also constrain the fitting aspect of the classical Elastic Bunch Graph Matching (EBGM) approach [18] using the framework of [17]. These two approaches will be compared to the previously used AAMs and the Laplacian version of [17] for face alignment experiments using our acquired database. C. Exhaustive Local Search (ELS) This section discusses the ELS variant [17] of the Constrained Local Model (CLM) approach for face alignment. To better understand the ELS approach, we first discuss the holistic gradient descent alignment approach. Given a template T (z) and a source image Y (z0 ), the goal is to find the best alignment between them, such that z0 is the warped image pixel position of z, i.e., z0 = W (z; p), where p refers to the warp parameters. z is the concatenation of individual pixel coordinates xi = (xi , yi ), i.e., z = [x1 , · · · , xn ]. The warp function W (z; p) for ELS is defined as W (z; p) = Jp + z0 . This is a linear model, which is also known as the Point Distribution Model (PDM) in [19]. Procrustes analysis is applied to all shape training samples to estimate the base shape z0 . Principal component analysis (PCA) is used to obtain shape eigenvectors J. The first four eigenvectors of J are forced to correspond to similarity variation (translation, scale, rotation) [9]. The warp update for the holistic alignment approach is of the form ∆p = R[Y (z0 ) − T (z)]

(1)

D. Local Features 1) Gabor Wavelets: This work follows the Elastic Bunch Graph Matching (EBGM) framework of [18], which makes use of the concept of a jet. A jet describes a small patch of gray-level values in an image I(x) around a given pixel x. It is based on a wavelet transform, defined as a convolution with a family of Gabor kernels. There are a total of 40 kernels, derived from five distinct frequencies and eight orientations. The resulting jet is then a set of 40 complex coefficients (after convolution) obtained for one image point. 2) Local Binary Patterns (LBP): The LBP operator, first introduced in [20], is a power texture descriptor supposedly robust to illumination variations. The original operator labels the pixels of an image by thresholding the 3×3 neighborhood of each pixel with the center value and considering the result as a binary number. At a given pixel position (xc , yc ), the decimal form of the resulting 8-bit word is LBP (xc , yc ) =

∆p = (JWJT )−1 JW∆z

(2)

where W is defined as a diagonal weighting matrix. To find the optimal local update ∆xi , an exhaustive local search, using various distance measures (depending on the local feature), is performed within a local neighborhood.

s(in − ic )2n

(3)

n=0

where ic corresponds to the center pixel (xc , yc ), in to graylevel values of the eight surrounding pixels and function s(x) is a unit-step function. The operator is invariant to monotonic changes in grayscale and can resist illumination variations as long as the absolute gray-level value differences are not badly affected. The authors in [21] extended the LBP operator in terms of a circular neighborhood (of different radius size) and with the concept of uniform patterns. u2 The notation LBPP,R used in this paper refers to the extended LBP operator in a (P, R) neighborhood, with only uniform patterns considered. E. Exhaustive Local Search using Gabor Jets Measuring the similarity between two jets, J and J 0 is crucial to ELS using Gabor jets. The equation for the similarity measure is PN 0

SD (J, J , d) = where R is the update matrix. Optimizing for the holistic warp update ∆p can lead to solution divergence if the target image contains a large appearance variation from the training set. The ELS approach instead searches within N local neighborhoods to find N best translation updates ∆xi , and then constrain all the local updates with the Jacobian matrix, J = ∂W∂(pz;0) . Once the local pixel updates, ∆z = [∆x1 , · · · , ∆xn ], are obtained, the global warp update ∆p can be estimated by a weighted least-squares optimization,

7 X

j=0

aj a0j cos(φj − (φ0j + d · kj )) qP PN 02 N 2 j=0 aj j=0 aj

(4)

where aj refers to magnitude, φj denotes phase, kj is a vector defined in [18], and d is the unknown displacement. To estimate the displacement d, we use the method in [18], where d has a closed-form solution based on maximizing the similarity function SD in (4) in its Taylor expansion. This approach is otherwise known as DEPredictiveIter in [22]. Once the displacement vector d is computed, we can consider this as the local update ∆z in (2) and the global warp update ∆p can be determined easily. F. Exhaustive Local Search using LBPs Authors in [23] noticed that LBP alone cannot describe the velocity of local variation. To solve this problem, they extended the LBP approach to include the gradient magnitude image, in addition to the original image. Furthermore, in order to retain spatial information, the local region around

landmark points are divided into small regions from which LBP histograms are extracted and concatenated into a single feature histogram representing the local features. To get this histogram, we extract a rectangular region of size 21 × 21, centered at the landmark point. A lowpass Gaussian filter is applied to reduce noise. The gradient magnitude image is then p generated using Sobel filters (hx and hy ), i.e., |∇I| = (hx ⊗ I)2 + (hy ⊗ I)2 . The original image, as well as the gradient magnitude image, are divided into four regions. We then build five histograms corresponding to the whole image and four regions, and concatenate the results for all cases, to get the final LBP histogram. This process is illustrated in Fig. 4.

a normal profile. Lastly, we use the global warp update computed in (2) to constrain the local displacements. G. Experimental Results The criteria for the effectiveness of automated face alignment methods is the distance between the resulting points to the manually labeled ground truth. The distance metric [14] used in this work is n

me =

1 X di ns i=1

(6)

where di is the Euclidean point-to-point error for each individual feature location and s is the ground-truth interocular distance between left and right eye pupils. For this work, n = 53 (out of the 68 vertices) since facial features located on the boundary of the face are ignored.

Fig. 5. Cumulative distribution of the point-to-point measure (me ) using various face alignment methods.

Fig. 4. LBP histogram generation. No averaging is done with the histograms; they are concatenated on top of each other to form the feature vector

During search, the LBP histogram corresponding to each point located on the search region of size 9 × 9 is built according to the above process. The similarity between the testing point histogram H and the template histogram H 0 is calculated using Chi-square statistic χ2 (H, H 0 ) =

X (Hi − H 0 )2 i

i

(Hi + Hi0 )

(5)

The closer the histograms are, the smaller the value of χ2 . The landmark point is moved to the search point where the LBP histogram is closest to the template histogram. This automatically generates the local update ∆z in (2) and the global warp update ∆p can be easily determined afterwards. There are differences between the implementation of [23] and this work. In this paper, instead of using a circular region around the landmark point, we utilize a rectangular region. We do not average any histograms during any step of the process (e.g., feature extraction and search), unlike in [23], since it is likely that averaging destroys local information. The search region is also rectangular region, instead of

Fig. 5 shows the cumulative distribution of the point-topoint measure (me ) using various face alignment methods. It can be seen (CLM+Gabor) and (CLM+LBP) both outperform our previous AAM implementation. The (CLM+Laplacian) of [17] do not perform as well as the former two. Between (CLM+Gabor) and (CLM+LBP), (CLM+LBP) reaches 1.0 way earlier than (CLM+Gabor), leading to the conclusion that (CLM+LBP) has the best performance. However, this comes with a catch because face alignment using (CLM+LBP) is considerably slower than (CLM+Gabor), owing to its exhaustive search nature (while the Gabor version has the DEPredictiveIter [22] component that can predict the best solution in closed-form). IV. R ECOGNITION A. Features 1) Sparse-Stereo Reconstruction: If the automated facial alignment methods discussed in Sec. III-B are applied to both images of the stereo pair, what we get are sparse correspondences that can be used to reconstruct the 3D position of the facial landmarks. Our previous work in [7][8] have shown that 2D projections (to the x-y plane) of the 3D sparse reconstructions (after frontal pose normalization, i.e., rigidly registering our unknown-pose sparse 3D points to a 3D model that is known to be frontal) can provide decent

classification via a 2D version of the Procrustes distance. Figure 6 shows sparse-stereo reconstruction results of three subjects, visualized with the x-y, x-z, and y-z projections, after rigid alignment to one of the subjects. Notice that in the x-y projections, the similarity (or difference) of 2D shapes coming from the same (or different) subject is visually enhanced. In particular, Subject 1 (probe) is visually similar to Subject 1 (gallery) than Subject 2 (gallery) in the x-y projection. This similarity (or difference) is not obvious in other projections. This is the main reason behind the use of x-y projections as features.

Fig. 7. Cumulative match characteristic (CMC) curves for the (a) 3-meter and (b) 15-meter indoor probe sets.

Fig. 8. Cumulative match characteristic (CMC) curves of the (a) singleclassifier and (b) multi-classifier architectures for the 30-meter probe set.

Fig. 6.

Visualization of sparse-stereo reconstruction results.

2) Texture: Although we only perform sparse reconstruction in this work, the holistic texture from the left image, as well as the local patches around the detected landmark points are still available. For this work, the local patches around the landmark points are well-suited for EBGM-based and LBP-based recognition using the local features discussed in Sec. III-B. The similarity measures for EBGM (4) and LBP (5) are now used for recognition between gallery and probe. For comparison purposes, the classical Principal Component Analysis (PCA) approach is used to classify the holistic texture; this serves as a benchmark for more advanced texture methods. B. Experimental Results 1) Experimental Setup: Our current database consists of 97 subjects, with a gallery at 3 meters and five different probe sets at 3-meter and 15-meter indoors, together with 30-meter, 50-meter, and 80-meter outdoors. Fig. 2 illustrates the captured images (left image of the stereo pair) at different ranges. The features discussed in Sec. III-B are now used for recognition. No training is required for recognition using sparse-stereo reconstruction, Gabor, and LBP features (i.e., the similarity distance is computed directly between the probe and each gallery instance and choosing the pair with the smallest distance as the match). For PCA, the face space is determined by the gallery of 97 subjects at the 3-meter range and the similarity function used is the L2 norm. Three quick observations can be garnered from Figs. 7, 8(a)-10(a). Texture methods perform well at indoors (3m,

15m) as expected. However, they drop slightly at outdoor settings (30m, 50m, 80m), due to illumination variations and farther distances. At outdoor settings, Gabor and LBP-based recognition are better than the PCA benchmark. Recognition using sparse-reconstructions are relatively quick to compute but still acceptable, as shown from the results. 2) Multiple-Classifier Architecture: These three features for recognition are good candidates for a recognition system using a multi-classifier architecture. Fig. 11 shows the multiclassifier design. In this architecture, the top n (e.g., n = 6) candidates from the sparse reconstruction classifier is submitted to both Gabor and LBP classifiers. The reason being that the similarity function (Procrustes) for sparse reconstructions is relatively quicker to compute. The final result uses the sum rule of decision fusion [24], weighted by the accuracy at rank-1 of each method at a specific range. Notice that the output of the multi-classifier approach is 100%, for any distance range in Figs. 8(b)-10(b).

Fig. 9. Cumulative match characteristic (CMC) curves of the (a) singleclassifier and (b) multi-classifier architectures for the 50-meter probe set.

Fig. 10. Cumulative match characteristic (CMC) curves of the (a) singleclassifier and (b) multi-classifier architectures for the 80-meter probe set.

Fig. 11.

Schematic diagram of the multi-classifier architecture.

V. CONCLUSIONS AND FUTURE WORKS We have studied the use of texture, sparse-stereo reconstruction, in the context of long-distance face recognition. Experiments using sophisticated facial feature localization methods are performed. These same local features are used for recognition later in the process. With our database of images, we have illustrated the effectiveness of relatively straightforward algorithms, especially when combined in a multi-classifier manner. A continuing goal of this project is to further increase the database size and include images as far as 1000 meters at challenging image conditions. As more challenging scenarios are encountered later on, additional novel and existing approaches will be utilized. R EFERENCES [1] S.Z. Li, B. Schouten, and M. Tistarelli, ”Biometrics at a Distance: Issues, Challenges, and Prospects”, in Handbook of Remote Biometrics: for Surveillance and Security (eds: Massimo Tistarelli, Stan Z. Li, and Rama Chellappa, Springer; 1 edition (August 6, 2009), pp. 3-22

[2] W. Zhao, R. Chellapa, P.J.Phillips, and A. Rosenfeld, ”Face recognition: A literature survey”, in ACM Comput. Surv., 35(4), 2003, pp. 399-458. [3] M. Ao, D. Yi, Z. Lei, and S.Z. Li, ”Face Recognition at a Distance: System Issues”, in Handbook of Remote Biometrics: for Surveillance and Security (eds: Massimo Tistarelli, Stan Z. Li, and Rama Chellappa, Springer; 1 edition (August 6, 2009), pp. 155-168 [4] P. J. Phillips, P. J. Flynn, T. Scruggs, K.W. Bowyer, J. Chang, K. Hoffman, J.Marques, J.Min, and W. Worek, ”Overview of the face recognition grand challenge”, in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2005 [5] Y. Yao, B. R. Abidi, N. D. Kalka, N. A. Schmid, and M. A. Abidi, ”Improving long range and high magnification face recognition: Database acquisition, evaluation, and enhancement”, in Comput. Vis. Image Underst., 111(2), 2008, pp. 111125. [6] G. Medioni, J. Choi, C.H. Kuo, and D. Fidaleo, ”Identifying noncooperative subjects at a distance using face images and inferred threedimensional face models” in IEEE Trans. Sys. Man Cyber. Part A, 39(1), 2009, pp. 1224. [7] H. Rara, S. Elhabian, A. Ali, M. Miller, T. Starr, and A. Farag,”Face recognition at-a-distance based on sparse-stereo reconstruction” in IEEE CVPR Biometrics Workshop, 2009, pp. 2732. [8] H. Rara, S. Elhabian, A. Ali, T. Gault, M. Miller, T. Starr, and A. Farag, ”A framework for long distance face recognition using denseand sparse-stereo reconstruction” in ISVC 09: Proceedings of the 5th International Symposium on Advances in Visual Computing, SpringerVerlag, 2009, pp. 774783 [9] I. Matthews and S. Baker, ”Active Appearance Models Revisited”, in International Journal of Computer Vision (IJCV), 60(2), 2004, pp. 135-164 [10] R. Gross, I. Matthews, and S. Baker, ”Generic vs. Person Specific Active Appearance Models”, in Image and Vision Computing, 23(11), 2005, pp. 1080-1093 [11] P. Viola and M.J. Jones, ”Robust real-time face detection”, in International Journal of Computer Vision (IJCV), 2004, pp. 151173 [12] M. Jones and J. Rehg, ”Statistical color models with application to skin detection”, in International Journal of Computer Vision (IJCV), 2002, pp. 8196 [13] M. Castrillon-Santana, O. Deniz-Suarez, L. Anton-Canals, and J. Lorenzo-Navarro, ”FACE AND FACIAL FEATURE DETECTION EVALUATION: Performance Evaluation of Public Domain Haar Detectors for Face and Facial Feature Detection”, in VISAPP, 2008 [14] D. Cristinacce and T. Cootes, ”Feature Detection and Tracking with Constrained Local Models”, in 17th British Machine Vision Conference, 2006, pp. 929-938 [15] U. Paquet, ”Convexity and Bayesian constrained local models”, IEEE Computer Vision and Pattern Recognition, 2009, pp. 1193-1199 [16] Y. Wang, S. Lucey, and J. Cohn, ”Enforcing Convexity for Improved Alignment with Constrained Local Models”, in IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2008 [17] Y. Wang, S. Lucey, and J. Cohn, ”Non-Rigid Object Alignment with a Mismatch Template Based on Exhaustive Local Search”, in IEEE Workshop on Non-rigid Registration and Tracking through Learning NRTL, 2007 [18] L. Wiskott, J. Fellous, N. Kruger, ”Face Recognition by Elastic Bunch Graph Matching”, in IEEE Transactions on Pattern Analysis and Machine Intelligence, 1997, pp. 775-779 [19] T.F. Cootes, C.J. Taylor, D.H. Cooper, and J. Graham, ”Active shape models - their training and application”, in Computer Vision and Image Understanding, 61, 1995 [20] T. Ojala, M. Pietikainen, and D. Harwood, ”A comparative study of texture mea- sures with classification based on feature distributions”, in Pattern Recognition, 29, 1996, pp.51-59 [21] T. Ojala, M. Pietikainen, and T. Maenpaa, ”Multiresolution gray-scale and rotation invariant texture classification with loval binary patterns” in IEEE Transactions on Pattern Analysis and Machine Intelligence, 24, 2002, pp. 971-987 [22] D. Bolme, ”Elastic Bunch Graph Matching (Masters Thesis)” in CSU Technical Report, 2003 [23] X. Huang, S. Li, and Y. Wang, ”Shape localization based on statistical method using extended local binary pattern” in Third International Conference on Image and Graphics (ICIG), 2004, pp. 184187 [24] A. Ross and A. Jain, ”Information fusion in biometrics” in Pattern Recognition Letters, 2003, 24(13), pp. 21152125