Adaptive Active Appearance Models - IEEE Xplore

1 downloads 0 Views 1MB Size Report
speed and accuracy, the idea of a linearly adaptive gradient ma- trix presented .... pose a representation that includes the intensity value, the hue, and the edge ...
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 11, NOVEMBER 2005

1707

Adaptive Active Appearance Models Aziz Umit Batur and Monson H. Hayes, Fellow, IEEE

Abstract—The active appearance model (AAM) is a powerful tool for modeling images of deformable objects and has been successfully used in a variety of alignment, tracking, and recognition applications. AAM uses subspace-based deformable models to represent the images of a certain object class. In general, fitting such complicated models to previously unseen images using standard optimization techniques is a computationally complex task because the gradient matrix has to be numerically computed at every iteration. The critical feature of AAM is a fast convergence scheme which assumes that the gradient matrix is fixed around the optimal coefficients for all images. Our work in this paper starts with the observation that such a fixed gradient matrix inevitably specializes to a certain region in the texture space, and the fixed gradient matrix is not a good estimate of the actual gradient as the target texture moves away from this region. Hence, we propose an adaptive AAM algorithm that linearly adapts the gradient matrix according to the composition of the target image’s texture to obtain a better estimate for the actual gradient. We show that the adaptive AAM significantly outperforms the basic AAM, especially in images that are particularly challenging for the basic algorithm. In terms of speed and accuracy, the idea of a linearly adaptive gradient matrix presented in this paper provides an interesting compromise between a standard optimization technique that recomputes the gradient at every iteration and the fixed gradient matrix approach of the basic AAM. Index Terms—Active appearance models (AAMs), appearance models, facial feature detection, model matching.

I. INTRODUCTION

T

HE active appearance model (AAM) is a powerful appearance-based representation for images of deformable objects. AAM uses principal component analysis (PCA) based linear subspaces to model the two-dimensional (2-D) shapes and textures of the images of a certain object class. Such a representation allows AAM to represent an image with a small number of parameters. Automatically fitting such a model to a previously unseen image is the fundamental problem that has to be solved to be able to use the model in various alignment and recognition tasks. In general, this is a computationally complicated procedure if a standard optimization technique is used since the gradient matrix should be numerically recomputed once per iteration, which dominates the amount of computation. AAM avoids this complicated procedure by assuming that the gradient matrix is fixed around the optimal coefficients for all images. A fixed gradient matrix is found by estimating the gradient matrices for

Manuscript received March 19, 2003; revised November 4, 2004. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Michel Schmitt. A. U. Batur is with Texas Instruments, Dallas, TX 75251 USA (e-mail: [email protected]). M. H. Hayes is with the School of Electrical and Computer Engineering Georgia Institute of Technology, Atlanta, GA 30332-0250 USA (e-mail: [email protected]). Digital Object Identifier 10.1109/TIP.2005.854473

a set of training images around their optimal parameters and averaging the results. AAM has originally been introduced by Edwards et al. in [1], and has later been expanded by Cootes et al. in [2] and [3]. Since its introduction, AAM has found many applications in a variety of areas where one desires to align, track, or interpret images of deformable objects, and various variations of the basic AAM algorithm have been developed. See [4] for an experimental comparison of some important AAM variations. After the introduction of the original algorithm, Edwards et al. have introduced an extension to the basic algorithm to handle color images, and also provided an enhanced search algorithm that is more robust against occlusions [5]. In [6], Cootes et al. have shown that multiple AAMs can be used to model human faces from any view point, and that these models can be used to track faces and estimate head poses. Cootes et al. have also shown that prior information, such as initial estimates about the locations of the eyes, can be used to constrain AAM search to obtain more reliable results [7]. In [8], Cootes et al. have demonstrated that using a linearly transformed representation of the edge structures at each pixel instead of the pixel values provides more accurate matching of the model to the target image, and, in [9], they have proposed subsampling techniques to speed up AAM convergence in the expense of loosing some accuracy. Baker et al. have proposed to use an inverse compositional approach to match the AAM to the target image, instead of the additive approach used in the basic AAM algorithm [10]. They consider an appearance model where the shape and texture coefficients are not combined into appearance coefficients, and show that their approach provides increased efficiency during the matching procedure. Their formulation for the matching algorithm requires the warp functions to satisfy certain properties, which are not satisfied by the warps used in AAM; therefore, they use first-order approximations of the inversion and composition operators. AAMs have been extensively applied to medical image processing. Mitchell et al. have developed a multistage hybrid AAM to automatically segment left and right cardiac ventricles from magnetic resonance (MR) images [11]. Their hybrid approach combines AAMs with active shape models (ASMs) to achieve more robustness against the risk of being trapped in local minima. Bosch et al. have extended AAMs to active appearance motion models (AAMMs) that enhance AAMs by including time dependent information [12]. Their goal is to automatically segment echocardiographic image sequences in a time continuous manner. Therefore, they consider the whole image sequence as a single shape-intensity pattern, and construct a single AAM that can describe both the appearance of the heart at a certain time and its dynamics throughout the cardiac cycle. This provides a time continuous segmentation,

1057-7149/$20.00 © 2005 IEEE

1708

and eliminates the need for constructing different AAMs to segment the heart at different phases of the cardiac cycle. They also propose a nonlinear intensity normalization technique to deal with the non-Gaussian nature of the distribution of the intensity values in ultrasound images, which is shown to provide a significant improvement in performance. Mitchell et al. have proposed a three-dimensional (3-D) AAM to segment volumetric cardiac MR images and echocardiographic image sequences [13]. This 3-D extension of AAM provides a successful segmentation of 3-D images in a spatially and temporally consistent manner. Motivated by the AAM algorithm, Hou et al. have developed direct appearance models (DAMs) where they use the texture to directly predict the shape during the iterations of the parameter updates [14]. This approach no longer combines the shape and texture parameters into appearance parameters like AAM does. Li et al. have later extended this approach for multiview face alignment by training multiple models for different poses of the human face [15]. In relation to his work, Yan et al. have developed texture-constrained ASMs (TC-ASMs), where the shape update predicted by an ASM is combined with the shape constraint provided by a global texture model like the one in AAM [16]. They show that such an approach performs better than ASM or AAM alone. Stegmann and Larsen propose to generalize the concept of texture in AAM to include any measurement over the target image selected according to the particular class that is being modeled [17]. For the specific case of human faces, they propose a representation that includes the intensity value, the hue, and the edge strength, and show that an AAM that uses this representation provides a better result than an AAM that works only on gray scale intensities. In [18], Stegmann et al. propose a few extensions to AAM that include an enhancement to the shape modeling, an initialization scheme to make the system fully automated, and a simulated annealing approach to fine tune the AAM parameters after the basic algorithm has converged. In [19], Edwards et al. use AAMs for face recognition, and in [20], Ahlberg uses AAMs to track faces and to automatically extract MPEG-4 facial animation parameters. In [21], Stegmann demonstrates that AAMs can be used for general object tracking. In this paper, we propose an adaptive AAM where we abandon the fixed gradient matrix approach of the basic AAM, and replace it with a linearly adaptive matrix that is updated according to the composition of the target texture [22]. Our approach starts with the observation that a fixed gradient matrix inevitably specializes to a certain region of the texture space and does not work well as the target image’s texture moves away from this region. In general, the gradient matrix depends on both the shape and the texture of the target image. Since the gradient is computed in a normalized frame where the shape is normalized to the mean shape, one fixed matrix is a good estimate for the gradient matrices at different shapes. However, the same desirable property does not hold for different textures since textures are not normalized like shapes in the normalized frame. Therefore, the fixed gradient matrix specializes for a certain region of the texture space around the mean texture, and it does not work well as the target texture moves away from

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 11, NOVEMBER 2005

this region. Our solution to this problem is based on finding the contributions of the texture eigenvectors of AAM to the gradient matrix. Once we have these contributions, we can update the fixed gradient matrix according to the strength of each texture eigenvector in the composition of the target texture. One problem with this strategy, however, is that the composition of the target texture is unknown during the iteration since the goal of the algorithm itself is to find the shape and texture of the target image. We solve this problem by realizing that AAM texture parameters have a current estimate for the target texture’s composition, and this estimate gets better and better as the iterative algorithm progresses. We use this information to find the linear adaptation coefficients, and the success of our adaptation increases with every iteration. We performed a large number of experiments to demonstrate that this new adaptive approach outperforms the basic AAM algorithm. The performance difference is especially large in challenging images that have large illumination changes. The linearly adaptive gradient matrix approach presented in this paper is a useful compromise between a standard optimization technique that recomputes the gradient matrix at every iteration, and the fixed gradient matrix approach of the basic AAM. In Section II, we first review the basic AAM algorithm as presented in [3]. For more details, please refer to [2], [3], and [23]. In Section III, we describe the limitations of a fixed gradient matrix. Then, in Section IV, we introduce our adaptive approach, and, finally, in Section V, we show our experimental results. II. REVIEW OF THE BASIC AAM Assume that we have a collection of training images for a certain object class in which the locations of feature points have been determined manually or in a semiautomatic manner. The 2-D coordinates of the feature points in each image define shape vector, , which is called the shape in the image a frame. Each of these shape vectors is normalized to a common reference frame to obtain , which is called the shape in the normalized frame. This normalization typically consists of a translation, rotation, and scaling. A shape model can now be obtained by applying PCA to the vectors

where is the synthesized shape in the normalized frame, is is a matrix that has the mean shape in the normalized frame, vector the shape eigenvectors as its columns, and is a being the number of shape coefof shape coefficients with ficients. Given a synthesized shape , in the normalized frame, in the image frame can be obtained by a synthesized shape applying a transformation to (1) where is a transformation that involves scaling, rotais a 4 1 vector that contains the tion, and translation, and transformation parameters for . We define as , where is the scale, is the rotation and are the shifts in the and directions, angle, and respectively.

BATUR AND HAYES: ADAPTIVE ACTIVE APPEARANCE MODELS

1709

After obtaining the shape model, all of the training images are warped to the mean shape and scanned into a vector to obtain

(8) and

are given as

(2) where is the image in the normalized frame, is one of the represents the warping operation training images, and from the image frame to the normalized frame followed by the scanning of the warped image into a vector. The next step is to obtain , which we call the normalized to normalize texture. This normalization typically consists of subtracting the mean pixel value and dividing by the pixel standard deviation so that each normalized texture has zero mean and unit variance. A texture model can now be obtained by applying PCA to the normalized textures (3) is a where is the synthesized texture, is the mean texture, matrix that contains the texture eigenvectors as its columns, and is a vector of texture coefficients with being the number of texture coefficients. Given a synthesized texture , a can be obtained synthesized image in the normalized frame by applying a transformation to (4) where dc shift, , and

is a transformation that involves scaling and provides the transformation parameters for represents a vector of ones. We define as , where is the scale, and is the dc is zero for the identity transformation, and shift. Note that . After obtaining the shape and texture models, a combined appearance model is obtained by applying PCA to the shape and texture coefficients

(9) (10) is a matrix of zeros, and is a identity matrix. Using all of the coefficients that have been introduced above, an image can now be represented by a vector that is given as where

(11) By changing , we can synthesize a variety of different images that belong to the object class for which we have constructed the model. Mathematically, using (2)–(4), the synthesized image, , can be represented as follows: (12) where represents the warping operation from the normalized frame to the image frame. To use the AAM, we need a way to fit the model automatically to a previously unseen image. In other words, we need an that optimally reconstructs automatic way to find the vector a given target image . In [3], Cootes et al. propose a fast and robust iterative scheme to find . The method starts with an initial value for , and converges to the optimal value by minimizing the difference between the target image and the image synthesized by the model. This difference is computed in the normalized frame. During an iteration, using the current synthe, the target image is warped into the normalized sized shape frame to obtain the normalized texture

(5) (13) is the combined vector of shape and texture coefwhere ficients, and is a diagonal scaling matrix to correct for the difference in the units of the shape and texture coefficients. A simple choice for the diagonal entries of is the square root of the ratio of the total intensity variation to the total shape variation [14]. As a result of PCA, we now have an appearance model

The difference between this normalized texture and the current synthesized texture (14) is a measure of how close is to the optimal value this difference is minimized by minimizing

. In [3],

(6) where is the vector of synthesized shape and texture coefis a matrix that has appearance eigenvectors as its ficients, vector of appearance coefficients columns, and is a with being the number of appearance coefficients. Note that we do not use a mean vector in this model since the means of the shape and texture coefficients are close to zero. Because of linearity, the shape and texture vectors can be represented in terms of the appearance coefficients as (7)

A first-order Taylor expansion of

is given as

where the th element of the gradient matrix is , should be chosen as To minimize

.

(15)

1710

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 11, NOVEMBER 2005

Solving this linear system of equations for provides the update direction for the current iteration. Ideally, the gradient matrix that appears in (15) should be recomputed at every iteration. Since this is a complicated procedure, AAM assumes that this is first matrix is fixed. To find a fixed gradient matrix, estimated numerically at the optimal for each training image, and then all the estimates are averaged to obtain one fixed manumerically for a given training image, trix. To estimate each parameter inside is systematically displaced from its opis timal value by a certain amount, , to obtain , and gives an estimate of the parcomputed. Then, tial derivative of with respect to the displaced parameter. Such estimates are computed at a few such displacements for each parameter, and the results are averaged. During an iteration, given the current difference vector, , the update direction is computed by solving (15). Since the gradient matrix is fixed, its pseudo inverse can be precomputed for computational efficiency. Then, the update direction is obtained by the following matrix multiplication: (16) where

Once the update direction is found, is updated to , and a new error is computed. If the error has decreased, the update is accepted. Otherwise, smaller updates in the same direction are for If all of them fail to tried such as improve the error, convergence is declared. In practice, a multiresolution implementation is used so that the system converges for larger parameter displacements [3].

gradient matrix using these training images would be equivalent to estimating the gradient matrix using the average of these images. In other words, instead of training with different images, we could average the training images beforehand, and then compute the gradient matrix for the average image. Clearly, this is not a desirable situation because it means that adding new training images under different illuminations would not necessarily teach AAM how to converge for the new illumination conditions since most of the changes in those images would be cancelled out during the averaging process. Obviously, a perfect cancellation would not occur if we estimate the gradient matrix using images that have different shapes as well as textures. This is because derivatives of different textures would now be evaluated at different shapes. However, even in this case, we can expect the cancellation to occur to a great extent since the derivatives at different shapes are likely to be similar to each other as the derivatives are computed in the normalized frame. In fact, this similarity is one of the reasons why AAM successfully uses a fixed gradient matrix to converge for many different shapes. In summary, although AAM has a powerful texture subspace to generate very different textures, the fixed gradient matrix specializes for a certain region of this texture space that lies around the average texture of the training images, and it does not work well as the target texture moves away from this region. The average texture of the training images is probably very close to the mean texture that is used in AAMs texture model. Note that is intellithe global texture normalization stage of AAM gently designed to normalize the target texture so that the normalized texture becomes as close as possible to . This provides the fixed gradient matrix the best chance to predict the correct update direction. The texture normalization schemes that have been described in [3] or [2] are, therefore, critical for the successful operation of AAM. IV. ADAPTIVE GRADIENT MATRIX

III. LIMITATIONS OF THE FIXED GRADIENT MATRIX In this paper, we propose a way of linearly adapting the gradient matrix to the target image. To demonstrate the need for such a strategy, let us first make a few observations about the fixed gradient matrix of the basic AAM. Using (14), we can write (17) Note that is constant, and is equal to , is a matrix of zeros with being the where number of pixels in the normalized frame. Therefore, the only , which variable component in the gradient matrix is is the derivative of the normalized texture of the target image with respect to . In the estimation of a fixed gradient matrix, this derivative is numerically estimated around the optimal for each training image, and the results are averaged. Now, contraining images with sider a scenario where we have a set of and with different normalized textures the same shape for . In the case of human faces, such images could be obtained by taking images of a person with a fixed pose and facial expression under varying illumination. Estimating the

We now propose a way to adapt the gradient matrix to the target texture so that it has a wider range over which it successfully predicts the correct update direction. We perform this adaptation by adding linear modes of change to the gradient matrix. To find these modes of change, we first derive an expression for the gradient matrix as a function of the texture eigenvectors of AAM. The critical idea in this derivation is to use the fact that AAM can synthesize an image that is very close to the target image. In other words, AAM can represent the target image as a function of its internal parameters and subspaces. So, substituting this expression of the synthesized image into the gradient matrix calculation allows us to represent the gradient matrix as a function of AAM parameters and subspaces. Then, we can identify how the gradient matrix depends on the texture eigenvectors that exist in the composition of the target texture. Once we identify these dependencies, we can change their strengths in the gradient matrix during convergence according to the strengths of the texture eigenvectors in the composition of the current target image. This helps us to obtain a better estimate for the gradient matrix of that particular target image. Let us start our derivation by considering a specific target and let us assume that we know the correct shape image

BATUR AND HAYES: ADAPTIVE ACTIVE APPEARANCE MODELS

1711

for this image. We can now warp into the normalized frame and determine its optimal AAM parameters. In the following derivation, we will use a subscript of “ ” to denote that the value of a specific parameter is optimal for this particular image. Once the optimal parameters have been determined, AAM can synthesize an image that is very close to the orig, can be exinal image. Using (12), the synthesized image, pressed as

components of the target image’s texture to the gradient matrix. The third term is the contribution of the reconstruction error, and is probably quite small. The first term is the contribution of the mean texture, and the second term contains the contributions of the texture eigenvectors, each scaled by the corresponding optimal texture coefficient for the target image. To simplify the notation for the rest of the paper, let us rewrite (22) as follows:

(18)

(23)

represents the warping operation from the norwhere is the optimal synthemalized frame to the image frame, is the vector of optimal texture coefficients, and sized shape, and are the optimal texture transformation parameters. Clearly, AAMs synthesis is not perfect, and the original and synthesized images are related as (19) where is a small error vector. Now, let us consider the estimation of the gradient matrix of at (20)

where

for and includes the last three terms on the right hand side of (22). The basic AAM estimates a fixed gradient matrix by averover the set of training images. Let denote the avaging eraging operation over the training set. Then, the fixed gradient is given as matrix of the basic AAM

Recall that is constant and depends only on . We now from (13) into (20) to obtain substitute the expression for (21) By substituting (18) and (19) into (21), and after some straightforward algebraic manipulations, we arrive at the following expression for the gradient matrix:

(24) Substituting (23) into (24), we obtain (25) Now, let us replace with its average over the training im. Such an approximation is justified since the only avages eraging operation for these terms is performed over different target shapes, and the derivatives are computed in the normalized frame where the shapes have been normalized. After this step, we obtain the following expression for the gradient matrix: (26)

(22) is the th texture eigenvector, is th optimal where with texture coefficient, and we have replaced for simplicity of notation. Let us now make a few comments about the significance of each term on the right hand side of (22). The last term is constant as we have mentioned previously, and it helps AAM to adjust its texture coefficients. The fourth term is a matrix of zeros where only one column is nonzero. This column corresponds to , and it is a fixed negative vector . It helps AAM to adjust the dc shift coefficient . The first three terms are the contributions of different

where stands for the average of the th optimal texture coefficient over the training images. Equation (26) shows two different types of averaging that are taking place during the estimation of . The first is the averaging of the derivaand , the tive at different shapes. Inside the terms derivatives of the mean texture and the texture eigenvectors are computed at different shapes, and the results are averaged. This shape averaging is a good approximation, and the shape averaged derivatives are good estimates of the derivatives at different shapes. The critical idea here is that the derivatives are computed in the normalized frame where all of the target images are warped into the mean shape, which essentially normalizes the shape information. The second averaging operation in (26)

1712

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 11, NOVEMBER 2005

occurs over different target textures. This is performed by avin the second term of (26). eraging the texture coefficients Unfortunately, texture averaging is not as easy to justify as the shape averaging, since the normalized frame does not normalize the textures in the way it normalizes the shapes. The texture averaging results in specializing the gradient matrix to a certain region around the mean texture in the texture space, and the gradient does not work well as the target texture moves away are likely to be very from this region. Note that the terms close to zero. In fact, they would be exactly zero if the averaging were performed over all of the images that were used during PCA. However, the averages here are being computed over the set of training images that are used during the estimation of the gradient matrix, which is usually a subset of the training images used for PCA. Therefore, the averages are not exactly zero, but quite close to zero. Hence, the gradient matrix given in (26) looks very similar to the gradient matrix of the mean texture. To address this texture averaging problem, we propose to . These are the modes of change of the compute the terms gradient matrix according to the texture eigenvectors. Once we have these modes, we can adapt the gradient matrix to a specific target image by replacing the average texture coefficients given in (26) with the optimal coefficients of the current target image. According to this approach, we define an adaptive gradient matrix as (27)

where are the optimal texture coefficients for the target image that AAM is trying to converge to. However, there is one problem with this strategy: we do not know these optimal texture coefficients. In fact, obtaining these coefficients is one of the goals of the convergence algorithm. But, at each step of the iteration, we have an estimate for them, and the quality of our estimate increases as the iteration progresses. Hence, we can replace the optimal values in (27) with the current estimates that we have. As the system converges, we obtain better estimates for the texture coefficients, and these provide us increasingly better estimates for the gradient matrix. So, the definition of the adaptive gradient matrix becomes (28)

where are the current texture coefficients during an iteration of AAM. Combining (26) and (28), the adaptive gradient matrix and the fixed gradient matrix are related as (29) where . So, our approach essentially updates the gradient matrix of the basic AAM to make it a better approximation for the current image. We estimate the update terms, , during the training procedure of the basic AAM, and we add them to

during convergence according to (29) to obtain . and are performed as follows: For each Estimation of training image, we systematically displace each parameter of from its optimal value to estimate

and for

(30)

Then, we average all such estimates from the training images and for . Note that, since to obtain , during training, we used we know the exact shape instead of to obtain a slight increase in accuracy. In AAM’s texture model, the texture eigenvectors are ordered according to their eigenvalues, so the importance of the texture coefficients decreases quickly as the index increases. Therefore, for high indexed texture coefficients may not updating help much, and this can be avoided to decrease computational complexity. A similar argument can be made about whether or . The importance of not to update all of the columns of the appearance coefficients decreases as the index increases, and that corone may choose not to update the columns of respond to the high indexed appearance coefficients in order to decrease computational complexity. is not fixed, it is not possible to preNote that since compute its pseudo inverse. So, to find the update direction for , we must now find the least squares solution to

We find

by solving the following linear system of equations: (31)

is a very thin and tall matrix, where the number of rows is equal to the number of pixels, and the number of columns is equal to the number of AAM parameters. Hence, the dom, inant computation in (31) is the matrix product which produces a small matrix, but performs long vector inner products to compute each element of that small matrix. Computational efficiency can be significantly increased if these inner products are computed offline during the training stage so that the same computations are not performed at every iteration. can be expressed as follows: Using (29),

BATUR AND HAYES: ADAPTIVE ACTIVE APPEARANCE MODELS

where

. The terms

1713

,

,

, and can be precomputed. Hence, can be obtained by a simple weighted summation. This weighted summation requires multiplies and additions, is the number of AAM parameters, is the number where of adaptive modes that are used to update the gradient matrix, is the number of columns of the gradient matrix that and are updated by the adaptive modes. Note that the computational complexity here is not a function of the number of pixels, is comwhich is usually a large number. After multiplications puted, we solve (31), which requires and additions, and the solution of a linear system of equations, where is the number of pixels. We can summarize the adaptive AAM algorithm as follows. 1. Using the current AAM parameters, warp the target image into the normalized frame to obtain the normal. ized texture 2. Compute the synthesized texture in the normalized frame . . 3. Compute the error 4. Compute the texture coefficients using (5) and (6). 5. Compute the adaptive gradient matrix using (29). 6. Solve (31) to find the update, . Use (32) to compute efficiently. and com7. Update the AAM parameters from to pute the new error . , accept the update. Otherwise, 8. If , try smaller updates in the same direction, such as , etc. If none of the updates are accepted, declare convergence.

Fig. 1. Feature locations: 63 points on the face were manually located to define the shape of the AAM.

Fig. 2. Example images from each subset of the testing images from the Yale Face Database B. (a) Subset 1: Illumination. (b) Subset 2: Illumination. (c) Subset 3: Illumination. (d) Subset 4: Illumination.

V. EXPERIMENTS To test the performance of the adaptive AAM and to compare its performance to that of the basic AAM, we performed a large number of experiments using 450 images from the Yale Face Database B and 1944 images from the CMU PIE face database [24], [25]. The image set we used from the Yale Face Database B contains images of ten people with a frontal head pose captured under varying illumination. The images from the CMU PIE Database contain near-frontal faces of 66 individuals with changing facial expressions captured under varying illumination. We divided the image set from the CMU Database into two large groups. The first group contained the images of the first 30 people and was used to train the AAMs. The second group contained the images of the remaining 36 people and was used to test the AAMs. We used 360 images from the training group to construct an AAM for human faces. We trained the AAM for three resolution levels. We manually selected the locations of 63 features on each face to define the shape model of AAM. An example image with the manually located features is shown in Fig. 1. We constructed a 30-dimensional shape subspace to represent 98% of the variation in shapes. Then, we warped all images into the mean shape using triangulation. At the highest resolution, the normalized frame had around 11000 pixels. Using

Fig. 3. Example images from each subset of the testing images from the CMU PIE Database. (a) Subset 1: Illumination. (b) Subset 2: Illumination. (c) Subset 3: Illumination. (d) Subset 4: Facial Expression.

the normalized textures, we constructed a 150-dimensional texture subspace to represent 97% of the variation in textures. Finally, we constructed a 130-dimensional appearance subspace to represent 99% of the total variation in the shape and texture coefficients. , we used a 48 image subset of the training To estimate images. When we were choosing this subset, we made sure that we included images of different people with different facial expressions and illuminations. As we have described previously, by numerically estimating it for each image we estimated separately and then averaging the results. During the numerical

1714

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 11, NOVEMBER 2005

Fig. 4. Histograms of the shape and texture errors for each subset of the testing images from the Yale Face Database B. (a) Subset 1. (b) Subset 2. (c) Subset 3. (d) Subset 4.

estimation of the derivatives, we computed eight displacements for each parameter with the maximum displacement being one half of the standard deviation for that parameter. During the es, we also estimated the update matrices . timation of

Numerical estimation of these update matrices is very similar to . For each training image and for each the estimation of texture eigenvector, we synthesized a target image whose shape was equal to the shape of the training image, and whose texture

BATUR AND HAYES: ADAPTIVE ACTIVE APPEARANCE MODELS

1715

TABLE I MEAN, MEDIAN, AND MINIMUM SHAPE AND TEXTURE ERRORS FOR EACH SUBSET OF THE TESTING IMAGES FROM THE YALE FACE DATABASE B

was equal to the texture eigenvector. Then, we numerically estimated the expression given in (30). Hence, we obtained one matrix for each texture eigenvector and for each training image. We averaged the matrices that belong to the same texture eigenmatrices. To devector to obtain the final estimates for the crease the computational complexity, we estimated these update matrices only for the first 20 texture eigenvectors. We also did not calculate the updates for the columns of the gradient matrix that correspond to the appearance coefficients with indexes higher than 40. During all of these estimations, we excluded the background pixels from our calculations so that the model does not learn the background. We implemented the complete AAM in MATLAB. In addition, we implemented the iterative procedure of AAM as a C++ MATLAB MEX file to increase the speed, but we did not perform a rigorous optimization to get the best possible speed. We ran experiments with two different systems. The first system is the basic AAM which uses the fixed gradient matrix iterations, and the second one is the adaptive AAM that uses the adaptive gradient matrix iterations. For both systems, we allowed at most 40 iterations at each level. On the average, the first system took 1.4 s, and the second system took 5.1 s to converge for one image. Note that since the adaptive iterations are fully compatible with the iterations of the basic AAM, one can use a mixture of fixed and adaptive iterations to achieve the best performance-complexity tradeoff for the target application. Figs. 2 and 3 show some sample test images from the Yale and CMU face databases that are used in our experiments. A. Experiments on the Yale Face Database B We first tested the AAMs using the image set from the Yale Face Database B. For each image, we ran the AAM algorithm 16 different times. Each time we started with a different initial , rotating the face rapoint by changing the scale pixels in the horizontal and dians, and displacing the model vertical directions. After each convergence, in order to obtain a metric of how successfully the shape and texture subspaces of AAM converged, we recorded two kinds of information. For the shapes, we computed the mean distance of the shape points to their manually determined ground truth locations, and we divided the final average distance by the face width to normalize it for different face sizes. We call this the shape error. For the

Fig. 5. Median shape errors for each subset of the testing images from the Yale Face Database B.

Fig. 6. Median texture errors for each subset of the testing images from the Yale Face Database B.

textures, we computed the mean texture error per pixel in the image frame. We call this the texture error. In Fig. 4, we show the histograms of the shape and texture errors for the basic and adaptive AAMs. The mean, median, and minimum values of these shape and texture error distributions are given in Table I, and the median values are plotted in Figs. 5 and 6. To see how the performance of AAM changes with illumination, we present our results by dividing the testing images into four subsets, where the subsets contain 70, 120, 120, and 140 images, respectively. Fig. 2 shows example images from each subset. As we go from subset 1 to subset 4, the lighting angle becomes larger. Hence, shadows become more significant, and convergence becomes more difficult. Figs. 5 and 6 reveal that adaptive AAM outperforms the basic AAM in all of these subsets in terms of both the shape and texture errors. We can also see from Fig. 5 that the performance difference between the basic and adaptive AAMs increases as we go from subset 1 to subset 4. In other words, adaptive AAM performs much better than the

1716

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 11, NOVEMBER 2005

Fig. 8. Average shape errors of adaptive AAM on the images selected from the Yale Face Database B for varying number of adaptive modes.

Fig. 7. Example results. Column 1: Original image. Column 2: Basic AAM. Column 3: Adaptive AAM.

basic AAM if there are large illumination changes in the testing images. This result meets our expectations because changing illumination moves the target image far away from the mean face, make the and this is where the modes of change we add to most significant contribution. Note that when we were training the basic AAM, we carefully added training images from different illumination conditions. Nevertheless, the fixed matrix did not successfully learn how to converge for these images because of the averaging process we have explained. Adaptive AAM, on the other hand, learned the linear modes of change induced in the gradient matrix by the texture eigenmodes, and updated the gradient matrix according to those modes so that it could predict the correct update direction in a much larger region of the texture space. In Fig. 7, we show example results for the final faces synthesized by the two algorithms when the iterations have stopped. The first column contains the original images, and the second and third columns contain the final faces synthesized by the basic and adaptive AAMs, respectively. We can clearly see that adaptive AAM outperforms the basic AAM especially for faces that have atypical textures as a result of illumination variations. We performed another experiment on the Yale Face Database B this time to see how the number of adaptive modes used in the adaptive AAM effects the performance. We systematically

Fig. 9. Average texture errors of adaptive AAM on the images selected from the Yale Face Database B for varying number of adaptive modes.

changed the number of adaptive modes, and for each case, we ran the adaptive AAM once on the image set selected from the Yale Face Database B. We averaged the results for each experiment to find the average shape and texture errors. We made experiments with 0, 5, 10, 15, and 20 adaptive modes. We plotted these results in Figs. 8 and 9. We can observe a behavior that looks like an exponential where the gain we obtain by adding a new adaptive mode decreases as the number of modes is increased. This is a very typical behavior that is usually observed when a scheme based on PCA is used. This means that the user can choose to use only a small number of adaptive modes, and still get a significant performance improvement. B. Experiments on the CMU Database We also performed experiments on the testing image group we have selected from the CMU PIE Database. We again divided the images into four subsets. The subsets contain 504, 504, 504, and 72 images, respectively. Fig. 3 shows example images

BATUR AND HAYES: ADAPTIVE ACTIVE APPEARANCE MODELS

1717

Fig. 10. Histograms of the shape and texture errors for each subset of the testing images from the CMU PIE Database. (a) Subset 1. (b) Subset 2. (c) Subset 3. (d) Subset 4.

from each subset. The first three subsets contain face images captured under varying illuminations with a frontal head pose. As we go from subset 1 to subset 3, the lighting angle becomes larger, and shadows become more significant. The fourth subset

contains frontal face images with varying facial expressions. Unlike the other three subsets, many faces in this subset have glasses. Note that these testing images are quite challenging for AAM search because of the varying illuminations, facial expres-

1718

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 11, NOVEMBER 2005

TABLE II MEAN, MEDIAN, AND MINIMUM SHAPE AND TEXTURE ERRORS FOR EACH SUBSET OF THE TESTING IMAGES FROM THE CMU PIE DATABASE

Fig. 12. Median texture errors for each subset of the testing images from the CMU PIE Database.

Fig. 11. Median shape errors for each subset of the testing images from the CMU PIE Database.

sions, glasses, facial hair, etc. In Fig. 10, we show the histograms of the shape and texture errors for the basic and adaptive AAMs. The mean, median, and minimum values of these shape and texture error distributions are given in Table II. The median values, which are plotted in Figs. 11 and 12 reveal that adaptive AAM outperforms the basic AAM in all of the four subsets. The results on the first three subsets again show that the performance difference between the basic and adaptive AAMs increases if there are large illumination changes in testing images. C. Forced Iterations In [4], Cootes et al. propose forced iterations to increase the performance of the basic AAM. Forced iterations are the iterations where the system accepts an update even though the error increases after the update. Cootes et al. suggest that a few such iterations help AAM to escape local minima [4]. We performed similar experiments on the testing images from the CMU PIE Database to evaluate the impact of forced iterations on the adaptive AAM and to compare the performance increase to that of the basic AAM. We let both the basic and

adaptive AAMs have five forced iterations at the highest resolution. The histograms of the shape and texture errors in this case are given in Fig. 13. The mean, median, and minimum values of the errors are shown in Table III. The plots of the median values presented in Figs. 14 and 15 reveal that using forced iterations increases the performance of both the basic and adaptive AAMs, and the adaptive AAM still outperforms the basic AAM. An interesting observation we made during this experiment reveals further insight into the difference between the basic and adaptive AAMs. To demonstrate this observation, we show in Table IV the percentage of the runs for which the final shape error was larger than 0.30. Such an unusually large error means that the system has completely diverged in those cases as a result of forced iterations. The percentages shown in Table IV reveal that there are quite a few cases where the basic AAM has diverged, with no such cases for the adaptive AAM. This difference is probably due to the fact that the update direction estimate of the adaptive AAM is more accurate than that of the basic AAM; hence, forced iterations are more likely to move the adaptive AAM in the correct direction. Therefore, adaptive AAM appears to be less likely to diverge as a result of forced iterations when compared to the basic AAM. One disadvantage of using forced iterations is that they increase the convergence time by increasing the total number of iterations. A way of decreasing the impact of this disadvantage is to use forced iterations only when they are needed. In most of the cases, AAM converges successfully, and forced iterations are not useful. These cases can be identified by choosing an empirical threshold for the texture error, and forced iterations can be used only when the final texture error is larger than this threshold. Such an approach may avoid using unnecessary forced iterations. This is actually what we have done in the above experiments. We chose an empirical threshold for the texture error in the normalized frame, and used forced iterations only if AAM has converged with a texture error that is larger than this threshold. Adjustment of this threshold value, however, is not a very important procedure. Using a lower threshold,

BATUR AND HAYES: ADAPTIVE ACTIVE APPEARANCE MODELS

1719

Fig. 13. Histograms of the shape and texture errors for each subset of the testing images from the CMU PIE Database (with five forced iterations). (a) Subset 1. (b) Subset 2. (c) Subset 3. (d) Subset 4.

or abandoning the threshold mechanism completely would not change the results much, but would just increase the convergence time by adding a few more forced iterations.

VI. CONCLUSION We proposed an adaptive AAM that linearly adapts the gradient matrix to the target image’s texture during convergence.

1720

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 11, NOVEMBER 2005

TABLE III MEAN, MEDIAN, AND MINIMUM SHAPE AND TEXTURE ERRORS FOR EACH SUBSET OF THE TESTING IMAGES FROM THE CMU PIE DATABASE (WITH FIVE FORCED ITERATIONS)

Fig. 14. Median shape errors for each subset of the testing images from the CMU PIE Database (with five forced iterations).

TABLE IV PERCENTAGE OF AAM RUNS FOR WHICH THE SYSTEM HAS COMPLETELY DIVERGED WITH FIVE FORCED ITERATIONS FOR EACH SUBSET OF THE TESTING IMAGES FROM THE CMU PIE DATABASE (THE FINAL SHAPE ERROR WAS LARGER THAN 0.30)

equally good estimate of the actual gradient matrices at different textures. This was due to the fact that the textures are not normalized like shapes in the normalized frame. Hence, we proposed to obtain a better estimate for the actual gradient matrix by computing the contributions of the texture eigenvectors to the gradient and by adding them to the fixed gradient matrix according to the composition of the current target image. A critical idea during this adaptation was to realize that the current texture coefficients of AAM provide an estimate for the composition of the target texture, and this estimate can be used to find the adaptation coefficients. We experimentally showed that the iterations with the adaptive gradient matrix are more successful than the basic AAM iterations. We also showed that using forced iterations are less likely to make the adaptive AAM diverge when compared to the basic AAM, probably because the update direction estimate of the adaptive gradient is more accurate. Adaptive AAM provides a significant performance increase when compared to the fixed gradient matrix approach in the expense of an increase in computational complexity. Therefore, it would be useful for applications where the user has a few more seconds to obtain a more accurate result. In general, the performance increase provided by the adaptive gradient matrix is much larger when the target texture has a large range of variability. Remember that the fixed gradient matrix specializes around the mean texture, and if the mean texture is not a good representation of the target texture variability, the performance difference between the fixed and adaptive approaches would increase. Therefore, adaptive AAM would be most useful in those applications where the target texture can go through significant variations. In this paper, we specifically concentrated on the problem of modeling human faces, and our results demonstrated that especially texture variations due to illumination can be handled effectively with the adaptive approach. This result would naturally extend to the illumination variations of any deformable object modeled by AAM. REFERENCES

Fig. 15. Median texture errors for each subset of the testing images from the CMU PIE Database (with five forced iterations).

Such an approach was motivated by the observation that the fixed gradient matrix of the basic AAM was a good estimate of the actual gradient matrices at different shapes, but not an

[1] G. Edwards, C. J. Taylor, and T. F. Cootes, “Interpreting face images using active appearance models,” in Proc. IEEE Int. Conf. Automatic Face and Gesture Recognition, 1998, pp. 300–305. [2] T. F. Cootes, G. J. Edwards, C. J. Taylor, H. Burkhardt, and B. Neuman, “Active appearance models,” in Proc. Eur. Conf. Computer Vision, vol. 2, 1998, pp. 484–498. [3] T. F. Cootes, D. J. Edwards, and S. J. Taylor, “Active appearance models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 681–685, Jun. 2001. [4] T. F. Cootes and P. Kittipanya-ngam, “Comparing variations on the active appearance model algorithm,” in Proc. Brit. Machine Vision Conf., vol. 2, 2002, pp. 837–846. [5] G. J. Edwards, T. F. Cootes, and C. J. Taylor, “Advances in active appearance models,” in Proc. IEEE Conf. Computer Vision, vol. 1, 1999, pp. 137–142.

BATUR AND HAYES: ADAPTIVE ACTIVE APPEARANCE MODELS

[6] T. F. Cootes, K. Walker, and C. J. Taylor, “View-based active appearance models,” in Proc. IEEE Conf. Automatic Face and Gesture Recognition, 2000, pp. 227–232. [7] T. F. Cootes and C. J. Taylor, “Constrained active appearance models,” in Proc. IEEE Int. Conf. Computer Vision, vol. 1, 2001, pp. 748–754. [8] , “On representing edge structure for model matching,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, 2001, pp. 114–119. [9] T. F. Cootes, G. Edwards, and C. J. Taylor, “A comparative evaluation of active appearance model algorithms,” in Proc. Brit. Machine Vision Conf., vol. 2, 1998, pp. 680–689. [10] S. Baker and I. Matthews, “Equivalence and efficiency of image alignment algorithms,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, 2001, pp. 1090–1097. [11] S. C. Mitchell, B. P. F. Lelieveldt, R. J. Geest, J. G. Bosch, J. H. C. Reiber, and M. Sonka, “Multistage hybrid active appearance model matching: Segmentation of left and right ventricles in cardiac MR images,” IEEE Trans. Med. Imag., vol. 20, no. 5, pp. 415–423, May 2001. [12] J. G. Bosch, S. C. Mitchell, B. P. F. Lelieveldt, F. Nijland, O. Kamp, M. Sonka, and J. H. C. Reiber, “Multistage hybrid active appearance model matching: Segmentation of left and right ventricles in cardiac MR images,” IEEE Trans. Med. Imag., vol. 21, no. 11, pp. 1374–1383, Nov. 2002. [13] S. C. Mitchell, J. G. Bosch, B. P. F. Lelieveldt, R. J. Geest, J. H. C. Reiber, and M. Sonka, “3-D active appearance models: Segmentation of cardiac MR and ultrasound images,” IEEE Trans. Med. Imag., vol. 21, no. 9, pp. 1167–1178, Sep. 2002. [14] X. Hou, S. Z. Li, H. Zhang, and Q. Cheng, “Direct appearance models,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, 2001, pp. 828–833. [15] S. Z. Li, Y. S. Cheng, H. J. Zhang, and Q. S. Cheng, “Multi-view face alignment using direct appearance models,” in Proc. IEEE Conf. Automatic Face and Gesture Recognition, 2002, pp. 309–314. [16] S. Yan, C. Liu, S. Z. Li, H. Zhang, H. Y. Shum, and Q. Cheng, “Face alignment using texture-constrained active shape models,” Image Vis. Comput., vol. 21, no. 1, pp. 69–75, Jan. 2003. [17] M. B. Stegmann and R. Larsen, “Multi-band modeling of appearance,” Image Vis. Comput., vol. 21, no. 1, pp. 61–67, Jan. 2003. [18] M. B. Stegmann, R. Fisker, and B. K. Ersbll, “Extending and applying active appearance models for automated, high precision segmentation in different image modalities,” in Proc. Scandinavian Conf. Image Analysis, 2001, pp. 90–97. [19] G. J. Edwards, T. F. Cootes, C. J. Taylor, H. Burkhardt, and B. Neumann, “Face recognition using active appearance models,” in Proc. Eur. Conf. Computer Vision, vol. 2, 1998, pp. 581–595. [20] J. Ahlberg, “Using the active appearance algorithm for face and facial feature tracking,” in Proc. IEEE Int. Conf. Computer Vision Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems, 2001, pp. 68–72. [21] M. B. Stegmann, “Object tracking using active appearance models,” in Proc. Danish Conf. Pattern Recognition and Image Analysis, vol. 1, 2001, pp. 54–60. [22] A. U. Batur and M. H. Hayes, “A novel convergence scheme for active appearance models,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, vol. 1, 2003, pp. 359–366.

1721

[23] T. F. Cootes and C. J. Taylor. (1998) Statistical models of appearance for computer vision. [Online]. Available: http://www.isbe. man.ac.uk/bim/Models/app_model.ps.gz [24] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman, “From few to many: Illumination cone models for face recognition under variable lighting and pose,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 6, pp. 643–660, Jun. 2001. [25] T. Sim, S. Baker, and M. Bsat, “The CMU pose, illumination, and expression (PIE) database,” in Proc. IEEE Conf. Automatic Face and Gesture Recognition, 2002, pp. 53–58.

Aziz Umit Batur received the B.S. degree in electrical engineering from Bilkent University, Turkey, in 1998, and the M.S. and Ph.D. degrees in electrical engineering from the Georgia Institute of Technology (Georgia Tech), Atlanta, in 2000 and 2003, respectively. After his graduation, he joined the DSP R&D Center of Texas Instruments, Dallas, TX. In 2003, he received the Graduate Research Assistant Excellence Award from the School of Electrical and Computer Engineering, Georgia Tech, and the Outstanding Research Award from the Center for Signal and Image Processing, Georgia Tech. His research interests include image processing and computer vision.

Monson H. Hayes (F’92) received the D.Sc. degree in electrical engineering from the Massachusetts Institute of Technology, Cambridge, in 1981. He joined the faculty at the Georgia Institute of Technology (Georgia Tech), Atlanta, where he is currently a Professor of electrical and computer engineering. He has published over 100 papers and is the author of two textbooks. His research interests are in the areas of face recognition, multimedia signal processing, image and video processing, adaptive signal processing, and DSP education. Dr. Hayes was a recipient of the Presidential Young Investigator Award and the recipient of the IEEE Senior Award. He has been a member of the DSP Technical Committee (1984–1989) and Chairman (1995–1997). He was Associate Editor for the IEEE Transactions on ACOUSTICS, SPEECH, AND SIGNAL PROCESSING (1984–1988), Secretary-Treasurer of the ASSP Publications Board (1986–1988), Member of the ASSP Administrative Committee (1987–1989), General Chairman of the 1988 DSP Workshop, Member of the SP Society Standing Committee on Constitution and Bylaws (1988–1994), Chairman of the ASSP Publications Board (1988–1994), Member of the Technical Directions Committee (1992–1994), and General Chairman of ICASSP 1996. Currently, he is Associate Editor for the IEEE TRANSACTIONS ON EDUCATION, member of the Signal Processing Magazine Editorial Board, a member of the MMSP Technical Committee, and General Chair for ICIP 2006 in Atlanta.