Face Recognition using Sparse Representations

0 downloads 0 Views 154KB Size Report
(based on the adjacency graph) and achieve dimensionality reduction by applying ... usually performed using the k-nearest neighbors (kNN) classifier.
Face Recognition using Sparse Representations and Manifold Learning Grigorios Tsagkatakis1, Andreas Savakis 2 1

Center for Imaging Science Department of Computer Engineering Rochester Institute of Technology, NY, 14623 2

Abstract. Manifold learning is a novel approach in non-linear dimensionality reduction that has shown great potential in numerous applications and has gained ground compared to linear techniques. In addition, sparse representations have been recently applied on computer vision problems with success, demonstrating promising results with respect to robustness in challenging scenarios. A key concept shared by both approaches is the notion of sparsity. In this paper we investigate how the framework of sparse representations can be applied in various stages of manifold learning. We explore the use of sparse representations in two major components of manifold learning: construction of the weight matrix and classification of test data. In addition, we investigate the benefits that are offered by introducing a weighting scheme on the sparse representations framework via the weighted LASSO algorithm. The underlying manifold learning approach is based on the recently proposed spectral regression framework that offers significant benefits compared to previously proposed manifold learning techniques. We present experimental results on these techniques in three challenging face recognition datasets. Keywords: Face recognition, manifold learning, sparse representations.

1

Introduction

Dimensionality reduction is an important initial step when extracting high level information from images and video. An inherent assumption is that data from a high dimensional space can be reliably modeled using low dimensional structures. Traditional techniques such as Principal Components Analysis (PCA) and Linear Discriminant Analysis (LDA) use linear subspaces as the low dimensional structures for signal modeling [12]. However, certain limitations of linear subspace methods suggest the investigation of non-linear dimensionality reduction techniques. Manifolds are non-linear structures that have gained considerable attention in recent years. In manifold learning, the objective is to identify the intrinsic dimensionality of the data and project it into a low dimensional space that can reliably capture important characteristics while preserving various geometric properties, such as the geodesic distances or the local neighborhood structure. In general, we may think of a manifold as a low dimensional surface embedded in an ambient higher

dimensional space. Manifolds arise when there is a smooth variation of some key parameters that define the system’s degrees of freedom, and are usually far less in number than the dimension of the space where the signal is initially described. In other words, the parameters that characterize the generation of the manifold are sparse with respect to the support set of the signal [1]. Manifold learning has been successfully applied in various computer vision problems such as face recognition [14], activity recognition [18], pedestrian detection [19] and structure-from-motion [20] among others. Sparse representations (SRs) are another type of signal representation that has received considerable attention in the past few years. SRs were initially applied in signal reconstruction, where it was shown that signals, such as images and audio, can be naturally represented in fixed bases using a small number of coefficients [15], [16]. As far as computer vision applications are concerned, SRs have been very successful in extracting useful information based on the assumption that although images are high dimensional by nature, collections of images, e.g. human faces in various poses, can be reliably represented by a small number of training examples [11]. From the above discussion, it becomes evident that manifold learning and sparse representations share similar motivation in the context of computer vision, although they are different in terms of signal reconstruction. It is therefore natural to ask whether these two approaches can be combined in order to obtain better results. The contribution of this paper lays in the investigation of the connections between the sparse representations and the manifold learning in terms of face recognition accuracy. More specifically, we examine the application of the sparse representation framework for both the generation of the weight matrix as well as the subsequent classification. Although these two approaches have been explored independently [11] and [9], they have not been examined in combination. In addition, we also examine a weighting scheme that introduces a distance based parameter on the sparse representation framework. The sparse representation framework is evaluated using the spectral regression framework that offers better modeling capabilities for manifold learning compared to linearized techniques [9]. The rest of the paper is organized as follows. The concepts of manifold learning and sparse representations are presented in Sections 2 and 3 respectively. The notion of sparsity and how it is applied for the weight matrix construction is explored in Section 4. Experimental results and discussion are provided in Section 5. The paper is concluded in Section 6.

2

Manifold Learning

Manifold learning techniques can be divided in two categories: non-linear and linear approximations. Non-linear dimensionality reduction, including Isometric Feature Embedding (Isomap) [3], Local Linear Embedding (LLE) [4] and the graph Laplacian Eigenmap (LE) [5], reduce the dimensionality of the data while preserving various geometric properties such as the geodesic distances or the local neighboring structure. Most algorithms of this nature represent the training data through a distance matrix

(based on the adjacency graph) and achieve dimensionality reduction by applying eigenanalysis on this distance matrix. A critical issue related to non-linear embedding methods such as Isomap, LLE and LE, is the out-of-sample extension problem, where the lack of a straightforward extension of the mapping to new data (testing) limits the applicability of the methods. More recent methods try to overcome the out-of-sample problem by constructing a linear approximation of the manifold. The benefits of linear approximations are mostly expressed in terms of savings in computational time, although in some cases very promising results have been reported. Methods in this category include locality preserving projections (LPP) [7] and neighborhood preserving embedding (NPE) [8]. A major limitation of these methods is the requirement for solving a large scale eigen-decomposition which makes them impractical for high dimensional data modeling. In the spectral regression framework (SR) [23] Cai et al. proposed an efficient approach that tackles this problem by recasting the learning of the projection function into a regression framework. Formally, given a data set , the first step of SR is the construction of a weight matrix , where each data sample is connected (given a non-zero weight) to another data sample if it is one of its nearest neighbors or belongs to the surrounding ball. In the second step, the diagonal degree matrix and the Laplacian are calculated. The optimal low dimensional projection of the high dimensional training data points is given by the maximization of the following eigen-problem (1) Once the low dimensional representation of the input data is found, new testing data are embedded by identifying a linear function such that (2) where the linear projection function is a matrix whose columns are the eigenvectors obtained by solving the following eigen-problem (3) In SR however, the last step, which is the most expensive, is replaced by the solution of a regression problem given by (4)

In SR the optimal solution to Eq. (4) is given by the regularized estimator (5) Once the embedding function has been learned, classification of new test points is usually performed using the k-nearest neighbors (kNN) classifier. In addition to the

desirable general properties of the kNN, such as the guarantees in the error rate [2], kNN has been widely adopted by the manifold learning community because of two main reasons. First, it is closely related to the local linear assumption used in the generation of the manifold embedding and, second, kNN is an instance-based classifier that does not require training and thus can be used in unsupervised settings. Furthermore, approximate kNNs have been proposed [13] that can deal with large datasets with moderate requirements in terms of classification speed and memory. Despite these benefits, kNN presents a number of drawbacks that may compromise the classification accuracy. The major limitation stems from the choice of k which is typically selected via cross-validation. Nevertheless, even if a globally optimal value for k is found, the lack of an adaptive neighborhood selection mechanism may result in a poor representation of the neighborhood structure.

3

Sparse Representations

Sparse representations (SRs) were recently investigated in the context of signal processing and reconstruction in the emerging field of compressed sensing [15], [16]. Formally, assume a signal is known to be sparsely represented in dictionary i.e. where and . The goal is to identify the nonzero elements of that participate in the representation of . An example could be the case where represents the face of a particular individual and is the dictionary containing the representation of all faces in the database as in [11]. The individual may be identified by locating the elements of the dictionary that are used for reconstruction, or equivalently the non zero elements of . The solution is obtained by solving the following optimization problem (6) This problem is NP-hard and therefore difficult to solve in practice. In the pioneering work of Donoho [16] and Candes [15], it was shown that if the solution satisfies certain constraints, such as the sparsity of the representation, the solution of the problem in Eq. 6 is equivalent to the solution of the following problem known as LASSO in statistics (7) This Sparse Representations approach (SRs) has been recently applied in various computer vision problems including face recognition [10] and image classification [17]. When noise is present in the signal, a perfect reconstruction using Eq. (7) may not be feasible. Therefore, we require that the reconstruction is within an error constant and Eq. (7) is reformulated as (8)

The LASSO algorithm can be efficiently applied for solving the problem in Eq. (8), but it assumes that all elements are equally weighted. This approach, called first order sparsity, only deals with the question of how to sparsely represent a signal given a dictionary. However, other means of information could also be used in order to increase the performance of LASSO or adjust it towards more desirable solutions. One such case is the weighted LASSO [25]. In the weighted LASSO, each coefficient is weighted differently according to its desired contribution. Formally, the optimization of the weighted LASSO is similar to Eq. (8) and it is given by (9) where is a vector of weights. In this paper, we examine the effects of the solution of Eq. (9) when corresponds to the distance between the sample and the individual dictionary elements . Regarding the use of the SRs for classification, the notion of sparsity was proposed in [11] for face recognition based on a well known assumption that the training images of an individual’s face span a linear subspace [12]. Although this approach achieved excellent results, it required that the class labels were available during classification. In an unsupervised setting, the class label information is not available and therefore an alternative approach has to be applied. In this paper we use the coefficient with the maximum value as the indicator of the training sample that is most similar to the test example analogous to a nearest neighbor approach. Formally the class of an unknown sample is given by where is the dictionary element found in Eq. (9) i.e. .

4

Sparse Graphs

As discussed in Section 2, the first step in manifold learning is the generation of the adjacency graph. Recently, the ℓ1-graph was proposed, which employs the concept of sparse representations during graph construction. The objective in ℓ 1-graph is to connect a node with the nodes associated with the data points that offer the sparsest representation. One could select the weights of the edges connecting to other vertices by solving the following ℓ1 minimization problem (10) where . In this case, the weights of the graph correspond to the coefficients of the linear approximation of each data point with respect to the rest of the training set. ℓ1-graphs offer significant advantages compared to the typical nearest neighbor graphs, the most important of which is that there is no need to specify k, the number of neighbors. The adaptive selection of neighborhood size can more accurately represent the neighborhood structure compared to nearest neighbors. In addition, the ℓ1-graph is more robust to noise and outliers, since it is based on linear representations that have shown promising results under difficult scenarios such as illumination variation. Furthermore, ℓ1-graphs encode more

discriminative information, especially in the case where class label information is not available. In [9], L. Qiao et al. applied the ℓ1 - graph construction approach in a modified version of the NPE and reported higher recognition accuracy compared to state-ofthe-art manifold learning algorithms using typical weight graphs for face recognition. The approach was later extended to semi-supervised manifold learning in [6]. A similar approach was also presented in [10] where the authors applied the ℓ1-graph in subspace learning and semi-supervised learning. In all three works the nearest neighbor was used for the classification scheme of the test data. In addition to the simple sparsity constraint that only deals with the cardinality of the solution, we propose the application of the weighted LASSO in order to take distances into account. We therefore propose to replace Eq. (10) by Eq. (9) where each weighting coefficient is given by where is the dictionary element (training example) associated with the coefficient . We investigated different choices for the distance metric and report the results using the Euclidean distance.

5

Experimental Results

The goal of the experimental section is to investigate how the SRs framework can be used in conjunction with manifold learning for face recognition as discussed in the previous sections. In order to evaluate the classification accuracy that is achieved by this combination, we performed a series of experiments on three publicly available face recognition datasets: the Yale, the AT&T and the Yale-B. The YALE dataset [22] contains 165 face images of 15 individuals, 11 images per individual. These images contain faces in frontal poses with significant variation in terms of appearance (expressions, glasses etc). The second one is the AT&T dataset [21] which contains 400 images of 40 individuals. The images included in the AT&T dataset exhibit variation in expression, facial details and head pose (20 degrees variation). The YaleB [24] dataset contains 21888 images of 38 persons under various pose and illumination conditions. We used a subset of 2432 of nearly frontal face images in this experiment. We note that all images were resized to 32x32 pixels and the pixel values were rescaled to [0,1]. The results are presented in Table 1-11 and correspond to the average classification accuracy after a 10-fold cross validation. In these tables, columns indicate the method used for the generation of the weight matrix using Eq. (1). These techniques are the typical nearest neighbor graph (NN-Graph) using 2 neighbors (as used in [9]), the sparse representation technique for weight matrix construction (ℓ1-Graph) and the weighted sparse representations (w ℓ1-Graph). The rows indicate the method used for classification. These methods are the 1-nearest neighbor (NN), the sparse representation classification (SRC) using Eq. (8) and the weighted sparse representations using the weighted LASSO (wSRC) using Eq. (9). Tables 1-4 present the classification results on the Yale dataset. Based on these results a number of observations can be made. First, we observe that regarding the method used for graph construction, the ℓ1-Graphs and the wℓ1-Graphs achieve

significantly higher accuracy compared to the NN-Graph, especially when the NN is used as the classifier. The increase in accuracy observed using the NN classification ranges from 13% to 22% depending on the number of training examples available. This indicates that using either the ℓ1-Graph or the wℓ1-Graph can provide significant benefits, especially when computational constraints prohibit the application of the SRC or the wSRC classifiers during testing. Regarding classification, we observe that the SRC and the wSRC achieve much higher recognition accuracy compared to the NN classifier, particularly when the NN-Graph is used for the weight matrix generation. As for the weighting extension of the sparse representations, we observe that the results are similar to the ones obtained without the weighting scheme. Table 1: Classification results on Yale with 2 training examples/class YALE - 2 NN SRC wSRC

NN-Graph 39.39 47.68 47.14

ℓ1-Graph 44.90 47.46 47.00

w ℓ1-Graph 44.80 47.43 46.97

Table 2: Classification results on Yale with 3 training examples/class YALE - 3 NN SRC wSRC

NN-Graph 42.83 53.15 52.65

ℓ1-Graph 50.45 52.80 52.83

w ℓ1-Graph 50.53 52.68 52.56

Table 3: Classification results on Yale with 4 training examples/class YALE - 4 NN SRC wSRC

NN-Graph 44.78 57.13 57.52

ℓ1-Graph 54.53 57.90 57.77

w ℓ1-Graph 54.64 57.67 57.39

Table 4: Classification results on Yale with 5 training examples/class YALE - 5 NN SRC wSRC

NN-Graph 46.13 60.73 60.68

ℓ1-Graph 56.68 60.86 60.86

w ℓ1-Graph 56.60 60.71 60.88

The results for the AT&T dataset are presented in Tables 5-8 for the cases of 2, 4, 6 and 8 training examples per individual. We observe that similarly to the Yale dataset, using either the ℓ1-Graph or the wℓ1-Graph can provide significant increase in accuracy, especially for the case of the NN classifier. We further notice that the wℓ1Graph performs better than the ℓ1-Graph, although the increase in accuracy is minimal.

Table 5: Classification results on AT&T with 2 training examples/class AT&T - 2 NN SRC wSRC

NN-Graph 57.90 77.14 75.00

ℓ1-Graph 69.93 76.12 74.76

w ℓ1-Graph 68.45 76.34 74.76

Table 6: Classification results on AT&T with 4 training examples/class AT&T - 4 NN SRC wSRC

NN-Graph 74.06 89.54 89.10

ℓ1-Graph 82.20 89.31 89.00

w ℓ1-Graph 82.77 89.55 89.06

Table 7 Classification results on AT&T with 6 training examples/class AT&T - 6 NN SRC wSRC

NN-Graph 84.84 92.43 92.96

ℓ1-Graph 89.34 93.25 93.37

w ℓ1-Graph 89.40 93.40 93.56

Table 8: Classification results on AT&T with 8 training examples/class AT&T - 8 NN SRC wSRC

NN-Graph 91.43 95.31 95.41

ℓ1-Graph 93.62 95.93 95.93

w ℓ1-Graph 93.81 95.87 96.00

The classification results for the YaleB dataset are shown in Tables 9-11 for 10, 20 and 30 training examples per individual. We note that the YaleB dataset is more demanding due to its larger size. Regarding the performance, we again observe the superiority of the ℓ1-Graph and the wℓ1-Graph for the weight matrix construction and the SRC and wSRC for the classification. However, we notice that there is a significant increase in terms of accuracy when the weighted sparse representation is used. We can justify this increase in accuracy by the fact that the larger number of training examples offers better sampling of the underlying manifold in which case the use of distances provide more reliable embedding and classification. Table 9: Classification results on YALE-B with 10 training examples/class YALEB - 10 NN SRC wSRC

NN-Graph 74.92 82.10 82.39

ℓ1-Graph 84.90 85.84 86.23

w ℓ1-Graph 84.95 86.18 86.33

Table 10: Classification results on YALE-B with 20 training examples/class YALEB - 20 NN SRC wSRC

NN-Graph 84.76 87.06 87.90

ℓ1-Graph 87.60 89.17 90.62

w ℓ1-Graph 87.96 90.10 91.05

Table 11: Classification results on YALE-B with 30 training examples/class YALEB - 30 NN SRC wSRC

6

NN-Graph 87.83 89.01 90.50

ℓ1-Graph 89.56 90.26 91.99

w ℓ1-Graph 90.03 90.50 91.75

Conclusions

In this paper, we investigated the use of manifold learning for face recognition when the sparse representations framework was utilized in two key steps of manifold learning: weight matrix construction and classification. Regarding the weight matrix construction, we examined the benefits of the sparse representation framework instead of the traditional nearest neighbor approach. With respect to classification, we compared the recognition accuracy of a typical classification scheme, the k-nearest neighbors, and the accuracy achieved by the sparse representation classifier. In addition, we investigated the effects of introducing a distance based weighting term in the sparse representation optimization and examined its effects on the weight matrix construction and the classification. Based on the experimental results, we can make the following suggestion regarding the design of a manifold based face recognition system. When sufficient computational resources are available during the training stage, using the sparse representation framework will always lead to significantly better results especially when resource limitations during the testing phase prohibit the application of the more computationally demanding sparse representation framework for classification. On the other hand, when the available processing power during testing is adequate, then the sparse representation classification outperforms the nearest neighbor while the method used for graph construction plays a less significant role.

References 1.

Baraniuk, R. G., Cevher, V., Wakin M. B.: Low-dimensional models for dimensionality reduction and signal recovery: A geometric perspective. to appear in Proceedings of the IEEE (2010)

2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

12.

13. 14. 15.

16. 17. 18. 19.

20. 21. 22. 23. 24.

25.

Cover T. M.: Estimation by the nearest neighbor rule. In: IEEE Trans. on Information Theory 14(1) : 50-55 (1968) Tenenbaum, J.B., Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionality reduction. In: Science 290: 2319-2323 (2000) Saul L.K., Roweis, S.T.: Think globally, fit locally: unsupervised learning of low dimensional manifolds. In: Journal of Machine Learning Research 4: 119-155 (2003) Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation 15(6) : 1373-1396 (2006) Qiao, L., Chen, S., Tan, X.: Sparsity preserving discriminant analysis for single training image face recognition. In: Pattern Recognition Letters (2009) He, X. Niyogi, P.: Locality preserving projections. In: Advances in neural information processing systems 16 : 153-160 (2003) He, X., Cai, D., Yan, S., Zhang, H.J.: Neighborhood preserving embedding. In: IEEE Int. Conf. on Computer Vision : 1208-1213 (2005) Qiao, L. and Chen, S. and Tan, X.: Sparsity preserving projections with applications to face recognition. In: Pattern Recognition 43 (1) : 331-341 (2010) Cheng, B., Yang, J., Yan, S., Fu, Y., Huang, T.: Learning with L1-Graph for Image Analysis. IEEE Transactions on Image Processing , accepted for publication (2010) Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse representation. In: IEEE Trans. on Pattern Analysis and Machine Intelligence 31(2) : 210-227 (2009) Belhumeur, P., Hespanda, J., Kriegman, D.: Eigenfaces versus Fisherfaces: Recognition Using Class Specific Linear Projection. In: IEEE Trans. Pattern Analysis and Machine Intelligence 19 (7) : 711-720 (1997) Andoni, A., Indyk, P.: Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, vol. 51(1), pp 117-122 (2008) He, X., Yan, S., Hu, Y., Niyogi, P., Zhang, H.J.: Face recognition using laplacianfaces. In: IEEE Trans. on Pattern Analysis and Machine Intelligence : 328-340 (2005) Candès, E., Romberg, J., Tao, T.: Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. In: IEEE Trans. on Information Theory 52(2) : 489 - 509 (2006) Donoho, D.: Compressed sensing. In: IEEE Trans. on Information Theory 52(4) : 1289 1306 (2006) Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Supervised dictionary learning. In: Advances in Neural Information Processing Systems 21 (2009) Elgammal, A., Lee, C.S.: Inferring 3D body pose from silhouettes using activity manifold learning. In: IEEE Conf. on Computer Vision and Pattern Recognition 2 (2004) Tuzel, O., Porikli, F., Meer, P.: Pedestrian detection via classification on Riemannian manifolds. In: IEEE Trans. on Pattern Analysis and Machine Intelligence : 1713-1727 (2008) Rabaud V., Belongie, S.: Linear Embeddings in Non-Rigid Structure from Motion. In: IEEE Conf. on Computer Vision and Pattern Recognition (2008) AT&T face database: http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html Yale Univ. Face Database, http://cvc.yale.edu/projects/yalefaces/yalefaces.html Cai, D., He, X. and Han, J.: Spectral regression for efficient regularized subspace learning. In: Proc. Int. Conf. Computer Vision, pp. 1-8 (2007) Georghiades, A.S., Belhumeur P.N., and Kriegman, D.J.: From Few to Many: Illumination Cone Models for Face Recognition under Variable Lighting and Pose. In: IEEE Trans. Pattern Analysis and Machine Intelligence 23(6) : 643-660 (2001) Zou,H.: The adaptive Lasso and its oracle properties. In: Journal of the American Statistical Association 101(476) : 1418–1429, (2006)