Novel Approaches for Face Recognition: Template-Matching using

Novel Approaches for Face Recognition: Template-Matching using Dynamic Time Warping and LSTM Neural Network Supervised Classification 1

Alexandre L. M. Levada, 2Débora C. Correa, 2Denis H. P. Salvadeo, 2José H. Saito and 2Nelson D. A. Mascarenhas 1

2

Physics Insitute of São Carlos, University of São Paulo, São Carlos, SP, Brazil Trabalhador Sãocarlense Avenue, 400, Postal Code 369, Zip Code 13560-970

Computer Department , Federal University of São Carlos, São Carlos, SP, Brazil

Washington Luís Highway, Km 235, Postal Code 676, Zip Code 13565-905 Phone: +55 (16) 33518579 Fax: +55 (16) 3351-8233 E-mail: [email protected]

Keywords: Face Recognition, Dynamic Time Warping, LSTM Neural Network, Learning Algorithm, PCA Abstract – This paper presents novel methodologies for face recognition: template-matching using Dynamic Time Warping (DTW) and Long-Short-Term-Memory (LSTM) neural network supervised classification. The advantage of the DTW algorithm is that it requires only one prototype (sample) for each class, that is, a single representative template is enough for classification purposes. The LSTM network is a novel recurrent network architecture that implements an appropriate gradient-based learning algorithm. It overcomes the vanishing-gradient problem. Experiments with images from the MIT-CBCL face recognition database provided good results for both approaches. For DTW, the obtained results indicate that the proposed method is robust against the presence of random noise on observations and templates, since it is capable to deal with unpredictable variations. The LSTM training achieved good performance even with small feature sets.

1. INTRODUCTION In the last decades, the human face has been explored in a variety of neural networks, computer vision and pattern recognition applications. Definitely, one of the most challenges is the problem of face recognition. Face recognition has several advantages over other biometric technologies: it is a natural and nonintrusive approach [1]. Basically, a face recognition system can operate in two distinct modes: face verification (authentication) or face identification (recognition), which is the main focus of this work. Recognition involves one-to-many matches, that compares a query face against all the template images in the database [1]. Face recognition methods can be divided in the following categorization: Holistic matching methods, Feature-based matching methods and Hybrid methods [2]. The holistic methods use the whole face as input. Statistical techniques based on principal component analysis (Eigenfaces) [3], and other feature extraction methods as Linear Discriminant Analysis (Fisherfaces) [4] and Independent Component Analysis (ICA) [5], belong to this class of methods. The proposed DTW approach also is an example of holistic method. Feature-based methods are based on the extraction of local features such as statistics and geometrical measures. Finally, hybrid methods combine both local features and the global face representation during the classification stage. One of the most used approaches using neural networks for face recognition tasks is the Neocognitron network [6-

7]. Our motivation to use LSTM is to reduce the number of training samples. Many technical challenges exist in face recognition. According to [1], the key ones are: large variability, highly complex nonlinear manifolds, high dimensionality and small sample size. In this work, our objective is to propose the application of DTW for template-matching face recognition and LSTM for classification as a way to solve some of these problems. The remaining of the paper is organized as follows: Section 2 describes the DTW. The LSTM network is presented in Section 3. The methodology, experiments and results are presented in Section 4. Finally, section 5 presents the final remarks and conclusions.

2. DYNAMIC TIME WARPING Dynamic Time Warping (DTW), [8-9] is a pattern matching technique used to compare signals, not necessarily of same size, based on its characteristic shapes. It is a widely used technique in speech processing and recognition. The basic idea is to compare the samples of an unknown input signal with the samples of a set of template signals. The unknown signal is classified as the most similar pattern found. Our motivation here is that, in face recognition, each pattern (1-D signal) can be characterized by a specific waveform by adopting a lexicographic notation. Also, we believe that small variations in intensity (illumination), expression or pose do not modify the lexicographic 1-D signal (or face descriptor) waveform in a drastic way. Application of DTW technique in face recognition is a novel approach, since its use is still restricted to1-D signals. The main advantages of using DTW in pattern recognition classification tasks can be summarized as: • •

It does not require complex mathematical models, resulting in a computationally simple alternative. It requires only one prototype for each class, that is, a single representative template for each class is enough for classification purposes (most statistical and neural network based approaches require a training set with many samples for each class, what is not always possible).

However, DTW also has some complications, such as:

• • •

High computational cost due to large size of the samples (1-D or 2-D signals), may cause the method to become “slow”. May be unviable for real time processing. Unviable for extremely large databases.

As DTW is commonly used for speech recognition (where the sampling rate is 22050/44100 samples/sec), these complications are not critical in face recognition. The complete algorithm for pattern matching classification using Dynamic Time Warping (DTW) is detailed below. DTW ALGORITHM 1. 2.

Read the input signal (size M ) Repeat for each template signal (size Ti ) a. Allocate M × Ti DIST matrix b. Allocate M × Ti ACCDIST matrix c. Compute the absolute difference between each input sample and the current template sample

DIST ( i, j ) = input ( i ) − templatek ( j ) d. Fill the first line of ACCDIST matrix ACCDIST ( 0, 0 ) = DIST ( 0, 0 )

ACCDIST ( i, 0 ) = DIST ( i, 0 ) + ACCDIST ( i − 1, 0 )

e. Fill the first column of ACCDIST matrix ACCDIST ( 0, j ) = DIST ( 0, j ) + ACCDIST ( 0, j − 1)

f. For each element ( i, j ) of ACCDIST do:

3. TRADITIONAL LSTM NETWORK Previous gradient based methods, as BPTT (BackPropagation Through Time), for example, share a problem: over time, as gradient information is passed backward to update weights whose values affect later outputs, the error/gradient information is continually decreased or increased by weight-update scalar values [10-12]. This means that the temporal evolution of the path integral over all error signals “flowing back in time” exponentially depends on the magnitude of the weights. For this reason, standard recurrent neural networks fail to learn when in presence of long time lags between relevant input and target events [10-12]. The LSTM algorithm minimizes this problem by enforcing constant error flow through “constant error carrousels” (CECs) within special units, permitting then non-decaying error flow “back into time” [10-12]. This enables the network to learn important data and store them without degradation of long periods of time. The memory block is the basic unit in the hidden layer of an LSTM network and it substitutes the hidden unit in a RNN neural network (Figure 1 (a)) [10]. A memory block is formed by one or more memory cells and a pair of adaptive, multiplicative gating units which gate input and output to all cells in the block (Figure 1 (b)). All cells in the memory block share the same gates [10]. Figure 1 (c) shows a detailed memory block with one memory cell. Each memory cell has a recurrently self-connected linear unit called “Constant Error Carrousel” (CEC) [10-11], representing the cell state. The CEC helps to solve the vanishing error problem.

ACCDIST ( i, j ) = DIST ( i , j ) + min { ACCDIST ( k , l )} ( k ,l )∈V

where V = {( i − 1, j ) ; ( i, j − 1) ; ( i − 1, j − 1)} is the set of 3 previous neighbors. The preference is always for the diagonal direction. g. Calculate the size of the path between the last and initial points of ACCDIST matrix: Fig. 2. A memory block with one cell [10, pp.12] i = M; j = T i; while (i+j != 0) direction = min(ACCDIST(k,l)) switch(direction) case ‘down’: j--; case ‘back’: i--; case ‘diagonal’:i--; j--; end path_size++; end

3. Select the template with the smallest distance. 4. Repeat steps 1 to 3 for all input signals.

For a detailed explanation of the LSTM network forward and backward pass see [10] and [11].

4. METHODOLOGY To test and evaluate the proposed methodology for face classification, we used images from the MIT-CBCL (Center for Biological and Computational Learning) face recognition database #1, available at [13]. The CBCL FACE DATABASE #1 consists of a training set of 2429 face images and a test set of 472 face images with spatial dimensions of 19 x 19 pixels. 4.1 Experiments and Results In the experiments, we selected 50 faces of different individuals from the training set to represent the template images. Later, we chose another 50 faces of the same persons from the test set. Each 19 x 19 image was

transformed to a 1-D signal of 361 elements. We call this representation the face descriptor. Figure 1 shows some template faces of the training set. Figure 2 shows face descriptors corresponding to 4 images in the template set. For DTW experiments, we present the results in 3 experiments comparing the performance of face recognition in distinct situations: a) Both template patterns and test faces without noise. b) Template patterns without noise and test faces degraded by Gaussian noise ( σ 2 = 0.1 ). c) Both template patterns and test faces degraded by Gaussian noise.

Fig. 1. MIT-CBCL DATABASE #1 example faces

that is, 47 correct answers from 50 possible cases. Although the disparity between the template and test sets is increased, the classification performance is almost not affected, indicating that proposed method is robust against the presence of random noise.

Fig. 3. Matrix of path sizes between the test faces and the template patterns for the case of noise free test images

Fig. 4. Degraded MIT-CBCL DATABASE #1 example faces

a.)

a.) b.) Fig. 2. Examples of face descriptors for template patterns a.) Template 3; b.) Template 4;

All experiments were executed using MATLAB. In the first DTW experiment, both templates and test images are free of noise. The obtained result was 100% of correct answers. A plot of the 50 x 50 matrix of path sizes between each pair of faces is shown in Figure 3. Note that the template sequence and test face sequence are aligned. Basically, this matrix associates a distance between each possible face in the test set and all the patterns from the template set. Note that, in this case, the smaller paths are clearly in the diagonal, indicating an optimal performance and that there is no confusion in the classification performance. In the second DTW experiment, we added independent Gaussian noise only in test faces, simulating images captured from a degraded video (as a security camera, for instance). The template patterns remain the same. The resulting noise test images and face descriptors are shown in Figures 4 and 5. As expected, the performance in this case is degraded, since the difference between the template patterns and the input test images is significantly higher. Obtained results show a correct classification rate of 94%,

b.) Fig. 5. Face descriptors for degraded images of the template set a.) Template 3; b.) Template 4

The 50 x 50 matrix of path sizes between each pair of faces for this case is shown in Figure 6. Note that the presence of small path sizes between possible pair of faces outside the diagonal is an indication that the classification is a complex task, since there is a significant amount of confusion. Finally, in the last experiment, both template patterns and test images were degraded, to simulate a sub-optimal condition for the image database creation, like in most real problems, where is not always possible to have an ideal prototype for each person/class. In this case, the obtained results show a correct classification rate of 70% (35 correct of 50), which is still a good performance. The matrix of

path sizes between each pair of faces for this case is shown in Figure 7. Note the high concentration of small path sizes outside the main diagonal.

Fig. 6. Matrix of path sizes between the test faces and the template patterns for the case of noisy test images

All the experiments were repeated using the minimum MSE as similarity measure. The obtained results showed 92%, 72% and 34% of accuracy for the first, second and third situations, respectively, indicating that DTW is more robust against the presence of random noise, improving classification performance.

Fig. 7. Matrix of path sizes between the test faces and the template patterns for the case of noisy templates and test images

For LSTM network, we also selected 50 faces of different individuals to represent the classes to be classified by LSTM. We applied PCA (Principal Component Analysis) to reduce the dimensionality of the input patterns (361-D). We trained two LSTM neural network using 10 and 20 principal components. We verified that LSTM can learn properly the classes even with one sample of each class and a reduced feature set. Table 1 shows the results obtained for correct classification (CC), incorrect classification (IC) and the correct classification rate (CCR). The training settings for this experiment were: 10 and 20 hidden memory blocks for the 10-D and 20-D cases, respectively; α = 0.02 (learning rate). The obtained mean square error for 50000 epochs was 0.05. 10 principal comp. CC LSTM

48

IC 2

CCR 96%

20 principal comp. CC 44

IC 6

CCR 88%

Table 2. Results obtained by LSTM neural network

5. CONCLUSIONS In this paper, we presented novel methodologies for template matching face recognition using Dynamic Time Warping and LSTM networks. The main motivation was the possibility to perform face recognition in critical situations, where there is only a single sample for each class. Experiments with images from the MIT-CBCL face recognition database indicated that the proposed DTW method is robust to the presence of random noise on the observations, a very desirable property in recognition of human faces. Also, LSTM could learn even with a reduced feature set (10-D and 20-D). Future works may include the use of other face databases, the study of the behavior of DTW for different illuminations and face expressions, age and gender classification and even the use of other types of degradation on the face images.

REFERENCES [1] A. K. Jain and S. Z. Li, Handbook of Face Recognition: Springer-Verlag New York, Inc., 2005. [2] W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld, "Face recognition: A literature survey," ACM Computing Surveys, vol. 35, pp. 399-459, 2003. [3] M. A. Turk and A. P. Pentland, "Eigenfaces for recognition", Journal of Cognitive Neuroscience, vol. 3, n. 1, pp. 71-86, 1991. [4] K. Etemad and R. Chellapa, "Discriminant Analysis for Recognition of Human Face Images", Journal of the Opt. Society of America A, vol. 14, pp. 1727-1733, 1997. [5] B. A. Draper, K. Baek, M. S. Bartlett and J. R. Beveridge, "Recognizing Faces with PCA and ICA", Computer Vision and Image Understanding, vol.91, n.1, pp. 115-137, 2003. [6] K. Fukushima, "A Neural Network for Visual Pattern Recognition", Computer, vol. 21, n. 3 pp. 65-75, 1980. [7] C. O. Santana, J. H. Saito, "Reconhecimento Facial utilizando a Rede Neural Neocognitron", In: Proceedings of the III Workshop de Visão Computacional, 2007 (in portuguese). [8] H. Sakoe and S. Chiba, “Dynamic programming algorithm optimization for spoken word recognition”, IEEE Trans. On Acoustics, Speech and Signal Processing (ASSP), vol. 26, pp. 43-49, 1978. [9] J. Coleman, “Introducing Speech and Language Processing”, Cambridge University Press, New York, 2005. [10] F. Gers, "Long-Short-Term-Memory in Recurrent Neural Networks", Ph.D Thesis, 2001. [11] S. Hochreiter, J. Schmidhuber, "Long-Short-TermMemory", Neural Computation, vol. 9, n. 8 pp. 1735-1780, 1997. [12] S. Hochreiter, Y. Bengio, P. Frasconi, J. Schmidhuber, "Gradient Flow in Recurrent Nets: The Difficulty of Learning Long Term Dependencies. A Field Guide to Dynamical Recurrent Network", IEEE Press, New York 1997. [13] CBCL Face Database #1, MIT Center for Biological and Computation Learning.