Emotion Recognition in the Wild with Feature Fusion and Multiple ...

0 downloads 0 Views 901KB Size Report
Nov 12, 2014 - ABSTRACT. This paper presents our proposed approach for the second. Emotion Recognition in The Wild Challenge. We propose a new.
Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning 1

Junkai Chen1, Zenghai Chen1, Zheru Chi1 and Hong Fu1,2

Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong 2 Department of Computer Science, Chu Hai College of Higher Education, Hong Kong [email protected], [email protected], [email protected], [email protected]

ABSTRACT This paper presents our proposed approach for the second Emotion Recognition in The Wild Challenge. We propose a new feature descriptor called Histogram of Oriented Gradients from Three Orthogonal Planes (HOG_TOP) to represent facial expressions. We also explore the properties of visual features and audio features, and adopt Multiple Kernel Learning (MKL) to find an optimal feature fusion. An SVM with multiple kernels is trained for the facial expression classification. Experimental results demonstrate that our method achieves a promising performance. The overall classification accuracy on the validation set and test set are 40.21% and 45.21%, respectively.

Although movies are shot in more or less controlled environments, they are close to real world environments, and the expressions are more realistic than the lab-controlled expressions [5]. AFEW [6] is such a dataset. With the AFEW dataset, researchers can evaluate their methods for emotion recognition in the real world. Dhall et al. organized the first Emotion Recognition in The Wild Challenge in 2013 with the AFEW database [7]. In 2014, they updated the AFEW dataset and organized the second Emotion Recognition in The Wild Challenge. This paper mainly presents the approach that we proposed for the Challenge 2014. We proposed a new feature descriptor called Histogram of Oriented Gradients from Three Orthogonal Planes (HOG_TOP) to represent the facial expressions. Multiple Kernel Learning (MKL) [8] is used to optimally combine the visual features and audio features.

Categories and Subject Descriptors I.4.9 [Image Processing and Computer Vision]: Applications; I.5.4 [Pattern Recognition]: Applications.

The rest of the paper is organized as follows. Section 2 gives an overview of the related work. Our proposed emotion recognition approach is presented in Section 3. The description and discussion of the experiments are presented in Section 4. This paper is concluded in Section 5.

General Terms Algorithms; Performance; Experimentation

Keywords Emotion Recognition; Feature Fusion; HOG_TOP; Multiple Kernel Learning; Support Vector Machine

2. RELATED WORK There has been a significant progress in automatic facial expression recognition because of the development of the computer vision and machine learning techniques. Two mainstream methods, geometry-based and appearance-based [9], are widely used for this purpose.

1. INTRODUCTION Facial expression analysis has been an active field of research for over two decades. In the early stage, most work focused on static images or posed facial expressions controlled under the lab. Researchers have made much progress on several popular public datasets, like JAFFE [1], CMU-PIE [2], MMI [3], and Multi-PIE [4]. The facial expressions in these datasets are lab-controlled. They are different from the facial expressions in the real world. Recently, more attention has been paid to emotion recognition in the wild. Compared with the lab-controlled expressions, expressions in the wild are more natural and spontaneous. They can truly reveal the emotions and intentions behind the expressions. To this end, the first step is to build a database of a reasonable size. This turns out to be a hard task, because expressions in the wild are transient and delicate. Some researchers find an alternative but useful way to collect facial expressions. They extract short video sequences from movies.

Geometry-based methods capture the shapes and configurations of the face and facial components like mouth, eyes, and brows etc. Facial expressions are caused by some facial muscle movements, which result in the deformations of the facial components. These deformations can be used to distinguish among different facial expressions. To this end, the facial fiducial points are extracted to form a vector to represent the shape or configuration of the face and facial components. The difficulty is to detect and locate the facial fiducial points precisely. Several algorithms have been proposed to solve this problem, such as Active Appearance Model (AAM) [10], [11], Constrained Local Models (CLM) [12], and Deformable Parts Model (DPM) [13].

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions [email protected].

Appearance-based methods, which represent the expressions with texture, are another primary method for facial expression recognition. Compared with geometry-based methods, appearance-based methods are easier to implement. It is not necessary to locate the facial landmarks. In general, appearance features are obtained by applying the hand-designed feature descriptors over the whole face. There are two widely accepted types of feature descriptors. One type is used for single and static images, such as Gabor filters [14], Local Binary Patterns (LBP)

ICMI’14, Nov. 12–16, 2014, Istanbul, Turkey Copyright 2014 ACM 978-1-4503-2885-2/14/11 …$15.00. http://dx.doi.org/10.1145/2663204.2666277

508

Figure 1: The flowchart of our proposed system. [15], and Local Phase Quantization [16] etc. The other type of feature descriptor, such as LBP_TOP [17], LPQ_TOP [18], and PHOG [19] can represent the dynamic features of image sequences (videos).

3.1 Histogram of Oriented Gradients from Three Orthogonal Planes We propose a new feature descriptor called Histogram of Oriented Gradients from Three Orthogonal Planes (HOG_TOP) to represent the dynamic features of image sequences. HOG_TOP is an extension of the Histogram of Oriented Gradients (HOG) [28] which is used to deal with 2D spatial textures only. HOG_TOP expands to three dimensions and computes dynamic spatialtemporal textures.

Recently, some methods utilizing multimodal data including facial expressions, sound and scene for emotion analysis have achieved promising results. Kim et al. [20] proposed a suite of Deep Belief Network models, which explicitly captured complex non-linear feature interactions in multimodal data for emotion recognition. A study reported in [21] has shown that visual features coupled with audio features obtained better performance than using one-modal data alone. In [22], Karan et al. utilized different feature descriptors to represent multimodal data, and achieved a good accuracy for emotion recognition in the wild. Kahou et al. [23] combined modality specific deep neural networks for emotion recognition in the Wild and won the emotion recognition challenge in 2013.

Each position in a sequence has a 3-D (X-Y, X-T and Y-T) neighborhood which is shown in Figure 2. In order to obtain HOG_TOP, we first calculate the gradients along the X, Y and T directions with a 3×3 Sobel mask. Three gradient orientations are defined as

Feature fusion and Multiple Kernel Learning (MKL) [24] have received increasing attention from the computer vision community with the development of multiple-modality emotion recognition. Psychological study in [25] showed that using multiple modalities was helpful for emotion analysis. Multiple modalities make use of both visual features and audio features. The point is to find an optimal way to integrate these modalities. MKL provides an effective manner to address this problem. MKL has been successfully applied for emotion and facial action unit recognition. For example, in [26], Zhang et al. applied the Multiple Kernel Learning (MKL) for facial expression recognition. They expanded the Hessian MKL algorithm into multiclass-SVM with the one-against-one rule. The proposed framework learned one kernel weight vector for each binary classifier. Zhang et al. [27] considered the underlying relations among action units and proposed a novel method, lp-norm multi-task multiple kernel learning (MTMKL) which collectively trained the classifiers for detecting the absence and presence of multiple action units.

 XY  arctan(

GY ) GX

 XT  arctan(

GT ) GX

YT  arctan(

(1)

GT ) GY

where GX, GY, and GT are the gradients along the X, Y and T directions, respectively. These angles are quantized into K

T X

XY

3. THE PROPOSED APPROACH

XT

This section describes in details the methods that we adopt for the emotion recognition in the Wild Challenge. The flowchart of our proposed approach is shown in Figure 1. We propose a new feature descriptor called HOG_TOP to represent the facial expressions. We also explore the properties of the HOG_TOP features and audio features and apply the Multiple Kernel Learning to find an optimal combination of the visual features and audio features. At last, an SVM with multi kernels is trained for emotion classification.

YT

Y Figure 2: Three orthogonal neighborhoods of a point in an image sequence.

509

Figure 3: Some facial images from the dataset. The variation in illumination, pose and background increase the complexity of emotion analysis. orientation bins with a range of 0° − 180° or 0° − 360° . We obtain a histogram in each plane. The three histograms are concatenated to form a global description combing the spatial and temporal features.

M

K (x i , x j )    m km (x i , x j ) M

with

3.2 Audio features

The basis kernels could be linear kernel, radial basis function (RBF) kernel and polynomial kernel etc. MKL learns the kernel weights β and α coefficients. We employ the SimpleMKL algorithm proposed in [8] to solve this problem.

4. EXPERIMENTS AND DISCUSSIONS In order to evaluate our method, we conducted the experiments on the Acted Facial Expression in Wild (AFEW) 4.0 [6] , which has been collected from different movies showing close-to-real-world conditions. Some facial images extracted from the dataset are shown in Figure 3. From these images, we can see that the variation in illumination, pose and background contribute to the complexity of emotion analysis. For the emotion recognition challenge, the dataset is divided into three parts: training, validation and test. There are 578 video clips in the training set. The validation and test sets have 383 video clips and 407 video clips, respectively. The task is to classify a sample audio-video clip into one of the seven categories: Anger, Disgust, Fear, Happiness, Neutral, Sadness and Surprise.

3.3 Feature Fusion and Multiple Kernel Learning for Classification Once we have acquired different types of features, we need to find an optimal way to combine the features to improve the discriminatory ability of the classifier. Traditional Support Vector Machine (SVM) [33] only integrates the features directly and a single kernel is built for the combined features. Multiple Kernel Learning (MKL) [34] constructs a kernel for each feature type and finds an optimal linear combination for the kernels. Given a training set with labeled samples:

D  {(x i , y i ) | x i  R n , yi  {1,1}}iN1

Face extraction in the wild is a challenging problem. The variations in pose, illumination, and background could contribute to the complexity of the problem. The classic Viola-Jones face detector [35] is effective for front view faces. However, it is not robust to deal with the faces in the wild. Recently the researchers have paid attention to the deformable parts models. Zhu and Ramanan’s model [13], which was based on a mixture of trees with a shared pool of parts, has been widely used for face detection in the wild. In their method, every facial landmark was modeled as a part and the global mixtures were used to capture topological changes in viewpoint. This model is able to estimate the pose of a face and detect the facial landmarks from the face. Face alignment can be done once the pose estimation and facial landmarks are obtained. The organizers of The Wild Challenge have adopted this model to extract faces and provided the aligned faces for this challenge.

The dual formulation of the traditional single kernel SVM optimization problem is then given by (2)

N

subject to

 y i

i

 m  0,   m  1 . m 1

Another type of features employed in our study is extracted from the sound data provided. The organizers of the emotion recognition challenge have supplied the acoustic features. The features are extracted using the open-source Emotion Affect Recognition (openEAR) [29] toolkit backend OpenSMILE[30]. The features include energy/spectral Low Level Descriptors (LLD) and voicing related LLD. More details about the features are described in [31] and [32].

N  1 N N max    i     i j yi y j k ( x i , x j )   2 i 1 j 1  i 1 

(3)

m 1

 0 , 0  i  C .

i 1

MKL applies a linear combination of multiple kernels to substitute for the single kernel, which is defined in Eq. (3). In our study, we adopt the formulation proposed in [8] that the kernel is actually a convex combination of basis kernels:

510

As mentioned in the previous section, we extract the HOG_TOP features from aligned faces. The size of the aligned face is 128× 128. We divide the face image into 8×8 non-overlap blocks and each block has a size of 16×16. We set the bin size to 9 and the orientation range to 0° − 180°. The final HOG_TOP feature is a 1 ×1728 vector. As for the acoustic features, we use the features supplied by the organizers. The audio feature is a 1×1584 vector.

Table 3: The confusion matrix showing the performance of the audio-visual features on the validation set (%) AN DI FE HA NE SA SU

We evaluate the performance of our method on the validation set first. Three feature combinations are explored: audio feature only, visual feature only, and the combined visual-audio features. The classification accuracy of the audio feature on the validation set is 32.89%. The visual feature achieves 35.77% accuracy on the validation set. The confusion matrixes of using individual feature sets are shown in Table 1 and Table 2, respectively.

0.45 0.4

From the experimental results, we can see that the emotions “angry”, “happy” and “neutral” have a higher accuracy than the other emotions. And the accuracy of the emotion “surprise” is the lowest. Nearly all the predictions are incorrect, meaning that in this dataset, the emotion “surprise” is easy to be confused with the other emotions, in particular, with “neutral”.

0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

We further integrate the two features to achieve better performance on the validation set. We build a linear kernel for each feature set and apply the SimpleMKL algorithm to optimize the combination weights. Two kernels are used: one for the audio feature and the other for the visual feature. The best performance achieves when the kernel weights of the visual feature and audio feature are 0.73 and 0.27 respectively in our experiment. The overall classification accuracy is 40.21%, which is higher than the audio features and visual features applied alone. Table 3 shows the confusion matrix of using the combined visual-audio features on the validation set. From Table 3, we can see that the classification accuracy of emotion “angry”, “happy” and “neutral” improves but the accuracy of emotion “surprise” is still the lowest.

AN DI FE HA NE SA SU

AN DI 73.44 4.69 22.50 22.50 52.17 10.87 4.76 6.35 15.87 3.17 31.15 13.11 26.09 2.17

FE 6.25 7.50 4.35 7.94 4.76 0.00 6.52

HA 1.56 12.50 10.87 60.32 9.52 8.20 8.70

NE 14.06 32.50 17.39 19.05 58.73 42.62 50.00

SA 0.00 2.50 4.35 1.59 7.94 4.92 4.35

SU 0.00 0.00 0.00 0.00 0.00 0.00 2.17

0.344 0.357 0.282

0.262

Video Only Baseline Method

Audio-Video

Our Method

Figure 4: Experimental results on the validation set. We also compare our results with the baseline method [36] on the validation set as shown in Figure 4. The baseline method applied the LBP_TOP [17] to extract the visual features from the aligned faces. A nonlinear SVM was learnt for the classification and the classification accuracy was 34.4%. The audio feature were obtained by using the OpenSmile toolkit [30] and a linear SVM was trained to perform the recognition. The accuracy was 26.2%. At last, a feature level fusion was performed, where the audio and video features were concatenated and a non-linear SVM was trained for the classification. It achieved an accuracy of 28.2%. From Figure 4, we can see that, our method outperforms the baseline method. When the audio features apply alone, our classification accuracy is about 6% higher than the baseline method. And the classification accuracy of the visual features in our method is 4% higher than the baseline method. When the two futures are combined, our method is about 12% higher than the baseline method.

SU 3.13 5.00 8.70 0.00 1.59 4.92 0.00

We also evaluated our method on the test set which includes 407 video clips without providing the ground-truth. We have to make the predictions for all the video clips and send the prediction labels to the organizers. The organizers then compared our predictions with the ground truth and calculate the overall classification accuracy. For the test set, we tried several different kinds of settings when we extracted the HOG_TOP features from aligned faces. The best performance on the test set is 45.21% when the block size is 8×8, the size of the faces is 64×64 and the number of the non-overlap blocks is 8×8. The confusion matrix is shown in Table 4. Experimental results on the test set are shown in Figure 5. For the baseline method on the test set, the video only achieves a classification accuracy of 33.7%; audio only achieves a classification accuracy of 26.8% and audio-video feature fusion can achieve an accuracy of 24.6%. We can see that our method is about 11% higher than the best performance of the baseline method.

Table 2: The confusion matrix showing the performance of the visual feature on the validation set (%) AN DI FE HA NE SA SU

SA 3.13 7.50 2.17 1.59 4.76 9.84 0.00

0.402 0.329

Audio Only

Table 1: The confusion matrix showing performance of the audio feature on the validation set (%) AN DI FE HA NE SA 64.06 3.13 4.69 12.50 10.94 1.56 17.50 15.00 0.00 17.50 35.00 10.00 32.61 2.17 26.09 13.04 15.22 2.17 23.81 3.17 11.11 34.92 26.98 0.00 7.94 9.52 6.35 14.29 57.14 3.17 1.64 14.75 11.48 22.95 29.51 14.75 17.39 8.70 13.04 17.39 34.78 8.70

AN DI FE HA NE 76.56 6.25 1.56 1.56 10.94 27.50 17.50 0.00 15.00 32.50 39.13 8.70 15.22 15.22 19.57 7.94 4.76 6.35 63.49 15.87 9.52 7.94 0.00 7.94 69.84 21.31 18.03 0.00 6.56 44.26 21.74 4.35 6.52 15.22 50.00

SU 0.00 0.00 0.00 0.00 0.00 0.00 2.17

511

[4] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker, "Multi-pie," Image and Vision Computing, vol. 28, pp. 807813, 2010.

Table 4: The confusion matrix showing the performance of the audio-visual features on the test set (%) AN DI FE HA NE SA SU 0.5 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0

AN DI FE HA NE SA SU 68.97 5.17 6.90 3.45 6.90 8.62 0.00 3.85 30.77 7.69 3.85 19.23 26.92 7.69 13.04 8.70 41.30 4.35 6.52 17.39 8.70 11.11 8.64 4.94 33.33 16.05 20.99 4.94 5.13 12.82 4.27 4.27 52.14 13.68 7.69 7.55 3.77 7.55 13.21 22.64 35.85 9.43 7.69 7.69 11.54 3.85 15.38 15.38 38.46

[5] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, "Static Facial Expression Analysis In Tough Conditions," in Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on, 2011, pp. 2106-2112. [6] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon, "A semiautomatic method for collecting richly labelled large facial expression databases from movies," IEEE Multimedia, 2012. [7] A. Dhall, R. Goecke, J. Joshi, M. Wagner, and T. Gedeon, "Emotion Recognition In The Wild Challenge 2013," in Proceedings of the 15th ACM on International conference on multimodal interaction, 2013, pp. 509-516.

0.452

[8] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet, "SimpleMKL," Journal of Machine Learning Research, vol. 9, pp. 2491-2521, 2008.

0.337 0.268

0.246

[9] S. Z. Li and A. K. Jain, Handbook of face recognition: springer, 2011.

Audio Only

Video Only Baseline Method

[10] T. F. Cootes, G. J. Edwards, and C. J. Taylor, "Active appearance models," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 23, pp. 681-685, 2001.

Audio-Video

Our Method

[11] I. Matthews and S. Baker, "Active appearance models revisited," International Journal of Computer Vision, vol. 60, pp. 135-164, 2004.

Figure 5: Experimental results on the test set.

5. CONCLUSION Emotion recognition in the wild is a challenging problem. Many factors such as illumination, head pose and complex background increase the difficulty of this problem. In this paper, we propose an approach for emotion analysis in unconstrained near real-world conditions. We employ the multiple modalities including visual and acoustic information for emotion recognition. We propose a new feature descriptor called HOG_TOP to extract the dynamic visual features from video sequences. Multiple Kernel Learning (MKL) is used to find an optimal combination of the visual features and audio features. Our approach achieves a promising performance on both the validation set and test set. The overall classification accuracy on the validation set and test set are 40.21% and 45.21%, respectively, which are significantly better than the baseline methods.

6. ACKNOWLEDGMENT This work was partially supported by a research grant from The Hong Kong Polytechnic University (Project Code: G-YL77).

[12] D. Cristinacce and T. F. Cootes, "Feature detection and tracking with constrained local models," in BMVC, 2006, pp. 929-938. [13] X. Zhu and D. Ramanan, "Face detection, pose estimation and landmark localization in the wild," in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, 2012, pp. 2879-2886. [14] H. G. Feichtinger and T. Strohmer, Gabor analysis and algorithms: Theory and applications: Springer, 1998. [15] T. Ojala, M. Pietikainen, and T. Maenpaa, "Multiresolution gray-scale and rotation invariant texture classification with local binary patterns," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 24, pp. 971-987, 2002. [16] V. Ojansivu and J. Heikkilä, "Blur insensitive texture classification using local phase quantization," in Image and Signal Processing, 2008, pp. 236-243. [17] G. Zhao and M. Pietikainen, "Dynamic texture recognition using local binary patterns with an application to facial expressions," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 29, pp. 915-928, 2007.

7. REFERENCES [1] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, "Coding Facial Expressions with Gabor Wavelets," in Automatic Face and Gesture Recognition, Proceedings. Third IEEE International Conference on, 1998, pp. 200-205.

[18] J. Päivärinta, E. Rahtu, and J. Heikkilä, "Volume local phase quantization for blur-insensitive dynamic texture classification," in Proceedings of the 17th Scandinavian conference on Image analysis, 2011, pp. 360-369.

[2] T. Sim, S. Baker, and M. Bsat, "The CMU pose, illumination, and expression database," Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 25, pp. 1615-1618, 2003.

[19] A. Dhall, A. Asthana, R. Goecke, and T. Gedeon, "Emotion recognition using PHOG and LPQ features," in Automatic Face & Gesture Recognition and Workshops IEEE International Conference on, 2011, pp. 878-883.

[3] M. Pantic, M. Valstar, R. Rademaker, and L. Maat, "Webbased database for facial expression analysis," in Multimedia and Expo, 2005. ICME 2005. IEEE International Conference on, 2005.

[20] Y. Kim, H. Lee, and E. M. Provost, "Deep learning for robust feature generation in audiovisual emotion recognition," in Acoustics, Speech and Signal Processing

512

(ICASSP), 2013 IEEE International Conference on, 2013, pp. 3687-3691.

[28] N. Dalal and B. Triggs, "Histograms of Oriented Gradients for Human Detection," in Computer Vision and Pattern Recognition, 2005. IEEE Conference on, 2005, pp. 886-893.

[21] M. Liu, R. Wang, Z. Huang, S. Shan, and X. Chen, "Partial least squares regression on grassmannian manifold for emotion recognition," in Proceedings of the 15th ACM on International conference on multimodal interaction, 2013, pp. 525-530.

[29] F. Eyben, M. Wollmer, and B. Schuller, "OpenEAR — Introducing the munich open-source emotion and affect recognition toolkit," in Affective Computing and Intelligent Interaction and Workshops. ACII 2009. 3rd International Conference on, 2009, pp. 1-6.

[22] K. Sikka, K. Dykstra, S. Sathyanarayana, G. Littlewort, and M. Bartlett, "Multiple kernel learning for emotion recognition in the wild," in Proceedings of the 15th ACM on International conference on multimodal interaction, 2013, pp. 517-524.

[30] F. Eyben, M. Wöllmer, and B. Schuller, "Opensmile_ the munich versatile and fast open-source," in Proceedings of the international conference on Multimedia, 2010, pp. 14591462.

[23] S. E. Kanou, C. Pal, X. Bouthillier, P. Froumenty, Ç. Gülçehre, R. Memisevic, et al., "Combining modality specific deep neural networks for emotion recognition in video," in Proceedings of the 15th ACM on International conference on multimodal interaction, 2013, pp. 543-550.

[31] B. Schuller, S. Steidl, A. Batliner, F. Burkhardt, L. Devillers, C. A. Müller, et al., "The INTERSPEECH 2010 paralinguistic challenge," in INTERSPEECH, 2010, pp. 2794-2797.

[24] M. Gönen and E. Alpaydın, "Multiple Kernel Learning Algorithms," The Journal of Machine Learning Research, vol. 12, pp. 2211-2268, 2011.

[32] B. Schuller, M. Valstar, F. Eyben, G. McKeown, R. Cowie, and M. Pantic, "Avec 2011–the first international audio/visual emotion challenge," in Affective Computing and Intelligent Interaction, ed: Springer, 2011, pp. 415-424.

[25] J. A. Russell, J. A. Bachorowski, and J. M. Fernandez-Dols, "Facial and vocal expressions of emotion," Annu Rev Psychol, vol. 54, pp. 329-349, 2003.

[33] C. J. Burges, "A Tutorial on Support Vector Machines for Pattern Recognition," Data mining and knowledge discovery, vol. 2, pp. 121-167, 1998.

[26] X. Zhang, M. H. Mahoor, and R. M. Voyles, "Facial expression recognition using HessianMKL based multiclassSVM," in Automatic Face and Gesture Recognition (FG), 2013 10th IEEE International Conference and Workshops on, 2013, pp. 1-6.

[34] G. R. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan, "Learning the Kernel Matrix with SemiDefinite Programming," The Journal of Machine Learning Research, vol. 5, pp. 27-72, 2004. [35] P. Viola and M. Jones, "Robust Real-Time Face Detection," International journal of computer vision, vol. 57, pp. 137-154, 2004.

[27] X. Zhang, M. H. Mahoor, S. M. Mavadati, and J. F. Cohn, "A lp-norm MTMKL framework for simultaneous detection of multiple facial action units," in Applications of Computer Vision (WACV), 2014 IEEE Winter Conference on 2014, pp. 1104-1111.

[36] A. Dhall, R. Goecke, J. Joshi, K. Sikka, and T. Gedeon, "Emotion Recognition In The Wild Challenge 2014: Baseline, Data and Protocol," in ACM International Conference on Multimodal Interaction 2014., 2014.

513