Emotion Recognition with Multimodal Features and

Emotion Recognition with Multimodal Features and Temporal Models Shuai Wang

Wenxuan Wang

Jinming Zhao

School of Information Renmin University of China [email protected]



Shizhe Chen

Qin Jin

Shilei Zhang



IBM Research Lab Beijing, China [email protected]

Yong Qin IBM Research Lab Beijing, China [email protected]

ABSTRACT

1

This paper presents our methods to the Audio-Video Based Emotion Recognition subtask in the 2017 Emotion Recognition in the Wild (EmotiW) Challenge. The task aims to predict one of the seven basic emotions for short video segments. We extract different features from audio and facial expression modalities. We also explore the temporal LSTM model with the input of frame facial features, which improves the performance of the non-temporal model. The fusion of different modality features and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods.

Automatic emotion recognition is a challenging task and has attracted much research interest in recent years. It has wide applications in human computer interaction and human behavior analysis fields. The Emotion Recognition in the Wild (EmotiW) Challenges [6, 7] provide a public benchmark for promoting the automatic emotion recognition in the spontaneous environment. The Acted Facial Expression in Wild (AFEW) database, as the dataset of this challenge, consists of short emotional video clips extracted from movies and TV shows, which simulate the emotional behaviors in real life. The task aims to assign each video segment a single emotion label within seven major emotions (Anger, Disgust, Fear, Happiness, Neutral, Sad and Surprise). The wide varieties in videos, such as context scenes, subjects, poses, illuminations, occlusions and so on, make the task very challenging. Facial expression features play the key role in emotion recognition. The winner of the EmotiW2015 [25] automatically learns the AU-aware features for each pair of different emotions and encodes the latent relations of the learned AU patches as the facial feature. For expression recognition, the CNN based features [26] have shown the state-of-the-art performance. Yan et al. [24] use 68 landmarks trajectories to find the dynamic clues of facial images. The hand-crafted visual features, such as HOG, LBP-TOP and LPQ-TOP, are also effective in previous works [20]. Text features extracted from the speech contents in the videos, however, are less useful as shown in the work [4] because of the inaccuracy of transcription from the speech recognizer and insufficiency of training data. Acoustic features generally cannot outperform visual features, but the combination of the two features can further improve the emotion recognition performance due to the complementary information between audio and visual modalities. Previous research works [20, 22, 24, 25] in the challenges show the great benefits from the fusion of multimodalities.

CCS CONCEPTS • Computing methodologies → Artificial intelligence; Computer vision; Machine learning;

KEYWORDS Emotion Recognition; Multimodal Features; CNN; LSTM ACM Reference Format: Shuai Wang, Wenxuan Wang, Jinming Zhao, Shizhe Chen, Qin Jin, Shilei Zhang, and Yong Qin. 2017. Emotion Recognition with Multimodal Features and Temporal Models. In Proceedings of 19th ACM International Conference on Multimodal Interaction (ICMI’17). ACM, New York, NY, USA, 6 pages. https://doi.org/ 10.1145/3136755.3143016 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ICMI’17, November 13–17, 2017, Glasgow, UK © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-5543-8/17/11. . . $15.00 https://doi.org/10.1145/3136755.3143016

INTRODUCTION

ICMI’17, November 13–17, 2017, Glasgow, UK The fusion strategies of different modalities can be classified into 3 categories, which are feature-level (early) fusion, decision-level (late) fusion and model-level fusion [21]. Early fusion means concatenating features from different modalities as the input feature before classification. It has been widely used in the literature and successfully improved the performance [18]. However, it suffers from the curse of dimensionality. Late fusion trains a high level model like RVM [12] according to the classification results of different modalities and thus eliminates the high dimensionality of concatenated features. But it ignores the interactions and correlations between the features from different modalities. Model-level fusion is a compromise between the two extremes mentioned above. Typical model-level fusion methods are kernel fusion [3] and concatenating hidden layers of different neural networks. Temporal information is also helpful for video emotion recognition. Samira et al. [14] propose to use a recurrent neural networks to encode the sequential information and report a good performance. The winner of the EmotiW2016 [15] trains CNN-RNN and C3D hybrid networks to fetch distinguishing learned feature which contains temporal information. In this task, we explore different features from audio and facial expression modalities. For the acoustic features, two kinds of frame-level features, the IS10 features and the MFCCs features, are extracted and encoded into video-level features via different methods which are Bag-of-Audio-Words, Fisher Vectors and VLAD. For the facial features, we train different convolutional neural networks, the VGG network and the DenseNet network, from scratch with the facial expression dataset named FER+. We utilize multiple classifiers including SVM, Random Forest, Logistic Regression and Neural Network on the video-level encoded features. In order to capture the dynamics of the facial movements, we further explore the temporal LSTM model with the input of frame facial features, which has a better performance than the non-temporal models. The fusion of features from different modalities and the temporal model lead us to achieve a 58.5% accuracy on the testing set, which shows the effectiveness of our methods. The paper is organized as follows. Section 2 describes all the features we extract from different modalities. Section 3 shows our non-temporal models and temporal model in details. Section 4 provides our experimental results and section 5 concludes the paper and presents our future work plans.

2

MULTIMODAL FEATURES

Features play an essential role in classification tasks and emotions are expressed through multi-modalities. So in this part, we describe all the features extracted from audio and facial images to represent the emotion conveyed in the video from different aspects.

Shuai Wang et al.

2.1

Acoustic Features

For the acoustic features, we first extract the low-level acoustic features and then apply different encoding strategies to transform the set of frame-level descriptors into the videolevel features. The details are described as follows. 2.1.1 Frame-level Acoustic Features. We extract two kinds of expert-knowledge based frame-level acoustic features, which are the IS10 feature and the MFCCs feature. The IS10 feature set includes energy, pitch, jitter and shimmer and so on, which is extracted by the open-source toolkit OpenSMILE [9] with the configuration in INTERSPEECH 2010 [19] Paralinguistic challenge. The MFCCs feature [5] is the most commonly used low-level feature in many audio tasks such as speech recognition and audio event detection. Therefore, we use MFCCs as a type of frame-level feature with window of 25ms and shift of 10ms. 2.1.2 Video-level Feature Encoding. Our goal is to recognize the emotion state of each video rather than each frame so it is necessary to encode the frame-level features to videolevel features. We try 3 different encoding methods which are Bag-of-Audio-Words(BoAW) [16], Fisher Vector(FV) [17] and VLAD [13].

2.2

Facial Features

facial expression features [15, 25] show a discriminative ability in emotion recognition tasks, especially when some videos contain no speaking. We capture frames from videos and utilize open-source toolkit SeetaFace [23] to detect and align faces to generate facial images, which are used in network finetuning and feature extraction. We train CNN-based models on Facial Expression Recognition Dataset (FER+) [1] which contains 35,887 images with seven emotion classes. In order to make the model more suitable for our task, we finetune the trained network on the AFEW dataset. When finetuning the facial expression models, we use the prediction of trained networks as the face label. The facial salient features are extracted from the finetuned networks. 2.2.1 VGGFace Features. We use VGGFace [2], which is the current state-of-the-art model for facial expression recognition task, to learn the facial salient features. Different layer can present different degree specific features and we extract ”pool4”, ”fc5”, ”fc6” and ”prob” layer features form the finetuned model as the facial expression characters. 2.2.2 DenseFace Features. The recent proposed Dense Convolution Network (DenseNet) is one of the most stateof-the-art CNN structures in the image recognition task. It connects all the preceding layers as the input for a certain layer, which can strengthen feature propagation and alleviate the gradient vanishing problem. Besides, due to the feature reuse, DenseNet only needs to learn a small set of new feature maps in each layer and thus requires fewer parameters than traditional CNNs, which is more suitable for small datasets. We follow the DenseNet-BC structure proposed in Huang et al. [11] with the growth rate as 12 and depth as 100, which

Emotion Recognition with Multimodal Features and Temporal Models consumes about 86% fewer parameters than our VGGFace model and achieves 2% higher accuracy on the FER+ testing set. Activations of the last mean pooling layer are extracted as the feature from DenseFace.

3

MODELS

Our models include non-temporal models, such as Support Vector Machine (SVM), Random Forest (RF), Linear Regression (LR) and Neural Networks (NN), and temporal models, like the Long Short-Term Memory (LSTM) recurrent neural network [8, 15].

3.1

Non-Temporal Models

We choose SVM, RF, LR and NN as our non-temporal models for both acoustic features and visual features. Hyper parameters of the models are selected according to grid search on our local validation set based on the sentence-level features. For fusing multimodal information, acoustic features, deep facial features and hand-craft features are concatenated as an input to the classifiers.

3.2

Temporal Models

As for facial expression features, temporal information is useful so we train a LSTM as a temporal model to make use of it. LSTM is the state-of-the-art model for capturing the temporal video information since it uses gates and memory cells to store useful information so that it can exploit long range dependencies in the video [10]. For the version of LSTM used in this paper [10], the gates and cell functions are defined as follows. it = σ(Wxi xt + Whi ht−1 + Wci ct−1 + bi ) ft = σ(Wxf xt + Whf ht−1 + Wcf ct−1 + bf ) ot = σ(Wxo xt + Who ht−1 + Wco ct−1 + bo )

(1)

ct = ft ct−1 + it · tanh(Wxc xt + Whc ht−1 + bc ) ht = ot · tanh(ct ) where i, f, o and c are respectively the input gate, forget gate, output gate and cell activation vectors, all of which are the same size as the hidden units, and the σ is the logistic sigmoid function. h and x denote previous output state and the current input. W and b are parameters. The overview of our temporal model system is shown in Figure 1. The hybrid network is based on a encoder-decoder structure. For the encoding stage, the input layer feeds the CNN features orderly to the LSTM layer, where each timestep yields an output. A mean pooling layer average the outputs to a sentence-level facial feature. We concatenate the video-level facial features and the hand-crafted features as inputs to the decoding stage. The decoding stage includes a fully-connected layer and a softmax layer. The fully-connected layer has the effect of dimension reduction and the softmax yields an 7-dimensional probability vector.

ICMI’17, November 13–17, 2017, Glasgow, UK

4 EXPERIMENTS 4.1 Dataset The AFEW dataset contains 1809 video clips extracted from movies and TV shows, which are divided into 3 parts: 773 for training, 383 for validation and 653 for testing. We randomly split the official validation set into two parts as local validation set and local testing set with size of 200 and 183 respectively.

4.2

Experimental Setup

For the non-temporal models, SVM, RF, LR and NN are trained as basic classifiers. The hyper parameters are selected according to the accuracy on the local validation set. For SVM, Linear and RBF kernel are applied and we search the cost from 2−3 to 215 . For random forest, the number of trees is selected from 100 to 1000 with 100 step length and the depth of the tree is searched from 2 to 20. And for logistic regression, we set the one-vs-rest scheme to train multiple binary classifiers without any hyper parameter. For NN, we use an one-layer network with 128 hidden units. Because of the unbalanced distribution between emotion categories, we use ”balanced” mode in our experiments. For the temporal model shown as Figure 1, we use an one-layer LSTM and average the time-step outputs as the video representation in the encoder and a fully-connected layer in the decoder. The number of neurons in LSTM layer and fully connected layer are set to 128. We use mini-batches of size 64, time-step of 64 and learning rate is set to 0.01.

4.3

Non-temporal Results

In order to find out the efficient acoustic features, we explored the BoAW, FV and VLAD strategies for processing the low-level features and the detailed experimental results are presented in Table 1. The FV strategy performs better than others and the MFCC.FV features and the IS10.FV features have the best performance in mfcc-based and IS10-based features respectively. The experimental results of the different facial features on local testing set are presented in Table 2. Two important observations can be made from these results. 1. The deep learned features (VGGFace and DenseFace) are more effective than the hand-crafted LBP-TOP features. 2. The CNN structure is very important to learn discriminative features and compared to the VGGFace features, the DenseFace features significantly improve the performance. We suppose that features from different modalities are complementary and the fusion features can improve the performance. Multimodal experimental results of the non-temporal models on local testing set are shown in Table 3. And the concatenation of the DenseFace features, the IS10.FV features, and the LBP-TOP features further boost the performance. So we will utilize this configuration in temporal experiments.


Shuai Wang et al.

Figure 1: The Overview of the temporal model system. Table 1: Acoustic feature results on local testing set. Feature IS10.BoAW IS10.FV IS10.VLAD MFCC.BoAW MFCC.FV MFCC.VLAD

SVM 34.0 35.5 35.5 32.5 39.5 34.5

RF 35.0 38.0 35.0 35.0 30.5 33.0

LR 32.0 36.5 33.5 27.5 33.5 32.0

NN 23.5 37.5 35.0 17.0 36.0 31.0

Table 2: Facial feature results on local testing set. Feature

SVM

RF

LR

NN

LBP-TOP

39.0

33.0

35.0

34.5

VGGFace.fc5 VGGFace.fc6

41.5 40.0

41.0 43.0

41.0 42.5

41.5 41.0

DenseFace

46.0

40.0

44.5

45.5

Table 3: Multimodal results of the non-temporal models on local testing set. Feature DenseFace+MFCC.FV DenseFace+IS10.FV DenseFace+MFCC.FV+LBP-TOP DenseFace+IS10.FV+LBP-TOP

non-temporal 47.5 47.0 48.75 49.0

Table 4: Temporal model results on local testing set. Feature VGGFace DenseFace DenseFace+IS10.FV DenseFace+IS10.FV+LBP-TOP

4.4

Non-temporal

Temporal

43.0 46.0 47.0 49.0

43.5 47.0 47.5 49.0

Temporal Model Results

We compare the accuracy of the non-temporal model with the temporal model on the multimodal features. As the results shown in Table 4, several observations can be made from these results. 1. The DenseFace feature has a significant improvement over VGGFace’s, both on the temporal model and the non-temporal model. This result also validates the conclusion in Table Table 2. 2. The temporal model obtains slight improvement over the non-temporal model, and this result proves that temporal information is helpful for the emotional recognition. 3. The IS10.FV and the LBP-TOP features are as complementary features and are helpful for emotional recognition task. Accordingly, we adopt the three modality features and the temporal model in our multimodal fusion system. Inspired by the [15], we add the training set and the validation set together to train the temporal model in training phase. The results of our submissions are shown in Table Table 5. Compared to the DenseFace features, we gain 2.1% improvement when fusing the IS10.FV features in the decoder and gain 3.8% improvement when continuing to fuse LBPTOP features in the decoder. The fusion of multimodality

Emotion Recognition with Multimodal Features and Temporal Models Table 5: The multimodal results on testing set. Sub 3 4 6

Feature

Test

DenseFace DenseFace + IS10.FV DenseFace + IS10.FV + LBP-TOP

54.7 56.9 58.5

Figure 2: Confusion Matrix of the Best Submission.

features and the temporal model, lead us to achieve the best result of 58.5% accuracy on testing set. The confusion matrix of our best submission on the testing set is shown as Figure 2.

5

CONCLUSIONS

In this paper, we investigate multi-modality feature representation and LSTM temporal model for the EmotiW 2017 Audio-Video Emotion Recognition Challenge. Various of features from acoustic and visual modalities are extracted and different fusion strategies are attempted. The Fisher Vector encoding of MFCCs features performs most effectively in all the acoustic features and the DenseFace feature shows the best performance among all the visual features. Early fusion across audio and video modalities can improve the recognition results. And the temporal model with various fused facial features further improves the accuracy and shows outstanding results on testing set. In our future work, we will try more discriminative features, make use of more temporal information, and attempt more fusion methods for video emotion recognition.

ACKNOWLEDGMENTS This work is supported by National Key Research and Development Plan under Grant No. 2016YFB1001202. We also appreciate the support from the National Demonstration Center for Experimental Education of Information Technology and Management (Renmin University of China).


REFERENCES [1] Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. 2016. Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution. In ACM International Conference on Multimodal Interaction (ICMI). [2] Emad Barsoum, Cha Zhang, Cristian Canton Ferrer, and Zhengyou Zhang. 2016. Training deep networks for facial expression recognition with crowd-sourced label distribution. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 279–283. [3] Jun Kai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. 2014. Emotion Recognition in the Wild with Feature Fusion and Multiple Kernel Learning. In International Conference on Multimodal Interaction. 508–513. [4] Albert C. Cruz. 2015. Quantification of Cinematography Semiotics for Video-based Facial Emotion Recognition in the EmotiW 2015 Grand Challenge. In ACM on International Conference on Multimodal Interaction. 511–518. [5] S Davis and P Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans.acoust.speech Signal Process 28, 4 (1980), 65–74. [6] Abhinav Dhall, Roland Goecke, Shreya Ghosh, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. 2017. From Individual to Group-level Emotion Recognition: EmotiW 5.0. In ACM International Conference on Multimodal Interaction. [7] Abhinav Dhall, Roland Goecke, Simon Lucey, and Tom Gedeon. 2012. Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE Multimedia 19, 3 (2012), 34–41. [8] Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent Neural Networks for Emotion Recognition in Video. 13, 5 (2015), 467–474. [9] Florian Eyben. 2010. Opensmile: the munich versatile and fast open-source audio feature extractor. In ACM International Conference on Multimedia. 1459–1462. [10] Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks. Computer Science (2013). [11] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. 2016. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993 (2016). [12] Zhaocheng Huang, Ting Dang, Nicholas Cummins, Brian Stasak, Phu Le, Vidhyasaharan Sethu, and Julien Epps. 2015. An Investigation of Annotation Delay Compensation and Output-Associative Fusion for Multimodal Continuous Emotion Prediction. In International Workshop on Audio/visual Emotion Challenge. 41–48. [13] Herv´ e J´ egou, Matthijs Douze, Cordelia Schmid, and Patrick P´ erez. 2010. Aggregating local descriptors into a compact image representation. In Computer Vision and Pattern Recognition. 3304–3311. [14] Samira Ebrahimi Kahou, Vincent Michalski, Kishore Konda, Roland Memisevic, and Christopher Pal. 2015. Recurrent Neural Networks for Emotion Recognition in Video. In Acm International Conference on Multimodal Interaction. 467–474. [15] Yuanliu Liu, Yuanliu Liu, Yuanliu Liu, and Yuanliu Liu. 2016. Video-based emotion recognition using CNN-RNN and C3D hybrid networks. In ACM International Conference on Multimodal Interaction. 445–450. [16] Stephanie Pancoast and Murat Akbacak. 2014. Softening quantization in bag-of-audio-words. In IEEE International Conference on Acoustics, Speech and Signal Processing. 1370–1374. [17] Florent Perronnin, Thomas Mensink, and Jakob Verbeek. 2013. Image Classification with the Fisher Vector: Theory and Practice. International Journal of Computer Vision 105, 3 (2013), 222– 245. [18] Viktor Rozgic, Sankaranarayanan Ananthakrishnan, Shirin Saleem, Rohit Kumar, Aravind Namandi Vembu, and Rohit Prasad. 2012. Emotion Recognition using Acoustic and Lexical Features. In Conference of the International Speech Communication Association. [19] Bj Schuller, Anton Batliner, Stefan Steidl, and Dino Seppi. 2011. Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge. Speech Communication 53, 9–10 (2011), 1062–1087. [20] Bo Sun, Liandong Li, Tian Zuo, Ying Chen, Guoyan Zhou, and Xuewen Wu. 2014. Combining Multimodal Features with Hierarchical Classifier Fusion for Emotion Recognition in the Wild. In ACM on International Conference on Multimodal Interaction.

ICMI’17, November 13–17, 2017, Glasgow, UK 481–486. [21] Chung Hsien Wu, Jen Chun Lin, and Wen Li Wei. 2014. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. Apsipa Transactions on Signal & Information Processing 3 (2014), –. [22] Jianlong Wu, Zhouchen Lin, and Hongbin Zha. 2015. Multiple Models Fusion for Emotion Recognition in the Wild. In ACM on International Conference on Multimodal Interaction. 475–481. [23] Shuzhe Wu, Meina Kan, Zhenliang He, Shiguang Shan, and Xilin Chen. 2016. Funnel-Structured Cascade for Multi-View Face Detection with Alignment-Awareness. Neurocomputing (under review) (2016).

Shuai Wang et al. [24] Jingwei Yan, Wenming Zheng, Zhen Cui, Chuangao Tang, Tong Zhang, Yuan Zong, and Ning Sun. 2016. Multi-clue fusion for emotion recognition in the wild. In ACM International Conference on Multimodal Interaction. 458–463. [25] Anbang Yao, Junchao Shao, Ningning Ma, and Yurong Chen. 2015. Capturing AU-Aware Facial Features and Their Latent Relations for Emotion Recognition in the Wild. In ACM on International Conference on Multimodal Interaction. 451–458. [26] Zhiding Yu and Cha Zhang. 2015. Image based Static Facial Expression Recognition with Multiple Deep Network Learning. (2015), 435–442.