Comparison of Different Classification Methods for Emotion Recognition

0 downloads 0 Views 180KB Size Report
The comparison was conducted using the WEKA. (The Waikato ... The build-in WEKA classifiers .... (WEKA) is a unified workbench that allows researchers.
MIPRO 2010, May 24-28, 2010, Opatija, Croatia

Comparison of Different Classification Methods for Emotion Recognition T. Justin, R. Gajšek, V. Štruc, S. Dobrišek University of Ljubljana, Faculty of Electrical Engineering SI-1000 Ljubljana, Tržaška cesta 25, Slovenia, Phone: +(386) 01-4768 839 Fax: +(386) 01-4768 316 E-mail: [email protected]

selective superiority; each one is best for some but no one is best for all tasks [4]. Therefore, it is a common practice in the field of pattern recognition to assess the efficiency of different learning techniques for some problem on a given dataset. The work in this paper focuses on experimentally evaluating different classification methods for classifying aroused and normal emotional states based on CMLLR features. The results of the evaluation are presented in the form of Receiver operating characteristics (ROC) graphs [5] and with statistical parameters such as per class recall, unweighted average recall and absolute mean error. Finally, the results of different classification methods are discussed.

Abstract - The paper presents a comparison of different classification techniques for the task of classifying a speaker’s emotional state into one of two classes: aroused and normal. The comparison was conducted using the WEKA (The Waikato Environment for Knowledge Analysis) open source software which consists of a collection of machine learning algorithms for data mining. The aim of this paper is to investigate the efficiency of different classification methods to recognize the emotional state of a speaker with features obtained by a constraint version of the Maximum Likelihood Linear Regression (CMLLR). For our experiments we adopted the multi-modal AvID database of emotions, which comprises 1708 samples of utterances each lasting at least 15 seconds. The database was randomly divided into a training set and a testing set in a ratio of 5:1. Since there are much more samples in the database belonging to the neutral class than to the aroused class, the latter was over-sampled to ensure that both classes in contained equal numbers of samples in the training set. The build-in WEKA classifiers were divided into five groups based on their theoretical foundation, i.e., the group of classifiers related to the Bayes’s theorem, the group of distance-based classifiers, the group of discriminant classifiers, the group of neural networks, and finally the group of decision tree classifiers. From each group we present the results of the best evaluated algorithms with respect to the unweighted average recall.

II. METHOD A. Emotional database and feature extraction In the field of automatic emotion recognition probably the most often used features are features based on prosody (pitch, energy, voicing, etc.). They are normally combined with a set of cepstrum based coefficients. The combination represents a standard used feature set for classification of emotions. In this paper we focus on the CMLLR transform estimates for emotional recognition. The estimate of the CMLLRs represents an adaptation of the Hidden Markov Models (HMMs) [6]. The CMLLR transformations use a constrained version of the Maximum Likelihood Linear Regression (MLLR) transformation, where the same transformation matrix ‫ܣ‬ሖ for both, means  and covariance matrix  of the HMMs is used. The transformation is presented in (1) and (2).

I. INTRODUCTION Automatic recognition of emotional states aims at automatically identifying the emotional or physical state of a human being from his or her voice. The hardest tasks in the field of automatic emotion recognition are divided in to three main parts, namely, finding appropriate datasets, feature extraction, and classification [1]. For the successful implementation of an emotional state classifier, good labeling of each emotional state is essential. It allows us to obtain sample features which bear sufficient information about the speaker's emotional state. Only a carefully selected set of voice recordings can yield a group of attributes rich enough with appropriate information to effectively train classifiers capable of discriminating between different emotional states of the speaker. Another important issue to be consider is the method used for feature extraction. A constrained version of the Maximum Likelihood Linear Regression (CMLLR) [2] transformation of the monophone acoustic model is one of the promising new features for the task of emotional recognition [3]. The estimated CMLLRs, when turned to vector form, can act as stand-alone feature vectors for training and evaluating different classification techniques. Different learning techniques differ in representations, search heuristic, evaluation functions and search spaces. It is commonly accepted that each algorithm has its own

ߤƸ ൌ ‫ܣ‬ሖߤ െ ܾሖ

(1)

ȭ෠ ൌ ‫ܣ‬ሖȭ‫ܣ‬ሖ் (2) The matrix ‫ܣ‬ሖ and the vector ܾሖ and their parameters are determined by maximizing the likelihood of the acoustic model based on the feature vector set ‘ሺሻ , which is available for adaptation. Estimation of ‫ܣ‬ሖ and ܾሖ is done by the Expectation-Maximisation (EM) algorithm [7]. In [4], the authors suggest that it is possible to use the matrix ‫ܣ‬ሖ and the vector ܾሖ as features for emotion recognition, , instead of means and the covariance matrix of the model itself. The full transformation is commonly defined in single matrix ܹ (3). ܹ ൌ  ሾ‫ܣ‬

ܾሿ

(3)

In order to estimate the transformation given by (3) , training of the acoustic model needed. In our experiments the database used for building the acoustic models was the 700

aroused class. The performance of the classification methods with unequal number of samples in the two classes were observed separately and is tabulated in Section III.

Voicetran database [8]. The idea is similar as with speaker adaptation, where the aim is to “move” the speaker specific characteristics from the acoustic model to the MLLR transform and consequently estimating a speaker independent acoustical model. Instead of “moving” the speaker-specific information to the MLLRs the aim in emotion recognition is shifting the emotional information from the acoustic model to the transform matrices. The acoustic model training procedure is fully described in [3]. For the evaluation of the classification techniques the AvID emotional database was used [9]. This database was collected as part of the interdisciplinary project “AvID: Audiovisual speaker identification and emotion detection for secure communications”. The database contains recordings of spontaneous speech in normal (relaxed) or non-normal psychophysical conditions (the conditions of excitement and arousal with different valence, both positive and negative) of Slovenian speakers. Audio data transcribed and labeled as emotionally neutral or aroused, forms the bases for the evaluation task. The utterances were grouped into at least 15 seconds long chunks according to the speaker and emotion label, in order to provide enough data to estimate the MLLR matrices. For assessing the performance of different classification methods 1708 CMLLR transformations were estimated. The matrices A from (3) were used as features for emotion recognition. Each matrix A with dimensions of 39 × 39, was turned into a 1× 1521 dimensional feature vector, which represented either a training or a testing sample. The obtained data set was then divided in the training (80% of the entire data) and testing (20% of the entire data) sets.

C. Experimental setup To investigate the performance of the assessed classifiers an experimental-setup script was developed. It is divided in three main parts: x automated lunch of WEKA’s classification algorithms, x parsing of the results, x and sorting of the different evaluated classification methods based on the unweighted average recall. The classification methods ware grouped in five groups. Each classifier was evaluated separately, but only the group-winning classifier with the best unweighted recall is presented here. The groups, named discriminant, distance based, Bayes based, decision trees and Multilayer Perceptron were evaluated. The 129 different classification models were built with each training set, as described in section C and tested with the same testing sets. D. WEKA The Waikato Environment for Knowledge Analysis (WEKA) is a unified workbench that allows researchers easy access to state-of-the-art techniques in machine learning [10]. It was developed by the University of Waikato in New Zealand and implements data mining algorithms using the JAVA language. The WEKA project provides a comprehensive collection of machine learning algorithms and data preprocessing tools. The workbench includes algorithms for regression, classification, clustering, association rule mining and attribute selection. Moreover, it also includes tools for data visualization. The data is usually imported from the ARFF file format, which consists of special tags to indicate different attribute names, attribute types, attribute values and the data itself. Implementations of almost all popular classification algorithms are included. Bayesian methods include naive Bayes, complement naive Bayes, multinomial naive Bayes, Bayesian networks, and AODE. There are many decision tree learners: decision stumps, ID3, a C4.5 clone called “J48,” trees generated by reduced error pruning, alternating decision trees, random trees and random forests. There are several separating hyperplane approaches like support vector machines with a variety of kernels, logistic regression, voted perceptrons, Winnow and a multi-layer perceptron. Alos included are lazy learning methods like IB1, IBk, lazy Bayesian rules, KStar, and locally-weighted Learning.

B. Observation of class quantity distribution Many classification algorithms have difficulties to efficiently determine the decision boundaries between different classes in the training set, when one of the classes contains much more samples the other. Our preliminary analysis of the AvID database showed that the class in training set corresponding to the normal emotional state contained more than seven times as many samples as the class corresponding to the aroused emotional state (shown in Table I) . To ensure an equal number of samples in each of the two classes we multiplied randomly chosen samples of the TABLE I QUANTITY OF FEATURE DISTRIBUTION OF TEST AND TRAIN DATA SETS

Sample data set

Classes Sum

Excited emotional state

Normal emotional state

Testing set

40

299

339

Training set - initial

166

1203

1369

Training set – multiplied

1162

III. RESULTS 1203

Using a fixed and predefined experimental setup, several state-of-the-art classifiers were trained and evaluated. This section represents evaluations of the classifiers, which achieved the best unweighted average

2365

701

TABLE II EVALUATION OF CLASSIFYING ALGORITHMS WITH INITIAL TRAINING SET

Recall aroused emotional state

Recall normal emotional state

UW Average recall

Mean absolute error

Nearest neighbor

0.775

0.632

0.704

0.351

Multilayer Perceptron

0.450

0.890

0.670

SVM

0.400

0.883

Naive Bayes

0.775

FT tree

0.225

Algorithm

TABLE III EVALUATION OF CLASSIFYING ALGORITHMS WITH MULTIPLIED TRAINING SET

Recall aroused emotional state

Recall normal emotional state

UW Average recall

Nearest neighbor

0.775

0.632

0.704

0.351

0.162

SVM

0.773

0.625

0.699

0.245

0.641

0.174

Multilayer Perceptron

0.425

0.893

0.659

0.162

0.442

0.608

0.519

Bayesian L. Regression

0.425

0.846

0.636

0.204

0.936

0.581

0.148

FT tree

0.350

0.876

0.613

0.186

Algorithm

Fig. 1. ROC curves of implemented classification methods with initial training set (normal emotional state)

Mean absolute error

Fig. 2. ROC curves of implemented classification methods with multiplied training set (normal emotional state)

recall from each of the representatives groups of classification techniques. Table II summarizes the result based on models build with initial training sets, while “Table III summarizes the results achieved with models build with similar numbers of samples in each of the classes of the training set. For a more detailed evaluation both tables consist of different statistical parameters: the recall of the aroused emotional state, the recall of the normal emotional state, the unweighted average recall and the mean absolute error. Classification methods are sorted based on the unweighted average recall. For better insight into the performance of the classifiers ROC graphs are presented in Fig. 1 for classification techniques build with the initial training set and in Fig. 2 for classification techniques built with the multiplied training set.

classifier in Table II and Table III are the same. The explanation for this setting can be found in the nature of the classifier algorithm. It uses the normalized Euclidean distance to find the closest instance in the training data set for the given test sample. The label of the closest train instance is given to the evaluated test sample. If multiple instances in the training set have the same (smallest) distance to the test sample the first one found is used. The difference between initial and multiplied training sets is that the multiplied training set has the aroused feature samples seven times multiplied. As a result the Nearest Neighbor classifier has the same confusion matrixes for both sample sets. The second best classification technique in Table II is the Multilayer Perceptron. The multiplied training set in case of the Multilayer Perceptron did not ensure better evaluation results. The unweighted average recall is lower than the second best classifier from Table III. The best achieved evaluation results with the Multilayer perceptron trained with the both training sets, differ in fine tuning parameters. In Table II the model was

IV. DISCUSSION Comparing Table II and Table III, it is evident, that the Nearest Neighbor classifier achieves the best unweighted average recall. The result for the Nearest Neighbor 702

trained with the learning rate of 0.1 and the momentum of 0.2, as opposed to the model in Table III, where the learning rate of 0.3 and the momentum of 0.4 resulted in the best evaluating result. The Suport Vector Machine (SVM) classifier, built with the multiplied training set achieved better unweighted average recall ( Table III), than the model built with the initial training set. The LibSVM library [11], was used for building the SVM models. The optimal hyperplane in both tables, is obtained with two different SVM types. The first model trained with the initial training set, uses the C-SVM error minimization function with the linear kernel, while the second model, trained with the multiplied training set uses the nu-SVM error minimization function with the radial bias kernel function. The next evaluated group of classifiers are classifier based on Bayes classification technique. The Naive Bayes achieved the best average unweighted recall, but with closer examination of the ROC graph in Fig. 1 it is evident that the prediction of the two class CMLLR features is just above the random guess. The best evaluated model of the Bayesian classifiers, build with the multiplied training set is the Bayesian Logistic Regression classifier (Table III). The evaluation of the same classification technique trained with the initial training data barley discriminated between the two classification classes. It achieved the unweighted average recall of 0.538. The last evaluated group of classifiers in this paper presents Decision Tree classifiers. As shown in Table II and III, the functional tree [12] is the best evaluated classifier from this group. Note that multivariate decisions with functional trees usually appear in inner nodes [11]. Since both models, i.e., the one build with the initial training set and the one build with the multiplied training set, rely on inner nodes models with a minimum number of 15 training features, before they are considered for splitting, the latter is more efficient (due to more data) as shown in Tables II and III

choosing the Nearest Neighbor for classification is the time consumption. However, the classification with the Nearest Neighbor classifier is fast enough if the data is not too high-dimensional. If that is the case, the SVM classifier should be used instead.

REFERENCES [1]

[2]

[3]

[4]

[5] [6]

[7] [8]

[9] [10]

IV. CONCLUSION [11]

Our assessment of different groups of classification techniques suggests that the Nearest Neighbor algorithm is the best classification technique for determining the emotional state of a speaker (either aroused or normal) based on CMLLR features. The negative aspect of

[12]

703

T. Vogt, E. Andre, and J. Wagner, “Automatic Recognition of Emotions from Speech: A Review of the Literature and Recommendations for Practical Realisation”, Affect and Emotion in HCI, LNCS 4868, p.75-91, 2008 Gales M.J.F., “Maximum likelihood linear transformations for hmm-based speech recognition”, Computer Speech and Language, vol. 14, 75-98, 1998 R. Gajšek, V. Štruc, S. Dobrišek, F. Miheli: “Combining Audio and Video for Detection of Spontaneous Emotions”, Interspeech 2009, Brighton, GB, ISCA, p. 1967-1970, 2009 C. E. Brodley, “Recursive automatic bias selection for classifiers construction”, Machine Learninig, vol. 20, p. 63-94, 1995 T. Fawcett, “An introduction to ROC analysis”, Pattern Recognition Letters, vol. 27, p. 861-874, 2006 M. J. F. Gales, “Maximum likelihood linear transformations for hmm-based speech recognition”, Computer Speech and Language, vol. 12, p. 75–98, 1998. G. McLachlan and T. Krishnan., The EM Algorithm and Extensions, John Wiley & Sons, New York, 1996. Miheli, F., et al.: “Spoken language resources at LUKS of the University of Ljubljana”. Int. J. of Speech Technology, vol. 6 (3), pp. 221-232, 2006 R. Gajšek, et. all, “Multi-Modal Emotional Database: AvID”, Informatica, vol. 33, p. 101-106, 2009 M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I. H. Witten, “The WEKA Data Mining Software: An Update”, SIGKDD Explorations, vol. 11, Issue 1, 2009 C. C. Chang, C. J. Lin , “LIBSVM - A Library for Support VectorMachines”, http://www.csie.ntu.edu.tw/~cjlin/libsvm/ , 2001 J. Gama, “Functional trees. Machine Learning”, Machine Learning ,vol. 55, p. 219–250, 2004.