Emotion Recognition from Spontaneous Slavic Speech - IEEE Xplore

68 downloads 241475 Views 477KB Size Report
call center recordings involving four Slavic languages: Czech, Polish, Russian and Slovak. Five different emotional states, namely: Anger, happiness, sadness, ...
CogInfoCom 2012 • 3rd IEEE International Conference on Cognitive Infocommunications • December 2-5, 2012, Kosice, Slovakia

Emotion Recognition from Spontaneous Slavic Speech 1

Hicham Atassi, 1Zdenek Smékal, 2Anna Esposito

Brno University of Technology, Department of Telecommunications, Czech Republic Second University of Naples, Department of Psychology and IIASS, Italy [email protected], [email protected], [email protected] emotions whereas it is proved that human emotional states are characterized by a high level of variability [4]. Moreover, it is impossible to build a speech database that covers all possible human emotional states. Some studies suggested approaches that estimate the activation and valence values of speech emotions. Such systems are basically trained using emotional corpora labeled for this purpose, where the listeners are asked to guess the position of each stimulus under examination in the two dimensional emotional space defined by the activation and valence axes [5]. However, this kind of labeling is more time consuming and requires experts or people with a special training. In fact, most speech emotional databases up to date consider only a certain number of emotions. For example, Berlin Database of Emotional Speech (BDES) [6] contains speech stimuli of seven emotional states: Anger, boredom, fear, happiness, disgust and neutral. In the light of remarked above, it is obvious that an approach where discrete classes (emotions) are used to estimate the activation and valence values can be very useful, because it combines the simplicity of constructing emotional databases of discrete emotions with the advantages of the continues representation of emotions. The authors believe that the presented work in this paper strongly contributes to the cognitive infocommunications field, as it involves, from the cognition science point of view, decision making, machine learning, emotion perception and other disciplines. On the other hand, the emotion recognition system is proposed to be used for call center monitoring, which is obviously an interesting domain of telecommunications.

Abstract— In this paper, we present a new approach for automatic recognition of emotional expressions from spontaneous Slavic speech. The speech corpus used in experiments is built from speech extracts obtained from real call center recordings involving four Slavic languages: Czech, Polish, Russian and Slovak. Five different emotional states, namely: Anger, happiness, sadness, surprise and neutral are represented in the speech corpus. The proposed approach is based on mapping of discrete emotions into two dimensional space defined by the valence and the activation of emotions. This is carried out by using support vector regression combined with forward feature selection.

I. INTRODUCTION The domain of analysis and recognition of emotional vocal expressions has become a very interesting field of research due to the increasing need for systems capable of recognizing speakers’ emotional states. Such systems can find application in several fields, such as media retrieval, robotics, E-learning, entertainment, or for call center monitoring purposes [1], where the analysis of users’ emotional states can give the call center supervisors a valuable idea about clients’ satisfaction. The last mentioned application is gaining on importance because of the boom of call centers worldwide, where a lot of companies are based on phone marketing. Obviously, the performance of operators in such companies is crucial as well as the clients’ feedback. However, it is very hard to manually evaluate the quality of services provided, or to assess the agents’ performance. For example, if a company has 20 operators working daily for 7 hours and 5 days per week, then the phone calls recorded throughout one month make about 2800 hours. Obviously, it is impossible to manually check all these phone calls in order to make a reliable image about agents’ performance or to assess the quality of services. In the light of mentioned above, it appears evident that a kind of automatic analysis of speakers’ emotional state is indispensable for phone-marketing and all subjects that involve costumer support services in their structure. The Majority of research work addressing emotion recognition from speech has focused on the classical approach to the task [2, 3]; the input speech signal is assigned to one class “emotion’ according to a certain classification criteria, for example, the logarithmic likelihood. This approach has some limitations; because the output of such systems is determined within discrete

978-1-4673-5188-1/12/$31.00 ©2012 IEEE

II.

DESCRIPTION OF SPEECH DATABASE

The collected data are based on extracts (short utterances) from dual-channel agent-client phone call records obtained from real call centers, mostly focused on costumers’ support services. The collaborating call centers which are based in four countries: Czech Republic, Slovakia, Poland and Russia, provided us with raw speech records of corresponding languages. For each language we selected speech extracts which were subsequently labeled according to this format: em_g_o_sp_la_abcdef.wav, where em is the emotion identifier, g is the speaker’s gender, o is the speaker’s age, sp is the speaker’s ID, la is the language ID and abcdef are the results of the subjective evaluation. For example, speech record “an_m_30_09_ru_7001002.wav” was labeled as anger;

389

H. Atassi et al. • Emotion recognition from spontaneous Slavic speech

independent features [9]. The first one, PLP1 is simply the Bark spectrum of a speech segment. PLP2 is the postprocessed Bark spectrum, where PLP1 is multiplied by the equal loudness curve and the module is compressed. The third PLP type is the LPC smoothed spectrum of PLP2 and the last type PLP4, is the smoothed LPC cepstrum of PLP2. The spectral features are extracted from the DFT module and include the spectral centroid, spread, skewness, kurtosis and slope. The complete list of features considered in this paper is reported in table II, these features are a part of a bigger set that was reported in our previous work [9]. Feature extraction process is done as follows: 1- Speech signal is segmented into frames of 32ms with 50% overlap. 2- Features from Table II are extracted from each frame. 3- High-level characteristics are computed from segmental features obtained from the previous step. 4- High-level characteristics are concatenated into the final feature vector used for training. Beside the features listed in Table II, the first and second differences (Δ, ΔΔ) of these features were also considered.

the speaker is Russian male aged 30 identified in the database by number 09. 7 listeners out of 10 agreed that the speaker expressed anger, 1 listener marked this utterance as neutral and 2 listeners marked it as other emotion. All speech records are stored in PCM *.wav format with a sample rate of 8 kHz and 16 bit quantization. The number of utterances involved in Slavic Database of Emotional Speech (SLADES) is illustrated in Figure 1. The average length of these utterances is 3.28±1.38s and the total duration of SLADES utterances is 151 minutes. The emotional speech records were subsequently judged by native listeners in order to evaluate the emotional state of each stimulus. Several native listeners were involved in SLADSES evaluation process for each language. The number of listeners and their average age are reported in Table I. 250 Czech Polish Russian Slovakian

200 150 100 50 0

TABLE II. 1

2

3

4

LIST OF FEATURES CONSIDERED

5

Feature

Figure 2. The number of utterances available for each language

TABLE I.

Mel Frequency Cepstral Coefficients Human Factor Cepstral Coefficients Linear Frequency Cepstral Coefficients MEL Bank Spectral Coefficients Human Factor Bank Spectral Coefficients Linear Frequency Bank Spectral. Coefficients Perceptual Linear Predictive 1 Perceptual Linear Predictive 2 Perceptual Linear Predictive 3 Perceptual Linear Predictive 4 4Hz Modulation Energy Mel Spectrum Modulation Energy [10] Fundamental frequency Formant frequencies Formant Bandwidths Harmonicity Temporal Energy Teager Energy Operator [11] Zero Crossing Ratio

THE NUMBER OF LISTENERS AND THEIR AGE INVOLVED IN THE EVALUATION PROCESS

Czech Polish Russian Slovakian

male 6 3 5 3

female 4 7 5 2

age average 28.8 31.8 23.9 31.6

Abbrev.

age std. 7.4 10.2 1.7 5.5

III. FEATURE EXTRACTION Speech Features used for emotion recognition can be categorized into three main types: prosodic features, voice quality features and spectral features. In the present work a combination of them is used. The description of well-known features like MFCC and F0 can be easily found in literature. See below for details about features that are not usually employed for emotion recognition. The HFCC and LFCC are computed analogously to well-known MFCC, but a different type of filter bank is used for each feature.. The MELBS, HFBS and LFBS are taken by multiplying the Discrete Fourier Transform (DFT) module of a speech segment by the corresponding filter bank, which are mel frequency filter bank, human-factor filter bank [7] and linear frequency filter bank.

MFCC HFCC LFCC MELBS HFBS LFBS PLP1 PLP2 PLP3 PLP4 4HzME MSME F0 Fx Bx H TE TEO ZCR

Number of coefficients 20 20 20 20 20 20 21 21 21 11 1 1 1 5 5 1 1 1 1

The high-level features computed in our experiments are reported in Table III. TABLE III. Basic characteristics Positional characteristics Relative characteristics moments Regression characteristics percentiles

The perceptual Linear Predictive (PLP) coefficients were proposed in [8]. In this work, four intermediate results of PLP extraction process were considered as

390

LIST OF HIGH-LEVEL FEATURES

mean, median, standard deviation, maximum, minimum, range, slope position of maximum, position of minimum relative standard deviation, relative range, relative maximum, relative minimum, relative position of maximum, relative position of minimum kurtosis, skewness, Pearson’s skewness coefficient, 5th moment, 6th moment linear regression coefficient, linear regression error 1%, 5%, 10%, 20%, 30%, 40%, 60%, 70%, 80%, 90%, 95% and 99% percentile

CogInfoCom 2012 • 3rd IEEE International Conference on Cognitive Infocommunications • December 2-5, 2012, Kosice, Slovakia

The total number of features utterance is computed as follows

other algorithms. Hence, the next is devoted to give more details about SVR-RBF-FS method. The MAEs for all emotions using this method are shown in Figure 2. The regression is carried out as follows: First, as it was stated in the feature selection section, the mRMR algorithms is applied resulting in 200 features, these features are subsequently filtered by using a Forward Selection, the selection criteria is the Mean Absolute Error (MAE) defined as

extracted from each

(1) 211 . 34 .3

21522

Where is the total number of feature coefficients, is the number of high-level characteristics and is the number of feature waveforms (one original and two differences)



IV. FEATURE SELECTION The feature selection process depends on the regression algorithm used. However, for all regression techniques the minimum Redundancy Maximum Relevance (mRMR) [12] algorithm is employed in order to reduce the number of features from 21522 to 200. The mRMR algorithm shows a very good performance comparing to similar filtering techniques. This algorithm is usually used before applying wrapper methods for feature selection, where a vast number of irrelevant and redundant features are filtered. This operation results in a small set of features which can be easily handled by wrapper methods. As it will be shown in the next section, one of the regression techniques are combined with a forward feature selection.

2. 3. 4.

ANN SVR-LK SVR-RBF SVR-RBF-FS

anger 4.35 2.01 1.93 1.10

happiness 3.84 1.78 1.69 0.87

ERRORS

neutral 3.53 2.42 2.30 1.79

FOR

sad 2.86 1.77 1.69 0.86

(2)

is the vector of subjective evaluation results for the ith stimulus, . 3 2 1 0 -1 0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

Figure 3. Mean Absolute errors for SVR-RBF-FS algorithm

The forward feature selection is described in the following pseudocode

SET SET SET SET SET SET SET

The speech corpus was split into three parts, where 80% of this corpus was used for training, 10% for validation and 10% for testing. The mean absolute errors for all regression techniques are reported in Table IV, where the maximum possible error for each emotion is 10, which corresponds to the number of listeners who were involved in the subjective evaluation process for Czech, Polish and Russian. The subjective evaluation results for Slovakian language were doubled in order to have all emotions in the same scale. MEAN ABSOLUTE ALGORITHMS UNDER EXAMINATION

|

.

ƒ

Feedforward neural network with one input, two hidden and one output layers (ANN). Support vector regression with linear kernel (SVR-LK) Support vector regression with radial basis kernel (SVR-RBF) Support vector regression with radial basis kernel combined with feature forward selection (SVRRBF-FS).

TABLE IV.

|

Where: ƒ M is the number of classes (emotions), M=5. ƒ N is the number of patterns (speech stimuli) used for validation, N=280. ƒ y is the output of SVR, y , ,…, .

V. REGRESSION Several regression algorithms were tested in order to identify the best one among them. The considered algorithms are the following 1.



0

1

0 20 200 280 5

//output feature group //feature index //mean absolute error function initialization //number of iterations //number of features considered //number of feature vectors //number of classes (emotions) 1

FOR m=0 TO

//the main cycle of FS algorithm // adding features

FOR n=1 TO FOR i= TO 10

//regression of the ith feature by using SVR-RBF

vector

REGRESSION

// 10-fold cross validation



|

|

//mean absolute error

ENDFOR // end of validation cycle

surprise 3.14 1.94 1.81 1.05

ENDFOR // end of feature adding cycle argmin 1

The results suggests that the support vector regression with radial basis kernel combined with feature forward selection (SVR-RBF-FS) gives the best result among the

// find feature with minimal MAE // add the selected feature to group Z // update J

ENDFOR // end the main cycle of FS algorithm

391

H. Atassi et al. • Emotion recognition from spontaneous Slavic speech

The forward selection resulted in 43 features for which the MAE was minimal, these features are reported in Table V, where the first column contains the coefficient number from which the high-level feature was extracted, the second column represents the feature group and the third column contains the high-level feature. The last column indicates whether the high-level feature was extracted from the original feature pattern (O) or from its first (Δ) or second derivative (ΔΔ). TABLE V.

(3)

y

defined as (4) are computed for all possible Ratios pairs of emotional states, these ratios determine the positions of points in the two dimensional emotional space. These points are always located on the connecting lines between what we call “fixed points”. For example, is located between anger and neutral and is located between happiness and surprise.

LIST OF SELECTED FEATURES USING SVR-RBF-FS

Coef.

Feature

High-level feature

P

8 13 11 9 12 8 1 9 1 1 16 19 9 9 11 10 19 9 8 16 20 16 3 19 18 1 16 8 4 9 15 15 16 18 1 1 19 8 8 1 16 12 14

MFCC PLP3 PLP4 MELBS HFCC PLP3 HFCC MFCC PLP2 HFCC PLP1 LFCC MFCC MELBS PLP4 PLP1 LFCC PLP1 PLP3 PLP1 LFCC LFCC MELBS HFCC LFCC HFCC PLP3 MFCC PLP3 PLP1 PLP3 PLP2 PLP1 LFCC HFCC PLP1 LFCC MFCC MFCC HFCC PLPT PLP1 PLP1

Slope Relative minimum 10% percentile 30% percentile regression coefficient minimum 70% percentile minimum maximum Range 80% percentile maximum 70% percentile 20% percentile 5% percentile Slope Slope 10% percentile maximum 60% percentile Regression coefficient 80% percentile Mean minimum slope 80% percentile 70% percentile 80% percentile maximum slope 20% percentile regression coefficient mean minimum percentile1 regression coefficient range relative standard dev. 5% percentile range 90% percentile slope regression coefficient

O Δ Δ O O ΔΔ Δ O Δ ΔΔ Δ ΔΔ Δ O Δ O O ΔΔ Δ Δ ΔΔ Δ O Δ Δ Δ O O Δ O ΔΔ O Δ O Δ Δ ΔΔ Δ O O Δ Δ O

VI.

(4) 0,

In case of

is not considered.

The fixed points represent the positions of emotions considered in our experiment; these positions were set according to the activation-valence theory proposed in [4]. The fixed point coordinates are shown in Table VI.

Figure 4. Two dimensional emotional space with fixed and b points TABLE VI.

THE COORDINATORS OF FIXED POINTS Valence (V) -1 1 0 -1 0

Anger Happiness Neutral Sadness Surprise

Activation (A) 1 1 0 -1 1

For emotions i,j the coordinates of 1

,

,

are computed as (5)

Where , , , are the valence and activation values of fixed points j and i. Finally, the centorid of points is computed, this centroid determines the position of the input speech emotion in the two dimensional space.

TWO DIMENSIONAL MAPPING

After identifying the optimal features, the regression is provided by using regression function C as following: the feature vector F is extracted from the input speech signal. After that, only optimal features selected using the forward selection algorithm are considered.

As an example suppose that the regression algorithm returned vector 6,0,2,2,0 . As it was mentioned before, this vector contains the outputs for each emotion in the following order: anger, happiness, neutral, sadness

392

CogInfoCom 2012 • 3rd IEEE International Conference on Cognitive Infocommunications • December 2-5, 2012, Kosice, Slovakia

and surprise. The ratio computed as

between anger and neutral is 6 6

2

0.75

VII. CONCLUSIONS In this paper, we proposed a new approach for automatic recognition of emotional expressions from spontaneous Slavic speech; this approach is based on Support vector regression with radial basis function combined with forward selection of features. Moreover, a new method for the mapping of discrete emotions into continuous twodimensional space was presented. The results of experiments made are promising; the SVR-RBF-FS yielded remarkably good performance for all emotions, where the MAE was 1.134±0.381. We assume that more effort will be made mainly to investigate new approaches for the mapping of discrete emotions into the two-dimensional space.

(6)

The remaining ratios are computed analogously: d12=d15=d35=d45=1, d13=d14=0.75, d23=d24=0, d34=0.5, d25 is not considered. Now we should find the position of b13 according to (5) 1

0.75 0,0

0.75 1,1 0.75,0.75

(7)

Again, the remaining points are taken analogously: b12=(-1,1), b13=(-0.75,0.75), b14=(-1,0.5), b15=(-1,1), b23=(0,0) , b24=(-1,-1), b34=(-0.5,-0.5), b35=(0,0), b45=(1,-1) The emotion position is then determined by getting the centroid of points bij, which is (-0.694, 0.083). This means that the input speech has high negative valence and low positive activation. The secondary evaluation was carried out by 5 listeners, who labeled a small corpus of 50 emotional speech utterances according to activation-valence protocol. These listeners were asked to guess the position of speakers’ emotional states from 50 utterances. The same utterances were analyzed by our algorithm and the results of subjective evaluation and automatic recognition were compared in terms of MAE. The mean absolute errors for valence and activation obtained by the two-dimensional approach are reported in Table VII. The maximum possible error here is 2, which the difference between the minimum (-1) and maximum (+1) values of valence and activation. Figure 5. Results of emotional analysis over client-operator phone call

TABLE VII.

MEAN ABSOLUTE ERRORS FOR BOTH VALENCE AND ACTIVATION USING TWO-DIMENSIONAL APPROACH

Valence Activation

0.37 0.39

ACKNOWLEDGMENT This research is part of the project reg. no CZ.1.07/2.3.00/20.0094 “Support for incorporating R&D teams in international cooperation in the area of image and audio signal processing” and is co-financed by the European Social Fund and the state budget of the Czech Republic.

As a practical example, Figure 4 illustrates the results of emotional analysis over client-operator phone call; the first column illustrates the two-dimensional emotional space of each channel whereas the second column represents the activation and valence values in time domain. The big advantage of the two-dimensional representation is that it gives a very good overview of the speaker’s emotions distribution within the utterance as well as the intensities of these emotions. However, the basic form of such interpretation doesn’t contain any temporal information, that is, it is not possible to define when a concrete emotion occurs. This limitation can be avoided by displaying the activation and valence in time (like the right column of Figure 4) or by using a threedimensional interpretation, where the third dimension is time.

REFERENCES [1] [2]

[3]

[4] [5]

393

Petrushin, V.: Emotion in Speech: Recognition and Application to Call Centers, Proceedings of the Conference on Artificial Neural Networks in Engineering, pp. 7–10, 1999. Atassi, H., Esposito, A.: A Speaker Independent Approach to the Classification of Emotional Vocal Expressions. In Proc. of 20th Int. Conf. Tools with Artificial Intelligence, ICTAI 2008. Dayton, Ohio, USA: IEEE Computer Society, pp. 147-151, 2008. Atassi, H., Riviello, M.T., Smekal, Z. Hussain, A., Esposito, A.: Emotional Vocal Expressions Recognition using the COST 2102 Italian Database of Emotional Speech. Lecture Notes in Computer Science, 2010. Mehrabian, A., Russel, J.: An approach to environmental psychology. Cambridge: MIT Press. 1974. Osgood, C.E., Suci J.G., Tannenbaum P.H.: The measurement of meaning. University of Illinois Press: Urbana. 1957.

H. Atassi et al. • Emotion recognition from spontaneous Slavic speech

[6]

[7]

[8]

[9]

[10] Bong-Wan, K., Dae-Lim, C., Yong-Ju, L.: Speech/Music Discrimination Using Mel-Cepstrum Modulation Energy, Springer Berlin / Heidelberg, pp. 406-414, 2007. [11] Rahurkar, M., Hansen, J. H. L.: Frequency Band Analysis for Stress Detection Using a Teager energy Operator Based Feature. In: Proc. Int. Conf. Spoken Language Processing (ICSLP ’02), vol. 3, pp. 2021–2024, 2001. [12] Peng, H., Long, F., Ding, C.: Feature selection based on mutual information: criteria of max-dependency, max-relevance, and minredundancy. EEE Transactions on Pattern Analysis and Machine Intelligence,Vol. 27, No. 8, pp.1226-1238, 2005.

Burkhardt F., Paeschke A., Rolfes M., Sendlmeier W., Weiss B.: A Database of German Emotional Speech, Proceedings of Interspeech, pp. 1517-1520, 2005. Skowronski, M., Harris, J.; Improving the Filterbank of a Classic Speech Feature Extraction Algorithm. IEEE Int. Symposium on Circuits and System, Bangok, pp. 281-284,2003. Hermansky, H.: Perceptual Linear Predictive (PLP) Analysis of Speech. Journal of Acoustic Socienty, US, no. 4, pp. 1738-1753, 1990. Atassi, H., Esposito, A., & Smekal, Z.: Analysis of high-level features for vocal emotion recognition. 34th International Conference in Telecommunications and Signal Processing (TSP), pp. 361-366, 2011.

394