Speech-based Emotion and Emotion Change in

1 downloads 0 Views 7MB Size Report
pursue my PhD; Faculty of Engineering, UNSW Sydney for the Tuition Fee Scholarship, and. NICTA (now Data61, CSIRO) for the Research Project Award.
Speech-based Emotion and Emotion Change in Continuous Automatic Systems

Zhaocheng Huang A thesis submitted in fulfilment of the requirements for the degree of Doctor of Philosophy

School of Electrical Engineering and Telecommunications Faculty of Engineering

August 2018

ORIGINALITY STATEMENT ‘I hereby declare that this submission is my own work and to the best of my knowledge it contains no materials previously published or written by another person, or substantial proportions of material which have been accepted for the award of any other degree or diploma at UNSW or any other educational institution, except where due acknowledgement is made in the thesis. Any contribution made to the research by others, with whom I have worked at UNSW or elsewhere, is explicitly acknowledged in the thesis. I also declare that the intellectual content of this thesis is the product of my own work, except to the extent that assistance from others in the project's design and conception or in style, presentation and linguistic expression is acknowledged.’

Signed ……………………………………………..............

Date

……………………………………………..............

COPYRIGHT STATEMENT ‘I hereby grant the University of New South Wales or its agents the right to archive and to make available my thesis or dissertation in whole or part in the University libraries in all forms of media, now or here after known, subject to the provisions of the Copyright Act 1968. I retain all proprietary rights, such as patent rights. I also retain the right to use in future works (such as articles or books) all or part of this thesis or dissertation. I also authorise University Microfilms to use the 350 word abstract of my thesis in Dissertation Abstract International (this is applicable to doctoral theses only). I have either used no substantial portions of copyright material in my thesis or I have obtained permission to use copyright material; where permission has not been granted I have applied/will apply for a partial restriction of the digital copy of my thesis or dissertation.'

Signed ……………………………………………...........................

Date

……………………………………………...........................

AUTHENTICITY STATEMENT ‘I certify that the Library deposit digital copy is a direct equivalent of the final officially approved version of my thesis. No emendation of content has occurred and if there are any minor variations in formatting, they are the result of the conversion to digital format.’

Signed ……………………………………………...........................

Date

……………………………………………...........................

Abstract An intelligent human-computer interface should not only provide automatic inference of emotion, but also be informative about emotion change, since emotion is dynamic in nature. Investigations into the former area have gained promising results, whilst the latter remains understudied. Accordingly, this thesis presents investigations to accomplish both from speech, i.e. improving existing emotion prediction systems and exploring possibilities for emotion change. Features which partition the acoustic space of speech are investigated, systematically demonstrating the significance of partitioning information for emotion prediction. Phonemespecific examination of phonetic features suggests that phonetic content is more favourable for arousal than valence. Further, a novel set of phonetically-aware acoustic features is proposed to incorporate useful phonetic information into acoustic features, yielding up to 7.0% and 79.7% relative improvements for arousal and valence prediction respectively across three emotion corpora while enabling emotional speech to be analysed on a per-feature-phoneme basis. Another problem investigated is the localisation of emotion change points in time. A likelihoodbased and a statistical Martingale-based framework are proposed, both of which achieve promising detection performances and underscore the importance of incorporating prior emotion information. Regression (smoothed) deltas are proposed as ground truth for emotion change, which yield considerably higher inter-rater reliability than first-order difference deltas used in previous research, and represent a more appropriate approach to derive annotations for emotion change research, with applications beyond speech-based systems. The first system design for continuous emotion change prediction from speech is explored, and comparison with systems for prediction of absolute emotion indicates that emotion change may be better predicted than absolute emotion ratings. Finally, investigations are undertaken into incorporating emotion change information to improve emotion prediction, using data selection and emotion dynamic modelling. Systems trained on selected speech segments with emotion change achieve 5.4% and 24.5% relative improvements in prediction accuracy for arousal and valence over those trained on all data. Moreover, Kalman filtering is improved by incorporating processed emotion dynamics, which significantly correlate with emotion, via joint modelling and probabilistic fusion, producing relative improvements of 1.8% and 7.3% for arousal and valence over baseline systems. i

ii

Dedication

To my beloved xiaoniu.

iii

iv

Acknowledgements I would like to express my sincerest gratitude to my supervisor, Associate Professor Julien Epps who has always been patient, supportive, inspiring and encouraging throughout my PhD journey. I feel very privileged for the time and support that A/Prof Epps has generously offered and for the tremendous amount that I have learnt from his top-notch guidance and meticulous proofreading/feedback. Without his unceasing guidance and advice, this thesis would not have been possible. I would also like to thank my co-supervisor Professor Eliathamby Ambikairajah for his encouragement and guidance. I am grateful to many colleagues for the warm friendship, inspiring discussions, entertaining activities and those relaxing chats: Dr Vidhyasahasan Sethu, Dr Siyuan Chen, Dr Shanglin Ye, Dr Zhan Shi, Dr Phu Le, Dr Sanat Biswas, Dr Marc Piggott, Dr Nicholas Cummins, Dr Dominic Rüfenacht, Brian Stasak, Hoe Kin Malcolm Wong, Jianbo Ma, Sarith Fernando, Saad Irtza, Ting Dang, Kalani Wataraka Gamage, Kaavya Sriskandaraja, Donna Kocherry, and Stefanie Brown; and especial thanks to Tom Millet for ensuring us a happy workplace. I acknowledge the following sources of funding which have enabled me to wholeheartedly pursue my PhD; Faculty of Engineering, UNSW Sydney for the Tuition Fee Scholarship, and NICTA (now Data61, CSIRO) for the Research Project Award. Thanks are extended to the School of Electrical Engineering and Telecommunications for providing great administration support during my study. I would like to gratefully thank my families for their unconditional love and support: my parents (Fu Huang and Hongfang Wang), sister (Qiongyao Huang) and my parents-in-law. Most importantly, my utmost appreciation goes to my beloved xiaoniu (Yiran Tai) whose love, companionship and support have been the pillar of my strength and happiness. She has always believed in me and been kind in her own nice way; quote ‘distracting me from research’ and protecting me from trivial distractions. Thanks to that, I could always take a step back to think in a creative way and enjoy our life with passion for my research.

v

vi

Contents Abstract .................................................................................................................................................... i Dedication ..............................................................................................................................................iii Acknowledgements ................................................................................................................................. v Contents ................................................................................................................................................vii List of Abbreviations ............................................................................................................................. xi List of Figures ......................................................................................................................................xiii List of Tables ....................................................................................................................................... xxi Chapter 1

Introduction ....................................................................................................................... 1

1.1

Research Overview ................................................................................................................ 1

1.2

Thesis Objectives................................................................................................................... 5

1.3

Thesis Organisation ............................................................................................................... 6

1.4

Major Contributions .............................................................................................................. 7

1.5

List of Publications ................................................................................................................ 8

Chapter 2

Automated Prediction of Emotions: A Review ............................................................... 11

2.1

Emotion: Definition and Representation ............................................................................. 11

2.2

Emotion in Speech ............................................................................................................... 13

2.2.1

Acoustic Low-Level Descriptors .................................................................................... 13

2.2.2

High-Level Representations ............................................................................................ 17

2.3

Typical Automatic Modelling and Prediction Techniques .................................................. 22

2.3.1

Gaussian Mixture Models ............................................................................................... 23

2.3.2

Support Vector Regression.............................................................................................. 25

2.3.3

Relevance Vector Machines ............................................................................................ 27

2.3.4

Relevance Vector Machine Staircase Regression ........................................................... 31

2.3.5

Output-Associative Relevance Vector Machine ............................................................. 33

2.3.6

Kalman Filtering ............................................................................................................. 34

2.4 Chapter 3

Concluding remarks ............................................................................................................. 36 Emotion Change in Speech: A Review ........................................................................... 39

3.1

Emotion change in affective science ................................................................................... 40

3.2

Emotion change in automated systems ................................................................................ 42

3.2.1

Computational Models using Emotion Change............................................................... 43

3.2.2

Emotion Timing and Segmentation................................................................................. 45 vii

3.3

Automatic Methods for Emotion Timing and Segmentation .............................................. 48

3.3.1

Speaker Change Detection .............................................................................................. 48

3.3.2

General Change Detection – a Martingale Framework .................................................. 52

3.4 Chapter 4 4.1

Concluding remarks ............................................................................................................ 56 Datasets ........................................................................................................................... 59 Characteristics of key emotional corpora ............................................................................ 59

4.1.1

Description of datasets .................................................................................................... 59

4.1.2

Description of selected partitions ................................................................................... 61

4.2

Constructed datasets for emotion change research ............................................................. 64

4.2.1

Emotion change detection............................................................................................... 64

4.2.2

Emotion change prediction ............................................................................................. 66

4.3 Chapter 5

Concluding remarks ............................................................................................................ 67 Emotion Change Detection ............................................................................................. 69

5.1

Introduction ......................................................................................................................... 69

5.2

The dual-sliding window framework .................................................................................. 70

5.2.1

Generalised Likelihood Ratio (GLR) ............................................................................. 71

5.2.2

Emotion Pair Likelihood Ratio (EPLR).......................................................................... 71

5.2.3

Normalisation for EPLR ................................................................................................. 72

5.3

The Modified Martingale framework .................................................................................. 73

5.4

Experimental outline and key settings ................................................................................ 76

5.4.1

Datasets, Features and Detection Tasks .......................................................................... 76

5.4.2

Evaluation measures for emotion change detection........................................................ 77

5.5

Experimental results and analysis ....................................................................................... 78

5.5.1

Results – the sliding window framework........................................................................ 78

5.5.2

Results – the Modified Martingale framework ............................................................... 83

5.5.3

Extended Comparisons of the two frameworks for ECD-A and ECD-V ....................... 87

5.6

Chapter 6

Concluding remarks ............................................................................................................ 89 Absolute Emotion Prediction .......................................................................................... 91

6.1

Introduction ......................................................................................................................... 91

6.2

On the Use of Phonetic Features in a Staircase Regression Framework............................. 92

6.2.1

Phone Log Likelihood Ratio (PLLR) Features ............................................................... 92

6.2.2

RVM Staircase Regression (RVM-SR) Framework ....................................................... 93

6.2.3

Preliminary experimental settings .................................................................................. 96

6.2.4

Preliminary experimental results and analysis ................................................................ 97

6.2.5

Questions raised for further investigations ................................................................... 101 viii

6.3

Partition-based Information ............................................................................................... 102

6.3.1

Partition-based Features ................................................................................................ 102

6.3.2

Experimental settings - Partition-based Features .......................................................... 104

6.3.3

Experimental results and analysis ................................................................................. 107

6.3.4

Investigation of numbers of partitions .......................................................................... 110

6.4

Phonetic Content ............................................................................................................... 115

6.4.1

Phoneme Relevance using RVM Relevance Factor ...................................................... 115

6.4.2

Relevance of individual phonemes ............................................................................... 116

6.4.3

Emotion Prediction using Selected Phonemes .............................................................. 118

6.5

Phonetically-aware acoustic features................................................................................. 121

6.5.1

Phonetically-aware eGeMAPS features ........................................................................ 121

6.5.2

Experimental settings – Phonetically-aware features .................................................... 122

6.5.3

Adjusting phonetic information via alpha values .......................................................... 122

6.5.4

Visualisation of PA-eGeMAPS features ....................................................................... 124

6.6 Chapter 7

Concluding remarks ........................................................................................................... 129 Emotion Change Prediction .......................................................................................... 131

7.1

Introduction ....................................................................................................................... 131

7.2

Delta Emotion Ground Truth ............................................................................................. 132

7.2.1

Proposed Regression Delta Emotion Ground Truth ...................................................... 134

7.2.2

Evaluation of Window Size for Regression Deltas: A Trade-off between Inter-

rater Agreement and Information Loss ...................................................................................... 135 7.2.3

Comparisons between Proposed Regression Deltas and Conventional First-order

Differences ................................................................................................................................. 138 7.3

System Design for ECP ..................................................................................................... 142

7.3.1

Experimental outlines and key settings ......................................................................... 142

7.3.2

Feature Window ............................................................................................................ 144

7.3.3

Delay Compensation ..................................................................................................... 145

7.4

Emotion Change Prediction (ECP) vs Absolute Emotion Prediction (AEP) ..................... 147

7.4.1

Initial Comparison of Baseline Performance using SVR and RVM ............................. 147

7.4.2

Effect of non-change frames ......................................................................................... 149

7.4.3

Improvement of Emotion Change Prediction Systems using Dynamic Features .......... 153

7.4.4

Comparison of Optimal Systems under OA-RVM Framework .................................... 154

7.5

Limitations......................................................................................................................... 155

7.6

Concluding remarks ........................................................................................................... 156

Chapter 8

Exploring Emotion Change for Absolute Emotion Prediction ...................................... 157 ix

8.1

Introduction ....................................................................................................................... 157

8.2

Data Selection based on Emotion Change ........................................................................ 158

8.2.1

Defining partitions based on emotion change ............................................................... 158

8.2.2

System Overview .......................................................................................................... 159

8.2.3

Experimental settings – data selection .......................................................................... 160

8.2.4

Emotion Prediction Results - Partition B, C, A ............................................................ 161

8.2.5

Emotion Prediction Results – Partition C ..................................................................... 163

8.3

Exploring Emotion Dynamics with a Kalman Filter ......................................................... 167

8.3.1

Correlation between static and dynamic emotions ....................................................... 167

8.3.2

Kalman Filter ................................................................................................................ 171

8.3.3

Experimental settings - Kalman Filtering ..................................................................... 174

8.3.4

Results and analysis ...................................................................................................... 175

8.4

Concluding remarks .......................................................................................................... 180

Chapter 9 9.1

Conclusions and Future Work ...................................................................................... 183 Conclusions ....................................................................................................................... 183

9.1.1

Emotion Change Detection ........................................................................................... 183

9.1.2

Understanding Phonetic Related Features for Absolute Emotion Prediction ............... 184

9.1.3

Emotion Change Prediction .......................................................................................... 185

9.1.4

On the Use of Emotion Change for Absolute Emotion Prediction ............................... 186

9.1.5

Summary ....................................................................................................................... 186

9.2

Future work ....................................................................................................................... 187

Appendix A

Absolute Emotion Prediction - Additional Investigations ........................................ 191

A.1

Delay Compensation ......................................................................................................... 191

A.2

Multimodal Fusion ............................................................................................................ 194

A.3

Concluding remarks .......................................................................................................... 199

Appendix B

Multi-stage Fusion using RVM-SR .......................................................................... 201

Appendix C

Visualisation of Phoneme Groups (Standard Deviation) ......................................... 203

Appendix D

Visualisation of PA-eGeMAPS features (Standard Deviation) ................................ 204

Bibliography ....................................................................................................................................... 207

x

List of Abbreviations AEP

Absolute Emotion Prediction

ASR

Automatic Speech Recognition

AVEC

Audio-Visual Emotion Challenge and Workshop

BDI

Beck Depression Index

BIC

Bayesian Information Criterion

BNFs

BottleNeck Features

BoAW

Bag-of-Audio-Word

CCC

Concordance Correlation Coefficients

CNNs

Convolutional Neural Networks

ComParE

Computational Paralinguistics ChallengE

DCT

Discrete Cosine Transform

DET

Detection Error Trade-off

DNNs

Deep Neural Networks

DTW

Dynamic Time Warping

ECD

Emotion Change Detection

ECG

Electrocardiogram

ECP

Emotion Change Prediction

EDA

Electro-Dermal Activity

EER

Equal Error Rate

eGeMAPS

extended Geneva Minimalistic Acoustic Parameter Set

EN PLLR

English PLLR

EPLR

Emotion Pair Likelihood Ratio

EM

Expectation Maximisation

FA

False Alarm

GeMAPS

Geneva Minimalistic Acoustic Parameter Set

GLR

Generalised Likelihood Ratio

GMM

Gaussian Mixture Models

GSR

Gaussian Staircase Regression xi

HCI

Human Computer Interaction

HMM

Hidden Markov Models

HNR

Harmonic-to-Noise Ratio

HU PLLR

Hungarian PLLR

KF

Kalman Filtering

KL

Kallback-Leibler

LFPC

Log Frequency Power Coefficients

LLDs

Low-Level Descriptors

LMLR

Log Mean Likelihood Ratio

LOSO

Leave-One-Speaker-Out

LPCCs

Linear Prediction Cepstral Coefficients

MD

Miss Detection

MFCCs

Mel-Frequency Cepstral Coefficients

MP

Most Probable

mRMR

minimal Redundancy Maximal Relevance

MSSD

Mean Squared Successive Difference

OA-RVM

Output-Associative Relevance Vector Machine

PA-eGeMAPS

Phonetically-Aware eGeMAPS features

PLLR

Phone Log-Likelihood Ratio

RF

RVM Relevance Factor

RLR

Regularised Linear Regression

RVM

Relevance Vector Machine

RVM-SR

RVM-Staircase Regression

RRMSE

Relative Root Mean Square Error

SAL

Sensitive Artificial Listener

SFS

Sequential Forward Selection

SVM

Support Vector Machines

SVR

Support Vector Regression

UBM

Universal Background Model

VB

Variational Bayesian xii

List of Figures Figure 1.1: An example of speech signals with synchronised arousal and valence continuous annotations. In the example, given a speech file of 5 seconds, a continuous emotion prediction system aims to continuously predict arousal and valence ratings, which are commonly annotated every 20 – 100 milliseconds. The original continuous arousal and valence ratings provided in emotional datasets are referred to as ‘absolute’ emotions (i.e. dashed lines), whereas changes in these ratings are referred to as ‘delta’ emotions in this thesis. ‘delta’ emotions will be investigated in depth in Chapter 7................................................................................................. 2 Figure 1.2: An overview of research conducted in this thesis. Concerning continuous absolute emotion prediction, the main objectives are two-fold, i.e. investigating phonetic-related features (Chapter 6) and exploiting emotion change information (Chapter 8). As for emotion change research, the investigations are undertaken to build the first systems to detect emotion change points in time (Chapter 5) and to predict the extent of emotion change (Chapter 7) from speech. There have been many studies regarding AEP and Feature Extraction, while ECD and ECP have been rarely investigated to date..................................................................................................... 5 Figure 2.1: The activation-evaluation space, alternatively known as the arousal-valence space (reproduced from [1]) ................................................................................................................. 12 Figure 2.2: A general system architecture for absolute emotion prediction, showing examples of specific methods for each block. Features are extracted from speech files to train a regression model, based on which arousal or valence values of a new speech file can be predicted. The features could be acoustic LLDs or a high-level representation of speech, whereas the most common regression method for emotion prediction is support vector regression. ..................... 23 Figure 2.3: Overview of RVM-SR approach, showing how pairs of low-high classifiers are built upon intervals of the BDI rating of depression severity, after [75]............................................. 32 Figure 2.4: Overview diagram for Kalman filtering. 𝒕𝒏 follows a Markov property that the current state only depends on the previous state. ........................................................................ 35 Figure 3.1: One sliding window for speaker change detection. In this approach, the window Z comprises of two windows X and Y of variable lengths. The question to answer is whether there is a speaker change at a particular point 𝑡, which is resolved by independently modelling the three windows to compare likelihood of Z and likelihood of X+Y, and a change occurrence is preferred if the latter is larger than the former. ........................................................................... 49 Figure 3.2: One sliding dual-window of a certain length for speaker change detection. The window 𝑍 consists of two windows 𝑋 and 𝑌 of equal length. .................................................... 50 Figure 3.3: A simple example demonstrating the concept of exchangeability. (a) Consider that each urn represents a distribution and there are three cases for selecting balls from the urns. In the first case, only blue balls are selected, so the 𝑃(𝑿|𝜃1 ) remains invariant regardless of the xiii

order of balls being selected. (b) In the second case, one starts to select red ball at a certain time n, which results in a minor change in the joint distribution, i.e. 𝑃(𝑿|𝜃2 ). (c) As more red balls are selected, in the third case, the joint distribution 𝑃(𝑿|𝜃3 ) alters significantly. In this regard, lack of exchangeability implies a change in the distribution or model. ...................................... 53 Figure 3.4: An overview diagram for Martingale framework for change detection. For each incoming data point 𝑥𝑛 , the strangeness of the data point with respect to the model 𝜆 is firstly calculated and used to compare with strangeness values from other previously observed data points. Comparison of the strangeness values then provides evidence of the exchangeability in terms of a p value, which is further used to calculate a Martingale value 𝑀𝑛 . 𝑀𝑛 can then be compared with a threshold to make decisions in terms of the presence of emotion change. ...... 55 Figure 3.5: Demonstration of how the Martingale framework is used for change detection via exchangeability testing. In this simplified example, samples are randomly generated from Gaussian distributions of two classes with mean shift of 1. The strangeness measure is the negative log likelihood of a single Gaussian distribution trained using most recent 50 samples after a change is detected. Note that this is a delay between the ground truth change points and the peaks of Martingale values. ................................................................................................... 55 Figure 4.1: Concatenation of small utterances to form ground truth for emotion change points by merging the same emotions for one speaker. Note that silence was not considered, but was discarded based on voicing probability. ...................................................................................... 65 Figure 4.2: Concatenation of original dataset segments per speaker, followed by omission of small utterances to form final datasets for emotion change detection. The shaded areas were omitted. ........................................................................................................................................ 65 Figure 5.1: A conceptual example of how the two proposed frameworks operate on speech signals. The two frameworks operate in two different manners: the dual-sliding window framework compares the similarity/diversity of two adjacent windows, drawing on analogous approaches from speaker change detection (Section 3.3.1), whilst the Martingale framework observes data points one-by-one and in turn performs statistical testing on the fly based on exchangeability (Section 3.3.2). Each dot or bar represents one frame-level feature vector, with frame boundaries shown as dashed vertical lines. ....................................................................... 69 Figure 5.2: A general system diagram for emotion change detection based on a sliding window and likelihood framework............................................................................................................ 70 Figure 5.3: Dual windowing process applied on speech signals for emotion change detection. The window centre represents a ground truth change point, around which a tolerance region is assigned. During each window, i.e. the previous, the current and the entire dual window, acoustic features are extracted on a per-frame basis. .................................................................. 71 Figure 5.4: Problems arising from direct application of the Martingale framework can be resolved by several modifications, which are marked in red in this figure. Firstly, we introduce a global emotion model 𝜽𝑒𝑚𝑜 , based on which a global strangeness value S can be calculated. The S determines the switch between a Supermartingale in which 𝑀𝑛𝜀 increases, and a Submartingale in which 𝑀𝑛𝜀 declines. This results in a number of peaks and troughs as indicators of emotion change points, for which a two-pass linear regression method is used to xiv

detect changes in slopes. Note the logarithm of 𝑀𝑛𝜀 is used to prevent overflow and underflow issues. .......................................................................................................................................... 74 Figure 5.5: Comparison of (a) the original Martingale, already shown in Figure 3.5 and (b) the proposed Martingale. In this simplified example, samples are randomly generated from normally distributed data representing two classes, with a mean shift of 1 between them. The strangeness measure is the negative log likelihood of a single Gaussian distribution trained using most recent 50 samples after a change is detected for (a) and only the first 50 samples for (b). ............................................................................................................................................... 76 Figure 5.6: Example of miss detection, false alarm, detection offset and tolerance region for evaluating emotion change detection performance. Green arrows represent correct detection, the dashed purple arrow represents a false alarm, and the dashed red arrow represents a miss detection. ..................................................................................................................................... 78 Figure 5.7: Equal error rates for GLR-based emotion change detection systems, comparing window length and window shift parameter choices, based on the IEMOCAP dataset for the ‘EMO-4’ task. ............................................................................................................................. 79 Figure 5.8: A visual comparison of the GLR and EPLR scores compared with ground truth emotion change points, which are annotated as red dashed lines. A 3.5-second and 0.7-second window size were used for GLR and EPLR according to results in Figure 5.7.......................... 80 Figure 5.9: DET curve comparison of proposed methods for EMO-4 on IEMOCAP dataset of GLR, EPLR and normalised EPLR, as well as their fusion. ....................................................... 81 Figure 5.10: EERs vs tolerance region durations for the general emotion change detection task (EMO-4)...................................................................................................................................... 82 Figure 5.11: DET curves for the proposed Martingale method and the baseline GLR method using MFCCs for detecting changes EMO-2, ECD-A and ECD-V. Q in Eq. (5.9) was set to 50 and 70, which leads to two operating points for the Martingale method. Numbers in brackets represent the exact miss detection and false alarm probabilities for the proposed Martingale approach. ..................................................................................................................................... 84 Figure 5.12: An example application of the proposed Martingale framework to detect large changes in arousal on the IEMOCAP dataset (operating on speech from the first speaker). The dashed grey lines are ground truth emotion change points. As stated in Eq. (5.11), log(M) is the main Martingale score, on which emotion change detection operates. More precisely, a two-pass linear regression was applied on the log(M) curve to detect its peaks and troughs, which are indicators of emotion change points (dashed lines). ................................................................... 86 Figure 5.13: Miss Detection Probability (thicker lines) and False Alarm Probability (thinner lines) vs tolerance region lengths for the three tasks .................................................................. 86 Figure 5.14: EERs (or FA and MD probability) vs tolerance regions for emotion change detection of arousal and valence using the GLR, EPLR and Martingale methods, evaluated on both the IEMOCAP and SEMAINE corpora. The GLR and EPLR methods generally had xv

comparable performances for the two tasks. The Martingale method provided significant improvements over the other two methods for both arousal and valence across both databases.88 Figure 6.1: System diagram for the Staircase Regression Framework. Staircase 1 was originally proposed for depression in [126]. Three additional variants of staircases were proposed herein to consider different aspects of low-high pairs. ............................................................................... 94 Figure 6.2: Comparison of performance in CCC for the smoothed PLLR (solid lines) and 5 functionals (dashed lines) calculated from PLLR under different window sizes for mean and functional calculation. In both cases of mean-only and functionals, CCC dramatically increases as the window size enlarges, especially for valence. ................................................................... 98 Figure 6.3: Distributions (around 7000 frames in total) of (a)(b) local PLLR and (c)(d) global PLLR for low and high arousal and valence, across all 9 speakers on training data of RECOLA. The phoneme example selected for arousal is /O/, whilst the phoneme selected for valence is /s/. With smoothing using a 7-second moving average filter, the global PLLR features become considerably more emotionally discriminative than the original local PLLR features, which is consistent across both arousal and valence.................................................................................. 99 Figure 6.4: A fully connected DNN Architecture trained for tri-phone recognition. The bottleneck layer, which situates at the second to last hidden layer, provides a low-dimensional summary of the output posteriors as well as maintaining information from the input features. 104 Figure 6.5: Prediction accuracies in CCC for (a) arousal and (b) valence under different feature window sizes using both LLD-based and partition-based features, on the RECOLA dataset. The LLD-based features have dashed lines with black colour, whilst the partition-based features are shown with colourful lines. Higher CCC values indicate better emotion prediction performances. ............................................................................................................................ 107 Figure 6.6: Comparisons of performances in CCC between the LLD-based features and the partition-based features for (a) arousal and (b) valence for the SEMAINE and CreativeIT corpora (RECOLA results are shown in Figure 6.5). The mean CCCs and 95% confidence intervals from all folds are shown. ............................................................................................ 109 Figure 6.7: Effect of the number of partitions (phoneme groups) for English phoneme posteriors on (a) arousal and (b) valence prediction, evaluated on RECOLA, SEMAINE and CreativeIT. Performances were evaluated in averaged CCC across all LOSO folds, along with their 95% confidence intervals. .................................................................................................................. 112 Figure 6.8: Correlations between phoneme or phoneme group posteriors and the ground truth of arousal (a)(c)(e) and valence (b)(d)(f) for RECOLA, SEMAINE and CreativeIT. CC was first generated per speaker prior to averaging across all the speakers. ............................................. 113 Figure 6.9: Normalised relevance factors (RFs) for all 38-dimensional English phonemes (PLLR features) for (a) arousal and (b) valence on RECOLA, SEMAINE and CreativeIT corpora. The higher the RF is, the more relevant the presence of a phoneme is to arousal or valence. ...................................................................................................................................... 117

xvi

Figure 6.10: CCC performances for (a) RECOLA-arousal, (b) RECOLA-valence (c) SEMAINE-arousal, (d) SEMAINE-valence, (e) CreativeIT-arousal and (f) CreativeIT-valence at different iterations. At each iteration one PLLR feature was appended during RVM training. “most” means starting with and appending the most relevant phonemes. “least” means starting with and appending the least relevant phonemes. “random” means appending phonemes in different random orders. The CCC performances converged when all phonemes were included. .................................................................................................................................................. 119 Figure 6.11: Phonetically-aware eGeMAPS (PA-eGeMAPS) features evaluated for three datasets using RVM. 𝛼 = 0 is equivalent to the smoothed eGeMAPS LLDs, whereas 𝛼 = 1 is equivalent to all phonetic information being imposed into the eGeMAPS features. The 95% confidence intervals were also reported. ................................................................................... 122 Figure 6.12: Visualisation of the phonetically-aware eGeMAPS features for arousal and valence across (a)(b) RECOLA, (c)(d) SEMAINE and (e)(f) CreativeIT with alpha set to 1. The left column is for arousal, while the right column is for valence. In the figure, each block represents the Pearson’s correlation coefficients (CC) between a single phoneme-dependent acoustic feature and the emotion ground truth. CCs were calculated for each single speaker before being averaged to form a final CC, shown in each single pixel.......................................................... 128 Figure 7.1: Demonstration of the absolute emotion ratings and their first-order differences of 80 seconds on the RECOLA dataset. The y-axis represents values of the ratings. It can be seen that the first-order differences may be unsuitable to represent sub-utterance emotion changes due to the potential problems mentioned above, because emotions are unlikely to change with such high frequency. The possible range for arousal and valence is from -1 to 1, as stated in Section 4.1.2. ......................................................................................................................................... 133 Figure 7.2: Mean Pearson’s correlation and Cronbach’s alpha α of regression delta emotion ground truth and absolute ground truth for (a) RECOLA and (b) SEMAINE, as a function of the regression delta window sizes 𝑁𝑆 = 2𝐾 + 1: longer regression delta windows for calculating delta ratings from absolute ratings resulted in increased inter-rater reliability. Information loss in 𝐼𝑐 for the regression delta emotion ground truth calculated under various regression delta window sizes is also shown: longer window sizes result in more information loss. ................ 137 Figure 7.3: Comparisons between the first-order differences and the proposed regression deltas as emotion change ground truth for arousal and valence on both RECOLA and SEMAINE datasets. The upper figures are absolute emotion ratings 𝑮𝐴 (blue) and the reconstructed absolute emotion ratings 𝑮𝑟 (orange) obtained by integrating the regression deltas. The reconstructed absolute emotion ratings are a smooth version of the original absolute emotion ratings. The middle row figures represent the first-order differences from the absolute emotion ratings, whilst the lower row figures are the regression deltas calculated from the absolute emotion ratings using Eq. (7.1) with the ‘optimal’ 𝑁𝑆 in Table 7.1, i.e. 4 seconds for arousal and valence on RECOLA, and 6 seconds for arousal and valence on SEMAINE. The time period shown for all figures is 80 seconds. Note that the difference between the 𝑮𝐴 (blue) and 𝑮𝑟 (orange) in the upper row is quantified by information loss, which is the reduction in concordance correlation coefficients. ....................................................................................... 140 Figure 7.4: Histogram of delta emotion ratings for arousal and valence on RECOLA and SEMAINE dataset. The upper figures represent first-order differences of absolute emotion ratings, which have a significantly larger proportion of zeros for both arousal and valence. The xvii

bottom figures represent regression delta emotion ground truth, which offers reasonable distributions for emotion change. .............................................................................................. 141 Figure 7.5: Concordance correlation coefficients were calculated and compared under different window lengths used to extract the eGeMAPS functionals between the (a) RECOLA and (b) SEMAINE corpora. CCC represents system performance for prediction. Consistent trends of declining performances for delta arousal and delta valence were observed for both databases as feature window increased. By contrast, this was not the case for absolute emotions, where, with larger feature window, performances degraded for (a) RECOLA and improved for (b) SEMAINE. ................................................................................................................................ 145 Figure 7.6: Annotation delays were evaluated and compensated for both absolute and delta emotions for the (a) RECOLA and (b) SEMAINE databases via temporal shift and smoothing. CCC represents system performance for prediction. The optimal delays for delta arousal and delta valence were smaller than those for absolute emotions, around 1.2 s across both databases, whilst the best delays for absolute emotions vary for arousal and valence as well as for different corpora. ...................................................................................................................................... 146 Figure 7.7: Demonstration of frame-dropping applied on arousal regression deltas of RECOLA. Frames corresponding to the area between two dashed red lines were dropped, since they potentially represent non-change frames. The remaining areas were used for training and testing. ....................................................................................................................................... 151 Figure 7.8: CCC results evaluated for frame dropping experiments under different thresholds on the (a)(c)RECOLA and (b)(d)SEMAINE corpus. Larger thresholds lead to more training and testing data being discarded. The performances for predicting delta emotions are slightly lower than those for predicting absolute emotions on SEMAINE. However, the same trend can be seen on both RECOLA and SEMAINE that the gap in performances narrowed as only data with large emotion changes were used for training and testing. This confirms our hypothesis that it is the zero values in the first-order differences that lead to the considerable discrepancy between prediction of absolute and delta emotion on SEMAINE seen in Table 7.5. .............................. 152 Figure 8.1: All speech data were divided into B, C, and A based on the first-order differences of ground truth emotion ratings, where zeros mean no emotion change, while non-zeros mean emotion change. The above two demonstrated figures show 6.6 seconds of ratings. ............... 159 Figure 8.2: Proposed emotion prediction system architecture for investigating data selection based on emotion change. It should be noted that data selection is only applied on the training data to ensure comparability of the test results. That is, the selected training data were used to train regression models, based on which all the testing data were tested without data selection. ................................................................................................................................................... 160 Figure 8.3: Delay compensation for different data partitions (A, B and C), selected based on the first-order differences of (a) arousal and (b) valence ratings..................................................... 162 Figure 8.4: Demonstration of how the window size W of a moving average filter impacts the selection of partition C. The left column includes (a) ground truth arousal ratings as well as smoothed arousal ratings using a filter of (b) 10 frames, (c) 25 frames and (d) 40 frames, while the right column contains the corresponding first-order differences of the (smoothed) arousal xviii

ratings. A threshold of 0.001 was applied to select the “change” frames, which is marked as red dashed lines. Note that only the red regions were used for training OA-RVM models............ 164 Figure 8.5: The purpose of this experiment was to evaluate the effectiveness of smoothed differences based on which data selection was performed, as compared with the first-order differences. The Baselines were marked in green colours: “ALL” means a baseline system trained on B+C+A, whereas “𝐶𝐹 ” means a baseline system trained on only C (selected from the first-order differences). Other systems shown used partition C only based on smoothed differences with different W values above a fixed threshold of 0.001 (i.e. the red dashed lines in Figure 8.4). The left axis denotes CCC performance represented by bars, while the right axis denotes percentage of training data used for training, represented by a dashed line with triangle markers. The larger the W is, the less training data were used for training. ............................. 165 Figure 8.6: Performances of systems trained on speech segments with only large emotion changes, compared against two baselines “ALL” and “CF ”(in green). The training data, i.e. partition C were selected based on different combinations of (T, R). For instance, (T, R) = [0.0012, 5] means selecting all emotion changes larger than T = 0.0012 and considering the nearest 5 frames adjacent to large changes. The left axis denotes CCC performance, while the right axis denotes percentage of training data used for training, represented by a dashed line with triangle markers, as per Figure 8.5. From left to right, both T and R are increased. ........ 166 Figure 8.7: Examination of the relationship between static, ∆ and ∆∆ arousal/valence, via (a) first-order differences and (b) smoothed differences respectively. The first row of (a) and (b) shows ground truth arousal and valence ratings of 80 seconds, i.e. 2000 frames from the 1 st speaker on the training data on RECOLA, from which either the first-order differences (a) or the smoothed differences (b) were used. A moving average filter of empirically selected length W = 6 seconds was applied for smoothing, and the smoothed arousal and valence ratings are marked as orange dashed lines in (b). The ‘position’, ‘velocity’, and ‘acceleration’ of a purely sinusoidal position signal are shown in (c) as an illustrative comparison. (c) conveys an important message that the ‘position’, ‘velocity’ and ‘acceleration’ are highly correlated in the ideal sinusoidal signals, but firstly the shifts need to be compensated. This may hold true for emotions, ∆emotions and ∆∆emotions. ..................................................................................................... 169 Figure 8.8: Effect of temporal shifts (DTS) on Pearson’s correlation coefficients (CC) between static emotions and shifted emotion dynamics, computed on all data, where ‘Ar’ -- arousal; ‘Vl’ -- valence; ‘smo’ -- smoothing; ‘orig’ -- no smoothing. Correlations between unsmoothed ratings and their emotion dynamics were significantly weaker than those with smoothing and shifts on RECOLA. ................................................................................................................... 170 Figure 8.9: Overview of Kalman filtering for arousal (xt) prediction, adapted from [255], showing prediction and update 𝑮𝑡 , the Kalman Gain, controls the update stage (Section 2.3.6). .................................................................................................................................................. 172 Figure 8.10: Fusion of probabilistic outputs from individual KFs. For a single feature vector at time 𝑡 , regression models trained on either emotions or emotion dynamics produce noisy measurements 𝒚𝑘𝑡 , which update the predictions from KFs that are explicitly trained for arousal or valence. Each KF outputs a probabilistic prediction of arousal or valence, and the final prediction is attained via the product of all Gaussians. ............................................................. 174

xix

Figure 8.11: Performances of Kalman filter under different internal delay values, evaluated in CCC against the RVM baseline (blue dashed line) and an implementation of [131] (red dashed line), for (a) arousal and (b) valence.......................................................................................... 177 Figure A.1: Effect of annotation delay compensation on a set of predicted arousal ratings. (a) Predictions without delay and smoothing are noisy and not well matched with the ground truth labels. (b) Applying temporal shifts to the training data improves system performance but results in predictions which are advanced in time by N frames compared to their ground truth. (c) Applying a binomial filter to these predictions not only smooths the output but resolves the synchronisation issue. ................................................................................................................ 192 Figure A.2: Delay compensation using frame shift and smoothing. The best delay for arousal was 4 seconds and the best delay value for valence was 2 seconds. ......................................... 193 Figure A.3: Block diagram showing the feature level fusion strategy used to combine information from different modalities for the task of absolute emotion prediction .................. 194 Figure A.4: Block diagram showing the decision-level fusion strategy used to combine information from different modalities for the task of absolute emotion prediction .................. 195 Figure A.5: Block diagram showing the formation of an 𝑀 + 𝑁 × 𝐾 output associative vector from a set of multimodal predictions for use in an OA fusion or OA regression system .......... 196 Figure A.6: Block diagram showing the output-associative fusion strategy used to combine information from different modalities for the task of absolute emotion prediction. Note 𝒚𝑎 and 𝒚𝑣 represent the predicted arousal and valence scores respectively .......................................... 197 Figure A.7: Block diagram showing the output-associative regression strategy used to combine information from different modalities for the task of absolute emotion prediction. Note 𝒚𝑎 and 𝒚𝑣 represent the predicted arousal and valence scores respectively. ......................................... 198

xx

List of Tables Table 2.1: A summary of low-level descriptors in selected emotional speech processing literature. It can be seen that the most common features in literature are prosodic features, followed by spectral features, which tend to carry rich paralinguistic information. ................... 16 Table 2.2: Examples of commonly used functionals, which can be grouped into static and dynamic functionals, summarised from [27] and [79]. ............................................................... 17 Table 2.3: The GeMAPS and the extended GeMAPS (eGeMAPS) features. Abbreviations: M – Mean, STD – Standard deviation, 20P – 20% percentile, M-Slope – Mean of rising/falling slope, M-(un)voiced – Mean in (un)voiced region only, ESL – Equivalent Sound Level, ML(un)voiced – Mean Length of continuously (un)voiced regions. A detailed explanation and calculation of eGeMAPS feature set can be found in [71]. The underlined features are only included in the eGeMAPS LLDs/Functionals. ........................................................................... 19 Table 2.4: A summary of high-level representations for emotion recognition. .......................... 22 Table 2.5: Implementation of Kalman filtering for predicting arousal and valence (𝒕𝑛 )𝑁 𝑛=1 , given trained regression models, which output noisy measurements (𝒚𝑛 )𝑁 and a trained 𝑛=1 Kalman filter model 𝝀𝐾𝐹 = {𝒕𝟎, 𝑺𝟎, 𝑭, 𝑸, 𝑯, 𝑹} based on [129]. ............................................... 36 Table 4.1: Summary of emotional databases employed for investigation in this thesis. Abbreviations: M – Number of males, F – Number of females, ECD – Emotion Change Detection, ECP – Emotion Change Prediction, AEP – Absolute Emotion Prediction, FPS – Number of arousal/valence labels (frames) per second. ‘’ represents the data partition selected and ‘’ represents the selected data partition used for certain tasks in different chapters in this thesis. .......................................................................................................................................... 63 Table 4.2: Summary of constructed datasets for detection of emotion change of four different types, including resultant number of turns per class, the minimal distance and the final number of change points. ......................................................................................................................... 66 Table 6.1: Performance in CCC for eGeMAPS low-level descriptors and short-term PLLR features on the RECOLA dataset. ............................................................................................... 97 Table 6.2: Comparison of eGeMAPS and PLLR features using SVR, RVM and RVM-SR in CCC. The CCC normally ranges from 0 to 1 (although it is possible to be negative), with higher CCC values indicating better emotion prediction performance. ............................................... 100 Table 6.3: Empirically determined delays to compensate for annotation lag ........................... 106 Table 6.4: Hierarchical structure of English phonemes, adapted from [222]. An exemplar word was also included after each phoneme. “5, 9, 11, 13 partitions” mean 5, 9, 11, 13 phoneme groups, whereas “39 partitions” mean 39 individual phonemes. .............................................. 111 xxi

Table 6.5: Comparisons of the 25-dimensional smoothed eGeMAPS LLDs, 88-dimensional eGeMAPS Functionals, and the 975-dimensional Phonetically-aware eGeMAPS features in CCC performance. ..................................................................................................................... 123 Table 7.1: Summary of regression delta window sizes for calculating regression delta emotion ground truth. Note that these were the ‘optimal’ settings chosen to trade off between inter-rater reliability and information loss, according to Figure 7.2........................................................... 138 Table 7.2: RMS value of absolute and delta ground truth emotion ratings for the RECOLA and SEMAINE datasets. The scale for absolute arousal and valence values are from -1 to 1 on both RECOLA and SEMAINE datasets, whereas the scales for delta arousal and valence values depend on different dynamic characteristics, and therefore are unknown. .............................. 143 Table 7.3: Summary of feature window size and annotation delay compensation for AEP and ECP. The smaller delay values for delta emotions indicate that annotators tend to respond more quickly to changes in emotion than to absolute emotions. ........................................................ 147 Table 7.4: Comparison of AEP and ECP using SVR and RVM on RECOLA and SEMAINE datasets. ..................................................................................................................................... 148 Table 7.5: Percentages of non-change frames for RECOLA and SEMAINE datasets, for individual raters. “Average” means an average of the non-change percentages across all raters. “Final” means taking the mean of absolute emotion ratings before calculating the non-change percentage. The “Final” under the “Regression deltas ground truth” is the final delta emotion ground truth adopted for comparisons in the following experiment. ........................................ 149 Table 7.6: Emotion change prediction using two sets of dynamic features: RD_LLDs and RD_FUN. Abbreviations: REF* - ECP systems using RVM and the appended eGeMAPS functionals whose dimensionality is 176; ‘RD_LLDs’ - regression deltas of 25-dimensional eGeMAPS LLDs, RD_FUN - regression deltas of 6 functionals of the 25-dimensional eGeMAPS LLDs. ...................................................................................................................... 153 Table 7.7: Comparisons for optimised systems for both AEP and ECP using the OA-RVM framework. While 88-dimensional eGeMAPS features were used for absolute emotion prediction, concatenation of adjacent eGeMAPS functionals (176 dimensions) was used for ECP. ........................................................................................................................................... 155 Table 8.1: Comparisons of performances in CCC using either all data (B+C+A) or subsets of training data (B, C, or A) with RVM used as the regression model. Compared with all training data, partition C achieved quite comparable CCC for both arousal and valence, whilst Partition B and A performed very poorly................................................................................................. 161 Table 8.2: Pearson’s Correlations (CC) between predicted static, Δ, ΔΔ arousal/valence and the ground truth of arousal and valence on the development set. Different sets of emotion dynamics ground truth were used for regression training to generate predicted emotion dynamics 𝒂, 𝒗, 𝒂, 𝒗, whose correlations with arousal and valence ground truth further were evaluated. For 𝒂, 𝒗, four conditions were considered: 1) with both smoothing and shifting, 2) smoothing only, 3) shifting only, and 4) without smoothing and shifting, whilst for 𝒂, 𝒗, no shifting was applied xxii

given that ΔΔ emotions are already most strongly negatively correlated with static emotions when no shift is applied Figure 8.8(b)). .................................................................................... 175 Table 8.3: CCC performances of single RVM system and RVM-Kalman Filter with joint modelling and probabilistic fusion............................................................................................ 179 Table A.1: Performance in CCC when combining delay compensation with either a SVR or RVM feature-level fusion system. ............................................................................................ 194 Table A.2: Performance in CCC when performing decision-level fusion using either an RVM or RLR to learn the fusion weights. .............................................................................................. 195 Table A.3: Performances in CCC when performing OA fusion using either an RVM or RLR to learn the fusion weights. ........................................................................................................... 197 Table A.4: Performances in CCC using different OA-Regression systems using an OA-RVM set-up to learn the fusion weights. Different front-end systems were used to generate arousal and valence predictions which were further used to train an OA-RVM system. “1-RVM” represents a single RVM model trained on four modalities, “4-SVR’s” represents training an SVR model for each modality, “4-RVM’s” represents training an RVM model for each modality and “4SVR’s + 4-RVM’s” represents concatenation of predictions of the ‘4-SVR’s’ and the ‘4RVM’s’. .................................................................................................................................... 199

xxiii

xxiv

Chapter 1 Introduction 1.1 Research Overview Humans, nowadays, are surrounded by numerous emerging technologies that bring about revolutionary changes to their lives: mobile phones can understand the human voice and react to commands; smart home systems automatically tailor comfortable living environment and recreational aids to users; automobiles can even drive themselves. What has really changed with the introduction of these technologies is the way humans and machines interact: mobile phones now can listen and speak to us; smart home systems are acting like a housekeeper to satisfy our needs; and automobiles accompany us and offer help during our long journeys. What has driven these changes is the goal of developing intelligent machines for better Human Computer Interaction (HCI). However, machines are not intelligent enough yet in the sense that they neither have emotions of their own nor understand human emotions. This reduces the interaction between humans and machines to nothing but passive actions unilaterally taken by machines in response to human requests to assist in task completion. For example, a query for a weather update with either a happy or sad tone makes no difference to a smartphone assistive system, which can accurately return a weather forecast but fail to provide subsequent feedback to adapt to the happy mood or regulate the sad mood. The significance of emotion lies in the fact that it is ubiquitous and constantly affects nearly every aspect of human lives, from family bonding to social interactions and decision-making. Hence, it would be short-sighted not to consider emotion as imperative in the development of “intelligent” machines. Recently, there has been emerging interest in equipping machines with emotional intelligence, in preparation for its important role in HCI, and a major area of research to this end is automatic recognition or prediction of a person’s emotional state using behavioural signals [1]. Automatic recognition of emotion is challenging, since emotion is subjective and difficult to define. However, there are many types of behavioural signals found to be associated with emotion, such as heart rate [2], blood pressure [3], muscle tension [4], facial expression [5] and speech [6]. Of all modalities, speech is advantageous because it is non-intrusive and readily accessible. More importantly, speech is one of the most natural means of communication between humans and between humans and machines, and offers rich emotional information. This has motivated the development of automatic speech-based emotion recognition systems during the last decade. 1

A system of this kind takes speech cues as input and produces a prediction of emotion. More precisely, there are two major components in the system, i.e. feature extraction, in which emotionally informative features are extracted from speech, and classification/regression analysis, which learns from these features via machine learning algorithms to classify or predict emotions given unknown speech data. The divergence in the second component is because emotion can be represented by not only emotion categories such as anger or happiness, but also numerical emotion dimensions such as arousal (i.e. how activated a person feels) and valence (i.e. how pleasant a person feels). Recently, the focus of automated recognition of emotion from speech has shifted gradually from recognising emotion categories on a per speech file basis towards continuous prediction of numerical emotion ratings, driven by the challenge of continuously tracking emotions in more naturalistic environments and the fact that ratings along emotion dimensions are more capable of capturing subtle variations in users’ emotional states [7].

Figure 1.1: An example of speech signals with synchronised arousal and valence continuous annotations. In the example, given a speech file of 5 seconds, a continuous emotion prediction system aims to continuously predict arousal and valence ratings, which are commonly annotated every 20 – 100 milliseconds. The original continuous arousal and valence ratings provided in emotional datasets are referred to as ‘absolute’ emotions (i.e. dashed lines), whereas changes in these ratings are referred to as ‘delta’ emotions in this thesis. ‘delta’ emotions will be investigated in depth in Chapter 7. A general ideal aim of the feature extraction stage is maximum emotion characterisation and minimal unwanted variability, so that emotions or emotional intensities (dimensions) can be most accurately differentiated. However, there exist many types of variability which undermine accuracy, reliability and robustness of emotion recognition systems such as lexical variability [8], speaker variability [9], channel and handset mismatch [10]. Among these, phonetic 2

variability, which arises from variability of short-term acoustic features extracted from one phoneme to the next, has long been considered a confounding factor for emotion recognition [11], [12]. Accordingly, there are many studies attempting to mitigate the variability for improved system performance, which however exhibits conflicting findings. On the one hand, elimination of phonetic variability via per-phoneme analysis of acoustic features [13] or models [14] was found helpful. On the other hand, literature has shown that certain phonemes themselves are emotionally informative, for instance, acoustic features extracted within vowels [12] or phoneme sequences [15] are conducive to emotion classification, suggesting that preservation rather than elimination of phonetic information may be helpful. The conflicting findings imply that the effects of phonemes in emotional speech remain relatively less wellunderstood and that further examination of phonetic aspects of speech is needed from a feature engineering perspective. Phonetic-related investigations are just one of numerous ways to improve upon existing speechbased emotion recognition systems. Another way to seek improvements upon existing emotion prediction systems is the incorporation of emotion change information, which is motivated by the increasing consensus within the research community that emotion change information is important for a better understanding of emotion, given the dynamic nature of emotion. For instance, in affective science, research into emotion dynamics has become a key topic in understanding emotions as well as their effects on social interaction [16]–[19]. Also, for automatic analysis of emotion, many previous systems incorporating emotion change have reported improved emotion recognition performances. Common approaches to accomplish this include utilising dynamic features [20],[21], applying dynamic models [22], [23], and analysing dynamic patterns from classifiers [24], [25] or regressors [26]. Despite this, improvements from the inclusion of emotion change were not always significant and our understanding of whether or not dynamics captured in features or models can reflect changes in emotion remains incomplete. Thus, there is surely a need for a better understanding of emotion dynamics and effective frameworks to incorporate emotion dynamics into automatic systems. Overall, investigations into speech-based emotion recognition or prediction systems have matured and their performances have plateaued over the last decade. However, stepping back to view the research field at a higher level, most studies to date have focused on classifying or predicting emotions individually from pre-segmented speech signals (e.g. on a file-by-file basis) [27], which lacks realism and often ignores the fact that emotion is dynamic in nature and evolves across time. For instance, a system may recognise a speech file as angry emotion well, but it may fail to interpret when the anger appears and whether the speaker is becoming angrier or less angry, which may be more valuable than recognition of an emotion file. Hence, a system 3

that can interpret emotion and emotion change is expected to be more intelligent than systems that can only recognise emotion. However, although within-utterance information about emotion changes recently has been gaining more attention, there remain open questions unresolved for automatic systems in this regard, such as detecting when changes occur and estimating the extent of the changes. This is a common yet significant drawback of current systems and there is a need to empower systems to understand a person’s emotion change. Nevertheless, what aspects of emotion change one should investigate remains an open question. One important aspect of emotion change is its time instant of emotion change, which has significant research potential and application possibilities. First, knowing the timing of emotional transitions is important for emotion regulation so that an intervention can be made to change the time course of the emotion [28]. As an example, if a person is detected to be increasingly sad, people or machines may apply a deliberate intervention such as telling a joke to please them [29]. Interaction of this kind could make a big difference to prevent users from being dominated by negative emotion, which if accumulated may incur depression. Second, it can be beneficial to detect the time at which emotion change occurs in a real-time emotion recognition system, where emotion recognition algorithms can be triggered once a change in emotion is detected, in place of ‘always-on’ recognition of emotions, which may be more applicable and computationally efficient in HCI [30]. These advantages are more pronounced in naturalistic, spontaneous speech, where the majority of emotions tend to be neutral. In addition, emotion change points in time can be referred to as boundaries of different salient emotional segments, obviating the need for manual segmentation prior to emotional signal processing. The need for detecting emotion change in time also arises for detecting outbursts of emotion changes within a larger group of people for security applications, as well as monitoring emotional changes in patients for medical purposes [31]. Nevertheless, to the best of our knowledge, few researchers have focused specifically on localising the point in time at which emotions transition from one category to the next. This problem is referred to as Emotion Change Detection (ECD) in this thesis. Apart from the timing of emotion change, it is essential to assess how emotion changes over time, such as the extent of emotion change. One can easily envisage a system that detects emotion change points, and then for each emotion change, tries to predict how much emotion has changed. Such a system may be well suited to speech that is predominantly neutral, but occasionally becomes emotional. Emotion changes may reflect important changes in the external environment, such as events that trigger emotions, e.g. changes in task difficulty have been found to be associated with changes in arousal [32]. Evaluating the extent of emotion 4

change would signal how badly the task difficulty or other events impact a person’s emotion, which would be valuable for cognitive research. The problem of predicting the extent of emotion change (in terms of changes in arousal and/or valence ratings values) is referred to as Emotion Change Prediction (ECP) in this thesis. By comparison with ECP, conventional systems for continuously predicting absolute arousal and/or valence rating values (as shown in Figure 1.1) are referred to as Absolute Emotion Prediction (AEP).

Feature Extraction

Absolute Emotion Prediction

Emotional intensity (absolute)

Emotion Change Detection

Emotion change points in time

Emotion Change Prediction

Change in emotional intensity (delta)

Speech signal

Continuous emotion annotation Emotion change points Emotion change annotation

Database

Figure 1.2: An overview of research conducted in this thesis. Concerning continuous absolute emotion prediction, the main objectives are two-fold, i.e. investigating phoneticrelated features (Chapter 6) and exploiting emotion change information (Chapter 8). As for emotion change research, the investigations are undertaken to build the first systems to detect emotion change points in time (Chapter 5) and to predict the extent of emotion change (Chapter 7) from speech. There have been many studies regarding AEP and Feature Extraction, while ECD and ECP have been rarely investigated to date.

1.2 Thesis Objectives Given the limitations discussed in the previous section, the main objectives of this thesis can be framed in terms of four main research questions. 

How can emotion change points in time be detected from speech? (Chapter 5)



How can existing emotion prediction systems be improved via the exploitation of phonetic-related features? (Chapter 6)



What aspects are important for the development of automatic systems that predict the extent of emotion change, and what are the possibilities for system performance? (Chapter 7)



In what ways can emotion change information be exploited to aid emotion prediction systems? (Chapter 8)

5

1.3 Thesis Organisation The remainder of this thesis is organised as follows: Chapter 2 presents an overview of literature regarding speech-based emotion prediction systems with respect to emotion theory, front-end feature representation, and a wide range of automatic prediction and modelling techniques. For emotion-related features, a range of acoustic low-level descriptors and high-level representations of speech are discussed. In addition, a range of automatic prediction and modelling techniques including Gaussian Mixture Models (GMMs), Support Vector Regression (SVR), Relevance Vector Machine (RVM), RVMStaircase Regression (RVM-SR), Output-associative RVM (OA-RVM), and Kalman Filtering (KF) are discussed. Chapter 3 reviews existing literature that investigated emotion change from both psychological and engineering perspectives. In particular, computational models using emotion change and the timing of emotion change are discussed. Moreover, potential change point detection techniques are described in detail. Chapter 4 introduces the four emotional corpora upon which investigations in this thesis are conducted. Due to different problem formulations involved in emotion prediction systems and emotion change related research, this chapter also describes how existing datasets are modified to produce datasets that suit the needs of emotion change research. Chapter 5 presents investigations into the task of Emotion Change Detection, which aims to localise the change points in time for both emotion categories and dimensions. Two frameworks are proposed to resolve this problem, including a sliding dual-window framework and a statistical Martingale framework. Chapter 6 investigates the use of partition-based features for speech-based continuous absolute emotion prediction. A wide range of modelling techniques, which partition the acoustic space of speech in various ways, are compared to understand the effectiveness of partitioning approaches in general. In addition, phone posterior features are used to understand the contribution of phonetic content to emotion prediction. Moreover, a novel set of phonetically-aware features is proposed to incorporate useful phonetic information into conventional acoustic features. Chapter 7 initiates an area of research named Emotion Change Prediction, which aims to predict changes in dimensional emotion ratings on a per-frame basis, i.e. a regression problem. A novel way to derive emotion change ground truth is presented, and analysed from the perspectives of inter-rater reliability and information loss. A range of system designs are 6

explored and compared against state-of-the-art systems that predict absolute emotion dimensions. Chapter 8 explores how to effectively exploit emotion dynamics for improving speech-based emotion prediction systems from two major perspectives: (i) data selection based on emotion change, and (ii) emotion dynamic modelling based on an understanding of the relationship between emotion and emotion dynamics and Kalman filtering. Chapter 9 concludes the thesis by summarising the research contributions and future research potential.

1.4 Major Contributions The major contributions of this thesis include: 

The first comprehensive review of emotion change research concerning both affective science and automated systems, highlighting areas of research that are understudied with potential to be explored for automated system development.



The first investigations into automatic detection of the instant of emotion change points in time (Emotion Change Detection) using two proposed frameworks: a dual-sliding window framework and a statistical Martingale framework. Results from both frameworks indicate reasonable detection performances and underscore the importance of incorporating prior emotion information for this task.



Evaluation, for the first time, of purely phonetic features for continuous absolute emotion prediction, attaining state-of-the-art performance. This finding yields both great promise and surprise, since phonetic variability has long been considered detrimental for emotional speech processing. Moreover, the discriminative nature of phonetic features can be exploited using a Staircase Regression (SR) framework for further gains.



In-depth investigations into phonetic features to understand their success in emotion prediction from two major perspectives: partitioning information and phonetic content. Results indicate that partitioning information is helpful for valence prediction and phonetic content can be associated with improved arousal prediction. These findings extend the current understanding of partitioning-related and phoneticrelated emotion research within the affective computing community.



A novel set of phonetically-aware acoustic features, which incorporate useful phonetic information into conventional acoustic features. This feature set allows a trade-off between useful phonetic information and unwanted phonetic variability. 7

Results indicate that incorporation of phonetic information improves prediction accuracy for both arousal and valence (by a large margin) across three widely used emotional corpora. Moreover, the proposed feature set extends previous research where emotion is investigated on a per-feature basis or (to a lesser extent) on a per-phoneme basis to a per-phoneme-feature basis. 

The first system to continuously predict the extent of emotion change from speech (Emotion Change Prediction). Regression deltas are proposed as the ground truth for emotion change based on careful examination of inter-rater reliability and information loss, offering a viable approach to render a meaningful representation of emotion change from existing continuously annotated emotional datasets. Furthermore, the first system design for predicting changes in emotion is explored. Correlation results compared with emotion prediction systems demonstrate that emotion change can be better predicted than absolute emotion ratings.



Investigation of emotion change exploitation in absolute emotion prediction systems. o

Data Selection based on emotion change: Results indicate that speech segments with emotion change are more emotionally informative than those without.

o

Kalman Filtering for emotion change modelling: It is found helpful to jointly model emotion and emotion change in the KF framework, which also allows probabilistic fusion. Of particular interest is the finding that emotion ratings are for the first time found to be highly correlated with a delayed version of emotion dynamics, which offers a new insight concerning the relationship between emotion and emotion dynamics.

1.5 List of Publications Journal Publications: Huang, Z., and Epps, J., “Prediction of Emotion Changes from Speech,” Frontiers in ICT, to appear, 2018. (Chapter 7) Huang, Z., and Epps, J., “An Investigation of Partition-based and Phonetically-aware Acoustic Features for Continuous Emotion Prediction from Speech,” IEEE Transactions on Affective Computing, to appear, 2018. (Chapter 6) 8

Conference Publications: Huang, Z., and Epps, J., “An Investigation of Emotion Dynamics and Kalman Filtering for Speech-based Emotion Prediction,” in INTERSPEECH, Stockholm, Sweden, 2017. (Chapter 8) Huang, Z., and Epps, J., "A PLLR and multi-stage Staircase Regression framework for speechbased emotion prediction," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 5145-5149. (Chapter 6) Huang, Z. and Epps, J., “Time to Embrace Emotion Change: Selecting Emotionally Salient Segments for Speech-based Emotion Prediction,” in 16th Australasian International Conference on Speech Science and Technology, Sydney, Australia, 2016, pp. 281-284. (Chapter 8) Huang, Z., Stasak, B., Dang, T., Wataraka Gamage, K., Le, P., Sethu, V., & Epps, J., “Staircase Regression in OA RVM, Data Selection and Gender Dependency in AVEC 2016.” In Proceedings

of

the

6th

International

Workshop

on

Audio/Visual

Emotion

Challenge (AVEC ’16). ACM, Amsterdam, Netherland, 2016, pp. 19-26. Huang, Z., and Epps, J., "Detecting the instant of emotion change from speech using a martingale framework," 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016, pp. 5195-5199. (Chapter 5) Huang, Z., Dang, T., Cummins, N., Stasak, B., Le, P., Sethu, V., and Epps, J., “An investigation of annotation delay compensation and output-associative fusion for multimodal continuous emotion prediction,” In Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge (AVEC ’15). ACM, Brisbane, Australia, 2015, pp. 41-48. (Appendix A) Huang, Z., "An investigation of emotion changes from speech," 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), Xi'an, China, 2015, pp. 733-736. (Chapter 3) Huang, Z., Epps, J., and Ambikairajah, E., “An Investigation of Emotion Change Detection from Speech,” in INTERSPEECH, 2015, Dresden, Germany, pp. 1329-1333. (Chapter 5) Dang, T., Stasak, B., Huang, Z., Jayawardena, S., Atcheson, M., Hayat, M., Le, P., Sethu, V., Goecke, R., and Epps, J., “Investigating Word Affect Features and Fusion of Probabilistic Predictions Incorporating Uncertainty in AVEC 2017.” In Proceedings of the 7th International Workshop on Audio/Visual Emotion Challenge (AVEC ’17). ACM, 2017, Mountain View, CA USA. 9

10

Chapter 2

Automated Prediction of Emotions: A Review

2.1 Emotion: Definition and Representation Emotion is ubiquitous in our daily life and also has been a long-standing research area [33]. However, there has never been a consensus regarding the definition of emotion within the research community because of potential inter-discipline, inter-cultural, inter-language, interindividual differences when defining emotion [34]. Despite this, an emotion theory which has gained increasing popularity and consensus is the component process model [35], also known as ‘Appraisal theory’ [36], [37]. The theory regards emotion as a continuous process under the collective

influence

of

multiple

components,

including

a

cognitive

component,

neurophysiological component, motivational component, motor expression component and subjective feeling component [34]. The changes in these components are triggered by internal or external stimulus events, and further form a physiological response called “emotion”. Examples of external stimuli could be a change in a person's pain or events happening around, while internal stimuli could be memories and sensation. The above-mentioned emotion theory provides theoretical support to system development in this thesis in two aspects. Firstly, among all the components, the motor expression component represents changes in facial and vocal expression that underlie emotion and how emotion evolves across time, which hints at the possibility of inferring emotion and emotion change via behavioural signals. Secondly, this theory explains the need to treat emotion as a dynamical state. According to [34], “while we are in the habit of talking about ‘emotional states’ these are rarely steady states. Rather, emotion processes are undergoing constant modification allowing rapid readjustment to changing circumstances or evaluations”. Another important aspect of emotion is how we perceive and represent emotion. There are two major types of descriptions in both affective science (psychological research) and affective computing systems (engineering-oriented research), i.e. categories and dimensions. The categorical representation assigns emotion into different emotional categories such as happiness, anger, sadness. One assumption behind this is that some basic (prototypical) emotions can be representatives for the majority of emotions that occur on a daily basis [6], [38], which is similar to the theory of colour where every single colour can be represented by a combination of several primary colours. For an in-depth analysis of basic emotions, the reader may refer to [38]. The most widely used basic emotional categories are the "Big Six", namely happiness, sadness, 11

fear, disgust, anger and surprise, due to their popularity, dominance and occurrence in our daily life [1], [5]. From an engineering point of view, this categorical approach is straightforward and consistent with the human experience. Nevertheless, it also has several drawbacks. Emotion is ambiguous in nature and has no clear-cut boundaries, so it is difficult and inappropriate to ensure precise representation. Moreover, since there are various components simultaneously contributing to production of emotion, according to the component process model, each component might evolve differently across time, which results in subtle changes in emotion. Since the categorical representation imposes one global category for a whole speech segment or file, it is considered less capable of representing complex and ambiguous emotions as well as capturing subtle changes in emotion, especially for naturalistic data. In spite of some work done towards resolving these problems in emotion recognition systems [39]–[41], recent years have marked an increasing number of studies adopting a dimensional representation, e.g. [7], [42], [43].

Figure 2.1: The activation-evaluation space, alternatively known as the arousal-valence space (reproduced from [1]) The dimensional representation maps emotion into a few continuous dimensions [1], [44]–[46]. Among the most widely used emotional dimensions are arousal (i.e. activated vs deactivated) and valence (i.e. positive vs negative), constituting a so-called arousal-valence space, as exemplified in Figure 2.1. Arousal measures how activated a person feels, while valence measures how pleasant a person feels. Other additional dimensions such as potency and 12

unpredictability have been proposed to complement the two-dimensional space for a better and complete representation of emotion [46]. The dimensional representation method is advantageous in various ways compared with the categorical representation of emotion. First, it can be used to describe emotional intensity, which is measured in numeric dimensions. Taking this further, it also allows a quantitative measure of how much emotion has changed through time by simply examining changes in the numeric emotion dimensions. Second, as seen in the increasing volume of continuously annotated corpora [47]–[49], emotion dimensions that are annotated on a per-frame basis (normally from 20 to 100 milliseconds) can capture the evolution of emotion, for instance, a general trend or direction. Although dimensional representation is not intrinsically straightforward, it can somewhat alleviate the issues in the categorical representation, such as ambiguity, complexity and less capacity for capturing subtle changes in emotion, thereby being more suitable in a naturalistic context. However, it should be noted that mapping emotion into just a few dimensions would inevitably cause a loss of information. For instance, there are some less distinguishable emotions in the arousal-valence space such as fear and anger [50]. Since emotion is subjective and perceived differently from individuals, emotion annotations, in either categories or dimensions, are commonly approached by multiple annotators who provide their subjective ratings based on audio and video signals. This raises the question of how reliable the ratings are, to evaluate which a key measure is inter-rater agreement/reliability [51]. Inter-rater agreement refers to agreement on categorical or numeric annotations among annotators, which is normally measured by Cronbach’s  [47]–[49].

2.2 Emotion in Speech The association between emotion and speech has long attracted interest within the research community, and attempts to extract emotion-related features from speech are ongoing. While emotion-related acoustic features, which are also known as acoustic low-level descriptors (LLDs), have been extensively studied and reviewed [1], [6], [27], [52]–[54], approaches that derive high-level representations from the low-level descriptors have mainly been restricted to calculating supra-segmental functionals, and by contrast other types of high-level representations have been relatively less discussed within the affective computing community. This section will review commonly used acoustic features as well as different high-level representations which are useful for inferring emotion. 2.2.1

Acoustic Low-Level Descriptors

Extracting meaningful and informative sets of features from speech is crucial for developing emotion recognition systems. An initial step to process speech commonly involves 13

segmentation of speech signals into frames of tens of milliseconds, since over this period speech articulators are approximately unchanged, relating to the frequency of vocal fold and vocal tract vibrations [55]. In other words, characteristics of speech remain approximately stationary, and consequently can be captured via numerous feature extraction techniques. Commonly adopted acoustic LLDs include prosodic, spectral, and voice quality features. Prosodic features Prosody refers to rhythm, stress and intonation characteristics of speech, which are intuitively associated with emotion. Indeed, prosodic features, which commonly include pitch, intensity, speaking rate and intonation, have long been found to contain rich emotion-related information and yield good performance for recognising emotion [1], [6]. From a speech production point of view, pitch is produced from vocal fold vibrations, whose frequency is referred to as the fundamental frequency 𝐹0. This single feature is significant for emotion inference. For instance, when a person gets angry, his or her vocal folds become tense and their closing and opening rate increases, which leads to an increase in pitch. To date, pitch has been the most prevalent acoustic parameter studied for emotion or emotion recognition systems, e.g. [1], [6], [52], [54], [56]–[61]. Despite this, it is worth noting that frame-level pitch has been rarely used, and instead a number of parameters regarding the pitch contour, such as mean, median, range, have been evaluated. This is partly due to the fact that early emotional corpora tend to only have utterance-level emotion labels, for which global information was found necessary and useful [57]. However, despite an emerging number of datasets which have frame-level annotation of emotion, supra-segmental pitch features are still preferred over the frame-level pitch, which tend to be less informative and have more local variation. Another important prosodic feature is intensity. It represents the loudness (or energy) of speech, which tends to vary between emotional speech, for instance, angry and sad speech. Similarly to pitch, the use of intensity has proliferated in emotion recognition systems [23], [27], [52], [56], [60]–[63]. However, this acoustic parameter could be occasionally unreliable in the sense that it is vulnerable to changes in the distance between a microphone and a speaker. Spectral features Spectral features are usually sets of parameters that describe the details of short-term speech in the frequency domain and carry information regarding the timbre of sound. A set of frequently used and straightforward spectral features are formants, which represent peaks of the spectral envelope, associated with resonances of the vocal tract. It has been found that the shape of the vocal tract alters under different emotional states, which leads to changes in formants. Therefore, formants are emotionally informative and have been extensively used in emotion 14

literature [27], [52], [56]. The formants can be characterised by their frequency, bandwidth, and amplitude. Moreover, additional features related to formants have also been applied such as band energy, roll-off, centroid or flux [56]. Another set of emotion-related spectral features is Mel-Frequency Cepstral Coefficients (MFCCs), which remain the most widely used for speech-related classification tasks, ranging from speech recognition [64] and speaker recognition [65] to emotion recognition [13], [27], [56], [62], [66]–[68]. MFCCs are essentially a set of filterbank-based spectral parameters that describe a smoothed envelope of the spectrum of interest. They are calculated by first filtering a speech frame with a number of linear bandpass filters, which are distributed in frequency based on the non-linear mel frequency scale, a psychoacoustically motivated scale designed to simulate the human auditory system [69]. Then, the log energy is calculated within each band to approximate the amplitude of spectrum of the speech signal within different frequency ranges, followed by application of the Discrete Cosine Transform (DCT) to decorrelate the energies. Apart from the MFCCs, delta (Δ) and delta-delta (ΔΔ) MFCCs are often calculated for capturing changes in the spectrum of speech signals [55]. For emotion recognition, MFCCs, as well as other spectral features such as Linear Prediction Cepstral Coefficients (LPCCs) and Log Frequency Power Coefficients (LFPC) [62] have been found to perform reasonably well, highlighting the significance of spectral information for emotion. Somewhat unlike prosodic features, which are calculated at supra-segmental level, frame-level MFCCs tend to be frequently used as direct inputs into back-end classification models1, which are commonly Gaussian Mixture Models (GMMs) [66], [68], [70] or Hidden Markov Models (HMMs) [62]. Although the combination of MFCCs and GMMs/HMMs have been dominant in speech fields such as speech recognition and speaker recognition, they have not yielded significantly better performance than supra-segmental level acoustic features combined with static models such as Support Vector Machine (SVM) [68]. MFCCs, without question, remain a key feature set to be considered for speech-based emotion recognition systems [71], where they offer many advantages. MFCCs are decorrelated via DCT and in turn offer stability and robustness in the feature space. Despite the extensive use of MFCCs in emotion recognition systems, it should be noted that MFCCs, which are sensitive to changes in phonemes and speakers in speech, will incur phonetic and speaker variability and

1

This approach can be viewed as a parameterised version of functionals. Also, it is not uncommon to calculate functionals of frame-level MFCCs like prosodic features in emotion recognition or prediction systems.

15

thereby degradation in emotion recognition performance. The use of MFCCs is especially fascinating when it comes to model and represent the acoustic space of speech at a higher level with the aid of modelling techniques. This is commonly achieved via GMMs, which cluster unique linguistic and paralinguistic information in the acoustic space of speech that helps differentiate phonemes [72], speakers [73], gender [74], and depression [75] for example. However, what aspects of the acoustic space derived from MFCCs are helpful for differentiating emotions remains relatively less explored in emotion recognition systems. Voice quality features The speech production mechanism is commonly treated as a source-filter model, where excitation (source) signals from vocal fold vibrations are filtered by the vocal tract to produce speech. Voice quality features are parameters that describe glottal flow waveforms, which can help quantify harsh, breathy, or tense sounding speech. It is straightforward that voice quality features are affected by different emotional states. Literature has shown a strong correlation between emotional states and voice quality features, which therefore have been increasingly adopted in emotion recognition systems [1], [6], [58], [76], [77]. In extracting voice quality features, an inverse-filtering technique can be used to isolate the excitation source signals. However, a more commonly adopted approach is to directly numerically calculate a subset of voice quality-related features such as jitter (a cycle-to-cycle variation of fundamental frequency 𝐹0), shimmer (a cycle-to-cycle variation in the energy), and harmonic-to-noise ratio (HNR). These features can reflect breathiness, creakiness and harshness in speech. In-depth discussions of voice quality features can be found in [9], [54]. Table 2.1: A summary of low-level descriptors in selected emotional speech processing literature. It can be seen that the most common features in literature are prosodic features, followed by spectral features, which tend to carry rich paralinguistic information.

Acoustic LLDs

Example features

Example References

Prosodic Features

Pitch, intensity, speaking rate, volume, and intonation.

[1], [52], [54], [56], [58], [59], [61], [62], [63], [78], [79]

Spectral Features

Formants, MFCCs.

[1], [13], [50], [56], [62], [66], [67], [78]

Voice Quality Features

Jitter, shimmer, and harmonicto-noise (HNR).

[1], [76], [77], [58]

16

2.2.2

High-Level Representations

Functionals Since acoustic LLDs (Section 2.2.1) tend to suffer from local variations, the use of functionals has dominated their use in emotion recognition to date. Functionals are statistics of various kinds calculated from acoustic LLDs across a longer time span, which is in part justified by the supra-segmental nature of phenomena found in emotional speech [27]. A detailed example can be found in Table 2.2, where all functionals can be roughly categorised into two classes: static and dynamic. The former capture information regarding the feature distribution such as location, dispersion, and shape, whereas the latter capture time-related variations in feature contours such as regression coefficients. Table 2.2: Examples of commonly used functionals, which can be grouped into static and dynamic functionals, summarised from [27] and [80]. Type of functionals

Examples

Static

Extremes (min, max, range, …), Arithmetic mean, Centroid, Quartiles, Inter-quartile ranges Higher moments (standard deviation, kurtosis, skewness), etc.

Dynamic

Regression coefficients (linear or quadratic), Rising/falling slope of feature contour, Rate of peaks, DCT coefficients, Linear prediction coefficients, etc.

Functionals offer many advantages compared with acoustic LLDs. First, the histogram of LLDs can be characterised by a set of brute-forced parameters and details of histograms were found informative for emotion recognition [56]. However, the resultant features can be highdimensional and contain redundant and irrelevant information, e.g. the 6373-dimensional Computational Paralinguistics ChallengE (ComParE) feature set [81], so it is commonplace to apply feature selection approaches such as Sequential Forward Selection (SFS) [61], [82], [83]. Second, functionals, which are calculated at supra-segmental level, tend to be less sensitive to local variations of LLDs and more capable of capturing temporal information. Third, functionals enable mapping from LLDs of different lengths to a single fixed-dimensional feature vector, i.e. length normalisation [27]. This is well-suited to the case where speech utterances have different lengths, especially in earlier times when the majority of emotional 17

corpora contained only utterance-level emotion labels. However, length variability could be detrimental, since feature vectors derived from long utterances may carry more information than those derived from shorter utterances, which seems to be problematic for emotion recognition [68] as well as other tasks such as speaker recognition [84]. One open question regarding LLDs and functionals is which ones are best for emotion recognition. Back in the nineties, hand-picked and expert-driven features (e.g. pitch average, range, and variability) were preferred [1]. At that time, statistical tests performed on 1dimension feature such as pitch and comparison of distributions were common [1]. This was followed by a trend towards using a large-scale set of functionals (which is a natural extension of the former approach) [56] in the last decade, and it was shown in the literature that functionals consistently produce state-of-the-art performance [27]. Until recently, special interests within the research community have included proposing compact knowledge-driven feature sets [71], [85]. This is exemplified by the Geneva Minimalistic Acoustic Parameter Set (GeMAPS), which has been increasingly popular and has become a reference feature set for speech emotion research. It is noted in [71] that 62-dimensional GeMAPS and its extended version (88-dimensional eGeMAPS) yielded comparable performance to other five sets of brute forced functionals (dimensionality from 384 to 6373) for binary classification of arousal and valence across six commonly used emotion datasets. A detailed description of eGeMAPS features can be found in Table 2.3. Similarly, in [85], three single-dimension features, i.e. median pitch, median vocal intensity and HF 500, were shown to achieve approximately state-of-the-art performance for arousal classification and regression. These studies suggest a necessity to understand acoustic features, and that emotion information can be primarily captured by a few statistics. Compact knowledge-driven feature sets also offer the advantage of good cross-corpus generalisation, since the high-dimensional functionals2, which suffer from the curse of dimensionality more often than low-dimensional ones, may be dataset-specific in general [86], [87].

2

Overfitting is not necessarily associated with the use of high-dimensional functionals, for example, regression models such as SVR and RVM can enforce sparsity on either data instances or feature weights to prevent overfitting in high-dimensional functionals. Overfitting could also arise from classification or regression models.

18

Table 2.3: The GeMAPS and the extended GeMAPS (eGeMAPS) features. Abbreviations: M – Mean, STD – Standard deviation, 20P – 20% percentile, M-Slope – Mean of rising/falling slope, M-(un)voiced – Mean in (un)voiced region only, ESL – Equivalent Sound Level, ML-(un)voiced – Mean Length of continuously (un)voiced regions. A detailed explanation and calculation of eGeMAPS feature set can be found in [71]. The underlined features are only included in the eGeMAPS LLDs/Functionals.

Frequencyrelated features

Energyrelated features

Spectral features

Low-level descriptors

Functionals

Pitch 𝑓0

M, STD, 20P, 50P, 80P, 20-80P, M-slope, STD-slope M, STD M, STD M, STD M, STD M, STD, 20P, 50P, 80P, 20-80P, M-slope, STD-slope M, STD M, STD, M-unvoiced M, STD, M-unvoiced M, STD, M-unvoiced

Jitter Frequency: F1, F2, F3 Bandwidth: F1, F2, F3 Shimmer Loudness HNR Alpha ratio Hammarberg index Spectral slope: 0-500Hz, 5001500Hz Relative energy: F1, F2, F3 Harmonic difference: H1-H2, H1-A3 Spectral flux MFCC 1-4

Additional Features GeMAPS LLDs: 18-Dim Total dimensionality eGeMAPS LLDs: 25-Dim

M, STD M, STD M, STD, M-unvoiced, M-voiced, STD-voiced M, STD, M-voiced, STD-voiced ESL, Rate of loudness peaks, ML-voiced, STDL-voiced, ML-unvoiced, STDL-unvoiced, N-voiced GeMAPS functionals: 62-Dim eGeMAPS functionals: 88-Dim

Another open question is over what time scale functionals should be calculated. Many studies have investigated features extracted from different levels such as frame-level [68], phonemelevel [13], [67], [88], syllable-level [88], [89], word-level [89], voiced, sentence and utterancelevel [57] [68]. Despite this, utterance-level analysis of features remains a dominant approach, which often yields state-of-the-art performance [90]. Furthermore, features extracted at different levels have been found to complement each other, and their fusion can achieve improved performance [90]. Of particular interest is phoneme-level features. With the intuition that different phonemes have diverse discriminating power in emotion, a few studies have focused on phoneme-level features [12], [13], [67]. In [44], spectral features extracted from vowel sounds gave the best classification performance compared with those from other phoneme groups. However, a conflicting result was reported in [45], where spectral properties in the 19

consonant regions contained more emotion-specific information than those in the vowel region. Nevertheless, systems that extract features at phoneme level remain unusual. This is partly because some phonemes are less common and occur less than others in existing emotional datasets, which are commonly small in size. Linguistic features Another high-level representation of speech is linguistic features, which represent the lexical content in people’s voice. It seems reasonable to link what people say to how they feel; people speak differently under various emotional states. Literature has shown that linguistic features are effective in emotion recognition, and further that combining them with the acoustic features yields improvements over systems that rely simply on either acoustic or linguistic features [56], [61], [91], [92], [93], [94]. For instance, in [93], lexical features were investigated by analysing transcripts on the IEMOCAP dataset [95], where the top K words were selected and weighed in terms of emotional salience for four emotions. Combining acoustic features and lexical features yielded relative improvements of up to 16% in classification accuracy over systems only using acoustic features. In [94], disfluencies such as filler words and non-verbal vocalisations such as laughter were found to be effective for predicting emotion dimensions at utterance level. However, linguistic features tend to have the drawback of being more complicated, error-prone, language-dependent [96] and less meaningful in acted emotional corpora [78], [27]. Model-based features Model-based features are generated with the aid of modelling approaches. The most common modelling approach is GMMs, which cluster and map acoustic LLDs into a higher-level representation of speech. This approach is regarded as de-facto standard in many speech-related fields such as speaker recognition [73]. For instance, GMM supervectors, which are a set of features that stack the mean parameters of each Gaussian component, were proposed to represent speaker identity [73]. The GMM supervector derives from a GMM model trained on acoustic LLDs, whilst functionals are calculated based on a number of knowledge-driven statistics of acoustic LLDs. It has been shown in the literature that the GMM supervector as well as its variations (e.g. i-vector) are effective for emotion recognition [70], [90], [88], [9], [97], [98], [99], [100]. For instance, Hu et al. reported significantly better performances when comparing GMM supervector based Support Vector Machine (SVM) with conventional GMMbased emotion recognition, achieving 82.5% accuracy in recognising five emotions on a privately collected dataset with 1309 Chinese utterances [90]. Apart from the GMM-based features, linguistic features (e.g. bag-of-word, part-of-speech), to some extent, can be regarded as model-based features, since they are generated from Automatic 20

Speech Recognition (ASR) transcripts [91], [92]. In [91], it is suggested that there is a marginal loss between using manual transcriptions and outputs from an HMM-based ASR engine for recognising four emotions on the FAU AIBO emotion dataset. A comparison between the GMM-based features and the ASR-based linguistic features suggests different possibilities to represent characteristics of speech based on various types of models: while GMM performs implicit phonetic modelling of speech with MFCCs, the ASR model explicitly outputs linguistic content. These two models, i.e. GMM and ASR, are dominant in emotional speech to date, and other alternative ways to produce model-based features that are effective for inferring emotion remain relatively less known to the affective computing community. Recently, Trigeorgis et al. proposed a new set of model-based features, which are learnt directly from raw speech signal via Convolutional Neural Networks (CNNs) [101]. The proposed CNNbased features outperformed eGeMAPS features significantly in continuous prediction of arousal and valence, achieving Concordance Correlation Coefficients (CCC) of 0.686 and 0.261 respectively, as compared with 0.316 and 0.195. Of particular interest in [101] is that there are strong correlations between outputs of activation cells in CNN and several acoustic and prosodic features such as pitch and loudness. However, an in-depth analysis of how CNN functions well for emotion prediction is needed. A well-known advantage of CNN is insensitivity to local variations in both image and speech processing. In particular, CNN was recently found to effectively reduce temporal variation in both time and frequency domain [102]. Moreover, a recent study that investigates the use of CNN in speech recognition suggests that CNN potentially captures the phoneme-specific spectral envelope when it is directly applied to raw speech [103], which is helpful for differentiating phonemes. However, whether the phoneme-specific information contributes to the promising performance reported in [101] remains unknown. It is worth noting that literature investigating model-based features is much less common in comparison to the prevalent use of functionals. Moreover, most of the abovementioned studies were focused on utterance-level emotion classification, and given the increasing popularity of continuous prediction of emotion dimensions such as arousal and valence, the effectiveness of the model-based features for this task (a regression problem) remains relatively less investigated.

21

Table 2.4: A summary of high-level representations for emotion recognition. High-level representation of LLDs

Example Features

Example References

Functionals

As seen in Table 2.2.

[27], [56], [60], [63], [104],

Linguistic features

Phonemes, words, laughter, bag-of-word, part-of-speech, higher semantics.

[27], [56], [61], [91], [92], [93], [94],

Model-based features

GMM-based features, ASRbased features, end-to-end CNN-based features.

[9], [70], [90], [91], [93], [97], [98], [99], [100], [101], [105], [106], [107],

2.3 Typical Automatic Modelling and Prediction Techniques The diversity in emotional representation approaches (Section 2.1) leads to various problem settings for speech-based emotion recognition. There are three main types of tasks: 1) classification of emotion categories or dimensions at utterance level (i.e. a classification problem), e.g. [108], 2) prediction of emotion dimensions at utterance level (i.e. a regression problem), e.g. [85], and 3) prediction of emotion dimensions at frame level (i.e. a regression problem), e.g. [109]. Despite the fact that research into speech-based emotion recognition has mainly focused on the emotion classification, frame-level continuous prediction of emotion recently has gained increasing popularity because of the advantages from the dimensional representation of emotion (Section 2.1). A continuous absolute emotion prediction system contains two stages: feature extraction and regression modelling (Figure 2.2), which is a supervised machine learning problem. Firstly, informative features are extracted from speech files, which are continuously annotated with arousal and valence ratings. Then the features are trained using different machine learning methods, herein regression models, which can be used to predict arousal and valence for unknown speech files. Thus, the aim of developing an emotion prediction system is to search for appropriate emotionally informative features and to advance regression modelling techniques for improved emotion prediction performance. While Section 2.2 presented common emotion-related features, this section aims to present several common modelling techniques for emotion prediction, including Support Vector Regression (SVR), Relevance Vector Machine (RVM) and its variations, i.e. RVM Staircase Regression (RVM-SR) and Output-associative RVM (OA-RVM) as well as Kalman filtering (KF). Moreover, Gaussian Mixture Models (GMMs), which function as a modelling technique, will also be discussed.

22

Feature Extraction

Regression Modelling

Predicted Arousal or Valence

Speech Signal  

Prosodic, spectral, voice quality acoustic LLDs, etc. Functionals, linguistic features, model-based features, etc.

    

Support Vector Regression (SVR) Relevance Vector Machine (RVM) RVM Staircase Regression (SR) Output-Associative (OA) RVM Kalman Filtering (KF)

Figure 2.2: A general system architecture for absolute emotion prediction, showing examples of specific methods for each block. Features are extracted from speech files to train a regression model, based on which arousal or valence values of a new speech file can be predicted. The features could be acoustic LLDs or a high-level representation of speech, whereas the most common regression method for emotion prediction is support vector regression. 2.3.1

Gaussian Mixture Models

Gaussian Mixture Models (GMMs) are a stochastic modelling technique which has long been prevalent in speech modelling, for instance, speaker recognition [73]. For emotion recognition or prediction, literature has also shown the effectiveness of GMM especially when it is used for generating model-based features which characterise the feature acoustic space of speech (as partly discussed in Section 2.2.2 – model-based features). Therefore, GMM is discussed herein to understand how characterisation of the acoustic space can be achieved via this modelling technique. From a modelling perspective, given data 𝑿 = {𝒙1 , … , 𝒙𝑛 , … , 𝒙𝑁 }, where 𝒙𝑛 is a D-dimensional feature vector at the nth frame, the aim of modelling is to find a model 𝜆, which maximises the posterior probability of 𝑿:

𝑝(𝜆|𝑿) =

𝑝(𝑿|𝜆)𝑝(𝜆) 𝑝(𝑿)

(2.1)

The posterior probability can be decomposed into a likelihood term 𝑝(𝑿|𝜆), a prior term 𝑝(𝜆) and a normalising factor 𝑝(𝑿) via Bayes’ rule. Since 𝑝(𝑿) and 𝑝(𝜆) remain unchanged, the problem can be reformulated as maximisation of 𝑝(𝑿|𝜆), which is the product of individual probability density functions 𝑝(𝒙𝑛 |𝜆). Normally, log-likelihood is used instead of likelihood, since the likelihood values tend to be very small and have a potential underflow issue. Also, log-likelihood enables conversion from product to summation. Accordingly, the overall loglikelihood of data 𝑿 given a model 𝜆 is as below:

23

𝑁

log 𝑝(𝑿|𝜆) = ∑ log 𝑝(𝒙𝑛 |𝜆)

(2.2)

𝑛=1

If 𝜆 is a GMM, the probability density function therefore can be represented using Eq. (2.3). 𝑀

𝑝(𝒙𝑛 |𝜆) = ∑ 𝑤𝑖 𝑝(𝒙𝑛 |𝝁𝑖 , 𝚺𝑖 ) 𝑖=1 𝑀

= ∑ 𝑤𝑖 𝑖=1

(2.3)

1 (2𝜋)𝐷/2 |𝚺𝑖 |1/2

1 −1 𝑇 𝑒 −2(𝒙𝑛 −𝝁𝑖)Σ𝑖 (𝒙𝑛 −𝝁𝑖)

Overall, the probability density function 𝑝(𝒙𝑛 |𝜆) can be seen as a linear combination of M Gaussian components 𝒩(𝝁𝑖 , 𝚺𝑖 ), each of which has a weight 𝑤𝑖 , D1 mean vector 𝝁𝑖 and DD covariance matrix 𝚺𝑖 , i.e. 𝜆 = {𝑤𝑖 , 𝝁𝑖 , 𝚺𝑖 |𝑖 = 1, … , 𝑀} . Given data 𝑿 = {𝒙1 , … , 𝒙𝑛 , … , 𝒙𝑁 } , GMMs are trained using the iterative Expectation-Maximization (EM) algorithm, which updates the parameters 𝑤𝑖 , 𝝁𝑖 , 𝚺𝑖 while ensuring increases in log-likelihood of the model given training data at each iteration. The update equations can be found in Eq. (2.4) - (2.7). More specifically, the posterior probability of each Gaussian mixture component given each feature vector 𝒙𝑛 is firstly calculated (Eq. (2.4)).

𝑝(𝑖|𝒙𝑛 ) =

𝑤𝑖 𝑝𝑖 (𝒙𝑛 |𝝁𝑖 , 𝚺𝑖 ) 𝑀 ∑𝑗=1 𝑤𝑗 𝑝𝑗 (𝒙𝑛 |𝝁𝑗 , 𝚺𝑗 )

(2.4)

th where ∑𝑀 𝑖=1 𝑝(𝑖|𝒙𝑛 ) = 1 and 𝑖 denotes the 𝑖 Gaussian mixture component. Then all posterior

probabilities are accumulated across N feature vectors to form utterance-level GMM posteriors 𝑝̌ (𝑖) for the 𝑖 𝑡ℎ Gaussian component (Eq. (2.5)), which are also known as the “occupancy count”. Based on 𝑝̌ (𝑖), a mean vector and a covariance matrix for each Gaussian component are updated using Eq. (2.6) and Eq. (2.7) respectively. For a detailed description of the EM algorithm, the reader may refer to [110], [111]. 𝑁

𝑝̌ (𝑖) = ∑ 𝑝(𝑖|𝒙𝑛 )

(2.5)

𝑛=1 𝑁

1 𝐸𝑖 (𝒙𝑡 ) = ∑ 𝑝(𝑖|𝒙𝑛 )𝒙𝑛 𝑝̌ (𝑖) 𝑛=1

24

(2.6)

𝑁

𝐸𝑖 (𝒙2𝑛 )

1 = ∑ 𝑝(𝑖|𝒙𝑛 )𝒙2𝑛 𝑝̌ (𝑖)

(2.7)

𝑛=1

Basically, GMM posteriors 𝑝̌ (𝑖) , mean vectors 𝐸𝑖 (𝒙𝑡 ) , and covariance matrices 𝐸𝑖 (𝒙2𝑛 ) characterise the acoustic space of speech in terms of partitioning, position and variations. This representation of the acoustic space, especially the so-called GMM supervectors, were found to be very useful for speaker recognition [73] and language identification [112]. In the field of speaker recognition, despite the fact that GMM supervectors have been a de-facto standard feature set, there are also a few interesting studies demonstrating the effectiveness of the GMM posteriors 𝑝̌ (𝑖) [113] and the covariance information [114]. In [113], an interesting finding was that only utterance-level GMM posteriors (named “phonotactic features” in [113]), directly calculated from a Universal Background Model (UBM), achieved an Equal Error Rate (EER) of 9.04% using SVM for speaker recognition on the NIST 01, 02, 04 & 05 corpora, compared with the standard GMM-UBM (9.01%) and GMM supervector + SVM (7.37%). It is noted that the same speakers share similar utterance-level GMM posteriors across all GMM components, and a further combination of the three individual systems yielded an EER of 6.47%. Recently, it has also been found that the GMM posteriors (same technique as per [113]) are effective for classifying different age and gender groups [74], as well as for depression [115]. As discussed in Section 2.2.2, GMM supervectors have been extensively investigated in the context of emotion classification, but whether the GMM posteriors with respect to Gaussian mixtures, which provide a soft partitioning of the acoustic space, can be emotionally informative remains relatively less explored to the affective computing research community. More commonly, GMM posterior probabilities with respect to emotion classes were used (e.g. [116]). 2.3.2

Support Vector Regression

Support Vector Regression (SVR) is a well-established regression approach [117] and has been widely used for predicting arousal and valence [83], [118], [119]. SVR’s solid theoretical framework ensures global solutions, sparsity and good generalisation ability. For the emotion prediction task, it has been found to generalise very well across different feature sets as well as datasets. Given a set of training data and corresponding target ratings: 𝐷 𝑎 𝑣 𝐷𝑎𝑡𝑎 = {(𝒙𝑛 , 𝒕𝑛 )𝑁 𝑛=1 |𝒙𝑛 ∈ ℝ , 𝒕𝑛 ∈ {𝑡𝑛 , 𝑡𝑛 }}

(2.8)

where 𝒙𝑛 is a D-dimensional feature vector, 𝑡𝑛𝑎 and 𝑡𝑛𝑣 are arousal and valence ground truth ratings at the 𝑛𝑡ℎ frame, and N is the total number of data instances. The aim of a regression 25

problem is to search for a function (model) which can predict numeric scores given unknown data 𝒙∗𝑛 . 𝑦(𝒘, 𝒙∗𝑛 ) = 𝒘𝑇 𝝓(𝒙∗𝑛 ) + 𝑏

(2.9)

where 𝝓(𝒙∗𝑛 ) is a transformation performed on the 𝐷-dimensional feature vector 𝒙∗𝑛 .This is commonly achieved by formulating a loss function that describes the error between predicted scores 𝑦(𝒘, 𝒙𝑛 ) and target scores 𝑡𝑛 , which is minimised through the training process. In SVR, a common loss function is an 𝜖-insensitive error function:

𝚬𝜖 (𝑦(𝒘, 𝒙𝑛 ) − 𝑡𝑛 ) = {

0, if |𝑦(𝒘, 𝒙𝑛 ) − 𝑡𝑛 | < 𝜖 |𝑦(𝒘, 𝒙𝑛 ) − 𝑡𝑛 | − 𝜖, otherwise

(2.10)

where 𝚬𝜖 (∙) is an error function under a certain choice of 𝜖. The condition |𝑦(𝒘, 𝒙𝑛 ) − 𝑡𝑛 | < 𝜖 provides a tube-like hyperplane (also known as “hypertube”), within which a data instance yields no error. On the other hand, if a data instance lies outside the hypertube, the error will be |𝑦(𝒘, 𝒙𝑛 ) − 𝑡𝑛 | − 𝜖. Of course, there will be some errors, i.e. not all the data points will be located in the hypertube, to tolerate some of which two slack variables are introduced, denoted as 𝜉𝑛 , 𝜉𝑛∗ . While 𝜉𝑛 quantifies the error when 𝑡𝑛 > 𝑦(𝒘, 𝒙𝑛 ) + 𝜖, 𝜉𝑛∗ quantifies the error when 𝑡𝑛 < 𝑦(𝒘, 𝒙𝑛 ) − 𝜖. In short, introducing the two slack variables allows data instances that lie outside the hypertube to be included inside the tube for generalisation purposes. Accordingly, the objective (loss) function for SVR can be formulated as in Eq. (2.11), which is a soft margin convex optimisation problem [117]. 𝑁

1 minimise ‖𝒘‖ 2 + 𝒞 ∑(𝜉𝑛 + 𝜉𝑛∗ ) 2 𝑛=1

𝑡𝑛 − 𝑦(𝒘, 𝒙𝑛 ) ≤ 𝜀 + 𝜉𝑛 subject to {𝑦(𝒘, 𝒙𝑛 ) − 𝑡𝑛 ≤ 𝜀 + 𝜉𝑛∗ 𝜉𝑛 , 𝜉𝑛∗ ≥ 0, 𝑛 = 1, … , 𝑁

(2.11)

where 𝒞 is a cost regularisation term, (𝜉𝑛 + 𝜉𝑛∗ ) quantifies the error for 𝒙𝑛 , and ‖𝒘‖ is a geometric margin, which should be minimised to achieve flatness in the hyperplane [117]. Optimisation of Eq. (2.11) can be achieved via dual Lagrange multipliers 𝑎𝑛 and 𝑎𝑛∗ [117], which gives: 𝑁

𝒘 = ∑(𝑎𝑛 − 𝑎𝑛∗ )𝝓(𝒙𝑛 ) 𝑛=1

Substituting the weights 𝒘 in Eq. (2.9) using Eq. (2.12) gives: 26

(2.12)

𝑁

𝑦(𝒘, 𝒙𝑛 ) = ∑(𝑎𝑛 − 𝑎𝑛∗ )𝝓(𝒙𝑛 ) 𝝓(𝒙𝑛 )𝑻 + 𝑏 𝑛=1

(2.13)

𝑁

= ∑(𝑎𝑛 − 𝑎𝑛∗ )𝒦(𝒙𝑛 , 𝒙𝑛 ) + 𝑏 𝑛=1

where 𝑎𝑛 and 𝑎𝑛∗ , are SVR model parameters, which weigh all training data instances, and 𝒦( ∙ , ∙ ) is a kernel function, which allows non-linear transformation of 𝒙𝑛 . All data points within the hypertube have 𝑎𝑛 = 𝑎𝑛∗ = 0 and contribute nothing, whereas the remaining data instances which have either 𝑎𝑛 ≠ 0 or 𝑎𝑛∗ ≠ 0 are known as the “support vectors”. Thus, the hypertube itself is merely maintained by a small number of support vectors, thereby achieving sparsity of the SVR model. Having the trained SVR parameters 𝑎𝑛 and 𝑎𝑛∗ , given unknown test data 𝑿∗ = (𝒙1∗ , … 𝒙∗𝑛 , … , 𝒙∗𝐿 ), then predictions become 𝑁

𝑦(𝒘, 𝒙∗𝑛 )

= ∑(𝑎𝑛 − 𝑎𝑛∗ )𝒦(𝒙𝑛 , 𝒙∗𝑛 ) + 𝑏

(2.14)

𝑛=1

2.3.3

Relevance Vector Machines

Relevance Vector Machine (RVM) is a relatively new approach to multi-dimensional regression which is gaining in popularity in continuous (absolute) emotion prediction [120]–[122]. RVM can be considered as a sparse Bayesian method analogous to the SVR [123] while offering a wide range of additional advantages [120]–[122]. Firstly, RVM eases the restriction of Mercer conditions [117] in SVR’s choices of the kernel function and allows any arbitrary kernel function to be used in conjunction with an RVM [124]. Secondly, RVM offers probabilistic outputs representing uncertainty for each prediction and classification, which can be exploited for absolute emotion prediction (Section 2.3.4). Thirdly, RVM allows not only the mappings of contextual temporal information (Section2.3.5), but also a convenient multimodal fusion technique (Appendix A), which negates the need to train and heuristically combine multiple predictors [121]. Moreover, RVM presents the learnt regression model as the most relevant set of extracted feature dimensions, meaning the technique explicitly performs both dimensionality reduction and feature selection without the need for a held-out validation data subset, which potentially helps to minimise the chance of overfitting during system development.

27

Regression A general form for regression can be found below, and RVM searches for a weight for each feature dimension [125] 𝑦(𝑡𝑛 |𝒙𝑛 , 𝒘) = 𝒘𝑇 𝝓(𝒙𝑛 ) + 𝝐, 𝝐~𝒩(0, 𝜎 2 )

(2.15)

in which 𝒘 = [𝑤1 , ⋯ 𝑤𝐷 ]𝑇 is an estimated set of sparse regression parameters, 𝝓 = [𝝓1 (𝒙𝑛 ), ⋯ , 𝝓𝐷 (𝒙𝑛 )]𝑇 is a set of (potentially non-linear) transforms performed on 𝒙𝑛 and 𝝐 = [𝜖1 , 𝜖2 , … , 𝜖𝑁 ]𝑇 are the training noise terms. RVM learns a sparse representation of 𝒘 where the majority of the 𝑤𝑖 are zero. This is achieved by giving 𝐰 a zero-mean Gaussian prior which encourages sparsity by declaring smaller weights as more probable [125]: 𝐷

𝑝(𝒘|𝜶) = ∏ 𝒩(𝑤𝑖 |0, 𝛼𝑖−1 )

(2.16)

𝑖

where 𝜶 is the inverse variance hyperparameter and is analogous to regularisation terms in SVR. 2 In the training phase, an RVM Regression Model searches for 𝜶𝑀𝑃 , 𝜎𝑀𝑃 , the most probable

(MP) values of 𝜶 and 𝜎 2 using iterative Bayesian inference procedure. In the testing phase, 2 𝜶𝑀𝑃 , 𝜎𝑀𝑃 are used to both make a prediction and to estimate a level of uncertainty associated

with that prediction. Below elaborates the training and testing phase in RVM. RVM Regression Training Phase Given training data and ground truth ratings (Eq. (2.8)), RVM aims to maximise the posterior probability of all parameters given training data. 𝑝(𝒘, 𝜶, 𝜎 2 |𝒕) = 𝑝(𝒘|𝒕, 𝜶, 𝜎 2 )𝑝(𝜶, 𝜎 2 |𝒕)

(2.17)

The first term on the right-hand side specifies normal distributions over the weights 𝒘 , controlled by 𝜶, 𝜎 2 [125]. 𝑝(𝒘|𝒕, 𝜶, 𝜎 2 )~𝒩(𝝁, 𝚺)

(2.18)

where 𝚺 = (𝜎 −2 𝚽 𝑇 𝚽 + diag(𝜶))−𝟏 𝝁 = 𝜎 −2 𝚺𝚽 𝑇 𝑡

28

(2.19)

where 𝚽 = [𝝓(𝒙1 ), 𝝓(𝒙2 ), … , 𝝓(𝒙𝑁 )]𝑇 . Accordingly, the aim is to find the Most Probable (MP) 2 𝜶𝑀𝑃 , 𝜎𝑀𝑃 that estimate accurate distributions of 𝒘. In achieving this, we should maximise the

posterior probability of 𝑝(𝜶, 𝜎 2 |𝒕), which however is intractable. Applying Bayes’ Rule gives us 𝑝(𝜶, 𝜎 2 |𝒕) ∝ 𝑝(𝒕|𝜶, 𝜎 2 )𝑝(𝜶)𝑝(𝜎 2 )

(2.20)

where 𝑝(𝜶) and 𝑝(𝜎 2 ) are considered as uniformly distributed. So the problem becomes a type2 II maximum likelihood problem, and the most probable 𝜶𝑀𝑃 , 𝜎𝑀𝑃 can be found analytically and

re-estimated by taking the derivatives of the likelihood ℒ(𝜶, 𝜎 2 ) to zero with respect to 𝜶 and 𝜎 2. 2 ) (𝜶𝑀𝑃 , 𝜎𝑀𝑃 = argmax ℒ(𝜶, 𝜎 2 ) = argmax 𝑝(𝒕|𝜶, 𝜎 2 ) 𝜶,𝜎 2

𝜶,𝜎 2

(2.21)

More specifically, the updates of 𝜶 and 𝜎 2 can be done using Eq. (2.22) and (2.23): 𝛼𝑖𝑛𝑒𝑤 =

2 𝑛𝑒𝑤

(𝜎 )

𝛾𝑖 𝜇𝑖2

‖𝒕 − 𝚽𝝁‖2 = 𝑁 − ∑𝑖 𝛾𝑖

(2.22)

(2.23)

where 𝛾𝑖 = 1 − 𝛼𝑖 Ʃ𝒊𝒊 ∈ [0,1] is a measure of ‘well-determinedness’ of the weight 𝑤𝑖 , which has a mean 𝜇𝑖 and variance Ʃ𝒊𝒊 (Eq. (2.25)). If 𝛼𝑖 is large (or infinity), Ʃ𝒊𝒊 ≈ 𝛼𝑖−1 and 𝜇𝑖 = 0 (Eq. (2.19)), which gives 𝛾𝑖 = 0, meaning the weight 𝑤𝑖 contributes nothing to inferring 𝒕, i.e. arousal and valence. On the other hand, if 𝛼𝑖 is small, 𝑤𝑖 fits the data and 𝛾𝑖 ≈ 1. This explains how 𝛼𝑖 is used to enforce sparsity in RVM. After certain number of iterations, the majority of 𝜶 become infinity and in turn their corresponding weights 𝒘 become zero. Thus, the iteration number, which is normally used as the stopping criterion during training, is an important parameter to be tuned in RVM, although other convergence criteria have been proposed [125]. RVM Regression Prediction Phase After a certain number of iterations in RVM training, distributions of the sparse weights 𝒘 can 2 be fully characterised by 𝜶𝑀𝑃 , 𝜎𝑀𝑃 (Eq. (2.24)).

2 𝑝(𝒘|𝒕, 𝜶𝑀𝑃 , 𝜎𝑀𝑃 )~𝒩(𝝁, 𝚺)

where

29

(2.24)

−2 𝑇 𝚺 = (𝜎𝑀𝑃 𝚽 𝚽 + diag(𝜶𝑀𝑃 ))−𝟏 −2 𝝁 = 𝜎𝑀𝑃 Ʃ𝚽 𝑇 𝑡

(2.25)

Given unknown test data 𝑿∗ = (𝒙1∗ , … 𝒙∗𝑛 , … , 𝒙∗𝐿 ), then predictions become 𝑦(𝑡𝑛∗ |𝒙∗𝑛 , 𝒘) = 𝝁𝑇 𝝓(𝒙∗𝑛 )

(2.26)

Classification In addition to RVM regression modelling, RVM can be reformulated as a classifier to differentiate emotions such as high and low arousal/valence, which may contain valuable information for prediction of arousal and valence. This is partly motivated by the recent depression-related study showing that classifying low and high depression severity is helpful for predicting depression severity scores [75]. The RVM classifier searches for the most relevant weights 𝒘 for the transformed features 𝝓(𝒙𝑛 ). 𝑝(𝑐𝑛 |𝒙𝑛 , 𝒘) = 𝒮(𝑦(𝝓(𝒙𝑛 ), 𝒘))

(2.27)

where 𝑐𝑛 ∈ {0,1}, 𝒮(𝑦) = 1⁄(1 + 𝑒 −𝑦 ) , 𝒘 = [𝑤1 , ⋯ 𝑤𝐷 ]𝑇 . Note that unlike regression (Eq. (2.15)), there is no error term 𝝐, i.e. the parameter 𝜎 2 . Similar to RVM regression, sparse weights 𝒘 are guaranteed via zero-mean Gaussian priors [125]: 𝐷

𝑝(𝒘|𝜶) = ∏ 𝒩(𝑤𝑖 |0, 𝛼𝑖−1 )

(2.28)

𝑖

The introduction of the sigmoid function 𝒮(∙) leads to a Bernoulli distributed likelihood function [125]. 𝑁

𝑝(𝒄|𝒘) = ∏ 𝒮{𝑦(𝝓(𝒙𝑛 ); 𝒘)}𝑐𝑛 [1 − 𝒮{𝑦(𝝓(𝒙𝑛 ); 𝒘)}]1−𝑐𝑛

(2.29)

𝑛=1

Training an RVM classifier involves optimisation of Eq. (2.29), but the iterative optimisation approach in the regression case, i.e. Eq. (2.22) and (2.23), is not directly applicable. This is because neither 𝑝(𝒄|𝒘) nor its decomposition 𝑝(𝒘|𝒄, 𝜶) and 𝑝(𝒄|𝜶) are Gaussian-distributed and have analytical solution (Eq. (2.30)). 𝑝(𝒄|𝒘) ∝ 𝑝(𝒘|𝒄, 𝜶)𝑝(𝒄|𝜶) Below elaborates the training and classification phases of RVM.

30

(2.30)

RVM Classification Training Phase An iterative approximation approach based on Laplace’s method was proposed for training an RVM classifier in [125], where the Most Probable (MP) weights 𝒘𝑀𝑃 are iteratively approximated as the mode of the posterior distribution of 𝑝(𝒘|𝒄, 𝜶) . By this means, the posterior probability is forced to follow a Gaussian distribution. 𝑝(𝒘|𝒄, 𝜶)~𝒩(𝒘𝑀𝑃 , 𝚺)

(2.31)

where 𝚺 = (𝚽 𝑇 𝚩𝚽 + 𝐀)−𝟏 𝒘𝑀𝑃 = 𝚺𝚽 𝑇 𝐁𝒄 where

𝐁 = diag(𝛽1 , 𝛽2 , … , 𝛽𝑁 )

𝛽𝑛 = 𝒮{𝑦(𝝓(𝒙𝑛 ))}[1 − 𝒮{𝑦(𝝓(𝒙𝑛 ))}]

,

(2.32)

,

𝐀 = diag(𝛼1 , 𝛼2 , … , 𝛼𝐷 ) and 𝚽 = [𝝓(𝒙1 ), 𝝓(𝒙2 ), … , 𝝓(𝒙𝑁 )]𝑇 . At each iteration, 𝛼𝑖 can be accordingly updated using Eq. (2.22) as per the regression case. Again, the iteration number is the parameter that needs to be tuned during RVM classification training [125]. RVM Classification Phase After a certain number of iterations in RVM classification training, the most probable weights 𝒘𝑀𝑃 can be attained. The probabilistic classification, given unknown test data 𝑿∗ = (𝒙1∗ , … 𝒙∗𝑛 , … , 𝒙∗𝐿 ), then becomes 𝑝(𝑐𝑛∗ |𝒙∗𝑛 , 𝒘𝑀𝑃 ) = 𝒮(𝑦(𝝓(𝒙∗𝑛 ), 𝒘𝑀𝑃 )) = 𝒮 ((𝒘𝑀𝑃 )𝑻 𝝓(𝒙∗𝑛 ))

(2.33)

The term 𝑝(𝑐𝑛∗ |𝒙∗𝑛 , 𝒘𝑀𝑃 ) ∈ [0, 1] represents a probabilistically numeric output from the sigmoid function 𝒮(∙) for classes 𝑐𝑛∗ ∈ {0,1}. Accordingly, classification can be done using a threshold of 0.5 (Eq. (2.34)).

𝑐𝑛∗ = { 2.3.4

class 0, 𝑝(𝑐𝑛∗ |𝒙∗𝑛 , 𝒘𝑀𝑃 ) ≤ 0.5 class 1, 𝑝(𝑐𝑛∗ |𝒙∗𝑛 , 𝒘𝑀𝑃 ) > 0.5

(2.34)

Relevance Vector Machine Staircase Regression

The idea behind RVM-SR [75] is to make pairwise comparisons between different pairs of low and high arousal/valence partitions, and to incorporate this information into regression modelling. This approach was motivated by Gaussian Staircase Regression (GSR), which was first proposed for depression prediction [126]. In the GSR approach, data corresponding to intervals of the rating scale are grouped into several pairs of low-high classes, and the log mean 31

likelihood ratio (LMLR) between each pair of low and high partition is calculated. The LMLR from each low-high class pair was then used in regression modelling, to predict depression Beck Depression Index (BDI) scores, which measure depression severity. The pairwise comparisons were based on models estimated from data grouped according to low and high BDI ratings, which were obtained by applying different thresholds (e.g. 5, 10, 15, etc.) to partition the data into low and high BDI scores. For example, when a threshold of 5 is used, the low BDI scores will be 0-5, whereas the high BDI scores will be 6-45. Based on the same framework, RVM-SR used an ensemble of RVM classifiers to model the class boundaries associated with low-high class pairs. The probabilistic outputs from each of the RVM classifiers were used as features for training a regression model. The application of the RVM-SR framework in a score prediction context can be seen in Figure 2.3.

features

RVM classifier 1

probabilistic output 1

RVM classifier 2

probabilistic output 2

● ● ●

● ● ●

RVM classifier P-1

probabilistic output P-1





0

5





Low BDI

Regression BDI rating model



10

33

40

High BDI 45

Figure 2.3: Overview of RVM-SR approach, showing how pairs of low-high classifiers are built upon intervals of the BDI rating of depression severity, after [75]. The training of RVM-SR involves two stages. In the first stage, training data 𝑿 and ground truth ratings 𝒕 (Eq. (2.8)) are partitioned into P partitions of balanced sizes. This is achieved by calculating percentiles of 𝒕 , and the percentiles function as thresholds to separate the P partitions on the training data. Data corresponding to the P partitions are grouped into P-1 adjacent low-high pairs to train different RVM classifiers, to produce a set of P-1 classifiers for arousal and another set of P-1 classifiers for valence (Figure 2.3). 𝑎 𝑐𝑝,𝑛 = 𝒮((𝜽𝒂𝒑 )𝑻 𝝓(𝒙𝑛 ))

32

(2.35)

𝑣 𝑐𝑝,𝑛 = 𝒮((𝜽𝒗𝒑 )𝑻 𝝓(𝒙𝑛 ))

(2.36)

𝑎 where 𝑐𝑝,𝑛 denotes the probabilistic output from the 𝑝th RVM classifier for arousal (i.e. 𝜽𝑎𝑝 ) at

the 𝑛th frame. In the second stage, probabilistic outputs from RVM classifiers were employed for training two independent RVM regression models (i.e. 𝒘2𝑎 , 𝝐2𝑎 for arousal, 𝒘2𝑣 , 𝝐2𝑣 for valence) to predict the final arousal and valence predictions 𝒕̌𝒂𝑛 and 𝒕̌𝒗𝑛 respectively: 𝒕̌𝒂𝑛 = (𝒘2𝑎 )𝑻 𝝓(𝒄𝑎𝑛 ) + 𝝐2𝑎

(2.37)

𝒕̌𝒗𝑛 = (𝒘2𝑣 )𝑻 𝝓(𝒄𝑣𝑛 ) + 𝝐2𝑣

(2.38)

𝑎 𝑎 𝑎 𝑣 𝑣 𝑣 where 𝒄𝑎𝑛 = (𝑐1,𝑛 , 𝑐2,𝑛 , … , 𝑐𝑃−1,𝑛 )𝑇 and 𝒄𝑣𝑛 = (𝑐1,𝑛 , 𝑐2,𝑛 , … , 𝑐𝑃−1,𝑛 )𝑇 .

In the testing (or prediction) phase of RVM-SR, input features are fed into all the trained classifiers 𝜽𝑎𝑝 and 𝜽𝑣𝑝 to generate probabilistic outputs regarding all the low-high pairs, after which the trained regression model 𝒘2𝑎 and 𝒘2𝑣 can predict arousal and valence based on the probabilistic outputs, i.e. 𝝓(𝒄𝑎𝑛 ) and 𝝓(𝒄𝑣𝑛 ) (Figure 2.3). The assumption made in the previous application of RVM-SR to depression score prediction [75] was that probabilistic outputs of the RVM classifier reflect beliefs in how strongly an utterance corresponds to a certain region of depression BDI scores, which help account for the ordinal nature of depression scores. This may hold true for emotion ratings, which suggests the use of RVM-SR for emotion prediction tasks. 2.3.5

Output-Associative Relevance Vector Machine

Output-Associative Relevance Vector Machine (OA-RVM) is an effective framework for absolute emotion prediction [121]. The idea behind OA-RVM is to learn temporal dependencies of predictions, dependencies between input features and output prediction, as well as dependencies between arousal and valence via a two-stage regression framework based on RVM. Given training data and emotion ground truth ratings (Eq. (2.8)), the first stage trains independent

RVM

regression

models

to

produce

a

set

of

arousal

predictions

𝒕̃𝑎 = (𝑡1𝑎 , 𝑡2𝑎 , … , 𝑡𝑛𝑎 )𝑇 , and valence predictions 𝒕̃𝑣 = (𝑡1𝑣 , 𝑡2𝑣 , … , 𝑡𝑛𝑣 )𝑇 respectively. 𝒕̃𝑎𝑛 = (𝒘1𝑎 )𝑻 𝝓(𝒙𝑛 ) + 𝝐1𝑎

33

(2.39)

𝒕̃𝑣𝑛 = (𝒘1𝑣 )𝑻 𝝓(𝒙𝑛 ) + 𝝐1𝑣

(2.40)

where 𝒘1𝑎 and 𝝐1𝑎 are the trained RVM model parameters for arousal at the first stage, similarly 𝒘1𝑣 and 𝝐1𝑣 characterise a trained RVM model for valence at the first stage, and 𝒕̃𝑎𝑛 and 𝒕̃𝑣𝑛 denote an initial arousal and valence prediction at the 𝑛th frame. At the second stage, each frame is associated with a temporal set of arousal and valence prediction values 𝒕̃𝑎𝑛±𝑖 and 𝒕̃𝑣𝑛±𝑖 , instead of 𝑡̃𝑛𝑎 and 𝑡̃𝑛𝑣 . 𝑎 𝑎 𝑎 𝑎 𝒕̃𝑎𝑛±𝑖 = (𝑡̃𝑛−𝑖 , … , 𝑡̃𝑛−1 , 𝑡̃𝑛𝑎 , 𝑡̃𝑛+1 … , 𝑡̃𝑛+𝑖 )𝑇

(2.41)

𝑣 𝑣 𝑣 𝑣 𝒕̃𝑣𝑛±𝑖 = (𝑡̃𝑛−𝑖 , … , 𝑡̃𝑛−1 , 𝑡̃𝑛𝑣 , 𝑡̃𝑛+1 … , 𝑡̃𝑛+𝑖 )𝑇

(2.42)

Then the second stage trains two independent RVM regression models to predict the final arousal and valence predictions 𝒕̌𝒂𝑛 and 𝒕̌𝒗𝑛 respectively, both learnt from a combination of input features 𝝓(𝒙𝑛 ), temporal arousal predictions (𝒕̃𝑎𝑛±𝑖 ), and temporal valence predictions (𝒕̃𝑣𝑛±𝑖 ). 𝒕̌𝒂𝑛 = (𝒘2𝑎 )𝑻 𝝓(𝒙𝑛 ) + (𝝋2𝑎 )𝑻 (𝒕̃𝑎𝑛±𝑖 ) + (𝝍2𝑎 )𝑻 (𝒕̃𝑣𝑛±𝑖 ) + 𝝐2𝑎

(2.43)

𝒕̌𝒗𝑛 = (𝒘2𝑣 )𝑻 𝝓(𝒙𝑛 ) + (𝝋2𝑣 )𝑻 (𝒕̃𝑎𝑛±𝑖 ) + (𝝍2𝑣 )𝑻 (𝒕̃𝑣𝑛±𝑖 ) + 𝝐2𝑣

(2.44)

OA-RVM, therefore, uses the past, current and future prediction context associated with input feature frames, as well as the input features, to update a prediction result. Prediction using the non-causal relationship has been shown to be superior to RVM and SVR when performing AEP [121]. 2.3.6

Kalman Filtering

A Kalman Filter (KF) is a linear dynamical model that has long been used in control systems [127]. Recently, KF has become appealing in continuous absolute emotion prediction, e.g. for multi-modal fusion [128], [129], for musical emotion prediction [130], and for capturing temporal evolution of emotion [131]. The formulation of KF can be seen in Figure 2.4, where 𝒕𝑛 are hidden state(s) and 𝒚𝑛 are observation(s) at the 𝑛th frame. In the emotion prediction context, arousal and valence ratings are represented as hidden states 𝒕𝑛 ∈ {𝑡𝑛𝑎 , 𝑡𝑛𝑣 }, 𝑛 ∈ [1, … , 𝑁], whereas the observations 𝒚𝑛 are a set of noisy predictions from regression models.

34

Figure 2.4: Overview diagram for Kalman filtering. 𝒕𝒏 follows a Markov property that the current state only depends on the previous state. The aim of KF is to estimate the hidden state(s) 𝒕𝑛 , i.e. arousal and valence ratings, given observation(s) (measurement(s)) 𝓨1:𝑛 = (𝒚1 , … , 𝒚𝑁 ) up to the 𝑛th frame (Eq. (2.45)). 𝑝(𝒕𝑛 |𝓨1:𝑛 ) ∝ 𝑝(𝒕𝑛 , 𝓨1:𝑛 ) = 𝑝(𝒚𝑛 |𝒕𝑛 , 𝓨1:𝑛−1 )𝑝(𝒕𝑛 |𝓨1:𝑛−1 ) = ⏟ 𝑝(𝒚𝑛 |𝒕𝑛 ) ∫ 𝑝(𝒕𝑛 |𝒕𝑛−1 )𝑝(𝒕𝑛−1 |𝓨1:𝑛−1 )𝑑𝒕𝑛−1 ⏟ update

(2.45)

prediction

According to Figure 2.4 and Eq. (2.45), a KF model can be fully characterised by its transition probabilities 𝑝(𝒕𝑛 |𝒕𝑛−1 ) , measurement probabilities 𝑝(𝒚𝑛 |𝒕𝑛−1 ) , an initial state 𝒕0 and a covariance matrix 𝑺0 , i.e. 𝜆𝐾𝐹 = {𝒕0 , 𝑺𝟎 , 𝑭, 𝑸, 𝑯, 𝑹} . The KF assumes state transitions and measurement transitions are continuous and Gaussian distributed. 𝑝(𝒕𝑛 |𝒕𝑛−1 )~𝒩(𝑭𝒕𝑛−1 + 𝜀, 𝑸𝑛 )

(2.46)

𝑝(𝒚𝑛 |𝒕𝑛 )~𝒩(𝑯𝒕𝑛 + 𝛽, 𝑹𝑛 )

(2.47)

where 𝑭 and 𝑯 are the state and measurement transition matrices, 𝑸 and 𝑹 are noise covariance matrices, and 𝜀 and 𝛽 are bias terms in states and measurements respectively. As seen in Eq. (2.45), there are two stages for Kalman Filtering, i.e. prediction and update, which are elaborated in Table 2.5.

35

Table 2.5: Implementation of Kalman filtering for predicting arousal and valence (𝒕𝒏 )𝑵 𝒏=𝟏 , 𝑵 given trained regression models, which output noisy measurements (𝒚𝒏 )𝒏=𝟏 and a trained Kalman filter model 𝝀𝑲𝑭 = {𝒕𝟎 , 𝑺𝟎 , 𝑭, 𝑸, 𝑯, 𝑹} based on [131]. Kalman Filtering implementation: Given a trained KF model 𝜆𝐾𝐹 = {𝒕0 , 𝑺𝟎 , 𝑭, 𝑸, 𝑯, 𝑹} For each time frame 𝑛 ∈ [1, … , 𝑁] 𝒚𝑛 // initial predictions from a trained regression model 𝒕𝑛|𝑛−1 = 𝑭𝒕𝑛−1 // prediction 𝑺𝑛|𝑛−1 = 𝑭𝑺𝑛−1 𝑭𝑇 + 𝑸 // predicted covariance 𝑮𝑛 = 𝑺𝑛|𝑛−1 𝑯𝑇 [𝑯𝑺𝑛|𝑛−1 𝑯𝑇 + 𝑹]−1 // Kalman Gain 𝒕𝑛 = 𝒕𝑛|𝑛−1 + 𝑮𝑛 [𝒚𝑛 − 𝑯𝒕𝑛|𝑛−1 ] // updated state 𝑺𝑛 = 𝑺𝑛|𝑛−1 − 𝑮𝑯𝑺𝑛|𝑛−1 // updated covariance end According to Table 2.5 and Eq. (2.45), there are two stages for each time frame, i.e. prediction and update. In prediction, the 𝑛th state 𝒕𝑛|𝑛−1 is roughly estimated from previous state 𝒕𝑛−1 , alongside its covariance 𝑺𝑛|𝑛−1 , which measures how confident the estimation is for 𝒕𝑛|𝑛−1 . Afterwards, a Kalman Gain 𝑮𝑛 is calculated to update the predicted 𝒕𝑛|𝑛−1 based on uncertainties with respect to the prediction 𝒕𝑛|𝑛−1 and the measurement 𝒚𝑛 . More specifically, the uncertainty for 𝒕𝑛|𝑛−1 is represented by 𝑺𝑛|𝑛−1 , whereas the measurement uncertainty can be roughly represented by [𝑯𝑺𝑛|𝑛−1 𝑯𝑇 + 𝑹]. The comparison determines how the state 𝒕𝑛 will be updated; moving towards either the prediction 𝒕𝑛|𝑛−1 or the measurement 𝒚𝑛 , whichever has a lower uncertainty.

2.4 Concluding remarks This chapter has presented a review of speech-based automated emotion prediction in three aspects, i.e. emotion theory, emotional speech features, and common techniques for acoustic space modelling and regression modelling. While many previous studies have focused on classifying emotion at utterance level, research into continuous prediction of emotion dimensions such as arousal and valence is gaining interest, because the dimensional representation of emotion offers a range of advantages such as the capacity for capturing potentially subtle changes in emotion and suitability in a naturalistic context. There is a wide spectrum of speech features which are associated with emotion. While acoustic low-level descriptors such as prosodic, spectral, and voice quality features have been 36

extensively investigated in literature, the high-level characterisation of speech has been mainly focused on the use of functionals, a brute-forced set of parameters that describe the histograms of low-level descriptors. Although it is true that functionals of relatively ‘standard’ feature sets often achieve state-of-the-art performance for emotion recognition, more work is needed to extend these by deriving high-level representations (at either window level or utterance level) that are emotionally informative, given the fact that approaches of this kind often contain rich paralinguistic cues including emotion [27]. Moreover, this chapter discussed techniques for speech modelling as well as regression modelling adopted in later investigations in this thesis, including Gaussian Mixture Models (GMMs), Support Vector Regression (SVR), Relevance Vector Machine (RVM) for regression and classification, Relevance Vector Machine Staircase Regression (RVM-SR), OutputAssociative Relevance Vector Machine (OA-RVM), and Kalman filtering. SVR is a wellestablished regression method and has been considered as a standard method for emotion prediction, while RVM, OA-RVM, and Kalman filtering have gained more recent interest. Also, RVM-SR, which was originally proposed for exploring depression score prediction in an ordinal context, may be effective for emotion prediction. The field of affective computing has matured since the term was first introduced in 1997 [132], [133], especially for research into developing automated emotion recognition and prediction systems. However, as discussed in Section 2.1, where the Appraisal theory hinted at the dynamic nature of emotion under the simultaneous collective influence of different major components, it may be worth viewing emotion as a dynamic process, i.e. considering further how emotion evolves over time. For this reason, the next chapter will review recent progress in research into emotion change, mainly from an engineering perspective.

37

38

Chapter 3

Emotion Change in Speech: A Review

According to the psychological mechanism of emotion production (Section 2.1), emotion is dynamic in nature and continuously evolves across time [16], [17], [34]. However, this has received less attention among studies that aim to recognise and predict emotion from behavioural signals within the affective computing community. Most studies, particularly until the past few years, investigated emotion recognition on a per-speech segment basis, i.e. speech sequences were pre-segmented into emotional utterances with one global category or dimension label assigned to each. This per-utterance labelling implicitly assumes that emotion is in a stable and steady state across the whole utterance, and overlooks the dynamical nature of emotion. This issue is more pronounced for longer utterances, in which emotion evolution is more likely to occur [152]. However, although it is argued in psychology that emotion response unfolds from milliseconds to minutes, the time scale remains an open question within the affective computing community. Recent years have marked increasing awareness of this issue, and there have been more research groups investigating the time course of emotion by employing continuously annotated corpora [7]. Examples of this kind of corpus are SEMAINE [48], CreativeIT [49], RECOLA [47] and Belfast Naturalistic dataset [134], where emotional ratings (e.g. arousal and valence) were evaluated continuously using real-time annotation tools such as Feeltrace [135], Gtrace [136] and ANNEMO [47], based on audio and video signals. Based on the continuous emotion annotation, a number of systems have been built with the intention of predicting the ratings at a fine temporal granularity, for example, the Audio-Visual Emotion Challenge (AVEC) [137], [109], but overall performances can be improved. Moreover, patterns and trajectories of how emotion evolves over time, which are known as affect (or emotion) dynamics in affective science, may not be sufficiently accounted for and remain relatively less investigated, certainly from an automatic system perspective. Given the motivations, this chapter reviews research into emotion change from both psychology and engineering points of view, with an emphasis on automated systems. The remainder of this chapter is laid out as follows: Section 3.1 discusses psychological literature about emotion change, Section 3.2 presents an overview of existing automatic systems, including two main aspects: computational models incorporating emotion change and the time course of emotion change, Section 3.3 discusses potential change-point detection approaches applicable to research relating to emotion timings, and concluding remarks are given in Section 3.4.

39

3.1 Emotion change in affective science In affective science, emotion change (more often known as emotion dynamics) has attracted increasing interest. It has been shown in the literature that emotion transitions carry a great deal of valuable information about social interactions [138], [139], [140], marital relationships, mother-child bonding [141], emotional intelligence [142], [28] and psychological well-being [143], [144], [31]. For instance, emotional transitions during conversations have a great impact on conversational outcomes [138]. Recently, emotion regulation and emotion dynamics were associated using several parameters describing trajectories of emotion change [142]. Regulating emotions is more often than not associated with knowing the timings of emotional transitions, given that people react to change in the course of emotion [28]. Moreover, it has been found that emotion instability [143], which refers to emotion changes between previous and current emotional states, carries a great deal of information about psychological well-being. A measure of emotional instability is the Mean Squared Successive Difference (MSSD), and this relates to the likelihood that a patient is suffering from disease, anxiety and depression [143]. Taken together, these findings suggest that research into emotion dynamics plays a crucial role in understanding emotions as well as contributing to interactions, emotional intelligence and psychological health. One root question for emotion dynamics is how to represent them. Back in 1998, Davidson introduced the term “affective chronometry” to describe temporal dynamics of emotion [145]. Affective chronometry includes rise time to peak, amplitudes and recovery time of affective responses. After about 20 years, Davidson has reemphasised the crucial role of dynamics in understanding emotion [146], in a special issue of the journal Emotion Review on advancing research into affect dynamics [17]. A number of interesting open questions were raised again, such as the time scales on which emotion evolution should be investigated, the relation between the time course of positive and negative affect, and the parameters of the affective chronometry that are of most importance for psychological well-being. Of course, there is by no means only one way to characterise emotion change. Many studies have proposed different ways to describe the patterns and regularities of emotion change, for example, 1) emotion intensity, ramp-up rate and delay rates in [147], 2) affective home base, variability, and attractor strength in [19], 3) time dependency and (co)variation of emotion including emotional variability, covariation, inertia and cross-lags in [142]. Readers may refer to [147], [19], and [142] for more detailed explanations. Moreover, there are an increasing number of studies investigating the computational modelling of emotion dynamics in affective science [148], [149]. Computational models rooted in psychological theories for emotion dynamics commonly contain an internal process that 40

underlies emotion dynamics, and an external process that influences how emotion evolves [18], [19]. Take [19] as an example, the internal process (i.e. latent true emotion dynamics of arousal/valence) was modeled using a transition equation (Eq. (3.2)), whereas the external process was modelled as a measurement equation (Eq. (3.1)) treating the observed emotions as an error-corrupted version of true emotions. Measurement equation:

𝑌𝑝 (𝑡) = 𝜃𝑝 (𝑡) + 𝜀𝑝 (𝑡)

(3.1)

Transition equation:

𝑑𝜃𝑝 (𝑡)/ 𝑑𝑡 = 𝛽𝑝 × (𝜇𝑝 − 𝜃𝑝 [𝑡]) + 𝜉𝑝 (𝑡)

(3.2)

where 𝑌𝑝 (𝑡) is the measured/observed arousal or valence value for person 𝑝 at time 𝑡, 𝜃𝑝 (𝑡) is a latent true arousal/valence score, corrupted by error 𝜀𝑝 (𝑡). The transition equation contains a random error term 𝜉𝑝 (𝑡) and a fading event 𝛽𝑝 × (𝜇𝑝 − 𝜃𝑝 [𝑡]) , in which the true arousal/valence moves towards an equilibrium value 𝜇𝑝 with a speed of 𝛽𝑝 . In the context of social interaction, arousal/valence dynamics were found to be co-influenced by an internal process that relates to personality and dispositions of emotions, as well as an external process driven by the content of discussion and in turn determines how the discussion proceeds [18]. The investigations departed from formulation of arousal and valence using Eq. (3.3) and Eq. (3.4).

Arousal dynamics:

𝛿𝑎𝑖 (𝑡) = −𝛾𝑎𝑖 (𝑎𝑖 (𝑡) − 𝑑) + ℱ𝑎 (ℎ, 𝑎𝑖 (𝑡)) + 𝐴𝑎𝑖 𝜉𝑎 (𝑡) 𝛿(𝑡)

(3.3)

Valence dynamics:

𝛿𝑣𝑖 (𝑡) = −𝛾𝑣𝑖 (𝑣𝑖 (𝑡) − 𝑏) + ℱ𝑣 (ℎ, 𝑣𝑖 (𝑡)) + 𝐴𝑣𝑖 𝜉𝑣 (𝑡) 𝛿(𝑡)

(3.4)

where arousal/valence dynamics are influenced by an internal relaxation 𝛾𝑎𝑖 (𝑎𝑖 (𝑡) − 𝑑) (or 𝛾𝑣𝑖 (𝑣𝑖 (𝑡) − 𝑑)) and stimulus ℱ𝑎 (ℎ, 𝑎𝑖 (𝑡)) (or ℱ𝑣 (ℎ, 𝑣𝑖 (𝑡))) and a stochastic error 𝐴𝑎𝑖 𝜉𝑎 (𝑡) (or 𝐴𝑣𝑖 𝜉𝑣 (𝑡)). Note that these models [18] and [19] are recent (2016, 2010) scientific attempts to characterise emotion change mathematically. They have not been used in automatic systems, but certainly are a motivation for computational approaches. However, for both studies [19] and [18], first-order differences (Δ) were considered to represent emotion dynamics, which raises the question of whether or not the first-order differences are suitable to characterise emotion dynamics. Despite the various potential characterisations of emotion change, it remains very challenging to develop automated systems for emotion change based on the aforementioned psychological theories, for two major reasons. First, the characterisation of emotion change involves a high 41

degree of complexity and subjectivity to render emotion change ground truth, which is necessary for machine learning problems. Second, there is no emotional corpus that allows direct investigations into emotion change, since consideration and explicit annotation of emotion change has not been a priority during data collection for most existing emotional corpora. Hence, how to approach appropriate emotion change ground truth for the formulation of automated systems for emotion change is one of the biggest challenges facing the affective computing community with respect to emotion change research. One additional note is that despite different approaches proposed to characterise emotion change, one unchanged major theme in affective science is to understand individual differences in the temporal dynamics of the emotional process via these approaches [17], [19], [34], [142], [147]. This is very different from the common purpose of automated emotion recognition systems, in which speaker dependency is not necessarily an important factor to be considered. However, this may hint at the need for individualised modelling in the foreseeable future for automated systems that target emotion and emotion change.

3.2 Emotion change in automated systems Compared with the increasing popularity of emotion dynamics in affective science, automatic systems for emotion dynamics in speech have been explored less. A recent experiment designed to analyse emotion dynamics computationally based on facial expression using a statistical model was reported by Hakim et al. [26]. In [26], emotion categories were mapped frame-byframe into the arousal-valence space to visualise trajectories of emotion dynamics, and emotion changes were observed to follow certain smooth common paths. That is, emotion transitions between two uncorrelated or negatively correlated emotions (e.g. excitement and frustration) tend to pass through the neutral state, whilst those between two positively correlated emotions (e.g. excitement or happiness) do not. These smooth paths are reasonable because they are frame-level (25 frames per second) emotion dynamics without considering external stimuli, i.e. no abrupt events (e.g. pain) were applied to intervene or elicit certain emotion from subjects. It is also worth noting that studies on emotion dynamics are often placed in context, where emotion transitions take place and are influenced by events [150], [151] as well as other speakers [152]. During conversations, an agreed trajectory of emotion dynamics is the onsetapex-offset path [145], [153]. There exists an emotion-arousing process between events (or situations) and people’s emotional responses [154] and these responses tend to fade over time [155]. Based on this, Danieli et al. [156] proposed an affective scene framework to investigate emotion unfolding across time during call-centre conversations. In [156], continuous emotional unfolding was converted into several discrete emotional episodes (e.g. one of the episodes is 42

that an Agent first expresses emotion, and the customer ends up experiencing positive emotions). The fading phenomenon was also observed in [157], in which it was shown that emotional evolution in speech is detectable using an emotion recognition system. While the above studies related to emotion dynamics involve a broad spectrum of topics, there is a need for more systematic investigations into emotion dynamics that can be used to facilitate applications of affective computing and HCI. Two important aspects in need of investigations concerning emotion change in automatic systems will be discussed, i.e. computational models for emotion change and the time course of emotion change. 3.2.1

Computational Models using Emotion Change

In affective computing, computational models that consider emotion change herein are roughly divided into two categories based on whether systems are built to explicitly infer characteristics of emotion change. The first category implicitly models emotion change for emotion recognition/prediction, whilst the second category explicitly models emotion change to characterise and learn about emotion change, for example the direction of emotion change and the extent of emotion change. Without question, this taxonomy reveals an extremely unbalanced view of the research area; while nearly all studies within the affective computing community fall within the first category, only a few studies have attempted to achieve emotion change related tasks (the second category). This, however, helps highlight the fact that emotion change research is a relatively understudied field as well as reviewing potential opportunities down this path. This subsection reviews emotion recognition systems that take advantage of modelling emotion dynamics using a range of approaches, as well as studies designed to tackle emotion change related topics. At a higher level, emotion recognition systems considering emotion dynamics [20], [21], [161], [162], [22]–[26], [158]–[160] share the same view: it is not appropriate to assume that emotional states are static. Their approaches to this end commonly involve 1) utilising dynamic speech features such as delta MFCCs [20], [21], 2) applying dynamic models [22], [23], and 3) analysing dynamic patterns from classifiers [24], [25] or regressors [26]. For instance, Kim et al. [24] exploited utterance-level emotion dynamics for classifying four emotions, based on a hypothesis that there exist emotion-specific dynamical patterns that may repeat within the same emotion class. A different way to capture emotion dynamics is the dynamic influence model proposed by Stolar et al. [152], where Markov models were used to capture long-term conditional dependencies of emotion. Another study adopted an HMM-based language model to capture the temporal patterns of emotional states via a predefined grammar, achieving an accuracy of 87.5% for classifying the four quadrants of the arousal-valence space 43

on the SEMAINE dataset [22]. An interesting technique was proposed by Han et al. [162], in which an utterance-level emotion prediction system trades off the conventional mean square error and a proposed rank-based trend loss to successfully preserve the overall dynamic trend of emotion dimensions. Computational modelling of emotion change has proved effective for emotion recognition, but a common problem is that emotion dynamics derived from dynamic speech features and dynamic models do not necessarily reflect the true characteristics of emotion change because emotion change information from ground truth emotion ratings was overlooked. Interestingly, there is some recent work on modelling emotion dynamics that are derived from ground truth emotion ratings, e.g. [21], [131]. In these two studies, first-order (Δ) and second-order differences (ΔΔ) of continuous ground truth emotion ratings were calculated as emotion dynamics and used for modelling. In [21], a GMM-based regression method was proposed to jointly model dynamic features and the emotion dynamics for predicting arousal, valence and dominance on a perframe basis. The other study [131] adopted Kalman filtering in facial emotion prediction. Despite significant improvements obtained using Kalman Filtering, modelling emotion dynamics (i.e. Δ and ΔΔ) yielded no further gains. Nevertheless, one fundamental question, which was not addressed in both studies as well as existing literature, is the suitability of firstorder (Δ) and second-order differences (ΔΔ) of ground truth emotion ratings for characterising emotion dynamics (Section 7.2.3). Furthermore, how to effectively exploit emotion dynamics to facilitate emotion recognition/prediction remains an open question and more studies are needed (which is investigated in Chapter 8). Compared with emotion recognition studies with respect to emotion dynamic modelling, systems that are designed to interpret and understand emotion change via computational modelling of emotion change are rare. Among the most relevant work is [163], where systems were built to qualitatively recognise the direction of emotion change as well as to quantitatively predict the change in emotional intensity for 16 emotions from speech, showing comparable or superior performances compared with human predictions. In [163], given two speech recordings, the problem of direction classification was to determine which recording has higher emotional intensity, whilst the problem of change prediction was to predict the change value in emotional intensity. Both tasks were achieved using differences between 207-dimensional utterance-level functionals as features and random forests as the classification or regression model. However, their dataset was small in size, only 225 utterances in total. Apart from [163], there is some recent work that attempts to rank emotions rather than recognising or predicting them [164]– [166]. Similarly to [163], the focus in [164]–[166] is to conduct pair-wise comparison given every two utterances or frames, which can be used for detection of emotion change direction. In 44

fact, the two previously mentioned studies [21], [131], where emotion dynamics calculated from ground truth were used for modelling, have the potential for building a system that can predict changes in emotion dimensions on a per-frame basis, however, this was not noted by the authors. Indeed, there are few systems that aim to understand and interpret emotion dynamics within the field of affective computing. This is in part due to difficulties in describing emotion dynamics (Section 3.1) and a lack of databases for emotion dynamics (i.e. no explicit annotations regarding emotion dynamics such as duration, ramp-up rate and decay rate) [147]. Continuously annotated corpora such as RECOLA, SEMAINE and CreativeIT (introduced in detail in [47], [48], [49]) can be helpful in investigating emotion dynamics, but still have the limitation that they were annotated in an absolute manner, rather than annotators being explicitly directed to rate emotion changes. 3.2.2

Emotion Timing and Segmentation

Another sensible way to begin looking at emotion dynamics is to consider the timing at which emotion change occurs. This is not only crucial for understanding emotion dynamics according to the psychological studies mentioned in Section 3.1 but also valuable for exploring different possibilities in the development of automatic systems, which could be expected to operate in real-time. To the best of our knowledge, there is only small amount of previous work related specifically to the timing of emotion changes, which can be roughly categorised into three groups of studies: 1) emotion change related studies using non-speech signals, 2) emotion change related studies using speech signals including emotion categories and dimensions, and 3) studies investigating emotion segmentation. In facial emotion research, it is commonly believed that facial changes can mirror changes in emotions, referring to the onset and offset of facial expressions [12], [5]. In 2000 and 2001, Niedenthal et al. investigated changes in facial expression to study emotion transitions between happiness, sadness and neutral, with the desire to understand why smiles drop and emotion congruence [150], [167]. More specifically, their subjects were instructed to note the certain time at which an emotional face in a movie starts to change, which did not involve any design of automatic systems for emotion change. A similar idea regarding the onset, apex and offset of emotion has been extended to speech, e.g. [13], [163], which however are for emotion recognition. Using several psychological measures, Leon et al. [168] considered that changes to neutral emotion are associated with large residuals between measured physiological signals (from actual sensors) and estimated physiological signals (from a model trained on neutral emotion data). The larger the residuals are, the more likely there is a change in neutral emotion, which is tested using Sequential Probability Ratio Test (SPRT). However, experimental details 45

were not very clearly stated and the dataset, which contains 160 samples for each emotion per day (8 emotions and 20 days in total), which may be unrealistic. Moreover, the approach can be considered similar to novelty detection via significance testing on two distributions, which however could be vulnerable to noisy physiological signals from sensors, since they potentially produce large residuals. Based on speech, Böck et al. [157] recently found that emotional evolution from both inter- and intra-speaker in conversations is detectable using an MFCC-GMM based emotion recognition system. The ‘emotion evolution’ in [157] was formed by considering the first five minutes of speech and the last five minutes of speech in 32 recordings of 25 minutes on average with the assumption of a difference in the participant’s emotional states. An important note from [157] though is that speaker variability significantly degraded system performance for detecting emotion evolution, highlighting the individual differences in terms of emotion evolution (which is a focus in affective science studies [17], [19], [34], [142], [147]), as well as the need to consider speaker dependency when investigating emotion change. Furthermore, there have been some studies hinting at the possibility of localising the time when emotion changes occur from speech [169], [170], [171]. For instance, Xu et al. [169] attempted to detect emotion evolution similarly to [157], using intersections of the two most prominent smoothed emotional scores within a sliding window framework. The ‘emotion evolution’ was formed by concatenating two utterances of different emotions into one utterance of 6-8 seconds. The emotional scores were generated from GMM-supervector based SVM emotion models with one model trained for each emotion. Fan et al. [170] tried to continuously detect boundaries of different emotions (neutral-anger and neutral-happiness) using multi-timescaled sliding windows where decisions of emotion recognition from each scale collectively contributed to a final decision for every 0.1 seconds (i.e. each slide). These decisions were compared with the ground truth for each slide to calculate accuracy, which in essence evaluates recognition of emotion rather than detection of boundaries around which emotions change. Different from the aforementioned studies that aimed to detect changes among emotion categories, there are some studies investigating changes in emotion dimensions. Lade et al. [30] were among the first group of people to propose an adaptive temporal topic model to detect large changes in arousal and valence on a per-frame basis using audio-visual features on the AVEC 2012 dataset [172]. Latent Dirichlet Allocation (LDA) based topic Model [173] has been extensively studied in text mining, where a document, which contains a number of words, is represented in terms of probabilities of different topics. A topic is associated with a set of words that co-occur in documents. In the emotion context, each feature vector is quantised into a 46

document that is associated with a certain number of latent topics. In their work, 30 topics were empirically used and 1242-dimensional audio features were quantised into a document with 62100 words using 50 cluster bins for the emotion change detection task. These topics change when the emotion changes, and significant changes can be detected. Despite lack of ground truth for changes and clarity of the time step they used, this method is useful for capturing dynamic information of emotion and has the potential to be applied to emotional categories. There has been some recent work regarding emotion segmentation [174], [175], [116]. Different segmentation schemes, normally fixed-length or variable-length segmentation, were employed to effectively exploit segment-level features [174] or models [175] for more emotionally salient segments and improved emotion recognition performance. Nevertheless, this type of system typically produces little explicit information about the timing of emotion boundaries. Despite this, some studies have suggested the association between emotional segmentation and the timing of emotion change, e.g. [22], [176], where dynamic programming approaches such as the Viterbi decoding can be used to refine boundaries (or segments) of different emotions across time and therefore the timings of emotion change can be known. Having mentioned emotion segmentation, an important note is that a common motivation to segment emotional speech is to find emotionally salient segments, which in general make greater contributions to utterance-level emotion recognition, where there is one label per utterance to be predicted or recognised, than other less informative segments. To this end, many criteria have been proposed such as mutual information [177] and classifier agreement [178]. However, a few studies have investigated this problem from the viewpoint of emotion change, and a question raised accordingly is whether emotion change plays a role in identifying emotionally salient segments from speech. Kim speculated that low variation in emotions, which implies clear expression of emotions, may favour emotion recognition [179]. However, this has not been experimentally investigated. Even though many studies discussed in this subsection more or less relate to emotion timing, investigations of this kind are an understudied area. Furthermore, they diverge significantly in terms of problem formulation, and most systems are still somewhat limited to emotion recognition problems [116], [157], [169]–[171], [174], [175], and have limited systematic insight into the timing of emotion change in many aspects such as definition of emotion change in time, temporal precision of emotion change, and real-time applicability. In addressing these problems, the problem for emotion timing is to detect the instant of emotion change in time (i.e. ECD in Figure 1.2), which offers many benefits in practice such as emotion regulation,

47

computational efficacy in real-time HCI, task transitions, security and medical monitoring (as motivated in Section 1.1). Without question, investigations into ECD remain rare. However, learning the time point of change-related properties has been a long-standing problem in different research fields such as speaker change detection, concept drift detection and video shot change detection, all of which aim to localise the instant of change events such as speaker or image in time. This may offer opportunities to develop emotion change systems that are able to detect instant of emotion change along the time course of emotion. Noting the possibility, we review common change point detection approaches including speaker change detection and statistical change detection, to learn potentially applicable approaches for developing systems for emotional timing.

3.3 Automatic Methods for Emotion Timing and Segmentation 3.3.1

Speaker Change Detection

Speaker Change Detection, as the name suggests, aims to localise the time at which speaker identity changes, for example during a recorded conversation. It is an integral part of speaker diarisation, which aims to resolve the problem of “who speaks when”. Speaker diarisation comprises two stages; the first stage is speaker change detection, which coarsely segregates speech from different speakers, and the second stage is speaker clustering, which groups each individual speaker’s speech into one cluster. Speaker diarisation is commonly applied to conversational speech or broadcasting, and serves as an important front-end processing step for other speech-related tasks such as speech recognition. Speaker change detection can be formulated as a hypothesis testing problem, where the null hypothesis claims no change in speaker at time 𝑡, 0 ≤ 𝑡 ≤ 𝑇 in Figure 3.1, whilst the alternative hypothesis claims a change in speaker at time 𝑡. 𝑇 is the length of the window 𝑍. - Null hypothesis: No speaker change at time 𝑡, - Alternative hypothesis: Speaker change point at time 𝑡 The aim is to find sufficient evidence to reject the null hypothesis, and then a change at time 𝑡 can be detected.

48

Z X

Y t: candidate speaker change point

Figure 3.1: One sliding window for speaker change detection. In this approach, the window Z comprises of two windows X and Y of variable lengths. The question to answer is whether there is a speaker change at a particular point 𝒕, which is resolved by independently modelling the three windows to compare likelihood of Z and likelihood of X+Y, and a change occurrence is preferred if the latter is larger than the former. Bayesian Information Criterion The seminal Bayesian Information Criterion (BIC) based approach was proposed by Chen et al. [180] in 1998 to detect a change in speaker using the sliding window framework. BIC is a likelihood criterion penalised by the model complexity. Adapting the BIC to the framework gives: 𝐷𝐵𝐼𝐶 = log 𝑝(𝑋|𝜆𝑋 ) + log 𝑝(𝑌|𝜆𝑌 ) − log 𝑝(𝑍|𝜆𝑍 ) − 𝜉𝑃 where

𝑋, 𝑌, 𝑍

represent

data

within

each

(3.5) window,

and

𝜆𝑋 ~𝒩(𝜇𝑋 , 𝜎𝑋2 ), 𝜆𝑌 ~𝒩(𝜇𝑌 , 𝜎𝑌2 ), 𝜆𝑍 ~𝒩(𝜇𝑍 , 𝜎𝑍2 ) are Gaussian distributions fitted to the respective data within each window. In addition, 𝜉 is a penalty weight, which is typically 1, and 𝑃 is the penalty term defined as below: 1 𝑃 = (#𝜆𝑋 log 𝑁𝑋 + #𝜆𝑌 log 𝑁𝑌 − #𝜆𝑍 log 𝑁𝑍 ) 2

(3.6)

where #𝜆𝑋 , #𝜆𝑌 , #𝜆𝑍 denotes the number of parameters in each model, and 𝑁𝑋 , 𝑁𝑌 , 𝑁𝑍 denotes the length of each window, i.e. the number of frames. Note that the Gaussian models 𝜆𝑋 , 𝜆𝑌 , 𝜆𝑍 are independently trained, and accordingly the log-likelihoods for each window given the model are independently evaluated. The score 𝐷𝐵𝐼𝐶 is directly compared with a defined threshold to detect speaker change. In Eq. (3.5), it is straightforward that if there is a change point at time 𝑡, windows 𝑋 and 𝑌 tend to have different speech characteristics, for which the two windows tend to be better modelled by two separate Gaussian models, leading to a higher score for the term log 𝑝(𝑋|𝜆𝑋 ) + log 𝑝(𝑌|𝜆𝑌 ) and consequently a higher 𝐷𝐵𝐼𝐶 . If there is no change, the entire window 𝑍 tends to be better modelled via one Gaussian, i.e. 𝜆𝑍 , which produces a higher log 𝑝(𝑍|𝜆𝑍 ) and in turn a lower 𝐷𝐵𝐼𝐶 . Speaker change detection is conducted by searching for an optimal time 𝑡 that gives the highest 𝐷𝐵𝐼𝐶 within the window 𝑍. If the 𝐷𝐵𝐼𝐶 (𝑡̂) is higher than 49

a predefined threshold, a speaker change point is detected. Afterwards, the whole window 𝑍 slides across time to initiate subsequent searches for 𝑡̂. 𝑡̂ = argmax 𝐷𝐵𝐼𝐶 (𝑡)

(3.7)

𝑡

Note that the BIC criterion is not only widely used for segmentation [181]–[184] but also for clustering in speaker diarisation [185]–[188]. However, the BIC method potentially has a high miss detection rate [180]. Accordingly, there are a number of methods proposed to tackle this problem, e.g. [182], [185], [189]. Generalised Likelihood Ratio Apart from the BIC method, more often distance-based methods have been applied for detecting changes in speaker identity, for instance, Generalised Likelihood Ratio (GLR) [190], KullbackLeibler (KL) divergence [191], and Cross Log Likelihood Ratio (CLLR) [192]. A sliding fixedlength dual window (Figure 3.2) can also replace the one sliding window framework (Figure 3.1) for speaker change detection.

Z X

Y t: candidate speaker change point

Figure 3.2: One sliding dual-window of a certain length for speaker change detection. The window 𝒁 consists of two windows 𝑿 and 𝒀 of equal length. The GLR method can be seen as a special case of the BIC method (Eq. (3.5)) when 𝑋 and 𝑌 are of equal length as below. However, unlike BIC, which searches for the optimal time within the window 𝑍 (Eq. (3.7)), only one score 𝐷𝐺𝐿𝑅 is generated before the window 𝑍 slides across time. 𝐷𝐺𝐿𝑅 = log 𝑝(𝑋|𝜆𝑋 ) + log 𝑝(𝑌|𝜆𝑌 ) − log 𝑝(𝑍|𝜆𝑍 )

(3.8)

Kullback-Leibler (KL) Divergence In addition to GLR, the KL divergence, which measures the dissimilarity of two distributions, has also been popular for speaker change detection [191]. Precisely, the KL divergence can be represented as below:

50

𝐾𝐿(𝜆𝑋 , 𝜆𝑌 ) = ∫ 𝑝(𝑋) log 𝑋

𝑝(𝑋) 𝑑𝑋 𝑝(𝑌)

(3.9)

where 𝑝(𝑋) and 𝑝(𝑌) represent the probability density functions for 𝑋 and 𝑌 , i.e. 𝜆𝑋 ~𝒩(𝜇𝑋 , 𝜎𝑋2 ) and 𝜆𝑌 ~𝒩(𝜇𝑌 , 𝜎𝑌2 ). Basically, 𝐾𝐿(𝜆𝑋 , 𝜆𝑌 ) calculates the log-likelihood ratio between 𝜆𝑋 and 𝜆𝑌 with respect to 𝜆𝑋 . A large 𝐾𝐿(𝜆𝑋 , 𝜆𝑌 ) suggests a potentially large difference between 𝑝(𝑋) and 𝑝(𝑌) , whereas 𝐾𝐿(𝜆𝑋 , 𝜆𝑌 ) = 0 if 𝑝(𝑋) and 𝑝(𝑌) are identical, according to Eq. (3.9). Since KL divergence is an asymmetric measure, more often a symmetric KL divergence 𝐷𝐾𝐿2 is used. 𝐷𝐾𝐿2 (𝜆𝑋 ; 𝜆𝑌 ) = 𝐾𝐿(𝜆𝑋 , 𝜆𝑌 ) + 𝐾𝐿(𝜆𝑌 , 𝜆𝑋 )

(3.10)

Further, the detailed equation for calculating the 𝐷𝐾𝐿2 given two Gaussians 𝜆𝑋 and 𝜆𝑌 is:

𝐷𝐾𝐿2 (𝜆𝑋 ; 𝜆𝑌 ) =

𝜎𝑌2 𝜎𝑋2 1 1 2 2 + 2 + (𝜇𝑋 − 𝜇𝑌 ) ( 2 + 2 ) 𝜎𝑋 𝜎𝑌 𝜎𝑋 𝜎𝑌

(3.11)

Hence 𝐷𝐾𝐿2, which measures the dissimilarity between 𝑋 and 𝑌 at time 𝑡, is directly used for detecting speaker change points by comparing with a predefined threshold. Afterwards, the window 𝑍 slides across time to generate subsequent 𝐷𝐾𝐿2 values. While the aforementioned BIC, GLR, and KL divergence approaches have been used for speaker change detection for decades, factor analysis [187] [193], [193], [194], Variational Bayesian (VB) [195], [196], and one class SVM [197] have gained in popularity. The former was motivated by the recent success of the factor analysis framework for speaker recognition [198], where speaker components can be effectively modelled and differentiated via i-vectors, which provide a low-dimensional summary of the distribution of acoustic features with respect to a UBM. However, the i-vector may be less than effective in the emotion recognition context [70], because emotional corpora are commonly much smaller in size relative to speaker diarisation databases and significant balanced training data are needed. Also, the vector approach is based on MFCCs, which are not necessarily the best choice for emotion given that a variety of acoustic features (e.g. prosodic features as discussed in Section 2.2.1) and high-level representation of different acoustic features (Section 2.2.2) have become popular [27]. The review of the basic speaker change detection approaches offers potential possibilities to investigate the timing of emotion change. However, it is worth noting that directly applying these methods in the context of emotion change remains potentially problematic due to the phonetic and speaker variability embedded in emotional speech and the complex nature of emotion (e.g. a person might experience more than one emotion at a time). In other words, the 51

approaches that are found effective for speaker modelling may not necessarily work well for emotion, which motivates consideration of other categories of change detection approaches. 3.3.2

General Change Detection – a Martingale Framework

Apart from change detection approaches based in speech processing, there are many statistical change detection methods proposed in other fields such as concept drift detection [199] and video shot change detection [200]. These kinds of methods might potentially alleviate the problems of speaker change detection such as variability and unavailability of large datasets. Recently, a statistical Martingale method based on exchangeability testing was successfully applied to these two change-detection problems, i.e. detection of concept drift and video shot change [199], [200]. It has also been widely used in image processing [201]. However, there has been very little work done in speech apart from [202], in which the Martingale method was used for detecting changes in speech rate. Despite the difference in context, their work showed that testing exchangeability could be effectively used in speech processing. The idea of exchangeability was introduced by [203] and applied for change detection by [200]. Unlike most change detection methods using large sliding windows, Martingales have been proposed for detecting changes in streaming data and making decisions on-the-fly by testing exchangeability of sequentially observed data points [200], [203]. This opens a possibility for an alternative framework for investigating the timing of emotion change with higher temporal resolution than using large sliding windows [199]. Exchangeability By definition [203], a sequence of random variables {𝑥1 , 𝑥2 , … 𝑥𝑛 } is exchangeable if their joint distribution remains unchanged regardless of any permutation 𝜋 of {1, … , 𝑛}, namely: 𝑝(𝑥1 , 𝑥2 , … 𝑥𝑛 ) = 𝑝(𝑥𝜋(1) , 𝑥𝜋(2) , … 𝑥𝜋(𝑛) )

(3.12)

An example of exchangeability is the selection of balls in sequence without replacement from an urn in which there are only uniquely numbered blue balls (i.e. Case 1 in Figure 3.3(a)). In this case, the joint probability of choosing red balls remains invariant no matter how we change the order of observation. Consider the case that a number of blue balls are selected up to time n, after which one starts to select red balls from another urn for which the overall probability distribution of a blue ball is 𝑝(𝑏𝑙𝑢𝑒) < 1 (i.e. Case 2 and Case 3 in Figure 3.3(a)). Then the selected ball sequence becomes less exchangeable, as the joint distribution of selecting a blue ball is no longer 1. The changes in model or distribution undermine exchangeability. A more detailed illustration of the concept of exchangeability can be seen in Figure 3.3. 52

Given the exchangeability, the change detection problem can be formulated as a problem of testing exchangeability to learn if there exists an observation number 𝑛 , after which the distributions of observed data sequence change. - Null hypothesis (Exchangeable): 𝜃1 = 𝜃2 = ⋯ = 𝜃𝑁−1 = 𝜃𝑁 - Alternative hypothesis (Non-exchangeable): ∃𝑛 with 𝜃1 = ⋯ = 𝜃𝑛 ≠ 𝜃𝑛+1 = ⋯ = 𝜃𝑁

(b)

(a)

(c) Figure 3.3: A simple example demonstrating the concept of exchangeability. (a) Consider that each urn represents a distribution and there are three cases for selecting balls from the urns. In the first case, only blue balls are selected, so the 𝑷(𝑿|𝜽𝟏 ) remains invariant regardless of the order of balls being selected. (b) In the second case, one starts to select red ball at a certain time n, which results in a minor change in the joint distribution, i.e. 𝑷(𝑿|𝜽𝟐 ). (c) As more red balls are selected, in the third case, the joint distribution 𝑷(𝑿|𝜽𝟑 ) alters significantly. In this regard, lack of exchangeability implies a change in the distribution or model. Martingales Given a sequence of random variables 𝑿𝑖 : {𝑥1 , 𝑥2 , … 𝑥𝑖 }, where 𝑿𝑖 denotes all random variables from 1 to i, if 𝑀𝑖 is a measurable function of 𝑿𝑖 and 𝐸(|𝑀𝑖 |) < ∞, then {𝑀𝑖 : 0 ≤ 𝑖 ≤ ∞} is a Martingale process once it satisfies [200], [204]: 𝐸(𝑀𝑛+1 |𝑿𝑛 ) = 𝑀𝑛

(3.13)

Eq. (3.13) defines a Martingale process, where the expectation of the next Martingale value 𝑀𝑛+1 equals the Martingale value 𝑀𝑛 given all the previously observed random variables 𝑿𝑛 . Further, the terms Submartingale and Supermartingale can be defined respectively as: 53

𝐸(𝑀𝑛+1 |𝑿𝑛 ) ≥ 𝑀𝑛

(3.14)

𝐸(𝑀𝑛+1 |𝑿𝑛 ) ≤ 𝑀𝑛

For change detection, the Martingale value 𝑀𝑖 measures the confidence of rejecting the null hypothesis of exchangeability. Combining exchangeability and Martingales, a family of Randomized Power Martingales with initial value 𝑀0 = 1, was proposed by [203]: 𝑛

(𝜺)

𝑀𝒏 = ∏

𝑖=1

𝜀𝑝𝑖𝜀−1

(3.15)

where 𝑝𝑖 can be seen as a measure of the exchangeability, and 𝜀 ∈ [0,1] controls the threshold for transitions between the Supermartingale and Submartingale. From Eq. (3.13) and (3.15), (𝜀)

𝑀𝑛 is a Martingale process once 𝜀𝑝𝑖𝜀−1 = 1 𝑝𝑖 =

(3.16)

𝑙𝑛(𝜀) 𝑒 1−𝜀

𝑙𝑛(𝜀)

(𝜀)

Also, according to Eq. (3.14) and (3.16), 𝑀𝑛 becomes a Supermartingale when 𝑝𝑖 > 𝑒 1−𝜀 , 𝑙𝑛(𝜀)

(𝜀)

whereas 𝑀𝑛 becomes a Submartingale when 𝑝𝑖 < 𝑒 1−𝜀 . A Submartingale occurs when the (𝜀)

(𝜀)

data points observed are no longer exchangeable and 𝑀𝑛 starts increasing. Once 𝑀𝑛 is larger than a defined threshold, the null hypothesis of no change is rejected. Martingale framework for general change detection Based on exchangeability testing, it is crucial to calculate p values that are representative of exchangeability. There are two main steps: strangeness and p value calculation. Strangeness measures how different a data point is from others with respect to a model 𝜆, which can be expressed for a data point 𝒙𝑛 as: 𝑠𝑛 = 𝑓(𝒙𝑛 , 𝜆)

(3.17)

where 𝑓(∙) represents any transformation from input 𝒙𝑛 and a model 𝜆 , for instance, the Euclidean distance in k-means algorithm. The larger 𝑠𝑛 is, the less likely the data point 𝑥𝑛 comes from the model 𝜆. Then the corresponding p value of 𝑠𝑛 can be calculated as follows [200]: 𝑝𝑛 (𝑿𝑛 , 𝜃𝑛 ) =

⋕ {𝑖: 𝑠𝑖 > 𝑠𝑛 } + 𝜃𝑛 ⋕ {𝑖: 𝑠𝑖 = 𝑠𝑛 } 𝑛

(3.18)

where ⋕ {} is the number of elements satisfying the bracketed condition and 𝜃𝑛  [0,1] is a random number.

54

Exchangeability Testing

𝒙𝑛

Strangeness Measure 𝑠𝑛 = 𝑓(𝜃, 𝒙𝑛 )

P value

Martingale

𝑝𝑛 (𝑠𝑛 , 𝑠1:𝑛−1 )

(𝜺) 𝑀𝒏 (𝑝𝑛 )

𝑀𝑛

Figure 3.4: An overview diagram for Martingale framework for change detection. For each incoming data point 𝒙𝒏 , the strangeness of the data point with respect to the model 𝝀 is firstly calculated and used to compare with strangeness values from other previously observed data points. Comparison of the strangeness values then provides evidence of the exchangeability in terms of a p value, which is further used to calculate a Martingale value 𝑴𝒏 . 𝑴𝒏 can then be compared with a threshold to make decisions in terms of the presence of emotion change. If there is no change, the observed data points are all from the same model 𝜆, which implies similarity and exchangeability of 𝑠1 , … , 𝑠𝑛 . Accordingly, the p values are uniformly distributed ln(𝜀)

on [0, 1], and their expectation 𝐸(𝑝𝑛 ) = 0.5 > 𝑒 1−𝜀 ∈ [0, 0.3679], depending on 𝜀, which is (𝜀)

typically selected from [0.90, 0.96] [200]. According to Eq. (3.15), 𝑀𝑛

is preferably a

Supermartingale and decreases with some fluctuations due to the fact that 𝑝𝑖 is random. Once observed data points are not exchangeable, strangeness 𝑠𝑛 becomes larger and in turn 𝑝𝑛 in Eq. (𝜀)

(3.18) has a smaller value. Accordingly, 𝑀𝑛 increases until a predefined threshold, above (𝜀)

which a change is detected. Once a change is detected, 𝑀𝑛+1 is reset to 1 and the detection restarts based on a new model trained using recent samples (50 samples in Figure 3.5).

Figure 3.5: Demonstration of how the Martingale framework is used for change detection via exchangeability testing. In this simplified example, samples are randomly generated from Gaussian distributions of two classes with mean shift of 1. The strangeness measure is the negative log likelihood of a single Gaussian distribution trained using most recent 50 samples after a change is detected. Note that this is a delay between the ground truth change points and the peaks of Martingale values. 55

3.4 Concluding remarks This chapter has presented a review of existing literature that investigated emotion change from both psychology and engineering perspectives. Motivated by the fact that emotion is a continuous process and dynamic in nature, there have been a number of emerging studies regarding emotion dynamics in affective science, whilst by large contrast automatic systems concerning emotion change have received less attention. On the other hand, the fact that emotion change research is an understudied yet important area yields a wide range of potential further investigations. The review of emotion change research was focused on two aspects, i.e. computational modelling of emotion change and the timing of emotion change. First, this chapter reviewed existing studies investigating computational models which consider emotion change. The majority of the studies still concentrate on the use of emotion change to facilitate recognition or prediction of emotion, which remains an open question to date. Moreover, many interesting aspects of emotion change have been overlooked during the development of automatic systems. For example, a few initial investigations suggest the possibility of developing systems that can identify emotion change direction or predict change in emotional intensity via explicit modelling of emotion dynamics. There could be many benefits as a result of resolving these problems. One benefit is that automatic systems could provide meaningful interpretation (e.g. direction and extent) of emotion change to facilitate HCI (e.g. being informed of changes to react) and studies associated with effects of emotion change (e.g. task transition and psychological studies about emotion dynamics). This is the information that conventional emotion recognition or prediction systems are unable to provide. Another benefit is that explicit modelling of emotion dynamics may allow more accurate characterisation of emotion dynamics, which help interpret emotion in psychology and facilitate recognition or prediction in automatic systems. Second, another sensible starting point for research is to consider the time instant of emotion change. This initiative would benefit a variety of areas, ranging from psychological studies (e.g. emotion regulation and investigation of individual differences on emotion change) to real-life applicability (e.g. efficient HCI, task transition, security and medical monitoring). Even though there are a few studies that have focused on the timing of emotion changes and emotion segmentation, the main limitations for most of the studies lie in the fact that they are mainly based on emotion recognition, and their problem setting, which mostly lack a clear definition of emotion change points in time and temporal precision of emotion change. Accordingly, by drawing on other fields that have long focused on changes in time series, such as speaker change detection and more general change detection problems, this chapter also highlighted a statistical Martingale-based change point detection method and several common approaches 56

including Bayesian Information Criterion, Generalised Likelihood Ratio and KL divergence for their potential application to emotional speech.

57

58

Chapter 4

Datasets

The experiments in this thesis were performed on four distinct corpora. They are the USC Interactive Emotion Dyadic Motion Capture (IEMOCAP) corpus [95], the Remote Collaboration and Affective Interaction (RECOLA) corpus [47], the Sustained Emotionally coloured Machine-human Interaction using Nonverbal Expression (SEMAINE) corpus [48], and the USC CreativeIT corpus [49], which are discussed in Section 4.1. This chapter also demonstrates how suitable datasets can be built from existing datasets to satisfy the need for emotion change research.

4.1 Characteristics of key emotional corpora 4.1.1

Description of datasets

The USC IEMOCAP corpus3 was collected in scenarios where two American English speakers (a male and a female) engaged in scripted (acted) or spontaneous (naturalistic) conversations. The scenarios were deliberately designed to elicit specific emotions. The database contains 5 sessions of audio-visual data from 10 speakers, 12 hours in total, which makes it reasonably large among publicly available emotional corpora. In each session, a VICON motion capture system with eight cameras was used to collect signals from markers attached to one participant’s face, head and hands, while two high quality microphones were used to record speech signal from two speakers in a clean environment. The audio sampling rate was 48 kHz, with unknown quantization. In IEMOCAP, recordings were pre-segmented into short turns, and each turn (i.e. utterance) was then annotated into one of ten discrete emotion categories (e.g. happiness, anger, etc.) as well as three continuous emotion dimensions (i.e. arousal, valence and dominance) by three annotators based on audio and video signals. More precisely, annotators manually assign each turn into one out of ten emotional labels as well as rate its emotion dimensions from 1 to 5 on a 5-point scale. The RECOLA corpus4 is a spontaneous (naturalistic) multimodal database collected in settings where two French speakers remotely collaborate to complete a survival task via a video conference. During the collaborative interactions, multimodal signals, including audio, video and physiological signals such as electrocardiogram (ECG) and electro-dermal activity (EDA), were collected from 46 participants (data from 23 participants are publicly available). For each

3

https://sail.usc.edu/iemocap/

4

http://diuf.unifr.ch/diva/recola

59

participant, each recording is 5 minutes long and continuously annotated for arousal and valence between [-1, 1] with a step of 0.01 by six annotators based on audio and video. There are 25 frames of ratings per second. The audio was recorded using unidirectional headset microphone (AKG C520L), whereas the video was recorded using a Logitech C270 camera, both with a sampling rate of 44.1 kHz and 16 bit quantization. The SEMAINE corpus5 is an English spontaneous (naturalistic) audio-visual database collected in the Sensitive Artificial Listener (SAL) scenario where a person engages in emotionally coloured interactions with one of four emotional operators. SEMAINE was collected by Queen’s University Belfast, Northern Ireland. They are angry Spike, happy Poppy, gloomy Obadiah and sensible Prudence who try to make people experience the same emotions. Unlike IEMOCAP, recordings on the database were continuously evaluated by 2 – 8 annotators in terms of five core affect dimensions (i.e. arousal, valence, power, expectation and intensity) between [-1, 1], as well as other optional descriptors (e.g. happiness and interests, etc.) via Feeltrace. The annotations were based on audio and video signals. There are 50 frames of ratings per second. Audio signals were recorded mainly using a wearable microphone (AKG HC-577-L) with a sampling rate of 48 kHz and 24 bit quantization, whereas video was recorded at 49.979 frames per second and at a spatial resolution of 780 × 580 pixels using AVT Stingray cameras. The audio was stored without compression. The USC CreativeIT corpus 6 [49] is an English (American) spontaneous multimodal dataset collected from dyadic theatrical improvisations. The dataset contains audio and body language (full body Motion Capture) data, apart from which improvised emotional expressions are annotated in fully continuous emotion dimensions, i.e. arousal, valence, and dominance between [-1, 1] by at least 3 annotators. Moreover, the dataset contains turn level annotations for emotion dimensions. The corpus contains 100 sessions from 16 actors (8 females), who are grouped into 8 gender-balanced pairs. The annotations ware based on audio and video signals. There are 60 frames of ratings per second. The audio signals were recorded with a sampling rate of 48 kHz, 24 bits. The audio was stored without compression. The mean length of all recordings is 5.79 minutes. More often, emotional annotations, either at turn-level (i.e. IEMOCAP) or frame-level (i.e. RECOLA, SEMAINE, and USC CreativeIT) are based on audio and video signals, because it is found in the literature that some emotions are better perceived from audio (e.g. sadness or

5

https://semaine-db.eu/

6

https://sail.usc.edu/CreativeIT/ImprovRelease.htm

60

arousal), while others from video (e.g. happiness or valence) [7], [205]. During the annotation process, people conducted annotations based on what they were seeing and hearing. Also, due to the subjective nature of emotion evaluation, annotations from multiple annotators are commonly available, which allows calculation of “consensus” as the ground truth. 4.1.2

Description of selected partitions

These four corpora were employed to fulfil various purposes of investigation, including Emotion Change Detection (ECD), Absolute Emotion Prediction (AEP) and Emotion Change Prediction (ECP). Table 4.1 describes a summary of the specific data partitions used, including number of subjects, total duration, available emotional annotation as well as the tasks and chapters where the databases were used. It should be noted that in all the four corpora, only audio recordings were considered in this thesis, while annotators perceived both audio and video during rating. For ECD, which aims to detect emotion change points in time from speech, the IEMOCAP and SEMAINE datasets were chosen. The availability of categorical annotation on IEMOCAP enables two possible investigations of emotion changes: 1) change among different pairs of arbitrary emotion categories; and 2) change between only neutral speech and emotional speech. To investigate these two possibilities, we selected a widely used data partition [70], [93] of the IEMOCAP which contains four emotions, i.e. neutral (1708 turns), anger (1103 turns), happiness7 (1636 turns), and sadness (1084 turns), 5531 turns in total. Anger, happiness, and sadness were further be merged into ‘emotional speech’, as opposed to neutral speech, for selected experiments. The mean length for the 5531 turns is 4.52 seconds. Another possibility is to detect emotion change between positive vs negative arousal/valence, given the availability of turn level annotation of dimensional emotions on IEMOCAP. Although the SEMAINE dataset contains only frame level annotation of arousal and valence, deriving turn level annotations by averaging them per turn is not uncommon [206]. For emotion dimensions, we consider all utterances with arousal and valence ratings on IEMOCAP dataset, 10039 turns in total (the mean length of turns is 4.44 seconds), whereas within SEMAINE, we considered only Solid SAL sessions with transcriptions and annotations from R1, R2, R3, R4, R5 and R6 for consistency, 1989 turns in total after segmentation based on the transcriptions (the mean length of turns is 2.43 seconds). All turn level dimensional ratings were preprocessed with z-normalisation and thresholding per corpus. Thresholds of ±0.3 and ±0.7 were

7

The two emotions happiness and excitation were merged into the happiness class

61

used to separate arousal and valence ratings into positive and negative classes for SEMAINE (as in [206]) and IEMOCAP respectively. For AEP, which aims to continuously predict arousal and valence from speech, three widely used corpora - RECOLA, SEMAINE and CreativeIT - were selected. All of them provide continuous frame level annotation of arousal and valence, but at different frame rates, as shown in Table 4.1. For RECOLA, we adopted the same partition as per the Audio/Visual Emotion Challenge and Workshop (AVEC) 2016 challenge [118]. For SEMAINE, we used the same criteria to select 56 sessions from 14 speakers; that is, availability of transcription and consistency of annotators in the SAL-only sessions. Within each session, since lengths of annotations from six annotators may vary, all ratings were shortened to the minimum length of annotations from the six annotators by removing the trailing redundant annotations. For CreativeIT, 90 sessions from 15 actors were adopted as per [177]. For ECP, which aims to predict how much arousal or valence has changed across time, we selected the same data subsets of the RECOLA and SEMAINE datasets used for AEP, i.e. speech data from 18 sessions on RECOLA and those from 56 sessions on SEMAINE. This was with the intention to ensure fair comparison between ECP and AEP experiments, and to gain insights into the difference in system design between ECP and AEP, and into how well one can predict change in arousal and valence when compared with prediction of absolute arousal and absolute valence.

62

Table 4.1: Summary of emotional databases employed for investigation in this thesis. Abbreviations: M – Number of males, F – Number of females, ECD – Emotion Change Detection, ECP – Emotion Change Prediction, AEP – Absolute Emotion Prediction, FPS – Number of arousal/valence labels (frames) per second. ‘’ represents the data partition selected and ‘’ represents the selected data partition used for certain tasks in different chapters in this thesis. Categorical annotation Corpus

Subjects

Total Duration

Dimensional annotation

Task ECD

Turn level Frame level

Turn level

AEP

ECP

Frame level Chapter 5 Chapter 6 Chapter 8 Chapter 7

Categorical:  6.96 hours 5531 turns Dimensional: (NEU, SAD, 12.38 hours ANG, HAP) 18 subjects 1.5 hours (8M, 10F)

IEMOCAP [95] 10 subjects (5M, 5F)

RECOLA [47]

SEMAINE [48]

14 subjects8 (6M, 8F)

4.43 hours

CreativeIT [49] 15 subjects (7M, 8F)

4.34 hours

8





10039 turns (Arousal, Valence)









Arousal, valence (25 FPS)





1989 turns (Arousal, Valence)

Arousal, valence (50 FPS)

 Arousal, valence (60 FPS)

The subject (user) IDs on SEMAINE are: 2, 3, 4, 5, 7, 8, 9, 10, 11, 12, 14, 15, 16, 17

63









4.2 Constructed datasets for emotion change research All the corpora mentioned above, as well as the majority of emotion corpora available within the affective computing community, contain only annotation rated in an absolute manner. Although the RECOLA, SEMAINE and CreativeIT datasets, where emotions are continuously annotated, can be helpful in investigating emotion change, these still have the limitation that they are annotated in an absolute manner, rather than annotators being explicitly directed to rate emotion changes such as when emotion change occurs, and how much emotion changes across time. To commence investigations into emotion change detection and emotion change prediction, it would be difficult to proceed without a suitable dataset that provides ground truth regarding emotion change points as well as the extent of emotion change. However, collecting a new dataset from scratch would be time-consuming and demanding, and a compromise is to construct datasets from the existing ones. To this end, a few datasets were tailored to be suitable for emotion change research. 4.2.1

Emotion change detection

This subsection elaborates on the process to construct datasets with ground truth of emotion change points in time from IEMOCAP and SEMAINE. The recording conditions within the IEMOCAP and SEMAINE dataset were the same and in clean environments. The speech utterances in IEMOCAP were manually segmented, while the speech in SEMAINE was segmented based on transcripts. For each utterance, the audio signals had silences at the beginning and end of each utterance, ranging from about 2.5s (SEMAINE) to around 5s (IEMOCAP), as shown in Figure 4.1. To allow detection of emotion change at turn level, frame-level ratings were averaged across all annotators and the whole turn to form turn-level annotation of arousal and valence, as per [206]. To construct the dataset for ECD, two main steps were adopted. First, speech turns uttered by the same speaker were concatenated in the same order as the recordings were collected, and then ground truth change points were identified as long as emotions in two contiguous turns differ (Figure 4.1). The assumption behind this is that each boundary between two utterances of different emotions can be treated as an emotion change point (as shown in Figure 4.1). This assumption can be reasonable for two reasons: 1) across the entire datasets, the recording environment/equipment/channel was kept constant and utterances were only concatenated within the same speaker, so the differences between two utterances tend to be less affected by the abovementioned issues but can reflect the actual emotion change; 2) the concatenation occurred during silence, and not during speech (which would create phase discontinuities at the 64

concatenation points), as shown in Figure 4.1. It is worth noting that this approach is not uncommon for generating ground truth emotion change points [157], [169], [170]. Given the unavailability of datasets specifically built for ECD within the affective computing community, this concatenation scheme might be the best practice to date. However, because of the very short duration of many utterances, concatenation of this kind can result in many emotion change points occurring very frequently, sometimes less than 1 second, which seems to lack realism and motivates the introduction of a minimal distance 𝐷𝑚𝑖𝑛 to allow reasonable distances between any two ground truth change points. More precisely, in the second step, small turns with duration less than 𝐷𝑚𝑖𝑛 were discarded, which was then followed by repeating the first step once (Figure 4.2).

Figure 4.1: Concatenation of small utterances to form ground truth for emotion change points by merging the same emotions for one speaker. Note that silence was not considered, but was discarded based on voicing probability.

Speaker 1

Speaker 1

Speaker 2 Ground truth change points

Speaker 2 Leave out small utterances

Ground truth change points

Speaker N-1

Speaker N-1

Speaker N

Speaker N

Concatenation of original dataset per speaker

Final dataset for emotion change detection

Different emotions:

Figure 4.2: Concatenation of original dataset segments per speaker, followed by omission of small utterances to form final datasets for emotion change detection. The shaded areas were omitted. 65

Although both Figure 4.1 and Figure 4.2 demonstrate a process for constructing a dataset for detection of change points among four emotions (which herein is referred to as “EMO-4”), the same technique is equally applicable to emotion change between two emotions, i.e. emotion change between neutral and emotional speech (“EMO-2”), and between positive and negative arousal (“ECD-A”) or valence (“ECD-V”). The constructed datasets from IEMOCAP and SEMAINE for these four emotion change detection tasks are summarised in Table 4.2. Table 4.2: Summary of constructed datasets for detection of emotion change of four different types, including resultant number of turns per class, the minimal distance and the final number of change points. IEMOCAP ECD task EMO-4

EMO-2 ECD-A ECD-V

# turn Neutral (1708) Anger (1103) Sadness (1084) Happiness (1636) Emotions (3802) Neutral (1701) Positive (1844) Negative (3089) Positive (2807) Negative (3196)

SEMAINE

𝑫𝒎𝒊𝒏

#change points

# turn

𝑫𝒎𝒊𝒏

#change points

8s

199

---

---

---

3s

224

---

---

---

7s

207

0s

90

7s

196

0s

62

Positive (821) Negative (737) Positive (1027) Negative (473)

The choices of 𝐷𝑚𝑖𝑛 were empirically determined and summarized in Table 4.2. More specifically, on IEMOCAP, a choice of 7 or 8 seconds for 𝐷𝑚𝑖𝑛 was selected to ensure enough distance between two emotion change points. However, using the same 𝐷𝑚𝑖𝑛 for the EMO-2 resulted in fewer change points, so the 𝐷𝑚𝑖𝑛 was selected to 3 seconds. For SEMAINE, since the total number of utterances is too small to impose the 𝐷𝑚𝑖𝑛 to attain sufficient change points, 𝐷𝑚𝑖𝑛 was set to 0. 4.2.2

Emotion change prediction

The construction of the dataset tailored for ECP essentially comes down to the construction of a “delta” emotion ground truth which can represent emotion dynamics, i.e. the extent of emotion changes across time, from the original absolute emotion ratings. A suitable ground truth for emotion change is crucial not only for meaningful system design such as modelling, but also for a better understanding of underlying mechanism behind emotion change. The process used to construct the “delta” emotion ground truth is elaborated in Section 7.2.

66

4.3 Concluding remarks This chapter has described four widely used emotion corpora adopted for investigations in this thesis, including IEMOCAP, SEMAINE, RECOLA and USC CreativeIT, in terms of the corpus size, speaker number, available emotion annotations and collecting conditions. Moreover, to meet the needs of emotion change research, i.e. emotion change detection and emotion change prediction, this chapter has also explained how new datasets can be constructed from the existing datasets to provide ground truth for emotion change points in time for the ECD tasks.

67

68

Chapter 5

Emotion Change Detection

5.1 Introduction This chapter aims to answer the first research question: how can emotion change points in time be detected from speech (Section 1.2)? Emotion Change Detection (ECD) is motivated by the need to explore the time course of emotion, which is crucial in many aspects such as emotion regulation, HCI, task transition, security and medical applications (Section 1.1). Moreover, as previously discussed in Chapter 3, most existing speech-based emotion recognition has focused on classifying or predicting emotions individually from pre-segmented speech signals (e.g. on a file-by-file basis) [27], [52], [85], [207], which lacks realism for real time processing, and ECD can potentially bridge this gap. Without exception, the review of previous literature considering emotion timing in Section 3.3 indicates that ECD remains a relatively understudied field to date and requires more investigation to advance further. The work described in this chapter was partly published in two conference papers [208] and [209]. The aim of this chapter is to investigate the possibility of detecting emotion changes in speech for both emotion categories and emotion dimensions. In meeting this objective, this chapter presents two possible frameworks: (i) A dual-sliding window framework (section 5.2) (ii) A Martingale framework (section 5.3) The implementations of the two frameworks in speech signals can be seen in Figure 5.1 below.

Dual Sliding Window Framework Martingale Framework

Figure 5.1: A conceptual example of how the two proposed frameworks operate on speech signals. The two frameworks operate in two different manners: the dual-sliding window framework compares the similarity/diversity of two adjacent windows, drawing on analogous approaches from speaker change detection (Section 3.3.1), whilst the Martingale framework observes data points one-by-one and in turn performs statistical testing on the fly based on exchangeability (Section 3.3.2). Each dot or bar represents one frame-level feature vector, with frame boundaries shown as dashed vertical lines. 69

In the dual-sliding window, apart from existing speaker change detection approaches, we propose a new approach that takes account of prior emotion information. In the Martingale framework, we proposed modifications of the conventional Martingale framework [200] in order to resolve potential issues such as phonetic variability and detection delays. The remainder of this chapter is laid out as follows: Section 5.2 presents explanations of the Generalised Likelihood Ratio (GLR) and the proposed Emotion Pair Likelihood Ratio (EPLR) methods in the dual-sliding framework, Section 5.3 explains proposed modifications applied to a statistical Martingale framework in the ECD context, Section 5.4 explains key experimental settings, Section 5.5 presents experimental results using the two proposed frameworks for a range of ECD tasks, and lastly a brief set of concluding remarks is given in Section 5.6.

5.2 The dual-sliding window framework In the sliding window framework, the general system proposed for emotion change detection, shown in Figure 5.2, begins with a sliding dual window, comprising previous and current fixedlength windows. Within these two windows, spanning multiple frames, features are extracted on a per-frame basis, and can be used to calculate likelihoods. Scores, which comprise a linear combination of log likelihoods, are then calculated and compared with a threshold during the detection stage to produce a decision, i.e. change or non-change.

Dual Windowing

Feature Extraction

Speech signal

Calculate Likelihood Ratio (Scores)

Emotion Change Points

Emotion Specific Models (e.g. GMMs)

Figure 5.2: A general system diagram for emotion change detection based on a sliding window and likelihood framework. For a detection problem of this kind, there is a design question of how close to a ground truth change point a detected change point should be, in order to be considered correct. For this reason, a tolerance region can be introduced. A tolerance region is essentially a temporal window around each ground truth change point, within which any detected changes are regarded as a correct detection. If a score above the threshold for change is located in the tolerance region where the true change point centres, a change is correctly detected (as seen in Figure 5.3). Other scores above the threshold, outside tolerance regions, are considered false

70

alarms. To employ this paradigm for detecting changes in emotions, two parameters are required: (a) window size, and (b) window shift.

Figure 5.3: Dual windowing process applied on speech signals for emotion change detection. The window centre represents a ground truth change point, around which a tolerance region is assigned. During each window, i.e. the previous, the current and the entire dual window, acoustic features are extracted on a per-frame basis. 5.2.1

Generalised Likelihood Ratio (GLR)

A detailed description concerning Generalised Likelihood Ratio [190] has been provided in Section 3.3.1, as well as other common approaches such as Bayesian Information Criterion (BIC) and Kullback-Leibler (KL) divergence. In the emotion change detection context, the GLR is used as a baseline approach to differentiate emotion changes from non-changes via likelihood ratios, shown as in Eq. (5.1). Normally, one multivariate Gaussian model is used to separately fit a previous window, a current window and the entire dual (previous and current) window respectively, and their log-likelihoods are calculated and combined as a score 𝐷𝐺𝐿𝑅 : 𝐷𝐺𝐿𝑅 = 𝐿𝐿𝑝𝑟𝑒𝑣 + 𝐿𝐿𝑐𝑢𝑟𝑟 − 𝐿𝐿𝑝𝑟𝑒𝑣+𝑐𝑢𝑟𝑟

(5.1)

where 𝐿𝐿𝑝𝑟𝑒𝑣 , 𝐿𝐿𝑐𝑢𝑟𝑟 and 𝐿𝐿𝑝𝑟𝑒𝑣+𝑐𝑢𝑟𝑟 are the log-likelihoods of the three independent models. Log-likelihoods are each calculated across the respective window, i.e. one log-likelihood per window. GLR can be used to detect emotion change of any kind. Note that the GLR contains no prior information regarding emotions themselves, and therefore considering emotional information may provide further gains in detection accuracy. This motivates the additional investigation into systems that incorporate emotion models. 5.2.2

Emotion Pair Likelihood Ratio (EPLR)

The second method proposed herein applies likelihood ratios between different pairs of GMMbased emotion models for emotion change detection. The likelihood ratios are referred to as Emotion Pair Likelihood Ratios (EPLRs). One assumption behind this is that if prior information of emotions is available, the performance might be improved over methods with no 71

prior information. Another assumption is that likelihood is a key measure in emotion recognition in likelihood-based systems. If there is a change in emotion, this might be reflected in variations in likelihoods. Assuming that emotion-specific models 𝑒𝑖 are available for a set E of emotions of interest, let 𝑝𝑟𝑒𝑣

𝐿𝐿𝑖

denote the log likelihood of emotion 𝑖 in the previous window. Then the log likelihood

ratio for the transition from emotion ei to ej during the dual window can be expressed as: 𝑝𝑟𝑒𝑣

𝐿𝐿𝑅𝑒𝑖→𝑒𝑗 = 𝐿𝐿𝑗𝑐𝑢𝑟𝑟 − 𝐿𝐿𝑖

=



log 𝑝 (𝒚𝑡 |𝑒𝑗 ) −

𝒚𝑡 ∈𝑿𝑐𝑢𝑟𝑟



log 𝑝(𝒙𝑡 |𝑒𝑖 )

(5.2)

𝒙𝑡 ∈𝑿𝑝𝑟𝑒𝑣

where 𝒙𝑡 and 𝒚𝑡 are frame-level features in the previous 𝑿𝑝𝑟𝑒𝑣 and current 𝑿𝑐𝑢𝑟𝑟 windows respectively. Since for any pair of emotions ei and ej, the transition can occur in either direction, a simple way to combine the log likelihood ratios into a single score is 𝐷𝐸𝑃𝐿𝑅 : 𝐷𝐸𝑃𝐿𝑅 = 𝐿𝐿𝑅𝑒𝑖↔𝑒𝑗 = |𝐿𝐿𝑅𝑒𝑖→𝑒𝑗 + 𝐿𝐿𝑅𝑒𝑗→𝑒𝑖 | 𝑝𝑟𝑒𝑣

= |(𝐿𝐿𝑗𝑐𝑢𝑟𝑟 − 𝐿𝐿𝑖

𝑝𝑟𝑒𝑣

) + (𝐿𝐿𝑐𝑢𝑟𝑟 − 𝐿𝐿𝑗 𝑖

)|

(5.3)

This bi-directional measure potentially captures changes in emotion log likelihood scores in both directions, and thereby being more general than measures operating on only a single direction. Moreover, emotional models adopted in this framework may be helpful to reduce the effects of phonetic variability compared with methods that require no prior knowledge of emotions. This is because short-term acoustic features (especially spectral features) tend to be sensitive to changes in phonemes. Mapping the features into the likelihood space using emotional models could avoid capturing the changes in phonemes, since the likelihoods over the duration of the windows with respect to each emotion are less sensitive to phoneme changes. 5.2.3

Normalisation for EPLR

According to Eq. (5.3), EPLRs are essentially a measure of how much the log likelihood of two emotions (𝐿𝐿𝑖 + 𝐿𝐿𝑗 ) changes from a previous window to a current window. Nevertheless, likelihoods might vary dramatically between windows due to the phonetic and speaker content as well as emotion. Indeed, upon closer examination of the curve of EPLRs vs ground truths, we found that both emotion-specific likelihoods and EPLRs change fairly consistently across all previous/current window boundaries, which is presumably due to phonetic variations between particular windows and the global similarity between emotion models, i.e. the movement of log likelihoods tend to be synchronised over time across all emotions. As a result, the betweenemotion differences of log likelihood levels are more salient than changes in (𝐿𝐿𝑖 + 𝐿𝐿𝑗 ) of 72

emotion pairs between two windows. To minimise this issue of log likelihood synchronisation across emotions, we performed window-based normalisation to remove the global similarity between log likelihoods from different emotions. Given a dual window that has 𝑁 frames in total, 𝐿𝐿𝑛,𝑖 is used to represent the log likelihood of emotion 𝑖 in the 𝑛-th frame. The window-based normalisation can be formulated as: 𝑝𝑟𝑒𝑣 ̅̅̅𝑝𝑟𝑒𝑣 𝐿𝐿 = 𝐿𝐿𝑖 − 𝜇𝑝𝑟𝑒𝑣 , 𝑖

̅̅̅𝑐𝑢𝑟𝑟 𝐿𝐿 = 𝐿𝐿𝑐𝑢𝑟𝑟 − 𝜇𝑐𝑢𝑟𝑟 𝑖 𝑖

𝑁

𝜇𝑝𝑟𝑒𝑣

2 ∑𝑛=1 ∑𝑒𝑖∈𝐸 𝐿𝐿𝑛,𝑖 = , 𝜇𝑐𝑢𝑟𝑟 = |𝐸|

(5.4)

∑𝑁

∑𝑒𝑖 ∈𝐸 𝐿𝐿𝑛,𝑖 𝑁 𝑛= +1 2

(5.5)

|𝐸|

̅𝐸𝑃𝐿𝑅 = 𝐿𝐿𝑅 ̅̅̅̅̅𝑒 ↔𝑒 = |(𝐿𝐿 ̅̅̅𝑗𝑐𝑢𝑟𝑟 − 𝐿𝐿 ̅̅̅𝑝𝑟𝑒𝑣 ̅̅̅𝑐𝑢𝑟𝑟 ̅̅̅𝑗𝑝𝑟𝑒𝑣 )| 𝐷 ) + (𝐿𝐿 − 𝐿𝐿 𝑖 𝑖 𝑖 𝑗

(5.6)

where |𝐸| denotes the number of elements in E. The normalisation operates at window level. First, the log likelihood 𝐿𝐿𝑛,𝑖 is averaged across the whole window and across all emotions (Eq. (5.5)), attaining 𝜇𝑝𝑟𝑒𝑣 and 𝜇𝑐𝑢𝑟𝑟 for the previous and the current window respectively. Then the 𝑝𝑟𝑒𝑣

𝜇𝑝𝑟𝑒𝑣 and 𝜇𝑐𝑢𝑟𝑟 are used to normalise the window-level log likelihood 𝐿𝐿𝑖

and 𝐿𝐿𝑐𝑢𝑟𝑟 (Eq. 𝑖

̅𝐸𝑃𝐿𝑅 (Eq. (5.6)). (5.4)), which are used to calculate the normalised EPLR score, 𝐷

5.3 The Modified Martingale framework An alternative method to achieve ECD is the statistical Martingale framework, which has been successfully applied to concept drift detection [199], video shot change detection [200], and more recently to the detection of changes in speech rates [202]. Unlike most change detection methods using large sliding windows, Martingales have been proposed for detecting changes in streaming data and making decisions on-the-fly by testing exchangeability (which is elaborated in Section 3.3.2) of sequentially observed data points [200], [203]. This opens the possibility for an alternative framework for emotion change detection with an improvement in the temporal resolution [199]. Hence, in this subsection, the problem of localising emotion change points in time was investigated from the perspective of testing exchangeability using a Martingale framework, where data points (frame-based features) from speech are observed one by one. However, it was observed from preliminary experiments that direct application of the Martingale framework for ECD potentially suffers from several problems: 1) phonetic variability from the use of acoustic features, 2) inability to handle long-term non-change, which is common in emotion, especially in spontaneous speech, 3) large delay time. The first problem arises from the sensitivity of short-term acoustic features (especially spectral features) to changes in phonemes. This is because the original Martingale framework does not account for a 73

priori knowledge of emotion; rather, it constructs a reference model based on a few data points (mostly 50 feature vectors, i.e. 0.5 seconds in speech), as shown in Figure 3.5. The model constructed with this short period is likely to capture phonetic content instead of emotion, and subsequently the model becomes sensitive to changes in phonemes. The last two problems can be partially visualised in Figure 5.5(a). To resolve these problems, a global emotion model is introduced to test ‘global’ exchangeability with respect to the trained emotion model, to be less sensitive to the potential variability in acoustic features (as discussed in Section 5.2.2). Then a p value thresholding method is proposed which enforces floor and ceiling p values, and poses emotion change detection as a Martingale turning point detection problem, in which peaks and troughs are the indicators of emotion changes. Figure 5.4 illustrates the proposed Martingale framework.

Proposed Martingale for Testing Global Exchangeability 𝒙𝑛

Strangeness Measure

P value

Martingale

𝑝𝑛 (𝑠𝑛 , 𝑆)

log(𝑀𝒏 (𝑝𝑛 ))

(𝜺)

Two-pass Linear Regression 𝑁𝑖 𝑁𝑖 𝑏𝑝𝑟𝑒𝑣 , 𝑏𝑐𝑢𝑟𝑟

Change or no change

Figure 5.4: Problems arising from direct application of the Martingale framework can be resolved by several modifications, which are marked in red in this figure. Firstly, we introduce a global emotion model 𝜽𝒆𝒎𝒐 , based on which a global strangeness value S can (𝜺) be calculated. The S determines the switch between a Supermartingale in which 𝑴𝒏 (𝜺) increases, and a Submartingale in which 𝑴𝒏 declines. This results in a number of peaks and troughs as indicators of emotion change points, for which a two-pass linear regression (𝜺) method is used to detect changes in slopes. Note the logarithm of 𝑴𝒏 is used to prevent overflow and underflow issues. The modified Martingale method has five distinct steps according to Figure 5.4: (i) Extract frame-level D-dimensional acoustic features 𝒙𝑛 from speech with a frame shift of 10 milliseconds (ii) Calculate a strangeness value 𝑠𝑛 , which is the negative of the log likelihood given the feature vector 𝒙𝑛 and a Gaussian mixture model 𝜆(𝝎, 𝝁, 𝑪) 𝑚

1

𝑠𝑛 = 𝑓(𝒙𝑛 , 𝜆(𝝎, 𝝁, 𝑪)) = − log (∑ 𝜔𝑖 𝑖=1

𝐷

1𝑒

(2𝜋) 2 |𝑪𝑖 |2

(iii) Specify p values based on 𝑠𝑛

74

1 (−2(𝒙𝑛 −𝝁𝑖 )𝑇 𝑪−1 𝑖 (𝒙𝑛 −𝝁𝑖 ))

)

(5.7)

𝑝 𝑠𝑢𝑏 , 𝑝𝑛 = { 𝑠𝑢𝑝𝑒𝑟 𝑝 , 𝑙𝑛(𝜀)

𝑠𝑛 ≥ 𝑆 𝑠𝑛 < 𝑆

(5.8)

𝑙𝑛(𝜀)

where 𝑝 𝑠𝑢𝑏 ∈ [0, 𝑒 1−𝜀 ) and 𝑝 𝑠𝑢𝑝𝑒𝑟 ∈ (𝑒 1−𝜀 , 1] are the parameters that activate the Randomized Power Martingales into submartingale and supermartingale respectively. S, a threshold for exchangeability, is an important parameter in the sense that a high S tolerates some data points that are less likely from model 𝜆(𝝎, 𝝁, 𝑪), whereas a small S rejects some data points that are likely from model 𝜆(𝝎, 𝝁, 𝑪). Both cases lead to unreliable transition between submartingale and supermartingale using Eq. (5.8), which in turn practically give rise to log(𝑀) characteristics with incorrect turning points. To address this problem, S is calculated as a trade-off between distributions of two classes: 𝑆 =

2 (𝑺1𝑄 + 𝑺100−𝑄 ) 2

(5.9)

where 𝑺1𝑄 denotes the 𝑄 % percentile of all the strangeness values of class 1, estimated using ground truth from held out training data (e.g. from other speakers). One advantage of estimating S using Eq. (5.9) is that this can offer good separation between two classes, avoiding relatively large and small S to make sure that transition between submartingale and supermartingale occurs when there is a change. (iv) Calculate log randomised power Martingale log(𝑀𝑛 ) using Eq. (3.15) using 𝑝 𝑠𝑢𝑏 or 𝑝 𝑠𝑢𝑝𝑒𝑟 𝑛 (𝜺) log 𝑀𝒏

= ∑ 𝜀𝑝𝑖𝜀−1

(5.10)

𝑖=1

where the direction of log(𝑀𝑛 ) will be determined by 𝑝 𝑠𝑢𝑏 , which leads to an increase in log(𝑀𝑛 ) and 𝑝 𝑠𝑢𝑝𝑒𝑟 , which leads to a decrease in log(𝑀𝑛 ), as seen in Figure 5.5(b). (v) Detect turning points using two-pass linear regression on log 𝑀𝒏 using Eq. (5.11). 𝐷𝑀𝑎𝑟 = {

1, 0,

𝑁

𝑁

𝑁

𝑁

1 1 2 2 if 𝑏𝑝𝑟𝑒𝑣 ∗ 𝑏𝑐𝑢𝑟𝑟 < 0 and 𝑏𝑝𝑟𝑒𝑣 ∗ 𝑏𝑐𝑢𝑟𝑟