Using Ensemble Classifier Systems for Handling Missing Data in ...

5 downloads 15883 Views 679KB Size Report
if only a part is corrupted results in a substantial loss of data. To address this problem, two ..... lated during the recovery periods served as baseline values.
Using Ensemble Classifier Systems for Handling Missing Data in Emotion Recognition from Physiology: One Step Towards a Practical System Cornelia Setz, Johannes Schumm, Claudia Lorenz, Bert Arnrich, Gerhard Tr¨oster ETH Zurich, Electronics Institute Gloriastrasse 35, 8092 Zurich [email protected]

Abstract Previous work on emotion recognition from physiology has rarely addressed the problem of missing data. However, data loss due to artifacts is a frequent phenomenon in practical applications. Discarding the whole data instance if only a part is corrupted results in a substantial loss of data. To address this problem, two methods for handling missing data (imputation and reduced-feature models) in combination with two classifier fusion approaches (majority and confidence voting) are investigated in this work. The five emotions amusement, anger, contentment, neutral and sadness were elicited in 20 subjects by films while six physiological signals (ECG, EMG, EOG, EDA, respiration and finger temperature) were recorded. Results show that classifier fusion significantly increases the recognition accuracy in comparison to single classifiers by up to 16.3%. Regarding the methods for handling missing data, reduced-feature models are competitive or even slightly better than models which employ imputation. This is beneficial for practical applications where computational complexity is critical.

1. Introduction and motivation Applications for emotion recognition are predominantly found in the field of Human-Computer Interaction (HCI). By including emotions, HCI shall become more natural, i.e. more similar to human-human interactions where information is not only transmitted by the semantic content of words but also by emotional signaling in prosody, facial expression and gesture. Such an application includes e.g. the Affective DJ [1], a PDA which chooses the music automatically based on the user’s emotional state inferred from physiological signals. Kapoor et al. [2] used multiple channels of affect-related information to detect frustration of a learner. Intelligent Tutoring Systems could use this information for determining when to initiate interaction. Our work originates from a project which aims at inc 978-1-4244-4799-2/09/$25.00 2009 IEEE

creasing comfort for airplane passengers by developing a smart seat. Since the subjective appraisal of comfort is very important, emotion recognition can help in determining appropriate adjustments of the seat environment, the entertainment system or the attendance of the stewardess for each passenger individually. E.g. the entertainment system could automatically choose calming music if a passenger feels irritated or the stewardess could distract a passenger from his fearful thoughts. This work will however focus on the recognition of emotions which represents a prerequisite for determining appropriate adjustments of the seat environment. In order to infer the emotional state of the passenger, physiological sensors were integrated into an airplane seat. When choosing the sensors, we aimed at minimizing discomfort induced by the sensor attachment. Such unobtrusive sensors are generally more prone to artifacts and therefore lead to a lower signal quality. This trade-off is not restricted to our airplane seat but is relevant for applications where unobtrusive sensors are preferred. Many artifacts can be detected automatically by plausibility analyses. Often, the artifacts do not occur in all physiological signals simultaneously. Assuming that a single modality fails, the entire data instance, i.e. all the remaining signal modalities, are usually discarded. This results in a substantial amount of unusable data for classifier training. Moreover, predicting emotions during run-time becomes impossible, if only a single signal modality, which was used to train the classifier, fails. Since data loss due to artifacts is frequently encountered in practical applications, missing values represent a serious problem that needs to be addressed but has so far gained little attention in emotion recognition from physiology. We therefore tested two methods for handling missing data in combination with two classifier fusion approaches.

2. Related work Modalities which have been used to automatically detect emotions include facial expression [3–5], speech [6–8] and

physiological signals. Since physiological signals are used in this work, this section will focus on the latter. Ekman et al. [9] already showed in 1983 that anger, fear and sadness increase the heart rate more than happiness and surprise, while disgust decreases the heart rate. Anger increases the finger temperature more than happiness and sadness, while fear, surprise and disgust decrease it. Newer studies employ pattern recognition methods in order to distinguish between different emotional states. In [10], a single actor tried to elicit and experience each of eight emotional states, daily, over multiple weeks. The states no emotion, anger, hate, grief, platonic love, romantic love, joy, and reverence could be recognized with 81% accuracy by using four physiological sensors: EMG along the masseter, blood volume pressure, EDA and respiration. Sequential forward floating selection followed by fisherprojection was employed before clustering the data with a maximum-a-posteriori classifier. 29 subjects participated in the study of Lisetti and Nasoz [11]. Emotional film clips in combination with difficult mathematical questions were used to induce sadness, anger, surprise, fear, amusement and frustration. The physiological signals included electrodermal activity, heart rate and skin temperature. Among three classifier, a neural network (Marquardt backpropagation algorithm) yielded the best recognition of 84.1%. Kreibig et al [12] employed film clips to elicit the three emotions fear, neutral and sadness. They conducted an experiment with 34 subjects and measured the three signals ECG, GSR and respiration. An LDA classifier reached an accuracy of 69%. The four emotional states joy, pleasure, sadness and anger were investigated in [13]. Only a single subject was considered. The emotional states were induced by music selected by the test subject himself. The recorded signals included EDA, EMG measured at the neck, respiration and ECG. Various classifiers and dimension reduction techniques were applied. Linear discriminant function classification with sequential forward selection resulted in a recognition rate of 92%. Classifications of different emotion groups resulted in 88.6% for positive/negative and in 96.6% for low/high arousal. A recent work of the same group analyzed the data of three subjects [14]. The emotion induction technique and the measured signals were the same as in [13]. An extended linear discriminant analysis (LDA) with sequential backward selection was compared with a novel emotion-specific multilevel dichotomous classification scheme. With the latter, an accuracy of 95% was reached for subject-dependent and 70% for subject-independent classification. For more studies, the reader is referred to a list of collected results presented in [11]. It is current praxis to discard the corresponding episodes

when encountering artifacts in the data. In [11], 20% of the data was partly corrupt and had to be discarded. Kapoor and Picard present an approach for handling missing channels in a multimodal scenario [15]. Information from facial expression, postural shifts and game parameters was combined to classify interest in children trying to solve a puzzle on a computer. Using a Mixture of Gaussian Processes resulted in 86% recognition accuracy. Our work using physiological signals is similar in that each modality generates a separate class label and the decisions are combined by classifier fusion. While in [15] missing values were replaced by -1, we used mean value imputation. Furthermore, we compared fused classifiers using imputed features with fused classifiers employing reduced-feature models.

3. Methods This section describes the methods used for emotion induction and classification. The emotions to be recognized were chosen according to the well-known 2-dimensional emotion model of arousal and valence often used in emotion recognition studies [16]. One emotion in each quadrant plus neutral were selected as shown in Figure 1: amusement (high arousal, pos. valence), anger (high arousal, neg. valence), contentment (low arousal, pos. valence), sadness (low arousal, neg. valence) and neutral (medium arousal, zero valence).

3.1. Emotion elicitation and pre-study Film clips have been chosen for emotion elicitation for several reasons [17]: Films are capable of eliciting strong emotional responses under highly standardized conditions which enables replication studies. Furthermore, a film clip generates a rather long stimulation (1-10 minutes) unlike e.g. pictures. For investigating certain physiological characteristics, e.g. Heart Rate Variability (HRV), a long stimulation is necessary. For each of the five emotions, two potential film clips were selected. The film clips were evaluated in a pre-study and the film that was better suited for eliciting the corresponding emotion was chosen for the main experiment. The potential film clips are shown in Table 1. They are either suggested in [17] (marked with ∗ in Table 1), proposed in another paper (“John Q”: [12]) or self chosen. We aimed at selecting at least one clip per emotion from the extensively evaluated film set of [17] which includes editing information. However, the recommended film for contentment (“Alaska’s Wild Denali”) was not purchasable and was thus replaced by a similar film (“Alaska: Spirit of the Wild”). The suggested neutral film (“Sticks” = ScreenPeace screen saver) exhibited a low recording quality and was therefore substituted by a regularly flickering fire in a fireplace. Furthermore the scene of “John Q” proposed in [12] was short-

arousal

Anger

Amusement neutral

Sadness

valence

Contentment

Figure 1. Emotions to be recognized in arousal valence space.

ened from 10 to approximately 4 minutes. The scenes of the remaining films were self chosen.1 The potential film clips were divided into two sets as shown in Table 1 such that the added duration of the films was approximately equal for both sets. Each film set was presented to a different group of subjects during the pre-study and for each emotion the film which better elicited the corresponding emotion was chosen for the main experiment. The success of emotion elicitation was tested by the SelfAssessment Manikin (SAM) questionnaire [18] which directly assesses the arousal, valence and dominance dimensions of emotion. The dominance dimension (control in a situation) was disregarded for this study since the test subjects were not directly involved in a situation they could control. The SAM questionnaire as it was used in the study is shown in Figure 2. The first row measures valence and the second assesses arousal. The questionnaire can easily be filled in with two mouse clicks.

Figure 2. Self-assessment manikin (SAM) used as digital questionnaire. First row: valence, second row: arousal. For evaluating the film clips, a scale ranging from -2 to 2 was assigned to arousal and valence (e.g. arousal = 2 means highest arousal).

The film clips for the main experiment were chosen based on the evaluation of the SAM questionnaires from the pre-study. For every film clip, the mean values of the arousal and valence scores were calculated. Then, the 1 Editing

instructions are provided by the authors upon request.

Sadness Amusement Neutral Anger Contentment

Film set 1 Lion King∗ When Harry met Sally∗ Fireplace The Magdalena Sisters Winged Migration

Film set 2 John Q Madagascar Fireplace Cry Freedom∗ Alaska: Spirit of the Wild

Table 1. Film sets presented to two groups of subjects during the pre-study: Films marked with ∗ originate from [17]. “John Q” was suggested in [12]. Films set in italics were selected for the main experiment.

distance between this mean vector and the target value (e.g. [−2, 2] for anger) was computed in the 2-dimensional valence-arousal space. The film clip which exhibited the smaller distance to the target of the corresponding emotion was selected for the main experiment.

3.2. Signals and features The following physiological signals have been recorded during the main experiment: The Electrocardiogram (ECG), the vertical component of the Electrooculogram (EOG), the Electromyogram (EMG) of the Musculus Zygomaticus (muscle between mouth and eye, contracted when smiling), the tonic and phasic part of the Electrodermal activity (EDA), the respiration and the finger temperature. The vertical component of the EOG was used to detect the blinking rate. For detecting eye movements, the horizontal component of the EOG would be necessary too. However, since the eye movements were expected to depend on the movement patterns of the film clips and not on the emotions, only the vertical component was recorded. Prior to feature extraction, a peak detection algorithm was implemented and applied to the phasic part of the EDA and to the respiration signal. For the ECG, the R-peaks were detected and the Lomb-Scargle Periodogram of the RRintervals was calculated. The Lomb-Scargle Periodogram is a method for spectral estimation [19] and was used to derive Heart Rate Variability (HRV) parameters [20]. Corresponding C and Matlab Software is freely available under the GNU public license at [21]. A total of 53 features was calculated from six signals: • ECG: From heart rate: Minimum, maximum and mean value, mean derivative, parameters of a linear fit. HRV parameters (according to the definitions in [20]): sdnn (standard deviation of all RR intervals), rmssd (rms of differences between adjacent RR intervals), pnn50 (number of pairs of adjacent RR intervals differing by more than 50 ms divided by the total number of intervals), triangular index (total number of all RR intervals divided by the height of the histogram of all intervals measured on a discrete scale with bins of 1/sampling rate), lf (power in low frequency range (0.04-0.15 Hz)), hf (power in high frequency range (0.15-0.4 Hz)), lf/hf.

• EDA: Tonic part: Minimum, maximum and mean value, mean derivative, parameters of a linear fit. Phasic part: Mean peak rate, mean peak height, sum of the peak heights, quantile thresholds at 25%, 50%, 75%, 85% and 95% for the peak height and the peak to peak intervals. • Respiration: From peak to peak intervals: Mean, standard deviation (std), minimum and maximum respiration rate, maximum/minimum. From peak heights (higher peaks mean deeper breathing): Mean, median and std. • Temperature: Mean, std, minimum and maximum value, maximum/minimum, parameters of a linear fit. • EMG: Energy. • EOG: Blinking rate. All time-dependent features were divided by the duration of the corresponding experiment phase. The features were calculated for the periods of the film clips (emotion phases) and for the preceding recovery periods. The features calculated during the recovery periods served as baseline values for calculating relative features by division or subtraction. Relative features were considered for two reasons: First, the physiology of the body could possibly change in relation to the time elapsed. This would be detected by the classifier because the order of film clips was the same for all subjects. However, we rather want the classifier to detect the emotion than a certain point in time (e.g. the third film clip). Second, relative features reduce the effect of interindividual differences in physiology. Detection of invalid features: The recorded data set contained a rather large set of corrupted segments, due to technical problems and artifacts. The artifacts included: (1) EDA signal reached saturation of the amplifier during the course of the experiment; (2) R-peaks in ECG signal are erroneously detected because of a high T-wave or due to motion artifacts; (3) blinking is not visible in EOG signal due to dry skin. The problematic data sets were identified and marked by visual inspection for (1) and (3) and automatically for (2): After R-peak detection, any RR-interval was removed if it deviated by more than 20% from the previous interval accepted as being true (refer to [21] for Matlab Code). If more than 5 RR-intervals were removed during an experiment phase, all the corresponding ECG features were declared as invalid. The described procedure resulted in 96% correct data for EOG, 78% for EDA, 64% for ECG and 100% for the remaining modalities. The percentage of data containing valid features of all modalities amounted to 47%.

3.3. Classification Most of the work presented in Section 2 used a feature selection technique. However, in feature selection the data set needs to be partitioned into 3 subsets: A training and a validation set for performing the feature selection and a test set to assess the performance on unseen data [22]. Since the number of our available data is small and the amount of correct data even smaller, we decided not to employ any feature selection and therefore also expected smaller recognition accuracies than presented in previous work. All the classifiers presented in the following were evaluated in a leave-one-person out cross-validation. 3.3.1

Classifier fusion

Polikar [23] gives an excellent review on ensemble classifier systems. The idea behind is to consult “several experts”. It can be compared to the natural behavior of humans to seek a second (or third) opinion before making an important decision. However, according to [23], the extensive benefits of consulting “several experts” is a rather recent discovery in computational intelligence. We are convinced that this technique can be of high advantage to the field of emotion recognition as it is particularly suitable for handling missing features [24] as well as small data sets [23]. Since low signal quality of a particular modality (e.g. ECG) usually renders all the corresponding features invalid, the classifier ensemble considered in this work consists of one classifier per signal modality. Numerous ways to combine the decisions of several classifiers exist. We have chosen the following two simple approaches: 1. Majority voting: The class that receives the highest number of votes is chosen. If several classes receive an equal number of votes, the corresponding classifiers are identified as candidate classifiers and the classifier with the highest confidence among them provides the deciding vote. 2. Confidence voting: The class of the classifier with the highest confidence is chosen. For the two described classifier fusion schemes, a suitable underlying classifier is needed. It has to be able to handle unequal class distributions and to generate a probabilistic output which can be used as confidence value. Since little data is available in our case, we opted for a simple classifier with few parameters to be estimated. Linear and Quadratic Discriminant Analysis (LDA and QDA) with diagonal covariance matrix estimation fulfill all the mentioned conditions: Discriminant Analysis generates an estimate of the posterior probability for each potential class and selects the class which exhibits the largest posterior probability. The estimated posterior probability for the selected class is taken as confidence measure.

3.3.2

Handling missing values

In [24], several courses of actions for handling missing data are summarized: Discard instances (this would mean to discard more than half of our data), acquire missing values (infeasible in our case), imputation or employing reducedfeature models. The latter two are investigated in the following. Imputation uses an estimation of the missing feature or of its distribution to generate predictions from a given classifier model. A simple possibility used in practice comprises the substitution of the missing value in the test sample by the mean value of the training feature vector. Imputation is needed if the applied classification model employs a feature whose value is missing in the test sample. The reduced-feature models technique represents an alternative approach: Instead of imputation, a new model is generated which employs only the available features. The following combinations of missing feature handling and ensemble classification are investigated in this work: 1. Single classifier with imputed features: Missing feature values in the training data are imputed by the mean value of the available training instances belonging to the same class. Missing features in the test data are imputed by the mean value of the entire training feature vector (i.e. independent of the class membership of the test sample). 2. Fusion of classifiers with imputed features: Missing feature values are imputed in the same way as for the single classifier. A classifier is trained for each signal modality and the decisions fused according to majority or confidence voting. 3. Fusion of classifiers with reduced-feature models: Missing feature values are not imputed. In case a missing feature is encountered in a certain signal modality, no classifier is trained for that modality. The decisions of all the available classifiers are fused according to majority or confidence voting.

4. Experiments This section describes the procedures for the pre-study and for the main experiment.

4.1. Pre-study The 14 subjects (13 male, 1 female)2 , who participated in the pre-study, were divided into two groups. Each group watched one set of film clips (see Table 1). The order of film clips was chosen such that negative and positive emotions 2 The

gender distribution for selecting the films in the pre-study is different from the gender distribution of the subjects participating in the main study. A balanced gender distribution in the pre-study might have resulted in a different film clip selection. This will be considered in future work.

always alternated and the order of emotions was the same for both groups. The subjects were sitting alone in an airplane seat and the light in the laboratory was dimmed. In order to enhance the subject’s attention to the film clips, video glasses were used for presenting the clips (see Figure 3). Since the questionnaires were filled in with the mouse, there was no need to take off the glasses during the whole experiment. A two minutes recovery period was introduced before each film clip. The screen went black and the subjects were requested to clean their mind and calm down. In case it was necessary to know the story of the movie in order to understand the next film clip, a short introduction text was shown before the clip started. After each film clip, the subjects

Figure 3. Participant wearing video glasses

were asked to fill in the SAM questionnaire. They were instructed to answer the questions according to the emotion they experienced themselves and not according to the actor’s emotions in the film scene. Based on the evaluation of the SAM questionnaires, one film clip was selected for each target emotion to be used in the main experiment. The chosen film clips are typeset in italics in Table 1.

4.2. Main experiment Twenty (12 male, 8 female) participants were recruited for the main experiment. Mean age of all subjects was 28.6 (std = 10.90), for the men 24.58 (std = 1.38) and for the women 34.63 (std = 15.83). The mean age of the women is higher than for the men, because two women were about 60 and all other subjects were about 25. The ethnic background of all the participants is Western European. Before the experiment, the participants received instruction and signed a consent form. They were paid 25 CHF for their participation of about 1.25 hours. The five film clips determined during the pre-study (see Table 1) were presented to the subjects. The order of the clips was the same as in the pre-study and constant for all subjects. After each film clip, the subjects had to fill in a SAM questionnaire. The procedure for the main experiment was essentially the same as for the pre-study with a few slight adjustments: The starting recovery time was increased to 10 minutes, to give the body enough time to calm

down. Potential agitation due to previous activities or due to the attachment of the sensors should not influence the data recorded during the first film clip. The recovery time between two film clips was increased from 2 to 3 minutes. The subjects were again asked to clear their mind and not think about anything emotionally activating during these recovery times while the screen of the video glasses was dark. The subjects were allowed to close their eyes. To indicate that the next film clip was about to start, three short, gentle beeps were included when the recovery period was over. To keep the procedure the same for each film clip, an introduction text was included for each of the clips. This has the additional benefit, that a potential startle effect due to the beeps would be over until the film clip starts. The subjects reported that they did not or only shortly feel startled by the beeps. Respiration and EDA were recorded with custom electronics integrated into the airplane seat. The respiration was measured by a stretch sensor attached to the seat belt (see Figure 4). The electronic circuits for the EDA directly filter and split the signal into a tonic and phasic part. The ECG, the vertical EOG, the EMG of the Musculus Zygomaticus (muscle between mouth and eye, contracted when smiling) and the finger temperature were measured by the commercial Mobi device of TMS International [25]. Data from both hardware devices was sent to the computer by Bluetooth and recorded and synchronized by the Daphnet toolbox [26] at 128 Hz.

Figure 4. EDA measurement at the left hand, respiration sensor in the seat belt. The additional electrode at the right hand can be applied to measure ECG. It could not be used due to technical problems encountered shortly before experiment begin.

Figure 5. Mobi device from TMS International [25]

Classifier fusion method Majority voting Confidence none Majority voting Confidence none

Classification method LDA LDA LDA QDA QDA QDA

Accuracy (5 classes) 44.7% 53.2% 53.2% 51.1% 51.1% 51.1%

Table 2. Comparison of classifier fusion methods with single classifiers for LDA and QDA (without imputation). Single classifiers consider all the features simultaneously. Except for Majority voting with LDA, the performance of classifier fusion methods equals the performance of single classifiers.

5. Results and discussion In a first step, the two classifier fusion approaches were compared with a single classifier which makes use of all features simultaneously and does not employ imputation. We therefore chose the largest possible data set containing only valid features. The clean dataset comprises 13 records for sadness, 10 for amusement and 8 for each of the remaining emotions neutral, anger and contentment. The results are presented in Table 2. Except for Majority voting with LDA, the classifier fusion methods perform equally as the single classifiers. This is a surprising result, since in classifier fusion, the final class decision only depends on the classifiers of the “strongest” modalities.3 A problem of our majority voting scheme was identified during this analysis: Since the majority voting is initially performed without considering the confidence values, it can happen, that several “weak” classifiers agree on a wrong decision and thereby dominate another classifier which might exhibit a high confidence value. A possibility to circumvent this problem would be to use the confidence values as weights during the majority voting process already. In a second step, we compared the accuracy of the fused classifiers with the accuracy of single modality classifiers. Table 3 shows the results for discriminating 5 classes by single modality classifiers using LDA and QDA. To allow a fair comparison, we again used the clean dataset containing only valid features. The accuracies are presented in ascending order. When ranking the modalities according to the accuracies, the same sequence results for both LDA and QDA: Electrodermal Activity clearly represents the best single modality, followed by finger temperature. For most modalities, QDA performed worse than LDA. This could be due to the low number of available experiment data (47 in total), since more parameters need to be estimated for QDA than for LDA. Furthermore, some modalities operated below chance in particular with QDA which might again 3 In the cases where the classifier fusion methods yield the same accuracy as the corresponding single classifiers, one could suspect, that the two methods are inherently equivalent. This objection can be refuted by the fact, that the confusion matrices are different for fused and single classifiers respectively.

LDA QDA

EOG 23.4% 12.8%

Resp 29.8% 23.4%

HR&HRV 31.9% 25.5%

EMG 36% 27.7%

Temp 38.3% 36.2%

EDA 42.6% 49.0%

Table 3. Accuracy of single modality classifiers for LDA and QDA. The results are presented in ascending order. The underlying data set is the same as for the results presented in Table 2 (only valid features considered). Fusion method none Maj. voting Confidence Maj. voting Confidence none Maj. voting Confidence Maj. voting Confidence

Classifier LDA LDA LDA LDA LDA QDA QDA QDA QDA QDA

Imputation Yes/No Yes Yes Yes No No Yes Yes Yes No No

Accuracy 5 classes 45.0% 41.0% 50.0% 41.0% 47.0% 35% 49.0% 45.0% 49.0% 48.0%

Accuracy 4 classes 52.5% 55.0% 58.8% 51.3% 58.8% 40.0% 53.8% 47.5% 56.3% 56.3%

Table 4. Comparison of classifier fusion methods and single classifiers with and without imputation for LDA and QDA. (Single classifiers consider all the features simultaneously.) The results are presented for 5 and for 4 classes (neutral excluded). Classifier fusion yields considerable benefits.

be due to the low number of available data. However, a comparison of the results of the single modalities with the ones of the fused classifiers (compare Table 2 with Table 3) shows that no single modality classifier is able to outperform the fused classifiers. We will now analyze the results when considering all the available data and see how imputation techniques and classifier fusion methods influence classification accuracy in the presence of missing feature values. All the six signal modalities were used for this analysis. The results for all 5 classes and for 4 classes (all except neutral) are presented in Table 4. Clearly, classifier fusion always yields a considerable benefit in comparison to single classifiers using all the features (with imputation). In the most extreme case (4 classes, QDA) the increase in accuracy amounts to 16.3%. When comparing the ensemble classifiers using imputation with the corresponding ones using reduced-feature models, the reduced-feature model ensembles perform 3 times better, 3 times equally and twice worse. Based on these result, it is difficult to decide whether to use imputation or not. The two strategies seem to be competitive. However, using no imputation is computationally less expensive and might be preferred in practical applications. Another interesting observation is that confidence voting always performs better then majority voting for LDA, while the tendency is reversed for QDA. The best classifier fusion technique thus seems to depend on the underlying classifiers. The results for arousal and valence classification are shown in Table 5. The neutral emotion was omitted for

Fusion method none Maj. voting Confidence Maj. voting Confidence none Maj. voting Confidence Maj. voting Confidence

Classifier LDA LDA LDA LDA LDA QDA QDA QDA QDA QDA

Imputation Yes/No Yes Yes Yes No No Yes Yes Yes No No

Accuracy Arousal 62.5 % 62.5 % 63.8 % 67.5 % 61.3% 57.5 % 58.8 % 60.0% 58.8% 60.0%

Accuracy Valence 71.3% 68.8% 72.5% 70.0 % 73.8% 60.0% 63.8% 63.8% 68.8 % 68.8 %

Table 5. Comparison of classifier fusion methods and single classifiers with and without imputation for LDA and QDA. Arousal and valence classification (2 classes). Single classifiers consider all the features simultaneously.

this analysis. The results again indicate that classifier fusion outperforms single classifiers which use all the features (with imputation). When comparing the ensemble classifiers using imputation with the corresponding ones employing reduced-feature models, the reduced-feature model ensembles perform once better, once worse and twice equal for arousal. Considering valence, reduced-feature models yield better results in all four cases. When comparing the two classifier fusion methods, confidence voting performs equal or better than majority voting except for one case, i.e. arousal classification with LDA and reduced-feature models. Unlike most other work, e.g. [13], a higher accuracy was reached for valence than for arousal classification. This is probably due to the EMG measurement of the Musculus Zygomaticus which is a reliable smiling detector. In the presented analyses all six signal modalities were considered. It is worth mentioning, that for the 2 and the 4-class problems, some classifier ensembles, composed of fewer modalities, performed better than the ones using all the modalities. However, these numbers and the particular combination of modalities strongly depend on the kind of underlying data set and are therefore not further evaluated.

5.1. Conclusion In practical applications, data loss due to artifacts occurs frequently. Many of these artifacts can be detected automatically by plausibility analyses (e.g. unrealistic RRintervals). Often the artifacts do not occur in all physiological signals simultaneously and discarding all the data instances containing invalid features results in a substantial amount of unusable data. In our case, more than half of the data would have been lost. We therefore tested two methods for handling missing features in combination with two classifier fusion approaches. Classifier fusion has been shown to significantly increase the recognition accuracies. A maximum increase in accuracy of 16.3% was observed when comparing an ensemble classifier to a single classi-

fier. Whether majority or confidence voting performs better depends on the classification task (e.g. 5 classes or arousal) and on the underlying classifier. Reduced-feature model ensemble classifiers are competitive or even slightly better than ensemble classifiers using imputation. In practical applications, where computational complexity is critical, imputation can therefore be omitted. Due to the low amount of available clean data, no feature selection could be employed. The maximally achieved accuracies (50.0% for 5 classes, 58.8% for 4 classes, 67.5% for arousal and 73.8% for valence) are therefore lower than reported in comparable studies. In the future, other classifier fusion approaches (e.g. weighting the decisions by the confidence), other classifiers (kNN, SVMs) or the combination of heterogeneous classifiers could be investigated. Furthermore, the proposed classifier fusion methods could be applied in different research fields, e.g. for recognizing activities using a large number of wearable unobtrusive but rather imprecise sensors.

Acknowledgment This project is funded by the EU research project SEAT (www.seatproject.org), contract number: 030958, all views here reflect the author’s opinion and not that of the commission.

References [1] J. Healey, F. Dabek, R. W. Picard. A new affect-perceiving interface and its application to personalized music selection. Proc. from the 1998 Workshop on Perceptual User Interfaces, 1998. [2] A. Kapoor, W. Burleson, R. W. Picard. Automatic prediction of frustration. Int. J. Hum.-Comput. Stud., 65(8):724–736, 2007. [3] P. Ekman, W. V. Friesen. Unmaskig the face. A guide to recognizing emotions from facial clues. Englewood Cliffs, New Jersey: Prentice-Hall, 1975. [4] L. C. De Silva, T. Miyasato, R. Nakatsu. Facial emotion recognition using multi-modal information. In ICICS ’97: Int. Conf. on Information, Communications and Signal Processing, pages 397–401, 1997. [5] C. Busso, Z. Deng, S. Yildirim, M. Bulut, C. Min Lee, A. Kazemzadeh, S. Lee, U. Neumann, S. Narayanan. Analysis of emotion recognition using facial expressions, speech and multimodal information. In ICMI ’04: Proc. of the 6th Int. Conf. on Multimodal interfaces, pages 205–211, 2004. [6] C. Yu, P. M. Aoki, A. Woodruff. Detecting user engagement in everyday conversations. In ICSLP ’04: 8th Int. Conf. on Spoken Language Processing, volume 2, pages 1329–1332, 2004. [7] D. Neiberg, K. Elenius, K. Laskowski. Emotion recognition in spontaneous speech using gmms. In ICSLP ’06: 9th Int. Conf. on Spoken Language Processing, pages 809–812, 2006. [8] A. Madan, A. S. Pentland. Vibefones: Socially aware mobile phones. In Tenth IEEE International Symposium on Wearable Computers, pages 109–112, 2006.

[9] P. Ekman, R. W. Levenson, W. V. Friesen. Autonomic nervous system activity distinguishes among emotions. Science, 221:1208–1210, 1983. [10] R. W. Picard, E. Vyzas, J. Healey. Toward machine emotional intelligence: Analysis of affective physiological state. IEEE Trans. Pattern Anal. Mach. Intell., 23(10):1175–1191, 2001. [11] C. L. Lisetti, F. Nasoz. Using noninvasive wearable computers to recognize human emotions from physiological signals. EURASIP J. Appl. Signal Process., 2004(1):1672– 1687, 2004. [12] S. D. Kreibig, F. H. Wilhelm, W. T. Roth, J. J. Gross. Cardiovascular, electrodermal, and respiratory response patterns to fear- and sadness-inducing films. Psychophysiology, 44:787–806, 2007. [13] J. Wagner, J. Kim, E. Andr. From physiological signals to emotions: Implementing and comparing selected methods for feature extraction and classification. In ICME 2005, Amsterdam, Nederland, 2005. [14] J. Kim and E. Andr´e. Emotion recognition based on physiological changes in music listening. IEEE Trans. Pattern Anal. Mach. Intell., 30(12):2067–2083, 2008. [15] A. Kapoor and R. W. Picard. Multimodal affect recognition in learning environments. In MULTIMEDIA ’05: Proc. of the 13th annual ACM Int. Conf. on Multimedia, pages 677–682, New York, USA, 2005. ACM. [16] R. Cowie, E. Douglas-Cowie, N. Tsapatsoulis, G. Votsis, S. Kollias, W. Fellenz, and J. Taylor. Emotion recognition in human-computer interaction. IEEE Signal Processing Magazine, 18(1):32–80, January 2001. [17] J. Rottenberg, R. R. Ray, J. J. Gross. Emotion elicitation using films. In J. A. Coan and J. J. B Allen (Eds.), The handbook of emotion elicitation and assessment, pages 9–28. New York: Oxford University Press, 2007. [18] M. M. Bradley, P. J. Lang. Measuring emotion: The selfassessment manikin and the semantic differential. Journal of Behavioral Therapy and Experimental Psychatry, 25:49–59, 1994. [19] G. D. Clifford. Advanced Methods & Tools for ECG Data Analysis, chapter ECG Statistics, Noise, Artifacts, and Missing Data, pages 55–99. Artech House, 2006. [20] Task Force of the European Society of Cardiology and the North American Society of Pacing and Electrophysiology. Heart rate variability standards of measurement, physiological interpretation, and clinical use. European Heart Journal, 17:354–381, 1996. [21] G. D. Clifford. Open source code: http://www.mit. edu/˜gari/CODE/ECGtools/. [22] G. Dreyfus and E. Guyon. Assessment methods. In Feature Extraction: Foundations and Applications. Springer, 2006. [23] R. Polikar. Ensemble based systems in decision making. IEEE Circuits and Systems Magazine, 6(3):21–45, 2006. [24] M. Saar-Tsechansky and F. Provost. Handling missing values when applying classification models. J. Mach. Learn. Res., 8:1623–1657, 2007. [25] Tms international: http://www.tmsi.com. [26] M. B¨achlin, D. Roggen, and G. Tr¨oster. Context-aware platform for long-term life style management and medical signal analysis. In Proc. of the 2nd SENSATION Int. Conf., 2007.