iSocioBot: A Multimodal Interactive Social Robot

36 downloads 0 Views 2MB Size Report
tion and face detection, it is capable of identifying persons ..... Several thousand sentences ... the robot talks with humans, which is called dialogue man-.
iSocioBot: A Multimodal Interactive Social Robot

Zheng-Hua Tan, Nicolai Bæk Thomsen, Xiaodong Duan, Evgenios Vlachos, Sven Ewan Shepstone, Morten Højfeldt Rasmussen, et al. International Journal of Social Robotics ISSN 1875-4791 Int J of Soc Robotics DOI 10.1007/s12369-017-0426-7

1 23

Your article is protected by copyright and all rights are held exclusively by Springer Science+Business Media B.V.. This e-offprint is for personal use only and shall not be selfarchived in electronic repositories. If you wish to self-archive your article, please use the accepted manuscript version for posting on your own website. You may further deposit the accepted manuscript version in any repository, provided it is only made publicly available 12 months after official publication or later and provided acknowledgement is given to the original source of publication and a link is inserted to the published article on Springer's website. The link must be accompanied by the following text: "The final publication is available at link.springer.com”.

1 23

Author's personal copy Int J of Soc Robotics DOI 10.1007/s12369-017-0426-7

iSocioBot: A Multimodal Interactive Social Robot Zheng-Hua Tan1 · Nicolai Bæk Thomsen1 · Xiaodong Duan1 · Evgenios Vlachos1 · Sven Ewan Shepstone2 · Morten Højfeldt Rasmussen1 · Jesper Lisby Højvang3

Accepted: 20 August 2017 © Springer Science+Business Media B.V. 2017

Abstract We present one way of constructing a social robot, such that it is able to interact with humans using multiple modalities. The robotic system is able to direct attention towards the dominant speaker using sound source localization and face detection, it is capable of identifying persons using face recognition and speaker identification and the system is able to communicate and engage in a dialog with humans by using speech recognition, speech synthesis and different facial expressions. The software is built upon the open-source robot operating system framework and our software is made publicly available. Furthermore, the electrical parts (sensors, laptop, base platform, etc.) are standard components, thus allowing for replicating the system. The design of the robot is unique and we justify why this design is suitable for our robot and the intended use. By making software, hardware and design accessible to everyone, we make research in social robotics available to a broader audience. To evaluate the properties and the appearance of the robot we invited users to interact with it in pairs (active interaction partner/observer) and collected their responses via an extended version of the Godspeed Questionnaire. Results suggest an overall positive impression of the robot and interaction expe-

This work is supported by the Danish Council for Independent Research | Technology and Production Sciences under Grant Number. 1335-00162 (iSocioBot).

B

Zheng-Hua Tan [email protected]

1

Department of Electronic Systems, Aalborg University, Fredrik Bajers Vej 7B, 9220 Aalborg, Denmark

2

Bang and Olufsen A/S, Peter Bangs Vej 15, 7600 Struer, Denmark

3

MV-Nordic, Lucernemarken 17, 5260 Odense, Denmark

rience, as well as significant differences in responses based on type of interaction and gender. Keywords Social robot · Human robot interaction · Speech processing · Image processing

1 Introduction Social robots are physical or digital entities capable of socially interacting and communicating with humans, and other robots, under diverse and dynamic environments [6]. In this work, we focus on robots with a physical embodiment only. Social robots have many applications ranging across social care, companionship, wellness, entertainment and education [10,16,25,41,42]. For example, social robots can be used to improve the self-management of elderly people and reduce the need for hospitalisation, with an extra advantage of being deployable in the areas where pets are inappropriate [7,28]. For children with autism spectrum disorders, robots are useful for enhancing their communication skills including emotions and physical contact [18]. The ultimate goal is to increase the quality of life while decreasing the expense. Social interaction with humans is one of the primary functions for a social robot, not necessary to achieve any well-defined tasks as industrial or even service robots do. A typical interaction scenario between a human and a robot is illustrated in Fig. 1. The woman would like to engage the robot’s attention, however the robot is only able to hear her and not see her. In order to see her it turns towards the direction of the sound and afterwards moves towards the woman to hear better. Finally the robot is close enough to the woman, such that it can both hear and see her clearly, and is now ready to engage in a conversation.

123

Author's personal copy Int J of Soc Robotics

Fig. 1 Illustration of an interaction scenario

A robot capable of interacting with humans in the aforementioned way, needs to have certain basic functionalities. The robot must first be able to find the direction or location of the woman using both vision and sound. After having approached the woman the robot must identify her, which is achieved using speaker identification and face recognition. To take part in a conversation it needs to know what she is saying, which is a task of automatic speech recognition (ASR). Furthermore, it must be able to reply in terms of speech, which is achieved through speech synthesis. In order to behave in a human-like way under various social setups, robots should be able to interact with users in friendly and competent manner, have a pleasant appearance, move and rotate at a certain speed, e.g., that close to humans, have a proper height to sense users while they stand or sit, and make users feel calm and safe around them. They should also be able to express themselves both verbally and non-verbally, including the embodiment of facial emotional expressions with respect to the fundamental rules of human affect expression. Since humans have arranged their space to their morphology and abilities, the characteristics of a social robot should fit within these predefined ”borders” for efficient performance without modifying the environment. The environments for social robots include homes where a robot and a person might be out of sight from each other or even in different rooms, and places (e.g., like care facilities) where the acoustic and lighting environments are adverse and multiple persons present at the same time. Furthermore, cross platform compatibility is a desirable feature to have, just as we are witnessing what is happening today in the worlds of computers and mobile phones. Therefore, the software system should be easily deployable on different hardware platforms. There are existing social robots, both commercial and proprietary. Nao1 is one of the most widely used commer1

http://www.ald.softbankrobotics.com/en/cool-robots/nao.

123

cial social robots (e.g., [2,31]). It has a number of different kinds of sensors and many degrees of freedom. On the other hand, it is small, 58 cm height, and moves slowly. Another alternative is Pepper,2 which is taller than Nao and with a wheel-base instead of legs. It is 1.2m tall, which we believe it is still too short for interacting with people standing. Furthermore, it has limited capabilities regarding facial expression, which is important for social interaction. PR23 robot is an advanced human size robot, which is rather expensive and thus in many ways not economically feasible or necessary. Furthermore, with a bulky body and strong arms, PR2 is more a service robot. Then there are proprietary robots that are developed by companies or research institutes and not for sale, which, among others, include Robovie [26], Maggie [43], and Rubi [17]. There is also the iCub [33], which is a fully open-source child-like robot, however it is designed specifically for interacting with children and unaccessible and unnecessarily complex for many cases. Okuno [36,37] presented a robot system for parties with two models; receptionist and companion robot. It consists of active audition, face identification, auditory and visual streams formation and association, focus-of-attention, and dialog control modules. In [46], Stiefelhagen et al. proposed a Human Robot Interaction (HRI) system in a kitchen scenario based on speech recognition and synthesis, head pose and gesture recognition running on two laptops. They extended the system by person localization, person tracking and face identification to make the interaction more natural in [45]. A camera and laser scanner based HRI and cooperation system is proposed in [39]. This robot can provide several services based on the user’s different gestures, such as guiding a person, laser scanner based leg-following and load transportation. Due to the lack of existing robots able to fulfill our requirements, we turned to build our own social robot called iSocioBot using off-the-shelf components and base it on Robot Operating System (ROS), which is open source. We furthermore make all additional software freely available, which allows for replicating the system. In this paper, we present the design of iSocioBot in terms of appearance, hardware and software with our novel algorithms. Furthermore we present evaluation results of key software modules and discuss some experience gained from an in-house study. The contribution of this work is multifold. First, we describe a complete system, with detailed implementations, for an interactive social robot, which is capable of interacting with humans through person tracking, person identification, automatic speech recognition and speech synthesis. The electronic parts of the system, including robot platforms, sensors, speakers, and computers, are off-the-shelf components, and the software is written in the ROS framework and made 2

http://www.ald.softbankrobotics.com/en/cool-robots/pepper.

3

http://www.willowgarage.com/pages/pr2/overview.

Author's personal copy Int J of Soc Robotics

publicly available,4 allowing for a complete replication of the robot system. The strength of the design is that the presented hardware is completely made of off-the-shelf components, the developed software is open-source and works with the relevant scenarios, and the system has been tested within public demonstrations, i.e., in the wild and in-house evaluation. Integration of various techniques from computer vision and speech processing into a coherent working system requires substantial effort. The platform opens the doors of social robotics research to a broader audience. Secondly, we describe algorithms for robust multiple-persons tracking in noisy environments and robust face recognition to deal with the well-known pose problem, and integrate these into the complete system. The software design further focuses on sensor fusion, for example, combining sound source localisation and face detection results to direct attention towards a human speaking to the robot and engaged in a dialog. The software modules are evaluated through an in-house study. The work has potential to be extended to other areas than social robots, individually or the ways that different parts of the system are integrated. Thirdly, we present innovative ideas for hardware design and implementation. For example, the face construction is unique. It uses a colour LED array with an acoustic cloth cover, so that it is visible from distance and under low light and it is programmable to show different facial expressions. The outline of the paper is as follows: In Sect. 2 we describe the robot system used in this work and how all functionalities are integrated to one complete system. In Sects. 3–7 the individual functionalities are described along with the implementation of specific details. In Sect. 9 we present the procedure of testing the system and report the results, analysis and discussion of the findings. Finally, the work is concluded in Sect. 10.

2 System Implementation This section describes the implementation of the robotic system in terms of hardware and software. We have two generations of iSocioBot, as shown in Fig. 2. The differences between the two generations concern mainly the body design, the iPad addition, and the wheel-base. More details about the first generation iSocioBot can be found in paper [50]. 2.1 Design of Appearance The preliminary design of the second generation iSocioBot can be found in Fig. 3. We have decided on a similar appear4

The source code is available at http://kom.aau.dk/~zt/online/ iSocioBot/ except for the speech synthesis used for evaluation, which is proprietary. However, a publicly available one is supplied with the source code.

Fig. 2 Picture of the first (left) and second (right) generation iSocioBot. The height of both robots is approximately 150 cm

ance with certain differences compared to the first generation to show that they come from the same family. Therefore, both robots have the same color and height while the second generation is equipped with a square face and with a set of round ears using LED array and strips. Based on visual cues only, people can easily tell that they come form the same ”family”, but still discriminate them from each other at the same time. In order to have a more stable body, we made the body of the second generation using styrofoam. We cut the styrofoam according to the design specifications as shown in Fig. 3 and glued them together to assemble the body. The head of the robot is tightly connected to the torso of the robot and has no degree of freedom. Therefore, the direction of the robot is aligned with the orientation of the head of the robot. Head orientation is perhaps the most reliable cue implying attention direction, and a deictic signal indicating the current focus of interest [54]. According to [47], the accuracy of focus of attention estimation based on head orientation data alone is 88.7%. Lastly, as it can be seen in Fig. 3, beneath the head there is a position to place an iPad or other tablet device for additional input and output. The actual implementation of our iSocioBot design can be found in the right side of Fig. 2. 2.2 Hardware The problem of limited moving capability of the first generation is solved in the second generation by building the robot upon a Pioneer P3-DX5 from Adept mobilerobots. 5

http://www.mobilerobots.com/ResearchRobots/PioneerP3DX.aspx.

123

Author's personal copy Int J of Soc Robotics 304,95

Speech Recognition

191,96 Speech Synthesis

68

15

Visual Feedback Interaction Management

Person Tracking

30

56

165 200

30

240

216,11

Person Identification

Sound Source Localization

Speaker Identification

Face Detection

Face Recognition

Fig. 4 The structure of software 345,01

50

361,12

245

387,97

Fig. 4, the software consists of six modules; interaction management, person tracking, person identification, speech recognition, speech synthesis and facial expression. The philosophy behind this structure is to equip the robot with basic capabilities, which can then be used to achieve complex interaction scenarios. This means that all modules are running all the time, but the interaction management node is determined by interaction scenario. This also makes it very easy to write an interaction management module for a new application, since all the basic functionalities are already implemented.

Fig. 3 The appearance design of the second generation iSocioBot

Due to design considerations, the robot is equipped with both a Microsoft Kinect, where only the microphone array is used, and a High-Definition camera Logitech HD Pro Webcam C920, which is used for face detection and face recognition. There is an iPad on the body used as additional input and output. For speech recognition we use a wireless handed microphone to mitigate the problem of reverberation and noise. A laser scanner for localization and navigation, Hokuyo (URG-04LX-UG01), is amounted on the basement of the robot The robot is controlled by a build-in computer in the wheel-base, and we use a Dell Latitude E6540 laptop as a server to run all the software. The laptop is with an Intel Core i7 − 4800MQ Processor, 16GB RAM and 256GB SSD harddrive.

3 Person Tracking An important aspect of HRI is how the robot directs attention in a reliable way. In this robot system, the goal is to direct attention towards the dominant speaker in a dialogue scenario with multiple participants and taking into account acoustically interfering sources, e.g., radio, phone ringing, door slamming etc. The implemented algorithm to direct attention assumes that the persons are stationary, and that the robot can only rotate to direct attention to someone/something. Figure 5 shows a block diagram of the algorithm. The first block is a voice-activity-detector [49] used to detect speech segments with a duration above 1.5s ensuring

2.3 Software As a basis for the robot system we have ROS [19] running on top of Ubuntu 14.04 on the laptop. This allows for modules (e.g., speech recognition, person tracking, etc.) to communicate and share results and data. As shown in

123

Fig. 5 Block diagram of person tracking algorithm

Author's personal copy Int J of Soc Robotics

that no spurious sounds will claim the attention. Whenever a segment is detected both pitch estimation [4] and sound source localization (SSL) [12] is performed on the detected segment. This is used to make a decision whether to turn towards the acoustic source or not, based on the following hypothesis test p(X |H0 ) p(H0 |W ) p(H0 |X, W ) = p(H1 |X, W ) p(X |H1 ) p(H1 |W )

(1)

where H = H0 is the hypothesis that the sound was generated by a human speaker and H = H1 is the hypothesis that the sound was generated by a noise/interfering source. X is the average confidence of pitch in the segment and W denotes the estimate of direction of the sound source given by SSL. We set p(X |H0 ) and p(X |H1 ) to Beta distributions. In order to simplify things, we discretize the range of directions from SSL into regions of 5◦ , such that p(H |W ) is a probability mass function. If Eq. (1) is evaluated to be above 1, the robot turns towards the region estimated by SSL and then uses face detection to confirm whether there is person or not. Here we use the method of [53] from OpenCV to detect faces. If a face is detected within the region, p(H0 |W ) is increased according to some rule and p(H1 |W ) = 1 − p(H0 |W ) is thus decreased. In this way constant noise sources, such as TV and radio, are after a while (depending on update rule) given lower weight and thereby ”ignored” by the robot. More details can be found in papers [51,52].

4 Person Identification To identify the person who is interacting with the robot, two methods are used here; face recognition and speaker identification. Finally, we identify the person through the fusion of the outputs from these two parts. The person identification module requires that the system is trained previously with the persons who will interact with the robot. 4.1 Face Recognition Local features, which can encode the image structure in spatial neighborhoods, are widely used for face recognition. Among these, Local Phase Quantization (LPQ) [35] is applied in this paper as local feature, since it has shown good performance in face recognition [5,40]. However, the face pose of a user varies a lot when interacting with the robot from different positions. Pose variations will change the local features extracted from the face images of the same person. In other words, the local features also capture the variation of poses besides the variation of different persons. To overcome this, we use a learning method to extract the subject related part from the local feature and remove pose

related part. Here, we assume the local feature l is composed of two parts; the subject related part d and the pose related part s, l = d + s.

(2)

We use d instead of l for face recognition to overcome the pose variation problem. More details about how to obtain d and quantitative results can be found in [13]. 4.2 Speaker Identification For speaker recognition we use the i-vector framework, which is currently the state-of-the-art within this field [11]. For a given speech utterance, the supervector extracted from the Universal Background Model (UBM) can be modeled as: M = µ + Th

(3)

where µ is the speaker-independent UBM supervector, T is a rectangular matrix, called the total variability (TV) matrix, and h is a hidden variable with a standard normal distribution. The TV matrix contains the largest eigenvectors for the total variability space, where no distinction is made between the separate channel and speaker space. The process for learning T is similar to that for eigenvoice training [27], where incremental updates are made using the Expectation Maximization Algorithm (EM), but where utterances belonging to the same speaker are treated as if belonging to separate speakers [11]. Once T has been determined, for each utterance (whether it represent the target or test speaker) an i-vector can be extracted. This is achieved by deriving the posterior distribution of h, conditioned on statistics for the given utterance. The i-vector is then simply taken to be the mean of this distribution (and is thus a MAP estimate). Since the rank of T is low (usually between 10 and 300), this allows further modeling and comparison of i-vectors to take place in a lowdimensional space. To determine the similarity between a target i-vector and a test i-vector, we use the fast cosine distance scoring technique, which does not require a separate enrollment stage: Scor e =

htest · htarget htest htarget 

(4)

Background training, extraction of i-vectors and scoring were all carried out using the ALIZE 3.0 framework [30]. The background models were trained using the TIMIT corpus of read speech, which contains 630 speakers of eight different dialects of American English [20]. The entire corpus was re-recorded in a studio setting using 3 Kinect microphone arrays, 2 directly in front of the sound source, but at difference heights, and the third at 45 degrees to the source. Although

123

Author's personal copy Int J of Soc Robotics

each Kinect microphone array contains four microphones, we only used the first one’s training data for modeling. The speech data for all three locations was pooled together to form a single training set. When the robot is operating, speaker identification is performed whenever a speech segment is detected by the voice activity detection module. An i-vector is then extracted from the speech segment, and the score for each speaker model is made available to the fusion module. 4.3 Fusion of Face Recognition and Speaker Identification To fuse the outputs from face recognition and speaker identification, we first transfer the results from each module to a probability distribution, and then get a fused probability for each person. Suppose there are M subjects in the models. For the face recognition part, the probability f i of subject i can be obtained by fi =

ni , i = 1, . . . , M K

2

si =  M

i=1 e

α×Scor ei2

(6)

through which the subject with higher score will result in higher probability. The parameter α ∈ R+ is used to increase the probability of subject with high score from speaker identification. At last, we can get the final probability for each subject through a linear combination of probabilities from face recognition and speaker identification oi = β f i + (1 − β)si , i = 1, . . . , M

6 Speech Synthesis A proprietary speech synthesis module was used for iSocioBot, as this was of high quality and accessible to us. The process of synthesising speech consists of these steps:

(5)

where n i is the number of training examples for subject i among the K nearest neighbors. We transfer the scores from speaker identification for all subjects in the model to a probability distribution through eα×Scor ei

to understand a few commands, the speech recognizer need not be very complex. In a social robot context however, the robot should engage in dialogues with different people, which means that many different topics can be encountered, which places heavy demands on the speech recognizer. To meet this requirement, the robot uses the open-source interface for Google’s state-of-the-art speech recognizer, which is both fast, accurate and available for many different languages. The only limitation is the constant need to be connected to the internet, and not having direct access to speech recognizer. For details on the Python API for Google speech recognition see [38].

1. 2. 3. 4.

Text normalization Convert text to phonetic transcription Select units of speech from pre-recorded sentences Concatenate selected units

An example is given in Fig. 6. During text normalization, abbreviations, symbols and numbers are written out using words. The phonetic transcription of most words is found in a pronunciation lexicon. If an unknown word can be identified as a compound word where the parts are in the lexicon, the transcription is created from the transcription of the parts. Otherwise, a reasonable guess of the phonetic transcription is made from the letters in the word. Several thousand sentences have been recorded and annotated so it is exactly known when and which phonetic sounds are in each recording.

(7)

where β ∈ [0, 1] is used to weight face recognition or speaker identification.

5 Speech Recognition The most important modality for HRI is speech. Communication between a robot and a human is ideally done via speech, hence it is important for the robot to understand what the human is saying. This requires a so-called speech recognizer, which transforms a speech signal into a text string. The type of interaction determines how accurate, and thereby complex, the speech recognizer must be, in order for the interaction to be meaningful, e.g., if the robot is required

123

Fig. 6 Example of speech synthesis—steps 1 to 3

Author's personal copy Int J of Soc Robotics

Fig. 7 Facial expression samples (happiness, anger, fear, sadness, disgust, surprise) of the robot

Fig. 8 Pixel boards for the modes of the robot when talking (top), listening (center) and thinking (bottom)

7 Visual Feedback 7.1 Face We implement the facial expression feedback module using LED array and Arduino embedded system. The LED array is covered with an acoustic cloth, so that it is visible from distance and under low light. The robot can display different facial expressions based on the results from the speech recognition and person identification modules. Figure 7 shows the six basic facial expressions (happiness, anger, fear, sadness, disgust, and surprise) that can be displayed by this system. As in many previous studies concerned with robotic facial expressions representation [1,23,32,55], we decided to follow the approach to emotions proposed by Paul Ekman [14]. Furthermore, by switching between two facial expressions at a low frequency, we are able to make the robot communicate its’ intentions by appearing to be talking (opening and closing the mouth), to be listening (eyes looking left and right), and to be thinking for a reply (eyes looking upwards) as shown in Fig. 8. This makes the interaction with people more natural, and makes the robot appear responsive and ”lively”. For the design of the facial expressions of the robot we use simple 32 × 32 pixel boards (see Fig. 8 again). The facial expressions of iSocioBot were validated in an in-house study by a group of 53 students and university personnel of mean age 27.3 years old (age range 20–53) consisting of 38 males (72%) and 15 females (28%). Every participant was invited to a robot session that lasted approximately 5 min, where iSocioBot portrayed in a random order the six robotic facial expressions. Participants were asked to select from a predetermined list with the six available emotions the one they felt was the best match. The high recognition rate for almost all the emotions, combined with an overall accuracy of 83.3% and an almost perfect level of

agreement (Cohens Kappa 0.8) indicates that the emotions were well designed and understood by the participants. 7.2 Ears Two LED strips, which are controlled by the same Arduino embedded system as the facial expression, are attached on each side of the iSocioBot head. When they are blinking, it means the iSocioBot is capturing speech from the microphone. Therefore, they are indicators for listening of the iSocioBot.

8 Interaction Management The interaction management system is responsible for controlling or communicating every other module to enable interaction. To achieve a specific interaction only the interaction management needs to be reprogrammed. A subtask of the interaction management is to handle how the robot talks with humans, which is called dialogue management. Dialogue management is the task of choosing or forming a suitable response to an input, e.g., if a person asks “How old are you?”, the robot should reply with the correct age. The general flow of the dialogue management is given in Fig. 9, along with the face expression displayed for that particular state and also which action is carried out. When the robot is just idle and listening the eyes are moving from side to side to indicate availability and curiosity. After speech has been detected, the robot will indicate that it is thinking while running ASR to decode the utterance. Finally, the robot will respond by playing the generated answer through the loud speaker and at the same time activating the speaking ani-

123

Author's personal copy Int J of Soc Robotics Listening Facial Mode: Listening Action: Wait for speech input

9 Evaluation The iSocioBot was demonstrated at four public events:

Thinking Facial Mode: Thinking Action: Decode speech

Speaking Facial Mode: Speaking Action 1: Generate Answer Action 2: Run Speech Synthesis

Fig. 9 The typical flow of the dialogue interaction with the robot with respect to conversation and facial modes. It should be noted that if decoding of speech is unsuccessful, the robot will not say anything

mation for the facial expression. The robot will only look straight ahead with fixed eye while talking. Our robot is currently equipped with three different dialogue management systems; one for traditional question and answers (Q/A), one for open dialogue and a combination.

8.1 Q/A In this mode, the robot is pre-programmed with a list of questions and answers with the possibility of multiple answers for each question. Assuming an input string is given (decoded speech), the robot decides which question was asked, by computing the Levenshtein distance between the input string and all the questions. The question yielding the highest score is chosen, given that the score is above a pre-defined threshold. The answer is then chosen at random from the list answers for that particular question, however if no match was found, then nothing is replied.

– Research Day of Denmark 2014: The first version of iSocioBot participated in the official opening of the event by first taking part in a scripted, on-stage interaction with a human presenter, and afterwards in a Q/A session off the stage, where the robot could turn toward and talk to the person who is speaking to it. – Nibe Safe 7 2014: This was a small, local event to promote traffic safety for children riding their bikes. The first version of iSocioBot took part in a scripted interaction with a presenter and also a Q/A session with person tracking, where mainly kids would come and ask a question from a set of pre-defined questions. Since the event was local, all relevant modules were set to Danish language. The kids seemed to like the interaction and appearance of the robot. – Copenhagen Culture Night 2014: This was a big event which took place at the Ministry of Higher Education and Science, Denmark. The first version of iSocioBot was programmed with a set of questions and possible replies, which would then be combined to form a conversation with any volunteer. More than 100 kids interacted with the robot and seemed sincerely impressed and enthusiastic about the interaction. Some adults also tried interacting with iSocioBot, but were easily able to see through the rather shallow way of replying from the robot. – People’s Meeting in Denmark 2015: The theme of this event was politics, thus first and second version of iSocioBot were programmed to carry out a small political debate with each other. During the two days of participating in the event, iSocioBot interacted with people in the form of open dialogue with person tracking.

8.2 Open Dialogue In order to engage in an open dialogue, the robot must be equipped with knowledge on many topics, be able to remember or take into account the most recent questions/replies and be able to come up with meaningful answers, when the exact answer is not known to the robot. This is handled by using the ALICE chatbot which is an open source chatbot based on artificial intelligence markup language (AIML).

8.3 Combination In this mode, we give priority to Q/A mode, i.e., if a match is found, the Q/A module is in charge of supplying the answer, otherwise the open dialogue module is used.

123

Due to the nature of these events and their audience, the collection of adequate data, needed to thoroughly evaluate the performance of the robots, was too demanding. Therefore, we decided to organise an in-house controlled lab study to formally evaluate the robot. Since the robot consists of many sub-modules, e.g, face detection, person tracking etc., there are many ways and levels of conducting the evaluation. In this work it has been decided to only evaluate the relevant functionalities via an interaction scenario and not separately. The motivation for doing so, is that the robot is supposed to interact with humans, who will perceive the robot as a full system operating in a context, thus it should also be evaluated in this way. The performance and evaluation of person tracking and face recognition are presented in [51,52] and [13], respectively.

Author's personal copy Int J of Soc Robotics

Fig. 10 Trial setup Fig. 11 Trial with iSocioBot with the active interaction partner (left) and the observer (right)

9.1 Methodology Many methods have been developed for the evaluation of social robotic avatars [29,34,56], but for this specific evaluation we considered the Godspeed Questionnaire (for measuring 24 attributes of anthropomorphism, animacy, likability, perceived intelligence, and perceived safety on a 5-point-scale) to be the most appropriate [3]. In addition to the Godspeed indices we introduce for this study the index of Attentiveness comprised of two attributes; Negligent/Attentive, and No eye contact when user speaks/Eye contact when user speaks. This addition contributes to having a more complete assessment of the technical features of iSocioBot, such as directing attention towards the interlocutor via the person tracking algorithm. Lastly, we asked participants to state their belief on the gender of iSocioBot (1 for female/5 for male). Even though the robots appearance does not reveal any strong information on gender (one could characterize it as gender neutral), the dialogue module is using a female voice. Previous studies have indicated that the gender of a robot (based on appearance) plays a crucial role in the categorization process that takes place in the mind of the people when they come across new kinds of technology such as humanoid robots [15,44,54]. Such studies influenced us to examine if the voice of iSocioBot can be another factor that affects robot gender categorization and therefore affects the HRI experience. Lastly, the questionnaire included a field where participants could voluntarily write their general comments about the robot, or their experience. 9.2 Procedure Participants were attracted via online invitations in social media networks, and were both university and community members. No participants were paid, but instead received

coffee and cookies afterwards. They were welcomed and then escorted to a meeting room inside the university where they were briefed about the experiment. The actual HRI would take place in another room where the robot was located. Participants were invited to participate in the study in pairs; one would engage in an active interaction, and the other would only observe the interaction. This condition was necessary because the scenarios, where iSocioBot can be applied, include situations where more people (apart from the active interaction partner) could be co-located observing the HRI as bystanders. Bystanders opinion matters as they could possibly be the next active interaction partners of the robot, and the impression they form by viewing others interacting with the robot might influence them negatively in engaging in HRI [8,22]. The participant who would do the active interaction would hold a microphone and address all the questions through it, and would be standing throughout the duration of the experiment. Furthermore, the active interaction partner should change position at least three times during the HRI trial following the numbering on the floor as shown in Fig. 10, in order to examine if the robot will turn and direct its attention towards the interlocutor. The three markings on the ground indicated the three positions each interlocutor had to change during her/his session in order for us to examine if the robot would turn and direct its attention towards the interlocutor, thus testing the ability of the robot to attend to the speaker. The participant who would be the observer should be seated within the same room and just observe the HRI throughout the whole experiment, as shown in Fig. 11. The participants were randomly placed in one of the two conditions. The active interaction partners were encouraged to speak in English to iSocioBot freely about any topic of interest. A list of eight assistive questions was avail-

123

Author's personal copy Int J of Soc Robotics Table 1 Analysis of the questionnaire

Questionnaire

Mean

SD

Cronbach’s alpha

I. Antrhopomorphism

2.76

1.00

0.78

Fake/natural

2.75

0.91



Machinelike/humanlike

2.46

1.16

– –

Unconscious/conscious

3.09

1.01

Artificial/lifelike

2.50

0.98



Moving rigidly/moving elegantly

2.71

1.11



II. Animacy

3.11

0.88

0.79

Dead/alive

3.21

0.85



Stagnant/lively

3.03

0.93



Mechanical/organic

2.37

0.89



Artificial/lifelike

2.48

0.97



Inert/interactive

3.65

0.77



Apathetic/responsive

3.90

0.87

– 0.72

III. Likeability

4.03

0.80

Dislike/like

4.15

0.57



Unfriendly/friendly

4.09

0.39



Unkind/kind

3.87

0.54



Unpleasant/pleasant

4.00

0.87



Awful/nice

4.06

0.80



IV. Perceived intelligence

3.38

0.87

0.88

Incompetent/competent

3.31

0.84



Ignorant/knowledgeable

3.53

0.90



Irresponsible/responsible

3.31

0.76



Unintelligent/intelligent

3.40

0.89



Foolish/sensible

3.34

0.95



V. Perceived safety

3.48

1.04

0.43

Anxious/relaxed

4.06

1.05



Agitated/calm

3.65

1.18



Surprised/quiescent

2.75

0.86



VI. Attentiveness

3.34

0.86

0.52

Negligent/attentive

3.43

0.78



No eye contact when user speaks/Eye contact when user speaks

3.25

0.93



able in case the participant needed guidance (‘Whats your name?’, ‘Do you think you are smart?’, ‘Do you have any friends?’, ‘Whats your birthday?’, ‘Do you ever catch an illness?’, ‘Do you think you are pretty?’, ‘What do you want to talk about?’, ‘Can you speak in another language?’). The dialogue management system of iSocioBot was set to the open dialogue, and interactions lasted approximately 5 min. Afterwards, both participants filled the extended Godspeed questionnaire, and were debriefed.

received questionnaires were anonymous. Table 1 presents an analysis of all the attributes of our questionnaire including mean values, standard deviations, group means for each index, and Cronbach’s alpha values for indicating the internal consistency reliability of the data. Table 2 shows the comments of the participants relevant to the trial and the robot. Furthermore, the question regarding the gender of the robot yielded a mean value of 2 and a standard deviation (SD) of 0.93.

9.3 Results

9.4 Analysis

In total, 32 participants (16 unique pairs) of mean age 28.2 years old (max 35/min 20), of which 22 were males (68.75%), and 10 females (31.25%) took part in the evaluation. All

The Cronbach’s alpha values for the indices of anthropomorphism, animacy, likeability, and perceived intelligence are above the 0.70 threshold meaning that there is a high degree

123

Author's personal copy Int J of Soc Robotics Table 2 General comments

1.

In sense of being able to answer questions it was good, but the relevance of the answers in some cases were low

2.

Some of the robot’s replies seemed to be cut short

3.

Nice experience. Difficulties to answer to some questions. It was not facing me all the time

4.

Doesn’t understand the most of the questions. Very rigid when it speaks. Not very nice outlooking

6.

Sometimes it doesn’t understand my accent:)

7.

I think the robot is mainly machine-like and I felt like its answers are based on googling

8.

Needs to learn better jokes

9.

Interacting well with the environment!

11.

It has been more interactive after some minutes

12.

The robot would almost always turn to the direction where the speaker was standing. It’s answers were usually well related to the questions and it would also ask back. Sometimes its answers were even funny!

13.

All in all it was a nice experience. The robot did not respond to some of the questions and it wasn’t clear if it was because it didn’t get the question or because it didn’t know the answer

14.

As a general impression I can say it has been a nice experience and it seemed as a friendly and calm discussion between the robot and the interlocutor. A good overall experience!

15.

Does not have preferences, seems it likes everything. Would be interesting to see more personal opinions. Very impressive!

16.

Very clever social answers. Very kind. Does not have preferences. Very expressive, friendly, funny, limited encyclopedia answers. It’s the best robot I have ever seen

Fig. 12 Boxplot with mean values for type of interaction

of internal consistency and homogeneity of the data. The low Cronbach’s alpha value for the perceived safety index is consistent with previous research [24,48], but the low Cronbach’s alpha value for attentiveness reported for the first time in literature might be attributed to the low number of questions. We conducted analysis of variance (ANOVA) and the statistical analysis of the results indicated: – A significant effect of type of interaction (active interaction/observation) on the extended godspeed questionnaire F(1, 725) = 7.590, p = 0.00602. Figure 12 shows

the boxplot for type of interaction against the questionnaire indices. – A significant effect of gender on the extended godpseed questionnaire F(1, 725) = 4.119, p = 0.04278. Figure 13 shows the boxplot for gender against the questionnaire indices. – A significant effect of the extended godspeed questionnaire, F(25, 725) = 11.874, p < 2e − 16 (meaning that participants responded differently on the 26 scales of the questionnaire).

123

Author's personal copy Int J of Soc Robotics Fig. 13 Boxplot with mean values for gender

– A significant interaction effect between the factors type of interaction and gender for the Perceived Safety index F(1, 84) = 4.279, p = 0.0417. Not statistically significant results were left out of the analysis. 9.5 Discussion of the Evaluation The results of the statistical analysis suggest that the type of interaction plays a significant role in the way interaction partners of the robot experience HRI. Active interaction partners rated the robot better on the indices of anthropomorphism, animacy, and perceived intelligence, observers rated the robot better on perceived safety and attentiveness, while both of them considered the robot to be almost equally highly likeable. A surprising finding was that observers felt safer around iSocioBot than actual interaction partners even though both were located in the same room, and in very close proximity to the robot. Additionally, the analysis suggests that the gender of the participants plays a significant role in HRI as female participants showed a tendency to rate the robot higher in almost all the indices apart from perceived safety. Taking into consideration the fact that the gender of the robot was perceived as female by the majority of the participants due to its’ female voice, we have a clear indication about a same gender preference (e.g., female participants prefer female robots) which is consistent with previous research [9,21], and that the voice of the robot is equally important with the external appearance of the robot in the categorization process that takes place in the mind of humans. Since the speech synthesiser used for this evaluation is proprietary, using another one may result in different ratings of the robot for some of the attributes. This is however beyond the scope of this work.

123

In the general comments field of the questionnaire, where participants could freely write their comments, some reported issues on the way the robot replied to their questions, and on the relevance of the answers, some reported their overall feeling of the robot. Certainly, in cases where the speech recognition system did not understand the strong accent of the speaker (as it was reported in the comments), or did not know the answer to the question, then the reply from the open dialogue module might not have been very relevant. Another limitation reported was that the robot was not facing the interlocutor ”all the time”. Indeed, rarely the robot would recognize noise coming from the wheel-base as input and turn, while the interlocutor might not have changed position. Other than these two reported limitations, the results, the analysis of the data, and the comments of the participants suggest an overall very good impression of both the robot and the HRI experience.

10 Conclusion In this work we have presented our approach to building a multimodal interactive social robotic systems in terms of design of appearance, choice of hardware and the structure of software. The technology and algorithms for the key components of the system have been described and made available online for other researchers to download. We believe that our approach of using open-source software along with off-theshelf hardware will benefit other research institutions with a desire to enter the field of social robotics. To quantify the success of our approach we conducted an in-house experiment using a modified version of the Godspeed questionnaire. The appearance of the robot was found to give an overall positive impression and the subjects also provided positive comments

Author's personal copy Int J of Soc Robotics

about the interaction. The main criticism from the test subjects was that the robot sometimes did not understand them and that the robot sometimes turned away from them during interaction. The first item needs further investigation to determine whether it is a speech recognition issue or a dialogue management issue. The second issue has already been identified to be caused by the internal fan of the wheelbase generating too much noise. Based on this evaluation we believe that this system is a good first iteration, but there are still important issues to consider. The main issues to improve for next version of iSocioBot are identified to be

2.

3.

4.

5. 6.

– The dialogue management must take the identity of the interacting person into consideration in order to make the dialogue more personal and thus more relevant. – The speech recognition system must be able to work at an acceptable accuracy using the on-board microphone(s) as opposed to using a handheld wireless microphone in this evaluation.

7. 8.

9. 10.

Another issue which was not directly evaluated in this study is the ability of the robot to sense the environment in a more structured way, as opposed to the current version where modalities (e.g., audio and vision) are processed almost independent of each other. For this we propose a fusion concept we call reinforcement fusion, combining sensor signals in an interactive way: e.g., when a robot detects a sound direction, it turns towards the direction to see better and moves towards it to hear better. Reinforcement fusion is analogous to reinforcement learning [29], a known term in machine learning. We believe that our complete system or parts of it can be used in many different applications of social robotics, which can benefit society. For instance, different aspects of iSocioBot are currently being evaluated at several nursing homes in Denmark on people with varying degrees of dementia, to determine if it can be used as a conversational partner in this setting. In line with previous studies by other researchers, we also plan to test the communication properties of iSocioBot on children with autism to examine if the children can benefit from this. Acknowledgements We thank the reviewers for their valuable suggestions and comments which have greatly improved the present work. We acknowledge Ben Krøyer and Peter Boie Jensen for constructing the iSocioBot, Trine Skjødt Axelgaard for designing the appearance of the first generation iSocioBot and making the scenario drawings (Fig. 1) and Søren Emil for designing the appearance of the second generation iSocioBot.

11.

12.

13.

14. 15.

16.

17.

18.

19. 20. 21. 22.

References 23. 1. Ahn HS, Lee DW, Choi D, Lee DY, Hur M, Lee H (2012) Difference of efficiency in human–robot interaction according to condition of

experimental environment. In: Social robotics. Springer, Berlin, pp 219–227 Baddoura R, Venture G (2013) Social vs. useful HRI: experiencing the familiar, perceiving the robot as a sociable partner and responding to its actions. Int J Soc Robot 5(4):529–547 Bartneck C, Kuli´c D, Croft E, Zoghbi S (2009) Measurement instruments for the anthropomorphism, animacy, likeability, perceived intelligence, and perceived safety of robots. Int J Soc Robot 1(1):71–81 Boersma P (1993) Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. IFA Proc 17:97–110 Brahnam S, Jain LC, Nanni L, Lumini A (2014) Local binary patterns: new variants and applications, vol 506. Springer, Berlin Breazeal C (2004) Social interactions in HRI: the robot view. IEEE Trans Syst Man Cybern C Appl Rev 34(2):181–186 Broekens J, Heerink M, Rosendal H (2009) Assistive social robots in elderly care: a review. Gerontechnology 8(2):94–103 Chang WL, Šabanovi´c S (2015) Interaction expands function: social shaping of the therapeutic robot paro in a nursing home. In: Proceedings of the tenth annual ACM/IEEE international conference on human–robot interaction. ACM, pp 343–350 Cialdini RB (1993) Influence: the psychology of persuasion. Morrow, New York Cooney MD, Kanda T, Alissandrakis A, Ishiguro H (2011) Interaction design for an enjoyable play interaction with a small humanoid robot. In: 2011 11th IEEE-RAS international conference on humanoid robots (humanoids). IEEE, pp 112–119 Dehak N, Kenny PJ, Dehak R, Dumouchel P, Ouellet P (2011) Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process 19(4):788–798 Dibiase JH (2000) A high-accuracy, low-latency technique for talker localization in reverberant environments using microphone arrays. Ph.D. thesis, Brown University Duan X, Tan ZH (2015) Local feature learning for face recognition under varying poses. In: 2015 IEEE international conference on image processing (ICIP). IEEE, pp 2905–2909 Ekman P (1992) An argument for basic emotions. Cognit Emot 6(3–4):169–200 Eyssel F, Kuchenbrandt D (2012) Social categorization of social robots: anthropomorphism as a function of robot group membership. Br J Soc Psychol 51(4):724–731 Feil-Seifer D, Matari´c MJ (2005) Defining socially assistive robotics. In: 9th International conference on rehabilitation robotics, 2005 (ICORR 2005). IEEE, pp 465–468 Fortenberry B, Chenu J, Movellan J (2004) Rubi: a robotic platform for real-time social interaction. In: Proceedings of the international conference on development and learning (ICDL04), The Salk Institute, San Diego Franois D, Powell S, Dautenhahn K (2009) A long-term study of children with autism playing with a robotic pet: taking inspirations from non-directive play therapy to encourage children’s proactivity and initiative-taking. Interact Stud 10(3):324–373 Garage W Robot operating system (ros). http://www.ros.org/. Accessed 29 Sept 2014 Garofolo JS, Consortium LD, et al (1993) TIMIT: acousticphonetic continuous speech corpus. Linguistic Data Consortium Gass RH, Seiter JS (2015) Persuasion: social influence and compliance gaining. Routledge, London Haring KS, Silvera-Tawil D, Takahashi T, Velonaki M, Watanabe K (2015) Perception of a humanoid robot: a cross-cultural comparison. In: 2015 24th IEEE international symposium on robot and human interactive communication (RO-MAN). IEEE, pp 821–826 Hashimoto T, Hitramatsu S, Tsuji T, Kobayashi H (2006) Development of the face robot saya for rich facial expressions. In:

123

Author's personal copy Int J of Soc Robotics

24.

25.

26.

27.

28.

29. 30.

31.

32.

33.

34.

35.

36.

37.

38. 39.

40. 41.

42.

International joint conference on SICE-ICASE, 2006. IEEE, pp 5423–5428 Ho CC, MacDorman KF (2010) Revisiting the uncanny valley theory: developing and validating an alternative to the godspeed indices. Comput Hum Behav 26(6):1508–1518 Jochum E, Vlachos E, Christoffersen A, Nielsen SG, Hameed IA, Tan ZH (2016) Using theatre to study interaction with care robots. Int J Soc Robot 1–14 Kanda T, Ishiguro H, Imai M, Ono T (2004) Development and evaluation of interactive humanoid robots. Proc IEEE 92(11):1839– 1850 Kenny P, Boulianne G, Dumouchel P (2005) Eigenvoice modeling with sparse training data. IEEE Trans Speech Audio Process 13(3):345–354 Kim ES, Berkovits LD, Bernier EP, Leyzberg D, Shic F, Paul R, Scassellati B (2013) Social robots as embedded reinforcers of social behavior in children with autism. J Autism Dev Disord 43(5):1038– 1049 Kober J, Peters J (2012) Reinforcement learning in robotics: a survey. In: Reinforcement learning. Springer, Berlin, pp 579–610 Larcher A, Bonastre JF, Fauve B, Lee KA, Lévy C, Li H, Mason J, Parfait JY, U ValidSoft Ltd (2013) ALIZE 3.0-open source toolkit for state-of-the-art speaker recognition. In: Annual conference of the international speech communication association (Interspeech), pp 1–5 Manohar V, Crandall J (2014) Programming robots to express emotions: interaction paradigms, communication modalities, and context. IEEE Trans Hum–Mach Syst 44(3):362–373 Mazzei D, Lazzeri N, Hanson D, De Rossi D (2012) Hefes: an hybrid engine for facial expressions synthesis to control human-like androids and avatars. In: 2012 4th IEEE RAS & EMBS international conference on Biomedical Robotics and Biomechatronics (BioRob). IEEE, pp 195–200 Metta G, Sandini G, Vernon D, Natale L, Nori F (2008) The iCub humanoid robot: an open platform for research in embodied cognition. In: Proceedings of the 8th workshop on performance metrics for intelligent systems. ACM, pp 50–56 Michalowski MP, Sabanovic S, Simmons R (2006) A spatial model of engagement for a social robot. In: 9th IEEE international workshop on advanced motion control, 2006. IEEE, pp 762–767 Ojansivu V, Heikkilä J (2008) Blur insensitive texture classification using local phase quantization. In: Image and signal processing. Springer, Berlin, pp 236–243 Okuno HG, Nakadai K, Hidai Ki, Mizoguchi H, Kitano H (2001) Human–robot interaction through real-time auditory and visual multiple-talker tracking. In: Proceedings of the 2001 IEEE/RSJ international conference on intelligent robots and systems, 2001, vol 3. IEEE, pp 1402–1409 Okuno HG, Nakadai K, Kitano H (2002) Social interaction of humanoid robot based on audio-visual tracking. In: Developments in applied artificial intelligence. Springer, Berlin, pp. 725–735 Python api for google speech recognition. http://pypi.python.org/ pypi/SpeechRecognition/2.1.3. Accessed 15 July 2016 Pereira FG, Vassallo RF, Salles EOT (2013) Human–robot interaction and cooperation through people detection and gesture recognition. J Control Autom Electr Syst 24(3):187–198 Pietikäinen M, Hadid A, Zhao G, Ahonen T (2011) Computer vision using local binary patterns, vol 40. Springer, Berlin Reich-Stiebert N, Eyssel F (2015) Learning with educational companion robots? Toward attitudes on education robots, predictors of attitudes, and application potentials for education robots. Int J Soc Robot 7(5):875–888 Robinson H, MacDonald B, Broadbent E (2014) The role of healthcare robots for older people at home: a review. Int J Soc Robot 6(4):575–591

123

43. Salichs MA, Barber R, Khamis AM, Malfaz M, Gorostiza JF, Pacheco R, Rivas R, Corrales A, Delgado E, García D (2006) Maggie: a robotic platform for human–robot social interaction. In: 2006 IEEE conference on robotics, automation and mechatronics. IEEE, pp 1–7 44. Siegel M, Breazeal C, Norton MI (2009) Persuasive robotics: the influence of robot gender on human behavior. In: IEEE/RSJ international conference on : intelligent robots and systems, 2009 (IROS 2009). IEEE, pp 2563–2568 45. Stiefelhagen R, Ekenel HK, Fugen C, Gieselmann P, Holzapfel H, Kraft F, Nickel K, Voit M, Waibel A (2007) Enabling multimodal human–robot interaction for the karlsruhe humanoid robot. IEEE Trans Robot 23(5):840–851 46. Stiefelhagen R, Fugen C, Gieselmann R, Holzapfel H, Nickel K, Waibel A (2004) Natural human–robot interaction using speech, head pose and gestures. In: Proceedings of the 2004 IEEE/RSJ international conference on intelligent robots and systems, 2004 (IROS 2004), vol. 3. IEEE, pp 2422–2427 47. Stiefelhagen R, Zhu J (2002) Head orientation and gaze direction in meetings. In: CHI’02 extended abstracts on human factors in computing systems. ACM, pp 858–859 48. Tahir Y, Rasheed U, Dauwels S, Dauwels J (2014) Perception of humanoid social mediator in two-person dialogs. In: Proceedings of the 2014 ACM/IEEE international conference on human–robot interaction. ACM, pp 300–301 49. Tan ZH, Lindberg B (2010) Low-complexity variable frame rate analysis for speech recognition and voice activity detection. IEEE J Sel Top Signal Process 4(5):798–807. doi:10.1109/JSTSP.2010. 2057192 50. Tan ZH, Thomsen NB, Duan X (2015) Designing and implementing an interactive social robot from off-the-shelf components. In: Recent advances in mechanism design for robotics. Springer, Berlin, pp 113–121 51. Thomsen NB, Tan ZH, Lindberg B, Jensen SH (2014) Improving robustness against environmental sounds for directing attention of social robots. In: Multimodal analyses enabling artificial agents in human–machine interaction. Springer, Berlin, pp 25–34 52. Thomsen NB, Tan ZH, Lindberg B, Jensen SH (2015) Learning direction of attention for a social robot in noisy environments. In: 3rd AAU workshop on robotics. Aalborg Universitetsforlag 53. Viola P, Jones M (2001) Robust real-time object detection. In: International journal of computer vision 54. Vlachos E, Jochum EA, Schärfe H (2016) Head orientation behavior of users and durations in playful open-ended interactions with an android robot. In: Cultural robotics, LNAI 9549. Springer, Berlin 55. Vlachos E, Schärfe H (2012) Android emotions revealed. In: Social robotics, pp 56–65. Springer, Berlin 56. Vlachos E, Scharfe H (2015) An open-ended approach to evaluating android faces. In: 2015 24th IEEE international symposium on robot and human interactive communication (RO-MAN). IEEE, pp 746–751

Zheng-Hua Tan is a Professor in the Department of Electronic Systems at Aalborg University, Denmark. He was a Visiting Scientist at the Computer Science and Artificial Intelligence Laboratory, MIT, USA, an Associate Professor in the Department of Electronic Engineering at Shanghai Jiao Tong University, China, and a Postdoc in the Department of Computer Science at KAIST, Korea. He received the B.Sc. and M.Sc. degrees from Hunan University, China, and the Ph.D. degree from Shanghai Jiao Tong University, China. His research interests include machine learning, deep learning, speech and speaker recognition, noise-robust speech processing, multimodal signal processing, and social robotics, in which he has over 170 publications. He has been an Editorial Board Member for several journals.

Author's personal copy Int J of Soc Robotics Nicolai Bæk Thomsen received his B.Sc. in electrical engineering and M.Sc. in signal processing and computing from Aalborg University, Denmark, in 2011 and 2013, respectively. He is currently pursuing a Ph.D. degree in signal processing at the department of Electronic Systems, Aalborg University, where he is doing research in the project entitled “Durable Interaction with Socially Intelligent Robots” supported by the Danish Council for Independent Research. His main research interests are statistical signal processing, speech processing, robotics and machine learning.

Sven Ewan Shepstone received the B.S. and M.S. degrees in Electrical Engineering from the University of Cape Town in 1999 and 2002 respectively. From 2005 to 2010 he worked as a Systems Engineer in the field of Ethernet and broadband communications for Ericsson A/S in Denmark, and has been employed at Bang and Olufsen A/S in Denmark since 2010. In 2015 he received the Ph.D. degree from Aalborg University. His research interests include the application of artificial intelligence to consumer electronics. He was the recipient of the IEEE Ganesh N. Ramaswamy Memorial Student Grant at ICASSP 2015.

Xiaodong Duan is currently a Ph.D. student in the Department of Electronic Systems, Aalborg University, Denmark, under the supervision of Zheng-Hua Tan. His research interests include computer vision and machine learning. He received his B.Sc. in optical information science & technology from South China University of Technology and M.Eng. in software engineering from Peking University, China, in 2006 and 2009, respectively. From 2009 to 2013, he worked on vision based fire detection and statistical experimental data analysis in China Academy of Building Research.

Morten Højfeldt Rasmussen received the M.Sc. degree in electrical engineering from Aalborg University, Denmark, in 2005. In 2012, he started SpeechMiners a company working in the area of automatic speech recognition. In 2016, he joined First Agenda, Denmark. His interests include speech processing, natural language processing and machine learning.

Evgenios Vlachos received the M.S. degree in electronic automation from National and Kapodistrian University of Athens, Greece in 2009, and the M.A. and Ph.D, degrees in human centered communication and informatics from Aalborg University, Denmark in 2012 and 2016 respectively. From 2011 to 2015, he participated in the Geminoid-DK project. Since 2016, he is a Postdoctoral Researcher with the Electronic Systems Department, Aalborg University, Denmark. Currently, he is working on the project “Durable Interaction with Socially Intelligent Robots” supported by the Danish Council for Independent Research. His research interest includes human-robot interaction, social robotics, assessment studies, user experience design, user behavior modeling and nonverbal behavior understanding.

Jesper Lisby Højvang received the M.Sc. and Ph.D. degrees in electrical engineering from Aalborg University, Denmark, in 2005 and 2009, respectively. In 2008, he joined MV-Nordic, Denmark, where he worked on commercial text-to-speech systems. In 2017, he joined First Agenda, Denmark. His interests include speech processing, natural language processing, music information retrieval and machine learning.

123