Usability of a Speech Centric Multimodal Directory ... - Semantic Scholar

Usability of a Speech Centric Multimodal Directory Assistance Service Els den Os*, Nicole de Koning, Hans Jongebloed KPN Research Leidschendam The Netherlands +31 24 3521333 [email protected] *Presently at Max Planck Institute for Psycholinguistics, Nijmegen 1

Abstract

We present two user evaluations of a speech centric multimodal Directory Assistance (DA) service that has been implemented on an iPAQ. The service combines speech and pen at the input side with text and speech at the output side. The first user experiment was an expert review. Five usability experts judged the initial design of the multimodal DA. Based on their comments a new design was implemented and six naïve users tested this version. Results show that the naïve users are able to combine pen and speech recognition in an intuitive way. Much attention must be paid to the graphical interface that should guide the intuitive use of the multimodal service. 1.1 Keywords

Usability, speech centric multimodal interaction, Directory Assistance, pen/speech/text 2

Introduction

The emerging mobile Internet-capable terminals promise to enable a large number of appealing services. However, the present generation of PDAs poses substantial usability problems, because these devices have small keyboards, or even no keyboard at all. In addition, size and resolution of the screen is much smaller than the usual desktop terminal. For services that require form filling, and for services that are map/location based the combination of speech recognition and pen at the input might provide a solution for the usability problems. The use of speech is especially intuitive for entering items from long implicit lists, such as names of cities, stations, airports, etc., and for asking questions related to certain points on maps, like “what is this?”, or “what is the shortest route from here to there?”. In this paper the focus is on our experiments with a multimodal Directory Assistance service, as an example of a service that includes form-filling. This work is part of the IST project SMADA [1]. In the framework of the EURESCOM project MUST we are working on a mapbased application [2, 11]. The experience from the MUST project is used in the design of the experiments reported here, and in the interpretation of the results.

Lou Boves Department for Language and Speech University of Nijmegen The Netherlands +31 24 3612902 [email protected]

DA services are widely used and they are conceptually simple. However, typing names of cities, businesses, etc. on small keyboards, is a bottleneck for DA with small terminals. In the present stage of development of telecommunication services PDAs compete with GSM handsets as the terminal of choice to access information services. When using a GSM handset as the terminal, automatic speech recognition (ASR) is the major input ‘device’, while pre-recorded or synthetic speech is the major way for presenting prompts and information. Experience has shown that speech-driven services suffer from their own usability problems. ASR is error-prone, and recognition errors appear to cause major usability problems: users become confused, and they may lose track of the status and progress of the dialogue. We have also found that users find it difficult to build and maintain an appropriate mental model of a dialogue with a machine [3]. Multimodal interfaces hold the promise of solving the usability problems with PDAs on the one hand, and speechdriven interaction on the other, by combining the strengths of several modes (without being impaired by their weaknesses). When speech recognition is combined with a textual display of the recognition result, one of the most important problems with speech-only services can be solved: immediate visual feedback of recognition results guarantees that users can maintain a correct model of the progress of the dialogue. In addition, this feedback facilitates error detection and repair considerably. At the same time, ASR can help to solve the usability problem that PDAs have because of the lack of a suitable keyboard. However, multimodal interaction as the solution for usability problems does not come for free. Novel dialogue management strategies must be developed, which (often implicitly) determine the ways in which input and output modes can be combined. Presently a number of multimodal research systems (such as SmartKom [5], but also commercial ones (e.g. the Lobby7 system [6]) are being developed. Recently attempts have been made (for example in the EURESCOM project MUST) to build multimodal

systems on the basis of the Galaxy Communicator Software Infrastructure [4].

wanted to speak (by pressing a speech button), or to type (by selecting the text field).

GUI interfaces that come with desktop access to Internet services tend to implement user driven interaction; speechbased interfaces, on the other hand, tend to be systemdriven (or to follow a mixed-initiative dialogue strategy). Moreover, users have grown accustomed to GUI interfaces, and –be it to a lesser extent- to spoken language systems. They seldom –if at all- have experience with applications that feature a combination of speech and keyboard (or pen) for entering information. Therefore, we must envisage a need to learn how to deal with this type of interaction that can neutralise the putative advantage of the possibility to select between, or to combine modes.

The WLAN interface of the iPAQ was used to transfer the input speech to a workstation that runs the SpeechPearl ASR software. SpeechPearl is used in isolated word mode; therefore, the interface design must prompt single word responses.

2.1 Multimodal Interaction Strategies

Although that need not be clear to users, form-filling applications combine two (or perhaps three) types of input actions: entering information in a field and initiating commands. The third type of input action that may be distinguished (but that can also be subsumed under the heading ‘commands’) is the selection of a field to be filled. When both ASR and pen input are available, the interface designer must decide how the input modes can (or must) be combined for the different types of input actions. When more than one input mode can be used simultaneously, decisions must be made with respect to the ways in which the information is fused. W3C [7] distinguishes between three different ways of combining multimodal input (and output) in a session: sequential, uncoordinated simultaneous, and coordinated simultaneous multimodal input/output. A fourth type of using input modes can be distinguished, where the user must select the mode that will be active at the start of each session. In this paper we will not address mode alternation between sessions. Of the three types of multimodal input distinguished by W3C coordinated simultaneous multimodal interaction is most complicated to implement, since all input from the different modalities within a certain time frame must be interpreted and fused. This type of multimodality is not necessary for form-filling applications like DA (although it probably is for map-based services). For the DA service we first experimented with a combination of sequential multimodal input (in a particular dialogue state, only one input mode is available, but in the whole dialogue more than one input mode can be used) and uncoordinated simultaneous multimodal input (the user may choose the input mode (s)he likes most at a particular stage in the dialogue). Since the iPAQ seems to promote pen input, the design of the interface was essentially pencentric: the pen is active all the time, and the pen must be used to click buttons (to initiate commands) and to select fields in the form to be filled. Once a field was selected, the user could choose the input mode (pen or ASR). In a second version of the interface only sequential use of input modes was allowed, i.e. the user indicated whether he

In the versions of the multimodal DA service under investigation ASR can be used to enter the name of the city and the business/person. For the city name recognition ASR was optimised for the (2500) cities in the Netherlands. Business/person name recognition was not realistic, since we used a small lexicon containing only the names to be recognised in the scenarios that were used for the evaluation. 2.2 Aims of the research

The aim of the experiments reported in this paper was to obtain basic knowledge about the usability of multimodal interaction that combines ASR and pen input, and speech, graphics and text for output. In addition, we hoped to discover good practice guidelines for the design of multimodal interfaces that can be generalised beyond DA services, and hopefully also beyond form-filling applications. In this paper we describe the first two iterations of a user centred procedure to design a multimodal DA. We describe an expert review and an evaluation by naive subjects. We wanted to determine the best way in which the service can guide users through the interaction. To this end we designed combinations of audio and text prompts and graphical cues. An important goal of the research was to investigate user preferences for specific input actions, and specifically for error correction. We wanted to use real PDAs as the terminal devices in the experiments with naïve users instead of PC’s, to ensure that the results of our experiments can be generalised to realistic environments. We used the Compaq iPAQ, because it offers the best platform for program development. The iPAQ comes with standard software for a soft keyboard, and for the recording of speech. For the expert review we used a simulated iPAQ environment on a laptop PC (combining ASR and typing and mouse clicking on the PC). For the evaluations by the naive subjects, we used a real iPAQ, with pen and speech input. 3 First cycle 3.1 Design focus

The major aim of this cycle was to obtain a basic understanding of the requirements that must be fulfilled for users to understand the ways in which they can interact with a service that provides simultaneous speech and pen input, but that is essentially controlled through the pen (in accordance with the default use of the iPAQ).

Because of the lack of guidelines for the design of multimodal services we decided to start with a straightforward adaptation of the existing Internet DA service in the Netherlands. The graphical user interface in the ‘simple query’ mode (i.e., a mode in which there is no field for street name) from this Internet service was adapted to the screen of an iPAQ. Speech could only be used to input city names and listings (cf. Fig. 1). The pen was available for all actions: selection of text boxes, entering text by means of the soft keyboard, issuing commands (e.g. search), and correction of ASR errors. Speech recognition for entering the names was solicited by visual cues (displaying text in the light green box at the top of the screen) in addition to audio prompts (“Type or say the name”). The ASR engine was activated by clicking in the text box that said “click here to enter a name”. Then the user could speak or type the name or listing. The recogniser output was displayed in the same text box, after some delay due to the fact that ASR needs some time to establish the end of the input utterance. When a recognition error occurred, the user could tap on the correction button (shown to the right of the listing field in Fig. 1). ASR errors could only be corrected by typing. This design choice was motivated by the finding that mode switching in these cases facilitates error handling [8,9]. There was a separate 'visual cue' window at the top of the screen. During the 'dialogue' this window contained messages (i.e. instructions) to the users.

3.3 Results

The experts praised the user-friendliness of speech compared to typing or handwriting. However, it became clear that the first design, which basically added speech to an Internet based DA design was not appropriate. The main problems mentioned by the experts relate to the fact that it was not clear what actions could be executed by speech and which by pen. Specifically, it was not clear that they could not use speech for correcting recognition errors. In general, it was not clear when the speech recogniser was open. Moreover, the auditory prompts mask the visual cues at the top of the screen. According to the experts, users should be able to choose themselves whether to correct the wrongly recognised item by pen or by speech. The main conclusion was that much more attention should be paid to the graphical design of the interface to guide the users through the interaction. It is reasonable to assume that the problems encountered by the experts are caused by the ‘hidden’ switch between sequential and uncoordinated simultaneous use of the two input modes. Apparently, such a switch in options for the users can only be tolerated if the output of the system makes it unambiguously clear which operation mode is selected.

3.2 The expert review

Correct

We presented the first design to five user interface design experts, in the form of an expert review. By doing so we expected to be able to discover the most important design issues that needed improvement before meaningful experiments with naive subjects could be conducted. The test set-up was very simple. First the service was explained, and then the experts completed 10 scenarios (e.g. find the number of a congress centre in Amsterdam). Afterwards an interview took place in which the experts could express their reactions.

Cue window City name Listing

Search

All experts were aware of the eventual goal of the research. Specifically, they knew that one of the main goals of the research was to design a good multimodal interface that combines speech recognition and pen input. Moreover, the audio prompts explicitly invited to use speech recognition to fill the two fields in the form. 3.2.1

The scenarios

In the user interface no distinction was made between searching for private names or business names. In the scenarios the experts were asked to look for the numbers of two persons (e.g., find the number of the Bakker family in Barendrecht) and eight businesses (e.g., find the number of a congress centre in Amsterdam). In order to see how users would behave when they did not know the city name, one scenario only mentioned the business name without the corresponding city name (i.e., find the number of the Tax Office).

Fig. 1 The first version of the design 4 Second cycle 4.1 Design focus

We tried to solve the design problems observed in the first cycle. First of all, we decided to revert to only sequential operation of the input modes. Two microphone buttons, one for each name field to be filled in, were added to activate the Speech recogniser (see Fig. 2). Visual feedback was

provided only in the active text box. No auditory prompts were used. Speech input could also be used for correcting errors, again by clicking the microphone button to the left of the field. Corrections could also be made by pen. To that end the user had to activate the soft keyboard, by clicking the icon on the status bar. To correct text they had to click in the field, after which conventional editing facilities (deleting and adding characters) were available. It was sufficient to type the first letters of a city name to get the full name on the screen (auto-completion). A pull down menu could be invoked using a dropdown button to see the n-best result of the recogniser (for city names), or a list of different spellings of names (for proper names). Users had to indicate by pen clicking whether they searched for a private or business listing. They had also to use the pen to initiate a search, after they had completed the form.

in the following scenario. The test leader conducted a semistructured interview with each subject at the end of the test. 4.2.1

The scenarios

Subjects had to indicate in the user interface whether they were looking for a private name or a business name. The scenarios included six private names (e.g. find the number of the Essink family in Moordrecht) and four business names (e.g., find the number of a congress centre in Amsterdam). Two scenarios contained names that were not included in the ASR lexicon. This forced users to enter these names via the pen. Three scenarios contained private names that could be spelled in multiple ways (for example, Essink and Essinck). The spelling showed by default in the DA did not match with the spelling as indicated in the scenario. This forced users to correct the spelling via the pull down menu or via the soft keyboard. 4.3 Results

The overall conclusion of the test is that the second version of the interface was quite transparent. None of the subjects encountered significant usability problems. Only one of the subjects used speech from the beginning of the test. The other five had to be encouraged after the third scenario. Once being asked, they all understood the meaning of the microphone button. One subject did not use speech input at all, because she was already acquainted to pen input (she had a small organiser). The remaining subjects used ASR after they became aware of this functionality.

Fig. 2 The second version of the design 4.2 The user test

Six users participated in the test: two women and four men. They were novice users who had never used the multimodal Directory Assistant before. All subjects had at least some experience with Internet. None of the subjects had previous experience in using an iPAQ (but one subject owned and used a small electronic organiser). The test leader did not explain or demonstrate the service. Most importantly, (s)he did not explain that information could be entered via speech recognition. The subjects were requested to execute two exercise tasks to learn how to operate the pen, the soft keyboard and the microphone of the iPAQ. Each user executed ten scenarios. When a subject did not use speech to enter an item in the first three scenarios, the test leader asked for their ideas about the meaning of the microphone buttons. S(h)e also encouraged subjects to use these buttons

Subjects were able to switch between the modalities in this application and most of them considered speech input as more user-friendly than text input. However, when recognition errors had to be corrected, subjects showed a preference for the pen. After making a correction in the city name field, subjects usually reverted to speech input for filling the name field. Only one of the subjects understood the pull down button to open the drop down menu containing the N-best results when the output of the recogniser was not correct (in the case of city names) or to select the correct spelling of a name (in the case of residential listings). When subjects typed a city name, they did not notice the auto-completion. 5

Discussion and Conclusions

From our experiments it appears that mixing sequential and uncoordinated simultaneous operation of ASR and pen input should be avoided. Subjects get confused if their input options change without clear notice. And giving the notice in an unambiguous way may be quite difficult. On the other hand our subjects had no problems with a Tap‘n-Talk style interaction [10], where they had to use the pen to indicate a field and to activate ASR. On the contrary, selecting a field and activating ASR are most probably experienced as a single action. Compared to the interface described in [10] our design differs in that users need just a

single click, that combines the selection of the field and the activation of the ASR system. Although a single user action corresponds to two information items for the system, this never confused the subjects. It should be clear that this combination of actions is only possible as long as the number of fields on the screen is small; else, the graphical layout of the screen would become cluttered with icons that would probably become too small to be useful. In our experiment naive users did not discover spontaneously that ASR could be used to enter names in a multimodal DA service, despite the fact that the screen shows a microphone icon. However, once alerted to the buttons, they understood their meaning, and most subjects then started using ASR instead of the soft keyboard. Neither do naïve subjects understand the function of a drop down button to display N-best lists. Therefore, it appears that the use of ASR in these services requires some form of explicit instruction, or, alternatively, some kind of familiarisation with ASR as an input device. This shows the need for standardised interface designs that are similar for all services on a given family of terminals, so that users have to learn the functions (and icons) only once. Once subjects understood that they could use ASR, the majority preferred speaking over the use of the soft keyboard. It also appeared that they had no difficulty in using the pen to activate ASR, or in switching between ASR and pen input to correct ASR errors. Concluding, for form-filling applications (like train table or flight information), but especially for those that require a fixed order of filling the forms (seen from a speech recognition point of view, like DA or movie information), sequential multimodal interaction, using a Tap-‘n-Talk like approach, works very well for the user. If much care is paid to the graphical interface, providing relevant feedback at the right location on the screen, the user is able to combine speech and pen in an intuitive way.

6

Acknowledgements

This work has been done within the framework of the IST project SMADA. REFERENCES

1. Béchet, F., den Os, E. Boves, L., Sienel, J. “Introduction to the IST-HLT project Speech-driven Multimodal Automatic Directory Assistance (SMADA)”, Proc. ICSLP-2000, Beijing. 2. http://www.eurescom.de 3. Sturm, J., den Os, E, Boves, L. Issues in Spoken Dialogue Systems: Experiences with the Dutch ARISE system. Proc. ESCA Workshop Interactive Dialogue in Multi-Modal Systems, Kloster Irsee, June 22-25, 1999, pp. 1-4. 4. GALAXY Communicator (http://fofoca.mitre.org) 5. Wahlster, W., Reithinger, N., Blocher, A. SmartKom: Multimodal Communication with a Life-Like character. Proc. EUROSPEECH, Aalborg, Denmark, 2001, pp. 1547-1550. 6. Lobby7 (http://www.Lobby7.com) 7. W3C, Multimodal Requirements for Voice Markup Languages, http://www.w3.org/TR/multimodal-reqs. 8. Oviatt, S. Taming recognition errors with a multimodal interface. In: Communications of the ACM, vol. 43 (9), 2000. 9. Suhm, B., Myers, B., and Waibel, A. Multimodal Error Correction for Speech User Interfaces. In: Interactions, January + February, 2001 10. Huang, X. et al. MIPAD: A multimodal Interaction prototype.Proc. ICASSP-2001. 11. Kvale, K., Narada, D.W., Knudsen, J.E. Speech-Centric Multimodal interaction with Small Mobile Terminals. Proc. NORSIG-2001, 18-20 October 2001, Trondheim.