Evaluating Spoken Language Systems - CiteSeerX

13 downloads 109592 Views 77KB Size Report
AT&T Labs—Research ... form include personal calendars and files, stock market quotes, business information, ... train and airline schedules, and personal banking and customer care services. ... This led to the development of a number of Air.
Evaluating Spoken Language Systems Candace Kamm and Marilyn Walker and Diane Litman AT&T Labs—Research 180 Park Avenue Florham Park, NJ 07932-0971 USA Abstract Spoken language systems (SLSs) for accessing information sources or services through the telephone network and the Internet are currently being trialed and deployed for a variety of tasks. Evaluating the usability of different interface designs requires a method for comparing performance of different versions of the SLS. Recently, Walker et al (1997) proposed PARADISE (PARAdigm for DIalogue System Evaluation) as a general methodology for evaluating SLSs. The PARADISE framework models user satisfaction with an SLS as a linear combination of measures reflecting both task success and dialogue costs. As a test of this methodology, we applied PARADISE to dialogues collected with three SLSs. This paper describes the salient measures identified using PARADISE within and across the three SLSs, and discusses the generalizability of PARADISE performance models.

1 Introduction There has recently been a great deal of interest in developing spoken language systems (SLSs) for accessing information sources or services through the telephone network or the Internet. Information sources already available in electronic form include personal calendars and files, stock market quotes, business information, product catalogues, weather reports, restaurant information, movie schedules and reviews, cultural events, classified advertisements, train and airline schedules, and personal banking and customer care services. Given the ubiquity of the phone, the potential benefits of SLSs are remote access from any phone, ease of use, efficiency and constant availability. A number of government-funded research projects focused on creating SLSs for accessing travel information in both the United States and Europe in the early 1990’s. This led to the development of a number of Air Travel Information Systems (ATIS) at various research labs in the United States [6, 12], and Train Timetable systems at various research labs in Europe [4, 5, 1, 21]. Research labs have also developed prototype spoken language systems (SLSs) for accessing personal calendars and personal information [25]. Other prototype SLSs focus on information accessible from the web such as stock market quotes [25], weather reports [18], restaurant information [19], and classified advertisements [15]. Some of these SLSs are already products and others are close to deployment; field studies have recently been carried out to compare a version of a Train Timetable SLS that restricts the user to single word utterances with a version that allows users more freedom in what they say to the system [5, 3]. Given the rapid increase in the number and variety of online information sources, we expect many new SLS applications to appear over the next few years. 1

Given the large set of possible users for these systems however, we must make them widely usable by supporting different styles of interaction and different expectations of the interface. Previous work has shown that an SLS’s usability is highly dependent on its interface design, perhaps in part because the interface must compensate for limitations in the underlying technologies. For example, for most applications, interface designers must attempt to constrain what the user will say, and must decide whether to prompt the user with directive prompts [9], or to allow the user to take the initiative [12]. In order to evaluate the usability effects of different interface designs, we must be able to compare the performance of different versions of the SLS embodying different design choices. Recently, Walker et al. proposed PARADISE (PARAdigm for DIalogue System Evaluation) as a potential general methodology for evaluating SLSs [23]. As a test of this evaluation methodology, we have applied PARADISE to dialogues collected with different versions of three SLSs developed in our lab, representing 544 experimental dialogues, consisting of over 40 hours of speech. In the remainder of the paper we first summarize the key aspects of PARADISE and then present our experimental results. Our experimental results provide support for PARADISE as an evaluation methodology, but also highlight issues with generalizing the performance models derived via PARADISE. We conclude with a discussion of outstanding problems and issues for future research.

2 PARADISE MAXIMIZE USER SATISFACTION

MAXIMIZE TASK SUCCESS

MINIMIZE COSTS

KAPPA EFFICIENCY MEASURES

NUMBER UTTERANCES DIALOGUE TIME ETC.

QUALITATIVE MEASURES

AGENT RESPONSE DELAY INAPPROPRIATE UTTERANCE RATIO REPAIR RATIO ETC.

Figure 1: PARADISE’s structure of objectives for spoken dialogue performance The state of the art in evaluating SLSs, prior to the proposal of the PARADISE framework, was to evaluate the SLS in terms of a battery of both subjective and objective metrics. Some of these metrics focused on task completion or transaction success. Others were based on the performance of the SLS’s component technologies, such as speech recognizer performance [22, 7, 11, 17, 16]. Subjective metrics included measures of user satisfaction [20], or ratings generated by dialogue experts as to how cooperative the system’s utterances

were [2]. The motivation for PARADISE is that problems arise when a battery of metrics are used. First, several different metrics may contradict one another. For example, Danieli and Gerbino compared two train timetable agents [5], and found that one version of their SLS had a higher transaction success rate and produced fewer inappropriate and repair utterances, but that the other version produced dialogues that were approximately half as long. However, they could not report whether the higher transaction success or efficiency was more critical to performance. Second, in order to make generalizations across different systems performing different tasks, it is important to know how multiple factors impact performance and how users’ perceptions of system performance depend on the dialogue strategy and on tradeoffs among other factors like efficiency, usability and accuracy. The PARADISE framework derives a combined performance metric for a dialogue system as a weighted linear combination of a task-based success measure and dialogue costs. In order to specify what factors should go into this combined performance metric, PARADISE posits a particular model of performance, illustrated in Figure 1. The model proposes that the system’s primary objective is to maximize user satisfaction. Task success and various costs that can be associated with the interaction are both contributors to user satisfaction. The PARADISE performance function is derived by using multivariate linear regression with user satisfaction as the dependent variable and task success, dialogue quality, and dialogue efficiency measures as independent variables. Applying PARADISE to dialogue data requires that dialogue corpora be collected via controlled experiments during which users subjectively rate their satisfaction. In addition, a number of other variables having to do with the costs of the interaction must be either automatically logged by the system or be hand-labelled. In the next section we discuss in detail the metrics that we log and hand-label for all three systems. Walker et al. did not make specific recommendations about what metrics to use, but suggested that the metrics for task success, dialogue efficiency and dialogue quality previously proposed in the literature can be easily incorporated into their framework. Modeling user satisfaction as a function of task success and dialogue cost metrics is intended to lead to predictive performance models for SLSs, so that values for user satisfaction could be predicted on the basis of a number of simpler metrics that can be directly measured from the system logs, without the need for extensive experiments with users to assess user satisfaction. In order to make this predictive use of PARADISE a reality, the models that are derived from experiments with one set of systems or user populations should be generalizable to other systems or other user populations. By applying PARADISE to three different systems, we can show what generalizations can be made across systems and user populations.

3 Experimental Methods All of our experiments applying PARADISE used a similar experimental setup. In each experiment, human subjects carried out a dialogue using one of three dialogue systems: ANNIE, an agent for voice dialing and messaging; ELVIS, an agent for accessing email; and TOOT, an agent for accessing online train schedules. Each agent was implemented using a general-purpose platform for phone-based spoken dialogue systems [8]. The platform consisted of a speech recognizer that supports barge-in so that the user can interrupt the agent when it is speaking. It also provided an audio server for both voice recordings and text-to-speech (TTS), an interface between the computer running the system and the telephone network, a module for application specific functions, and modules for specifying the application grammars and the dialogue manager.

The dialogues were obtained in controlled experiments. The TOOT 1 experiments [14] were designed to evaluate the effects of: (1) task difficulty and (2) cooperative vs. literal response strategies. The TOOT 2 studied the effect of user-adapted interaction [13]. The ELVIS experiments [24] were designed to evaluate the effect of different dialogue strategies for managing initiative and presentation of information. The ANNIE experiments [10] were designed to evaluate the effects of (1) novice vs. expert user populations, and (2) the impact of a short tutorial on the novice user population. All of the experiments required users to complete a set of application tasks in conversations with a particular version of the agent. Following PARADISE, we characterized these tasks in terms of particular items of information that the user must find out from the agent. Each set of tasks was also designed to be representative of typical tasks in that domain. For TOOT, each user performed 4 tasks in sequence. Each task was represented by a scenario where the user had to find a train satisfying certain constraints, by using the agent to retrieve and process online train schedules. A sample task scenario is as follows:



Try to find a train going to Boston from New York City on Saturday at 6:00 pm. If you cannot find an exact match, find the one with the closest departure time. Please write down the exact departure time of the train you found as well as the total travel time.

Experiments with ANNIE were designed to be comparable with ELVIS so the users in the ANNIE experiments performed the same tasks as those for ELVIS. Each user had to perform three tasks in sequence in three difference conversations with the system. In the ANNIE experiments, some users also performed an additional task as part of a tutorial interaction before starting the experimental tasks. The example email access task below was used for both ANNIE and ELVIS:



You are working at home in the morning and plan to go directly to a meeting when you go into work. Kim said she would send you a message telling you where and when the meeting is. Find out the Meeting Time and the Meeting Place.

These experiments resulted in 268 dialogues with the ELVIS system, 108 dialogues with the ANNIE system, and 168 dialogues with the TOOT system for a total of 544 dialogues, consisting of almost 42 hours of speech.

4 Data Collection We noted above that the PARADISE framework does not specify which measures to collect. We decided to use a combination of dialogue quality and efficiency measures, mainly focusing on those measures that could be automatically logged or computed. We used three different methods to collect our data: (1) All of the dialogues are recorded; (2) The dialogue manager logs the agent’s dialogue behavior and a number of other measures discussed below; (3) Users fill out web page forms after each task (task success and user satisfaction measures). Measures are in boldface below. The dialogue recordings are used to transcribe the user’s utterances to derive performance measures for speech recognition, to check the timing of the interaction, to check whether users barged in on agent utterances (Barge In), and to calculate the elapsed time of the interaction (ET). A number of measures related to each agent’s dialogue behavior are also logged. There are a number of loggable agent behaviors that affect the quality of the resulting dialogue. We can automatically log the

number of timeout prompts (Timeout Prompts) played when the user didn’t respond as quickly as expected, the number of Recognizer Rejections where the system’s confidence in its understanding was low and it said something like I’m sorry I didn’t understand you., and the times the system played one of its context specific help messages because it believed that the user had said Help (Help Requests). The number of System Turns and the number of User Turns are calculated on the basis of this data. In addition, the recognition result for the user’s utterance is extracted from the recognizer and logged. The transcriptions are used in combination with the logged recognition result to calculate by hand a concept accuracy measure for each utterance. Concept accuracy is a measure of semantic understanding by the system, rather than word for word understanding. For example, the utterance Read my messages from Kim contains two concepts, the read function, and the sender:kim selection criterion. If the system understood only that the user said Read, then concept accuracy would be .5. Mean concept accuracy is then calculated over the whole dialogue and used as a Mean Recognition Score MRS for the dialogue. The web page forms are the basis for calculating Task Success and User Satisfaction measures. Users reported their perceptions as to whether they had completed the task (Comp).1 They also had to provide objective evidence that they had in fact completed the task by filling in a form with the information that they had acquired from the agent. In order to calculate User Satisfaction, users were asked to evaluate the agent’s performance with a user satisfaction survey. A sample survey is below:

        

Was the system easy to understand in this conversation? (TTS Performance) In this conversation, did the system understand what you said? (ASR Performance) In this conversation, was it easy to find the message you wanted? (Task Ease) Was the pace of interaction with the system appropriate in this conversation? (Interaction Pace) In this conversation, did you know what you could say at each point of the dialogue? (User Expertise) How often was the system sluggish and slow to reply to you in this conversation? (System Response) Did the system work the way you expected him to in this conversation? (Expected Behavior) In this conversation, how did the system’s voice interface compare to the touch-tone interface to voice mail? (Comparable Interface) From your current experience with using the system to get your email, do you think you’d use the system regularly to access your mail when you are away from your desk? (Future Use)

In order to focus the user on the task of rating the system, the survey probed a number of different aspects of the users’ perceptions of their interaction with the SLS. The surveys used for the three SLSs were identical except that the Comparable Interface question was eliminated from the TOOT survey. The surveys were multiple choice and each survey response was mapped into the range of 1 to 5. Then the values for all the responses were summed, resulting in a User Satisfaction measure for each dialogue ranging from 8 to 40.

5 Results of Applying PARADISE This section presents the results of applying PARADISE to the experimental data described above. First, we present the results of applying PARADISE to derive performance functions for each system we tested. 1

Yes,No responses are converted to 1,0.

Measure Comp User Turns System Turns Elapsed Time (ET) MeanRecog (MRS) Time Outs Help Requests Barge Ins Recognizer Rejects User Satisfaction

ELVIS

System Init .83 25.94 28.18 328.59 s .88 2.24 .70 5.2 .98 28.9

ELVIS

Mixed Init .78 17.59 21.74 289.43 s .72 4.15 .94 3.5 1.67 25.0

Exps 1 10.8 11.7 156.5 s .80 .64 .08 4.5 1.8 32.8

ANNIE

ANNIE

Novs .73 21.4 25.5 280.3 .67 2.4 3.3 5.8 6.6 23.0

1 .81 13.7 14.3 267.2 .85 .67 .44 .79 .81 27.2

TOOT

2 .78 14.4 14.7 225.0 .74 .15 .08 1.4 1.2 24.4

TOOT

Table 1: Performance measure means per dialogue for different SLSs Then we present the results of applying PARADISE to the combined data from all three experiments. Then we discuss the application of PARADISE to different user populations (subsets of our users). Finally, we discuss what these results suggest would be required in order to make generalizations. Table 1 summarizes the measures that we collected over several different versions of the SLSs we tested. These measures show that, in the main, users were able to complete the tasks and that experts who had learned the system could always complete the tasks despite the limitations of the interface. Not surprisingly, the experts also had the highest mean user satisfaction. Applying PARADISE modeling using stepwise linear regression to ELVIS using the measures in Table 1 suggests that Task Completion (Comp), Mean Recognition Score (MRS) and Elapsed Time (ET) are the only significant contributors to User Satisfaction. This regression yields the following equation:

UserSat = :21  Comp + :47  MRS ? :15  ET with Comp , MRS and ET significant predictors, accounting for 38% of the variance in user satisfaction. The magnitude of the coefficients in this equation demonstrates that the performance of the speech recognizer (MRS) is the most important predictor, followed by users’ perception of Task Success (Comp) and efficiency (ET). The application of PARADISE to the TOOT 1 data shows that the most significant contributors to User Satisfaction are Comp, MRS and BI. The performance function below provides the best fit to the TOOT 1 data, accounting for 47% of the variance in User Satisfaction:

UserSat = :45  Comp + :35  MRS ? :42  BI The application of PARADISE to the TOOT 2 data shows that the most significant contributors to User Satisfaction are Comp, MRS and ET. The performance function below provides the best fit to the TOOT 2 data, accounting for 55% of the variance in User Satisfaction:

UserSat = :33  Comp + :45  MRS ? :14  ET To determine which of the task success and cost factors are most predictive of user satisfaction for the ANNIE dialogues, a stepwise regression over all the measures showed that only Comp, MRS and Help

UserSat = :49  MRS + :31  Comp ? :17  ET + :20  Rejs ? :13  BargeIns + :13  Helps Figure 2: PARADISE Performance Function for Data from all Three Systems Requests were significant. These three factors accounted for 41.3% of the variance in the data, and yielded the following equation:

UserSat = :25  MRS + :33  Comp ? :33  Helps The finding that Mean Recognition Score (MRS) and Task Completion (COMP) are significant factors is consistent across all three models. The influence of help requests in ANNIE probably reflects the fact that the ANNIE experiments sampled subjects with different levels of expertise with the system. The influence of barge-ins in TOOT 1 reflects the fact that subjects in TOOT 1 tended to use barge-in to shorten the system responses in that experiment. All of the functions are a weighted combination of objective measures of dialogue quality, task success, and dialogue efficiency. These functions suggest that in all three systems, more accurate speech recognition and more success in achieving task goals are important, and suggest that shorter dialogues may also contribute to increasing user satisfaction. Because we have collected exactly the same metrics for all three systems, we can also combine the data and model the combined data. The result of performing a regression on the combined data yields the performance function in Figure 2. The fact that the factors MRS and Comp are significant performance predictors suggests that the combined data generalizes from the individual results. However the individual results also affect the combined data, since factors that were significant only in some of the experiments (e.g., ET, BI, and Help Requests) are significant in the combined data. Furthermore, using the larger combined data set, one additional feature Recognizer Rejections reached statistical significance. Another reason that a function may fail to generalize to another situation is that properties of one SLS may not even occur in another SLS. One way to examine this is to look at the subset of our user data where the users were particularly successful with the SLS. We can examine this subset of the data by simply extracting those users who had very good recognition performance. We found that users who had very good recognition performance were more attuned to efficiency than our user population as a whole. This makes sense because recognition performance was effectively removed as a source of variance. It was also the case that the expert population of users that we used in the ANNIE experiments typically had good recognition performance and thus were also more attuned to efficiency factors.

6 Discussion and Future Work In order to determine which are the best among multiple possible designs for SLSs we need to have common metrics and tasks to carry out our evaluations. This paper has demonstrated progress in defining both tasks and metrics. We have presented results showing that the same method can be applied to multiple systems. To a first order of approximation, we show that the same set of factors are significant predictors of user satisfaction. These factors are consistent despite differences in system, interface, task and subject population, suggesting that these are critical factors for any SLS.

It is interesting to note that previous work has often assumed that user satisfaction is primarily determined by dialogue efficiency. Our experiments suggest that the quality of the dialogue, which correlates with Mean Recognition Score, in fact has a larger effect on user satisfaction than efficiency. None of our experiments showed efficiency to be as important as dialogue quality (or task completion). This demonstrates the benefit of using multivariate analysis. One issue arises when attempting to compare models and make generalizations is which factors used as input to the performance modeling actually lead to variations in performance. For example, although mean recognition score is always a significant predictor in our models, models developed for SLSs that have consistently good recognition performance are unlikely to have mean recognition score as an important factor. Our conclusion is that the ability to generalize performance models to different systems than the ones they were developed on will depend on being able to index the system components, system domain complexity, or expected user populations as similar on a number of abstract features, such as recognizer performance or expertise of the user population. A related issue is that in order to compare models developed for different SLSs, it is necessary for the same metrics to be used and for user satisfaction to be calculated in a similar way. In this study we collected identical measures for all three systems. Some of these measures should be universally available in different SLSs, such as dialogue length and mean recognition score. Others, such as timeout prompts might be specific to different applications since some platforms do not provide the capability to detect when the user is silent. It is also possible for the types of help messages to be implemented in different ways. In future work we hope to identify a core set of metrics which can be shared across different sites in order to facilitate comparisons among models.

References [1] S. Bennacef, L. Devillers, S. Rosset, and L. Lamel. Dialog in the railtel telephone based system. In Proceedings of ISSD, pages 173–176, 1996. [2] Niels Ole Bernsen, Hans Dybkjaer, and Laila Dybkjaer. Principles for the design of cooperative spoken human-machine dialogue. In International Conference on Spoken Language Processing, ICSLP 96, pages 729–732, 1996. [3] R. Billi, G. Castagneri, and M. Danieli. Field trial evaluations of two different information inquiry systems. In 1996 IEEE Third Workshop: Interactive Voice Technology for Telecommunications Applications, IVTTA, pages 129–135. IEEE, 1996. [4] M. Danieli, W. Eckert, N. Fraser, N. Gilbert, M. Guyomard, P. Heisterkamp, M. Kharoune, J. Magadur, S. McGlashan, D. Sadek, J. Siroux, and N. Youd. Dialogue manager design evaluation. Technical Report Project Esprit 2218 SUNDIAL, WP6000-D3, 1992. [5] M. Danieli and E. Gerbino. Metrics for evaluating dialogue strategies in a spoken language system. In Proceedings of the 1995 AAAI Spring Symposium on Empirical Methods in Discourse Interpretation and Generation, pages 34–39, 1995. [6] L. Hirschman, M. Bates, D. Dahl, W. Fisher, J. Garofolo, D. Pallett, K. Hunicke-Smith, P. Price, A. Rudnicky, and E. Tzoukermann. Multi-site data collection and evaluation in spoken language understanding. In Proceedings of the Human Language Technology Workshop, pages 19–24, 1993.

[7] Lynette Hirschman, Deborah A. Dahl, Donald P. McKay, Lewis M. Norton, and Marcia C. Linebarger. Beyond class A: A proposal for automatic evaluation of discourse. In Proceedings of the Speech and Natural Language Workshop, pages 109–113, 1990. [8] C. Kamm, S. Narayanan, D. Dutton, and R. Ritenour. Evaluating spoken dialog systems for telecommunication services. In 5th European Conference on Speech Technology and Communication, EUROSPEECH 97, 1997. [9] Candace Kamm. User interfaces for voice applications. In David Roe and Jay Wilpon, editors, Voice Communication between Humans and Machines, pages 422–442. National Academy Press, 1995. [10] Candace Kamm, Diane Litman, and Marilyn A. Walker. From novice to expert: The effect of tutorials on user expertise with spoken dialogue systems. In Proceedings of the International Conference on Spoken Language Processing, ICSLP98, 1998. [11] Kai-Fu Lee. Large Vocabulary Speaker-Independent Continous Speech Recognizer: the Sphinx System. PhD thesis, 1988. [12] E. Levin and R. Pieraccini. Chronus, the next generation. In Proc. of 1995 ARPA Spoken Language Systems Technology Workshop, Austin Texas, 1995. [13] Diane J. Litman and Shimei Pan. Empirically Evaluating an Adaptable Spoken Dialogue System. In Proceedings of the 7th International Conference on User Modeling, 1999. [14] Diane J. Litman, Shimei Pan, and Marilyn A. Walker. Evaluating Response Strategies in a Web-Based Spoken Dialogue Agent. In Proceedings of ACL/COLING 98: 36th Annual Meeting of the Association of Computational Linguistics, pages 780–787, 1998. [15] Helen Meng, Senis Busayapongchai, James Glass, Dave Goddeau, Lee Hetherington, Ed Hurley, Christine Pao, Joe Polifroni, Stephanie Seneff, and Victor Zue. Wheels: A conversational system in the automobile classifieds domain. In Proceedings of the Fourth International Conference on Spoken Languaggge Processing, 1996. [16] David S. Pallett. Performance assessment of automatic speech recognizers. J. Res. Natl. Bureau of Standards, 90:371–387, 1985. [17] J. V. Ralston, D. B. Pisoni, , and John W. Mullennix. Perception and comprehension of speech. In Syrdal, Bennet, and Greenspan, editors, Applied Speech Technology, pages 233–287. CRC Press, 1995. [18] M. D. Sadek, A. Ferrieux, A. Cosannet, P. Bretier, F. Panaget, and J. Simonin. Effective humancomputer cooperative spoken dialogue: The ags demonstrator. In Proceedings of the 1996 International Symposium on Spoken Dialogue, pages 169–173, 1996. [19] Stephanie Seneff, Victor Zue, Joseph Polifroni, Christine Pao, Lee Hetherington, David Goddeau, and James Glass. The preliminary development of a displayless PEGASUS system. In ARPA Spoken Language Technology Workshop, 1995. [20] Elizabeth Shriberg, Elizabeth Wade, and Patti Price. Human-machine problem solving using spoken language systems (SLS): Factors affecting performance and user satisfaction. In Proceedings of the DARPA Speech and NL Workshop, pages 49–54, 1992. [21] A. Simpson and N. A. Fraser. Black box and glass box evaluation of the SUNDIAL system. In Proceedings of the Third European Conference on Speech Communication and Technology, pages 1423–1426, 1993.

[22] Karen Sparck-Jones and Julia R. Galliers. Evaluating Natural Language Processing Systems. Springer, 1996. [23] M. A. Walker, D. Litman, C. A. Kamm, and A. Abella. PARADISE: A general framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association of Computational Linguistics, ACL/EACL 97, pages 271–280, 1997. [24] Marilyn Walker, Donald Hindle, Jeanne Fromer, Giuseppe Di Fabbrizio, and Craig Mestel. Evaluating competing agent strategies for a voice email agent. In Proceedings of the European Conference on Speech Communication and Technology, EUROSPEECH97, 1997. [25] Nicole Yankelovich, Gina-Anne Levow, and Matt Marx. Designing Speech Acts: Issues in Speech User Interfaces. In Conference on Human Factors in Computing Systems. CHI, 1995.