Technology-Driven Design of Speech Recognition Systems

Technology-Driven

Design of Speech Recognition Catalina

Systems

Dnnis and John Karat

IBM T. J. Watson Research Center PO Box 704 Yorktown Heights, NY 10598 [email protected] and [email protected] of the variety of “best technology” awards it has rccently received (e.g., Discover magazine’s award for technological innovation in 1994; “Best of Comdcx”, Spring and Fatl, 1993). Such awards tend to draw people’s attention to the technology and serve to get them excited about its potential. Second, the price of the technology has dropped significantly in the past year, to the point where people place it in the “cxpensive toy” category (software and hardware for a 20K vocabulary system that runs on a personal computer is now available for under $1000). The result is that as a much larger group of people get accessto the technology, more creative ideas are generated about its potential uses. As one moves to evaluate the impact of largevocabulary ASR technology, it is impossible to ignore the fact that there are relatively few applications available on the market and that these tend to be aimed at niche markets. Generally very positive reviews are given by users whose task requirements or physical constraints (c.g., people who can not type because of a permanent or temporary disability) lcad them to try speech-enabled applications (e.g., Danis, et. al., 1994). Ilowevcr, active skitled typists tend to find writing with a speech-enabled tool frustrating because on the whole it slows them down relative to typing (Danis, 1995). In our experience, as HCI specialists working with one large vocabulary ASR system for almost a combined decade, we agree that a general purpose large vocabulary ASR interface (i.e., one with a vocabulary of twenty thousand words or more which is suitable for document creation tasks) that can compete successfully with keyboard based applications has not yet been developed . We believe this is true even though user communities report considcrablc satisfaction with the technology and one of us has been using a speech recognition system regularly for the creation of text for over a year. It is the goal of our work in speechenabled interfaces to bring about a better integrated, more effective human-computer in&action technique for some well-chosen applications. The nature of the technology, specifically its potential for rcdefming the work roles of individuals, makes it necessary for investigations of ASR technology to be done with users in productive work contexts.

AESTR ACT

End-users and application developers are increasingly considering use of large vocabulary automatic speech recognition (ASR) technology for tasks that involve entering large volumes of text into a computer. Interest is in part fueled by the overwhelmingly positive reviews the technology is receiving in the trade press and at major trade shows. While acknowledging the irnpressive advances in ASR technology in recent years, critics nevertheless point out that problems with ASR-enabled applications currently preclude them from being broadly considered viable alternatives to keyboard input. In this paper, we argue that to become a generally viable alternative to keyboard input, ASR needs to undergo a transformation from a laboratory technology into a human computer interaction (I-ICI) technique. That is, we must discover how the technology should be used to support users engaged in productive work. We propose that to bring this about, designers must engage in building applications grounded in real work contexts now, even though the technology is still at an immature stageof development. design to We call this approach technology-driven emphasize our goal of advancing the technology in our design activities. Not as apparent in this label, but of great importance to our approach, is a commitment to the involvement of users in every aspect of system design. KEYWORDS: Speech recognition, speech user interface, design, dictation, technology-driven design. INTRODUCTION

ASR technology is capturing the imagination of the end-user and software developer communities alike. In the past two years, most major trade publications have prominently featured at least one story on this tcchnology which has the potential of redefining the way people interact with computers. One frequent theme is that the breakthroughs that have been expected each decade since the sixties are finally taking place in the nineties (Rash, 1994). Two types of events are fueling this optimism. First, ASR is increasingly being thought of as tenable, solid technology by the public because permission to make digital/hard copies of all or part of this material for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advanta& the coPYright notice. the title of the publication and its date appear. and notic is. given that copyright is by permission of the ACM, Inc. To copy othewlsc, to rcoublish. to post on servers or to redistribute to lists, requires specific perm’ission andlbr fee. DIS 95 Ann Arbor MI USA @ 1995 ACM 0-89791-673-5/95/08..$3.50

WORKING WITH EVOLVING TECHNOLOGY

If we contend that ASR technology is not yet mature and that we are committed to meeting users’ needs through our development efforts, how can we argue

17

that the real world task context is the proper ‘laboratory” for application development? First, we firmly believe that currently there are a number of niche markets, for example, special needs and medicine, where speech-enabled dictation applications based on current ASR technology are viable. By carcfully selecting our customers from these areas, we can develop applications that acid value to users. Second, WC share a bclicf in the tenets of user-centered design (e.g., see Norman & Draper, 1986) and participatory design (see Mullcr Rt Kuhn, 1993 for a discussion) with many of our collcagucs in IICI. Happily, the importance of involving users from user groups targeted by the application is by now part of mainstream thought, as there is ample cvidcnce lof its benefits (c.g., Greenbaum & Kyng, 1995; Kcil & Cannel, 1995). Third, the potential of ASK technology to transform the work roles of its users rcquires that the individual, social and task contexts be addressed and brought into balance in the course of Since it is impossible to application development. replicate this multi-faceted environment within a laboratory, the design work must be situated in the context of actual work environments. While in many ways the approach we take can be viewed as, just another example of user-centered design in practice, we do feel that our dual focus of advancing the technology while developing usable systems provides a di.ffercnt focus to our approach than WC would have if we wcrc completely “user-centered.” The domains we select to work in and the interaction tcchniques wlc are willing to consider are colored by our specific interests. This dots not radically change our general approach to design (Karat, 1995), but it does influence things we watch for and design decisions WC make. WHY IS TECHNOLOGY-DRIVEN

DESIGN

NECESSARY?

We want to begin by offering a general description of Mature HCI “maturity” applied to IICI techniques. techniques can be thought of as those that have well understood technology bases and some general undcrstanding concerning their domain of applicability. When described in this way, it should be clear that maturity is a relative rather than a binary term when describing IICI techniques. Without making too much of exactly where on a maturity scale various techniques would be, we think it would bc generally agreed that keyboards and mice are relatively mature compared to ASR or gesture recognition. We know a fair amount about keyboard design, and have developed many guidelines for use of this device. It is important to distinguish between ASR as a tcchnology and as a human-computer interaction technique. A focus on ASR as an I-ICI technique requires that the technology be considered in the broader context of its ability to support task completion under production conditions (e.g., deadlines, teamwork, etc.). In order for ASR to function as a mature HCI tcchnique it must address such issues as recognition errors,

variability in users’ composition styles, the afforadances of speech and the integration of multi-modal input. The solutions to these and other issues related to speech input will constitute a speech user interface (SUI) which will necessarily bc strikingly different from today’s gra:phical user interfaces (GUIs).’ Towards

a more

mature

ASR technology

Part of the reason that ASR is not a mature HCI tnethod derives from the fact that the technoloL7 itself is not yet mature. One significant indicator is the lack of a general agreement about the capabilities ol- the technology. Performance characteristics for a mature technology can bc expected to be well understood and its design specifications should bc well established. For example, the QWERTY keyboard design is an exa:mple of a mature technology .2 It’s pcrforrnance characteristics are stable under a variety of describable situations and the same basic key organization has been stable within type of device (e.g., electric typewriter, pc keyboard). Performance characteristics are not well understood for ASR; the claims found in marketing literature for across the board 95% accuracy, notwithstanding. Marketing literature is relevant because it shapes the expectations of typical administrators when they fist consider including ASR in a system being built by their IS group for internal use. In reality, characterization of ASR performance is much more complex and can not bc captured by a single number. The fact is that many factors affect recognition performance in ways that are not well understood. Research indicates that ASR performance varies as a function of task domain, match between task domain and dictated text, and speaker (e.g., Brown and Vosburgh, 1989). The complexity of language usage (a language model) as reflected in variety of word scquencing (a measure called perplexity) indicates that dictation in a domain such as radiology is much easier for an ASR. system than dictation in a domain such as journalism. The source of the diffcrcncc derives from the fact that the recognition decision in any large vocabulary system in part depends on the ability of the language model (LM) to predict the sequence of words that the user will produce. Because the language: use in radiologl tends to be relatively stereotyped, it is easy to create a LM based on an analysis of radiology corpora that accurately reflects the language USC of radiologists.. In journalism, the goal is often the creation of images in the readers’ minds rather than simply communicating a set of facts. It is correspondingly more difficult to capture this type of language use in a LM using the techniques currently available. A related factor that affects recognition accuracy is the closeness of the fit between the dictated material and the LM used. As an extreme example consider a journalist using a radiology LM. The sequences of dictated words would not correspond well and therefore the ability of the rccognizcr to predict the journalist’s lan-

-1 For a discussion of speech user interfaces for tasks with a heavy emphasis on command and control, see Yankelovich, et. al., 199.5. 2 The fact that the desirability of the current design is coming into question (e.g., Elmer-Dewitt, 1992) due to its implication in repe,Litive stress injuries is not relevant here.

guage use would be poor. This factor operates widely in more subtle ways under conditions of real use. For example, the language use of journalists writing a straight news story which answers the “five whquestions” is much more constrained than their writing in a feature story. Decause this will affect recognition pcrfonnancc negatively, researchers are working on language model adaptation techniques aimed at dcveloping LMs which respond dynamically to such changes in style (Jclinek, et. al., 1991). Another factor that influences accuracy is covered by the rubric of “individual differences”. It is a fact that cvcn when using speaker statistics that capture the individual’s style of speaking (called spcakcr-dependent statistics), some speakers are better recognized than others. Furthcrmorc, these differences do not seem to disappear with practice (Danis, 1989). As a contrast, imagine keyboards which had different recognition characteristics for male and female typists. Clearly, much more work has to be done on the modeling of speaker’s voices within the context of the ASR tcchnology. Another fact that indicates the lack of maturity of ASR is that currently there are no viable continuous speech systems capable of supporting free dictation on the market . Research prototypes exist in several institutions, but none is capable of working accurately enough or fast enough on a small enough machine to make it commercially viable for interactive use. Systems which require short pauses between words (called discrete word input systems) represent a compromise solution which makes the recognition task easier bccause it eliminates the need for segmentation of speech into words. While systems built on discrete speech engines are being sold and are accepted by niche segments in the population, the bulk of the users we have talked with contend that users who have other options (e.g., dictation followed by human transcription, typing) will not use ASR for dictation in significant numbers until continuous speech input is possible. WHAT

comfortable the we have really identified the characteristics of a speech user interface. Currently available dictation interfaces that USC ASR as input are GUIs retrofitted with speech. First, there is a general lack of appreciation of what it means to work with a recognition technology such as ASR. For speech recognition there is a class of errors (called misrccognitions) in which the user speaks the intended word, but that the system recognizes as something else. People consider the existence of misrccognitions to be an indication that the technology is immature, rather than accepting it as a defting characteristic of the technology. Error rates will dccrease as basic ASR technology issues, including some of the ones we mentioned when discussing the immaturity of the technology, are successfully addressed. Ilowevcr, recognition errors will always exist, if only because in a dictation task users will always use words which are not in the rccognizer’s vocabulary. The distinction between deterministic input technologics, such as a mouse and a keyboard, and technologies which have a probabilistic step interposed between the user action and the system response, such as ASR and handwriting recognition, is important. The appearance of errors in deterministic technologies is much easier for a user to understand and to avoid than arc errors that result in ASR. For example, if I type “ans” where I intended to type “and” I quickly come to the hypothesis that my fmger accidentally hit a neighboring key and I can resolve to be more precise in the future. IIowever, the source of an error in speech recognition is much harder to diagnose. This is in part due to there being several factors that affect the output of the recognizcr. Did I misspeak ? Did I use a sequence of words that are unfamiliar to the system? Or was it an interaction of the two factors? The trouble users of ASR systems have in understanding the source of recognition errors (and therefore, how to avoid them in the future) can be illustrated by an example of the behavior of users who were explicitly charged with leaming to be better recognized by one large vocabulary ASR system (Danis, 1989). A user who was having problems running words together in a discrete word system hypothesized that he could indicate where words ended by hyper-articulating the final consonant in words. This lead to further mistakes where, for example uninflected words were being recognized in various inflected forms (e.g., heart = > hearts, detail = > detailed) .

NEEDS TO CHANGE?

The immature state of ASR should not be viewed as an impediment to using it in the field, nor should it be used to conclude that ASR is not a good interaction technique. It is a central part of our argument that it is unwise to wait until ASR technology is mature before applications arc built based on it. The problem with mature technology is that it is diflicult and expensive to change (consider various attempts to change keyboard layouts after we found possibly “better” ways to arrange the keys). The technology does, however, need to function reliably (i.e., does not crash regularly, delivers the specified functionality, is applicable to the chosen domain) in order for development to begin. In addition, the technology has to be in a state where the basic problems have been solved, leaving the developers of the technology free to extend the technology by incorporating the functionality identified in devclopmcnt efforts and to modify those aspects that do not work. Towards

Recognition errors can also do a great deal of damage to a user’s documents; considerably more than can be caused with a keyboard. If, for example, a text word is r&recognized as a command, and this is not detected immediately by the user, a document can suffer serious damage. Often this is enough for users to discard entire paragraphs (Danis, 1995). ‘I’hus, we consider designing for errors one of the most important tasks for designers of speech recognition systems. Part of designing for errors includes developing methods of error correction which (I) help the system learn (i.e., eliminate repeated machine caused

ASR as an HCI technique

Based on our cxperiencc we have identified several major issues that need to be addressed before WC arc

19

the action with a speech command. The tight intcgration of time-delayed input method (e.g., ASR) with an immediate input method (e.g., mouse click) is problematic: when the operating system expects input from only one source. There are currently efforts underway to develop operating systcrns that are multimodal as well as pen and speech-centric alone. The growing maturity of ASR as an interaction tcchniquc will be indicated by the narrowing of the signifthe imprcssivc icant gap that exists bctwecn performance of dcmoers of ASR technology today and the experience of users under unconstrained u.sage conditions. Skilled democrs can and do impress prospective clients with their manipulation of the technolHowever, when the technology is usecl to WY. complctc tasks under typical job constraints, the shortcomings become apparent. Solving the problems noted in our discussion above should bring us considerably closer to closing this gap.

mistakes of that sort), (2) are well integrated into the task so th.at the user’s attention is not consumed by the use of the technology but can remain on the task, and (3) are easy .and fool-proof to execute (Rhyne & Wolf, 1993). We would argue that such techniques have evolved for keyboard and mouse interfaces, but do not immediat~ely extend to a SUI. Equally important is the dcvclopment of altcrnativc methods for manipulating the interface. While this is a generally agreed upon guideline in many interface systems (e.g., IBM, 1992), this is particularly important when working with IICI input techniques which are fallible. Speech as a viable IICI technique must also be able to accommodate the major text creation styles that people use. Broadly, one can distinguish between writing in draft mode and writing by tinkering. In draft mode, writers enter a complete piece into the computer and begin editing only after all of the main thoughts have been captured. In contrast, a tinkering style combines what are :separate text entry and editing stages in draft mode, so tha.t the text is complete (except for fmc tuning) as thle writer moves down the piece. Reporters who participated in the StoryWriter project (Danis, 1995) tended to use a tinkering style. Such a style is very demanding for a speech-only interface because the user needs to move between a large vocabulary dictation task and small vocabulary command and control task rapidly and frcqucntly. Several problems will need to be solved to make this a smooth process, including the communication of mode switching information to the system with an easy and fail-safe method.

INDIVIDUAL

AND ORGANIZATIONAL

IMPACTS OF A!SR

The introduction of ASR-enabled applications for document creation into an environment where users previously depended on transcription scrviccs or their own keyboard use produces profound changes fo.r individual end-users and for the organization in which they work. To some extent such changes occur whenever any new system is introduced into an organization. of word For example, consider the introduction processors into off~cc environments. This rcdcfincd the expectations between a document originator (e.g., an insurance adjustor, a mid-level executive) and the clerical worker responsible for producing the document. For example, typographical errors corrected -with “white-out” and ragged margins were no longer acc:eptable. Similarly, a secretary had to come to expect multiple dnafts of the document before it was fmally sent out. IIowevcr, we maintain that the impact of the introduction of speech-enabled applications is much larger because the costs and benefits that accrue from this technology affect the individual and the organization in potentially opposite ways. ‘I-he source of this cortlhct derives from the power that ASR-enabled dictation tools give the document originator. While the ability to produce ones documents from start to finish can, on the one hand, be viewed as empowering, there is a corresponding negative effect as well. ASR can potentially broaden the job description of the lawyer, doctor, and executive to include that of secretary and clerical worker.

Another critical component of an ASR-enabled interface is an understanding of the affordanccs of speech. It is a sigp of growing contidencc among researchers and developers of speech-based interfaces that they arc wihing to tell those controlling development funding that there: are some aspects of tasks that arc better handled by modalities other than speech. For example, in the StoryWriter project which produced a low function editor for newspaper reporters suffering from repetitive stress injuries (RSI), WC recommended use of a mouse for item selection combined with speech for specification of the action (Danis, ct. al., 1994). We did include an all-speech procedure for selection (e.g., cursor-up live, move-right three, delete-word) with the intention that only those users whose injury prevented them from even moving a mouse (our design had eliminated the requirement for clicking the mouse, inferring selection by measuring elapsed time) would use it. Subsequent evaluations confiied our expectations (Danis, 1995).

There are multiple impacts on the individual of this work-role redefmition. First is the issue of perceived status. In the status hierarchy that operates in Amcrican workplaces, the knowledge worker holds a higher position than a clerical worker. ASR technology as presented in marketing sound-bytes (“you speak and the machine types it out”) is very attractive to administrators and end-users alike because the work of ke:ying is done by means of an exciting new technology. Doctors and other professionals we have worked with over the years have commented that they feel that making corrections, especially by keyboard, lowers

The immaturity of ASR as a I-ICI technique is further reflected in the fact that most current operating system software is not designed to accept speech input. Rather, typical current speech applications trick the operating system into treating speech input as if it were coming from the keyboard. This sort of hack works only within certain limitations. It begins to break down when one tries to integrate multiple modalities in a way which allows the overlapping of events as can happen when a user points to targets with a mouse and spccilies 20

their status (in fact the loss of status associated with keyboard use is what causes medical administrators to turn to ASR instead of keyboarding for text entry in the first place).

report content. Part of the education process therefore involves getting people to consider the costs associated with the current non-speech recognition solution in order to put into perspective the costs of ASR solution. Additionally, is important to point out the benefits the technology brings to both end-users and their administrative unit. Using the example of physicians, the main bcnetit they see to themselves as users of a speechenabled dictation tool is the ability to dictate, correct and sign in a single episode. This eliminates the need to keep in mind the details of patient encounters for a day or longer until the transcribed report is returned for signature (usually several days). As a user of information, the physician greatly benefits from access to large amounts of online, codable information for purposes of outcome analysis and disease managcmcnt. The chief advantage to administrators is the elimination of transcription pools. These arc a source of concern in many areas because of their high costs and rclativc scarcity. The importance of doing technology-driven design with ASR technology is that one needs to address the conflicting needs of the organization and the individual much in advance of system delivery. IJsing the methodologies of participatory design it is possible to understand and possibly modify expectations and prepare the environment to address the changed roles that the technology engenders.

Not only can professionals feel a loss of status as a rcsult of using an ASR-enabled system, they typically begin their rcdefmcd job at the level of an incxpcrienced clerical worker because they often do not have the sk.ills rcquircd to produce a fully formatted document. It is when they take on these additional responsibilities that they realize how much knowledge the clerical worker brings to the document production task (Danis, 1991). Automation can bc brought to bear on many of thcsc problems, but currently the word processor is seriously deficient without an operator knowledgeable about domain-specific formatting conventions. An additional impact on most users of ASR-cnablcd tools is the need to lcam to dictate. Many users report an initial high cost to learning to compose by speaking. Reporters who have used an ASR-enabled editor for a year describe the change as similar to moving from composing by handwriting to composing on a typcwriter (Danis, 1995). They were initially afraid of committing their thoughts to voice and felt that they needed to have a thought well formulated in their head before they could start speaking. Some resorted to creating an outline first. The effects of the additional task and the unfamiliarity of the new composition process do impact the time that the individual dcvotcs to document production. This can obscure the benefits for the individual, particularly if the organization does not make allowances for the fact that the individual has taken on additional rcsponsibilities. While our cxperienccs with the technology give us solid reason to believe that it can provide enhanccd text entry capabilities (e.g., even using current technology, one of us has expcricnced productivity improvcments for producing email corrcspondencc using ASR), these may take some time to become evident.

GUIDANCE FOR TECHNOLOGY-DRIVEN

DESIGN

The activity that we refer to hcrc as technology-driven design shares many of the ideas put forward under the heading of participatory design (see Mullcr and Kuhn, 1993 for a discussion of this topic). We generally consider technology-driven design as involving the dual mission of advancing technology and supporting user work, while methods for participatory or user-centered design have focused primarily on the latter clement. Our intent in looking for applications to work on is not just to look for ways in which information technology can provide value in the work context, but specifically to look for applications which might be well suited to the particular hammer that we arc wielding (ASR in this case). Good candidates for technology-driven design then are those in which the technology can really provide useful benefits and that also offer challenges to the tcchniqucs under consideration. We acknowledge the limitations we place ourselves under in doing this, but feel that advancing ASR as an HCI technique is a valuable goal for future information technology systcms. We do not hesitate in letting those we work with know that working with our particular technology is part of our project goal.

It is part of the job of developers of speech-enabled applications to educate the administrators to the fact that the document originator has additional work which requires an adjustment to their work schedules. For example, physic&s who typically see twenty paticnts in the course of a day may have to decrease their load to nineteen if they have to do significant charting using ASR. On the other hand, significant improvements in medical documentation would be possible if such records were kept electronically rather than in current paper formats. Evaluation of such tradeoffs is a critical part of technology-driven design. The difficulty of the sales job that the designer needs to take on is easier than might seem at first. Currently, there are many hidden costs in processes that depend on human transcription. In the medical field these include: incomplete or greatly truncated reports, delays in availability of physician report to others involved in patient care, selective availability of documents online because of the cost of transcription, and repeated correction cycles, some requiring reaccessing original supporting documents (e.g., X-rays) for conlirmation of

What

is our experience?

Attempts to introduce new technology into the workplace can fail for so many reasons that WC feel it is particularly important to take considerable care with projects in which the technology itself is not mature. There is a tendency to blame the technology as the reason for such failures and prematurely dismiss it, rather than trying to build on early experiences. We believe that our explicit admission that we want to make use of immature ASR technology in a project

21

helps keep the expectations of all parties realistic. WC do however think that it is important to select applications in which WChave a strong reason to cxpcct that the technology can provide users with real benefits. Involvement with the development of ASR technology has given us the opportunity to develop an undcrstanding of some of the things to look for identifyi.ng good candidate domains for early USCof speech rccognition i.nterfaccs. Careful screening of potential customers b,y attending to the domain and the individuals and organization that we will bc working with is ncccssary, si.ncc development of a system with ASR will require considcrablc creativity in places where developmcnt with mature techniques can more easily rely on known solutions. Again, the cast of error correction is a good example. We do not yet have generally accepted techniques for handling the various consequcnccs of the fact that user input can be misrccognized by ASR systems. Technology-driven design projects involve a fairly high risk. Being a part of the maturation of new techniques and technologies is not for everyone or every context. For many reasons, some organizations and contexts will be rnorc receptive as contexts for relatively early technology projects. There arc a number of factors that we have found to bc important in the process of targeting ‘an application area and establishing a rclationship with a customer. There are some general factors which include: . Development for a specific customer with their involvcmcnt. Part of the targeting decision is to ensure that both the customer and the members of the dcvelopmcnt team are willing to view the project as a joint effort. This requires an effort on the part of the technology provider to resist the temptation to limit communication with the customcr once a project has been started, and an cffort on the part of the customer to resist thinking that the technology developer can produce the desired result without continually trying to communicate what that result might bc. Another way of viewing this is to say that we make sure that all parties are willing to take part in a participatory desig$i project in which both acknowledge the intcrcsts of the other. WC need to bc clear in such projects that our intcrcsts go beyond traditional “payments for services” that usually cover the intcrcsts of the dcvclopmcnt organization. l Commitment from partner to use systems as they become available; even if not stable. In most relatio:nships where there is a customer and a provider, the customer would prefer to know when the finished (defect free) system will bc delivered and dots not want to be a part of evaluating the pr&minary (buggy) systems. For immature I-ICI techniques, both partics must be willing to tolerate the expectations of the other. At the same time, we need to be able to provide reasonable assurancc;s to the customer that real work carried out with preliminary systems will not be wasted (i.e., we need to keep from requiring the user to duplicatc work). 22

.

Educa.tion of partner about the technology and how it ticcts them. Everyone, both on the technology provider side and on the customer side wilI underestimate the impacts of new technology. ‘T’cchniques such as ASR wiU redefine many tasks and jobs. For example, introduction of speech recognition technology might impact how offtces arc laid out or how someone “makes notes” .while in front of a client.

Organizational

and individual

factors

for success

In our experience we have found that it takes both a receptive organizational climate, and the presence of enthusiastic individuals to create an cnvironme:nt in which techlnology has a chance to mature. WC have no real formula for creating the right environment, but these are some of the guidelines WCcan identify for rccognizing one: 0 The existence of a champion for the project. We have found that there needs to be someone who shares the vision (including some awareness of possible difficulties in developing the system) and will consequently bc willing to do what it tak.esto make it happen. While our own enthusiasm for the technology carries some weight, an strong supporter within the customer organization is more powerful. l Willingness to accept limited or special purpose solutions. For most organizations, there is a desire approach dcvclopmcnt of new technology as a way to cha.nge the workplace in big ways. Solutions which work for some people or some departments, but which differ from the way most of the work is carried out (i.e., which do not integrate well with the overall context) are not viewed favorably. While we believe that the use of SUIs will be pcrvasive in the future, WCneed to carefully manage expectations by “thinking-small” at the current time. There arc situations in which speech recognition has sullicient merit (e.g., for hands-busy tasks or for individuals who have difficulty typing) for limited scope approaches to be seen as appropriate. Work-task

related

factors

for success

Techniques for IICI dcvclop and survive because they are appropriate for some set of tasks. Keyboards are not well suited for pointing (though cursor-key ,techniques have been rcfmed to make them acceptable at times), but they work well for entering text. Pointer devices are generally clumsy for entering text (though techniques such as soft-keyboards are being developed for doing so), but they work well for selecting and manipulating displayed items. Other techniques, Iike .4SR are likely to emerge with similar characteristics - being well suited for some tasks, and not so well for others. We arc still developing the ASR technique though, and have not solidified all aspects of it (such as how to switch between text and command entry with voice). Just as the BACKSPACE key refines the keyboard and makes keyboard interaction a more mature interface technique, :similar refinement will inevitably contribute to the qu‘ality of ASK as an interface technique. Judg-

ing how well the technique fits the task will be diflicult to evaluate during the maturing process. We have had some success in work contexts which led to successful applications in spite of current technology limitations: . StoryWriter: For this project there was a requircment for a system for entering and editing text in which key-press actions, including pressing mouse buttons, wcrc to be eliminated (users were journalists with RSI). A technique did not have to be more effective than the current techniques (typing), since for these users typing was impossible. l

Issues

Speech recognition for medical reporting (radiology): For this project there was a requirement for hands-fret entry of information with rapid turn-around time for access. We felt that the current system (dictation foUowcd by transcription), could easily be improved upon with the current level of ASR technology. The domain had certain characteristics (e.g., vocabulary with words that could be reliably recognized with high accuracy, users who were already used to dictating as an interaction technique) which made it favorable. in the ongoing

development

process

Sclcction of an appropriate client and domain is only the first part of the effort. What remains is the hard work of participatory design with “use of ASR” as a constraint. Since all designs arc done within a context of constraints, the adding of this one does not change the remaining process in any fundamental way. The immaturity of the interface technique has scrvcd to raise our awareness of the importance of some of the techniques of usability engineering and participatory design, and WC find that we rely particularly on: . Task analysis and early user input. Since potential users gcncrally have no experience with the tcchniquc, WC cannot rely on accumulated experience of IICI studies to inform design for the given context. We must become domain-knowledgeable and continually review the design direction with the users. 0 Use of early prototypes. The naturalness of speech is both a feature and a curse in that users think (we think incorrectly) that they know what the look and feel of interacting through voice will bc like, but are fairly blind to the subtle ways in which communicating through voice will be different with immature speech interaction tcchniqucs. Prototypes of all levels of fidelity, from paper mock-ups to early releases of fully-enabled systems must bc examined in simulated use. . Anticipating work context impacts. WC work to prepare the organization for the use of ASR-enabled application. This includes education of the individual and the organization about what to expect from the system and what not to expect. As much as possible we try to keep from creating extra work on the early prototype users (e.g., by having the work they do with the system convcrtcd into real work output).

23

.

Designing for errors. Errors with recognition technologies are more varied and complex than errors with direct response technologies such as keyboards and pointing devices. Because we do not have a well developed understanding of when thcsc will occur and how to accommodate them, this aspect of design requires special attention throughout development.

CONCLUSIONS

Will the maturing of the technology lessen the need for technology-driven design? In some ways we would expect that it would. Right now there is a need for ASR technologists to carefully “push” their immature technology into domains that have not yet recognized a need for such a new I ICI technique. This is necessary to both help guide the technology and the interface tcchniquc development. What WC believe will develop is a mature ASR tcchniquc which can be used by information technology dcvelopcrs in much the same way as they use other interface techniques. What will evolve is an understanding of where the tcchniquc is applicable and how it should bc used in interface design -- hopefully at a level of specification that systems designers find useful in making design decisions. The push necessary to mature a technology into a technique will become less necessary -- it will rise in use or be applicable in limited contexts depending on how well it serves in facilitating productive human-computer interaction. If one sign of a mature technique is existence of guidelines for its use, how long is this process of maturation likely to take for ASR? Part of the answer rests with the maturation of the technology, and questions of how long it will be before continuous speech recognition and speaker-indcpendcnt recognition reach practical lcvcls. I’rogrcss in thcsc areas will likely continue for some time yet, but WC bclievc that levels of pcrformante which can feed into real applications are close to reality. The other factor in the “how long” question the dcvclopmcnt of a more mature ASR technique - is also progressing and will almost certainly be well formed in the current decade. The key question to be resolved in the evolution of ASR in IICI is how well and broadly any recognition tcchniquc (one which includes different kinds of errors than WC have considcrcd in interaction tcchniqucs in the past) will bc suited to user needs. We believe that this is not the sort of question that can be answered without fielding systems that make use of the still devcloping technology. Technology-driven design is trying to provide such answers. REFERENCES

C. (1989). Developing Successful speakers for an automatic speech recognition system. In Procccdings of the Human Factors Society, 33rd Annual Meeting. I-IFS: Santa Monica, CA, 300-304.

Danis,

C. (1991). Methods for formatting text produced with a speech-based editor. In Procecdings of the Human Factors Society, 35th Annual Meeting. IIFS: Santa Monica, CA, 364-368.

Danis,

Karat, J. (1995). Scenario use in the design of a speech

Danis, C., Comerford, L., Janke, E., Davies, K., DeVries,, J. and Bertrand, A. (1994). StoryWriter:

recognition system. In J. Carroll (Ed.) based design. Wiley: New York.

A speech oriented editor. In C. Plaisant (Ed.) Human Factors in Computing Systems - CHI’94 Conference Companion. ACM: New York, 277-278.

Scemrio

Keil, M., and Carmel, E. (1995).

Customer-developer links in software development. Communications of the ACM, ACM, 38, 5, 33-44.

Danis, C. (1995). Feedback on long-term usage of the

StoryWriter system by writers with RSI. Unpublishedl manuscript.

M., and Kuhn, S. (1993). Participatory design (Special issue of Communications of the ACM), Communications, ACM, 36, 4.

Muller,

P. (1992). Building a better keyboard. Time, October 12.

Elmer-DeWitt,

User Centered System Design. LEA: H&dale, New Jersey.

Norman, D. A. & Draper, S. W. (1986).

(1992:). Object-oriented interface design: IBM Common User Access Guidelines. Que: Cannel, IN.

IBM,

Rash, W. Jr. (1994). Talk Show. PC Magazine, De-

cember 20.

(1995). The design challenge: Creating a mosaic out of chaos. In I. Katz, R. Mack, & L. Marks (Eds.) Human Factors in Computi.ng Systems - CH1’95 Conference Proceedings. ACM: New York, 195-196.

Rhyne, J. R. & Wolf, C. G. (1993).

Recognition based user interfaces. In R. Hartson & D. Hix (Eds.) Advances in human-computer interaction, vol. 4. Ablex.

Grcenbaum, J. & Kyng, M.

N., Levow, G., & Marx, M. (1995). Designing SpeechActs: Issues in speech user interlaces. In I. Katz, R. Mack, & L. Marks (Eds.) Hulman Factors in Computing Systems - CHI’95 Conference Pmceedings. ACM: New York, 275-276.

Yankelovich, Jelinek,

IF., Merialdo,

B., Roukos, S., & Strauss, M.

(1991). A dynamic language model for speech recognition. In Proceedings of the Speech and Natural Language DARPA Workshop, 293-295.

24