Dynamic speech interfaces - CiteSeerX

3 downloads 3038 Views 242KB Size Report
system or change programs on the TV set, if we go to a printer in a ... some notes, or finally if we go to a ticket machine on a railway plattform, we would like to buy a ticket, we would ... applications it would be possible to also design an application that has to be installed on the local PC, with a .... and therefore it not error-free.
Dynamic speech interfaces Stefan Rapp, Sunna Torge, Silke Goronzy and Ralf Kompe Sony International (Europe) GmbH, Advanced Technology Center Stuttgart, Home Network Company - Europe R&D, Man-Machine Interfaces rapp,torge,kompe @sony.de

Abstract In this paper we argue for a paradigm shift towards a more dynamic construction of speech interfaces in the field rather than a detailed in depth tuning and engineering of speech interfaces in the lab, which in general is resulting in interfaces that are fixed in their functionality. We believe dynamic speech interfaces to be an attractive paradigm for user interfaces of mobile systems, as these encounter frequent changes in the environment, especially with respect to the emerging network technologies that are ready for plug and play and ad hoc networking. We illustrate our approach with a research prototype we call speech unit, an example of a dynamic speech interface that is able to concurrently control IEEE1394 devices or PC applications multilingually in German and English.

1 Introduction Speech interfaces are ideally suited for mobile systems. In order to be mobile, these systems should be small and hence there is not much space to equip them with a keyboard. In addition, as there are situations where a mobile system cannot be put in a convenient position for typing, it would sometimes be cumbersome to use it. Mobile systems also often have limited display capabilities which favours speech output. They are often to be worn near the body easing speech in/output. Also they are normally used by one single speaker, allowing improved recognition accuracy by adopting speaker adaptation. There are, however, also challenges involved in the interface development for mobile sytems. One aspect we are particularly interested in is, that in mobile systems, environment changes are an essential factor. Todays technology allows to consider ad hoc networking as well as situation and location awareness. There are several networks allowing plug and play like IEEE 1394 (also known as i.LINK or Firewire) or USB. There is a variety of wireless network technologies with different bandwidth and ranging from ’in sight connections’ such as IRDA over personal area networks such as bluetooth to wireless local area networks such as IEEE802.11B and finally country or continent (and in the future maybe even worldwide) spanning networks such as GSM or UMTS. Especially the networks with a lower range experience frequent changes in the availability of other communication partners. If we take location awareness serious, we would want to do something with these communication partners when they appear to be in reach. For instance if we enter our living room we would like to adjust our enterntainment system or change programs on the TV set, if we go to a printer in a conference area, we might want to print out some notes, or finally if we go to a ticket machine on a railway plattform, we would like to buy a ticket, we would want to know how long we have to wait, which trains to take to come to a certain destination etc. For all these applications it is essential, that a change in the availability of a communication partner offers or withdraws an additional functionality that must be appropriately reflected in the user interface to give the user the possibility to make use of the offered services. We think it is about time to think about a paradigm shift for the way in which user interfaces are developed. We would like to argue that interface development should be taken out from the labs, towards a dynamic construction of user interfaces in the field. This step means that we must come to the point where we do not do engineering of the user interfaces but instead do engineering of the process of engineering user interfaces. It means that we should try to encapsulate the expertise that is needed to construct a user interface into the UI device itself. Of course, we will only partly be able to code the knowledge stemming from all the different areas such as human factors, linguistics, design, etc., so the interfaces will fall somewhat behind of what would be possible through careful in-lab engineering. We believe the approach still to be worthwhile, as we think that we can balance the deficiencies caused by absence of careful engineering against increased utilisation because of a more wide spread usage. As an indication that this strategy might work out, we can consider the web. Compared to a traditionally developed application using some widget set, the possibilities to design a user interface with HTML forms are 0 We would like to acknowledge the work of our dear colleagues Daniela Raddino, Franck Giron and J¨ urgen Schimanowski who helped in developing and implementing SIDL and the speech unit prototype system. This work was partly funded by the German research ministry under the SmartKom (01 IL 905 I 7) and Embassi (01 IL 904 S 8) projects.

1

a11 , b11

S1

a22 , b22 a12 , b12

S2

a33 , b33 a23 , b23

a13 , b13

S3

a44 , b44 a34 , b34

S4

a24 , b24 a14 , b14

Figure 1: Example of a basic HMM representing the acoustic properties of one phoneme. are the HMM states, denote transition probabilities, and denote the probability distribution for producing an acoustic feature vector. severely limited. However, there are many applications built on HTML forms on the web, such as on-line banking, access to databases, train schedule information, ticket reservation for the cinema etc. Of course, for most of these applications it would be possible to also design an application that has to be installed on the local PC, with a carefully designed user interface. But since it is not more complex to design a web interface (and advantageous for additional reasons), the web application will have or has already exceeded the traditional applications in number. Every user connected to the internet can easily display much more user interfaces using a web browser than he will ever be able to install together with an application. Of course we do not say that the web user interfaces are not engineered. Nevertheless, large parts of the appearence of the user interface are left to the browser. The user might be able to specify, e.g., the size of fonts. With some effort, it is possible (although not too easy) to design a browser that takes these forms and maps it to a speech interface or a tactile interface for the visually impaired. Of course the user interface aspect is just a tiny facett for the success of the web, so we should not overemphasize it and be careful in our judgement. But the essential point here is that it is simple to use and that everybody can easily write applications that just use a browser, a program that was developed long before.

2 Dynamic speech interfaces Speech interfaces obviously consist of two parts: speech input and speech output. Speech synthesis systems implementing speech output have a long history of dealing with unknown text which is supposed to be produced in a most intelligible and natural way. Recent research is more concentrating on naturalness, improving the overall quality of the output to foster a broader acceptance. We think that, albeit being not perfect yet, speech synthesis is already usable for dynamic speech interfaces. For reasons explained in the next paragraphs, speech input through automatic speech recognition on the other hand is not. Traditionally, speech recognizers are developed in a rigid process in which the recognizers are specifically tuned towards the specific application. Their functionality is generally fixed before the recognizer is shipped. State–of–the–art speech recognizers are based on a statistical approach cf. e.g. [1]. Every 10 msec a feature vector encoding spectral information about the speech signal is computed. This builds the basis for the actual word recognition, for which Hidden Markov Models (HMMs) are used. These are state-transition automatons, which define doubly embedded stochastic processes. It is assumed that with each state transition one of the above mentioned feature vectors is produced according to a certain probability. Alternative state transitions can be followed according to transition probabilities. Given a feature vector sequence the probability can be computed for a particular HMM having generated this sequence. However, it is unknown (hidden) which state sequence has actually generated this sequence. This approach allows for flexibility in the models which is needed to cope with acoustical and pronunciation variabilities in speech. Basic HMMs correspond to phonemes. Usually per phoneme several models are used which depend on the phonemic context. Word models are built by concatenating these basic HMMs according to a pronunciation lexicon, which contains one or more alternative phoneme sequences per word. A typical speech recognizer contains ten thousands up to millions of statistical parameters. Figure 1 shows a basic HMM for one phoneme. The parameters/probability distributions of the HMMs have to be estimated using large amounts of training data typically from hundreds of speakers. This is a collection of utterances annotated with the spoken words. For initialization purposes some of the utterances have to be manually segmented into phonemes. In isolated word recognition (for example, command&control tasks) the task of speech recognition is to find the best matching word level HMM given a feature vector sequence computed out of a speech signal. In continuous speech recognition the optimal sequence of word HMMs has to be determined. This cannot be done without using a grammar restricting the possible word sequences or sentences. In tasks with limited number of variance in the wording, often hand-coded finite-state-grammars (FSG) are used. In larger tasks stochastic language models are used which contain so-called n-gram probabilities that are probabilities for n-tuples of words. These have to be optimized on large text corpora. Note that all speech recognizers can only recognize words they contain in the

2

1

2

3 server identity

environment

mobile system description

environment

mobile system identity

environment

description

mobile system identity

identity

description

memory

Figure 2: Different possibilities to transfer a selfdescription lexicon. Generic recognizers able to recognize arbitrary words are not possible with the current state-of-the-art, and it cannot be forseen when or if at all this will be possible. In present applications or products speech recognizers are tuned a priori. Tuning here means that the statistical parameters are trained on a task-specific (with regard to vocabulary and environment) speech database, and that lexicon and grammar are developed before the recognizer is released. Afterwards everything is frozen, i.e., while the application is being used, no change in particular of lexicon or grammar is possible anymore. However, as pointed out in the introduction in many applications a dynamic configuration of a recognizer is needed. Prerequisite for a dynamic configuration is that the statistical parameters of the speech recognizer were optimized on task-independent speech databases. Then, even more important,there has to be the possibility of dynamic change of the pronunciation lexicon and the grammar of a recognizer, while the recognizer is in use. In the case of a dynamic speech interface, lexicon and grammar are usually not developed by the team or company who developed the speech recognizer, but the designer of a device which is to be speech enabled needs to specifiy lexicon and grammar. This needs to be done in a standardized format. In the easiest case, just a commandlist is specified. Then, no grammar is required. The lexicon can be built automatically by a tool which maps the letter sequence of words to a phoneme sequence. However, this mapping is (depending on the language) not trivial and therefore it not error-free. In some application, e.g., music title selection, multi-linguality is requested. In that case the speech recognizer needs to contain statistical parameters optimized on speech databases of several languages and lexicon, word-list, and grammar specification should contain a language identifier. It is also possible, again without 100% correctness, to identify automatically to which language a word or a word sequence belongs to.

3 Self description The fundamental prerequisite for enabling a mobile system to communicate with the local outside world is, that the mobile system knows about the outside world. There needs to be a flow of information from the environment to the mobile system. Traditionally, for the interfaces engineered in the lab, this information flow follows a long chain including the developer’s understanding of the environment, probably fixed in a specification, until it is finally put into the interface. In such a scenario, of course, it is not possible to apply the interface to an unforeseen environment (i.e. one with a different functionality) as the functionality is already fixed by the developers at development time. A solution gaining popularity in the area of plug and play networks is self description. The idea of self description is that the flow of information from the environment to the mobile device is made explicit at the time the mobile system is exposed to the environment. In HAVi, a proposed standard for Home Audio/Video interoperability [2] for example, it is possible that a device, say, a CD jukebox, offers Java code that can be used by a controller, say, a TV set, to render a user interface of the target device on a display. By supplying the interface executable in Java code, the device describes an essential aspect of itself to a controller. As only the Java engine has to be implemented in the controller (and of course a means to transfer the Java code from the target device), the controller is open to render also interfaces that are developed after the development of the controler was finished. There are different possibilities to give the self description from the environment to the mobile system. We show three possibilities, depicted in fig. 2 and discuss advantages and disadvantages thereof. 1. The selfdescription is directly sent from the environment to the mobile system. 2. An identification key is sent from the environment to the mobile system, the description associated to the key is fetched from memory. 3. An identifikation key is sent from the environment to the mobile system, the key is forwarded to a network server and a the description associated to the key is returned to the mobile system. 3

Obviously, there is not a really strict distinction between the three possibilities, and one can imagine further possible settings. We want to point out two things: first, that sometimes a self description can be reduced to just a description of the identity itself, given that there is a way to add description information later (for example, by delivering updates to the memory for case 2 or 3, or by tuning to a different server for case 3). If the description is abstract enough, and there is a certain interest by the user, it could also be possible to let the user add or modify these descriptions to his or her needs. Secondly, we think that this description must be declarative and not procedural in nature. We consider this as being important, because we believe that a communication partner of a mobile system cannot assume it is the only relevant device to the mobile system. Instead, the existance of several devices in parallel should be the default situation, and all their interfaces must be integrated into a common user interface of the mobile system. Therefore, this might result in the need to modify the delivered self description appropriately which is severely hassled in a procedural description such as Java byte code.

4 Understanding the world Understanding the world as a whole turned out to be too ambitious, as history of AI has shown. Modeling of the world knowledge and resolving the ambiguities pose severe problems e.g. concerning the complexity. But in some cases it is not necessary to understand the world as a whole. This observation led to a promising approach to face the above mentioned problems namely not to model the world as a whole but only the relevant subparts of it. More concrete, applied to a network consisting of different devices, this means the following: Instead of modeling the network as a whole every device brings along with it a description of its own functionality. These descriptions include the possible states of each device, like e.g. “play” or “single picture” if considering a VCR. It also includes the possible transitions from one state to another, the features of each state and the manner in which the device will be represented in the user interface. For mobile systems this might be a speech only interface but may also be e.g. a graphical interface or a multimodal one. First, this approach provides the possibility to add or retract devices to or from the network dynamically, that means during run-time without the need to reconfigure the whole system. This is possible since each of the involved devices provides a self description and brings it into the network. Using this information the network gains the knowledge not only about each of the devices but also about the functionality of the network which is offered by the network since different devices interact. To model these more complex correlations between different devices or services obviously a single description of each of the involved devices is not sufficient. In this case the idea is to receive a modeling of these correlations by merging the descriptions of each of the involved devices. Merging in this context may have different meanings. Considering the list of recognizable words with respect to two different devices, merging in some simple cases may simply mean the union of the two lists. Combining functionalities of the devices also needs some knowledge about compatible data types given in a taxonomy or ontology, about preconditions that must be fulfilled or states that must be reached, and also ressource management and side effects of actions. So, merging the function models of two devices may preferably include reasoning on the models of the devices. The obtained model then not only includes the functionalities of each of the involved device but also those functionalities due to the use of both devices in correlation. Due to this approach the services provided by the network do not have to be fixed and modeled in advance but can be infered dynamically depending on the involved devices. This allows especially to potentially add devices to a network in future time with functionalities which are still unknown up to now. Secondly, the proposed approach solves the following problem: Considering a camcorder which might be in the ”play” state and the ”single picture” state. In each of the states it makes sense telling ”forward” and ”backward” respectively to the system. But depending on the states ”forward” and ”backward” do have different meanings. Therefore the device always needs to be asked about its state to know which command will be the appropriate one, which in general is too expensive in terms of run-time. Considering the approach where the devices provides a self description, the model of the device not only includes the possible states but also the features of the states, e.g. the meaning of ”forward” and ”backward”, which avoids the need to ask the device itself about its state.

5 A case study 5.1 Speech unit We mentioned in the introduction different kinds of networks. In the near future more and more consumer devices will be connected with each other by such networks. Some iLINK devices are already on the market. First bluetooth devices are announced for this year. In cars, low bit-rate buses (CAN, MOST) are already frequently used. Audio, video, and navigation devices are connected to these bus systems. This “networked area” will evoke the need for user interfaces of a whole network of devices rather than of individual devices, because connecting devices to each other creates new functionality. For example copying video streams from one VCR to another or combining VCR programming with the personal calendar application somewhere available in the network is functionality not available when just focusing on the usage of a single device.

4

Other advantages of having a single user interface for a whole network are the reduction of costs and the chance of having a uniform user interface for several devices. However, a user interface for a network of devices needs to be dynamic, because these networks often have plug and play functionality, and more over when such a user interface device will be put on the market it is not known which devices will be available a few years later. As a first step towards such kind of user interfaces we propose the speech unit which is a dynamic speech interface to a network of devices. Several realizations of such a device are possible: 1. A single unit in the home connected to other devices via iLINK and bluetooth. Microphones could be installed in every room or attached to the clothes of the users. Several users would use the same speech unit. Devices to be controlled could be TV, VCR, audio equipment such as CD or MD jukebox, light, heating and so forth. 2. A similar unit in the car controlling car audio, navigation system, air conditioning, mirrors, and so forth. 3. A personalized speech unit running on a wearable computer. It would store a detailed user profile including speech recognizer parameters adapted to the user. The user would have the same user profile available everywhere. The personalized speech unit could connect automatically to the network available wherever the user currently stays. Therefore a network in a home or in a car could both be controlled by this speech unit. The user profile would be independent from the location. The EMBASSI consortium [3] even proposes to use a similar kind of personalized user interface to control devices in public places like ticket machines. Then an interactive dialog could help the user to find the right ticket, and he becomes independent from the language spoken in the country where he needs the ticket. So far we implemented a first prototype, which is running on a standard notebook equipped with an iLINK connector and a CD player. The speech unit is used to control the VCR, the CD player, and to select music titles from the CD currently inserted. The speech interface is dynamic in a way that the VCR can be plugged or unplugged anytime and CDs can be changed anytime. The technical implementation is described below.

5.2 Concurrency From a user perspective, there are three things that can be accessed from the speech interface, the VCR, the CD player and the track titles of the CD currently inserted. The Speech interfaces of all of them are modelled by finite state grammars stating the possible command words (in the case of the VCR or the CD player) or utterances (the track titles in the case of the CD medium) a user can say. As mentioned above, we believe it is an essential factor that there is more than one device, service or application that should be controlled by a speech unit. We chose a method to address this issue which in our opinion is a very natural solution: you simply say the command to any of the devices, and whichever device can react on it, will react. For the CD medium, the user just has to utter the track title, and it will start to play. If the user wants to address a specific device, the command can be prefixed by the name of the device. Of course prefixing is necessary for ambiguous commands such as “play” which is understood by both the VCR and the CD player. In our current application, if a user utters an ambiguous command, it is sent to either of the devices. In the future we plan to start a clarifying dialogue upon recognition of an ambiguous command. We did not want to have an explicit switch of focus, in which the user would need to bring the speech unit’s attention to a specific device first, and then all of the following commands are related to it. We believe that out concurrent approach is more natural because it is compatible with human-human communication. Of course, it is a rather straight forward method, and in the future more intelligent treatment (allowing both concurrent activation as well as explicit or implicit shift of focus) is needed, taking into account also intonational cues and dialog history.

5.3 Technical implementation The implementation is completely seperated between the (speech) user interface component (speech unit) and the three clients (VCR, CD player with possibly inserted CD). On an abstract level, all three cases in figure 2 are contained in our prototype: As the VCR is a standard, of the shelf product, it is only able to send its identity (defined as the global unique ID from the device’s IEEE1394 accessible config ROM) to the speech unit which fetches an associated description from memory (case 2 in figure 2). The CD player is a standard application driving the notebook’s internal CD drive that we extended a little to communicate with the speech unit by generating and sending a self description to it (case 1 in figure 2). The CD can be viewed as case 3 in figure 2:1 from the number and lengths of the CD tracks, a generally unique identifier for the CD is derived and sent to www.cddb.com from where it retrieves a list of track titles, which can be seen as an interface description for the CD. The communication between a client and the speech unit is via an explicit description of the speech interface, expressed as an XML document. The speech unit is able to receive any interface description conforming to the DTD of our speech interface definition language called SIDL. Currently, the implementation does not consider 1 Here,

the CD player application is already performing the negotiation with the server, and not the speech unit.

5

synthesis nor dialog, and hence some restrictions have to be considered for authoring such a document. Our speech unit is able to receive documents with German, English and unknown language speech recognition parts. When the speech unit receives a document, it first fixes the parts containing ‘unknown language’ sections to be either German or English. This is needed for the CD titles and illustrates nicely what we mean with putting knowledge into the interface rather than into the application. Our prototype is able to process both German and English CD titles, but the CD player, including the track list into its interface description does not know anything about languages, so it just writes ‘unknown’ in the language specification for the track list. The component deciding about the language of the track titles is inside the speech unit, in this case a language identification based on stochastic letter sequence models of German and English. Postponing the decision is advantageous because of course the same or a similar problem of deciding between languages might occur to another device. Also, if we would update our speech unit to also support French, no changes are required for the CD player. Of course if the CD player knows, it can specify the correct language already. Next, the speech interface definitions are merged together in a purely syntactic process, prepending the recognition parts with an optional name of the device and by putting all possible paths for all devices in parallel. Finally the description is converted into the appropriate structures for the recognizer, that is, the appropriate finite state networks for the merged document is established, pronunciations (if any are supplied from a client) are taken into the dictionary, all remaining words that are included in the client’s descriptions are automatically converted into phoneme sequences. This process of converting between the letters and the sounds of speech is another example for (in this case phonetic or language engineering) knowledge that can be integrated into the interface itself. Although we might argue that this process could also be done offline when somebody writes the interface that is to be sent by simply specifying the pronunciation with it, we would lose a lot of potential applications where the orthography is known, but pronunciation is not: at least for the track titles, we cannot expect to get phonetic transcriptions from the clients at all. By putting the knowledge that is in practical use in many labs dealing with speech technology of how to convert (even huge quantities of) text into phonetic transcriptions into the interface, we allow also applications to become speech aware by people that do not have that background or resources for development. When the user utters one of the commands, the speech recognizer selects a specific path through the finite state grammar and issues an action that is associated to that path and also contained in the SIDL document. In our current implementation this could be an arbitrary AV/C command on the IEEE1394 bus (used for the VCR) or an arbitrary Message of the GUI.

6 Conclusion and Future work As described in this paper the goal of this research is to obtain speech interfaces for dynamic networks in the sense that it is possible to include and retract devices during run-time without having to reconfigure the system as a whole. To reach this aim, the information about the functionality and the manner of presentation of each device cannot be stored statically in the network, since devices might be added to or retracted from the network, but it needs to be brought with the devices themselves. We presented the speech unit where this idea was realized with respect to the speech interface. However, the speech unit does not yet address the question of how to combine functionalities of the devices (for example, you cannot record on the VCR the sound of the CD). We feel that for this to work smoothly, a limited understanding of the observed world is needed. We have started to investigate these problems in the course of the SmartKom project [4] where we want to develop and implement models that can describe these functional dependencies between devices. The goal here is, that from the functional descriptions of the devices, their interplay can be inferred and accordingly reflected in the user interface. We also feel that current state of the art recognition systems must be made more dynamic in other aspects, besides pronunciation, this holds e.g. for stochastic word sequence models. Ultimatively, if we are to move from a purely surface oriented description as SIDL to one which includes more understanding, then all the modules involved in the understanding process must be constructed dynamically from individual definitions of the relevant devices, services and applications. Definitely, this is a long term goal, and we are curious to see how far we can go with that in the course of the SmartKom project. Another aspect that we follow in this and the above mentioned other project EMBASSI is multimodality, namely the question if it is possible to link certain interface descriptions for different also nonspeech modalities together be that on a surface level or on a deeper understanding level.

References [1] [2] [3] [4]

L. Rabiner and B.-H. Juang. Fundamentals of Speech Recognition. Prentice Hall, New Jersey, 1993. www.havi.org. www.embassi.de. www.smartkom.com.

6