A Speech-to-Speech Translation based Interface for Tourism - CiteSeerX

2 downloads 7847 Views 188KB Size Report
domain-dependent speech-to-speech translation systems, which permit to limit ... booking and tourist arrangements for the country of A1. When A1 has nished.
A Speech-to-Speech Translation based Interface for Tourism M. Cettolo, A. Corazza, G. Lazzari, F. Pianesi, E. Pianta, L.M. Tovena ITC-irst { Trento, Italy

Abstract. This paper presents a speech-to-speech translation system for tourism application developed in the context of the C-STAR consortium. Potential users can communicate by speech and by using their own language with a travel agent in order to organize their travel. The system uses an interchange format representation of the semantic contents of utterances, which is exible and simpli es the system portability to new languages. A demonstrative prototype, developed at ITC-irst, is now working for the Italian modules and was integrated with the English counter part developed at the Interactive System Laboratory at CMU.

1 Introduction In the eld of tourist information, users from every part of the world may want to access an information system to get information for organizing their travels. However, potential users are not computer scientists nor are keen onto use arti cial languages for interacting with the system. They would rather exploit their own language. This is why multi-lingual and translation-based interfaces are so important in tourist information retrieval applications. Also, traditional computer based interfaces severely limit the number of potential users to those who are familiar with keyboards. Instead, the use of tourism information systems can be greatly enhanced by allowing a speech-based interaction, say by means of a telephone. At present, general purpose speech-to-speech translation is too ambitious a target. The e ort of the research community is directed towards making available domain-dependent speech-to-speech translation systems, which permit to limit the vocabulary size of the speech recognizer and the set of possible expressions to be considered in the analysis and generation steps. With the goal of exploring the use of the speech-to-speech translation for tourism applications, the C-STAR consortium designed a possible scenario so that partners from di erent countries could cooperate by sharing data and experiences on a common task to build an actual system. At the moment six laboratories are participating to the system implementation: ATR (Japan), CLIPS++ (France), ETRI (Korea), ITC{irst (Italy), ISL at CMU (USA), ISL at UKA (Germany). Di erent partners from the consortium use a common formal language, called the interchange format (henceforth, IF), for encoding the information content

of each utterance. This permits each partner to focus on the development of modules mapping its language onto IF representations, and from the latter back to its language. The authors of this paper are involved in the development of analysis, and generation modules for Italian.1 The following section describes the application scenario considered. Sections 3 and 4 consider the two main characteristics of the system: the speech based interfaces, with its problems and advantages, and the interchange format approach. In Section 5, the system is described, in its general architecture and the particular methodologies applied for the single modules. Finally, general conclusions and future work are discussed in the nal section.

2 The scenario In the C-STAR application scenario, a customer is planning a trip to several countries around the world. He/she relies on a travel agency with agents in each of the considered countries that arranges for ights, trains, hotels, sight-seeing tours, tickets for cultural events, etc.. The scenario is designed in such a way that the conversation does not involve more than two people at a time. A typical conversation consists of a sequence of two-partners dialogues and takes place over the telephone, so that visual or gestual communication is excluded. The contact point for the client is an agent, A1 (the primary agent), in the rst city of the trip. The conversation, a phone call between A1 and the traveler, concerns the transportation settlements for the whole trip, and hotel booking and tourist arrangements for the country of A1. When A1 has nished his/her task, he/she transfers the line to another agent, A2 (secondary agent), in one of the other countries involved in the trip. A2 will only take care of room reservations and tourist information for his/her own country. Once the client has nished with A2, the latter transfers him/her back to A1 who might then dispatch the client to another secondary agent in another country, and so on. Three main subdomains characterize the scenario: the travel information, the hotel reservation and the tourist information. In addition, there are minor subdomains related to call transfers.

3 Speech based interfaces The system interface is very important for the system usability. Speech based interfaces are very e ective. Indeed, they leave the user hands free for performing other tasks (like handling documents or taking notes) and permit the whole exchange to be accomplished through a telephone line. Moreover, when the dialogue is among people, speech is the most natural way of communication. 1

The idea of using an interchange format can be pro tably extended to other application domains, e.g. to dialogue-based human-machine systems. In these cases, the interchange format can be translated into SQL queries for accessing a database [5].

The choice of a speech based interface, however, conditions the design of all the modules of the system. The idea of considering speech modules as parts which can be simply connected with a pre-existing (or an otherwise already speci ed) system has been de nitely abandoned in the last years. This is due to the fact that both the speech signal (input), and the spoken language (output) are extremely variable and dicult to model. Acoustically, spontaneous speech is characterized by highly variable speaking rates and by a series of extra-linguistic phenomena, like hesitations, pauses, coughs, and so on which have to be taken into account. Table 1 reports on the occurrences of such extra-linguistic phenomena in the corpus collected for the development and assessment of the system we are going to present. phenomeee inhala- exhala- mouth vowel mmm ehm ah enon tion ion lengthen. # occurrences 1974 1497 1360 496 490 303 222 172 # a ected sentences 1389 1117 1176 449 432 258 204 167 % a ected sentences 25.3 20.4 21.4 8.2 7.9 4.7 3.7 3.0

Table 1. Most frequent human noises occurring in the corpus (# stands for \number of" while % for \rate".

To model these and other phenomena, and obtain acceptable recognition accuracy with spontaneous speech, accurate acoustic models are required. In this respect, only some of the relevant phenomena can be handled by current techniques. For example, modeling such frequent phenomena as breaths, lled and empty pauses, coughs, etc. is a feasible task, but modeling speaking rate variations is more dicult. Finally, handling out-of-vocabulary words, or word fragments are still research issues. Another important point is that spoken language is often very di erent from written language (refer to [9] for a deeper discussion on this item). For example, it is commonplace to nd utterances consisting of sentence fragments, sequences of fragments, or fragments combined with complete sentences. Building models capable of dealing with all possible variations is a dicult task. Finally, recognition errors must be carefully controlled and minimised. They cause, in fact, the insertion of unuttered words in the recognized word sequence, the omission of uttered words or the substitution of uttered with unuttered words. This, in turn, causes problems to modules devoted to language processing, which are therefore expected to be very robust with respect to recognition errors. All these considerations are all the more important when dealing with spontaneous speech, as compared with read/dictated one.

4 The interchange format approach The adoption of an interchange format based approach has several advantages and potentialities. The most obvious advantage is the reduction of the number of di erent systems which have to be implemented. Given n di erent languages, an analysis chain (starting from from the spoken input and delivering an interchange format representation) and a synthesis chain (taking the interchange format representation and prividing a linguistic form for it) for each language suce to yield a system capable of dealing with speech-to-speech translation between all of the possible language pairs. That is, the resulting system would require n separate analysis and synthesis chains, instead of the otherwise required quadratic number of modules. Furthermore, given that each module involves only one language, the development can be done by native speakers of that language. Another important advantage concerns portability to a new language; given the described con guration, a lower e ort is necessary to make an existing system capable of dealing with a new language. This strikingly contrasts with the case of a direct translation system, where the addition of a new language to a set of n pre-existing languages requires the construction of n new complete modules to link each old language to the new one.2 In addition to this, the techniques developed to build and process a formal representation of the information content of utterances can be exploited to meet the demands of many other applicative scenarios. For example, in a speechbased information retrival system, the interchange format can be used to build the formal query. Both in this case and in speech-to-speech scenario, the interchange format representations can provide the means to produce summaries of the transactions occured between the human and the machine, and among the di erent parties, respectively.

4.1 The interchange format design For the interchange format based approach to work properly, the interchange format design is crucial. This is a dicult problem because many aspects need be taken into account. In the rst place, the interchange format must be (as) language independent (as possible). That is, we want it to focus on the information contained in utterances rather than the way in which such information is expressed. On the other hand, in order to control the complexity of the task, and often to make the taks feasible at all, the interchange format must be allowed to be 2

This short discussion is somehow reminiscent of the long debate between interlinguaand transfer-based approaches in traditional Machine Translation. This is not by chance. Actually, we believe that the IF-based philosphy described in the text can make the best of both approaches. Obviously, this requires that some constraints be relaxed; e.g., that the domain-independence pursued by interlinguas be abandoned. Yet, as we argue in the text, it is possible to maintain a high degree of languageindependence.

both application- and domain-sensitive. Indeed, the applicative scenario provides strong indications as to the type of information which must be extracted, while the domain de nes the actual data to be identi ed in the utterance. For example, the application might request the identi cation of the topic of the utterance, and the domain might restric the choice to a nite set [11]. To exemplify, suppose that the application is a system retriving data from a relational database. In this case the information which has to be extracted from the input utterance is that relevant to building the SQL query; that is, the topic, which usually corresponds to the table to be searched, the data to be retrived, which correspond to the column, and the constraints on the data. Therefore, in this case the database structure gives the structure of the interchange format. [5] describes a system working according to these principles which exploited a frame based language as interchange format. An interchange format example is reproduced below, and corresponds to the Italian sentence vorrei i voli da verona a pittsburgh in partenza prima delle dieci (i'd like to know about the ights from verona to pittsburgh which leave before ten o'clock). ( TABLE = ight ) ( COLUMN = ight id ) ( ORIGIN = 'VERONA' ) ( DEST = 'PITTSBURGH' ) ( DEP TIME IN 00:00 10:00 )

Speech-to-speech translation scenarios feature dialogues with a richer structure than that of speech-based information retrival, which the relevant interchange format must be able to capture them. For instance, the number of parties active in the conversation is always greater than one and track must be kept of who is speaking at each time. To this end, the C-STAR IF provides a label to encode the speaker: a: for the agent and c: for the traveller. Furthermore, the users do not only ask for data or information. They also perform, and request the other party to perform, a number of di erent actions. Users greet the other speaker, seek and give information and clari cations, accept and reject proposals and suggestions, and so on. All these actions are represented in the speech-act level of the C-STAR IF. Moreover, each action may concern and involve a number of objects in the world and properties thereof. Such objects and properties are encoded by means of concepts, usually ordered by importance and (decreasing) generality. Examples of the relevant concepts are availability, price, hotel, room, trip, . . . . Finally, there is an argument level, consisting of attribute-value pairs such as time=sunday, location=downtown etc. Each concept admits one or more attribute-value pairs. In summary, the C-STAR IF consists of four levels: 1) the speaker label, 2) the speech-act part, 3) a (sequence of) concept(s), and 4) the arguments. If the utterance considered in the example above had been uttered by the traveller in the C-STAR scenario, the IF representation would have been the following one: c:request-information+features+ ight (origin=verona, destination=pittsburgh, time=(before=10:00))

Here, the four levels are: 1) c:, indicating that the speaker is the traveller and that he 2) asked for information request-information, concerning 3) ights features+ ight 4) from Verona to Pittsburgh with departure time before ten am (origin=verona, destination=pittsburgh, time=(before=10:00)). The architecture of the IF permits to clearly distinguish the domain dependent part (concepts and arguments) from the domain independent one (speaker and speech-act). This facilitates the porting from a domain to another. For example, moving from the hotel reservation domain to the travel information one will only require the addition of new concepts and new arguments.

5 The system As said, the C-STAR consortium consists of several partners from many di erent countries. Each partner is required to develop system modules for its own language; the interface between each pair of partners/languages is provided by IF representations, which are sent and received through the geographical network. Partners can organize their modules the way they like, provided that the IF interface works properly. Analysis Chain SPONTANEOUS SPEECH RECOGNIZED SENTENCE RECOGNIZER

SEGMENT SEQUENCE

INTERCHANGE FORMAT SEQUENCE

SEGMENT

UNDERSTANDING

DETECTOR

MODULE

SCT

NATURAL LANGUAGE

TEXT-TO SPEECH

SYNTHESIZED SENTENCE

OTHER LANGUAGE SYSTEM

GENERATOR

SENTENCE TEXT

INTERCHANGE FORMAT SEQUENCE

Synthesis Chain

Fig. 1. System architecture. At ITC{irst, two main processing chains have been implemented for Italian: the analysis chain and the synthesis chain (cf. Figure 1). The analysis chain converts the input speech signal into a (sequence of) IF representation(s) by going through: (1) the recognizer, which produces a sequence of word hypotheses for the input signal; (2) the segment detector which detects semantic boundaries inside the utterance on the basis of acoustic-prosodic features and simple statistics on words in correspondence of such boundaries [8]; (3) the understanding

module, which exploits a robust parser and a statistical classi er to deliver IF representations [1]. The synthesis chain starts from an IF expression and produces a synthesized audio natural language message expressing that content. It consists of two modules. The generator rst converts the IF representation into a more language oriented representation and then integrates it with domain knowledge to produce sentences in Italian [12]. Such sentences feed a speech synthesizer, namely h the text-to-speech system EloquensR developed by CSELT.

5.1 The Acoustic Recognizer

The goal of the speech recognizer is to nd the word sequence that most likely caused the acoustic signal. For each sentence hypothesis, the likelihood can be computed on the basis of two probabilistic models: the Acoustic Model (AM), concerning the relation between each word and the acoustic signal, and the Language Model (LM), which considers how words are concatenated in a sentence. Speech sound is rst converted into electrical signals by a microphone; the analog signal is then digitalized through sampling and quantizing; the digital signal is nally processed to estimate signi cative parameters [10]. The recognizer used in our system is based on the platform developed at ITC-irst [7] for Italian large vocabulary dictation tasks (20K words). It uses an AM based on hidden Markov models (HMMs) [13] of phonemes. Each vocabulary word is modeled by concatenating the sequence of phoneme HMMs corresponding to its pronunciation. HMMs were initialized on a phonetically rich database, APASCI, collected at ITC-irst [1], while a re-training phase was carried out on the training set of the spontaneous speech corpus considered here. Speci c models were trained for most frequent extra linguistic phenomena. The adopted LM is represented by Shift- trigram [7] including extra linguistic models as if they were real words. Such a model gives the probability of each sentence segment consisting of three words. These probabilities can then be combined together to nd the LM probability of each possible sentence. Intra-word acoustic constraints (phonetic transcriptions of words) and interword linguistic constraints (trigram language model) are compiled into a sharingtail tree-based network that de nes the search space, at phoneme level, for the decoding algorithm [3]. For each word, only a single canonical pronunciation is considered. Extra linguistic models are inserted into the search space by using their trigram distributions as given by the LM.

5.2 The Understanding Module

The input to the understanding module is represented by a sequence of words and speech phenomena like pauses and hesitations corresponding to a semantic unit, that is, a part of the utterance identi ed by two semantic boundaries. The role of the understanding module is to construct the corresponding IF representation. The rst level of the IF representation, i.e. the speaker, is known. The rest of the IF representation is built in two successive phases, each devoted to a single

task. In the rst phase the utterance is processed by a local parser which tries to extract information concerning argument-value pairs. In the second phase, a statistical classi er identi es the speech act and the concept sequence. Therefore, our system conceives the rst task (identifying argument-value pairs) as requiring consideration of the syntactic structure of portions of the utterance. The second task (building the speech-act and the concept sequence) is seen as a classi cation one in such a way that the word sequence output by the acoustic recognizer (and processed by the local preprocessor) is the object to be classi ed by means of IF labels. Among all possible approaches to the classi cation task, the extreme variability or total absence of a global structure in the input utterance suggests the application of a statistical based approach. In fact, if a corpus of example utterances is available, it is necessary to nd all and only the elements of the input utterance which have highest correlation with the IF label. Therefore, we use a preprocessor to extract the phrases which are relevant for the IF construction. The classi cation proper is left to a statistical classi er based on Semantic Classi cation Trees (SCTs) [6], which are derived from CARTs (Classi cation and Regression Trees) [2]. SCTs can be automatically built from a corpus of labelled examples (training set). In the current implementation [5], a variant of the procedure described in [6], the classi cation is based on the presence of particular keywords. In a way, keywords are the simplest and most straightforward elements which can be identi ed in a sentence, while in principle statistical classi ers could be applied to any structure built on the input. On the other hand, in the system here considered, the application of SCTs is preceded by the application of the local parser which, in addition to extracting the arguments, preprocesses the input data by substituting important phrases with labels which can be used as keywords. In this way, the statistical relevance of the data improves because equivalent phenomena are clustered in the same event.

5.3 The generator

The goal of the generator is to produce the Italian translation for the source sentences encoded in IF representations. Since IF is in not a linguistically oriented semantic representation, rst the generator uses a sentence planning module mapping IF into a functional representation similar to the f-structure of Lexical Functional Grammar. This step usually does not take place in current machine translation systems. The sentence planning algorithm allows for three strategies, which can be mixed. General heuristics are used to map parts of the IF representation in linguistically relevant functions. These rules give medium quality results, but can be applied in a large number of cases. However, before resorting to these general rules, the system can consider more speci c ones which produce a more accurate output but can be applied in a smaller number of cases. Should both strategies fail, a backup strategy applies, which merely translates the components of the IF representation, see [12].

The actual generation step makes wide use of exible templates, which guarantee high eciency and at the same time allow us an elegant treatment of linguistic phenomena such as phonological adjustment, morphological agreement and syntactic constituency.3

6 Conclusions and future work At present, a demo version of the Italian modules is available. All the analysis chain modules were developed by using C/C++. Some simple lters are written in perl. The synthesis chain modules are implemented in Prolog. The user interface is written in Tcl-tk, but currently a new version in Java is under developement. All these choices guarantee a high degree of portability of the system across hardware platforms and operating systems (OSs). The current demo is running both on PC and laptop equipped with Linux OS. Other versions are available for Unix-based workstations (Sun and HP). Currently, the recompilation of all the modules under Windows NT/95 OSs is being completed. In Figure 2 a screen dump of the Italian-English system is shown, which integrates to our modules those developed and implemented at the Interactive Systems Laboratory at CMU (Pittsburgh, USA).

References 1. B. Angelini, M. Cettolo, A. Corazza, D. Falavigna, and G. Lazzari. Multilingual Person to Person Communication at IRST. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, April 21-24 1997. 2. L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classi cation and Regression Trees. Wadsworth Inc., 1984. 3. F. Brugnara and M. Cettolo. Improvements in Tree-based Language Model Representation. In Proceedings of the 4th European Conference on Speech Communication and Technology, pages 1797{1800, Madrid, Spain, 1995. 4. Nicola Cancedda, Gjertrud Kamstrup, Emanuele Pianta, and Ettore Pietrosanti. Sax: Generating hypertext from sadt models. In Third Workshop on Applications of Natural Language to Information Systems, Vancouver, 1997. 5. M. Cettolo, A. Corazza, and R. De Mori. Language Portability of a Speech Understanding System. Computer Speech and Language, 12:1{21, 1998. 6. R. De Mori and R. Kuhn. The Application of Semantic Classi cation Trees to Natural Language Understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(5):449{460, May 1995. 3

Our templates are expressed in Hyper Text Planning Language (HTPL) [4]. An HTPL expression can include di erent kinds of objects: linguistic items at di erent degrees of abstraction (from frozen strings to semantic representations), calls to template de nitions, slot speci cations to be lled at run time, conditional and disjunctive expressions, operators allowing to specify character formatting, speech syntesis annotations.

Fig. 2. Interface of the Italian-English demo. 7. M. Federico, M. Cettolo, F. Brugnara, and G. Antoniol. Language Modelling for Ecient Beam-Search. Computer Speech and Language, 9(4):353{379, 1995. 8. M.Cettolo and D.Falavigna. Automatic Detection of Semantic Boundaries based on Acoustic and Lexical Knowledge. In Proceeding of the International Conference on Spoken Language Processing, Sidney, Australia, 1998. To appear. 9. R. C. Moore. Integration of Speech with Natural Language Understanding. In D. B. Roe and J. G. Wilpon, editors, Voice Communication between Human and Machines, pages 254{271. National Academy Press, Washington D.C., USA, 1994. 10. R.De Mori, editor. Spoken Dialogues with Computers. Academic Press, San Diego, CA, 1998. 11. Fabio Pianesi and Lucia M. Tovena. Using the interchange format for encoding spoken dialogues. In Proceedings of AMTA SIG-IL Second Workshop on Interlinguas and Interlingual Approaches, pages {, 1998. 12. E. Pianta and L.M. Tovena. Generating with exible templates from C-STAR Interchange Format. Technical Report 9808-04, Istituto per la Ricerca Scienti ca e Tecnologica, ITC{irst, 1998. 13. L.R. Rabiner. A tutorial on hidden Markov models and selected applications in speech recognition. In A. Waibel and K.F. Lee, editors, Readings in Speech Recognition, pages 267{296. Morgan Kaufmann Publishers, San Mateo, CA, 1990.