A Multi-lingual Speech Translation System for

'96 Korea-China Joint Symposium on Oriental Language Computing

1

A Multi-lingual Speech Translation System for Hotel Reservation Woosung Kim, Du-Seong Chang, Jae-In Kim and Myoung-Wan Koo Multimedia Technology Research Laboratory Korea Telecom Research & Development Group 17 Umyon-dong, Seocho-gu, Seoul, 137{792, Korea E-mail:fsung,cds,jikim,[email protected]

Abstract

In this paper, we present a multi-lingual speech translation system for hotel reservation. This is an experimental speech-to-speech translation system which translates a spoken utterance in Korean into one in Japanese or English. The system can understand dialogues for hotel reservation between a Korean customer and a hotel reservation desk in Japan and countries using English. It consists of a Korean speech recognizer, a Korean to Japanese and English machine translators and three kinds of speech synthesizers. The Korean speech recognizer is an HMM(Hidden Markov Model)-based speaker-independent, continuous speech recognizer which can recognize about 300 word vocabularies. We have achieved the word recognition rate of 94.5% and the sentence recognition rate of 81.2%. In machine translation, direct transfer method is used for Korean to Japanese and example based translation method is used for Korean to English translation. Experimental results of machine translation gave us the success rate of 100% for Korean to Japanese or English translation. There are three kinds of speech synthesizers in this system, which are respectively for Korean, Japanese and English. The system runs in nearly real time on the SUN SPARC20 workstation with one TMS320C30 DSP(Digital Signal Processing) board.

1 Introduction As spoken language processing technology has been advanced, the dream of speech translation becomes more feasible. ATR's Interpreting Telephony Research Laboratories carried out an international joint experiment of speech translation with Carnegie Mellon University in U.S. and Siemens Corporation/Karlsruhe University in Germany in 1993[1, 2]. In Germany, Verbmobil, a long-term project on the translation of face-to-face dialogues, was launched in 1991. This project will be continued until 2001[3]. Korea Telecom held an international joint experiment with KDD(Kokusai Denshin Denwa) in Japan for speech translation between Korean and Japanese in 1995[4]. Each site in Korea and Japan developed a speech recognizer and a language translator to the other language. The CSTAR{II is one of the noteworthy consortiums and is scheduled to hold an international joint experiment until 1999[5]. In this paper, we present a multi-lingual speech translation system consisting of a Korean speech recognizer, a Korean to Japanese or English machine

translators and a Korean, a Japanese, and an English speech synthesizers. In section 2, system overview is presented. In section 3, we describe a Korean speech recognizer. Machine translators are presented in section 4. Section 5 describes speech synthesizers. Section 6 gives the evaluation results of the system. Finally, we make a conclusion in section 7.

2 System overview Figure 1 shows the system overview. The dotted box designates modules which we have developed. If a customer says a Korean sentence after pushing down the the menu button indicating \say", it automatically detects the endpoints of speech and nds top N best sentence hypotheses, of which one is selected to be translated into Japanese or English sentence. Two kinds of translators, which are respectively a Korean to Japanese and a Korean to English, are currently available in the system and the customer can toggle a menu button to select the desired language. It also has three TTS(Text-To-Speech) systems which

'96 Korea-China Joint Symposium on Oriental Language Computing Korean to Japanese translator

Japanese TTS

Korean to English translator

English TTS

2

Korean speech recognizer

An English officer A Korean customer

English to Korean translator

English speech recognizer

Japanese to Korean translaor

Japanese speech recognizer

A Japanese officer

Korean TTS

Figure 1: System overview can generate speech in Korean, Japanese and English. Korean speech will be delivered to the Korean customer to listen the translated one, and Japanese, English will also be delivered to the hotel reservation desk in Japan, and in U.S., respectively.

3 Speech Recognition 3.1 Feature Extraction

The speech signal is sampled at 8 kHz rate with a -law, 8 bit codec and pre-emphasized with a lter whose transfer function is 1 ? 0:95z ?1. The preemphasized speech is then divided into the frames. Each frame spans 20 msec and is overlapped by 10 msec. Linear Predictive Coding(LPC) analysis is performed and a set of LPC driven cepstral coecient is computed from the LPC coecients. The LPC driven cepstral coecients are weighted by a window Wc (m), of the form

Wc (m) = 1 + Q2 sin( m Q ) , 1 m Q, where Q is the order of LPC. We used four kind of VQ codebooks for the speech recognition: (1) weighted LPC cepstral coecients, (2) their dierences, (3) their second order dierences, (4) dierenced log power and its second order dierences.

3.2 Phonetic Models

It is necessary to choose basic units for the speech recognition system based on HMM. We chose phoneme-like phones as basic units. We selected 60 phones at rst and then expanded them into 300 context-dependent units considering coarticulation phenomena. Our phoneme model has 7 states and 12 transitions[6]. 12 transitions can be tied into 3 groups and transitions in the same group share the same output probabilities. We considered three kinds of coarticulation phenomena for phonetic model[7]; 1. Silence : Depending on the speaker, silence in inter-words can be kept or not. Therefore, optional silence is needed. We adopted null transition for skipping the silence. 2. Intra-word model : We used the triphone for the modeling of intra-word coarticulation phenomenon. In deciding the number of triphones, the unit reduction rule was used[8]. 3. Inter-word model : Inter-word coarticulation phenomenon occurs before or after each words in the continuous speech. In the training stage, partial-connection is allowed in representing the sentence using the triphone. Only the most detailed connected is allowed in partial connection. And we allow full-connection in the recognition stage, in which all the triphone that can


3

Table 1: Characteristics of people participated in database collection

Figure 2: Phonetic transcription of word \ne" be connected are allowed[9]. Figure 2 shows the phonetic transcription of Korean word \ne" using triphones.

3.3 Language Processing

Two kinds of language models were used for language processing. We adopted bigram as a forward language model and right-to-left chart parser with dependency grammar as a backward language model. Figure 3 shows a continuous speech recognition. The perplexity of class based bigram language model is around 10. The bigram that we used is a statistical rst-order class grammar in which the probability of a word W1 being followed by another word W2 is given by

P (W2 =W1 ) =

X P (C =W )P (C =C )P (W =C );

path

1

1

2

1

2

2

where C1 is each of the classes to which W1 belongs, and C2 is each of the classes to which W2 belongs[10]. The grammar uses 44 classes determined by heuristics.

3.4 Search algorithm

In order to provide an ecient interface between the speech and language processing of a spoken language system, we choose N-best algorithm which is a time synchronous Viterbi beam search algorithm that can be made to nd the most likely N whole sentence hypotheses within a given beam threshold of the most likely sentence[11]. The number of N is xed to be 5 in the system. There are some methods for nding multiple sentence hypotheses[12]. We use the word-dependent N-best algorithm which is a compromised algorithm between the exact sentence-dependent algorithm and the lattice algorithm. At each state within the word, the total probability for each of n dierent preceding words is preserved at each state. And we record the score for each previous word hypothesis along with the name of the previous word at the end of each

Number of talkers Ages Male Female 20 15 5 30 15 5 40 10 4 50 4 0 word. Then we proceed on with a single theory with the name of the word that just ended. At the end of the sentence, we perform the recursive traceback to derive the list of the most likely sentences using the backward language model. The number of n is set to be 2.

3.5 Database

We collected the dialogues between a customer and a hotel reservation desk. And the dialogues were modeled to be 13 sentence templates encompassing 300 words. A total of 54 men and women read 5,901 sentences. Their ages range from 20's to 50's. Each speaker read about 100 sentences naturally over the headset microphone. 54 of 58 persons were designated as the training speakers and the remaining 4 ones were designated as the test speakers. Table 1 shows the characteristics of the people with regard to sexes and ages participated in database collection.

4 Machine Translation

4.1 Korean to Japanese Translation

The Korean to Japanese machine translation processing is made up of analysis, transfer and generation. Since the grammar structure of Korean language is similar to that of Japanese language and the task domain is restricted to dialogues for hotel reservation, the machine translation could be implemented with ease. We do not use the morphological analyzer since the speech recognizer works as a morphological analyzer. And the right-to-left chart parser with dependency grammar is used for both speech recognition and machine translation as shown in Figure 4. Since Korean language has the characteristics of free word-order, it is dicult to analyze a Korean sentence with the phrase structure grammar which is suitable for the languages having the characteristics of rigid word-order. Dependency grammar provides good solution since it focuses on the relationship between modi er words and modi ed word[13]. And

'96 Korea-China Joint Symposium on Oriental Language Computing Speech

Feature extraction

Viterbi search

Backward language model : right-to-left chart parser with dependency grammar

Forward language model : bigram

4 N-best sentences

Figure 3: A continuous speech recognition system we adopted the right-to-left parsing technique rather than the conventional left-to-right parsing technique since the main predicate in most Korean sentences appears at the end. In the stage of translation, direct transfer method is used. We also use simple generation rules in the stage of generation.

Table 2: Performance testing results of Korean continuous speech recog. system Recognition rate(%) Word Sentence Top 1 Top 5 Top 1 Top 5 94.68 98.27 82.42 95.07

4.2 Korean to English Translation

The Korean to English machine translator consists of morphological analyzer, syntactical analyzer, translator and generator. Morphological analyzer extracts the morphological structure from an entered sentence in Korean, and then loads corresponding linguistic information from dictionary. It also resolves the ambiguities using context information. The syntactical analyzer, which uses the results of morphological analyzer, generates the syntactical relations among the constituent words as a tree structure. Dependency grammar and right-to-left analyzing method are also used here. English translator translates the Korean syntactic structure into the English one using EBMT(Example Based Machine Translation)[14]. This method uses example database which have has already been constructed as the pairs of the Korean and English structures. If a Korean structure is given, the most similar English one is chosen in the example database. Finally English generator, which generates surface English sentence using English structure, consists of syntactical generator and morphological generator. Syntactical generator arranges the words so as to be tted into the English grammar. Morphological generator deals with the phenomena of phonological contraction, irregular conjugation, etc.

5 Speech Synthesis

5.1 Korean TTS System

We developed a Korean speech synthesizer which can translate all the texts into speech. It was composed of the phonetic preprocessing, the prosody generation and the concatenation of synthesis unit. A Korean sentence entered into the speech synthesizer is analyzed to be used in the prosody generation and the concatenation of synthesis units. We generate the prosody which consists of intonation, duration and gain. In the concatenation of synthesis units, prosody generation and inter-phoneme interpolation

are done simultaneously. Each segment of speech is connected by interpolation technique and prosody is also inserted. As a basic unit of synthesis, we used the demiphone[15]. It reduces the memory size, and still keeps quite a good quality of speech.

5.2 Japanese and English TTS System

Since our system is a speech-to-speech translation system the Japanese and the English speech synthesizer is indispensable which generate speech from translated text. As the Japanese TTS system we used NewWave version 2.04 under the permission of KDD. In case of English TTS system, we used a commercial product, the ORATORr release 10.2, which was developed by Bellcore 1 .

6 Evaluation of the performance As the test speech data, 644 utterances of four persons were used. The results are shown in Table 2. Our system has the word recognition rate of 94.68% for the rst candidate (designated as top 1), and 98.27% for all candidates from the rst candidate to the fth candidate (designated as top 5). And the sentence recognition rate of 82.42% for top 1, 95.07% for top 5 is also obtained. Korean to Japanese and English translation shows success rate of 100%. This was due to the fact that only outputs of speech recognition are entered into translators. In other words, all inputs of machine translator were generated from predetermined vocabularies and grammar which speech recognizer had 1

ORATOR is a registered trademark of Bellcore


Korean speech

Speech recognition system (Forward search)

5

Translation and generation

Dependency grammar (Backward search)

Translated Japanese

Speech recognition system Machine translation system

Figure 4: The Machine translation system combined with the speech recognition system used. Because we have already considered all possible outputs of the speech recognizer and designed the machine translator upper assumption, all translation results should be correct. Since all translations were done correctly, overall speech-to-speech translation success rate including speech recognition equals to the speech recognition rate. We used one TMS320C30 DSP board which plays a role of real time feature extraction. The system runs in real time for generating a Japanese or English sentence from a spoken Korean sentence.

7 Conclusion In this paper, we described an multi-lingual speech-to-speech translation system which translates a Korean utterance into Japanese or English one. The task domain of the system is restricted to dialogues for hotel reservation. The system consists of a Korean speech recognizer, a Korean-to-Japanese or English machine translators, and three kinds of speech synthesizers. The speech recognition system is an HMM-based continuous speech recognizer which can recognize about 300 vocabularies. We have achieved the word recognition rate of 94.5% and the sentence recognition rate of 81.2%. The machine translator employs the direct transfer method for Korean to Japanese translation and EBMT for Korean to English translation. Since the translation success rate were 100%, overall speech-to-speech translation success rate equals to the sentence recognition rate. Three speech synthesizers were used. For Korean TTS, we developed our own system which uses demiphone as the basic synthesis unit. NewWave developed by KDD and ORATORr are established in the system for generating Japanese and English speech, respectively. In the future, we are expecting to expand the task domain so that every people can use it conveniently.

Acknowledgment The authors would like to acknowledge the contributions of all members in the spoken language research team of multimedia technology laboratory to develop the system.

References [1] T. Morimoto, et al., \ATR's speech translation system : ASURA," Proc. 3rd European Conf. on Speech Communication and Technology, pp. 1291{1294, Sep. 1993. [2] A. Waibel, et al., \JANUS - a speech-to-speech translation system using connectionist and symbolic processing strategies," Proc. 1991 International Conf. on Acoustics, Speech, and Signal Processing, pp. 793{796, May 1991. [3] W. Wahlster, et al., \Verbmobil, translation of face-to-face dialogs," Proc. 3rd European Conf. on Speech Communication and Technology, pp. 29{38, Sep. 1993. [4] Myoung-Wan Koo, et al., \KT{STS : A speech translation system for hotel reservation and a continuous speech recognition system for speech translation," Proc. of 4th European Conf. on Speech Communication and Technology, pp. 1227{1230, Madrid, Sep. 1995. [5] C-STAR II : Consortium for Speech Translation Advanced Research, http://www.is.cmu.edu /cstar/CSTAR-II.html. [6] K. -F. Lee, Automatic speech recognition: the development of the SPHINX system. Kluwer Academic Publisher, Norwell, Mass., 1989. [7] Myoung-Wan Koo, et al., \A Korean continuous speech recogntion system for nding N-best sen-


[8]

[9]

[10]

[11]

[12]

[13] [14]

[15]

tence hypotheses," Proceedings on Speech Communication and Signal Processing Workshop, pp. 48{51, Oct. 1994. C. H. Lee, et al., \Acoustic modeling of subword units for speech recognition," in Proc. 1990 IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 721{724, April 1990. E. P. Giachin, et al., \On the use of inter-word context-dependent units for word juncture modeling," Computer Speech and Language, vol. 6, pp. 197{213, 1992. A. Derr, et al., \A simple statistical class grammar for measuring speech recognition performance, " Proc. Speech and Natural language, pp. 147{149, Oct. 1989. Y. Chow, et al., \The N-best algorithm: an ef cient procedure for nding top N sentence hypotheses speech recognition performance, " Proc. Speech and Natural language, pp. 147{149, Oct. 1989. R. Schwartz, et al., \A comparison of several approximate algorithms for nding multiple (Nbest) sentence hypotheses," Proc. 1991 International Conf. on Spoken Lang. Processing, pp. 701{704, May 1991. I. A. Mel'cuk, Dependency Syntax: Theory and Practice, The State Univ. of New York Press, 1988. O. Furuse and H. Iida, \An Example-based method for transfer-driven machine translation," Proc. 13th International Conf. on Computational Linguistics, pp. 645{651, 1992. E. I. Kim, J. I. Kim, \HanSoRi: An Unlimited Synthesis System," Proceedings on Speech Communication and Signal Processing Workshop, pp.342{345, October 1994.

6