EXPERIMENTS FOR AN APPROACH TO ... - Semantic Scholar

3 downloads 122588 Views 123KB Size Report
Center for Spoken Language Understanding ... It is based on language-dependent phone recog- nition and phonotactic ... was amongst the best systems during the NIST'95 eval- uation ( 6]) ... By listening to some of the data from the LDC Call.
EXPERIMENTS FOR AN APPROACH TO LANGUAGE IDENTIFICATION WITH CONVERSATIONAL TELEPHONE SPEECH Yonghong Yan

Etienne Barnard

Center for Spoken Language Understanding Oregon Graduate Institute of Science and Technology 20000 N.W.Walker Road, Portland, OR 97291-1000 [email protected]

ABSTRACT

This paper presents our recent work on language identi cation research using conversational speech (the LDC Conversational Telephone Speech Database). The baseline system used in this study was developed recently ([4, 5]). It is based on language-dependent phone recognition and phonotactic constraints. The system was trained using monologue data and obtained an error rate of around 9% on a commonly used nine-language monologue test set. While the system was used to process conversational speech from the same nine-language task, dramatic performance degradation (with an error rate of 40%) was observed. Based on our analysis of conversational speech, two methods: (1) pre-processing and, (2) post-processing, were proposed. Without the presence of training data from conversational speech database, the nal system (the baseline system enhanced by the two proposed methods) obtained an error rate of 24%, a substantial improvement (with 41% error reduction) compared with the baseline system.

1. Introduction

Recent years have seen tremendous advances in the accuracy of the best language-identi cation (LID) systems. The best systems in the 1993 evaluation of systems by the National Institute of Standards and Technologies (NIST) performed with an error rate of 40% to 50% when recognizing one of ten languages from a ten-second segment ([1]). In contrast, error rates of around 25% were obtained on comparable tasks during the most recent (1995) evaluation ([6]). In the light of these improvements, it is important to ask how well these systems will generalize to di erent application scenarios. Since one of the potential commercial applications for LID will be international long distance telephone service, we have therefore examined the performance of our system when applied to

[email protected]

the identi cation of conversational speech. (Until now, our research had focused exclusively on evoked monologues which are part of the OGI TS corpus.) The platform for this study is a system developed based on our recent approach ([4, 5, 3, 2]). The LID algorithm is designed to exploit phonotactic constraints and duration information based on language-dependent phone recognition. The system developed based on this algorithm was amongst the best systems during the NIST'95 evaluation ([6]), and has been further re ned since then. When the system was used to identify conversational speech without any modi cation, tremendous performance degradation was found; the error rate on a particular nine-language task increased from 9% to around 40%. Since there is no publically available multi-language conversational speech database suitable for our training purposes, two methods (pre-processing and postprocessing) are proposed in this paper for improving the system performance without using training data containing conversational speech. The general idea of these methods is to eliminate some of the most severe distortions introduced by the di erences between these two types of input speech data. An error-rate reduction of 41% was achieved by the proposed methods, and our work on the processing of conversational speech is being continued.

2. Overview of the system

Speech data are parameterized every 25.6 ms with 12.8 ms overlap between contiguous frames. A feature vector with 26 dimensions is calculated: 12th-order LPC Cepstra appended with normalized Energy and the Delta Cepstra appended with Delta Energy. Cepstral mean subtraction is used to perform channel normalization. The general architecture of our system is given in Figure 1. The system is composed of three parts.

Speech

Recognizer 1

Score Generator 1 LID model 1,2,....,N

Recognizer 2

Score Generator 2 LID model 1,2,....,N

Final Classifier

Signal

Recognizer M

LID Result

Score Generator M LID model 1,2,....,N

Figure 1: General Structure of the LID system

2.1. Front End

Six (M = 6) language-dependent phone recognizers were implemented; they are for English, German, Hindi, Japanese, Mandarin and Spanish. The phone accuracies for these six phone recognizers are approxiately 45% to 55%. Details can be found in [4, 3]. The recognizers run in parallel, and independently decode the input speech vectors into phone strings. The output of the recognizer is the time-aligned phone string with an acoustic probability attached to each phone in the string.

2.2. Score Generator

The score generator takes the outputs (the decoded phone strings) from each language-dependent phone recognizer, calculates various LID feature scores, and provides them to the nal classi er. Each score generator contains three sets of LID models:  Forward language model T Y PLF = ( P(OijOi?1) + P(Oi)) (1) i=1

where Oi is the ith phone in the decoded best path, P(Oi) is the unigram term and P(Oi jOi?1) is the bigram term; and are the weight coef cients. T is the total number of phones in the decoded utterance.  Backward language model T Y PLF = ( P(OijOi+1) + P(Oi )) (2) i=1

 Duration model T Y PD = ((1 ? )P(OijOi; Oi?1 2 S)+ P(Oi jOi)) i=1

(3)

where P(OijOi; Oi?1 2 S) is the context-dependent model, and S is one of the six broad categories: vowel, fricative, stop, nasal, a ricative or glide. P(Oi jOi) is the original context-independent monophone duration model, which is used here as a smoothing factor with weight . The language models were optimized by the method proposed in[5].

2.3. Final Classi er

A feed-forward neural network with one hidden layer and full connections between successive layers is used to learn the relations among these scores in our system. The output of the neural network is the nal LID result of the input utterance.

3. Two proposed Methods

Conversational speech is quite di erent from elicited monologues (whether read or spontaneous). Conversations contain frequent lled pauses (such as \uhuh"), repetitions, hesitations, excitations, false starts, and particularly poor articulation. As a result, conversational speech has a much higher variability in phone quality and duration. This causes problems when the recognizers trained on monologue speech data are used to recognize the underlying phones of a conversational utterance.

3.1. Analysis of Conversational Speech

By listening to some of the data from the LDC Call Friend sample release, we found that there are:  more lled pause segments,  more emotional and bigger prosodic variations,  long silence segments. Using the English recognizer developed for the system presented in previous chapters as an automatic labeler,

the di erences between the monologue and conversational speech data used in this study are illustrated in Figure 2 and 3. Unigrams decoded from OGI data and LDC data 0.15 Comparison of Unigrams

OGI LDC

Unigram

0.1

0.05

0 0

5

10

15 20 25 English Phone

30

35

40

Figure 2: Phone distributions of monologue speech and conversational speech (by English Phone Recognizer) Phone Durations decoded from OGI data and LDC data 20 OGI LDC

Comparison of Durations

One major di erence between the training data (monologue) and the testing data (conversational speech) is the presence of long silence segments in the latter. This presents a problem to the front end. In our implementation, channel normalization is implemented by cepstral mean subtraction, which does not consider if the frame being processed is speech or not. In our training data, the silence segments are generally only a small part of the data les. After subtraction, not only the channels, but also the speakers in the training data are thus normalized. With the long silence segments in conversational speech, the mean of a conversational utterance contains more information about the channel, so the speakers are not well normalized. This causes a mismatch between the training and testing environments. In order to overcome this, we experimentally proposed a simple energy-based algorithm to delete the long silence segments. Within a segment of T frames, for frame i, the energy Ei is calculated, and compared with the threshold (TE) given in ( 4). T X (4) TE = 1 T  Ei +  Emin i=1 where, T Emin = min E (5) i=1 i The energy for each frame (with N samples) is calculated as: N X (6) Ei = x2j j =1

15 Duration

3.2. Pre-processing

where xj is a sample in the frame. Frame i is classi ed as:  if TE  Ei Frame i is : Speech; Silence; if TE > Ei In our implementation, is set to 1000.0 and is set to 6.0.

10

5

3.3. Post-processing

0 0

5

10

15 20 25 English Phone

30

35

40

Figure 3: Distributions of phone durations for monologue speech and conversational speech (English Phone Recognizer)

By examining the outputs (decoded phone strings), we nd there are some obvious error patterns in recognition results produced by the phone recognizers. A post-processing algorithm is proposed to correct some of these obvious mistakes. It performs the following corrections on the outputs of the phone recognizers. The rules were generated based on the analysis of the di erences in the unigrams and bigrams between the monologue and conversational speech.  single consonants surrounded by silences are not allowed.

 segments containing more than 4 contiguous con-

sonants are not allowed.  single (or repeated) nasal/(z,ah,s) pairs are not allowed.  contiguous silence segments are merged. The processed phone strings are sent to the score generators.

4. Databases

The database used for conversational speech processing is the data sample distribution from the Call Friend Project (collected by the Linguistic Data Consortium (LDC)). These data were recoded on switchboards. The two channels (local and remote) were recorded simultaneously and stored in two di erent les. It was released in May 1995, with a total of 6 hours of speech data. It contains data from 9 languages, which are real telephone conversations. The nine languages are: English, Farsi, German, Hindi, Japanese, Mandarin, Spanish, Tamil and Vietnamese. The Oregon Graduate Institute Multi Language Telephone Speech Corpus (OGI TS) was also used. It is a monologue speech database. Data from the above nine languages were used to develop our system. The test data for these nine languages in NIST'94 evaluation is used to provide a benchmark result for our system on monologue data. The developed system achieved an error rate of 9% on this data set.

5. Experiments and Results

Four experiments were conducted on the Call Friend data. Results (Error rate) are given in Table 1. The error reductions achieved (compared with the monologue system) are also listed. Approach Result Error Reduction Monologue system 40.3% N/A Pre-process system 30.6% 24.1% Post-process system 29.2% 27.5% Combined system 23.6% 41.4% Table 1: LID Results on conversational speech Both pre- and post-processing cut the system error rate by approximately a quarter. The nal combined system achieves an error reduction of more than 40%. The signi cant part of these methods is that the improvement is achieved without model reestimation.

6. Discussion

Although the combined system achieves an error reduction of more than 40%, the performance of this system is still inferior to the performance we achieved on the same task with monologue speech as input. This may suggest that the task for conversational speech is inherently more dicult; however, training the system on conversational speech data may be sucient to recover the results obtained with monologues. (This requires the availabilityof conversational speech databases.) We intend to investigate this issue in detail, using additional data from the \Call-friend" corpus when it becomes available.

7. References

[1] Y.K. Muthusamy, E. Barnard, and R. A. Cole. Reviewing automatic language identi cation. IEEE Signal Processing Magazine, 11(4):33{41, October 1994. [2] Y. Yan and E. Barnard. A comparison of neural net and linear classi er as the pattern recognizer in automatic language identi cation. In Interna-

tional Conference on Neural Networks and Signal Processing (ICNNSP95), page To Appear, Nanjing,

P.R.China, December, 1995. [3] Y. Yan and E. Barnard. Recent improvements to a phonotactic approach to language identi cation. In the Fifteenth Annual Speech Research Symposium XV, pages 212{219, Baltimore, Maryland, June,

1995. [4] Y. Yan and E. Barnard. An approach to automatic language identi cation based on languagedependent phone recognition. In Proceedings 1995 IEEE International Conference on Acoustics, Speech, and Signal Processing, pages V{3511 { V{

3514, Detroit, Michigan, USA, May, 1995. [5] Y. Yan and E. Barnard. An approach to automatic language identi cation with enhanced language model. In Eurospeech Proceedings, pages 1351 { 1354, Madrid, Spain, September, 1995. [6] M.A. Zissman and A. Martin. Language identi cation overview. In Proceedings of the Fifteenth Annual Speech Research Symposium, pages 2 { 14, Baltimore, USA, June 1995.