CONTEXT-DEPENDENT MODELLING IN THE ABBOT ... - CiteSeerX

28 downloads 0 Views 92KB Size Report
D.J. Kershaw, M.M. Hochberg and A.J. Robinson. Cambridge University Engineering Department, Cambridge CB2 1PZ, England. Tel: +44] 1223 332754, Fax: ...
CONTEXT-DEPENDENT MODELLING IN THE ABBOT LVCSR SYSTEM D.J. Kershaw, M.M. Hochberg and A.J. Robinson Cambridge University Engineering Department, Cambridge CB2 1PZ, England. Tel: [+44] 1223 332754, Fax: [+44] 1223 332662 E-mail: fdjk, mmh, [email protected]

ABSTRACT

Abbot is the hybrid connectionist-hidden Markov model (HMM) large vocabulary continuous speech recognition system developed at Cambridge University. This system uses a recurrent neural network to estimate the acoustic observation probabilities within an HMM framework. This paper presents a recent enhancement to the Abbot

system; limited modelling of phonetic context, which has resulted in a signi cant improvement in performance. This summary describes the modi cations and reports results from the 1994 ARPA and 1995 SQALE evaluation and development test sets.

1. THE CONTEXT-INDEPENDENT SYSTEM

As in HMMs, the hybrid approach uses an underlying hidden Markov process to model the time-varying nature of the speech signal. The Markov process is determined in a hierarchical fashion, e.g., the language model is a Markov process on the words and the words are a Markov process on the phones. In the basic system as described in [1], a recurrent network is used as the acoustic model within the HMM framework. At each 16 msec frame, the input acoustic vector u( ), is mapped by the network to an output vector, y( ). The output vector represents an estimate of the posterior probability of each of the phone classes, i.e., ( ) ' Pr( ( )ju1+4), where ( ) is phone at time and u1 is the input from time 1 to . Because the employed Viterbi decoding process makes use of the likelihood of the acoustic data, the network outputs are mapped to scaled likelihoods by Pr(u( )j ( ))  ( ) Pr( ). Here, Pr( ) is estimated by the relative frequency of the phone in the training data. The recurrent network provides a number of bene ts over standard HMM acoustic models (e.g., mixture Gaussian densities). The internal recurrent nodes are able to model acoustic context and the use of the network structure results in a nonparametric model of the acoustic features. These properties enable the system to achieve good performance using context-independent phone models, with only a single state per phone. The compact representation of the acoustic model has many desirable properties, such as fast decoding [4]. t

qi t t

i

qi t

t

yi t =

t

yi t

t

qi

qi t

t

t

qi

2. THE CONTEXT-DEPENDENT SYSTEM

A recent improvement to the Abbot system augmented the context-independent recurrent network with a set of phonetic context-dependent modules [2]. These modules are single-layer networks for estimating the conditional context-class posterior, ( ) ' Pr( ( )ju1+4 ( )), where ( ) is the context class for phone class ( ). The input to the modules is the internal state (similar to the hidden layer of an MLP) of the recurrent network [2]. The joint posterior probability of context class and phone class is simply the product, ( ) ' ( ) ( ). These phonetic context-dependent outputs are mapped to scaled likelihoods to be used in decoding by, () () p(u( )j ( ) ( ))  Pr( j ) Pr( ) Embedded training is used to estimate the parameters of the context modules. Viterbi segmentation was used to align the training data. Each context network was trained on a non-overlapping subset of the state vectors generated from all the Viterbi aligned training data. The context networks were trained using a gradient-based procedure. The context classes were determined using a decision tree based approach. This allows for sucient statistics for training and keeps the system compact (allowing fast context training). The decision trees are also used to relabel the lexicon. yj ji t

t

cj t

; qi t

cj t

j

t

3. RESULTS

c j t ; qi t

qi t

i

yij t

yi t yj ji t cj qi

qi

yi t yj ji t

:

The context-dependent system increased the number of phones from 54 context-independent phones to 527 word-internal context-dependent phones. Since small networks are used to model the context classes, the orderof-magnitude increase in phones only doubled the number of acoustic modelling parameters (350k to 620k). This system has been evaluated on the 1995 SQALE (for American and British English) [5] and 1994 ARPA North

American Business News [3] development and evaluation tests. These tests utilized either 20,000 word or 64,000 word vocabularies and trigram language models, without any speaker adaptation. The acoustic models have been trained on the Wall Street Journal (WSJ) SI-84 corpus (or the British English equivalent for SQALE). Table 1 shows the result of the context-dependent system (denoted CD) on these tests. For comparison, the results from the context-independent system (i.e., same acoustic model without the context class modules) and the 1994 Abbot system [1] are also presented. As can be seen in the table, context-dependent modelling signi cantly reduces the word error rate from the context-independent system (in the order of 15%). Note that the US and UK context-dependent systems achieved the lowest reported word error rates for the SQALE evaluation tests shown. Table 1 also shows that the context-dependent system achieves consistently better performance than the 1994 Abbot system [1] which was trained on the WSJ SI-284 corpus. The context-dependent system gives similar performance to many of the state-of-the-art HMM systems that use SI-284 training and cross-word triphone modelling. An additional bene t of the context-dependent modelling is the reduction in decoding time (between 33{50% from the context-independent system). The average decode time on an HP735 for the contextdependent system is 40 seconds and 68 seconds per utterance for the 20k and 64k lexicons respectively. This improvement is a result of the improved modelling capabilities with only a limited increase in the number of parameters.

Test Set Lexicon Eval '94 CI CD % Redn WER % WER % WER % WER CI ! CD SQALE US Dev Test 20k | 12.8 11.3 12.2 SQALE UK Dev Test 20k | 15.6 12.7 18.9 # SQALE US Eval Test 20k | 14.3 12.9 9.8 SQALE UK Eval Test 20k | 16.5 13.8# 16.3 1994 H1 Dev Test 20k 15.2 15.5 13.9 10.3 1994 H1 Dev Test 64k 11.2 11.5 10.0 13.0 1994 H1 Eval Test 20k 14.1 15.3 13.1 14.4 1994 H1 Eval Test 64k 12.4 13.3 11.2 15.6 y z

Table 1: Ocial results for the English language portions of the SQALE development and evaluation tests and for the 1994 ARPA development and evaluation tests. The y and z symbols indicate ocial 1994 Abbot system results for the H1-C1 and H1-P0 tests. The # symbol indicates ocial Abbot system results for the 1995 SQALE evaluations. CI and CD trained on SI-84 and Eval '94 trained on SI-284. All ARPA tests have been adjudicated with phone mediated NIST scoring software.

4. SUMMARY

Although this is still preliminary work, context-dependent modelling gives an average of 14% reduction in word error rate over the context-independent system and faster decoding, without a large increase in parameters. The system achieves very good performance on the 1994 ARPA test, especially considering the acoustic models are only trained on the SI-84 corpus. It is expected that investigation into such areas as discriminative decision tree clustering, state-based clustering and cross-word modelling, should provide further improvement.

5. ACKNOWLEDGEMENTS

This work is funded, in part, by ESPRIT project 6487 Wernicke.

6. REFERENCES [1] M.M. Hochberg, G.D. Cook, S.J. Renals, A.J. Robinson, and R.S. Schechtman. The 1994 ABBOT Hybrid Connectionist-HMM Large-Vocabulary Recognition System. In Spoken Language Systems Technology Workshop, pages 170{6. ARPA, January 1995. [2] D.J. Kershaw, M.M. Hochberg, and A.J. Robinson. Incorporating Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System. F-INFENG TR217, Cambridge University Engineering Department, May 1995. [3] F. Kubala. Design of the 1994 CSR Benchmark Tests. In Spoken Language Systems Technology Workshop, pages 41{6. ARPA, January 1995. [4] S. Renals and M.M. Hochberg. Ecient Search Using Posterior Phone Probability Estimates. In ICASSP, volume 1, pages 596{9, April 1995. [5] H.J.M. Steeneken and D.A. Van Leeuwen. Speech-Recognition Systems: The SQALE Project (Speech Recognition Quality Assessment for Language Engineering). To Appear in Eurospeech, September 1995.