GRADUATE SCHOOL

0 downloads 0 Views 828KB Size Report
stochastic segment model, a dynamical system segment model is proposed for con- tinuous speech .... 6 Search Algorithms for Segment-based Models. 103 .... of words in isolation is much easier than Continuous Speech Recognition (CSR). .... In Chapter 3 we deal with the representation of the speech signal that is used.
BOSTON UNIVERSITY GRADUATE SCHOOL

Dissertation

SEGMENT-BASED STOCHASTIC MODELS OF SPECTRAL DYNAMICS FOR CONTINUOUS SPEECH RECOGNITION by VASSILIOS V. DIGALAKIS M.S., Northeastern University, 1988 Diploma, National Technical University of Athens, 1986

Submitted in partial ful llment of the requirements for the degree of Doctor of Philosophy 1992

c Copyright by

VASSILIOS V. DIGALAKIS 1992

Acknowledgements First and foremost I would like to thank my advisor Mari Ostendorf and my good luck for taking her pattern-recognition class in spring 1989. She urged me to get involved in speech recognition, provided the initial model and guided me throughout the thesis work. She set high standards for all of our work and went over all the manuscripts countless times. The support and the freedom she gave me these years were invaluable. I would also like to thank my \second advisor", Robin Rohlicek from BBN. Our numerous discussions and meetings with Mari worked for me as the seed for new ideas. They both, undoubtedly, helped me keep the research on the right track and deserve credit for this work. I also want to express my gratitude to the readers for their comments, that improved signi cantly the quality of this dissertation. The long and continuous research e ort involved in a PhD requires the appropriate environment. I would like to thank all my ocemates in the Signal Processing and Interpretation laboratory at Boston University for providing this environment, and especially Owen Kimball, our \Emacs guru", John Butzberger and Nanette Veilleux for their patience all the times I interrupted them. Thanks also go to Becky Bates for all the paperwork and photocopying she did for me. I have been very fortunate to be surrounded in my personal life by people who helped me go all this way. I want to thank Vanna for being here and understanding. I feel guilty for the sacri ces she has made these years; but then, I made them too. Hopefully, we have many years in front of us to make up. Korinna, Aria and Katerina in Greece saw me very little the past 5 years, but they were always very careful not to make me feel nostalgia for home. And, of course, I want to acknowledge the person that has in uenced me more than anyone else, my father. He taught me to be persistent and to work hard and he gave me his iii

\unconventional" thinking. Unfortunately, he missed the end by a few months, which took away a lot of the joy. To his memory I dedicate this dissertation.

iv

   o    o:

v

vi

SEGMENT-BASED STOCHASTIC MODELS OF SPECTRAL DYNAMICS FOR CONTINUOUS SPEECH RECOGNITION (Order No.

)

VASSILIOS V. DIGALAKIS Boston University Graduate School, 1992 Major Professor: Mari Ostendorf, Assistant Professor of Electrical, Computer and Systems Engineering

Abstract This dissertation addresses the problem of modeling the joint time-spectral structure of speech for recognition. Four areas are covered in this work: segment modeling, estimation, recognition search algorithms, and extension to a more general class of models. A uni ed view of the acoustic models that are currently used in speech recognition is presented; the research is then focused on segment-based models that provide a better framework for modeling the intrasegmental statistical dependencies than the conventional hidden Markov models (HMMs). The validity of a linearity assumption for modeling the intrasegmental statistical dependencies is rst checked, and it is shown that the basic assumption of conditionally independent observations given the underlying state sequence that is inherent to HMMs is inaccurate. Based on these results, linear models are chosen for the distribution of the observations within a segment of speech. Motivated by the original work of the stochastic segment model, a dynamical system segment model is proposed for continuous speech recognition. Training of this model is equivalent to the maximum likelihood identi cation of a stochastic linear system, and a simple alternative to the traditional approach is developed. This procedure is based on the ExpectationMaximization algorithm and is analogous to the Baum-Welch algorithm for HMMs, since the dynamical system segment model can be thought of as a continuous state vii

HMM. Recognition involves computing the probability of the innovations given by Kalman ltering. The large computational complexity of segment-based models is dealt with by the introduction of fast recognition search algorithms as alternatives to the typical Dynamic Programming search. A Split-and-Merge segmentation algorithm is developed that achieves a signi cant computation reduction with no loss in recognition performance. Finally, the models are extended to the family of embedded segment models that are better suited for capturing the hierarchical structure of speech and modeling intersegmental statistical dependencies. Experimental results are based on speaker-independent phoneme recognition using the TIMIT database, and represent the best context-independent phoneme recognition performance reported on this task. In addition, the proposed dynamical system segment model is the rst that removes the output independence assumption.

viii

Contents 1 Introduction

1

2 Statistical Models in Speech Recognition

9

2.1 Acoustic Models: A Uni ed View : : : : : : : : : : 2.1.1 Hidden Markov Models : : : : : : : : : : : : 2.1.2 Segment-based Models : : : : : : : : : : : : 2.2 Evaluation Methods and Corpus : : : : : : : : : : : 2.2.1 The TIMIT Corpus : : : : : : : : : : : : : : 2.2.2 Current Phonetic Recognition Performance : 2.A Test set : : : : : : : : : : : : : : : : : : : : : : : : 2.B TIMIT phone set and allowable confusions : : : : :

3 Feature Space and Statistical Modeling 3.1 3.2 3.3 3.4 3.A

Feature Selection : : : : : : Cepstra and Interpolation : Cepstra and Linear Models : Discussion : : : : : : : : : : Signal Processing : : : : : :

: : : : :

: : : : :

4 Linear Segment-based Models

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

: : : : : : : : : : : : :

10 13 16 22 23 24 26 27

29 30 32 35 44 46

47

4.1 A Gauss-Markov Segment Model : : : : : : : : : : : : : : : : : : : 48 4.2 A Dynamical System Segment Model : : : : : : : : : : : : : : : : : 51 ix

4.2.1 Correlation Assumptions 4.2.2 Structure of the Model : 4.3 Classi cation rule : : : : : : : : 4.4 Experimental Results : : : : : : 4.5 Discussion : : : : : : : : : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

5 Identi cation of a Stochastic Linear System

5.1 Linear System Parameter Estimation : : : : : : : 5.1.1 The Classical Solution : : : : : : : : : : : 5.1.2 Estimation with the EM Algorithm : : : : 5.1.3 Estimation with Missing Observations : : 5.2 General Markov-state sources : : : : : : : : : : : 5.3 Training of the Dynamical System Model for CSR 5.3.1 Training from Sentence Transcriptions : : 5.3.2 Training without transcriptions : : : : : : 5.4 Experimental Results : : : : : : : : : : : : : : : : 5.5 Discussion : : : : : : : : : : : : : : : : : : : : : : 5.A Maximization of the log-likelihood function : : : : 5.B E-step recursions for DS estimation : : : : : : : : 5.C The Forward-Backward Algorithm : : : : : : : : :

6 Search Algorithms for Segment-based Models 6.1 Fast Segment Classi cation : : : : : : : : 6.2 Joint Segmentation/Recognition : : : : : : 6.3 Local Search Algorithms: Theory : : : : : 6.3.1 Analysis of local search for the JSR 6.3.2 Local search strategies : : : : : : : 6.4 A Split-and-Merge Search Algorithm : : : 6.4.1 Basic Algorithm : : : : : : : : : : : x

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : : : : : : : :

53 58 61 62 70

71

74 75 78 82 83 87 87 89 91 94 95 97 100

103

106 109 112 112 116 118 118

6.5 6.6 6.7 6.A

6.4.2 Improvements to the basic algorithm 6.4.3 Constrained searches : : : : : : : : : 6.4.4 Complexity : : : : : : : : : : : : : : Experimental Results : : : : : : : : : : : : : Word recognition with the Split-and-Merge : Discussion : : : : : : : : : : : : : : : : : : : Exact Neighborhood for the JSR problem :

7 Hierarchical Models

: : : : : : :

7.1 Hierarchical-Mode Process : : : : : : : : : : : 7.2 Embedded Segment Models : : : : : : : : : : 7.2.1 General Case : : : : : : : : : : : : : : 7.2.2 Linear Embedded-Segment Models : : 7.2.3 A Two-level Segment Model : : : : : : 7.3 Using Hierarchical Models in CSR : : : : : : : 7.3.1 Fine-level Unit and Grammar Selection 7.3.2 Classi cation : : : : : : : : : : : : : : 7.3.3 Training : : : : : : : : : : : : : : : : : 7.3.4 N-level Models : : : : : : : : : : : : : 7.4 Experimental Results : : : : : : : : : : : : : : 7.5 Discussion : : : : : : : : : : : : : : : : : : : :

8 Conclusions

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

: : : : : : : : : : : : : : : : : : :

121 123 126 127 132 134 136

139

140 142 143 147 149 152 153 155 155 157 158 163

165

8.1 Thesis Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : 165 8.2 Suggestions for Further Research : : : : : : : : : : : : : : : : : : : 168 8.3 Epilogue : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 169

xi

List of Tables 2.1 Designated test speakers for 1st release of TIMIT database. Entries give the dialect region number (1 through 8), the sex of the speaker (m-f) and the speaker initials. : : : : : : : : : : : : : : : : : : : : : 26 2.2 Groups of allowable confusions between the 61 TIMIT phones. : : : 27 3.1 Multiple R2 values when predicting di erent cepstral coecients at the middle of a phoneme from the cepstral coecients of the previous frame. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 43 3.2 Multiple R2 values when predicting di erent cepstral coecients at the sixth frame of a phoneme from the cepstral coecients of the second frame. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 43 3.3 Multiple R2 values when predicting di erent cepstral coecients at the middle frame of a phoneme from the cepstral coecients of the middle frame in the previous phoneme for ve frequently occurring pairs. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 44 4.1 Number of parameters per phone model for an SSM with xed length M and d-dimensional feature vector. : : : : : : : : : : : : : : : : : 49 4.2 Transformation sequences of the correlation-invariance assumption for di erent segment lengths and M = 4 invariant regions. : : : : : 56

xii

4.3 Phone classi cation results for context-independent models based on di erent assumptions about the statistical dependence between features within a segment. : : : : : : : : : : : : : : : : : : : : : : : 68 5.1 Analogies between HMMs and stochastic linear systems. : : : : : : 85 5.2 Estimation of F; P; R from arti cial Gaussian data : : : : : : : : : : 92 6.1 Comparison of di erent search algorithms using a test set with average sentence length N = 300. The SSM used maximum phone duration D = 50, jAj = 61 models, M = 5 distributions per segment model. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 129 6.2 Phone recognition results for 18 cepstral coecients (no derivatives) with Split-and-Merge for the independent-frame and the dynamical system segment models. 1) Independent-1: uniform initial segmentation, 2) DS model-1: initial segmentation the output of 1), 3) DS model-2: uniform initial segmentation, 4) Independent-2, DS model-3: initial segmentation from sentence transcription. : : : : : 130 7.1 Comparison of microsegment models for di erent average number of models per phone region using 18 cepstral coecients. : : : : : : 161 7.2 Comparison of the microsegment model to the linear models and the baseline SSM. Entries are phone classi cation rates for 18 cepstral coecients and approximate numbers of free parameters. : : : : : : 162

xiii

List of Figures 2.1 Three-mode, left-to-right hidden Markov Model. : : : : : : : : : : : 15 2.2 Mapping of variable-length segments to xed length in the Stochastic Segment Model using a linear sampling transformation. : : : : : 21 3.1 Spectrogram for the utterance \They own a big house in the remote ..." : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.2 Scatter plots for various pairs of cepstral coecients extracted from the middle (k) and the immediately previous (k0 = k ? 1) frame of the phoneme \ae". : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.3 Scatter plots for various pairs of cepstral coecients extracted from the middle (k) and the immediately previous (k0 = k ? 1) frame of the phoneme \m". : : : : : : : : : : : : : : : : : : : : : : : : : : : 3.4 Scatter plots for various pairs of cepstral coecients extracted from the frames k0 = 2; k = 6 of the phoneme \eh". : : : : : : : : : : : : 3.5 Scatter plots for various pairs of cepstral coecients extracted from the middle frames of two consecutive phonemes for four frequently occurring phoneme-pairs. : : : : : : : : : : : : : : : : : : : : : : : :

xiv

34

38

39 40

41

4.1 Constant-value ellipses of estimated densities from the training set under various assumptions (solid lines) and the density corresponding to the test set (dotted line). Top left: full covariance (correlated features), no smoothing. Top right: Diagonal (independent features). Bottom: full covariance, smoothing. : : : : : : : : : : : : 4.2 Sampling of state-space trajectories under Trajectory and Correlation Invariance. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.3 Classi cation rates for a full-covariance, Gauss-Markov and blockdiagonal model versus dimension of the feature vector. : : : : : : : 4.4 Classi cation performance as a function of observation length for Correlation Invariance (CI) and Trajectory Invariance (TI) with unobserved trajectory of length 15. : : : : : : : : : : : : : : : : : : : : 4.5 Classi cation rates for various types of correlation modeling and numbers of cepstral coecients under Correlation Invariance. In the solid-line experiments, the features used are the indicated number of cepstral coecients and their time derivatives. The derivative of log-power was also used in the experiments with 18 cepstra and their derivatives. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4.6 Original spectrogram (top) and spectrograms created using the predicted estimate of the 18 cepstral coecients for the most likely candidate phone using the independent-frame model (middle) and the dynamical system model (bottom). : : : : : : : : : : : : : : :

52 57 64

65

67

69

5.1 Classi cation performance of test data vs. number of iterations and log-likelihood ratio of each iteration relative to the convergent value for the training data. Results use the correlation invariance assumption, and 10 cepstral coecients. : : : : : : : : : : : : : : : 93 6.1 Split-and-Merge segmentation neighbors : : : : : : : : : : : : : : : 120 xv

7.1 Example of the model components of a hierarchical mode for a 3level hierarchy. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 141 7.2 Representation of an embedded segmentation. : : : : : : : : : : : : 144

xvi

Chapter 1 Introduction In the era of multimedia communication systems, building a computer that is able to \hear" and \understand" human voice is of major signi cance. This task falls in the area that is known today as speech understanding. Research in this area is subdivided into two categories that correspond to the two main components of a speech understanding system: Speech recognition deals with the decoding of the speech signal to a discrete symbol string, such as words, phonemes or other language units, and natural language processing is related to the interpretation of this symbol string. The interaction between the two parts is not necessarily sequential. We are interested in the rst of these two areas, speech recognition. There are many factors that determine the diculty of a speech recognition task. Recognition of words in isolation is much easier than Continuous Speech Recognition (CSR). A speech recognizer that will work for any speaker without additional training or adaptation is much more dicult to build than a speaker-dependent one. Viewing the speech production mechanism as a coding process [40], the uncertainty of the message source which produced the message that we are trying to decode is a critical factor: it is determined in a speech recognition problem by the vocabulary size and the grammatical constraints. 1

2

Chapter 1. Introduction

For a given task, no matter how dicult it is, there are two main problems that any speech recognition system must solve: the rst is the construction of prototypes or the statistical models - depending on the approach one follows - that will be used for the symbols in the alphabet. This is known in speech recognition as training. The second is the decoding of the message and will be generally solved by the speci cation of the decoding rule and some form of a search. The most popular approaches to the speech recognition problem today are based on statistical methods. In these, one tries to model mathematically the uncertainty that is inherent in a speech recognition problem, due to di erences among speakers and pronunciations, changes in the speaker psychological conditions, corruption of the speech signal by environmental noise, and other sources of variability. Under this statistical framework, the two main speech recognition problems are estimation (training) and Maximum-Likelihood decoding (recognition). There are two signi cant advantages in a statistical approach to speech recognition. First, it is likely that the decoding of the spoken message will be based on many di erent forms of knowledge, at least for interesting recognition problems. Statistical methods give us an excellent and theoretically sound framework for combining the scores from all the di erent knowledge sources, the main issue in the modeling problem. Acoustic information may be represented by many inhomogeneous sets of features that are not equally important in making the recognition decision. There may be many di erent pronunciations for the same word, but some are more common than others. Second, a statistical setting can solve problems related to training. We know, for example, roughly how much training data we need in order to construct robust models, and we can use automatic training algorithms to do this with little or no human supervision. Speech recognizers based on statistical methods have been e ective in much more dicult tasks than earlier approaches. On speaker-independent, large vocabulary, continuous word recognition tasks, today's top-of-the-line systems achieve

Chapter 1. Introduction

3

80% to 95% correct word recognition, depending on the type of grammatical constraints that are used. Even though other approaches to continuous speech recognition, like neural networks, have recently appeared, they have not yet proved as successful and are not as ecient in combining scores and in training. However, regardless of the progress that has been done in today's state-ofthe-art speech recognition systems, the problem is far from solved. Performance degrades rapidly in more dicult test conditions, such as spontaneous speech, noisy environments and non-trained or non-cooperative speakers. In addition, for the word recognition rates that we mentioned earlier, the corresponding sentence recognition rates are 25% and 75% respectively, since a single word error causes the whole sentence to be incorrect. There is, therefore, potential for improvements in continuous speech recognition, and the problem is far from solved. There are many possible directions for research in statistical approaches to continuous speech recognition. We can identify the ones that promise the most through an elimination process. The dominant family of statistical models in CSR since the 1970's is that of hidden Markov models (HMMs), rst used by Baker [10] and Jelinek [40]. For this family, there exist ecient algorithms to obtain Maximum-Likelihood estimates [12]. Other parameter-estimation methods, like Maximum-Mutual-Information estimation [19], o er little or no improvement and their usefulness in dicult tasks has been questioned [71]. Moreover, it has recently been shown [64] that for HMMs the decision rule that is used performs as well as the optimal Bayesian decision rule (asymptotically with the number of observations). Finally, there exist ecient search algorithms that implement this decoding rule [55]. Thus, we can see that we can only expect advances in the eld today by improving the models themselves. We can try to improve the accuracy of a speech recognizer by adding more knowledge to the system. A spoken utterance contains many levels of information. In addition to the word sequence, it reveals the identity of the speaker, the psycho-

Chapter 1. Introduction

4

logical and many times the physical condition, the acoustic environment, and so on. When a human listener interprets a spoken message, a complicated decoding process takes place. Knowledge of the speaker identity limits the uncertainty of the acoustic realization of the various phonemes. Within a sentence, changes to the speaking rate, local and global stress and other prosodic cues enable us to make more accurate predictions on how the various words will be pronounced in the particular context. More locally, the context in which a phoneme appears a ects its realization { the so-called coarticulation e ect { and the phones1 are very orderly sounds. Incorporation of this knowledge in a speech recognizer can signi cantly improve its performance, as demonstrated in the use of context-dependent models that take into account the coarticulation e ect [84, 51]. However, very little of this information is actually used in speech recognition systems today. Modeling all these complex interactions can lead to highly parameterized models, that are almost impossible to estimate. The alternatives are either to try to impose some structure and reduce the degrees of freedom that the model has, or make simplifying assumptions and ignore some aspects of the process. The latter approach, however, leads to inaccurate models and subsequent recognition errors. The subject of this dissertation is to investigate and propose methods that will improve the acoustic modeling component of a speech recognizer by removing some of the simplifying assumptions about the statistical dependencies of the acoustic representation. The problem that we are trying to solve does not have a straightforward solution. For example, previous attempts to drop the independence assumption of consecutive observations have not been successful [19, 45, 83]. In the hidden-Markov-modeling framework the acoustic observations are modeled as noisy observations of a discrete process, an approach that leaves little exibilPhoneme is a linguistic unit, whereas the term phone is used to denote a particular acoustic realization of the phoneme. For example, the phone \ ap" is a particular realization of the phoneme \t". 1

Chapter 1. Introduction

5

ity for modeling more complex dependencies. Thus, we are going to investigate di erent families of models, and in particular segment-based models. There are many issues that will arise beyond the development of the models. We must rst solve the recognition and training problems. Since the models will have fewer simplifying assumptions than the traditional ones, they will be of course more general and accurate, but also very often more complex. This can be costly in recognition, where we are interested in the decoding time, since we want our models to be used eventually in real applications. Thus it will be important to develop ecient recognition algorithms. In addition, when building more accurate models, we must be more careful with di erences that possibly exist between the data that we shall use to train the models and those that will be used for testing, and with the related problem of over-training. The remainder of this dissertation is organized as follows. Chapter 2 provides the background to this research. We give the general de nition of an acoustic model, based on which we are able to present in a uni ed way the standard statistical modeling approaches in speech recognition today, speci cally hidden Markov models and segment-based models. We elaborate more on a particular segmentbased model, the stochastic segment model, introduced by Ostendorf and Roukos in [67], since it plays an important role in the dissertation. In Chapter 3 we deal with the representation of the speech signal that is used for recognition. We look at the properties of the representation that are related to its modeling, and investigate the adequacy of a linearity assumption. Chapter 4, modeling, is very important in the sense that it de nes the path that will be followed in the remaining chapters. There, we search for methods that will enable us to model the statistical dependency between observations from the same phoneme, and we introduce our \dynamical system model". We then investigate its relationship to other existing models, and present results comparing the performance to previous methods.

6

Chapter 1. Introduction

The next two chapters deal with the two main problems that must be addressed for using the proposed model in CSR. In Chapter 5 we show how the training of the dynamical system model can be performed eciently. This is actually a well-known problem, that of Maximum-Likelihood estimation of the parameters of a stochastic linear system. We follow a nontraditional and conceptually simple approach and the algorithm presented is of more general interest, since it is applicable to stochastic linear systems used in any application. The segmental framework that we follow, which enables us to obtain more accurate models, requires a computationally expensive recognition search. This is the subject of Chapter 6, where we introduce fast recognition search algorithms for segment-based models as alternative solutions to the typical dynamic-programming search. Having shown that modeling the dependencies within a single phoneme can improve recognition performance, in Chapter 7 we present an extension of the segment-based models to a family of multi-level models for speech recognition that, we believe, can better represent phenomena in speech that occur at di erent time scales. Finally, we discuss in Chapter 8 our conclusions and summarize the contributions of this dissertation to the elds of pattern recognition and estimation theory. We have chosen to present experimental results throughout the thesis, since in this way some chapters become self-contained and also the results can be used to motivate the material that follows. There are many paths that one can follow in reading this dissertation depending on his or her speci c interests. The reader who is interested in acoustic modeling and wants to evaluate the models that we propose without going into the algorithmic details, can read Chapters 2, 4 and parts of Chapter 7. The reader interested in search algorithms for segment-based models can read Chapters 2 (for notation) and 6. The reader interested in training algorithms, especially for segment models,

Chapter 1. Introduction

7

can look at Chapters 2 and 5. Someone who is interested in the dynamical system estimation algorithm only can just look at the rst half of Chapter 5, since the presentation there is self-contained.

8

Chapter 1. Introduction

Chapter 2 Background

Statistical Models in Speech Recognition Speech can be viewed as a discrete message source with a hierarchical structure: phonemes are joined to form syllables, then words, phrases and nally continuous discourse. In modern speech-recognition systems, stochastic models are postulated for certain units of speech, such as words or phonemes. The basic acoustic models are then combined to form models for larger units with the aid of dictionaries, and/or probabilistic grammars. In designing a speech recognizer, an important decision to be made is about the set of events that will be modeled. The question \what is a good unit of speech for recognition?", can be translated under the hierarchic structure perspective to \what is a good level - or levels - of the hierarchy to model?". Given unlimited amounts of training data1 , one would bene t by using models for words, or even groups of words, because by modeling larger speech units, one can eliminate the variability due to context. However, given the practical limitations, one is forced to use units from lower levels of the hierarchy, phonemes, for example, that are easier to train. 1

The set of speech utterances that will be used for the estimation of the models' parameters.

9

10

Chapter 2. Background

The phone is a natural and popular choice as the basic modeling unit. The approaches that we present in this chapter, namely hidden Markov models and segment-based models have the phone as a basic building block. For a large part of this research we also choose to model the phone. Furthermore, phonetic recognition experiments will be the method of evaluation of the acoustic models developed here. In this chapter we shall describe the general form of the conventional acoustic model used in speech recognition today. We shall introduce the notation followed throughout the thesis, give the general de nition of an acoustic model, and present HMMs and segment-based models as special cases of it. There are, however, signi cant di erences between these two methodologies, and we shall comment on the advantages and disadvantages of each. HMMs will not be used in this thesis, but we feel that a brief description is useful for the reader, since they dominate the speech recognition eld today. Segment-based models, and in particular the stochastic segment model (SSM), will play a major role for the models that we introduce in the following chapters and will be described in more detail. We shall also describe in this chapter the corpus on which our new methods will be tested and give results of current approaches on this domain.

2.1 Acoustic Models: A Uni ed View The de nition of the acoustic statistical model that we give in this section is independent of the modeling-unit choice. We assume that the speech signal of a particular utterance is represented by a sequence of feature vectors denoted collectively by Z = [zi; i = 1; : : : ; N ]. We shall use A to denote the spoken message, which can be a phoneme, or word or even sentence sequence, depending on the type of recognition that we are interested in. The objective in speech recognition is to de ne a mapping from the signal space Z to the set of all messages, M,  : Z ! M chosen in such a way that a certain criterion is satis ed. In statistical

Chapter 2. Background

11

speech recognition, we can de ne this mapping by rst introducing the acoustic model. We rst assume that the sequence of feature vectors can be partitioned Z = [zi ; i = 1; : : : ; n], where each zi = [zki ; : : : ; zki+Ni?1] is treated as a variablelength random vector. In addition, since speech is a nonhomogeneous message source, we must introduce a discrete state component to the model to account for time variability.

De nition 2.1 (Acoustic Model) An acoustic model for the sequence of observations Z is the quadruple (Q; B; ; ), where

 Q is the nite and discrete set of modes of the acoustic model. The partic-

ular realization of the mode sequence Q = [qi ; i = 1; : : : ; n], where each qi is a random variable taking values in Q, implies a partitioning Z = [zi; i = 1; : : : ; n].

 B = fpztjz1 ;:::;zt?1;qt (j); qt 2 Qg is a collection of probability measures, and we assume that the distribution of zt given qi ; i = 1; : : : ; n depends only on the current mode qt ,

  is a deterministic or stochastic grammar that describes the mode dynamics,  , the decoding function2, is a mapping from the set of possible mode sequences to the message set,  : Qn ! M. We have chosen the term mode here, instead of the typically used state, to emphasize its discrete nature and distinguish from the state of processes with continuous variables. Based on this de nition, the joint likelihood of a sequence of A more accurate name for the function  would be the inverse-mode-generating function, since the true decoding is from observation sequences Z to messages A. 2

Chapter 2. Background

12

observations and a corresponding mode sequence Q can then be written3 n Y

p(Z; Q) = p(Q)p(Z jQ) = p(Q) pztjz1 ;:::;zt?1;qt (zt jz1; : : : ; zt?1; qt) t=1 n Y = p(qt jq1; : : : ; qt?1 )pztjz1 ;:::;zt?1;qt (zt jz1; : : : ; zt?1 ; qt): t=1

(2.1)

If the grammar used in the model is not a probabilistic but a syntactic one, then the \probability" of the mode sequence p(Q) in the expression above can take the form of an acceptance function, with p(Q) = 1 if Q is a valid mode sequence. Having de ned the acoustic model, we can then de ne the recognition mapping . Since the recognition problem can also be viewed as a decoding one, and in the more general case all recognition errors are equally important, the criterion we can use to specify this mapping is to minimize the probability of error, e.g. use the MAP rule:  : Z ! M; (Z ) = argmax p(AjZ ) (2:2) A

so that

(Z ) = argmax A

X Q:(Q)=A

p(A; QjZ ) = argmax

= argmax A

X Q:(Q)=A

A

p(Z; Q):

X Q:(Q)=A

p(QjZ ) (2:3)

From De nition 2.1, we can see that the components of the model that distinguish di erent approaches are the selection of the mode space (and therefore the way the feature sequence is partitioned), the choice of the probability distributions and the form of the grammar. In the following sections, we shall see how HMMs and segment-based models address these issues. Throughout this thesis we adopt a common convention and use p(x0 ) to represent the probability that the underlying random variable x assumes the value x0 , p(x = x0 ). We shall also use p( ) indistinguishably for probabilities, probability mass functions and probability density functions, leaving to the reader the task of identifying the correct interpretation from context. 3



Chapter 2. Background

13

2.1.1 Hidden Markov Models Hidden Markov models [10, 40, 9], currently the most popular approach to statistical speech recognition, are frame-based recognizers. A frame is a xed-duration window of the speech signal from which a feature vector is extracted. In HMMs, the mode space is typically referred to as the hidden-state space, and the observation sequence is modeled directly, i.e. there is no partitioning of the feature vectors. The distributions in B can be either discrete or continuous densities, depending on the feature space, and the usual assumption is that the observed vectors are conditionally independent given the underlying mode sequence, even though it has been attempted in the past to relax this constraint [19]. The name HMMs comes from the assumption on the mode dynamics, since the underlying mode sequence is modeled as a Markov chain. The grammar  is stochastic and consists of the set of transition probabilities, and the initial-state probabilities, if the observation sequence does not extend to in nity in the past. Thus, in HMMs the observed process is modeled as a \noisy" version of the mode sequence, with each sample drawn from the distribution indexed by the current mode, which is also known as the output distribution. It remains to de ne the decoding function . This function is directly related to the particular design of the HMMs, i.e. the constraints imposed on the transition probabilities (the grammar ), and also the type of recognition that we are interested in. In order to become more speci c, we must see how HMMs are typically used in CSR today. The observation vectors usually represent the state of the articulatory system. Therefore, the mode space can also be viewed as a quantized representation of all possible modes of the articulators. With HMMs, we can assume that all transitions between any two articulatory modes are possible, thus start with a full ergodic model, and use training data to estimate these probabilities. However, given the complexity of the speech process, and the fact that we are interested in speaker-independent recognition, the number of modes

14

Chapter 2. Background

is usually large, and the number of all possible transitions would require an enormous amount of training data to estimate. Thus, in practice, linguistic knowledge is utilized to limit the number of possible transitions by building models for some basic unit of speech, like the phone, that consist of a network of modes and the corresponding output distributions. Usually, the transition probability matrix for the modes of a single unit is constrained to be triangular for some permutation of the modes, which means that we can associate the modes with their relative position in the modeled unit, e.g. we have a so-called left-to-right HMM. If, for example, the HMM for a phone has three modes, they correspond to the beginning, middle and ending parts of it (see Figure 2.1). Based on this partitioning of the set of modes to phonetic classes, we can answer our original question, the de nition of the mapping  from sequences of modes to strings of phones. If the objective is a di erent form of recognition, e.g. word recognition, then we can build pronunciation networks to de ne the second part of this mapping, from phones to word sequences. Recapitulating, we can give the formal de nition for an HMM in terms of the general acoustic model:

Chapter 2. Background

15

Figure 2.1: Three-mode, left-to-right hidden Markov Model.

De nition 2.2 (Hidden Markov Model) Hidden Markov Models are the class

of acoustic models for which:

 The mode space Q is discrete and nite.  The observations are assumed conditionally independent given the mode sequence, so that:

B = fpztjz1;:::;zt?1;qt (j); qt 2 Qg = fpztjqt (j); qt 2 Qg:

 The mode sequence is modeled as a rst order Markov chain, and thus the

grammar  consists of a set of transition probabilities and a set of initial probabilities.

 The decoding function  depends on the speci c topology of the HMMs. The generation of the output observations in hidden Markov modeling involves two steps. First the string of alphabet symbols (e.g., phones) is converted to a

16

Chapter 2. Background

sequence of modes, modeled as a Markov chain. This part of the modeling deals with the time and duration variability of the units, which is modeled by having a stochastic, one-to-many mapping from alphabet symbols to mode sequences. The second step is the generation of \noisy" observations of the underlying mode sequence. Each mode is associated with a particular output distribution, and the feature vectors for consecutive frames are assumed conditionally independent given the underlying mode sequence. This methodology poses, however, some signi cant limitations on the accuracy of the model. First, it is dicult to model relative durations within a phone segment. Since the mode sequence in HMMs is a Markov chain, it is possible to have some parts of a segment stretched and others compressed, and this does not agree with speech knowledge. The biggest de ciency of HMMs comes, however, from the second step and the conditional independence assumption of the observations. The modeling of inter-frame time correlation is done through the statistics of the mode sequence only. In particular, if we use continuous output distributions, then one can show [63] that the joint likelihood of a particular mode and observation sequence is dominated by the terms of the output distributions as the dimension of the feature vector increases. In other words, an HMM with continuous output distributions converges asymptotically to an independent process with the dimension of the feature vector, and this, we believe, is not a realistic model for speech.

2.1.2 Segment-based Models A good acoustic model for speech should be able to represent higher-order phenomena and utilize features extracted from a longer time-scale processing than the typical 10{20 ms analysis window, as well as the statistical dependency between observations. In order to deal with the inadequacies of the frame-based HMM approach, other methods, such as neural networks [88, 81] and segment-based models [21, 67, 92, 15, 60], have been proposed as alternatives to the HMM ap-

Chapter 2. Background

17

proach. A segment is a variable-duration part of the speech waveform, that usually corresponds to a language unit. Segment-based models o er a better framework for modeling higher order phenomena and the evolution of spectral dynamics of speech. We shall now see the form that the components of a general acoustic model take for segment-based models. We shall rst de ne the general segment-based model, and then concentrate on the stochastic segment model. The mode of a segmentbased model consists of the segment and the model component. A segment s in an utterance of N frames is de ned by

s 2 S = f(a ; b) : 1  a  b  N g;

(2:4)

with a ; b representing the begin and end times. We could choose s as the segment component of the mode. This would, however, imply that the segmental distributions depend on the relative position within an utterance, and is not practical. Instead, we select the duration of the segment l = b ? a + 1 as the segment component of the mode, and we assume that it takes values from the set of allowable phone durations l 2 L. The model component is the identity of the language unit and takes values from a set A. The grammar in a segmental model is a mixture of deterministic and stochastic constituent parts. We rst have the constraint that segmentations, de ned as the sequences of segments across a sentence, S = [s1; : : : ; sn], must span the entire sentence and have no gaps, i.e., an acceptable segmentation has the form

S =: [(1; 1); (1 + 1; 2 ); : : : ; (n?1 + 1; n)]; where

(2:5)

1 = 0  1 < 2 < : : : < n?1 < n = N:

We shall denote by S (; t)  S the segmentation for a part of the sentence with starting and ending frames ; t respectively. In terms of durations, the previous

Chapter 2. Background

18 constraint is equivalent to

l1 + l2 + : : : + ln = N;

(2:6)

with li = i ? i?1 . If we denote the corresponding model sequence by A = [ 1 ; 2; : : : ; n] and the sequence of segment durations by L = [l1 ; l2; : : : ; ln], and treat each i and li as a random variable, then the probability of the composite mode sequence can be written

p(Q) = p(L; A) = p(LjA)p(A) =

n Y

i=1

p(lij i)p( ij 1; : : : ; i?1)

(2:7)

where we assumed that the segment durations are conditionally independent given the model sequence and that each segment duration li depends only on the corresponding label i. In a segment-based model there is no limitation on the form of the duration distributions, as was the case with HMMs. The joint probability of a segmentation S and a model sequence A can also be computed using (2.7). If we use si to denote the event (i?1 ; i) 2 S , then

p(S; A) = p(A)

n n   Y Y p si S (1; i?1); i = p(A) p(lij i) = p(L; A)

i=1

i=1

(2:8)

if we make the Markov assumption p(sijS (1; i?1); i) = p(sijsi?1; i) and we also assume that si is conditionally independent of the future and past phone labels given i. Therefore, the segmentation probability is determined by phonedependent duration probabilities. In segment-based models, the partitioning of the feature sequence Z is done in accordance with the segmentation S (or, the mode sequence Q = (L; A)) with

Z = [z(1; 1); z(1 + 1; 2 ); : : : ; z(n?1 + 1; n)]

(2:9)

z(; t) = [z ; z +1; : : : ; zt ]:

(2:10)

Summarizing, we can give the formal de nition of a general segment-based model:

Chapter 2. Background

19

De nition 2.3 (Segment-based Model) Segment-based models are the class of

acoustic models for which:

 The mode is the ordered tuple q = ( ; l) 2 A  L, where A is the set of language units and L is the set of allowable segment durations.  The output distribution is a joint distribution for the observations that correspond to a segment:

  p z(i?1 + 1; i) qi :

 The grammar for a segment model consists of the constraint (2.6) and the stochastic components that appear in (2.7).

 Finally, the decoding function is de ned from (Q) = 0(A; L) = A. The decoding rule (2.3) for the recognition of a phone string with a segment model can be written

(Z ) = argmax n;A

X S :jjS jj=n

p(Z jS; A)p(S; A);

(2:11)

where jjS jj is the cardinality of S , and we see that we sum over all possible segmentations with n segments since the phone string itself is a component of the mode of the model. Similarly to HMMs, the generation of the output observations with a segmentbased model also involves two steps. First, the segment duration l for each label in the phone string is drawn from a distribution speci c to that phone. Then, an l-long sequence of observations is generated using the segmental distribution characterized by the duration l and the phone label . The choice of the distribution for the sequence of observations within a segment s given phone model is not straightforward, since we deal with a vector of

Chapter 2. Background

20

variable dimension, and is the di erentiating factor of the various segment-based approaches. We now concentrate on the SSM, and we shall simplify the notation and represent an observed segment by an l-long sequence of d-dimensional feature vectors z = [z1 z2 : : : zl ].

De nition 2.4 (Stochastic Segment Model) The stochastic segment model is a segment-based model, with a distribution for z that has two components [83]: i)

a time-warping transformation Tl to model the variable-length observed segment in terms of a xed-length unobserved sequence z = yTl , where y = [y1 y2 : : : yM ], and ii) a probabilistic representation of the unobserved feature sequence y. The conditional density of the observed segment z given phone and duration l can then be obtained as the marginal distribution:

p(zj ; l) =

Z

y:z=yTl

p(yj ) dy:

(2:12)

Thus, the SSM breaks the second step in the generation of the variable-length sequence of observations within each segment in two sub-steps. First, a xed length M sequence y is generated using the distribution p(yj ), where M  l for all durations l 2 L. Then, the actual observations are created by downsampling y using the time-warping transformation Tl . This transformation can be linear or nonlinear, xed according to phonetic knowledge or arbitrary and reestimated from data. The original SSM used a linear time transformation (see Figure 2.2). Given the transformations Tl for all lengths l, the densities p(zj ; l) are all marginal distributions of the same \mother" distribution p(yj ). There are many alternatives for the form of the distribution p(yj ). One possibility is to model the density of the segment y given the phone using an Mddimensional multivariate Gaussian distribution. If no assumptions are made on the structure of the covariance matrix, as was the case with the original SSM, we refer to this model as a full-covariance Gaussian model. For reasonable choices of

Chapter 2. Background

21

Figure 2.2: Mapping of variable-length segments to xed length in the Stochastic Segment Model using a linear sampling transformation.

M and d the full-covariance model has a large number of free parameters, for which it is dicult to obtain good estimates. In addition, this model is computationally expensive for recognition. Another possibility is to assume that observation vectors are conditionally independent within a segment given the observed length l and the phone label . In this case we can write the density of the unobserved sequence p(yj ) = p1 (y1j ) p2(y2j ) : : : pM (yM j ); (2:13) where in this case we used subscripts to distinguish the distributions used at di erent frames. This corresponds to a block-diagonal covariance matrix for the distribution of y. The latter approach has a smaller number of free parameters but does not capture the inter-frame time correlation. Under the independence assumption, and assuming that Tl is a simple downsampling transformation, then the density of the l-length observed sequence z is simply

p(zjl; ) = pk1 (z1 j ) pk2 (z2 j ) : : : pkl (zl j );

(2:14)

where the distribution indices k1; : : : ; kl are determined by the transformation Tl .

22

Chapter 2. Background

Most prior work on the SSM has been based on the independent-frame distribution (2.14) and a linear sampling transformation Tl .

2.2 Evaluation Methods and Corpus Among the di erent evaluation methods in speech recognition, phonetic recognition has traditionally been used for testing the accuracy of the acoustic models. Since the phone is the language unit that we have chosen to model, phonetic recognition is a very natural choice for the evaluation of our models. Moreover, there are available benchmark results of the traditional modeling approaches, like HMMs, on publicly available databases, so that we can have an objective evaluation. A task that is used more often in CSR is word recognition. However, because of limitations in time and computer resources, and since phonetic recognition serves our purpose well, we shall not extend our work in this thesis to word recognition. More speci cally, we are going to use two forms of evaluation: phonetic classi cation, in which we are given the boundaries of the phonetic segments in a sentence and we want to assign the label of the \best" class to each one of them; and phonetic recognition, in which we do not know anything about the segmentation, the phonetic string or the actual number of phones in a given sentence, and we are trying to determine everything from the observed data. Even though phonetic classi cation might seem less interesting, it is very useful to us because we can evaluate the modeling accuracy independently of search errors. These are introduced in phonetic recognition, since the search over all possible segmentations and sequences of phones is a combinatorial optimization problem with many local optima and for its solution we usually rely on suboptimal methods. In addition, classi cation is a fast evaluation method, a signi cant advantage for models that use continuous densities and are computationally expensive, like the ones that we introduce in this thesis.

Chapter 2. Background

23

Before we present the database that we shall use, let us also mention that, whereas in classi cation the computation of the error rate is direct, in recognition we have the complication of deletions and insertions, since the actual number of phones in the sentence is unknown. In this case we shall follow the standard approach [52, 81], and we shall use a Dynamic Programming [13] match to align the true and recognized phone strings. The actual scoring software that we shall use is provided by the National Institute of Standards and is publicly available. The recognition performance can then be measured in terms of percent correct, de ned as Percent Correct = 100  Number Correct True sentence length and also percent accuracy,

: Percent Accuracy = 100 ? 100  Substitutions+Deletions+Insertions True sentence length

2.2.1 The TIMIT Corpus The TIMIT acoustic-phonetic continuous speech corpus [49] is speci cally designed to train and test recognizers at the phoneme level. It is phonetically balanced, and in addition there are available time-aligned phonetic transcriptions of all the sentences in the database. The portion of the database that we had available at the beginning of this research consisted of 420 speakers and 10 sentences per speaker, 2 of which, the \sa" sentences are the same across all speakers and were not used in either recognition or training because they would otherwise lead to optimistic results. We designated 219 male and 98 female speakers for training (a total of 2536 training sentences) and a second set of 71 male and 32 female di erent speakers for testing (a total of 824 test sentences with 31,990 phones). A list of the test speakers can be found in Appendix 2.A. We deliberately selected a large number of test speakers to increase the con dence for our performance estimates. Baseline results are reported using all of the available training sentences (from both male

24

Chapter 2. Background

and female speakers) and testing over the entire test set. In algorithm development, however, we often used a smaller test set that consists of the Western dialect male speakers (a total of 12 test speakers), a subset that gives us good estimates of the overall performance. Before the completion of our research, the second release of the TIMIT database was made available, with a new set of 1344 utterances for testing. We shall use the training and test sets of the second release in the experiments of Chapter 7, but other chapters will be based on the training and test sets as described above. The phonetic transcriptions of the TIMIT database use 61 phonetic symbols, and in most of the thesis we shall use 61 phonetic models corresponding to the TIMIT symbols. However, in counting errors, di erent phones representing the same English phoneme can be substituted for each other without causing an error. The allowable substitutions can be found in Appendix 2.B. The set of 39 English phonemes that we e ectively use is the same as the ones that the CMU SPHINX and MIT SUMMIT systems reported phonetic recognition results on [52, 92].

2.2.2 Current Phonetic Recognition Performance The TIMIT database has been used in the past by other researchers in phonetic recognition experiments. For example, Lee and Hon [52] reported 64% correct/53% accuracy using discrete density context-independent HMMs in Sphinx. In context-independent modeling, the model used for a phone does not depend on the context (e.g. the phonetic labels of its neighbors) in which it appears. Zue et al. [93] reported 55% accuracy for the MIT SUMMIT system. The MIT SUMMIT system has also reported a 70% classi cation accuracy on the same database, and recently Leung and Zue [54] have reported a classi cation rate of 75% using segmental Neural networks. Sphinx's recognition rate increases to 74% correct/66% accuracy with rightcontext-dependent phone modeling. Finally, Robinson and Fallside [81] reported

Chapter 2. Background

25

75% correct/69% accuracy using connectionist techniques. Their system also uses context information through a state feedback mechanism (left context) and a delay in the decision (right context). Prior to the beginning of this research, there were no phonetic recognition results available on the TIMIT database for the stochastic segment model. Evaluation of the SSM had only been performed on a speaker-dependent task, where it achieved 80% accuracy in classi cation and 76% correct with 11% insertions in phonetic recognition [67]. In that study, the SSM used block-diagonal covariance structure, e.g., did not model time correlation. Attempts to use a full-covariance model in [83] were not successful, and the performance degradation relative to the independent-frame model was attributed to insucient training data. At the preliminary stages of the research, we performed a comparative study of the SSM to a continuous density phonetic HMM with the same number of distributions ( ve). Using a small feature set (10 cepstral coecients and their time-derivatives, see Chapter 3), we found that both approaches resulted to the same classi cation rate of 69%. The SSM in this experiment also used a blockdiagonal structure. Thus, if we are successful in using the segmental framework to model the correlations within a phone, we will achieve improved performance over the more successful approach in acoustic modeling today, HMMs.

Chapter 2. Background

26

2.A Test set dr1-m-rws0 dr2-m-rjm0 dr2-m-wew0 dr3-m-ree0 dr3-m-tlb0 dr4-m-r 0 dr4-m-tas0 dr5-m-sas0 dr5-m-wem0 dr6-m-tju0 dr7-m-ter0 dr7-m-pfu0 dr1-f-sma0 dr2-f-srh0 dr4-f-rew0 dr5-f-smm0 dr6-f-taj0 dr8-f-nkl0

dr1-m-sjs1 dr2-m-rlj0 dr2-m-zmb0 dr3-m-rtj0 dr3-m-tpg0 dr4-m-rgm0 dr4-m-teb0 dr5-m-sfh1 dr5-m-wsh0 dr7-m-rms1 dr7-m-tkd0 dr8-m-slb0 dr1-f-tbr0 dr3-f-ntb0 dr4-f-kdw0 dr5-f-tbw0 dr7-f-spm0

dr1-m-tjs0 dr2-m-rlr0 dr2-m-bjv0 dr3-m-rtk0 dr3-m-tpp0 dr4-m-roa0 dr4-m-fwk0 dr5-m-srr0 dr6-m-rxb0 dr7-m-rpc0 dr7-m-tlc0 dr8-m-rre0 dr1-f-vmh0 dr3-f-sjs0 dr4-f-sak0 dr5-f-utb0 dr7-f-sxa0

dr1-m-tpf0 dr2-m-tbc0 dr2-m-dem0 dr3-m-sfv0 dr3-m-wdk0 dr4-m-sfh0 dr4-m-jee0 dr5-m-tat0 dr6-m-sjk0 dr7-m-rpc1 dr7-m-twh0 dr8-m-ejs0 dr2-f-rll0 dr3-f-sjw0 dr4-f-sem0 dr5-f-dmy0 dr7-f-tlh0

dr1-m-trr0 dr2-m-tdb0 dr2-m-dss0 dr3-m-taa0 dr3-m-wjg0 dr4-m-smc0 dr5-m-rvg0 dr5-m-tmt0 dr6-m-smr0 dr7-m-ses0 dr7-m-twh1 dr8-m-pam0 dr2-f-scn0 dr3-f-jlr0 dr5-f-sjg0 dr6-f-sdj0 dr7-f-vkb0

dr1-m-wbt0 dr2-m-tmr0 dr3-m-rds0 dr3-m-tjm0 dr4-m-rab1 dr4-m-srg0 dr5-m-rws1 dr5-m-vlo0 dr6-m-svs0 dr7-m-tab0 dr7-m-dlm0 dr1-f-sah0 dr2-f-slb1 dr4-f-mcm0 dr5-f-skp0 dr6-f-sgf0 dr8-f-mbg0

Table 2.1: Designated test speakers for 1st release of TIMIT database. Entries give the dialect region number (1 through 8), the sex of the speaker (m-f) and the speaker initials.

Chapter 2. Background

27

2.B TIMIT phone set and allowable confusions 1. p 2. pcl tcl kcl bcl dcl gcl q epi h# pau 3. d 4. m em 5. s 6. th 7. jh 8. r 9. w 10. ao aa 11. ay 12. ax ah ax-h 13. uh

14. t 15. dx

27. k 28. b

16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26.

29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39.

g n en nx z f v y eh uw ux ey ix ih oy

dh ng eng ch sh zh l el hh hv ow er axr aw ae iy

Table 2.2: Groups of allowable confusions between the 61 TIMIT phones.

28

Chapter 2. Background

Chapter 3 Linearity

Feature Space and Statistical Modeling In the previous chapter we presented some of the most common approaches to the speech recognition problem. We mentioned that the advantage of segment-based models is that they o er a better framework for modeling the statistical dependencies between the observations from the same segment of speech. We also assumed that the feature-vector sequence representing speech, which ideally contains as much information as possible for discriminating between di erent sounds, is given to us by some preprocessor. It is, however, useful to look at the properties of the features extracted by the particular preprocessor before we attempt to determine the classes of models that can be used. The ideal preprocessor should compress all the discriminative information of the speech signal into a small number of coecients so that we can avoid dimensionality problems in the modeling part. In addition, since we are dealing with a dynamic phenomenon, if the vectors extracted at di erent times are uncorrelated and represent only new information, their modeling will be simpli ed. This of course would make this thesis and all the previous research in acoustic modeling unnecessary. Fortunately, this ideal preprocessor is not an easy problem - we can 29

30

Chapter 3. Linearity

philosophically argue that it would require the same e ort as when, given a \decent" representation, we try to improve the acoustic models for this representation. Hence, in this chapter we shall rst brie y present the features that we shall use in this research, cepstral coecients, and also comment on the reasons that make this representation useful. We shall then go on and investigate some aspects of sequences of these features relevant to the design of a good statistical model, showing that linear models are adequate for representing intrasegmental dependencies.

3.1 Feature Selection In the past three decades of development in speech recognition, di erent research centers [24, 50, 25] have converged to a more or less similar feature extraction component in their recognizers, mel-warped cepstral coecients. For the sake of completeness, let us see what useful characteristics of the speech signal are preserved in this representation. Speech is a quasi-stationary signal. Many types of phones - vowels, fricatives, nasals, liquids - are almost stationary over intervals of a typical duration of 50 ms or more. The optimum transformation for stationary signals is the Fourier transform, in the sense that it whitens the input signal (it happens to be the Karhunen-Loeve transform in that case [69]). However, since speech is non-stationary, what is often used is the window Fourier Transform, or Short Time Fourier transform (STFT), de ned by Z Xk (!) = e?j!t w(t ? kT ) x(t) dt; (3:1) where x(t) is the speech signal, w(t) is a window function and T is the interval at which the analysis is repeated. The window is used to capture the local characteristics of the signal and obtain a localized-in-time frequency representation. Time-frequency (t-f) representations are the most common approach to the feature-selection problem in speech recognition. There are many good reasons for

Chapter 3. Linearity

31

this. Biological experiments have shown that the function of the ear resembles a spectral analyzer. Expert spectrogram1 readers can decode the spoken message from the spectrogram with remarkable success [91]. Speech is also characterized by fast transients that occur in intervals even smaller than the pitch period, and are important in identifying the associated events, as in plosives and transitions from nasals to vowels. Such events represent short-time-scale information in the speech signal, that also has information at longer time scales, like pitch variation. Even though with the STFT it is not easy to obtain good time and frequency resolution at the same time, and lately other promising t-f analysis tools have been introduced, like the wavelet transform [26, 61, 48, 86] or the Wigner-Ville representation [87, 78], the STFT is used extensively for the reasons mentioned above and its low computational cost. In the window Fourier transform, we obtain local information about the frequency contents of the signal by multiplying by a window function. In order to have good frequency resolution, since multiplication in the time domain corresponds to convolution in frequency, we are forced to have a \suciently" long window function - we do not want the smearing e ect in frequency to be too severe. In practice, this translates to typical window functions of 20 ms length - an interval that corresponds to 2-3 pitch periods. This xed-duration window is usually referred to as a frame. The analysis is repeated at short steps, e.g. every 10 ms, in order to capture the short-scale events. Moreover, many speech sounds - vowels, nasals - can be modeled well as the convolution of a periodic excitation with the response of the vocal tract. The period of the excitation signal - its inverse is called fundamental frequency - is not constant but changes signi cantly even within the same sentence. Even though this information, as well as other prosodic cues, may be useful in improving the accuracy The spectrogram is a two-dimensional representation of the STFT, with frequency on the vertical, time on the horizontal axis and amplitude represented using a gray or color scale. 1

Chapter 3. Linearity

32

of a recognizer, it is very dicult to model and only lately there have been attempts to incorporate prosody in recognition. In typical feature extraction approaches, the trend is to separate the response of the vocal tract from the periodic excitation, and this can be achieved by cepstral analysis. To summarize, the classical preprocessor in a speech recognition system involves a short-time frequency analysis over a window that extends from two to three typical pitch periods (about 20 ms) which is repeated at a xed rate2 (typically every 10 ms). The cepstral coecients are then computed from the STFT, for example as described in [75]. Alternatively, one can compute the cepstral coecients from the linear-predictive coecients [75], but these approaches make little di erence in practice. The cepstral coecients are usually computed on a mel-warped scale [70]; the speci c algorithm used in this thesis is described in Appendix 3.A. In the following paragraphs we shall try to \predict" what types of models can be used for this purpose. Towards the end of the chapter we shall also give experimental evidence to motivate the types of models that we shall introduce later on. Of course, the nal judge will be the recognition performance.

3.2 Cepstra and Interpolation The cepstral coecients of a single frame can be modeled e ectively in recognition by a multivariate Gaussian distribution. This has been demonstrated for HMMs in [19, 76, 38], and for the SSM with a block-diagonal distribution in [67] (e.g., under the assumption that feature-vectors from di erent times are independent). We, however, are interested in modeling the dynamic behavior of the cepstral coecients. The validity of the independence assumption, in both HMMs and the SSM, has been questioned by many researchers and we shall also see in this chapter that it is far from true. Unfortunately, despite the intuition that the spectral 2

There are also examples of preprocessors with variable rate analysis [73].

Chapter 3. Linearity

33

dynamics are useful for recognition, previous attempts to repeal the independence assumption have not been successful [83, 19, 45]. Building a good model for the statistical dependency of the di erent cepstral vectors is, in essence, equivalent to knowing how to predict, or interpolate in time. A good predictor, either linear or nonlinear, should convert the observation process to an innovation process. The idea is that, if we are able to build good predictors for the di erent classes, then we shall be able to distinguish between the di erent classes by selecting the predictor that matches the data better at any particular time. This is an important point however, and we shall return to this issue in the following chapter. A clue to how we can interpolate eciently is given to us by the spectrogram. Within vocalized sounds, and also at the transitions between them, the spectrograms are characterized by \smooth" formant3 trajectories. This is depicted in Figure 3.1, and suggests that, if we assume that the transfer function of the vocal tract can be adequately represented by a rational function, then we should be able to approximate well the dynamics of its poles and zeroes by piecewise linear functions. This is true since the formant trajectories correspond to the movement of the vocal tract poles. Thus, we postulate that linear interpolation in time should be e ective for the poles and zeros of the vocal tract, at least for vocalized sounds. An immediate suggestion could be to replace our feature set with the poles and zeros of the vocal tract. This, however, would not be easy, and probably not e ective. The poles and zeros can be extracted from LPC analysis, but a numerical root- nding algorithm should be used, and accuracy problems could be encountered. We should determine the order of the polynomials used, and in the interpolation step we should match pairs of poles and zeros between the two di erent times. All these do not have straightforward solutions. Instead, we choose an alternative approach. We try to 3

Formants are the frequencies corresponding to the poles of the vocal-tract system response.

Chapter 3. Linearity

34

Figure 3.1: Spectrogram for the utterance \They own a big house in the remote ..." see what the hypothesis of \good" behavior of the poles implies to the interpolation in time of the cepstral coecients. Let us assume that the transfer function of the vocal tract has the following rational form: r QK (1 ? ai z ?1 ) QL (1 ? bi z ) Az i=1 (3:2) X (z) = QM i=1 ? 1 ) QN (1 ? di z ) (1 ? c z i i=1 i=1 where all the coecients ai; bi ; ci; di have magnitude less than 1, K; M and L; N is the number of the minimum and the maximum-phase zeros and poles respectively, and r is the order of the zero at 0. Then, we can nd that the coecients of the complex cepstrum in terms of the poles and zeros can be expressed [75]:

8 log[A] n = 0, > > > > >

Mi=1 cni ? Ki=1 ani n > 0, > > > > : PL bni ? PN dni n < 0. i=1 n i=1 n

(3:3)

In practice, it is assumed that the transfer function of the vocal tract is minimum phase, and the cepstrum, rather than the complex cepstrum, is used, but for the discussion here it suces to look at equation (3.3).

Chapter 3. Linearity

35

We can see from this equation that the cepstral coecients are polynomials in the poles and zeros of the system function that are inside the unit circle (or polynomials with negative powers for poles and zeros outside the unit circle) with a degree equal to the order of the particular coecient. Thus, especially for the higher cepstral coecients, there exists a strong nonlinear relationship between them and the poles/zeros of the transfer function. The conclusion is that, if we can use linear functions to interpolate the poles and zeros of the system function extracted at two di erent time instants, then we cannot do the same for the cepstral coecients. We can hope, however, that linear approximations will work for the cepstral coecients as well, if the sets of coecients that we try to interpolate are close in space, so that we can use a rst degree approximation to equation (3.3). If this is true, then we can bene t from all the advantages that linear models have when we design our acoustic models. In the following section we shall try to investigate the extent to which linear approximations can be used to model the statistical dependency of distinct observations.

3.3 Cepstra and Linear Models In this section we try to answer the question on the types of models that can be used for modeling the correlation of feature vectors extracted at di erent times. We can distinguish the types of dependencies that we are interested in modeling into two categories, depending on the time-scale that we are looking at. This division is also consistent with the segmental framework: the rst type is the mutual in uence of observations within a segment of speech, or intrasegmental; the second is for observations across the segment boundaries, or intersegmental. Modeling intrasegmental correlations re ects our knowledge that the language unit within a segment (e.g., the phoneme) corresponds to a structured and homogeneous rather than disordered sound. Modeling intersegmental correlations re ects our

Chapter 3. Linearity

36

belief that the way a particular speaker in a particular context pronounces certain phonemes should provide information on the realization of neighboring phonemes. The reason for making this distinction, in addition to speech knowledge, is related to our intuition that linear interpolators can be used for cepstra that are close in space, as we discussed in the previous section. Feature vectors extracted within a segment boundaries are very similar, due to the homogeneity of the phonemes. Thus, we will now try to evaluate the extent to which linear models can be used to model the dependency of observations within a segment. We shall also do the same for the second type of dependencies, e.g. intersegmental ones. In order to do this evaluation, we shall try to predict the observations for a particular phoneme at a xed position k within a segment. We shall use as independent variables the observations extracted from 1. the same segment, at another xed position k0, and 2. the previous segment, at a position k0 xed relative to that previous segment. In the second case, the identity of the previous phoneme is held constant, e.g. we will be trying to predict a particular frame of the second phoneme based on a frame of the previous one for a xed pair of phonemes. We shall compare the performance of a linear regression, with the cepstral coecients at k0 as independent variables and the coecients at k as the dependent variables, to that of a nonlinear regression method. For the nonlinear regression we use the alternating conditional expectation (ACE) method [17] that estimates the nonlinear, nonparametric functions in the following regression

(Y ) = 1(X1 ) + 2(X2) +    + d(Xd)

(3:4)

Chapter 3. Linearity

37

so that the fraction of the variance not explained by the regression

(

2 ) P d E (Y ) ? i=1 i(Xi)   : (3:5) 2 E  (Y ) is minimized. In our case, the dependent variable Y will be the n-th cepstral coecient at frame k, x^k (n) and the dependent variables will be x^k0 (1); : : : ; x^k0 (d). We choose as our gure of merit the multiple R2 coecient, the ratio of the variance

explained by the regression over the total variance. Before we present the R2 values comparing the two regression methods for several cases, it is useful to see for a few of them some scatter plots of pairs of cepstral coecients extracted at di erent time instants. In Figures 3.2-3.4 we present various scatter plots of pairs of coecients from the same speech segment. There are examples of strong positive correlation, when we look at the same coecient at consecutive frames in the middle of a segment. We can also see examples of zero correlation and even negative correlation, as in Figure 3.2 between the 8th cepstral coecient of the middle frame of phoneme \ae" and the 9th coecient of the previous frame. The correlation is weaker when we look at frames that are further than 10 milliseconds apart, as in Figure 3.4, where we see scatter plots for pairs extracted from the 2nd and 6th frames respectively. Nevertheless, the linear trend in the relationship is still evident. Furthermore we should note the normality of the plots, since the points form clusters without any signi cant outliers or curious shapes. Looking at scatter plots of pairs of coecients extracted from the middle frames of the two segments in phoneme pairs is not as revealing, since the number of points in these plots is smaller. However, in some frequently occurring pairs, it seems that the plots are not as normal and the points are not clustered together in the same nice manner as in the intrasegmental ones (see Figure 3.5). In order to verify these observations quantitatively, we present in Tables 3.1-3.3 the values of the R2 coecient for the linear regression and the ACE method. In

38

Chapter 3. Linearity

Figure 3.2: Scatter plots for various pairs of cepstral coecients extracted from the middle (k) and the immediately previous (k0 = k ? 1) frame of the phoneme \ae".

Chapter 3. Linearity

39

Figure 3.3: Scatter plots for various pairs of cepstral coecients extracted from the middle (k) and the immediately previous (k0 = k ? 1) frame of the phoneme \m".

40

Chapter 3. Linearity

Figure 3.4: Scatter plots for various pairs of cepstral coecients extracted from the frames k0 = 2; k = 6 of the phoneme \eh".

Chapter 3. Linearity

41

Figure 3.5: Scatter plots for various pairs of cepstral coecients extracted from the middle frames of two consecutive phonemes for four frequently occurring phonemepairs.

42

Chapter 3. Linearity

Table 3.1, for the regression on two consecutive pairs at the middle of the same segment, we see that not only the percentage of the variance explained by the linear regression is high, but it is also very close to the more general ACE method. Even though the R2 values are a little lower when doing the regression between frames 40 milliseconds apart (Table 3.2), the linear hypothesis seems still to be adequate. We can draw two important conclusions from these results. First, the assumption of the conditional independence of the observations given the underlying state sequence that has been used so far in both HMMs and the SSM is clearly not valid. At reasonable levels of signi cance, the R2 values in Tables 3.1-3.2 suggest that the feature vectors within a segment are correlated, and therefore not independent under a Gaussian assumption. The second, very important conclusion, is that linear models are adequate in modeling intrasegmental dependencies. Thus, despite the recent trend in speech recognition of using nonlinear methods like Neural Networks [88, 81], we shall mainly be concerned in this thesis with linear modeling of the intrasegmental statistical dependencies. Nonlinear segment based models, like for example the ones introduced in [54, 5], have an additional computational burden and are also less tractable to analysis. Based on our ndings, the additional complexity is not justi ed. The second form of correlations that we are interested in modeling consists, as we have mentioned, of the intersegmental ones. Our intuition that linear models may not be adequate in modeling this type of dependency is veri ed by the R2 values in Table 3.3, where we try to predict the middle frame coecients of the second segment from those of the rst segment. The values are of course much lower than in the previous tables. It is, however, obvious that the linear regression is not as e ective here as the ACE method. From the number of observations that we had in our disposal, like for example in the \ix{n" pair, we can conclude that linear models will not be adequate in modeling the intersegmental statistical

Chapter 3. Linearity

43

Phoneme / Coecient R2 , linear R2 , ACE Observations eh / c1 0.81 0.82 3273 eh / c7 0.85 0.86 3273 ae / c1 0.88 0.89 2292 ae / c7 0.86 0.87 2292 m / c1 0.69 0.73 3322 m / c7 0.83 0.84 3322 s / c1 0.72 0.74 6176 s / c7 0.51 0.54 6176 Table 3.1: Multiple R2 values when predicting di erent cepstral coecients at the middle of a phoneme from the cepstral coecients of the previous frame.

Phoneme / Coecient R2 , linear R2 , ACE Observations eh / c1 0.52 0.56 3015 eh / c7 0.46 0.50 3015 ae / c1 0.64 0.68 2275 ae / c7 0.58 0.60 2275 m / c1 0.35 0.44 2213 m / c7 0.57 0.60 2213 s / c1 0.13 0.20 5960 s / c7 0.31 0.33 5960 Table 3.2: Multiple R2 values when predicting di erent cepstral coecients at the sixth frame of a phoneme from the cepstral coecients of the second frame.

Chapter 3. Linearity

44

Phoneme pair/coecient R2 , linear R2 , ACE Observations dh-ax / c1 0.06 0.34 622 dh-ax / c7 0.19 0.38 622 ix-n / c1 0.22 0.35 1659 ix-n / c7 0.15 0.22 1659 r-iy / c1 0.43 0.59 618 r-iy / c7 0.35 0.47 618 ao-r / c1 0.38 0.54 705 ao-r / c7 0.40 0.54 705 ix-s / c1 0.09 0.33 692 ix-s / c7 0.18 0.41 692 Table 3.3: Multiple R2 values when predicting di erent cepstral coecients at the middle frame of a phoneme from the cepstral coecients of the middle frame in the previous phoneme for ve frequently occurring pairs. dependencies.

3.4 Discussion To summarize, in this chapter we rst presented our motivations based on theory and speech knowledge that linear models can be used for time interpolation of the cepstral coecients only if the time between the two observations is small, so that these observations are close in space. We found then, by comparing a linear regression to a nonlinear one, that typical segment durations and the homogeneity of the phonemes result in observations that are suciently close in space, so that linear models can be used for modeling the intrasegmental correlations. Unfortunately, this does not seem to be the case for intersegmental correlations. We now

Chapter 3. Linearity

45

believe that we are in the position to introduce the basic model that we shall propose in this thesis. It is a linear model for representing the rst type of statistical dependencies, and we shall present it in the following chapter. In Chapter 7 we shall investigate methods on how to extend the model to include the modeling of intersegmental correlations.

Chapter 3. Linearity

46

3.A Signal Processing The features used in this thesis are the mel cepstra. If we use X (!) to denote the Fourier transform of the speech signal, then the mel-warped cepstra are obtained by Z (3:6) c(n) = 1 cos k! log X (m~ ?1 (!)) d!; 0

where m~ (!) is the mel transform de ned by

  m(!) = !0 log 1 + ( ? 1) !! 0 m(!min)  m~ (!) = mm(!(!) ? max ) ? m(!min!) !=!0 log 1+(1+( ? ?1)1)!min =!0 ! : = 1+( ? 1) ! =! max 0 log 1+( ?1)!min=!0

(3.7)

(3.8)

The parameter values that we have used for all the experiments in this thesis are = 2, !0 = 1000 Hz, !min = 80 Hz and !max = 8000 Hz.

Chapter 4 Modeling

Linear Segment-based Models We veri ed experimentally in the last chapter that the output independence assumption is not valid by showing that the observation vectors within a segment of speech are highly correlated. However, a full-covariance SSM in [83] performed signi cantly worse than a block-diagonal model, and this result was attributed to insucient training data, since the model has a large number of free parameters. In Section 2.1.2, we saw that the distribution component of the SSM models the variable-length sequence of observations within a segment as a sample of the marginal distribution of an \unobserved" xed-length sequence. Thus, the xedlength model must be longer than all observations of a particular phone in the training/test data. In [83], a 10-frame long xed-length model was used for data analyzed at 10 ms intervals, and observations longer than 10 frames were downsampled to this xed length. Given the feature-vector dimensions that are used today in CSR, (15 or more) a covariance matrix of dimension at least 150 must be estimated for this model. The amount of data required to obtain robust estimates would probably be prohibitive for most practical applications. In this chapter we shall investigate methods to overcome this limitation of 47

48

Chapter 4. Modeling

the SSM and propose the rst successful model in achieving improved recognition performance via intrasegmental correlation modeling that we are aware of. We shall emphasize the di erences of this model to existing methods and also present classi cation results on the TIMIT database. The algorithms for using the model in CSR, i.e., training algorithms for estimating the parameters of the model and search algorithms for connected-phone recognition will be presented in the following chapters. The training and search algorithms are more general and can be used for other acoustic models and applications other than speech.

4.1 A Gauss-Markov Segment Model The basic problem that we aimed at solving at the beginning of this research was the training problems of the full-covariance Gaussian SSM. Thus, the model that we are looking for must have a moderate number of free parameters without trading its capability to model time correlation eciently. Two more observations that can focus our search to a smaller family of models come from Chapter 3. We saw there that linear models can represent well the dependency of feature vectors from the same segment of speech. Thus, joint Gaussian models must be further investigated. Moreover, we saw that a large percentage of the variance could be explained by the last observation alone. This observation is particularly signi cant, since it motivates us to make a Markov assumption in modeling the density of the feature vectors. A natural candidate model that satis es all three requirements above is a Gauss-Markov structure for the covariance of the SSM. Thus, in terms of the general de nition of an acoustic model, the Gauss-Markov (GM) model is a variation of the SSM with the same mode set, grammatical constraints and decoding function (see Section 2.1.2). The density of the unobserved feature sequence y

Chapter 4. Modeling

49 Block-Diagonal Gauss-Markov Full-Covariance

Md2 + 3Md 2 2 2 (3M ?2)d + 3Md 2 2 M 2 d2 + 3Md 2 2

Table 4.1: Number of parameters per phone model for an SSM with xed length M and d-dimensional feature vector. given phone is characterized by a Markov assumption:

p(yj ) = p(y1j ) p(y2jy1; ) p(y3jy2; ) : : : p(yM jyM ?1; )

(4:1)

and p(yk+1jyk ; ) are the conditional Gaussian densities of the jointly normal random vectors yk ; yk+1. This model can be considered as a compromise between the full-covariance and the block-diagonal SSM, and has the advantage that the number of parameters that must be estimated increases by less than a factor of 3 over the block-diagonal case (see Table 4.1), whereas it still models time correlation. Even though the output independence assumption is usually adopted for simplicity in HMMs, it is not necessary for the development of the training and recognition algorithms (see also De nition 2.1 and Chapter 5). A Markov assumption was adopted for the output distribution of HMMs in the past by Brown [19]. There, an explicit modeling of the correlation between consecutive observation vectors was investigated by removing the usual conditional independence assumption of the observations given the state and conditioning the output distribution on the previous observation vector. This e ectively increased the state of the modeled process by augmenting it with a continuous component. A simple extension of this work appeared in [45], where the output distributions were conditioned on observations in a longer time window. Since conditional Gaussians were also used in Brown's work for the output distributions, our Gauss-Markov model is similar to his approach. When the segmentation is given, the di erence between the conditional Gaussian HMM and the Gauss-Markov SSM is that in the former there is

Chapter 4. Modeling

50

a discrete random component in the state, whereas the state of the latter consists of only the continuous component, since the warping is done deterministically. In experiments that we present in Section 4.4 we shall see that the GaussMarkov model does not achieve any signi cant improvement in performance over the independent-frame model. On the contrary, when we include in our feature set time-derivatives1 of the cepstral coecients, computed by tting a linear regression over a window of 5 frames, the independent-frame model performs signi cantly better than the GM model. Our results recon rm the ndings of Brown and Kenny et al. with conditional Gaussian HMMs. There are many possible explanations for the bad performance of the conditional Gaussian HMMs and our Gauss-Markov segment model. The negative derivative result can be attributed in part to the fact that the derivatives at frames near the segment boundaries are computed using feature vectors from the neighboring phones { we observed in Chapter 3 that the relationship between features from di erent segments is nonlinear and therefore not well modeled by a Gaussian distribution. Incorporation of derivatives in the feature set can also make the cross-covariance matrices of consecutive frames near singular, since the derivatives represent no additional information and can be approximated by di erences of cepstral coecients. Finally, even though the GM model belongs to a more general class that includes the block-diagonal one and can model the training data better, it is nevertheless a less robust model. We shall see in the following section how this lack of robustness can lead to deterioration in performance, even in cases when sucient data exists to obtain good statistical estimates for the parameters of our models. We shall also propose a solution to this problem. It is a common practice in many CSR systems to use derivatives of the cepstral coecients as a simple way to incorporate some form of correlation modeling. 1

Chapter 4. Modeling

51

4.2 A Dynamical System Segment Model Let us review the situation that we have here: a model that is presumably more general and accurate than our previous approach, with sucient training data for the estimation of the model's parameters. Even in this case, however, we are still not guaranteed that our \improved" model will perform better on the test data. The reason is that there may be di erences between the training and test data, and a more accurate model may be more susceptible to these mismatches. In order to elucidate this \paradox", we can consider a simple example. Let us assume that we are given a training set of two-dimensional observations, and that we estimate means and covariances under both assumptions of uncorrelated and correlated vector components. This is illustrated in Figure 4.1, where we have plotted the constant-value ellipses of the estimated Gaussian distributions under the correlated assumption (top left) and the independence assumption (top right). Let us also assume that there is a mismatch in the training and test conditions, so that the mean in the test data is shifted from the estimated point in the training data, but the rest of the distribution parameters remain the same. The constantvalue locus of the test-data distribution is plotted in the same gure with a dotted line. Because of the mismatch between training and test data, it is possible that the less-accurate on the training set independence assumption can describe better the di erent test set. The reason is that a diagonal covariance matrix represents a smoother distribution than the full-covariance one. Despite the simplicity of the example, this situation is quite possible in speech recognition, where we have large di erences between speakers, dialects, recording equipment. Thus, we can get signi cant improvements if we foresee possible mismatches and smooth the estimated distributions. In the previous example, we can achieve a better representation of the test data under the assumption of correlated components with sucient smoothing (see bottom of Figure 4.1). In the Gauss-Markov formalism, we can achieve this smoothing by adding an observation

52

Chapter 4. Modeling

Figure 4.1: Constant-value ellipses of estimated densities from the training set under various assumptions (solid lines) and the density corresponding to the test set (dotted line). Top left: full covariance (correlated features), no smoothing. Top right: Diagonal (independent features). Bottom: full covariance, smoothing.

Chapter 4. Modeling

53

noise component to the model. Addition of this component can also account for nonlinearities and modeling error. The more general family of models that we can then use to parameterize the density of a segment of speech is that of linear state-space Dynamical Systems (DS). We can model the xed-length sequence y with the following Markov representation for each di erent phone

xk+1 = Fk ( )xk + wk yk = Hk ( )xk + k ( ) + vk ;

(4.2a) (4.2b)

where wk ; vk are uncorrelated zero-mean Gaussian vectors with covariances

E fwk wlT j g = Pk ( )kl E fvk vlT j g = Rk ( )kl

(4.3a) (4.3b)

and kl is the Kronecker delta. We further assume that the initial state x0 is zero-mean Gaussian with covariance 0 ( ). The Gauss-Markov model is of course a special case of the dynamical system, since it can be represented by the state equation (4.2a) alone and the identity mapping in place of the observation equation (4.2b). The class of stochastic linear systems is quite general, and there are many issues that must be resolved before the dynamical system can be used for recognition. In (4.2) we have a time-varying system, and this formulation may introduce too many parameters. The choice of the state-vector dimension in (4.2a) is important, as is the particular structure of the system matrices, so that the model is identi able [57, 22]. These are the subjects that we will try to address in the following sections, before we present classi cation results using this model.

4.2.1 Correlation Assumptions In (4.2) we modeled the density of the hidden sequence y using the dynamical system formalism. Thus, we solve the problem of variable segment length as in

54

Chapter 4. Modeling

the original SSM formulation, by assuming that the observed vector process z is an incomplete version of the underlying process y. Equivalently, we assume that there exist unobserved trajectories in phase space that basic units of speech follow, and we sample these trajectories at a number of points equal to the observed length. Under the dynamical system formalism, this assumption translates to a xed sequence of state transition matrices for each segment length. Successive observed frames of speech have stronger correlation for longer observations, since the underlying trajectory is sampled at shorter intervals. We shall refer to this assumption as trajectory invariance. The system parameters in (4.2) are time-varying on the xed-length trajectory. This approach will introduce problems in the parameter estimation phase, since the underlying trajectory must be chosen suciently long. An additional problem is that many phones have observations that never exceed a small number of frames, so we can not estimate the time-varying parameters on a long trajectory. Of course this can be solved by xed-length trajectories with phone-dependent length. An alternative solution, however, that also addresses the estimability problem, is to assume that the system parameters are time-invariant on a number of regions on the underlying trajectory.

De nition 4.1 [TI-DS Segment Model] The Trajectory-Invariance Dynamical Sys-

tem segment model is a special case of the SSM for which the conditional density of the unobserved xed-length sequence y is given by (4.2).

Let us now introduce a di erent approach to the modeling of the observed segment. Here we deviate from the SSM framework, in the sense that we do not assume that the densities of small observation lengths are obtained as marginals of a \mother" distribution. Thus, we do not have the complication of missing observations. We rede ne the model for the segment distribution as follows. As opposed to trajectory invariance, an alternative assumption, correlation invariance, is that

Chapter 4. Modeling

55

the correlation between the state vectors at adjacent frames is invariant under time warping transformations and depends only on the relative position within a segment. The underlying trajectory in state space varies with the observation length, and the sequence of state-transition matrices for a particular observation of a phone depends on the phone length. The speci c rule used to determine the sequence of system matrices for a given segment length can be chosen as linear, as in previous SSM work. The approach that worked best in practice was to repeat the rst and last transformations only once at the begin and end portions of the phone respectively, and to warp linearly in the middle. Since coarticulation phenomena are more prominent near segment boundaries, this approach minimizes their e ect on the estimates of the state-transition matrices, something desirable when the same model is used for a phone in all di erent contexts. The speci c sequences of transformation matrices that were used in our experiments with a segment model with 4 invariant regions are presented in Table 4.2 for di erent segment lengths.

De nition 4.2 [CI-DS Segment Model] The Correlation-Invariance Dynamical System segment model for phone is a segment-based model that parameterizes the density of the observed sequence z = [z1 z2 : : : zl ] with the recursions:

xk+1 = FIl(k) ( )xk + wk zk = HIl(k)( )xk + Il(k)( ) + vk ;

(4.4a) (4.4b)

where wk ; vk are uncorrelated zero-mean Gaussian vectors with covariances PIl(k) ( ), RIl(k) ( ) respectively and the initial state x0 is zero-mean Gaussian with covariance 0 ( ). Il () is a family of monotonically non-decreasing indexing functions that map the time index k 2 f1; 2; : : : ; lg to f1; 2; 3; : : : ; M g, where M is the number of invariant regions in the model.

In Figure 4.2 we show the sampling of the state-space trajectories and the corresponding sequences of state-transition matrices under both the correlation and

Chapter 4. Modeling

56 Length Transformation sequence 2 F1 3 F 0 F3 4 F0 F1 F3 5 F0 F1 F2 F3 6 F0 F1 F2 F2 F3 7 F0 F1 F1 F2 F2 F3

Table 4.2: Transformation sequences of the correlation-invariance assumption for di erent segment lengths and M = 4 invariant regions. trajectory-invariance assumptions. Under the trajectory-invariance assumption, a xed-length, hidden-state process is generated by repeating a xed sequence of transformation matrices, independent of the segment length. This state process is then corrupted by observation noise and downsampled in order to generate the output process. In speech terms, the trajectory-invariance assumption implies that the vocal-tract movement for a particular phone is always the same, and independent of the phone length. Short phones are \fast-forward" versions of longer ones. Under the correlation-invariance assumption, the sequence of transformations depends on the observed length, and there is no downsampling. If the time warping in Table 4.2 is analogous to downsampling, then the evolution of the means is similar for TI and CI models2. However, CI is not limited to this warping. In any case, the dynamical behavior of the second-order statistics under the correlationinvariance assumption describes a vocal-tract movement that is length-dependent; some regions of the model are not used in short phones. In this case, and since the state process is zero-mean, the term covariance invariance would be more appropriate for this assumption. 2

Chapter 4. Modeling

57

Figure 4.2: Sampling of state-space trajectories under Trajectory and Correlation Invariance.

58

Chapter 4. Modeling

4.2.2 Structure of the Model Smoothed Gauss-Markov Realization The rst issue in implementing the dynamical system segment model is the selection of the dimensionality of the state-space of the model. Without any physical motivation, we could use stochastic realization theory [22] to answer this question. However, since feature vectors extracted from consecutive 10 ms frames in speech are highly correlated, we can choose the dimension of the state to be equal to the size of the observation vector and constrain for all models the observation matrix H = I , the identity matrix. Therefore, we e ectively assume that the observed process is a noisy version of an underlying Gauss-Markov process. In the remainder of the thesis we shall use the term dynamical-system (DS) segment model to refer to this particular realization. For this speci c structure, there is a complication that might arise regarding the observation noise distribution. If the observation-noise covariance R( ) is model speci c and a Gauss-Markov model can accurately represent the training data, as we discussed in the last section, then our estimates for the observation noise will have very small values. This was a problem in our initial experiments. If we want to be able to handle possible mismatches in training/test data, we must not allow the observation noise covariance to attain small values. We can achieve this in many ways. One is to have an observation-noise covariance R( ) = R that is common over all di erent phone-models and not estimate its value from the training data. A possible selection for R is the lumped estimated covariance of the observations from all classes multiplied by a small positive constant. The magnitude of the constant can be ne-tuned for the particular application, according to the expected degree of mismatch and the diculty of the problem. A second method is to do jack-kni ng and estimate the parameters in the state-equation from one set of data and those of the observation-equation from a di erent set.

Chapter 4. Modeling

59

The structure that we have presented thus far was designed for a feature vector that includes the cepstral coecients only. If the cepstral derivatives are also used, since they are computed directly from the cepstra, the state of the process should still consist of cepstral coecients alone and the observation equation should be modi ed accordingly. However, since the derivatives are computed over long time windows, the state must be expanded to include the cepstra from multiple frames, so that the assumption about an uncorrelated observation noise process is still valid. Of course, in such a case there is no need to use derivatives at all, with the possible exception of frames near the segment boundaries. We may also encounter training problems, since the number of parameters increases signi cantly. A compromise is to incorporate derivatives and use the noisy Gauss-Markov process realization for the whole feature set. Using an independent-frame model with derivatives in the feature set is not a correct approach either. It suces to make the simple remark that since the derivatives are a function of the cepstral coecients from multiple frames, the observations are not independent, and the coecients of a particular frame contribute to the likelihood score in more than one positions. This problem is less eminent with derivatives computed from long time windows. If di erences of cepstral coecients at consecutive frames are used instead of derivatives, then the situation can become worse. It is simply wrong to include di erences in the feature set and treat the observations as independent. With the DS formalism, the rst-derivative information that the cepstral di erences have is incorporated correctly in the likelihood computation. Including di erences in the feature set of the DS segment model allows us to capture second-derivative information, provided that the structure of the model is modi ed accordingly. In our experiments we captured this information using derivatives, but we did not investigate the use of di erences.

Chapter 4. Modeling

60

Target-state Realization A variation of the DS model formalism that we shall also examine is related to the hierarchical nature of speech. The realization of a phone is related to a particular point in feature space, the target state, the exact location of which depends on the particular speaker, prosodic information, and context. A possible model under this framework is to have a prior distribution for the target state, and then model the time dynamics with local distributions conditioned on the state. Since we are interested in Gaussian models, we can represent this model as a special case of the dynamical system segment model with a constant state:

xk+1 = xk = x0 ( ) yk = Hk ( )xk + k ( ) + vk ;

(4.5a) (4.5b)

and the initial state is zero-mean with covariance 0 ( ). In the target-state model, the observations within a segment are not independent. However, the dependence is not modeled directly, but in a hierarchical fashion through a random vector that represents the common variation of the different observations. This idea will be particularly useful in a later chapter, where we try to extend our methods to model statistical dependencies at higher hierarchy levels. The target-state model can be considered as a compromise model between the independent-frame model and the DS segment model, in the sense that given the target state, the observations are conditionally independent. The distribution of the target-state x0 represents the more global forms of variability. In the DS segment model, the distribution of the initial state x0 plays this role, but the observations are not conditionally independent given x0. In speech-recognition terms, for an easy application (a speaker-dependent task for example) where the target points of the di erent phonetic classes are well separated, an independent-frame model is very likely to perform well. If in a more dicult task with multiple speakers we can represent the additional sources of

Chapter 4. Modeling

61

uncertainty in a prior distribution for the target state, then we can still use a simpler, conditionally independent model within the phonetic segments.

4.3 Classi cation rule The decoding rule that we can use is the maximum a posteriori (MAP) rule. We need to calculate the likelihood of the observed sequence z given phone model . For all the variations of the dynamical system segment model, this likelihood is given by the innovations representation fek ; k = 1; : : : ; lg [57], as log p(zj ) = ?

Xl n k=1

o

log jek ( )j + eTk ( )[ek ( )]?1ek ( ) + constant

(4:6)

where ek can be obtained using the Kalman lter3 [42] and ek ( ) is the prediction error variance given phone model . In the trajectory-invariance case, innovations are only computed at the points where the output of the system is observed, and the predicted state estimate for these times can be obtained by the m-step ahead prediction form of the Kalman lter, where m is the length of the last \black-out" interval - the number of missing observations immediately before the last observed frame. Using the Kalman lter to compute the quantities needed in (4.6) and perform classi cation implies that for all phones we must compute the prediction error variance ek and the Kalman gain Kk ( ) for k = 1; : : : ; l. Since this represents a signi cant computation which we probably want to avoid during the actual recognition phase, an alternative strategy is to precompute and store these matrices. This strategy, however, is not ecient in terms of memory requirements. The previous quantities must be computed and stored for all possible lengths l, all models and all k = 1; : : : ; l. A solution that we can give to this problem is to assume that the Kalman parameters are piecewise constant. A single variance ek and 3

See also equations (5.5) and (5.6) in Chapter 5.

62

Chapter 4. Modeling

a single Kalman gain Kk can be used for each time-invariant region in the DS model, and can be computed by repeating a few iterations of the matrix Riccati equation of the Kalman lter. We found experimentally that the introduction of this piecewise-constant parameters did not cause any loss in performance. We shall leave the problem of parameter estimation until the following chapter. For the remainder of this chapter we shall assume that, given a set of training data, we are able to obtain the Maximum Likelihood (ML) estimates of the model's parameters.

4.4 Experimental Results A series of phone-classi cation experiments will be presented in this section in order to validate the dynamical system segment model for speech recognition. Specifically, we shall compare the independent-frame SSM, the Gauss-Markov SSM, the correlation and trajectory-invariance DS segment models and the target-state model.

Previous Attempts in Correlation Modeling Before we present our main results, it is useful to see how the Gauss-Markov assumption compares to previous approaches in modeling time-correlation with segment-based models. In Figure 4.3 we compare the performance of a fullcovariance, a Gauss-Markov and an independent-frame model on a phone classi cation experiment. This experiment is from an early phase of our work [30], and many of the conditions are di erent from the rest of the experiments presented in this section. For example, the segment models there always used resampling to a xed length and the features used were obtained using linear discriminant analysis [34]. Nevertheless, it is clear that the training problem for the Gauss-Markov model is not as acute as for the full-covariance one. In addition, for small number

Chapter 4. Modeling

63

of features, where the number of parameters that we estimate is moderate and there is no training issue, we can see that correlation modeling is important and that the Gauss-Markov assumption represents well the structure of the covariance matrix. At large number of features, the decrease in performance of the models with correlation modeling can be attributed to many factors, such as training problems, warping to a xed length and training/test data mismatch. The e ect of all these factors can be better understood in the experiments that we present below.

Correlation Assumptions In Figure 4.4 we compare the two alternative assumptions for the evolution of the spectral dynamics of the model, i.e. we compare the CI-DS segment model (De nition 4.2) to the TI-DS segment model (De nition 4.1). The number of invariant regions used for each phone model was M  5, depending on the average length of the particular phone. We have plotted the classi cation rate for di erent lengths of the test phones, up to length 15. The length of the hypothesized underlying trajectory in the trajectory-invariance assumption was set to 15 for this experiment. When the length of the observed phone is much shorter than the length of the unobserved trajectory, correlation invariance clearly outperforms the trajectory-invariance assumption. Under the latter assumption, consecutive observations are less correlated for shorter phones. From our experimental results, we can conclude that this is not a good assumption. We conjecture that the trajectory-invariance assumption, together with training limitations, was the reason that earlier attempts to incorporate correlation modeling in the stochastic segment model with full-covariance distributions [83] did not succeed. For longer observations, both assumptions are equivalent and we have similar performance. The correlation-invariance assumption was adopted and used in all the remaining experiments in this thesis, and all subsequent uses of DS will refer to the CI case.

64

Chapter 4. Modeling

Figure 4.3: Classi cation rates for a full-covariance, Gauss-Markov and blockdiagonal model versus dimension of the feature vector.

Chapter 4. Modeling

65

Figure 4.4: Classi cation performance as a function of observation length for Correlation Invariance (CI) and Trajectory Invariance (TI) with unobserved trajectory of length 15.

66

Chapter 4. Modeling

Correlation Modeling Results In the next series of experiments, presented in Figure 4.5, we show the classi cation rates for the independent-frame (block-diagonal) SSM, Gauss-Markov SSM and DS segment model for di erent numbers of cepstral coecients in the feature set. We also include in the same plot the classi cation rates when the derivatives of the cepstra are included in the feature set, so that some form of correlation modeling is included in the independent-frame model. The main result is that the dynamical system model clearly outperforms the block-diagonal model, and thus we have succeeded in improving the classi cation performance by modeling the intrasegmental correlations. We can draw additional conclusions from Figure 4.5. We notice the signi cance of incorporating observation noise in the model, by comparing the performance of the dynamical system to the model based on the Gauss-Markov assumption. When derivatives are included in the feature set, the Gauss-Markov model performs worse than the block-diagonal model even for small number of features, so that trainingset size is not an issue. As we have extensively argued, we believe that this low performance is the combined e ect of nonlinearities near the segment-boundaries and the lack of smoothing. The improvement that the observation noise modeling and the associated distribution smoothing brings in this case is more impressive than the cepstra-only case. In order to visualize the success of the dynamical system model in capturing the joint time-spectral structure of speech, we have plotted in Figure 4.6 the original spectrogram from a sentence, together with the \predicted" spectrogram estimates for this sentence using the independent-frame and the dynamical-system models. These are created using the predicted estimates for the cepstral coecients for each frame, using observations from all the previous frames and the parameters of the most likely candidate within each segment. The DS spectrogram captures more details than the independent one and better represents the dynamics of the

Chapter 4. Modeling

67

Figure 4.5: Classi cation rates for various types of correlation modeling and numbers of cepstral coecients under Correlation Invariance. In the solid-line experiments, the features used are the indicated number of cepstral coecients and their time derivatives. The derivative of log-power was also used in the experiments with 18 cepstra and their derivatives.

Chapter 4. Modeling

68 Features, Model Assumptions

18 Cepstra:

% correct 61 classes 39 classes

Independent frame Target state Gauss-Markov Dynamical System (CI)

50.6 52.0 53.9 57.9

63.5 63.0 64.5 68.0

Independent frame Dynamical System (CI)

61.5 64.6

72.1 73.9

18 Cepstra + 18 Deriv + Deriv. log power

Table 4.3: Phone classi cation results for context-independent models based on di erent assumptions about the statistical dependence between features within a segment. formant trajectories. The last series of experiments in this chapter compares the two basic di erent structures that we introduced in Section 4.2.2 to the independent-frame assumption. In this experiment our feature set included 18 cepstral coecients only. We can see from the results in Table 4.3 that, even though the target-state structure outperforms the independent-frame model, its classi cation rate is signi cantly lower than the DS segment model. We recall that with the target-state model we use some form of prior distribution for the point in feature space that is associated with a particular realization of a phone. Thus, we can draw the signi cant conclusion from this result that, even if some form of \equalization" is performed to the observations along a segment, the correlation among the di erent samples is still important and contributes signi cantly in recognition.

Chapter 4. Modeling

69

Figure 4.6: Original spectrogram (top) and spectrograms created using the predicted estimate of the 18 cepstral coecients for the most likely candidate phone using the independent-frame model (middle) and the dynamical system model (bottom).

70

Chapter 4. Modeling

When cepstral derivatives are included in the feature set the performance of the target-state model degrades signi cantly, and it performs worse than the blockdiagonal model. We believe that the reasons for this are the same reasons for the bad performance of the Gauss-Markov model with derivatives - namely, nonlinearities near the segment boundaries and the lack of smoothing the model distributions to allow for training/test set mismatches.

4.5 Discussion Since we believe that linear models can be used to represent the intrasegmental statistical dependencies, we introduced in this chapter a dynamical system segment model and investigated di erent realizations of this structure. We evaluated the model in classi cation experiments on the TIMIT database, and we showed that it is the rst successful model in incorporating time correlation. We demonstrated the signi cance of incorporating an observation noise term in the model, so that we can (i) foresee possible mismatches between training and test data and (ii) take into account modeling-error e ects. We have not yet shown how to estimate the parameters of the dynamical system model from training data, nor how it can be used eciently in phonetic recognition. These are the subjects of the following chapters on training and recognition. Another issue that we have not addressed is the modeling of intersegmental correlations. For this, we must deviate from the segmental framework, and it will be the subject of a later chapter.

Chapter 5 Training

Identi cation of a Stochastic Linear System Automatic training in a speech recognition system is the mechanism for estimating the model parameters from speech and knowledge of the \encoded message" only, without requiring a hand-aligned transcription of the utterance. The existence of automatic training algorithms is one of the biggest advantages of the statistical methods over other approaches to speech recognition. In statistical approaches, the parameter estimation method that is used most widely is Maximum Likelihood (ML). The set of parameters in an acoustic model (see De nition 2.1) consists of the parameters in the stochastic components of the grammar and the parameters of the distributions. In order to perform ML estimation, we need to evaluate the likelihood of the data based on the hypothesized model. This task is usually facilitated signi cantly if the data that we have available includes not only the sequence of observations along the uttered sentence, but also the corresponding mode sequence. In this case, since our data consist of both the observations and the mode sequence, we need to maximize the likelihood (2.1) conditioned on the label sequence with respect to the model's parameters. Furthermore, we can usually separate the maximization of this likelihood func-

71

Chapter 5. Training

72

tion with respect to the grammar parameters and the distribution parameters by maximizing the logarithm of (2.1), assuming that the two sets have no common parameters and there are no additional constraints. A hand-aligned transcription of an utterance in terms of the acoustic model de nition is either the corresponding mode sequence, or some other constraint that reduces the number of possible mode sequences. If such a transcription, and consequently the mode sequence, is not available, then in order to obtain ML estimates we must maximize the sum of (2.1) over all possible mode sequences. More speci cally, we must sum over all mode sequences in the inverse image of the decoding function  for the uttered message A:

p(Z jA) =

X

Q2?1 [A]

p(Z; QjA):

(5:1)

The maximization of (5.1) is performed eciently by automatic training algorithms for acoustic models, such as the Baum-Welch algorithm [12]. In segment-based models, assuming that the uttered message consists of a phonetic string, the sum over all possible mode sequences in (5.1) reduces to the sum over all possible segmentations with a given number of segments. If we know how to obtain ML estimates of the segment-model parameters when the segmentation is given, then there exist ecient algorithms for automatic training. Thus, our basic theme in this chapter is to show how training can be performed for the models that we introduced in the previous chapter when the segment boundaries are given. For completeness, we shall also describe how our algorithms can be extended to the case where segment boundaries are unknown. The estimation for the Gauss-Markov model is simpler (see [30]). Here, however, we consider it as a special case of the dynamical system model with no observation noise, and we shall only give the solution for the general case. Estimating the parameters of the dynamical system segment model from training data for a given segmentation is actually the problem of maximum likelihood (ML) identi cation of a general linear dynamical system. This problem had gained much attention soon

Chapter 5. Training

73

after the introduction of the Kalman lter [42]. The classical method to obtain maximum likelihood estimates of the parameters of a linear state-space system involves the construction of a time-varying Kalman predictor and the expression of the likelihood function in terms of the prediction error [4, 37, 43]. For simplicity, one usually assumes that the prediction error variance is time invariant [89]. Even under this assumption, which is equivalent to the convergence of the Kalman predictor for suciently large number of observations, the minimization of the log-likelihood function is a nonlinear programming problem. In the traditional approach, iterative optimization methods are used that require the gradient of the log-likelihood function with respect to the system parameters, and some additional computation equivalent to the evaluation of the Hessian, or the second derivative itself if Newton methods are used [37]. The computation of these derivatives becomes complicated in the multiple-output case. In this chapter, motivated by the popularity of HMMs and the Baum-Welch algorithm [12] that is used for estimating their parameters, we view the problem of parameter estimation of a hidden-state system in a uni ed way, regardless of the continuous or discrete nature of the state, and present an alternative iterative method for the ML identi cation of a stochastic linear system. The underlying idea is based on the observation that the identi cation problem would be simple if the state of the system were observable. This observation can be combined with the Expectation-Maximization (EM) algorithm [27] to provide a conceptually simple approach to the ML identi cation of dynamical systems. Among other bene ts, the simplifying assumption of convergence of the Kalman predictor is no longer necessary, and we can solve more complicated cases, like a partially observable output process, which appears under the trajectory invariance assumption. In the remainder of this chapter, we rst de ne the problem of ML estimation of a linear stochastic system and outline the traditional methods used for its solution. We then describe the alternative approach, based on the EM algorithm. We

Chapter 5. Training

74

also give the solution for the case of a partially observable process and include a discussion on the relationship of this algorithm to the estimation algorithms that are used for HMMs. We solve the problem without reference to speech recognition, so that Sections 5.1 and 5.2 are self-contained and the algorithms presented there are applicable to other problems as well. In Section 5.3 we apply the estimation algorithm to the training of the dynamical system segment model, and we present algorithms for training when the segmentation is not known.

5.1 Linear System Parameter Estimation In this section we formulate and solve the problem of ML identi cation of a timeinvariant linear dynamical system driven by white Gaussian noise from a single run of output observations. Alternatively, as is the case with the dynamical system segment model, we could choose to estimate the parameters of a time-varying system from multiple output runs. The extension to the latter case is not dicult and shall be dealt with later in the chapter. We assume that a sequence of observations Y = [y0 y1 : : : yN ] is generated by the nite-dimensional linear dynamical system

xk+1 = Fxk + wk yk = Hxk + vk

(5.2a) (5.2b)

where the state x is a (n  1) vector, the observation y is (m  1) and wk ; vk are uncorrelated, zero-mean Gaussian vectors with covariances

E fwk wlT g = Pkl E fvk vlT g = Rkl

(5.3a) (5.3b)

We further assume that the initial state x0 is Gaussian with known mean and covariance 0 ; 0.

Chapter 5. Training

75

Maximum likelihood estimates of the unknown parameters  in F; H; P; R can be obtained by minimizing the negative log likelihood, or equivalently the quantity [37]

J (Y; ) = ?L(Y; ) =

N X k=0

flog jek ()j + eTk ()?ek1()ek ()g + constant

(5:4)

where ek (); k () is the prediction error and its covariance, and can be obtained from the Kalman lter equations [42]

x^kjk = x^kjk?1 + Kk ek x^k+1jk = F x^kjk ek = yk ? H x^kjk?1

(5.5a) (5.5b) (5.5c)

where x^0j?1 = 0, we have suppressed the parameterization on , and

Kk = kjk?1H T ?ek1 ek = H kjk?1H T + R kjk = kjk?1 ? Kk ek KkT k+1jk = F kjk F T + P

(5.6a) (5.6b) (5.6c) (5.6d)

where 0j?1 = 0 .

5.1.1 The Classical Solution In classical approaches to the problem of ML estimation of the system parameters when N is large enough, it is usually assumed that the Kalman predictor has converged, speci cally that ek = e , Kk = K , kjk?1 = . In this sense, the estimates obtained are only approximate ML estimates1. If the objective is to use the estimates for prediction or ltering, as is usually the case, one can parameterize e and K directly instead of P and R, and simplify the calculations [43]. The 1

In prediction-error identi cation methods [57] this assumption is not necessary.

Chapter 5. Training

76

estimation problem then is the minimization of (5.4) subject to the constraints (5.5) with respect to the free parameters  in F; H; e; K . If, however, the ML estimates for P and R are desired, the solution is more complicated since we have the additional constraints (5.6). Minimizing (5.4) with respect to  requires the computation of the gradient and perhaps the Hessian. The quantities that must be computed for this purpose are the state sensitivities with respect to each one of the system parameters. One can organize the computation of these sensitivities recursively, and it might appear that a separate recursive lter of the form (5.5) is needed for the calculation of the state sensitivities with respect to a particular parameter. This computation is cumbersome, and an attempt was made to use a reduced order gradient lter by exploiting the particular symmetries that the gradient lters possess [37]. The situation is actually simpler though. By attacking the problem as a constrained minimization one, Kashyap [43] showed that a single recursion is needed in addition to the predictor (5.5), which corresponds to the computation of Lagrange multipliers. Other researchers [90, 89] came to the same conclusion later using some properties of adjoint equations. Akaike [1] had also obtained simple solutions by Fourier Transform methods for the stationary case. It can be shown (see e.g. [89]) that the computation of the necessary derivatives during each iteration of a gradient-type algorithm for the minimization of (5.4) involves the following operations: 1. Solve the matrix Riccati equation (5.6) for a constant e, and the predictor equation (5.5). Compute the statistics ?ee ?x^e

N X 1 T = e k ek N + 1 k=0 N X = N 1+ 1 x^kjk?1eTk : k=0

Chapter 5. Training

77

2. Solve the adjoint equation backwards for k = N; N ? 1; : : : ; 0

k = F~ T k+1 + H T ?e 1 ek N = H T ?e 1eN

(5.7a) (5.7b)

where F~ = F (I ? KH ) and compute the statistics ?x^ ?y ?e

N X 1 = N + 1 x^k?1jk?2Tk k=1 N X 1 T y = k?1 k N + 1 k=1 N X = N 1+ 1 ek?1Tk : k=1

3. Solve the linear matrix equation for W

F~ T W F~ + S +2 S = W T

where

S = H T MH ? 2H T ?e 1?eF~ M = ?e 1 ? ?e 1?ee?e 1 : 4. Compute the derivatives of (5.4) from

@J = 2 trace(@F=@)(F~ T W ? ? ? K ? ) x^ e @  h + 2 trace (@H=@) ??x^e?e 1 ? F~ T ?Te?e 1 i ? 1 T ~ + (?x^ + K ?e)FK + K (I ? ?eee ) ? F WFK  + tracef(@P=@)W g + trace (@R=@)(2?e 1?eFK  ? 1 ? 1 T T + e (I ? ?eee ) + K F WFK ) :

If the innovations representation is chosen, then, as we already mentioned, the problem becomes simpler: it is not necessary to solve the matrix Riccati equation

Chapter 5. Training

78

and the matrix equation in the rst and third steps above, and the formula for the derivative is somewhat simpli ed. However, one must still solve the predictor equation (5.5) forward and the adjoint equation (5.7) backwards and collect all the statistics at each iteration of the minimization algorithm.

5.1.2 Estimation with the EM Algorithm We now present an alternative approach to the identi cation problem. Consider for the moment the following slightly modi ed estimation problem, where we assume that the state of the system (5.2) is observable, and we want to nd the ML estimates of the system parameters  given Y = [y0 y1 : : : yN ] and X = [x0 x1 : : : xN ]. It is not hard to see that for this problem, the ML estimates of  are obtained by maximizing N X



L(X; Y; ) = ? log jP j + (xk ? Fxk?1 k ? Fxk?1 ) k=1  N X T ? 1 ? log jRj + (yk ? Hxk ) R (yk ? Hxk ) + constant )T P ?1(x

k=0

(5.8)

since, without loss of generality, wk and vk were assumed uncorrelated white Gaussian noise sources. Assuming that there are no constraints on the structure of the system matrices, the estimates are (see Appendix 5.A):

F^ H^ P^ R^

= ?4 ??3 1 = ?6 ??1 1

(5.9a) (5.9b)

= ?2 ? ?4??3 1?T4 = ?2 ? F^ ?T4 = ?5 ? ?6??1 1?T6 = ?5 ? H^ ?T6

where the sucient statistics are

N X

?1 = N 1+ 1 xk xTk k=0 N X ?2 = N1 xk xTk k=1

(5.9c) (5.9d) (5.10a) (5.10b)

Chapter 5. Training

79

N X 1 (5.10c) ?3 = N xk?1xTk?1 k=1 N X 1 T ?4 = x (5.10d) k xk?1 N k=1 N X (5.10e) ?5 = N 1+ 1 yk ykT k=0 N X ?6 = N 1+ 1 yk xTk : (5.10f) k=0 A critical observation, that leads to the estimation algorithm that we present here, is that the original problem can be treated as one with incomplete data, with the state vector playing the role of missing observations. The ExpectationMaximization (EM) algorithm [27] is a two-step iterative algorithm and can be used for parameter estimation in problems with missing data. The EM algorithm is a two step iterative algorithm. During the rst step, the expectation (E) step, the expected log-likelihood of the complete data (by complete we mean both the observed and the missing components of the data) is calculated based on the observed data and the current parameter estimates. This quantity is then maximized at the second, the maximization (M) step, with respect to the new parameter estimates. The basic EM iteration for the parameter vector  can be described as

new = argmax Eold 



 log p (M; O) O ;

(5:11)

where M; O are the missing and the observed data respectively. It can be shown [27] that with every iteration of the EM algorithm the likelihood of the observed data p (O) increases. Thus, assuming that the algorithm does not converge to a local optimum, the estimates obtained with the EM algorithm are ML estimates. In our case, one can use the EM-algorithm to maximize the following quantity at iteration p:   ( p +1) ( p +1) ( p ) (5:12) Q( j ) = E(p) L(X; Y;  ) Y with L(X; Y; ) de ned in equation (5.8) and the subscript  denoting the parameter vector that is used in calculating the expectation. It can also be shown [27] that

Chapter 5. Training

80

the EM algorithm for the exponential family, as is our case under the Gaussian assumption, reduces to computing the conditional expectations of the complete-data sucient statistics during the E-step and using these in place of the complete-data sucient statistics in the M-step. Therefore, in using the EM algorithm to obtain ML estimates of the parameters of (5.2), the statistics (5.10) require the following quantities at iteration p:

E(p) fyk xTk jY g = yk E(p) fxTk jY g E(p) fyk ykT jY g = yk ykT E(p) fxk xTk?1jY g E(p) fxk xTk jY g:

(5.13a) (5.13b) (5.13c) (5.13d)

Now, since the input processes are Gaussian, then the state process will also be Gaussian, and furthermore the conditional distribution of the state of the system given the observations on a xed interval is Gaussian [22]:



 p(xk jY )  N x^kjN ; kjN :

It follows that the statistics in (5.13) can be computed from

E(p) fxk jY g = x^kjN (5.14a) (5.14b) E(p) fxk xTk jY g = kjN + x^kjN x^TkjN   E(p) fxk xTk?1jY g = E(p) (xk ? x^kjN )(xk?1 ? x^k?1jN )T Y + x^kjN x^Tk?1jN = k;k?1jN + x^kjN x^Tk?1jN : (5.14c) We can use the xed interval smoothing form of the Kalman lter (RTS smoother) [79] to compute the required statistics. It consists of a backward pass that follows the standard Kalman lter forward recursions (5.5), (5.6). In addition, in both the forward and the backward pass, we need some additional recursions for the computation of the cross-covariances k;k?1jN , the derivation of which is in Appendix 5.B. All the necessary recursions are summarized below.

Chapter 5. Training

81

Forward recursions x^kjk = x^k+1jk = ek = Kk = e k = kjk = k;k?1jk = k+1jk =

x^kjk?1 + Kk ek F x^kjk yk ? H x^kjk?1 kjk?1H T ?ek1 H kjk?1H T + R kjk?1 ? Kk ek KkT (I ? Kk H )F k?1jk?1 F kjk F T + P

(5.15a) (5.15b) (5.15c) (5.15d) (5.15e) (5.15f) (5.15g) (5.15h)

Backward Recursions x^k?1jN k?1jN Ak k;k?1jN

= x^k?1jk?1 + Ak [^xkjN ? x^kjk?1] = k?1jk?1 + Ak [kjN ? kjk?1]ATk = k?1jk?1FkT?1?kj1k?1 = k;k?1jk + [kjN ? kjk ]?kj1k k;k?1jk

(5.16a) (5.16b) (5.16c) (5.16d)

To summarize, the EM algorithm involves at each iteration the computation of the sucient statistics described previously using the recursions above and the old estimates of the model parameters (E-step). The new estimates for the system parameters can then be obtained from these statistics as the simple multivariate regression coecients given in (5.9) (M-step). In addition, the structure of the system matrices can be constrained in order to satisfy identi ability conditions [22].

82

Chapter 5. Training

5.1.3 Estimation with Missing Observations The algorithm that we presented in the previous section was motivated in our case by the estimation of the dynamical system segment model under the trajectory invariance assumption where some of the output observations are missing. We refer to this problem as a dynamical system estimation with blackouts (intervals where the observations are missing). Extension of the classical methods to this problem is very complicated. For example, the prediction error variance can not be time-invariant: it depends on the length of the last blackout interval. On the other hand, extension of the EM-based estimation algorithm is straightforward, by observing that the missing observations and the imperfect state information can be jointly treated as missing data. Analytically, let us now use Z to denote collectively the observed outputs of the system, obtained by downsampling the process Y , 2 3 2 3 6 y0 7 66 I 0 0 0 : : : 0 77 666 y1 777 66 0 0 0 I : : : 0 77 66 77 76 7: (5:17) Z = 66 66 : : : : : : : : 777 666 y.2 777 4 5 66 .. 77 0 0 0 0 ::: I 4 5 yN In this case, the computation of the conditional expectations in (5.13) should be modi ed as follows, to account for the missing observations: 8y ; if observed; > > < k (5.18a) E(p) fyk jZ g = > > : HE (p) fxk jZ g; if missing. 8 y yT; if observed; > > kk < (5.18b) E(p) fyk ykT jZ g = > > : R + HE (p) fx xT jZ gH T ; if missing 8 y E fxT jZ gk; k if observed; > > k (p) k < E(p) fyk xTk jZ g = > (5.18c) > : HE (p) fx xT jZ g; if missing, k k 

Chapter 5. Training

83

where we have used the law of iterative expectation

E(p)

n

 n o o  g(yk) Z = E(p) E(p) g(yk) Z; xk Z  n o  = E(p) E(p) g(yk) xk Z

(5.19)

for any vector-valued function of the observation yk . In order to compute the smoothed state estimates and their variances in (5.18), the recursions of the forward pass of the E-step (5.15) have to be slightly modi ed as follows. Innovations (5.15c), the Kalman gain (5.15d) and the variance (5.15e) are only computed when we have an observation. In this case, the updates of the ltered state estimates and the corresponding variance are as in (5.15). If, however, an observation yk is missing, the ltered estimate x^kjk and its variance kjk are equal to the corresponding predicted estimates x^kjk?1; kjk?1. Finally, the recursions of the backward pass (5.16) remain unchanged, since the computations do not involve the output process Z.

5.2 General Markov-state sources The underlying (hidden) state process of a stochastic linear system is continuous, modeled as a Gauss-Markov process, and the observed process is a transformed and noisy version of it. In contrast, hidden Markov models are characterized by a nite, discrete state space, where the state process, which we referred to in Chapter 2 as the mode process, is modeled as a Markov chain. The observed process is then again modeled as a \noisy" version of the mode process, with each sample drawn from a distribution indexed by the current state, the output distribution. There exists a strong analogy between the estimation algorithms for HMMs and for linear state-space stochastic systems. The parameter estimation method for HMMs, known as the Baum-Welch algorithm [12], is also a version of the EM algorithm where the missing data is the unobserved discrete state. In the

Chapter 5. Training

84

algorithm that we presented in this chapter, the continuous state of the system was also treated as missing data. Furthermore, in both the dynamical system and HMMs the E-step of the EM algorithm consists of a forward and a backward pass through the data, the Forward-Backward algorithm for HMMs (see Appendix 5.C) or the RTS smoothing algorithm for the dynamical system. In the HMM case, we have a discrete state space and the Forward-Backward algorithm estimates a discrete probability mass function at each point in the interval that we are looking at. In the dynamical system case, the state of the system follows a Gaussian distribution, and we use a variation of the RTS smoother to obtain the required rst and second order statistics of the state process. Some of the analogies between the stochastic linear system and HMMs are summarized in Table 5.1, where Q, jjQjj < 1 is the mode space,  is the set of transition probabilities fapq ; jp; q 2 Qg and the initial state probabilities, and B is the collection of probability measures fbq (y)jq 2 Qg. The more general case of modeling a time-varying signal with a Markov state process, would be to assume that there are both continuous and discrete unobserved components of a composite state (q; x) 2 Q  Rn, the former referred to as the mode and the latter as the state of the process. Under the HMM framework, this is equivalent to using the basic dynamical system equations to describe the observation distribution. From a dynamical system perspective, the notion of a composite state with discrete and continuous components can be interpreted as having a time-varying system, with system parameters indexed by a discrete Markov random variable. If we denote the mode sequence over the observation interval by Q = [q1 q2 : : : qN ], and we try to apply the EM algorithm for estimation, we can rewrite the quantity that is maximized at each iteration as



 n o  0 = E E log p (X; Q; Y ) Z; Q Z o n X p (QjZ )E log p0 (X; Q; Y ) Z; Q = 

E log p0 (X; Q; Y ) Z Q2QN

Chapter 5. Training

85

Hidden Markov Models

Dynamical System Modeling State

Discrete qk 2 Q; jjQjj < 1 Markov chain

Continuous xk 2 R n Gauss-Markov

Noisy Observation Output distribution Observation Equation bqk (yk ) yk = Hxk + vk Estimation: EM Algorithm E-step: Forward-Backward algorithm RTS forward-backward smoother p(qk jy1; : : : ; yN ) E fxk jy1; : : : ; yN g p(qk ; qk?1jy1; : : : ; yN ) E fxk xTk?1jy1; : : : ; yN g M-step: Relative frequencies Multivariate regression Table 5.1: Analogies between HMMs and stochastic linear systems.

Chapter 5. Training  o n X p (QjZ ) log p0 (Q) + E log p0 (X; Y jQ) Z; Q : (5:20) =

86

Q2QN

From the last term in (5.20), which we can identify as the quantity maximized in (5.12) when the discrete state sequence is given, we can see that the EM algorithm in this case would involve a separate forward-backward Kalman smoother for each possible discrete state sequence. In addition, it is not possible to derive simple forward recursions for the joint likelihood of the observations and the discrete state, as in the regular HMM case, the reason being that the Kalman predictor for the continuous state depends on the whole history of the discrete state component. Thus, maximum-likelihood estimation with the EM algorithm becomes complicated for the general case. However, we can follow a di erent strategy here and use an iterative procedure that alternates between nding the best mode sequence given the current parameter estimates and then applying the continuous forwardbackward smoother to reestimate the linear-system parameters. This algorithm belongs to the general family of k-means, or generalized Lloyd clustering algorithms used in vector quantization2. In the speech recognition problem, k-means training procedures have appeared for both HMMs [77] and segment-based models [67]. We conclude this section by noting that a similar approach applied to a timeseries modeling and forecasting was rst used by Shumway and Sto er in [85]. The di erences in our case are that: 1) our reestimation formulas are simpler, 2) we have addressed the issue of blackout intervals, 3) motivated by our application in speech recognition, we viewed the problem as the continuous-state analog of the Baum-Welch training algorithm for HMMs, and 4) we have described estimation methods for the general Markov case. The idea of using the expected value of the state vector to obtain estimates for the system parameters was also used by Gibson in [36], where a state-space model was used for speech enhancement, and 2

For a discussion on the name and the origin of the algorithm see [59].

Chapter 5. Training

87

can actually be traced back to Lim [56]. However, the work presented there is not an EM procedure and does not provide true ML estimates, since only the rst order statistics are used.

5.3 Training of the Dynamical System Model for CSR 5.3.1 Training from Sentence Transcriptions When the segment boundaries are given, then training of the dynamical system segment model for continuous speech recognition is essentially the same problem as estimating the parameters of a stochastic linear system. The di erence from the general case that we developed in the previous sections is that we are estimating the parameters of a time-varying dynamical system from multiple sets of segment observations (multiple runs), rather than those of a time-invariant system from a single run. We have a xed number m of di erent sets of system parameters for each phone-model (the time-invariant regions), and within the boundaries of each run, the region index to be used at time k is determined by the correlation assumption that we have adopted (e.g. correlation versus trajectory invariance). Thus, the parameter set for each model consists of

  ( ) = Fi( ); Hi( ); Pi( ); Ri( ); i( ); i = 1; : : : ; m ; 0 ( ) ;

(5:21)

and includes a region and phone dependent mean i( ) of the observation noise term. To see how the reestimation formulas must be transformed for the timevarying case, let us use Il (k) to denote the time-warping function that returns the region index that must be used at time k for an observation of length l, as we did in the de nition of the dynamical system segment model. Let us also de ne the

Chapter 5. Training

88 operator





 i

1 X = constant l

X

X

All occur. of length l

k:Il(k)=i



(5:22)

to denote the summation over all allowable phone durations l, all occurrences of length l for a particular phone, and all times within each occurrence that the parameters of the i-th region are used. \Constant" is the number of terms in this summation. The reestimation formulas of the EM-based estimation algorithm for the system parameters of the i-th region, i = 1; : : : ; M , appear below.

Reestimation Formulas for Dynamical System Model

    !?1 T T E fxk+1xk jzg E fxk xk jzg i i     T T ^ ^ Pi = E fxk+1xk+1jzg ? Fi E fxk xk+1 jzg i i     H^ i ^i = E fyk xTk jzg E fyk jzg i 3 1?1 0 2  T E fxk xk jzg E fxk jzg 7 C 6 B @ 4 5 A i E fxTk jzg 1   R^i = E fyk ykT jzg i    T T ? E fyk xk jzg E fyk jzg i H^ i ^i F^i =

(5.23a) (5.23b)

(5.23c)

(5.23d)

We have used, as usual, z to denote the set of observations along a segment of speech. We have assumed that the means i( ) appear in the observation equation, and the reestimation formulas for Hi; Ri; i can be derived similarly to those of the state equation by augmenting the state with the constant 1 (see Appendix 5.A). For the general, time-varying case, when there is no tying of parameters over timeinvariant regions, we can equivalently incorporate the means i( ) in the state equation, rather than the observation equation. With parameter tying the two approaches are not equivalent in principle, but in the experimental evaluation they

Chapter 5. Training

89

performed similarly. The conditional expectations above must use the variation of the smoother for the blackout case under the trajectory-invariance assumption, since length variability is treated with subsampling of a hidden trajectory. Finally, in the parameter set (5.21), in addition to F; P; H and R, we have included 0 ( ). The initial-state covariance can also be estimated, since we have multiple observations of any particular phone in the training data, and the reestimation formula can be easily derived: 1 ^ 0 = constant

X All Occurrences

E fx0 xT0 jzg:

(5:24)

Since the output mean of the rst region is not speci c to the rst frame, but it is shared among many frames and di erent segment lengths, we could also have included in the parameter set a non-zero mean 0 for the initial state. For simplicity, however, and since incorporation of 0 made no di erence in our experiments, we have omitted it here.

5.3.2 Training without transcriptions When a sentence transcription is not available, in order to obtain ML estimates we must maximize the likelihood (5.1), as we discussed in the introduction of this chapter. For HMMs, this training can be done using the Baum-Welch algorithm, the E-step of which is the forward-backward algorithm. In Appendix 5.C we present the forward-backward algorithm for the general case of an acoustic model, under a Markov assumption on the mode dynamics. Thus, it can also be applied at the E-step of a general EM procedure for estimating the parameters of either the SSM or the dynamical system segment model from unknown segmentation. The general form of this procedure can be summarized:

90

Chapter 5. Training

EM automatic training of segment-based models Step 1 Run the forward-backward algorithm described in Appendix 5.C to compute p(segmentjdata) for all possible segments in a sentence. Step 2 Use the usual, known-transcription, reestimation formulas by appropriately weighting the sucient statistics of each segment by the probabilities above. Step 3 If not convergence go to Step 1.

For the dynamical system case, however, such an approach would require that the RTS smoother is used for all segments in all possible segmentations with a xed number of segments, something very expensive computationally. Moreover, a necessary condition for the derivation of the forward-backward algorithm recursions is that the probability of the observations associated with the current mode given the mode sequence depends only on the current mode (see Appendix 5.C). For the dynamical system, this implies that the state of the system must be reinitialized at the beginning of each segment. This requirement is a consequence of the fact that the computation of the likelihood involves the Kalman predictor for all observations since the rst time the state was chosen randomly. Thus, if the dynamics are not reinitialized across segment boundaries, the likelihood computation will involve observations from previous segments. We have made this assumption so far by including an initial state distribution in the parameter set of each segment model. If we decide to drop it, the EM procedure outlined here can not be used. This is consistent with the observation at the end of Section 5.2, that an EM procedure for a Markov source with discrete mode and continuous state components is computationally infeasible. A simple alternative solution to the automatic training problem for segment-

Chapter 5. Training

91

based models was rst introduced by Ostendorf and Roukos in [67]. That method is a \coordinate-descent" scheme, and at each iteration one alternates between nding the best segmentation given the current parameter estimates and reestimating the parameters based on the new segmentation, similar to the segmental k-means algorithm and the procedure outlined for the general Markov-state case.

5.4 Experimental Results We rst applied the estimation algorithm to an arti cially generated Gaussian random sequence, with dimension of the output vector equal to the dimension of the state vector, 3. For identi ability purposes, the observation transformation matrix was restricted to be equal to the identity matrix. In Table 5.2 we see the resulting estimates at subsequent iterations. The convergence of the estimates follows the known pattern for the EM algorithm, with a rate that is initially fast and slows down signi cantly after the rst few iterations. For the speech recognition application, however, we want to avoid over- tting the models to the training data and we usually limit the estimation procedure to only a few iterations. The training algorithm for the dynamical system model is very e ective, as we can see in Figure 5.1, where we present the normalized loglikelihood of the training data and classi cation rate of the test data versus the number of iterations. The plot is for the correlation invariance assumption, but the convergence properties were similar under the trajectory invariance assumption. We used 10 cepstra for this experiment, and the initial parameters for the models were uniform across all classes, except the class-dependent means. We can see the fast initial convergence of the EM algorithm, and that the best performance is achieved after only 4 iterations.

Chapter 5. Training

92

F Actual

0 B@ 00::00

P

1:0 0:0 0:0 1:0 0:8 ?1:79 1:9

Initial

0 B@ 00::90

Iter. 1

0 0:25 B @ ?0:16

Iter. 2

0 B@ ?00:15:14

Iter. 4

0 0:09 B @ ?0:1

Iter. 8

0 B@ ?00:04:06

Iter. 1

0 B@ ?00::001

0:0 0:0 0:9 0:0 0:0 0:0 0:9

1 CA

0:3

0:52

0:77 0:14 0:2 0:87 ?1:32 1:57

0:86 0:09 0:12 0:92 0:64 ?1:52 1:7

0:7

0:92 0:04 0:07 0:95 ?1:62 1:78

0:8

1:0 0:0 0:01 0:99 ?1:78 1:89

0 B@ 10::00

1 CA

0 B@ 10::00

1 CA

0 B@ 00::50

1 CA

0 B@ 00::50

1 CA

0:0 0:0 1:0 0:0 0:0 0:0 1:0

1 CA

0:59 0:21 0:34 0:75 ?0:9 1:37

0:0 0:0 0:5 0:0 0:0 0:0 0:5

1 0 CA B@

0 B @

1:86 0:04 ?1:13 0:04 1:99 1:26 ?1:13 1:26 5:02 1:38 0:03 ?0:7 0:03 1:49 0:77 ?0:7 0:77 3:72

0:0 0:0 1:0 0:0 0:0 0:0 1:0

1 0 CA B@

2:03 0:03 ?1:11 0:03 2:11 1:4 ?1:11 1:4 5:23

1 0 CA B@ 1 CA

R

1 CA

0 B@

1 CA

0 B@

1 0 CA B@ ?10:02:08

?0:08 ?0:41

1 0 CA B@

1 0 CA B@ ?00:93:04

?0:04 ?0:01

1 0 CA B@

?0:41

?0:01

1:16 0:46 1:02 0:06

0:46 2:8

0:06 1:05

0:0 0:0 0:5 0:0 0:0 0:0 0:5

8:73 3:46 ?3:16 3:46 8:35 3:65 ?3:16 3:65 7:88 4:65 1:42 ?1:3 1:42 3:9 1:73 ?1:3 1:73 3:85 2:11 0:31 ?0:41 0:31 1:59 0:53 ?0:4 0:53 1:54 1:33 0:01 ?0:15 0:01 1:0 0:18 ?0:15 0:18 0:93 1:11 0:0 ?0:05 0:0 1:01 0:02 ?0:05 0:02 0:98

Table 5.2: Estimation of F; P; R from arti cial Gaussian data

1 CA 1 CA 1 CA 1 CA 1 CA

Chapter 5. Training

93

Figure 5.1: Classi cation performance of test data vs. number of iterations and loglikelihood ratio of each iteration relative to the convergent value for the training data. Results use the correlation invariance assumption, and 10 cepstral coecients.

94

Chapter 5. Training

5.5 Discussion In this chapter, we presented a non-traditional approach to the system identi cation problem in order to apply it to the training of the dynamical system segment model. Classical solutions to the system identi cation problem require the computation of the rst derivatives and some form of evaluation of the second derivatives. The complexity of these algorithms has resulted in the limited use of state-space stochastic systems in modeling random processes, and smaller classes, like autoregressive models, have been used more extensively because of the existence of ecient algorithms for the parameter estimation problem. Our approach to the problem using the EM algorithm, however, has simple reestimation formulas, good convergent properties, is applicable to time-varying problems, and can also solve the more dicult problem, where the output of the system is only partially observed. It can, therefore, revive the interest to these types of models. For our problem, the EM-based estimation made the training of the dynamical system segment model feasible, especially in the trajectory-invariance case. Furthermore, there is a theoretically appealing analogy between this work and the parameter estimation problem in hidden Markov models.

Chapter 5. Training

95

5.A Maximization of the log-likelihood function In this appendix we derive the Maximum-Likelihood estimates for the general multivariate regression, e.g. we want the estimates for F; P that maximize

 N X (yk ? Fxk )T P ?1(yk ? Fxk ) + constant: (5:25) L(X; Y ) = ? N2 log jP j ? 21 k=1

This result can is used to derive the reestimation formulas (5.9) and (5.23). A more \classical" proof of this result can be found elsewhere (see e.g. [3]), below we shall give a simpler derivation. We rst give the solution to the following more general problem:

Proposition 5.1 If [e1 () e2() : : : eN ()] has full rank, and ; P have no

common parameters, then the quantity

N X L = ? N2 log jP j ? 21 eTk ()P ?1ek () + constant k=1

is maximized by

# " X N 1 T ^ = argmin det N ek ()ek ()  k=1 N X P^ = N1 ek (^)eTk (^) k=1

(5:26)

(5.27a) (5.27b)

In order to prove this, we use the following two formulas for derivatives with respect to symmetric matrices:

@ log det P = P ?1 @P @ eT Pe = eeT : @P

(5.28a) (5.28b)

where @=@P denotes taking partial derivatives with respect to each element of the matrix and arranging the results in a matrix.

Chapter 5. Training

96

Now, maximizing (5.26) with respect to the elements of P is equivalent to maximizing with respect to the elements of its inverse. Taking the partial derivative of (5.26) with respect to P ?1 and using (5.28a), (5.28b), we can write N @ @ L = ? N @ log det P ?1 ? 1 X T ?1 @P ?1 2 @P ?1 2 k=1 @P ?1 ek ()P ek () N X = N P ? 1 eTk ()ek () (5.29a) 2 2 k=1 and setting this to zero we obtain N X (5:30) P^ () = N1 ek ()eTk (): k=1

To nd the estimate for , we substitute the above in (5.26). Using the matrix identity aT Aa = tracefAaaT g we can see that

#?1 X T "1 X T ? 1 T ^ ej ()P ()ej () = ej () N ek ()ek () ej () j j k 9 8" # ? 1 = < 1X X T () T () = N  tracefI g e (  ) e e (  ) e = trace : j j ; N k k k j X

is a constant. Thus, by taking the partial with respect to  we obtain (5.27a). 2 To obtain the estimate for F , P in the general multivariate regression problem, we note that by setting

 = F ek () = yk ? Fxk

(5.32a) (5.32b)

and using some elementary matrix algebra and the proposition we can nd

F^ =

"X N

yk xTk

k=1 N 1X

P^ = N k=1

# "X N

yk ykT

k=1

xk xTk

#?1

"X # N 1 T ? N yk xk F^ T : k=1

(5.33a) (5.33b)

Chapter 5. Training

97

5.B E-step recursions for DS estimation We now derive the recursions required for the E-step of the dynamical system estimation algorithm, in addition to the RTS smoother recursions. Let us rst de ne   (5:34) i;jjk = E (xi ? x^ijk )(xj ? x^jjk )T where x^ijk is the projection of xi on the Hilbert space spanned by y1; y2; : : : ; yk . We shall also use the simpli ed notation ijk for i;ijk .

Proposition 5.2 The following relations hold: k;k?1jk = (I ? Kk H )F k?1jk?1

(5.35)

k;k?1jN = k;k?1jk + [kjN ? kjk ]?kj1k k;k?1jk

(5.36)

Proof: To prove the relation (5.35) for the ltered cross-covariance, we rst

note that



k+1;kjk = E (xk+1 ? x^k+1jk )(xk ? x^kjk )T

 



= E [F (xk ? x^kjk ) + wk ](xk ? x^kjk )T = F kjk

(5.37)

Now, we can use the RTS recursion (5.16a) for the smoothed estimate at k ? 1 given observations up to time k:

x^k?1jk = x^k?1jk?1 + Ak [^xkjk ? x^kjk?1] so that we get



k;k?1jk = E (xk ? x^kjk )(xk?1 ? x^k?1jk?1 ? Ak [^xkjk ? x^kjk?1



= E (xk ? x^kjk )(xk?1 ? x^k?1jk?1



)T



= E (xk ? x^kjk?1)(xk?1 ? x^k?1jk?1)T





?Kk E ek (xk?1 ? x^k?1jk?1)T ;

(5:38) ])T



 (5.39)

Chapter 5. Training

98

where in the second equality we used the fact that the term in the brackets is orthogonal to the rst factor, and thus we have k;k?1jk = k;k?1jk?1 ? Kk H k;k?1jk?1

(5:40)

which, combined with (5.37), gives us (5.35). In order to prove (5.36), we rst use the RTS recursion (5.16a) twice for the smoothed estimate at k ? 1 given observations up to times L and k respectively:

xk?1 ? x^k?1jN = xk?1 ? x^k?1jk?1 ?Tk;k?1jk?1?kj1k?1[^xkjN ? x^kjk?1] x^k?1jk = x^k?1jk?1 + Tk;k?1jk?1?kj1k?1[^xkjk ? x^kjk?1] = x^k?1jk?1 + Tk;k?1jk?1H T ?ek1 ek :

(5.41) (5.42)

Now, combining (5.41) and (5.42), we can get

xk?1 ? x^k?1jN = xk?1 ? x^k?1jk + Tk;k?1jk?1H T ?ek1ek ?Tk;k?1jk?1?kj1k?1[^xkjN ? x^kjk?1] = xk?1 ? x^k?1jk + Tk;k?1jk?1?kj1k?1[^xkjk?1 +kjk?1H T ?ek1ek ? x^kjN ] = (xk?1 ? x^k?1jk ) + Tk;k?1jk?1?kj1k?1(^xkjk ? x^kjN ): (5.43) Also, notice that we can write

xk ? x^kjN = (xk ? x^kjk ) ? (^xkjN ? x^kjk )

(5:44)

and since the second term on the right depends only on the innovations at k + 1; : : : ; N , we can deduce that

o o n n E (xk ? x^kjN )(xk?1 ? x^k?1jN )T = E (xk ? x^kjk )(xk?1 ? x^k?1jN )T o n ?E (^xkjN ? x^kjk )(xk?1 ? x^k?1jN )T o n (5.45) = E (xk ? x^kjk )(xk?1 ? x^k?1jN )T :

Chapter 5. Training

99

Inserting (5.43) in (5.45), and observing that the second term on the right hand side in (5.44) is also orthogonal to the left hand side, we can get k;k?1jN = k;k?1jk + [kjN ? kjk ]?kj1k?1k;k?1jk?1:

(5:46)

Finally, using the Kalman lter identities Kk = kjk H T R?1 , ?kj1k = ?kj1k?1 + H T R?1H with (5.40), we get ?kj1k?1k;k?1jk?1 = [?kj1k ? H T R?1H ][k;k?1jk + Kk H k;k?1jk?1] = ?kj1k k;k?1jk + ?kj1k Kk H k;k?1jk?1 ?H T R?1 H k;k?1jk?1 = ?kj1k k;k?1jk (5.47) and replacing in (5.46) we obtain (5.36). 2

Chapter 5. Training

100

5.C The Forward-Backward Algorithm In this appendix we shall develop the forward-backward algorithm recursions for the general acoustic model that we de ned in Chapter 2. The derivation here is similar to the one used for HMMs [9], but since our notation is more abstract and De nition 2.1 includes segment-based models, the recursions presented here can also be used for the training of these models as well. In the general training problem for an acoustic model, we want to obtain maximum-likelihood estimates for the parameters of the model  based on the set of observations Z . If Q denotes the underlying mode sequence, the EM algorithm can be generally applied by treating Q as missing data and maximizing at each iteration

 X  E log p0 (Z; Q) Z = p (QjZ ) log p0 (Z; Q): Q

(5:48)

If the mode sequence has the Markov property, then in order to calculate the summation above we must show how the computation of the probabilities p(qk jZ ), p(qk ; qk?1jZ ) can be performed. The forward-backward algorithm computes these probabilities eciently with a set of forward recursions followed by a set of backward recursions. We shall give the derivation here for p(qk jZ ), and the recursions for the second probability can be derived similarly. We can rst de ne k (qk ) = p(qk ; z1; : : : ; zk ); (5:49) so that we can write the following set of forward recursions for the 's:

k (qk ) =

X qk?1

X

p(qk ; qk?1; z1; : : : ; zk )

p(qk?1; z1; : : : ; zk?1)p(qk jqk?1)p(zk jqk ; z1; : : : zk?1) X  = k?1(qk?1)p(qk jqk?1) p(zk jqk ; z1; : : : zk?1); (5.50) =

qk?1

qk?1

Chapter 5. Training

101

where in the second step we used the identity

p(qk jqk?1; z1; : : : ; zk?1) = p(qk jqk?1) that can be derived using the Markov property of the mode sequence and the fundamental property of an acoustic model that the output distribution is conditionally independent of the future and past mode sequences given the current mode. Similarly, we can de ne

k (qk ) = p(zk+1; : : : ; zN jqk ; z1; : : : ; zk );

(5:51)

and write the following set of backward recursions for the 's:

k (qk ) = = =

X

qk+1

X

qk+1

X

qk+1

p(qk+1; zk+1; : : : ; zN jqk ; z1; : : : ; zk ); p(qk+1jqk )p(zk+1jqk+1; z1; : : : ; zk )p(zk+2; : : : zN jqk+1; z1; : : : zk+1) p(qk+1jqk )p(zk+1jqk+1; z1; : : : ; zk ) k+1(qk+1);

(5.52)

where again in the second step we used the same identity for k = k + 1. Equations (5.50) and (5.52) were developed as generally as possible. In the models examined in this thesis we assume a segment is conditionally independent of other segments given the mode of the process:

p(zk+1jqk+1; z1; : : : ; zk ) = p(zk+1jqk+1):

(5:53)

Under this assumption, equations (5.50) and (5.52) become, respectively

k (qk ) = k (qk ) =

X

qk?1

X

qk+1

 k?1(qk?1)p(qk jqk?1) p(zk jqk )

p(qk+1jqk )p(zk+1jqk+1):

(5.54) (5.55)

In either case, we can combine the forward and backward recursions and compute the desired probability as: (5:56) p(qk jZ ) = p(pq(kz; z;1:;::::;:z; zN) ) = P k (qk()q )k (qk()q ) : 1 N qk k k k k

102

Chapter 5. Training

Chapter 6 Recognition

Search Algorithms for Segment-based Models The introduction of an acoustic model serves as the vehicle for the de nition of the recognition mapping. This mapping, from observation sequences to legal symbol strings, was de ned in Chapter 2 for the general case of an acoustic model by

 : Z ! M; (Z ) = argmax A

X Q:(Q)=A

p(Z; Q):

(6:1)

This maximization is the recognition problem in continuous speech recognition. Algorithms for solving (6.1) are usually referred to as search algorithms. The forward set of recursions of the Forward-Backward algorithm can be used to compute the summation in (6.1). However, since this must be repeated for all legal symbol strings A to solve the maximization, a di erent rule is often used for recognition. Instead of searching for the symbol sequence that maximizes the likelihood above, one can jointly search for the pair of symbol and mode sequences that maximize the likelihood of the observations and the mode sequence. In this case, one e ectively uses the mapping

0 : Z ! M; 0(Z ) = argmax Q:max p(Z; Q) (Q)=A A

103

(6:2)

Chapter 6. Recognition

104

which can be considered as the composition of rst nding the most likely mode sequence and then using the decoding function to determine the symbol sequence:

0 =   : Z ! QN ;

(Z ) = argmax p(Z; Q): Q

(6:3)

This modi ed recognition problem is usually solved eciently using a Dynamic Programming (DP) [13] recursion, under appropriate assumptions on the mode dynamics. This type of search is usually referred to in CSR as Viterbi search, because the problem can also be interpreted as a decoding problem in a communication theory setting. Depending on the type of recognition performed and the speci c structure of the models that are used, it is also possible to use hybrid search schemes. These schemes e ectively decompose the recognition mapping to more levels, and combine various types of searches at di erent levels. For example, the BBN BYBLOS recognition system [24] uses a Viterbi search at the word level and full search within words. When Viterbi decoding is used to solve the modi ed recognition problem (6.3), the complexity of the search is directly related to: 1. The size of the mode space 2. The cost-function evaluation at each iteration of the DP recursion. Under the HMM methodology, Viterbi decoding is an ecient search method. The mode-space in HMM-based recognition systems has a moderate size (usually a few thousand states) and, if discrete output distributions are used, the evaluation of the cost-function is a simple table look-up. The recognition problem for segmentbased models is much more expensive to solve using Viterbi decoding. First, the mode space is much larger than HMMs, since it consists of both a segmentation component and a model component, as we saw in Section 2.1.2. Second, segmentbased models like the SSM and the DS segment model, are explicit models for the

Chapter 6. Recognition

105

intrasegmental statistical dependencies and use continuous densities, which makes the cost-function evaluation particularly expensive. Computational complexity has been considered the largest de ciency of segmentbased models. For the independent-frame SSM, as we shall see in this chapter, the complexity in terms of Gaussian likelihood evaluation is proportional to the length of the observation sequence and the number of di erent distributions in the models. However, if correlation modeling is included, as in the DS segment model, the complexity becomes proportional to the square of the length of the observed sequence. A DP search is not feasible using the currently available hardware. It is therefore important to investigate alternative search algorithms to the typical Viterbi decoding. In this chapter we present methods for reducing the computation requirements in recognition using segment-based models. Our approach to the problem is twofold: rst, in Section 6.1, we present a fast segment classi cation method for Gaussian segment models that reduces computation by a factor of two to four, depending on the con dence of choosing the most probable model. Second, in Section 6.2, we address the issue of joint segmentation and recognition. After we examine the computational complexity of the typical Viterbi decoding as applied to recognition via segment-based models, we look at an alternative approach to this problem based on local search algorithms. In Section 6.3, we present an analysis of local search algorithms for the joint segmentation and recognition problem, and then propose a di erent search strategy that reduces the computation significantly. This new algorithm, presented in Section 6.4, is based on the Split-andMerge [72, 39] segmentation algorithm and can be used in conjunction with the classi cation algorithm. Although theoretically the new algorithm may not nd the global optimum, our experimental results, presented in Section 6.5, show that the recognition performance of the fast algorithm is very close to the optimal one. In Section 6.6 we discuss how the algorithm can be extended to word recognition,

Chapter 6. Recognition

106

and in Section 6.7 we summarize the results of this chapter.

6.1 Fast Segment Classi cation In this section we present a fast algorithm for classifying the observations z = [z1 z2 : : : zl ] over an observed segment s of speech. We assume that the segment density p(zj ; l) is Gaussian, and therefore the method can be used for both the SSM and the DS segment model. The MAP detection rule for segment classi cation can be reduced to  = arg max (6:4) p(zj ; l)p(lj )p( )

where p(lj ) are the duration probabilities. Using a multivariate Gaussian distribution for the term p(zj ; l) makes the calculation of the log-likelihoods L( ) =: log p( ) + log p(lj ) + log p(zj ; l) computationally expensive. The fast segment classi cation algorithm that we present here is based on obtaining an upper bound L ( ) for L( ) and eliminating candidate phones based on this bound. A similar idea for HMMs has also been used by Bahl et al. in [8]. In our scheme the computation can be done recursively. We rst notice that we can bound the value of L( ) using a sequence zk de ned as

where

zk = [z1 z2 : : : zk z^k( +1) jk : : : z^l(j k) ]

(6:5)

z^j( jk) = E fzj j ; l; z1; : : : ; zk g; j > k

(6:6)

is the conditional mean at time j given observations z1 ; : : : ; zk and segment length l for all phones . Then, because of the Gaussian assumption,

L( )

= ln p( ) + ln p(lj ) + ln p(zj ; l)

 ln p( ) + ln p(lj ) + ln p(zk j ; l) =: L k ( ); k = 1; 2; : : : ; l:

(6.7)

Chapter 6. Recognition

107

Notice that the bounds get tighter as k increases, L k+1 ( )  L k ( ). Using this bound on L( ), we can construct a fast classi cation scheme for a Gaussian segment-based model. In the following algorithm, let C denote the set of remaining candidate phones and A the set of all phones.

Fast Segment Classi cation Initialization Set C = A; k = 1 Step 1 For all 2 C

compute L k ( )

Step 2 Step 3 Step 4 Step 5

Pick  = argmax 2C L k ( ) Compute the actual likelihood L(  ) For all 2 C if L k ( ) < L( ) remove from C Increment k. If jCj > 1 and k  l go to step 1, else classify to .

The computation of the bounds L k ( ) in step 1 of iteration k can be done recursively:

L k ( ) = L k?1( ) + ln p(zk j ; l) ? ln p(zk ?1j ; l) = L k?1( ) ? 1 d(zk ; z^k( jk)?1) 2

(6.8)

where d(zk ; z^k( jk)?1) represents the Mahalanobis distance with the conditional covariance of zk given observations z1; : : : ; zk and segment length l. This computation is equivalent to the calculation of the conditional mean z^k( jk)?1 and a single d-dimensional distance evaluation for every remaining class at every iteration of

Chapter 6. Recognition

108

the algorithm. For the independent-frame SSM (block-diagonal covariance matrix) the computation at each iteration is simpler and reduces to a d-dimensional Mahalanobis score computation, with d being the dimension of the feature vector. If no candidates are eliminated at any iteration, then the computation is equivalent to an ld-dimensional Gaussian score evaluation per model, the same amount of computation required in a \brute-force" scheme, in this case a full search. The savings in computation are directly dependent on the discriminant power of the feature vector, and can be further increased if the frame scores are computed in a permuted order, so that frames with the largest discriminant information (for example the middle portion of a phone) are computed rst. Speci cally, we could replace the observed segment z in the segment classi cation algorithm by a re-ordered version of it,

z0 = zP = [z10 z20 : : : zl0 ]

(6:9)

where P is a block permutation matrix such that a measure of the ratio of between class over within class scatter is monotonically decreasing in the order that framescores are computed: jSB(k)j  jSB(k+1)j ; k = 1; : : : ; l ? 1 (6:10) jSW(k)j jSW(k+1)j and SB(k); SW(k) are the between and within class scatter matrices respectively obtained from the observations that map to the k-th segment frame in the training data. The fast segment classi cation algorithm has the desirable property of being exact { at the end the best candidate according to the MAP rule is obtained. On the other hand, the computation savings can be increased if instead of pruning candidates based on the strict bound, we use the bound at each iteration to obtain a biased estimate of the true likelihood, L^ k ( ), and use it in place of L k ( ) in step 4 of the algorithm. In the choice of the estimate there is a trade-o between computation savings and the con dence level of choosing the true best candidate. In

Chapter 6. Recognition

109

our phonetic classi cation experiments, using a segment model with length M = 5 and assuming independent frames, we achieved a 50% computation reduction with the exact method. The savings increased to 75% without any e ect to classi cation accuracy when pruning was based on the estimates of likelihoods.

6.2 Joint Segmentation/Recognition In connected phone recognition we must determine the most likely phone sequence when the phone boundaries are unknown. The classi cation problem that we looked at in the previous section is a part of this more general automatic recognition problem, and the algorithm presented in the previous section will prove useful for improving the speed of the recognizer. In the modi ed recognition problem, we jointly select the most likely mode and symbol sequences given the observations. For segment-based models, the MAP rule for this problem is to jointly select the segmentation S and the phone sequence A that maximize the a posteriori likelihood

n

o

(A; S ; n ) = argmax p(Z; S jA)p(A) : (A;S;n)

(6:11)

We shall thus refer to the automatic recognition problem as the Joint Segmentation and Recognition (JSR) problem. In Section 2.1.2 we represented a segmentation as a set of \connected" ordered pairs that span the entire sentence. A segmentation could also be represented as a binary vector in a (N ? 1)-dimensional space and, assuming that the beginning of a segment is marked with 1, its weight would correspond to the number of segments. Such a representation would be useful in developing simulated-annealing types of algorithms [47] or introducing other forms of stochastic iterations for solving the JSR problem. It is also more convenient for answering questions relative to the computational complexity of the search algorithms that we shall examine. However, we chose the former representation to facilitate the analysis in the following

110

Chapter 6. Recognition

sections. Before we present solutions to the combinatorial optimization problem that we have in hand, it is useful to brie y examine its size. If we assume that there is no upper bound in the range of allowable phone durations, there are 2N ?1 possible segmentations, where N is the length of the observed sequence. However, if the range of phone durations is 1  l  D, then the number of con gurations among which we optimize drops to 2N ?1 ? 2N ?D + 1. A Dynamic-Programming solution for the JSR problem was originally given in [67]. We shall assume here that the mode sequence consists of independent random variables. This is equivalent to assuming that (i) the duration of a particular segment does not depend on the durations or the phone labels of the previous segments and (ii) the phone labels are independent, e.g. we essentially use unigram phone probabilities. We can easily remove either assumption, and in following sections we shall give the solution for the case of modeling the phone sequence as rst-order Markov (e.g. using bigram probabilities). For simplicity, we shall give the solution for the independent case. We can show that JSR is a shortest path problem, and therefore has a solution using the following DP-recursion [67]:

J0 = 0    + lnhp(z(; t) ; l)i + lnhp(lj )i + lnhp( )i + C (6.12) J Jt = max t;  where l = t ?  +1 and C can be considered as a bias on the duration probabilities that e ectively controls the rate of insertion errors. At the last frame N we obtain the solution:  h  i  ln p(Z; S jA)p(A) + nC (6:13) JN = (max A;S;n) which for C = 0 is the solution to (6.11). In measuring the complexity of a joint segmentation and recognition search algorithm two quantities are important. The rst factor is the number of segment

Chapter 6. Recognition

111

classi cations:

 h i h i h i c;t = max ln p z(; t) ; l + ln p(lj ) + ln p( ) :

(6:14)

Clearly, a segment classi cation is dominated by the rst term, and its computational complexity was discussed in Section 6.1. The second factor is the number of feature-vector, d-dimensional, Gaussian score evaluations. The DP solution is ecient with a complexity of O(N 2) segment classi cations, which drops to O(D  N ) if the segment length is restricted to be less than D. However, in terms of Gaussian score evaluations this approach is computationally expensive. If we assume that feature vectors are independent in time within a phone-model, then for each frame in the sentence the scores of all models and possible positions within a segment will e ectively be computed and stored. Using caching, a DP search has a complexity of O(M  N jAj) d-dimensional Gaussian evaluations, or simply M Gaussian evaluations per frame per model, where M is the model length, which is the same complexity as a Gaussian HMM with M states per model. This complexity is increased when the independence assumption is dropped, in which case the number of Gaussian scores that must be computed increases drastically. For the dynamical system segment model, for every observation zt in the sentence and every phone model, we must compute one Gaussian score for every possible segment (t1; t2 ) that includes the frame at time t, where t1  t  t2 . The reason is that the actual covariance used for the innovation at this frame depends on the whole sequence of system matrices used since the beginning of the segment, which in turn is a function of the begin and end times of the segment. Thus, for the DS or other models that include correlation modeling, the DP search complexity is O(D2  N  jAj) d-dimensional Gaussian evaluations. For large d, the DP solution is impractical. We therefore need to investigate alternative search algorithms for segment-based models.

112

Chapter 6. Recognition

6.3 Local Search Algorithms: Theory In this section we include an analysis of local search algorithms to provide a perspective on our approach. A local search [68] is an iterative optimization scheme; at each iteration a subset of all possible con gurations that depends on the currently hypothesized optimum is searched and if one is found that improves the score based on a weight function, it becomes the hypothesis for the next iteration. The main result of this section is the necessary and sucient conditions for a local search algorithm to nd the global optimum for the JSR problem. We describe the exact neighborhood that must be searched at each iteration of a local search algorithm so that convergence to the global optimum is assured. This result provides insights on how to design e ective search algorithms. However, for the reader interested only in the algorithm and not the theory behind it, it is possible to proceed to Section 6.4.

6.3.1 Analysis of local search for the JSR Here, we follow closely the analysis in [68] for general local search algorithms. In the past, local search algorithms have been used as heuristic solutions to dicult combinatorial optimization problems, that were either NP-complete or had large search spaces. For the joint segmentation and recognition problem there exists a polynomial-in-time solution. However, in our case the critical factor is the cost calculation (6.14), and we would like to do as few of those as possible. Since local search algorithms are heuristics that involve a small portion of the search space at each iteration, we can use such an algorithm for the JSR problem and avoid a large amount of cost evaluations. We rst show that the JSR problem, being a shortest path problem, belongs to the broader class of discrete linear subset (DLS) problems. A DLS is a combinatorial optimization problem de ned as follows:

Chapter 6. Recognition

113

De nition 6.1 (Discrete Linear Subset problem) [68] Let N = f1; : : : ;  g and T be a set of subsets of N , T  2N (where 2N is the power set of N ) with the property that no set in T is properly contained in another. A DLS problem is the combinatorial optimization problem with feasible set T and score c(S ) =

X i2S

ci; S 2 T

(6:15)

with [c1; : : : ; c ] the weight vector.

Proposition 6.2 The joint segmentation and recognition problem is a discrete linear subset problem.

The proof to this claim can be given by construction. Let N be the set of allowable segments:

n

N = (1; 1); (1; 2); : : :; (1; N ); (2; 2); : : : ; (2; N ); : : : ; (N; N )

o

(6:16)

where jjNjj = N (N2+1) =  . A feasible segmentation was de ned in (2.5) as an unbroken sequence of allowable segments that spans the entire sentence. The set of all feasible segmentations is then a subset of the power set of N , T  2N

n o T = S S = f(1; 1); (1 + 1; 2); : : : ; (n?1 + 1; N )g; 0 < n  N : (6:17)

Clearly, no set (or segmentation) in T is properly contained in another since we constrain the rst segment to begin at time 1 and the last segment to end at time N . Therefore, the joint segmentation and recognition (JSR) problem is a DLS problem, since we maximize

L(S ) =

X (;t)2S

c;t

(6:18)

Chapter 6. Recognition

114

over all sets S in T , where the weights c;t are de ned in (6.14). 2 Let us also de ne the neighborhood of a segmentation S through the mapping from T to the power set of T :

N : T ! 2T :

(6:19)

A local search algorithm for the JSR problem is then de ned as follows. Given any segmentation S , the neighborhood N (S ) is searched for a segmentation S 0 with L(S 0 ) > L(S ). If such S 0 exists in N (S ), then it replaces S and the search continues, otherwise a local optimum with respect to the given neighborhood is found. The choice of the neighborhood is critical, as discussed in [68]. A large N (S ) helps avoid local optima, but on the other hand a small N (S ) can be searched more eciently. An important question for any local search algorithm is the size of the minimal exact local search neighborhood. That is, given any con guration S , what is the smallest possible neighborhood that must be searched at that step so that we are guaranteed convergence to the global optimum after a nite number of steps? For the JSR problem, we shall show that a segmentation S 0 is in the minimal exact search neighborhood of S if it has no common boundaries in a single part of the sentence, except the rst and last frames of this part, and the set of boundaries before and after those two frames is the same. To give a formal de nition of the minimal exact neighborhood and a proof of this statement, we use the fact that JSR is a DLS problem. In [68], it is shown that any DLS problem is equivalent to a linear programming problem and therefore the minimal exact neighborhood is the set of all adjacent vertices on the corresponding polytope of the linear program. Two vertices S; S 0 are adjacent if there exists a set of costs fcig such that S is the unique optimum and S 0 is the second best con guration. To better understand the notions of adjacency and minimal exact search neighborhood, let us assume that S  is the true optimum, S is our current

Chapter 6. Recognition

115

con guration and that S; S  are not adjacent. Then, even if at the current step we do not evaluate the cost at S  (we do not include S  in our search space), we are guaranteed not to falsely decide that S is the best con guration as long as we search the minimal exact search neighborhood of S : since S; S  are not adjacent, there must be a con guration S 0 with a cost better than S and worst than S  .

De nition 6.3 Two segmentations S ,S 0 form exactly one loop if n o S (1; 2 ) = (1 ; t1); (t1 + 1; t2); : : : ; (t + 1; 2)  S n o S 0(1 ; 2) = (1; t01 ); (t01 + 1; t02); : : : ; (t0 + 1; 2 )  S 0

and

ti 6= t0j 8i; j ;

S ? S (1 ; 2) = S 0 ? S 0(1 ; 2 ):

(6.20) (6.21) (6:22)

Proposition 6.4 The minimal exact neighborhood for S , Nexact (S ) consists of all segmentations S 0 that form exactly one loop with S .

A proof of this result is given in Appendix 6.A. The notion of the minimal exact neighborhood is useful in determining the e ectiveness of a local search algorithm. The greater the portion of Nexact (S ) that a local search algorithm searches at each step, the more chances it has to escape local optima. In addition, since the minimal exact neighborhood is di erent at every point S , determining the size of Nexact(S ) is important and helps design more e ective search algorithms, through intelligent choice of starting con gurations and neighborhood N (S ). Since we have described the minimal exact search neighborhood for the JSR problem, we can now calculate its size. Let us assume that a given segmentation S consists of n segments, with lengths

1 + 2 + : : : + n = N:

(6:23)

Chapter 6. Recognition

116

Then, a di erent segmentation S 0 can form exactly one loop with S over a single segment, two consecutive segments and so forth up to the whole sentence. We shall refer to S 0 as a k-segment neighbor of S if the loop is formed over k segments. Using the binary vector representation of a segmentation, it is easy to see that the number of single segment neighbors of S is

1 (S ) = 21?1 + 22 ?1 + : : : + 2n?1 ? n:

(6:24)

Similarly, the number of two segment neighbors is given by

2 (S ) = 21 +2?2 + 22 +3?2 + : : : + 2n?1 +n?1

(6:25)

and nally the number of n-segment neighbors is

n(S ) = 21 +2 +:::+n ?n = 2N ?n:

(6:26)

The total size of the minimal exact neighborhood is then

 (S ) = jjNexact(S )jj = 1 (S ) + 2 (S ) + : : : + n (S ):

(6:27)

We see that the exact neighborhood size is in the worst case exponential in N . It depends on the number of segments, and is larger for segmentations with longer segments. Of course an explicit search of this neighborhood is infeasible and furthermore we already have a solution with a complexity of O(N 2) segment classi cations. In Section 6.4 we shall propose a strategy that searches over a subset of Nexact (S ) with size proportional to the number of segments only.

6.3.2 Local search strategies Two common local search strategies are the splitting and merging schemes. A splitting scheme starts with an initial segmentation that consists of a single segment, S = f(1; N )g, and moves towards a ner segmentation by searching at each point a subset of the exact neighborhood that consists of all possible splits for all

Chapter 6. Recognition

117

segments (i; j ) of the current segmentation. We can de ne the corresponding split neighborhood as

n o Ns(S ) = S 0 S 0 = (S ? f(i; j )g) [ f(i; k); (k + 1; j )g 8 i  k < j; (i; j ) 2 S (6:28) The size of this search neighborhood is O(N ). We can also restrict the splits to a xed position within each segment, and have a search neighborhood with size of O(n), the number of segments in the sentence: n o Ns(S ) = S 0 S 0 = (S ?f(i; j )g) [f(i; k); (k + 1; j )g; k = f (i; j ); 8 (i; j ) 2 S :

(6:29) The opposite, merging strategy, is to start with an initial segmentation with one frame long segments and search over all con gurations with a pair of two consecutive segments of the original segmentation merged into a single segment. The corresponding merge neighborhood is de ned as:

n o Nm (S ) = S 0 S 0 = (S ?f(i; j ); (j +1; k)g) [f(i; k)g 8 (i; j ); (j; k) 2 S : (6:30)

The latter scheme has in general a smaller search neighborhood, O(n). We can argue though that this type of search is more e ective for the JSR problem: the size of the minimal exact neighborhood for a segmentation that consists of single frame segments is O(N 2), whereas the minimal exact neighborhood for a segmentation with a single segment contains all other segmentations and has a size 2N ?1 ? 1. Since the search neighborhoods for both the splitting and merging schemes have approximately the same size, it is much easier to fall into local optima using the former method because we are searching a smaller portion of the exact neighborhood. As an example in speech recognition, the search strategy followed in the MIT SUMMIT system for nding a dendrogram [92] is of the latter form. The method is a merging scheme that tries to minimize some measure of spectral dissimilarity and constrain the segmentation space for the nal search. Similar search methods

Chapter 6. Recognition

118

have also been used for image segmentation and are referred to as region growing techniques [74].

6.4 A Split-and-Merge Search Algorithm A Split-and-Merge method originally appeared in the segmentation of plane curves [72] and also in image segmentation [39], as an iterative scheme that tried to satisfy a certain deterministic criterion. In our case, however, we introduce a variation of the method as a special case of local search algorithms and use it to solve a combinatorial optimization problem. The deterministic criterion is substituted by the maximization of the likelihood score (6.18). Convergence of the algorithm to the global optimum depends on the selection of the basic unit of speech and the accuracy of the statistical representation of this unit. In particular, segmental features should prove extremely helpful to the Split-and-Merge algorithm. We present the algorithm as a local-search algorithm for segment-based models. However, the only \segmental" characteristic of the algorithm is that the basic cost evaluation is a segment classi cation. Thus, the algorithm can also be used by HMMs, combined with a full or Viterbi search within each phonetic segment.

6.4.1 Basic Algorithm The Split-and-Merge approach has a search neighborhood that is the union of the neighborhoods of the splitting and merging methods discussed previously. The advantage of this method is that it is harder to fall into local optima because of the larger neighborhood size. Speci cally, the basic Split-and-Merge algorithm consists of a local search over all segmentations in the union

Nsm(S ) = Ns(S ) [ Nm(S )

(6:31)

Chapter 6. Recognition

119

where Ns(S ) and Nm (S ) are de ned in (6.29), (6.30) respectively, and the splitting frame f (i; j ) is the middle point between the frames i; j . At each iteration we can either split a segment in half (at the middle frame) or merge it with an adjacent segment. The basic Split-and-Merge neighborhood is depicted graphically in Figure 6.1 and consists of neighbors 1 and 2. In a similar fashion, we can extend the search to neighbors over three or more consecutive segments that can have boundaries in a restricted set. As we shall see though, we found that extending the search above two-segment neighbors was unnecessary. Furthermore, we choose to follow a \steepest ascent" search strategy, rather than a \ rst improvement" one. Speci cally, given that the segmentation at iteration k is Sk , we choose to replace it with the segmentation Sk+1,

Sk+1 = argmax L(S ); S 2Nsm (Sk )

(6:32)

if it improves the current score, L(Sk+1) > L(Sk ). An overview of the algorithm is presented below:

120

Chapter 6. Recognition

Figure 6.1: Split-and-Merge segmentation neighbors

Chapter 6. Recognition

121

Split-and-Merge Algorithm Initialization Let S0 be the initial segmentation. Set k = 0; oldscore = L(S0 )

Step 1 Step 2

Iteration k improvement-found = FALSE

S  = argmaxS2Nsm(Sk ) L(S ) newscore = L(S )

Step 3

If (newscore  oldscore) S k = S ; oldscore = newscore; improvement-found = TRUE

Step 4

If (improvement-found) Increment k and go to step 1 else stop.

Convergence in a nite number of steps is guaranteed because at each step the likelihood can only increase and there are only a nite number of possible steps.

6.4.2 Improvements to the basic algorithm An important factor on the performance of the Split-and-Merge algorithm is the choice of the initial segmentation. In experiments with an independent-frame model, when the initial segmentation was uniform with the average segment length, the performance of the algorithm was below the optimal DP search one, with a 14% increase in the error rate. Starting from non-uniform initial segmentations based on spectral dissimilarities did not improve performance, and the only signi cant gain was when the initial segmentation had very short segments, in accordance

122

Chapter 6. Recognition

with the discussion in Section 6.3.2. In order to improve performance of the baseline Split-and-Merge algorithm, we also considered several other modi cations in order to obtain a more powerful search neighborhood and avoid local optima. We found it useful to change our splitting scheme to a two-step process: rst, the decision for a split at the middle point is taken, and then the boundary between the two children segments is adjusted by examining the likelihood ratio of the last and rst distributions of the corresponding models. The boundary is nally adjusted to the new frame if this action increases the likelihood. This method does not actually increase the search neighborhood, because boundary adjustment will be done only if the decision for a split was originally taken. Therefore the increase in computation is small. A second method that helped avoid local optima was the use of combined actions. In addition to single splits or merges, we search over segmentations produced by splitting a segment and merging the rst or second half with the previous or next segment respectively. In essence, this expands the search neighborhood by including the neighbors 3 and 4 in Figure 6.1. A di erent approach that can serve the same purpose of escaping local optima is to include random or negative steps at the end. Stochastic optimization methods have been used lately in many diverse elds [47]. Motivated by this, we considered the following approach. After the baseline algorithm converges to a segmentation S , then every S 0 in the search neighborhood of S has a score L(S 0 ) < L(S ). In the stochastic form of the algorithm, we do not stop at this point, but we allow the possibility of taking negative steps with some probability using an annealing schedule [47]. Of course, a sequence of those negative steps that are taken randomly should be followed by another run of the baseline Split-and-Merge algorithm. It is argued in [16] that taking these negative steps randomly for a combinatorial (or discrete state) problem may not prove as useful as to do so deterministically and we veri ed this claim in our problem. A similar strategy of allowing \bad" moves

Chapter 6. Recognition

123

in order to avoid local optima is referred to in [68] as \variable-depth" search. In our implementation, after convergence of the baseline algorithm to S , we continue for a certain number of additional iterations by taking the best action. The best action is that which gives the biggest increase in probability or, in the case where the segmentation is locally optimal, the least decrease in probability. In this way, we obtain a sequence of segmentations, from which we select the con guration with the larger likelihood, and if it is di erent from S we repeat the whole procedure. We found that, in order for the variable-depth search to work e ectively, a large number of steps was required, so that all possible negative actions are taken at some point. This additional computation was not justi ed since the performance of the basic algorithm with the improvements of boundary adjustment and combined actions was very close to the optimum (see Section 6.5).

6.4.3 Constrained searches In this section, we describe how the Split-and-Merge search can be constrained when the phones are not independent. So far, the phone independence assumption allowed us to associate the most likely candidate to each segment and then perform a search over all possible segmentations. When this assumption is not valid, the cost c;t associated to the segment (; t) is not unique for all possible segmentations that include this segment, so the search must be performed over all possible segmentations and phone sequences. A sequence of phones is not independent when, for example, a grammar is used on the sequence of phones, as in word recognition, or assuming a Markov structure (e.g., bigram probabilities). In the latter case, the likelihood of a phone sequence A is given by

p(A) = p( 1)

n Y i=2

p( ij i?1 ):

(6:33)

Then, the costs in (6.14) should be modi ed to account for the transition probabilities. If a DP search is used in this case, then it should be done over all allowable

124

Chapter 6. Recognition

phone durations and all phone labels at each time. The state in (6.12) is not simply the ending time of a segment  , but the state space is augmented by the label of the previous phone, which means that the search space size is multiplied by the number of phone models. Similarly, in the local search analysis of Section 6.3, in order to carry on a similar analysis for the dependent-phone case, we should rede ne the set of all segments in (6.16) to be the Cartesian product of the set of all segments and the set of phone labels. Then the set of all segmentations (6.17) can be properly rede ned, and the analysis proceeds in a similar fashion. Because the search space is much larger for the dependent-phone case, it is very important to develop a fast algorithm. However, it is much harder to design an e ective local search algorithm when phones are dependent. Because of the practical limitations of computational resources, the size of the search neighborhood remains the same, whereas the minimal exact search neighborhood is much larger. Therefore, it is much easier for the local search algorithm to get trapped into local optima. For the dependent-phone case we have developed a search algorithm based on the Split-and-Merge segmentation algorithm. At each iteration, the raw acoustic scores of all the candidate phones for all segments and all segmentations in the search neighborhood are stored, without using the constraint probabilities (we make the assumption that we have probabilistic constraints, in the form of a statistical grammar as above). We only need to compute the scores for segments that are created from splits or merges that had not been searched in previous iterations of the algorithm. Next, an action (split or merge) is taken if, by maximizing only over the phone labels of the segments that change from this action, the global likelihood is increased (including the constraint probabilities). Analytically, let us assume that the segmentation at iteration k is (6:34) S = Sb [ Sm [ Se =: S (1; 0) [ S (0 + 1; 1) [ S (1 + 1; N ) and the corresponding optimum phone-label sequences for the three parts of the

Chapter 6. Recognition

125

sentence are respectively Ab; Am; Ae. Now, let S 0 be a neighbor of S obtained by changing Sm to Sm0 =: S 0(0 + 1; 1 ). Then, it will be possible to choose S 0 as the new segmentation only if

n

o

p(z(0 + 1; 1) j A0m ; Sm0 )p(Sm0 jA0m)p(A0m jAb)p(AejA0m) max A0m > p(z(0 + 1; 1 ) j Am; Sm )p(SmjAm)p(Am jAb)p(AejAm):

(6.35)

This guarantees that the likelihood of the new segmentation is greater than that of the previous one. If the decision is to take the action, we x the segmentation and then perform a search over the phone labels only. For example, if we use bigram probabilities, this can be done with a Dynamic-Programming search over the phonetic labels with the acoustic scores of the current segmentation. This last step is necessary, because once an action is taken and a local change is made, we are not guaranteed that the current scenario is globally optimum for the given segmentation because of the constraints (e.g., independent phones). As in the independent phone case, convergence of the algorithm is ensured because after each iteration of the algorithm a new con guration will be adopted only if it increases the likelihood. Because of the diculty of the search in the dependent-phone case, it is important that the above mentioned algorithm starts with a good initial segmentation. In practice we obtained a good starting segmentation by rst running an unconstrained search using only unigram probabilities and incorporating the constraints as mentioned above in a second pass. This two-phase algorithm is summarized below.

Chapter 6. Recognition

126

Constrained Split-and-Merge Phase I Run unconstrained Split-and-Merge until convergence. Let SI be the nal segmentation.

Phase II Replace each iteration of the unconstrained Split-and-Merge with the following steps

1) Search over possible actions that increase score

locally based on the criterion (6.35). 2) Fix segmentation and perform a DP search over phone labels only Use SI as initial segmentation and iterate until convergence. There are also other possible variations of the basic Split-and-Merge algorithm. For example, with more complex segment-based models, like the dynamical system segment model, we can use as an initial segmentation the one obtained using an independent-frame model. If bigram probabilities are used with the dynamical system segment model, then phase I of the constrained Split-and-Merge can be replace by running both phases I and II with an independent-frame model rst. Such schemes reduce signi cantly the iterations of Split-and-Merge with the more expensive DS segment model.

6.4.4 Complexity The size of the search neighborhood at each iteration of the Split-and-Merge algorithm is O(n), with n the number of current segments. However, what is actually needed at each iteration of the algorithm is to compute only the likelihood scores for these actions (splits or merges) that involve segments created at the last iteration,

Chapter 6. Recognition

127

since all other scores have been computed previously. Therefore, the number of segment classi cations (6.14) that are required for the Split-and-Merge algorithm has the order of the number of iterations with an additional initial cost proportional to the number of initial segments. For comparison, we saw in Section 6.2 that the DP search requires O(DN ) segment classi cations when the maximum phone duration is D. In the case of constrained searches, we can argue that there may exist pathological situations where all possible con gurations are visited, and the number of iterations is exponential with N . As we shall see though, we found experimentally that the number of iterations without and with the bigram constraints was roughly equal to n and the number of segment classi cations 2:5N and 3:2N respectively.

6.5 Experimental Results In this section, we present experimental results on phone recognition. There are two basic sets of experiments in this chapter. In the rst set, we show that the fast search algorithms save signi cant computation with e ectively no loss in recognition performance. In the second, we give results that compare the recognition performance of the DS segment model to an independent-frame model. The test set used in this set of experiments was our small, development set.

Search Algorithm Comparison The acoustic model that we used in the experiments comparing the Split-andMerge to the DP search was a 5-frame, block-diagonal version of the SSM. We used 61 phonetic models, but in counting errors we ignored errors between similar allophones, as shown in Appendix 2.B. However, we did not remove the glottal stop \q" in either training or recognition, and allowed substitutions of a closurestop pair by the corresponding single closure or stop as acceptable errors. Those

128

Chapter 6. Recognition

two di erences from the conditions used in [52, 81] have e ects on the performance estimates which o set one another. The dimension of the feature vector in these experiments was 39, and the features were obtained by linear discriminant analysis from an initial feature set that included cepstral coecients, their derivatives, the total energy and its derivative [30]. The average sentence length in the small test set was 3 seconds, or N = 300 frames when the cepstral analysis is done every 10 ms. Using D = 50, jAj = 61, M = 5, and unigram phone sequence probabilities (unconstrained Split-andMerge), we measured the average number of segment classi cations and Gaussian scores for the dynamic programming and Split-and-Merge search algorithms. Both algorithms also used the fast segment-classi cation method presented in Section 6.1, although it primarily bene ted the Split-and-Merge search. We found that the average number of iterations for the non-constrained Split-and-Merge search over these sentences was 114, with 2:5N segment classi cations and 133N Gaussian score evaluations, versus 50N segment classi cations and 305N Gaussian score evaluations for the DP algorithm. The di erence in Gaussian score evaluations suggests that the Split-and-Merge algorithm should be 2.3 times faster than the DP search. In fact, we observed that the Split-and-Merge search was on average 5 times faster than the single-frame resolution DP search because of the overhead associated with the factor of 20 times as many segment classi cations for the DP algorithm. The complexity of the SSM in terms of Gaussian computations is the same as that of a continuous density HMM with M states per model, but the Split-and-Merge algorithm used with HMMs would probably be closer to 2-3 times faster than Viterbi decoding (without pruning) because HMMs would not have the overhead of evaluating segment scores. The Split-and-Merge search was on average 3 times faster than a two-frame accurate DP search under these conditions, which has a factor of 4 times fewer segment classi cations than the one-frame accurate DP search. All three algorithms gave equivalent recognition performance (64.6-

Chapter 6. Recognition

129

Search Segment Gaussian Recognition Grammar Algorithm Classi cations Computations Accuracy 1-Frame DP 15,000 90,000 64.8 Unigram 2-Frame DP 3,750 80,000 64.9 Split & Merge 750 40,000 64.6 Bigram 2-Frame DP 3,750 80,000 66.7 Split & Merge 950 60,000 66.7 Table 6.1: Comparison of di erent search algorithms using a test set with average sentence length N = 300. The SSM used maximum phone duration D = 50, jAj = 61 models, M = 5 distributions per segment model. 64.9% accuracy). These results are summarized in Table 6.1. The savings also increase with the length of the model: we veri ed experimentally that doubling the length of the model had little e ect on the Split-and-Merge search, whereas it doubled the recognition time of the DP search. The average number of segment classi cations and Gaussian computations was also measured for the case of constrained searches, and the computation savings are similar. In this case, there are more segment (3:2N ) and Gaussian (200N ) score evaluations because of the second phase of the Split-and-Merge. However, we again observe a greater computation savings than predicted by the number of Gaussian score computations because of the additional DP overhead associated with the increased search space. The di erent search algorithms had equivalent recognition performance, 66.7% accuracy.

Correlation Models in Recognition In the second set of experiments we evaluated the connected-phone recognition performance of the DS segment model. Ecient search algorithms are of greater

130

Chapter 6. Recognition Model % Correct % Subst. % Delet. % Insert. % Accuracy Independent-1 62 22 16 7 55 DS model-1 64 22 14 7 57 DS model-2 62 23 15 4 58 Independent-2 68 23 9 5 63 DS model-3 69 21 10 2 67

Table 6.2: Phone recognition results for 18 cepstral coecients (no derivatives) with Split-and-Merge for the independent-frame and the dynamical system segment models. 1) Independent-1: uniform initial segmentation, 2) DS model-1: initial segmentation the output of 1), 3) DS model-2: uniform initial segmentation, 4) Independent-2, DS model-3: initial segmentation from sentence transcription. importance if correlation modeling is included in the SSM and the frames are not independent, since caching is no longer an option. In that case, the number of Gaussian computations in the Dynamic-Programming search increases signi cantly, making the segment models much more complex than comparable HMMs. Connected-phone recognition using the dynamical system segment model was feasible using the Split-and-Merge algorithm, and the number of segment classi cations was similar to the one with the independent-frame SSM. The DP search was not implemented for the DS segment model because of limitations in time and computer resources. However, since the number of segment classi cations for the Split-and-Merge remained the same as for the independent-frame SSM, and the DP search requires the same number of segment classi cations as in the independent-frame case, the Split-and-Merge in this case is 20 times faster than dynamic programming. There is a subtle issue in using the DS segment model for connected-phone recognition. Since the dynamics of the model that we implemented were reinitial-

Chapter 6. Recognition

131

ized at the segment boundaries, the rst innovation (obtained from the Kalman predictor) in each segment usually has the largest amount of variation. The distribution associated with the rst frame is therefore much smoother than the remaining frames, and the likelihood scores for the rst frame are correspondingly lower. This was a problem in our initial implementation of the DS segment model in recognition, and the recognizer su ered from a large number of deletions. The solution that we gave to this problem was to approximate the state at the last frame of the previous segment with the corresponding observation, and predict the rst frame of the current segment based on that observation. This however puts too much emphasis on the last observation of the previous segment (since we assume that it is noiseless), and reduces somewhat the performance of the model. In this set of experiments, closure-stop pair substitutions were not allowed. Results using the 18 cepstral coecients are presented in Table 6.2, for the independent frame model and the DS segment models. Since cepstra only were used, the models are less accurate and it is much easier for the Split-and-Merge search to fall into local optima. For this reason, we have included the recognition results for di erent initial segmentations, including the \true" segmentation obtained from the TIMIT transcriptions. Even though the latter is probably a slightly optimistic result (since it may correspond to a local optimum with a higher score than the true optimum), it gives an indication of the relative recognition performance of the DS and independent-frame models, when less a ected by search errors. Using the improvements to the basic Split-and-Merge algorithm, that were not used in this set of experiments, it should be possible to bring the fully automatic performance close to the optimum one. With cepstral derivatives and the derivative of log energy included in the feature set, both the independent-frame model and the DS segment model gave similar performance in recognition, with 64% and 63% accuracy for the independent-frame and DS segment models respectively when starting from a uniform segmentation. The corresponding numbers when the sentence tran-

132

Chapter 6. Recognition

scription is used as the initial segmentation of the Split-and-Merge were 68% and 69% for the independent-frame and DS segment models respectively. The results without derivatives demonstrate that the performance with the DS model is similar to the independent model that uses both cepstra and derivatives. This is very encouraging, since the latter system uses additional information obtained across segment boundaries { the derivatives are computed from 5-frame-long windows. We conjecture that by removing the reinitialization assumption at the boundaries, incorporation of derivatives in the feature set will be unnecessary for the DS segment model. This approach would also make the likelihood scores more homogeneous across the whole sentence, and should improve the overall recognition performance. Even though this may be dicult to do optimally (see the discussion in Section 5.3), we can give suboptimal solutions by assuming piecewise constant Kalman gains and prediction error variances, either independent of, or dependent on the previous model (for context modeling). Such an approach can also model the dynamics at the segment transitions, which are important in connected recognition, but requires context modeling since the transition dynamics depend on both phones.

6.6 Word recognition with the Split-and-Merge So far in this chapter we have presented search algorithms for phonetic recognition and classi cation using segment-based models, since we have chosen the phone as the basic modeling unit and phone recognition experiments as our evaluation method. However, the algorithms can be extended to word recognition. For the purpose of completeness, we shall present in this section a possible extension of the Split-and-Merge algorithm to word recognition, even though it is beyond the scope of this thesis to perform word recognition experiments. We shall assume in this section that the purpose of the recognition mapping is

Chapter 6. Recognition

133

to reveal the spoken word sequence W , which we consider as random. The MAP rule for decoding the sequence of observations Z can be rewritten in this case

W  = argmax p(W; Z ) = argmax W

W

X A;S;n:(A;S )=W

p(W; A; S; Z )

(6:36)

where the decoding function ( ) now maps mode sequences to words, since the objective is word recognition. The approach that we can follow in order to obtain an approximate solution to (6.36) uses phonetic recognition and speci cally the Split-and-Merge algorithm as its initial starting block. After the Split-and-Merge algorithm converges to a segmentation S , we have essentially evaluated the likelihood scores for all segmentations in the neighborhood Nsm(S ) and all phone sequences that can correspond to segmentations in this neighborhood. It is, therefore, inexpensive computationally to approximate the summation in (6.36) by

W^ = argmax W

X

X

S 2Nsm (S  )

A:(A;S )=W

p(W; A; S; Z ):

(6:37)

We conjecture that this approximation should be good, because (i) the highprobability phone strings probably dominate the summation in (6.36) and (ii) the family of segmentations in Nsm (S ) is large, since it includes splits, merges and combined actions of split and merge left or right for all segments in the nal segmentation. In order to compute the probabilities in the sum (6.37), we can write

p(W; A; S; Z ) = p(W jA)p(A; S; Z );

(6:38)

where we have assumed that p(W jA; S; Z ) = p(W jA), and the second terms are obtained from the nal lattice of the Split-and-Merge algorithm. There are many choices for the probability of a word string given a phone string. It is essential that we use a rich dictionary, with multiple pronunciations for each word, since the word search will be limited on the Split-and-Merge lattice. One such example

134

Chapter 6. Recognition

is the dictionary used in the SRI DECIPHER recognition system [25] with their \bushy" pronunciation networks. With multiple pronunciations, the probability of a particular word given a sequence of phones will be nonzero only if the word phone-sequence pair is in the dictionary. The relative weights for all words that can produce the same phone string can be estimated from training data. An alternative approach, would be to use knowledge from phonetic recognition experiments in order to determine the probability of misrecognized phone strings. More speci cally, we can estimate from phonetic recognition the probability of substitution, deletion or insertion given a particular phonetic context. These probabilities, used in conjunction with a dictionary, can determine the probability of a phone string given a particular word. There are also other possible extensions of the search algorithms that we presented which could be useful in word recognition. The fast segment classi cation algorithm could also be used for fast word classi cation to eliminate unlikely word hypotheses prior to a more detailed classi cation step. The constrained Split-andMerge algorithm could also be extended to consider word pronunciation grammars rather than bigram phone probabilities. Lastly, local search algorithms of the form presented here can be a part of more general and constrained searches at the word level.

6.7 Discussion To summarize, we presented in this chapter fast algorithms for segment classi cation and for connected-phone recognition. The recognition algorithm, based on a local Split-and-Merge search, is applicable to any recognition system but is particularly relevant to segment-based models. We included an analysis of local search algorithms which motivates the Split-and-Merge search algorithm; the neighborhood of combined splits and merges is larger than the neighborhoods of either splits

Chapter 6. Recognition

135

or merges alone and therefore the search algorithm will be less sensitive to local optima. In addition, use of a ne initial segmentation is suggested by the analysis, since the minimal exact neighborhood is smallest at this point. In experiments with an independent frame SSM, the Split-and-Merge search algorithm achieved a signi cant computation reduction relative to the optimal dynamic programming algorithm with no loss in recognition performance. In this case, the SSM complexity was similar to that of an HMM with Gaussian observation distributions. In addition, we showed that greater savings are possible with more complex segment models. For the dynamical system segment model, a DP search has a much larger complexity than a comparable, continuous-density HMM. However, the incorporation of time correlation has no e ect on the Split-and-Merge search, making our models more feasible for continuous speech recognition.

Chapter 6. Recognition

136

6.A Exact Neighborhood for the JSR problem In this appendix we shall prove Proposition 6.4. Let us de ne the neighborhood

N1(S ) = fS 0jS; S 0 form exactly one loopg:

(6:39)

We rst show that N1 (S )  Nexact(S ). To do so, we show that if S 0 forms exactly one loop with S then it is adjacent to S and therefore S 0 2 Nexact(S ). To prove that S 0 is adjacent to S , we show that there exists a set of weights such that S 0 is the best segmentation and S is the second best. Let the segmentations S; S 0 form the loop over the interval [1 ; 2 ] and the corresponding parts be denoted by S (1 ; 2); S 0(1; 2 ) with ;  0 segments in this interval respectively. Then, we can select weights

ct1 ;t2 = c0; 8(t1 ; t2) 2 S 0 0 ct1 ;t2 = c < c0   8(t1 ; t2 ) 2 S (1 ; 2) ct1 ;t2 = ?1; otherwise ;

(6.40a) (6.40b) (6.40c)

thereby guaranteeing that L(S 0) > L(S ). The only complete segmentation that spans [1; N ] which can be constructed from the segments of S and S 0 are S and S 0 themselves, since they have no common boundaries in the interval between 1 and 2 and are identical outside that interval. If all other segments have costs ?1 then any other segmentation S 00 has cost ?1 satisfying L(S 00) < L(S ) < L(S 0). To complete the proof we must show that Nexact(S )  N1 (S ). We do this by showing that for T the set of all feasible segmentations (T ? N1(S )) \ Nexact(S ) = ;;

(6:41)

i.e., we show that if S 0; S form more than one loop then S 0 62 Nexact(S ). This is rather straightforward, since then S; S 0 could be decomposed to

S = S (1; 1 ? 1) [ S (1; 2 ? 1) [ S (2 ; 3 ? 1) [ S (3 ; N ) S 0 = S 0(1; 1 ? 1) [ S 0(1 ; 2 ? 1) [ S 0(2 ; 3 ? 1) [ S 0(3 ; N ):

(6.42) (6.43)

Chapter 6. Recognition

137

Then, we cannot nd weights such that S 0; S are the best and second best con gurations, because in any case we can construct other segmentations that consist of parts of those two and have scores between those two.

138

Chapter 6. Recognition

Chapter 7 Extension

Hierarchical Models Let us review our approach to the acoustic modeling problem in CSR. We initially chose the phoneme as the modeling unit and segment-based models for the intrasegmental statistical dependencies. Based on our intuition and experimental ndings, which we presented in Chapter 3, we chose to constrain our segmental distributions to the family of linear models. In Chapter 4 we introduced a linear segment-based model and demonstrated that such an approach can improve today's state-of-the-art phonetic recognition performance. As signi cant as the modeling of intrasegmental correlations is, it is also important to model the intersegmental statistical dependencies. The use of segmentbased models only addresses the problem of intrasegmental dependencies. The problem of modeling intersegmental dependencies is a more complicated one, and we have also argued and seen experimental evidence that linear models may not be adequate in representing them. In this chapter we shall propose a framework for their modeling, speci cally an embedded-segment model. Although this model can be applied to model longer-scale dependencies, in this work we shall only use it for modeling intrasegmental correlations. 139

140

Chapter 7. Extension

7.1 Hierarchical-Mode Process As we discussed in a previous chapter, speech possesses a hierarchical structure. All the acoustic models that are currently used in CSR, including the ones that we have used so far in this thesis, explicitly model units at only a single level of the hierarchy, usually the phoneme. As a consequence, the mode sequence in the formal de nition of an acoustic model (see Chapter 2) is a single-rate process. Thus, the family of acoustic models that our de nition can handle does not include models for hierarchical processes. Modeling of more global e ects can only be done in an indirect fashion: by expanding the mode space of the single-level events, so that there exist distinct representations of the same event for di erent contexts. This is the approach followed in most recognition systems today for modeling the coarticulation e ects in speech [84, 51], that is to have di erent models for di erent contexts. Such an approach, however, can become an inecient parameterization. Instead, we are interested in providing a formalism that better represents the hierarchical structure of speech. The mode of the speech process is not only the current state of the articulators: it is also the phoneme, syllable, word, sentence, speaker and recording conditions in which this state appears { it therefore belongs to a set that can be de ned hierarchically. We shall make an attempt to generalize the family of acoustic models, so that it can include models with a hierarchical mode. The family of acoustic models described in De nition 2.1 will be a subclass of the new family. This approach will facilitate the design of hierarchical models, by allowing us to borrow methods from our single-level models. The time variability of speech will be represented by sequences of random variables, what we again call the mode process, that take values from a hierarchically de ned set. The time-resolution of the mode process implies a partitioning of the observed sequence, as was the case with the acoustic models. As an example, let us assume that we want to model a hierarchy with 3 levels, representing words, sylla-

Chapter 7. Extension

141

Figure 7.1: Example of the model components of a hierarchical mode for a 3-level hierarchy. bles and phones. The model components of the mode that correspond to the three levels of this example are respectively 0 = fdoctorg, 1 = f/d aa k/; /t axr/g, 2 = f/d/, /aa/, /k/, /t/, /axr/g, where we use superscripts to denote the level, and are depicted graphically in Figure 7.1. In order to construct an acoustic model for speech with a hierarchical mode, we need to specify the joint distribution for all observations associated to the current mode of the process. If the top-level component of the mode corresponds to a speech unit from the high levels of the hierarchy, like the word in the previous example, then estimating a global distribution may not be feasible. Word-based models were used in the early days of speech recognition, and were soon abandoned because of training problems. As the recognition tasks become more dicult and the vocabulary size increases, we should expect to nd fewer occurrences of the high-level units in our training data, simply because there are too many of them. The introduction of low-level units, like the phone, made estimation much easier, at the expense of capturing less speech-signal variability in the models.

142

Chapter 7. Extension

One should not conclude, however, that hierarchical, or more global models, cannot be used in CSR. The problem is actually as posed at the end of the last paragraph, to attain the best possible balance between the variability the model allows and trainability. We may not hope that a \straightforward" global distribution will be ecient in modeling global e ects. However, by imposing constraints on the structure of the distribution we can improve the trainability properties of the model, without, hopefully, losing too much of the capability to model global events. This was the approach that we also followed in modeling the intrasegmental dependencies, by constraining the full-covariance model to the class of rst-order Markov ones. In order to build a hierarchical model, we could choose any family of models, like HMMs, as the basic building block. In this case we would have a hierarchical model with di erent HMMs modeling the sequences of speech units at the di erent levels. Among the di erent single-level acoustic models, we have selected to work with the segment-based ones. In our extended models we still want to be able to capture the interaction of observations at any particular level. Moreover, we want to utilize our experience with segment-based models in our new approach. We shall therefore generalize our segment-models to the family of embedded segment models, that we describe below.

7.2 Embedded Segment Models Instead of giving a rigorous de nition of the embedded-segment model, we shall describe the general model and two possible assumptions that simplify its observation distribution.

Chapter 7. Extension

143

7.2.1 General Case We saw in Chapter 2 that the basic components of an acoustic model are the mode space, the observation distribution, and the grammar that describes the mode process. Each of these objects is described below for a general embedded-segment model.

Mode Space One basic component of the mode of a segment-based model is the segmentation. The boundaries between units at higher levels of the hierarchy, like syllables or words, are much better de ned than between phones. In addition, the boundaries at higher levels are subsets of the boundaries between lower level units, since the higher level units consist of several low-level ones. This motivates the de nition of an embedded segmentation for an h-level hierarchy as:

S = fS 1; S 2; : : : ; S hg

(7:1)

where the segmentation at the j -th level, j = 1;    ; h is de ned as usual

S j =: fsj1; sj2; : : : ; sjnj g = f(1; 1j ); (1j + 1; 2j ); : : : ; (njj?1 + 1; njj )g:

(7:2)

and each segment sjk = (kj?1 +1; kj ) is a member of the set S = f(1 ; 2) : 1  1  2  N g, where N is the length of the sentence at the nest level. The segments at level j represent a ner segmentation than the immediately coarser level j ? 1, e.g., each segment at level j is embedded within a segment at level j ? 1. Thus, the syntactic components of the mode-grammar that the embedded segmentation implies are 1) the top-level segmentation must span the entire sentence and have no gaps, as in the one-level case, and 2) the subdivision of a segment at any level must begin and end at the boundaries of that segment and have no gaps. The entities involved in a hierarchical model span multiple levels, which can make the notation cumbersome. To facilitate the reading of the remainder of this

Chapter 7. Extension

144

s11 s21

s12 s22

s23

s24

s25

s31 s32 s33 s34 s35 s36 s37 s38 s39 s310 S 2 = [s21; s22; s23; s24; s25] S12 = [s21; s22; s23] , S22 = [s24; s25] , S43 = [s36 ; s37; s38] S2 = [s12; s24; s25; s36 ; s37; s38; s39; s310] Figure 7.2: Representation of an embedded segmentation. chapter, let us outline the notation conventions that we adopted above. Capitals are used as usual to denote collections of random variables. We use superscript to denote the level, whereas a subscript is used for the time index. For example, Skj includes all segments at level j that are embedded within the k-th segment of level j?1 Skj = fsnj (k?1)+1 ; : : : ; snj (k) g; (7:3) where nj (k) is the index of the last segment at level j embedded within the k-th segment at level j ? 1. If the superscript is missing and the subscript is k, e.g. Sk , then the random variables from all levels associated with s1k are included in the collection. We will refer to Sk as a generalized segment. If the time index is missing then we reference all the random variables for the given level. The embedded segmentation and the notation are depicted in Figure 7.2. In analogy to the single-level segment models, the second mode component of the embedded segment models is the class identity of a particular segment. The

Chapter 7. Extension

145

sequences of language units for all h levels of the hierarchy will be represented by

A = fA1 ; A2; : : : ; Ahg:

(7:4)

Hence, the complete mode of an embedded segment model is qk = (Lk ; Ak ), where Ak is the collection of language units from all levels within the k-th generalized segment and Lk represents the durations of all segments included in Sk . We can use the hierarchical framework to model observations extracted from the same, single-level feature processing that we have used thus far, as well as to model observations extracted at di erent time scales. Segmental features, such as those used in the MIT system [62], multiscale approaches to time-frequency analysis, like wavelets [26, 61], and more global features can be easier incorporated using a hierarchical approach. The mechanism for including multiple-rate features in the model will become clearer when we specialize our de nition to the family of embedded segment models. In the following discussion, we shall assume that we can partition the collection of observations within a generalized segment to h subsets, Z = fZ 1; Z 2; : : : ; Z hg and that the subset Z j is associated to the j -th level. We also allow the observation space to di er according to level. Presumably, this partition is such that features characterizing a particular unit of speech are associated with the corresponding level: phone duration should be modeled at the phonetic level, whereas pitch variation should be modeled at the syllable level.

Observation Distribution The core component of an embedded segment model is the joint distribution of the observations within a generalized segment. According to the acoustic model de nition, these observations depend only on the current mode. In addition, as we did in the single level case, we shall make the simplifying assumption that observations from di erent generalized segments are conditionally independent, given the mode:

p(Zk jZ0; : : : ; Zk?1; Lk ; Ak ) = p(Zk jLk ; Ak )

(7:5)

Chapter 7. Extension

146

where Zk denotes the observations from all levels within the k-th segment of the coarser level. In order to simplify the observation distribution of an embedded segment model, we can make the following assumptions: i) the observations at level j are conditionally independent of the mode components at levels other than j and j ? 1 given the segment and label sequences at levels j , j ? 1, ii) the observations at level j are conditionally independent of those at levels 1; : : : ; j ? 2 given the observations and mode of the comprising segment at level j ? 1 and the mode at level j , i.e., we make a Markov assumption for the dependence of observations across levels, and iii) the observations of di erent segments at the same level are conditionally independent given the observations at the coarser level and the mode. Under these assumptions, the distribution for the observations in the k-th generalized segment can be written

j

j

p(Zk Sk ; Ak ) = p(z1k qk1)

nY 2 (k) i2 =n2 (k?1)+1



j

p(z2i2 z1k ; qk1; qi22 ) nh (Y ih?1 )

ih =nh (ih?1 ?1)+1

nY 3 (i2 ) i3 =n3 (i2 ?1)+1



p(zhih jzhih??11 ; qihh??11 ; qihh )

(7.6)

where qkj = (lkj ; kj ). In following sections we shall investigate alternative assumptions for the conditional distributions appearing in (7.6).

Grammars For an embedded segment model, the grammar consists of syntactic and the stochastic components that compose the score p(A; S ) of the observed mode sequence. A rst syntactic component consists of the constraints imposed by the embedded segmentation, that we saw in the previous section. In order to see what other components can be included in the score above, we can rewrite

p(A; S ) = p(A1; S 1)p(A2 ; S 2jA1 ; S 1)    p(Ah; S hjA1; S 1; : : : ; Ah?1; S h?1) (7:7)

Chapter 7. Extension

147

where, as usual, Aj ; S j denote the segment and label sequences at level j , and, as for the single-level model, p(A; S ) = p(A; L). In the observation distribution, we used a similar decomposition going down the levels in order to incorporate in our model the e ects of more global context in the acoustic information of the speech signal. In the grammar component, a decomposition going down the levels re ects changes in pronunciation that are implied by knowledge of the more global context. In the single-level models we used segment duration probabilities to compose the score of a segmentation. This, however, may not be the best approach, since the duration of a phone is clearly a ected by the context in which the phone appears. These more global e ects were re ected in our single level models by bimodal empirical distributions for the segment duration probabilities. Thus, we can think of the \top-down" interaction across levels that we introduced here as adjusting the acoustic and the grammar components of the models at lower levels based on the observations and/or the mode at the higher levels.

7.2.2 Linear Embedded-Segment Models One alternative for modeling the interaction across levels is to assume a linear relationship between observations at di erent levels. In such an approach, portions of the variation for observations at a particular level are explained by regressions on observations at higher levels. Thus, the correlation between observations from the same level is partially determined by the higher mode-components, under which both observations appear { the information of one observation about another is transmitted upwards, to their common parent nodes in the hierarchical tree and then downwards. This form of correlation modeling is not actually totally di erent from some of the single-level models that we examined. In Chapter 4, we introduced a variation of the dynamical system segment model, the target-state model. In this model,

Chapter 7. Extension

148

the dependence between the observations from the same segment was modeled through the \target state" of the phone. We can think of this model as a very simple hierarchical model, with two levels, the top corresponding to the phone and the bottom level to the part of phone. The top level component of the mode is the phonetic segment, whereas the low-level segments are the invariant regions within the segment. The distribution of the high-level \observation", the target state (for which we employ a familiar trick and treat it as unobserved) is the Gaussian

j

j

p(z1k k1 ; lk1 ) = p(x0 k1 ; lk1 )

N



0; 0( k1 )

 ;

(7:8)

whereas the observations at the lower level given x0 are assumed conditionally independent and modeled with Gaussians described by the observation equation,

  p(zi2 jz1k ; k1 ; lk1 )  N HI (i) ( k1 )x0 + I (i)( k1 ); HI (i) ( k1 )0 ( k1 )HIT(i) ( k1 ) + RI (i) ( k1 ) (7:9) where I (i), the low level model, is the index of the time invariant region. The grammar component of this model is trivial, since the segmentation and label sequence at the top level uniquely determines the mode components of the lower level. We can easily generalize this model to an arbitrary number of levels, by de ning the conditional distributions in the de nition of the general embedded segment model to be conditional Gaussians. One special case would be to assume that the i-th observation zij at level j is obtained as a linear transformation of its \parent" observation, z j?(i)1 at level j ? 1, zij = Hij z j?(i)1 + vij ;

(7:10)

where vij is a non-zero mean Gaussian random variable and (i) returns the index of the parent observation. This model resembles a process de ned on a di erent topology, a hierarchical tree. Processes on trees have gained much interest lately [11, 23], particularly after

Chapter 7. Extension

149

the introduction of wavelets, to which they are closely related. Moreover, in [23] Chou et al. study the family of linear state-space models de ned on trees, and develop estimation algorithms analogous to the RTS smoothing algorithm. It is fairly simple to extend our estimation results in Chapter 5 for the models studied in [23] and therefore solve the training problem for linear embedded segment models. However, we believe that a linearity assumption is not appropriate for modeling the longer time-scale dependencies. We saw in Chapter 3 that the linear assumption is not as e ective for observations across phonetic segment boundaries, as it is within. This is consistent with the results in Chapter 4, where we showed that the target-state model barely outperformed the independent-frame model, and that it had very poor performance when derivatives were included in the feature set. Since the derivatives are computed from a window that spans several frames, a linear model for the dependency of observations that are several frames apart is probably not a good approximation. As a consequence, we will not evaluate this model experimentally in this thesis.

7.2.3 A Two-level Segment Model The experimental work in this thesis will be con ned to a simple, two-level model, and the units that we select to model in the two levels are the phones at the coarser level and the parts of phone at the ner level. We shall refer to the models of the ner level as stochastic microsegment models. Even though these experiments do not investigate intersegmental dependencies, as the embedded-segment framework is capable of, the results are directly comparable to our previous models and allows us to test the feasibility of the general framework. In order to further simplify the observation distribution (7.6), we assume that given the current mode of the process, the observations across di erent levels are conditionally independent. As a consequence of this assumption, in conjunction with the distribution assumptions of the general embedded-segment models, the

Chapter 7. Extension

150

correlation between observations at the ner level is characterized only by the mode. Thus, the joint distribution (7.6) of the observations at the ner level j = 2 within the k-th generalized segment (the k-th phoneme for the two-level model) can be written under the conditional-independence assumption as

j

j

p(Zk2 z1k ; qk ) = p(Zk2 qk ) =

nY 2 (k) i=n2 (k?1)+1

p(z2i jqk1; qi2):

(7:11)

We can limit the number of distributions (7.11) that need to be estimated by eliminating the dependence on the mode components of the coarser level. Thus, we e ectively use a smaller unit than the phone to capture short-term dependencies and attain the largest possible sharing of parameters among di erent phone models. Thus, the distribution (7.11) becomes

j

p(Zk2 z1k ; qk ) =

nY 2 (k) i=n2 (k?1)+1

p(z2i jqi2):

(7:12)

This philosophy is also apparent lately in some of the more advanced recognition systems. The IBM Tangora system incorporates context modeling by using a pool of models for a smaller unit than the phone, the so-called \fenone" [7]. Microsegment models have also been used by Deng et al. [28]. Under the previous assumption, we can write the conditional distribution of the ne-level observations given the coarse-level observations and mode component as

X

p(z2i ; qi2jqk1; z1k ) i X p( i2; li2j k1 ; lk1 )p(z2i j i2; li2) =

p(z2i jqk1; z1k ) =

q2

2i ;li2

(7.13)

where any of the previously-described segment-based models can be used for the distribution of observations within a microsegment, p(z2i j i2; li2), and p( i2; li2j k1 ; lk1 ) is de ned as part of the mode grammar. If no coarse-level features are used, then (7.13) is the basic model for a single-rate process. The following steps are involved in the generation of the output observations with the two-level model. First, the segment duration lk1 for the phone k1 is drawn

Chapter 7. Extension

151

from a phone-speci c distribution, as was the case with the segment-based models. Any coarse-level observations are generated using p(z1k j k1 ; s1k ). Then, the ner-level segmentation Sk2 and microsegment sequence A2k are obtained from p(A2k ; Sk2j k1 ; s1k ), and nally the nally the ne-level observations are drawn using the microsegment distributions. Equation (7.13) clari es the relationship of the two-level embedded-segment model to the previous single-level segment models that we examined in this thesis for single-rate observation processes. For all previous cases, the mapping Tl deterministically de nes regions that are analogous to the mode components q2. Therefore, the fundamental di erence between the explicit two-level model and previous segment models is a larger mode space with a phone-dependent stochastic grammar. The larger mode space and stochastic grammar e ectively introduce Gaussian mixture distributions at the microsegment level in place of the plain Gaussians used for the single-level models. A more classical approach would be to introduce mixture densities at each individual frame for the SSM. The twolevel model is di erent in the sense that we have mixtures densities for groups of observations, which allows us to also model the short-term correlation. Since the modeling of the longer-term dependencies is mainly done through the statistics of the mode process, we can also think of the two-level model that we introduce here as a \blend" of the two di erent families of models, HMMs and segment-based models. HMMs represent the correlation between di erent observations through the statistics of the mode sequence only, whereas segmentbased models are explicit models for the correlation of observations within the same segment. The \microsegment" models explicitly represent the correlation for observations that are close in time, whereas the detailed mode space captures the dependencies for observations that are further apart. If, however, we did not make the assumptions that lead to (7.12), the two-level model would be more complex than an HMM-SSM blend.

152

Chapter 7. Extension

The two-level model also resembles a discrete-distribution model with a dynamic, variable-rate quantizer. The microsegment scores give the probabilities of the di erent articulatory states. If a Markov assumption is adopted for the grammar probabilities and there are no high-level observations, then our model is actually a variable-rate HMM. The advantage over the usual, xed-rate framebased HMMs is, we believe, that by looking at a longer window than the usual 10 milliseconds we should be able to identify the articulatory state much more accurately. Before we present experimental results with this model, there are many practical issues that we must solve. Among these are the selection of the model set for the ner level { we do not have the linguistically motivated phonemes in this case { as well as the design of algorithms for using the model in CSR. We deal with these issues in the following section.

7.3 Using Hierarchical Models in CSR In this section, we answer design questions and present training and recognition algorithms for using embedded-segment models in CSR. First, we give the solutions for the simple model that we introduced in Section 7.2.3. The two-level model is a segment-based model. Thus, it is only necessary for us here to develop classi cation and labeled-training algorithms for this model. We can use the algorithms introduced in Chapters 5 and 6 to obtain methods to do recognition and training without transcriptions using this model. We then discuss how these algorithms can be generalized for the h-level embedded-segment models using level-recursive schemes in both recognition and training.

Chapter 7. Extension

153

7.3.1 Fine-level Unit and Grammar Selection The rst step in designing the microsegment models is the selection of the number and the type of classes at the ner level. The choice of the number of models is related to the diculty of the task and the amount of the available training data. In our experiments, we shall select the number of models so that the number of free parameters is equal to that used in the dynamical system segment model. In general, it is possible to determine an optimum number of models for the available training data by applying, for example Akaike's AIC criterion [2], or Rissanen's minimum description length principle [80]. The second issue, the choice of the type of ne-level models, is related to the training procedure. A subdivision of a phoneme is not an established linguistic unit. Since we shall give iterative automatic training algorithms, the important issue is the choice of the initial estimates, especially since we are not usually guaranteed convergence to a global optimum. One alternative is to rely on speech knowledge and obtain initial estimates based on the acoustic similarity of parts of various phonemes appearing in similar contexts. We are going to follow a di erent approach, and use automatic clustering methods [34] to determine the initial estimates for the microsegment models, as described in Section 7.3.3. The rst component of the grammar for the two level model is the top-level phonetic grammar that computes the score p(A1; S 1), and in which we can include the bigram phonetic probabilities and the phonetic duration distributions, as we have done so far. For the second-level component, we shall assume for simplicity that the microsegment model sequences from di erent phonetic segments are conditionally independent given the phone sequence. This is clearly not a good assumption if we are interested in connected phone recognition or context modeling, but it is convenient to use in classi cation.

Chapter 7. Extension

154

We can write under this assumption the grammar score of the second level

p(A2; S 2

j

A1 ; S 1 ) =

n Y k=1

p(A2k ; Sk2j k1 ; s1k )

(7:14)

where A2k ; Sk2 are the sequences of ne-level units and microsegments within the k-th phone and n is the number of phonetic segments. The terms in the above product can be expanded

p(A2k ; Sk2j k1 ; s1k ) = p(A2k j k1 )

nY 2 (k) i=n2 (k?1)+1

p(s2i j i2; s1k )

(7:15)

where we have assumed that the label-grammar of the ne-level is independent of the phonetic segmentation, and that the durations of the microsegments only depend on the corresponding label and the global duration. An important term in the grammar score is p(A2k j k1 ), in the absence of segmental features. Since these probabilities are conditioned on the phone label, and the number of ne-level models will be in the order of a few hundred, obtaining good estimates is a dicult problem. There is a vast amount of research in language modeling for the estimation of grammar probabilities from sparse data (see for example [41]). The method that we choose to follow here is the one described by Katz in [44], which uses a general, m-gram language model. It uses the highorder m-grams if there are sucient occurrences of the corresponding context in the training data, and backs-o to lower order grammars for rare contexts. The computation of the grammar score p(A2k j k1 ) in both training and recognition can be simpli ed if we impose an upper bound on the number of microsegments within each phone. We can further simplify things by assuming that the number of microsegments is always equal to this bound. Sequences with fewer microsegments can be allowed by including in our model set a \null" model, for \observations" with zero length.

Chapter 7. Extension

155

7.3.2 Classi cation In phonetic classi cation we are given the coarse-level boundaries, and we can use the MAP rule, as usual, to select the best candidate phone

^k1 = argmax p( k1 jZk ; s1k ) = argmax p( k1 ; Zk ; s1k ) 1 1 k

k

(7:16)

which we can rewrite using the total probability theorem

^k1 = argmax 1

X

p( k1 ; Zk ; s1k ) k A2k ;Sk2 1 ; 1 ; s1 ) X p(Z 2 jA2 ; S 2 )p(A2 j 1 )p(S 2 jA2 ; S 1 ) (7.17) = argmax p ( z k k k k k k k k k k k 1 k

A2k ;Sk2

using the simplifying assumptions of the model. Maximizing the summation above is actually a recognition problem, similar to (6.1), where we search over all possible mode sequences at the ner level. We can either use a set of forward recursions, or, in analogy to what we did in Chapter 6, we can jointly select the best phone and ne-level components of the mode (^ k1 ; A^2k ; S^k2) = argmax p(z1k ; k1 ; s1k )p(Zk2jA2k ; Sk2)p(A2k j k1 )p(Sk2jA2k ; Sk1): 1 2 2 k ;Ak ;Sk

(7:18)

In this case, we e ectively use only the most dominant component of the mixture distribution (7.13). This search problem can be solved using either a DP search or the split-and-merge and must be repeated for all phoneme candidates. An alternative and simpler approach, is to assume that the ne-level segmentation is deterministic, e.g. a linear subdivision of the coarse-level segmentation. In this case, the search in (7.18) should be over class labels k = f k1 ; A2k g only.

7.3.3 Training Training the microsegment models when the phonetic transcriptions are given is analogous to training a segment-based model without transcriptions, except that the \hidden" variables now include the microsegment labels, in addition to the

Chapter 7. Extension

156

ne-level segmentation. Thus, it can be solved using either an EM or a segmental k-means procedure, as we described in Chapter 5. Initial estimates for the labels can be obtained using the k-means1 clustering algorithm, with initial estimates obtained from binary tree (divisive) clustering, as described in [59]. For completeness, we outline below the segmental k-means procedure, since it is more convenient to extend for an arbitrary number of levels.

Segmental k-means Training of Microsegment Models Step 1 For all segments ( k1 ; s1k ) use the current parameter estimates  to nd (A^2k ; S^k2) = argmaxA2k ;Sk2 p(Zk ; k1 ; A2k ; s1k ; Sk2j) Step 2 Reestimate the parameters using 0 = argmax p(Zk ; k1 ; A^2k ; s1k ; S^k2j) Step 3 If not convergence, set  0 and go to Step 1.

It is rather straightforward, as we have already mentioned, to extend the algorithm above for the case when phonetic transcriptions are not given. A global segmental k-means scheme would involve searching at Step 1 over segmentations at the coarser level also, and using the estimated phonetic boundaries in Step 2 of the previous algorithm. Thus, the main di erence is in the joint search (S^1; A^2; S^2) = argmax p(Z; A1; A2; S 1; S 2j); 1 2 2 S ;A ;S

(7:19)

which we discuss further for the general case in the following section. The k-means algorithm that we refer to here is the one used in clustering or vector quantizer design, and should not be confused with the segmental k-means algorithm we use for the training of microsegment models. 1

Chapter 7. Extension

157

7.3.4 N-level Models For the two-level case, both training (via the segmental k-means algorithm) and recognition actually reduce to maximizing (7.19) with respect to some of the discrete components of the mode. The di erence in the recognition problem from (7.19) is that we search over all mode components (A1 ; S 1; A2; S 2) rather than (S 1; A2; S 2) alone. At the coarser, phonetic level, the search for both cases can be performed using a DP search, a split-and-merge scheme, or any other search method. In any case, as we saw in Chapter 6, the main component of the search is the segment classi cation which in the two-level model is described in (7.18). This classi cation, as we saw, involves itself a search that can be solved using one of the search algorithms that we examined in Chapter 6. Under the general assumptions of the embedded segment models, we can easily see that this scheme is easily generalized to an arbitrary number of levels. The magic word here is, of course, recursion. If a DP search is adopted for all levels, the cost evaluation of the k-th segment at level j will involve a DP search over the label sequence and segmentation within its boundaries at level j + 1. Such recursive DP schemes can be used to solve the more general shortest path problem on hierarchically de ned graphs, see for example [53]. A recursive DP scheme can, however become computationally expensive as the number of levels increases, and other approaches, like multigrid methods [18], can prove more ecient in solving the recognition problem in the general multi-level case. An alternative solution would be to use a level-recursive version of the Split-and-Merge algorithm.

158

Chapter 7. Extension

7.4 Experimental Results The experimental results in this section are only intended to provide a feasibility study of the embedded-segment models that we have introduced. The two-level model does not capture the intersegmental statistical dependencies, since the larger modeling unit is the phone. Nevertheless, we can use the familiar phone classi cation task in order to evaluate the performance of the proposed framework at the ner scale. For the experiments of this chapter we had available the second release of the TIMIT database, with a designated set of utterances for testing and new and improved phonetic transcriptions. In model development we used the \core" test set, consisting of one female and two male speakers from each one of the eight dialect regions, whereas we shall report our best result on the full, 168-speaker test set. In most of the experiments we used the same single-rate signal processing that we used in the experiments of previous chapters and we did not include any coarse-level features. We shall compare the results of these experiments to the correlation models of Chapter 4. We also performed some limited experiments with coarse-level features, on which we shall comment at the end of this section.

Microsegment-Model Development The rst series of experiments is related to the development of the two-level model. In the implementation of the two-level model, the number of regions in each phone was constrained to be less than or equal to three. In other words, all allowable microsegment sequences consisted of at most three models. The model sets for each one of the three regions (begin, middle and end) were either disjoint, or the models from all regions were lumped together, in which case we have an increased cost in recognition. The type of segment-based model used for the microsegments in all experiments was an independent-frame segment model. In our rst experiments we tried to initialize a xed number of microsegment

Chapter 7. Extension

159

models using a binary clustering scheme without making any phonetic distinctions. For each region, the corresponding part (begin, middle or end) of every phone occurrence in the database was included in a single cluster, after linear warping to a xed length. Then the desired number of clusters was obtained using binary tree, or divisive, clustering. This approach, however, did not give good results: a large number of microsegments would be assigned to phones with a large variance, whereas phones with small variance would share the same microsegments, and the recognizer could not distinguish between them. Sharing the same microsegments among di erent phone models puts too much emphasis for the discrimination on the grammar probabilities. The score for large numbers of features is dominated by the Gaussian terms, and relying solely on the grammar scores for discrimination leads to degraded performance. Using 18 cepstral coecients and the same set of models for all three regions, this two-level model achieved a classi cation accuracy of 57.3%, whereas the independent-frame model with a comparable number of parameters had a 59.3% classi cation rate. Our solution to this problem was to start with the samples of each phone in a di erent cluster, e.g. start with a tree with 61 nodes, and one microsegment model for each phone and region. From this point, the number of clusters was increased using divisive clustering. The criterion used at each iteration for the selection of the cluster to split was the total cluster distortion [34], and the features were initially weighted by their inverse standard deviation. After the desired number of clusters was attained, a k-means clustering procedure was performed on the samples of each phone separately, that used as initial estimates the nal nodes of the binary clustering. The procedure was repeated for each of the three regions, and we had di erent models for each region. This clustering scheme creates an asymmetric model, in the sense that we have di erent numbers of microsegment models associated with each phone. Having the same number of models associated with each phone performed signi cantly worse. With 3 times as many parameters

160

Chapter 7. Extension

as for the independent-frame model (e.g., 3 microsegments per phone-model and region on average), the asymmetric model achieved 62.8% classi cation, and the symmetric model achieved 59.9%. Since di erent phones had di erent number of models associated with them, the grammar probabilities introduce a penalty for phones that have many alternative pronunciations. In order to avoid this, we used the grammar probabilities only for nding the best microsegment sequence for each phone. The nal score for each phone then consisted of the raw acoustic scores of the microsegments on this best path, without the grammar probabilities. Of course, this would not be necessary if, instead of maximization, we performed a summation over all possible microsegment sequences in recognition, i.e., if we used (7.17) instead of (7.18). We also found that there was no decrease in performance when the ne-level segmentation was constrained to a linear one. This signi cantly reduces the complexity of the recognition search, since we must only search over microsegment label sequences within each phonetic segment. This result supports the use of linear, rather than dynamic, time warping in the single-level SSM. Although the segmentation is deterministic, the two-level model still di ers from the previous SSM cases through the use of a stochastic grammar. Next, we evaluated the performance of the two-level model for di erent average numbers of microsegments per phone region. The results are summarized in Table 7.1. The rst model has one microsegment model per phone and region, and is, therefore, identical to the independent-frame SSM.

Comparison to Other Models In this section, we shall compare the performance of the two-level model to some more traditional mixture models and to the linear models we introduced in Chapter 4. For the mixture-model comparison, we implemented a Gaussian-mixture SSM,

Chapter 7. Extension

161 Number of % correct models 61 classes 39 classes 1 47.8 59.3 2 51.8 61.9 3 52.8 62.8

Table 7.1: Comparison of microsegment models for di erent average number of models per phone region using 18 cepstral coecients. where the mixture distribution of each frame consisted of 3 Gaussians. For compatibility with the microsegment results, we used only the most dominant component of each mixture in recognition. The classi cation rate of this model was 60.5%, as opposed to the 62.8% of the microsegment model. We anticipate, however, that the di erence between these two numbers would be signi cantly larger if models with correlation modeling were used for the microsegments in the two-level model. Since the advantage of the two-level model is that it is a more structured model than the mixture SSM, we also considered the other extreme, of having a single region for each phone model. In this case, the model is equivalent to a mixture of segment distributions for each phone. The performance of this model was 62.0%, which is close to the performance of the model with multiple regions. Next, we compare the performance of the two-level model to the models with direct correlation modeling, using only cepstral coecients in the feature set. Our results are summarized in Table 7.2. There, we compare the classi cation performance with 18 cepstral coecients for the baseline SSM, the target-state (or linear hierarchical) segment model, the correlation-invariance DS segment model and the two-level microsegment model. The performance of the DS segment model with cepstral coecients only is by far the better, representing a 13% reduction in error rate for 39 classes over the basic SSM, whereas the two-level model with the

162

Chapter 7. Extension Number of % correct Model Assumptions parameters 61 classes 39 classes Independent frame N 47.8 59.3 Target state 3N 51.6 61.4 Dynamical System (CI) 3N 56.0 64.9 Microsegment 3N 52.8 62.8

Table 7.2: Comparison of the microsegment model to the linear models and the baseline SSM. Entries are phone classi cation rates for 18 cepstral coecients and approximate numbers of free parameters. same number of parameters achieves a 9% reduction. Thus, we can conclude that correlation modeling is more important than having mixture distributions for the regions of a segment. The two-level model can also bene t from short-term correlation modeling. The DS segment model captures the short-term dependencies, whereas the two level model can capture longer-term dependencies through the statistics of the mode sequence. Since our experiments showed that within a segment short-term correlation modeling is more signi cant, we investigated the simple extension of using derivatives in the feature set. For 18 cepstra, 18 derivative cepstra and the derivative of log-energy in the feature set, the two level model achieved 65.5% and 74.1% classi cation accuracy for 61 and 39 classes respectively on the full test set of the new release of the TIMIT database. For comparison, on the old test set where the basic SSM and the DS segment model achieved 72.1% and 73.9% respectively, the two level model had a 74.7% classi cation rate. Thus, when some form of correlation modeling is included in the two-level model, it outperforms the DS segment model and the baseline SSM. It is possible that the performance will improve if the two approaches are combined by using explicit short-term correlation

Chapter 7. Extension

163

modeling in the two-level model.

Coarse Level Features We also tried to incorporate coarse-level features in our models. Speci cally, we used at the coarse level cepstral coecients obtained from an averaged spectrum over the whole duration of each phone (i.e., we used a variable-rate analysis). Including these features in the two-level model, improved the performance of the model with a single pool of microsegments shared among all phones from 57.3% to 62.5%. However, the coarse-level features made little di erence in our best-case two-level model with cepstra only (63.2% from 62.8%), or no di erence with cepstra and derivatives in the ne-level feature set. Nevertheless, our limited experimentation does not decrease the signi cance of incorporating more global features, which is an area of ongoing research.

7.5 Discussion In this chapter we proposed an extension of the single-level segment models and introduced a framework that can be used to model longer term dependencies. We described alternative assumptions for the output distribution, as well as recognition and training algorithms for the general, N-level hierarchy. We used this framework in a two-level, phonetic model that we compared in phonetic classi cation experiments to our previous models, demonstrating the feasibility of the approach. We found that explicit correlation modeling is more important than having mixture distributions for the regions of a segment, and we also saw that by having multiple microsegments per phone we can obtain improved performance over our previous single level models. This suggests that the dynamical system segment model can also bene t by using mixture distributions or context-dependent modeling, so that the observations within a single phonetic segment can be better represented by

164

Chapter 7. Extension

a single Gaussian. However, these models will be of greater value when they are extended to higher levels, and when features from higher scales are included in the representation of speech.

Chapter 8 Conclusions Instead of using the standard HMM approach to CSR that has been thoroughly investigated, we have chosen in this dissertation to follow a less-common, segmentbased approach. We believe that this research contributes signi cantly to the speech recognition eld, and impacts other areas in pattern recognition as well. We rst summarize in this chapter the contributions of this work, which are both theoretical and experimental, and then we shall highlight the many alternatives that it o ers for extension.

8.1 Thesis Contributions In Chapter 3 we showed that the output-independence assumption inherent in HMM modeling is not valid. We also demonstrated that linear approximations are sucient in modeling the statistical dependency of observations extracted from the same segment of speech. Therefore, if one is willing to make speci c assumptions about the segmental distribution { the approach that we followed in this work { then linear models will probably work as well as any other classes of models in representing the intrasegmental dependencies. Based on these ndings, we introduced a dynamical-system model for speech 165

166

Chapter 8. Conclusions

segments. In order to implement the model eciently it was necessary to make speci c assumptions about the evolution of the spectral dynamics within a segment of speech. This part of the research showed that the previous assumption of the SSM, that all realizations of a phone are undersampled versions of a xed trajectory in phase space, was not accurate. Instead, we found that the correlation between consecutive states stays invariant when the duration of the segment changes. This part of our work is useful to the entire speech recognition community, regardless of the particular approach. Using the dynamical-system segment model we demonstrated the importance of incorporating time correlation in segment-based models through phone classi cation experiments on a well established speech database. Since the baseline SSM has similar performance to a continuous density, Gaussian HMM, our improvement in performance through correlation modeling represents an improvement over HMMs. Our results represent the state-of-the-art in phonetic recognition using contextindependent models, and are comparable to those of more complex systems that do use context information. Improved acoustic modeling helps reduce the error rate in word recognition. The SSM is de nitely a more accurate acoustic model than discrete density HMMs, as it was shown in [67], and, in a recently developed hybrid system [66], it has helped improve the accuracy of one of the top-of-the-line HMM word recognition systems. In Chapter 5 we developed the training algorithm for this model. The parameter estimation problem in this case was solved with a non-traditional approach to the classical system identi cation problem, based on the EM algorithm. This algorithm is clearly applicable to more general problems. The complexity of the classical solutions to the system-identi cation problem has resulted in the limited use of state-space stochastic systems in modeling random processes, and other, smaller classes of models (e.g., autoregressive), have been used more extensively because of the existence of ecient algorithms for the parameter estimation prob-

Chapter 8. Conclusions

167

lem. Our approach to the problem, however, is much simpler and has good convergent properties. It can, therefore, revive the interest in these types of models. Moreover, there is a theoretically appealing analogy between this work and the parameter estimation problem in hidden Markov models. In Chapter 6 we developed fast search algorithms to solve computation problems. In particular, we introduced a new local search algorithm for the joint segmentation and recognition problem, namely the Split-and-Merge search algorithm, and a fast classi cation algorithm. We also presented in the same chapter a theoretical analysis of the local search algorithms for the joint segmentation and recognition problem. The performance of the suboptimal search algorithm in terms of percent accuracy was similar to that obtained by the optimal DynamicProgramming search in phone recognition experiments. In terms of computation savings, the Split-and-Merge was from 3 times (for simple, independent-frame models) to 20 times (for the more complex dynamical system model) faster than the DP search. The Split-and-Merge search algorithm is also applicable to other recognition systems, like HMMs. In Chapter 7 we proposed a framework for the extension of the single-level segment-based models to multiple levels. This generalization will prove useful when features from higher levels are included in the speech representation. The multi-level model can also capture more global sources of variability. In order to evaluate the usefulness of the approach, we implemented a two-level, phonetic model. Our results showed that further improvements are possible if, in addition to the short-term correlation modeling that our dynamical-system model provides, longer-term statistical dependencies are modeled through the statistics of a multilevel mode process.

168

Chapter 8. Conclusions

8.2 Suggestions for Further Research There are many directions in which the work that we presented in this dissertation could be continued. The comparison of correlation assumptions showed that it is better to consider the correlation between consecutive frames as invariant between di erent segment lengths. We modeled this assumption by repeating the transition matrices according to a mapping that was linear in time and common for all phones. This is probably not the best mapping { the evolution of the dynamics is strongly dependent on the particular phone. The part of the trajectory that is followed for short occurrences depends on the current, as well as the surrounding phone labels. Thus, phone-dependent warpings should be investigated, and they could be determined either with the aid of speech knowledge or automatically. In implementing the DS segment model for recognition, we observed a degradation in performance because of the the reinitialization of the dynamics at the segment boundaries. Moreover, the phone-transition dynamics have additional information that is useful in a continuous recognition environment. However, modeling these transitions can only be done if the models are context-dependent. This is, perhaps, the most promising area to extend the dynamical system segment model. Coarticulation modeling has been the biggest gain in the past few years in speech recognition, and lately a signi cant improvement in performance of an independent-frame SSM has been achieved using context-dependent models by Kimball et al. [46]. In terms of the search algorithms, we used the Split-and-Merge segmentation algorithm for phone-recognition. However, most real-world applications use word recognition. The complexity of segment-based models makes it necessary to significantly limit the search space, before they can be used in recognition, an approach followed by Ostendorf et al. in [66]. An alternative approach would be to use the phone-segment lattice obtained after convergence of the Split-and-Merge algorithm and perform word recognition based on this lattice. This is also the approach fol-

Chapter 8. Conclusions

169

lowed in the MIT SUMMIT system [92]. We believe that the phone-segment lattice obtained by the Split-and-Merge algorithm is suciently rich, so that there are few unrecoverable errors in the phonetic recognition part, particularly if multiple word pronunciations are used. The good performance of the two-level microsegment model in Chapter 7 suggests that segment-level mixture distributions can better represent the distribution of a phonetic segment than a single Gaussian. This observation can be combined with the short-term correlation modeling. In terms of the multi-level model, the linear segment models could replace the independent-frame models that we used at the lower level in future work. In terms of the dynamical-system model, a possible extension in this direction is to replace the initial distribution at each segment by a Gaussian mixture. In this thesis, we only performed a small number of experiments with multilevel models. Nevertheless, we have provided the theoretical framework and there are enormous possibilities for using this model in recognition. Multi-rate features can be investigated and modeled using embedded-segments. Intersegmental dependencies can be represented by extending the model to levels above the phone. Di erent distribution assumptions for the segmental distributions at the di erent levels can also be the subject of a research project. And, nally, further development and implementation of the various search algorithms that we outlined in Chapter 7 for the general multi-level case is needed.

8.3 Epilogue Recently, many di erent approaches to the acoustic modeling problem in speech recognition, like HMMs, Neural Networks, and segment based-models, including the ones that we examined in this thesis, have reported phonetic recognition results that all share a common characteristic: there seems to be an upper bound

170

Chapter 8. Conclusions

in performance of about 75% for recognition of the English phonemes, when there are no constraints in the allowable phone sequences. This represents a signi cant improvement over the 60% phonetic accuracy that was the state of the art in the 1970's. However, the performance of expert spectrogram readers in phonetic recognition based on the information provided by the spectrogram alone ranges between 80% and 90% [91]. The advantage of the human experts in these experiments was that they had a global view of the spectrogram over a whole utterance. Thus, we believe that in order to obtain machine performance close to the human-experts, it is necessary to eliminate the intersegmental variability. This, we believe, will be the most promising area of research in acoustic modeling in the following years, and the area from which the breakthroughs in performance will come.

References [1] H. Akaike, \Maximum Likelihood Identi cation of Gaussian Autoregressive Moving Average Models", in Biometrica, Vol. 60, 2, pp. 255{265, 1973. [2] H. Akaike, \A New Look at the Statictical Model Identi cation," in IEEE Trans. Automatic Control, Vol. AC-19(6), pp. 716{723, December 1974. [3] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 2nd Edition, Wiley, New York, 1984. [4] K. J. Astrom and T. Bohlin, \Numerical Identi cation of Linear Dynamic Systems from Normal Operating Records", in Proc. 2nd IFAC Symposium on the Theory of Self-Adaptive Control Systems, NPL Teddington, England, Plenum Press, New York, pp. 96{111, 1965. [5] S. Austin, J. Makhoul and R. Schwartz, \Continuous Speech Recognition Using Segmental Neural Nets," in Proc. of the DARPA Workshop on Speech and Natural Language, pp. 249-252, February 1991. [6] L. R. Bahl, R. Bakis, P. S. Cohen, A. G. Cole, F. Jelinek, B. L. Lewis and R. L. Mercer, \Continuous Parameter Acoustic Processing for Recognition of a Natural Speech Corpus," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 1149{1152, Atlanta, Georgia, April 1981. [7] L. Bahl, J. R. Bellegarda, P. V. de Souza, P. S. Gopalakrishnan, D. Nahamoo, M. A. Picheny, \A New Class of Fenonic Markov Word Models for Large 171

172

References Vocabulary Continuous Speech Recognition," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 177{180, Toronto, Canada, May 1991.

[8] L. Bahl, P. S. Gopalakrishnan, D. Kanevsky and D. Nahamoo, \Matrix Fast Match: A Fast Method for Identifying a Short List of Candidate Words for Decoding," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 345{ 348, Glasgow, Scotland, May 1989. [9] L. R. Bahl, F. Jelinek, and R. L. Mercer, \A Maximum Likelihood Approach to Continuous Speech Recognition," in IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. PAMI-5(2), pp. 179{190, March 1983. [10] J. K. Baker, \The Dragon System - An Overview," in IEEE Trans. on Acoust., Speech and Signal Proc., Vol. ASSP-23(1), pp. 24{29, February 1975. [11] M. Basseville, A. Benveniste, K. C. Chou, S. A. Golden, R. Nikoukhah and A. S. Willsky, \Modeling and Estimation of Multiresolution Stochastic Processes," Submitted Manuscript. [12] L. E. Baum, T. Petrie, G. Soules and N. Weiss, \A Maximization Technique in the Statistical Analysis of Probabilistic Functions of Finite State Markov Chains," in Ann. Math. Stat., Vol. 41, pp. 164{171, 1970. [13] R. Bellman, Dynamic Programming, Princeton University Press, 1957. [14] D. P. Bertsekas, Dynamic Programming: Deterministic and Stochastic Models, Prentice-Hall, New Jersey, 1987. [15] E. Bocchieri and G. Doddington, \Frame-speci c statistical features for speaker-independent speech recognition," in IEEE Trans. Acoust., Speech and Signal Proc., Vol. ASSP-34(4), pp. 755{764, August 1986.

References

173

[16] A. Brandt, D. Ron and D. Amit, \Multi-level Approaches to Discrete-state and Stochastic Problems," in Proceedings of 2nd European conference on multigrid methods, Cologne, October 1985, Springer-Verlag, Berlin, 1985. [17] L. Breiman and J. Friedman, \Estimating Optimal Transformations for Multiple Regression and Correlation," in J. Amer. Stat. Assoc., Vol. 80, pp. 580{ 607, September 1985. [18] W. L. Briggs, A Multigrid Tutorial, SIAM, Philadelphia, PA, 1987. [19] P. F. Brown, The Acoustic-Modeling Problem in Automatic Speech Recognition, Ph.D. Thesis, Computer Science Department, CMU, May 1987. [20] N. G. De Bruijn, \Uncertainty Principles in Fourier Analysis," in Inequalities, O. Shisha, Ed., Academic Press, NY-London, 1967. [21] M. A. Bush and G. E. Kopec, \Network-based Connected Digit Recognition," in IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-35(10), pp. 1401{1413, October 1987. [22] P. E. Caines, Linear Stochastic Systems, John Wiley & Sons, 1988. [23] K. C. Chou and A. S. Willsky, \Kalman Filtering and Riccati Equations for Multiscale Processes," in IEEE Intern. Conf. on Decision and Control, pp. 841{846, Honolulu, Hawaii, December 1990. [24] Y. L. Chow, M. O. Dunham, O. Kimball, M. Krasner, F. Kubala, J. Makhoul, S. Roucos, R. M. Schwartz, \BYBLOS: The BBN Continuous Speech Recognition System," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 89{ 92, Dallas, TX, April 1987. [25] M. Cohen, H. Murveit, J. Bernstein, P. Price and M. Weintraub, \The Decipher Speech Recognition System," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 77{80, Albuquerque, New Mexico, April 1990.

174

References

[26] I. Daubechies, \The Wavelet Transform, Time-Frequency Localization and Signal Analysis", in IEEE Trans. Information Theory, Vol. 36(5), pp. 961{ 1005, September 1990. [27] A. P. Dempster, N. M. Laird and D. B. Rubin, \Maximum Likelihood Estimation from Incomplete Data," in Journal of the Royal Statistical Society (B), Vol. 39, No. 1, pp. 1{38, 1977. [28] L. Deng, L. Lennig and P. Mermelstein, \Modeling Microsegments of Stop Consonants in a Hidden Markov Model Based Word Recognizer," in Journal of the Acoustical Society of America, Vol. 87, pp. 2738{2747, 1990. [29] V. Digalakis, M. Ostendorf and J. R. Rohlicek, \Fast Algorithms for Phone Classi cation and Recognition Using Segment-Based Models," to appear in IEEE Trans. Signal Processing, December 1992. A shorter version also appeared in Proceedings of the Third DARPA Workshop on Speech and Natural Language, pp. 173{178, June 1990. [30] V. Digalakis, M. Ostendorf and J. R. Rohlicek, \Improvements in the Stochastic Segment Model for Phoneme Recognition," in Proceedings of the Second DARPA Workshop on Speech and Natural Language, pp. 332{338, October 1989. [31] V. Digalakis, J. R. Rohlicek and M. Ostendorf, \A Dynamical System Approach to Continuous Speech Recognition," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 289{292, Toronto, Canada, May 1991. [32] V. Digalakis, \Maximum Likelihood Identi cation of a Dynamical System Model for Speech Using the EM Algorithm," in IEEE International Symposium on Information Theory, Budapest, Hungary, June 1991. [33] J. L. Doob, Stochastic Processes, Wiley, New York, 1953.

References

175

[34] R. O. Duda and P. E. Hart, Pattern classi cation and scene analysis, New York, Wiley, 1973. [35] F. Fallside et al., \Continuous Speech Recognition for the TIMIT Database Using Neural Networks," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 445{448, Albuquerque, New Mexico, April 1990. [36] J. D. Gibson, B. Koo and S. D. Gray, \Filtering of Colored Noise for Speech Enhancement and Coding," in IEEE Trans. on Signal Processing, Vol. 39(8), pp. 1732{1742, August 1991. [37] N. K. Gupta and R. K. Mehra, \Computational Aspects of Maximum Likelihood Estimation and Reduction in Sensitivity Function Calculations", in IEEE Trans. Automatic Control, Vol. AC-19, No. 6, pp. 774{783, Dec. 1974. [38] V. N. Gupta, M. Lennig, P. Mermelstein, \Integration of Acoustic Information in a Large Vocabulary Word Recognizer," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 697{700, Dallas, TX, April 1987. [39] S. L. Horowitz and T. Pavlidis, \Picture Segmentation by a Tree Traversal Algorithm," in Journal Assoc. Comput. Mach., Vol. 23, No. 2, pp. 368{388, April 1976. [40] F. Jelinek, \Continuous Speech Recognition by Statistical Methods," in IEEE Proceedings, Vol. 64, No. 4, pp. 532{556, April 1976. [41] F. Jelinek and R. L. Mercer, \Interpolated Estimation of Markov Parameters from Sparse Data," in Pattern Recognition in Practice, E. S. Gelsema and L. N. Kanal, Eds., Amsterdam: North-Holland, 1980. [42] R. E. Kalman, \A New Approach to Linear Filtering and Prediction Problems," Trans. ASME, Series D, J. Basic Eng., Vol. 82, pp. 35{45, March 1960.

176

References

[43] R. L. Kashyap, \Maximum Likelihood Identi cation of Stochastic Linear Systems", in IEEE Trans. Automatic Control, Vol. AC-15(1), pp. 25{34, February 1970. [44] S. M. Katz, \Estimation of Probabilities from Sparse Data for the Language Model Component of a Speech Recognizer," in IEEE Trans. Acoust., Speech, Signal Processing, Vol. ASSP-35(3), pp. 400{401, March 1987. [45] P. Kenny, M. Lennig and P. Mermelstein, \A Linear Predictive HMM for Vector-Valued Observations with Applications to Speech Recognition," in IEEE Trans. on Acoust., Speech and Signal Proc., Vol. ASSP-38(2), pp. 220{ 225, February 1990. [46] O. Kimball, M. Ostendorf and I. Bechwati, \Context Modeling with the Stochastic Segment Model," manuscript submitted to IEEE Trans. Signal Processing. [47] S. Kirkpatrick, C. D. Gelatt Jr. and M. P. Vecchi, \Optimization by Simulated Annealing", in Science, Vol. 220, pp. 671{680, 1983. [48] R. Kronland-Martinet, J. Morlet and A. Grossman, \Analysis of Sound Patterns through Wavelet Transforms," in Intern. Journ. of Patt. Recog. and Artif. Intel., Vol. 1, No. 2, pp. 97{126, 1987. [49] L. F. Lamel, R. H. Kassel and S. Sene , \Speech Database Development: Design and Analysis of the Acoustic-Phonetic Corpus," in Proc. DARPA Speech Recognition Workshop, Report No. SAIC-86/1546, pp. 100{109, Feb. 1986. [50] K. F. Lee, Large Vocabulary Speaker-independent Continuous Speech Recognition: The Development of the SPHINX System, Ph.D. Thesis, Computer Science Department, CMU, April 1988.

References

177

[51] K. F. Lee, \Context-Dependent Phonetic Hidden Markov Models for SpeakerIndependent Continuous Speech Recognition," in IEEE Trans. on Acoust., Speech and Signal Proc., Vol. ASSP-38(4), pp. 599{609, April 1990. [52] K. F. Lee and H. W. Hon, \Speaker-independent Phone Recognition Using Hidden Markov Models," in IEEE Trans. on Acoust., Speech and Signal Proc., Vol. ASSP-37(11), pp. 1641{1648, November 1989. [53] T. Lengauer, Combinatorial algorithms for integrated circuit layout, Wiley, New York, 1990. [54] H. C. Leung and V. Zue, \Phonetic Classi cation Using Multi-Layer Perceptrons," in IEEE Int. Conf. Acoust., Speech, Signal Processing, Albuquerque, New Mexico, April 1990, pp. 525{528. [55] S. E. Levinson, \Structural Methods in Automatic Speech Recognition," in Proceedings of the IEEE, Vol. 73(11), pp. 1625{1650, November 1985. [56] J. S. Lim and A. V. Oppenheim, \All-pole modeling of degraded speech," in IEEE Trans. Acoustic Speech and Signal Processing, Vol. ASSP-26(6), pp. 197{210, June 1978. [57] L. Ljung, System identi cation : theory for the user, Prentice-Hall, Englewood Cli s, NJ, 1987. [58] D. G. Luenberger, Linear and Nonlinear Programming, 2nd Edition, AddisonWesley, Reading, MA, 1984. [59] J. Makhoul, S. Roucos and H. Gish, \Vector Quantization in Speech Coding," in Proceedings of the IEEE, Vol. 73(11), pp. 1551{1588, November 1985. [60] S. Makino and K. Kido, \Recognition of Phonemes Using Time-Spectrum Pattern," in Speech Communication, Vol. 5, No. 2, pp. 225{238, June 1986.

178

References

[61] S. G. Mallat, \A Theory for Multiresolution Signal Decomposition: the Wavelet Representation," in IEEE Trans. Pattern Anal. Mach. Intel., Vol. PAMI-11(7), pp. 674{693, July 1989. [62] H. Meng and V. Zue, \Signal Representation Comparison for Phonetic Classi cation," in IEEE Int. Conf. Acoust., Speech, Signal Processing, Toronto, Canada, pp. 285{288, May 1991. [63] N. Merhav and Y. Ephraim, \Hidden Markov Modeling Using the Most Likely State Sequence," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 469{472, Toronto, Canada, May 1991. [64] N. Merhav and Y. Ephraim, \A Bayesian Classi cation Approach with Application to Speech Recognition," in IEEE Trans. Signal Processing, Vol. 39(10), pp. 2157{2166, October 1991. [65] M. Ostendorf and V. Digalakis, \The Stochastic Segment Model for Continuous Speech Recognition," in Proceedings of Twenty- fth Asilomar Conference on Signals, Systems and Computers, Asilomar, California, November 1991. [66] M. Ostendorf, A. Kannan, O. Kimball, R. Schwartz, S. Austin, R. Rohlicek, \Integration of Diverse Recognition Methodologies Through Reevaluation of N-Best Sentence Hypotheses," in Proceedings of the DARPA Workshop on Speech and Natural Language, Monterey, pp. 83{87, February 1991. [67] M. Ostendorf and S. Roukos, \A Stochastic Segment Model for Phonemebased Continuous Speech Recognition," in IEEE Trans. Acoustic Speech and Signal Processing, Vol. ASSP-37(12), pp. 1857{1869, December 1989. [68] C. H. Papadimitriou and K. Steiglittz, Combinatorial Optimization, Algorithms and Complexity, Prentice-Hall, New Jersey 1982.

References

179

[69] A. Papoulis, Probability, Random Variables, and Stochastic Processes, 2nd ed. New York, McGraw-Hill, 1984. [70] T. Parsons, Voice and Speech Processing, New York, McGraw-Hill, 1986. [71] D. B. Paul, J. K. Baker and J. M. Baker, \On the Interaction Between True Source, Training and Testing Language Models," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 569{572, Toronto, Canada, May 1991. [72] T. Pavlidis and S. L. Horowitz, \Segmentation of Plane Curves," in IEEE Trans. on Computers, Vol. C-23(8), pp. 860{870, August 1974. [73] K. M. Ponting and S. M. Peeling, \The Use of Variable Frame Rate Analysis in Speech Recognition," in Computer Speech and Language, pp. 169{179, May 1991. [74] W. K. Pratt, Digital Image Processing, New York, Wiley, 1978. [75] L. R. Rabiner and R. W. Schafer, Digital Processing of Speech Signals, New Jersey, Prentice-Hall, 1978. [76] L. R. Rabiner, B.-H. Juang, S. E. Levinson and M. M. Sondhi, \Recognition of Isolated Digits Using Hidden Markov Models with Continuous Mixture Densities," in AT&T Technical Journal, 63(7) pp. 1211{1233, July-August 1985. [77] L. R. Rabiner, J. G. Wilpon and B.-H. Juang, \A Segmental k-means Training Procedure for Connected Word Recognition," in AT&T Technical Journal, pp. 21{40, May-June 1986. [78] D. Rainton, \Speech Recognition - A Time-Frequency Subspace Filtering Based Approach," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 461{464, Toronto, Canada, 1991.

180

References

[79] H. E. Rauch, F. Tung and C. T. Striebel, \Maximum Likelihood Estimates of Linear Dynamic Systems", in AIAA Journal, Vol. 3, no. 8, pp. 1445{1450, Aug. 1965. [80] J. Rissanen, \Universal Coding, Information, Prediction and Estimation," in IEEE Trans. Information Theory, Vol. IT-30(4), pp. 629{636, July 1984. [81] T. Robinson and F. Fallside, \Phoneme Recognition from the TIMIT Database Using Recurrent Error Propagation Networks," Cambridge University technical report No. CUED/F-INFENG/TR. 42, March 1990. [82] S. Roucos and M. Ostendorf Dunham, \A Stochastic Segment Model for Phoneme-based Continuous Speech Recognition," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 73{89, Dallas, Texas, April 1987. [83] S. Roucos, M. Ostendorf, H. Gish, and A. Derr, \Stochastic Segment Modeling Using the Estimate-Maximize Algorithm," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 127{130, New York, New York, April 1988. [84] R. Schwartz, Y. Chow, O. Kimball, S. Roucos, M. Krasner, J. Makhoul, \Context-Dependent Modeling for Acoustic-Phonetic Recognition of Continuous Speech," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 1205{ 1208, Tampa, Florida, March 1985. [85] R. H. Shumway and D. S. Sto er, \An Approach to Time Series Smoothing and Forecasting Using the EM Algorithm," in Journal of Time Series Analysis, Vol. 3, No. 4, pp. 253{264, 1982. [86] G. Strang, \Wavelets and Dilation Equations: A Brief Introduction," in SIAM Review, Vol. 31, No. 4, pp. 614{627, December 1989. [87] E. F. Velez and R. G. Absher, \Transient Analysis of Speech Signals Using the Wigner Time-Frequency Representation," in IEEE Int. Conf. Acoust., Speech,

References

181

Signal Processing, pp. 2242{2245, Glasgow, Scotland, May 1989.

[88] A. Waibel, T. Hanazawa, G. Hinton, K. Shikano and K. Lang, \Phoneme Recognition Using Time-Delay Neural Networks," in IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 107{110, New York, New York, April 1988. [89] D. A. Wilson and A. Kumar, \Derivative Computations for the Log Likelihood Function", in IEEE Trans. Automatic Control, Vol. AC-27, No. 1, pp. 230{ 232, Feb. 1982. [90] N. R. Sandell, Jr. and K. I. Yared, \Maximum Likelihood Identi cation of State Space Models for Linear Dynamic Systems", Electron. Syst. Lab, M.I.T., Cambridge, MA, Rep. ESL-R-814, 1978. [91] V. Zue, \The Use of Speech Knowledge in Automatic Speech Recognition," in Proceedings of the IEEE, Vol. 73(11), pp. 1602{1615, November 1985. [92] V. Zue, J. Glass, M. Philips and S. Sene , \Acoustic Segmentation and Phonetic Classi cation in the SUMMIT System," in IEEE Int. Conf. Acoust., Speech, Signal Processing, Glasgow, Scotland, May 1989. [93] V. Zue et al., \Recent Progress on the SUMMIT System", in Proceedings of the Third DARPA Workshop on Speech and Natural Language, June 1990.