Maximum-likelihood Stochastic-transformation

1 downloads 0 Views 522KB Size Report
transformations to adapt the means, and possibly the covariances of the mixture ..... can be computed using Bayes' rule, in analogy to (19) and (20). • M-Step: .... and 520 adaptation sentences, and we found that method. II outperformed ...
IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999

177

Maximum-Likelihood Stochastic-Transformation Adaptation of Hidden Markov Models Vassilis D. Diakoloukas and Vassilios V. Digalakis, Member, IEEE

Abstract—The recognition accuracy in recent large vocabulary automatic speech recognition (ASR) systems is highly related to the existing mismatch between the training and testing sets. For example, dialect differences across the training and testing speakers result to a significant degradation in recognition performance. Some popular adaptation approaches improve the recognition performance of speech recognizers based on hidden Markov models with continuous mixture densities by using linear transformations to adapt the means, and possibly the covariances of the mixture Gaussians. The linear assumption, however, is too restrictive, and in this paper we propose a novel adaptation technique that adapts the means and, optionally, the covariances of the mixture Gaussians by using multiple stochastic transformations. We perform both speaker and dialect adaptation experiments, and we show that our method significantly improves the recognition accuracy and the robustness of our system. The experiments are carried out with SRI’s DECIPHERTM speech recognition system. Index Terms— Speaker adaptation, speech recognition, robust recognition.

I. INTRODUCTION

T

HE MISMATCH that frequently occurs between the training and testing conditions of an automatic speech recognizer (ASR) can be efficiently reduced by adapting the parameters of the recognizer to the testing conditions. Recently, a family of adaptation algorithms for large-vocabulary continuousdensity hidden Markov model (HMM) based speech recognizers have appeared that are based on constrained reestimation of the distribution parameters [1]–[3]. In these approaches, all the Gaussians in a single mixture, or a group of mixtures, if there is tying of transformations, are transformed using the same linear transformation. These transformation-based approaches adapt the Gaussians with no samples in the adaptation data by extrapolation, using the samples of other Gaussians that are close in the observation space. Hence, transformation-based adaptation is a fast adaptation scheme, since small amounts of adaptation data are typically required. The linear assumption, however, may be too restrictive and inadequate in modeling the characteristics of the testing conditions, since in many cases the inconsistency between the training and testing phases is large, and the mapping from the training observation space to the mismatched testing space too Manuscript received June 5, 1997; revised April 10, 1998. This work was performed under contract to SRI International and Telia Research. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Mazin Rahim. The authors are with the Technical University of Crete, Kounoupidiana, Chania 73100, Greece (e-mail: [email protected]). Publisher Item Identifier S 1063-6676(99)01626-0.

complicated to be modeled by a simple linear transformation. In this paper, we introduce a new adaptation method for continuous-density HMM’s, maximum-likelihood stochastictransformation (MLST) adaptation. MLST adaptation is based on a more complex, piecewise-linear, stochastic transformation of the Gaussians which consists of a collection of component, linear transformations that are shared among all the Gaussians in each mixture. The component transformation applied to each Gaussian is selected probabilistically, based on weight probabilities that are trained from the adaptation data. For the estimation of the transformation parameters and weight probabilities we use the expectation-maximization (EM) algorithm [4]. We evaluate our new method using SRI’s DECIPHERTM speech recognition system on dialect- and speaker-adaptation experiments, and we find that the new method significantly outperforms previous methods based on the single linear transformation. A recent literature survey of statistical techniques for robust ASR appeared in [5]. In this paper we focus on transformation based adaptation. Transformation techniques for mismatch compensation and model adaptation can be used in both the feature and the model spaces. Probabilistic optimum filtering [6] is a technique applied in the feature space that requires parallel recordings of the clean and noisy speech. Maximum likelihood (ML) techniques can be used to avoid the need for parallel recordings. In the feature space, a simple transformation that consists of a single offset is estimated using ML estimation in [7] and [8], and can be regarded as a generalization of cepstral mean normalization. Stochastic matching [3] treats the offset as a random bias, and utilizes multiple shifts. Multiple transformations are typically used in the model space, where the model identity is used to determine the transformation index. ML transformation-based adaptation of continuous-density HMM’s appeared concurrently in [1] and [2], and became known as maximum-likelihood linear regression (MLLR). The transformation-based approach is combined with MAP adaptation in [9] and [10], and is extended to include biases in both the mel-cepstral and linear spectral domains in [11]. More related to the work presented here is the work of Gales [12], where he deals with the issue of optimal component assignment to a set of transformations, and the nonlinear model-space transformations presented in [13]. This paper is structured as follows. In Section II, we review the linear transformation-based adaptation methods for continuous mixture-density HMM’s which were introduced in [1] and [2]. Section III describes the multiple stochastic transformation-based adaptation scheme. We first show that

1063–6676/99$10.00  1999 IEEE

178

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999

EM algorithm can be used for the parameter estimation of the transformation in the case of a simple, static Gaussian mixture model, and we then extend the method for timevarying HMM’s. We also deal in this section with some implementation issues. In Section IV, we present experimental results comparing the new method to the previous linear transformation-based approaches. Finally, in Section V we give our conclusions.

do not adapt the covariances of the Gaussians. A comparative study of these two approaches was done in [14]. For method (3), where the affine constraint is applied to both the means and covariances of the mixture Gaussians, the following first- and second-order statistics for all multivariate normal densities of all the HMM states must be computed during the th iteration of the EM algorithm: (5)

II. ADAPTATION USING A LINEAR TRANSFORMATION Recently developed fast adaptation algorithms [1]–[3] for continuous mixture density HMM’s are based on linear constrained reestimation of the mixture Gaussians. Maximum likelihood reestimation of the Gaussians in all these adaptation schemes is performed using the expectation-maximization (EM) algorithm. The observation densities of the speaker-independent (SI) speech recognition system in continuous mixture-density HMM’s have the following form: (1) denotes the multivariate normal denwhere and covariance matrix , sity with mean vector is the number of mixture components, , denotes the event that the th component is used, and is the mixture weight for the th component of state . These models need large amounts of training data for robust estimation of their parameters. However, given a small amount of training data that match the testing conditions, the initial SI system can be adapted to new conditions, like the speaker, channel, or dialect. In [1], it is assumed that the of the mismatched testing condition can vector process , be obtained through a sequence of linear transformations from a corresponding process that matches the training population:

(6) (7) is the probability of being The quantity at state at time given the training data and the parameter , , of the current iteration , estimates and is computed using the forward–backward algorithm. The , , , can be posterior probability computed using Bayes’ rule

(8) Given these sufficient statistics, the transformation parameters can be reestimated by solving the following system of equations [1]:

(9) (2) The transformation used at each time depends on the underlying HMM state , in which case the observation densities of the adapted models can be written

(3) denotes the transpose of a matrix. In [2], the where linear constraint is only applied to the means of the adapted observation densities, which become (4) Closed-form solutions for the reestimation formulae of method (3) can be derived in the case of diagonal transformation and covariance matrices. The reestimation formulae for (4) are simpler, can be used for full transformation matrices, but

(10) This is a system of quadratic equations that are decoupled and easy to solve under the assumption of diagonal covariance matrices. For the second method (4), where the affine constraint is applied only to the means of the Gaussians, only the first-order sufficient statistics given in (5) and (7) must be computed. This should be expected, since this method reestimates only the means, and not the covariances of the Gaussians. The reestimation equations ( -step) for the transformation parameters

DIAKOLOUKAS AND DIGALAKIS: ADAPTATION OF HIDDEN MARKOV MODELS

179

observation densities will be

can be written in this case (adapted from [2]): (11)

(14) represents the extended matrix of the affine where transformation

and

is the extended mean vector

.

III. ADAPTATION USING MULTIPLE STOCHASTIC TRANSFORMATIONS The linear assumption may not adequately model the dependency between the training and testing conditions. For example, it may be too simplistic to assume that the mapping between the observation spaces of a new speaker and the speakers in the training population is linear, even when we are looking only at the observations of a particular state (or group of HMM states, in the case of transformation tying). An alternative solution to the deterministic linear transformation described in (2), is to use for an observation drawn from the th Gaussian of a particular HMM state a probabilistic, piecewise-linear transformation of the form with probability

In either case, the MLST parameters that must be estimated from the adaptation data for each HMM state include the , , and the transformation parameters , , , for transformation probabilities, . The estimation each Gaussian in the mixture, of the MLST parameters can be done using the EM algorithm. We first derive the reestimation formulae for the case of a single Gaussian mixture density, and we then extend the results for HMM’s with Gaussian mixture observation densities. A. MLST Adaptation of a Gaussian Mixture Density Let us first consider a simpler problem, where the random follows a known Gaussian-mixture density with vector , as well as fixed and positive-definite fixed mean vectors covariance matrices

where is the number of mixture components. Instead of , we observe samples of a random vector obtained from via a complex transformation consisting of component transformations of the form:

with probability .. .

.. . with probability

.. .

with probability with probability .. . with probability

(12) is the number of component transformations used where and , by each HMM state, . denotes the event that the th transformation , for is used, and the component transformations are shared by all the Gaussians used by , that select the th state . The probabilities transformation at time for the th Gaussian of the HMM state , however, are specific to each Gaussian in the mixture. Let us consider adaptation using a complex transformation component transformations. The adapted consisting of observation densities of the HMM-based speech recognizer will then have the following form:

(13) Alternatively, if we choose to apply the transformations only to the means of the Gaussians, as in (4), then the adapted

(15) where we assumed that the index of the Gaussian used to generate the sample is known, and we have the constraints

Under these assumptions, the distribution of the random vector will obviously have the following form:

where the unknown model parameters that must be estimated , , , from the observed data are , . ML adaptation of the single Gaussian mixture density is equivalent to the estimation of based on the a priori knowledge of the the parameters unobserved distribution and the observations of . The EM algorithm is widely used to obtain ML estimates for problems with hidden variables. The estimation of the unknown parameters from observations of the random vector can be achieved with the EM algorithm, if we consider the mixture index and the transformation index as hidden

180

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999

variables. In this case, the auxiliary function that has to be maximized at each EM iteration has the form

scaling matrices

diag diag

argmax

, , , , are the model parameters of the previous iteration, is the set of the observation data samples and denotes the set of the corresponding unobserved data , which consists of the set of mixture indices and the set of the component transformations’ indices . In general, the EM algorithm consists of an expectation and a maximization step. In the Appendix, we show that the expectation step involves to the computation of the following sufficient statistics:

, that is

where

diag (22) the maximization step, in addition to calculating the weights from (21), involves solving the following set of equations , , , where is the dimension of the above matrices and vectors:

(16)

(17) (23) (18) and . is the size of for , , the sample set, and the posterior probabilities and , , , , can be computed using Bayes’ rule as in (19) and (20), shown at the bottom of the page. In the Appendix, we prove that part of the maximization step is to calculate the new value for the weight probability from the quantity

(21) When we consider diagonal covariances

and diagonal

where the offset

is given by

(24) The above quadratic equation has real roots. In the general case, when the covariances and scaling matrices are full matrices, we can use iterative schemes to solve a system of second order equations. When the transformation is applied only to the means of the Gaussians, the computation of the component-transformation in the maximization step remains probabilities the same as in (21). The computation of the transformation parameters is now equivalent to solving the following system , (see the of equations

(19)

(20)

DIAKOLOUKAS AND DIGALAKIS: ADAPTATION OF HIDDEN MARKOV MODELS

Appendix): (25)

181

where the sufficient statistics as follows:

is the iteration index. Collect , , (29) (30)

(26) Equations (25) and (26) can be combined using the extended matrix representation to form an equation similar to (11) that can be solved as described in [2]. B. MLST Adaptation of Gaussian Mixture Densities in HMM’s We can now easily extend the MLST estimation method for hidden Markov models with Gaussian mixture densities as , be the hidden, observation distributions. Let , finite-state process of an HMM. This state process generates through a stochastic mapping of the an observed process . form In (13) and (14) we saw that the adapted observation densicomponent ties of an HMM using MLST adaptation with transformations can be written

(27) when the transformations are applied to both the means and covariances, or, alternatively

(28) if we choose to apply the transformations only to the means of the Gaussians. In either case, the parameters that must be estimated from the adaptation data for each HMM state include the trans, , and the formation parameters , , , transformation probabilities . These for each Gaussian in the mixture, parameters can be estimated using the EM algorithm in a similar manner as shown in the case of a single Gaussian mixture density. However, when HMM’s are considered, the unobserved data include the HMM state together with the mixture class and the transformation indices. Therefore, the EM algorithm in this case takes the form of the iterative Baum–Welch algorithm and the training procedure can be summarized as follows: • Initialization: Initialize all transformation parameters , , and , . One possible choice of the initialization conditions is given in the following section. • E-Step: Perform one iteration of the forward–backward algorithm on the speech data, using the adapted Gaussians with the current value of the transformation parameters , , , , , ,

(31) , is the probability of being where at time given the adaptation data and at state , and is comthe current estimates of the parameters puted by the forward–backward recursions. The posterior probabilities (32) (33) can be computed using Bayes’ rule, in analogy to (19) and (20). • M-Step: Compute the new transformation parameters , and component transformation probabilities , , using (21), (23), and (24). • If another iteration, go to -step. When the transformation is applied only to the means of the Gaussians, then only the first order statistics given in (29) and (30) have to be computed in the expectation step, since the covariance remains the same through the iterations. In addition, the maximization step involves the computation of the new transformation parameters using (21), (25), and (26). C. Implementation Issues The adapted mixture densities in (13) and (14) using MLST Gaussians, whereas the original, adaptation consist of SI observation densities and the adapted densities using the Gaussians. conventional methods of (3) and (4) consist of Hence, the adapted system will require additional computation for the output likelihoods in both recognition and adaptation, during the forward–backward recursions. This can be avoided if we constrain the transformation probabilities: for the transformation with the highest probability (34) elsewhere and , which means that for we only apply the transformation with the highest probability to each Gaussian. In this case, the adapted system still has , as the SI system or the same number of Gaussians, the adapted system with the conventional methods, and there will be no additional computation involved in the calculation of the output likelihoods, the dominant part of the overall computation. We will refer to this approximation as the MLSTHPT approach. Alternatively, we can calculate the linear

182

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999

combination of all the component transformations using the estimated weights of the th mixture, (35)

TABLE I NUMBER OF ADAPTATION PARAMETERS PER MIXTURE AND WORD ERROR RATES FOR LINEAR TRANSFORMATIONS (METHODS I AND II) AND MULTIPLE STOCHASTIC TRANSFORMATIONS (HPT AND LCT) WITH TWO TO SIX COMPONENT TRANSFORMATIONS IN DIALECT ADAPTATION EXPERIMENTS

(36) and apply the averaged transform to that mixture. We refer to this as the MLST-LCT approach. In our experiments we consider both MLST-HPT and MLST-LCT approaches. Another issue for the implementation of the algorithm is the choice of the initialization conditions of the transformation parameters. This is very important, since different initial values may lead the algorithm to converge toward different local optima. The latter significantly influences its performance. One can obtain initial values of the transformation parameters either by clustering methods, or by slightly perturbing the identity transformation. In our experiments, we used a systematic method to obtain multiple slightly perturbed variants of the identity transformation, and use these as initial values of the different component transformations. We set, for all states and for , , where is the identity . represents the element-wise matrix, and is a vector with elements the product of two vectors, standard deviations of the observation vector for state , is the th column of a Hadamard matrix, and is the dimension of the offset vector . Finally, we , , initialize the weight probabilities with , and . IV. EXPERIMENTS We have tested our new algorithm in dialect adaptation experiments, where we try to develop a multidialect SI speech recognition system for the Swedish language which will require only a small amount of dialect-dependent data. We use the Swedish language corpus collected by Telia, and the recognizer used in a bidirectional speech translation system between English and Swedish that has been developed under the SRI-Telia Research Spoken Language Translator project [15]. We have also evaluated our algorithm in speaker adaptation experiments based on the “spoke 3” task of the large-vocabulary Wall Street Journal (WSJ) corpus [16]. The goal of this task is to improve recognition performance for nonnative speakers of American English. A. Dialect Adaptation Experiments For our dialect adaptation experiments, we used data from the Stockholm and Scanian dialects, that were, respectively, the seed and target dialects. There is a total of 40 speakers from the Scanian dialect, both male and female, and each of them recorded more than 40 sentences. We selected eight of the speakers (half of them male) to serve as testing data and the rest composed the adaptation/training data with a total of 3814 sentences. Experiments were carried out using SRI’s DECIPHERTM system [17]. The system’s front-end was configured to output 12 cepstral coefficients, cepstral energy

and their first and second derivatives. The cepstral features are computed with a fast Fourier transformation (FFT) filterbank and subsequent cepstral-mean normalization on a sentence basis is performed. The SI continuous HMM system, which served as seed models for our adaptation scheme, was a phonetically-tied mixture (PTM) system [17] trained on approximately 21 000 sentences of the Stockholm dialect. The system’s recognition performance on an air travel information task similar to the English ATIS one was benchmarked at a 8.9% word error rate using a bigram language model when tested on Stockholm speakers. On the other hand, its performance degraded significantly when tested on the Scanian-dialect testing set, reaching a word error rate of 25.08%. In previous work [18], we adapted the Stockholm-dialect system using (3) with diagonal transformations (method I) and (4) with structured transformations (method II). The transformation matrices in method II are block diagonal matrices, with three blocks that perform a separate transformation to every basic feature vector (cepstrum, and its first and second derivatives). Hence, method II allows rotation of each subvector, in addition to the scaling and shifting of method I. The results are summarized in Table I for 198 and 520 adaptation sentences, and we found that method II outperformed method I because of the more complex transformations. These results were consistent with similar findings on speaker-adaptation experiments reported in [14]. The results of our new method, considering both the MLSTHPT and MLST-LCT approaches, are also summarized in Table I for different numbers of component transformations. We used multiple diagonal transformations applied to both

DIAKOLOUKAS AND DIGALAKIS: ADAPTATION OF HIDDEN MARKOV MODELS

the means and covariances, as in (13). Results show that the MLST-HPT approach outperforms MLST-LCT for most of the experimental conditions, especially in the cases when only two component transformations and when more training sentences are used. If compared with the results of previous work, we see that even with as few as two component transformations, we get a significant performance improvement over both methods I and II. When more component transformations are used, the new MLST method gives even better results than the previous approaches, with the best performance achieved for five component transformations for MLST-HPT and six component transformations for MLST-LCT. The word error rate for 198 adaptation sentences is reduced by 21% and 10% over methods I and II, respectively. The reductions for MLST-HPT when 520 adaptation sentences are used are even larger, and the word error rate is reduced by 25% and 21% over methods I and II, respectively, although the number of adaptation parameters is still smaller than those used in method II. B. Speaker Adaptation Experiments For the speaker adaptation experiments we used the DECIPHERTM system on the “spoke 3” task of the largevocabulary WSJ corpus [16]. The speaker-independent, continuous HMM systems that were used as seed models for adaptation were gender dependent, trained on 140 speakers and 17 000 sentences for each gender. Each of the two systems was phonetically tied, having 12 000 context-dependent phonetic models that shared 100 Gaussians specific to each center phone. We used the 5000-word closed-vocabulary bigram language model provided by the MIT Lincoln Laboratory, and the 1994 development set that consists of six female and five male speakers, each one of them speaking 40 phonetically rich adaptation sentences. The test set consisted of 11 speakers and 20 sentences per speaker. The speaker-independent word-error rate for this test set is 29.06%. We evaluated our new method for 10, 20, and 40 stochastic transformations. Each of the stochastic transformations was used to adapt the Gaussians of all allophone states clustered in each of 10, 20, and 40 groups, respectively, that corresponded to one of the stochastic transformations. We used diagonal component transformations applied to both the means and the covariances (13), and the number of component transformations in each stochastic transformation varied from one (in which case our new method simply reduces to method I) to eight. We used the MLST-HPT approach, which in our dialectadaptation experiments performed significantly better than the MLST-LCT method. The results are summarized in Table II. We see that with as few as two component transformations, there is a significant improvement in recognition performance, compared to method I. The improvement becomes more obvious as we use more component transformations. The best performance is achieved for six component transformations, and the speaker-independent word-error rate is reduced by 38.8%, 40.8%, and 42.2% when 10, 20, and 40 transformations are used, respectively. The improvement in performance of the MLST-HPT method over method I is 23.3, 16.9, and 13.8% for 10, 20, and 40 transformations, respectively.

183

TABLE II SPEAKER-ADAPTED WORD ERROR RATES OF A SYSTEM WITH 100 GAUSSIAN COMPONENTS PER MIXTURE USING THE MLST-HPT METHOD FOR SEVERAL NUMBERS OF STOCHASTIC TRANSFORMATIONS WITH 1 (METHOD I) UP TO EIGHT COMPONENT TRANSFORMATIONS

TABLE III SPEAKER-ADAPTED WORD ERROR RATES OF A SYSTEM WITH 50 GAUSSIAN COMPONENTS PER MIXTURE USING THE MLST-HPT METHOD FOR SEVERAL NUMBERS OF STOCHASTIC TRANSFORMATIONS WITH ONE (METHOD I) UP TO EIGHT COMPONENT TRANSFORMATIONS

To investigate the tradeoff between the number of mixture and the number of component transformations components of the MLST transform, we repeated the experiments of Table II for a phonetically-tied system with 50 Gaussians per mixture. The speaker-independent word-error rate for this system is 30.96%. The results are summarized in Table III. We can see that the behavior of the MLST-HPT method is similar for the system with 50 Gaussians per mixture to that of the system with 100 Gaussians per mixture, and that in both cases the best adaptation performance is achieved with five through eight component transformations. The performance improvement of the MLST-HPT method over method I is, in the case of the 50-Gaussian system, 23.9%, 18.3%, and 11.1% for 10, 20, and 40 transformations, respectively. It is also interesting to compare the adapted performance of the 100-Gaussian system using method I (one component transformation) with the adaptation performance of the

184

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999

TABLE IV AVERAGE ADAPTATION TIME (FORWARD–BACKWARD ITERATION FOR A SINGLE SENTENCE), ADAPTATION-TRAINER PROCESS SIZE AND AVERAGE RECOGNITION TIME PER SENTENCE OF THE ADAPTED SYSTEM USING THE MLST-HPT METHOD FOR A SYSTEM WITH 50 GAUSSIAN COMPONENTS PER MIXTURE, 20 STOCHASTIC TRANSFORMATIONS, AND ONE (METHOD I) UP TO EIGHT COMPONENT TRANSFORMATIONS

50-Gaussian system using the MLST-HPT method with two component transformations. Although the MLST-HPT method selects one of the transformations for each mixture component and for each HMM state, different transformations may be selected for the same component of different HMM states. As a result, and due to the sharing of Gaussians across different states, both of the systems mentioned above have 100 different adapted Gaussians per Gaussian codebook. By comparing Tables II and III, we can see that the adapted word-error rates of the 100-Gaussian system with method I (23.2%, 20.7%, and 19.5% for 10, 20, and 40 transformations, respectively) are worse than the word-error rates of the 50-Gaussian system with the MLST-HPT method and two component transformations (22.2%, 19.3%, and 18.6% for 10, 20, and 40 transformations, respectively). C. Complexity According to our discussion in Section III-C, the exact MLST algorithm would require significantly more computation during both adaptation and recognition. In all of our experiments, however, we used the MLST-HPT and MLSTLCT approximations, which have the same number of Gaussians per mixture as the conventional methods I and II in the adapted observation densities, and we claimed in Section IIIC that these approximations do not require any additional computation in adaptation or recognition over the single transformation methods I or II. In this section we verify our claim experimentally. Table IV shows the average required time in seconds per sentence for a forward–backward iteration, by far the most demanding adaptation step. The times were measured in our speaker-adaptation experiments for the system with 50 Gaussians per mixture and several number of component transforms. We can see that the average forward–backward time per sentence for the MLST-HPT method is comparable to that of

method I, since the number of Gaussians per mixture is the same. In fact, the forward–backward time initially decreases as the number of component transformations increases from one to five, and then the computation time increases slightly. There are two conflicting reasons for this behavior. As the number of component transformations increases, the larger number of adaptation parameters results to a larger process size of the adaptation trainer (see third column of Table IV), and this may slow down the process. At the same time, however, the adapted systems with more component transforms exhibit better recognition performance, as we saw in the previous section, and the better acoustic match speeds up the late forward–backward iterations as well as the Viterbi algorithm (see last column of Table IV). In terms of memory requirements, the MLST adaptation -times as many adaptation parameters as method requires the conventional method I when diagonal transformations are used, as we indicated in the dialect adaptation experiments (see Table I). Since the number of adaptation parameters is small, however, compared to the overall number of parameters in the HMM recognizer, the relative increase in process size of the adaptation trainer is also small, as we can see in Table IV. V. CONCLUSIONS We have developed a new, transformation-based, adaptation algorithm for HMM’s that is based on multiple stochastic transformations, namely the MLST algorithm. We have evaluated the MLST algorithm on dialect-adaptation experiments, and found that it outperforms other widely used adaptation methods. We also performed speaker-adaptation experiments on the Spoke 3 task of the WSJ corpus (adaptation to nonnative speakers of American English). We showed that our new method significantly improves recognition performance over the well known deterministic, linear transformation. APPENDIX EM-ALGORITHM STEPS Let denote the observed set of data samples. In addition, we assume that the unobserved data , , consist of the set of mixture classes , and the set of the component transformations’ where indices

The auxiliary function of the EM algorithm can then be defined as

DIAKOLOUKAS AND DIGALAKIS: ADAPTATION OF HIDDEN MARKOV MODELS

Since

, , , , , and , , , we can rewrite the auxiliary function in the following form:

185

, . By expanding (40) we can obtain the following expression:

(37) When multiple stochastic transformations are considered, the consist of the transformation parameters , parameters as well as the weight probabilities , , . Therefore, the term does not depend on and hence, at each EM iteration we only need to maximize the sum of the following two terms:

(38) (43) (39) term, we have to consider its partial To maximize the derivative with respect to the transformation parameters and since there is no functional dependence of on probabilities. Similarly, the term is maximized by the , , taking its partial derivative with respect to the and since this term is not functionally dependent on the parameters.

We can define the sufficient statistics: (44)

(45)

A. Maximizing The maximization of the term is similar to the one described in [1]. In [19], the joint-log-likelihood of a collection of samples drawn independently from a multivariate normal distribution with mean and covariance is described as

(46) , definitions, we rewrite (43) as

. Based on the above

trace where denotes the number of samples and and the sample mean and covariance, respectively. Based on the expression above we can rewrite (38)

(47)

trace (40) where the means

and the covariances

are constrained:

(48)

(41)

trace for a where we used the matrix identity matrix and a vector . Furthermore, if we use the definition

(42)

186

IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 7, NO. 2, MARCH 1999

of the statistics , , and given in (44)–(46), we can rewrite (48) in the following compact form:

trace

:

(49)

We now substitute the constrained means and covariances from (41) and (42), to insert the transformation parameters in the expression that we maximize

(54) where the offset

is given by

(55) trace

(50)

Taking the gradient of the above with respect to the transforand , mation parameters

in the first By substituting the expression for the offset equation, we get a quadratic equation with real roots which quantity [1]. maximize the B. Maximizing Using the follows:

we find the following system of equations:

definition of (44), we can rewrite (39) as (56)

(51)

term, we perform equality constrained opTo maximize the timization using multiple Lagrange multipliers. The constraints in our case are

Based on the above constraints we extend the following augmented term:

(57) term to the

(52) , . In the above equations we assumed that the covariance matrices and the A matrices have full rank. However, we can simplify the problem by considering diagonal matrices [see (22)]. In this case, assuming that the feature-vector dimension is , the auxiliary function takes the form:

(53) By taking partial derivatives, we derive the following set , , of equations

(58) parameters are the Lagrange multipliers. Taking where the with respect to the , probabilities the gradient of and setting each element of the gradient vector to zero, we find

(59) Using (57) and (59) we can show (60)

DIAKOLOUKAS AND DIGALAKIS: ADAPTATION OF HIDDEN MARKOV MODELS

Finally, by substituting in (59), we derive the value of , that maximizes the quantity as (61)

C. Adapting Only the Means of the Gaussians If we choose to adapt only the means of the Gaussians, the term has to be slightly above analysis for maximizing the as changed. In this case, we only use the constraint of . Equation described in (41) and we set the covariance (50) then takes the following form:

trace

187

[12] M. Gales, “Transformation smoothing for speaker and environmental adaptation,” in Proc. Europ. Conf. Speech Communication and Technology, 1997, pp. 2067–2070. [13] V. Abrash, A. Sankar, H. Franco, and M. Cohen, “Acoustic adaptation using nonlinear transformations of HMM parameters,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1995, pp. 729–732. [14] L. Neumeyer, A. Sankar, and V. Digalakis, “A comparative study of speaker adaptation techniques,” in Proc. Europ. Conf. Speech Communication and Technology, 1995, pp. 1127–1130. [15] M. Rayner et al., “Spoken language translation with mid-90’s technology: A case study,” in Proc. Europ. Conf. Speech Communication and Technology, 1993. [16] F. Kubala et al., “The hub and spoke paradigm for CSR evaluation,” in Proc. ARPA Workshop on Human Language Technology, 1994, pp. 37–42. [17] V. Digalakis, P. Monaco, and H. Murveit, “Genones: Generalized mixture tying in continuous hidden Markov model-based speech recognizers,” IEEE Trans. Speech Audio Proc., pp. 281–289, July 1996. [18] V. Diakoloukas, V. Digalakis, L. Neumeyer, and J. Kaja, “Development of dialect-specific speech recognizers using adaptation methods,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1997, pp. II1455–II-1458. [19] T. W. Anderson, An Introduction to Multivariate Statistical Analysis, 2nd ed. New York: Wiley, 1984.

(62) Taking the gradient of the above with respect to the transforand as we did before, we derive the mation parameters , : following set of equations (63)

(64) REFERENCES [1] V. Digalakis, D. Rtischev, and L. Neumeyer, “Speaker adaptation using constrained reestimation of Gaussian mixtures,” IEEE Trans. Speech Audio Processing, vol. 3, pp. 357–366, Sept. 1995. [2] C. J. Leggetter and P. C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models,” Comput. Speech Lang., pp. 171–185, 1995. [3] A. Sankar and C.-H. Lee, “A maximum likelihood approach to stochastic matching for robust speech recognition,” IEEE Trans. Speech Audio Processing, vol. 4, pp. 190–202, May 1996. [4] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood estimation from incomplete data,” J. R. Stat. Soc. B, vol. 39, pp. 1–38, 1977. [5] J. R. Belegarda, “Statistical techniques for robust ASR: Review and perspectives,” in Proc. European Conf. Speech Communication and Technology, 1997, pp. KN-33–KN-36. [6] L. Neumeyer and M. Weintraub, “Probabilistic optimum filtering for robust speech recognition,” in Proc. Int. Conf. Acoustics, Speech, Signal Processing, 1994, pp. 417–420. [7] L. Neumeyer, V. Digalakis, and M. Weintraub, “Training issues and channel equalization techniques for the construction of telephone acoustic models using a high-quality speech corpus,” IEEE Trans. Speech Audio Processing, vol. 2, pp. 590–597, Oct. 1994. [8] M. Rahim and B. H. Juang, “Signal bias removal by maximum likelihood estimation for robust telephone speech recognition,” IEEE Trans. Speech Audio Processing, vol. 4, pp. 19–30, Jan. 1996. [9] V. Digalakis and L. Neumeyer, “Speaker adaptation using combined transformation and Bayesian methods,” IEEE Trans. Speech Audio Processing, vol. 4, pp. 294–300, July 1996. [10] J. T. Chien, H.-C. Wang, and C. H. Lee, “Improved Bayesian learning of hidden Markov models for speaker adaptation,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1997, pp. II-1027–II-1030. [11] M. Afify, Y. Gong, and J.-P. Haton, “A unified maximum-likelihood approach to acoustic mismatch compensation: Application to noisy Lombard speech recognition,” in Proc. Int. Conf. Acoustics, Speech, and Signal Processing, 1997, pp. II-839–II-842.

Vassilis D. Diakoloukas was born in Rhodes, Greece, on November 27, 1971. He received the Diploma degree in physics from the University of Crete, Greece, in 1994, and the M.Sc. degree in electronics and electrical engineering from the University of Durham, U.K., in 1995. During his last year of studies he participated in the European Union ERASMUS student exchange program. Since April 1996, he has been with the Telecommunications Laboratory, Technical University of Crete, working toward the Ph.D. degree in speech recognition. His research mainly focuses on speaker and dialect adaptation. From January 1994 to August 1994, he was with the Computing Science Department, University of Groningen, The Netherlands, where he carried out his senior project concerning speech recognition using neural networks.

Vassilios V. Digalakis (M’98) was born in Chania, Greece, on February 2, 1963. He received the Diploma in electrical engineering from the National Technical University of Athens, Greece, in 1986, the M.S. degree in electrical engineering from Northeastern University, Boston, MA, in 1988, and the Ph.D. degree in electrical and systems engineering from Boston University in 1992. From 1986 to 1988, he was a Teaching and Research Assistant at Northeastern University. From 1988 to 1991, he served as a Research Assistant at Boston University. From January 1992 to February 1995, he was with the Speech Technology and Research Laboratory of the Stanford Research Institute (SRI International), Menlo Park, CA. At SRI, he was a Principal Investigator for the United States Advanced Research Projects Agency research contracts, and he developed new speech recognition and speaker adaptation algorithms for the DECIPHER speech recognition system. He is currently an Assistant Professor with the Department of Electronic and Computer Engineering, Technical University of Crete, Chania, Greece. He teaches undergraduate and graduate courses on speech processing and on digital and analog communications. His research interests are in pattern and speech recognition, information theory, and digital communications. He leads a research project funded by the Swedish Telephone Company (Telia) for the development of speech recognition systems for the various dialects of the Swedish language. He also developed language education algorithms using speech recognition techniques. He is the author of numerous articles on speech recognition in journals and conference proceedings. He has filed for three patents.