COMPARING SOLUTION METHODOLOGIES FOR

COMPARING SOLUTION METHODOLOGIES FOR SPEECH ENHANCEMENT WITHIN A BAYESIAN FRAMEWORK J Vermaak, M Niranjan and SJ Godsill CUED/F-INFENG/TR.329 August, 1998

Cambridge University Engineering Department Trumpington Street Cambridge CB2 1PZ England E-mail: [email protected]

S IGNAL P ROCESSING G ROUP Cambridge University Engineering Department

Technical Report CUED/F-INFENG/TR.329

Comparing Solution Methodologies for Speech Enhancement within a Bayesian Framework Jaco Vermaak, Mahesan Niranjan and Simon J. Godsill August, 1998

Speech enhancement is the area of speech processing concerned with the reconstruction of speech corrupted by some noise process. This report states the problem of enhancing speech corrupted by additive white Gaussian noise within a model-based Bayesian framework, and proceeds to develop and compare three strategies to estimate the model parameters and the clean speech. Each of these methods estimates the additive noise variance from the observed noise corrupted data, alongside the speech model parameters, eliminating the need for off-line estimation of this value. The first two methods are the Evidence Maximisation and Expectation Maximisation (EM) algorithms, which essentially perform maximum a posteriori (MAP) estimation of the model parameters, and, based on this estimate, reconstruct the clean speech. The difference between these two methods lies in the fact that the former performs the maximisation directly in the parameter space, whereas the latter iteratively maximises for the parameters by conditioning on the current estimate of the clean speech. The third method entertained is the Gibbs sampler, which is a special case of the general class of Markov chain Monte Carlo (MCMC) algorithms. In contrast with the first two methods, which are deterministic, the Gibbs sampler is essentially stochastic in nature, and sets up a Markov chain that generates samples from the joint posterior distribution for the model parameters and clean speech. These samples are subsequently used to calculate Monte Carlo estimates for the desired quantities. The methods are evaluated and compared on real speech corrupted by both stationary and non-stationary noise processes.

Keywords: Speech enhancement, Bayesian analysis, Evidence maximisation, Expectation Maximisation algorithm, Markov chain Monte Carlo methods, Gibbs sampler

2

Introduction

1

I NTRODUCTION

Speech enhancement is the area of speech processing concerned with the reconstruction of speech corrupted by some noise process. What is considered as noise is very much dependent on the application at hand. For example, in telecommunications the received signal may be corrupted by many kinds of background noise at the transmitter and receiver, as well as convolutional channel distortions and different path length effects. All these are unwanted artifacts and need to be removed from the received signal to improve its perceived quality. On the other hand, at, for example, a press conference, the speech of competing speakers may form part of the corrupting process. These variations in what is considered as noise strongly determine the characteristics of the required reconstruction algorithms. A large class of speech enhancement problems involves the separation of speech from a background noise component. Such noise processes are present to some degree in nearly all speech processing applications, and arise due to factors such as environmental noise, intrinsic system noise, ambient noise, and many more. Most of these can adequately be modelled as a white Gaussian noise process that combine additively with the clean speech signal. The variance of the noise process typically changes slowly with time, but can normally be assumed constant over a short segment of the signal. Traditional enhancement strategies approach this problem from both the time and frequency domains, and include methods as diverse as spectral subtraction, and all its variants, Wiener filtering, model-based approaches, signal subspace decomposition, and many more (see (Deller et al., 1993) for a summary of some of the existing techniques). Although these methods achieve some degree of success, their performance is restricted by the fact that most of them are deterministic, and hence cannot adequately account for the uncertainty introduced by the random nature of the noise processes and the variations in the speech production system. Furthermore, almost all these methods assume that the variance of the corrupting noise process is known. This value is normally estimated off-line on a speech-free segment of the signal, and leads to satisfactory reconstruction performance if the noise process is stationary. For nonstationary noise processes, however, the performance deteriorates quickly, since it is not always possible to re-estimate the noise variance at the rate of change of the noise process characteristics. It is these limitations that inspired the drive to develop statistical speech enhancement algorithms. This important movement started in the late 1970s, headed by the work of Lim and Oppenheim (Lim and Oppenheim, 1978; Lim and Oppenheim, 1979), and since developed into a vast field of research. Particularly successful within this field, are model-based Bayesian speech enhancement approaches. Within this framework explicit parametric models are constructed for the speech and noise processes, leading to likelihood functions for the observed and uncorrupted speech data. If suitable priors are chosen, the parameters of these

3 models, together with the uncorrupted speech data, can be viewed as random variables for which a posterior density can be formed according to Bayes’ rule, after observing the corrupted speech signal. Estimates for the speech and noise model parameters and the clean speech data can then be obtained by performing inferences from this density. Regardless of its conceptual simplicity, for any model of realistic complexity, the Bayesian inference procedure results in expressions containing integrals that are analytically intractable. Furthermore, since the problem spaces are generally of high dimension, standard numerical integration techniques cannot be applied. These difficulties spawned the development of a host of solution methodologies. It is the purpose of this report to present and compare three of the most successful of these strategies on the problem of separating speech from an additive white Gaussian noise component. The first two methods are the Evidence Maximisation (Gull, 1989; MacKay, 1992) and Expectation Maximisation (EM) (Dempster et al., 1977) algorithms, which essentially perform maximum a posteriori (MAP) estimation of the model parameters, and, based on this estimate, reconstruct the clean speech. The difference between these two methods lies in the fact that the former performs the maximisation directly in the parameter space, whereas the latter iteratively maximises for the parameters by conditioning on the current estimate of the clean speech. The third method entertained is the Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990), which is a special case of the general class of Markov chain Monte Carlo (MCMC) algorithms (see (Gilks et al., 1996; Tanner, 1993) for good introductions). In contrast with the first two methods, which are deterministic, the Gibbs sampler is essentially stochastic in nature, and sets up a Markov chain that generates samples from the joint posterior distribution for the model parameters and clean speech. These samples are subsequently used to calculate Monte Carlo estimates for the desired quantities. The rest of the report is organised as follows. In Section 2 the details of the speech and noise models are laid out and discussed. Section 3 states the enhancement problem within a Bayesian framework, and proceeds to derive the required likelihood functions and posterior distribution. Section 4 presents and discusses the three solution methodologies for the Bayesian inference problem. In Section 5 these methods are evaluated and compared on real speech data corrupted by both stationary and nonstationary additive white Gaussian noise processes. Finally, in Section 6, the salient points in the results are discussed, and the merits and shortcomings of the methods are summarised. The presentation will aim to be tutorial in nature, and special emphasis will be placed on the assumptions, limitations, computational expense, implementation issues, extendibility and applicability of each of the methods.

4

The Model

2

T HE M ODEL

A widely used and popular model for the speech production system is the autoregressive (AR) process (Makhoul, 1975; Rabiner and Schafer, 1978; Deller et al., 1993). This model exploits the local correlations in a time series by forming the prediction for the current sample as a linear combination of the immediately preceding samples. For a p-th order autoregressive process (abbreviated as AR(p)), the governing equation is given by:

xt =

p X i=1

ai xt i + et ;

(1)

where ai , i = 1; : : : ; p are the autoregressive coefficients, and fet g is a serially uncorrelated (i.e. cov(es ; et ) = 0 for s 6= t) disturbance sequence, with p(et je2 ) = N (0; e2 ). This sequence can be interpreted as either the sequence of modelling errors or the excitation process. For the purposes of speech processing the latter interpretation seems more natural, and is subsequently adopted here. Although the popularity of this model stems primarily from its simplicity and the analytical analyses it facilitates in a wide range of applications, it has some justification from a physical point of view in that it is a simplification of the system that results when the vocal tract is modelled as a lossless acoustic tube (Rabiner and Schafer, 1978; Deller et al., 1993). It provides a fairly accurate representation for vowels and other voiced sounds, but performs relatively poorly on nasals and fricatives, mostly due to the presence of zeros (anti-resonances) in the magnitude spectra for these sounds. However, even for these classes of speech sounds, increasing the modelling order p often leads to models that yield satisfactory representations. Finally, the corrupting signal is assumed to be a white Gaussian noise process, which additively combines with the clean speech, i.e.:

yt = xt + vt ;

(2)

where p(vt jv2 ) = N (0; v2 ) and cov(vs ; vt ) = 0 for s 6= t. The noise process is assumed to be uncorrelated with the speech signal, an assumption which is valid for most background noise processes.

2.1

Block Processing

In reality the speech, and most noise processes, are non-stationary. Changes in the statistics of the speech process are due to configuration changes of the vocal tract and variations in the characteristics of the excitation source, and occur on the 10-100 msec. scale. Although the exact scale of change of the noise process statistics may

2.2

State-space Form

5

vary significantly across different applications, for almost all problems of practical importance, it is larger than that of the speech process. Typical values range on the 1-10 sec. scale, and may even be larger. To accurately deal with these variations requires the use of non-stationary models for the speech and noise processes. This hugely complicates the parameter estimation and reconstruction procedures. A satisfactory compromise is found in processing the speech in blocks of T samples, in each of which the speech and noise processes are assumed to be stationary. To eliminate transient effects and hence ensure continuity, the blocks are normally chosen to overlap to some degree. This methodology is quite common in speech processing, and yields satisfactory results if the quasi-stationary assumption holds true. In this case it is convenient to represent (1) and (2) in vectormatrix form for a block of T contiguous samples as:

e = x1 Xa = Ax y = x1 + v ;

(3) (4)

where e = [e1 ; : : : ; eT ℄T is the vector of excitations, v = [v1 ; : : : ; vT ℄T is the vector of noise samples, y = [y1 ; : : : ; yT ℄T is the vector of observations, and a = [a1 ; : : : ; ap ℄T is the coefficient vector for the AR(p) speech production model. The vector of uncorrupted speech samples is partitioned as x = [xT0 ; xT1 ℄T , where x0 = [x p+1 ; : : : ; x0 ℄T contains the p initial samples, and x1 = [x1 ; : : : ; xT ℄T , the succeeding T samples. The matrices X and A are constructed so as to generate the elements of e and x in the correct order. More specifically: 2

X 2

A

ap

6 6 0 =6 6 ... 4

.. .

2.2

x0 x1

x 1 ::: x x0 : : : x

6 =6 6 .. 4 .

.. .

.. .

3 p+1 7 p+2 7 .. 7 . 5

2 R T p

xT 1 xT 2 : : : xT p 3 ::: a1 1 0 : : : 0 7 ap : : : a1 1 . . . . . . 7 7 ..

.

..

.

..

.

..

.

..

.

..

.

ap : : :

..

.

.. 7 2 R .5

(5)

T T +p :

(6)

a1 1

State-space Form

Denoting as the vector comprising the unknown model parameters, the speech and noise models described above are readily represented in the state-space form of Appendix B.1, repeated here for convenience as:

yt = Zt ()t + Gt ()ut t+1 = Tt()t + Ht()ut ;

(7) (8)

6

Bayesian Speech Enhancement

where the state, observation and disturbance vectors are defined as t = [xt ; : : : ; xt p+1 ℄T , yt = [yt ℄ and ut = [et+1 ; vt ℄T , respectively. The system matrices then follow readily as:

Zt () = Z = 1 01p

Tt () = T(a) = I p

1

1

aT

;

0p 11

Gt () = G = 0 1 ;

(9)

Ht() = H = 01 0 ; p 12

(10)

with the covariance matrix for the disturbance process fut g given by:

2 Qt() = Q(e ; v ) = 0e 02 : v 2

2

(11)

In the above, 0mn is the m n matrix of zeros, and In is the n n identity matrix. The utility of the state-space representation lies in fast and efficient Kalman filterbased algorithms it facilitates. Some of these will form integral parts of the enhancement algorithms presented in Section 4.

3

B AYESIAN S PEECH E NHANCEMENT

Recovering the uncorrupted speech and determining the unknown model parameters can be cast as a statistical estimation problem. This section shows how this problem can be solved within a fully Bayesian setting.

3.1

The Posterior Distribution

Denoting, as above, as the vector comprising all the unknown model parameters, and treating it as a random vector along with the uncorrupted speech x, Bayes’ rule can be employed to yield a posterior distribution for these quantities, after the noisy speech data y has been observed, as:

p(x; jy) =

p(yjx; )p(x1 jx0 ; )p(x0 j)p() ; p(y)

(12)

where p( ) is the prior distribution of the unknown model parameters, p(x0 j ) is the likelihood of the initial samples, p(x1 jx0 ; ) is the conditional likelihood1 of the speech production model, and p(yjx; ) is the likelihood of the observed data. The term p(y) 1

Since it is conditional on the initial samples x0 .

3.2

The Prior p( )

7

is sometimes referred to as the evidence for the model structure2 , and can be expressed as:

p(y) =

Z

p(yjx; )p(x1 jx0 ; )p(x0 j)p()dxd:

(13)

It is a constant normalising factor independent of the uncorrupted speech data x and the model parameters , and can usually be omitted in applications where issues of model selection and model comparison are not considered. It should be noted that performing Bayesian inference with respect to p(x; jy) (or its marginals) is equivalent to employing the state-space form of the model, presented in Section 2.2, and performing inference with respect to p(1:T ; jy1:T ) (or its correspond^ 1:T is an estimate for the stacked state vector, computed by some ing marginals). If ^ can straightforwardly algorithm, the corresponding estimate for the speech vector x ^ t(1) , t = 1; : : : ; T , where the notation v(i) denotes the i-th be recovered by setting x^t = element of the vector v. This utility of the state-space form is extensively employed in the development of the speech enhancement algorithms in Section 4. The subsequent sections develop expressions for the important quantities in the joint posterior distribution of (12).

3.2

The Prior p( )

The purpose of a prior distribution is to express initial beliefs about the parameters before the data is seen. These initial beliefs are then refined through Bayes’ rule as the data becomes available to yield the posterior parameter distribution. The priors should be chosen to reflect either the true prior distribution (where this is available, or can be accurately measured), or a more subjective belief of what the parameters should be. The exact form of the prior distribution is dependent on the extent of the initial information available. If nothing or very little is known a priori about the parameters, a vague (also called uniform, diffuse or non-informative) prior should be chosen. This kind of prior expresses no initial preference for one set of parameter values over any other. If, on the other hand, an approximate estimate for the parameter values is available beforehand, it may be appropriate to choose a prior with its mean or mode centred on this estimate, and its variance chosen to reflect the confidence in this estimate. Some other, more subtle, issues pertaining to the choice of a prior distribution are discussed in the points below (see (Box and Tiao, 1973; Bernardo and Smith, 1994) for more detail). 2

Since it gives the probability of observing the noisy data, taking into account all possible values for the model parameters and the uncorrupted speech.

8


For any specific set of prior beliefs there is some arbitrariness in the precise choice of the mathematical representation of the corresponding prior densities. Given this flexibility, choices for priors should be made to aid the analytic tractability of the overall problem. This should, however, not be done at the expense of expressing the prior beliefs accurately. One way of ensuring tractability is by the method of conjugate analysis. This technique is based on the fact that since Bayes’ rule can be expressed in the form:

p(jx) / p(xj)p();

(14)

both p( jx) and p( ) can be guaranteed to belong to the same general family of mathematical functions by choosing p( ) to have the same structure as p(xj ), when the latter is viewed as a function of .

As the number of data observations T increases, the posterior density p( jx) in (14) is more strongly dominated by the likelihood function p(xj ). Under these conditions a precise mathematical representation of p( ) is unnecessary, since the parameters will normally be well-determined by the data.

A desirable property for a prior distribution on a scalar parameter is invariance to transformations of , i.e. if ! = f ( ) for some f (), then the resulting prior p(! ) should express the same information about ! , as did p() about . In this way any arbitrariness in parameterising a variable will make no difference to the final results.

For the speech enhancement problem considered here, the unknown parameter vector can be denoted as = [aT ; e2 ; v2 ; a2 ℄T . Assuming the parameters to be a priori independent and employing the guidelines presented above, the following prior distribution is imposed:

p() = p| (aja2 )p{z (e2 )p(v2 ) p(a2 ) } | {z } priors

hyperprior

= N a; a ; a2 Ip IG e2 ; e; e IG v2 ; v ; v IG a2 ; a ; a :

(15)

Here N (; ; ) denotes the multivariate Gaussian distribution with mean vector and covariance matrix , and IG (; ; ), the inverted-gamma distribution with parameters and (see Appendix A for more detail). In (15), a2 is a hyperparameter controlling the covariance matrix of the prior on the AR coefficients. Hence, the prior on this quantity is referred to as a hyperprior. Note that the parameters of the chosen prior distributions can easily be adjusted to reflect both vague and informative prior beliefs.

3.3

The Likelihood of the Initial Samples p(x0 j )

3.3

The Likelihood of the Initial Samples p(x0 j )

9

The distribution p(x0 j ) is the likelihood of the p initial uncorrupted speech samples, and, since the data is assumed to be Gaussian, can be written as:

p(x0 j) = p(x0 ja; e2 ) = N x0 ; x0 ; x0 :

(16)

Several options are available for specifying the mean x0 and covariance x0 in (16). If nothing is known a priori about x0 , then x0 = 0 and x0 is the autocovariance matrix for p samples drawn from an AR(p) process with coefficients a and excitation variance e2 (Box et al., 1994; Priestley, 1981). In this case the elements of x0 are dependent on a and e2. Alternatively, if an initial estimate for x0 is available, x0 and x0 can be chosen to encourage continuity of x with the section of data immediately preceding the current block. Taking the available initial estimate as the mean and letting x0 ! 0 ensures strict continuity. In this case p(x0 j ) is independent of the parameters of the AR(p) process (i.e. p(x0 j ) = p(x0 )), and is fully specified by some user-defined constants. Finally, it should be noted that, even in cases where no prior information about x0 is available, the likelihood p(x0 j ) can often be dropped from (12) without significant loss in accuracy. This is due to the fact that, in typical applications (certainly those considered here), T p, so that the contribution of p(x0 j ) to the overall probability mass is negligible, compared to that of the conditional likelihood p(x1 jx0 ; ).

3.4

The Model Likelihood p(x1 jx0 ; )

The conditional likelihood function p(x1 jx0 ; ) gives the probability of the AR(p) speech production model generating the data x1 , given values for the model parameters and initial samples x0 . It can be factorised according to the probability chain rule as3 :

p(x1 jx0 ) = p(x1 ; : : : ; xT jx = =

T Y

t=1 T Y t=1

p+1 ; : : :

p(xt jxt 1 ; : : : ; x

; x0 )

p+1 )

(17)

p(xt jxt 1 ; : : : ; xt p );

where, in the last step, use has been made of the fact that the distribution of the current sample xt conditional on all the past data samples, is equal to its distribution conditional only on the last p samples. Knowing that p(et je2 ) = N (0; e2 ) and employing a 3

Conditioning on is implicit, and dropped for the sake of notational clarity.

10


straightforward random variable transformation from et to xt , the predictive density p(xt jxt 1 ; : : : ; xt p ) is obtained as:

p(xt jxt 1 ; : : : ; xt p ) = N xt ;

p X i=1

!

ai xt i ; e2 :

(18)

Substituting (18) into (17) and writing the result in terms of the vector-matrix form of (3), lead to the following expression for the conditional likelihood function:

p(x1 jx0 ; ) = p(x1 jx0 ; a; e2 ) = N x1 ; Xa; e2 IT : 3.5

(19)

The Likelihood p(yjx; )

Given the model parameters and the reconstructed speech samples x, the sequence fyt xt g is characterised by the same statistics as that of the noise process fvt g, i.e. p(yt jxt ; v2 ) = N (yt ; xt ; v2 ). Since, given x and , the observations are assumed to be independent, the likelihood function of the observed data can straightforwardly be derived as:

p(yjx; ) = p(yjx1 ; v2 ) =

T Y t=1

p(yt jxt ; v2 )

(20)

= N y; x1 ; v2 IT : 3.6

Bayesian Inference

Now that the joint posterior distribution of (12) has been fully specified, it can be employed for Bayesian inference, such as obtaining estimates for the unknown model parameters and the reconstructed speech data, here grouped together as = [xT ; T ℄T . The desired quantities of inference are normally defined in terms of some cost function C ( ^ ; ), which expresses objectively a measure of the cost associated with a particular estimate ^ , compared to any other value of (Duda and Hart, 1973). The form of the cost function is dependent on the requirements of the problem under consideration. A cost of zero indicates that the estimate is perfect for the required purposes, while increasing positive values indicate worse estimates. The risk R( ^ ) is defined as the expected cost associated with an estimate ^ , i.e.: h

i

R( ^ ) = E C ( ^ ; ) =

Z

C ( ^ ; )p( jy)p(y)d dy:

(21)

3.6

Bayesian Inference

11

The desired estimate ^ is the one that minimises the risk. This minimum risk is known as the Bayes risk. For non-negative cost functions it is sufficient to minimise the inner integral:

I( ^ ) = for all

Z

C ( ^ ; )p( jy)d

(22)

^ . Typical cost functions are the quadratic cost function, defined as:

Cq ( ^ ; ) = j ^

j; 2

(23)

and the uniform cost function, defined for an arbitrarily small as:

Cu ( ^ ; ) =

(

j> 1 if j ^ 0 otherwise:

(24)

The quadratic cost function leads to the minimum mean square error (MMSE) estimate, which can be shown to be equal to the mean of the posterior distribution, i.e.: MMSE

= E [ jy℄ =

Z

p( jy)d :

(25)

The uniform cost function is employed in situations where the correct parameter estimate is required at all costs, and any other estimate is of no use. For cases where ! 0, the uniform cost function leads to the maximum a posteriori (MAP) estimate, i.e. the value of ^ that maximises the posterior distribution p( jy): MAP

= arg maxfp( jy)g:

(26)

Note that for any distribution symmetric about its mean with its maximum at the mean, the MMSE and MAP solutions coincide. Depending on the application, other types of estimates may also be defined, most of which can be expressed as posterior expectations of functions of the desired variables, i.e.:

^ = E [f ( )jy℄ =

Z

f ( )p( jy)d :

(27)

Even for simple models, such as the one presented here, the integrals and maximisation procedures implied by (25) to (27) are generally analytically intractable. Note that these estimates also assume the availability of the normalised posterior distribution p( jy), that, in turn, requires the evaluation of the integral in (13), which is also analytically intractable. These difficulties led to the development of a host of iterative numerical solution methodologies. Three of the most successful of these strategies are presented and evaluated in the subsequent sections.

12

Solution Methods

4

S OLUTION M ETHODS

This section presents three popular numerical techniques to solve the Bayesian inference problem presented in Section 3.6. These techniques originated from various disciplines within the fields of statistical analysis and statistical physics, and are adapted here to fit within the speech enhancement framework.

4.1

Evidence Maximisation

The evidence maximisation framework is due to Gull and Skilling (see, for example, (Gull, 1989)), and is an approximate method to solve the Bayesian inference problem. It involves finding the value of the parameters that maximise their evidence4 in the light of the observed data, defined as p(yj), and, based on these values, determining the most probable value for the unobserved data xMP . The former step is nothing other than maximum likelihood (ML) (or maximum a posteriori (MAP), if the posterior distribution p( jy) is employed instead) estimation of the model parameters. For a subsequent treatment of this method in the context of linear and nonlinear interpolation, see, for example, (MacKay, 1992). This framework was first applied to the speech enhancement problem in (Saleh and Niranjan, 1998) (see also (Saleh, 1996)). There, however, an inexact iterative algorithm was employed to solve the equations that result from equating the derivatives of the log-evidence with respect to the parameters to zero. The convergence properties and bias introduced by this algorithm is not quantified, and the nature of the subsequent solutions my differ substantially from those resulting from an exact method. In the exposition, the evidence maximisation framework is presented in more detail, and subsequently applied to the speech enhancement problem. The resulting equations are solved exactly by an algorithm based on the Kalman filter.

4.1.1

General Description

The posterior distribution for the vector x can be obtained by marginalising (12) over the parameter space, i.e.:

p(xjy) =

Z

p(xj; y)p(jy)d:

(28)

If the posterior distribution for the parameters p( jy) is strongly peaked at the most probable parameter values MP , compared to variations in p(xj ; y), the integral in 4

The term ’evidence’, as defined here, can be used interchangeably with the term ’likelihood’ from standard statistical nomenclature.

4.1


13

(28) can be approximated as:

p(xjy) p(xjMP ; y):

(29)

Thus, if MP is assumed to be available, an estimate for the clean speech can be obtained as:

xMP(MP) = arg maxfp(xjMP ; y)g: x

(30)

The most probable parameter values are taken as those that maximise their posterior distribution, i.e.:

MP = arg maxfp(jy)g;

(31)

thus implying maximum a posteriori (MAP) estimation. Following from Bayes’ rule, the posterior distribution for the parameters can be expanded as:

p(jy) / p(yj)p();

(32)

where p(yj ) is defined as the evidence for the model parameters in the light of the observed data y. Within the original evidence maximisation framework, the prior p() is assumed to be uniform over the parameter space, so that the maximisation of (32) reduces to the maximisation of the evidence, hence the name. Such an assumption is not required by the subsequent analytical analyses, and hence, for the sake of generality, in the exposition, the prior is retained. For the exponential-family distributions considered here, the MAP estimation problem of (31) is often more conveniently solved in the log-domain. Thus, defining the target function L(y; ) as:

L(y; ) = log p(yj) + log p();

(33)

the equivalent maximisation problem becomes:

MP = arg maxfL(y; )g:

(34)

In most problems of practical interest, (34) cannot be solved directly, and some numerical optimisation strategy needs to be employed. The method adopted here is a second-order gradient-based algorithm, and hence requires the first (gradient) and second (Hessian) partial derivatives of the target function with respect to the model parameters. The details of the algorithm is presented in Appendix C.

4.1.2

Application to Speech Enhancement

This section develops expressions for MP and xMP ( MP ) for the speech enhancement problem stated in Section 3. It is first shown how these quantities can, in principle,

14

Solution Methods

be calculated using the standard parameterisation of the model, as given in Section 2. These computations turn out to be numerically inefficient and analytically intractable for the gradient and Hessian of the target function L(y; ). Subsequently, the exposition proceeds to develop efficient numerical techniques, based on the Kalman filter, to evaluate the target function L(y; ) and calculate its gradient and an approximation to the Hessian with respect to the model parameters. Using Bayes’ rule, the posterior conditional for the clean speech can be expanded as:

p(xj; y) =

p(yjx; )p(xj) ; p(yj)

(35)

where the terms in the numerator are the likelihood functions derived in Section 3, and p(yj) can again be recognised as the evidence for the model parameters in the light of the observed data, acting here as a normalising constant independent of x. Substituting (16), (19) and (20) into (35), and rearranging, lead to the following expression for the posterior conditional distribution of the clean speech:

p(xj; y) = N

MP MP x ; x ;

(36)

where the mean vector and covariance matrix are given by:

x = x

MP

MP

x01 x0 1 v2 y

1 1 0 x = 2 AT A + 0 x0 1pITT T p v2 e MP

(37)

1

:

(38)

Thus, since p(xj ; y) is Gaussian, it is maximised by its mean, and this value can be taken as the most probable estimate for x, given any value of the parameters , i.e. xMP() = MP directly involves inverting a (T + p) (T + p) x . Computing this estimate 3 matrix. This operation is of O ((T + p) ) complexity, and can become computationally prohibitive for typical values of T . Fortunately, this estimate can be obtained very efficiently by an algorithm based on the Kalman filter, and results as a natural byproduct of the evidence maximisation strategy discussed later in this section. Returning now to the problem of finding the most probable parameter estimate MP , it is necessary, first, to specify the terms on the RHS of (33). The second term is simply the logarithm of the prior distribution given by (15), and follows straightforwardly as:

log p() /

1 p log 2a2 + 2 (a a )T (a a ) + (e + 1) log e2 2 2a e v a 2 2 + 2 + (v + 1) log v + 2 + (a + 1) log a + 2 ; (39) e v a

where the normalising constants of the inverted-gamma distributions have been discarded. The first term on the RHS of (33) involves the evidence p(yj ), which may be

4.1


15

obtained by marginalising the numerator of (35) over the supporting space of x, i.e.:

p(yj) =

Z

p(yjx; )p(xj)dx:

(40)

Since the integrand in (40), as a function of x, is proportional to a Gaussian distribution, the solution of the integral is simply the normalising constant of this Gaussian distribution. Taking the logarithm of this value, leads to the following expression for the log-evidence:

log p(yj) =

1 T 1 log 2e2 v2 + log jMP log jx0 j j x 2 2 2 1 1 T y y + Tx0 x01x0 2 v2

MP T

x x

MP 1

x

MP

: (41)

Unfortunately, determining the gradient and Hessian of the target function L(y; ) with respect to the model parameters, given the expression for the log-evidence in (41), is not easily done, since its terms are complex functions of the components of the parameter vector. However, if the model is represented in state-space form (see Section 2.2), these quantities can be computed very efficiently by an algorithm based on the Kalman filter, discussed next. Since y is trivially equivalent to the stacked observation vector y1:T , the target function can be rewritten as:

L(y1:T ; ) = log p(jy1:T ) = log p(y1:T j) + log p() = l(y1:T ; ) + log p( );

(42)

where l(y1:T ; ) can be recognised as the log-likelihood function given by (103) in Appendix B.3. Thus, (42) is very efficiently evaluated by a single forwards run of the Kalman filter. Furthermore, the gradient and an approximation to the Hessian of l(y1:T ; ) with respect to the model parameters can be obtained by an application of the Fisher identity (Fisher, 1925), given by:

log p(y1:T j) = E log p(y1:T ; 1:T j)j; y1:T ;

(43)

and the Kalman smoother (see Appendix B.2). The conditional expectation in (43) is taken relative to the distribution p(1:T j ; y1:T ). Details of these computations are presented in Appendix B.3.1 for the gradient, and Appendix B.3.2 for the approximate Hessian. Combining the expressions for the components of the gradient of l(y1:T ; ) in (121) to (123) of Appendix B.3.1 with the first partial derivatives of the log-prior in (39) leads to the following expressions for the components of the gradient of the target

16

Solution Methods

function L(y1:T ; ):

1 1 (a a ) L(y1:T ; ) = 2 M a~ a e a2 1 1 T T L(y ; ) = 2 2 a~ a~ + 2 e + e + 1 e2 1:T e2 2 2(e ) 1 T 1 T L(y ; ) = 2 2 Z Z + + 2 v + v + 1 v2 1:T v2 2 2(v ) 1 p 1 T L(y ; ) = 2 2 (a a ) (a a ) + 2 a + a + 1 ; a2 1:T a2 2 2(a )

(44) (45) (46) (47)

with

M = 0p1 Ip =

=

T X

t=1 T X t=1

(48)

PtjT + atjT aTtjT yt2

2yt ZatjT :

(49) (50)

~ = [1; aT ℄T , and the state estimates and their covariances atjT , PtjT , In the above, a t = 1; : : : ; T are obtained by running the Kalman smoother with as the parameter vector. It should be noted that, in order to simplify the computations, the state-space formulation of the model in Section 2.2 is redefined by extending the state vector to include one more delayed element, i.e. t = [xt ; : : : ; xt p ℄T (see Appendix B.3.1 for more detail). Similarly, the non-zero elements of the approximate Hessian of L(y1:T ; ) are obtained by combining the expressions for the elements of the approximate Hessian of l(y1:T ; ) in (126) to (129) of Appendix B.3.2 with the second partial derivatives of the log-prior, and follow as:

2 L(y1:T ; ) = a aT 2 L(y ; ) = ae2 1:T 2 L(y ; ) = aa2 1:T 2 L(y ; ) = (e2 )2 1:T 2 L(y ; ) = (v2 )2 1:T 2 L(y ; ) = (a2 )2 1:T

1~ 1 + 2 Ip e2 a 1 2 ~ M a = L(y1:T ; ) (e2 )2 e2 aT 1 2 ( a ) = a 2 aT L(y1:T ; ) (a2 )2 a 1 T 1 T a~ a~ + 2 e + e + 1 (e2 )2 2 e2 1 T 1 T Z Z + + 2 v + v + 1 (v2 )2 2 v2 1 1 p T (a a ) (a a ) + 2 a ; + a + 1 (a2 )2 2 a2

(51) (52) (53) (54) (55) (56)

4.1


where

17

~ is with the first row and column removed.

Apart from being computationally efficient, this method also has the added advantage in that it computes the optimal smoothed state estimates atjT (and their covariances PtjT ), t = 1; : : : ; T as a natural by-product. Thus, once the optimisation algorithm has converged to a maximum, these estimates contain all the information necessary to recover the optimal estimate for the clean speech xMP ( MP ), as discussed in Section 3.1. The complete algorithm is summarised below.

Algorithm 1: Evidence Maximisation for Speech Enhancement 1. Choose an initial value for the unknown model parameters (0) . 2. Set the constant parameters of the prior distributions to reflect the available prior information. 3. Use the gradient-based optimisation algorithm of Appendix C to find the value of the parameters MP that maximises the target function L(y1:T ; ), given by (42). Each evaluation of the gradient and approximate Hessian requires the following steps: (a) Run the Kalman smoother with the current estimate for the parameter vector (see Section B.2) to compute atjT and PtjT , t = 1; : : : ; T . (b) Use these estimates in (44) to (47) and (51) to (56) to obtain the required gradient and approximate Hessian respectively. 4. Run the Kalman smoother with the optimal parameter estimate MP to obtain the corresponding optimal state estimates, and from these recover the optimal estimate for the speech vector xMP ( MP ), as described in Section 3.1.

4.1.3

Remarks

1. Evidence maximisation is essentially an optimisation problem, and consequently suffers from all the shortcomings of methods within this framework. For example, if a gradient-based algorithm is employed, as is the case here, convergence is to a local optimum only, resulting in the quality of the final solution being very much dependent on the parameter estimates used to initialise the algorithm. Fortunately, in processing speech, it is often possible to obtain good initial estimates (see Section 5.1.1 for more detail), and convergence to an acceptable solution is almost always guaranteed.

18

Solution Methods 2. Since the maximisation of the evidence is performed in the marginal space that results when integrating over the support of x, hence eliminating dependencies on these variables, this method is expected to perform better than, for example, the EM algorithm, where the mutual dependencies between the parameters and the clean speech vector x may dramatically slow down the convergence to a solution. 3. From a theoretical perspective, the quality of the final solution xMP ( MP ) is crucially dependent on how accurate the approximation of (29) is. If the posterior uncertainty about the parameters is not negligible, the delta distribution Æ ( MP ) will be a poor representation of p(jy), so that inferences with respect to p(xj MP ; y) may lead to substantially worse solutions compared to those resulting from inferences with respect to the true marginal distribution p(xjy). 4. It is not immediately obvious how to extend this framework to accommodate discrete variables as components of the unknown parameter vector. Such variables are widely used in statistical models, and have, for example, been employed to represent outliers in noise models (Godsill and Rayner, 1995; Godsill and Rayner, 1996; Godsill, 1997; Godsill and Rayner, 1998), model orders for model selection purposes (Troughton and Godsill, 1997; Troughton and Godsill, 1998), time delays in pitch estimation (Vermaak et al., 1998a), and many more. This limitation severely restricts the class of statistical estimation problems this method can be applied to. 5. Constraints on the parameters, such as the variances being positive, and the poles of the AR process lying strictly within the unit circle, are not explicitly enforced within this framework. If these constraints can be expressed in terms of quantitative limits (which is rather difficult for the parameterisation employed here for the AR coefficients), a constrained optimisation algorithm may be employed. It is the hope, however, that parameter values falling outside the admissible regions will have low probability compared to those inside, so that convergence to such solutions is unlikely to occur, even in the unconstrained case. 6. The evidence maximisation framework does allow for a form of model comparison, in that the evidence for a number of candidate models can be evaluated and utilised to rank the models according to some criterion function. However, the more interesting problem of model averaging, where each candidate model contributes to the final solution in the proportion of its posterior model probability, is not elegantly solvable within this framework.

4.2

The Expectation Maximisation (EM) Algorithm

The Expectation Maximisation (EM) algorithm (Dempster et al., 1977) is an iterative procedure for finding the mode of a posterior distribution p( jy) or a likelihood func-

4.2


19

tion p(yj), in cases where it is difficult to maximise the given distribution directly. It augments the observed data y with a set of latent variables x, so that the corresponding MAP or ML estimation problem, based on the complete data, defined as z = [xT ; yT ℄T , is more tractable. Since the latent variables are unknown, they are represented by their distribution, conditional on the observed data and the current estimate for the model parameters p(xj ; y). The algorithm starts from some initial parameter estimate, and then proceeds to cycle through the following two steps until some convergence criterion is met. The first step, or E-step, finds the conditional expectation of the log-posterior (for MAP estimation) or log-likelihood (for ML estimation) based on the complete data z, with respect to the distribution p(xj ; y), where is the current estimate for the model parameters. The second step, or M-step, then maximises this expectation with respect to the model parameters, to yield an improved model parameter estimate. This estimate is then employed for the conditioning in the next E-step, and so the cycle is repeated. The most remarkable attribute of this algorithm is that it guarantees, at each iteration, to increase the original log-posterior or log-likelihood function, or leaves it unchanged when a maximum has been reached. The EM algorithm has found application in a vast array of speech processing problems, including, for example, the training of HMM-based speech recognition systems, mixture density estimation for feature analysis, and many more. It is especially wellsuited for the speech enhancement problem presented here, if the unobserved clean speech is taken to be the latent variables. The subsequent discussion presents the EM algorithm in more detail, and proceeds to apply it to the speech enhancement problem at hand. Again, the state-space parameterisation of the model is employed, leading to efficient Kalman filter-based algorithms to estimate the model parameters and the clean speech. This approach is similar to the one adopted in (Gannot et al., 1998).

4.2.1

General Description

Focusing here on the posterior distribution p( jy), the problem is one of maximum a posteriori (MAP) estimation, i.e. finding the value for the model parameter vector such that:

MP = arg maxfL(y; )g;

(57)

L(y; ) = log p(yj) + log p():

(58)

with L(y; ) defined as:

Note that this estimation objective is exactly the same as that for the evidence maximisation framework, presented in Section 4.1.1. In many situations of practical interest,

20

Solution Methods

maximising the target function L(y; ) directly is either difficult, impossible or computationally expensive. However, augmenting the the observed, or incomplete, data y with a set of unobserved or latent variables x, and conditioning on the complete data z = [xT ; yT ℄T , often leads to a more tractable problem. Since the latent variables x are unknown, they are represented by their posterior distribution conditional on the observed data and the current estimate of the model parameters p(xj ; y). This distribution is refined by successively conditioning on parameter estimates with a monotone increasing posterior probability. The observed data y, on the other hand, is known, and remains fixed throughout the computations. The EM algorithm starts with some initial guess at the most probable parameter estimate (0) , and then proceeds to iteratively generate successive estimates (1) ; (2) ; : : : . More specifically, the estimate at iteration i is obtained from that at iteration i 1, by applying the following two steps: E-step: Compute:

h

i

Q( ; (i 1) ) = E log p(jz)j(i 1) ; y ;

(59)

where the conditional expectation is taken relative to the distribution p(xj (i y).

1)

;

M-step: Set:

i = arg maxfQ(; i )g: ( )

(

1)

(60)

The E-step can be seen as computing the expectation of the target function based on the complete data z, conditional on the current estimate of the model parameters. The M-step then employs this expectation to perform MAP estimation of the model parameters, an operation that is assumed to be feasible. The algorithm is run until convergence is achieved. In practical implementations the algorithm is normally said to have converged if k (i) (i 1) k < or jQ( (i) ; (i) ) Q( (i) ; (i 1) )j < Q , where and Q are user-defined tolerance values, and k k denotes the Euclidean norm. More sophisticated convergence criteria, employing, for example, different distance metrics, do, however, exist. Each iteration of the EM algorithm increases the value of the target function L(y; ), or leaves it unchanged. An outline of the proof is given in Appendix D. Indeed, in most applications, the algorithm will converge to a local maximum of L(y; ), though exceptions do exist. Such monotonic improvement in L(y; ) is also guaranteed for any Generalised EM (GEM) algorithm, in which only a partial maximisation is performed in the M-step, with (i) simply set to some value so that:

Q( (i) ; (i) ) Q( (i) ; (i 1) );

(61)

4.2


21

with equality holding when convergence has been achieved. The price paid for the simplicity and stability of the EM algorithm is slow convergence. This problem is compounded with an increase in the amount of information contained in the latent variables x, relative to that in the observed data y. Although several extensions have been proposed to mitigate the convergence problem, they often jeopardise the simplicity and stability of the algorithm (Meng and Rubin, 1992).

4.2.2


This section shows how MP and xMP ( MP ) can be computed for the speech enhancement problem, by an application of the EM algorithm. For this purpose, the same state-space representation of the model, introduced earlier within the evidence maximisation framework (see Section 4.1.2), is employed, taking the stacked state vector 1:T as the latent variables. This will again facilitate the application of fast and efficient Kalman filter-based algorithms for the estimation. Within this context, the function to be maximised in the M-step of each iteration i of the algorithm, can be expanded from (59) as:

Q(;

i

(

1)

h

) = E log p(jy1:T ; 1:T )j

/E

h

h

i

(

1)

; y1:T

i

log p(y1:T ; 1:T j) + log p()j(i

= E log p(y1:T ; 1:T j)j(i

1)

i

1)

; y1:T

i

(62)

; y1:T + log p();

where the conditional expectations are taken relative to the distribution p(1:T j (i 1) ; y1:T ). Employing the results of Appendix B.3 and Section 4.1.2, this function can be obtained as:

Q(;

i

(

1)

1 1 T a~ a~ + 12 Z ZT + )= 2 2 e v

+ T log 2e + log 2v 2

2

+ log p(); (63)

where terms independent of the model parameters have been discarded, and the logarithm of the prior distribution is given by (39). This function has exactly the same derivatives as those given in (44) to (47). This is to be expected, since taking the derivative of (62) with respect to the model parameters , and employing the Fisher identity (43), result in: i h (i 1) (i 1) Q( ; ) = E log p(y1:T ; 1:T j)j ; y1:T + log p() = log p(y1:T j) + log p();

(64)

22

Solution Methods

which can be recognised as the derivative of L(y; ) in (42) with respect to the model parameters. However, as opposed to the evidence maximisation scheme, where the state estimates and their covariances atjT , PtjT , t = 1; : : : ; T are re-computed at each evaluation of the gradient, they, here, remain fixed throughout the maximisation performed in the M-step, to the values that result when running the Kalman smoother with (i 1) as the parameter vector. In the current context the latter operation constitutes the E-step of the algorithm, since it computes the sufficient statistics for the distribution p(1:T j (i 1) ; y1:T ). The fact that the state estimates and their covariances remain fixed, leads to the utility that the maximum of Q( ; (i 1) ) can be found by solving the set of simultaneous equations that results when equating the derivatives in (44) to (47) to zero. Details of these computations and the corresponding results are given in Appendix E. When the algorithm has finally converged to a maximum MP , this parameter vector can be employed in a Kalman smoothing run to calculate the optimal state estimates. As was done in Section 4.1.2 for the evidence maximisation algorithm, these can then be utilised to recover the optimal estimate for the clean speech vector xMP ( MP ). The complete algorithm is summarised below.

Algorithm 2: EM for Speech Enhancement 1. Choose an initial value for the unknown model parameters (0) . 2. Set the constant parameters of the prior distributions to reflect the available prior information. 3. Repeat for i = 1; 2; : : : : E-step: Run the Kalman smoother (see Section B.2) with (i vector to compute atjT and PtjT , t = 1; : : : ; T .

1)

as the parameter

M-step: Use these estimates in the equations of Appendix E to find the parameter vector (i) that maximises Q( ; (i 1) ). until the specified convergence criterion is achieved. Take MP to be the parameter estimate at the last iteration. 4. Run the Kalman smoother with the optimal parameter estimate MP to obtain the corresponding optimal state estimates, and from these recover the optimal estimate for the speech vector xMP ( MP ), as described in Section 3.1.

4.3

4.2.3

Markov Chain Monte Carlo (MCMC) Methods

23

Remarks

1. The estimation objective for the EM algorithm is exactly the same as that for the evidence maximisation strategy in that it performs MAP estimation to find the optimal parameter vector MP , and, based on this estimate, computes the optimal point estimate for the clean speech vector xMP ( MP ). This, together with the fact that the EM algorithm also utilises gradient information in the M-step, means that points 3 to 6 of Section 4.1.3 apply equally well to the EM algorithm. 2. Since, for the application considered here, the target function L(y; ) can be maximised directly within the evidence maximisation framework, the trade-off between the two algorithms is one of convergence performance and computational efficiency. Whereas the direct maximisation of the target function L(y; ) is performed in the marginal space with the unobserved variables x integrated out, the EM algorithm conditions on these variables, and is, hence, likely to converge slower if the former strategy is implemented accurately enough. Thus, in this instance, the application of the EM algorithm can be justified only if the computational burden to arrive at a solution is notably less than that required by a direct maximisation strategy. Note, however, that this is not true in general, and many problems may be encountered for which direct maximisation strategies do not exist, or are difficult to implement accurately and computationally expensive if they do.

4.3


The general Bayesian inference problem requires the evaluation of integrals that are analytically intractable. Furthermore, since the problem spaces are generally highdimensional, standard numerical integration techniques cannot be applied. In these circumstances, Monte Carlo integration methods have been applied with success. Provided that independent samples can be drawn from the distribution forming part of the integrand, these methods approximate the integral to an accuracy that depends only on the number of samples drawn, and not the dimensionality of the problem space. However, drawing independent samples from arbitrary distributions is a difficult problem. Fortunately, Monte Carlo integration does not necessarily require the samples to be independent, as long as they are representative of the distribution throughout its region of support. Markov chain Monte Carlo (MCMC) methods set up an ergodic Markov chain for which the invariant distribution is the target distribution of interest. The chain is then simulated from some arbitrary starting point, and convergence to the invariant distribution is guaranteed under some mild conditions as the number of iterations approach infinity. Once convergence is achieved, subsequent samples from the chain form a set of dependent samples from the target distribution,

24

Solution Methods

from which Monte Carlo estimates may be calculated. The subsequent discussion presents the MCMC methodology in more detail. It also introduces the Gibbs sampler as a special case for drawing samples from a joint distribution by repeatedly simulating from its conditionals. Because of its simplicity and efficiency, this method is adopted here to solve the speech enhancement problem.

4.3.1

General Description

Consider the general estimation problem, given in (27), and repeated here for convenience as:

^ = E [f ( )jy℄ =

Z

f ( )p( jy)d :

(65)

The evaluation of (65) is generally analytically intractable, even for simple models, and requires either numerical methods or crude analytical approximations for its solution. A popular numerical technique is Monte Carlo integration. It states that if a sequence of samples f (i) g, i = 1; : : : ; N can be drawn from p( jy), then E [f ( )jy℄ can be approximated as:

^f = 1 N

N X i=1

Clearly, if the samples are generated from p(

f(

i

( )

):

(66) h

jy), then E ^f jy

i

= E [f ( )jy℄, resulting

in ^f being an unbiased estimate of f ( ). Furthermore, if the samples are independent, laws of large numbers ensure that the estimate can be made as accurate as desired by increasing the number of samples N . More specifically, assuming that f ( ) is scalar, the variance of the estimate f^ decreases as 2 =N , where 2 is the variance of f ( ). Thus, the accuracy of the Monte Carlo estimate is independent of the dimensionality of the space sampled, and is only influenced by the number of independent samples N. In general, drawing samples independently from p( jy) is made difficult or impossible by its non-standard form and the high-dimensionality of its supporting space. However, strict independence of the samples is not necessarily required. They can be generated by any process which, loosely speaking, draws samples throughout the region of support of p( jy) in the correct proportions. One way of doing this is by setting up a Markov chain having p( jy) as its stationary distribution. The combined method is then known as Markov chain Monte Carlo (MCMC). A full discussion of the theory of Markov chains falls outside the scope of this report, and is, hence, omitted. For excellent texts on their application to sampling problems, the interested reader is referred to (Tierney, 1994; Tierney, 1996; Roberts, 1996; Neal, 1993).

4.3


25

The Gibbs sampler (Geman and Geman, 1984; Gelfand and Smith, 1990) is perhaps the most popular, and conceptually the simplest, of the MCMC sampling methods, and is widely applicable to problems where the variables take on values from a small finite set, or have conditional distributions of a parametric form that can easily be sampled from. It can be derived as a special case of the more general Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970), and sets up an irreducible aperiodic Markov chain that converges to the target distribution of interest. It proceeds to generate samples from the target distribution p( jy) by repeatedly drawing a sample for each component5 of the joint random vector from its corresponding conditional distribution, conditional on the current values for the other components. More specifically, assuming that comprises of K components (1) ; : : : ; (K ) , the procedure for generating the sample at iteration i + 1, (i+1) from that at iteration i, (i) , can be expressed more formally as: i

( +1) (1)

i

( +1) (2)

i (k )

( +1)

i (K )

( +1)

p p p p

j j

(1)

(2)

i

; (2)

;::: ; (3) ( )

i

;

i

;::: ;

( +1) (1)

i

( ) (3)

;:::

i (K ) ; ((iK) ) ( )

.. .

k

( )

i

( )

j

( +1) (1)

i (k

( +1) 1)

;

i ::: (k +1) ( )

.. . K j

(

)

i

( +1) (1)

;

i

( +1) (2)

;::: ;

i (K

( +1)

1)

;

i (K ) ( )

(67)

;

where the notation ‘’ denotes that the variable to the left is drawn as an independent (i) random sample from the distribution to the right. Note that the new value for (k) is used immediately when sampling the new value for

i k

( )

( +1)

.

Convergence of the resulting Markov chain to the desired invariant distribution is guaranteed for arbitrary initial values for the components, as long as the conditional distributions are non-zero over their corresponding regions of support. The rate of convergence is defined as the number of iterations N0 the chain takes to converge to its invariant distribution. Samples drawn during this initial burn in period are discarded in subsequent computations of the Monte Carlo estimates. It should be noted that the rate of convergence is dependent on the exact values chosen to initialise the chain and the nature of any correlations that may exist between the components of the parameter vector. Highly correlated components will result in chains that converge and mix6 slowly. Better performance can, however, be attained by sampling the highly correlated components jointly, whenever this operation is feasible. 5

’Component’ here refers to either a single element of the random vector, or a group of related elements. The components are mutually disjoint, and together span the random vector. 6 Mixing refers to the number of iterations required for the chain to move from one sample to another, nearly independent sample.

26

4.3.2

Solution Methods


Applying the Gibbs sampler to the speech enhancement problem, reduces to finding the posterior conditional distributions for the clean speech vector x and each of the components of the unknown parameter vector , and specifying strategies to generate samples from each of these distributions. The required conditional distributions are readily obtained from the joint posterior distribution in (12) by grouping, for each component, together terms in the joint posterior that depend on this component, and finding the appropriate normalising constant to form a proper density. For all the cases considered here, these conditionals are of known parametric form (an utility of employing conditionally conjugate priors) for which standard sampling techniques exist (Devroye, 1986; Ripley, 1987). Employing the technique stated above, the posterior conditionals for the components of the parameter vector follow readily as:

1 p p(a ja) = IG a ; a + ; a + (a a )T (a a ) 2 2 MP MP 2 2 p(aja ; e ; x) = N a; a ; a N x0 ; x0 ; x0 1 T T 2 2 p(e ja; x) = IG e ; e + ; e + e e N x0 ; x0 ; x0 2 2 1 T T 2 2 p(v jx1 ; y) = IG v ; v + ; v + v v ; 2 2 2

with

2

a = a 12 XT x1 + 12 a e a 1 1 T 1 MP a = 2 X X + 2 Ip : e a MP

(68) (69) (70) (71)

MP

(72) (73)

It should be noted that rejection sampling is employed in the sampling step for the AR coefficients in (69), in that samples that lead to unstable filters (i.e. having poles that lie outside the unit circle) are discarded. Such samples are, however, unlikely to be encountered, since the non-stationary regions should contribute very little to the total probability mass if the observed data is assumed to be stationary.

If the likelihood function of the initial samples N x0 ; x0 ; x0 is independent of the model parameters , this term disappears from (69) and (70), greatly simplifying the sampling procedure. This is the case when x0 and x0 are fixed and known. When x0 depends on the model parameters , sampling may proceed by proposing candidates from the corresponding conditional distribution with the term N x0 ; x0 ; x0 removed, and then either accepting or rejecting these candidates in a MetropolisHastings acceptance step (Chib and Greenberg, 1994). This will hardly ever be necessary, since the contribution of N x0 ; x0 ; x0 to the probability mass will be small if T p.

4.3


27

The posterior conditional for the clean speech vector x follows in a similar fashion, and is given by (36), repeated here for convenience as:

p(xja; e2 ; v2 ; y) = N with

x = x

MP

MP

x01 x0 1 v2 y

MP MP x ; x ;

(74)

1 0 x = 12 AT A + 0 x0 1pIT T p v2 T e

(75)

1

MP

:

(76)

Direct sampling from this distribution requires the inversion of a (T + p) (T + p) matrix, which is of O ((T + p)3 ) complexity. However, casting the model into state-space form, and subsequently sampling from the distribution p(1:T j ; y1:T ) can be done very efficiently by a method based on the Kalman filter, reducing the computational complexity to O (p2 T ), which is linear in the number of samples T . A general outline of this method is presented in Appendix B.4. The complete algorithm is summarised below.

Algorithm 3: The Gibbs Sampler for Speech Enhancement 1. Choose initial values for all the unknown parameter vector (0) and the uncorrupted speech samples x(0) . 2. Set the constant parameters of the prior distributions to reflect the available prior information. 3. For each sampling iteration i = 0; : : : ; Nmax 1, obtain (i+1) and x(i+1) from (i) and x(i) by performing the following Gibbs sampling steps: (a) Sample for the hyperparameter:

a2 (i+1) p(a2 ja(i) )

see (68)

(b) Sample for the AR(p) process parameters:

a(i+1) p(aja2 (i+1) ; e2 (i) ; x(i) ) e2 (i+1) p(e2 ja(i+1) ; x(i) )

see (69) see (70)

The sampling step for a(i+1) is repeated until a set of stationary filter coefficients results. (c) Sample for the additive noise variance:

v2 (i+1) p(v2 jx(1i) ; y)

see (71)

28

Solution Methods (e) Sample for the reconstructed speech samples:

x(i+1) p(xja(i+1) ; e2(i+1) ; v2(i+1) ; y)

see (74)

This is done by casting the model into state-space form, and employing the strategy based on the Kalman filtering and simulation smoothing algorithms, as discussed in Appendix B.4. 4. Discard the first N0 samples in the calculation of any Monte Carlo estimates.

4.3.3

Remarks

1. As opposed to the evidence maximisation and the EM algorithms, which compute deterministic point estimates (ML or MAP) of the desired quantities, MCMC methods are truly stochastic in nature, in that they form discrete representations of the posterior distributions of interest. 2. A direct consequence of the previous point is that MCMC methods can be utilised to obtain an estimate of the unknown quantities, based on arbitrary criterion functions, such as those presented in Section 3.6. 3. Since the joint posterior distribution can be factorised as p(x; jy) = p( jx; y) p(xjy), it is clear that samples for x generated from the joint distribution p(x; jy) are equivalent to samples generated from the true marginal distribution p(xjy). Thus, estimates for the clean speech vector x are formed by implicitly integrating over the parameter space, without making any assumptions about the form of the posterior distribution of the parameters p( jy), as was required for the evidence maximisation and EM algorithms. 4. Since each component of the parameter vector is sampled conditional on the current values for the other components and the vector of clean speech, convergence and mixing of the sampler may be particularly slow in cases where strong mutual dependencies exist between the various components and the clean speech vector. In these situations it may be advantageous to devise schemes to sample jointly for all the components of the parameter vector and the clean speech vector. Such strategies may be based on the Metropolis-Hastings algorithm (Metropolis et al., 1953; Hastings, 1970), with a suitably defined proposal density, or the Hybrid Monte Carlo (HMC) algorithm (Duane et al., 1987; Neal, 1993). For a full treatment of these issues, see (Roberts and Sahu, 1997). 5. The general MCMC methodology is equally applicable to discrete variables, as long as strategies can be formulated to sample from the resulting conditional posterior distributions. See, for example, (Godsill and Rayner, 1995; Godsill

29 and Rayner, 1996; Godsill, 1997; Godsill and Rayner, 1998) where discrete variables are employed as switching indicators for a non-Gaussian noise process, and (Vermaak et al., 1998a) where, in a glottal excitation estimation application, the pitch period is represented as a discrete quantity. 6. Constraints on the parameters are elegantly accommodated by employing rejection sampling, as was done for the AR process coefficients in Section 4.3.2, or by redefining the structure of the prior distribution so that inadmissible parameter values have zero probability mass. 7. The general MCMC estimation framework can be extended in a variety of ways to account for model uncertainty, subsuming both model selection and model averaging issues. One popular strategy is the reversible jump MCMC sampler (Green, 1995), which has successfully been applied to the problems of AR model order selection (Troughton and Godsill, 1997; Troughton and Godsill, 1998) and model averaging for speech signal reconstruction (Vermaak et al., 1998b).

5 5.1

C OMPARATIVE E XPERIMENTS

Experimental Protocol

This section describes the experimental protocol adopted for the Bayesian speech enhancement problem. First, Section 5.1.1 presents the issues that are common to all the strategies, after which Sections 5.1.2 to 5.1.4 lay out the details specific to each of the strategies individually.

5.1.1

General Methodology

Block processing. The data was processed in blocks of T samples, with T chosen so that the quasi-stationary assumption holds. For all blocks but the first, the required initial samples x0 were obtained from the previously processed block. Initial samples for the first block were obtained by advancing the start of this block, and employing the preceding samples from the observed noise corrupted sequence. Thus, in all cases, initial samples were assumed to be available. To ensure strict continuity, the likelihood of the initial samples p(x0 j ) in (16) was taken to be Gaussian with a zero covariance matrix, centred on the initial values. This distribution is independent of the model parameters , so that it disappears from the expressions for the posterior conditionals in (69) and (70). When processing longer sequences, the blocks were chosen to overlap to some degree. The degree of overlap was so as to allow for the initial transients of

30

Comparative Experiments the Kalman filter-based algorithms to die out within the overlapping section. The final reconstruction step then involved concatenating the samples from each processed block subsequent to the overlapping section.

Initialisation. The initialisation strategy for the model parameters and the reconstructed signal plays an important role in the performance of the enhancement algorithms. Initial estimates that are close to the true values for these variables are likely to lead to faster algorithm convergence, than would otherwise be the case. Though not optimal for all blocks, the following initialisation strategy for the model parameters was found to work well in general. The AR coefficients a were initialised with their ML estimates based on the observed noisy data. This was found to lead to better results than using, for example, the final estimates from the previous frame, since the characteristics of the signal may change significantly over subsequent frames, especially in the presence of transitions. The initial estimate for the AR excitation variance e2 was set to a small constant value (0.01 was used in most of the experiments). This facilitated the convergence of this parameter to even smaller values in cases where speech is not present in the observed signal. Larger initial values often led to overestimates for this parameter, and comparatively high levels of residual noise in the enhanced signal. The additive noise variance v2 was randomly initialised to be approximately two orders of magnitude larger than the initial estimate for the AR excitation variance. Finally, the hyperparameter controlling the variance of the prior on the AR coefficients a2 was set to initially reflect vague prior information over the range of admissible parameter values. For all the blocks in a sequence of samples, the vector of reconstructed samples x1 was initialised with the corresponding observed noisy samples y. Prior distributions. The parameters of the prior distributions were chosen such that these densities were diffuse over the range of interest. In this way, none of the other admissible parameter values are disregarded in the case of bad initial estimates. Kalman filter-based algorithms. Since the speech model under consideration is assumed to be stationary over a block of T samples, all the Kalman filter-based routines are monitored for convergence of the relevant state covariance matrices. Convergence is defined in terms of the Frobenius norm of the difference between successive covariance matrices, and once it is achieved these matrices are no longer updated, resulting in substantial computational savings.

5.1.2

Evidence Maximisation Specifics

Optimisation strategy. The optimisation strategy employed is outlined in Appendix C. For the specific problem addressed here, an added constraint was placed on

5.1

Experimental Protocol

31

the adaptation step size Æk , in that it was limited to the maximum value that ensured stability of the resulting AR filter, and positiveness of all the variances. Convergence. The optimisation algorithm was run for a pre-specified maximum number of iterations, or until either of the following convergence criteria was satisfied at iteration i:

k i i k < jL(y; i ) L(y; i )j < ; L L(y; i ) ( )

(

1)

( )

(

(

1)

(77)

1)

(78)

whichever occurred first. In the above, and L are user-specified tolerance values determining the accuracy of the solution.

5.1.3

EM Algorithm Specifics

Convergence. The exact same criteria of (77) and (78) were employed to monitor the convergence of the EM algorithm. The convergence criterion in (77) was also used in the iterative algorithm to maximise Q( ; (i 1) ), as discussed in Appendix E.

5.1.4

MCMC Estimation Specifics

Convergence. Convergence of the induced Markov chain was diagnosed by monitoring the evolution of the additive noise variance and a value proportional to the negative logarithm of the joint posterior probability of the model parameters and the reconstructed data sequence p(x; jy), as given by (12). However, due to the random nature of the sampler, assessment of convergence is difficult in practice. Thus, for all the experiments reported here, the number of burn in iterations was conservatively fixed to a value higher than the slowest empirically found convergence rate. Computing estimates. Only the samples obtained after the underlying Markov chain has converged to the desired invariant distribution were employed in the calculation of the Monte Carlo estimates. Estimates for the model parameters and the reconstructed speech signal x were taken to be their MMSE values, based on all the samples generated after the burn in period, with the number of samples empirically set to yield low variance estimates.

32 5.2

Comparative Experiments Data Description

The data for the experiments was obtained from the Speech Enhancement Assessment Resource (SpEAR) Database (http://ee.ogi.edu/NSEL/. Beta Release v1.0. CSLU, Oregon Graduate Institute of Science and Technology. E. Wan, A. Nelson, and Rick Peterson). This database contains carefully selected samples of noise corrupted speech with clean speech references. All speech and noise sources have been acoustically combined and re-recorded at a sampling frequency of 48 kHz. The recordings were subsequently down-sampled to 16 kHz and stored as 16 bit Microsoft wave files (“.wav”). The complete technical specification and composition of the database can be found at the URL given above. Two sentences from this database, the first uttered by a male, and the second, by a female speaker, were employed in the subsequent experiments. These utterances will be referred to as S1 and S2 , respectively, and are given below. S1 : “Good service should be rewarded by big tips.” S2 : “Draw every outer line first, then fill in the interior.” The clean waveforms for these utterances were each corrupted by two additive white Gaussian noise processes. For the first process the noise variance was held fixed throughout the utterance, whereas it was varied with time for the second process, causing it to be non-stationary. The corrupted versions of S1 will be referred to as S1 N1 for the stationary noise process N1 , and S1 N2 for the non-stationary noise process N2 , respectively. Similarly, S2 N3 will be used to denote the version of S2 corrupted by the stationary noise process N3 , and S2 N4 , the version of S2 corrupted by the non-stationary noise process N4 . The waveforms for the clean and noisy utterances are given in Figure 1 for S1 and Figure 2 for S2 , respectively. All the utterances, including those resulting from applying the enhancement algorithms, are also available as Microsoft wave files. A guide to the location and names of these files is given in Appendix F.

5.3

Results

Each of the enhancement methods discussed in Section 4 was applied to noise corrupted utterances presented above. The block size was chosen to be T = 500 samples (31.25 msec.), with an overlap of 100 samples between adjacent blocks. The AR model order was fixed at p = 10. For the evidence maximisation and EM algorithms the convergence tolerance values were set to = L = 1 10 3 . On the other hand, the Gibbs sampler was run for a total number of 100 iterations, with the first 50 taken as the burn in period.

5.3

Results

33

Clean speech (S1) 20 0 −20

Noise sequence (N1)

10 0 −10

Noise corrupted speech (S1N1)

20 0 −20

Noise sequence (N2)

20 0 −20


20 0 −20

Figure 1: Clean and noise corrupted utterances for S1 .

34

Comparative Experiments

Clean speech (S2) 20 0 −20

Noise sequence (N3)

10 0 −10


20 0 −20

Noise sequence (N4)

20 0 −20


20 0 −20

Figure 2: Clean and noise corrupted utterances for S2 .

5.3

Results

35

The overall reconstruction performance and noise estimation results are given in Figures 3 to 6 and the third and fourth columns of Table 1. From the graphs on the RHS of these figures it is evident that both the evidence maximisation and EM algorithms estimate the variance of the additive noise process to a satisfactory degree of accuracy under both stationary and non-stationary circumstances. This is particularly clear from the graphs for the stationary noise processes in Figures 3 and 5, where the mean of the block estimates is close to the true value, and most of these estimates fall within a 2 range of the mean. From the same figures, the graphs for the Gibbs sampler indicate that this method consistently underestimates the noise variance. This may be ascribed to the Markov chain converging to a suboptimal mode of the posterior distribution. Using an informative prior for the noise variance, centred on the true value, removed this bias, but such an solution is not always possible in practice. As a consequence of the Gibbs sampler underestimating the noise variance, the corresponding reconstruction performance is poorest for this algorithm. Similar reconstruction performance results are exhibited by the evidence maximisation and EM algorithms, except at low input SNR levels, where the latter outperforms the former. As intuitively expected, the reconstruction performance decreases for all methods with an increase in the input SNR.

Input Utterance S1 N1

S1 N2

S2 N3

S2 N4

Method EVM EM Gibbs EVM EM Gibbs EVM EM Gibbs EVM EM Gibbs

Input SNR 2.40

0.16

4.11

1.42

SNR avg. # iter. avg. MFlops avg. MFlops 6.34 6.72 5.15 7.98 8.73 7.07 6.70 7.03 5.20 8.61 9.53 7.75

per block 18 10 100 14 10 100 13 8 100 13 9 100

per iter. 10.96 8.06 1.52 11.29 6.80 1.42 11.98 7.86 1.54 12.94 7.15 1.53

per block 197.22 80.64 151.65 158.02 67.95 141.99 155.72 62.89 153.98 168.22 64.37 153.00

Table 1: Comparative computational expense and overall reconstruction performance of the three methods on four noise corrupted utterances. The computational cost figures are for Matlab implementations of the algorithms, executed on a Sun Ultra Sparc, and are not precise, but meant to give a feel for the relative performance of each algorithm.

Typical convergence curves for the algorithms are depicted in the graphs of Figure 7. As expected, the Gibbs sampler is the slowest to converge, confirming the existence of

36


SNR improvement vs. input SNR

Noise variance vs. block number

30

4

20

3 2

10

1 0 −60

−40 −20 0 SNR improvement vs. input SNR

20

30

4

20

3

0

20 40 60 80 100 Noise variance vs. block number

0


2

10

1 0 −60

−40 −20 0 SNR improvement vs. input SNR

20

30

4

20

3 2

10

1 0 −60

−40

−20

0

20

0

20

40

60

80

100

Figure 3: Enhancement (left) and noise variance estimation (right) results for the corrupted utterance S1 N1 . From top to bottom the graphs are for the evidence maximisation, EM and Gibbs sampling algorithms, respectively. The solid line in the graphs on the right indicates the true noise variance, whereas the mean of the estimates over all the blocks is given by the dashed line. The dash-dotted lines indicate the 2 deviations of the estimates from the mean value.

5.3

Results

37



30 15

20

10

10 0

5

−10 −20 −40

0 −20 0 20 SNR improvement vs. input SNR

40

0


0


30 15

20

10

10 0

5

−10 −20 −40

0 −20 0 20 SNR improvement vs. input SNR

40

30 15

20

10

10 0

5

−10 −20 −40

0 −20

0

20

40

0

20

40

60

80

100

Figure 4: Enhancement (left) and noise variance estimation (right) results for the corrupted utterance S1 N2 . From top to bottom the graphs are for the evidence maximisation, EM and Gibbs sampling algorithms, respectively. The solid line in the graphs on the right indicates the true noise variance.

38




30

6

20

4

10

2

0 −40

−20 0 SNR improvement vs. input SNR

20

0

30

6

20

4

10

2

0 −40

−20 0 SNR improvement vs. input SNR

20

0

30

6

20

4

10

2

0 −40

−20

0

20

0

0

50 100 Noise variance vs. block number

0


0

50

100

Figure 5: Enhancement (left) and noise variance estimation (right) results for the corrupted utterance S2 N3 . From top to bottom the graphs are for the evidence maximisation, EM and Gibbs sampling algorithms, respectively. The solid line in the graphs on the right indicates the true noise variance, whereas the mean of the estimates over all the blocks is given by the dashed line. The dash-dotted lines indicate the 2 deviations of the estimates from the mean value.

5.3

Results

39



40

30

20

20

0

10

−20

0

−40

−20 0 20 40 SNR improvement vs. input SNR

40

30

20

20

0

10

−20

0

−40

−20 0 20 40 SNR improvement vs. input SNR

40

30

20

20

0

10

−20

0

−40

−20

0

20

40

0


0


0

50

100

Figure 6: Enhancement (left) and noise variance estimation (right) results for the corrupted utterance S2 N4 . From top to bottom the graphs are for the evidence maximisation, EM and Gibbs sampling algorithms, respectively. The solid line in the graphs on the right indicates the true noise variance.

40


strong mutual dependencies between the components of the parameter vector and the clean speech signal. Of the remaining methods the evidence maximisation algorithm is the slower to converge, as is confirmed by the average convergence rates in the fifth column of Table 1. This result is counter-intuitive, since the evidence maximisation algorithm performs the maximisation directly in the marginal parameter space, with the clean speech integrated out, whereas the EM algorithm conditions on the current estimate of the clean speech at each iteration of the maximisation procedure. However, contrary to first appearances, this result does not necessarily contradict the second remark in Section 4.2.3, but should rather be ascribed to the inaccurate line search strategy employed in the maximisation procedure (see Appendix C.2 for details), leading to suboptimal behaviour at points far from the solution. The stochastic nature of the Gibbs sampler, as opposed to that of the other algorithms, which is deterministic, is also clearly illustrated by the relevant graphs in Figure 7. As is evident from the second last column of Table 1, the evidence maximisation algorithm is computationally the most expensive per iteration of the methods, with the Gibbs sampler being, by far, the most efficient. The average algorithm cost per block is given by the last column of the table, and is obtained by multiplying the average cost per iteration by the average convergence rate. Here the evidence maximisation algorithm still ranks the most expensive, with the Gibbs sampler now being less efficient than the EM algorithm. The latter result is, however, misleading, since the Gibbs sampler normally converges in a lot fewer than 100 iterations. Hence, the computational cost for this algorithm can be greatly reduced if a scheme can be devised to automatically detect convergence in practical implementations. Informal listening tests revealed the following about the enhanced utterances:

Musical artifacts were present in sections of the utterances enhanced by the evidence maximisation and EM algorithms, corresponding to sections in the corrupted signal characterised by low input SNR values, the artifacts being more severe for the evidence maximisation algorithm. The presence of these artifacts may be due to the fact that these algorithms do not integrate over the parameter space, as required by (28), but reconstruct the clean speech based on a single point estimate of the parameters. Another contributing factor may be the systematic modelling of some of the noise due to an inappropriate choice for the AR model order. This may explain why these artifacts are also present, though to a lesser degree, in utterances enhanced by the Gibbs sampler. Utterances enhanced by the EM algorithm were generally more muffled than those resulting from the other algorithms. Under closer investigation, this was found to be due to bad reconstruction of fricatives and other unvoiced sounds exhibiting noise-like characteristics. This is a less serious artifact, though, since it seems to be related more to whether the algorithm has converged or not (apparent algorithm convergence was found to be deceptive under these circumstances), than to a serious shortcoming of the model.

5.3

Results

41

Noise variance evolution

Target function evolution

3

1000

2.5

900

2

800

1.5

700

1

0

10 20 30 40 Noise variance evolution

50

8

600

0

10 20 30 40 Target function evolution

50

0

10 20 30 40 Target function evolution

50

0

10

50

750

6

700

4 650

2 0

0

10 20 30 40 Noise variance evolution

50

6

600 1500 1400

4

1300 2 0

1200 0

10

20

30

40

50

1100

20

30

40

Figure 7: Typical convergence graphs for the additive noise variance (left) and the value of the target function (right). From top to bottom the graphs are for the evidence maximisation, EM and Gibbs sampling algorithms, respectively. The dashed line in the graphs on the left indicates the true noise variance. The target function for both the evidence maximisation and EM algorithms was taken to be the negative of L(y1:T ; ), as defined by (42), whereas for the Gibbs sampler, it was given by the negative of the logarithm of the joint posterior distribution p(x; jy), as defined by (12).

42

Discussion

Compared to the other algorithms, more residual noise was present in utterances enhanced by the Gibbs sampler. This is a direct result of the algorithm consistently underestimating the additive noise variance. Other block artifacts were present in the enhanced utterances for all the algorithms, becoming more severe as the input SNR of the noise corrupted utterance decreases. These are mostly due to the fact that continuity between adjacent blocks was not strictly enforced, and the algorithms were free to converge to the best local solution. For adjacent blocks these solutions do not necessarily correspond, causing the quality of reconstruction to fluctuate quite significantly over the enhanced utterance. This is especially true for low input SNRs where an increasing number of parameter values can adequately explain the data, reflecting an increase in the number of modes in the posterior distribution of the parameters.

6

D ISCUSSION

This report presented and compared three methods to perform enhancement of speech corrupted by additive white Gaussian noise within a Bayesian framework. These methods do not required the variance of the noise process to be available, but estimate it from the observed noise corrupted signal alongside the other parameters, at the cost of an increase in the severity of the block artifacts in the enhanced waveform, especially at high noise levels. The deterministic methods, i.e. the evidence maximisation and EM algorithms, were found to be superior to the Gibbs sampler in terms of reconstruction performance, but introduced more unwanted musical artifacts in the enhanced signal. The Gibbs sampler, on the other hand, consistently underestimated the noise variance, leading to more residual noise being present in the reconstructed signal, but did not distort the speech as much as the other algorithms. The evidence maximisation algorithm was computationally the most expensive, with the EM algorithm being the most efficient. On a per-iteration basis, the Gibbs sampler was the least costly, but the difficulty to automatically detect convergence of the chain, requires it to be run for many more iterations that the deterministic methods, increasing the overall cost. The following issues will be addressed in future research:

Developing truly sequential enhancement strategies to eliminate block artifacts from the reconstructed waveforms. Extending the algorithms to account for model uncertainty by model selection or model mixing strategies.

43

Extending the algorithms to account for other real-life noise processes, such as coloured and impulsive noise. Developing MCMC strategies to sample jointly for the parameters and the clean speech, so as to improve the convergence and mixing properties of the induced Markov chain. Extending the models to better account for important speech characteristics. To this end some progress has been made in (Vermaak et al., 1998b), where the AR vocal tract model is combined with a seasonal AR glottal excitation model to lead to an improved speech production model for voiced sounds.

44

Kalman Filter-Based Algorithms

A A.1

P ROBABILITY D ENSITY F UNCTIONS

The Inverted-Gamma Distribution

A continuous random variable X has an inverted-gamma distribution with parameters and (; > 0) if and only if its probability density function is of the form:

p(x) = IG (x; ; ) = x

( +1)

exp( =x);

x > 0;

(79)

R1

where = = () and ( + 1) = 0 y exp( y )dy , > 1 is the gamma function. This distribution is so-called because it is derived as the distribution of X = 1=Y , where Y is gamma distributed. It has a mean value =( 1), > 1, variance 2 =(( 1)2 ( 2)), > 2, and a unique mode at =( + 1).

A.2

The Multivariate Gaussian Distribution

A d-dimensional random vector X is Gaussian or normally distributed if and only if its probability density function is of the form:

p(x) =

1

(2 )d=2 jj1=2

exp

1 (x 2

)T

1

(x

) :

(80)

This distribution is fully determined by its mean vector and covariance matrix:

E [X℄ = ;

E (X

)(X )T = ;

(81)

and is normally written as p(x) = N (; ) for the sake of conciseness. An alternative notation, p(x) = N (x; ; ), will be used in contexts where ambiguities may arise.

B

K ALMAN F ILTER -B ASED A LGORITHMS

This appendix presents a number of important algorithms based on the Kalman filter (Anderson and Moore, 1979; Harvey, 1989). These algorithms all pertain to the statespace formulation of a model parameterised by , as introduced in Section B.1, and lead to fast and efficient ways for state and parameter estimation under a variety of circumstances.

B.1

The State-Space Form

B.1

The State-Space Form

45

The state-space form of a model parameterised by is given by:

yt = Zt ()t + Gt ()ut t+1 = Tt()t + Ht()ut ;

(82) (83)

where t 2 R p , yt 2 R d and ut 2 R q are respectively the state, observation and disturbance vectors. The state evolves according to the first-order state transition equation (83), and can normally only be measured indirectly, as indicated by the measurement equation (82). Due to the measurement uncertainty, the initial state 0 is represented by a probability distribution function, rather than a fixed value. Here it is assumed to be Gaussian distributed, i.e. p(0 j ) = N (a0 ; P0 ), with a0 and P0 assumed to be known. Furthermore, the disturbances are assumed to be i.i.d. and Gaussian distributed with zero mean and covariance matrix Qt ( ), i.e. p(ut j ) = N (0; Qt ( )), with cov(us ; ut ) = 0 for s 6= t. The set fZt ( ); Gt ( ); Tt ( ); Ht ( ); Qt ( )jt = 1; : : : ; T g is often referred to as the system matrices. Under these assumptions the following properties hold true of the state-space system:

The state t is a Gaussian random vector. This follows from the solution of the recursive equation (83), which can be written as:

t = t; + 0

0

t 1 X s=0

t;s+1Gsus ;

(84)

with t;s = Tt 1 Tt 2 Ts , t > s and t;t = Ip . Thus, (84) expresses t as a linear combination of the jointly Gaussian7 random vectors 0 and us , s = 0; : : : ; t 1. Since linear transformations of Gaussian random vectors preserve their Gaussian character, it follows that t is a Gaussian random vector. The state process ft g is a Gaussian random process. Thus, any arbitrary subset of state vectors is jointly Gaussian.

The state process ft g is a first-order Markov process, resulting in p(t jt ; : : : ; ; ) = p(t jt ; ). This property is the consequence of the whiteness of the 1

0

1

disturbances ut and the form of the state transition equation (83).

The observation process fyt g is also a Gaussian random process, for essentially the same reasons as ft g. More specifically, ft g and fyt g are jointly Gaussian.

For the purposes of the subsequent derivations, the stacked state and observation vectors are respectively denoted by s:t = [Ts ; : : : ; Tt ℄T and ys:t = [ysT ; : : : ; ytT ℄T , with 7

These variables are jointly Gaussian as a result of their being individually Gaussian and independent.

46

s


t.

For the sake of notational clarity, in the exposition dependence on the system parameters is assumed to be implicit, except in equations that require specific references to these variables.

B.2

The Kalman Filter, One-Step-Ahead Predictor and Fixed-Interval Smoother

The Kalman filter, one-step-ahead predictor and fixed-interval smoother all compute the distributions for the states t , t = 1; : : : ; T , based on differing amounts of information. In filtering the distribution for the state at time t is conditioned on all the observations up to and including time t, whereas one-step-ahead prediction employs the same information to compute the distribution for the state at time t + 1. On the other hand, fixed-interval smoothing yields the distribution for the state at time t, based on all the observations available. General recursive formulations for these respective distributions of the state vector are given below.

Filtering:

p(yt jt ; y1:t 1 )p(t jy1:t 1 ) p(yt jy1:t 1 ) p(yt jt )p(t jy1:t 1 ) = p(yt jRy1:t 1 ) p(yt jt ) p(t jt 1 )p(t 1 jy1:t 1 )dt 1 : = p(yt jy1:t 1 )

p(t jy1:t ) =

One-step-ahead prediction:

p(t+1 jy1:t ) = = =

(85)

Z Z R

p(t+1 ; t jy1:t )dt p(t+1 jt )p(t jy1:t )dt

(86)

p(yt jt )p(t+1 jt )p(t jy1:t 1 )dt : p(yt jy1:t 1 )

Fixed-interval smoothing:

p(t jy1:T ) = =

Z Z

p(t ; t+1 jy1:T )dt+1 p(t jt+1 ; y1:T )p(t+1 jy1:T )dt+1

= p(t jy1:t )

Z

p(t+1 jy1:T )p(t+1 jt ) dt+1 : p(t+1 jy1:t )

(87)

B.2

The Kalman Filter, One-Step-Ahead Predictor and Fixed-Interval Smoother47

In (85) to (87), p(yt jt ) and p(t+1 jt ) follow from the measurement (82) and state transition (83) equations, respectively. Since these are Gaussian by definition, and a Gaussian distribution is also assumed for the initial state, (85) to (87) are also Gaussian, and can be solved analytically, leading to sets of recursive equations to compute the unknown mean vectors and covariance matrices. For the one-step-ahead predictor, the distribution for the state can be written as p(t+1 j y1:t ) = N at+1jt ; Pt+1jt , where the mean vector and covariance matrix can be computed recursively, forwards in time for t = 1; : : : ; T 1, as:

at+1jt = Tt atjt 1 + Ktet Pt+1jt = Tt Ptjt 1LTt + HtQt JTt ;

(88) (89)

with

et = yt Zt atjt 1 Dt = Zt Ptjt 1ZTt + Gt Qt GTt Kt = (TtPtjt 1 ZTt + Ht Qt GTt )Dt 1 Lt = Tt Kt Zt Jt = Ht KtGt :

(90) (91) (92) (93) (94)

In the above, Kt is referred to as the Kalman filter gain, and fet g, as the sequence of innovations or prediction errors, with et = yt E [yt jy1:t 1 ℄. Thus, et consists of that part of yt containing new information not carried in y1:t 1 . Furthermore, fet g is an independently distributed Gaussian process with p(et ) = N (0; Dt ). The filter is initialised with the mean and covariance of the initial state distribution, i.e. a0j 1 = a0 and P0j 1 = P0 . For the Kalman filter, on the other hand, the state distribution can be written as p(t jy1:t ) = N (at ; Pt ), where the mean vector and covariance matrix can be defined concisely in terms of those of the one-step-ahead predictor as:

at = atjt 1 + Ptjt 1 ZTt Dt 1 et Pt = Ptjt 1 Ip ZTt Dt 1 Zt Ptjt

1

(95)

:

(96)

If the state-space system under considerations is time-invariant and asymptotically stable, i.e. the characteristic roots of the state transition matrix T lie within the unit circle (ji (T)j < 1, 8i), and the disturbance process fut g is stationary, then the state ft g and observation fyt g processes are stationary as well. Under these circumstances the Kalman filter is guaranteed to be stable, and the covariance matrices Ptjt 1 and Pt will converge to steady-state solutions exponentially fast, irrespective of the initial conditions. This fact may lead to substantial computational savings in Kalman filter-based algorithms since, once converged, the need to update the state covariance matrices is eliminated.

48


Finally, for the fixed-interval smoother, the distribution for the state can be denoted as p(t jy1:T ) = N atjT ; PtjT . The required mean vector and covariance matrix can be defined in terms of those of the Kalman filter and the one-step-ahead predictor, and are computed recursively, backwards in time for t = T 1; : : : ; 1, according to:

atjT = at + St (at+1jT Tt+1 at ) PtjT = Pt + St (Pt+1jT Pt+1jt )STt ;

(97) (98)

with

St = PtTTt+1 Pt+11 jt : In the above, aT jT

B.3

(99)

= aT and PT jT = PT .

The Log-likelihood Function and its Derivatives

The likelihood function for the state-space model presented in Section 2.2 is the probability of observing the data y1:T , given the model parameters , i.e. p(y1:T j ). This function can be factorised according to the probability chain rule as:

p(y1:T j) =

T Y t=1

p(yt jy1:t 1 ; ):

(100)

In (100), the term p(yt jy1:t 1 ; ) can be recognised as the normalising constant in the denominators of (85) and (86), and can consequently be expanded as:

p(yt jy1:t 1 ; ) = =

Z Z

p(yt ; t jy1:t 1 ; )dt p(yt jt ; )p(t jy1:t 1 ; )dt :

(101)

In the Gaussian case, this integral can be solved analytically to yield p(yt jy1:t 1 ; ) = N ytjt 1 ; Dt , with ytjt 1 = Zt atjt 1 . Using the fact that et = yt Zt atjt 1 , the distribution p(yt jy1:t 1 ; ) can be rewritten as:

p(yt jy1:t 1 ; ) =

1

(2 )d=2 jDt j1=2

exp

1 T 1 e D e : 2 t t t

(102)

Many system identification applications require the value of the system parameters that maximises the likelihood function. Instead of maximising the likelihood function directly, it is often easier to maximise the corresponding log-likelihood function. As the name implies, this function is defined by taking the logarithm of (100) to yield: T 1X l; l(y1:T ; ) = log p(y1:T j) = 2 t=1 t

(103)

B.3


49

with

lt = d log 2

log jDt j

eTt Dt 1et :

(104)

This expression can be recognised as the prediction error decomposition (Harvey, 1989) form of the likelihood function. Often gradient-based algorithms are employed to maximise the log-likelihood function. The most basic of these require expressions for the derivatives of the log-likelihood function with respect to model parameters, whereas more sophisticated algorithms, such as the one presented in Appendix C require, in addition to the gradients, the matrix of the second partial derivatives of the log-likelihood function to the model parameters (or an approximation to it), or the Hessian. The subsequent sections develop expressions for these derivatives, based on the Fisher identity, for the speech model considered here.

B.3.1

The Gradient of the Log-likelihood Function

The Fisher identity, first introduced in (Fisher, 1925), has been applied in the context of state-space model identification in, for example, (Segal and Weinstein, 1989), and leads to a method for computing the derivatives of the log-likelihood function that requires only a single forwards filtering and backwards smoothing run of the Kalman filter. Adapted for the state-space models under consideration here, this identity is given by:

log p(y1:T j) = E log p(y1:T ; 1:T j)jy1:T ; ;

(105)

where the conditional expectation is taken relative to the distribution p(1:T jy1:T ; ). As pointed out in (Dempster et al., 1977), the Fisher identity is closely related to the Expectation Maximisation (EM) algorithm for finding maximum likelihood (ML) parameter estimates from incomplete data. From this perspective, fy1:T ; 1:T g constitutes the complete data, whereas the observed, or incomplete, data is given by y1:T . Employing the properties of the state-space model, (105) can be further expanded as:

log p(y1:T j) = E (log p(y1:T j1:T ; ) + log p(1:T j)) jy1:T ; ! # " T T 1 X X log p(yt jt ; ) + log p(t+1 jt ; ) jy1:T ; : =E t=1 t=0

(106)

Knowing, from the measurement (82) and state transition (83) equations, that p(yt jt ; T T ) = N Zt t ; GtQt Gt and p(t+1 jt; ) = N Ttt ; HtQt Ht , (106) can easily be expanded to yield the derivatives of the log-likelihood function for general Gaussian state-space models, as given in (Segal and Weinstein, 1989). Instead of repeating those here, the remainder of this section develops efficient expressions for computing the

50


derivatives of the log-likelihood function specific to the speech model presented in Section 2. First of all, for the sake of algorithm simplicity, the state-space model of Section 2.2 is redefined by extending the state vector to include one more delayed element, i.e. t = [xt ; : : : ; xt p℄T . This is necessary to avoid computing expectations of the form E t Tt+1 jy1:T ; , as is required by the algorithm in (Segal and Weinstein, 1989). The system matrices then become:

Zt () = Z = 1 01p ; T a Tt () = T(a) = I 0p+11 ;

Gt () = G = 0 1

Ht() = H = 01 0 : p2

p

(107) (108)

The matrix Qt ( ) remains unaltered. Following from the nature of the AR(p) process, the state transition density p(t+1 jt ; ) is degenerate, having only a single random component, and can be expressed as:

p(t+1 jt ; ) = p(xt+1 jxt ; : : : ; xt

p+1 ;

)

tY p+1 s=t

Æ (x xs );

(109)

where Æ () is the Kronecker delta function, defined to be one if its argument is zero, ~= and zero otherwise. Using (18) and defining the augmented coefficient vector as a [1; aT ℄T , this expression can be rewritten as:

1 exp p(t+1 jt ; ) = (2e2 )1=2

1 T a~ t+1 Tt+1a~ : 2e2

(110)

The distribution for the measurement equation is univariate, and follows straightforwardly as:

1 p(yt jt ; ) = exp (2v2 )1=2

1 2 y 2v2 t

T T 2yt Zt t + Zt t t Zt :

(111)

Using (110) and (111), the log-likelihood function of the parameters with respect to the complete data can be written as: T 1X 1 log p(y1:T ; 1:T j) = log 2v2 + 2 yt2 2yt Zt t + Zt t Tt ZTt 2 t=1 v 1 + log 2e2 + 2 a~T t+1 Tt+1 a~ : e

(112)

Taking the derivatives of (112) with respect to the elements of the parameter vector

B.3


51

results in: T 1 X log p(y1:T ; 1:T j) = 2 M t Tt a~ a e t=1

T 1 X log p(y1:T ; 1:T j) = 1 e2 2e2 t=1

T 1 X log p(y1:T ; 1:T j) = 1 v2 2v2 t=1

(113)

1 T a~ tTt a~ 2 e 1 2 y v2 t

(114)

T T 2yt Zt t + Zt t t Zt ;

(115)

with M = 0p1 Ip . Taking the conditional expectations of (113) to (115) with respect to the distribution p(1:T jy1:T ; ), leads to the derivatives of the log-likelihood function, given by: T 1 X log p(y1:T j) = 2 M E t Tt jy1:T ; a~ a e t=1

T 1 X log p(y1:T j) = 1 e2 2e2 t=1

T 1 X log p ( y j ) = 1 1:T v2 2v2 t=1

+ tr ZTt Zt E

(116)

1 ~a~T E t Tt jy1:T ; tr a 2 e

1 2 y v2 t

(117)

2yt Zt E [t jy1:T ; ℄

t Tt jy T ;

1:

(118)

:

The conditional expectations in the above expressions follow readily from the Kalman smoothing equations (97) to (99), and are given by:

E [t jy1:T ; ℄ = atjT E t Tt jy1:T ; = PtjT + atjT aTtjT :

(119)

(120)

Finally, substituting (119) and (120) into (116) to (118) and using the fact that Zt constant, the derivatives can be expressed as:

log p(y1:T j) = a log p(y1:T j) = e2 log p(y1:T j) = v2

1 M a~ e2 1 1 T a~ a~ T 2e2 e2 1 1 Z ZT + 2 2 2v v

= Z is (121) (122)

T ;

(123)

with

=

=

T X t=1 T X t=1

PtjT + atjT aTtjT yt2

2yt ZatjT :

(124) (125)

52

B.3.2


The Approximate Hessian of the Log-likelihood Function

An exact evaluation of the Hessian requires the derivatives of and with respect to the parameters. The details of these calculations are described in (Segal and Weinstein, 1989). The resulting computations require at least k + 4 Kalman smoothing runs at each iteration, which is computationally very expensive. However, the exact value of the Hessian is not strictly required for the algorithm to work correctly. Any approximation will suffice, as long as it is positive definite, the specific choice only influencing the rate of convergence. Here the components of the Hessian are approximated by assuming and to be constant, resulting in

2 1~ log p ( y j ) = 1: T a aT e2 2 1 2 ~ log p(y1:T j) = M a = 2 T log p(y1:T j) ae2 (e2 )2 e a 2 1 T 1 T a~ a~ log p(y1:T j) = 2 2 (e2 )2 (e ) 2 e2 2 1 T 1 T Z Z + ; log p(y1:T j) = 2 2 (v2 )2 (v ) 2 v2

(126) (127) (128) (129)

where ~ is with the first row and column removed. This approximation gives a satisfactory trade-off between accuracy and computational complexity (it requires no additional Kalman smoothing runs), and was found to work well in practice.

B.4

Sampling from the Distribution p(1:T j ; y1:T )

Conditional on , the posterior distribution p(1:T j ; y1:T ) is Gaussian and sampling for the stacked state vector 1:T can proceed as discussed in (Carter and Kohn, 1994; Shephard, 1994; Fruhwirth-Schnatter, 1994). For some models the sampling scheme based on the disturbance smoother proposed in (de Jong and Shephard, 1995) may be more efficient. Both these methods are outlined below. In the exposition conditioning on is assumed to be implicit, and omitted for the sake of notational clarity.

B.4

B.4.1

Sampling from the Distribution p(1:T j ; y1:T )

53

State-Based Sampling

The posterior distribution for the stacked state vector 1:T can be factorised according to the probability chain rule as:

p(1:T jy1:T ) = p(1 ; : : : ; T jy1:T ) = p(T jy1:T )

TY1 t=1

p(t jt+1 ; : : : ; T ; y1:T ):

(130)

Using Bayes’ rule, each of the factors within the product of (130) can be further decomposed as:

p(t jt+1 ; : : : ; T ; y1:T ) / p(t+1 jt ; t+2 ; : : : ; T ; y1:T )p(t jt+2 ; : : : ; T ; y1:T ):

(131)

Following from the first-order Markovian nature of the state process, p(t+1 jt ; t+2 ; : : : ; T ; y1:T ) = p(t+1 jt ), whereas the causality of the system allows for the simplification p(t jt+2 ; : : : ; T ; y1:T ) = p(t jy1:t ) to be made. Thus, p(t jt+1 ; : : : ; T ; y1:T ) = p(t jt+1 ; y1:t ), and (130) can be rewritten as:

p(1:T jy1:T ) = p(T jy1:T )

/ p(T jy

T)

1:

TY1 t=1 TY1 t=1

p(t jt+1 ; y1:t ) p(t+1 jt )p(t jy1:t ):

(132)

Thus, a draw from p(1:T jy1:T ) can be constructed using the method of composition (Tanner, 1993) and recursing backwards in time for t = T; : : : ; 1, provided that subdraws from the densities on the RHS of (132) are practical. To this end, the distribution p(t+1 jt ) = N Tt t ; Ht Qt HTt follows straightforwardly from the state transition equation (83), whereas, for the filtering density p(t jy1:t ) = N (at ; Pt ), the mean vector at and covariance matrix Pt are given by the Kalman filter equations (95) and (96). Thus, the smoothing density p(t jt+1 ; y1:t ) / p(t+1 jt )p(t jy1:t ) is easily shown to be Gaussian, i.e. p(t jt+1 ; y1:t ) = N atjt+1 ; Ptjt+1 , with atjt+1 and Ptjt+1 computed by recursing backwards in time for T = T 1; : : : ; 1, according to:

atjt+1 = Rt (at + Btt+1 ) Ptjt+1 = RtPt;

(133) (134)

with

Rt = (Ip + Bt Tt) 1 Bt = PtTTt HtQt HTt 1 :

(135) (136)

Sampling from p(1:T jy1:T ) in this way is of O (p3 T ) complexity, which is linear in the number of samples T .

54


Degeneracies Generally, the components of the stacked state vector 1:T are not independent, and many identities link the comprising state variables t , t = 1; : : : ; T . These identities are problem-dependent and a direct consequence of forcing the model into state-space form. They lead to the conditional densities p(t jt+1 ; y1:t ), t = 1; : : : ; T 1, being degenerate, i.e. having less degrees of freedom, say l, than the dimensionality of the state vector p. For a sample from p(1:T jy1:T ) to be consistent with the state-space model under consideration, these degeneracies must be kept track of in the recursive construction of the draw. This can be efficiently done by linearly transforming the state vector t to another variable Æ t that has only l random components, given the state vector t+1 . Since the transformation is linear, the distribution for the random components of Æ t is also Gaussian, sampling from which requires the inversion of only a l(< p)-dimensional matrix. After sampling these components, the inverse linear transformation can be applied, yielding a sample from p(t jt+1 ; y1:t ) consistent with the rest of the draw. For more detail see (Fruhwirth-Schnatter, 1994).

B.4.2

Disturbance-Based Sampling

Disturbance-based sampling, first introduced in (de Jong and Shephard, 1995), is a generalisation of the analytical disturbance smoother proposed in (Koopman, 1993). This technique samples from the distribution p( 1:T jy1:T ), with 1:T = [ T1 ; : : : ; TT ℄T the stacked disturbance vector, and t = F t ut 2 R r , with Ft an arbitrary matrix. All random variables in the state-space model are linear combinations of the disturbances, and can be constructed, as desired, once the disturbances are available. Indirectly constructing the draw from the stacked state vector 1:T by sampling from the disturbances, is not subject to automatic degeneracies (see Section B.4.1), and is typically simpler. The posterior distribution of the disturbances can be decomposed in a way similar to what was done for the states in (130) as:

p(1:T jy1:T ) = p(1 ; : : : ; T jy1:T ) = p(T jy1:T )

TY1 t=1

p(t jt+1 ; : : : ; T ; y1:T ):

(137)

If sub-draws from the densities on the RHS of (137) are feasible, the draw from p( 1:T j y1:T ) can be constructed using the method of composition, and recursing backwards in time for t = T; : : : ; 1, as was done in Section B.4.1 for the stacked state vector 1:T . The jointly Gaussian nature of the disturbance f t g = fFt ut g and observation fyt g

55 processes implies that these densities are all Gaussian, and hence fully specified by their corresponding mean vectors and covariance matrices. These can be expressed as8 :

t jt ; : : : ; T ; y T = Ft Qt GTt Dt et + JTt rt cov( t j t ; : : : ; T ; y T ) = Ft Qt Ir GTt Dt Gt Qt JTt Ut Jt Qt FTt = Ct : E

+1

1:

+1

1:

1

1

(138) (139)

In the above, rt and Ut are evaluated by the following backwards recursions for t

T; : : : ; 1:

rt 1 = LTt rt + ZTt Dt 1 et VtT Ct 1 "t Ut 1 = LTt Ut Lt + ZTt Dt 1Zt + VtT Ct 1Vt; with

=

(140) (141)

Vt = Ft Qt GTt Dt 1Zt + JTt Ut Lt : (142) The recursions are initialised with rT = 0 and UT = 0. In (140), "t denotes the distur bance estimation error, defined as "t = t E t j t+1 ; : : : ; T ; y1:T . To reconstruct the system states 1:T from the sampled disturbances 1:T requires that Ft = Ht. The disturbance sample is then drawn from p(fHtut gjy1:T ), and the states can be reconstructed recursively for t = 0; : : : ; T 1 according to:

t = Tt t + t ; +1

starting with 0

(143)

= 0.

Sampling from p(1:T jy1:T ) indirectly in this way does not require the inversion of a matrix with dimensions proportional to the system state p, and subsequently the computational complexity is reduced from O (p3 T ) for the state-based sampler to O (p2 T ), which is of the same order as that for the Kalman filter. It also requires less storage space than the state-based sampler, and, provided a suitable choice is made for Ft , is not subject to automatic degeneracies.

C

S ECOND -O RDER O PTIMISATION

Consider the minimisation9 problem that involves finding x such that:

x = arg minff (x)g; x

8

(144)

Only the main results are presented here. Details of the derivation can be found in (de Jong and Shephard, 1995). 9 Any maximisation problem can be represented as a minimisation problem by negating the function of interest.

56

Second-Order Optimisation

where f is some real-valued function of the vector variable x. For the purposes of the specific optimisation algorithm developed here, the added constraint that the first and second order partial derivatives of f with respect to the components of x exist and are smooth, is imposed. To solve (144), a general class of iterative algorithms of the form:

xk+1 = xk + Æk dk

(145)

is considered here. In (145), xk is the value of the vector variable at iteration k , dk is a search direction vector, and Æk is an adaptation step size, computed such that:

Æk = arg minfg (Æ )g; Æ

(146)

where the function g is defined as g (Æ ) = f (xk + Æ dk ). The algorithm is initialised with some initial estimate for the variables x0 , and iterated until convergence is achieved. Typical conditions for convergence are:

kxk jfk

xk k < x

+1 +1

fk j < f ;

where k k denotes the Euclidean norm, fk tolerances.

(147) (148)

= f (xk ), and x and f are user-specified

Given this formulation, the optimisation problem reduces to finding the search direction vector dk and adaptation step size Æk at each iteration k . The subsequent sections present methods to compute these parameters.

C.1

The Search Direction dk

The main requirement for the search direction vector dk is that it should be a direction of descent, i.e. for small values of the adaptation step size Æ , the value of g (Æ ) decreases as Æ increases from zero. The simplest choice for such a vector is to set:

dk = g k ;

(149)

where gk = df (xk )=dx is the gradient of f with respect to x, measured at the point xk . Employing this choice for the search direction vector, the method becomes the well-known steepest descent (SD) algorithm. Convergence of this algorithm is linear, and occurs at a rate of:

r 1 R= r+1

2

;

(150)

where r = max =min , with min and max the minimum and maximum eigenvalues of the Hessian matrix of f with respect to x, respectively (Luenberger, 1984). The value

C.1

The Search Direction dk

57

r is often referred to as the condition number of the corresponding matrix. Thus, convergence can be particularly slow for high values of the condition number r . This commonly occurs when the components of x are inter-dependent, as is often the case in many problems of practical importance.

Newton’s method addresses this problem by assuming that the function to be minimised is approximated locally by a quadratic function, and this approximate function is minimised exactly. Thus, in the region near xk , f is approximated by the truncated Taylor series expansion:

1 f (x) fk + gkT (x xk ) + (x xk )T Hk (x xk ); (151) 2 where Hk is the matrix of second partial derivatives of f with respect to the components of x, or the Hessian. This function is minimised at:

xk+1 = xk Hk 1gk :

(152)

At points far from the solution the function f may be locally very different from a quadratic function, and (152) needs to be modified. This is usually done by introducing an adaptation step size Æk , serving the same purpose as before, so that (152) becomes:

xk+1 = xk Æk Hk 1gk :

(153)

This update equation is of the same general form as (145) if the search direction vector is defined as:

dk = Hk 1 gk :

(154)

This method leads to second-order convergence, and is not plagued by correlations between the components of x as is the case for the steepest descent algorithm. It effectively transforms the search space so that the components of the vector variable x are independent in the transformed space. For the direction vector of (154) to be a direction of descent, the Hessian Hk is required to be positive definite. If f is a quadratic function with a minimum, this will always be the case, but for general functions and points far from the solution, this condition is often violated. A common approach to restore positive definiteness in cases where it is violated, or nearly violated, is to set:

~ k = k I + Hk ; H

(155)

for some non-negative value of k , and employ this modified Hessian in (154) instead of Hk . This can be regarded as a compromise between steepest descent ( k very large) ~ k positive definite. and Newton’s method ( k = 0). There is always a k that makes H One way to ensure positive definiteness is to set k to the smallest non-negative value

58

Convergence Proof Outline for the EM Algorithm

~ k has eigenvalues greater than or equal to for which the matrix H ~ value, Hk is expanded as: ~ k = k I + Hk H = k I + Vk k VkT = Vk ( k I + k )VkT ~ k VkT ; = Vk

" > 0. To find this

(156)

~ k is the diagonal where Hk = Vk k VkT is the eigenvalue decomposition of Hk , and ~ k . It follows straightforwardly that: eigenvalue matrix of H (Hk ) k = " min ;

H

(157)

where mink is the minimum eigenvalue of Hk . (

)

The Adaptation Step Size Æk

C.2

Once the search direction at iteration k has been specified, the adaptation step size Æk is chosen so that (146) is satisfied. This is normally done by performing a line search in the direction of dk . Numerous line search strategies exist (see (Luenberger, 1984) for a summary), differing in convergence rate and computational complexity. If the evaluation of f is computationally expensive, performing a complete line search at each iteration may lead to an optimisation algorithm that is computationally prohibitive. In this case it might be beneficial to relax the requirement of (146), with Æk chosen only such that g (Æk ) < g (0). Due to the definition of dk in (154), it is known that, at points near the solution x , the optimum value for the adaptation step size is approximately one. This is also a good initial value to employ for general points. If it does not lead to an immediate reduction in the value of f , an alternative search strategy is to successively halve the value of Æk until a point of decrease is found. Since dk is a direction of descent, such a point is guaranteed to exist. In this way the number of function evaluations is kept within an acceptable bound without significant loss in accuracy, since the optimum value for the adaptation step size will seldom differ significantly from unity.

D

C ONVERGENCE P ROOF O UTLINE FOR THE EM A LGORITHM

In this appendix it is shown that the EM algorithm increases the value of the target function L(y; ) at each iteration i, or leaves it unchanged when convergence has been

59 achieved. Using (33), the target function can be expressed as:

L(y; ) = log p(yj)p() = log

Z

(158)

p(x; yj)p()dx:

For any distribution q (x) that is strictly positive over the support of x, (158) may be further expanded as:

p(x; yj)p() dx q (x) p(jz) dx q (x) log q (x) Z

L(y; ) = log Z

Z

=

Z

q (x)

q (x) log p(jz)dx

(159)

q (x) log q (x)dx;

where, in the second step, use has been made of Jensen’s inequality. The lower bound can be recognised as the negative of the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951), which is a measure of the distance between two distributions. Thus, by minimising the KL divergence (increasing the lower bound), the target function is increased. If q (x) = p(xj (i 1) ; y), the inequality becomes an equality, and (159) can be rewritten as: h

L(y; ) = E log p(jz)j(i 1) ; y = Q(;

i

(

1)

i

h

E log p(xj(i 1) ; y)j(i 1) ; y

h

) E log p(xj

i

(

1)

; y)j

i

(

1)

i

;y ;

where the conditional expectation is taken relative to the distribution The second term in the above is minimised10 for = (i 1) , i.e.: h

E log p(xj

i

(

1)

; y)j

i

(

1)

;y

Thus, to increase the target function (160), i.e. finding (i) so that:

i

E

h

log p(xj; y)j

i

(

i

1)

i

;y ;

p(xj(i 1) ; y).

8:

(161)

L(y; ), it suffices to increase the first term in

Q((i) ; (i) ) Q((i) ; (i 1) ): This is done by the M-step. 10

(160)

This result is a well-known consequence of Jensen’s inequality. See (Rao, 1965) for a proof.

(162)

Maximising Q( ; (i

60

M AXIMISING Q(; (i

E The derivatives of Q( ; (i here for convenience as:

Q(; (i a Q(; (i e2 Q(; (i 2 v Q(; (i 2 a

1)

1)

)

)

), as defined in (63), are given by (44) to (47), repeated

1 M a~ 12 (a a) 2 e a 1 1 T 1) T )= + e + 1 a~ a~ + 2 e e2 2 2(e2 )2 1 1 T 1) T )= + v + 1 Z Z + + 2 v v2 2 2(v2 )2 1 1 p T 1) )= + a + 1 ; (a a) (a a) + 2 a a2 2 2(a2 )2 1)

1)

)=

(163) (164) (165) (166)

with

M = 0p1 Ip =

=

T X

t=1 T X t=1

(167)

PtjT + atjT aTtjT

(168)

2yt ZatjT :

yt2

(169)

~ = [1; aT ℄T , and the state estimates and their covariances atjT , PtjT , In the above, a t = 1; : : : ; T are obtained by running the Kalman smoother (see Appendix B.2) with (i 1) as the parameter vector. Equating these expressions to zero, leads to the following set of simultaneous equations, the solutions of which give the optimal parameter estimate for (i) : 1 1 M ( a ) (a a ) = 0 0 1 2 e2 a T 1 T 2 e + e + 1 a~ a~ + e = 0 2 2 1

T T 2 + v + 1 Z Z + + v = 0 v 2 2 2 p 1 T 2 a + a + 1 (a a ) (a a ) + a = 0; 2 2

(170) (171) (172) (173)

where = [ 0 ; 1 ℄ is a column-wise partition of , with 0 containing the first column, and 1 , the remaining p columns. The equation for the additive noise variance (172) is independent of the other parameters, and can be solved uniquely, yielding:

T + v + 1 v = 2 2

1

1 Z ZT + + v : 2 2

(174)

61 The solutions for the other parameters are not easily obtained directly, since substituting (171) and (173) into (170), leads to an expression that is cubic in the AR coefficient vector a. However, by cycling over the following two-step procedure, the solutions may be found iteratively: 1. Fix the estimates of e2 and a2 to their current values, and maximise for a to yield:

1 a = 2M e

1 I 1 + a2 p

1

1 M e2

1 : 0 + a2 a

(175)

2. Now, fixing the value of a to the one computed above, maximise for e2 and a2 to yield:

T + e + 1 e = 2 p a2 = + a + 1 2 2

1 T a~ a~ + e 2 1 1 (a a )T (a 2 1

(176)

a ) + a :

(177)

This procedure is guaranteed to converge to a locally optimum solution within very few iterations, since the existence of strong dependencies between the two variances on the one hand, and the AR coefficient vector on the other, is unlikely. In the case where a flat prior is assumed, the term log p( ) disappears from (63). In this case the set of simultaneous equations that result when equating the derivatives to zero can be solved directly to yield:

a = (M 1) 1 M

0

1 e2 = a~T a~ T 1 v2 = Z ZT + : T

(178) (179) (180)

Note that, since the prior is assumed to be flat, the hyperparameter a2 is no longer unknown (in fact, a2 ! 1), and hence disappears from the parameter vector .

F

A UDIO D ATA F ILES

The Microsoft wave (“.wav”) audio files for the clean, corrupted and enhanced utterances of Section 5 are located at the URL: http://www-svr.eng.cam.ac.uk/˜jv211/audio/methods/

62

Audio Data Files

Tables 2 to 5 below give the names11 and descriptions of the files that can be found at this URL. As before, S1 and S2 refer to the utterances of the sentences below, by respectively a male and a female speaker. S1 : “Good service should be rewarded by big tips.” S2 : “Draw every outer line first, then fill in the interior.”

Utterance File name S1 x_bigtips S2 x_draw Table 2: Clean speech file names.

Noise process N1 N2 N3 N4

Description stationary non-stationary stationary non-stationary

File name v_bigtips_white v_bigtips_burst v_draw_white v_draw_burst

Table 3: Noise process file names and descriptions.

Utterance S 1 N1 S 1 N2 S 2 N3 S 2 N4

File name y_bigtips_white y_bigtips_burst y_draw_white y_draw_burst

Table 4: Corrupted speech file names.

11

The extension “.wav” is implied, and omitted for the sake of convenience.

63

Input utterance S 1 N1

S 1 N2

S 2 N3

S 2 N4

Enhancement method Evidence maximisation EM Gibbs sampler Evidence maximisation EM Gibbs sampler Evidence maximisation EM Gibbs sampler Evidence maximisation EM Gibbs sampler

File name xe_bigtips_white_EVM xe_bigtips_white_EM xe_bigtips_white_MCMC xe_bigtips_burst_EVM xe_bigtips_burst_EM xe_bigtips_burst_MCMC xe_draw_white_EVM xe_draw_white_EM xe_draw_white_MCMC xe_draw_burst_EVM xe_draw_burst_EM xe_draw_burst_MCMC

Table 5: Enhanced speech file names.

REFERENCES

64

R EFERENCES Anderson, B. D. O. and Moore, J. B. (1979). Optimal Filtering. Prentice-Hall, Englewood Cliffs. Bernardo, J. M. and Smith, A. F. M. (1994). Bayesian Theory. John Wiley and Sons. Box, G. E. P., Jenkins, G. M., and Reinsel, G. C. (1994). Time Series Analysis, Forecasting and Control, Third Edition. Prentice-Hall, Englewood Cliffs. Box, G. E. P. and Tiao, G. C. (1973). Bayesian Inference in Statistical Analysis. AddisonWesley Publishing Company. Carter, C. K. and Kohn, R. (1994). Biometrika, 81(3):541–553.

On Gibbs sampling for state space models.

Chib, S. and Greenberg, E. (1994). Bayes inference in regression models with ARMA(p,q ) errors. Journal of Econometrics, 64:183–206. de Jong, P. and Shephard, N. (1995). The simulation smoother for time series models. Biometrika, 82(2):339–350. Deller, J. R., Proakis, J. G., and Hansen, J. H. L. (1993). Discrete-Time Processing of Speech Signals. Macmillan Publishing Company, Englewood Cliffs. Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B, 39(1):1–38. Devroye, L. (1986). Non-Uniform Random Variate Generation. Springer-Verlag, New York. Duane, S., Kennedy, A. D., Pendleton, B. J., and Roweth, D. (1987). Hybrid Monte Carlo. Physics Letters B, 195(2):216–222. Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. John Wiley and Sons. Fisher, R. A. (1925). Theory of statistical estimation. In Proceedings of the Cambridge Philosophical Society, volume 22, pages 700–725. Fruhwirth-Schnatter, S. (1994). Data augmentation and dynamic linear models. Journal of Time Series Analysis, 15(2):183–202. Gannot, S., Burshtein, D., and Weinstein, E. (1998). Iterative and sequential Kalman filter-based speech enhancement algorithms. IEEE Transactions on Speech and Audio Processing, 4(6):373–385. Gelfand, A. E. and Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85(410):398–409.

REFERENCES

65

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(6):721–741. Gilks, W. R., Richardson, S., and Spiegelhalter, D. J. (1996). Markov Chain Monte Carlo in Practice. Chapman and Hall. Godsill, S. J. (1997). Bayesian enhancement of speech and audio signals which can be modelled as ARMA processes. International Statistical Review, 65(1):1–21. Godsill, S. J. and Rayner, P. J. W. (1995). A Bayesian approach to the restoration of degraded audio signals. IEEE Transactions on Speech and Audio Processing, 3(4):267– 278. Godsill, S. J. and Rayner, P. J. W. (1996). Robust reconstruction and analysis of autoregressive signals in impulsive noise using the Gibbs sampler. Technical Report CUED/F-INFENG/TR.233, Signal Processing Group, Cambridge University Engineering Department. Godsill, S. J. and Rayner, P. J. W. (1998). Statistical reconstruction and analysis of autoregressive signals in impulsive noise using the Gibbs sampler. IEEE Transactions on Speech and Audio Processing, 6(4):352–372. Green, P. J. (1995). Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, 82(4):711–732. Gull, S. F. (1989). Developments in maximum entropy data analysis. In Skilling, J., editor, Maximum Entropy and Bayesian Methods, pages 53–71. Kluwer Academic Publishers. Harvey, A. C. (1989). Forecasting, Structural Time Series Models and the Kalman Filter. Cambridge University Press. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109. Koopman, S. J. (1993). Disturbance smoother for state space models. Biometrika, 80(1):117–126. Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22:79–86. Lim, J. S. and Oppenheim, A. V. (1978). All-pole modeling of degraded speech. IEEE Transactions on Acoustics, Speech, and Signal Processing, ASSP-26(3):197–210. Lim, J. S. and Oppenheim, A. V. (1979). Enhancement and bandwidth compression of noisy speech. Proceedings of the IEEE, 67(12):1586–1604. Luenberger, D. G. (1984). Linear and Nonlinear Programming, Second Edition. AddisonWesley Publishing Company.

66

REFERENCES

MacKay, D. J. C. (1992). Bayesian interpolation. Neural Computation, 4(3):415–447. Makhoul, J. (1975). Linear prediction: A tutorial review. Proceedings of the IEEE, 63(4):561–580. Meng, X. L. and Rubin, D. B. (1992). Recent extensions to the EM algorithm. In Bernardo, J. M., Berger, J. O., Dawid, A. P., and Smith, A. F., editors, Bayesian Statistics, volume 4, pages 307–320. Oxford University Press. Metropolis, N., Rosenbluth, A. W., Metropolis, M. N., Teller, A. H., and Teller, E. (1953). Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087–1091. Neal, R. M. (1993). Probabilistic inference using Markov chain Monte Carlo methods. Technical Report CRG-TR-93-1, Department of Computer Science, University of Toronto. Priestley, M. B. (1981). Spectral Analysis and Time Series. Academic Press. Rabiner, L. R. and Schafer, R. W. (1978). Digital Processing of Speech Signals. PrenticeHall, Englewood Cliffs. Rao, C. R. (1965). Linear Statistical Inference and its Applications. Wiley, New York. Ripley, B. D. (1987). Stochastic Simulation. John Wiley, New York. Roberts, G. O. (1996). Markov chain concepts related to sampling algorithms. In Gilks, W. R., Richardson, S., and Spiegelhalter, D. J., editors, Markov Chain Monte Carlo in Practice, pages 45–57. Chapman and Hall. Roberts, G. O. and Sahu, S. (1997). Updating schemes, correlation structure, blocking and parameterisation for the Gibbs sampler. Journal of the Royal Statistical Society, Series B, 59:291–317. Saleh, G. M. K. (1996). Bayesian Inference in Speech Processing. PhD thesis, Cambridge University Engineering Department, Cambridge, England. Saleh, G. M. K. and Niranjan, M. (1998). Speech enhancement in a Bayesian framework. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, volume 1, pages 389–392. Segal, M. and Weinstein, E. (1989). An new method for evaluating the log-likelihood gradient, the Hessian, and the Fisher information matrix for linear dynamic systems. IEEE Transactions on Information Theory, 35(3):682–687. Shephard, N. (1994). Partial non-Gaussian state space. Biometrika, 81(1):115–131. Tanner, M. A. (1993). Tools for Statistical Inference, Second Edition. Springer-Verlag, New York.

REFERENCES

67

Tierney, L. (1994). Markov chains for exploring posterior distributions (with discussion). The Annals of Statistics, 22(4):1701–1762. Tierney, L. (1996). Introduction to general state-space Markov chain theory. In Gilks, W. R., Richardson, S., and Spiegelhalter, D. J., editors, Markov Chain Monte Carlo in Practice, pages 59–74. Chapman and Hall. Troughton, P. T. and Godsill, S. J. (1997). A reversible jump sampler for autoregressive time series, employing full conditionals to achieve efficient model space moves. Technical Report CUED/F-INFENG/TR.304, Signal Processing Group, Cambridge University Engineering Department. Troughton, P. T. and Godsill, S. J. (1998). A reversible jump sampler for autoregressive time series. In Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, volume 4, pages 2257–2260. Vermaak, J., Niranjan, M., and Godsill, S. J. (1998a). Markov chain Monte Carlo estimation for the seasonal autoregressive process with application to pitch modelling. Technical Report CUED/F-INFENG/TR.312, Speech, Vision and Robotics Group, Cambridge University Engineering Department. Vermaak, J., Niranjan, M., and Godsill, S. J. (1998b). Markov chain Monte Carlo estimation of a speech production model for voiced sounds. Technical Report CUED/F-INFENG/TR.325, Speech, Vision and Robotics Group, Cambridge University Engineering Department.