Robust distant speaker recognition based on position ... - CiteSeerX

15 downloads 11555 Views 265KB Size Report
position, which we call position-dependent CMN. We mea- ..... We call this method real-time CMN. .... speaker located in the center of each area and recorded.
Speech Communication 49 (2007) 501–513 www.elsevier.com/locate/specom

Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM Longbiao Wang *, Norihide Kitaoka, Seiichi Nakagawa Department of Information and Computer Sciences, Toyohashi University of Technology, 1-1, Hibarigaoka, Tempaku-cho, Toyohashi, Aichi 441-8580, Japan Received 26 December 2005; received in revised form 7 February 2007; accepted 11 April 2007

Abstract In this paper, we propose a robust speaker recognition method based on position-dependent Cepstral Mean Normalization (CMN) to compensate for the channel distortion depending on the speaker position. In the training stage, the system measures the transmission characteristics according to the speaker positions from some grid points to the microphone in the room and estimates the compensation parameters a priori. In the recognition stage, the system estimates the speaker position and adopts the estimated compensation parameters corresponding to the estimated position, and then the system applies the CMN to the speech and performs speaker recognition. In our past study, we proposed a new text-independent speaker recognition method by combining speaker-specific Gaussian mixture models (GMMs) with syllable-based HMMs adapted to the speakers by MAP [Nakagawa, S., Zhang, W., Takahashi, M., 2004. Text-independent speaker recognition by combining speaker-specific GMM with speaker-adapted syllable-based HMM. Proc. ICASSP-2004 1, 81– 84]. The robustness of this speaker recognition method for the change of the speaking style in close-talking environment was evaluated in (Nakagawa et al., 2004). In this paper, we extend this combination method to distant speaker recognition and integrate this method with the proposed position-dependent CMN. Our experiments showed that the proposed method improved the speaker recognition performance remarkably in a distant environment.  2007 Elsevier B.V. All rights reserved. Keywords: Distant speaker recognition; GMM; HMM; Position-dependent CMN; Sound source estimation

1. Introduction Hands-free speech communication (Juang and Soong, 2001; Hughes et al., 1999; Seltzer et al., 2004) has been more and more popular in some special environments such as an office or a cabin of a car. However, in a distant environment, the channel distortion may drastically degrade speaker recognition performance. This is mostly caused by the mismatch between the practical environment and training environment. Over the past few decades, several approaches were proposed to compensate for the adverse

*

Corresponding author. Tel.: +81 532 44 6777; fax: +81 532 44 6757. E-mail address: [email protected] (L. Wang).

0167-6393/$ - see front matter  2007 Elsevier B.V. All rights reserved. doi:10.1016/j.specom.2007.04.004

effects of a mismatch between training and testing conditions. One of them is a feature-based compensation technique, which compensates for the noisy feature to match the training environment before it is fed to the recognizer. The other one is a model-based adaptation technique, which adapts/trains the speaker models using the data in a noise condition. In this paper, we used both the feature-based compensation technique and the model-based adaptation technique to obtain a robust distant speaker recognition performance. A feature-based technique which compensates for an input feature is the main way to reduce a mismatch. Cepstral mean normalization (CMN) has been used to reduce the channel distortion as a simple and effective way of normalizing the feature vectors (Furui, 1981; Liu et al., 1993).

502

L. Wang et al. / Speech Communication 49 (2007) 501–513

CMN is sometimes supplemented with variance normalization. Recently, other efficient feature normalization approaches have been proposed for improving speaker recognition performance, mainly feature warping (Pelecanos and Sridharan, 2001) and short-time Gaussianization (Xiang et al., 2002). The feature warping consists of mapping the observed cepstral feature distribution to a normal distribution over a sliding window, the various cepstral coefficients being processed in parallel streams. The short-time Gaussianization is similar but applies a linear transformation to the feature before mapping them to a normal distribution. This linear transformation, which can be estimated by the EM algorithm, makes the resulting features better suited to diagonal covariance GMMs. Barras and Gauvain (2003) evaluated those feature normalization methods on cellular data. The essence of our study is how the transmission characteristics were affected depending on the speaker position. Since it is very easy to implement, CMN has been adopted in many current systems, without loss of generality, we used CMN to verify our idea. The other feature normalization methods were not evaluated in this paper. In conventional CMN, the Cepstral Mean is previously estimated by averaging along the entire current utterance and is kept constant during the normalization. However, this off-line estimation involves a long delay that is likely unacceptable when the utterance is long. If the utterance is short, the accurate cepstral mean cannot be estimated. Alternatively, when the conditions (environment, channel, speaker, etc.) do not change for a period of time, cepstral mean can be estimated from a given set of previous utterances, and so the delay is avoided (Kitaoka et al., 2001). Nevertheless, conditions change such as speaker and speaking position change, etc. is existed in certain applications. In a distant environment, the transmission characteristics from different speaker positions are very different, so the Kitaoka method cannot track the rapid change of the transmission characteristics. Various window CMN methods have been used to normalize the feature vectors in an on-line version (Viikki and Laurila, 1998; Pujol et al., 2006). However, there exists a tradeoff between delay and recognition error (Pujol et al., 2006). Thus, the usual CMN cannot obtain an excellent recognition performance with a short delay. In this paper, we propose a robust speaker recognition method using a new real-time CMN based on speaker position, which we call position-dependent CMN. We measured the transmission characteristics (the compensation parameters for position-dependent CMN) from some grid points in the room a priori. The system estimates the speaker position in a 3D space based on microphone arrays. Four microphones were arranged in a T-shape on a plane, and the sound source position was estimated by Time Delay Of Arrival (TDOA) among the microphones (Knapp and Carter, 1976; Omologo and Svaizer, 1996; Doclo and Moonen, 2003). The system then adopts the compensation parameter corresponding to the estimated position and applies a channel distortion compensation

method to the speech (that is, position-dependent CMN) and performs speaker recognition. How to normalize intra-speaker variation of likelihood (similarity) values is one of the most difficult problems in speaker verification. Recently, some score normalization techniques have been used for speaker verification, while it is not applicable for speaker identification. The most frequently used among them are the Z-norm and the T-norm (Auckenthaler et al., 2000; Barras and Gauvain, 2003). These two score normalization methods, which normalize the distribution of the scores, have been proven to be quite efficient (Auckenthaler et al., 2000). Barras and Gauvain (2003) indicated that the combination of feature normalization and score normalization improved the verification performance more than that of the individual one since their effects were cumulative. In this paper, feature normalization dependent on speaker position compensated the mismatch between test and training conditions efficiently. We did not conduct speaker verification experiments using such a combination in this paper, but we believe that the combination achieves good performance on speaker verification. For speaker recognition, various types of speaker models have long been studied. Hidden Markov models (HMMs) have become the most popular statistical tool for the text-dependent task. The best results have been obtained using continuous density HMMs (CHMMs) for modeling speaker characteristics (Savic and Gupta, 1990). For the text-independent task, the temporal sequence modeling capability of the HMM is not required. Therefore, one state CHMM, also called a Gaussian mixture model (GMM), has been widely used as a speaker model (Tseng et al., 1992). In GMM modeling techniques, feature vectors are assumed to be statistically independent. Although this is not true, it allows one to simplify mathematical formulations. To overcome this assumption, models based on segments of feature frames were proposed (Liu et al., 1995). One of the disadvantages of GMM is that the acoustic variability dependent on phonetic events is not directly taken into account. In other words, this modeling is not sufficiently constrained by the phonetic temporal pattern. Therefore, speech recognition techniques have been used for text-dependent speaker identification (Matusi and Furui, 1995). This approach is also used for text-independent speaker identification. Nakagawa et al. (2004, 2006) proposed a new speaker recognition method by combining speaker-specific GMMs with speaker-adapted syllablebased HMMs, which showed robustness for the change of the speaking style in a close-talking environment. In this paper, we extend this combination method to distant speaker recognition and integrate this method with the proposed position-dependent CMN. The MAP-based speaker adaptation method from the speaker-independent GMM (speaker-adapted GMM) was indeed an effective speaker identification method (Reynolds et al., 2000), but (Nakagawa et al. (2004, 2006)) indicated that the speaker-specific GMM obtained slightly better performance than the

L. Wang et al. / Speech Communication 49 (2007) 501–513

speaker-adapted GMM. So we used the speaker-specific GMMs in this paper. Since the training data were not sufficient for speaker-specific HMMs, we used speakerindependent HMMs adapted to the speakers by MAP (speaker-adapted HMMs) instead of speaker-specific HMMs. Reynolds and Rose (1995) also proposed an effective Gaussian mixture speaker models for robust text-independent speaker identification. The essence of this paper is that the transmission characteristics are distorted depending on different speaker positions and that the proposed position-dependent CMN can address this problem effectively. Furthermore, the position-dependent CMN could also be integrated with various speaker models. The consideration of state-of-the-art speaker models is beyond the scope of this paper. Thus, in this paper, syllable-based HMMs adapted to the speakers by MAP and speaker-specific GMMs whose parameters were estimated by the EM algorithm using training data uttered by the corresponding speaker were used for speaker identification. 2. Speaker position estimation Speaker localization based on Time Delay Of Arrival (TDOA) between distinct microphone pairs has been shown to be effectively implementable and to provide good performance even in a moderately reverberant environment and in noisy conditions (Brandstein, 1995; Omologo and Svaizer, 1996; DiBiase et al., 2001; Huang et al., 2001; Doclo and Moonen, 2003). Speaker localization in an acoustical environment involves two steps. The first step is estimation of time-delays between pairs of microphones. The next step is to use these delays to estimate the speaker location. The performance of TDOA estimation is very important to the speaker localization accuracy. The prevalent technique for TDOA estimation is based upon Generalized Cross-Correlation (GCC) in which the delay estimation is obtained as the time-lag which maximizes the cross-correlation between filtered versions of the received signals (Knapp and Carter, 1976). Various investigators (DiBiase et al., 2001; Doclo and Moonen, 2003; Raykar et al., 2005) have proposed some more effective TDOA estimation methods in noisy and reverberant acoustic environments. On the other hand, it is necessary to find the speaker position using estimated delays. The Maximum Likelihood (ML) location estimate is one of the common methods because of its proven asymptotic consistency. It does not have a closed-form solution for the speaker position because of the nonlinearity of the hyperbolic equations. The Newton–Raphson iterative method (Bard, 1974), Gauss–Newton method (Foy, 1976), and Least Mean Squares (LMS) algorithm are among possible choices to find the solution. However, for these iterative approaches, selecting a good initial guess to avoid a local minimum is difficult, the convergence consumes much computational time, and the optimal solution cannot be guaranteed.

503

Therefore, it is our opinion that an ML location estimate is not suitable for real-time implementation of a speaker localization system. We therefore proposed a method to estimate the speaker position using a closed-form solution. Using this method, the speaker position can be estimated in real-time using TDOAs. We estimate the speaker position as follows. It is assumed that N microphones are located at position (xi, yi, zi), i = 1, . . ., N, and the sound source is located at (xs, ys, zs). The distance between the sound source and the ith microphone is denoted by: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Di ¼ ðxi  xs Þ2 þ ðy i  y s Þ2 þ ðzi  zs Þ2 : ð1Þ The difference in the distances from the sound source between microphone i and microphone j is given by: d ij ¼ Di  Dj ¼ c^sij ;

ð2Þ

where c is the velocity of sound and ^sij is the Time Delay Of Arrival (TDOA). TDOA can be estimated by Crosspower Spectrum Phase (CSP) (Omologo and Svaizer, 1996, 1997). CSP is defined as follows: X i ðx; tÞX j ðx; tÞ ; jX i ðx; tÞjjX j ðx; tÞj Z þ1 Uðx; tÞ ej2pxsij dx; Cðsij ; tÞ ¼

ð3Þ

Uðx; tÞ ¼

ð4Þ

1

where Xi and Xj are the spectra of sound signals received by a microphone pair, and t indicates the speech frame index. Then C(sij, t)s are summed up along t: Cðsij Þ ¼

T 1 X

ð5Þ

Cðsij ; tÞ:

t¼0

Finally, TDOA ^sij is given by: ^sij ¼ arg max Cðsij Þ:

ð6Þ

sij

In order to estimate the speaker position in a 3D space, four microphones are required theoretically. Microphones are set on a plane as indicated in Fig. 1. We can estimate the speaker position by using three microphone pairs Z M3(0,0,d)

M4(0,-d,0)

M1(0,0,0)

M2(0,d,0) Y

X Fig. 1. Microphones (d = 20 cm).

arranged

for

speaker

position

estimation

504

L. Wang et al. / Speech Communication 49 (2007) 501–513

(M1, M2), (M1, M3), (M1, M4). The first microphone (M1) is regarded as the reference and is placed at the origin of the coordinate system. The other three microphones are placed on the plane at the same distance d from the first microphone (M1). We can derive three equations from Eqs. (1) and (2): qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx1  xs Þ2 þ ðy 1  y s Þ2 þ ðz1  zs Þ2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2  ðx2  xs Þ þ ðy 2  y s Þ þ ðz2  zs Þ ¼ c^s12 ; ð7Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2 ðx1  xs Þ þ ðy 1  y s Þ þ ðz1  zs Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi  ðx3  xs Þ2 þ ðy 3  y s Þ2 þ ðz3  zs Þ2 ¼ c^s13 ; ð8Þ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðx1  xs Þ2 þ ðy 1  y s Þ2 þ ðz1  zs Þ2 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 2 2 2  ðx4  xs Þ þ ðy 4  y s Þ þ ðz4  zs Þ ¼ c^s14 ; ð9Þ Because of the symmetry of three microphone pairs, the equations with square root can be simply solved as: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi b3 þ b23  4a3 c3 ys ¼ ; ð10Þ 2a3 d

2

ðc^s14 Þ d3

b3 ¼

ðc^s14 Þ



2

þ 2

d4

c3 ¼

4ðc^s14 Þ

b2 þ

d

ðc^s13 Þ d3 ðc^s13 Þ

 2

2

ð11Þ

;

 2d; 2

d4 ðc^s13 Þ

ð12Þ 2

2

ðc^s14 Þ ðc^s13 Þ  ; 4 4

þ 2

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi b22  4a2 c2 2a2

ð13Þ

b2 ¼

ðc^s13 Þ

b1 ¼ d  c1 ¼

2

 1;

d3 ðc^s13 Þ

d4 4ðc^s13 Þ

2

ð19Þ ð20Þ

; 2

þ 2

ðc^s13 Þ d2  : 4 2

ð21Þ

This method involves a relatively low computational cost, and there is no position estimation error if the TDOA estimation is correct because no assumption is needed for the relative position between the microphones and the sound source. Of course, this approach leads to an estimation error caused by the measuring error of TDOA. If there are more than four microphones, we can also estimate the location by using the other combination of four microphones. Thus, we can estimate the location by the average of estimated locations at only a small computational cost. As will be mentioned in Section 5.1, we did not use position estimation for experiments but assumed that we could estimate accurate position because various previous works revealed the sufficient accuracy of the methods based on TDOA for our purpose.

3.1. Conventional CMN and real-time CMN A simple and effective way of channel normalization is to subtract the mean of each cepstrum coefficient (CMN) (Furui, 1981; Liu et al., 1993) which will remove timeinvariant distortions caused by the transmission channel and the recording device. When speech s is corrupted by convolutional noise h, the observed speech x becomes: x ¼ h  s:

ð14Þ

;

where, a2 ¼ 

d2

2

then, zs ¼

a1 ¼

3. Position-dependent CMN

where, a3 ¼

where,

ð22Þ

Cepstrum is obtained by DCT transforming a logarithm of a power spectrum of the signal, and then Eq. (22) becomes: Cx ¼ Ch þ Cs;

d

2

;

ð15Þ

 d;

ð16Þ

ðc^s12 Þ2 d3

ðc^s12 Þ2

d2 2 z  c2 ¼ ðc^s13 Þ 4

!

d3 ðc^s13 Þ

2

d zþ

2

d4 4ðc^s13 Þ

and finally, qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi xs ¼ y 2s þ a1 z2s þ b1 zs þ c1 ;

h

ð17Þ

ð23Þ s

where C , C and C express the cepstrums of observed speech x, transmission characteristics h, and clean speech s, respectively. Based on this, convolutional noise is considered as additive bias in the cepstral domain, so the noise (transmission characteristics or channel distortion) can be approximately compensated by subtracting the mean of each utterance: e t  C t  C train ; C

2

ðc^s13 Þ ðc^s12 Þ  ;  þ 2 4 4 4ðc^s12 Þ d

2

x

ðt ¼ 0; . . . ; T Þ;

ð24Þ

or, mean biased version as: e t ¼ C t  DC; C

ðt ¼ 0; . . . ; T Þ;

ð25Þ

and the compensation parameter DC is approximated by: ð18Þ

DC  C t  C train ;

ð26Þ

L. Wang et al. / Speech Communication 49 (2007) 501–513 Table 1 Euclidean cepstrum distance Speaker 1 and speaker 2 (vowel /a/, close-talk) Speaker 1 and speaker 2 (vowel /i/, close-talk) Vowel /a/ and vowel /i/ (close-talk, speaker 1) Vowel /a/ and vowel /i/ (close-talk, speaker 2) Area 5 and area 10 (vowel /a/, speaker 1) Area 5 and area 10 (vowel /i/, speaker 1) Area 5 and area 10 (vowel /a/, speaker 2) Area 5 and area 10 (vowel /i/, speaker 2)

0.227 0.357 0.673 0.673 0.059 0.125 0.081 0.168

where C t and C train are the cepstral mean of Ct (the utterance to be recognized) and that of training data for original (speaker-independent) HMMs,1 which is the conventional CMN. Thus, when using conventional CMN, the compensation parameter DC can be calculated at the end of input speech. This prevents real-time processing of speech recognition. The other problem of conventional CMN is that accurate cepstral means cannot be estimated especially when the utterance is short. The third problem is that conventional CMN may have negative effect on speaker recognition. Various studies (Furui et al., 1972; Markel et al., 1977) have indicated that long-term feature averaging can be used in text-independent speaker recognition. These results indicated that the average of the feature vectors (that is, cepstral mean) contained some speaker characteristics and the conventional CMN may remove them, so it may not be very effective for speaker recognition in some cases especially when the channel distortion is not so large compared to the speaker variation, because the relative negative effect of the loss of speaker characteristics is considerably large in those cases. We compared Euclidean cepstrum distances between different speakers, different vowels and different positions in Table 1. Japanese vowels /a/ and /i/ uttered by two Japanese male students were recorded in our experimental environment. Since the required characteristics for acoustic features in speaker recognition serve to maximize the inter-speaker variation in the feature space, in Table 1, the inter-position variation (that is, channel distortion) was less than the inter-speaker variation. As widely acknowledged, the speech generation process is well modeled by excitation and filtering, and the cepstral analysis including MFCC extracts the charac1

We used speaker-independent HMMs as original models of speakeradapted HMMs as described in Section 4.1.2. The conventional CMN which subtracts the cepstral mean of current utterance compensates the channel distortion, while it also removes the speaker characteristics at the same time which are very important to identify the speaker. Thus, in the original speaker-independent HMMs, the training data were not normalized by the cepstral mean of each utterance. So when adapting the original speaker-independent HMMs to the speakers by MAP, the common bias C train was used (that is, Eqs. (25) and (26) were used). When C train was used commonly for both adaptation and test, Eqs. (24) and (25) were theoretically identical with any arbitrary C train . On the other hand, speaker-specific GMM parameters were estimated by the EM algorithm using training data uttered by the corresponding speaker. In other words, the original model did not exist in the case of GMMs, so C train was set to zero in GMMs in this paper.

505

teristics of the latter. So if the speaker-specific features can be well represented by the average of the MFCC (Furui et al., 1972; Markel et al., 1977), the loss of speaker characteristics may be larger than the gain of normalization of channel distortion when normalizing the features by conventional CMN (including the normalization functions of speaker characteristics and channel normalization). Thus, the conventional CMN may not be effective for speaker recognition. We solve these problems under the assumption that the channel distortion does not change drastically. In our method, the compensation parameter is calculated from utterances recorded a priori. The new compensation parameter is defined by: DC ¼ C environment  C train ;

ð27Þ

where C enviornment is cepstral mean of utterances recorded at a practical environment a priori. Using this method, the compensation parameter can be applied from the beginning of recognition of current utterance. Moreover, as the compensation parameter is estimated from enough number of cepstral coefficients of utterances, so it can compensate the distortion better than the conventional CMN. We call this method real-time CMN. In Kitaoka et al. (2001), the compensation parameter is calculated from past recognized utterances, i.e., the calculation of the compensation parameter for the nth utterance is:   DC ðnÞ ¼ ð1  aÞDC ðn1Þ  a  C train  C ðn1Þ ; ð28Þ where DC(n) and DC(n1) are the compensation parameters for the nth and (n  1)th utterances, respectively, and C ðn1Þ is the mean of cepstrums of the (n  1)th utterance. Using this method, the compensation parameter can be calculated before recognition of the nth utterance. This method can indeed track the slow change of the transmission characteristics, but the characteristic changes caused by the change of the speaker position or speaker is beyond the tracking ability of this method. 3.2. Incorporate speaker position information into real-time CMN In a real distant environment, the transmission characteristics of different speaker positions are very different because of the distance between the speaker and the microphone, and the reverberation of the room. Hence, the performance of speaker recognition system based on real-time CMN will be drastically degraded because of the drastic change of channel distortion. In this paper, we incorporate speaker position information into real-time CMN (Wang et al., 2004; Wang et al., 2005b). We call this method position-dependent CMN. Instead of calculating the compensation parameter from past recognized utterances, the compensation parameter was measured from some grid points in the room a priori.

506

L. Wang et al. / Speech Communication 49 (2007) 501–513 M X

1.85 m

ci ¼ 1:

ð32Þ

i¼1

1.0 m

1.15 m

0.3 m

The complete Gaussian mixture model is parameterized by the mean vectors, covariance matrices and mixture weights from all component densities. These parameters are collectively represented by the notation:

1.15 m

microphone array 0.2 m

k ¼ fci ; li ; Ri g;

1

2

3

0.6 m

4

5

6

0.6 m

4.45 m

3.45 m

7

8

9

i ¼ 1; . . . ; M:

In our speaker recognition system, each speaker is represented by such a GMM and is referred to by his model k. For a sequence of T test vectors X = x1, x2, . . ., xT, the standard approach is to calculate the GMM likelihood in the log domain as: LðX jkÞ ¼ log pðX jkÞ ¼

0.6 m

ð33Þ

T X

log pðxt jkÞ:

ð34Þ

t¼1

10

11

0.6 m

0.6 m

12 0.6 m

The speaker-specific GMM parameters are estimated by the EM algorithm using training data uttered by the corresponding speaker using the HTK toolkit (Young et al., 2000).

1.0 m

3.0 m

Fig. 2. Configuration of the room (room size: (W) 3 m · (D) 3.45 m · (H) 2.6 m).

The new compensation parameter for position-dependent CMN is defined by: DC ¼ C position  C train ;

ð29Þ

where C position is the cepstral mean of utterances affected by the transmission characteristics between a certain position and the microphone. In our experiments in Section 5, we divide the room into 12 areas as shown in Fig. 2 and measure the C position corresponding to each area. 4. Speaker recognition method 4.1. Speaker modeling 4.1.1. Gaussian mixture model (GMM) A GMM is a weighted sum of M component densities and is given by the form: P ðX jkÞ ¼

M X

ci bi ðxÞ;

ð30Þ

i¼1

where x is a d-dimensional random vector, bi(x), i = 1, . . ., M, is the component densities and ci, i = 1, . . ., M, is the mixture weights. Each component density is a d-variate Gaussian function of the form:   1 1 T 1 bi ðxÞ ¼ exp  ðx  li Þ Ri ðx  li Þ ; d=2 1=2 2 ð2pÞ jRi j ð31Þ with mean vector li and covariance matrix Ri. The mixture weights satisfy the constraint that:

4.1.2. Speaker-adapted HMM A parameter set of HMM is given by k = {A, B, p}, where A, B and p denote a set of state transition probability, a set of output probability density functions, and a set of initial state probabilities, respectively. We used contextindependent syllable-based HMMs as acoustic models, each of which has a left-to-right topology and consists of five states, four with pdfs (probability density functions) of output probability. Each pdf consists of four Gaussians with full-covariance matrices. The number of syllables is 114. Speaker adaptation is performed for B. We describe in brief adaptation methods for a Gaussian distribution. The speaker adaptation by Maximum A Posterior Probability Estimation (MAP) (Gauvain and Lee, 1994; Tsurumi and Nakagawa, 1994) is in the following: PN ðc þ N  1Þ^ lN 1 þ X N cl0 þ i¼1 X i ^N ¼ l ¼ ; ð35Þ cþN cþN where {X1, X2, . . ., XN} denotes training sample vectors and c corresponds to the prior mean vector l0, that is, the reliability of the mean vector l0 of speaker-independent distribution (Nilsson, 1966). We set c to 15 (Tsurumi and ^ N Þ denotes an estimated GaussNakagawa, 1994). N ð^ lN ; R ian model adapted by training samples. 4.2. Speaker identification procedure Fig. 3 shows the procedure of our speaker identification system. In this system, input speech is analyzed and transformed into a feature vector sequence by front-end analysis block and then each test vector is fed to all reference speaker models of GMM and speaker-adapted syllablebased HMMs in parallel. Stated mathematically, the ability of GMMs is to find a maximum possible speaker S given the observation

L. Wang et al. / Speech Communication 49 (2007) 501–513

Fig. 3. Speaker identification by combining speaker-specific GMM with speaker-adapted syllable-based HMMs.

sequences O and the ability of HMMs for speaker identification is to find a maximum possible unit (syllable) sequences U given the reference speaker S and the observation sequences O. When the HMMs are combined with GMMs, the ability of the combination method is to find the joint maximum probability among all possible unit (syllable) sequences U and speakers S given the observation sequences O. Thus, the speaker could be identified accurately because the combination method uses the uttered context. Our proposed combination method is formulated as: b;b fU S g ¼ arg max P ðU jS; OÞP ðSjOÞ

ð36Þ

¼ arg max P ðU ; SjOÞ;

ð37Þ

U ;S

U ;S

where the first term of the right-hand side of the Eq. (36) is calculated by the pdfs (probability density functions) of HMMs for speaker recognition and the second one is calculated by the pdfs of GMMs. In a logarithmic domain, the multiplication becomes addition. In a real case, the combination of the GMMs and the HMMs with a weighting coefficient may be a good scheme because of the difference in training/adaptation methods, etc. The ith speaker dependent GMM produces likelihood LiGMM ðX Þ; i ¼ 1; 2; . . . ; N , where N is the number of speakers registered. The ith speaker-adapted syllable-based HMMs also produce likelihood LiHMM ðX Þ by using a continuous syllable recognizer. All these likelihood values are passed to the so-called likelihood decision block, where they are transformed into the new score Li(X): Li ðX Þ ¼ ð1  aÞLiGMM ðX Þ þ aLiHMM ðX Þ;

ð38Þ

where a denotes a weighting coefficient. 5. Experiments 5.1. Experimental setup We performed the experiment in a room measuring 3.45 m · 3 m · 2.6 m. The room was divided into 12 (3 · 4) rectangular areas as shown in Fig. 2, where the area size is 60 cm · 60 cm. We measured the transmission char-

507

acteristics (that is, the mean cepstrums of utterances recorded a priori) from the center of each area. The purpose of this experiment was to verify the validity of our proposed position-dependent CMN and combination method, so we purely evaluated our method using the speech recorded by one microphone, M1, as shown in Fig. 1. In our related research (Wang et al., 2005a; Wang et al., 2006), multiple microphone processing based on position-dependent CMN achieved a significant improvement in speech recognition performance over both conventional microphone array processing and single microphone processing. Therefore, our multiple microphone processing method should also be effective for speaker recognition, but the evaluation of this method will be undertaken in future. In our method, the estimated speaker position should be used to determine the area (60 cm · 60 cm) in which the speaker should be. Omologo and Svaizer (1997) found that an average location error of less than 10 cm could be achieved using only four microphones in a room measuring 6 m · 10 m · 3 m, in which source positions are uniformly distributed in an area 6 m · 6 m. Wang et al. (2004) also revealed that the speaker position could be estimated with estimation errors of 20–25 cm by the T-shaped four microphone system shown in Fig. 1 without interpolation between consecutive samples. In Section 5.2, therefore, we assumed that the position area was accurately estimated and purely evaluated only our proposed speaker recognition methods. Twenty male speakers uttered 200 isolated words, each with a close-microphone. The average time of all utterances is about 0.6 s. For the utterances of each speaker, the first 100 words were used as test data and the rest for estimation of cepstral means C position in Eq. (29). The same last 100 training utterances were also used to adapt the syllablebased HMMs and train the speaker-specific GMM for each speaker. All the utterances were emitted from a loudspeaker located in the center of each area and recorded for test and estimation of C position to simulate the utterances spoken at various positions. The sampling frequency was 12 kHz. The frame length was 21.3 ms and the frame shift was 8 ms with a 256 point Hamming window. Then, 114 Japanese speaker-independent syllable-based HMMs (strictly speaking, mora-unit HMMs; Nakagawa et al., 1999) were trained using 27,992 utterances read by 175 male speakers (JNAS corpus). Each continuous density HMM had five states, four with pdfs of output probability. Each pdf consisted of four Gaussians with full-covariance matrices. The feature space for syllable-based HMMs was comprised of 10 mel frequency LPC cepstral coefficients. First- and second-order derivatives of the cepstral plus first and second derivatives of the power component were also included. The feature space for speaker-specific GMM with diagonal matrices was comprised of 10 MFCCs. Firstorder derivatives of cepstral plus first derivatives of the power component were also included. The mixture number of the GMMs is 32.

508

L. Wang et al. / Speech Communication 49 (2007) 501–513

Table 2 Four methods for speaker recognition

9.00

Feature

Model (HMM/GMM)

8.00

W/o CMN Conv. CMN PICMN PDCMN

W/o CMN Conv. CMN PICMN PDCMN

W/o CMN Conv. CMN A priori CMN A priori CMN

7.00

Speaker-independent HMMs were adapted by MAP using 100 isolated words (about 60 s) recorded by closetalking microphone. All the utterances were emitted from a loudspeaker. The GMMs were trained by ML criterion using the same data. Three types of adaptation/training data were used depending on the experimental conditions. The three types of models obtained were: (1) W/o CMN Model using the features without CMN; (2) Conv. CMN Model using the features compensated by conventional CMN; and (3) A priori CMN Model using the features compensated by CMN with cepstral mean measured a priori. The compensation parameter of the features for a priori CMN Model is defined by: DC ¼ C close-loudspeaker  C train ;

ð39Þ

where C close-loudspeaker is the cepstral mean of utterances recorded by close-talking microphone.2 In our experiments, the four methods shown in Table 2 were compared. These four methods were defined by combining the corresponding features and models. PICMN (position-independent CMN) compensated the features using the averaged compensation parameters over 12 areas, and the same compensation parameters were used in all areas (position-independent). In PICMN and PDCMN, the compensation parameters DC were independent of the speakers. 5.2. Speaker recognition results In this Section, we assumed that the position area was accurately estimated and purely evaluated only our proposed speaker recognition methods. It is difficult to identify the correct speaker using only one utterance because the average time (about 0.6 s) for all utterances is too short. Therefore, the connective likelihood of three utterances (about 1.8 s) was used to identify the speaker. As described in Section 5.1, 100 isolated words were used for the test, so we had 33 test samples for each speaker, that is, a total 660 samples. In this paper, the standard CMS was applied to a concatenation of the three words. 5.2.1. Speaker recognition by GMM We evaluated our proposed feature-based compensation technique for speaker identification by GMM. The pro2 That is, only the difference in recording equipment in training and test environments was compensated.

Recognition error rate (%)

Method

8.71 7.94

6.00

5.34

5.27 4.71

5.00

4.10

4.00 2.86 3.00 2.00 1.00 0.00

W/o CMN

Conv. CMN

CM of area 2

CM of area 5

CM of area 12

PICMN

PDCMN

Fig. 4. Speaker identification by GMM (isolated word).

posed method was referred to as PDCMN (position-dependent CMN). The results were shown in Fig. 4. In Fig. 4, PDCMN was compared with the baseline (recognition without CMN), conventional CMN, ‘‘CM of area 2’’, ‘‘CM of area 5’’, ‘‘CM of area 12’’ and PICMN (position-independent CMN). Area 2 was nearest the microphones, and ‘‘CM of area 2’’ means that a fixed cepstral mean (CM) of the nearest area was used to compensate the input features of all 12 areas. Area 5 was at the center of all 12 areas, and area 12 was the farthest from the microphones. PICMN means the method by which the averaged compensation parameters over 12 areas were used. Without CMN, the recognition rate was drastically degraded according to the distance between the sound source and the microphone. For the feature-based compensation technique, since the conventional CMN removed the speaker characteristic and the utterances were too short (about 1.8 s), the result was much worse than that without CMN. The transmission characteristics from different speaker positions are very different, so the cepstral mean (CM) of each area is considerably different. In our experiment, area 5 is at the center of whole the area, whereas area 2 and area 12 are nearest and farthest from the microphones, respectively. We compared the ‘‘CM of area 5’’, ‘‘CM of area 2’’ and ‘‘CM of area 12’’ with ‘‘average CM over all areas (that is, position-independent cepstral mean: PICM)’’. The cepstral distance of the ‘‘CM of area 5’’ from the ‘‘PICM’’ was much smaller than those of the ‘‘CM of area 2’’ and the ‘‘CM of area 12’’. This means that the variation in the distances of the ‘‘CM of area 5’’ from the ‘‘CMs of the other areas’’ was much smaller than those of the other two. So the recognition error rate of ‘‘CM of area 5’’ (4.71%) was significantly smaller than those of ‘‘CM of area 2’’ (5.34%) and ‘‘CM of area 12’’ (5.27%). The variation between the ‘‘CM of area 2’’ and ‘‘CM of the other area’’ and that between the ‘‘CM of area 12’’ and ‘‘CM of the other area’’ was little, so the performance of ‘‘CM of area 2’’ and that of ‘‘CM of area 12’’ were very similar. PICMN worked better than the ‘‘CM of one fixed

L. Wang et al. / Speech Communication 49 (2007) 501–513 15.63 16.00 14.00

Recognition error rate (%)

area’’ because it averaged compensation parameters over all areas, and thus could obtain a more robust compensation parameter than that of a fixed area. Furthermore, the proposed PDCMN also compensated for the transmission characteristics according to speaker position, and it worked better than PICMN and the other methods. The proposed PDCMN achieved a relative error reduction rate of 64.0% from W/o CMN, 67.2% from Conv. CMN, 46.4% from ‘‘CM of area 2’’, 39.3% from ‘‘CM of area 5’’, 45.7% from ‘‘CM of area 12’’ and 30.2% from PICMN, respectively. In light of our previous study on speaker localization, we assumed that the position area was accurately estimated. We also showed the additional reports on the impact of localization error on the recognition rate. We simulated this impact under two localization error situations. One situation was that the true ‘‘area 1’’ was incorrectly estimated as ‘‘area 2’’, and the other was that the true ‘‘area 4’’ was incorrectly estimated as ‘‘area 5’’. Incorrect estimate of the cepstral mean was used to compensate the speech uttered from the true area. We compared the speaker identification error rates for speech uttered from area 1. The rates were 1.37% by ‘‘CM of area 2’’, 2.74% by W/o CMN, 4.57% by conventional CMN, 1.86% by PICMN and 0.76% by PDCMN, respectively. The speaker identification error rates for speech uttered from area 4 were also compared. The rates were 2.59% by ‘‘CM of area 5’’, 6.40% by W/o CMN, 7.16% by conventional CMN, 3.20% by PICMN and 1.98% by PDCMN, respectively. The performance with localization error was obviously worse than the ideal case with position-dependent CMN. Although the corresponding area was estimated incorrectly, the result by ‘‘CM of the incorrectly estimated neighbor area’’ was remarkably better than those by W/o CMN and conventional CMN, and it was even significantly better than that by position-independent CMN. Even if the true area had been estimated incorrectly as a neighbor area, the proposed method could work much better than the other methods. NTT database. To verify the robustness of the proposed method for distant speaker recognition, we also conducted our methods on the NTT (Nippon Telegraph and Telephone Corp.) database, which is a standard database for Japanese speaker identification. The NTT database consists of recordings of 22 male speakers collected in five sessions over 10 months (1990.8, 1990.9, 1990.12, 1991.3 and 1991.6) in a soundproof room. For training the models, the same five sentences were used for all speakers, from one session (1990.8). Five other sentences uttered at normal speed and the same for each of the speakers from the other four sessions, were used as test data (text-independent speaker identification). Average duration of the sentences was about 4 s. The input speech was sampled at 16 kHz. 12 MFCC, their derivative, and delta log-power were calculated at every 10 ms with a hamming window of 25 ms. We also verified the robustness of the proposed method in relation to changes in the environment. The simulated

509

12.00 9.43

9.92

10.00 8.00

6.50

6.00 4.00 2.00 0.00

W/o CMN

Conv. CMN

PICMN

PDCMN

Fig. 5. Speaker identification by GMM (NTT database).

environment in Section 5.1 was modified to a more realistic environment in this section. The difference in simulated and real environment may be described as follows: in the simulated environment, there was nothing in the room except microphones and a sound source; in the realistic environment, the room was set up as a seminar room with a whiteboard beside the left wall, one table and some chairs arranged in the center of the room, one TV and some other tables, etc. The average results of all 12 areas were shown in Fig. 5. The proposed PDCMN achieved a relative error reduction rate of 58.4% from W/o CMN, 31.1% from conventional CMN, and 34.5% from PICMN, respectively. The proposed PDCMN also worked better than the other methods on the new database under a more realistic environment. In this case, the conventional CMN was better than W/o CMN and PICMN because the average duration of the sentences (about 4 s) was sufficient to estimate the cepstral mean accurately. This indicated that the proposed method is robust in terms of evaluation data and experimental environment. 5.2.2. Speaker recognition by combination method Nakagawa et al. (2004, 2006) indicated that the integration of similar methods is less effective than that of different methods. Two speaker models using different features may obtain better results than that using same features. Using position-dependent CMN, the speaker identification error rates were shown in Table 3. Since the combination of LPC-based HMMs and MFCC-based GMMs improved the speaker identification performance more than the other combinations, we used those settings in this paper. The speaker recognition result based on position-dependent CMN by combining GMMs with HMMs was shown in Fig. 6. The combination method improved the speaker recognition result remarkably and produced the best result when the weighing coefficient was a = 0.7. The essence of the improvement of the combination method was not the adjustment of the weighting coefficient a, but its

510

L. Wang et al. / Speech Communication 49 (2007) 501–513

Table 3 Speaker identification error rate by combination method using positiondependent CMN (three utterances) Combination of

Error rate (%) GMM

MFCC MFCC LPC LPC MFCC LPC – –

MFCC LPC MFCC LPC – – MFCC LPC

Speaker recognition error rate (%)

HMM

1.01 1.33 0.69 1.26 3.30 3.10 2.86 4.69

3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

α

GMM

1.0

HMM

Fig. 6. Speaker recognition error rates based on PDCMN by combining speaker-specific GMMs with speaker-adapted syllable-based HMMs for every three test utterances.

simultaneous maximizing of the joint probability among all possible unit sequences and speakers. Even a was set to 0.5, the relative error reduction rate was more than 70%, and significant improvement was achieved on all a range from 0.1 to 0.9. In the following experiments, the weighting coefficient a was set to 0.7 for the combination method. The experimental results for all 12 areas and the average result were shown in Table 4. The proposed combination

method improved the speaker recognition performance significantly in all areas. By using the position-dependent CMN, the combination method achieved a relative error reduction rate of 75.9% from GMM and 77.7% from HMM. There were three reasons for this: (1) The combination method maximized the joint probability among all possible unit (syllable) sequences and speakers, while the GMM-based technique maximized the possible speakers and the HMM-based technique maximized the unit (syllable) sequences given the observation sequences for a given speaker. (2) Using the same close-talking data, the speakerspecific GMMs with diagonal matrices were trained by ML criterion while the speaker-independent HMMs with fullcovariance matrices were adapted by MAP, so the combination method may obtain robust output probability density functions. (3) The GMMs cannot express the temporal sequences (one state) while the HMMs can express the temporal sequences (that is, syllable sequences) dynamically. Incorporating the combination method with the proposed PDCMN, the proposed PDCMN achieved a relative error reduction rate of 66.7% from W/o CMN, 91.4% from the conventional CMN, and 28.9% from PICMN. So far the weight coefficient a of the combination method was set to 0.7 for all areas. Variable weighting factor depending on speaker positions was evaluated here. The speaker recognition results of the combination method with variable weight a were shown in Table 5. A relative error reduction rate of 11.6% from that of the fixed weight 0.7 was achieved. It also showed that the greater the distance between sound source and microphone, the greater was the contribution of GMMs (i.e., the weight coefficient a was smaller). 5.3. Integrating speaker position estimation with speaker recognition In this section, speaker position estimation was evaluated and then integrated with our proposed position-

Table 4 Speaker recognition error rate (%) (three utterances) Area

1 2 3 4 5 6 7 8 9 10 11 12 Average

GMM

HMM

Combination method (a = 0.7)

No CMN

Conv. CMN

PI-CMN

PDCMN

No CMN

Conv. CMN

PI-CMN

PDCMN

No CMN

Conv. CMN

PI-CMN

PDCMN

2.90 2.13 2.74 6.40 6.25 7.16 9.91 5.49 9.60 12.80 10.98 18.90

4.57 3.96 5.49 7.16 8.38 9.30 8.38 10.21 9.60 9.30 14.63 13.57

2.29 0.91 1.83 3.20 4.12 3.66 4.73 4.12 4.27 4.73 6.55 8.84

1.83 0.76 1.22 2.90 2.44 3.51 2.13 4.27 3.05 4.42 3.51 4.27

1.37 1.37 1.83 3.04 2.59 3.51 4.88 3.20 5.79 7.62 10.98 9.30

4.73 4.12 5.49 10.21 7.32 10.82 13.11 12.20 13.87 14.18 23.17 19.51

1.07 0.91 1.98 2.59 2.13 3.81 3.05 3.20 3.81 6.10 7.32 14.63

1.07 0.74 1.37 2.74 1.68 3.81 3.66 3.20 2.74 5.64 6.25 4.12

0.46 0.30 0.46 1.37 1.68 2.13 1.98 1.37 1.83 3.66 4.42 5.18

1.07 1.22 1.52 4.27 3.51 4.73 5.18 5.49 3.53 6.40 8.69 9.91

0.00 0.30 0.46 1.37 0.61 0.61 1.37 0.91 0.76 1.22 2.13 6.86

0.00 0.30 0.15 0.91 0.61 0.61 0.61 0.91 0.46 1.07 1.22 1.37

7.94

8.71

4.10

2.86

4.64

11.15

3.53

3.09

2.07

4.36

0.97

0.69

L. Wang et al. / Speech Communication 49 (2007) 501–513

511

Table 5 Speaker recognition error rate by optimal a (%) Area a Result

1 0.7 0.00

2 0.7 0.30

3 0.8 0.00

4 0.6 0.76

5 0.6 0.46

6 0.7 0.61

dependent CMN. For this experiment, we rerecorded isolated word utterances with four microphones as shown in Fig. 1. The same 200 isolated words uttered by 20 male speakers were emitted from a loudspeaker located in the center of areas 1, 5 and 9. First 100 utterances were used as the test samples and the other for parameter estimation. The recording condition was the same as in Section 5.1. 5.3.1. Speaker position estimation results We performed the speaker position estimation experiments by the method proposed in Section 2. The system segmented the speech with a 2048-point window with an 80-point shift. Twenty frames were used for each utterance to estimate the speaker position, that is to say, the first 3568 points ( 300 ms) were used for TDOA estimation by CSP. Two hundred and fifty interpolations between consecutive samples were used to obtain more accurate speaker position estimation performance. To evaluate the speaker position estimation, two measures were defined as follows: • Relative precision of estimated position (%) qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1 0 2 2 ðxr  xe Þ þ ðy r  y e Þ @1  A  100 pffiffiffiffiffiffiffiffiffiffiffiffiffiffi x2r þ y 2r

ð40Þ

where (xr, yr) and (xe, ye) are real position and estimation position, respectively. Although positions ze of the z-axis were estimated using four microphones, they were not used to calculate the relative precision of estimated position because the position area could be determined by only positions (xe, ye) of Px-axis and y-axis.

• Area estimation rate (%):

Iðxe ;y e Þ

# utterance

 100

Iðxe ; y e Þ  1 if estimated speaker position ðxe ; y e Þ is within the correct area ¼ ; 0 otherwise

ð41Þ

where I(xe, ye) denotes an indicator.

7 0.7 0.61

8 0.7 0.91

9 0.7 0.46

10 0.6 0.91

11 0.6 1.07

12 0.5 1.22

Average 0.61

Table 6 Speaker position estimation result Area

Relative precision of estimated position (%)

Area estimation rate (%)

Real position (x, y, z) (cm)

1 5 9

94.3 92.1 92.0

100 100 99.4

(50, 60, 0) (110, 0, 0) (170, 60, 0)

The speaker position estimation results were shown in Table 6. It was shown that good speaker estimation performance was obtained. More than 92% relative precision of estimated position was achieved. At the center of all 12 areas (area 5), the average estimation error was less than 10 cm, while the distance between the sound source and the microphone was 110 cm. In our PDCMN method, the estimated speaker position is used to determine the area (60 cm · 60 cm) in which the speaker should be. Thus, almost a 100% area estimation rate was achieved. Even when the estimated area was incorrect, one of the neighboring areas was always selected in this experiment.

5.3.2. Position-dependent CMN by automatic speaker position estimation The connective likelihood of three utterances (about 1.8 s) was used to identify the speaker. To simulate the three utterances as a sentence, the first 300 ms speech of first utterance was used to estimate the common speaker position of the three utterances. The system adopted the compensation parameter corresponding to the automatic estimated position. The speaker recognition results shown in Table 7 were slightly different from those in Table 4 because the same recording conditions in different recording time (June 2004 and December 2006) could not be guaranteed, e.g., loudspeaker volume and quality of microphones and recording facilities should be different. ‘‘Ideal PDCMN’’ means that the compensation parameter for the correct area is adopted (that is, ideal condition). This method

Table 7 Comparison of speaker recognition error rate of ideal PDCMN with realistic PDCMN (%) Area

GMM

HMM

Combinational method

PI-CMN

Ideal PDCMN

Realistic PDCMN

PI-CMN

Ideal PDCMN

Realistic PDCMN

PI-CMN

Ideal PDCMN

Realistic PDCMN

1 5 9

2.29 4.42 3.96

1.07 2.74 2.90

1.07 2.74 2.90

1.37 2.44 5.18

1.07 2.29 3.81

1.07 2.29 3.81

0.15 0.91 1.22

0.00 0.30 0.76

0.00 0.30 0.76

Average

3.56

2.34

2.34

3.00

2.39

2.39

0.76

0.35

0.35

512

L. Wang et al. / Speech Communication 49 (2007) 501–513

assumed that the position area was accurately estimated. ‘‘Realistic PDCMN’’ means that the compensation parameter for an estimated position area is used (that is, realistic condition). ‘‘Ideal PDCM’’ improved the speaker recognition performance remarkably more than PICMN. ‘‘Realistic PDCMN’’ obtained the same recognition performance as ‘‘ideal PDCMN’’ because the estimated position areas were almost correct, and the errors of position estimation affected the performance little because the compensation parameters for neighboring areas, which could approximate the correct one, were used in all cases with area estimation errors. 6. Conclusion and future work In a distant environment, speaker recognition performance may drastically degrade because of the mismatch between the training and testing environments. We addressed this problem by a feature-based compensation technique. We thus proposed a robust distant speaker recognition based on position-dependent CMN. Twenty male speaker recognition experiments were conducted. The speaker identification result by GMM showed that the proposed position-dependent CMN achieved a relative error reduction rate of 64.0% from W/o CMN and 30.2% from the position-independent CMN. We also integrated the position-dependent CMN into the combination use of speaker-specific GMMs and speaker-adapted syllablebased HMMs. The combination method improved the speaker recognition performance more than the individual use of either speaker-specific GMMs or speaker-adapted syllable-based HMMs. In the combination method, the position-dependent CMN also improved speaker recognition performance significantly more than other featurebased methods. The speaker position estimation was also evaluated and then it was integrated with our proposed position-dependent CMN. A good speaker position estimation performance was achieved, and the same speaker recognition performance as ideal PDCMN was achieved when speaker position estimation was integrated with our proposed PDCMN. In our future work, we will try to track a moving speaker and expand our speaker recognition method to accommodate an adverse acoustic environment. References Auckenthaler, R., Carey, M., Lloyd-Thomas, H., 2000. Score normalization for text-independent speaker verification systems. Dig. Signal Process. 10, 42–54. Bard, Y., 1974. Nonlinear Parameter Estimation. Academic, New York. Barras, C., Gauvain, J., 2003. Feature and score normalization for speaker verification of cellular data. Proc. ICASSP-2003 2, 49–52. Brandstein, M., 1995. A framework for speech source localization using sensor arrays. Ph.D. Thesis. Brown University, Providence, RI. DiBiase, J., Silverman, H., Brandstein, M., 2001. Robust localization in reverberant rooms. Microphone Arrays – Signal Processing Techniques and Applications. Springer-Verlag, Berlin, Germany, pp. 157– 180 (Chapter 8).

Doclo, S., Moonen, M., 2003. Robust adaptive time delay estimation for speaker localisation in noisy and reverberant acoustic environments. EURASIP J. Appl. Signal Process. 2003 (11), 1110–1124. Foy, W., 1976. Position-location solutions by Taylor-series estimation. IEEE Trans. Aerosp. Electron. Syst. AES-12, 187–194. Furui, S., Itakura, F., Saito, S., 1972. Talker recognition by long-time averaged speech spectrum. Electron. Commun. Jpn. 55-A, 54–61. Furui, S., 1981. Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29 (2), 254– 272. Gauvain, J., Lee, C., 1994. Maximum a posteriori estimation for multivariate Gaussian mixture observations of markov chains. IEEE Trans. Acoust. Speech Signal Process. 2 (2), 291–298. Huang, Y., Benesty, J., Elko, G., Mersereau, R., 2001. Real-time passive source localization: a practical linear-correction least-squares approach. IEEE Trans. Acoust. Speech Signal Process. 9, 943–956. Hughes, T., Kim, H., DiBiase, J., Silverman, H., 1999. Performance of an HMM speech recognizer using a real-time tracking microphone array as input. IEEE Trans. Speech Audio Process. 7 (3), 346–349. Juang, B., Soong, F., 2001. Hands-free telecommunications. Proc. Workshop Hands-Free Speech Commun. (HSC-2001), 5–10. Kitaoka, N., Akahori, I., Nakagawa, S., 2001. Speech recognition under noisy environments using spectral subtraction with smoothing of time direction and real-time Cepstral Mean Normalization. Proc. Workshop Hands-Free Speech Commun. (HSC-2001), 159–162. Knapp, C., Carter, G., 1976. The generalized correlation method for estimation of time delay. IEEE Trans. Acoust. Speech Signal Process. 24 (4), 320–327. Liu, C., Wang, H., Soong, F., Huang, C., 1995. An orthogonal polynomial representation of speech signals and its probabilistic model for text independent speaker verification. Proc. ICASSP-1995 I, 345–348. Liu, F., Stern, R., Huang, X., Acero, A., 1993. Efficient cepstral normalization for robust speech recognition. Proc. ARPA Speech Nat. Lang. Workshop, 69–74. Markel, J.D., Oshika, B.T., Gray, A.H., 1977. Long-term feature averaging for speaker recognition. IEEE Trans. Acoust. Speech Signal Process. ASSP-25 (4), 330–337. Matusi, T., Furui, S., 1995. Concatenated phoneme models for textvariable speaker recognition. Proc. ICASSP-1993 II, 391–394. Nakagawa, S., Hanai, K., Yamamoto, K., Minematsu, N., 1999. Comparison of syllable-based HMMs and triphone-based HMMs in Japanese speech recognition. Proc. Int. Workshop Autom. Speech Recognit. Understanding, 393–396. Nakagawa, S., Zhang, W., Takahashi, M., 2004. Text-independent speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM. Proc. ICASSP-2004 1, 81–84. Nakagawa, S., Zhang, W., Takahashi, M., 2006. text-independent/textprompted speaker recognition by combining speaker-specific GMM with speaker adapted syllable-based HMM. IEICE Trans. Inform. Syst. 89-D (3), 1058–1064. Nilsson, N., 1966. Learning Machines. McGraw-Hill. Omologo, M., Svaizer, P., 1996. Acoustic event location in noisy and reverberant environment using CSP analysis. Proc. ICASSP-1996, 921–924. Omologo, M., Svaizer, P., 1997. Use of the crosspower-spectrum phase in acoustic event location. IEEE Trans. Speech Audio Process. 5, 288– 292. Pelecanos, J., Sridharan, S., 2001. Feature warping for robust speaker verification. Proc. ISCA Workshop Speaker Recognit.-2001: A Speaker Oddyssey, 213–218. Pujol, P., Macho, D., Nadeu, C., 2006. On real-time mean-and-variance normalization of speech recognition features. In: Proc. ICASSP-2006, pp. 773–776. Raykar, V., Yegnanarayana, B., Prasanna, S., Duraiswami, R., 2005. Speaker localization using excitation source information in speech. IEEE Trans. Speech Audio Process. 13, 751–761.

L. Wang et al. / Speech Communication 49 (2007) 501–513 Reynolds, D.A., Quatieri, T.F., Dunn, R., 2000. Speaker verification using adapted Gaussian mixture models. Dig. Signal Process. 10 (1–3), 19– 41. Reynolds, D.A., Rose, R.C., 1995. Robust text-independent speaker identification using Gaussian mixture speaker models. IEEE Trans. Speech Audio Process. 3 (1), 72–83. Savic, M., Gupta, S., 1990. Variable parameter speaker verification system based on Hidden Markov Modeling. Proc. ICASSP-1990, 281–284. Seltzer, M., Raj, B., Stern, R., 2004. Likelihood-Maximizing beamforming for robust hands-free speech recognition. IEEE Trans. Speech Audio Process. 12 (5), 489–498. Tseng, B., Soong, F., Rosenberg, A., 1992. Continuous probabilistic acoustic map for speaker recognition. Proc. ICASSP-1992 II, 161–164. Tsurumi, Y., Nakagawa, S., 1994. An unsupervised speaker adaptation method for continuous parameter HMM by maximum a posteriori probability estimation. Proc. ICSLP-1994, 431–434. Viikki, A., Laurila, K., 1998. Cepstral domain segmental feature vector normalization for noise robust speech recognition. Speech Commun. 25, 133–147.

513

Wang, L., Kitaoka, N., Nakagawa, S., 2004. Robust distant speech recognition based on position dependent CMN. Proc. ICSLP-2004, 2049–2052. Wang, L., Kitaoka, N., Nakagawa, S., 2005a. Robust distant speech recognition based on position dependent CMN using a novel multiple microphone processing technique. Proc. EUROSPEECH-2005, 2661– 2664. Wang, L., Kitaoka, N., Nakagawa, S., 2005b. Robust distant speaker recognition based on position dependent Cepstral Mean Normalization. Proc. EUROSPEECH-2005, 1977–1980. Wang, L., Kitaoka, N., Nakagawa, S., 2006. Robust distant speech recognition by combining multiple microphone-array processing with position-dependent CMN. EURASIP J. Appl. Signal Process., Vol. 2006, Article ID 95491, 1–11. Xiang, B., Chaudhari, U., Navratil, J., Ramaswamy, G., Gopinath, R., 2002. Short-time Gaussianization for robust speaker verification. Proc. ICASSP-2002 1, 681–684. Young, S., Kershow, D., Odell, J., Ollason, D., Valtchev, V., Woodland, P., 2000. The HTK Book. Cambridge University (for HTK version 3.0).