speaker verification system based on probabilistic ... - Semantic Scholar

1 downloads 0 Views 16KB Size Report
MAY 19-22, 2002, VIENNA, VIRGINIA, USA. SPEAKER VERIFICATION SYSTEM BASED ON. PROBABILISTIC NEURAL NETWORKS. Todor Ganchev, Nikos ...
2002 NIST SPEAKER RECOGNITION EVALUATION WORKSHOP MAY 19-22, 2002, VIENNA, VIRGINIA, USA

SPEAKER VERIFICATION SYSTEM BASED ON PROBABILISTIC NEURAL NETWORKS Todor Ganchev, Nikos Fakotakis, George Kokkinakis Wire Communication Laboratory, University of Patras, Patra-Rio 26500, Greece {tganchev, fakotaki, gkokkin}@wcl.ee.upatras.gr

1. INTRODUCTION Because of their good generalization properties, Probabilistic Neural Networks (PNNs) were chosen as classifiers for the Speaker Verification system presented here. Their design is straightforward and does not depend on the training, and they are built only for a fraction of the back propagation ANNs training time [1]. The PNNs need much more neurons, compared to back propagation ANNs, which leads to increased complexity, higher computational and memory requirements. Nevertheless, the Speaker Verification system presented here works in real-time on common personal computers.

2.

SYSTEM CONCEPT

The building blocks of our Speaker Verification system are described shortly in the following paragraphs. 2.1. Features extraction Saturation by level is a common phenomenon for telephone speech signals. In order to reduce the spectral distortions it causes, a band-pass filtering of speech is performed as a first step of the feature extraction process. A fifth order Butterworth filter with pass-band from 80Hz to 3800Hz is used for both training and testing. Although typical telephone channels have a bandwidth from 300 to 3400Hz, it was found that filtering with fcut1 higher than 80Hz is not appropriate, because it decreases the performance of the pitch extractor and hence the performance of the speaker verification system. After the band-pass filtering, the speech signal is processed in 40 ms frames, overlapped by 10 ms. A voiced/unvoiced decision is obtained using a pitch estimation based on the “modified autocorrelation method with clipping (AUTOC)” [2]. The feature vector consists of 33 filter-bank MFCCs, computed only from the voiced speech frames. A preemphasis with factor α=0.97 is also used. 2.2. Construction of the Codebooks Due to the nature of PNNs, their complexity strongly depends on the number and dimensionality of the training

vectors. A Vector Quantization technique is used to reduce the amount of training data [3]. It was found experimentally that a codebook composed of 128 vectors is large enough to maintain a good representation accuracy of the speaker’s peculiarity. When a 256 vectors codebook is used the performance of the system is improved slightly, but the memory and computational requirements increase considerably. Therefore, a codebook consisting of 128 vectors was chosen as a trade-off. For the background speakers however, a codebook of 256 vectors is necessary. Both the speaker’s and the background codebooks are constructed by using the well-known k-means clustering algorithm [4]. The gender-dependent universal background codebook (UBgCB) was built by using all the available speakers from the NIST Development database. 2.3. The PNN and the PNN classifier A two layer PNN is used. The input Radial Basis layer (1) is followed by the Competitive layer (2). 2 ai 1 = exp( − (||i IW1,1 − p || bi 1) )

(1)

A = compet ( LW2,1a1)

(2)

where IW1,1 and LW2,1 are the weights for the first and the second layers respectively, and the index i denotes i-th element of a1 or b1 , and i-th row of the IW1,1 . The input feature vector is denoted by p , and b1 is the bias for the Radial Basis layer, defined as: b1 =

− log(0.5) SPREAD

(3)

By || . || the Euclidian distance is denoted, while A is the binary output of the PNN second layer. By compet, the transfer function of the Competitive layer is denoted which employs the winner-take-it-all rule. The biggest weighted sum of probabilities from the first layer is granted a ‘1’, while the others receive zeroes.

2002 NIST SPEAKER RECOGNITION EVALUATION WORKSHOP MAY 19-22, 2002, VIENNA, VIRGINIA, USA In the process of PNN design, the sensitivity factor SPREAD was set to 0.35. Therefore, information about the neighbors of each training vector is also exploited. For each enrolled speaker, a personal PNN is designed to recognize him/her among unlimited other speakers. Because both the speakers’ and the background models are represented not by one feature vector, but by codebooks, the problem is reduced to classifying one input vector to one of these two classes. The degree of closeness of the input feature parameters to the speaker training data and to the background speakers’ model is estimated by computing the distance to them. The modular structure was selected because of its inherent flexibility and easy updating possibility. In case when retraining of any of the enrolled speakers is necessary, only his/her PNN is replaced by a new one, without affecting the others. Enrolling of a new speaker is performed simply by designing a personal PNN for him/her and adding it to the PNN classifier. The PNNs of the already enrolled speakers are not affected, and therefore arbitrary new speakers could be added at any point of the system exploitation. In the test phase, the PNN classifier decides whether the input sentence belongs to the claimed speaker, or not. In order to do this, the claimed speaker’s PNN is tested by the feature vectors extracted from the input speech. The output results from the PNN are used to compute the probability the input sentence belongs to the claimed speaker. Then a speaker-independent threshold is applied and the final decision is made. When the score is above the threshold, the claimant speaker is accepted, otherwise the sentence is considered to belong to an impostor speaker.

Speaker models 139 males 191 females

Time spent 5¾h 9h

Table 1. Time spent to build the speaker models. The gender-dependent background models were built by using the NIST Development data. Table 2 shows the time spent. Background model Built from 74 males Built from 100 females

Time spent 2¾h 4¼h

Table 2. Time spent to build the background models. Total time spent for constructing the PNN-based Speaker Verification system is shown in Table 3: SRE system Males Females

Time spent 8½h 13 ¼ h

Table 3. Total time spent to construct the Speaker Verification system. In the test phase all the Detect{1,2,3}.ndx files were processed. In Table 4, the time spent to process the tests trials is shown. Test file Detect1.ndx Detect2.ndx Detect3.ndx

Time spent 11 h 20 h 1¾h

Table 4. Test trials processing time.

4. REFERENCES 3. MEMORY REQUIREMENTS AND PROCESSING TIME A common IBM-PC compatible personal computer was used to perform the training and the testing of our system. The configuration includes a single CPU Pentium 4 at 1.6GHz and 512MB RAM. In the following tables the time spent in the train and test phase is shown. The total time spent to build all speaker models for the experiments with males and females is shown in the following Table 1:

[1] D.F. Specht, “Probabilistic Neural Networks”, Neural Networks, Vol. 3, pp. 109-118, 1990. [2] L.R. Rabiner, M.J. Cheng, A.E. Rosenberg and C.A. McGonegal, “A Comparative Performance Study of Several Pitch Detection Algorithms”, IEEE Transactions on ASSP, Volume ASSP-24, No.5, October 1976. [3] R. M. Gray, "Vector Quantization," IEEE Acoustics, Speech and Signal Processing Magazine, pp. 4-29, April 1984. [4] J. A. Hartigan and M. A. Wong, “A k-means clustering algorithm,” Applied Statistics, No.28, pp.100-108, 1979.