EE559: Pattern Recognition Speaker Identification

0 downloads 0 Views 116KB Size Report
May 3, 2004 - Speaker Identification has been a subject of active research for many years, ... The dataset used is the Otago Speech Corpus [3] that has raw ...
EE559: Pattern Recognition Speaker Identification Performance Evaluation

Vivek Kumar Rangarajan Sridhar May 3, 2004

Abstract Speaker Identification has been a subject of active research for many years, and has many potential applications where propriety of information is a concern. In this project, a speaker identification system that automatically classifies different speakers based on features extracted from speech waveforms has been implemented. Features used are the power in different frequency subbands of dyadic filterbank, LPC coefficients and power spectral density coefficients. Each of these high-dimensional feature set is reduced to one dimension and the resulting 3-D vector is used as a feature for classification. Speaker Identification has been performed for three cases, one with two different male speakers, with a male and female speaker and with three speakers. The performance of the system is summarized and compared using various parametric and non-parametric techniques.

Introduction Speaker Identification has been an active research topic for many years. A variety of methods have been proposed for speaker identification. Some are based on auditory and acoustic models while the others are based on simple measurements from the speakers voice. In this project, classification techniques, both parametric and non-parametric are used to achieve speaker identification. The dataset used is the Otago Speech Corpus [3] that has raw speech data of 21 speakers (male and female) uttering various digits from 0-9 (obtained from a reference in the UCI Machine Learning Database). I have used 3 speakers from this database, 2 male speakers and 1 female speaker and setup a 2 class problem of male vs. male speaker and male vs. female speaker. The following are the features that I used for the classification: • Power Spectral Density (narrowband) • Linear Prediction Coefficients (LPC) • Energies in various subbands of a dyadic filter bank An more generic approach is used for the classification rather than a rigorous acoustic model accounting for formant frequencies and pitch factors. The rationale behind the selection of these features is explained in the following sections.

Speaker Identification Preprocessing of speech data The sound files though are claimed to be recorded under the same condition, their magnitudes need to be normalized. The normalization of the speech data has been done based on the maximum amplitude. Normalization by total energy is another approach that was tried but it yielded similar results.

Feature Selection Feature selection in this project has been performed in such a way that the sets of features selected have information not contained in the other. The features used for the project are ones that can be easily calculated from simple routines in MATLAB, • Temporal information of time series data (LPC) • Energies in different subbands for the signal (Energy) • Information in the frequency domain (PSD) The first feature set are the LPC coefficients. LPC provide a robust, reliable, and accurate method for estimating the parameters of a linear time-varying system and have been used in

1

system estimation and identification [1]. LPC coefficients in turn can be used in a wide range of speech applications (synthesis, identification, formant analysis etc...). The second set of features is obtained by measuring energies in different energy subbands. Filterbanks have been used for separating the signal in different frequency subbands. Especially, paraunitary filterbanks are power complementary and preserve power. I have used a four level signal decomposition for this feature. A diagram of the filterbank used in the project is shown below, LPF

2

LPF

HPF

HPF

2

LPF

2

HPF

2

LPF

2

HPF

2

2

2

Figure 1: Filterbank structure. Speech signal is split into various frequency subbands and energy in each subband is measured The third set of features that capture local relation between samples is obtained by taking 16 point power spectral density. The FFT of 16 consecutive samples is calculated and the resulting FFTs are squared and added together, and averaged. Each of these 3 sets of features are reduced to one dimension using Fishers Linear Discriminant. The resulting three features are concatenated to form a 3-D feature vector which is used for classification. It was verified that these three features did represent different information and were not redundant. The 3-D features after the processing are plotted in fig3 . If the features were redundant and did not represent different information, they would have aligned on a surface or a curve. Since, they appear to be well separated, they do represent different information. Finally, the concatenated feature vectors is used in a training phase and a testing phase with various parametric and non-parametric classifiers. The speakers are identified for the male1 vs. male2, male vs. female and 3 speaker cases.

Feature Reduction Feature reduction can be performed by two methods, either through Principal Component Analysis (PCA) [2] or Fisher Linear Discriminant [2]. PCA finds the components that has the maximum power. But in the case of this project that may not essentially be useful for discriminating the components. So, Fisher Linear Discriminant method is applied to reduce each feature set to one dimension. The resultant feature vectors are concatenated to form a 3-D feature vector which in turn is used for classification.

2

Powers in frequency subbands

FLD

LPC coeffs

3DFeature Vector

FLD

short window PSD

FLD

Figure 2: Each feature set is reduced to one dimension using Fishers Linear Discriminant and the resulting features are concatenated to form a 3-D vector

A plot of the concatenated feature vector for the male1 vs. male2 speaker identification is shown below in fig3. The features cluster well and classification results are tabulated in the next section. male1 male2

−32 −33 −34 −35 −36 −37 −38 6 5

2

4

0

3

−2

2 −4

1 0

−6

Figure 3: Concatenated feature vector for the male1 vs. male2 speaker identification

The fig4 shows a similar plot for the male vs. female speaker. The separation of the features can be clearly seen and thus any of the classification methods used must yield good results.

Classification Training I have used 20 labeled prototypes for the training phase. The ’leave one algorithm’ was used for the training and evaluation of features. Performance evaluation was done using crossvalidation technique (available in PRTOOLS [4] - crossval). Finally, various parametric and non-parametric techniques were applied to evaluate the performance of different classifiers. The complete summary of the performance of the classifiers for both the cases is given in the following section. Performance was evaluated in terms stability and classification error. The instability is defined as the average fraction of classification differences between the classifier 3

male female

−19 −20 −21 −22 −23 −24 −25 −26 −27 4 2 0 −2 −4

−4

−2

0

2

4

6

8

Figure 4: Concatenated feature vector for the male vs. female speaker identification

based on the entire training set and a disturbed version (e.g. a leave-one-out version of the classifier). Parametric & Non-parametric Techniques for Classification Parametric techniques assume that the functional form of p(x/Si ) is known. A classifier that uses this kind of assumption is the Bayes minimum error classifier, assuming normal densities and estimating mean and unconstrained covariance matrices from the data (qdc in PRTOOLS [4]). A distribution free classifier is the Fisher’s Linear Least Square Classifier (fisherc in PRTOOLS [4]). Non-parametric techniques for density estimation and classification include Parzen Window Classifier(parzenc in PRTOOLS) and k-nearest neighbor classifier (knn in PRTOOLS). The basic difference between the parametric and non-parametric techniques is that the nonparametric techniques estimate the density functions themselves rather than assume a prior density. The performance results for the male1 vs. male2 speaker is shown below. As expected from a clear separation of the clusters, the classifiers are quite accurate. The same is the case for male vs. female speaker. Error would increase when multiple speakers are identified and the problem is set up as a multi-class problem. Table 1: Classifier Results for male1 vs. male2 speaker Classifier Error Instability qdc 12.5 7.5 fisherc 7.5 0 Parzenc 7.5 0 knnc(2 neighbors) 12.5 5 The results for the test samples are shown in the table below. The classification for this instance was done using parzen window classifier. 4

Table 2: Classifier Results for male vs. female speaker Classifier Error(%) Instability(%) qdc 0 0 fisherc 0 0 Parzenc 0 0 knnc(2 neighbors) 0 0

Table 3: Classifier Results for male1 vs. male2 speaker test samples Test Sample Class Test Sample Class 1 2 11 2 2 1 12 2 3 1 13 2 4 1 14 2 5 1 15 2 2 16 2 6 7 2 17 2 8 2 18 2 2 19 1 9 10 2 20 2

Table 4: Classifier Results for male vs. Test Sample Class Test 1 1 2 1 3 1 4 2 5 1 6 1 2 7 8 1 9 1 10 2

female speaker test samples Sample Class 11 2 12 1 13 1 14 2 15 2 16 2 17 2 18 2 19 2 20 2

3 speaker identification The classification problem becomes much more difficult for classification of three speakers and the error in the classification increases considerably. The plot of the 3-D feature vector for male1, male2 and female speaker is shown below.

5

male1 male2 female 29 28 27 26 25 24 23 22 21 2 2

0 0

−2

−2 −4

−4 −6

−6 −8

Figure 5: Concatenated feature vector for the 3 speaker identification

Classifier Results Table 5: Classifier Results Classifier qdc fisherc Parzenc knnc(2 neighbors)

for 3 speaker identification Error Instability 11.67 1.67 13.33 3.33 13.33 6.67 11.67 1.67

Table 6: Classifier Results for 3 speaker id test samples Test Sample Class Test Sample Class Test Sample Class 1 2 11 2 21 3 1 12 2 22 3 2 3 1 13 2 23 3 1 14 2 24 3 4 5 1 15 2 25 1 2 16 2 26 3 6 7 2 17 2 27 2 8 2 18 2 28 3 9 2 19 2 29 3 10 2 20 2 30 2

Interpretation of results The male1 vs. male2 speaker identification results are summarized in the previous section. The clusters formed after deriving the 3-D feature vector has some overlap. Hence, there are errors present in the classification. However, for the male vs. female speaker identification, the clusters are very well separated. The classification is thus accurate. 6

The problem of classification becomes more difficult when the problem is scaled up to include more speakers. A 3 speaker id problem gives reasonable results with the scheme implemented. The results are tabulated in the previous section.

Conclusion Speaker identification has been implemented as a 2-class and 3-class classification problem. The performance has been evaluated using various parametric and non-parametric techniques. The feature sets used are the LP coefficients, energy in different subbands and narrowband power spectral density. The feature sets are reduced to one dimension using Fisher linear discriminant and the resulting 3-D feature vector is used for further classification. The resultant feature vector is well separated and provides good classification results. The results have been summarized. Future work might include more speakers in the dataset for identification. Also, use of acoustic and linguistic features might result in better classification results when the problem is scaled up.

7

Bibliography [1] R.W.Schafer L.R. Rabiner. Digital Processing of Speech Signals. Pearson Education, 1993. [2] D.G.Stork R.O.Duda, P.E.Hart. Pattern Classification. John Wiley & Sons, 2001. [3] S. Sinclair and C. Watson. The development of the otago speech database. In Proceedings of ANNES ’95. ANNES, 1995. [4] PRTOOLS toolbox for MATLAB. http://www.ph.tn.tudelft.nl/ bob/PRTOOLS.html.

8

Appendix - MATLAB CODE % % % % % %

EE559 Project: Speaker classification system Submitted by: Vivek Kumar Rangarajan Sridhar Student ID 886805854 Author: Vivek Kumar Rangarajan Sridhar Course EE559 Speaker Classification System

clear all close all NumF=16; NUMFFTPTS = 2*(NumF-1); NumProto = 20 ; Numtest=20; mydatamat=zeros(NumProto*2,NumF); mydatawavlet=zeros(NumProto*2,5); mydatalpc=zeros(NumProto*2,5); mytestmat=zeros(Numtest,NumF); mytestwavlet=zeros(Numtest,5); mytestlpc=zeros(Numtest,5); LPf = [1,1]/sqrt(2); HPf = [1,-1]/sqrt(2); % Location where S1 prototypes are located path = ’C:\MATLAB6p1\work\dataset\S1\S1_’; extn = ’.wav’; figure for index = 1:NumProto if (index < 10) complete_path = [path ’0’ num2str(index) extn]; else complete_path = [path num2str(index) extn]; end y=wavread(complete_path); % Preprocessing is normalization by maximum value y = Mypreprocess(y); % implementation of analysis filter bank y1=psd(y,NUMFFTPTS); lp1=filter(LPf’,[1],y); lp1=resample(lp1,1,2); hp1=filter(HPf’,[1],y); hp1=resample(hp1,1,2);

9

lp2=filter(LPf’,[1],lp1); lp2=resample(lp2,1,2); hp2=filter(HPf’,[1],lp1); hp2=resample(hp2,1,2); lp3=filter(LPf’,[1],lp2); lp3=resample(lp3,1,2); hp3=filter(HPf’,[1],lp2); hp3=resample(hp3,1,2); lp4=filter(LPf’,[1],lp3); lp4=resample(lp4,1,2); hp4=filter(HPf’,[1],lp3); hp4=resample(hp4,1,2); wavletfeatures=[mypower(hp1),mypower(hp2),mypower(hp3), mypower(hp4),mypower(lp4)]; mydatamat(index,:)=y1’; mydatawavlet(index,:)=wavletfeatures; mydatalpc(index,:) =real(lpc(y,4)); % Calculate LPC coefficents hold on; plot(y1,’b’); end

% Location where S2 prototypes are located path = ’C:\MATLAB6p1\work\dataset\S2\S2_’; extn = ’.wav’;

for index = 1:NumProto % Sequentially read all the test prototypes from class S2 if (index < 10) complete_path = [path ’0’ num2str(index) extn]; else complete_path = [path num2str(index) extn]; end y=wavread(complete_path); % Preprocessing is normalization by maximum value y = Mypreprocess(y); y1=psd(y,NUMFFTPTS);

10

% Analysis filter bank decomposition lp1=filter(LPf’,[1],y); lp1=resample(lp1,1,2); hp1=filter(HPf’,[1],y); hp1=resample(hp1,1,2); lp2=filter(LPf’,[1],lp1); lp2=resample(lp2,1,2); hp2=filter(HPf’,[1],lp1); hp2=resample(hp2,1,2); lp3=filter(LPf’,[1],lp2); lp3=resample(lp3,1,2); hp3=filter(HPf’,[1],lp2); hp3=resample(hp3,1,2); lp4=filter(LPf’,[1],lp3); lp4=resample(lp4,1,2); hp4=filter(HPf’,[1],lp3); hp4=resample(hp4,1,2); wavletfeatures=[mypower(hp1),mypower(hp2),mypower(hp3), mypower(hp4),mypower(lp4)];

mydatamat(NumProto+index,:)=y1’; mydatawavlet(NumProto+index,:)=wavletfeatures; mydatalpc(NumProto+index,:) =real(lpc(y,4)); % LPC coefficents calculation hold on plot(y1,’r’); end %labels = genlab([20 20],[’w1’;’w2’]); labels=[zeros(NumProto,1);ones(NumProto,1)]; f_list=zeros(NumF,1); for i=1:NumF f_list(i)=i; end prior_prob=[0.5;0.5]; A=dataset(mydatamat,labels,f_list); WA=dataset(mydatawavlet,labels,f_list); LPC_Co=dataset(mydatalpc,labels,f_list);

11

% Feature reduction is done by using Fisher Linear Discriminant % All three festure sets are reduced to 1 dimension and are concatenated to % obtain a 3 D feature vector which is used for classification AP=fisherm(A,1)%2 WAP=fisherm(WA,1)%2 LAP=fisherm(LPC_Co,1)%2 A=AP*A; WA=WAP*WA; LPC_Co=LAP*LPC_Co; A=[A,WA,LPC_Co]; % Concatenation of 3 features

figure plot3(double(A(1:20,1)),double(A(1:20,2)),double(A(1:20,3)),’ro’,’MarkerFaceColor’,’r’); hold on; plot3(double(A(21:40,1)),double(A(21:40,2)),double(A(21:40,3)),’gx’,’MarkerFaceColor’,’g’); % Location where Test prototypes are located path = ’C:\MATLAB6p1\work\dataset\test1\test_’; extn = ’.wav’; figure for index = 1:Numtest if (index < 10) complete_path = [path ’0’ num2str(index) extn]; else complete_path = [path num2str(index) extn]; end y=wavread(complete_path); % Preprocessing is normalization by maximum value y = Mypreprocess(y); % Short time window PSD calculation y1=psd(y,NUMFFTPTS); % Analysis filter bank decomposition lp1=filter(LPf’,[1],y); lp1=resample(lp1,1,2); hp1=filter(HPf’,[1],y); hp1=resample(hp1,1,2);

lp2=filter(LPf’,[1],lp1); lp2=resample(lp2,1,2); hp2=filter(HPf’,[1],lp1); 12

hp2=resample(hp2,1,2); lp3=filter(LPf’,[1],lp2); lp3=resample(lp3,1,2); hp3=filter(HPf’,[1],lp2); hp3=resample(hp3,1,2); lp4=filter(LPf’,[1],lp3); lp4=resample(lp4,1,2); hp4=filter(HPf’,[1],lp3); hp4=resample(hp4,1,2); wavletfeatures=[mypower(hp1),mypower(hp2),mypower(hp3), mypower(hp4),mypower(lp4)];

mytestmat(index,:)=y1’; mytestwavlet(index,:)=wavletfeatures; mytestlpc(index,:) =real(lpc(y,4)); hold on plot(y1,’g’); %Plots PSD of test prototypes end

prior_prob=[0.5;0.5]; % 1 is added to match convensions % Mapping object which was created using training prototypes are used again here % Final classification of test prototypes using mapping created by training Classfr1=parzenc(A); classdec=classd(Classfr1*([AP*mytestmat,WAP*mytestwavlet,LAP*mytestlpc]))+1 %test = [AP*mytestmat,WAP*mytestwavlet,LAP*mytestlpc]; %test = dataset(test,labels,f_list);

% Plot S1,S2 and test prototypes in 3D feature space figure plot3(double(A(1:20,1)),double(A(1:20,2)),double(A(1:20,3)),’ro’,’MarkerFaceColor’,’r’); hold on; plot3(double(A(21:40,1)),double(A(21:40,2)),double(A(21:40,3)),’bo’); hold on; plot3(double(AP*mytestmat),double(WAP*mytestwavlet),double(LAP*mytestlpc),’gx’,’MarkerFaceC hold on; % Classifier Performance %fisherc 13

w_ps = fisherc(A); figure scatterd(A,4); plotm(w_ps,[],0.1); [e_ps,s_ps] = crossval(fisherc,A); %qdc w_bayes = qdc(A); [e_bayes,s_bayes] = crossval(qdc,A); figure scatterd(A,4); plotm(w_bayes,[],.1); %knnc w_knnc2 = knnc(A,2); [e_knnc2 s_knnc2] = crossval(knnc,A,2); %parzenc Classfr1=parzenc(A); % Parzenc classifer is used % Cross validtion to access classifer performance sprintf(’%s’,’Crossvalidating using parzenc’); [e,s]=crossval(parzenc,A,1) % Leave one out

14