support vector machines for speaker verification ... - Semantic Scholar

2 downloads 0 Views 67KB Size Report
Sheffield S1 4DP, UK. william [email protected] [email protected]. Abstract. In this paper the performance of the support vector machine ...
SUPPORT VECTOR MACHINES FOR SPEAKER VERIFICATION AND IDENTIFICATION Vincent Wan  Department of Computer Science University of Sheffield, 211 Portobello Street, Sheffield S1 4DP, UK. [email protected]

William M. Campbell Motorola Human Interface Lab 2100 East Elliot Rd. Tempe, AZ 85284, USA. william [email protected]

Abstract. In this paper the performance of the support vector machine (SVM) on a speaker verification task is assessed. Since speaker verification requires binary decisions, support vector machines seem to be a promising candidate to perform the task. A new technique for normalising the polynomial kernel is developed and used to achieve performance comparable to other classifiers on the YOHO database. We also present results on a speaker identification task.

INTRODUCTION Support vector machines (SVMs) have gained much attention since their inception [1, 2, 4]. SVMs are classifiers based on the principle of structural risk minimisation. Experimental results indicate that SVMs can achieve a generalisation performance that is greater than or equal to other classifiers, while requiring significantly less training data to achieve such an outcome. In this work we are concerned with using SVMs for speaker verification. Speaker verification is a task that seems especially suited to SVMs due to the binary nature of the decisions: The classifier must decide whether or not a speaker is indeed who he claims to be. Some early work on the similar task of speaker identification was reported by Schmidt and Gish [12]. They achieved some mixed results on Switchboard, a very noisy speech database. In this paper we restrict ourselves to textindependent speaker verification using the YOHO database [7, 11], which provides high quality recordings of speech from co-operative speakers. We also report initial results for speaker identification, a simple extension of verification. The task was not straightforward and required the application of a new normalisation technique before performance comparable to other classifiers could be achieved. The normalisation scheme is applicable to the polynomial kernel and im This work was undertaken while the author was with the Motorola Human Interface Lab at Palo Alto, California, USA.

poses upon it some properties of the RBF kernel. With the new normalisation technique, we were able to achieve excellent performance from SVMs on this database.

OVERVIEW OF SUPPORT VECTOR MACHINES This section provides a theoretical overview of support vector machines. For a more detailed explanation of SVMs, the reader is directed to the introductory text by Burges [2] or, for a more detailed description, Vapnik’s book [13]. An SVM is a binary classifier that makes its decisions by constructing a linear decision boundary or hyperplane that optimally separates the two classes. The hyperplane is defined by x  w + b = 0 where w is the normal to the plane. For linearly separable data labelled fxi ; yi g, xi 2 ℜd , yi 2 f 1; 1g, i = 1 : : : N the optimal hyperplane is chosen according to the maximum margin criterion, i.e. by choosing the separating plane that maximises the Euclidean distance to the nearest data points on each side of that plane. This is achieved by minimising the square of the L2-norm of w, jjwjj22 subject to the inequalities(xi  w + b)yi  1 for all i. The solution for the optimal hyperplane, w0 , is a linear combination of a small subset of data, xs , s 2 f1 : : : N g, known as the support vectors. These support vectors also satisfy the equality (xs  w0 + b)ys = 1. When the data is not linearly separable then no hyperplane exists for which all points satisfy the inequality above. To overcome this problem slack variables, ξi , are introduced into the inequalities relaxing them slightly so that some points are allowed to lie within the margin or even be misclassified completely. The resulting problem is then to minimise 1 jjwjj22 + C ∑ L(ξi ) 2 i

(1)

subject to (xi  w + b)yi  1 ξi , where the term on the RHS is the empirical risk associated with the marginal or misclassified points, L is the loss function and C is a hyper-parameter used to specify the balance the effects of minimising the empirical risk and maximising the margin. The LHS can be thought of as a regularisation term. The most commonly used loss is the linear-error cost function since it is robust to outliers. The dual formulation, which is more conveniently solved, of (1) with L(ξi ) = ξi is max α

∑ αi + ∑ αi α j yi y j xi  x j i

!

(2)

i; j

subject to 0  αi  C ∑i αi yi = 0

(3)

in which αi is the Lagrange multiplier of the ith constraint in the primal optmisation problem. The dual can be solved using standard quadratic programming techniques.

2

The orientation of the optimal plane, w0 , is given by w0 = ∑ αi yi xi

(4)

i

and is a linear combination of all points in feature space that have ξi > 0 as well as those that lie on the margin (i.e. αi 6= 0). The extension to non-linear boundaries is achieved through the use of kernels that satisfy Mercer’s condition [5]. In essence, each data point is mapped onto a manifold embedded in some feature space, the space defined implicitly by the kernel, of higher dimension than the input space, the space occupied by the data. The hyperplane is constructed in the feature space and intersects with the manifold creating a non-linear boundary in the input space. In practice, the mapping is achieved by replacing the value of dot products between two data points in input space with the value that results when the same dot product is carried out in the feature space. The dot product in the feature space is expressed conveniently by the kernel as some function of the two data points in input space. In common use are polynomial kernels and radial basis function (RBF) kernels. The polynomial and RBF kernels take the form, K (xi ; x j ) = (xi  x j + 1)n and

" K (xi ; x j ) = exp

1 2

 jjx

i

σ

x j jj

(5)

2 # (6)

respectively, where n is the order of the polynomial and σ is the width of the radial basis function. The dual for the non-linear case is thus

!

max α

∑ αi + ∑ αi α j yi y j K (xi x j ) ;

i

(7)

i; j

subject to 0  αi  C ∑i αi yi = 0

(8)

The use of kernels means that an explicit transformation of the data into the feature space is not required.

NORMALISED POLYNOMIAL KERNELS SVMs rely heavily on quadratic programming (QP) optimisers. Unfortunately QP does not scale easily to large problems. Memory requirements of QP may be largely overcome by chunking [9]. However, when a near singular Hessian is encountered, due to badly scaled inputs for example, then an optimiser may find a suboptimal solution sometimes even failing to find any satisfactory solution. The “no satisfactory

3

solution” symptom seems to occur more often when using a polynomial kernel, but very rarely when using the RBF kernel. The elements of the Hessian consist (with the exception of a sign) of K (xi ; x j ) evaluated for all possible combinations of i and j. In the case of the polynomial kernel, K (xi ; x j ) can vary from very small to very large depending upon the relative values of xi and x j , sometimes leading to a badly conditioned or a near singular Hessian that breaks the QP optimiser. The effect is particularly noticeable when using higher powers of n. The usual method of overcoming this problem is to normalise the input xi so that the absolute value of the base of the exponential is less than one preventing K from becoming large. This can be achieved by applying some linear or affine transformation on the input but it is sometimes inadequate. It has been observed that the RBF kernel is less susceptible to such scaling problems and this has been attributed to the fact that K (xi ; xi ) > K (xi ; x j ) > 0 for xi 6= x j [10], so that if K (xi ; xi ) = 1 is also satisfied then the scaling problem cannot manifest itself. The inequality is a property of locality: it means that the basis functions decay away so that changes in one basis function does not propagate to many others. The polynomial kernel, as it stands does not possess properties of locality. Indeed, a polynomial’s influence is greatest at its “tails”. Furthermore, this property ensures that the Hessian is diagonally dominant giving rise to better convergence [8]. A simple way to incorporate properties of locality into the polynomial kernel is to normalise the L2 norms of each vector, jjxi jj2 = 1, so that computing the result of any dot product, xi  x j , in input space is exactly the cosine of the angle between the two vectors. This is particularly desirable since the cosine takes its maximum when xi = x j and decays away as the angle between the two vectors increases. However, a direct normalisation of the lengths would lead to a loss of information: for example, two discrete points represented by vectors x and 2x become identical after normalisation by length and results in increased classification uncertainty. A general solution that loses no information is to embed the data into the surface of a hemisphere in a space one dimension larger than that of the input space. Defining the new origin to be the centre of the sphere means that the L2 norms of the vectors are equal to the radius of the sphere and thus are normalised. Mapping a plane onto the surface of a sphere can be achieved easily by many different projections. Cartographers use various projections for creating maps of the world. In our problem we wish to apply the inverse projections to those used by cartographers. A projection can be chosen to preserve a particular aspect, for example conformal projections that preserve angle, equal-area projections and equidistant projections. We consider the orthographic projection, and one other projection similar to the stereographic projection that can be applied as simple modifications to the polynomial kernel. Of course any conceivable projection can be applied directly to the data as preprocessing prior to classification using the polynomial kernel. The orthographic projection, perhaps the simplest, maps all points from the hemisphere on to a plane by projecting in the direction perpendicular to that plane. The projection can be achieved by augmenting the vectors with a new positive component whose value is chosen such that the L2 norm of the new vector is the radius, r, of the sphere. It is important that a sufficiently large r be chosen in order to fit all of the data onto the sphere. Incorporating the modification into the polynomial

4

kernel gives K (x; y) =

p

xy+

1 2n

(r 2

jjxjj2)(r2 jjyjj2) + 1 r2

!n (9)

where the factor of 2 has been included so that K is normalised between 0 and 1. Another projection maps points on the plane, placed at a distance d from the origin, onto a unit sphere by projecting along the radius of the sphere. The projection is achieved by augmenting the vectors with a non-zero constant component, d, and normalising L2 norm of the resulting vector. The modified kernel is K (x; y) =

1 2n

p

x  y + d2

jjxjj2 + d 2)(jjyjj2 + d 2)

(

!n

+1

(10)

This method has an advantage over the orthographic projection in that the entire input space can be embedded onto the surface of the hemisphere independently of the choice of d. In the normalised form, the polynomial kernel contains the dot product between two unit vectors, which is exactly the cosine of the angle between the two vectors. The angle between two vectors from the centre of a sphere onto the surface is proportional to the length of arc between the two points measured along the surface of that sphere. The cosine function is a unimodal bell-shaped function for angles in the range π to +π and whose output lies between 1 and +1. A simple affine transformation forces the output to lie between 0 and 1 as applied in (9) and (10). Raising the transformed cosine function to the power n decreases its width at half height. Thus the normalised kernel is a measure of distance between two points on the surface of the sphere controlled by the parameter n, analogous to the Mahalanobis distance, with n being analogous to 1=σ in the RBF kernel. This interpretation is particularly appealing when used with the equidistant projection in which the arc length between two points on the sphere is the same as their separation in the plane measured by the Euclidean metric. It is simple to verify that the new normalised kernels are indeed valid kernels to use with SVMs. The criterion for a valid kernel is that it must represent the dot product between two vectors in some other space, K (x; y) = Φ(x)  Φ(y)

(11)

where Φ is a transformation to some feature space, which can equally be expressed as two transformations applied one after another: K (x; y) = φ2 (φ1 (x))  φ2 (φ1 (y)):

(12)

The normalised polynomial kernel combines two transformations. The first transformation, φ1 , is the mapping on to the surface of a sphere and the second, φ2 , is the standard polynomial expansion. Thus the normalised polynomial kernel represents a dot product in some other space

5

Furthermore, the criterion for a valid kernel is that it results in a Hessian that is positive semidefinite for all possible input arguments to the function, i.e. it satisfies Mercer’s conditions [13]. It is known that the polynomial kernel, Kp (xi ; x j ), is positive semidefinite. Applying a nonlinear transformation on the input features, x :! x0 , does not change this. Thus, after mapping the data from the plane to the sphere, the matrix with elements Kp (x0i ; x0j ) must also be positive semidefinite because the polynomial kernel function yields a matrix that is positive semidefinite for any selection of input arguments.

TEXT INDEPENDENT SPEAKER VERIFICATION USING SVMS The goal of speaker verification is to validate the identity of a person by his/her voice only. Prior to verification, the user must be enrolled onto the system so that a model of his/her voice can be created. Classification methods can be grouped together in to two groups: statistical methods that include Gaussian Mixture Models [11] and discriminant methods, which include multilayer perceptrons [6] and polynomial classifiers [3]. During verification, an individual (the claimant) claims a certain identity. The claimant is prompted to read a line of text. In a text independent speaker verification task the transcription of the text is ignored. The model stored in the system is then used to determine whether the utterance was indeed made by the user: A score for the utterance is computed from the model and compared to a threshold to determine the claimant’s validity. SVMs are discriminant classifiers and require both positive and negative examples for training. To train an SVM to perform speaker verification not only are examples of the user’s speech required, but also sufficient samples from some imposter speakers must be available so that the classifier does not under-train by classifying a region of input space that is devoid of any training data as belonging to the user. Under-training means that an imposter that was not seen by the classifier during training could be misclassified. Increasing the number of imposters in the training set is the simplest way to prevent this from occurring. However, SVMs do not scale well to very large training sets. When the data is inseparable the number of support vectors that parameterise the solution grows with the size of the training set. This arises from the fact that points that lie inside the margin or are incorrectly classified are included as support vectors in the solution. A larger solution requires more storage space and significantly increases the amount of computation required during both training and classification. Thus it is more desirable to train an SVM on relatively small amounts of data so one must be especially careful when selecting training data to avoid under-training. Vector quantisation can be used as a simple method that provides a small training set that is still representative of the full training set with only a small effect on the overall performance. It is interesting to note that an SVM trained with the polynomial kernel is very similar to Campbell’s polynomial classifier method [3]. In that method, the acoustic vectors are mapped explicitly into a feature space of high dimension according to the coefficients of a polynomial expansion. A linear discriminant that minimises the sum-squared error on the training set classifies the vectors in the feature space.

6

The expansion is the same as that implied by the unnormalised polynomial kernel in SVM classifiers although the polynomial classifier method is invariant to linear transformations applied to the acoustic features whereas SVMs are not. This provides us with some baseline results with which we can make comparisons. The score for an utterance is computed simply as the arithmetic mean of the activations of the SVM for each acoustic feature vector. The classification of the SVM is given by sign(w  x + b)

(13)

w  x + b:

(14)

while the activation is

Suppose that the SVM is trained so that the classification is positive for the user and negative for imposters. The score of an utterance of length N is S=

1 N (w  xi + b): N i∑ =1

(15)

Expressing w as the sum over the support vectors, xs , and including the kernel function, S=

1 N N i∑ =1

  ∑ αs ys K (xs xi ) + b ;

(16)

s

where αs is the Lagrange multiplier associated with the sth support vector and ys is the corresponding classification label, ys = +1 if xs belongs to the user and ys = 1 when it belongs to an imposter. Computing the utterance score as the mean over the activations has some benefits over the mean of the classification. If the SVM classifies a feature vector strongly as belonging to one class or another (either large positive or large negative) then that feature vector contributes more significantly to the mean. Equally the reverse is true, if the activation is close to zero (i.e. that point is close to the decision boundary) then applying the sign operator forces the classifier to give a decisive classification to a point about which it is less certain. Using the activation here means that points with low classification certainty contribute less to the utterance score. Once the utterance score has been computed it is compared to a threshold, T . A decision is made according to the rule: S > T then accept the speaker (the person is who he claims to be) S  T then reject the speaker (the person is an imposter) There are two types of error that can be made: A false acceptance where an imposter is incorrectly authenticated, and a false rejection where the user is incorrectly identified as an imposter. The proportion of each type of error is dependent upon the value of the threshold. The threshold can be speaker specific or risk adaptive depending upon the nature of the environment in which the system is used. For the purpose of evaluation define the equal error rate, EER, such that when T is set appropriately the percentages of each type of error are the same.

7

Extension to speaker identification In the speaker identification task, we are interested in determining the identity of a speaker from a group of speakers. The simplest method constructs classifiers to separate each speaker from all of the others. If there are n speakers then n classifiers must be constructed. The identity of the speaker is determined from the classifier that yields the largest utterance score, argmax j

1 N N i∑ =1



∑ αs j ys j K (xs j ; xi ) + b



(17)

s

where xs j are the support vectors of the jth classifier and αs j and ys j are the corresponding Lagrange multipliers and classes.

EXPERIMENTS Experiments were performed on the YOHO database. This database consists of 138 speakers prompted to read combination lock phrases, for example, “67 34 85.” The features were derived using 12th order LPC analysis from the audio recording and deltas computed making up a twenty four dimensional feature vector. Out of the 138 speakers in the database, 69 speakers, labelled 101 to 174, were used for training and testing while the remaining speakers, labelled 175 to 277, were used for testing only. The frames of data corresponding to silence were removed from the utterances. The YOHO database has forty utterances for each speaker set aside for the purpose of testing. Testing is performed in two batches using single phrase tests. The test utterances of speakers labelled 101 to 174 (which we call the seen imposters), on which the SVMs were trained, are separated from those of speakers 175 to 277 (which we call the unseen imposters) and results reported separately for each batch. Testing on speakers not seen during training gives a more realistic assessment of the overall performance than using the set of imposters seen during training. In order to construct a small data set for training SVMs, the training data for each speaker was quantised to one hundred centroids using the k-means clustering algorithm. In each training run, 69 SVMs were trained each discriminating one speaker (100 centroids) from all other 68 speakers (6800 centroids). Table 1 shows the average performance of the SVMs trained using the unnormalised polynomial kernel and the RBF kernel. It was not possible to train using unnormalised polynomial kernels of higher order. The optimiser was unable to converge to a solution. We attribute this to a nearly singular Hessian. Table 2 shows the average performance when using the new normalised polynomial kernel approach in (10) with d = 1. It can be seen that the normalised polynomial kernel gives a marked improvement in the average performance over unnormalised polynomial kernels of the same order. Higher orders could also be trained and these exhibited further improvements in the average equal error rate. It has been observed that the errors reported in Table 2 are due mostly to just two speakers. For the 10th order polynomial experiment speakers 162 and 169 have

8

Table 1: Average performance and size of SVMs trained using unnormalised polynomial and RBF kernels on the YOHO speaker verification task.

Kernel polynomial polynomial RBF

n or σ 2 4 0.3

% EER (seen) 5.00 1.45 1.47

% EER (unseen) 5.60 1.73 1.86

mean #SVs 266 377 789

standard deviation #SVs 11 107 81

Table 2: Average performance and size of SVMs trained with the normalised polynomial kernel on the YOHO speaker verification task.

Polynomial order 2 4 6 8 10

% EER (seen) 2.13 0.70 0.49 0.39 0.34

% EER (unseen) 2.50 1.06 0.74 0.64 0.59

mean #SVs 742 582 643 711 748

standard deviation #SVs 58 63 54 45 29

average EERs of 5% and 10% respectively while the EERs for almost all other speakers are less than 0.1%. Excluding these two speakers and averaging over 67 speakers yields an average EER of 0.13% on the seen imposters and 0.32% on the unseen imposters for the 10th order normalised polynomial kernel. To put these figures into perspective the results obtained by the polynomial classifiers method is shown in Table 3 (which includes results of further experiments not in [3]). It can be seen that the performance of SVMs is close to but not quite as good as the polynomial classifiers technique. In [7] and [11] Gaussian Mixture Model systems yield EERs of between 0.5% and 0.6%. However, only loose comparisons can be made with the Gaussian Mixture systems since the training and testing paradigms are different. The same SVMs trained for speaker verification were also used for speaker identification in a one from others classification scheme. Table 4 shows the performance of SVMs trained using the normalised polynomial kernel on this task.

SUMMARY A new method for normalising the polynomial kernel was presented. The technique yielded performance gains of as much as a factor of two over the unnormalised polynomial kernel and also enabled higher orders to be used. The performance on the speaker verification task is close to but not as good as those obtained using another method employing polynomial classifiers. We also presented some results on a speaker identification task.

9

Table 3: Speaker verification results on the YOHO database using polynomial classifiers.

Polynomial order 2 3

EER (seen) % 1.30 0.18

EER (unseen) % 1.65 0.31

Number of parameters 325 2925

Table 4: Speaker identification results using SVMs with the normalised polynomial kernel.

Polynomial order 2 4 6 8 10

I.D. error rate % 23.5 7.3 5.2 4.4 4.5

ACKNOWLEDGEMENTS We gratefully acknowledge the ideas volunteered by John Platt of Microsoft.

REFERENCES [1] B. E. Boser, I. M. Guyon and V. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” in Fifth Annual Workshop on Computational learning Theory, ACM, Pittsburgh, 1992. [2] C. J. C. Burges, “A Tutorial on Suport Vector Machines for Pattern Recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 1–47, 1998. [3] W. M. Campbell and K. T. Assaleh, “Polynomial Classifier Techniques for Speaker Verification,” in Proc. ICASSP, 1999, vol. 1, pp. 321–324. [4] C. Cortes and V. Vapnik, “Support Vector Networks,” Machine Learning, vol. 20, pp. 273–297, 1995. [5] R. Courant and D. Hilbert, Methods of Mathematical Physics, Interscience, 1953. [6] K. R. Farrell, R. J. Mammone and K. T. Assaleh, “Speaker Recognition using Neural Networks and Conventional Classifiers,” IEEE Trans. Speech and Audio Process., vol. 2, pp. 194–205, Jan 1994. [7] J. P. C. Jr., “Testing with the YOHO CD-ROM voice verification corpus,” in Proc. ICASSP, 1995, vol. 1, pp. 341–344. [8] D. G. Luenberger, Linear and Nonlinear Programming, Addison-Wesley Publishing Company, 1984. [9] E. E. Osuna, R. Freund and F. Girosi, “Support Vector Machines: Training and Applications,” Techn. report, Massachusetts Institute of Technology, March 1997. [10] J. Platt, Private communication. [11] D. A. Reynolds, “Speaker Identification and Verification using Gaussian Mixture Speaker Models,” Speech Communication, vol. 17, pp. 91–108, 1995. [12] M. Schmidt and H. Gish, “Speaker Identification via Support Vector Machines,” in Proc. ICASSP, 1996, pp. 105–108. [13] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995.

10