ROBUST ARABIC SPEAKER VERIFICATION ...

1 downloads 0 Views 406KB Size Report
Frequency (LSF) features extracted directly from G.729 encoded bitstream improved significantly the recognition performance compared with the Mel Frequency ...
2013 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 22–25, 2013, SOUTHAMPTON, UK

ROBUST ARABIC SPEAKER VERIFICATION SYSTEM USING LSF EXTRACTED FROM THE G.729 BITSTREAM Kawthar Yasmine ZERGAT, Abderrahmane AMROUCHE, Meriem FEDILA and Mohamed DEBYECHE Speech Com. & Signal Proc. Lab.-LCPTS Faculty of Electronics and Computer Sciences, USTHB, Bab Ezzouar, 16 111, Algeria. E-mail: [email protected], [email protected], [email protected], [email protected] ABSTRACT This paper deals with an Arabic text-independent speaker verification system over the Internet Protocol (VoIP). The system, using the ARADIGIT database and based on Support Vector Machine (SVM), was designed to use the information extracted directly from the coded parameters embedded in the ITU-T G.729 bitstream. Experiments evaluated the robustness of the system at different noisy conditions. The results showed that the use of Line Spectral Frequency (LSF) features extracted directly from G.729 encoded bitstream improved significantly the recognition performance compared with the Mel Frequency Cepstral Coefficients (MFCC) features extracted from decoded speech. Index Terms— G729, SVM, Speaker Verification, LSF, MFCC, VoIP 1. INTRODUCTION Speaker recognition can be divided into speaker identification and speaker verification [1]. In speaker identification, the goal is to determine which one of a group of known voices best matches the input voice sample, where in verification, given a test utterance and a claim of identity, the system determine if the claim is true or false using the corresponding speaker model. Speaker recognition systems can also be classified into text-dependent systems and textindependent systems [1]. Text dependent systems require the recitation of a predetermined text, whereas textindependent systems accept speech utterances of unrestricted text. Today, speech communication technologies based on Voice over the Internet Protocol (VoIP) has become more and more pervasive, which makes possible the use of the speaker's voice to control access to restricted services,

such as voice mail, phone access to banking, and access to secure equipment. The ITU-T G.729 [2] speech codec is one of the codecs used to encode and decode the voice transmitted over the Internet Protocol (VoIP). In [3] and [4], some studies have been done consisting of techniques requiring knowledge of the coder internal structure to ameliorate the performance accuracy of the system. However, the performance is still poorer than that achieved by using resynthesized speech. In this paper, we investigate, using the G.729 coder, the effect of speech coding on a speaker verification task over IP networks. Focus is given particularly on the performance recognition obtained with the encoded bitstream using LSF features. Experiments were performed using Support Vector Machines (SVM) [5] for an Arabic text-independent speaker verification system in both clean and noisy environments. Section 2, explains the G.729 speech coder, where in section 3 the automatic speaker verification system (ASV) using the SVM classifier is presented. The experiments conducted on the ARADIGIT database and the performance evaluations of this work are described in sections 4 and 5. Finally, the paper is concluded in section 6. 2. G.729 CODING SPEECH G. 729 is a toll-quality speech coding standard, specified by the ITU (International Telecommunications Union). It is officially described as Coding of speech at 8 kbit/s, using code-excited linear prediction speech coding (CSACELP). In G.729, forward adaptation is used to determine the synthesis filter parameters every 10 ms. These filter coefficients are then converted to line spectral frequencies (LSFs) and quantized using predictive two stage vector quantization. Each of the 10 ms frames is split into two 5 ms subframes and the excitation for the synthesis filter is calculated for each subframe [9]. The long-term correlations in the speech signal are modeled using an adaptive codebook with fractional delay.

An algebraic codebook with an efficient search procedure is used as the fixed codebook. The adaptive and fixedcodebook gains are vector quantized using a two stage conjugate structure codebook. The entries from the fixed, adaptive, and gain codebooks are chosen every subframe using an analysis-by-synthesis search. In G.729 decoders, the received bitstream is decoded to obtain synthesis filter coefficients. Entries from the fixed, adaptive, and gain codebooks are also determined to form an excitation signal for the synthesis filter [9].

ƒ

The next experiment, “Process-B” illustrated in the following figure 1, uses the synthesized speech version of the ARADIGIT database. The main idea is to show the impact of using a G.729 coder on the SVM system performance, for an Arabic Speaker Verification task over VoIP applications.

3. SPEAKER VERIFICATION In speaker verification, when an unknown speaker is requesting for access, the system can be seen as a one-toone problem by accepting or rejecting the proclaimed identity on the basis of individual information included in speech signals, so the system can be represented as a process of a binary decision. A typical speaker verification system is constructed from three modules, which are: Feature Extraction, Speaker Modeling and Decision Module. 3.1. Feature Extraction A feature extraction module converts the speech waveform to some type of parametric representation with lower information rate [6]. Short-time spectral analysis is the most common way to characterize the speech signal, such as Linear Prediction Coding (LPC), Perceptual Linear Prediction (PLP) coefficients and Mel-Frequency Cepstrum Coefficients (MFCC). The MFCC coefficients are the best known and most popular features used in speaker recognition task, because they characterize well the envelope of the spectrum and allow the separation of the source and the vocal.

Figure 1. Feature extraction method, process B. ƒ

For the “Process-C”, the goal consists of reducing the degradation caused by the synthesis process, we propose a method that consists of extracting directly 10 LSF feature vectors from the G.729 uncoded speech (Bitstream), and to use these feature vectors as input to the SVM system for scoring, as shown in the following figure.

separation of the contribution vocal/source. In this work, the effect of synthesized speech in a speaker verification task was studied. Thus, the system was trained using three parallel feature streams for recognition. The feature streams used are summarized in the following process: ƒ

The baseline experimental system or “Process-A” consists of a conventional speaker verification system applied to the original ARADIGIT speech files. In all process the feature vectors were extracted every 10 ms from speech signal without silent segments (removed). In this process, the feature vectors consisting of 12 MFCCs and their first and second derivatives were used. A Cepstral Mean Subtraction (CMS) was applied to each feature vectors in order to fit the data around their average.

Figure 2. Feature extraction method, process C. 3.2. Speaker Modeling During enrollment, a feature extraction process converts the speech samples into speech feature vectors. A speaker model, such as a Gaussian Mixture Model (GMMs) [7] or Support Vector Machines (SVMs) [5], is then created to characterize a speaker from the speech features. In this work, the obtained SVM speaker models are kept in a speaker database for future verification tests.

2.2.1. Support vector machine (SVM) The Support Vector Machines (SVMs) [5] is designed to separate vectors in a two class problem, also called binary classification, in which SVM projects the input vector, x which belongs to the input space into a kernel feature space of high dimensionality f (x) as follow:

⎡N ⎤ f ( x ) = class ( x ) = sign ⎢ ∑ α i y i K ( x , xi ) + b ⎥ ⎣ t =i ⎦

(1)

xi represent the support vectors, their corresponding

Where weights

αi

N

and b is the bias term. Note that

∑α y i

i

=0

t =i

and α i

≥ 0 . N is the number of support vectors and

k(·,·) is the Kernel function [8] with a constrained to have certain properties (the Mercer condition), so that it can be expressed as

K ( x, y ) = φ ( x) T φ ( y ).

Figure 3 illustrates the principle of SVM. Suppose that the blue circles belong to the training vectors of the client speakers (Class 1), and black circles represent the training vectors of other speakers (Class 2), also called background speakers. Using the labeled training vectors, the focus of the SVM training process is to model the boundary by maximizing the margin of separation between these two classes [8].

Class 1 Margin

Support vectors

OSH

Figure 3. Principle of support vector machine (SVM) protocol. 3.3. Decision Module The final module is for making decision based on the training and testing phase. The system outputs a binary decision (accept or reject). Success in speaker verification depends on extracting and modeling the speaker dependent characteristics of the speech signal which can effectively distinguish one speaker from another. So the decision can be

made by comparing the score with a decision threshold, which is estimated from an impostor database.

4. EXPERIMENTAL PROTOCOL 4.1 Description of the Database Arabic is currently one of the most widely spoken languages in the world with an estimated number of 350 millions speakers covering a large geographical area. While spelling the ten digits in Arabic language, an interesting number of Arabic phonemes are already produced. It can be considered as a relative representative elements of this language, which has several specificities, such as germination, emphasis, and sound duration. Furthermore, it presents some phonetics and morpho-syntactic particularities. The morpho-syntactic structure built, around pattern roots and generative rules, makes the Arabic language more appropriate to natural language automatic processing. The database used for training and testing the recognition system is a locally Arabic speech database collected from 110 Algerian natives aged from 18 to 50. This database, named ARADIGITS, has been recorded in a large auditory room (1800 people), which was very quiet, at 22.050 kHz and downsampled at 16 kHz. In this work, 62 speakers of both sexes (equally distributed) have been selected for both training and testing phase. The experiments were conducted in text independent speaker mode, therefore, the data in the testing set do not intersect those in the training set. In this work, we have concatenated the sequences of eights numbers, from zero to seven for training set and used a sequence of two numbers, eight and nine for testing set, with three repetitions for each sequence. The average length for training and testing data is about few time (some seconds), for modeling the speaker and making the recognition process. To simulate the impostors, 40 unknown speakers (20 female and 20 male) are used, with five utterances spoken by each unknown speaker. 4.2. Modeling Phase In the front end part, we extract the features representing the speakers by three steps as detailed in section 3.1, these parameters are then used as input to the SVM system for the modeling phase. The SVM classifier uses the Radial Basis Function (RBF) kernel with two parameters, Gamma and C which represent the radius of RBF and the penalty factor respectively. Another focus of this paper is the evaluation of the robustness of the proposed system using different noisy conditions. For the noisy environments, two types of

additive noise produced by a Speech Babbble and Factory production vehicle, reaching highh levels of SNR={0,5,10,15dB}, and derived from the NOISEX-92 database (NATO: AC 243/RSG 10) are aadded to the test speech signal issued from the Arabic ARAD DIGIT database. 4.3. Classification phase In this project, the performances of the speeaker verification system are evaluated by the following criterrions: • The Detection Error Tradeoff (DE ET) curve, which is a popular way of graphically representing the performance of speaker verificatioon systems. It is a plot of the false acceptance (FA) annd false rejection (FR) errors rate. • The Equal Error Rate (EER) whicch corresponds to the point where the rate of FR is eqqual to the rate of FA.

5. EXPERIMENT RESULT TS 5.1. SVM Speaker Verification using G.7229

DET curve 100 Processs-A: EER=0.26% . Processs-B: EER=29.7% . Processs-C: EER=1.8% .

90 80

F als e Rejec t Rate (% )

5.2. SVM Speaker Verification using G.729 in noisy conditions The main experiments in this section evaluate the robustness of the proposed system in differen nt noisy conditions. The speech data is corrupted by two different noisy environments which are Babble Speech and Factory noises. Figures 5 and 6 present the experim mental results in terms of EER in the real world. Clearly, it is seen that the propo osed method using LSF derived directly from the G.729 bitstream b is more robust than the decoded one. For example the EER of the proposed method for Babble speech noise, at a SNR=0dB is less than 24.8%, where it is more than 32% for fo the decoded speech.

Equal Error Rate(%)

To show the effectiveness of the proposedd approach some experimental results are shown. The experim ment has done by testing the system in different ways and theen comparing the performances to estimate the final result.

features vectors extracted from an Arabic database speech files. We obtain the best perform mance with the uncoded ARADIGIT with an EER less th han 0.27%. It is clearly noticed that there is a drop of accurracy when using a G.729 decoded database at 8Kbit/s with an n EER equal to 29.7%. In order to improve the systtem performance, when using the G.729 codec for speecch transmission over IP network, we propose another meethod which consists of using LSF parameters extracted directly d from the G.729 bitstream. This method brings excellent improvement of the SVM system performance. Indeed, the EER decreases from 29.7% to 1.8% comparing to the deccoded speech.

70 60

35% 30% 25% 20% 15% 10% 5% 0%

LSF (coded speech) MFCC (decoded speech) 0dB

5dB 10dB B 15dB SNR(dB)

50 40 30 20 10 0

0

10

20

30 40 50 60 70 False Acceptation Rate (%)

80

90

100

Figure 4. Performance for Process-A, Prrocess-B, and Process-C. The above figure shows the performance oof the Automatic SVM Speaker Verification system (ASV) over the Internet Protocol (VoIP). The system uses 36 dim mensional MFCC

Figure 5. Performances evaluatio on of the SVM system in noisy environment corrupted by Babble Speech noise

15dB

10dB

5dB

LSSF (coded sp peech)

0dB

Equal Error Rate(%)

35% 30% 25% 20% 15% 10% 5% 0%

M MFCC (decoded sp peech)

SNR(dB)

Figure 6. Performances evaluation of thee SVM system in noisy environment corrupted by Facttory noise. 6. CONCLUSION In this work are presented some experiment results when Automatic Speaker Verification (ASV) syystem recognizes transmitted voice over Internet (VoIP), bassed on the G.729 coder. For this, SVM system was performedd using an Arabic database, ARADIGIT constructed of the tten digits of the Arabic language. Speaker models were trained usinng 36 MFCCs coefficients obtained from the synthesizeed speech for a G.729 coder and using 10 LSF features eextracted directly from the G.729 bitstream. The obtained resuults show that the use of LSF parameters extracted directly from the G.729 bitstream, provides significant perform mance in clean environment and under severely degraded noisy conditions compared to the conventional MFCC features of the decoded speech. As a future work, the optimization of thhis system can be achieved using some dimensionality reductiion method in the front end part processing. 7. REFERENCES [1] Z. Jian-wei, S. Shui-fa, L. Xiao-li, and L. Bang-jun,”Pitch in speaker recognition,” Ninth International Confference on Hybrid Intelligent System, 2009. [2] ITU-T Recommendation G.729-Coding off speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear-prediction (CS-ACELP), 1996. [3] JP. Campbell, “Speaker and language recognition using Eurospeech, vol. 2, speech codec parameters,” In: Proceedings of E pp. 787–790, 1999. [4] T. Quatieri, R. Dunn, D. Reynolds, J. Campbell, and E. Singer, “Speaker recognition using G.729 codeec parameters,” In: Proceedings of ICASSP, 2000, pp. 89–92. [5] W. Campbell, J. Campbell, D. Reynolds, and E. Singer, “A support vector machines for speaker and languuage recognition,”, Comput. Speech Lang, 2006, pp. 210–229.

K ”A review on text[6] K. Prasad, P. Lotia and M. Khan, independent speaker identification using u gaussian supervector SVM,” International Journal of u- an nd e- Service, Science and Technology, , Vol. 5, 2012 . D “Speaker verification [7] D. Reynolds, T. Quatieri, and R. Dunn, using adapted gaussian mixture modells,” Digital Signal Process, Vol. 10, 2000. [8] V. Wan, “Speaker Verification using Support Vector S United Kingdom, Machines”, PhD thesis, University of Sheffield, 2003. K “Speaker Verification [9] E. Yu, M. Mak, C. Sit, and S. Kung, Based on G.729 and G.723.1 Coderr Parameters and Handset Mismatch Compensation”. [10] A. Amrouche, “Automatic speech recognition using connectionist models (Reconnaissancee automatique de la parole par les méthodes connexionnistes)”, Doctoral D thesis, Faculty of Electronics and Computer Science, UST THB. 2007.