Audio-Visual Multimodal Fusion for Biometric

0 downloads 0 Views 703KB Size Report
liveness guards the system against spoof/replay attacks by ensuring that the biometric .... complex backgrounds require an efficient face detection and tracking ...
Audio-Visual Multimodal Fusion for Biometric Person Authentication and Liveness Verification Girija Chetty and Michael Wagner Human Computer Communication Laboratory School of Information Sciences and Engineering University of Canberra, Australia [email protected]

Abstract In this paper we propose a multimodal fusion framework based on novel face-voice fusion techniques for biometric person authentication and liveness verification. Checking liveness guards the system against spoof/replay attacks by ensuring that the biometric data is captured from an authorised live person. The proposed framework based on bi-modal feature fusion, cross-modal fusion as well as 3D shape and texture fusion techniques, allow a significant improvement in system performance against impostor attacks, type-1 replay attacks (still photo and pre-recorded audio), and challenging type-2 replay attacks(CG animated video from a still photo and pre-recorded audio) and robustness to pose and illumination variations. Keywords: multimodal fusion, biometric authentication, liveness verification.

1

Introduction

Due to increased security threats, biometric technology is evolving at an enormous pace, and many countries have started using biometrics for border control and national ID cards. Of late, biometric technology is not just limited to national security scenarios, but also being used for a wide range of application domains such as forensics, for criminal identification and prison security, and, a number of other civilian applications such as preventing unauthorized access to ATMs, cellular phones, smart cards, desktop PCs, workstations, and computer networks (Ross, Prabhakar and Jain, 2003). In addition, there is a recent surge in use of biometric technology for conducting transactions via telephone and Internet (electronic commerce and electronic banking), and in automobiles for key-less entry and key-less ignition. Biometrics authentication (Am I whom I claim I am?) involves confirming or denying a person's claimed identity based on his/her physiological or behavioural characteristics (Kittler, Matas, and Sanchez 1997). This method of identity verification is preferred over traditional methods involving passwords and PIN numbers for various reasons: (i) the person to be Copyright © 2006, Australian Computer Society, Inc. This paper appeared at the NICTA-HCSNet Multimodal User Interaction Workshop (MMUI2005), Sydney, Australia. Conferences in Research and Practice in Information Technology, Vol. 57. Fang Chen and Julien Epps, Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included.

authenticated is required to be physically present at the point-of-verification; (ii) Verification based on biometric techniques removes the need to remember a password or carry a token. Due to increased use of computers and internet for information access, sensitive/personal data of an individual is easily available, and it is necessary to restrict unauthorized access to such private information. Moreover, to remember several PINs and passwords is difficult, and token based methods of identity verification like passports and driver's licenses can be forged, stolen, or lost. Replacement of PINs and passwords with biometric techniques is hence a more efficient way of preventing unauthorized or fraudulent use of ATMs, cellular phones, smart cards, desktop PCs, workstations, and computer networks. Various types of biometric traits can be used for person authentication, such as face, iris, fingerprints, DNA, retinal scan, speech, signatures and hand geometry. However, several human factors needs to be taken into consideration for deployment of civilian, e-commerce and transaction control applications unlike border-control and security applications, and it is necessary to make use of less intrusive biometric traits. Face and voice biometric systems rate high in terms of user acceptance and deployment costs due to less intrusiveness and ease of availability of low-cost off-the shelf system components (Poh and Korczak, 2001). A biometric system is essentially a pattern recognition system which verifies the identity of a person by determining the authenticity of a specific physiological or behavioural characteristic possessed by the user. An important issue in designing a practical biometric system is to determine how an individual can be reliably discriminated from another individual based on these characteristics in presence of various environmental degradations, and whether these characteristics can be easily faked or spoofed. Various studies (Ross, Prabhakar and Jain, 2003, Kittler, Matas, and Sanchez 1997, Poh and Korczak, 2001) have indicated that no single modality can provide an adequate solution against impostor or spoof attacks. Single mode systems in general are limited in performance due to unacceptable error rates, sensitivity to noisy biometric data, failure to enrol rates, and reduced flexibility to offer alternate biometric traits. In order to cope with the limitations of single-mode biometrics, researchers have proposed using multiple biometric traits concurrently for verification. Such systems are commonly known as multi-

modal person authentication systems (Poh and Korczak, 2001). By fusing multiple biometric traits, systems gain more immunity to intruder attacks. For an audio-visual person authentication system for example, it will be more difficult for an impostor to impersonate another person using both audio and visual information simultaneously (Cheung, Mak, Kung, 2004). In addition, fusion of multiple cues, such as those from face and voice can improve system reliability and robustness. For instance, while background noise has a detrimental effect on the performance of voice biometrics, it does not have any influence on face biometrics. On the other hand, while the performance of face recognition systems depends heavily on lighting conditions, lighting does not have any effect on the voice quality. However, current audiovisual multimodal biometric systems mostly verify a person’s face statically, and hence these systems though may have an acceptable performance against impostor attacks, they remain vulnerable to spoof and replay attacks, where a fake biometric is presented by the intruder to access the facility. To resist such attacks, person authentication should include verification of the “liveness” of biometric data presented to the system [Chetty and Wagner, 2004a]. Liveness verification in a biometric system means the capability to detect and verify, whether or not the biometric sample presented is alive or not, during training/enrolment and testing phases. The system must be designed to protect against attacks with artificial/synthesized audio and/or video recordings, with checks that ensure that the presented biometric sample belongs to the live human being who was originally enrolled in the system and not just any live human being. Until now, although there has been much published research on the liveness, for example, of fingerprints, research on liveness verification in audio-visual person authentication systems has been very limited. Liveness verification for face-voice person authentication systems should be possible due to ready availability of multimodal synchronous face voice data from speaking faces. We propose that novel feature extraction and fusion techniques that uncover the static and dynamic relationship between face-voice biometric information will allow liveness verification to be carried out in person authentication systems. In this paper, some of the details of proposed multimodal fusion framework for person authentication and liveness verification based on novel techniques for fusion of face and voice features is described. The techniques based on static and dynamic bimodal feature fusion, cross-modal face-voice fusion and intra-modal shape and texture fusion using 3D face models, form the core multimodal fusion approaches of the framework, and allow a significant enhancement in the system performance against impostor and replay attacks. The performance of the proposed feature extraction and multimodal fusion techniques in terms of equal error rates (EERs) and detector error trade-off (DET) curves were examined by conducting experiments with three different speaking face data corpus, described in next section. The details of

some of the feature extraction and multi-modal fusion techniques developed are given in section 3. The details of impostor and replay attack experiments with results are given in section 4, followed by conclusions in section 5.

2

Speaking face data corpus

The speaking face data from three different data corpus, VidTIMIT, UCBN and AVOZES was used for conducting impostor and replay attack experiments. The VidTIMIT multimodal person authentication database [Sanderson and Paliwal, 2003], consists of video and corresponding audio recordings of 43 people (19 female and 24 male). The mean duration of each sentence is around 4 seconds, or approximately 100 video frames. A broadcast quality digital video camera in a noisy office environment was used to record the data. The video of each person is stored as a sequence of JPEG images with a resolution of 512!384 pixels with corresponding audio provided as a 16-bit 32-kHz mono PCM file. The second type of data used is the UCBN database, a free to air broadcast news database. The broadcast news is a continuous source of video sequences, which can be easily obtained or recorded, and has optimal illumination, colour, and sound recording conditions. However, some of the attributes of broadcast news database such as nearfrontal images, smaller facial regions, multiple faces and complex backgrounds require an efficient face detection and tracking scheme to be used. The database consists of 20-40 second video clips for anchor persons and newsreaders with frontal/near-frontal shots of 10 different faces (5 female and 5 male). Each video sample is 25 frames per second MPEG2 encoded stream with a resolution of 720 × 576 pixels, with corresponding 16 bit, 48 kHz PCM audio.

Figure 1: Faces from (a) VidTIMIT, (b) UCBN, (c) AVOZES The third database used is the AVOZES database, an audiovisual corpus developed for automatic speech recognition research (Goecke and Miller, 2004). The corpus consists of 20 native speakers of Australian English (10 female and 10 male speakers), and the audiovisual data was recorded with a stereo camera system to achieve more accurate 3D measurements on the face. The recordings were made at 30 Hz video frame rate and 16bit 48kHz mono audio rate in a controlled acoustic environment with no external noise, and some

background computer and air-conditioning noise. For each speaker there were 3 spoken utterances, 10 digit sequences, 18 phoneme sequences (CVC words in a carrier phrase), and 22 VCV phoneme sequences (VCV words in a carrier phrase. Figure 1a, 1b and 1c, show sample data from VidTIMIT, UCBN and AVOZES corpus. The three types of databases represent very different types of speaking face data, VidTIMIT with original audio recorded in a noisy environment and clean visual environment, UCBN with clean audio and visual environments, but complex visual backgrounds, and AVOZES with stereo face data for better 3D face modeling.

3

Multimodal Fusion Framework

The proposed multimodal fusion framework is based on of three core fusion approaches, bimodal feature fusion (BFF), cross-modal fusion (CMF), and 3D multi-modal fusion (3MF). Audio-visual fusion in these three approaches is performed at different levels with different features, for uncovering the face-voice relationship needed for checking the liveness and establishes the identity of the person. A brief description of the three fusion approaches used is given in next section.

3.1

Bimodal Feature Fusion (BMF)

The classical approaches to audio-visual multimodal fusion are based on late fusion and its variants, and have been investigated in great depth (Kittler, Matas, and Sanchez 1997, Poh and Korczak, 2001). Late fusion, or fusion at the score level, involves combining the scores of different classifiers, each of which has made an independent decision. This means, however, that many of correlation properties of the joint audio video data are lost. Fusion at feature-level (BMF) on the other hand, can substantially improve the performance of the multimodal systems as the feature sets provide a richer source of information than the matching scores, and because in this mode, features are extracted from the raw data and subsequently combined., In addition, feature-level fusion allows synchronization between closely coupled modalities for a speaking face, such as voice and lip movements to be preserved throughout various stages of authentication, facilitating liveness verification in systems that would otherwise be more vulnerable to replay attacks.

3.1.1

Cepstral mean normalization was performed on all MFCCs before they were used for training, testing and evaluation. Before extracting MFCCs, the audio files from the two databases, VidTIMIT and UCBN, were mixed with acoustic noise at a signal-to-noise ratio of 6 dB. Channel effects with a telephone line filter were then added to the noisy PCM files to simulate the channel mismatch.

3.1.2

Visual features

The visual features used were geometric and intensity features extracted from lip-region from all the faces in the video frame. Before the lip-region features can be extracted, faces need to be detected and recognised. The face detection for video was based on the approach of skin colour analysis in red-blue chrominance colour space, followed by deformable template matching with an average face, and finally verification with rules derived from the spatial/geometrical relationships of facial components. The lip region was determined using derivatives of hue and saturation functions, combined with geometric constraints. Figures 2(a) to 2(c) show some of the results of the face detection and lip feature extraction stages. The scheme is described in more detail in (Chetty and Wagner, 2004b). Similar to the audio files, the video data in both databases were mixed with artificial visual artefacts such as addition of Gaussian blur and Gaussian noise, using a visual editing tool [Adobe Photoshop]. The “Gaussian Blur” of Photoshop was set to 1.2, and “Gaussian Noise” of Photoshop to 1.6.

Acoustic features

The audio and visual features in BMF approach were extracted from each frame of the speaking face video clip, and the joint audio-visual feature vector was formed with direct concatenation of acoustic and visual features from the lip-region. The acoustic features used were Mel frequency cepstral coefficients (MFCC) derived from cepstrum information. The pre-emphasized audio signal was processed using a 30ms Hamming window with onethird overlap, yielding a frame rate of 50 Hz, to obtain the MFCC acoustic vectors. An acoustic feature vector was determined for each frame by warping 512 spectral bands into 30 Mel-spaced bands, and computing the 8 MFCCs.

To evaluate the power of the feature-level fusion BMF approach in preserving the audiovisual synchrony, and hence verification of liveness, experiments were conducted with both BMF and late fusion of audiovisual features. In case of bimodal feature fusion, the audiovisual fusion involved a concatenation of the audio features (MFFCs-8) and visual features (eigen-lip projections (10) + lip dimensions (6)), and the combined feature vector was then fed to a GMM classifier. The audio features acquired at 50 Hz, and the visual features acquired at 25Hz were appropriately rate interpolated to obtain synchronized joint audiovisual feature vectors. For

late fusion, audio and visual features were fed to independent GMM classifiers and the weighted scores (!) (Sanderson and Paliwal, 2004) from each stage, were fed to a weighted-sum fusion unit. Figure 3 shows various sections of bimodal feature fusion (BMF) module.

distributions are elliptically symmetric (Borga and Knutsson, 1998). Figure 4 shows the processing stages for cross-modal feature extraction. The cross-modal feature extractor computes LSA and CCA feature vectors from low-level visual and audio features. The visual features are 20 PCA (eigenface) coefficients, and the audio features are 12 MFCC coefficients. Based on preliminary experiments, fewer than 10 LSA and CCA features are normally found to be sufficient to achieve good performance. This is a significant reduction of feature dimension compared with the 32-dimensional audio-visual feature vector formed by concatenated bimodal feature fusion of 20 PCA and 12 MFCC vectors

Figure 3: Bi-modal Feature Fusion module

3.2

Cross-Modal Fusion

For the cross-modal fusion approach, the proposed features detect the liveness of biometric information by extracting the face-voice synchrony information in a cross-modal space. The cross-modal features proposed are based on latent semantic analysis (LSA) involving singular value decomposition of joint face-voice feature space, and canonical correlation analysis (CCA), based on optimising cross-correlations in a rotated audio-visual subspace. Latent semantic analysis is a powerful tool used in text information retrieval to discover underlying semantic relationships between different textual units (Deerwester, and Harshman, 2001). The LSA technique achieves three goals: dimension reduction, noise removal and the uncovering of the semantic and hidden relation between different objects such as keywords and documents. In our current context, we used LSA to uncover the synchronism between image and audio features in a video sequence. The method consists of four steps: construction of a joint multimodal feature space, normalisation, singular value decomposition and semantic association measurement. Canonical correlation analysis, an equally powerful multivariate statistical technique, attempts to find a linear mapping that maximizes the cross-correlation between two features sets (Borga and Knutsson, 1998). It finds the transformation that can best represent (or identify) the coupled patterns between features of two different subsets. A set of linear basis functions, having a direct relation to maximum mutual information, is obtained in each signal space, such that the correlation matrix between the signals described in the new basis is diagonal. The basis vectors can be ordered such that the first pair of vectors wx1 and wy1 maximize the correlation between the projections (xTwx1, yTwy1) of the signals x and y onto the two vectors respectively. A subset of vectors containing the first k pairs defines a linear rank-k relation between the sets that is optimal in a correlation sense. In other words, it gives the linear combination of one set of variables that is the best predictor and at the same time the linear combination of another set which is most predictable. It has been shown that finding the canonical correlations is equivalent to maximizing the mutual information between the sets if the underlying

Figure 4: Cross-modal Fusion module

3.3

3D Multi-Modal Fusion

For this approach, shape and texture features from 3D face models are extracted and fused with acoustic features. Before three dimensional features can be extracted, a 3D face model needs to be developed using appropriate modelling technique based on facial information available from the data corpus. The VidTIMIT data base for example, consists of frontal and profile view images of the faces, and AVOZES data comprises left and right images of the faces. The 3D face modeling was based on approach proposed by Gordon, (1995), and Hsu and Jain, (2001). The 3D face modeling algorithm starts by computing 3D coordinates of automatically extracted facial feature points. Correspondence between feature points in both images is established using epipolar constraints, and then depth information from front and profile views for VidTIMIT faces, and, left and right views for AVOZES faces, is computed using perspective projection. The 3D coordinates of the selected feature points are then used to deform a 3D generic face model to obtain a person specific 3D face model. Figure 5 shows the sample frontal and profile face from VidTIMIT database and 3D face developed. The techniques proposed till date for processing and integration of shape and texture features are in the 3D face recognition domain, and have evolved based on the assumption that there is no correlation between shape and texture features of a 3D face. This might be true for static 3D faces, and most of the research efforts so far have mainly addressed recognition of still 3D faces (Hsu and Jain, 2001). But a speaking face is a kinematic-acoustic system in motion, and the shape, texture and acoustic

features during speech production must be correlated in some way or other. A number of studies carried out by Yehia, H., Rubin, P. and Vatikiotic-Bateson E., (1998), and Hani Yehia, Takaaki Kuratate, Eric VatikioticBateson, (2002), have demonstrated this correlation based on the anatomical facts, that a single neuromotor source controlling the vocal tract behavior is responsible for both the acoustic and the visible attributes of speech production. Hence, for a speaking face not only the facial motion and speech acoustics are correlated, but the head motion and fundamental frequency (F0) produced during speech are also related. Though there is no clear and distinct neuromotor coupling between head motion and speech acoustics, there is an indirect anatomical coupling created by the complex of strap muscles running between the floors of the mouth, through the thyroid bone, and attaching to the outer edge of the cricothyroid cartilage. Due to this indirect coupling, a speaker tends to raise the pitch when head goes up while talking. The head motion can be modeled by tracking 3D face shapes with complementary and synchronous 2D facial feature variation, and 1D acoustic variation. This unique and rich information is normally person-specific and cannot be easily spoofed either by a real imposter, or CG animated speaking faces. Hence a multimodal fusion of 3D shape, texture and acoustic features can enhance the performance of face-voice authentication systems and check liveness of biometric data presented to the system.

Figure 5: 3D face model for VidTIMIT face The major deformations for the speaking face are in the lower part of the face compared to rest of the face. Hence the lower half of the face was used for 3D multimodal fusion. The lower part of the face was modeled using 128 vertices and 200 surfaces. This means a fusion of acoustic vector with 128 dimensional shape (X, Y, Z) vector and similar size for texture feature vector values. This is too large a dimension for a reasonable performance to be achieved. However, after principal component analysis(PCA) of the shape vector and the texture vector separately, we learnt that about 6-8 principal components of shape vector and 3-4 components of texture vector explains more than 95% of variations in lip shapes and appearances during spoken utterances of most of the English language sentences.

Figure 6: Principal visemes during English speaking

The 8 eigen-values for shape vector correspond to jaw opening/closing, lip protrusion/retraction, lip opening/closing, and jaw protrusion/retraction as shown in Figure 6. Similarly, the 3-4 Eigen values of texture vector describe most of the appearance variations mainly those corresponding to one rounded viseme with closed lips, (e.g. [‘u’]), one rounded viseme with open lips, and one spread viseme with spread lips, (e.g. [‘i’]). The 18-dimenaional audio-visual feature vector for 3D multimodal fusion module was constructed by concatenating 8 MFCCs + 1 F0 feature, 6 eigen-shape and 3 eigentexture features. The fundamental frequency F0 was computed by autocorrelation method.

4

Liveness Experiments

To investigate the potential of proposed fusion approaches, that is bimodal feature fusion; cross-modal fusion and 3D multimodal fusion, different sets of experiments were conducted. In the training phase, a 10-Gaussian mixture model of each client’s feature vectors in the three dimensional space was built by constructing a gender-specific universal background model (UBM) and then adapting each UBM by MAP adaptation (Reynolds and Dunn, 2000). In the test phase, clients’ live test recordings were evaluated against a client’s model ! by determining the log likelihoods log p (X|!) of the time sequences X of audiovisual feature vectors. A Z-norm based approach (Auckenthaler and Carey, 1999) was used for score normalization. For testing replay attacks, two types of replay attack experiments were conducted. For Type-1 replay attacks, a number of “fake” recordings were constructed by combining the sequence of audio feature vectors from each test utterance with ONE visual feature vector chosen from the sequence of visual feature vectors. Such a fake sequence represents an attack on the authentication system, which is carried out by replaying an audio recording of the client’s utterance while presenting a still photograph to the camera. Four such fake audiovisual sequences were constructed from different still frames of each client test recording. Log-likelihoods log p(X’|!) were computed for the fake sequences X’ of audiovisual feature vectors against the client model !. For Type-2 replay attacks, a synthetic video clip was constructed from a still photo of each speaker. This represents a scenario of a replay attack with an impostor presenting a fake video clip constructed from prerecorded audio and a still photo of the client animated with facial movements and voice-synchronous lip movements. The still photo of each client was voicesynched with the speech signal of each speaker, using a set of commercial software tools (Adobe Photoshop Elements, Discreet 3DSMax, and Adobe After Effects). We constructed several fake video clips by extracting ONE face (the first face) from the video sequence, which acts as a key frame, animated the lip region of the key frame by phoneme-to-viseme mapping, and then added random deformations and movements in the face and finally rendered lip and face movements with speech, all

together as a new video clip. The synthesized fake clip visually emulates a normal talking head with certain facial and head movements in three dimensional spaces in synchronism with spoken utterance. Performance in terms of DET curves and EER rates was examined for text-dependent and text-independent experiments. For all experiments, the threshold was set using data from the test data set. The results obtained for each of the fusion approach is described next.

4.1.1

Results for BFF approach

For bimodal feature fusion approach only type-1 replay attacks were studied, and training with VidTIMIT and UCBN corpus was done with pose and illumination normalization of faces. For VidTIMIT, 24 male and 19 female clients were used to create separate gender specific universal background models. The first two utterances for all speakers in the corpus being common were used for text dependent experiments and 6 different utterances for each speaker allowed text independent verification experiments to be conducted. For text independent experiments, four utterances from session 1 were used for training and four utterances from session 2 and 3 were used for testing.

acoustic noise (Factory noise at 6 dB SNR + channel effects), feature fusion allows a performance improvement of the order of 38% compared to late fusion (!=0.25), and 18% for late fusion (!=0.75). When mixed with visual artefacts, the improvement in performance achieved with feature fusion is about 30.40% as compared to LF (!=0.25), and 18.9% with LF (!=0.75). Table 3 shows the baseline EERs achieved and EERs achieved with inclusion of visual artefacts, acoustic noise and shorter training data. The table also shows a drop in performance due to late fusion and feature fusion.

Table 1: Number of Client and Replay attack trials

For the UCBN database, the training data for both text dependent and text independent experiments contained 15 utterances from 5 male and 5 female speakers, and 5 utterances for testing, each recorded in a different session. The utterances were of 20-second duration for text dependent experiments and of 40-second duration in text independent mode. Similarly to VidTIMIT, separate UBMs for the male and female cohorts were created for UCBN data. Table 1 shows the number of client trials and replay attack trials conducted for examining the performance of bimodal feature fusion module. The first row in Table 1 refers to experiments with the VidTIMIT database in text dependent mode for a male-only cohort, comprising a total of 48 client trials (24 client’s × 2 utterances per client) and 192 replay attack trials (24 clients × 2 utterances × 4 fake sequences per client).

As a baseline performance measure, both late fusion and feature fusion experiments were conducted with concatenated audio-visual feature vector described in section 3.1. The results for DB1TIMO (VidTIMIT database text-independent male-only cohort) and DB2TDFO (UCBN database text-dependent female-only cohort) experiments are reported here. All late fusion experiments had varying combination weights ‘!’ for combining audio and visual scores. ! is varied from 0"1 with ! increasing for increasing visual scores. As shown Table 2, the baseline EER achieved is 3.65% for DB1TIMO and 2.55% for DB2TDFO for feature fusion, as compared to 8.1% (DB1TIMO) and 6.8% (DB2TDFO) achieved for late fusion with !=0.75. In Figure 3, the behaviour of the system is shown when subjected to different types of environmental degradations as is the EER sensitivity to variations in training data size. Once again, feature level fusion outperforms late fusion for acoustic and visual degradations. When mixed with

Table 2: Relative performance of BFF with acoustic noise, visual artifacts and variation in training data size. The influence of training utterance length variation on system performance is quite remarkable and different as compared to other effects. The system is more sensitive to utterance length variation for feature fusion mode as compared to late fusion mode (Table 2). The drop in performance is less (9.46% for late fusion (!=0.75)) and (26.57% for late fusion (!=0.25)) as compared to 42.32% drop for feature fusion for DB1TIMO, and likewise, the drop is 12.15% and 24.53% as compared to 40.96% drop for DB2TDFO. The utterance length is varied from 4 seconds to 1 second for DB1TIMO and from 20 seconds to 5 seconds for DB2TDFO data. This drop in performance is because of a larger dimensionality of the joint audiovisual feature vectors used (8 MFCCs+10 eigenlips+6 lip dimensions), as well as the shorter utterance length, which seems to be not sufficient to establish the audiovisual synchrony.

4.1.2

Results for CMF approach

For cross modal fusion approach, both type-1 and type-2 replay attacks were studied, and training with VidTIMIT and UCBN corpus was done, with pose and illumination

normalization of faces. The performance of LSA and CCA features for CMF module were compared with concatenated BMF features (20 PCA+12 MFCC) features for baseline comparison. The EER results in Table 3 show the potential of the LSA and CCA features of CMF module over BMF module for type-1 replay attacks. An improvement of 80% with 8-dimensional LSA features and 60% with 8-dimensional CCA features is achieved over concatenated 32-dimensional BMF fusion approach. Table 6: Number of Client, Impostor and Replay attack trials for 3MF module

Table 3: EERs for type-1 replay attacks

Table 5: EERs for type-2 replay attacks Table 4 shows the improvement in error rates achieved for type-2 replay attacks. Approximately 43% improvement in EERs with 8-dimensional LSA features and 22% with 8-dimensional CCA features is achieved. This is a remarkable improvement in EERs, due to ability of LSA and CCA features to detect mismatch in synchrony in video replay attacks.

4.1.3

Results for 3MF approach

For 3D multimodal fusion approach, the impostor and type-1, type-2 replay attacks were studied, and training with VidTIMIT and AVOZES corpus was done. The results for only two types of data, that is DB1TIMO (VidTIMIT database text independent male-only cohort) and DB2TDFO (AVOZES database text dependent female-only cohort) are reported here. For both types of data, both late-fusion and feature level fusion of shape and texture features were examined. For late-fusion equal weights for shape and feature fusion was used. Table 6 shows the number of client, imposter and replay attack trials for 3MF module. The DET curve and EER results in Table 6 and Figure 7 show the potential of the proposed fusion of 3D eigenshape and eigen-texture features with acoustic features (MFCC+F0) for 3MF approach to thwart impostor and replay attacks for VidTIMIT data and AVOZES data without pose and illumination normalization. For VidTIMIT corpus, less than 1% EER achieved, with 0.92% for late fusion and 0.64% for feature fusion.

Figure 7: DET curves for Type-1 TD tests, (a) male subjects in VidTIMIT, (b) female subjects in UCBN

6

References

Auckenthaler, R.,E.Paris, and M.Carey, “Improving GMM Speaker verification System by Phonetic Weighting”, Proceedings ICASSP’99, pp. 1440-1444, 1999. Borga, M., H.Knutsson, “An adaptive stereo algorithm based on canonical correlation analysis”, Proceedings of the Second IEEE International Conference on Intelligent Processing Systems, pp.177-182, August, 1998.

Table 7: EERs for type-2 replay attacks Feature fusion performs better, a 30% improvement as compared to late fusion, due to synchronous processing of eigen-shape, eigen-texture and acoustic features. For AVOZES corpus, EER achieved is 1.24% with feature fusion as compared to 1.53 %, about 20% EER improvement. For type-1 replay attacks, less than 1 % EER is achieved for VidTIMIT and AVOZES, with feature-fusion performing better than late fusion (48% improvement for VidTIMIT data vs. 38% for AVOZES data). Less than 7% EER is achieved for type-2 replay attacks for both VidTIMIT and AVOZES data, with best EER equal to 1.9% for VidTIMIT TIMO data and worst EER of 6.45% for AVOZES TDFO data. The fusion of acoustic features with three dimensional shape and texture features allowed a significantly better performance, and robustness to pose and illumination variations, though type-2 replay attacks are more complex replay attacks to detect.

5

Conclusions

In this paper we show the potential of multi-modal fusion framework with several new features and different fusion techniques for biometric person authentication and liveness verification. For the BMF module, feature level fusion of audiovisual feature vectors substantially improves the performance of a face-voice authentication system for checking liveness and thwarting replay attacks. Also, the sensitivity of the BMF module to variations in the size of the training data has been recognized. For the CMF module, the two new cross-modal features, LSA and CCA have the power to thwart type-2 replay attacks. About 42% overall improvement in error rate with CCA features and 61% improvement with LSA features is achieved as compared to feature-level fusion of image-PCA and MFCC face-voice feature vectors. For 3MF module, features based on three-dimensional face modeling perform better against impostor and replay attacks. The multimodal feature fusion of acoustic, 3D shape and texture features allowed an improvement of 2540% over CMF features, with less than 1% for type-1 replay attacks and less than 7% EER, a significantly better performance for more difficult type-2 replay attacks.

Cheung, M.C., K.K. Yiu, M.W.Mak, and S.Y.Kung, “Multisample fusion with constrained feature transformation for robust speaker verification”, Proceedings Odyssey’04 Conference. Chetty, G. and Wagner, M., "’Liveness’ Verification in AudioVideo Authentication”, Proc. Int Conf on Spoken Language Processing ICSLP-04, Jeju, Korea, pp 2509-2512. Chetty, G. and Wagner, M., “Automated lip feature extraction for liveness verification in audio-video authentication”, Proc. Image and Vision Computing 2004, New Zealand, pp 17-22. Deerwester, S., Dumais, S.T., Frunas, G.W., Landauer, T.K., and Harshman, R. Indexing by Latent Semantic Analysis, Journal American Society for Information Sci.,2001, 41(6),391-407. Goecke, R., and J.B. Millar, “The Audio-Video Australian English Speech Data Corpus AVOZES”, Proceedings of the 8th International Conference on Spoken Language Processing INTERSPEECH 2004 - ICSLP, Volume III, pages 25252528, 4-8 October 2004. Gordon, G., “Face Recognition from Frontal and Profile Views,” Proceedings Int’l Workshop on Face and Gesture Gesture Recognition, Zurich, 1995, pp.47-52. Hsu, R.L. and A.K.Jain, “Face Modeling for Recognition,” Proceedings Int’l Conf. Image Processing, ICIP, Greece, Oct. 7- 10, 2001. Kittler, J., G. Matas, K. Jonsson, and M. Sanchez,“Combining evidence in personal identity verification systems,” Pattern Recognition Letters, vol.18, no.9,pp.845–852,Sept. 1997. Poh, N., and J. Korczak, "Hybrid biometric person authentication using face and voice features," Proc. of Int. Conf. on Audio and Video-Based Biometric Person Authentication, Halmstad, Sweden, June 2001, pp. 348--353. Reynolds, D., T. Quatieri and R. Dunn, “Speaker Verification Using Adapted Gaussian Mixture Models”, Digital Signal Processing, Vol. 10, No. 1-3, 2000, pp. 19-41. Ross, A., and Jain, A.K., “Information fusion in biometrics”, Pattern Recognition Letters 24, 13 (Sept. 2003), 2115–2125. Sanderson, C. and K.K. Paliwal (2003), “Fast features for face authentication under illumination direction changes”, Pattern Recognition Letters 24, 2409-2419. Yehia, H., Rubin, P. and Vatikiotic-Bateson E. (1998), “Quantitative association of vocal tract and facial behavior”, Journal of Speech Communication 26(1-2), 23-43. Yehia Hani, Takaaki Kuratate, Eric Vatikiotic-Bateson, “Linking Facial Animation, Head Motion and Speech Acoustics”, Journal of Phonetics, Vol.30, Issue 3, 2002.