A review of speech-based bimodal recognition ... - Semantic Scholar

0 downloads 0 Views 301KB Size Report
ants of the standard cepstrum include the popular mel-warped cepstrum or mel ...... [31] S. B. Davis and P. Mermelstein, “Comparison of parametric represen-.
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 1, MARCH 2002

23

A Review of Speech-Based Bimodal Recognition Claude C. Chibelushi, Farzin Deravi, Member, IEEE, and John S. D. Mason

Abstract—Speech recognition and speaker recognition by machine are crucial ingredients for many important applications such as natural and flexible human-machine interfaces. Most developments in speech-based automatic recognition have relied on acoustic speech as the sole input signal, disregarding its visual counterpart. However, recognition based on acoustic speech alone can be afflicted with deficiencies that preclude its use in many real-world applications, particularly under adverse conditions. The combination of auditory and visual modalities promises higher recognition accuracy and robustness than can be obtained with a single modality. Multimodal recognition is therefore acknowledged as a vital component of the next generation of spoken language systems. This paper reviews the components of bimodal recognizers, discusses the accuracy of bimodal recognition, and highlights some outstanding research issues as well as possible application domains. Index Terms—Audio-visual fusion, joint media processing, multimodal recognition, speaker recognition, speech recognition.

I. INTRODUCTION

A

MIXTURE of verbal and nonverbal cues is used in human social interactions, where such cues are often exploited together. Audible and visible speech cues carry linguistic and para-linguistic information, which can be extracted and used in machine-based recognition. Typical para-linguistic information includes the identity, gender, age, and physical and emotional state of the speaker. The extraction and analysis of such information is the target of research on audio-visual signal processing, which is gaining momentum in areas such as recognition, synthesis, and compression. In particular, automatic speech recognition or speaker recognition, involving the use of acoustic speech and its visible correlates, has become a mainstream research area. This paper uses the following terminology. Visual speech stands for the configuration of visible articulators—lips, teeth, jaw, and part of the tongue—during speech; it is also known as visible speech [67], [88]. Medium refers to an information carrier such as sound, text, and pictures. Modality is used in the context of the sensory channels of hearing, vision, touch, taste, and smell. Bimodal recognition means recognition which is based on two modalities. In this review, bimodal recognition

Manuscript received August 5, 1999; revised December 13, 2000. The associate editor coordinating the review of this paper and approving it for publication was Dr. Hong-Yuan Mark Liao. C. C. Chibelushi is with the School of Computing, Staffordshire University, Beaconside, Stafford ST18 0DG, U.K. (e-mail: [email protected]). F. Deravi is with the Electronic Engineering Laboratory, University of Kent at Canterbury, Canterbury, Kent CT2 7NT, U.K. (e-mail: [email protected]). J. S. D. Mason is with the Department of Electrical and Electronic Engineering, University of Wales Swansea, Swansea SA2 8PP, U.K. (e-mail: [email protected]). Publisher Item Identifier S 1520-9210(02)01393-7.

is narrowed down to automatic speech or speaker recognition based on acoustic and visual speech. A. Motivation for Bimodal Recognition Speech recognition can be used wherever speech-based man-machine communication is appropriate. Speaker recognition has potential application wherever the identity of a person needs to be determined (identification task) or an identity claim needs to be validated (identity verification task). Possible applications of bimodal recognition are: speech transcription; adaptive human-computer interfaces in multimedia computer environments; voice control of office or entertainment equipment; and access control for buildings, computer resources, or information sources. Bimodal recognition tries to emulate the multimodality of human perception. It is known that all sighted people rely, to a varying extent, on lipreading to enhance speech perception or to compensate for the deficiencies of audition [116]. Lipreading is particularly beneficial when the listener suffers from impaired hearing or when the acoustic signal is degraded [115], [117]. Sensitivity to speech variability, inadequate recognition accuracy for many potential applications, and susceptibility to impersonation are among the main technical hurdles preventing a widespread adoption of speech-based recognition systems. The rationale for bimodal recognition is to improve recognition performance in terms of accuracy and robustness against speech variability and impersonation. Compared to speech or speaker recognition that uses only one primary source, recognition based on information extracted from two primary sources can be made more robust to impersonation and to speech variability, which has a different effect on each modality. Automatic person recognition based on still two-dimensional (2-D) facial images is vulnerable to impersonation attempts using photographs or by professional mimics wearing appropriate disguise. In contrast to static personal characteristics, dynamic characteristics such as visual speech are difficult to mimic or reproduce artificially. Hence, dynamic characteristics offer a higher potential for protection against impersonation than static characteristics. Given the potential gains promised by the combination of modalities, multimodal systems have been identified, by many experts in spoken language systems [28], as a key area which requires basic research in order to catalyze a widespread deployment of spoken language systems in the “real world.” B. Outline of the Review This review complements earlier surveys on related themes [20], [62], [86], [114]. The history of automatic lipreading research is outlined in [86], which does not cover audio processing, sensor fusion, or bimodal speaker recognition. The

1520–9210/02$17.00 © 2002 IEEE

24

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 1, MARCH 2002

Fig. 1. Simplified general architecture for bimodal recognition. (a) Feature fusion. (b) Decision fusion. In practice, similar speech processing techniques are used for speech recognition and speaker recognition. Front-end processing converts raw speech into a high-level representation, which ideally retains only essential information for pattern categorization. The latter is performed by a classifier, which often consists of models of pattern distribution, coupled to a decision procedure. The block generically labeled “constraints” typically represents domain knowledge, such as syntactic or semantic knowledge, which may be applied during the recognition. Sequential or tree configurations of modality-specific classifiers are possible alternatives to the decision fusion of parallel classifiers shown in (b). Audio-visual fusion can also occur at a level between feature and decision levels.

overview given in [114] covers speechreading (it pays particular attention to visual speech processing), but it does not cover audio processing and bimodal speaker recognition. Reference [62] centers on the main attributes of neural networks as a core technology for multimedia applications which require automatic extraction, recognition, interpretation, and interactions of multimedia signals. Reference [20] covers the wider topic of audio-visual integration in multimodal communication encompassing recognition, synthesis, and compression. This paper focuses on bimodal speech and speaker recognition. Given the multidisciplinary nature of bimodal recognition, the review is broad-based. It is intended to act as a shop-window for techniques that can be used in bimodal recognition. However, paper length restrictions preclude an exhaustive coverage of the field. Fig. 1 shows a simplified architecture commonly used for bimodal recognition. The structure of the review is a direct mapping from the building blocks of this architecture. The paper is organized as follows. First, the processing techniques are discussed. Thereafter, bimodal recognition performance is reviewed. Sample applications and avenues for further work are then suggested. Finally, concluding remarks are given. II. FRONT-END PROCESSING The common front-end processes for speech-based recognition are signal conditioning, segmentation, and feature extraction. Signal conditioning typically takes the form of noise removal. Segmentation is concerned with the demarcation of signal portions conveying relevant acoustic or visual speech. Feature extraction generally acts as a dimensionality reduction procedure which, ideally, retains information possessing high discrimination power, high stability, and also for speaker recognition, good resistance to mimicry. Dimensionality reduction may mitigate the “curse of dimensionality.” The latter relates to the relation between the dimension of the input pattern space

and the number of classifier parameters, which influences the amount of data required for classifier training [55]. To obtain reliable estimates of classifier parameters, training data volume should increase with the dimension of the input space. The reliability of parameter estimates may affect classification accuracy. Segmentation and feature extraction can have an adverse effect on recognition. They may retain unwanted information or inadvertently discard important information for recognition. Also, the extracted features may fail to match the assumptions incorporated in the classifier. For example, some classifiers minimize their parameter-estimation requirements by assuming that features are uncorrelated. Accurate segmentation and optimal feature extraction are challenging. A. Acoustic Speech Processing 1) Segmentation: Separation of speech from nonspeech material often employs energy thresholding [124], zero-crossing rate, and periodicity measures [95], [119]. Often, several information sources are used jointly [46], [95]. In addition to heuristic decision procedures, conventional pattern recognition techniques have also been used for speech segmentation. This is typified by classification of speech events, based on vector-quantization (VQ) [46] or hidden Markov models (HMMs) [76]. 2) Feature Extraction: Many speech feature extraction techniques aim at obtaining a parametric representation of speech, based on models which often embed knowledge about speech production or perception by humans [94]. a) Speech production model: The human vocal apparatus is often modeled as a time-varying filter excited by a wide-band signal; this model is known as a source-filter or excitation-modulation model [108]. The time-varying filter represents the acoustic transmission characteristics of the vocal tract and nasal cavity, together with the spectral characteristics of glottal pulse shape and lip radiation. Most acoustic speech models assume that the excitation emanates from the lower end

CHIBELUSHI et al.: REVIEW OF SPEECH-BASED BIMODAL RECOGNITION

of the vocal tract. Such models may be unsuitable for speech sounds, such as fricatives, which result from excitation that occurs somewhere else in the vocal tract [18]. b) Basic features: Often, acoustic speech features are short-term spectral representations. For recognition tasks, parameterizations of the vocal tract transfer function are invariably preferred to excitation characteristics, such as pitch and intensity. However, these discarded excitation parameters may contain valuable information. Cepstral features are very widely used. The cepstrum is the discrete cosine transform (DCT) of the logarithm of the short-term spectrum. The DCT yields virtually uncorrelated features, and this may allow a reduction of the parameter count for the classifier. For example, diagonal covariance matrices may be used instead of full matrices. The DCT also packs most of the acoustic information into the low-order features; hence, allowing the reduction of the input space dimension.The cepstrum can be obtained through linear predictive coding (LPC) analysis [2], [96] or Fourier transformation [43]. Variants of the standard cepstrum include the popular mel-warped cepstrum or mel frequency cepstral coefficients (MFCCs) [31] (see Fig. 2) and the perceptual linear predictive (PLP) cepstrum [51]. In short-term spectral estimation, the speech signal is first divided into blocks of samples called frames. A windowing function, such as the Hamming window, is usually applied to the speech frame before the short-term log-power spectrum is computed. In the case of MFCCs, the spectrum is smoothed typically by a bank of triangular filters, the passbands of which are laid out on a frequency scale known as mel scale. The latter is approximately linear below 1 kHz and logarithmic above 1 kHz; the mel scale effectively reduces the contribution of higher frequencies to the recognition. Finally, a DCT yields the MFCCs. By removing the cepstral mean, MFCCs can be made fairly insensitive to time-invariant distortion introduced by the communication channel. In addition, given the low cross correlation of MFCCs, their covariance can be modeled with a diagonal matrix. MFCCs are notable for their good performance in both speech and speaker recognition. c) Derived features: High-level features may be obtained from a temporal sequence of basic feature vectors or from the statistical distribution of the pattern space spanned by basic features. These derived features are typified by first-order or higher order dynamic (also known as transitional) features such as delta or delta-delta cepstral coefficients [40], [112] and statistical dynamic features represented by the multivariate autoregressive features proposed in [74]. The delta cepstrum is usually computed by applying a linear regression over the neighborhood of the current cepstral vector; the regression typically spans approximately 100 ms. The use of delta features mitigates the unsuitability of the assumption of temporal statistical independence often made in classifiers. Other high-level features are the long-term spectral mean, variance or standard deviation, and the covariance of basic features [9], [39]. Some high-level features aim at reducing dimensionality through a transformation that produces statistically orthogonal features and packs most of the variance into few features. Common transforms are based on principal component anal-

25

Fig. 2.

MFCC feature extraction.

ysis (PCA) [100], statistical discriminant analysis optimizing the F-ratio such as linear discriminant analysis (LDA) [1], and integrated mel-scale representation with LDA (IMELDA) [53]. The latter is LDA applied to static spectral information, possibly combined with dynamic spectral information, output by a mel-scale filter bank. Composite features are sometimes generated by a simple concatenation of different types of features [69]. B. Visual Speech Processing 1) Segmentation: Visual speech requires both spatial and temporal segmentation. Temporal endpoints may be derived from the acoustic signal endpoints, or computed after spatial segmentation in the visual domain. Many spatial segmentation techniques impose restrictive assumptions, or rely on segmentation parameters tuned for a specific data set. As a result, robust location of the face or its constituents in unconstrained scenes is beyond the capability of most current techniques. At times, the spatial segmentation task is eased artificially through the use of lipstick or special reflective markers, or by discarding most facial information and capturing images of the mouth only. Face segmentation relies on image attributes related to facial surface properties, such as brightness, texture, and color [26], [87], [122], possibly accompanied by their dynamic characteristics. Face segmentation techniques can be grouped into the broad categories of intra-image analysis, inter-image analysis, or a combination of the two. Intra-image approaches may be subdivided into conventional, connectionist, symbolic, and hybrid methods. Conventional methods include template-based techniques [131], signature-based techniques [104], edge or contour following [78], and symmetry detection [97]. Connectionist methods are built around artificial neural networks such as radial basis functions, self-organizing neural networks, and (most frequently) multilayer perceptrons (MLPs). Symbolic methods are often

26

based on a knowledge-based system [128]. Hybrid methods combining the above techniques are also available [34]. Conventional and symbolic methods tend to perform poorly in the presence of facial image variation. In comparison, when sufficient representative training data is available, connectionist methods may display superior robustness to changes in illumination and to geometric transformations. This is due to the ability of neural networks to learn without relying on explicit assumptions about underlying data models. Face segmentation often exploits heuristics about facial shape, configuration, and photometric characteristics; implicit or explicit models of the head or facial components are generally used [17], [83]. The mouth is commonly modeled by deformable templates [50], dynamic contour models [26], [32], [60], or statistical models of shape and brightness [65]. The segmentation then takes the form of optimization of the fit between the model and the image, typically using numerical optimization techniques such as steepest descent, simulated annealing, or genetic algorithms. A downside of approaches based on iterative search, is that speedy and accurate segmentation requires initialization of model position and shape relatively close to the target mouth. In addition, segmentation based on such models is usually sensitive to facial hair, facial pose, illumination, and visibility of the tongue or teeth. Some approaches for enhancing the robustness of lip tracking are proposed in [83] and [10]. The approach described in [83] incorporates adaptive modeling of image characteristics, which is anchored on Gaussian mixture models (GMMs) of the color and geometry of the mouth and face. [10] enhances the robustness of the Kanade-Lucas-Tomasi tracker by embedding heuristics about facial characteristics. 2) Feature Extraction: Although raw pixel data may be used directly by a classifier [12], [35], [126], [130], feature extraction is often applied. Despite assertions that much of lipreading information is conveyed dynamically [13], [65], features relating to the static configuration of visible articulators are fairly popular. Depending on the adjacency and area coverage of pixels used during feature extraction, approaches for visual speech feature extraction may be grouped into mouth-window methods and landmark methods. These two approaches are sometimes used conjunctively [26]. a) Mouth-window methods: These methods extract features from all pixels in a window covering the mouth region. Examples of such methods are: binarization of image pixel intensities [88], aggregation of pixel intensity differences between image frames [80], computation of the mean pixel luminance of the oral area [84], 2-D Fourier analysis [12], [35], DCT [90], discrete wavelet transform (DWT) [90], PCA (“eigenlips”) [13], [15], [35], LDA [35], and nonlinear image decomposition based on the “sieve” algorithm [70]. b) Landmark methods: In landmark methods, a group of key points is identified in the oral area. Features extracted from these key points may be grouped into three main subgroups: 1) kinematic features; 2) photometric features; and 3) geometric features. Examples of kinematic features are velocity of key points [125]. Photometric features may be in the form of intensity and temporal intensity gradient of key points [125]. Typical geometric features are the width, height, area, and perimeter of

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 1, MARCH 2002

the oral cavity [14]; and distances or angles between key points located on the lip margins, mouth corners, or jaw [7], [125]. Spectral encoding of lip shape is used in [3], where it is shown to yield compact feature vectors, which are fairly insensitive to the reduction of video frame rate. Geometric and photometric features are sometimes used together [36]. This is because, although shape features are less sensitive to lighting conditions than photometric features, they discard most of the information conveyed by the visibility of the tongue and teeth. c) Evaluation of feature types: Investigations reported in [35] show no significant difference in speech classification accuracy obtained from raw pixel intensities, PCA and LDA features. A study of mouth-window dynamic features for visual speech recognition found that optical-flow features are outperformed by the difference between successive image frames [45]. It was also noted that local low-pass filtering of images yields better accuracy than PCA [45]. In [90], no significant difference in recognition accuracy was observed between DCT, PCA, and DWT features. Compared to landmark features, mouth-window features may display higher sensitivity to changes of lighting and to the spatial or optical settings of the camera. Moreover, pixel-based features may be afflicted by the curse of dimensionality. Additionally, although pixel intensities capture more information than contour-based features, pixel data may contain many irrelevant details. III. SPEECH OR SPEAKER MODELLING Speech or speaker classifiers often consist of stochastic models of voice pattern distribution, which provide similarity measures followed by a decision process [94]. Modeling the distribution of patterns is difficult, particularly in the cases of: distributions having several modes or clusters; pattern classes having complex decision boundaries, presence of outliers, and sparse or nonrepresentative training data. Moreover, devising a decision procedure well suited to all pattern distributions is impossible. Speech or speaker classifier models can be broadly grouped into parametric and nonparametric models. Parametric models generally assume a normal distribution, or a mixture of normal distributions. The group of nonparametric models comprises reference-pattern models and connectionist models. A. Parametric Models The static characteristics of voice are sometimes modeled by single-mode Gaussian probability density functions (pdfs) [43]. However, the most popular static models are multimode mixtures of multivariate Gaussians [98], commonly known as Gaussian mixture models (GMMs). HMMs are widely used as models of both static and dynamic characteristics of voice [36], [93], [94]. 1) Gaussian Mixture Models (GMMs): A GMM represents a probability distribution as a weighted aggregation of Gauswhere is the “observasians tion” (usually corresponding to a feature vector) and the GMM ), the number parameters are the mixture weights ( of mixture components ( ), the mean ( ), and the covariance

CHIBELUSHI et al.: REVIEW OF SPEECH-BASED BIMODAL RECOGNITION

Fig. 3. Three-state left-right HMM depicted as an emitter of feature vectors by a sequence of hidden states. State transitions and vector emissions are probabilistic.

27

In practice, to minimize the number of parameters, HMMs are confined to relatively small or constrained state spaces. Practical HMMs are typically first-order Markov models. Such models may be ill-suited for higher order dynamics; in the case of speech, the required sequential dependence may extend across several states. HMMs for speech or speaker recognition are typically configured as left-right models (see Fig. 3). The use of state-specific GMMs increases the number of classifier parameters to estimate. To reduce the parameter count, sharing (commonly referred to as “tying”) of parameters is often used. Typically, state parameters are tied across HMMs which possess some states deemed similar. B. Nonparametric Models

matrix ( ) of each component. Diagonal covariance matrices are often used for features, such as MFCC, which are characterized by low cross correlation. GMM parameters are often estimated using the Expectation-Maximization (EM) algorithm [98]. Being iterative, this algorithm is sensitive to initial conditions. It may also fail to converge if the norm of a covariance matrix approaches zero. 2) Hidden Markov Models (HMMs): HMMs are generative data models, which are well suited for the statistical modeling and recognition of sequential data, such as speech. An HMM embeds two stochastic components (see Fig. 3). One component is a Markov chain of hidden states, which models the sequential evolution of observations. The hidden states are not directly observable. The other component is a set of probability distributions of observations. Each state has one distribution, which can be represented by a discrete or a continuous function. This divides HMMs into discrete-density HMMs (DHMMs) and continuous-density HMMs (CHMMs), respectively. In early recognition systems, continuous-valued speech features were vector quantized and each resulting VQ codebook index was then input to a DHMM. A key limitation of this approach is the quantization noise introduced by the vector quantizer and the coarseness of the similarity measures. Most modern systems use CHMMs, where each state is modeled as a GMM. In other words, a GMM is equivalent to a single-state HMM. Although, in theory, Gaussian mixtures can represent complex pdfs, this may not be so in practice. Hence, HMMs sometimes incorporate MLPs, which estimate state observation probabilities [36], [75]. The most common HMM learning rule is the Baum-Welch algorithm, which is an iterative maximum likelihood estimation of the state and state-transition parameters [93]. Due to the iterative nature of the learning, the estimated parameters depend on their initial settings. HMMs are often trained as generative models of within-class data. Such HMMs do not capture discriminating information explicitly and hence may give suboptimal recognition accuracy. This has spurred research into discriminative training of HMMs and other generative models [63], [102]. Viterbi decoding [93] is typically used for efficient exploration of possible state sequences during recognition; it calculates the likelihood that the observed sequence was generated by the HMM. The Viterbi algorithm is essentially a dynamic programming method, which identifies the state sequence that maximizes the probability of occurrence of an observation sequence.

1) Reference-Pattern Models: These models take the form of a store of reference patterns representing the voice-pattern space. To counter misalignments, arising from change in speaking rate, for example, temporal alignment using dynamic time warping (DTW) is often applied during pattern matching. The reference patterns may be taken directly from the original pattern space; this approach is used in k-nearest-neighbor (kNN) classifiers [4]. Alternatively, the reference patterns may represent a compressed pattern space, typically obtained through vector averaging. Compressed-pattern-space approaches aim to reduce the storage and computational costs associated with an uncompressed space. They include VQ models [112], [129] and the template models used in minimum distance classifiers [77]. A conventional VQ model consists of a collection (codebook) of feature-vector centroids. In effect, VQ uses multiple static templates, and hence, it discards potentially useful temporal information. The extension of such memoryless VQ models, into models which possess inherent memory, has been proposed in the form of matrix quantization and trellis VQ models [58]. 2) Connectionist Models: These consist of one or several neural networks. The most popular models are the memoryless type, such as MLPs [82], radial basis functions [82], neural tree networks [37], Kohonen’s self-organizing maps [52], and learning vector quantization [6]. The main connectionist models capable of capturing temporal information are time-delay neural networks [5] and recurrent neural networks [103]. However, compared to HMMs, artificial neural networks are generally worse at modeling sequential data. Most neural network models are trained as discriminative models. Predictive models, within a single medium or across acoustic and visual media, are rare [21], [44], [49]. A key strength of neural networks is that their training is generally implemented as a nonparametric, nonlinear estimation, which does not make assumptions about underlying data models or probability distributions. C. Decision Procedure The classifier decision procedure sometimes involves a sequence of consecutive recognition trials [77]. At times, it is implemented as a decision tree [37], [113]. Some similarity measures are tightly coupled to particular feature types. For speaker verification or open-set identification, a normalization of similarity scores may be necessitated by speech variability

28

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 1, MARCH 2002

[63], [101]. Examples of common similarity measures are: the Euclidean distance (often inverse-variance weighted, or reduced to a city-block distance) [1], [64], the Mahalanobis distance [2], the likelihood ratio [107], and the arithmetic-harmonic sphericity measure [9]. The Mahalanobis distance measure takes account of feature covariance and de-emphasizes features with high variance; however, reliable estimation of the covariance matrix may require a large amount of training data. IV. AUDIO-VISUAL FUSION A. Foundations Speech perception by humans is a bimodal phenomenon characterized by high recognition accuracy and graceful performance degradation in the presence of distortion. As confirmed empirically [21], [84], acoustic and visual speech are correlated. They display complementarity and supplementarity for speech recognition, particularly under adverse conditions [117]. Complementarity entails the presence, in one modality, of information that is largely absent in the other modality. Supplementarity relates to information redundancy across the modalities. Speaker recognition by humans is also a multimodal process. Synergy exists between speaker information conveyed by acoustic speech and its visual correlates. However, the factors that affect the complementarity and supplementarity of acoustic speech and visual speech, for speaker recognition, are not clearly understood. B. General Fusion Hierarchy Sensor fusion deals with the combination of information produced by several sources [30], [47]. It has borrowed mathematical and heuristic techniques from a wide array of fields, such as statistics, artificial intelligence, decision theory, and digital signal processing. Theoretical frameworks for sensor fusion have also been proposed [61]. In pattern recognition, sensor fusion can be performed at the data level, feature level, or decision level (see Fig. 1); hybrid fusion methods are also available [48]. Low-level fusion can occur at the data level or feature level. Intermediate-level and high-level fusion typically involves the combination of recognition scores or labels produced as intermediate or final output of classifiers. Hall [48] argues that owing to the information loss, which occurs during the transformation of raw data into features and eventually into classifier outputs, classification accuracy is expected to be the lowest for decision fusion. However, it is also known that corruption of information due to noise is potentially highest and requirements for data registration most stringent, at the lower levels of the fusion hierarchy [30], [109]. In addition, low-level fusion is less robust to sensor failure than high-level fusion [30]. Moreover, low-level fusion generally requires more training data because it usually involves more free parameters than high-level fusion. It is also easier to upgrade a single-sensor system into a multisensor system based on decision fusion; sensors can be added, without having to retrain any legacy singlesensor classifiers. Additionally, the frequently-used simplifying assumption of independence between sensor-specific data holds

better at the decision level, particularly if the classifiers are not of the same type. However, decision fusion might consequently fail to exploit the potentially beneficial correlation present at the lower levels. C. Low-Level Audio-Visual Fusion To the best of the authors’ knowledge, data-level fusion of acoustic and visual speech has not been attempted, possibly due to range registration difficulties. Low-level fusion is usually based on input space transformation into a space with less cross correlation and where most of the information is captured in fewer dimensions than the original space. Feature fusion is commonly implemented as a concatenation of acoustic and visual speech feature vectors [15], [24], [70], [91]. This typically gives an input space of higher dimensionality than each unimodal pattern space, and hence, raises the specter of the curse of dimensionality. Consequently, linear or nonlinear transformations, coupled to dimensionality reduction, are often applied to feature vector pairs. Nonlinear transformation is often implemented by a neural network layer [29], [35], [72], [130], connected to the primary features or outputs of subnets downstream of the integration layer. PCA and LDA are frequently used for linear transformation of vector pairs [24]. Although the Kalman filter can be used for feature fusion [109], it has not found much use in bimodal recognition. Transformations, such as PCA and LDA, allow dimensionality reduction. However, PCA and LDA may require high volumes of training data for a reliable estimation of the covariance matrices on which they are anchored. LDA often outperforms PCA in recognition tasks. This is because, unlike LDA, PCA does not use discriminative information during parameter estimation. However, the class information embedded in LDA is a set of class means, hence, LDA is ill suited for classes with multiple distribution modes or with confusable means. D. Intermediate-Level and High-Level Audio-Visual Fusion Intermediate-level or high-level fusion of acoustic and visual speech is usually based on conventional information combination and pattern classification techniques, or on stochastic modeling of coupled time series. 1) Information Combination: A common technique for post-categorical fusion is the linear combination of the scores output by the single-modality classifiers [12], [110]. Geometric averaging is also applied at times; a popular approach for combining HMMs is decision fusion implemented as a product of the likelihood of a pair of uncoupled audio and visual HMMs. DTW is sometimes used, along the sequence of feature vectors, to optimize the path through class hypotheses [35]. The combination of classifiers possessing localized expertise can give a better estimate of the decision boundary between classes. This has motivated the development of the mixture of experts (MOE) approach [54]. An MOE consists of a parallel configuration of experts, whose outputs are dynamically integrated by the outputs of a trainable gating network. The integration is a weighted sum, whose weights are learned estimates of the correctness of each expert, given the current input. MOEs can be incorporated into a tree-like architecture known as hierarchical mixture of experts (HME) [56]. One difficulty with

CHIBELUSHI et al.: REVIEW OF SPEECH-BASED BIMODAL RECOGNITION

using HMEs is that the selection of appropriate model parameters (number of levels, branching factor of the tree, architecture of experts) may require a good insight into the data or problem space under consideration. Neural networks, Bayesian inference, Dempster-Shafer theory, and possibility theory have also provided frameworks for decision fusion [22], [68]. Integration by neural network is typically implemented by neurons whose weighted inputs are connected to the outputs of single-modality classifiers. Bayesian inference uses Bayes’ rule to calculate a posteriori bimodal class probabilities, from the a priori class probabilities and the class conditional probabilities of the observed unimodal classifier outputs. Dempster-Shafer theory of evidence is a generalization of Bayesian probability theory. The bimodal belief in each possible class is computed by applying Dempster’s rule of combination to the basic probability assignment in support of each class. Possibility theory is based on fuzzy sets. The bimodal possibility for each class is computed by combining the possibility distributions of classifier outputs. Although possibility theory and Dempster-Shafer theory of evidence are meant to provide more robust frameworks than Bayesian inference, for combining uncertain or imprecise information, comparative assessment on bimodal recognition has shown that this may not be the case [22]. 2) Classification: Decision fusion can be formulated as a classification of the pattern of unimodal classifier outputs. The latter are grouped into a vector that is input to another classifier, which yields a classification decision representing the consensus among the unimodal classifiers. A variety of classifiers, acting as a decision fusion mechanism, have been evaluated. It is suggested in [120] that, compared to kNN and decision trees, logistic regression offers the best accuracy and the lowest computational cost during recognition. It is shown in [19] that a median radial basis function network outperforms clustering based on fuzzy k-means or fuzzy VQ; the superiority of fuzzy clustering over conventional clustering is also shown. Comparison of the support vector machine (SVM), minimum cost Bayesian classifier, Fisher’s linear discriminant, decision trees, and MLP, showed that the MLP gives the worst accuracy [8]. The comparison also showed that the SVM and Bayesian classifiers have similar performance and that they outperform the other classifiers. An SVM is a binary classifier which is based on statistical learning theory. SVMs try to maximize generalization capability, even for pattern spaces of high dimensionality. A downside of SVMs is that inappropriate kernel functions can result in poor recognition accuracy; hence, the need for careful selection of kernel functions. 3) Stochastic Modeling of Coupled Time Series: The fusion of acoustic and visual speech can be cast as a probabilistic modeling of coupled time series. Such modeling may capture the potentially useful coupling or conditional dependence between the two modalities. The level of synchronization between acoustic and visual speech varies along an utterance; hence, a flexible framework for modeling the asynchrony is required. Factorial HMMs [41], [106], Boltzman chains [105] and their variants (multistream HMMs [36], [79] and coupled HMMs [11]) are possible stochastic models for the combination of time-coupled modalities (see Fig. 4).

29

Fig. 4. Left-right multistream HMM for acoustic and visual speech [36]. (a) Two-stream model of a temporal speech segment. Each tri-state component HMM generates a sequence of observations. Stream-segment likelihoods are combined at the synchronization points. (b) Factorial representation of the multistream model shown in (a). a v represent the Cartesian product of stream states [106]. The diagonal path corresponds to synchronous streams.

Factorial HMMs explicitly model intra-process state structure and inter-process coupling; this makes them suitable for bimodal recognition, where each process could correspond to a modality. The state space of a factorial HMM is a Cartesian product of the states of its component HMMs. The modeling of inter-process coupling has the potential of reducing the sensitivity to unwanted intra-process variation during a recognition trial and hence it may enhance recognition robustness. Variants of factorial HMMs have been shown to be superior to conventional HMMs for modeling interacting processes, such as two-handed gestures [11] or acoustic and visual speech [36], [79]. A simpler pre-categorical fusion approach for HMM-based classifiers is described in [110]. In this approach, the weighted product of the emission probabilities of acoustic and visual speech feature vectors is used during Viterbi decoding for a bimodal discrete HMM. E. Adaptive Fusion In most fusion approaches to pattern recognition, fusion parameters are determined at training time and remain frozen for all subsequent recognition trials. However, optimal fusion requires a good match between the fusion parameters and the factors that affect the input patterns. Nonadaptive data fusion does not guarantee such a match and hence, pattern variation may

30

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 1, MARCH 2002

TABLE I SAMPLE BIMODAL SPEECH RECOGNIZERS BASED ON HMM OR REFERENCE-PATTERN SPEECH MODELS

lead to suboptimal fusion, which may even result in worse accuracy than unimodal recognition [47]. Fusion parameters should preferably adapt to changes in recognition conditions. Such dynamic parameters can be based on estimates of signal-to-noise ratio [36], [72], entropy measures [71], [72], degree of voicing in the acoustic speech [79], or measures relating to the perceived quality of the unimodal classifier output scores [19], [23], [27], [118], [123]. V. PERFORMANCE OF BIMODAL RECOGNIZERS A. Synopsis of Recognizers Tables I–III present the salient characteristics of a representative sample of bimodal recognizers. The tables show that speech

input for bimodal recognition tasks is usually in the form of isolated utterances. This is possibly due to the high storage and computational requirements of continuous speech processing, particularly in the visual domain. However, investigations into large-vocabulary continuous speech recognition have also been reported [79]. MFCCs are prevalent acoustic speech features, whereas mouth-window features are more popular than landmark features for visual speech. The models commonly incorporated in classifiers are neural networks and HMMs. The latter are gaining increasing popularity. The spread of fusion techniques across the levels of the fusion hierarchy displays a slight bias toward post-categorical fusion. Most fusion techniques are nonadaptive.

CHIBELUSHI et al.: REVIEW OF SPEECH-BASED BIMODAL RECOGNITION

31

TABLE II SAMPLE BIMODAL SPEECH RECOGNIZERS BASED ON CONNECTIONIST SPEECH MODELS

B. Recognition Accuracy Bimodal sensor fusion can yield (see Tables I–III): 1) better classification accuracy than either modality (“enhancing fusion,” which is the ultimate target of sensor fusion); 2) classification accuracy bounded by the accuracy of each modality (“compromising fusion”); 3) lower classification accuracy than the least accurate modality (“attenuating fusion”). “Enhancing,” “compromising,” and “attenuating” fusion is a terminology adapted from [67]. When audio-visual fusion results in improved accuracy, it is often observed that intermediate unimodal accuracy gives a higher relative improvement in accuracy than low or high unimodal accuracy [21], [72]. In particular, the “law of diminishing returns” seems to apply

when unimodal accuracy changes from intermediate to high. Most findings show that audio-visual fusion can counteract a degradation of acoustic speech [34], [36]. Audio-visual fusion is therefore a viable alternative, or complement, to signal processing techniques which try to minimize the effect of acoustic noise degradation on recognition accuracy [36]. Although it is sometimes contended that the level at which fusion is performed determines recognition accuracy (see Section IV-B), published results reveal that none of the levels is consistently superior to others. It is very likely that recognition accuracy is not determined solely by the level at which the fusion is applied, but also by the particular fusion technique and training or test regime used [21]. For example, [110] shows nearly equal improvement in speech recognition accuracy accruing from either pre-categorical or post-categorical audio-visual fusion. It is also observed in [118]

32

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 1, MARCH 2002

TABLE III SAMPLE BIMODAL SPEAKER RECOGNIZERS

that feature fusion and decision fusion yield the same speech recognition accuracy. However, [35] shows that post-categorical (high-level) audio-visual fusion yields better speech recognition accuracy than pre-categorical (low-level) fusion; the worst accuracy is obtained with intermediate-level fusion. C. Performance Assessment Issues It is difficult to generalize some findings reported in the bimodal recognition literature and to establish a meaningful comparison of recognition techniques with respect to published recognition accuracy figures. Notably, not all systems are fully

automatic; and there are no universally accepted test databases, or performance assessment methodologies. In addition, the majority of reported bimodal recognition figures are for relatively small tasks in terms of vocabulary, grammar, or distribution of speakers. Another problem with most published findings is the lack of rigor in performance assessment methodology. Most results quote empirically determined error rates as point estimates, and findings are often based on inferences made without reference to the confidence intervals of estimates or to the statistical significance of any observed differences. To permit the drawing of

CHIBELUSHI et al.: REVIEW OF SPEECH-BASED BIMODAL RECOGNITION

objective conclusions from empirical investigations, statistical decision theory should guide the interpretation of results. Most of the reported results are based on data captured under controlled laboratory environments. Most techniques have not been tested in real-world environments. Performance degradation is expected in such environments, particularly if the modeling and fusion techniques are not adaptive. Real-world environments are characterized by a comparatively limited level of control over operational factors such as acoustic and electromagnetic noise, illumination and overall image quality, ruggedness of data capture equipment; as well as head pose, facial appearance, and physiological or emotional state of the speaker. These sources of variability could lead to a mismatch between test and training conditions, and hence, potentially result in degraded recognition accuracy. Although the need for widely accepted benchmark databases has been asserted, there is a paucity of databases for the breadth and depth of research areas in bimodal recognition. Typical limitations of most existing bimodal recognition databases are: small population; narrow phonetic coverage; isolated words; lack of synchronization between audio and video streams; or absence of certain visual cues. There is a pressing need for the development of readily available good benchmark databases. The “DAVID” [25], “M2VTS” [89], “XM2VTSDB” [73], and “ViaVoice Audio-Visual” [79] databases represent positive efforts toward the fulfilment of this need.

VI. SAMPLE APPLICATIONS AND RESEARCH AVENUES A. Applications Many application domains would benefit from bimodal recognition, particularly in situations where acoustic or visual speech alone is deficient [114]. In some applications, bimodal recognition can serve the dual role of speech and speaker recognition. Other person recognition approaches, such as iris or finger-print recognition, do not cater to such dual recognition modes. Microphones, cameras, and digitizing cards are now massmarket products. The input equipment for bimodal recognition is thus readily available at reasonably low costs for many applications. A brief illustrative list of possible applications is given below. 1) Computer Interfaces: Untethered human-computer interaction can be built around bimodal recognition, possibly extended to incorporate other input channels, such as touch. 2) Equipment Operation: In unconstrained environments, bimodal recognition can facilitate hands-off operation of equipment such as photocopiers, video game consoles, machines on the factory floor, and vehicles or military aircraft. Other interaction channels (such as hand gestures and eye gaze) could be added to speech input. 3) Video Conferencing: The combination of bimodal recognition with synthesis can be used for countering the degradation which occurs during transmission, or which arises from poor room acoustics or illumination. This combination can also be used as a platform for compressing audio-visual data, thereby reducing transmission costs.

33

4) Content-Based Retrieval: Bimodal recognition can be incorporated into search engines for databases containing audiovisual material such as newscasts. 5) Speech Transcription: Transcription based on bimodal recognition can be used in TV broadcasts for hearing impaired viewers. It can also be particularly useful in adverse scenarios such as mobile phones or nomadic computers operated on the stock trading floor, outdoor TV news reports, and dictation in a noisy office. 6) Access Control: Bimodal recognition in unconstrained environments can be used in access control for computers, buildings, rooms, and privileged data (e.g., in e-commerce transactions). A bimodal speaker recognition product named BioId is reported in [38]. B. Research Avenues There is a need for research into bimodal recognition capable of adapting its pattern modeling and fusion knowledge to the prevailing recognition conditions. Further research into the nesting of fusion modules (an approach called meta-fusion in [120]), also promises improved recognition accuracy and easier handling of complexity. The modeling of the asynchrony between the two channels is also an important research issue. Furthermore, the challenges of speaker adaptation, recognition of spontaneous and continuous speech, require further investigation within the framework of bimodal recognition. Also, to combat data variability, the symbiotic combination of sensor fusion with mature techniques developed for robust unimodal recognition, is also a worthwhile research avenue. In addition, further advances in the synergetic combination of speech with other channels—such as hand gestures and facial expressions—to reduce possible semantic conflicts in spoken communication, are required. Despite the potential gains in accuracy and robustness afforded by bimodal recognition, the latter invariably results in higher storage and computational costs than unimodal recognition. To make the implementation of real-world applications tractable, the development of optimized and robust visual-speech segmentation, feature extraction, or modeling techniques are worthwhile research avenues. Comparative studies of techniques should accompany such developments. Research efforts should also be directed at the joint use of visual and acoustic speech for estimating vocal tract shape, a difficult problem often known as the inversion task [99]. This could also be coupled with studies of the joint modeling of the two modalities. The close relationship between the articulatory and phonetic domains suggests that an articulatory representation of speech might be better suited for speech recognition, synthesis, and coding than the conventional spectral acoustic features [99]. Previous approaches to the inversion task have relied on acoustic speech alone. The multimodal nature of speech perception suggests that visual speech offers additional information for the acquisition of a physical model of the vocal tract. An investigation, relevant to the inversion task within a bimodal framework, is presented in [42]. The effect of visual speech variability on bimodal recognition accuracy has not been investigated as much as its acoustic

34

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 1, MARCH 2002

counterpart [92]. As a result, it is difficult to vouch strongly for the benefit of using visual speech for bimodal recognition in unconstrained visual environments. Studies of the effect of the following factors are called for, particularly in large-scale recognition tasks closely resembling typical real-world tasks: segmentation accuracy; video compression; image noise; occlusion; illumination; speaker pose; and facial expression or paraphernalia (such as facial hair, hats, makeup). A study into the effects of some of these factors is given in [92]. Although the multimodal character of spoken language has long been formally recognized and exploited, multimodal speaker recognition has not received the same attention. The surprisingly high speaker recognition accuracy obtained with visual speech [21], [66], [121], warrants extensive research on visual speech, either alone or combined with acoustic speech, for speaker recognition. Research is also needed on the potential of bimodal recognition for alleviating the problem of speaker impersonation. The study of how humans integrate audio-visual information could also be beneficial as a basis for developing robust and computationally efficient mechanisms or strategies for bimodal recognition by machine, particularly with regard to feature extraction, classification, and fusion.

VII. CONCLUDING REMARKS This paper has reviewed the main components of a bimodal recognition system and techniques used therein. It has also discussed bimodal recognition accuracy and highlighted sample applications and research avenues. Bimodal recognition exploits the synergy between acoustic speech and visual speech, particularly under adverse conditions. It is motivated by the need—in many potential applications of speech-based recognition—for robustness to speech variability, high recognition accuracy, and protection against impersonation. For most potential applications, the hardware for capturing, storing, and processing audio-visual data is readily available at reasonably low costs. Moreover, commendable advances in bimodal recognition techniques already allow some useful applications, if the recognition conditions are constrained to some extent. However, despite these advances, bimodal recognition research is not mature yet. Techniques which guarantee that bimodal recognition will yield better recognition accuracy and robustness than unimodal recognition, over a wide range of recognition conditions, are still being sought. Further research is required on accurate and robust segmentation, optimal feature extraction, reliable speech or speaker modeling, and more importantly, optimal and preferably adaptive fusion of acoustic and visual sensory modalities. Progress in these areas will allow bimodal recognition to play a key role in intelligent computer interfaces, video-conferencing, multimedia databases, speech transcription, access control, voice-control of household and office equipment, and many other applications of speech-based recognition.

REFERENCES [1] B. S. Atal, “Automatic speaker recognition based on pitch contours,” J. Acoust. Soc. Amer., vol. 52, no. 6, pp. 1687–1697, 1972. , “Effectiveness of linear prediction characteristics of the speech [2] wave for automatic speaker identification and verification,” J. Acoust. Soc. Amer., vol. 55, no. 6, pp. 1304–1312, 1974. [3] R. Auckenthaler, J. Brand, J. S. Mason, F. Deravi, and C. C. Chibelushi, “Lip signatures for automatic person recognition,” in Proc. 2nd Int. Conf. Audio- and Video-Based Biometric Person Authentication, 1999, pp. 142–147. [4] L. G. Bahler, J. E. Porter, and A. L. Higgins, “Improved voice identification using a nearest-neighbor distance measure,” in Proc. IEEE ICASSP, vol. 1, 1994, pp. 321–323. [5] Y. Bennani, “Probabilistic cooperation of connectionist expert modules: Validation on a speaker identification task,” in Proc. IEEE ICASSP, vol. 1, 1993, pp. 541–544. [6] Y. Bennani, F. F. Soulie, and P. Gallinari, “A connectionist approach for automatic speaker identification,” in Proc. IEEE ICASSP, vol. 1, 1990, pp. 265–268. [7] C. Benoît, “On the production and the perception of audio-visual speech by man and machine,” in Proc. Symp. Multimedia Communications and Video Coding, 1995, pp. 277–284. [8] S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz, “Fusion of face and speech data for person identity verification,” IEEE Trans. Neural Networks, vol. 10, pp. 1065–1074, Sept. 1999. [9] F. Bimbot and L. Mathan, “Text-free speaker recognition using an arithmetic harmonic sphericity measure,” in Proc. 3rd Eurospeech Conf., vol. 1, 1993, pp. 169–172. [10] F. Bourel, C. C. Chibelushi, and A. A. Low, “Robust facial feature tracking,” in Proc. 11th British Machine Vision Conf., vol. 1, 2000, pp. 232–241. [11] M. Brand, N. Oliver, and A. Pentland, “Coupled hidden Markov models for complex action recognition,” in Proc. IEEE CCVPR, 1997, pp. 994–999. [12] C. Bregler, H. Hild, S. Manke, and A. Waibel, “Improved connected letter recognition by lipreading,” in Proc. IEEE ICASSP, vol. 1, 1993, pp. 557–560. [13] C. Bregler and Y. Konig, “Eigenlips for robust speech recognition,” in Proc. IEEE ICASSP, vol. 2, 1994, pp. 669–672. [14] N. M. Brooke and E. D. Petajan, “Seeing speech: Investigations into the synthesis and recognition of visible speech movements using automatic image processing and computer graphics,” in Proc. Int. Conf. Speech Input/Output: Techniques and Applications, 1986, pp. 104–109. [15] N. M. Brooke, M. J. Tomlinson, and R. K. Moore, “Automatic speech recognition that includes visual speech cues,” in Proc. Inst. AcousticsAutumn Conf. (Speech and Hearing), vol. 16, 1994, pp. 15–22. [16] R. Brunelli and D. Falavigna, “Person identification using multiple cues,” IEEE Trans. Pattern Anal. Machine Intell., vol. 17, pp. 955–966, Oct. 1995. [17] R. Brunelli and T. Poggio, “Face recognition: Features versus templates,” IEEE Trans. Pattern Anal. Machine Intell., vol. 15, pp. 1042–1052, Oct. 1993. [18] J. P. Campbell, Jr, “Speaker recognition: A tutorial,” Proc. IEEE, vol. 85, pp. 1437–1462, Sept. 1997. [19] V. Chatzis, A. G. Bors, and I. Pitas, “Multimodal decision-level fusion for person authentication,” IEEE Trans. Syst., Man, Cybern. A, vol. 29, pp. 674–680, Nov. 1999. [20] T. Chen and R. R. Rao, “Audio-visual integration in multimodal communication,” Proc. IEEE, vol. 86, pp. 837–852, May 1998. [21] C. C. Chibelushi, “Automatic audio-visual person recognition,” Ph.D. dissertation, Univ. of Wales, Swansea, U.K., 1997. [22] C. C. Chibelushi, F. Deravi, and J. S. Mason, “Audio-visual person recognition: An evaluation of data fusion strategies,” in Proc. 2nd Eur. Conf. Security and Detection, 1997, pp. 26–30. [23] C. C. Chibelushi, F. Deravi, and J. S. D. Mason, “Adaptive classifier integration for robust pattern recognition,” IEEE Trans. Syst., Man, Cybern. B, vol. 29, pp. 902–907, Dec. 1999. [24] C. C. Chibelushi, J. S. D. Mason, and F. Deravi, “Feature-level data fusion for bimodal person recognition,” in Proc. 6th IEE Int. Conf. Image Processing and its Applications, 1997, pp. 399–403. [25] C. C. Chibelushi, S. Gandon, J. S. D. Mason, F. Deravi, and R. D. Johnston, “Design issues for a digital audio-visual integrated database,” Dig. IEE Colloq. Integrated Audio-Visual Processing for Recognition, Synthesis, and Communication, no. 1996/213, pp. 7/1–7/7, 1996. [26] G. I. Chiou and J.-N. Hwang, “Lipreading from color video,” IEEE Trans. Image Processing, vol. 6, pp. 1192–1195, Aug. 1997.

CHIBELUSHI et al.: REVIEW OF SPEECH-BASED BIMODAL RECOGNITION

[27] T. Choudhury, B. Clarkson, T. Jebara, and A. Pentland, “Multimodal person recognition using unconstrained audio and video,” in Proc. 2nd Int. Conf. Audio- and Video-Based Biometric Person Authentication, 1999, pp. 176–181. [28] R. Cole et al., “The challenge of spoken language systems: Research directions for the nineties,” IEEE Trans. Speech Audio Processing, vol. 3, pp. 1–21, Jan. 1995. [29] P. Cosi, E. M. Caldognetto, K. Vagges, G. A. Mian, and M. Contolini, “Bimodal recognition experiments with recurrent neural networks,” in Proc. IEEE ICASSP, vol. 2, 1994, pp. 553–556. [30] B. V. Dasarathy, “Sensor fusion potential exploitation: Innovative architectures and illustrative applications,” Proc. IEEE, vol. 85, pp. 24–38, Jan. 1997. [31] S. B. Davis and P. Mermelstein, “Comparison of parametric representations for monosyllable word recognition in continuously spoken sentences,” IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-28, pp. 357–366, 1980. [32] P. Delmas, P. Y. Coulon, and V. Fristot, “Automatic snakes for robust lip boundaries extraction,” in Proc. IEEE ICASSP, vol. 6, 1999, pp. 3069–3072. [33] B. Duc, E. S. Bigun, J. Bigun, G. Maitre, and S. Fischer, “Fusion of audio and video information for multimodal person authentication,” Pattern Recognit. Lett., vol. 18, no. 9, pp. 835–843, 1997. [34] P. Duchnowski, M. Hunke, D. Busching, U. Meier, and A. Waibel, “Toward movement-invariant automatic lip-reading and speech recognition,” in Proc. IEEE ICASSP, vol. 1, 1995, pp. 109–112. [35] P. Duchnowski, U. Meier, and A. Waibel, “See me, hear me: Integrating automatic speech recognition and lip-reading,” in Proc. ICSLP, vol. 2, 1994, pp. 547–550. [36] S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Trans. Multimedia, vol. 2, pp. 141–151, Sept. 2000. [37] K. R. Farrell, R. J. Mammone, and K. T. Assaleh, “Speaker recognition using neural networks and conventional classifiers,” IEEE Trans. Speech Audio Processing, pt. II, vol. 2, no. 1, pp. 194–205, 1994. [38] B. Fröba, C. Küblbeck, C. Rothe, and P. Plankensteiner, “Multisensor biometric person recognition in an access control system,” in Proc. 2nd Int. Conf. Audio- and Video-Based Biometric Person Authentication, 1999, pp. 55–59. [39] S. Furui, “An overview of speaker recognition technology,” in Proc. ESCA Workshop on Automatic Speaker Recognition, Identification, and Verification, 1994, pp. 1–9. , “Cepstral analysis technique for automatic speaker verification,” [40] IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-29, no. 2, pp. 254–272, 1981. [41] Z. Ghahramani and M. I. Jordan, “Factorial hidden Markov models,” Mach. Learn., vol. 29, no. 2–3, pp. 245–273, 1997. [42] L. Girin, G. Feng, and J. L. Schwartz, “Fusion of auditory and visual information for noisy speech enhancement: A preliminary study of vowel transitions,” in Proc. IEEE ICASSP, 1998, pp. 1005–1008. [43] H. Gish and M. Schmidt, “Text-independent speaker identification,” IEEE Signal Processing, vol. 11, no. 4, pp. 18–32, 1994. [44] Y. Gong and J. P. Haton, “Nonlinear vectorial interpolation for speaker recognition,” in Proc. IEEE ICASSP, vol. 2, 1992, pp. 173–176. [45] M. S. Gray, J. R. Movellan, and T. J. Sejnowski, “Dynamic features for visual speechreading: A systematic comparison,” in Proc. 10th Annu. Conf. Neural Information Processing Systems, 1996, pp. 751–757. [46] J. A. Haigh, “Voice activity detection for conversational analysis,” M.Phil. thesis, Univ. of Wales, Swansea, U.K., 1994. [47] D. L. Hall and J. Llinas, “An introduction to multisensor data fusion,” Proc. IEEE, vol. 85, no. 1, pp. 6–23, 1997. [48] D. L. Hall, Mathematical Techniques in Multisensor Data Fusion. Norwood, MA: Artech House, 1992. [49] H. Hattori, “Text-independent speaker recognition using neural networks,” in Proc. IEEE ICASSP, vol. 2, 1992, pp. 153–156. [50] M. E. Hennecke, K. V. Prassad, and D. G. Stork, “Using deformable templates to infer visual speech dynamics,” in Conf. Rec. 28th Asilomar Conf. Signals, Systems, and Computers, 1994, pp. 578–582. [51] H. Hermansky, B. A. Hanson, and H. Wakita, “Perceptually-based linear predictive analysis of speech,” in Proc. IEEE ICASSP, vol. 2, 1985, pp. 509–512. [52] M. M. Homayounpour and G. Chollet, “Neural net approaches to speaker verification: Comparison with second-order statistic measures,” in Proc. IEEE ICASSP, vol. 1, 1995, pp. 353–356. [53] M. J. Hunt, S. M. Richardson, D. C. Bateman, and A. Piau, “An investigation of PLP and IMELDA acoustic representations and of their potential for combination,” in Proc. IEEE ICASSP, vol. 2, 1991, pp. 881–884.

35

[54] R. A. Jacobs, M. I. Jordan, S. J. Nowlan, and G. E. Hinton, “Adaptive mixtures of local experts,” Neural Comput., vol. 3, pp. 79–87, 1991. [55] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Machine Intell., vol. 22, pp. 4–37, Jan. 2000. [56] M. I. Jordan and R. A. Jacobs, “Hierarchical mixtures of experts and the EM algorithm,” Neural Comput., vol. 6, no. 2, pp. 181–214, 1994. [57] P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, “Acoustic-labial speaker verification,” Pattern Recognit. Lett., vol. 18, no. 9, pp. 853–858, 1997. [58] B. H. Juang and F. K. Soong, “Speaker recognition based on source coding approaches,” in Proc. IEEE ICASSP, 1990, pp. 613–616. [59] H. Kabré, “Audiovisual speech recognition using the fuzzy shape filters model,” in Proc. 4th Eurospeech, vol. 1, 1995, pp. 307–310. [60] R. Kaucic, B. Dalton, and A. Blake, “Real-time lip tracking for audio-visual speech recognition applications,” in Proc. 4th Eur. Conf. Computer Vision, 1996, pp. 376–387. [61] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,” IEEE Trans. Pattern Anal. Machine Intell., vol. 20, pp. 226–239, Mar. 1998. [62] S. Kung and J. Hwang, “Neural networks for intelligent multimedia processing,” Proc. IEEE, vol. 86, no. 6, pp. 1244–1272, 1998. [63] C. S. Liu, C. H. Lee, B. H. Juang, and A. E. Rosenberg, “Speaker recognition based on minimum error discriminative training,” in Proc. IEEE ICASSP, vol. 1, 1994, pp. 325–328. [64] C. S. Liu, C. S. Huang, M. T. Lin, and H. C. Wang, “Automatic speaker recognition based upon various distances of LSP frequencies,” in Proc. 25th Annu. IEEE Int. Carnahan Conf. Security Technology, 1991, pp. 104–109. [65] J. Luettin and N. A. Thacker, “Speechreading using probabilistic models,” Comput. Vis. Image Understand., vol. 65, no. 2, pp. 163–178, 1997. [66] J. Luettin, N. A. Thacker, and S. W. Beet, “Speaker identification by lipreading,” in Proc. ICSLP, vol. 1, 1996, pp. 62–65. [67] D. W. Massaro, “Speech perception by ear and eye,” in Hearing by Eye: The Psychology of Lip-Reading, B. Dodd and R. Campbell, Eds. Mahwah, NJ: Lawrence Erlbaum, 1987, pp. 53–83. [68] D. W. Massaro and D. G. Stork, “Speech recognition and sensory integration,” Amer. Sci., vol. 86, no. 3, pp. 236–244, 1998. [69] T. Matsui and S. Furui, “Text-independent speaker recognition using vocal tract and pitch information,” in Proc. ICSLP, 1990, pp. 137–140. [70] I. Matthews, J. A. Bangham, and S. Cox, “Audiovisual speech recognition using multiscale nonlinear image decomposition,” in Proc. 4th ICSLP, vol. 1, 1996, pp. 38–41. [71] U. Meier, R. Stiefelhagen, J. Yang, and A. Waibel, “Toward unrestricted lipreading,” Int. J. Pattern Recognit. Artif. Intell., vol. 14, no. 5, pp. 571–785, 2000. [72] U. Meier, W. Hurst, and P. Duchnowski, “Adaptive bimodal sensor fusion for automatic speechreading,” in Proc. IEEE ICASSP, vol. 2, 1996, pp. 833–836. [73] K. Messer, J. Matas, J. Kittler, J. Luettin, and G. Maitre, “XM2VTSDB: The extended M2VTS database,” in Proc. 2nd Int. Conf. Audio- and Video-Based Biometric Person Authentication, 1999, pp. 72–77. [74] C. Montacié, P. Deleglise, F. Bimbot, and M. J. Caraty, “Cinematic techniques for speech processing: Temporal decomposition and multivariate linear prediction,” in Proc. IEEE ICASSP, vol. 1, 1992, pp. 153–156. [75] N. Morgan and H. A. Bourlard, “Neural networks for statistical recognition of continuous speech,” Proc. IEEE, vol. 83, pp. 742–770, May 1995. [76] J. M. Naik and D. M. Lubensky, “A hybrid HMM-MLP speaker verification algorithm for telephone speech,” in Proc. IEEE ICASSP, vol. 1, 1994, pp. 153–156. [77] J. M. Naik, L. P. Netsch, and G. R. Doddington, “Speaker verification over long distance telephone lines,” in Proc. IEEE ICASSP, vol. 1, 1989, pp. 524–527. [78] O. Nakamura, S. Mathur, and T. Minami, “Identification of human faces based on isodensity maps,” Pattern Recognit., vol. 24, no. 3, pp. 263–272, 1991. [79] C. Neti et al., “Audio-visual speech recognition,” Center for Language and Speech Processing, Johns Hopkins University, Baltimore, MD, 2000. [80] S. Nishida, “Speech recognition enhancement by lip information,” in Proc. CHI, 1986, pp. 198–204. [81] A. Ogihara and S. Asao, “An isolated word speech recognition based on fusion of visual and auditory information using 30-frame/s and 24-bit color image,” IEICE Trans. Fund. Electron., Commun. Comput. Sci., vol. E80A, no. 8, pp. 1417–1422, 1997.

36

[82] J. Oglesby, “Neural models for speaker recognition,” Ph.D. dissertation, Univ. College of Swansea, Swansea, U.K., 1991. [83] N. Oliver, A. Pentland, and F. Berard, “LAFTER: Lips and face real-time tracker,” in Proc. IEEE CCVPR, 1997, pp. 123–129. [84] K. Otani and T. Hasegawa, “The image input microphone—A new nonacoustic speech communication system by media conversion from oral motion images to speech,” IEEE J. Select. Areas Commun., vol. 13, no. 1, pp. 42–48, 1995. [85] H. Pan, Z.-P. Liang, and T. S. Huang, “A new approach to integrate audio and visual features of speech,” in Proc. 1st IEEE Int. Conf. Multimedia and Expo., vol. 2, 2000, pp. 1093–1096. [86] E. Petajan and H. P. Graf, “Automatic lipreading research—Historic overview and current work,” in Proc. Symp. Multimedia Communications and Video Coding, 1995, pp. 265–275. , “Robust face feature analysis for automatic speechreading and [87] character animation,” in Proc. Int. Conf. Automatic Face and Gesture Recognition, 1996, pp. 357–362. [88] E. D. Petajan, N. M. Brooke, B. J. Bischoff, and D. A. Boddoff, “Experiments in automatic visual speech recognition,” in Proc. 7th FASE Symp., Book 4, 1988, pp. 1163–1170. [89] S. Pigeon and L. Vandendorpe, “The M2VTS multimodal face database (Release 1.00),” Lecture Notes in Comput. Sci., vol. 1206, pp. 403–409, 1997. [90] G. Potamianos, A. Verma, C. Neti, G. Iyengar, and S. Basu, “A cascade image transform for speaker independent automatic speechreading,” in Proc. 1st IEEE Int. Conf. Multimedia and Expo., vol. 2, 2000, pp. 1097–1100. [91] G. Potamianos and H. P. Graf, “Discriminative training of HMM stream exponents for audio-visual speech recognition,” in Proc. IEEE ICASSP, vol. 6, 1998, pp. 3733–3736. [92] G. Potamianos, H. P. Graf, and E. Cosallo, “An image transform for HMM-based automatic lipreading,” in Proc. IEEE ICIP, vol. 3, 1998, pp. 173–177. [93] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, 1989. [94] L. R. Rabiner and B. H. Huang, Fundamentals of Speech Recognition. Englewood Cliffs, NJ: Prentice-Hall, 1993. [95] L. R. Rabiner and M. R. Sambur, “An algorithm for determining the endpoints of isolated utterances,” Bell Syst. Tech. J., vol. 54, no. 2, pp. 297–315, 1975. [96] R. P. Ramachandran, M. S. Zilovic, and R. J. Mammone, “A comparative study of robust linear predictive analysis methods with applications to speaker identification,” IEEE Trans. Speech Audio Processing, vol. 3, no. 2, pp. 117–125, 1995. [97] D. Reisfeld and Y. Yeshurun, “Robust detection of facial features by generalized symmetry,” in Proc. 11th ICPR, vol. 1, 1992, pp. 117–120. [98] D. A. Reynolds and R. C. Rose, “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. Speech Audio Processing, vol. 3, no. 1, pp. 72–83, 1995. [99] H. B. Richards, J. S. Bridle, M. J. Hunt, and J. S. Mason, “Vocal tract shape trajectory estimation using MLP analysis-by-synthesis,” in Proc. IEEE ICASSP, 1997, pp. 1287–1290. [100] E. A. Rohwer, J. A. Lear, and R. J. Rohwer, “A large speaker identification and verification experiment,” in Proc. Inst. Acoustics-Autumn Conf. (Speech and Hearing), vol. 16, 1994, pp. 517–524. [101] A. E. Rosenberg, J. Delong, C. H. Lee, B. H. Juang, and F. K. Soong, “The use of cohort normalized scores for speaker verification,” in Proc. ICSLP, vol. 1, 1992, pp. 599–602. [102] A. E. Rosenberg, O. Siohan, and S. Parthasarathy, “Speaker verification using minimum verification error training,” in Proc. IEEE ICASSP, 1998, pp. 105–108. [103] L. Rudasi and S. A. Zahorian, “Text-independent talker identification using neural networks,” J. Acoust. Soc. Amer., pt. Suppl. 1, vol. 87, no. S104, 1990. [104] H. Sako, M. Whitehouse, A. Smith, and A. Sutherland, “Real-time facial-feature tracking based on matching techniques and its applications,” in Proc. 12th ICPR (Conf. B), 1994, pp. 320–324. [105] L. K. Saul and M. I. Jordan, “Boltzman chains and hidden Markov model,” in Advances in Neural Information Processing Systems 7, G. Tesauro, D. Touretzky, and T. Leen, Eds. Cambridge, MA: MIT Press, 1995. , “Mixed memory Markov models: Decomposing complex sto[106] chastic processes as mixtures of simpler ones,” Mach. Learn., vol. 37, no. 1, pp. 75–87, 1999. [107] M. Savic and J. Sorensen, “Phoneme-based speaker verification,” in Proc. IEEE ICASSP, vol. 2, 1992, pp. 165–168.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 4, NO. 1, MARCH 2002

[108] R. W. Schafer and L. R. Rabiner, “Digital representations of speech signals,” Proc. IEEE, vol. 63, no. 4, pp. 662–677, 1975. [109] R. Sharma, V. I. Pavlovic, and T. S. Huang, “Toward multimodal humancomputer interface,” Proc. IEEE, vol. 86, no. 5, pp. 853–869, 1998. [110] P. L. Silsbee, “Sensory integration in audiovisual automatic speech recognition,” in Conf. Rec. 28th Asilomar Conf. Signals, Systems, and Computers, 1994, pp. 561–565. [111] P. L. Silsbee and A. C. Bovik, “Computer lipreading for improved accuracy in automatic speech recognition,” IEEE Trans. Speech Audio Processing, vol. 4, pp. 337–351, Sept. 1996. [112] F. K. Soong and A. E. Rosenberg, “On the use of instantaneous and transitional spectral information in speaker recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 36, no. 6, pp. 871–879, 1988. [113] J. Sorensen and M. Savic, “Hierarchical pattern classification for highperformance text-independent speaker verification systems,” in Proc. IEEE ICASSP, vol. 1, 1994, pp. 157–160. [114] D. G. Stork and M. E. Hennecke, “Speechreading: An overview of image processing, feature extraction, sensory integration and pattern recognition techniques,” in Proc. 2nd Int. Conf. Automatic Face and Gesture Recognition, 1996, pp. XVI–XXVI. [115] W. H. Sumby and I. Pollack, “Visual contribution to speech intelligibility in noise,” J. Acoust. Soc. Amer., vol. 26, no. 2, pp. 212–215, 1954. [116] Q. Summerfield, “Lipreading and audio-visual speech perception,” Philos. Trans. R. Soc. London B, vol. 335, no. 1273, pp. 71–78, 1992. , “Some preliminaries to a comprehensive account of audio-visual [117] speech perception,” in Hearing by Eye: The Psychology of Lip-Reading, B. Dodd and R. Campbell, Eds. Mahwah, NJ: Lawrence Erlbaum, 1987, pp. 3–51. [118] P. Teissier, J. Robert-Ribes, J. L. Schwartz, and A. Guérin-Dugué, “Comparing models for audiovisual fusion in a noisy-vowel recognition task,” IEEE Trans. Speech Audio Processing, vol. 7, pp. 629–642, Nov. 1999. [119] R. Tucker, “Voice activity detection using a periodicity measure,” Proc. Inst. Electr. Eng. I, vol. 139, no. 4, pp. 377–380, 1992. [120] P. Verlinde and G. Chollet, “Comparing decision fusion paradigms using k-NN-based classifiers, decision trees, and logistic regression in a multimodal identity verification application,” in Proc. 2nd Int. Conf. Audioand Video-Based Biometric Person Authentication, 1999, pp. 188–193. [121] T. Wagner and U. Dieckmann, “Multisensorial inputs for the identification of persons with synergetic computers,” in Proc. 1st IEEE ICIP, vol. 2, 1994, pp. 287–291. [122] T. Wark and S. Sridharan, “A syntactic approach to automatic lip feature extraction for speaker identification,” in Proc. IEEE ICASSP, vol. 6, 1998, pp. 3693–3696. [123] T. J. Wark, S. Sridharan, and V. Chandran, “Robust speaker verification via asynchronous fusion of speech and lip information,” in Proc. 2nd Int. Conf. Audio- and Video-Based Biometric Person Authentication, 1999, pp. 37–42. [124] J. G. Wilpon, L. R. Rabiner, and T. Martin, “An improved word-detection algorithm for telephone-quality speech incorporating both syntactic and semantic constraints,” Bell Labs. Tech. J., vol. 63, no. 3, pp. 479–498, 1984. [125] G. J. Wolff, K. V. Prasad, D. G. Stork, and M. Hennecke, “Lipreading by neural networks: Visual preprocessing, learning, and sensory integration,” in Advances in Neural Information Processing Systems 6, J. Cowan, G. Tesauro, and J. Alspector, Eds. San Mateo, CA: Morgan Kaufmann, 1994, pp. 1027–1034. [126] J. T. Wu, S. Tamura, H. Mitsumoto, H. Kawai, K. Kurosu, and K. Okazaki, “Neural network vowel-recognition jointly using voice features and mouth shape image,” Pattern Recognit., vol. 24, no. 10, pp. 921–927, 1991. [127] L. Wu, S. L. Oviatt, and P. R. Cohen, “Multimodal integration—A statistical view,” IEEE Trans. Multimedia, vol. 1, pp. 334–341, Dec. 1999. [128] G. Yang and T. S. Huang, “Human face detection in a complex background,” Pattern Recognit., vol. 27, no. 1, pp. 53–63, 1994. [129] K. Yu, J. Mason, and J. Oglesby, “Speaker recognition using hidden Markov models, dynamic time warping, and vector quantization,” Proc. Inst. Elect. Eng., Vis., Image, Signal Process., vol. 142, no. 5, pp. 313–318, 1995. [130] B. P. Yuhas, M. H. Goldstein Jr., T. J. Sejnowski, and R. E. Jenkins, “Neural network models of sensory integration for improved vowel recognition,” Proc. IEEE, vol. 78, pp. 1658–1668, Oct. 1990. [131] A. L. Yuille, D. S. Cohen, and P. W. Hallinan, “Feature extraction from faces using deformable templates,” in Proc. IEEE CCVPR, 1989, pp. 104–109.

CHIBELUSHI et al.: REVIEW OF SPEECH-BASED BIMODAL RECOGNITION

Claude C. Chibelushi received the B.Eng. degree in electronics and telecommunications from the University of Zambia, Lusaka, Zambia, in 1987, the M.Sc. degree in microelectronics and computer engineering from the University of Surrey, Surrey, U.K., in 1989, and the Ph.D. degree in electronic engineering from the University of Wales Swansea, Swansea, U.K., in 1997. In 1997, he joined the School of Computing at Staffordshire University, Stafford, U.K., where he is currently a Senior Lecturer. Before he joined Staffordshire University, he was a Senior Research Assistant at the University of Wales Swansea from 1995 to 1996. He also worked as a Lecturer in the Department of Electrical and Electronic Engineering at the University of Zambia from 1989 to 1991. His current research interests include multimodal recognition, robust pattern recognition, content-based image retrieval and medical image analysis. Dr. Chibelushi was a Beit Fellow from 1991 to 1995. He is currently a member of the Institution of Electrical Engineers (IEE).

37

Farzin Deravi (S’84–M’87) received the B.S. degree in engineering and economics from the University of Oxford, Oxford, U.K., in 1981, the M.Sc. degree in electronic engineering from Imperial College, University of London, London, U.K., in 1982, and the Ph.D. degree from the University of Wales Swansea, Swansea, U.K., in 1987. From 1983 to 1987, he was a Research Assistant at the University of Wales Swansea. In 1987, he joined the academic staff at Swansea where he was active in teaching and research in the Department of Electrical and Electronic Engineering. In 1998, he joined the Electronics Engineering Laboratory at the University of Kent at Canterbury, Canterbury, U.K., as a Senior Lecturer. His current research interests include texture recognition, fractal coding, integrated audio-visual processing, biometric systems, and multimedia. Dr. Deravi is a member of the Institution of Electrical Engineers (IEE) and the British Machine Vision Association. He is currently the Chairman of the IEE Professional Network on Visual Information Engineering and an Editor of the International Journal of Engineering Applications of Artificial Intelligence.

John S. D. Mason received the M.Sc. and Ph.D. degrees in the areas of control and digital signal processing from the University of Surrey, Surrey, U.K., in 1973. He is currently a Senior Lecturer in the Department of Electrical and Electronic Engineering, University of Wales Swansea, Swansea, U.K. In 1983, after spending a year as a Senior Research Engineer at the Hewlett Packard Laboratories, Edinburgh, U.K., he formed the Speech Research Group at Swansea, (now expanded to Speech and Image Processing). In 1994, he spent six weeks as a Visiting Fellow at the ANU Canberra, Australia, working on an industrially sponsored speaker recognition project. His current research interests include multimodal biometrics and multimedia signal processing.