Bayesian Estimation of the von-Mises Fisher Mixture ... - IEEE Xplore

1 downloads 0 Views 1MB Size Report
Jul 29, 2014 - The learning task in VI consists of optimization of the variational posterior distribution. ... Index Terms—Bayesian estimation, von-Mises Fisher distribution, mixture ...... without resorting to techniques such as cross validation.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 36,

NO. 9,

SEPTEMBER 2014

1701

Bayesian Estimation of the von-Mises Fisher Mixture Model with Variational Inference Jalil Taghia, Zhanyu Ma, and Arne Leijon Abstract—This paper addresses the Bayesian estimation of the von-Mises Fisher (vMF) mixture model with variational inference (VI). The learning task in VI consists of optimization of the variational posterior distribution. However, the exact solution by VI does not lead to an analytically tractable solution due to the evaluation of intractable moments involving functional forms of the Bessel function in their arguments. To derive a closed-form solution, we further lower bound the evidence lower bound where the bound is tight at one point in the parameter distribution. While having the value of the bound guaranteed to increase during maximization, we derive an analytically tractable approximation to the posterior distribution which has the same functional form as the assigned prior distribution. The proposed algorithm requires no iterative numerical calculation in the re-estimation procedure, and it can potentially determine the model complexity and avoid the over-fitting problem associated with conventional approaches based on the expectation maximization. Moreover, we derive an analytically tractable approximation to the predictive density of the Bayesian mixture model of vMF distributions. The performance of the proposed approach is verified by experiments with both synthetic and real data. Index Terms—Bayesian estimation, von-Mises Fisher distribution, mixture model, variational inference, directional distribution, predictive density, gene expressions, speaker identification

Ç 1

INTRODUCTION

D

IRECTIONAL data arises naturally in many areas [1, chapter 1.4]. To analyze directional data, directional distributions are of great importance (e.g., [2], [1, chapter 9]) while popular parametric clustering algorithms, which employ mixture of multivariate Gaussians in their generative models, yield poor performance (e.g., [3]). The von-Mises Fisher (vMF) distribution is considered as a popular distribution in the family of directional distributions [1]. In information retrieval applications, it has been shown that cosine similarity is an efficient measure of similarity for analyzing and clustering text documents [4], [5], hence, modeling data by the vMF distribution is well-motivated. Modeling gene expression data, which has been shown to have directional characteristics, is another application domain for directional distributions [6]. The vMF distribution has been also applied successfully in many other domains (e.g., [1], [7], [8], [9], [10], [11]). Moreover, the vMF distribution has potential applications in blind source separation, localization and speaker identification (SI) (e.g., [12], [13]). In the analysis of directional distributions, the Bayesian literature is less extensive partly due to difficulties in working with probability distributions commonly

 

J. Taghia and A. Leijon are with the School of Electrical Engineering, Communication Theory Lab., KTH Royal Institute of Technology, 10044 Stockholm, Sweden. E-mail: {taghia, leijon}@kth.se. Z. Ma is with the Pattern Recognition and Intelligent System Lab., Beijing University of Posts and Telecommunications, Beijing 100876, China. E-mail: [email protected], [email protected].

Manuscript received 30 Aug. 2012; revised 29 Sept. 2013; accepted 17 Jan. 2014. Date of publication 13 Feb. 2014; date of current version 29 July 2014. Recommended for acceptance by Y.W. Teh. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPAMI.2014.2306426

associated with directional data, in particular with the vMF distribution [14]. An early attempt at Bayesian inference for the von Mises distribution is given by [15] in which the concentration parameter is assumed to be known (the vMF distribution for circular data is known as the von Mises distribution [1, chapter 3]). In [16], authors considered a case where both the mean direction and the concentration parameter are unknown, but they are forced to use the posterior maximum likelihood point estimate for the concentration parameter. Bagchi and Kadane [17] developed Laplace approximations to posterior distributions involving the von Mises distribution. However, they only provided Bayes estimates for the cosine of the directional parameter under the assumption that the concentration parameter is known. In [18], a full Bayesian analysis of the von Mises distribution is proposed where both parameters are unknown. They developed a Gibbs sampler based on using auxiliary variables to derive the posterior distribution. Nunez and Pena [14] pointed out that the autocorrelation of samples from the posterior distribution is fairly high for relatively large values of the concentration parameter, which makes the convergence of the Gibbs sampler [18] slow. Instead, they presented a Bayesian analysis of the vMF distribution where samples from the posterior distribution are obtained using a sampling-importance-resampling (SIR) method. Although sampling-based approaches (e.g., [14], [18]) can provide accurate estimates of the posterior distribution, they can be computationally demanding. This practically often limits their use to small-scale problems. The aim of this paper is to provide a full Bayesian treatment for the vMF model by approximating the posterior distribution with variational inference (VI) (e.g., [19], [20, chapter 10]). VI is an alternative to samplingbased methods which converts the intractable inference

0162-8828 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

1702

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

problems into optimization [19]. We define prior distributions over the model parameters and employ a variational treatment for approximating the posterior distributions. Since the normalization factor of the vMF distribution involves the modified Bessel function of the first kind, there is no analytically tractable solution to the optimization of the variational posterior distribution under VI. Thus, we introduce a lower bound to the evidence lower bound through a Taylor series expansion, in order to avoid intractable moment computations. Having the value of the bound guaranteed to increase during the maximization procedure, we derive a tractable approximation to the posterior distribution with the same functional form as the assigned prior distributions. It is notable that, in the literature, the Taylor approximation has been successfully employed in VI in order to avoid intractable moment computations (e.g., [21], [22]). The mixture model is a probabilistic tool for analyzing both univariate and multivariate data [23], [24]. Piater [25] employed a numerical gradient descent based approach to derive the parameter estimates for a two-dimensional vMF mixture model because of difficulty in deriving a closedform solution for estimation of the concentration parameter. In [26], a mixture of two circular von Mises distributions is employed to estimate the parameters using a quasi-Newton procedure. Banerjee et al. [11] estimated the mean and concentration parameters of the mixture in an expectation maximization (EM) framework. However, they mentioned that it is not trivial to obtain a closed-form expression for the parameter estimate of the concentration parameter since it involves some functional inversion of ratios of Bessel functions, and ways of approximating those ratios are inadequate for high-dimensional data. In general, the EM algorithm for the vMF mixture model has some disadvantages. It may potentially lead to overfitting when the mixture models are excessively complex and needs some prior information on the true number of components. Moreover, the iterative numerical calculation in the maximization step (e.g., with the Newton-Raphson method) causes high computational cost. Here, we provide a full Bayesian treatment for the vMF mixture model by generalizing our approach for a single vMF distribution to a mixture of vMF distributions. Moreover, we derive an approximation to the predictive density, as the exact density is not available in a closedfrom expression. The approximate solution is analytically tractable and is shown to be quite accurate. In the experimental results, first we empirically evaluate the presented approach through experiments on synthetic data. Next, we evaluate the method on real data sets. Clustering of gene expression data and modeling the underlying distributions of the line spectral frequency (LSF) parameters of speech are considered for this purpose. The rest of the paper is organized as follows: In Section 2, we describe Bayesian estimation of a single vMF component by variational inference, and in Section 3 we extend our method to the case of a mixture of vMF distributions. In Section 4, we derive the predictive density of the Bayesian mixture model of vMF distributions. The experimental results are presented in Section 5. Finally, Section 6 concludes the paper.

VOL. 36,

NO. 9,

SEPTEMBER 2014

Notation. Vectors are symbolized by bold-face lowercase letters (e.g., x), matrices by bold-face upper-case letters (e.g., X), and sets by underlined letters (e.g., X); scalars are represented by either lower-case or upper-case letters (e.g., n, N). E½fðxÞ and hfðxÞix take expectation of random variable fðxÞ in their argument with respect to the true variable distribution and the variational variable distribution, respectively; k  k denotes the L2 norm, and k  k1 denotes the L1 norm. Derivative of fðxÞ with respect 0 0 to x is shown by @fðxÞ @x , f ðxÞ, or ðfðxÞÞ .

2

THE VON-MISES FISHER MODEL

A D-dimensional unit random vector x 2 SD1 , where SD1 ¼ fx j x 2 RD ; k x k ¼ 1g, is said to have D-variate vMF distribution if its probability density function is given by T

Fðx j m ; Þ ¼ cD ðÞemm x ;

(1)

where jjm mjj ¼ 1,   0, and D  2 [1, chapter 9]; the superscript (T ) indicates the transpose operator. The normalizing constant cD ðÞ is given by D

cD ðÞ ¼

 2 1 D

ð2pÞ 2 I D1 ðÞ

;

(2)

2

where I n ðÞ represents the modified Bessel function of the first kind of order n [27, chapter 10]. The density function Fðx j m ; Þ is characterized by the mean direction m and the concentration parameter . Larger values of  imply a stronger concentration about the mean direction. In particular, when  ! 0, Fðx j m ; Þ reduces to the uniform density on SD1 , and as  ! 1, Fðx j m ; Þ tends to a point density [28]. For a complex unit vector x 2 CSD1 where CSD1 ¼ fx j x 2 CD ; k x k ¼ 1g, the vMF distribution can be defined as y Fðx j m ; Þ ¼ c2D ðÞeReðm xÞ ;

(3)

where ReðÞ takes the real part of its argument, and the superscript (y ) indicates the conjugate transpose operator.

2.1 Conjugate Prior and Posterior Let X ¼ fx1 ; . . . ; xN g denote a set of independent and identically distributed (i.i.d.) observations of size N on the unit hypersphere xn 2 SD1 from Fðxn j m ; Þ. The conjugate prior for ðm m; Þ can be expressed as pðm m; Þ ¼

1 n0 T c ðÞeb0 m0 m ; Z0 D

(4)

where Z0 is an unknown normalization factor, m0 2 SD1 and n0 ; bo  0. Given X and the assigned conjugate prior by (4), the posterior distribution of ðm m; Þ takes a form of 1 cnD ðÞ T pðm m;  j XÞ ¼ cD ðbÞebm m ; |fflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflffl} Z cD ðbÞ |fflfflfflfflffl{zfflfflfflfflffl} pðmj;XÞ

(5)

pðjXÞ

PN

P 1 where b ¼ kb0 m0 þ n¼1 xn k, m ¼ ðb0 m0 þ N n¼1 xn Þb , and n ¼ n0 þ N. As Z is not available in a closed form, moments and other characteristics of (5) have to be

TAGHIA ET AL.: BAYESIAN ESTIMATION OF THE VON-MISES FISHER MIXTURE MODEL WITH VARIATIONAL INFERENCE

1703

illustrates an example of a fitted Gamma distribution into the true distribution of . This example further motivates our choice of the Gamma prior in (7).

Fig. 1. Example of an exact posterior density function of  (i.e., pð j XÞ ¼

n 1 cD ðÞ Z cD ðbÞ ), and a fitted Gamma approximation (i.e., pð

j XÞ ’ Gð j a; bÞ).

computed numerically by samples from the posterior distribution. In general, sampling methods can be slow to converge and their convergence can be difficult to diagnose [20, chapter 11]. In the following, we approximate the posterior distribution with a class of deterministic methods known as the variational inference [19].

2.2 Variational Inference Variational inference is an alternative to computationally costly sampling-based approaches, which replaces sampling with optimization and provides a deterministic approximation to the posterior distribution [20, chapter 10], [19]. This section describes the Bayesian estimation of a single vMF distribution with VI. 2.2.1 Model Description The goal is to infer the posterior distribution for the mean direction m and the concentration parameter  given the data set X. The generative model is given by the joint distribution of all random variables as m; Þ; (6) pðX; m ; Þ ¼ pðX j m ; Þpðm QN where pðX j m ; Þ ¼ n¼1 Fðxn j m ; Þ. While the conjugate prior given by (4) exists, it is difficult to use because of the dependence on the unknown normalizing constant, and hence this is not used. Instead, we assume a vMF-Gamma distribution as the prior over m and  as, pðm m; Þ ¼ pðm m j ÞpðÞ ¼ Fðm m j m0 ; b0 Þ Gð j a0 ; b0 Þ;

(7)

where Gð j a0 ; b0 Þ is a Gamma density function with the shape parameter a0 and the inverse scale parameter b0 given a by Gð j a0 ; b0 Þ ¼ Gða1 0 Þ b00 a0 1 eb0  , where GðÞ denotes the Gamma function (a0 and b0 are regarded as prior parameters). The choice of the vMF distribution for pðm m j Þ is motivated by the fact that by (4) and (5), the conjugate conditional distribution of m given  is a vMF distribution Fðm m j m0 ; b0 Þ. The choice of the Gamma distribution for pðÞ seems reasonable since firstly the concentration parameter  is scalar and positive, secondly for large values of , pðÞ tends to take the form of a Gamma density. This is  , which leads to because, for large values of , I D1 ðÞ ’ peffiffiffiffiffiffi 2p 2

1 cnD ðÞ 1 D 1 ’ ðn1Þð 2 2Þ eðnbÞ ¼ Gð j a; bÞ; pðÞ ¼ Z cD ðbÞ Q

(8)

where Q is the normalization constant, b ¼ n  b, and a ¼ ðn  1ÞðD2  12Þ þ 1. In Section 2.2.3, we will also enforce a similar Gamma approximation for the posterior pðÞ. Fig. 1

2.2.2 Variational Distribution In the fully factorized approximation (known as the meanfield approximation), all variables are regarded mutually independent, in other words, the variational posterior qðm m; Þ is expressed by qðm m; Þ ¼ qðm mÞqðÞ. However, here, we consider a structural factorization of the variational posterior as qðm m; Þ ¼ qðm m j ÞqðÞ:

(9)

As the result of this particular factorization, we can retain some information about uncertainty, in a similar way as in the mean-field approximation. Contrary to the mean-field approximation, we allow this information to propagate between factors m and . Thus, unlike the mean-field approximation which violates the correlation between m and , the structural factorization can possibly preserve the actual correlation to some extent. Optimization of the variational factors qðm m j Þ and qðÞ is performed by maximizing the lower bound on the log marginal likelihood (evidence). This lower bound is called the evidence lower bound.1 However, in our case, the evidence lower bound is still not available in a closed form since it involves evaluation of intractable moments. To avoid the intractable moment computation, with extended variational methods, we further lower bound the evidence lower bound while the bound is tight at one point in the parameter distribution [19]. We proceed by deriving the sequential updates for optimizing the variational factors.

2.2.3 Optimization of the Variational Distribution Regarding the described model, the evidence lower bound is given by   pðX j m ; Þpðm m j ÞpðÞ : (10) LðqÞ ¼ ln qðm m j ÞqðÞ m ; Including all terms which are independent of m and  into a constant term cst, (10) can be rewritten as m; Þim ; ; LðqÞ ¼ cst þ hfðm m; Þim ; hln qðm

(11)

N  X

  D T fðm m; Þ ¼ m xn  1 ln   ln I ðD1Þ ðÞ þ m 2 2 n¼1   D  1 ln   ln I ðD1Þ ðb0 Þ þ 2 2 þ b0 mT 0 m þ ða0  1Þln   b0 : (12) Inspecting (12) and reading off those terms which involve only m , we can express ln q ðm m j Þ as m j Þ ¼ cst þ b0 mT ln q ðm 0mþ

N X

m m T xn ;

n¼1

1. The evidence lower bound is the negative free energy.

(13)

1704

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

where q ðÞ shows the posterior distribution after optimization. By taking exponential of both sides of this expression, m j Þ is recognized to be a vMF density q ðm q ðm m j Þ ¼ Fðm m j m; bÞ; (14)     N X   (15) b ¼ b0 m0 þ xn ;   n¼1 ! N X 1 m¼b (16) b0 m0 þ xn ;

 N ln I ðD1Þ ðÞ  ln I ðD1Þ ðb0 Þ þ ln I ðD1Þ ðbÞ: 2 2 2 (18) We notice that (17) can not be computed analytically as it involves intractable moments hln I ðD1Þ ðÞi , hln I ðD1Þ 2 2

g such ðb0 Þi , and ln I ðD1Þ ðbÞ  . The goal is to find gðÞ 2 g  gðÞ from which we can further bound the that gðÞ evidence lower bound. For this purpose, we express the following Lemmas (proofs are available in the Appendix as the supplementary material, which can be found on the Computer Society Digital Library at http://doi. ieeecomputersociety.org/10.1109/TPAMI.2014.2306426). Lemma 2.1. The function ln I n ðxÞ is strictly concave with respect to x for all n > 0 and x 2 R. Lemma 2.2. The function ln I n ðxÞ is strictly convex relative to ln x for all n > 0 and x > 0ðx 2 RÞ. As a result of Lemma 2.1 and Lemma 2.2, the following inequalities hold true2: ln I ðD1Þ ðÞ  ln I ðD1Þ ðÞ 2  2  (19) @ ln I ðD1Þ ðÞ ð  Þ; þ 2 @ ln I ðD1Þ ðb0 Þ  ln I ðD1Þ ðb0 Þ 2  2  (20) @ ln I ðD1Þ ðb0 Þ ðb0   b0 Þ; þ 2 @b0  ln I ðD1Þ ðbÞ  ln I ðD1Þ ðbÞ 2  2  @ ln I ðD1Þ ðbÞ bðln b  ln bÞ: þ 2 @b (21) @ 2. To keep the notation uncluttered, in a general form, @x fðxÞ j ðx¼xÞ @ is denoted simply by @x fðxÞ.

NO. 9,

SEPTEMBER 2014

In (19)-(21), the equality holds where  ¼ , that means the bounds are tight at the point  ¼  in the parameter distriI 0 ðxÞ @ bution. In a general form, @x ln I n ðxÞ ¼ I nn ðxÞ can be computed by using the recurrence relation [27], that is I 0n ðxÞ ¼ I nþ1 ðxÞ þ xn I n ðxÞ. By substituting (19)-(21) in (18), gðÞ is bounded as     g ¼ N D  1 þ a0  1 ln   b0  gðÞ  gðÞ 2  N ln I ðD1Þ ðÞ  ln I ðD1Þ ðb0 Þ 2  2  @ ln I ðD1Þ ðÞ ð  Þ þ ln I ðD1Þ ðbÞ  N 2 2 @   @ ln I ðD1Þ ðb0 Þ ðb0   b0 Þ  2 @b   0  @ ln I ðD1Þ ðbÞ bðln b  ln bÞ: þ 2 @b

n¼1

where b and m are posterior parameters and b0 and m0 are the prior parameters. Now, we can determine ln q ðÞ simm; Þ  ln q ðm m j Þ, where on the ply as ln q ðÞ ¼ ln q ðm right-hand side of this equality, we substitute for m j Þ using (13) and for ln q ðm m; Þ ¼ fðm m; Þ using ln q ðm (12). Retaining those terms that have some functional dependence on  and absorbing other terms into a constant term, (10) can be expressed as (17) LðqÞ ¼ cst þ hgðÞi hln qðÞi ;     D  1 þ a0  1 ln   b0  gðÞ ¼ N 2

VOL. 36,

(22) e Let LðqÞ denote the lower bound to the evidence lower bound LðqÞ, then we have g  hln qðÞi : e LðqÞ  LðqÞ ¼ cst þ hgðÞi  

(23)

e We note that LðqÞ is tight at the point  ¼ . Thus, the approximate solution ln q ðÞ can be expressed in terms of  and ln  as ln q ðÞ ¼ cst þ ða  1Þln   b; where we have defined     D @  1 þ b ln I ðD1Þ ðbÞ ; a ¼ a0 þ N 2 2 @b b ¼ b0 þ N



(24)

(25)

   @ @ ln I ðD1Þ ðÞ þ b0 ln I ðD1Þ ðb0 Þ : 2 2 @ @b0  (26)

By taking exponential of both sides of (24), we recognize q ðÞ to be a Gamma density q ðÞ ¼ Gð j a; bÞ;

(27)

where a and b are the posterior parameters. In the above expressions, ( a1 if a > 1; b (28) ¼ a otherwise; b which is obtained from the previous iteration.

3

THE VMF MIXTURE MODEL

We will now generalize the single vMF model to the vMF mixture model. As before, we denote the observed data set by X ¼ fx1 ; . . . ; xN g. With K mixture components, the probability density function of the vMF mixture model is represented as N X K Y t k Fðxn j mk ; k Þ; (29) pðX j m ; ; tÞ ¼ n¼1 k¼1

TAGHIA ET AL.: BAYESIAN ESTIMATION OF THE VON-MISES FISHER MIXTURE MODEL WITH VARIATIONAL INFERENCE

Fig. 2. Directed acyclic graph representing the Bayesian mixture model of von-Mises Fisher distributions.

where t ¼ ft k gK k¼1 denotes a set of mixture weights, m ¼ fm mk gK denotes a set of mean directions, and k¼1  ¼ fk gK denotes a set of concentration parameters. In a k¼1 Bayesian framework, as shown in the graphical representation of the mixture model in Fig. 2, for each D-dimensional observation vector xn , there is a corresponding latent switch variable zn ¼ ½zn1 ; . . . ; znK T consisting of 1-of-K binary vector with elements znk for k ¼ 1; . . . ; K. Indicating the set of switch variables by Z ¼ fz1 ; . . . ; zN g, the conditional distribution of Z given t is categorial distributed as Q QK znk pðZ j tÞ ¼ N n¼1 k¼1 t k . The conditional distribution of X given Z, m and  can be expressed by pðX j Z; m ; Þ ¼ QN QK znk . Thus, the generative model can n¼1 k¼1 Fðxn j m k ; k Þ be expressed by the joint distribution of all random variables as, pðX; Z; t; m ; Þ ¼ pðX j Z; m ; Þpðm m; ÞpðZ j tÞpðtÞ:

Next we introduce prior distributions over the model parameters m , , and t. A Dirichlet prior distribution is introduced over the mixing coefficients t as pðtÞ ¼ Dirðt j a 0 Þ ¼ Cða a0 Þ

K Y

a

1

t k 0;k ;

The functional form of the factors qðZÞ and qðt; m ; Þ is determined by optimization of the variational distribution. The main concentration here is on optimization of the posterior distribution of the latent switch variables qðZÞ as it involves evaluation of intractable moments. We notice that optimization of the second factor qðt; m ; Þ can be carried out in a similar way as described in Section 2 by extending a single vMF component to the mixture of vMF components. Regarding our model, the evidence lower bound can be expressed by * + pðX j Z; m; Þpðm m; ÞpðZ j tÞpðtÞ : (34) LðqÞ ¼ ln m j ÞqðÞqðtÞ qðZÞqðm Z;m m;;t

Let (34) be rewritten by including all terms which do not involve Z into a constant term, as   pðX j Z; m; ÞpðZ j tÞ LðqÞ ¼ cst þ ln qðZÞ Z;m m;;t (35) K X N X ¼ cst þ hznk ln rnk iZ  hln qðZÞiZ ; k¼1 n¼1

where ln rnk is given by   D D  1 hln k i ln rnk ¼ hln t k it  ln 2p þ 2 2 D E

 ln I ðD1Þ ðk Þ þ k m T k xn m ; : 2

(30)

(31)



e LðqÞ  LðqÞ K X N D E X znk lng  hln qðZÞiZ ; rnk

P Gð K a0;k Þ Cða a0 Þ ¼ QK k¼1 k¼1 Gða0;k Þ

and a 0 ¼ ða0;1 ; . . . ; a0;K Þ. Similarly as before, we introduce a vMF-Gamma prior over parameters m and  as pðm m; Þ ¼ pðm m j ÞpðÞ ¼

K Y

Fðm mk j m0;k ; b0;k k Þ Gðk j a0;k ; b0;k Þ:

(32)

k¼1

We recall that a0;k , m0;k , b0;k , a0;k , and b0;k are regarded as the prior parameters of the kth mixture component.

3.1 Variational Inference Having pðX; Z; t; m ; Þ defined by (30), we now consider a variational distribution which factorizes between latent switch variables and component parameters as qðZ; t; m ; Þ ¼ qðZÞqðt; m ; Þ m j ÞqðÞ: ¼ qðZÞqðtÞqðm

(33)

(37)

Z

k¼1 n¼1

where

(36)

We observe that (35) is not available in a closed form as (36) involves the intractable moment ln I ðD1Þ ðk Þ  . The goal 2 rnk  ln rnk by which we can is to bound ln rnk such that lng further lower bound the evidence lower bound as

¼ cst þ

k¼1

1705

where the bound is tight at the point k ¼ k in the parameter distribution. Taking into account (19), ln rnk is bounded as rnk ln rnk  lng

  D D  1 hln k i ¼ hln t k it  ln 2p þ 2 2

T þ k m k xn m;  ln I ðD1Þ ðk Þ 2   @ ln I ðD1Þ ðk Þ ðhk i k Þ;  2 @k

(38)

where this inequality validates inequality (37). Thus, the variational posterior distribution of the latent switch variables can be expressed as ln q ðZÞ ¼ cst þ

N X K X

znk lng rnk :

(39)

n¼1 k¼1

Hence, we get q ðZÞ ¼ mial distribution with

QN QK n¼1

znk k¼1 nk ;

which is a multino-

1706

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

e

nk ¼

g ln rnk

PK

j¼1

e



g ln rnj

:

(40)

Obviously, the quantities nk are nonnegative and sum to one. For q ðZÞ, we have hznk iZ ¼ nk . Now, we consider the factor qðt; m ; Þ in the posterior distribution (33). As shown in Fig. 2, t is independent of  and m, given X. Hence, qðt; m ; Þ is factorized as m; Þ. The optimization of qðtÞ is straightqðt; m ; Þ ¼ qðtÞqðm forward which leads to ln q ðtÞ ¼ cst þ

K X

ða0;k  1Þ þ

N X

nk ln t k :

VOL. 36,

NO. 9,

SEPTEMBER 2014

It is mentionable that the other expectations involved in (38) are evaluated as, hln k i ¼ Cðak Þ  ln bk (where d CðyÞ dy ln y), hk i ¼ ab k , hln t k it ¼ Cðak Þ  CðaÞ (where k

PK a ¼ k¼1 ak ), and k m T ¼ ab k mT k xn ;m k xn . m k A summary of the algorithm is presented in Algorithm 1. The optimization procedure is guaranteed to converge as the lower bound is convex in each of the factors [29], [20, chapter 10].

(41)

n¼1

k¼1

Taking the exponential of both sides of the above expression, q ðtÞ is recognized to be a Dirichlet distribution with posterior parameter ak , N X

q ðtÞ ¼ Dirðt j aÞ; ak ¼ a0;k þ

nk :

(42)

n¼1

m; Þ Next, we consider optimization of the second factor qðm which comprises a sum over k terms involving m k and k Q leading to the further factorization qðm m; Þ ¼ K mk ; k¼1 qðm mk ; k Þ can be k Þ. The variational posterior distribution qðm mk j k Þqðk Þ. Following a similar written as qðm mk ; k Þ ¼ qðm approach as described for the case of a single vMF component (Section 2), we can show that the variational posterior distribution of m k and k can be expressed by a vMFGamma distribution. More specifically, the posterior distribution qðm mk j k Þ is recognized to be a vMF density with posterior parameters bk and mk as, mk j k Þ ¼ Fðm mk j mk ; bk k Þ; q ðm     N X   nk xn ; bk ¼  b0;k m0;k þ   n¼1 mk ¼

b0;k m0;k þ

N X

(43) (44)

! nk xn b1 k :

(45)

n¼1

Similarly, the posterior distribution q ðk Þ is recognized to be a Gamma density with posterior parameters ak and bk as, q ðk Þ ¼ Gðk j ak ; bk Þ; X N D nk 1 ak ¼ a0;k þ 2 n¼1   @ þ ln I ðD1Þ ðbk k Þ bk k ; 2 @bk k

(46)



bk ¼ b0;k þ

N X

 nk

n¼1

þ b0;k

@ @b0;k k

 @ ln I ðD1Þ ðk Þ 2 @k ! ln I ðD1Þ ðb0;k k Þ : 2

(47)

(48)

3.2 Prior Parameters Generally speaking, when we have no prior knowledge about the data, the prior parameters are chosen such that the corresponding posterior distributions are influenced mostly by the data and less by the priors. Specifically, we consider 0 < b0;k ; a0;k ; b0;k 1, where the same prior values can be considered for all components. The prior parameter m0;k can be randomly initialized from the data such that km0;k k ¼ 1 (different prior parameters m0;k are considered for different components). Here, we assign small values for the prior parameter a0;k such that 0 < a0;k < 1, where the same prior values can be considered for all components. This choice of the prior needs to be discussed more in detail. In Bayesian learning, there is an automatic tradeoff between fitting the data and the complexity of the model in which the complexity comes from those components whose parameters are pushed away from their prior values [20, chapter 10]. Components that Ptake almost no responsibility for explaining the data have N n¼1 nk ! 0 which leads to ak ’ a0;k , bk ’ b0;k , mk ’ m0;k , ak ’ a0;k , and bk ’ b0;k . Hence, with a broad prior their effect is numerically negligible. The pruning effect is achieved by assigning small values for the prior parameters a0;k , that is 0 < a0;k < 1. In Section 5.1.2, we experimentally evaluate the effect of the prior parameter a0;k on the training model in a toy experiment. Finally, we emphasize that when there is no knowledge about the data, the prior parameters are set in a noninformative way. However, by observing a new set of data, our prior knowledge can be used to initialize the prior parameters in an informative way; for example, the posterior parameter distributions from the previous observed data can be used as prior parameter distributions for the new observed data.

TAGHIA ET AL.: BAYESIAN ESTIMATION OF THE VON-MISES FISHER MIXTURE MODEL WITH VARIATIONAL INFERENCE

4

PREDICTIVE DENSITY

In applications of the Bayesian mixture model of vMF distributions, we might be interested in the predictive density for b denote the new value of the a new observed variable. Let x observed variable and b z denote the corresponding latent b. The exact predictive denswitch variable associated with x sity is given by X ZZZ pðb xjb z; m; Þ pðb x j XÞ ¼ (49) bz m d; pðb z j tÞpðt; m ;  j XÞ dt dm where pðt; m ;  j XÞ is the true posterior distribution of the parameters. The exact predictive density is analytically intractable, hence in this section we shall provide an approximate solution. Starting with performing the summation over b z and replacing the true posterior distribution with its variational approximation, we get x j XÞ ¼ pðb x j XÞ qðb

K ZZZ X

Since the factor lðk Þ involves the Bessel function, we can not evaluate (56) analytically. The logarithm of lðk Þ can be linearized in terms of both k and ln k as, gk Þ ¼ ln lðk Þ ln lðk Þ ln lð xÞðk  k Þ þ sk ðb xÞðln k  ln k Þ;  rk ðb

t k Fðb x j m k ; k Þ

2

I 0 D1 ðbk k Þ I 0 D1 ðk Þ ð Þ ð Þ þ 2 ; þ bk 2 I ðD1Þ ðbk k Þ I ðD1Þ ðk Þ

qðtÞqðm mk ; k Þ dt dm mk dk :

qðb x j XÞ ¼

k¼1

a

2

(59)

2

I 0n ðtÞ I n ðtÞ

I 0 ðtÞ ðI nn ðtÞÞ0

and can be computed by where in general form using the recurrence relation [27]. It is easy to confirm that !0 I 0ðD1Þ ðtÞ

Fðb x j m k ; k Þqðm mk ; k Þ dm mk dk : (51)

2

I ðD1Þ ðtÞ

0

2

mk j mk ; bk k ÞGðk j ak ; bk Þ in Using the result qðm mk ; k Þ ¼ Fðm (51), we next perform the integration over m k , Z Fðb x j m k ; k ÞFðm mk j mk ; bk k Þ dm mk Z T T (52) ¼ cD ðk ÞcD ðbk k Þ ek mk bxþbk k mk m k dm mk D1

¼ wk ðb xÞlðk Þk2

;

where we have defined xÞ ¼ kb x þ bk mk k; hk ðb xÞ ¼ wk ðb

lðk Þ ¼

(53)



1 D

ð2pÞ 2

D1 2 bk ; hk ðb xÞ

I ðD1Þ ðhk ðb xÞk Þ 2

I ðD1Þ ðk ÞI ðD1Þ ðbk k Þ 2

(54)

:

(55)

K X ak k¼1

a

wk ðb xÞ

xÞ and rk ðb xÞ are positive for all t > 0 and D  2. Hence, sk ðb and we have Z a þD2 s ðb xÞ lðk Þebk k kk 2 dk lðk Þerk ðbxÞk k k Z a þs ðb xÞþD 2 2 ek ðrk ðbxÞþbk Þ kk k dk (60)   D s ðb xÞ ¼ lðk Þerk ðbxÞk k k G ak þ sk ðb xÞ þ  1 2 ðak þsk ðb xÞþD 1Þ 2 ðr ðb xÞ þ b Þ ; k

Z

a þD2 lðk Þebk k kk 2 dk :

VB qðb x j XÞ qapp ðb x j XÞ

¼

(56)

k

where we have taken the integration in the second line of (60) by the known normalization factor of the Gamma distribution. Finally considering (60) in (56), the approximate predictive density is given by

2

The integrand on the second line of (52) is an un-normalized xÞ vMF distribution with mean direction ðb x þ bk mk Þh1 k ðb xÞk , hence, we can write and concentration parameter hk ðb down the result of this integration by using the known normalization coefficient for the vMF distribution (third line of (52)). Thus we have qðb x j XÞ ¼

2

I 0 D1 ðhk ðb xÞk Þ sk ðb xÞ ð Þ xÞ ¼  hk ðb xÞ 2 rk ðb k xÞk Þ I ðD1Þ ðhk ðb

(50)

By integrating over t and making use of the result ht k it ¼ aak , (50) can be written as

(57)

where rk ðb xÞ and sk ðb xÞ can be calculated by setting the first gk Þ equal at and the second derivatives of ln lðk Þ and ln lð the current expansion point k ¼ k , which gives 0 0 10 I ðD1Þ ðhk ðb xÞk Þ 2 2 2 A xÞ ¼ hk ðb xÞk @ sk ðb xÞk Þ I ðD1Þ ðhk ðb 2 0 0 10 0 0 10 (58) I D1 ðbk k Þ I D1 ðk Þ ð Þ 2 A þ 2 @ ð 2 Þ A; þ b2k k @ 2 k I ðD1Þ ðbk k Þ I ðD1Þ ðk Þ

2

k¼1

ZZ K X ak

1707

K X ak k¼1



a

sk ðb xÞ

xÞ lðk Þerk ðbxÞk k wk ðb

G ak þ sk ðb xÞ þ

 D D  1 ðrk ðb xÞ þ bk Þðak þsk ðbxÞþ 2 1Þ : 2 (61)

5

EXPERIMENTAL RESULTS

In this section, we first empirically evaluate our variational approach through toy experiments. Next, we evaluate it by experiments on both synthetic data and real data sets.

1708

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 36,

NO. 9,

SEPTEMBER 2014

Fig. 3. Comparison of the true posterior distribution (first row) via the SIR inference with the approximate posterior distribution via the proposed variamÞ; (middle) samples tional approach (second row) in Example 5.1: (left) samples from the posterior distribution of the mean direction, pðm mÞ and q ðm from the posterior distribution of the concentration parameter, pðÞ and q ðÞ; (right) samples from the posterior distribution of the mean direction and m; Þ. The comparison is illustrated for different data sizes of N ¼ 10 and N ¼ 100. the concentration parameter, pðm m; Þ and q ðm

5.1 Empirical Evaluations There are some advantages with our variational approach compared to the ML-EM based approaches in parameter estimation of the vMF mixture model. First, the singularities arising in the maximum likelihood, that is when a vMF component collapses onto a specific data point, are absent in the Bayesian treatment. Furthermore, in our approach there is no over-fitting even if we choose a large number of model components in the mixture. Finally, our approach can potentially determine the model complexity by pruning unnecessary mixture components during optimization without resorting to techniques such as cross validation. The presented variational approach has some potential advantages over Markov chain Monto Carlo (MCMC) methods. It is deterministic, and offers an optimization criterion which can be used to determine convergence. On the other hand, there are some disadvantages along with the variational approach. First, the optimization procedure can lead to local maxima in the variational parameter space. Second, it yields only an approximation to the posterior and incurs an unknown bias. In this section, we empirically examine some of these aspects. 5.1.1 Posterior Distribution It is well-known that the variational factorized approximations can result in compact posterior distributions [20, chapter 10]. It has been also shown that using a lower bound to the VI criterion function, as in our (23) and (37), may lead to a bias in the parameter estimation [30]. In the following, through an illustrative example, we show how well the variational approach performs compared to a costly sampling based approach. Example 5.1. For this illustration, we simulate a sample of size N 2 f10; 100g from a single vMF distribution, with a mean vector arbitrary chosen as m true ¼ ½0; 1T and a concentration parameter true ¼ 40. We employ the algorithm described in [31] to generate a set of data points based on a given vMF distribution. We estimate the posterior both with our variational approach and a sampling-importanceresampling approach as described in [14]. The posterior

resulted from the SIR method is regarded as the true posterior, and the one by the variational approach as the approximate posterior. Fig. 3 compares the approximate posterior with the true one. We observe that the approximate postem; Þ is more compact compared to the true posterior rior q ðm pðm m; Þ. Furthermore, we notice that our choice of the structural factorization of the variational posterior has preserved to some extent the actual correlation between m and  by allowing some uncertainties to propagate between m and . Finally, we notice that by increasing the amount of observed data N, the point estimates of the true source parameters become more accurate. However, the systematic bias might not vanish totally, as the variational methods may be asymptotically biased [30].

5.1.2 Model Pruning The VI procedure always converges to a single local optimum, even when the true posterior distribution of model parameters has multiple modes in the parameter space [20, chapter 10]. This means that, in assigning each data cluster to its own mixture component, the VI solution includes only one of all the possible equivalent index assignments, whereas the exact Bayesian solution includes all the possible equivalent index assignments, while each permutation results in an equivalent description of the training data set. In this sense, the VI solution shows symmetry-breaking. In [32], through toy experiments, it is suggested that the symmetry-breaking can cause inappropriate pruning (overpruning) of the mixture components. It has been also suggested that the over-pruning might be an instance of the ”biasedness” property of VI [30], that is the variational approximation is biased away from peaks in the marginal likelihood towards regions where the evidence lower bound is tightest. The toy experiments in [32] are re-analyzed and re-interpreted in [33], suggesting that the symmetry-breaking should not be interpreted as a cause for any inappropriate pruning (over-pruning) of the mixture components, in the sense that too few mixture components are used. If we see the main purpose of the Bayesian learning as to estimate the actual parameter values, which is the case in some

TAGHIA ET AL.: BAYESIAN ESTIMATION OF THE VON-MISES FISHER MIXTURE MODEL WITH VARIATIONAL INFERENCE

applications, then we should agree with [30], [32] that the approximate posterior parameter distribution is biased and breaks the symmetry, compared to the exact parameter density. However, for many practical applications the goal is to obtain a model that describes the training data reasonably well. In such cases, the VI approach leads to a solution that gives just a slightly lower estimate of the data likelihood, and hence, the actual parameter values and the arbitrary permutation of mixture component indices have less importance [33]. The model pruning in VI can also be seen as a desirable effect of Occam’s Razor in Bayesian learning, as the training procedure automatically determines the necessary model complexity and avoids over-fitting [20, chapter 10], [32]. In Bayesian learning of mixture models, this desirable pruning effect is achieved by assigning small values for the parameters a0;k (i.e., 0 < a0;k < 1) of the prior Dirichlet distribution over mixture weights PNt. In other words, the model favors solutions where n¼1 nk for some components approaches zero, thus effectively eliminating mixture components that have very small responsibility in describing the observed data. It has been suggested that if the model uses too few mixture components, it is probably caused by the inappropriate selection of the prior model parameter a0;k and not by the VI approach [33]. In the following toy experiment, we examine the effect of prior parameter a0;k on the model pruning. Example 5.2. For this experiment, we generate synthetic data from a known vMF mixture model composed of J ¼ 10 components with the dimension of D ¼ 5 and with 10 samples per mixture component on average. The mean direction of each component is a random vector on the unit hypersphere. The concentration parameters are random values sampled from a uniform distribution, unifð14:8; 15:2Þ. The variational algorithm is initialized with two different prior parameters a0;k 2 f0:001; 0:5g, while other prior parameters are kept unchanged. Next, we train models with different number of components K 2 f8; 10; 12; 14g. Finally, for each trained model, we compute the lower bound value after convergence. The experiment is repeated 100 times for the same data set but different (random) initializations. It is notable that in this experiment, the component k is considP   103 . The ered pruned away from the mixture if N n¼1 nk result of this experiment is shown in Fig. 4. We notice that, when the algorithm is initialized with too few model components (K < 10), the trained model uses all the available model complexity, but the result is sub-optimal because the initial model complexity is not sufficient. When it is initialized with the exact known number of components (K ¼ 10), the lower bound is in its highest value, whereas for K > 10, the lower bound is slightly lower. This reduction is mainly a penalty related to the number of equivalent arbitrary index permutations among the components, as discussed in [20, chapter 10.2] and [33]. We also note that for both K ¼ 10 and K > 10, the training procedure sometimes (rather infrequently) converges to a sub-optimal local optimum, perhaps due to the random initialization. This indicates that one should perhaps try with several different random initializations, for the same data set, when there is uncertainty whether the assigned model complexity is sufficient. Finally we observe similar pruning effects for

1709

Fig. 4. Model pruning of the mixture components for data generated from exactly 10 components. The left-hand plots illustrate the effective number of remaining components after training versus the number of model components. The right-hand plots show the lower bound values after convergence versus the number of model components. The result with the prior parameter a0;k ¼ 0:5 is shown on the top (the first row) and with a0;k ¼ 0:001 on the bottom (the second row). The figure shows results of 100 trials. Plotted points are randomly perturbed horizontally for clarity.

a0;k ¼ 0:001 and a0;k ¼ 0:5. The experiments suggest that, as long as 0 < a0;k < 1, the choice of the prior value a0;k is not critical. There appears to be only very small risk of overpruning, in the sense that too few model components remain active after training.

5.1.3 Computational Complexity Here, we evaluate the computational complexity of our variational approach with respect to the number of components and dimensionality through a toy experiment. Example 5.3. We generate synthetic data from a known vMF mixture model composed of J ¼ 10 components with 100 samples per mixture component on average. We vary the dimensionality of each data point from D ¼ 10 to D ¼ 50 with the step size of 10. Next, we train different models with the variational approach and the EM-based method [11]. Both algorithms are initialized with K 2 fJ; 2J; 3Jg. The algorithms are terminated after either 2; 000 iterations, or the change in the lower bound (or the log likelihood) becomes less than 105 per datum. Then, for each trained model, we compute the convergence time. Fig. 5 illustrates the average convergence time per dimension over 10 data sets. We notice that the convergence time per dimension decreases with increasing the dimensionality. We also notice that the variational approach leads to a slight computational overhead compared to the EM-based approach, but it exhibits rather less variance in its convergence time. This extra computational complexity arises mainly from evaluation of the responsibilities. Compared to the MCMC sampling methods, the variational approach requires considerably less computational time because MCMC samplers take long time to converge to

1710

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Fig. 5. Average convergence time and standard error per dimension across 10 data sets for the variational approach (VB) and the EM-based approach (EM) in Example 5.3.

their stationary distribution whereas our method converges faster. For instance, in Example 5.1, our method was about 90 100 times faster than the SIR approach [14]. One possible extension of the work is to numerically

approximate the intractable term ln I ðD1Þ ðÞ  arising 2 from variational Bayes during optimization. This would require defining a special function zðD; a; b; Þ where D is the dimension and a, b are the parameters of the Gamma distribution. This function is responsible for the

numerical approximation of ln I ðD1Þ ðÞ  . The compu2 tational complexity would depend on the required precision in the numerical approximation.

5.1.4 Approximate Predictive Density Here, we illustrate the accuracy of the approximate predictive density (61) through a toy experiment. Example 5.4. For this experiment, we generate synthetic data from a known vMF component with a mean direction vector arbitrary chosen as mtrue ¼ ½0:3814; 0:8581; 0:2860; 0:1907T and a concentration parameter true ¼ 6. We calculate the Kullback-Leibler divergence (KLD) of the true predictive density (49) from the approximate predictive density (61). The true predictive distribution is calculated by integrating (49) numerically, using also the true posterior distribution. Here, we employ the SIR approach [14] to estimate the true posterior distribution. The SIR approach [14] is proposed for a single vMF component. We also consider another approximation by plugging the parameters obtained from the ML estimation [11] into the x j XÞ). More specifically, we calcumodel (shown by pML app ðb pðb x j XÞ x j XÞ ¼ EpðÞ ½lnð Þ; where late KL½pðb x j XÞkpapp ðb papp ðb x j XÞ VB ML x j XÞ 2 fqapp ðb x j XÞ; papp ðb x j XÞg. The experiment is papp ðb performed for different random data sizes N. The result of the experiment is shown in Fig. 6 on a log scale plot. We notice that for small amount of training data, the KLD VB obtained by the approximate solution qapp ðb x j XÞ is significantly less than the ML-based approximation pML x j XÞ. app ðb This difference becomes smaller with increasing the amount of training data, as expected. We also notice that, the variational approach has rather less standard error in the approximation.

VOL. 36,

NO. 9,

SEPTEMBER 2014

Fig. 6. The KLD between the exact predictive density pðb x j XÞ (49) and VB ðb x j XÞ (61) and ML the approximate predictive densities by VI qapp pML x j XÞ in Example 5.4. The average and standard error results of 10 app ðb trials are reported (the y-axis is on log scale).

5.2 Synthetic Data The synthetic data sets are designed such that we can evaluate our approach with respect to various data dimensions and various concentrations about the mean direction. Specifically, we address handling small values of concentration parameters in experiments. This is important since difficulty in clustering arises where the concentration about the mean direction decreases. 5.2.1 Effect of the Concentration Parameter For the following experiment, the synthetic data are generated from a given vMF mixture model composed of J ¼ 10 components with the dimension of D ¼ 50. The mean direction of each component is a random unit vector. The mixture weights are random values sampled from a uniform distribution. To examine only the effect of data size, we choose three different subsets with data sizes of N 2 f100; 500; 1;000g while all the other parameters are kept unchanged. Furthermore, to investigate the effect of concentration parameters, we only change the value of concentration parameters while other model parameters are fixed and again examine the aforementioned data with different data sizes. We compare the presented variational approach with an approach based on the maximum-likelihood expectation maximization (EM-based approach). For this purpose, we follow the EM derivation introduced in [11] which has been shown to provide satisfactory results on their experimental results. The main difference between the EM-based approach that we consider and the one by [11] is that for deriving a more accurate estimate of the concentration parameter, as suggested by [11], we perform NewtonRaphson iterations instead of the empirical approximation employed in [11]. The EM-based approach provides point estimates of the model parameters, whereas in the variational approach, the mean of the posterior distributions are regarded as point estimates of the model parameters. For the evaluation, we calculate the KLD between the true model and its estimated model, in which the model parameters are the point estimates derived by either the variational approach or the EM-based approach. A smaller value of KLD implies higher similarity between the true distribution and the estimated distribution. The result of this comparison is shown in Fig. 7. It is noticeable that the number of model components for the variational approach is set to K ¼ 15 while the EM-based approach benefits from

TAGHIA ET AL.: BAYESIAN ESTIMATION OF THE VON-MISES FISHER MIXTURE MODEL WITH VARIATIONAL INFERENCE

Fig. 7. The KLD between the true model and its estimated one versus the average values of the concentration parameters ave for different data size values N (from top to bottom, N 2 f100; 500; 1;000g) with the dimension of D ¼ 50. The estimated model is given by either the variational approach (VB-based) or the ML-EM approach (EM-based). The number of model components in the EM-based approach is set to the true number of components, K ¼ J ¼ 10, and in the VB-based approach is set to K ¼ 15. The red solid line is considered as the baseline where KLD ¼ 0. Each point is the average of 20 trials.

knowing the true number of components as prior knowledge, i.e., K ¼ J ¼ 10. To smooth out the effect of random initializations, every single experiment has been repeated 20 times and the average result is reported. From this comparison, the following observations can be noted: 1) for small values of the concentration parameter, the variational approach leads to lower KLD values compared to the EMbased approach; 2) by increasing the value of the concentration parameter, both algorithms converge to fairly similar results; 3) by increasing the data size, as expected, both approaches provide more accurate estimates.

5.2.2 Effect of the Data Dimension Here, we evaluate the variational approach when data dimension D is allowed to vary. We consider synthetic data with two different data sizes of N ¼ 500 and N ¼ 1;000, in which random samples are drawn from simulated data generated from a given vMF mixture model composed of J ¼ 10 components (ave ’ 18). The mixture model parameters are held fixed while the dimension of data D varies from 5 to 200 with the step size of 5. The number of model components for the variational approach is set to K ¼ 15 while for the EM-based approach K ¼ J ¼ 10. Every experiment is repeated 20 times and the average results are shown. Fig. 8 shows the result of this experiment. We notice that the variational approach, compared to the EM-based approach, is less susceptible as D increases. 5.3 Real Data In this section, first we use the presented variational approach as a clustering method and apply it to a data set of yeast gene expressions. Next, we employ our approach in

1711

Fig. 8. The KLD values between the true model and its estimated model versus the data dimension D for data sizes of N ¼ 500 (top plot) and N ¼ 1;000 (bottom plot). The mixture model components are fixed while the dimension is varying from 5 to 200 with the step size of 5. The estimated model is given by either the variational approach (VB-based) or the ML-EM approach (EM-based). The red solid-line is considered as the baseline where KLD ¼ 0. The number of model components in the EM-based approach is set to the true number of components, K ¼ J ¼ 10, and in the VB-based approach, it is set to K ¼ 15. Each point is the average of 20 trials.

modeling the line spectral frequency parameters of speech and evaluate the performance in text-independent speaker identification.

5.3.1 Yeast Gene Expressions Clustering of gene expression data is considered as a challenging domain because of difficulties in clustering validation due to the absence of true cluster labels. Thus, use of the variational approach in this domain can be well-motivated due to its capability in determining the required model complexity. The Pearson correlation is commonly used for the analysis of gene expression data. Let PD i¼1 ðxi  xÞðyi  yÞ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ffi rðx; yÞ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi PD PD 2 2 i¼1 ðxi  xÞ i¼1 ðyi  yÞ denote the Pearson product moment correlation between P PD 1 x; y 2 RD where x ¼ D1 D i¼1 xi and y ¼ D i¼1 yi . Now cone such that sider the mapping x 7! x xi  x ei ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi x PD 2 i¼1 ðxi  xÞ e and a similar mapping for y. Then we have rðx; yÞ ¼ x eT y where ke xk ¼ ke yk ¼ 1. This implies that analysis and clustering of data using Pearson correlations is essentially a clustering problem for directional data [11]. Here, we consider a curated version of CDC15 yeast gene expression data set provided by [34]. The data set consists of 23 experiments measuring expression of 4;382 yeast genes.3 Out of that, we consider a subset of 1;000 randomly 3. The data set is available through: www.exploredata.net/Downloads/Gene-Expression-Data-Set.

1712

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

VOL. 36,

NO. 9,

SEPTEMBER 2014

selected genes. For each gene, the 23-element expression vectors are normalized to have a unit norm by considering e. the mapping x 7! x Since the true cluster labels are not available, the clustering performance is evaluated by computing certain internal figures of merit. Measuring the homogeneity and the separation of the clusters are two commonly used figures of merit (e.g., [11], [35]). The homogeneity determines how similar the individual members of each cluster are to their cluster representative, providing a measure of the intracluster similarity (higher values are favorable). The separation determines how disjoint the clusters are from each other, providing a measure of inter-cluster similarity (lower values are favorable). Let X ¼ fx1 ; . . . ; xN g denote a set of observed data and V1 ; . . . ; VJ denote disjoint clusters where Vj includes xn which belong to the jth cluster. The homogeneity measures (Have and Hmin ) and the separation measures (Save and Smin ) can be evaluated as [11], Have ¼

J X xT m j xT m j 1X ; Hmin ¼ min ; x2Vj kxkkm N j¼1 x2V kxkkm mj k mj k

(62)

j

P Save ¼

mT m

j Vi j j Vj j kmmi ikkmmjj k mT i mj P ; Smax ¼ max ; i6 ¼ j j V j j V j km m kkm mj k i j i i6¼j

i6¼j

(63) where j Vi j is the number of members of the ith cluster. Note that J is the number of clusters which in the case of the variational approach is obtained automatically and in the case of the EM-based approach has to be set manually. We compare the variational approach with the EM-based clustering approach known as ”soft-movMF” [11]. In [11], it has been shown that soft-movMF can clearly result in a better clustering performance in comparison to the frequency sensitive spherical K-means [36] and the spherical K-means [37], hence, we avoid comparison with these two clustering approaches. Since soft-movMF is derived in an expectation maximization framework, the number of clusters should be known in advance, thus, we evaluate the performance of soft-movMF for different number of model components (clusters). However, for the variational approach, the number of model components is set to a large value (in this case K ¼ 30). The various cluster quality figures of merit as computed for clusters of the yeast-gene expression data are shown in Fig. 9. This figure shows that the variational approach (shown in the figure by VB-based) performs a slightly better clustering compared to the EM-based approach (soft-movMF) both in terms of the intra-cluster similarity (Have and Hmin ) and the inter-cluster similarity (Save and Smax ). However, more importantly, we notice that for the EM-based approach, by increasing the number of clusters, the intra-cluster similarity improves but the inter-cluster similarity degrades. Hence, there is a compromise involved so that one may need several runs of the EM-based approach in order to determine the right number of clusters satisfying this compromise. The variational approach can determine the required model complexity over a single run, and hence is preferable inPpractice. In this experiment, those compo3 are considered pruned away nents in which N n¼1 nk  10 from the mixture (J ¼ 22 clusters are left after convergence).

Fig. 9. Measures of cluster quality for the gene data set consisting of 23 experiments measuring expression of 1; 000 randomly selected yeast genes. (a)-(b) illustrate average and maximum values of the inter-cluster similarity, respectively, i.e., Save and Smax ; (c)-(d) illustrate average and minimum values of the intra-cluster similarity, respectively, i.e., Have and Hmin . For the EM-based approach, the number of model components K is equal to the number of clusters J varying from 4 to 40 with the step size of 2, while for the variational approach (VB-based) the number of clusters is set to a large value K ¼ 30 (J ¼ 22 clusters are left after convergence).

5.3.2 LSF Parameter Modeling As an efficient presentation of the linear predictive coding (LPC)4 coefficients, the line spectral frequency parameters are widely used in speech coding [39], speaker recognition [40] and other speech processing applications. The LSF parameters are first introduced by [41], and have interesting properties which make them superior to direct quantization of the LPC coefficients. How to model the underlying distribution of the LSF parameters properly is an essential part in LSF parameter based applications. For a given order D, LPC analysis leads to an inverse filter the residual energy shown as GðzÞ ¼ 1 þ PD minimizing d g z where g d is the LPC coefficients [42]. Let P ðzÞ ¼ d d¼1 GðzÞ þ zðDþ1Þ Gðz1 Þ and QðzÞ ¼ GðzÞ  zðDþ1Þ Gðz1 Þ show the two symmetric and anti-symmetric polynomials. The zeros of P ðzÞ and QðzÞ are interleaved as 0 ¼ vq0 < vp1 < vq1 <    < vqD < vpD ¼ p. Then LSF parameters are defined þ1 2 2 as [43] s ¼ ½s1 ; s2 ; . . . ; sD T ¼ ½vp1 ; vq1 ; . . . ; vpD ; vqD T : 2

(64)

2

Fig. 10 illustrates marginal histograms for elements of a 16-dimensional LSF parameter vector (for more detail, refer to [44]). Considering that LSF parameters are bounded and strictly ordered in the interval ð0; pÞ, here, we introduce a representation of these parameters which satisfies L2 unit norm. For this purpose, first, we calculate the 4. LPC is mostly used for representing the spectral envelope of a digital signal of speech using the information of a linear predictive model (e.g., [38]).

TAGHIA ET AL.: BAYESIAN ESTIMATION OF THE VON-MISES FISHER MIXTURE MODEL WITH VARIATIONAL INFERENCE

1713

TABLE 1 Comparison of Speaker Identification Rate (in Percent)

Fig. 10. Illustration of the bounded and ordering properties of the LSF parameters. Plots from left to right correspond to the histograms from the first to the 16th dimension [44].

intervals between two adjacent LSF parameters (including 0 and p) as [44] y ¼ ½y1 ; y2 ; . . . ; yD T ¼

1 ½s1 ; s2  s1 ; . . . ; p  sD T ; p

(65)

where kyk1 ¼ 1. Next, the L2 unit normalized LSF parameters named L2 -LSFs are obtained by pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffiffiffi x ¼ ½x1 ; x2 ; . . . ; xD T ¼ ½ y1 ; y2 ; . . . ; yD T :

(66)

The L2 -LSFs feature vector x has directional characteristics satisfying kxk ¼ 1. Thus, we can model the underlying distribution of this feature vector by a mixture of vMF distributions. The performance of modeling is verified by a text-independent speaker identification task [45], using the wellknown TIMIT [46] speech database. The TIMIT database contains 630 male and female speakers and each speaker speaks ten sentences. During each round of evaluation, we randomly select 25 speakers from the database. In the training phase, seven sentences are randomly selected from each speaker to make the training speech data. The speech is segmented into frames with 25 ms frame length and 10 ms step size. The silent frames are removed. For each frame, a ”hann” window is applied. We obtain the LSF parameters from 16 LPC coefficients extracted from each frame. Next, the LSF parameters are transformed to L2 -LSFs by (65) and (66). Finally, the L2 -LSFs feature vector from each specific speaker is modeled by a vMF mixture model. In the test phase, the remaining three sentences are used as the test data. We randomly select T consecutive frames from the test feature set. These selected frames are used as the feature set for identifying a speaker. Test feature sets, each containing T frames (T 2 f50; 100; 150; 200g), are obtained for each of the 25 speakers. Next, we compute the log-likelihood values of an anonymous speaker in the test set given the trained models for all speakers. Since we have enough data, we consider the point estimates of the posterior distributions in the evaluation of the log-likelihood. We assign this test speaker with the identity of the trained model which yields the highest log-likelihood value. To smooth out the effect of randomness, for every T , we randomly select the starting point and repeat this 10 times. Thus we have 25 10 test sets to identify. The identification score is calculated as the percentage of correctly identified test sets against the total number of test sets. The above procedure is repeated 10 times and the average values are

reported in Table 1. Since the duration of test speech has effect on the identification result, we choose different speech durations as f0:5; 1; 1:5; 2g seconds,5 and evaluate the performance of the proposed system with the above procedure. For comparison, we also model the L2 unit normalized LSF parameters by a Gaussian mixture model (GMM) in a Bayesian framework [20, chapter 10], and perform a similar procedure to evaluate the identification rate. Furthermore, to see how close our SI system performs compared to a benchmark method, we consider the mel-frequency cepstral coefficients (MFCCs)6 based SI system which is commonly used [45]. The combination of MFCCs and GMM is often regarded as the benchmark [45]. We compare the following methods: L2 -LSFs þ vMFMM: the presented SI system based on L2 -LSFs features and modeling with a mixture of vMF distributions;  L2 -LSFs þ GMM: an SI system based on L2 -LSFs features and modeling with a mixture of Gaussian distributions;  MFCCs þ GMM: an SI system based on MFCCs features and modeling with a mixture of Gaussian distributions (regarded as the benchmark). Results of this experiment is shown in Table 1. We first notice that, the proposed SI system L2 -LSFs þ vMFMM provides competitive results compared to the benchmark MFCCs þ GMM for different speech durations. We also notice that L2 -LSFs þ vMFMM based SI system performs significantly better than the L2 -LSFs þ GMM based SI system. This suggests that, the vMF distribution might be a better candidate in practice for the LSF parameter modeling. Finally, we observed in the experiments that the Bayesian vMF mixture model keeps less active mixture components than the Bayesian GMM, which implies less model complexity. 

6

CONCLUSION

We derived Bayesian estimation of the von-Mises Fisher mixture model with variational inference. The proposed variational approach is deterministic and offers an optimization criterion which can be used to determine convergence. Moreover, empirically we showed that our approach can determine the model complexity by pruning unnecessary mixture components while avoiding over-fitting. Next, we derived an analytically tractable approximation for the predictive density of the Bayesian mixture model of vMF 5. For a test speech with T frames, the duration is about 0:01T seconds, since the segment step size is 0:01 s. 6. MFCCs are a representation of the short-term power spectrum of a sound based on a linear cosine transform of a log-power spectrum on a nonlinear mel-scale of frequency [45].

1714

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

distributions. We evaluated our approach by experiments with both synthetic data and real data. First, we employed the variational approach as a clustering method for clustering of gene expression data. The various cluster quality figures of merit verified the effectiveness of our approach in this application domain. Next, we employed the presented approach in modeling line spectral frequency parameters of speech and evaluated the performance in text-independent speaker identification. The evaluations showed potential applications of the vMF mixture model in speaker identification, as it could provide competitive results compared to the benchmark approach.

ACKNOWLEDGMENTS This work was funded by the European Commission within the Marie Curie ITN AUDIS, grant PITNGA-2008-214699. The authors would like to thank Gustav Henter for helpful discussions.

REFERENCES [1] [2] [3] [4] [5] [6] [7] [8]

[9] [10] [11] [12]

[13]

[14] [15] [16] [17]

K. V. Mardia and P. E. Jupp, Directional Statistics. New York, NY, USA: Wiley, 2000. N. I. Fisher, T. Lewis, and B. J. J. Embleton, Statistical Analysis of Spherical Data. Cambridge, U.K.: Cambridge Univ. Press, 1987. A. Strehl, E. Strehl, J. Ghosh, and R. Mooney, “Impact of similarity measures on web-page clustering,” in Proc. Workshop Artif. Intell. Web Search, 2000, pp. 58–64. G. Salton and M. J. McGill, Introduction to Modern Retrieval. New York, NY, USA: McGraw-Hill, 1983. G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Inform. Process. Manage., vol. 24, pp. 513–523, 1988. I. S. Dhillona, E. M. Marcotte, and U. Roshan, “Diametrical clustering for identifying anti-correlated gene clusters,” Bioinformatics, vol. 19, pp. 1612–1619, 2003. P. Lopez Cruz, C. Bielza, and P. Larranaga, “The von Mises Naive Bayes classifier for angular data,” in Proc. 14th Int. Conf. Advances Artif. Intell., 2011, vol. 7023, pp. 145–154. M. Bangert, P. Hennig, and U. Oelfke, “Using an infinite von Mises-Fisher mixture model to cluster treatment beam directions in external radiation therapy,” in Proc. Ninth Int’l Conf. Mach. Learning Appl., 2010, pp. 746–751. J. Sinkkonen and S. Kaski, “Clustering based on conditional distributions in an auxiliary space,” Neural Comput., vol. 14, pp. 217– 239, 2001. S. Zhong and J. Ghosh, “A unified framework for model-based clustering,” J. Mach. Learning Res., vol. 4, pp. 1001–1037, 2003. A. Banerjee, I.S. Dhillon, J. Ghosh, and S. Sra, “Clustering on the unit hypersphere using von Mises-Fisher distributions,” J. Mach. Learning Res., vol. 6, pp. 1345–1382, 2005. D.H.T. Vu and R. Haeb-Umbach, “Blind speech separation employing directional statistics in an expectation maximization framework,” in Proc. IEEE Int’l Conf. Acoust. Speech, Signal Process., 2010, pp. 241–244. H. Tang, S. Chu, and T. Huang, “Generative model-based speaker clustering via mixture of von Mises-Fisher distributions,” in Proc. IEEE Int’l Conf. Acoust., Speech, Signal Process., Apr. 2009, pp. 4101–4104. A. G. Nunez and E. G. Pena, “A Bayesian analysis of directional data using the von Mises-Fisher distribution,” Commun. Statist. Simul. Comput., vol. 34, no. 402, pp. 989–999, 2005. K. V. Mardia and S. A. M. El-Atoum, “Bayesian inference for the von Mises-Fisher distribution,” Biometrika, vol. 63, no. 1, pp. 203– 206, 1976. P. Guttorp and R. A. Lockhart, “Finding the location of a signal: A Bayesian analysis,” J. Am. Statistical Assoc., vol. 83, no. 402, pp. 322–330, 1988. P. Bagchi and J. B. Kadane, “Laplace approximations to posterior moments and marginal distributions on circles, spheres, and cylinders,” Can. J. Statist., vol. 19, no. 1, pp. 67–77, 1991.

VOL. 36,

NO. 9,

SEPTEMBER 2014

[18] P. Damien and S. Walker, “A full Bayesian analysis of circular data using the von Mises distribution,” Can. J. Statist., vol. 27, no. 2, pp. 291–298, 1999. [19] M.I. Jordan, Z. Ghahramani, T.S. Jaakkola, and L.K. Saul, “An introduction to variational methods for graphical models,” Mach. Learning, vol. 37, no. 2, pp. 183–233, 1999. [20] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer, 2006. [21] D.M. Blei and J.D. Lafferty, “A correlated topic model of science,” Ann. Appl. Statist., vol. 1, pp. 17–35, 2007. [22] Z. Ma and A. Leijon, “Bayesian estimation of Beta mixture models with variational inference,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 11, pp. 2160–2173, Nov. 2011. [23] G. Mclachlan and D. Peel, Finite Mixture Models. (Wiley Series in Probability and Statistics), 1st ed. ed. New York, NY, USA: Wiley-Interscience, 2000. [24] M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 381–396, Mar. 2000. [25] J. H. Piater, “Visual feature learning,” Ph.D. dissertation, Department of Computer Science, Univ. Massachussets, Jun. 2001. [26] J. A. Mooney, P. J. Helms, and I. T. Jolliffe, “Fitting mixtures of von Mises distributions: A case study involving sudden infant death syndrome,” Comput. Stat. Data Anal., vol. 41, nos. 3/4, pp. 505–513, 2003. [27] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. New York, NY, USA: Dover Publications, 1965. [28] N. I. Fisher, Statistical Analysis of Circular Data. Cambridge Univ. Press, 1996. [29] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge, U.K.: Cambridge Univ. Press, 2004. [30] R. E. Turner and M. Sahani, “Two problems with variational expectation maximisation for time-series models,” in Bayesian Time Series Models, D. Barber, T. Cemgil, and S. Chiappa, eds., Cambridge, U. K.: Cambridge Univ. Press, 2011, ch. 5, pp. 109–130. [31] A.T. Wood, “Simulation of the von Mises Fisher distribution,” Commun. Stat.—Simul. Comput., vol. 23, no. 1, pp. 157–164, 1994. [32] D. J. C. MacKay, “Local minima, symmetry-breaking, and model pruning in variational free energy minimization,” Univ. Cambridge, Cambridge, U.K., Tech. Rep., 2001. [33] A. Leijon, “Bayesian learning of Gaussian mixtures: Variational ‘over-pruning’ revisited,” KTH–School Elect. Eng., Stockholm, Sweden, Tech. Rep. TRITA-EE 2013:032, Aug. 2013. [34] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. K. Anders, M. B. Eisen, P. O. Brown, B. Futcher, and G. R. Fink, “Comprehensive identification of cell cycle-regulated genes of the Yeast Saccharomyces Cerevisiae by microarray hybridization,” Molecular Biol. Cell, vol. 9, pp. 3273–3297, 1998. [35] R. Sharan and R. Shamir, “CLICK: A clustering algorithm with applications to gene expression analysis,” in Proc. Intell. Syst. Molecular Biol., 2000, pp. 307–316. [36] A. Banerjee and J. Ghosh, “Frequency sensitive competitive learning for clustering on highdimensional hyperspheres,” in Proc. Int. Joint Conf. Neural Netw., May 2002, pp. 1590–1595. [37] I. S. Dhillon, J. Fan, and Y. Guan, Efficient Clustering of Very Large Document Collections, Norwell, MA, USA: Kluwer, 2001 pp. 357– 381. [38] P. Vary and R. Martin, Digital Speech Transmission: Enhancement, Coding and Error Concealment. 1st ed. New York, NY, USA Wiley, 2006. [39] R. V. Cox, “Speech coding standards,” in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, eds., Amsterdam, The Netherlands: Elsevier, 1995 pp. 49–78. [40] J. P. Campbell, Jr., “Speaker recognition: A tutorial,” in Proc. IEEE, vol. 85, no. 9, pp. 1437–1462, Sep. 1997. [41] F. Itakura, “Line spectrum representation of linear predictive coefficients of speech signals,” J. Acoust. Soc. Amer., vol. 57, p. S35, Apr. 1975. [42] J. Markel and A. Gray, Linear Prediction of Speech. New York, NY, USA: Springer-Verlag, 1976. [43] F. Soong and B. Juang, “Line spectrum pair (LSP) and speech data compression,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 1984, vol. 9, pp. 37–40. [44] Z. Ma, A. Leijon, and W. B. Kleijn, “Vector quantization of LSF parameters with a mixture of Dirichlet distributions,” IEEE Trans. Audio, Speech, Language Process., vol. 21, no. 9, pp. 1777–1790, Sep. 2013.

TAGHIA ET AL.: BAYESIAN ESTIMATION OF THE VON-MISES FISHER MIXTURE MODEL WITH VARIATIONAL INFERENCE

[45] Springer Handbook on Speech Processing, J. Benesty, M. M. Sondhi, and Y. Huang, Eds. New York, NY, USA: Springer, 2008. [46] “DARPA-TIMIT,” Acoustic-Phonetic Continuous Speech Corpus, NIST Speech Disc 1.1-1, 1990. Jalil Taghia received the MS degree in electrical engineering from Shahed University, Tehran, Iran, in 2009. He is currently working toward the PhD degree at the Communication Theory Lab. at the School of Electrical Engineering (EES) at the KTH (Royal Institute of Technology), Stockholm, Sweden, since 2009. His research interests include statistical signal processing and machine learning including Bayesian inference, variational approximations, latent variable models in particular with audio applications.

Zhanyu Ma received the MEng degree in signal and information processing from the Beijing University of Posts and Telecommunications (BUPT), China, and the PhD degree in electrical engineering from KTH (Royal Institute of Technology), Sweden, in 2007 and 2011, respectively. From 2012 to 2013, he has been a postdoctoral research fellow in the School of Electrical Engineering, KTH, Sweden. He has been an assistant professor at the Beijing University of Posts and Telecommunications, China, since 2013. His research interests include statistical modeling and machine learning related topics with a focus on applications in speech processing, image processing, biomedical signal processing, and bioinformatics.

1715

Arne Leijon received the MS degree in engineering physics in 1971, and the PhD degree in information theory in 1989, both from the Chalmers University of Technology, Gothenburg, Sweden. He is a professor in hearing technology at the KTH (Royal Institute of Technology) School of Electrical Engineering, Stockholm, Sweden, since 1994. His main research interests include applied signal processing in aids for people with hearing impairment, and methods for individual fitting of these instruments, based on modeling of sensory information transmission and subjective sound quality. " For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.