Efficient Model Selection for Mixtures of Probabilistic ... - IEEE Xplore

1 downloads 0 Views 9MB Size Report
Abstract—This paper concerns model selection for mixtures of probabilistic principal component analyzers (MPCA). The well known Bayesian information ...
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. IEEE TRANSACTIONS ON CYBERNETICS

1

Efficient Model Selection for Mixtures of Probabilistic PCA via Hierarchical BIC Jianhua Zhao, Member, IEEE

Abstract—This paper concerns model selection for mixtures of probabilistic principal component analyzers (MPCA). The well known Bayesian information criterion (BIC) is frequently used for this purpose. However, it is found that BIC penalizes each analyzer implausibly using the whole sample size. In this paper, we present a new criterion for MPCA called hierarchical BIC in which each analyzer is penalized using its own effective sample size only. Theoretically, hierarchical BIC is a large sample approximation of variational Bayesian lower bound and BIC is a further approximation of hierarchical BIC. To learn hierarchicalBIC-based MPCA, we propose two efficient algorithms: two-stage and one-stage variants. The two-stage algorithm integrates model selection with respect to the subspace dimensions into parameter estimation, and the one-stage variant further integrates the selection of the number of mixture components into a single algorithm. Experiments on a number of synthetic and real-world data sets show that: 1) hierarchical BIC is more accurate than BIC and several related competitors and 2) the two proposed algorithms are not only effective but also much more efficient than the classical two-stage procedure commonly used for BIC. Index Terms—BIC, clustering, expectation maximization, mixture model, model selection, principal component analysis.

I. Introduction

P

RINCIPAL component analysis (PCA) [1] is one of the most classical and popular linear data reduction techniques. A combination of local linear PCA models [2]–[4] further widens the applicability of PCA to nonlinear or multimodal data. The mixture probabilistic PCA (MPCA) proposed in [5] is an important development since it provides a principaled way to combine local PCAs. In MPCA, clustering and dimension reduction are performed simultaneously and the parameter estimation can be easily done by maximum likelihood (ML)/maximum a posteriori (MAP) methods via the popular expectation maximization (EM) algorithm [6]. Moreover, in density modeling for high-dimensional data, MPCA offers significant advantage over Gaussian mixture model (GMM) with full/diagonal/spheral covariance structure, because of its capability of providing an appropriate

Manuscript received February 1, 2013; revised October 18, 2013 and January 1, 2014; accepted January 3, 2014. This work was supported by the National Natural Science Foundation of China under Grant 11361071. This paper was recommended by Associate Editor R. Lynch. The author is with the School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming 650221, China (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCYB.2014.2298401

trade-off between overfitting full GMM and underfitting diagonal/spheral GMM. Despite the attractiveness of MPCA model, it is still a challenging problem to find the mentioned trade-off, namely to determine the number of mixture components m and local subspace dimensions kj ’s. To this end, a two-stage procedure is commonly adopted, where Stage 1 implements parameter estimation for a set of candidate models and Stage 2 chooses the best according to a model selection criterion. A common way is to assume a common-k MPCA model (that restricts the local subspace dimension kj = k for every component j), and then perform exhaustive enumeration of all possible models in the set of {m, k} [7]. Without this common-k assumption, it would be very time consuming to search over the much larger set of {m, kj s}. Intuitively, it seems more plausible for different components to enjoy their own different local subspace dimensions kj ’s. In fact, in image processing [8], the quality of image compression can be significantly improved by careful assignment of kj ’s. In handwritten digit recognition [9], the MPCA with different subspace dimensions kj ’s can obtain higher test likelihood and lower classification error rate than that with common k. Attempts have been made toward the solution with different local subspace dimensions kj ’s. For example, in [9], the noise variances of different components are restricted to be same, namely σj2 = σ 2 , and the dimension kj is determined as the total number of eigenvalues of local covariance matrices larger than σ 2 . Apart from the restriction, the optimal σ 2 has to be determined with validation data. The classical method in PCA that retains variance ratio larger than a threshold is also employed for MPCA in [10] and [11]. A drawback is that the determination of kj ’s and the parameter estimation are performed via separate cost functions. To learn GMM efficiently, Figueiredo and Jain [12] propose a one-stage algorithm, which is denoted as FJ’s algorithm for clarity. Unlike two-stage methods, this algorithm integrates parameter estimation and model selection in a single algorithm. In addition, it enjoys less sensitivity to the initialization than the standard EM. However, the use of FJ’s algorithm to MPCA model suffers two limitations. First, FJ’s algorithm is only applicable to the MPCA model with local subspace dimensions kj ’s given. Second, the constraint used for the component annihilation step is too strong to be used for several benchmark real-world data sets in our experiments. Variational Bayesian (VB) methods have been suggested for fitting MPCA [13]. Unlike ML/MAP methods, VB maintains

c 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. 2168-2267  See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 2

IEEE TRANSACTIONS ON CYBERNETICS

distributions over model parameters and its lower bound can be naturally adopted as model selection criterion. In the VB treatment of MPCA (VBMPCA) proposed in [14], the different subspace dimensions kj ’s could be determined automatically in a continuous manner via automatic relevance determination (ARD). To further determine the number of mixture components m automatically, a birth/death operation is incorporated into VBMPCA in [15] and a non-parametric treatment is suggested in [16]. The resulting method implements model estimation and model selection in a single algorithm. However, the determination of local subspace dimensions kj ’s in these VB-based methods purely depends on ARD. As pointed out in [17], its result is sensitive to the value of maximum subspace dimension that has to be prespecified. This paper concerns learning MPCA in an effective and efficient manner. To learn MPCA effectively, we present a new criterion called hierarchical BIC (H-BIC) for model selection in MPCA. Theoretically, the proposed H-BIC is a large sample limit of VB lower bound and BIC is a further approximation of H-BIC. To learn H-BIC-based MPCA efficiently, we propose two new algorithms: two-stage and one-stage variants. Both algorithms eliminate the commonk restriction and perform parameter estimation and model selection simultaneously. Moreover, the one-stage algorithm overcomes the two limitations suffered by FJ’s algorithm. The remainder of the paper is organized as follows. Sections II and III review the MPCA model and existing model selection methods, respectively. Section IV presents the new H-BIC criterion and Section V proposes two efficient algorithms to implement H-BIC-based MPCA. Section VI conducts experiments to compare our proposed H-BIC with competing criteria/algorithms. Section VII concludes the paper.

Under the MPCA model, the log likelihood of observing the data X is ⎡ ⎤ N m   log ⎣ πj p(xn |θ j )⎦ L(X|θ) = n=1

where p(xn |θ j ) = (2π)−d/2 |j |−1/2  1 · exp − (xn − μj )T −1 (x − μ ) . n j j 2 The ML estimate θˆ is defined as θˆ = arg max{L(X|θ)}. θ

If the prior p(θ) for model parameter θ is available, then the MAP estimate θˆ can be obtained as θˆ = arg max{L(X|θ) + log p(θ)}. θ

B. EM Algorithm Given the number of components m and subspace dimension k = (k1 , k2 , · · · , km ), the parameter θ in MPCA model can be estimated by the well known EM algorithm [5]. This algorithm is an iterative procedure and each iteration consists of an E-step to obtain the expected complete data log likelihood, i.e., the so-called Q function, and an M-step to maximize Q with respect to θ [6]. Consider the complete data (X, Z) = {xn , zn }N n=1 , where zn = (zn1 , . . . , znM ), znj is an indicator variable taking the value 1 if xn comes from the component j, and 0 otherwise. Then the complete data log likelihood can be expressed as

II. Review of Mixtures of Probabilistic PCA Lc (X, Z|θ) =

A. Mixtures of Probabilistic PCA (MPCA)

Q(θ|θ (t) ) =

j =

+

j (θ j |θ (t) )

(2)

where

N (0, σj2 I)

σj2 I.

m  j=1

j (θ j |θ (t) ) =

N 

E(znj ) log(πj p(xn |θ j ))

(3)

n=1

where I denote some unit matrix whose dimension should be apparent from the context, μj is a d-dimensional mean vector, Aj is a d × kj factor loadings matrix, ynj is an independent kj dimensional latent factor vector and σj2 is the noise variance associated with component j. Clearly, this is a mixture of m probabilistic PCA sub-models with mixture proportion πj s. Unlike the traditional factor analysis model, which assumes that nj has a diagonal covariance, MPCA assumes a scalar covariance. Let θ ≡ {πj , θ j ; j = 1, . . . , m}, θ j = (Aj , μj , σj2 ), and Aj Aj

znj log(πj p(xn |θ j )).

E-step: Computes the expected Lc given X and θ (t)

xn |j = Aj ynj + μj + nj ; nj ∼

N  m  n=1 j=1

Under the mixtures of probabilistic PCA (MPCA) model proposed in [5], each d-dimensional data vector xn in the i.i.d sample X = {xn }N n=1 is generated with two steps. First, a natural number j is generated according to the distribution p(j) = πj , j = 1, . . . , m under the constraint m j=1 πj = 1. Second, given this j, xn is generated from the restricted factor analysis model ynj ∼ N (0, I),

j=1

(1)

depending only on (πj , θ j ) of component j. The expectation Rnj (θ (t) )  E(znj ) is the posterior probability of data point xn belonging to component j, given by πj(t) p(xn |θ (t) j ) Rnj (θ (t) ) = P(znj = 1|θ (t) ) =  . m πk(t) p(xn |θ (t) ) k

(4)

k=1

In addition, we define the local sample covariance matrix as S(t+1) = j

1

N 

Nπj(t+1) n=1

Rnj (θ (t) )(xn − μ(t+1) )(xn − μ(t+1) ) . j j

(5)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHAO : EFFICIENT MODEL SELECTION FOR MIXTURES OF PROBABILISTIC PCA VIA HIERARCHICAL BIC

M-step: For ML estimate, maximizing Q with respect to θ  under the restriction m i=1 πj = 1 yields the updating equations πj(t+1) = μ(t+1) = j σj2(t+1)

N 1  Rnj (θ (t) ) N n=1

1

N 

Nπj(t+1)

n=1

Rnj (θ (t) ) xn .

−1

= (d − kj )

d 

(6)

(7)

λj

(8)

A(t+1) = Uj (j − σj2(t+1) I)1/2 j

(9)

=kj +1

where Uj = (uj1 , . . . , ujkj ), j = diag(λj1 , . . . , λjkj ), with {λj }d=1 , {uj }d=1 being the descending ordered eigenvalues and given by (5). eigenvectors of S(t+1) j For MAP estimate, θ (t+1) is obtained by θ (t+1) = arg max{Q(θ|θ (t) ) + log p(θ)}.

(10)

3

ˆ where P(θ(m, k)) is a penalty term that penalizes the higher values of (m, k). BIC is the most popular one for mixture models due to its theoretical consistency and satisfactory practical performance in a variety of applications [20], [21]. Note that MDL has the same form as BIC, but its motivation is from information/coding theory. For m-component MPCA, the penalty term of BIC in (12) is given by ⎡ ⎤ m  1 ˆ = ⎣ (Dj + 1) − 1⎦ log N (13) Pbic (θ) 2 j=1 where Dj = d(kj + 1) − kj (kj − 1)/2 + 1 is the number of free parameters of θ j . However, it can be seen that BIC determines Dj of analyzer j implausibly using the whole sample size N. As the number of candidate models in the set {m, k} is often very large, to save time, common-k MPCA models that restrict kj = k, j = 1, · · · , m, are usually assumed and the two-stage procedure is then performed in the smaller set {m, k} [7].

θ

The updating formula for θ (t+1) hence depends on the choice of prior p(θ).

III. Previous Works It can be seen that the MPCA model with the mixture number m and the subspace dimension kj ’s is nested within those with larger values of m and/or kj . Thus, the log likelihood L is a nondecreasing function of m and kj ’s and can not be adopted as model complexity criterion [12]. Several model selection methods have been proposed to deal with this problem. The deterministic methods are more preferred in pattern recognition applications due to their good performance and computational efficiency. From a computational viewpoint, these methods can be roughly classified into two classes: two-stage and one-stage. A. Two-Stage Methods Two-stage methods perform parameter estimation and model selection in a separate manner. In detail, given a range of values of (m, k) (with m from mmin to mmax and every kj from kmin to kmax ), which is assumed to include the optimal one, two-stage methods typically first obtain the ML/MAP ˆ estimates θ(m, k) for each model (m, k) and then employ a ˆ model selection criterion L∗ (m, k, θ(m, k)) to choose the value as ˆ ˆ = arg max{L∗ (m, k, θ(m, k))}. (m, ˆ k)

(11)

(m,k)

Many criteria have been proposed. For instance, Akaike’s information criterion (AIC), Bayesian information criterion [18], Laplace-empirical criterion (LEC), the integrated classification likelihood (ICL) criterion, minimum description length (MDL) criterion, and so on. See [19] for a review. All these criteria can be written as ˆ ˆ ˆ L∗ (m, k, θ(m, k)) = L(X|θ(m, k)) − P(θ(m, k))

(12)

B. One-Stage Methods Unlike two-stage methods, one-stage methods integrate parameter estimation and the selection of number of mixture components m into a single algorithm, and thus are computationally more efficient. A representative example is due to Figueiredo and Jain [12]. The authors propose a MML criterion, given by θˆ = arg max{L(X|θ) − Pmml (θ)}

(14)

θ

where Pmml (θ) =

m  Dj j=1

2

Nπj m N  Dj + 1 ) + log + . (15) 12 2 12 j=1 2 m

log (

The noticeable features of (14) include: 1) the penalty term involves the model parameter πj ’s to be jointly optimized with those in the first term L(X|θ) and 2) the maximization in (14) is performed with respect to θ and Dj ’s simultaneously. To meet 1) and 2), the authors present a one-stage algorithm (namely FJ’s algorithm) implemented by a simple modification of the EM algorithm in Sec. II-B. Despite its success in a number of applications as illustrated in [12], FJ’s algorithm suffers two limitations. First, FJ’s algorithm is only applicable to the MPCA model with kj ’s given. Second, in FJ’s algorithm, the updating equations for maximizing (14) with respect to θ are the same as those of the EM algorithm detailed in Section II-B except that those for mixing proportions πj ’s are changed as  N  Dj (t) max 0, Rnj (θ ) − 2  n=1N (16) πj(t+1) = m   Dj (t) max 0, Rnj (θ ) − 2 j=1

n=1

where j = 1, . . . , m. Equation (16) automatically annihilates  component j that is not well supported by data, i.e., N n=1 Rnj 1, BIC penalizes model heavier than H-BIC (as the term m ˆ j is negative). However, there exist close j=1 (Dj /2) log π relationships between them. 1) BIC is a further approximation of H-BIC in the large m sample limit (by dropping the order-1 term ˆ j ). j=1 (Dj /2) log π 2) In H-BIC, BIC is performed in a hierarchical manner.  a) BIC for πj ’s: Since the constraint j πj = 1 and all N data points are devoted to estimate πj ’s, the corresponding BIC penalized term to πj ’s is (m − 1)/2 log N. b) Local BIC for each component: N πˆ j can be viewed as the effective sample size based on which parameter θ j is estimated and hence the corresponding BIC penalized term to θ j is Dj /2 log(N πˆ j ). Recall that the data points under a MPCA model are generated with two steps: the first step is to generate component label j according to πj ’s and then given this j, the second step is to generate data point xn based on component parameter θ j . Therefore, criterion (17) can be regarded as a hierarchical BIC where each hierarchy corresponds to a step to generate data points. 2) Relationship/Difference Between H-BIC and MML: ˆ includes the From (17), it can be observed that Phbic (θ) ML/MAP estimate πˆ j . However, from (15), the penalty term Pmml (θ) includes the parameters πj ’s to be optimized and thus the solution in (16) is not a ML/MAP estimate. Furthermore, using 12/e ≈ 4.4146, Pmml (θ) in (15) can be rewritten as Pmml (θ) ≈

m  Dj j=1

2

5

common-k restriction. In this section, we shall propose two efficient algorithms to learn H-BIC-based MPCA: two-stage and one-stage variants, both of which are based on the modified EM algorithm detailed in Section V-A.

A. Modified EM Algorithm with m Given In this section, we propose a simple modification to the conventional EM algorithm in Section II-B, in which, the dimensions kj ’s are treated as parameters and an additional M-step is introduced to maximizes H-BIC over kj ’s. As a result, parameter estimation and model selection with respect to the dimensions kj ’s can be solved simultaneously by means of EM optimization. Given the component number m, the objective function is

L∗ (θ, k) = L(θ) −

m  Dj j=1

2

log(Nπj ) −

Nπj m N ) + log . 4.4146 2 4.4146

(26)

From (26) and (17), it can be seen that MML penalizes model generally lighter than H-BIC. Due to the strong constraint suffered by FJ’s algorithm as detailed in Section III-B, in this paper we use the proposed Algorithm 2 in Section V-D to implement MML criterion, ˆ and which allows us to compare the performance of Pmml (θ) ˆ Phbic (θ) fairly. In our experiments in Section VI, even with a component annihilation step detailed in Section V-C, MML often chooses the model with too many spurious components (consisting of small number of data points). This is because the penalty term of MML in (26) is not sensitive to small values of Nπj . In particular, when Nπj < 4.4, the penalty is in fact negative and thus may encourage insignificant small components.

V. Two Algorithms Obviously, the classical two-stage procedure used for BIC in Section III-A can be employed to implement H-BIC for MPCA. However, this would be time consuming even with the

(27)

ˆ k)), ˆ θ( ˆ where kˆ is the model and the objective is to find (k, ˆ k) ˆ has the highest H-BIC value whose ML/MAP estimate θ( of L∗ . Note that maximizing L∗ in (27) with respect to θ (as done for MML in [12]) does not generally yield the ML/MAP ˆ and is not consistent with the objective. Thus, estimate θ, ˆ as detailed we make use of the EM algorithm that obtains θ, below. Given (πj(t+1) , μ(t+1) )’s, the corresponding penalized Q funcj tion to L∗ in (27) (denoted by Q∗ ), is given by

Q∗ (θ|θ (t) , π(t+1) ) =

m 

∗j (θ|θ (t) , π(t+1) ) −

j=1

log (

m−1 log N 2

m−1 log N (28) 2

where ∗j (θ|θ (t) , π(t+1) ) = j (θ|θ (t) ) −

Dj log(Nπj(t+1) ) 2

(29)

ˆ k), ˆ k), ˆ the modified with j (θ j |θ (t) ) given by (3). To find (θ( EM iterates the following two steps. Step 1: Maximize Q in (2) with respect to θ to obtain θ (t+1) . Step 2: Maximize Q∗ in (28) with respect to k to obtain k(t+1) . Step 1 is performed by the EM algorithm detailed in Section III-B while Step 2 is treated in Section V-A1. 1) Maximization of Step 2: The first term in (29) is the expected complete data log likelihood of component j and the second term is a BIC penalty term with effective sample size Nπj(t+1) , thus (29) performs a local BIC for component j. In other words, we can use m separate local BIC to determine dimension kj ’s. The closed form expression of kj is obtained as follows. Substituting (8) and (9) into (29), noting the fact that

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 6

IEEE TRANSACTIONS ON CYBERNETICS

(t+1) tr(−1 ) = d, and dropping the terms irrelevant of kj , we j Sj obtain ⎧ ⎛ ⎞ d  ⎪ ⎪ ⎪ λjl ⎟ k ⎜ ⎨ ⎜ ⎟ kj(t+1) = arg min Nπj(t+1) ⎜ log λjl + (d−k) · log l=k+1 ⎟ ⎪ ⎝ ⎠ d−k k ⎪ l=1 ⎪ ⎩

⎫ ⎪ ⎪ ⎪ ⎬

+ Dj log(Nπj(t+1) )

⎪ ⎪ ⎪ ⎭

(30)

where {λj }d=1 are the descending ordered eigenvalues of S(t+1) j in (5). After determining kj(t+1) by (30), we can easily use the existing eigen-decomposition of S(t+1) to estimate σj2 and Aj j via (8) and (9). This step makes our proposed algorithms in Section V-D eliminate the first limitation of FJ’s algorithm. Since local BIC selects the dimension kj ’s for linear submodels, it has theoretical consistency [18]. Moreover, for onecomponent MPCA (namely probabilistic PCA) model, it has been shown by extensive empirical studies [22] that local BIC is comparable with or more accurate than many criteria such as AIC, consistent AIC, cross-validation (CV), and so on. 2) On Convergence of Modified EM: Since Steps 1 and 2 involve two different objective functions L and L∗ , we check whether convergence of L∗ could be guaranteed. Using (27), we write the difference of L∗ between two successive iterations of Steps 1 and 2 as L∗ (θ (t+1) , k(t+1) ) − L∗ (θ (t) , k(t) ) = L∗1 + L∗2 where L∗1

m  ! (t+1) " Dj(t+1) log(Nπj(t+1) ) = L(θ )− 2 j=1 m  ! " Dj(t) log(Nπj(t+1) ) − L(θ (t) ) − 2 j=1

L∗2 =

m  Dj(t) j=1

2

[log(Nπj(t) ) − log(Nπj(t+1) )].

Since maximizing Q∗ with respect to θ is equivalent to maximizing Q with respect to θ, Steps 1 and 2 guarantee L∗1 not to decrease. At a first look, L∗2 could decrease and thus L∗ can not be guaranteed to increase. However, from our experience, L∗2 is generally ignorable in contrast with L∗1 and the evolvement of L∗ is dominated by L∗1 . This is particulary true in the early iterations of the modified EM, since L∗1 involves the update of all parameters in θ, usually enjoying large  increase in this stage, while L∗2 only involves (t) (t) (t+1) the difference m )]. j=1 Dj /2[log(πj ) − log(πj If the algorithm fails to converge after a fixed number of iterations (say, 50), there are two ways to enforce convergence of L∗ . One way is to fix πj ’s in the penalty term Phbic (θ) ∗ ∗ as {πj(t+1) = πj(t) }m j=1 . In this case, L is just the L1 whose convergence is guaranteed. Another way is to fix k. This case is the standard EM algorithm for fixed k and thus convergence of L∗ is guaranteed. However, we do not perform these

modifications in our experiments, since L∗ converges for all the datasets used. B. Maximum A Posteriori (MAP) Estimate It is well known that the ML estimate may approach the boundary of the parameter space and suffer from the problem of singularities. In this case, its H-BIC value can not be computed. When the number of components m and subspace dimensions kj ’s are larger than the optimal ones, this tends to happen frequently. To track this problem, we use MAP estimate because it is more stable than ML estimate and its H-BIC value can always be computed. In addition, it can be seen that the convergence analysis on ML estimate in Section V-A2 is also applicable to MAP estimate. Inspired by the prior used in [23], we use the prior p(θ) = p(π) p(μj )p(j ) j g 1 |j |− 2 exp {− tr(−1 ∝ (31) j B)} j 2 where B is a positive matrix. The prior (31) means that we use an inverted Wishart prior with g degrees of freedom over j in (1) and noninformative priors for component probability πj ’s and mean μj ’s. For fixed kj ’s, substituting (31) into (10), we have that the updating equations for θ in M-step remain the same as (6)–(9), except that, instead of S(t+1) in (5), the eigen-decomposition in j (8) and (9) is now taken with respect to its regularized version $ # 1 (t+1) (t+1) S˜ (t+1) = S + B Nπ j j j Nπj(t+1) + g B = (1 − γ)S(t+1) +γ (32) j g where γ = g/(Nπj(t+1) + g). If we set B as a scalar covariance, S˜ (t+1) in (32) performs j a similar regularization as used in regularized discriminant analysis [24], where the covariance matrix is shrunk by a scalar covariance. Nevertheless, the regularization parameter γ in (32) is in an adaptive manner as local sample size Nπj varies. C. Component Annihilation Inspired by the component annihilation step used in FJ’s algorithm that can alleviate the problem of singularities, we add a similar step to the immediate estimates of our implementations. Instead of Nπj ≤ Dj /2, we remove the components with Nπj ≤ n0 , where n0 is a user-defined parameter (e.g. n0 = 20 in our experiments). In other words, any components with effective sample size less than n0 are regarded as insignificant components and simply discarded. When a component is removed, we immediately redistribute its probability mass to all the other components as done in FJ’s algorithm. Like that in FJ’s algorithm, this step can alleviate the problem of singularities substantially. However, our constraint is less restrictive than that in FJ’s algorithm, making our proposed algorithms in Section V-D applicable to all the real-world data sets in our experiments. The second limitation of FJ’s algorithm is thus overcome. For example,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHAO : EFFICIENT MODEL SELECTION FOR MIXTURES OF PROBABILISTIC PCA VIA HIERARCHICAL BIC

our one-stage algorithm correctly chooses two components for the 100-D synthetic data used in Section III-B, with better test log likelihood −4.10 × 105 , in strong contrast with the failure of FJ’s algorithm. The possible disadvantage of this step is that, when some component j is of interest but its local sample size N πˆ j < n0 , this component will not survive and be forced to merge with other ones. D. Two Proposed Learning Algorithms To learn a MPCA model given a range of values of m (from mmin to mmax ), the objective is to find the optimal model ˆ m, ˆ whose ML/MAP estimate θ( ˆ gives the highest (m, ˆ k), ˆ k) H-BIC value. ˆ k)) ˆ θ( ˆ for 1) New Two-Stage Algorithm: We first find (k, each fixed m using the modified EM algorithm in Section V-A, and then choose the m ˆ with the best H-BIC value. Clearly, this algorithm only requires choosing the optimal one from the set of {m} and thus is computationally much more efficient than the traditional one in Section III-A involving the much larger set {m, k}. Importantly, this algorithm allows different kj ’s in an efficient manner. 2) New One-Stage Algorithm: Motivated by FJ’s algorithm, we extend the above two-stage algorithm to a one-stage algorithm that further integrates the selection of m into a single algorithm, which we call automated learning of MPCA (A-MPCA). Specifically, A-MPCA is a backward learning algorithm that starts from mmax and then recursively decreases the value of m to mmin by annihilating the least probable component. Consequently, we obtain a one-stage algorithm that performs parameter estimation and the selections of m and kj ’s simultaneously. 3) Implementations: Obviously, MAP estimate alone may include the insignificant components (with local effective sample size smaller than n0 ). On the other hand, the component annihilation step can alleviate but not eliminate the problem of singularities. This means that this step alone can not always guarantee a computable H-BIC value. Therefore, both strategies are adopted in our implementations and the obtained estimate excludes insignificant components and guarantee that the H-BIC value can always be computed. The pseudocode for A-MPCA is summarized in Algorithm 1. The pseudocode of the two-stage algorithm is given in Algorithm 2.

VI. Experiments In this section, we perform experiments on artificial and real-world data sets. We compare the proposed H-BIC with several competing model selection criteria including BIC [20], ICL [25], and MML [12]. As detailed in Section IV, H-BIC is an approximation of VB lower bound F when sample size N is large. Thus, we also examine the performance of VBMPCA using the code1 which implements VBMPCA as a special case. To reduce the computation burden, the common-k MPCA ˆ is used for BIC and ICL, and the optimal solution (m, ˆ k) 1 Available from http://www.cse.buffalo.edu/faculty/mbeal/software/vbmfa/ vbmfa.tar.gz

7

Algorithm 1 Automated Learning of H-BIC-Based MPCA.

Input: Data X, mmax , mmin , initialization of (θ (0) , k(0) ), L∗max ← −∞. 1: while m ≥ mmin do 2: repeat // modified EM with m fixed 3: repeat // Component annihilation 4: Compute the conditional expectations Rnj ’s via (4). 5: Update πj ’s via (6). 6: j ∗ ← arg minj {πj }. 7: if Nπj∗ ≤ n0 then 8: πj∗ ← 0, m ← m − 1. 9: end if 10: until Nπj∗ > n0 11: Compute the local regularized covariance matrices S˜ via (32); Update μj ’s via (7); ˜ update kj ’s via (30) and (Aj , σ 2 )’s via (9) 12: Using S, j and (8). 13: until change of L∗ is smaller than a threshold. ˆ ˆ ˆ 14: if L∗ (θ(m, k(m)), m, k(m)) ≥ L∗max then ∗ ∗ ˆ ˆ ˆ 15: Lmax ← L (θ(m, k(m)), m, k(m)). ˆ ˆ 16: θ best ← θ(m, k(m)), mbest ← m, kbest ← kˆ 17: end if 18: j ∗ ← arg minj {πj }, πj∗ ← 0, m ← m − 1. 19: end while Output: (θ best , mbest , kbest ). Algorithm 2 Two-Stage Algorithm of H-BIC-Based MPCA.

Input: Data X, mmax , mmin and L∗max ← −∞. 1: for m = mmin to mmax do 2: Initialization of (θ (0) , k(0) ). 3: Perform Algorithm 1, line 2-line 17. 4: end for Output: (θ best , mbest , kbest ). is found from the set of {(m, k), m = mmin , · · · , mmax , k = kmin , · · · , kmax }. For H-BIC, the two-stage implementation using Algorithm 2 is denoted by H-BIC and the one-stage implementation using Algorithm 1 by A-MPCA. Algorithm 2 is also used for MML and BIC and the results are denoted by MML and BIC (Algorithm 2), respectively. This allows a fair comparison among BIC, MML and H-BIC. Additionally, BIC (Algorithm 2) can be used to examine whether the new algorithm can improve the BIC under the common-k assumption. To tackle the initialization issue for all methods except for VBMPCA, we choose the best out of the ten runs of different (0) (0) K-means starts. We set g = d and B = 0.1 · tr( m j=1 πj Sj ) in (32) and use this setting for all methods except for VBMPCA. This means that we simply employ a small amount of regularization to achieve a more stable MAP solution. As the value of m may vary in BIC (Algorithm 2), MML, H-BIC, and A-MPCA, (πj(0) , S(0) j ) herein is understood as the initial value for a given m. For H-BIC, BIC (Algorithm 2), MML, and A-MPCA, unless else stated, we set n0 = 20 in the component annihilation step. We stop all the EM-like algorithms when the relative change in the objective function is smaller than a

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 8

IEEE TRANSACTIONS ON CYBERNETICS

TABLE I

TABLE II

Successful Rate (%) in Choosing the Number of Components

Computational Time in Seconds Used by Different Methods

threshold 10−6 or the number of iterations exceeds 500. We also run VBMPCA ten times with the default setting. A. Synthetic Data In this section, we use a 10-D synthetic data which is generated from the three-component common-k MPCA model (k = 5) (π1 , π2 , π3 ) = (0.2 0.5 0.3)  μ1 = μ2 = μ3 = (0, . . . , 0) , μ2i = 3, μ3i = −3 1 = 0.1 · I10 , 1 (i, i) = 1, i = 1, · · · , k 2 = 1 , 3 = 1 . 1) Illustration of A-MPCA: The objective of this experiment is to see how A-MPCA works. We use the data with size N = 900. Fig. 2(a) shows the true mixture. The algorithm is then initialized with mmax = 20 and the resulting configuration is shown in Fig. 2(b). Fig. 2(c) and (d) show two intermediate configurations, and (e) shows the obtained correct estimate with m ˆ = 3 and kˆ j = 5, j = 1, 2, 3. Fig. 2(f) shows the evolvement of L∗ , from which it can be seen that the mixtures with m = 1 and m = 2 have lower values of L∗ and hence are discarded. We repeat this experiment 100 times with 100 percent success. Therefore, for this mixture, A-MPCA is able to successfully perform parameter estimation and model selection simultaneously. 2) Sensitivity to the Component Annihilation Parameter n0 : In this experiment, we investigate the effect of n0 on the performance of BIC(Algorithm 2), MML, HBIC and A-MPCA. We choose a moderately large sample size 600. In this case it is expected that Theorem 1 would hold approximately and thus BIC(Algorithm 2), MML, H-BIC, and A-MPCA should perform similarly. Table I shows the successful rates in choosing the number of components over 100 repetitions. From Table I, it can be observed that BIC (Algorithm 2) is not sensitive while MML is the most sensitive to the value of n0 . A-MPCA and H-BIC perform reliably when n0 ≥ 10 and n0 ≥ 15, respectively, while MML is not satisfactory even when n0 = 40. In practice, it might not be good to take a large value of n0 , since the number of data points in the class of interest may be smaller than n0 . Therefore, we choose a moderate one, n0 = 20 or n0 = 15, in our experiments. 3) Comparison with Competing Methods: In this experiment, we investigate the performance of different methods on the data with varying sample size N. To reduce variability, for each N we generate 100 training data sets. We set mmin = 1, mmax = 8, kmin = 1 and kmax = 9 except that A-MPCA

is initialized with mmax = 20. As mentioned in Section I, VBMPCA is sensitive to the value of maximum subspace dimension, hence we use two values for VBMPCA: the true dimension 5 (denoted as VBMPCA:5) and the largest one 9 (denoted as VBMPCA:9). To compare the obtained estimates by different methods, we use two measures: 1) average estimated mixture number and 2) average test log likelihood, where 1) is used to assess the accuracy in estimating the number of components and 2) to evaluate the generalized performance of the chosen models. An independent test data set with N = 5000 is generated to compute 2). Fig. 3 shows the results and Table II lists several computational times used by different methods. The main observations from Fig. 3 and Table II are summarized as follows. 1) When N is large, all methods are similar in terms of both accuracy and generalized performance, except that MML suffers from overestimation, choosing numbers close to five and even bigger ones as N increases. 2) When N is not large, for H-BIC/A-MPCA versus BIC, ICL, and MML. a) Accuracy: A-MPCA and H-BIC perform similarly and they are the most accurate in estimating the number of components. Since the three components of the true model almost do not overlap, BIC and ICL perform similarly [25]. However, they tend to underestimate the true number three until N = 600. By contrast, MML always overestimates the true number three. b) Generalization: H-BIC and A-MPCA are the best in generalization, which are then followed by BIC/ICL, and MML is the worst. c) Efficiency: H-BIC (resp. A-MPCA) is computationally more efficient than BIC/ICL (resp. H-BIC). 3) A-MPCA versus VBMPCA: A-MPCA substantially outperforms VBMPCA. Moreover, A-MPCA is more efficient than VBMPCA. 4) VBMPCA:5 versus VBMPCA:9: VBMPCA using the true subspace dimension five outperforms that using the maximum nine. This observation confirms the comment in [17]. 5) BIC versus BIC (Algorithm 2): They perform similarly but BIC (Algorithm 2) is computationally much more efficient. B. Performance on Wine and Iris Data Sets In this section, we perform experiments on two benchmark real-world data sets available from UCI Machine Learning Repository [26]:

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHAO : EFFICIENT MODEL SELECTION FOR MIXTURES OF PROBABILISTIC PCA VIA HIERARCHICAL BIC

9

Fig. 2. Results by A-MPCA shown in the leading two principal component space. (a) True mixture. (b) Initialization with m = 20; two immediate estimates. (c) m = 7. (d) m = 4, the final estimate. (e) m = 3. (f) Typical evolution of L∗ versus number of iterations, where the vertical solid lines signal that the insignificant component with Nπj ≤ 20 are removed and the dashed lines signals that the least probable component is forced to be removed.

Fig. 3. Results of different methods on artificial data sets. (a) Average estimated number of components versus training sample size N. (b) Test log likelihood (divided by 104 ) versus training sample size N.

1) The Wine data set contains 178 observations of three types of Italian wines: Barolo, Grignolino, and Barbera, measured on 13 chemical and physical variables. The Barolo, Grignolino, and Barbera wines have 59, 71, and 48 observations, respectively. 2) The Iris data set contains 150 observations, measured on four variables of three classes of Iris plants. Each class has 50 observations. For Wine (resp. Iris) data, 48, 58, 38 observations from Barolo, Grignolino, and Barbera (resp. 45 observations from each class) are randomly chosen for training and the remaining ones are used for testing. For both data sets, we set mmin = 1 and

mmax = 10 and initialize A-MPCA with mmax = 10. Since the training sample size is small, we set n0 = 15 in the component annihilation step for H-BIC, BIC (Algorithm 2), MML and A-MPCA. We fit VBMPCA with kmax = 10 (resp. kmax = 3) for Wine (resp. Iris) data. To compare the estimates obtained by different methods, we use two performance measures: 1) adjusted Rand index [27] and 2) classification error rate. The adjusted Rand index uses a value between zero and one to qualify the clustering performance on training data, where the value of one indicates perfect agreement and a small value means poor agreement between the true classification and the clustering results. To

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 10

IEEE TRANSACTIONS ON CYBERNETICS

TABLE III

TABLE IV

Results on Wine Data and Iris Data Over 50 Random Splitting. The CPU Time is Shown in Seconds and the Frequency of the Chosen Number of Components is Shown as (m ˆ = 2, m ˆ = 3, · · · , m ˆ ≥ 6). The Best Method is in Boldface and the Method Marked ∗ is Statistically Significantly Worse than

Results of Several Classifiers on Ionosphere Data. The Best Method is in Boldface

the Best One (With a p-value of 0.05 Using the Two-Sample One-Tailed t-Test) TABLE V Results on Ionosphere Data. The CPU Time is Shown in Seconds. The Chosen Number of Components and Subspace Dimensions are Shown as m(q ˆ 1 , . . . , qmˆ ). The Best Method is in Boldface

compute 2), we perform a MAP classification on training data with Rnj ’s, and label the component with the majority vote of its data. Then another MAP classification is performed on test data and the classification error rate is computed by measuring the agreement between the estimated class labels and the true ones. Table III summarizes the average results across 50 random splittings. From Table III, it can be seen that A-MPCA and H-BIC perform similarly and they are the best no matter whether judged by adjusted Rand index or error rate. BIC and ICL yield the same result, choosing two-component solutions 45 times. Consequently, they have significantly lower adjusted Rand index and higher error rate than A-MPCA. BIC (Algorithm 2) improves BIC on wine data but not the case on Iris data. In contrast, VBMPCA always underestimates the true number of classes three and performs the worst, while MML tends to overestimate and has slightly higher error rate but significantly lower adjusted Rand index. It is also seen that A-MPCA is computationally the most efficient and H-BIC and BIC (Algorithm 2) are more efficient than BIC. C. Performance on Ionosphere Data Set The Ionosphere data set available from [26] contains 351 observations of two classes of signals: “good” and “bad,” measured on 34 continuous attributes. Table IV lists several previous results by various classifiers. All these results are obtained from one fixed splitting, where the training set consists of the first 100 “good” and 100 “bad” observations, and the testing set comprises the remaining 125 “good” and 26 “bad” ones. To make our results comparable with those in Table IV, we use the same training and testing sets. The difference is that we perform an unsupervised clustering experiment on the training set. We use a similar empirical setting to that in Section VI-B except that kmax = 20. Table V summarizes the results. From

Table V, it can be observed that H-BIC and A-MPCA perform the best. BIC (Algorithm 2) improves BIC substantially on this data and achieves comparable performance with A-MPCA. In contrast, MML chooses the model with large subspace dimensions and performs unsatisfactorily. The respective error rates 2.6% and 4.0% by H-BIC and A-MPCA are comparable with the best result 3.3% shown in Table IV. Musa et al. [11] have conducted a classification experiment by fitting each class with a separate MPCA model and obtained an average result 3.4±1.4% across five random splittings. The results by H-BIC and A-MPCA are roughly comparable with theirs but our results are obtained from a more difficult clustering experiment, in which both classes are fitted with a common MPCA model. In addition, their method [11] requires a three-component mixture for “good” class and a six-component mixture for “bad” class, while H-BIC (resp. A-MPCA) only requires a four-component mixture (resp. a three-component mixture) for both classes. D. Performance on Handwritten Digit Recognition In this section, we perform an unsupervised clustering experiment on a handwritten digit database available from [26]. This database contains 5620 images of ten digits (0 − 9) with size 8 × 8, written by 43 people. The training and testing sets contain 3823 and 1797 images, respectively. In our experiment, we preprocess the data by normalization transformation. Some examples of the data are shown in Fig. 4(a). For BIC, ICL, H-BIC, BIC (Algorithm 2), and MML, we set mmin = 7 and mmax = 25. For BIC and ICL, we find the optimal k from kmin = 2 to kmax = 20 with increments of two. We initialize A-MPCA with mmax = 40 and fit VBMPCA with kmax = 20. For fair comparison, we set kmax = 20 for H-BIC, BIC (Algorithm 2), MML and A-MPA. To compare different methods, we use two performance measures: 1) test log likelihood and 2) classification error rate. Both 1) and 2) are evaluated on the test data set.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHAO : EFFICIENT MODEL SELECTION FOR MIXTURES OF PROBABILISTIC PCA VIA HIERARCHICAL BIC

11

improved by increasing the value of k since VBMPCA is sensitive to this value as observed in Section VI-A3. However, this would trade more time. In addition, it is observed that H-BIC, A-MPCA, and BIC (Algorithm 2) are all much more efficient than BIC. VII. Conclusion

Fig. 4. (a) Examples of digits in the database. (b) Mean μj s of the chosen 20-component estimate by A-MPCA, together with πj (kj ), where πj and kj are the estimated prior probability and subspace dimension of component j, respectively. TABLE VI Results on Ionosphere Data. The Best Method is in Boldface

The results are summarized in Table VI. Fig. 4(b) shows that the optimal 20-component estimate by A-MPCA together with the chosen kj ’s. From Table VI, it can be seen that H-BIC and A-MPCA are again the top performers, which are then followed by MML and BIC (Algorithm 2). BIC/ICL is the worst, choosing a ten-component solution. Note that the number ten is much smaller than those by other methods. Similar to that reported in [17], the result of VBMPCA could be slightly

In this paper, we have presented a novel hierarchical BIC (H-BIC) for model selection in MPCA. Moreover, to learn H-BIC-based MPCA we have developed two efficient algorithms: two-stage and one-stage variants. The novelty is that both algorithms seamlessly integrate the selections of the subspace dimensions into parameter estimation and the onestage variant further integrates the selection of the number of components into a single algorithm. Empirical results show that our proposed approaches are more advantageous than alternative methods. The EM algorithm for mixture models is known to converge to a local optimal solution and thus sensitive to the initialization [12]. The two-stage algorithm tackles this issue based on multiple starts for every candidate number of components m while the one-stage algorithm based on backward elimination from multiple starts for the maximum mmax considered. As can be seen in our experiments, the one-stage algorithm is generally computationally more efficient even with a larger value of mmax . For the well-separated synthetic data in Section VI-D, and the data with moderate sample size, such as the digit data in Section VI-D, the one-stage algorithm can achieve slightly better performance in a shorter time. However, for the data with smaller sample size such as Iris data and Ionosphere data, the two-stage algorithm can be competitive. In our experiments, we also examined the performance of MML criterion using an extension of FJ’s algorithm to MPCA which updated mixing proportions πj ’s by (16) and chose the subspace dimensions based on (26). Since Nπj(t+1) in this  (t) case does not equal to N n=1 Rnj (θ ) in general, the choices of subspace dimensions are not local BICs. This algorithm failed completely, always choosing two-component solutions on Wine data, and choosing a one-component mixture on Ionosphere data, and a three-component mixture on the ten-class digit data. This failure motivates us to develop the two new algorithms. The Theorem 1 in Section IV-B shows the proposed H-BIC is simply an approximation of the VB lower bound in the large sample limit. However, our empirical results shows that VBMPCA suffers from underestimation and performs worse than H-BIC. There are two possible reasons about this. One reason is that the factor loading matrices in MFA can only determined up to rotations, while the implementation of VBMFA in [15] fails to reduce the degrees of freedom of rotations. As a result, this VBMPCA lower bound may penalize model heavier than H-BIC [28]. The other reason is that H-BIC and VBMFA use different approaches to determine local subspace dimensions. That is, H-BIC utilizes local BIC while VBMPCA purely relies on ARD. Given the good performance of H-BIC, we believe that there should be room to improve VBMPCA. In addition, it might be worthwhile to further explore other

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. 12

IEEE TRANSACTIONS ON CYBERNETICS

possible approximations of VB lower bound and compare them with H-BIC under MPCA model or other problems beyond MPCA. The key to the selection of subspace dimensions in A-MPCA relies on the eigen-decomposition of local covariance matrices. However, the eigen-decomposition is not available in another EM algorithm for MPCA that includes latent factor vectors [5]. Since this algorithm scales better for large datasets, it would be interesting to investigate how to implement H-BIC efficiently for this algorithm. Extending A-MPCA to a closely related model, namely mixtures of factor analyzers [29], [30], is also worthwhile and currently being investigated. For the high-dimensional data where the number of data points in each class is typically smaller than the data dimension, the number of parameters to be estimated could be very large compared with the available local sample size. In this case, it would be better to use more parsimonious-covariance models than MPCA model such as the mixtures of commonfactor analyzers (MCFA) [31]. In fitting MCFA with ML/MAP method, an important issue is how to solve the model selection issue in an effective and efficient manner. Since MCFA model has constraints across the parameters of different components (e.g., common factor loading matrix), it seems that the H-BIC criterion can not be used directly. However, motivated by our current work, it would be interesting to develop a variant of H-BIC for MCFA to solve this issue. Su and Dy [10] propose an automated hierarchical MPCA method, in which ICL criterion is used for splitting clusters and the retained variance ratio is used for determining local subspace dimensions. Inspired by our current work, it would be interesting to further develop a hierarchical ICL criterion, which allows simultaneously splitting clusters and determining local subspace dimensions with a common cost function and could lead to a fully automated hierarchical MPCA.

Using Proposition 1, we obtain F1 = E[log p(π)] − E[log q(π)] m−1 ˆ − = log p(π) log N + O(1). 2 B. Limit of F2 Consider E[log q(θ j )]. It has been proved in [32] that for exponential family models the variational posterior distribution q(θ) is approximately normal and hence q(θ j ) is also approximately normal. Thus we have that as N → ∞ E[log q(θ j )] ≈

where H(θˆ j ) is the negative Hessian matrix of q(θ j ) evaluated ˆ the fact that H scales linearly at MAP  estimator θ j . Using with n Rnj [see (24)], and n Rnj /N πˆ j → 1, we have that as N → ∞, H/N πˆ j converges to a constant matrix, denoted by H0 , and thus Dj 1 log (N πˆ j ) + log |2πH0 | + O(1) 2 2 Dj = log (N πˆ j ) + O(1). 2

E[log q(θ j )] =

Therefore, as N → ∞, we obtain  E[log q(θ j )] − E[log p(θ j )] F2 = j

=

2

log (N πˆ j ) − log p(θˆ j ) + O(1).

C. Limit of FD Since q(π) and q(θ j ) will be strongly peaked at πˆ and θˆ j in the large N limit [33], we have ˆ + O(1) q(π) = δ(π − π) q(θ j ) = δ(θ j − θˆ j ) + O(1)

A. Limit of F1 From p(π) = Dir(π|α0 ) and q(π) = Dir(π|α), we get  E[log p(π)] = log C(α0 ) + (α0 − 1) E[log πj ] E[log q(π)] = log C(α) +

 Dj j

Appendix A Proofs



1 log |2πH(θˆ j )| 2

where δ(·) is the Dirac delta function. Combining this result with (21) yields ˆ + O(1), as N → ∞. q(znj ) = p(znj |xn , θ)

j

(αj − 1)E[log πj ]

Using these results, we obtain that

j

 % where C(α) = ( j αj )/ j (αj ) is the normalization coefficient. When N is large, we have the following approximation (see App. A-D for the proof). Proposition 1: As N → ∞  m−1 − log C(α) = N πˆ j log πˆ j − log N + O(1). 2 j E[log πj ] = log πˆ j + O(1) where πˆ j s are the MAP estimators of q(π).

ˆ + O(1). FD = L(X|θ) D. Proof of Proposition 1 Proof: − log C(α) =



log (Nj + α0 ) − (N + mα0 ).

(33)

j

Using Stirling’s approximation log (x) = (x − 1/2) log x − x + O(1),

x → ∞.

(34)

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination. ZHAO : EFFICIENT MODEL SELECTION FOR MIXTURES OF PROBABILISTIC PCA VIA HIERARCHICAL BIC

Substituting (34) into (33) and taking α0 = 1/2 gives  Nj + α0 − log C(α) = (Nj + α0 ) log N + mα0 j 1 Nm + α0 1 log (Nj + α0 ) − log ( ) + O(1) 2 j=1 2 N + mα0  1  Nj log πˆ j + (α0 − ) log πˆ j = 2 j j m−1



m−1 log N + O(1) − 2  m−1 = Nj log πˆ j − log N + O(1) 2 j where πˆ j s are the MAP estimator of q(π). It can be seen that α0 = 1/2 will only yield an additional order-1 term and can be absorbed into the O(1) term. Thus,  the proof holds for α0 > 0. As E[log πj ] = ψ(πj ) − ψ( j πj ), where ψ(x) is known as the digamma function, using Stirling’s approximation ψ(x) = log x + O(1), x → ∞  we obtain E[log πj ] = log (πj / j πj ) + O(1) = log πˆ j + O(1), as N → ∞. Acknowledgment The author would like to thank the Associate Editor and all the four anonymous reviewers for their valuable comments and suggestions that strengthened this paper.

References [1] I. T. Jolliffe, Principal Component Analysis, 2nd ed. New York, NY, USA: Springer, 2002. [2] N. Kambhatla and T. K. Leen, “Dimension reduction by local principal component analysis,” Neural Comput., vol. 9, no. 7, pp. 1493–1516, Jan. 1997. [3] G. Hinton, P. Dayan, and M. Revow, “Modelling the manifolds of handwritten digits,” IEEE Trans. Neural Netw., vol. 8, no. 1, pp. 65–74, Jan. 1997. [4] G. Hinton, M. Revow, and P. Dayan, “Recognizing handwritten digits using mixtures of linear models,” in Proc. Adv. Neural Inf. Process. Syst., 1995, pp. 1015–1022. [5] M. E. Tipping and C. M. Bishop, “Mixtures of probabilistic principal component analysers,” Neural Comput., vol. 11, no. 2, pp. 443–482, 1999. [6] A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data using the EM algorithm (with discussion),” J. Roy. Stat. Soc. Ser. B, Stat. Methodol., vol. 39, no. 1, pp. 1–38, 1977. [7] P. D. McNicholas and T. B. Murphy, “Parsimonious Gaussian mixture models,” Statist. Comput., vol. 18, no. 3, pp. 285–296, Sep. 2008. [8] C. Archer and T. K. Leen, “Optimal dimension reduction and transform coding with mixture principal components,” in Proc. IEEE Int. Joint Conf. Neural Netw., vol. 2, 1996, pp. 916–920. [9] P. Meinicke and H. Ritter, “Resolution-based complexity control for Gaussian mixture,” Neural Comput., vol. 13, no. 2, pp. 453–475, 2001. [10] T. Su and J. G. Dy, “Automated hierarchical mixtures of probabilistic principal component analyzers,” in Proc. 21th Int. Conf. Mach. Learn., 2004, pp. 775–782. [11] M. E. M. Musa, D. de Ridder, R. P. W. Duin, and V. Atalay, “Almost autonomous training of mixtures of principal component analyzers,” Pattern Recognit. Lett., vol. 25, no. 9, pp. 1085–1095, 2004.

13

[12] M. A. T. Figueiredo and A. K. Jain, “Unsupervised learning of finite mixture models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 3, pp. 381–396, Mar. 2002. [13] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY, USA: Springer, 2006. [14] C. M. Bishop, J. M. Winn, and C. C. Nh, “Non-linear Bayesian image modelling,” in Proc. 6th Eur. Conf. Comput. Vis, 2000, pp. 3–17. [15] Z. Ghahramani and M. Beal, “Variational inference for Bayesian mixture of factor analysers,” in Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 449–455. [16] X. Wei and Z. Yang, “The infinite student’s t-factor mixture analyzer for robust clustering and classification,” Pattern Recognit., vol. 45, no. 12, pp. 4346–4357, 2012. [17] C. Constantinopoulos and A. Likas, “Unsupervised learning of Gaussian mixtures based on variational component splitting,” IEEE Trans. Neural Netw., vol. 18, no. 3, pp. 745–755, May 2007. [18] G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, no. 2, pp. 461–464, 1978. [19] G. Mclachlan and D. Peel, Finite Mixture Models. New York, NY, USA: Wiley, 1997. [20] C. Fraley and A. E. Raftery, “Model-based clustering, discriminant analysis, and density estimation,” J. Amer. Statist. Assoc., vol. 97, no. 458, pp. 611–631, 2002. [21] J. L. Andrews and P. D. McNicholas, “Extending mixtures of multivariate t-factor analyzers.” Statist. Comput., vol. 21, no. 3, pp. 361–373, 2011. [22] X. L. Hu and L. Xu, “A comparative investigation on subspace dimension determination,” Neural Netw., vol. 17, nos. 8–9, pp. 1051–1059, 2004. [23] S. Srivastava, M. R. Gupta, and B. A. Frigyik, “Bayesian quadratic discriminant analysis,” J. Mach. Learn. Res., vol. 8, no. 6, pp. 1277–1305, 2007. [24] J. H. Friedman, “Regularized discriminant analysis,” J. Amer. Statist. Assoc., vol. 84, no. 405, pp. 165–175, 1989. [25] C. Biernacki, G. Celeux, and G. Govaert, “Assessing a mixture model for clustering with the integrated completed likelihood,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 7, pp. 719–725, Jul. 2000. [26] K. Bache and M. Lichman. (2013). UCI Machine Learning Repository [Online]. Available: http://archive.ics.uci.edu/ml [27] L. Hubert and P. Arabie, “Comparing partitions,” J. Classif., vol. 2, no. 1, pp. 193–218, 1985. [28] J. Zhao and P. L. H. Yu, “A note on variational Bayesian factor analysis,” Neural Netw., vol. 22, no. 7, pp. 988–997, 2009. [29] G. J. McLachlan, D. Peel, and R. W. Bean, “Modelling high-dimensional data by mixtures of factor analyzers,” Comput. Statist. Data Anal., vol. 41, nos. 3–4, pp. 379–388, 2003. [30] J. Zhao and P. L. H. Yu, “Fast ML estimation for the mixture of factor analyzers via an ECM algorithm,” IEEE Trans. Neural Netw., vol. 19, no. 11, pp. 1956–1961, Nov. 2008. [31] J. Baek, G. J. McLachlan, and L. K. Flack, “Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 32, no. 7, pp. 1298–1309, Jul. 2010. [32] B. Wang and D. M. Titterington, “Convergence and asymptotic normality of variational bayesian approximations for exponential family models with missing values,” in Proc. Uncertain. Artif. Intell., 2004, pp. 577–584. [33] H. Attias, “Inferring parameters and structure of latent variable models by variational Bayes,” in Proc. 15th Uncertain. Artif. Intell., 1999, pp. 21–30.

Jianhua Zhao received the B.S. degree in mathematics from Yunnan University, Kunming, China, and the M.S. and Ph.D. degrees in statistics from Southeast University, Nanjing, China, and the University of Hong Kong, Hong Kong, China, in 2000, 2003, and 2009, respectively. He is currently an Associate Professor with the School of Statistics and Mathematics, Yunnan University of Finance and Economics, Kunming. His current research interests include statistical machine learning and pattern recognition.