Maximum Penalized Likelihood Estimation of Finite ... - CiteSeerX

Maximum Penalized Likelihood Estimation of Finite Mixtures with a Structural Model Shinto Eguchi Institute of Statistical Mathematics and Department of Statistical Science, The Graduate University for Advanced Studies, Minato-ku, Tokyo 106-8569, Japan email: [email protected] and Koichi Yoshioka Department of Biochemistry and Biophysics, Tokyo Medical and Dental University, Bunkyo-ku Tokyo 113-8519 and CREST, JST, Tokyo, Japan email: [email protected]

Summary. This paper discusses a problem in a finite mixture model. The likelihood inference often fails to offer proper performance in the situation where the finite mixture model is full without any structure such as in a normal mixture model with unspecified means and variances. Imposing a structural constraint on the full model can avoid the problem, but such a structural model may suffer from model inflexibility. In order to overcome this dilemma we propose a maximum penalized likelihood method with a penalty function defined by the KullbackLeibler divergence from the distribution estimated in the structural model. Good performance of this method with the optimal selection of the penalty parameter by the cross validatory criterion has been shown. As a specific interest we applied our method to the neurophysiological data analysis of evoked synaptic responses with amplitude fluctuations.

Key words: Cross validation, EM algorithm, Infinite likelihood, Mixture of distributions, Model identification, Penalized likelihood.

1

1. Introduction We discuss statistical inference on a finite mixture model of which distribution is defined by f (x, θ) =

R r=0

πr h(x, ω r )

(1)

with the parameter vector θ = (πr , ω r )r=0,···,R , where πr and ω r are the mixing proportion of the label r and the parameter vector of the r-th component distribution, respectively. We will focus on the case where the distribution of each component is common except for the parameter values of ω. Typically, h(x, ω) is a normal density with mean µ and variance σ 2 , so ω = (µ, σ 2 ). See Titterington et al. (1987) and Mclachlan and Peel (2000) for extensive discussions. The mixture of distributions adaptively gives flexible shapes of density functions over multi-modality and non-normality. However the fact that the label r of component is not available is known to lead to essential difficulties in the inference for the mixture model. If the label r were observed then the inference for parameters (πr , ω r ) would be separable, which implies that the maximum likelihood estimator (MLE) of πr would be the relative frequency of the r-th stratified sample and the MLE of ω r would be obtained only based on observations stratified into the r-th component. In effect the EM algorithm for obtaining the MLE of θ utilizes this structure with missing label r. See Dempster et al. (1977) and McLachlan and Krishnan (1998). The E-step prepares just the conditional probability of the label r given x and the M-step is maximization of the complete likelihood function estimated by the conditional probability, cf. Orchard and Woodbury (1972). Given data {xi : i = 1, · · · , n} from (1) the log-likelihood function is (M) (θ) =

n i=1

log f (xi , θ).

(2)

Let us consider a “full” model in (1) in which πr and ω r are all unspecified. There are two difficulties in making the likelihood inference for such a full model based on (M) (θ). One is the problem of model identification, which is guaranteed only when all the parameter vectors 2

ω r ’s are distinct and all πr ’s are positive. In practical situations we sometimes encounter perplexed situations where a part of mixing proportions πr is nearly negligible, or component parameters ω r are nearly equal each other. The other is that the log-likelihood function (2) over the full model (1) often becomes unbounded above on the boundary set, so that the MLE of θ does not exist. It is known that the full model of normal mixture has infinite likelihood for any observations. This reflects upon the fact that the EM algorithm applied to the full model is often trapped into local maxima. The estimated density function often has unreasonably spike modes with nearly zero variances. Thus the full mixture model does not satisfy usual regularity conditions as a parametric model in these two respects. A condition in which the infinite likelihood occurs will be given in a subsequent section. A simple way to avoid these difficulties of the full likelihood inference is to impose a structural constraint to the model so that such a “structural model” may keep away from the set of unidentifiable parameters and the boundary set giving the infinite likelihood. The constraint on θ to define a structural model can be given in two ways: One is implicitly expressed by Θ1 = {θ : f (θ) = 0} with the inducing function f of θ; the other is explicitly given by Θ1 = {θ(ξ) : ξ ∈ Ξ}

(3)

with a structural parameter vector ξ of a lower dimension than that of θ. The MLE for the structural model is given by ˆ 1 = arg max{(M) (θ) : θ ∈ Θ1 }, θ

(4)

ˆ with the MLE ξˆ for the structural parameter ξ if the model which can be obtained by θ(ξ) is given by (3). Under such a structural model we can expect the standard performance of the likelihood inference. As a compensation, however, the performance of the inference would be poor if we select a misspecified structure. It would be often difficult in practice to build up the structure that guarantees both regularity conditions for the model and flexible adaptability to the data. 3

Consequently we are confronted with two conflicting situations: the full mixture model is flexible enough to adapt to a good variety of density shapes while it suffers from the problems of model identification and infinite likelihood. On the other hand, a structural model main retain the model identifiability and bounded likelihood while it may suffer from model inflexibility. Our approach in this paper is to make an efficient balance between these two model specifications. We observe that the full likelihood function (M) (θ) defined at (2) has superfluous degrees of freedom for θ in the unspecified space. Our proposal is to use the Kullback-Leibler (KL) divergence between the joint distributions of the label r and the variable x as a penalty function for (M) (θ). It is noted that although the the label r is missing the KL divergence between the complete data distributions can be obtained by (J)

∗

D (θ , θ) =

R r=0

πr∗

R πr∗ log + πr∗ D(ω ∗r , ω r ), πr r=0

(5)

where θ ∗ = (πr∗ , ω ∗r )r=0,···,R , θ = (πr , ωr )r=0,···,R and D(ω ∗r , ω r ) is the KL divergence from the r-th component distribution h(x, ω ∗r ) to h(x, ω r ). We note that D (J) properly discriminates any distinct points θ ∗ and θ. ˆ 1 defined in The estimation proceeds in two steps: we first obtain the structural MLE θ (3), we then maximize the penalized likelihood function given by (M) ˆ 1 , θ), λ (θ) = (1 − λ)(M) (θ) − λD (J) (θ

where λ is a scalar taking values 0 ≤ λ ≤ 1. Thus we propose the maximum penalized ˆ λ . We will show that for the mixture of exponential-type likelihood estimator (MPLE) θ distributions the MPLE is the maximum posterior estimator with respect to the conjugate ˆ 1 . See Ormoneit prior, of which hyper-parameters are estimated by the structural MLE θ and Tresp (1998) for the normal mixture with a fixed conjugate prior. The one-parameter family of penalized likelihood functions λ (θ) becomes the full likelihood function (M) (θ) and ˆ 1 , θ) at endpoints 0 and 1 of λ, respectively. Thus the tuning parameter λ determines −D(J) (θ a balance between model flexibility and regularity. To optimize the tuning parameter λ we 4

will propose to use a criterion which is an approximate of the cross validation, and is closely related to the generalized information criterion by Konishi and Kitagawa (1996). See also Stone (1977). This criterion may also provide a self-contained way for finding the proper number of components on the basis of the maximum penalized likelihood, cf. Richardson and Green (1997) for the Bayesian arguments. This paper is organized as follows. In section 2 we investigate difficulties involved in the full mixture model and full likelihood inference. Section 3 discusses structural models for the normal mixture, which will be incorporated into our proposing method. In particular we consider a structural model, called the quantal model, which is derived from a hypothesis on the mechanism of synaptic transmission between neurons. In Section 4 we propose the maximum penalized likelihood method by the use of the joint KL divergence. Section 5 discusses a method of selecting the parameter tuning the penalty function. Finally we present results of applying our method concerning the quantal model to a real data set.

2. Infinite likelihood in a finite mixture model We consider a finite mixture model f (x, θ) defined by (1), in which the parameter vector θ to be estimated has no restriction, i.e., all components πr and ω r are unspecified except for the condition

r

πr = 1. Let x1 , · · · , xn be data from distribution in the finite mixture

model. We focus on the behavior of the full log-likelihood function (M) (θ) defined at (2) when the parameter space Ω of the density function h(x, ω) is open. The parameter space can be written by R+1

Θ = SR × Ω × · · · × Ω, where SR = {(π0 , · · · , πR )} denotes a simplex of dimension R. Then there often occurs a situation where the likelihood function (1) becomes +∞ on a subset of the boundary of Θ. We present a simple sufficient condition such that (M) (θ) is unbounded above, which can be given as a condition for the component model h(x, ω) as follows. 5

Theorem 1. The log-likelihood function (M) (θ) is unbounded above if a component distribution in (1) meets the following condition (C). Given any single observation x there exists a point ω ∗ (x) in the bounday of Ω such that the density h(x, ω) increases to +∞ as ω goes to ω ∗ (x).

P roof . We get (M) (θ) ≥ log πr h(xi , ω r ) +

j=i

log

  

πs h(xj , ω s )

s=r

since log f (xi , θ) ≥ log πr h(xi , ω r )

log f (xj , θ) ≥ log

and

 

  

(6)



s=r

 

πs h(xj , ω s ) . 

Hence from (6) we obtain (M) (θ) ≥ log h(xi , ω r ) + cir , for any i, 1 ≤ i ≤ n, where cir is independent of ω r . It follows from Condition (C) for h(xi , ω r ) that (M) (θ) increases to +∞ as the r-th component vector ω r approaches to ω ∗ (xi ). This completes the proof.

We remark that under Condition (C) there does not exist the MLE of θ for any size n of sample, which implies that the usual asymptotic theory cannot be applied to the model. On the other hand, the MLE for ω in the model h(x, ω) usually exists for any n > n0 with some n0 . For example in the p-variate normal distribution model the MLE for a covariance matrix exists with probability one if n ≥ p + 1. Condition (C) often arises when the dimension of ω is larger than that of x. Assume that the component distribution h(x, ω) belongs to a regular exponential family given by h(x, ω) = exp{ω t(x) − κ(ω)} 6

(7)

with the sufficient statistic t(x) and Ω = {ω : κ(ω) < ∞}. Let Γ be the convex hull of t(x). If t(x) is in the boundary of Γ, then h(x, ω) goes to +∞ as ω goes to ω ∗ (x), where ω ∗ (x) is a solution of the equation t(x) = (∂/∂ω)κ(ω) with respect to ω. See Section 6.4 in Barndorff-Nielsen and Cox (1989). The normal mixture model with unconstrained means and variances satisfies Condition (C) since the sufficient statistic t(x) = (x, x2 )T has the convex hull Γ = {(γ1 , γ2 ) : γ2 ≥ γ12 } and t(x) is in the boundary of Γ for any x. On the other hand, if we suppose a structure of equal variances that σ02 = · · · = σR2 = σ 2 , then the log-likelihood function is bounded above with respect to σ 2 . This discussion also holds for multivariate cases; See Maclachlan and Krishnan (1998). We will discuss the penalized likelihood for this simple structure. It has been discussed that the EM algorithm is quite unstable on the convergence if the likelihood equation itself has multiple roots. In fact the normal mixture model does not meet a sufficient condition for the EM algorithm to converge to the MLE: the level set of the log-likelihood function {θ : (M) (θ) ≥ c}

(8)

is compact for any possible c, (Wu, 1983 and McLachlan and Krishnan, 1998) as observed in the above discussion. In accordance, the EM algorithm sometimes fails to converge in our simulation studies. This is because the variance in a component is absorbed to 0 when the mean estimate during the iteration of the EM algorithm happens to become close to one of observations.

3. Structural model of finite mixture We discuss a structural model in the finite mixture model in which the vector of mixing proportions πr and parameter vectors ω r of component distributions in (1) are constrained as in (3). If the structural model is so defined that the parameter vector can be kept outside the region causing unidentifiablity or infinite likelihood, then the standard likelihood 7

inference including the EM algorithm will be expected. We write such a structure {θ(ξ) = (πr (ξ), ωr (ξ))r=0,···,R : ξ ∈ Ξ} described by the parameter ξ in the full parameter θ of R + 1 mixture, where Ξ is an open set of Rk . To ensure model identification we assume that k ≤ 3R + 2 and that the set {θ(ξ) : ξ ∈ Ω} is disjoint with the irregular region. 3.1. Quantal model for synaptic transmission As an illustrative example we introduce a model of synaptic transmission in the nervous system. When synaptic activities are recorded from a single neuron and one of the input to the neuron is stimulated the amplitude of the evoked responses randomly fluctuates, cf. Stricker and Redman (1994). The quantal hypothesis assumes that the mean amplitudes of the responses are separated by equal increments (the quantal amplitude) and the variances of the components also increase by the variances of quantal components. The probability of occurrence of quanta is assumed to obey the binomial or Poisson distribution. In this paper, the latter distribution is assumed. Thus under the normality assumption the quantal hypothesis leads us to a normal mixture model such that the parent parameter θ is modeled as θ(ξ) = (πr (ξ1 ), µr (ξ2 ), σr2 (ξ3 , ξ4 ))0≤r≤R : 1 πr (ξ1 ) = R

ξ1s

s=0 s!

ξ1r , r!

µr (ξ2 ) = rξ2,

σr2 (ξ3 , ξ4 ) = ξ3 + rξ4

(9)

for r = 0, · · · , R by the structure parameter ξ = (ξ1 , ξ2 , ξ3 , ξ4). The parameter ξ1 describes the mixing proportion by the truncated Poisson distribution, ξ2 denotes the quantal amplitude and ξ3 and ξ4 are variances of the background noise and the quantal unit noise, respectively. The Jacobi matrix of size (3R + 2) × 4 is given by    

B(ξ) = 

π1 κ(1, ξ1), · · · , πR κ(R, ξ1 ) 0 0 0 0, 1, · · · , R 0 0 0 1, 1, · · · , 1 0 0 0, 1, · · · , R

8

   , 

where κ(r, ξ1 ) =

r − ξ1 ξ R /R! + R1 s . ξ1 s=0 ξ1 /s!

The MLE ξˆ is obtained by the EM algorithm which defines a mapping from ξ0 to ξ 1 where (0)

ξ 1 is a solution of four equations given pr|i : n R i=1 r=0

n R

(0)

κ(r, ξ1 )pr|i = 0,

(xi − rξ2 )2 − (ξ3 + rξ4 ) (0) pr|i = 0, (ξ3 + rξ4 )2 i=1 r=0

n R

r(xi − rξ2 ) (0) pr|i = 0, i=1 r=0 ξ3 + rξ4 n R i=1 r=0

r

(xi − rξ2 )2 − (ξ3 + rξ4 ) (0) pr|i = 0. (ξ3 + rξ4 )2

The four equations form a four variable polynomial system. See Stricker and Redman (1994). Unfortunately the exact solution of the equation system is unavailable, but it can be easily obtained with a standard iterative optimization algorithm. This structural model satisfies the Wu condition, and the EM algorithm given above has been shown to be effective in our experiences. Figure 1 gives a result of quantal analysis applied to a dataset taken from an experiment on the rat cerebellum. The dataset consists of the amplitude of inhibitory post synaptic currents, which was evoked by field stimulation and recorded from a basket cell of a rat cerebellar slice preparation using slice patch recording techniques (n = 174). The kernel density estimate shows multi-modal peaks with approximately equal separations, suggesting the quantal nature of the response. The dotted line is the ML density estimate of the quantal model (9). Comparison with the kernel density estimate suggests that the fitting of the MLE may be good for several components of smaller amplitudes. However some discrepancies are apparent: the mixing proportion of the zero-th component and the mean and variance structures in components of lager amplitudes. >

9

4. Maximum penalized likelihood estimation In this section we propose the penalized likelihood method and the MPLE. As outlined in Introduction we define the penalized likelihood function (M)

ˆ 1 , θ) λ (θ) = (1 − λ)(M) (θ) − λD (J) (θ ˆ 1 = θ(ξ). ˆ Here ξˆ is the MLE of ξ in the for a fixed λ, 0 ≤ λ ≤ 1 with the strutural MLE θ structural model (3). Thus the MPLE is defined by ˆ λ = arg max (M) (θ). θ λ θ∈Θ

ˆ 0 for the full model (1) exists. Then our estimate θ ˆ λ is in a position Assume that the MLE θ ˆ1. ˆ 0 and θ balancing between two MLEs θ The key idea in our method is to use the KL divergence D (J) (θ ∗ , θ) defined by (5), which is defined over the complete data distribution of (r, x), not the KL divergence over our actual distribution (1) of x, which is defined by D (M) (θ ∗ , θ) =

f (x, θ ∗ ) log

f (x, θ ∗ ) dx. f (x, θ)

Then D (J) has more information than D (M) as seen in D (J) (θ ∗ , θ) = D (M) (θ ∗ , θ) + D (C) (θ ∗ , θ), where D (C) (θ ∗ , θ) is the mean divergence of conditional probabilities of components, D

(C)

∗

(θ , θ) =

R r=0

p(r|x, θ ∗ ) p(r|x, θ ) log f (x, θ ∗ )dx p(r|x, θ) ∗

with the conditional distribution p(r|x, θ ∗ ) of r given x. We observe that the contour of D (J) is well-approximated by concentric ellipsoidal while D (M) falls into bad behaviors around unidentifiable points. For example, consider two normal mixture distributions g(x) = .5h(x, 0, 1) + .5h(x, 1, 2) and fπ,τ (x) = πh(x, 0, 1) + (1 − π)h(x, τ, τ 2 + 1), 10

where h(x, µ, σ 2 ) is a normal density function with mean µ and variance σ 2 . Note that the model fπ,τ (x) loses indetifiablity at τ = 0. The contour of D (M) is written by C (M) = {(τ, π) : D (M) (g, fπ,τ ) = const} which is far from any ellipsoids, whereas the contour C (J) of D (J) is well approximated by a suitable ellipsoid. See Figure 2. > (M)

We now focus on the behavior of the penalized likelihood function λ (θ) where the full MLE does not exist, or (M) (θ) is unbounded above as discussed in Section 2. The following theorem shows another favorable property of D (J) that controls the ill-behavior of the log-likelihood function (M) (θ).

Theorem 2. Assume that the component density h(x, ω) is in an exponential family defined at (7). Even if h(x, ω) satisfies Condition (C) given in Theorem 1, the penalized log-likelihood (M)

function λ (θ) is bounded above for any λ, 0 < λ ≤ 1.

The proof follows from a property of convexity associated with the exponential family, which will be given in Appendix 2. Then the sufficient statistic t(x) of (7) is in the boundary of the convex hull Γ of t(x) for any x. The MLE does not exist since the MLE is a function of t(x), where the function is not well-defined on the boundary. In contrast, the MPLE does exist because the MPLE is a function of the shrinkage statistic of t(x) into the interior of Γ. For obtaining the estimator the EM algorithm is still applicable: the E-step is the same as in the case of the MLE and the M-step is to obtain a solution of (1 − λ)

n R (0) i=1 r=0

ˆ pr|i U(πr , ω r , xi ) = λb(θ, ξ),

(0)

say θ 1 , where pr|i = p(r|xi , θ 0 ) and ˆ = b(θ, ξ)

∂ (J) ˆ D (θ(ξ), θ). ∂θ 11

(1)

2 For the normal mixture model we have a reweighted form: θ 1 = (πr(1) , µ(1) )r=0,···,R r , σr

defined by πr(1)

(1 − λ)

=

n

i=1

(0) ˆ pr|i + λπr (ξ)

(1 − λ)n + λ

(1) σr2

=

(1 − λ)

n

i=1

, µ(1) r

=

(1 − λ)

n

i=1

(1 − λ)

(0) ˆ pr|i xi + λµr (ξ)

n

i=1

(0)

pr|i + λ

(0) ˆ + (µr − µr (ξ) ˆ 2 )2 } pr|i (xi − µ1 r )2 + λ{σr2 (ξ) 1

(1 − λ)

n

i=1

(0)

pr|i + λ

since we get

ˆ = b(θ, ξ)

ˆ µr (ξ) ˆ − µr σ 2 (ξ) ˆ − σ 2 + {µr − µr (ξ)} ˆ 2 πr (ξ) r r , , − πr σr2 2σr4

. r=0,···,R

We observe that the level set of the function λ (θ) defined by {θ : λ (θ) ≥ c} is compact because the function λ (θ) is bounded as shown in the following. We get that

2 ˆ 2 ˆ 1 ˆ + sup λπr (ξ) ˆ σr (ξ) − 1 + n(1 − λ) log σr (ξ) λ (θ) ≤ max n(1 − λ) log 2πσr2 (ξ) 2 0≤r≤R σ2 σ2 σ2 >0

,

which is

n(1 − λ) 1 ˆ − n(1 − λ) + n(1 − λ) log max n(1 − λ) log +λπr (ξ) , ˆ 2 (ξ) ˆ 2 0≤r≤R 2πλπr (ξ)σ r which is independent of θ. This concludes the boundedness of the penalized likelihood function. The conjugate prior distribution of (π, µ, σ 2 ) with the hyper parameter χ = (χ1 , χ2 ) is 2

p(π, µ, σ , χ) =

R r=0

exp{−

(χ2 − χ1 )r (µ − χ1 )2r 1 − − log σr2 }, 2 2 2σr 2σr 2

ˆ θ) if we assume that χ = (µ(ξ), ˆ σ 2 (ξ)). ˆ Hence we see which is proportional to exp D (J) (θ(ξ), that the MPLE is equivalent to the maximum a posteriori estimator with the conjugate prior except that the hyper-parameter is estimated based on the structural model, cf. Ormoneit and Tresp (1998). 12

5. Selection of tuning parameter ˆ λ with an arbitrarily fixed λ, 0 < λ ≤ 1 as discussed in We have proposed the MPLE θ the section 4. The tuning parameter λ plays a central role in the sense that λ controls the balance between the full and structural models in the finite mixture model. We now discuss a problem of the optimal selection of λ based on data. For this purpose we define a loss function by ˆ g) = − L(θ,

ˆ log f (x, θ)g(x)dx,

where g is a density of the underlying distribution. Here we suppose that g(x) may not necessarily belong to the finite mixture model that we consider. Apparently this loss has a clear advantage over the full MLE because (M) (θ)/n almost surely converges to −L(θ, g). It is known that the MLE does not minimize the expected loss. Let us assess the optimality of estimation on the basis of the expected loss or risk function defined by ˆ g) = E{L(θ, ˆ g)} R(θ, where E denotes the expectation over the distribution associated with the dataset. In this framework our objective is to estimate ˆ λ , g). λ∗ (g) = argmin R(θ 0≤λ≤1

The idea of the cross-validation by leave-one-out leads to an estimate for the risk function given by CV(λ) = −

n 1 ˆ [-i] ) log f (xi , θ λ n i=1

ˆ [-i] is the minimizer based on the loss function by leaving the i-th observation, or where θ λ ˆ [-i] = argmin{−(1 − λ)[-i] (θ) + λD (J) (θ(ξˆ [-i] ), θ)}, θ λ θ

13

where [-i] (θ) =

1 log f (xj , θ) n − 1 j=i

and [-i] ξˆ = argmin [-i] (θ(ξ)). ξ∈Ξ

It is noted that we need the local assessment of both the MLEs under the structural and full ˆ [-i] for each i, exact computation of CV defined above models. Since we have to evaluate θ λ may become time-consuming where the sample size is large. As a more feasible form we propose an approximate of the cross validation as follows. Let B(ξ) =

∂θ(ξ) ∂ξ T

and Jλ (θ, ξ) = −(1 − λ)

∂ 2 (M) ∂2 (θ) + λ D (J) (θ(ξ), θ). ∂θ∂θ t ∂θ∂θ t

An approximate version of CV is given by

n 1 ˆ λ ) + 1 − λ tr(Jˆ−1 K ˆ 0 ) + λ tr(Jˆ−1 K ˆ 1 Pˆ K ˆ 1) log f (xi , θ ACV(λ) = − λ λ n i=1 n n

(10)

where ˆ λ , ξ), ˆ Jˆλ = Jλ (θ

ˆ λ , ξ) ˆ K ˆ λ , ξ) ˆ ˆ 0 = J 0 (θ ˆ 1 = J 1 (θ K

and ˆ K ˆ }−1 B(ξ). ˆ ˆ {B(ξ) ˆ 1 B(ξ) Pˆ = B(ξ) We will give the derivation in Appendix 2. We observe that the first term of ACV is minimized by λ = 0, or the maximum likelihood estimator; the second bracket-term is ˆ If the minimum of the ACV in λ minimized by λ = 1 with the structured estimator ξ. occurs at an interior point in the interval [0, 1], then the minimizer is expected to give the best choice to overcome the bias-variance dilemma. 14

ˆ0 K ˆ 1, A further simplified version of ACV is given by using Jˆλ K ACV0 (λ) = −

n 1 ˆ λ ) + 1 − λ dim(θ) + λ dim(ξ). log f (xi , θ n i=1 n n

We will discuss the comparison between ACV and ACV0 in a normal mixture model. ACV for the quantal model for synaptic transmission discussed in 3.1 can be calculated as follows. In the normal mixture model the score vector is given by   p(r|x)

S (M) (θ, x) = 

πr

−

p(0|x) π0

, p(r|x) 1≤r≤R

x − µr σr2

2

, p(r|x) 0≤r≤R

(x − µr ) − 2σr4

σr2

 

0≤r≤R

where θ = {(πr )1≤r≤R , (µr )0≤r≤R , (σr2 )0≤r≤R } and p(r|x) is the conditional probability of r ˆ 0 and K ˆ 1 are given by the following Hessian matrices: given x. Thus the matrices Jˆλ , K Firstly the Hessian matrix of the log-likelihood function is −

n n 1 1 ∂ 2 (M) (M) (M) (θ) = S (θ, x )S (θ, x ) + ∆(θ, xi ), i i n i=1 n i=1 ∂θ∂θ t

where 



O O O   ∆(θ, x) =  O ∆22 ∆23  , O ∆23 ∆33 with

∆22

(x − µr )2 − σr2 = diag p(r|x) σr4

(x − µ )3

∆23 = diag p(r|x)

∆33

r

, 0≤r≤R

+ (x − µr )σr2 σr6

, 0≤r≤R

{(x − µr )2 − σr2 }2 − 2σr4 (x − µr )2 − σr2 = diag p(r|x) + 4σr8 2σr6

Secondly the Hessian matrix of the KL divergence is 



D1 0 0 ∂   (J) D 0 D D (θ(ξ), θ) =  2 3 , ∂θ∂θ t 0 D3 D4 2

15

. 0≤r≤R



,

where

D1 =

D3 = diag

µ∗ πr∗ r

πr∗ δrs π∗ + 02 πr πs π0

− µr 2σr4

, D2 = diag r,s=1,···,R

, D4 = diag r=0,···,R

∗ 2σ 2 πr∗ r

πr∗ σr2

, r=0,···,R

− σr2 + 2(µ∗r − µr )2 2σr6

, r=0,···,R

∗

where θ(ξ) = {(πr∗ )r=1,···,R , (µ∗r )r=0,···,R , (σr2 )r=0,···,R }. We now introduce a simple structural model for checking the performance of ACV and ACV0 . The structural model is defined in the normal mixture with R + 1 components such that the structure is only given to the variance part as σr2 = σ 2 as discussed in Section 2. This model fairly gives flexible shapes of density function unless the variances for components are extremely heterogeneous. Implementation of the EM algorithm is straightforward. Assuming several situations for the underlying distribution g we investigated the performance of the ACV and ACV0 . In this setting we know the generalization error gerror(λ) = −

ˆ λ )g(x)dx. log f (x, θ

The Jacobian matrix is a constant matrix 



IR 0 0   0 B =  0 IR+1  0 0 1, 1, · · · , 1 of size (3R + 2) × (2R + 1), where IR is an identity matrix of order R. Thus we can easily implement the formula of ACV. Figure 3 illustrates two representative results of our simulation study. Randomized data were generated according to the distributions: g1 (x) = .25h(x, 0, .25) + .25h(x, 5, 1) + .25h(x, 10, 5) + .25h(x, 15, 1) and g2 (x) = .1h(x, 0, 1) + .1h(x, 5, 1) + .7h(x, 10, 5) + .1h(x, 15, 1). Both are mixtures of four equally separated components. The departure of g1 from the structural model is quite small, i.e., the variance of only one component is different from 16

others, whereas that of g2 is much larger. Thus the structural MLE fails to give a proper density estimate for the case of g2 . On the other hand, in both cases the MPLEs give good density estimates, which are quite close to the nonparametric density estimates. The lower panels in Fig. 3 show plots of gerror, ACV and ACV0 against λ. As expected, the optimal λ selected by ACV is close to 1 (λopt = .96) for the case of g1 ; that is remote from 1 (λopt = .55) for the case of g2 . In these examples and other simulation studies the plots of ACV0 against λ are close to those of ACV, giving similar optimal values of λ. This suggests that the use of ACV0 in place of ACV can be justified in practice, particularly when the calculation of ACV is difficult. > We note that the structural model of common variance has a bounded likelihood function, but no constraint on mean parameters, therefore the model is in fact not identifiable. It would suffer from a phenomenon of composite local maxima when the mean parameters of some components are very close. As introduced in Section 3.1 the quantal structural model is given in (9), which has a constraint on mean parameters. This model therefore does not suffer from the problem of unidentifiability. Figure 4 shows the result of applying the MPLE to the synaptic response data of Figure 1, where the optimal λ (λopt = .87) is selected by ACV. As can be seen from the following figure, the estimation has been fairly improved by the penalized likelihood method, particularly at the 0-th component. > We need to determine the proper number of components in estimation of finite mixture models. For our method, selecting R which minimizes ACV(λ) would be a self-consistent criterion: MACV(R) = min ACV(λ). 0≤λ≤1

Here we note that ACV(λ) depends on R. Figure 5 is the plot of minimum ACVs against R 17

for the synaptic response data, showing that nine is selected for R by this criterion. >

6. Discussion We have discussed incompatible aspects of maximum likelihood methods over the full model and structural model in the mixture of distributions. As a remedy for this dilemma we have proposed the penalized maximum likelihood estimator. Furthermore an adaptive method of selecting the penalty tuning parameter is proposed and applied to synthetic and real data. The choice of the structural model depends on individual situations. If we have no information about the structure of mixing then the simple structure of common variance may be the choice. In other cases, we may conceive a certain structure based on our background knowledge behind the data set as discussed in the quantal model. As a post-analysis of the MPLE procedure it is important to know the degree of lackˆ and of-fit, which expresses quantitatively the departure from the stucture model. Let θ(ξ) ˆ ˆ be the structural MLE and the MPLE tuned by ACV, respectively. In the penalized θ λ likelihood method the global measure for the degrees of lack-of-fit is naturally measured by ˆ θ ˆ ˆ ), which asymptotically follows the chi-square distribution with degrees D = D (J) (θ(ξ), λ f = 3R + 2 − k of freedom, where k is the dimension of the structural model. The global measure is decomposed into (5), so that for the r-th mixing proportion πr the departure is measured by Dπ (r) = π ˆr∗ log

π ˆr∗ +π ˆr − π ˆr∗ π ˆr

and for the parameter ω r of r-th component distribution the departure is by Dω (r) = ∗ ˆ = (ˆ ˆ ˆ = (ˆ ˆ ∗r , ω ˆ r ), where θ(ξ) πr∗ , ξˆr )0≤r≤R and θ πr , ξˆr )0≤r≤R . The sum D(r) = Dπ (r) + π ˆr∗ D(ω λ

Dω (r) represents a measure of fitness of the r-th component. Thus we can decompose the global measure into contributions of individual mixture components. Table 1 gives the result of the post-analysis for synaptic response data given by Figure 4. The global measure nD 18

is 20.375, so the upper probability is .728 according to the approximation to the chi-square distribution with 25 degrees of freedom. The hypothesis of quantal model was therefore not rejected. In the table of the deviance statistics for quantal components the 0-th component is the most remote from the quantal model with a large deviance in the mixing proportion parameter. The 3rd component also shows a large deviation, mostly due to the overestimated variance as seen in Figure 4. >

Acknoledements We wish to thank Dr Fumihito Saitoh for kindly providing his unpublished data of synaptic responses obtained from the rat cerebellum. This work is supported in part by grants of Ministry of Education, Culture, Sports, Science and Technology.

References Barndorff-Nielsen, O. E. and Cox, D. R. (1989). Asymptotic Techniques for Use in Statistics. Chapman and Hall, London. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B, 39: 1-22. Konishi, S. and Kitagawa, G. (1996). Generalised information criteria in model selection Biometrika, 83: 875-890. McLachlan, G. J. and Krishnan, T. (1998). The EM Algorithm and Extensions. New York, Wiley. Mclachlan, G. J. and Peel, D. (2000). Finite Mixture Models. New York, Wiley. Ormoneit, D. and Tresp, V. (1998). Averaging, maximum penalized likelihood and Bayesian estimation for improving normal mixture probability density estimates. IEEE Trans. on Neural Networks, 9: 639-650. 19

Orchard, T and Woodbury, M. A. (1972). A missing information principle: theory and applications. Proceeding of the 6th Berkeley Symposium on Mathematical statistics and Probability University of California Press, 697-715. Richardson, S. and Green, P. J. (1997). On Bayesian Analysis of mixtures with an unknown number of components. J. R. Statist. Soc. B, 59: 731-792. Stricker, C. and Redman, S. (1994). Statistical models of synaptic transmission evaluated using the expectation-maximization algorithm. Biophysical Journal, 67: 656-670. Stone, M. (1977). An asymptotic equivalence of choice of model by cross-validation and Akaike’s criterion J. R. Statist. Soc., B, 39 : 44-47. Titterington, D. M., Smith, A. F. M. and Makov, H. E. (1987) Statistical analysis of finite mixture distributions. London, Wiley. Wu, C. F. J. (1983). On the convergence properties of the EM algorithm. Ann. Statist., 11: 95-103.

Appendix 1 Proof of Theorem 2. From the assumption of (C) and (7) it follows that lim {ω t(x) − κ(ω)} = ∞.

ω→ω(x)

with the boundary point ω(x) of Ω. By the definition of the penalized likelihood we get (M) λ (θ)

n

R

R ˆ πr (ξ) ˆ ˆ ˆ = (1 − λ) log f (xi , θ) − λ{ πr (ξ) log + πr (ξ)D(ω r (ξ), ω r )}. π r r=0 r=0 i=1

Since log f (x, θ) ≤ max log h(x, ω r ), 0≤r≤R

and the non-negativity of (M) (θ) ≤

n i=1

r

ˆ log{πr (ξ)/π ˆ r } and D(ωr (ξ), ˆ ω r ) holds, we obtain πr (ξ)

max {(1 − λ) log h(xi , ω r ) −

0≤r≤R

20

λ ˆ ˆ ω r )}. πr (ξ)D(ω r (ξ), n

For arbitrarily fixed r and i, r = 0, · · · , R, i = 1, · · · , n, we get that (1 − λ) log h(xi , ω) −

= (1 − λ +

λ ˆ ˆ ω), πr (ξ)D(ωr (ξ), n

λ ˆ ˆ ω − κ(ω)} + λ πr (ξ){η ˆ ˆ ˆ ˆ πr (ξ)){η r(λ) (xi , ξ) r (ξ) ω r (ξ) − κ(ω r (ξ))}, n n

ˆ = (∂/∂θ)(β (ξ)) ˆ and where η r (ξ) r ˆ η r(λ) (xi , ξ)

ˆ (ξ) ˆ (1 − λ)t(xi ) + nλ πr (ξ)η r . = ˆ 1 − λ + λ πr (ξ)

(11)

n

ˆ > 0 and the assumtion of η(ξ)), ˆ the statistic η (λ) (xi , ξ) ˆ is an interior point of Since λπr (ξ) r the space {(∂/∂ω)κ(ω) : ω ∈ Ω}. Therefore, the maximum of ˆ ω − κ(ω) η r(λ) (xi , ξ) ˆ since the MLE is a solution attains at the MLE of ω given the sufficient statistic η r(λ) (xi , ξ) ˆ of ω (λ) (xi , ξ) r

ˆ ξ)η ˆ r (ξ) ˆ (1 − λ)t(xi ) + nλ πr (ξ)( ∂ κ(ω) = ˆ ∂ω 1 − λ + nλ πr (ξ) ˆ reduces to the boundary with respect to ω. If we take a limit of λ to 0, then ω r(λ) (xi , ξ) ˆ is still an interior point, so (11) is bounded point of Ω, but for any fixed λ > 0, ω (λ) (xi , ξ) r

in Ω. This completes the proof.

Appendix 2 We now give a derivation of the ACV formula (10) by leave-one-out method in terms of the influence function. Let Fn =

n 1 δx , n j=1 j

[-i] Fn−1 =

1 δx . n − 1 j=i j

ˆ λ as the functional of Fn , say θ λ (Fn ), then θ ˆ [-i] is expressed as If we write the minimizer θ λ [-i] θ λ (Fn−1 ). We will show that ˆλ = − ˆ [-i] − θ θ λ

1 IFλ (xi ), n−1 21

(12)

(x) is the influence function of the functional θ ˆ λ (·) against the x-direction, or where IF λ (x) = IF λ

∂ ˆ θ λ ((1 − ε)Fn + εδx )|ε=0 , ∂ε

cf. Hampel (1974) for the interpretation. The proof of the key relation (12) follows from the Taylor approximation

1 1 )Fn − δxi − θ λ (Fn ) n−1 n−1 1 ∂ 1 θ λ ((1 + ε)Fn − εδxi ) =− IFλ (xi ) n − 1 ∂ε n−1 ε=0

ˆ [-i] − θ ˆ λ = θ λ (1 + θ λ

since Fn = (1 −

1 [-i] 1 )Fn−1 + δxi . n n

Similarly [-i] ξˆ − ξˆ = −

1 ˆ −1 S( ˆ xi ). ˜ ξ, J(ξ) n−1

Thus we get an approximation of the expected loss by n 1 ! (x ) ˆ λ − 1 IF − log f xi , θ λ i n i=1 n−1

which is further approximated by −

n n 1 1 ! (x ). ˆλ) + ˆ λ , xi ) IF log f (xi , θ S(θ λ i n i=1 n(n − 1) i=1

(13)

ˆ = θ ˆ λ ((1 − -)Fn + -δx ) and ˆ λ . Writing θ We now obtain the influence function of θ ˆ − -)Fn + -δx ) we get the estimating equation ξˆ = ξ((1 (1 − λ)

ˆ , y)d{(1 − -)Fn + -δx )}(y) = λ S(θ

ˆ ) ∂D(J) (θ(ξˆ ), θ , ∂θ

of which both sides we differentiate in -, so that (1 − λ)

ˆ , y) ∂S(θ ˆ , y)d{δx − Fn )}(y) d{(1 − -)Fn + -δx )}(y) + (1 − λ) S(θ ∂θ

=λ

ˆ) ∂θ ˆ ˆ ) ∂ ξˆ ∂ 2 D (J) (θ(ξˆ , θ ∂ 2 D (J) (θ(ξˆ , θ + λ . ∂θ∂θ ∂∂θ∂ξ ∂22

Hence we get ˆ ∂θ ∂-

=0

2 (J) ˆ ˆ ˆ ˆ , x) − λ ∂ D (θ(ξ ), θ ) ∂ ξ = Jˆλ−1 (1 − λ)S(θ ∂θ∂ξ ∂-

, =0

which implies

(x) = Jˆ−1 (1 − λ)S(θ ˆ λ , x) − λK ˆ {B(ξ) ˆ K ˆ }−1 B(ξ)S(θ( ˆ ˆ x) ˆ 1 B(ξ) ˆ 1 B(ξ) IF ξ), λ λ

since the structural MLE satisfies ∂ ξˆ ∂-

ˆ K ˆ }−1 B(ξ)S(θ( ˆ ˆ x) ˆ 1 B(ξ) = {B(ξ) ξ), =0

Therefore n 1 ! (x ) = (1 − λ)tr(Jˆ−1 K ˆ λ , xi ) IF ˆ 0 ) + λtr(Jˆ−1 K ˆ 1 Pˆ K ˆ 1 ), S(θ λ i λ λ n i=1

which leads to the ACV formula (10) from (13).

23

Captions for Figures Fig 1. The uniform kernel density function and the density function estimated by the structural MLE of the synaptic respomse data. The MLE for the structural parameter is (ξ1 , ξ2 , ξ3, ξ4 ) = (3.3, −53.1, 103.0, 54.5). The dotted line and the shaded area denote the ML density function and the uniform kernel density function with bandwidth 10pA.

Fig 2. The contour plots of the KL divergence of D(J) and D (M) . Fig 3. The density functions estimated by the structural MLE and MPLE of synaptic data. (A) the case of the true density g1 . (B) the case of the true density g2 . The upper panels show the density functions by the structure MLE (dotted lines), MPL (solid lines) and the uniform kernel (shaded areas). The lower panels are the plots of gerror, ACV and ACV0 against λ. Fig 4. The density functions estimated by the structural MLE and MPLE of the synaptic response data. The solid line is the MPL density function; the dotted line is the ML density function. The inset shows the plot of ACV against λ showing that the minimum attains at λ = .87 Fig 5. The plot of minimum ACVs against R for the synaptic response data.

24

Table 1 The post-analysis of the synaptic responce data.

r

0

1

nD nDπ nDω

6.011 5.642 .369

1.422 .718 .704

2

3

.930 3.092 .714 .428 .216 2.664

4

5

.952 1.422 .001 .242 .952 1.180

25

6

7

8

1.741 .625 3.108 .771 .579 3.108 .970 .045 .000

9

Sum

1.072 1.060 .012

20.375 13.263 7.112

0.006

Density

0.005 0.004 0.003 0.002 0.001 0 0

-100

-200

-300

-400

-500

Amplitude (pA) Fig. 1. The uniform kernel density function and the density function estimated by the structural MLE of the synaptic response data

0.8

0.8

π

0.6

0.6

。

。

π 0.4

0.4

0.2

0.2 -1.5

-1

-0.5

0

0.5

1

1.5

-1.5

-1

-0.5

0.5

τ

τ Fig. 2. The contour plots of the KL divergence of D

0

(J)

and D

(M)

.

1

1.5

Density

A

B Structural MLE

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0

MPLE

-5

0

5

10

15

20

x

2.88

-5

0

5

15

20

0.6

0.8

1

x

2.69

2.87

10

2.68 gerror

2.86

ACV0

2.85 2.84 0.4

2.67

ACV

2.66 2.65

0.5

0.6

0.7

λ

0.8

0.9

1

0

0.2

0.4

λ

Fig. 3. Density functions estimated by the structural MLE and MPLE of synthetic data

6.00

ACV

0.006

Density

0.005

5.99

5.98

0.004

0.7

0.8

0.9

λ

0.003 0.002 0.001 0 0

-100

-200

-300

-400

-500

Amplitude (pA) Fig. 4. Density functions estimated by the structural MLE and MPLE of the synaptic response data

1.0

6.03

6.02

ACV

6.01

6

5.99

5.98 4

6

8

10

12

14

R Fig. 5. The plot of minimum ACVs againt R for the synaptic response data