Supplementary materials

3 downloads 0 Views 216KB Size Report
ables flexible definition of various non-conjugate observations models and coef- ficient priors using hyperparameters with a multivariate Gaussian prior. Ap-.
Supplementary materials 1

An Expectation Propagation Approach for Generalized Linear Models with Hierarchical Priors

This section describes a novel approach for hierarchical linear models that enables flexible definition of various non-conjugate observations models and coefficient priors using hyperparameters with a multivariate Gaussian prior. Approximate posterior inference is done using expectation propagation for both the linear coefficients and the hyperparameters.

1.1

Model Description

We consider generalized linear models with multiple output variables, where the probability density of the observed dy ×1 output vectors yi depend on the dx ×1 input vectors xi through a linear transformation WT xi . We assume that given n input-output pairs denoted by D = {xi , yi }ni=1 the observation model can be written as p(Y|XW, θ) =

n Y

p(yi |WT xi , ViT θ),

(1)

i=1

where Y = [y1 , ..., yn ]T is a n × dy output variable matrix, X = [x1 , ..., xn ]T a n × dx input variable matrix, and W = [w1 , ..., wdy ] a dx × dy coefficient matrix. Each of the n likelihood terms depends on the dθ × 1 hyperparameter vector θ via a known transformation matrix Vi . For the inference algorithm to remain computationally feasible, we assume that in typical applications each likelihood term depends only on a low-dimensional transformed variable zi = ViT θ. We form a hierarchical prior family for the linear coefficients by combining them with the hyperparameters using known linear transformations. We assume that the prior for the coefficients W can be factored into m terms according to p(W|θ) =

n+m Y

T tj (UT j w, Vj θ),

(2)

j=n+1

where w = vec(W) = [w1T , ..., wdTy ]T , and the transformation matrices Uj and Vj are known. We choose a multivariate Gaussian prior distributions for the

1

hyperparameters θ: p(θ) = N (µθ,0 , Σθ,0 ),

(3)

where µθ,0 is the prior mean vector and Σθ,0 the prior covariance matrix. Combining the likelihood (1) with the priors (2) and (3), gives the following posterior distribution p(w|D) = Z −1

n Y

n+m Y

p(yi |WT xi , ViT θ)

i=1

T tj (UT j w, Vj θ)p(θ),

(4)

j=n+1

R where Z = p(Y|X) = p(Y|W, θ, X)p(W|θ)p(θ)dWdθ is the marginal likelihood or the model evidence. The general model definition of equations (1)–(3) enables implementation of various different linear models via the choice of transn+m formations {Ui }n+m i=n+1 and {Vi }i=1 , and in the following we summarize some common model specifications. 1.1.1

Regression Models

A Gaussian observation model with unknown noise variances for each output k = 1, ..., dy can be formed as   p(yi |WT xi , ViT θ) = N yi |WT xi , diag exp(ViT θ) =

dy Y

 T N yi,j |wkT xi , exp(vi,k θ) ,

(5)

k=1

where the exponentiation is taken component-wise and Vi = [vi,1 , ..., vi,dy ] is a dθ ×dy matrix that encodes our noise model. For example, a common noise level for all outputs can be obtained by setting Vi = 1T dy . Alternatively, independent noise levels for all outputs are obtained by choosing Vi = Idy , which results in dy independent regression problems unless the coefficient are not coupled through the prior p(W|θ). An outlier robust observation model could be formed, e.g., by setting choosing Vi = ei 1T dy in equation (5), where ei is the i:th n-dimensional unit vector. This model assumes that a potential outlier corrupts all components of yi , and it corresponds to a normal scale-mixture model with a log-normal mixing distribution. A heavy-tailed model could also be obtained by replacing the Gaussian distribution in (5) with a Laplace distribution or a Student-t distribution and using the same noise level hyperparameter for all observations. 1.1.2

Classification Models

A binary classifier can by formed by choosing p(yi |WT xi , ViT θ) =

dy Y k=1

2

Φ(yi,k wkT xi ),

(6)

where yi ∈ [−1, 1] is the i:th class label and Φ(x) is the logistic or the Gaussian cumulative distribution function. In this case there are no hyperparameters θ associated with the likelihood terms. This observation model results in dy independent regression problems unless the coefficient are not coupled through the prior p(W|θ). A multi-class classifier can be formed by using the multinomial probit observation model, which can be written as p(yi |WT xi , ViT θ) =

dy Y

Z

N (ui |0, 1)Φ(ui + wyTi xi − wkT xi )dui ,

(7)

k=1,k6=yi

where yi ∈ {1, .., dy } is the class label and the integration over the nuisance variable ui ensures that the probabilities for all class assignments of yi sum to one. The EP inference approach for Gaussian process classifiers described by Riihim¨ aki et al. [9] can be readily applied for this type of likelihood terms. 1.1.3

Scale-Mixture Coefficient Priors

Various types of scale-mixture coefficient priors can be formed by using Gaussian distributions in (2): p(W|θ) = Z(θ)−1



n+m Y

n+m Y

  1 exp − exp(−VjT θ)(wT Uj )2 2 j=n+1

  T N UT w|0, exp V θ j j

(8)

j=n+1 T where transformation UT j w and Vj θ are assumed to yield scalar random variables and the normalization term is given by

log Z(θ) =

dx 1 log(2π) − log |Q| , 2 2

where

Q=

n+m X

Uj exp(−VjT θ)UT j .

j=1

To keep the prior factorizable, we assume that the vectors U1 , ..., Um are orthonormal, because then λj = exp(−VjT θ) are the eigenvalues of Q and the P normalization term reduces to log Z(θ) = 21 (dw log(2π) + j VjT θ), where dw = dx dy is the dimension of w = vec(W). In the standard setting when Uj = ej , the components of w are conditionally independent given θ. With general transformation vectors U1 , ..., Um ,P we could either use an un-normalized prior by using log Z(θ) ≈ 12 (m log(2π) + j VjT θ) or constrain the number of different hyperparameter transformations Vj to a low number so that |Q| can be evaluated efficiently. Using (8) we can implement various types of commonly used coefficient priors:

3

• A uniform Gaussian prior with an unknown scalar prior variance (similar to ridge regression) can be obtained by choosing Uj = ej (dw × 1) and Vj = 1 for j = n + 1, ..., n + dw . This prior assumes that all the dy different outputs share the same prior variance hyperparameter. We could also set different variance hyperparameters for each output by choosing Vj = ek (dy × 1) for j = (k − 1)dx + n + 1, ..., (k − 1)dx + n + dx and k = 1, .., dy . Note that this leads to dy different inference problems, if the coefficients related to different outputs are not coupled through the observation model. • Automatic relevance determination (ARD) prior can be formed by assigning individual scale hyperparameters to the coefficients related to each input variable. One option is to set Uj = ej (dw × 1) and Vj = el (dx × 1) with j = k − 1 + l + n for l = 1, ..., dx and k = 1, ..., dy . This construction assumes that all outputs share the same sparsity structure, i.e., a coefficient gets a large prior variance if the corresponding coefficients from the other outputs are also large. Alternatively individual scale parameters could be assigned to all coefficients but this would result in a much larger number of hyperparameters and no information sharing between the outputs. • Group sparsity priors can be constructed by defining possibly overlapping groups as Uj = ej (dw × 1) and Vj = [1, 0, 1, 0, 0, ..., 0]T (ng × 1), where Vj is a binary vector indicating the group memberships for each coefficient and ng is the number of different groups. The groups could be defined either so that they combine coefficients from different output units into same groups or completely separately for each output. The previous prior constructions could be implemented also with other type of univariate distributions in place of Gaussian in (8). For example, we could use more heavy-tailed alternatives such as the Laplace distribution but then it would be more sensible to use only uniform scale parameters that control the overall magnitude. 1.1.4

Gaussian Process Priors

Instead of sparsity-promoting priors we can also use Gaussian process (GP) priors in equation (8) and there are at least two distinct approaches to this [for an introduction to GP models, see 8]. One option is to set a GP prior directly to the coefficients w:  T p(W|θ) = tn+1 (UT (9) j w, Vj θ) = N w|0, K(θGP , r, t) , where K(θGP , r, t) is the prior covariance matrix that encodes, e.g., smoothness assumptions between the coefficients associated with different spatial and temporal coordinates denoted by r and t respectively. Inference on the GP hyperparameters θGP , which control e.g. the smoothness properties induced 4

by the prior, is challenging due to the complex nonlinear dependency of w and θGP through the chosen covariance function [8]. Therefore, following the usual practice with GP models, maximum marginal posterior (MAP) estimates could be obtained via gradient-based optimization using an approximation to the marginal likelihood Z Z(θGP ) = p(Y|W, θL , X)p(W|θGP )p(θL )dWdθL . The other approach for utilizing Gaussian processes would be to place a GP prior on the hyperparameters θ by redefining (3) as p(θ) = N (0, K(θGP , r, t)),

(10)

where K(θGP , r, t) is a prior covariance matrix that encodes, e.g., the smoothness assumptions between the scale hyperparmeters associated with different spatial and temporal coordinates denoted by r and t respectively. The main conceptual difference with the GP prior for w is that the signs of the coefficients can still vary in nearby locations but the sparsity structure is assumed to be smooth. In this setting, inference on w and θ can be done using EP, whereas MAP estimates could be determined for the GP hyperparameters θGP . Similar approaches based on GPs and Gaussian markov random fields in combination with spike and slab priors have been proposed by Andersen et al. [1] and Miguel Hern´ andez-Lobato et al. [4].

1.2

Approximate Inference

This section describes an EP algorithm that is used to form a deterministic Gaussian approximation to the posterior distribution (4). 1.2.1

Posterior Approximation

To simplify the notation, we first denote WT xi = UT i w for i = 1, ..., n, where Ui = Idy ⊗ xi , and then write the posterior distribution (4) as p(w, θ|D) = Z −1

n+m Y

T ti (UT i w, Vi θ)p(θ),

(11)

i=1

where ti (·) are some known site functions. We use expectation propagation [7] to approximate the posterior distribution (11) with −1 p(w, θ|D) ≈ ZEP

n+m Y

Z˜ t˜w,i (w)t˜θ,i (θ)p(θ) = q(w)q(θ),

(12)

i=1

where each analytically intractable site function is approximated with a tractable site approximation that can be factored between w and θ: T ˜˜ ˜ ˜ ti (UT i w, Vi θ) ≈ ti (w, θ) = Zi tw,i (w)tθ,i (θ),

5

(13)

and Z˜i is a scalar scaling parameter, which is needed to form an EP approximation for the marginal likelihood of the observed data Z = p(Y|X) ≈ ZEP . We assume Gaussian site approximations of the form   ˜ w,i t˜w,i (w) = ψ wT Ui , ν˜w,i , T   ˜ θ,i t˜θ,i (θ) = ψ θ T Vi , ν˜θ,i , T (14) ˜ w,i are the site location and precision parameters for w, ν˜θ,i where ν˜w,i and T ˜ and Tθ,i are the corresponding site parameters for θ, and the un-normalized Gaussian site term approximations are defined using   1 (15) ψ(w, h, Q) = exp − wT Qw + hT w . 2 The dimension of the site parameters are defined by the corresponding trans˜ w,i is a di × di matrix and ν˜w,i is a di × 1 vector if formation matrices, e.g., T Ui has di columns. Multiplying the site approximations (14) together according to (12) gives the following Gaussian approximations for w and θ: q(w) = Z(hw , Qw )−1 ψ(w, hw , Qw ) = N (w|µw , Σw ) q(θ) = Z(hθ , Qθ )−1 ψ(w, hθ , Qθ ) = N (w|µθ , Σθ ),

(16)

where the moment and natural parameters are related by µw = Q−1 w hw and (and analogously for θ), the natural parameters are given by Σw = Q−1 w hw = hθ =

n+m X i=1 n+m X

Ui ν˜w,i

Qw =

Vi ν˜θ,i + Σ−1 θ,0 µθ,0

Qθ =

i=1

n+m X i=1 n+m X

˜ w,i UT Ui T i ˜ θ,i VT + Σ−1 , Vi T i θ,0

(17)

i=1

and the normalization factor (or the partition function) is defined as Z d 1 1 log Z(h, Q) = log ψ(w, h, Q)dw = log 2π − log |Q| + hT Q−1 h. (18) 2 2 2 1.2.2

Expectation Propagation

The standard EP algorithm [7] updates the parameters of the site approximations and the posterior approximation q(w, θ) sequentially. At each iteration, first a proportion η of the i:th site term is removed from the posterior approximation to obtain a cavity distribution: q−i (w, θ) = q−i (w)q−i (θ) ∝ q(w)q(θ)t˜w,i (w)−η t˜θ,i (θ)−η ,

6

(19)

where η ∈ (0, 1] is a fraction parameter that can be adjusted to implement fractional (or power) EP updates [5, 6]. Then, the i:th site approximation is replaced with the exact site term to form a tilted distribution T η pˆi (w, θ) = Zˆi−1 q−i (w, θ)ti (UT i w, Vi θ) ,

(20)

T η where Zˆi = q−i (w, θ)ti (UT i w, Vi θ) dwdθ is a normalization factor, which in case i ≤ n can also be thought of as an approximation to the leave-one-out (LOO) predictive density of the excluded data point yi . The tilted distribution can be regarded as a more refined approximation to the true posterior distribution. Next, the algorithm attempts to match the approximate posterior distribution q(w, θ) with pˆi (w, θ) by finding a member of the chosen approximate family qˆi (w, θ) that satisfies

R

qˆi (w, θ) = arg min KL (ˆ pi (w, θ)||qi (w, θ)) ,

(21)

qi

where KL denotes the Kullback-Leibler divergence and qi can be factored as qi (w, θ) = qi (w)qi (θ). Then, the parameters of the i:th site terms are updated so that the moments of q(w, θ) match with qˆi (w, θ): qˆi (w, θ) ≡ q(w, θ) ∝ q−i (w)q−i (θ)t˜w,i (w)η t˜θ,i (θ)η .

(22)

Finally, the posterior approximation q(w, θ) is updated according to the changes in the site parameters. These steps are repeated for all sites in some suitable order until convergence. From now on, we refer to the previously described EP update scheme, where the posterior approximations q(w) and q(θ) are refreshed after each of the n+m site updates, as sequential EP. If the posterior approximations are updated only after new site parameter values have been determined for all the n likelihood sites or m prior sites, we refer to parallel EP [see, e.g., 2]. 1.2.3

Algorithm Description

With the chosen model structure and approximate family, the EP algorithm can be implemented as follows. First, initialize the site parameters, for example, ˜ w,i = 0}n , {T ˜ w,i = αI}m ˜ θ,i = 0, ˜θ,i = 0, and T as ν˜w,i = 0, {T i=1 i=n+1 , ν where α is an initial uniform prior variance for the coefficients, and compute the approximations q(w) and q(θ). Then iterate the following steps at a chosen order for each i = 1, ..., n + m: 1. Determine the parameters of the cavity distribution (19). Compute first the mean µi and the covariance Σi of the approximate marginal distribution of the transformed variable zi = UT i µw : q(zi ) = N (µi , Σi ), where T µi = UT µ and Σ = U Σ U . The cavity distribution for the transi w i i w i formed variable zi is then given by q−i (zi ) = N (µ−i , Σ−i ), where ˜w,i ) µ−i = Σ−i (Σ−1 i µi − η ν  −1 ˜ Σ−i = Σ−1 . i − η Tw,i

7

(23)

Compute the cavity distribution q−i (φi ) for the transformed variable φi = ViT θ analogously. 2. Compute the sufficient statistics with respect to the i:th tilted distribution (21). For the the transformed variables zi and φi , determine (or ˆ i = Epˆi (zi ) and µ ˆ φ,i = Epˆi (φi ) and approximate) the marginal means µ T T ˆ ˆ i = Epˆ (zi zT )− µ ˆ φ,i µ ˆT ˆ ˆ and Σ = E (φ φ µ covariances Σ φ,i pˆi i i )− µ i i i i φ,i with respect to the the marginal tilted distribution: pˆi (zi , φi ) = Zˆi−1 ti (zi , φi )η q−i (zi )q−i (φi ).

(24)

If the normalization term Zˆi can be computed analytically, the sufficient statistics can be determined by differentiating log Zˆi with respect to the −1 cavity parameters, e.g., wrt. Σ−1 −i µ−i and Σ−i for zi . Otherwise, the necessary moments have to approximated with a suitable method such as numerical quadrature integration. 3. Update the site parameters according to (22) by damping the updates with δ ∈ (0, 1]. For the site approximations t˜w,i (w) this results in the following updates: ˜ new = (1 − δ)T ˜ w,i + δη −1 (Σ ˆ −1 − Σ−1 ) T w,i i −i ˜ w,i + δη −1 (Σ ˆ −1 − Σ−1 ) =T i

new ν˜w,i =

=

i −1 ˆ −1 ˆ i − Σ−1 (1 − δ)ν˜w,i + δη (Σi µ −i µ−i ) −1 −1 ˆ −1 ˆ i − Σi µi ). ν˜w,i + δη (Σi µ

(25)

(26)

˜ θ,i are updated similarly using the marginal The site parameters ν˜θ,i and T ˆ φ,i . ˆ φ,i and Σ tilted moments µ 4. If sequential EP is used, apply rank-one posterior updates on q(w) = N (µw , Σw ) and q(θ) = N (µθ , Σθ ). If parallel EP is used, the posterior approximations q(w) and q(θ) are refreshed after each sweep over the sites. In practice we first conduct few sweeps only for the likelihood sites i = 1, . . . , n to get a good data fit. Then we iterate few sweeps only for the prior sites i = m + 1, . . . , n + m. Finally we alternate between likelihood and prior site sweeps until convergence. Practical tips for implementing the EP updates in numerically stable manner can be found, e.g., in the appendices of [3].

References [1] Michael R Andersen, Ole Winther, and Lars K Hansen. Bayesian inference for structured spike and slab priors. In Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 1745–1753. Curran Associates, Inc., 2014. 8

[2] Marcel V. Gerven, Botond Cseke, Robert Oostenveld, and Tom Heskes. Bayesian Source Localization with the Multivariate Laplace Prior. In Y. Bengio, D. Schuurmans, J. Lafferty, C. K. I. Williams, and A. Culotta, editors, Advances in Neural Information Processing Systems 22, pages 1901–1909. Curran Associates, Inc., 2009. [3] Pasi Jyl¨ anki, Aapo Nummenmaa, and Aki Vehtari. Expectation Propagation for Neural Networks with Sparsity-promoting Priors, March 2013. URL http://arxiv.org/abs/1303.6938. [4] Jose Miguel Hern´ andez-Lobato, Daniel Hern´andez-Lobato, and Alberto Su´ arez. Network-based sparse Bayesian classification. Pattern Recognition, 44(4):886–900, April 2011. ISSN 00313203. doi: 10.1016/j.patcog.2010.10. 016. URL http://dx.doi.org/10.1016/j.patcog.2010.10.016. [5] T. Minka. Power EP. Technical Report MSR-TR-2004-149, Microsoft Research, 2004. [6] Thomas Minka. Divergence Measures and Message Passing. Technical report, 2005. URL http://citeseerx.ist.psu.edu/viewdoc/summary? doi=10.1.1.314.3445. [7] Thomas P. Minka. Expectation Propagation for Approximate Bayesian Inference. In Proceedings of the 17th Conference in Uncertainty in Artificial Intelligence, UAI ’01, pages 362–369, San Francisco, CA, USA, 2001. Morgan Kaufmann Publishers Inc. ISBN 1-55860-800-1. URL http: //portal.acm.org/citation.cfm?id=647235.720257. [8] Carl E. Rasmussen. Gaussian processes for machine learning. Adaptive Computation and Machine Learning. MIT Press, 2006. URL http: //citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.86.3414. [9] Jaakko Riihim¨ aki, Pasi Jyl¨anki, and Aki Vehtari. Nested Expectation Propagation for Gaussian Process Classification with a Multinomial Probit Likelihood, July 2012. URL http://arxiv.org/abs/1207.3649.

9