Appendix

1 downloads 0 Views 147KB Size Report
Here we present a mathematically rigorous specification of the DPM model. ... area has already been layed down by O'Leary (2009) and Neal (2000), and we ...
Appendix 1. The DPM Model Here we present a mathematically rigorous specification of the DPM model. Most of the groundwork in this area has already been layed down by O’Leary (2009) and Neal (2000), and we follow their notation closely. For the sake of generality we will consider a discrete-space specification of the problem, although the results presented here are perfectly applicable to continuous space. Let us assume that every observed data point and every source location occupies a single cell in a finite grid of cells, which we will call Ω. Every ω ∈ Ω is a vector with two components ω = (ω (1) , ω (2) ) giving the x- and y-coordinates of the location in two-dimensional space. We presume that we are working with a series of n observations, labelled x1 , x2 , ..., xn . Each xi is modelled as a draw from the random vector Xi , defined on the sample space Ω. We will also assume a strictly infinite series of potential source locations z1 to z∞ (denoted z1,...,∞ ), where each zi is modeled as a draw from the random vector Zi defined on Ω. The prior on source locations (also the base distribution of the Dirichlet process) is given by the distribution of Zi , which we will call G0 . Each observation xi is linked to a particular source location via an associated categorical variable ci . The value of ci indexes one of the infinitely many potential source locations, meaning the observation xi can be said to have originated from source location zci . Specifically, we assume that the observation xi is a draw from the distribution F(zci ), which could, for example, be a normal distribution centred on the point zci . The values of the categorical variables ci are drawn from a Chinese restaurant process (CRP) with concentration parameter α. Finally, we model α as a draw from the hyper-prior distribution H. The complete DPM model can be written as follows xi | zci ∼ F(zci ) z1,...,∞ ∼ G0 ci ∼ CRP(α) α ∼ H.

(1)

It is important to emphasize that uppercase italicised letters such as G0 , F and H describe types of distribution (e.g. normal, exponential), with the associated probability mass/density functions being written g0 (zi ), f (xi | zci ) and h(α) respectively. Specific forms of these distributions will be considered below. 2. The Analytical Solution to the DPM Model Here we present expressions for some posterior quantities of interest under the DPM model. As mentioned in the main text, these analytical solutions are only practically useful when the number of observations is small (in the oder of 10 or less). To start with, let us condition on a particular partition of the data, as specified by the vector of group indices c = c1 . . . cn described above. Pu Let there be nj elements in group j (in other words nj = #{ci : ci = j}), and u groups in total, so that j=1 nj = n. The prior probability of this (or any) partition under the CRP can be written as follows: Pr(c | α) =

u Γ(α) Y α Γ(nj ) . Γ(n + α) j=1

Throughout this paper we assume the following hyper-prior on α: h(α) =

1 . (1 + α)2

Rather than integrate over this hyper-prior separately for each partition, we can define the function t(u) as follows: Z ∞ Γ(α) αu t(u) = h(α) dα . Γ(n + α) 0

1

This function can be be pre-computed for u = 1, . . . , n. Then the probability of any partition, integrated over the hyper-prior on α, can be written Pr(c) = t(u)

u Y

Γ(nj ) .

(2)

j=1

Before deriving the posterior distribution of a partition we must calculate the marginal probability of the data, summed (or integrated) over all possible locations of all possible sources. This can be written Pr(x | c) =

∞ X Y

g0 (ω)

j=1 ω ∈Ω

Y

f (xi | ω) .

(3)

i:ci =j

Although the first product is over an infinite number of potential source locations, only a finite subset of these source locations (those for which ∃ ci = j) have observed data associated with them. For all other values of j the summation is over the space of the prior only, returning a value of one. In fact, while this expression appears complex, in many common situations (such as normal prior and likelihood) a closed-form solution can be obtained. Combining (2) with (3) we arrive at the posterior probability of a partition via Bayes’ rule: Pr(x | c)Pr(c) , c∈P Pr(x | c)Pr(c)

Pr(c | x) = P

(4)

where the sum is over the set (denoted P) of all possible partitions of n observations into up to n groups. As well as being useful in its own right, the information in (4) is required when computing the posterior distribution of all unknown source locations (i.e. the Bayesian analogue of the traditional jeopardy surface). First, the posterior distribution of the j th source location, conditional on a particular partition, can be obtained as follows: Q g0 (zj ) i:ci =j f (xi | zj ) Q Pr(zj | c, x) = P . (5) ω ∈Ω g0 (ω) i:ci =j f (xi | ω) We are interested in the probability of finding any source location at the point z, which is given by a Boolean or operation (union) over all zj for j ∈ 1, . . . , u. To a close approximation this is equal to the mean of Pr(zj | c, x) for j ∈ 1, . . . , u (when dealing with probability density rather than probability mass this result is exact). Thus, we can define a jeopardy surface, conditional on a particular partition, as follows: u

J (z | c, x) =

1X Pr(zj | c, x) . u j=1

Finally, making use of (4), we can define an overall jeopardy surface as follows: J (z | x) =

u X

J (z | c, x)Pr(c | x) .

c∈P

Most of the time this will be the main posterior quantity that we are interested in. 3. Details of the MCMC Algorithm For large data problems (in this context meaning more than 10 points) we must turn to MCMC methods for a solution to the DPM model. The core of our MCMC algorithm is centred around algorithm 2 in Neal (2000). At its heart the algorithm is a Gibbs sampler, which works by alternately drawing new categories and new source locations from their respective conditional distributions. We start by drawing a new value k for category ci from its conditional distribution given our most upto-date values of the source locations. Thus, in agreement with Neal’s original notation, let c-i be the subset of elements cj for which j 6= i, and let n-i,k be the number of c-i that are equal to k. Then, adapting Neal’s

2

formula 3.6 to our specific needs we can write If k ∈ c-i , Pr(ci = k | c-i , xi , z1,...,∞ , α) = b

n-i,k f (xi | zk ) ; n−1+α

If k 6∈ c-i , Pr(ci = k | c-i , xi , z1,...,∞ , α) = b

X α f (xi | ω)g0 (ω), n−1+α ω ∈Ω

(6)

where b is a normalizing constant that ensures that the above probabilities sum to unity. This sampling scheme is repeated for i = 1, . . . , n. Then, for each j ∈ 1, . . . , u, we can draw a new value of the source location zj from its conditional posterior distribution given our most up-to-date values of the categorical variables (i.e. we draw zj from (5) given the new value of c). Our main extension to this methodology has been to integrate over the hyper-prior on α. Making use of the function t(u) defined previously, we can re-define (6) with α integrated out as follows: If k ∈ c-i , P(ci = k | c-i , xi , z1,...,∞ ) = b0 t(u-i ) n-i,k f (xi | zk ) ; If k 6∈ c-i , X P(ci = k | c-i , xi , z1,...,∞ ) = b0 t(u-i +1) f (xi | ω)g0 (ω), ω ∈Ω

(7)

where u-i denotes the total number of unique values in c-i (i.e. the total number of groups defined by the categorical variables once element i is removed). The notation b0 has been used to indicate that this constant of proportionality is not equivalent to b in (6). A more powerful algorithm can be used under the assumption of continuous space and conjugate prior and likelihood (as in the bivariate normal example). In this case f (·) and g0 (·) represent conjugate probability density functions, and the following result obtains If k ∈ c-i , Z

P(ci = k | c-i , xi ) = b00 t(u-i ) n-i,k

f (xi | ω) dy-i,k (ω) ; Ω

If k 6∈ c-i , P(ci = k | c−i , xi ) = b00 t(u-i +1)

Z f (xi | ω) dg0 (ω),

(8)



where y-i,k (ω) denotes the posterior distribution of ω based on the prior G0 and all observations xj for which j 6= i and cj = k. The notation b00 indicates that this is yet another constant of proportionality (in reality none of these b-values need to be evaluated explicitly, as it is only necessary that we can sample from the requisite distribution). Formula (8) can be substituted directly for formula 3.7 in Neal’s algorithm 3.

3