Coupling of Particle Filters

3 downloads 136 Views 949KB Size Report
Jul 19, 2016 - †Corresponding author: [email protected]. Code available at: github.com/pierrejacob/. 1. arXiv:1606.01156v2 [stat.ME] 16 Jul 2016 ...
Coupling of Particle Filters∗ Pierre E. Jacob† Department of Statistics, Harvard University

arXiv:1606.01156v1 [stat.ME] 3 Jun 2016

Fredrik Lindsten and Thomas B. Schön Department of Information Technology, Uppsala University June 6, 2016

Abstract Particle filters provide Monte Carlo approximations of intractable quantities such as pointwise evaluations of the likelihood in state space models. In many scenarios, the interest lies in the comparison of these quantities as some parameter or input varies. To facilitate such comparisons, we introduce and study methods to couple two particle filters in such a way that the correlation between the two underlying particle systems is increased. The motivation stems from the classic variance reduction technique of positively correlating two estimators. The key challenge in constructing such a coupling stems from the discontinuity of the resampling step of the particle filter. As our first contribution we consider two coupled resampling algorithms. They improve the precision of finite-difference estimators of the score vector and boost the performance of particle marginal Metropolis–Hastings algorithms for parameter inference. The second contribution arises from the use of these coupled resampling schemes within conditional particle filters, allowing for unbiased estimators of smoothing functionals. The result is a new smoothing strategy that operates by averaging a number of independent and unbiased estimators, which allows for 1) straightforward parallelization and 2) the construction of accurate error estimates. Neither of the above is possible with existing particle smoothers.

Keywords: state space models, particle filter, particle smoother, resampling algorithms, common random numbers, couplings, optimal transport.

∗ This research is financially supported by the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE and the Swedish research Council (VR) via the projects Learning of complex dynamical systems (Contract number: 637-2014-466) and Probabilistic modeling of dynamical systems (Contract number: 621-2013-5524). † Corresponding author: [email protected]. Code available at: github.com/pierrejacob/.

1

Contents 1 Introduction

3

2 Coupled resampling 2.1 Common random numbers . . . . . . . 2.2 Coupled resampling and coupled particle 2.3 Transport resampling . . . . . . . . . . . 2.4 Index-coupled resampling . . . . . . . . 2.5 Existing approaches . . . . . . . . . . .

4 4 5 6 7 8

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3 Correlated likelihood estimators and applications 3.1 Joint or conditional sampling . . . . . . . . . . . . . . . . . . 3.2 Numerical illustration: correlated likelihood estimators . . . 3.3 Application: finite difference estimators . . . . . . . . . . . . 3.4 Application: correlated particle marginal MH . . . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

9 . 9 . 10 . 12 . 13

. . . . . .

16 16 18 19 21 22 23

4 A new smoothing algorithm 4.1 Smoothing and debiasing techniques 4.2 Proposed smoother . . . . . . . . . . 4.3 Theoretical properties . . . . . . . . 4.4 Practical considerations . . . . . . . 4.5 Variance reduction . . . . . . . . . . 4.6 Numerical results . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . filter . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

5 Discussion

28

A Approximate transport

34

B Validity of correlated particle marginal MH

36

C Properties of the Rhee–Glynn smoothing estimator 38 C.1 Proof of Lemma 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 C.2 Proof of Theorem 4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 D Experiments with the proposed smoother D.1 Effect of the resampling scheme . . . . . D.2 Effect of the number of particles . . . . . D.3 Effect of the truncation variable . . . . . D.4 Effect of ancestor sampling . . . . . . . . E Pseudo-code for particle filters

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

43 43 43 45 45 46

2

1

Introduction

In the context of nonlinear state space models, particle filters provide efficient approximations of the distribution of a latent process (xt )t≥0 , given noisy and partial observations (yt )t≥1 (Doucet et al., 2001; Cappé et al., 2005; Doucet and Johansen, 2011). We assume that the latent process takes values in X ⊂ Rdx , and that the observations are in Y ⊂ Rdy for some dx , dy ∈ N. The model specifies an initial distribution m0 (dx0 |θ) and a transition kernel f (dxt |xt−1 , θ) for the Markovian latent process. Conditionally upon the latent process, the observations are independent and their distribution is given by a measurement kernel g(dyt |xt , θ). The model is parameterized by θ ∈ Θ ⊂ Rdθ . Filtering consists in approximating the distribution p(dxt |y1:t , θ) for all times t ≥ 1, whereas smoothing consists in approximating the distribution p(dx0:T |y1:T , θ) for a fixed time horizon T , where for s, t ∈ N, we write s : t for the set {s, . . . , t}, and vs:t for the vector (vs , . . . , vt ). The bootstrap particle filter (Gordon et al., 1993) generates weighted samples (wtk , xkt )N k=1 , k N k for all t ∈ N, where the particle locations (xt )k=1 are samples in X and the weights (wt )N k=1 are non-negative and sum to one. The number N ∈ N of particles is specified by the user— the computational cost of the algorithm is linear in N , while the approximation of the filtering P k distribution p(dxt |y1:t , θ) by the empirical distribution N k=1 wt δxkt (dxt ) becomes more precise as N increases. Various theoretical aspects of this approximation have been studied, see e.g. Del Moral (2004); Whiteley (2013) and references therein. An important by-product of the particle filter for P Q k statistical inference is the likelihood estimator, defined as pˆN (y1:t |θ) := ts=1 N −1 N k=1 g(ys |xs , θ). The likelihood estimator is known to have expectation equal to the likelihood p(y1:t |θ), and its variance has been extensively studied (Del Moral, 2004; Cérou et al., 2011; Bérard et al., 2014). The estimator is at the core of the particle marginal Metropolis–Hastings (MH) algorithm (Andrieu et al., 2010; Doucet et al., 2015), in which particle filters are run within an MH scheme, enabling inference in a large class of state space models. We consider methods to couple particle filters. A coupling of particle filters, given two parame˜ refers to a pair of particle systems, denoted by (wk , xk )N and (w ter values θ and θ, ˜tk , x ˜kt )N t t k=1 k=1 , such that: 1) marginally, each system has the same distribution as if it was generated by a particle filter, ˜ and 2) the two systems are in some sense correlated. We study ways of respectively given θ and θ, ˜ in order introducing positive correlations between likelihood estimators pˆN (y1:t |θ) and pˆN (y1:t |θ), to estimate the gradient of the log-likelihood or to improve the performance of MH schemes. Importantly, we will also show how similar coupling idea leads to a new algorithm for smoothing. The proposed algorithm is trivial to parallelize, provides unbiased estimators of smoothing functionals, and accurate estimates of the associated Monte Carlo error. Correlating estimators is a classic Monte Carlo technique for variance reduction, and can often be achieved by using common random numbers (Kahn and Marshall, 1953; Asmussen and Glynn, 2007), which is provably advantageous in some contexts (Glasserman and Yao, 1992). Particle filters are randomized algorithms which can be written as a deterministic function of some random

3

variables and a parameter value. Thus, it is possible to use the same random variables for two different parameter values, in an attempt to correlate the resulting particle systems. However, particle filters are discontinuous functions of their inputs, due to their inherent resampling steps. This discontinuity renders theoretical guarantees such as Proposition 2.2 in Glasserman and Yao (1992) inapplicable. Therefore, and despite various attempts (Pitt, 2002; DeJong et al., 2007; Lee, 2008; Coquelin et al., 2009; Malik and Pitt, 2011), there are no standard ways of coupling particle filters. Meanwhile, positive correlations between likelihood estimators have been shown to yield drastic improvements in pseudo-marginal methods (Deligiannidis et al., 2015; Dahlin et al., 2015), which strengthens the motivation for finding such couplings. Our proposed strategy is to use common random numbers for the initialization and propagation steps, while the resampling step can be performed jointly for both systems, using couplings inspired by optimal transport ideas. In this article, we present coupled resampling schemes designed to increase the correlation between pairs of particle systems. The schemes are described in Section 2, in the simple context of ˜ In Section 3, we investigate the perforbootstrap particle filters, given two parameters θ and θ. mance gain for finite difference estimators and correlated pseudo-marginal algorithms. In Section 4, we investigate a different use of coupled resampling schemes, namely to couple two conditional particle filters (Andrieu et al., 2010) using different reference trajectories (but a common parameter value). Combining this with a debiasing method due to Glynn and Rhee (2014), we propose a new smoothing algorithm, the Rhee–Glynn smoother. Section 5 discusses the proposed methodology and future avenues of research.

2 2.1

Coupled resampling Common random numbers

Within particle filters, random numbers are used to initialize, to resample and to propagate the particles. We describe the bootstrap particle filter given a parameter value θ, emphasizing on the sources of randomness. Initially, we sample xk0 ∼ m0 (dx0 |θ) for all k ∈ 1 : N , which is equivalent to computing xk0 = M (U0k , θ) where M is a function and U01:N random variables. The initial weights w0k are set to N −1 . Consider now step t ≥ 0 of the algorithm. In the resampling step, a vector of ancestor variables a1:N ∈ {1, . . . , N }N is sampled; akt represents the index of the t particle of generation t that becomes the parent of particle xkt+1 . The resampling step can be written a1:N ∼ r(da1:N |wt1:N ), for some distribution r. The propagation step consists in drawing t ak ak k , θ), where F is a function xkt+1 ∼ f (dxt+1 |xt t , θ), or equivalently computing xkt+1 = F (xt t , Ut+1 1:N are some random variables. The next weights are computed deterministically as w k and Ut+1 t+1 ∝ k g(yt+1 |xt+1 , θ), then normalized to sum to one; and the algorithm proceeds. The distribution of the latent process typically leads to a natural implementation of the functions M and F , and a choice of distributions for Ut1:N for all t. We refer to Ut1:N for all t as the process-

4

generating variables. On the other hand, the resampling distribution r is an algorithmic choice. A standard condition for the validity of the resampling scheme is that, under r, P(akt = j) = wtj ; various schemes satisfying this condition exist (Douc and Cappé, 2005; Murray et al., 2015). ˜ producing particle systems (wk , xk )N and Consider a pair of particle filters given θ and θ, t t k=1 (w ˜tk , x ˜kt )N , that we want to make as correlated as possible. Assume that the state space is onek=1 ˜ are increasing and right-continuous, and that dimensional, that u 7→ M (u, θ) and u 7→ M (u, θ) ˜ < ∞. Then Proposition 2.2. of Glasserman and Yao (1992) E[M 2 (U0 , θ)] < ∞ and E[M 2 (U0 , θ)] ˜ is maximized, among all choices of joint states that the covariance between M (U0 , θ) and M (V0 , θ) distributions for (U0 , V0 ) that have the same marginal distributions, by choosing U0 = V0 almost surely. This justifies the use of the common random numbers for the initialization step. Likewise, if the propagation function F : (xt , Ut+1 , θ) 7→ xt+1 is continuous in its first and third arguments, and if the particle locations xt and x ˜t are similar, then, intuitively, xt+1 = F (xt , Ut+1 , θ) and ˜ x ˜t+1 = F (˜ xt , Ut+1 , θ) should be similar as well. Random numbers are also used in the resampling step: we can write a1:N = R(wt1:N , UR,t ), where t UR,t are random variables, typically uniformly distributed. The resampling function (wt1:N , UR,t ) 7→ a1:N = R(wt1:N , UR,t ) takes values in the discrete space of indices {1, . . . , N }N , and thus cannot be t a continuous function of its arguments. In other words, even if we use the same random numbers for UR,t , a small difference between the weight vectors wt1:N and w ˜t1:N might lead to sampling, for instance, akt = i in the first system and a ˜kt = i + 1 in the second system; and xit and x ˜i+1 have no t reason to be similar, since they have been produced using different random numbers. This creates discontinuities in the likelihood estimator pˆN (y1:T |θ) as a function of θ, for fixed common random numbers. For this reason, we separate the randomness in process-generating variables from the resampling step. We consider resampling algorithms explicitly designed to make the particles in both systems as similar as possible.

2.2

Coupled resampling and coupled particle filter

We use bold fonts to denote vectors of objects indexed by k ∈ 1 : N , for instance (wt , xt ) = 1:N , and we drop the temporal notation. We consider the problem of jointly (wtk , xkt )N k=1 or Ut = Ut ˜ x). ˜ A joint distribution on {1, . . . , N }2 is characterized by a matrix P resampling (w, x) and (w, with non-negative entries P ij , for i, j ∈ {1, . . . , N }, that sum to one. The value P ij represents ˜ of matrices P such that the probability of sampling the pair (i, j). We consider the set J (w, w) T ˜ where 1 denotes a column vector of N ones. Pairs (a, a ˜ ) distributed P 1 = w and P 1 = w, k j k j ˜ are such that P(a = j) = w and P(˜ according to P ∈ J (w, w) a = j) = w ˜ for all k and j. The T ˜ corresponds to an independent coupling of w and w. ˜ Sampling from this matrix choice P = w w ˜ with probabilities w, ˜ independently. P is done by sampling a with probabilities w and a ˜ leads to a coupled resampling scheme, and Any choice of probability matrix P ∈ J (w, w) to a coupled particle filter which proceeds as follows. The initialization and propagation steps

5

are performed as in the standard particle filter, using common process-generating variables U0:T and the parameter values θ and θ˜ respectively. At each step t ≥ 0, the resampling step involves ˜ possibly using all the variables generated thus far. Then the computing a matrix Pt in J (w, w), ˜ t ) are sampled from Pt . Appendix E contains the algorithmic description. pairs of ancestors (at , a ˜ t ) with the aim of correlating a We now investigate particular choices of matrices P ∈ J (wt , w pair of particle systems.

2.3

Transport resampling

˜ such that, upon sampling ancestors from P , the Intuitively, we want to choose P ∈ J (w, w) ˜ a a ˜ are as similar as possible. Similarity between locations can be resampled locations x and x encoded by a distance d : X × X → R+ , for instance the Euclidean distance in X ⊂ Rdx . The ˜ a˜ , conditional upon (w, x) and (w, ˜ x), ˜ expected distance between the resampled particles xa and x PN PN ij i j ij i is given by i=1 j=1 P d(x , x ˜ ). Denote by D the distance matrix with entries D = d(x , x ˜j ). ˜ that minimizes The optimal transport problem considers, when it exists, the matrix P ? ∈ J (w, w) the expected distance: N N X X

P ? = argmin

P ij Dij .

(1)

˜ i=1 j=1 P ∈J (w,w)

Computing P ? , either exactly or approximately, is the topic of a rich literature. Exact algorithms compute P ? in order N 3 log N operations, while recent methods yield accurate approximations Pˆ in order N 2 operations (Cuturi, 2013; Benamou et al., 2015). Computing the distance matrix D and sampling from a generic probability matrix P already cost N 2 operations, thus the overall cost is in N 2 operations. We denote by Pˆ the matrix obtained by Cuturi’s approximation (Cuturi, 2013), which is described in Appendix A. ˜ Directly Unlike the exact solution P ? , an approximate solution Pˆ might not belong to J (w, w). using such a Pˆ in a coupled particle filter would result in a bias, for instance in the likelihood ˜ that is close to Pˆ . Introduce estimator. However, we can easily construct a matrix P ∈ J (w, w) ˜ = Pˆ T 1, the marginals of Pˆ . We compute a new matrix P as P = αPˆ + (1 − α)r r˜T u = Pˆ 1 and u for some α ∈ [0, 1] and some probability vectors r and r˜. To obtain the correct marginals, we solve the system w − αu , 1−α ˜ − αu ˜ w ˜ + (1 − α)˜ ˜ ⇔ r˜ = P T 1 = αu r=w . 1−α P 1 = αu + (1 − α)r = w ⇔ r =

The requirement that r and r˜ are probability vectors puts a constraint on α, namely (

0 ≤ α ≤ min

i∈1:N

min

6

wi w ˜i , ui u ˜i

!)

.

To make the best use of the transport matrix Pˆ , we select α to attain the upper bound; see Appendix A for discussions on the tuning parameters of this resampling scheme. The computational cost in N 2 of transport resampling is potentially prohibitive. However, this cost is independent of the model, and scales linearly with the dimension dx of the state space if one uses the Euclidean distance to compute D. Thus, for complex dynamical systems, the resampling step might still be negligible compared to the propagation and weighting steps, in the overall cost of the particle filter.

2.4

Index-coupled resampling

We now introduce another coupled resampling scheme, that we call index-coupled resampling. Unlike transport resampling, its computational cost is linear in N . The idea of index-coupled is to maximize the probability of sampling pairs (a, a ˜) such that a = a ˜, by computing the matrix ˜ with maximum entries on its diagonal. We first describe how to compute such as a P ∈ J (w, w) matrix, before giving more motivation for this choice of P . The index-coupled matrix P can be computed in a linear cost in N in the following way, which is similar to the construction of maximal couplings (Lindvall, 2002), and which was proposed by Jasra et al. (2015) in the context of multilevel Monte Carlo. First, for all i ∈ 1 : N , P has to satisfy P ii ≤ min(wi , w ˜ i ), otherwise one of the marginal constraints would be violated. We tentatively P i ˜ (element-wise), α = N write P = α diag(µ) + (1 − α)R, where ν = min(w, w) i=1 ν , µ = ν/α and R is a residual matrix with zeros on the diagonal. Matrices P of this form have maximum trace ˜ We now look for R such that P ∈ J (w, w) ˜ and such that sampling among all matrices in J (w, w). from P can be done linearly in N . From the marginal constraints, the matrix R needs to satisfy, for all i ∈ 1 : N , ν i + (1 − α)

N X

Rij = wi

and ν i + (1 − α)

j=1

N X

Rji = w ˜i,

j=1

˜ − ν)/(1 − α). Among all the matrices R that is, r = R1 = (w − ν)/(1 − α) and r˜ = RT 1 = (w T that satisfy these constraints, the choice R = r r˜ is such that we can sample pairs of indices from R by sampling from r and r˜ independently. Thus we define the index-coupled matrix P as P = α diag(µ) + (1 − α) r r˜T .

(2)

We now provide intuition for this choice of matrix P . In the coupled particle filter, the same random numbers are used to initialize both particle systems, thus we expect xk0 = M (U0k , θ) and ˜ to be similar for all k ∈ 1 : N . Then, the same random number U k is used to x ˜k0 = M (U0k , θ) 1 ak0 a ˜k0 k k k k k k ˜ compute x1 = F (x0 , U1 , θ) and x ˜1 = F (˜ x0 , U1 , θ). For x1 and x ˜1 , to be similar, we wish to ak

a ˜k

select x00 and x ˜00 that are similar. This is the case if we sample ak0 and a ˜k0 such that ak0 = a ˜k0 , and

7

motivates the use of a matrix P with maximum diagonal entries, at least for the first resampling step. At later steps, it is not guaranteed that sampling akt and a ˜kt such that akt = a ˜kt will result in selecting similar particle locations. We appeal to another phenomenon to justify the use of index-coupled resampling at all steps t. Consider two particle locations xt and x ˜t that are not similar. For typical functions F , if we propagate the pair of particles using common random numbers Ut+1 , Ut+2 , . . . , Ut+k , then the two particles xt+k and x ˜t+k will become similar as k increases. This can be easily checked for auto-regressive processes, where F can be defined as (x, U, θ) 7→ θX + U , and θ and θ˜ are taken to be similar and less than 1. This contraction property of F is expected to hold in some generality; see Diaconis and Freedman (1999) for a way of analyzing Markov chains based on certain Lipschitz conditions on F . Thus, compared to transport resampling where particles are more likely to be jointly selected if they are similar, index-coupled resampling results in common random numbers being given to pairs of particles, for enough consecutive steps to make them similar. The index-coupled resampling scheme, proposed in Jasra et al. (2015), has not be used outside of the context of multilevel Monte Carlo, to the best of our knowledge.

2.5

Existing approaches

Various attempts have been made to modify particle filters so that they produce correlated likelihood estimators for different parameter values θ. A detailed review is given in Lee (2008). Some methods, such as Pitt (2002); DeJong et al. (2007), aim at constructing a continuous likelihood estimator, as a function of θ and for fixed random numbers. We note that neither of our proposed schemes achieves such continuity, as we aim at increasing the correlation between the particle systems in a looser sense. As explained in Lee (2008), some of the proposed methods have a cost in N 2 (DeJong et al., 2007), some are not expected to work for large time horizons T (Coquelin et al., 2009), and some are not expected to scale well with the dimension dx of the latent process (Pitt, 2002). The approach of Lee (2008) is based on tree-based algorithms to tackle multivariate state spaces for a cost in N log N . We describe the method proposed in Pitt (2002); Malik and Pitt (2011); Deligiannidis et al. (2015), which is based on sorting the particles. Consider first the univariate case, dx = 1. We can sort both particle systems in increasing order of x1:N and x ˜1:N respectively, yielding (w(k) , x(k) )N k=1 and (w ˜ (k) , x ˜(k) )N , where the parenthesis indicate that the samples are sorted. Then, we can draw k=1 a1:N and a ˜1:N by inverting the empirical cumulative distribution function associated with these sorted samples, using common random numbers. That is, we draw UR1:N uniformly in [0, 1], and P compute ak = Fˆ −1 (URk ) where Fˆ −1 (u) = min{i ∈ 1 : N : Fˆ (x(k) ) ≥ u}, and Fˆ (x) = x(k) ≤x w(k) , and likewise for the second system. We might sample ak and a ˜k such that ak 6= a ˜k , but it is still k k likely that ak and a ˜k will be close and thus that x(a ) and x ˜(˜a ) will be similar. The method can be extended to multivariate spaces using the Hilbert space-filling curve as

8

mentioned in Deligiannidis et al. (2015), following Gerber and Chopin (2015). That is, we use the pseudo-inverse of the Hilbert curve to map the dx -dimensional particles to the interval [0, 1], where they can be sorted in increasing order. We refer to this approach as sorted resampling, and use the implementation provided by the function hilbert_sort in The CGAL Project (2016). The cost of sorted resampling is of order N log N . In the univariate setting, there is a connection between transport and sorted resampling. Indeed, if the samples were not weighted, then sorting the locations x1:N and x ˜1:N and associating each x(k) to x ˜(k) would constitute a solution to the optimal transport problem. Thus, if θ and θ˜ are similar, and if the space of the latent process is low-dimensional, we expect sorted resampling to perform well.

3

Correlated likelihood estimators and applications

In this section, we explore applications of coupled resampling schemes in situations where positive correlations between the likelihood estimators are beneficial.

3.1

Joint or conditional sampling

˜ for known values of θ If we are interested in estimating two likelihoods p(y1:T |θ) and p(y1:T |θ), ˜ we can run a coupled particle filter as described in Section 2.2. The memory cost of the and θ, overall algorithm is thus in N , and the computational cost is T N , T N log N or T N 2 depending on the coupled resampling scheme. However, as in Section 3.4 below, we might need to correlate ˜ for various values of θ˜ which are not known in advance. pˆN (y1:T |θ) with pˆN (y1:T |θ) To address this situation, we can run a first particle filter given θ, and store all the particles (wt , xt ), ancestors at and random numbers Ut for all t. This yields the estimator pˆN (y1:T |θ). To ˜ of p(y1:T |θ) ˜ that is correlated with the first one, we can run a produce an estimator pˆN (y1:T |θ) second particle filter conditionally on θ˜ and on the variables generated by the first filter, possibly using common process-generating variables. At each resampling step, a probability matrix Pt is ˜ t is then sampled according computed given the variables generated thus far. The ancestry vector a to Pt , conditionally upon the ancestors at in the first filter. Conditional sampling from Pt can be done in order N operations for index-coupled resampling, in order N log N for sorted resampling and in order N 2 for transport resampling. For the index-coupled scheme, the conditional sampling procedure goes as follows, in the notation of Section 2.4. First, sample V 1:N uniformly distributed k ak in [0, 1]. For each k ∈ 1 : N , if V k ≤ ν at /wt t , a ˜kt is set to akt . For all the other indices `, draw a ˜`t independently from r˜t using multinomial resampling. The conditional sampling of a particle filter requires the variables generated by a first filter. A naive implementation has a memory cost in the order of N × T . By carefully storing and resetting the state of the random number generator, one can in principle re-compute the first particle filter

9

independent

CRN+systematic ● ●

log−likelihood

−215



● ●

● ●



−220 ●















● ●





● ●











● ● ●

● ● ●



● ●

● ●

● ● ●

● ● ●

● ● ●

● ● ●

● ●

● ●



● ●

● ● ●



● ●

● ● ●

● ●









● ●





● ●



−225



● ● ●



● ● ● ●



● ●









● ● ● ●



● ●

● ●











● ●











● ●

● ● ●





● ●







● ●











● ● ●

● ●

● ● ●



● ●







● ●

● ● ●



−230 0.30

● ●

● ●

● ● ●

● ●





● ●

● ●

● ●

● ● ●

● ●

● ● ● ●

● ●

● ●



0.35

0.40 0.30

0.35

0.40

θ

Figure 1: Log-likelihood estimators for various values of θ in the hidden auto-regressive model with dx = 1, T = 100 and N = 128. Left panel: 5 independent estimators obtained by bootstrap particle filters. Right panel: 5 estimators obtained with common random numbers, connected by lines; systematic resampling is used independently for each particle filter. The thicker red line indicates the exact log-likelihood computed using Kalman filters. when needed, and thus the memory cost can be reduced to N , in exchange for an increase in computational cost and a more sophisticated implementation; as in Jun et al. (2012).

3.2

Numerical illustration: correlated likelihood estimators

We first illustrate coupled particle filters for the following hidden auto-regressive model. The process starts as x0 ∼ N (0, Idx ), where Idx is the identity matrix of dimension dx ×dx . The transition kernel is defined by xt = Axt−1 + N (0, Idx ), where Aij is θ|i−j|+1 , as in Guarniero et al. (2015). Finally, the measurement kernel is defined as yt = xt + N (0, Idx ). We consider a sequence of parameter values θ1 , . . . , θK . We run a standard particle filter given θ1 , and then for each k ∈ 2 : K, we run a particle filter given θk conditionally upon the variables generated by the previous particle filter given θk−1 . We start with dx = 1, and generate T = 100 observations with θ = 0.95. We then consider a sequence of 20 values of θ between 0.3 and 0.4, and consider estimators of the likelihood at these values with N = 128. Figure 1 shows log-likelihood estimators obtained by independent particle filters, and by particle filters using common random numbers and systematic resampling for each particle system, using a common uniform random variable (CRN+systematic). First note that the log-likelihood is under-estimated, as expected since only the likelihood estimators are unbiased. The use of common random numbers reduces the variations in the log-likelihood estimators, compared ˜ for nearby values to independent estimators, and thus help comparing log p(y1:T |θ) and log p(y1:T |θ) ˜ of θ and θ. Figure 2 shows the benefit of using coupled resampling schemes on top of common random numbers. There are still discontinuities in the likelihood estimators, but the comparison of log p(y1:T |θ) ˜ for different values of θ and θ˜ appears to be greatly facilitated. and log p(y1:T |θ) 10

CRN+sorted

−215

CRN+index−coupled

CRN+transport ● ●

log−likelihood

● ● ●



−220



● ● ● ● ● ● ●

−225



● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

● ●

● ●

● ●







● ●

● ●













● ●

● ●

● ●

























● ●





● ●





● ●

● ● ● ●





● ●

● ● ● ●





● ●



● ●

● ●









● ●

● ● ●

● ● ●

● ●

● ●

● ● ●

● ● ●

● ● ●

● ● ●





● ●

● ●

● ●

● ●

● ●

● ●

● ●















● ●



● ●

● ●

● ● ●

● ●

● ● ●









● ●



● ●









● ● ●









● ●

● ●

● ●









● ●











● ●







● ●

● ●

● ●





● ●





● ●







● ●



−230

● ● ●

● ●

● ● ●

● ●



● ●

● ●

● ●

● ● ●

● ● ●



● ●

● ●

● ●

● ● ● ●

● ●

● ●

● ●

● ●

● ● ●



0.30

0.35

0.40 0.30

0.35

0.40 0.30

0.35

0.40

θ

Figure 2: Log-likelihood estimators for various values of θ in the hidden auto-regressive model with dx = 1, T = 100, and N = 128. Left panel: 5 estimators obtained with common random numbers and sorted resampling; middle panel: index-coupled resampling; right panel: transport resampling. The thicker red line indicates the exact log-likelihood computed using Kalman filters. CRN+sorted

CRN+index−coupled

CRN+transport

log−likelihood

−8900 −8950

● ●

● ●



−9000

● ● ●

● ●

−9050



● ●

● ●

● ●





● ●







● ●



● ● ●

● ● ●

● ● ●











● ●

● ● ●











● ●

● ● ● ● ●

● ●

● ●



● ● ● ●

● ● ●

● ●

● ●

● ● ●







● ●

● ●

● ● ● ●

● ●





● ● ●



● ●

● ● ● ●

● ● ● ● ●

● ● ● ● ●

● ● ● ● ●

● ● ● ●

● ● ● ● ●

● ● ● ●

● ●

● ●

● ● ●

● ●

● ● ● ●

● ● ●

● ● ●









● ● ● ● ●

● ● ● ●

● ● ● ● ●

● ● ● ● ●

● ● ● ●





● ●

● ●

● ●

● ●

● ●



● ●



● ● ● ● ●



● ● ●

● ● ●

● ●

● ● ●

● ● ● ●

● ● ● ● ●

● ● ●

● ● ● ●

● ● ● ● ●

● ● ● ●







● ●

● ● ● ●







● ●







● ●

● ● ●





● ● ● ●

● ● ● ●

● ● ● ●



● ●



● ●

● ● ● ● ●

0.30

0.35

0.40 0.30

0.35

0.40 0.30

0.35

0.40

θ

Figure 3: Log-likelihood estimators for various values of θ in the hidden auto-regressive model with dx = 5, T = 1, 000 and N = 128. Left panel: 5 estimators obtained with common random numbers and sorted resampling; middle panel: index-coupled resampling; right panel: transport resampling. The thicker red line indicates the exact log-likelihood computed using Kalman filters. Next, we consider the hidden auto-regressive model with dx = 5 and T = 1, 000. We generate data using θ = 0.4, and we still use N = 128 particles. We see in Figure 3 that all the estimators under-estimate the log-likelihood by a significant amount, indicating that more particles would be necessary to obtain precise point-wise estimates. However, we see that the variations of the log-likelihood are still approximately recovered when using common random numbers and coupled resampling schemes. A more quantitative comparison of these three schemes will be given in the following section. We stress that the proposed coupled resampling schemes are not guaranteed to induce positive ˜ and that discontinuities still exist. In an correlations between the likelihood estimators at θ and θ, attempt to guarantee positive correlations, we could consider the design of probability matrices Pt ∈ ˜ t ) at each step t, that directly try to maximize the correlation between likelihood estimators, J (wt , w 11

k or at least between the incremental likelihood estimators, pˆN (yt |y1:t−1 , θ) = N −1 N k=1 g(yt |xt , θ) P ˜ = N −1 N g(yt |˜ ˜ This appears to be challenging in general, and we and pˆN (yt |y1:t−1 , θ) xkt , θ). k=1 leave this to future work.

P

3.3

Application: finite difference estimators

To present a more quantitative comparison of the proposed resampling schemes, we consider the estimation of the log-likelihood gradient, also called the score and denoted by ∇θ log p(y1:T |θ). We focus on univariate parameters for simplicity. A finite difference estimator of the score at the value θ is given by (Asmussen and Glynn, 2007) DhN (θ) =

log pˆN (y1:T |θ + h) − log pˆN (y1:T |θ − h) , 2h

where h > 0 is a perturbation parameter. If h is small, the variances of the two log-likelihood estimators can be assumed approximately equal, and we obtain 











V DhN (θ) ≈ (2h)−1 × V log pˆN (y1:T |θ) × 1 − ρN h (θ) , where ρN ˆN (y1:T |θ + h) and log pˆN (y1:T |θ − h). Thus, comh (θ) denotes the correlation between log p pared to using independent estimators with the same variance, the variance of the finite difference estimator can be divided by 1/(1 − ρN h (θ)). We refer to this number as the variance reduction factor. It corresponds to how many times more particles should be used in order to attain the same accuracy using independent particle filters. The bias of the gradient estimator is unchanged, since the proposed resampling schemes do not change the marginal distributions of each particle filter. We consider the hidden auto-regressive model of the previous section, with dx = 5, and T = 1, 000 observations generated with θ = 0.4, and the finite difference estimator of the score at θ = 0.3. We run R = 1, 000 independent experiments, with various values of h and resampling schemes, common random numbers and N = 128 particles. The variance reduction factors are shown in Figure 4. First we see that common random numbers help in general, especially for small values of h. Secondly, the gains can be orders of magnitude higher when using transport and index-coupled resampling. These added gains, compared to systematic or sorted resampling, disappear when h increases. We note that sorted resampling does not seem to add much compared to systematic resampling, in this five-dimensional setting. Index-coupled resampling appears to perform better than transport resampling for small values of h, perhaps due to the approximation introduced for computational reasons in the transport problem. The tuning parameters of transport resampling were set to ε = 5% × median(D) and α = 99% (see Appendix A).

12

variance reduction factor

527

400 322

200

0

10

10

25

8

8

0.001

33

6

0.025

6

11

13

0.05

h CRN+systematic

CRN+sorted

CRN+index−coupled

CRN+transport

Figure 4: Variance reduction factor when using coupled resampling schemes to estimate the score at θ = 0.3. Results for R = 1, 000 experiments in the hidden auto-regressive model with dx = 5 and T = 1, 000 observations generated with θ = 0.4. The variance reduction factor is defined as N 1/(1 − ρN h (θ)) where ρh (θ) is the correlation between the log-likelihood estimators at θ + h and θ − h, computed over R experiments.

3.4 3.4.1

Application: correlated particle marginal MH Algorithm

As a second application we now turn our attention to the particle marginal MH algorithm (Andrieu et al., 2010), a variant of the pseudo-marginal approach (Andrieu and Roberts, 2009) which can be used for parameter inference in state space models. Denoting the prior parameter distribution by p(dθ), the algorithm generates a Markov chain (θ(i) )i≥1 targeting the posterior distribution p(dθ|y1:T ) ∝ p(dθ)p(y1:T |θ). At iteration i ≥ 1, a parameter θ˜ is proposed from a Markov kernel q(dθ|θ(i−1) ), and accepted as the next state of the chain θ(i) with probability ˜ ˜ q(θ(i−1) |θ) ˜ pˆN (y1:T |θ) p(θ) min 1, N , (i−1) (i−1) ˜ (i−1) ) pˆ (y1:T |θ ) p(θ ) q(θ|θ !

(3)

˜ is the likelihood estimator produced by a filter given θ. ˜ In order for this algorithm where pˆN (y1:T |θ) to mimic the ideal underlying MH algorithm, the ratio of likelihood estimators must be an accurate approximation of the ratio of exact likelihood evaluations (Andrieu and Vihola, 2015). By taking the logarithm of the ratio, this setting appears very similar to that of Section 3.3, so that we can expect important gains from coupled resampling schemes, particularly if q(dθ|θ(i−1) ) is a local proposal kernel. The benefit of correlated likelihood estimators within pseudo-marginal algorithms is the topic of recent works (Deligiannidis et al., 2015; Dahlin et al., 2015), following Lee and Holmes 13

(2010) in the discussion of Andrieu et al. (2010). We describe a correlated particle marginal MH algorithm, and associated conditions on the coupled resampling scheme. The algorithm works on the joint space of the parameter θ, the process-generating variables Ut for all t ∈ 0 : T , and the ancestor variables at , for all t ∈ 0 : T − 1. Denote by ϕ the distribution of U0:T , assumed to be standard multivariate normal in the numerical experiments, and let φ be a Markov kernel leaving ϕ invariant. Consider any iteration i ≥ 1 of the algorithm; the current state of the Markov chain con(i−1) (i−1) tains θ(i−1) = θ, U0:T = U0:T and a0:T −1 = a0:T −1 , and the associated likelihood estimator is pˆN (y1:T |θ). Note that the particles (wt , xt ) are deterministic given θ, U0:T and a0:T −1 . The algorithm proceeds in the following way. ˜ 1. A parameter value is proposed: θ˜ ∼ q(dθ|θ), as well as new process-generating variables: ˜ ˜ U0:T ∼ φ(dU0:T |U0:T ). ˜ using U ˜ 0:T and conditionally upon the current particle 2. A new particle filter is run given θ, ˜t, x ˜ t ), filter. That is, at each resampling step, a matrix Pt is computed using (wt , xt ) and (w ˜ t are sampled conditional upon at , as in Section 3.1. The algorithm and the new ancestors a ˜ ˜ 0:T −1 and a likelihood estimator pˆN (y1:T |θ). produces ancestor variables a ˜ and the 3. With the probability given by Eq. (3), the chain moves to the parameter value θ, ˜ are kept in memory for the next iterations. Otherwise, ˜ 0:T , a ˜ 0:T −1 and pˆN (y1:T |θ) variables U the current state of the chain is unchanged. 3.4.2

Validity

We give a sufficient condition on the coupled resampling scheme for the correlated particle marginal MH algorithm to target the same distribution as the standard particle marginal MH algorithm. A direct consequence is that the marginal distribution on θ of the extended target distribution is precisely the posterior distribution p(θ|y1:T ). Denote by r(dat |wt ) a probability distribution on the ancestors at at step t. Since the weights wt are deterministic given a0:t−1 , U0:t , θ, we can write equivalently r(dat |a0:t−1 , U0:t , θ). In the ˜t, x ˜ t ) are used to compute a probability matrix Pt and then a ˜t proposed algorithm, (wt , xt ) and (w ˜ a0:t , U0:t , θ). ˜ 0:t , θ, are sampled conditionally on at . We denote that distribution by c(˜ at |˜ a0:t−1 , U Lemma 3.1. Assume that the marginal and conditional resampling distribution, respectively r and c, associated with the coupled resampling scheme, are such that ˜ a0:t , U0:t , θ) ˜ 0:t , θ, r(at |a0:t−1 , U0:t , θ)c(˜ at |˜ a0:t−1 , U ˜ ˜ ˜ 0:t , θ)c(a ˜ 0:t , θ). ˜ 0:t , U = r(˜ at |˜ a0:t−1 , U t |a0:t−1 , U0:t , θ, a

14

(4)

Furthermore, assume that under the marginal resampling distribution r, P(akt = j) = wtj for all k and j in 1 : N . Then the Markov kernel defined on θ, U0:T and a0:T −1 by the correlated particle marginal MH algorithm has the same invariant distribution as the standard particle marginal MH.

The proof is provided in Appendix B, where we also show that the conditions are satisfied for independent, sorted and index-coupled resampling, as well as for transport resampling, under a slight modification of the scheme. 3.4.3

Numerical experiments

We consider again the hidden auto-regressive model with dx = 5 and T = 1, 000 observations generated with θ = 0.4. We specify a standard normal prior on θ. We compare a standard particle marginal MH algorithm and coupled versions with common random numbers and systematic, sorted, index-coupled and transport resampling (for the latter, we choose ε = 10%×median(D) and α = 95%, see Appendix A). The distribution ϕ of the process-generating variables is a multivariate p ˜ = ρU + 1 − ρ2 N (0, I). normal distribution, and we choose the kernel φ to be auto-regressive: U We refer to Deligiannidis et al. (2015) for the choice of ρ, and we take ρ = 0.999. All algorithms use the same proposal distribution for θ, taken to be a normal random walk with a standard deviation of 0.01. We run each algorithm 5 times for 5, 000 iterations, starting the chain from a uniform variable in [0.37, 0.41], taken to be in the bulk of the posterior distribution. Figure 5 shows the average acceptance rates, obtained for various values of N . We see that index-coupled resampling and transport resampling lead to significant gains in acceptance rates, over the existing methods. The acceptance rate cannot be taken as the only performance measure, since the various algorithms have different proposal distributions on the extended space of θ, U0:T and a0:T −1 . We compare instead the approximations of the posterior density in Figure 6. The ground truth is obtained by a long run of MH using Kalman filters. We see that the correlated approach with index-coupled resampling enables consistent approximations with only N = 256 and 5, 000 iterations in this example, whereas the standard algorithm with independent random numbers would require significantly more iterations.

15

acceptance rate (%)

8 5.9

6

4 3.24 2.74

2

1.82 1.28 1.26 0.24 0.25

0

0.3

0.16

0.5

64

0.76 0.73

0.37

0.26

128

256

N independent

CRN+systematic

CRN+sorted

CRN+index−coupled

CRN+transport

Figure 5: Average acceptance rate (in %) of particle marginal MH algorithms, with various resampling schemes and numbers of particles. Results obtained over R = 5 experiments and 5, 000 iterations, in the hidden auto-regressive model with dx = 5 and T = 1, 000 observations.

independent

CRN+index−coupled

density

300

200

100

0 0.34

0.36

0.38

0.40

0.42

0.34

0.36

0.38

0.40

0.42

θ

Figure 6: Estimation of the posterior density based on particle marginal MH, with independent random numbers (left) and correlated random numbers with index-coupled resampling (right). Results obtained over R = 5 experiments and 5, 000 iterations, with N = 256 particles, in the hidden auto-regressive model with dx = 5 and T = 1, 000 observations. The filled area represents the true posterior distribution, obtained with a long run of MH using Kalman filters.

4 4.1

A new smoothing algorithm Smoothing and debiasing techniques

In this section, the parameter θ is fixed, and we consider the problem of computing expectations with respect to the smoothing distribution π(x0:T ) := p(x0:T |y1:T , θ). We drop θ from the notation, and denote by h a generic test function on XT +1 , of which we want to compute the expectation with

16

respect to π, denoted by π(h); we will later focus on the identity function, yielding the smoothing mean. We build upon the debiasing technique of Glynn and Rhee (2014), which follows a series of unbiased estimation techniques (Kuti, 1982; Rychlik, 1990, 1995; McLeish, 2012; Rhee and Glynn, 2012; Rhee, 2013), see also Vihola (2015). The debiasing technique of Glynn and Rhee (2014), hereinafter referred to as the Rhee–Glynn estimator, uses the kernel of a Markov chain with invariant distribution π in order to produce unbiased estimators of π(h). A Markov kernel leaving the smoothing distribution invariant has been proposed in Andrieu et al. (2010), referred to as the conditional particle filter (CPF); it will be described below. If the transition density function of the model is analytically available, then improvements upon the conditional particle filter have been proposed, such as backward sampling (Whiteley, 2010) and ancestor sampling (Lindsten et al., 2014). The conditional particle filter kernel has been extensively studied in Chopin and Singh (2015); Andrieu et al. (2013); Lindsten et al. (2015). The use of conditional particle filters within the Rhee–Glynn estimator naturally leads to the problem of coupling two conditional particle filters. We will therefore rely on index-coupled resampling. The difference between the setting of Section 3 is that the two particle filters are conditioned on the same parameter θ, but on different reference trajectories, as will be described below. Popular smoothing techniques include the fixed-lag smoother and the forward filtering backward smoother; see Doucet and Johansen (2011); Lindsten and Schön (2013); Kantas et al. (2015) for recent reviews. Our proposed method sets itself apart in the following way. Instead of forming a smoothing estimator based on a long run of a Markov chain (as in iterated conditional particle filters) or on a large population of particles (as in particle-based smoothers), the proposed estimator is an average of R independent and unbiased estimators of π(h), denoted by H (r) for r ∈ 1 : R. These estimators have a finite variance and finite computational cost under mild conditions. There are two direct consequences of the fact that the proposed estimator is obtained as an average of independent unbiased estimators. • Complete parallelization of the computation of the terms H (r) is possible. On the contrary, particle-based methods are not entirely parallelizable due to the resampling step, which might become a bottleneck when large numbers of particles are used (Murray et al., 2015; Lee and Whiteley, 2015a). Markov chain Monte Carlo methods, such as the conditional particle filter, are intrinsically iterative. • Error estimators can be easily constructed based on the central limit theorem, allowing an empirical assessment of the performance of the estimator. On the opposite, error estimators for particle smoothers have not been proposed, although see Lee and Whiteley (2015b) for a variance estimator in the context of particle filters, and Flegal et al. (2008) on computing the variance of Markov chain Monte Carlo estimators, which could potentially be applied to the conditional particle filter. 17

We note that unbiased estimation of smoothing functionals can potentially be done by perfect sampling, as explored in Lee et al. (2014).

4.2

Proposed smoother

We first describe the conditional particle filter and a coupled version of it, which are building blocks for the proposed method. A trajectory x0:T will also be denoted briefly by X, and the processgenerating variables U0:T by U , assumed to follow some distribution ϕ, as in Section 3.4. Executing the particle filter generates N trajectories xk0:T with weights wTk . At the final step T , we can draw an index bT ∈ 1 : N with probabilities given by wT1:N , and retrieve the corresponding trajectory by tracing back the ancestor variables. Given U , we write X ∼ PF(U ) for this sampling procedure, where the randomness comes from the resampling steps. Given a trajectory X = x0:T , referred to as the reference trajectory, and process-generating variables U ∼ ϕ, the conditional particle filter (Andrieu et al., 2010) defines a distribution on the space of trajectories, as follows. At the initial step, we compute xk0 = M (U0k , θ) for all k ∈ 1 : N − 1, k −1 for all k. At each step t, we draw a1:N −1 ∼ r(da1:N −1 |w 1:N ) from we set xN t t 0 = x0 , and w0 = N N a multinomial distribution, and set at = N ; other resampling schemes can be implemented, as ak k , θ) for detailed in Chopin and Singh (2015). The propagation step computes xkt+1 = F (xt t , Ut+1 k k k ∈ 1 : N − 1 and sets xN t+1 = xt+1 . The weighting step computes wt+1 ∝ g(yt+1 |xt+1 , θ), for all k ∈ 1 : N . The procedure guarantees that the reference trajectory x0:T is among the trajectories produced by the algorithm, as the N -th trajectory at each step t. At the final step, we draw bT with probabilities wT and retrieve the corresponding trajectory, denoted X 0 . We succinctly write X 0 ∼ CPF(X, U ). The coupled conditional particle filter (CCPF) acts similarly, producing a pair of trajectories 0 ˜ 0 ) given a pair of reference trajectories X = x0:T and X ˜ = x (X , X ˜0:T . We first draw U ∼ ϕ. The initialization and propagation steps follow the conditional particle filter for each system, using common random numbers U . For the resampling step, we compute a probability matrix ˜ t ) as in Section 2, based on the variables generated thus far, and we sample pairs of Pt ∈ J (wt , w −1 N ancestor variables (akt , a ˜kt )N ˜N t = N . At the final step, we draw a k=1 . We then set at = N and a ˜ ˜ T ), and retrieve the corresponding pair of indices (bT , bT ) from PT , a probability matrix in J (wT , w 0 ˜ 0 ) ∼ CCPF(X, X, ˜ U ). The full pair of trajectories. We write this procedure succinctly as (X , X description of the algorithm is given in Appendix E. Using the coupled conditional particle filter, we can now generate a pair of sequences (X (n) )n≥0 ˜ (n) )n≥0 that both converge to the smoothing distribution, and that have a non-zero proband (X ability of being exactly equal after a number of steps. The construction is as follows. We first ˜ (0) ) ˜ (0) from two independent particle filters, PF(U (0) ) and PF(U draw two trajectories X (0) and X ˜ (0) ∼ ϕ. We apply one step of conditional particle filter to the first trajecwith U (0) ∼ ϕ, U ˜ (n−1) ) ∼ tory: U (1) ∼ ϕ and X (1) ∼ CPF(X (0) , U (1) ). Then, for all n ≥ 2 we draw (X (n) , X

18

Algorithm 1 Rhee–Glynn smoothing estimator. • Draw U (0) ∼ ϕ and X (0) ∼ PF(U (0) ), draw U (1) ∼ ϕ, and draw X (1) ∼ CPF(X (0) , U (1) ). ˜ (0) ∼ ϕ and X ˜ (0) ). ˜ (0) ∼ PF(U • Draw U ˜ (0) ), set H = ∆(0) + ∆(1) . • Compute ∆(0) = h(X (0) ) and ∆(1) = h(X (1) ) − h(X • For n = 2, 3, . . ., ˜ (n−1) ) ∼ CCPF(X (n−1) , X ˜ (n−2) , U (n) ). – Draw U (n) ∼ ϕ and (X (n) , X ˜ (n−1) ), set H ← H + ∆(n) . – Compute ∆(n) = h(X (n) ) − h(X ˜ (n−1) , then n is the meeting time τ : exit the loop. – If X (n) = X • Return H.

˜ (n−2) , U (n) ), where U (n) ∼ ϕ. The resulting chains are such that CCPF(X (n−1) , X ˜ (n) )n≥0 have the same distributions as if they were generated • marginally, (X (n) )n≥0 and (X independently by conditional particle filters; ˜ (n) , since the sequence (U (n) )n≥0 is • for each n ≥ 0, X (n) has the same distribution as X independent and identically distributed; • under mild conditions—see details in the next section—at each iteration n ≥ 2, there is a ˜ (n−1) . We refer to this event as a meeting, and introduce non-zero probability that X (n) = X ˜ (n−1) }. the meeting time τ , defined as inf{n ≥ 2 : X (n) = X Using this construction, the Rhee–Glynn smoothing estimator is defined as H = h(X (0) ) +

τ X

˜ (n−1) ), h(X (n) ) − h(X

(5)

n=1

and is an unbiased estimator of π(h), under conditions given in the next section. The full procedure is described in Algorithm 1.

4.3

Theoretical properties

We give assumptions under which the Rhee–Glynn smoothing estimator H in Eq. (5) has expectation π(h) and a finite variance. First, we put an upper-bound on the measurement density of the model. This bound limits the influence of the reference trajectory in the conditional particle filter. Assumption 1. The measurement density of the model is bounded from above: ∃¯ g > 0,

∀y ∈ Y,

∀x ∈ X, 19

g(y|x, θ) ≤ g¯.

Next, we put a technical condition on the coupled resampling scheme. Assumption 2. The resampling probability matrix P , constructed from the weight vectors w and ˜ is such that w, ∀i ∈ {1, . . . , N }, P ii ≥ wi w ˜i. ˜ then P is a diagonal matrix with entries given by w. Furthermore, if w = w, One can check that this condition holds for independent and index-coupled resampling schemes. The second part of Assumption 2 ensures that if two reference trajectories are equal, an application of the coupled conditional particle filter returns two identical trajectories. We first state a result on the probability of meeting in one step of the coupled conditional particle filter. Lemma 4.1. Under Assumptions 1 and 2, there exists ε > 0 such that ∀X ∈ XT +1 ,

˜ ∈ XT +1 , ∀X

˜ 0 |X, X) ˜ ≥ ε, P(X 0 = X

˜ 0 ) ∼ CCPF(X, X, ˜ U ) and U ∼ ϕ. Furthermore, if X = X, ˜ then X 0 = X ˜ 0 almost where (X 0 , X surely. The proof of Lemma 4.1 is given in Appendix C.1. The constant ε depends on N and T , and on the coupled resampling scheme being used. Lemma 4.1 can be used, together with the coupling inequality (Lindvall, 2002), to prove the ergodicity of the conditional particle filter kernel, which is akin to the approach of Chopin and Singh (2015). The coupling inequality states that ˜ (n−1) is less than 2P(τ > n), the total variation distance between the distributions of X (n) and X ˜ (n−1) } is the meeting time. By considering the case X ˜ (0) ∼ π, where τ = inf{n ≥ 2 : X (n) = X ˜ (n) follows π at each step n, and we obtain a bound for the total variation distance between the X distribution of X (n) and π. Using Lemma 4.1, we can bound the probability P(τ > n) from above by (1 − ε)n , as is done the proof of Theorem 4.1 below. This also tells us that the computational cost of the proposed estimator has a finite expectation. We will see in experiments that the meeting can occur in very few steps, including in real-world examples. Assumption 3. Let (X (n) )n≥0 be a Markov chain generated by the conditional particle filter. The test function h is such that h i E h(X (n) ) −−−→ π(h). n→∞

Furthermore, there exists δ > 0 and C < ∞ such that h

i

∀n ≥ n0 , E h(X (n) )2+δ ≤ C. This assumption relates to the validity of the iterated conditional particle filter to estimate π(h), addressed under general assumptions in Chopin and Singh (2015); Andrieu et al. (2013); 20

Lindsten et al. (2015). Up to the term δ > 0 which can be arbitrarily small, the assumption is a requirement if we want to estimate π(h) using iterated conditional particle filters while ensuring a finite variance. Our main result states that the proposed estimator is unbiased and has a finite variance. Similar results can be found in Theorem 1 in Rhee (2013), Theorem 2.1 in McLeish (2012), Theorem 7 in Vihola (2015) and in Glynn and Rhee (2014). Theorem 4.1. Under Assumptions 1-2-3, the Rhee–Glynn smoothing estimator H, given in Eq. (5) and Algorithm 1, is an unbiased estimator of π(h) with E[H 2 ] =

∞ X

h

i

E (∆(n) )2 + 2

n=0

∞ X ∞ X

h

i

E ∆(n) ∆(`) < ∞,

n=0 `=n+1

˜ (n−1) ) where ∆(0) = h(X (0) ) and for n ≥ 1, ∆(n) = h(X (n) ) − h(X The proof is given in Appendix C.2. The theorem uses univariate notation for H and ∆n , but we note that the Rhee–Glynn smoother can be applied to estimate multivariate smoothing functionals. In the numerical experiments, we will for instance estimate jointly all the smoothing means E[x0:T |y1:T ]. The above theorem can thus be understood as holding component-wise.

4.4

Practical considerations

The unbiasedness of H comes with the immediate benefit that we can produce H (1) , . . . , H (R) in parallel, for R ∈ N, and compute the average R X ¯ = 1 H (r) . H R r=1

(6)

The estimator is akin to the sum estimator in Rhee and Glynn (2013); Vihola (2015). It is unbiased, converges with the standard Monte Carlo rate in R and a confidence interval can be constructed from the central limit theorem. As mentioned in Glynn and Rhee (2014), we can appeal to Glynn and Whitt (1992) to obtain a central limit theorem parameterized by the computational budget instead of the number of samples. This takes into account the fact that more estimators can be obtained, in a given wall-clock time, ¯ obtained for a if each of them has a smaller cost. Thus the overall variance of the average H fixed budget exhibits a trade-off between the variance and the cost of each H, summarized in the asymptotic efficiency factor E[τ ]V[H]. Theorem 4.1 ensures the validity of the proposed smoother for a broad class of state space models. The variance and the cost of the proposed estimator are tied to the number of steps before the pair of trajectories meets. Lemma 4.1 and the coupling inequality (Lindvall, 2002) indicate that if the meeting time occurs rapidly, then the underlying conditional particle filter kernel mixes 21

rapidly. Conversely, if the kernel mixes slowly, the coupling inequality states that the probability of meeting is small. Thus, the proposed method is expected to work in the same situations where the conditional particle filter works. Likewise, any improvement in the conditional particle filter, such as the use of ancestor sampling (Lindsten et al., 2014), directly translates into a faster occurrence of the meeting time, and thus to smaller computational cost and variance. Therefore, the proposed method can be seen as a framework to parallelize conditional particle filters and as a way of obtaining reliable confidence intervals. In Appendix D, we provide numerical results to investigate the behaviour of the Rhee–Glynn smoothing estimator as a function of the number of particles N , the resampling scheme and the use of ancestor sampling, in a hidden auto-regressive model. We find a trade-off for the choice of N : too small a value leads to exceedingly long meeting times, but setting N too large is inefficient compared to increasing the number R of independent estimators, for a fixed budget. The use of ancestor sampling significantly mitigates the degeneracy issue with respect to the time horizon T . Finally, we find that the use of index-coupled resampling is crucial for the meeting time to occur in a small number of steps. For a fixed computational budget, the only tuning parameter is the number of particles N , which implicitly sets the number of independent estimators R that can be obtained within the budget. The computational cost of producing an unbiased estimator H is of order N T × E[τ ], and the expectation of τ is seen empirically to decrease with N , so that the choice of N is not obvious; in practice we recommend choosing a value of N large enough so that the meeting time occurs in a few steps, but other considerations such as memory cost could be taken into account. The memory cost for each estimator is of order T + N log N in average (Jacob et al., 2015). This memory cost holds when using ancestor sampling (Lindsten et al., 2014), whereas backward sampling (Whiteley, 2010) results in a memory cost of N T .

4.5

Variance reduction

The variance of the proposed estimator can first be reduced by a simple Rao–Blackwellization argument. In the n-th term of the sum in Eq. (5), the random variable h(X (n) ) is obtained by applying the test function h to a trajectory drawn among N trajectories, say x1:N 0:T , with probabilities PN 1:N k k wT . Thus the random variable k=1 wT h(x0:T ) is a conditional expectation of h(X (n) ) given x1:N 0:T and wT1:N , which has the same expectation as h(X (n) ). We can replace in H any term h(X (n) ) or ˜ (n) ) by similar conditional expectations, thereby reducing the variance of H without changing h(X its expectation. This enables the use of all the particles generated by the conditional particle filters. A further variance reduction can be achieved in the following way. Let M, m be two integers

22

such that M > m ≥ 0. Define M X

Hm,M = h(X (m) ) +

˜ (n−1) ) h(X (n) ) − h(X

(7)

n=m+1

= h(X (M ) ) +

M −1 X

˜ (n) ), h(X (n) ) − h(X

(8)

n=m

˜ (n) have the same We have E[Hm,M ] = E[h(X (M ) )] by Eq. (8) and using the fact that X (n) and X distribution. Furthermore, E[h(X (M ) )] goes to π(h) as M → ∞ under Assumption 3. We consider the estimator Hm,∞ , which can be computed in a finite time as follows. We run Algorithm 1 until step max(τ, m). If τ ≤ m + 1, from Eq. (7), Hm,∞ = h(X (m) ) almost (n) ) − ˜ (n−1) for all n ≥ m + 1. If τ > m + 1, Hm,∞ = h(X (m) ) + Pτ −1 surely, since X (n) = X n=m+1 h(X ˜ (n−1) ), again using Eq. (7). h(X The estimator Hm,∞ is thus made of a single term with large probability if m is large enough; the computational cost is of max(τ, m) instead of τ for the original estimator. The intuition is that the fewer terms there are in Hm,∞ , the smaller the variance. Thus, some variance reduction can be achieved for a reasonable cost if we choose m comparable to the median or the mean of τ , as will be illustrated in Section 4.6. Another question is whether we can average over various choices of m. We can compute Pm P ¯ Hm = m n=0 αn = 1; this estimator is still unbiased. It follows (after some n=0 αn Hn,∞ where calculations) that ¯m = H

m X n=0

αn h(X (n) ) +

τX −1

˜ (n−1) )), βn (h(X (n) ) − h(X

n=1

αj ; the choice of coefficients α0:m is left for future work. where βn = n−1∧m j=0 Finally, an alternative estimator is considered in Appendix C, and tested in Appendix D, that involves a Geometric truncation variable which parameter can be tuned to potentially achieve further efficiency gains. P

4.6 4.6.1

Numerical results Unlikely observation

We consider the first example of Ruiz and Kappen (2016), designed so that the filtering and the  smoothing distributions are significantly different. The latent process is defined as x0 ∼ N 0, τ02  and xt = ηxt−1 + N 0, τ 2 ; we take τ0 = 0.1, η = 0.9 and τ = 0.1 and consider T = 10 time steps.  The process is not observed until the final time T , where yT = 1 and we assume yT ∼ N xT , σ 2 , with σ = 0.1. The observation yT is unlikely under the latent process distribution. Therefore the filtering distributions and the smoothing distributions have little overlap, particular for times t 23

close to T . Smoothers such as forward filtering backward smoothers and fixed-lag smoothers, which are based on the filtering approximations, are bound to struggle for this model. On the other hand, the conditional particle filter iteratively selects paths closer to yT , and converges in relatively few steps. We consider the problem of estimating the smoothing means, and run R = 10, 000 independent Rhee–Glynn estimators, with various numbers of particles, with ancestor sampling (Lindsten et al., 2014) and without variance reduction. For comparison, we also run a bootstrap particle filter R times, with various larger numbers of particles. This compensates for the fact that the Rhee– Glynn estimator requires a certain number of iterations, each involving a coupled particle filter. The average numbers of iterations for each value of N are given in Table 1.

N N N N

= = = =

128 256 512 1024

meeting time 10.64 (25.1) 8.86 (16.97) 7.27 (10.77) 6.12 (7.33)

Table 1: Average meeting time as a function of the number of particles N . Standard deviations are between brackets. √ √ For each method, we compute a confidence interval as [ˆ xt − 2ˆ σt / R, x ˆt + 2ˆ σt / R] at each time t, where x ˆt is the mean of the R estimators and σ ˆt is the standard deviation. The results are shown in Figure 7. The exact smoothing means are obtained analytically and shown by black dots. The Rhee–Glynn estimators benefit from the good performance of the conditional particle filter kernel, even for small values of N , leading to reliable confidence intervals. Increasing N reduces the width of the interval and the average meeting time. On the other hand, the standard particle filter requires larger numbers of particles to accurately estimate the smoothing means. The confidence intervals are highly unreliable, since they are based on the average of independent but biased estimators. This is related to the poor performance of bootstrap particle filters in the setting of highly-informative observations (Del Moral and Murray, 2015). 4.6.2

Multimodality in a nonlinear growth model

We consider the nonlinear growth model used by Gordon et al. (1993). We set x0 ∼ N (0, 2), and, for t ≥ 1, xt = 0.5xt−1 + 25xt−1 /(1 + x2t−1 ) + 8 cos(1.2(t − 1)) + Wt ,

and yt = x2t−1 /20 + Vt ,

where Wt and Vt are independent normal variables, with variances 1 and 10 respectively. We generate T = 50 observations using X0 = 0.1, following Gordon et al. (1993). Because the measurement distribution g(yt |xt , θ) depends on xt through x2t , the sign of xt is hard to identify, and as a result 24

1.00

1.00

0.75



smoothing means

smoothing means



● ● ●

0.50

● ● ●

0.25



0.75

● ● ●

0.50

● ● ●

0.25















0.00

0.00 0

1

2

3

4

5

6

7

8

9 10

0

1

2

3

time N=



128



256



4

5

6

7

8

9 10

time 512



1024

(a) Rhee–Glynn estimators of the smoothing means.

N=



1024



2048



4096



8192



16384

(b) Particle filter estimators of the smoothing means.

Figure 7: Confidence intervals on the smoothing means, obtained with R = 10, 000 Rhee–Glynn smoothers (left), and bootstrap particle filters (right). The true smoothing means are shown in black dots. (Note that the estimators for different t are dependent since they are obtained from the same trajectories.) the smoothing distribution has multiple modes. We run a conditional particle filter with ancestor sampling, with N = 1, 024 particles for M = 50, 000 iterations, and discard the first 25, 000 iterations. We plot the histogram of the obtained sample for p(dxt |y1:T , θ) at time t = 36 in Figure 8a. We notice at least two modes, located around −7 and +7, with possibly an extra mode near zero. We run the Rhee–Glynn smoother with N = 1, 024 and ancestor sampling, without variance reduction techniques. Each estimator took less than 10 iterations of the coupled conditional particle filter to meet, with a median meeting time of 3 iterations. The total number of calls to the coupled conditional particle filter to obtain R = 1, 000 estimators adds up to 2, 984. We plot the histogram (r) of the estimators Ht , for r ∈ 1 : R, of the smoothing mean E[xt |y1:T ] at time t = 36 in Figure 8b. We see that the distribution of the estimator is itself multimodal. Indeed, the two initial reference trajectories might belong to the mode around −7, or to the mode around +7, or each trajectory might belong to a different mode. Each of these cases leads to a mode in the distribution of the Rhee–Glynn estimator. The resulting estimator x ˆt of each smoothing mean is obtained by averaging the R = 1, 000 (r) independent estimators Ht . We compute the Monte Carlo standard deviation σ ˆt at each time t, √ √ ˆt + 2ˆ σt / R] as error bars in Figure 9. The and represent the confidence intervals [ˆ xt − 2ˆ σt / R, x line represents the smoothing means obtained by conditional particle filter with ancestor sampling, taken as ground truth. The agreement shows that the proposed method is robust to multimodality

25

0.100

density

density

0.2

0.1

0.075 0.050 0.025 0.000

0.0 −10

−5

0

5

−40

10

−20

0

20

x^36

x36

(a) Approximation of the smoothing distribution, us-(b) Rhee–Glynn estimators of the smoothing mean, ing a conditional particle filter, at t = 36. and true mean in vertical dashed (red) line, at t = 36.

Figure 8: Smoothing distribution approximated by conditional particle filters (left), and R = 1, 000 independent Rhee–Glynn estimators of the smoothing mean (right), at time t = 36 for the nonlinear growth model with T = 50. in the smoothing distribution. 4.6.3

Plankton population model

We consider the Plankton–Zooplankton model of Jones et al. (2010), which is an example of an implicit model: the transition density is intractable (Bretó et al., 2009; Jacob, 2015). The hidden state xt = (pt , zt ) represents the population size of phytoplankton and zooplankton, and the transition from time t to t + 1 is given by a Lotka–Volterra model, dpt = αpt − cpt zt , dt

and

dzt = ecpt zt − ml zt − mq zt2 , dt

where the stochastic daily growth rate α is drawn from N (µα , σα2 ) at every integer time t. The propagation of each particle involves solving numerically the above equation, here using a RungeKutta method in the odeint library (Ahnert and Mulansky, 2011). The initial distribution is given by log p0 ∼ N (log 2, 1) and log z0 ∼ N (log 2, 1). The parameters c and e represent the clearance rate of the prey and the growth efficiency of the predator. Both ml and mq parameterize the mortality rate of the predator. The observations yt are noisy measurements of the phytoplankton pt , log yt ∼ N (log pt , 0.22 ); zt is not observed. We generate T = 365 observations using µα = 0.7, σα = 0.5, c = 0.25, e = 0.3, ml = 0.1, mq = 0.1. We consider the problem of estimating the mean population of zooplankton at each time t ∈ 0 : T . We run a conditional particle filter, with N = 4, 096 particles and for M = 50, 000 iterations, and discard the first 25, 000 iterations. The intractability of the transition density precludes the use of ancestor or backward sampling, and the use of forward filtering backward sampling. 26

20



● ●

smoothing means









● ● ●



10



● ●











● ● ●

0











● ●















● ●







● ●

−10







● ● ●

−20

0

10





20

30

40

50

time

Figure 9: Confidence intervals around the exact smoothing means. The intervals are computed as two standard deviations around the mean of R = 1, 000 proposed smoothing estimators. The line represents the exact smoothing means, retrieved by a long run of conditional particle filter, for the nonlinear growth model with T = 50 observations. We draw R = 1, 000 independent Rhee–Glynn smoothing estimators, using N = 4, 096 particles. The observed meeting times have a median of 4 and a maximum of 19. The total number of calls to the coupled conditional particle filter kernel adds up to 4, 735. The estimator zˆt of the smoothing (r) mean of zt at each t is obtained by averaging the R = 1, 000 independent estimators Ht . We compute the Monte Carlo standard deviation σ ˆt at each time. We represent the confidence intervals √ √ σt / R]. For clarity, we focus separately on the first and last thirty points in [ˆ zt − 2ˆ σt / R, zˆt + 2ˆ Figure 10. The line represents the smoothing means obtained by the long run of conditional particle filter described above. We see that the two methods are in agreement. We now combine the Rhee–Glynn estimator (denoted by “unbiased” below) with the variation techniques of Section 4.5: Rao–Blackwellization (denoted by “unbiased+RB”) and the estimator Hm,∞ with Rao–Blackwellization (denoted by “unbiased+RB+m”), run with m = 4, chosen to be the median of the meeting time. The latter increases the average number of steps per estimator from 4.7 to 5.1. We compare the proposed estimators with a fixed-lag smoother (Doucet and Johansen, 2011) with a lag parameter L = 10, and with a standard particle filter storing the complete trajectories. We use the same number of particles N = 4, 096 and compute R = 1, 000 estimators for each method. The relative variance, computed as the empirical variance divided by the squared empirical mean, is shown in Figure 11. First we see that the variance reduction techniques have a significant effect, particularly for t close to T but also for small t. In particular, the estimator Hm,∞ with Rao–Blackwellization (“unbiased+RB+m”) achieves nearly the same relative variance as the particle filter. The cost of these estimators can be computed as the number of iterations max(m, τ ), times twice the cost of a particle filter for each coupled particle filter. In the present setting where the average number of iterations is around five, we conclude that removing the bias

27

smoothing means of Z

smoothing means of Z

10.0 7.5 5.0 2.5 0.0

10.0 7.5 5.0 2.5 0.0

0

10

20

30

340

time

350

360

time

(a) Smoothing mean for t ∈ 0 : 30.

(b) Smoothing mean for t ∈ 335 : 365.

Figure 10: Confidence intervals around the exact smoothing means of the zooplankton population size zt . The intervals are computed as two standard deviations around the mean of R = 1, 000 proposed smoothing estimators. The line represents the exact smoothing means, retrieved by a long run of conditional particle filter, for the phytoplankton–zooplankton model with T = 365 observations. from the standard particle filter can be done for an approximate ten-fold increase in computational cost; and no increase in memory cost. As expected the fixed-lag smoother leads to a significant decrease in variance. For this model, the incurred bias is negligible even for L = 10 (not shown), which, however, would be hard to tell if we did not have access to either unbiased methods or long runs of asymptotically exact methods. In this model, standard particle filters and fixed-lag approximations perform well, leading to smaller mean squared error than the proposed estimators for a given computational cost. However the proposed method is still competitive, the tuning of the algorithm is minimal, and the unbiasedness prevents the possibility of over-confident intervals as in Section 4.6.1. Therefore the proposed method trades an extra cost for convenience and reliability.

5

Discussion

Couplings of particle filters can be beneficial in multiple settings. For parameter inference, coupled resampling schemes can significantly reduce the variance of finite difference estimators of the score. Furthermore, they improve the performance of the correlated particle marginal MH algorithm, achieving higher acceptance rates for a fixed number of particles. While the quadratic cost of transport resampling might be prohibitive, the index-coupled resampling scheme brings similar gains in a linear complexity. The added cost compared to standard resampling schemes is negligible in most settings. In the context of conditional particle filters, a coupling construction combined with the debiasing technique of Glynn and Rhee (2014) leads to a new smoothing algorithm. The Rhee–Glynn

28

relative variance

1e+00

unbiased unbiased+RB unbiased+RB+m particle filter

1e−02

fixed−lag L=10

1e−04 0

91

182

274

365

time

Figure 11: Comparison of the relative variance of the standard particle filter, a fixed-lag smoother with lag L = 10, and the proposed unbiased method, with Rao–Blackwellization (RB) and variance reduction (RB+m), for the estimation of the mean of the zooplankton population zt , for the phytoplankton–zooplankton model with T = 365 observations. Results based on R = 1, 000 experiments. smoother does not outperform existing smoothers in all situations, but has a number of appealing properties. • The proposed estimator is completely parallelizable, while existing methods are eventually limited by the resampling steps (Murray et al., 2015; Lee and Whiteley, 2015a), or by their iterative nature. • Contrarily to fixed-lag smoothers, the algorithm can be used to estimate functionals of the whole trajectories, and not only of small-dimensional marginals. Furthermore, the only tuning parameter is the number of particles N , which affects the efficiency of the method but not its validity. • The proposed estimator allows the construction of reliable confidence intervals. Unbiasedness can also be valuable for other applications, such as gradient estimation for stochastic gradient optimization techniques. The proposed smoothing method, together with Fisher’s identity (Poyiadjis et al., 2011), leads to unbiased estimators of the score, for models where the transition density is tractable. • Contrarily to standard smoothers, where the smoothing approximation is retrieved from the filtering approximation (Ruiz and Kappen, 2016), the proposed method is robust to highly informative observations. A topic of future research might be the application of the proposed ideas outside the context of state space models, following the growing popularity of particle methods in increasingly many settings (Del Moral et al., 2006; Bouchard-Côté et al., 2012; Naesseth et al., 2014). 29

Appendices Appendix A describes Cuturi’s approximation to the optimal transport problem. Appendix B provides proofs for the results of Section 3, and Appendix C for the results of Section 4. Appendix D provides further experiments with the Rhee–Glynn smoother, and Appendix E provides pseudocode for standard and coupled particle filters. Acknowledgements The first author thanks Mathieu Gerber and Marco Cuturi for helpful discussions.

References K. Ahnert and M. Mulansky. Odeint-solving ordinary differential equations in C++. arXiv preprint arXiv:1110.3397, 2011. C. Andrieu and G. Roberts. The pseudo-marginal approach for efficient Monte Carlo computations. Ann. Stat., 37(2):697–725, 2009. C. Andrieu and M. Vihola. Convergence properties of pseudo-marginal Markov chain Monte Carlo algorithms. The Annals of Applied Probability, 25(2):1030–1077, 2015. C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):357–385, 2010. C. Andrieu, A. Lee, and M. Vihola. Uniform ergodicity of the iterated conditional SMC and geometric ergodicity of particle Gibbs samplers. arXiv preprint arXiv:1312.6432, 2013. S. Asmussen and P. W. Glynn. Stochastic simulation: Algorithms and analysis, volume 57. Springer, 2007. J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111– A1138, 2015. J. Bérard, P. Del Moral, and A. Doucet. A lognormal central limit theorem for particle approximations of normalizing constants. Electron. J. Probab, 19(94):1–28, 2014. A. Bouchard-Côté, S. Sankararaman, and M. I. Jordan. Phylogenetic inference via sequential Monte Carlo. Systematic Biology, 61(4):579–593, 2012. C. Bretó, D. He, E. L. Ionides, and A. A. King. Time series analysis via mechanistic models. The Annals of Applied Statistics, pages 319–348, 2009. O. Cappé, E. Moulines, and T. Rydén. Inference in Hidden Markov Models. Springer-Verlag, New York, 2005. 30

F. Cérou, P. Del Moral, and A. Guyader. A nonasymptotic theorem for unnormalized Feynman–Kac particle models. Ann. Inst. Henri Poincarré, 47(3):629–649, 2011. N. Chopin and S. S. Singh. On particle Gibbs sampling. Bernoulli, 21(3):1855–1883, 2015. P.-A. Coquelin, R. Deguest, and R. Munos. Sensitivity analysis in HMMs with application to likelihood maximization. In Advances in Neural Information Processing Systems (NIPS), pages 387–395, 2009. M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems (NIPS), pages 2292–2300, 2013. J. Dahlin, F. Lindsten, J. Kronander, and T. B. Schön. Accelerating pseudo-marginal Metropolis– Hastings by correlating auxiliary variables. arXiv preprint arXiv:1511.05483, 2015. D. N. DeJong, H. Dharmarajan, R. Liesenfeld, and J.-F. Richard. An efficient approach to analyzing state-space representations. SSRN eLibrary, 2007. P. Del Moral. Feynman-Kac Formulae, Genealogical and Interacting Particle Systems with Applications. New York: Springer-Verlag, 2004. P. Del Moral, A. Doucet, and A. Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006. P. Del Moral and L. M. Murray. Sequential monte carlo with highly informative observations. SIAM/ASA Journal on Uncertainty Quantification, 3(1):969–997, 2015. G. Deligiannidis, A. Doucet, and M. K. Pitt. The correlated pseudo-marginal method. arXiv preprint arXiv:1511.04992, 2015. P. Diaconis and D. Freedman. Iterated random functions. SIAM review, 41(1):45–76, 1999. R. Douc and O. Cappé. Comparison of resampling schemes for particle filtering. In Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis (ISPA), pages 64–69, 2005. A. Doucet and A. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. In Handbook of Nonlinear Filtering. Oxford, UK: Oxford University Press, 2011. A. Doucet, N. de Freitas, and N. Gordon. Sequential Monte Carlo methods in practice. SpringerVerlag, New York, 2001. A. Doucet, M. Pitt, G. Deligiannidis, and R. Kohn. Efficient implementation of Markov chain Monte Carlo when using an unbiased likelihood estimator. Biometrika, 102(2):295–313, 2015.

31

J. M. Flegal, M. Haran, and G. L. Jones. Markov chain Monte Carlo: can we trust the third significant figure? Statistical Science, 23(2):250–260, 2008. M. Gerber and N. Chopin. Sequential quasi Monte Carlo. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(3):509–579, 2015. P. Glasserman and D. D. Yao. Some guidelines and guarantees for common random numbers. Management Science, 38(6):884–908, 1992. P. W. Glynn and C.-H. Rhee. Exact estimation for markov chain equilibrium expectations. J. Appl. Probab., 51A:377–389, 2014. P. W. Glynn and W. Whitt. The asymptotic efficiency of simulation estimators. Operations Research, 40(3):505–520, 1992. N. Gordon, J. Salmond, and A. Smith. A novel approach to non-linear/non-Gaussian Bayesian state estimation. IEE Proceedings on Radar and Signal Processing, 140:107–113, 1993. P. Guarniero, A. M. Johansen, and A. Lee. The iterated auxiliary particle filter. arXiv preprint arXiv:1511.06286, 2015. P. E. Jacob, L. M. Murray, and S. Rubenthaler. Path storage in the particle filter. Statistics and Computing, 25(2):487–496, 2015. P. E. Jacob. Sequential Bayesian inference for implicit hidden Markov models and current limitations. ESAIM: Proceedings and Surveys, 51:24–48, 2015. A. Jasra, K. Kamatani, K. J. Law, and Y. Zhou. arXiv:1510.04977, 2015.

Multilevel particle filter.

arXiv preprint

E. Jones, J. Parslow, and L. Murray. A Bayesian approach to state and parameter estimation in a phytoplankton-zooplankton model. Australian Meteorological and Oceanographic Journal, 59: 7–16, 2010. S.-H. Jun, L. Wang, and A. Bouchard-Côté. Entangled Monte Carlo. In Advances in Neural Information Processing Systems (NIPS), pages 2726–2734, 2012. H. Kahn and A. W. Marshall. Methods of reducing sample size in Monte Carlo computations. Journal of the Operations Research Society of America, 1(5):263–278, 1953. N. Kantas, A. Doucet, S. S. Singh, J. Maciejowski, and N. Chopin. On particle methods for parameter estimation in state-space models. Statistical science, 30(3):328–351, 2015. J. Kuti. Stochastic method for the numerical study of lattice fermions. Physical Review Letters, 49 (3):183, 1982. 32

A. Lee. Towards smooth particle filters for likelihood estimation with multivariate latent variables. Master’s thesis, University of British Columbia, 2008. A. Lee and C. Holmes. Comment on Particle Markov chain Monte Carlo by Andrieu, Doucet and Holenstein. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4): 357–385, 2010. A. Lee, A. Doucet, and K. Łatuszyński. Perfect simulation using atomic regeneration with application to sequential Monte Carlo. ArXiv e-prints, July 2014. A. Lee and N. Whiteley. Forest resampling for distributed sequential Monte Carlo. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2015a. A. Lee and N. Whiteley. Variance estimation and allocation in the particle filter. arXiv preprint arXiv:1509.00394, 2015b. F. Lindsten and T. B. Schön. Backward simulation methods for Monte Carlo statistical inference. Foundations and Trends in Machine Learning, 6(1):1–143, 2013. F. Lindsten, M. I. Jordan, and T. B. Schön. Particle Gibbs with ancestor sampling. Journal of Machine Learning Research (JMLR), 15:2145–2184, June 2014. F. Lindsten, R. Douc, and E. Moulines. Uniform ergodicity of the particle Gibbs sampler. Scandinavian Journal of Statistics, 42:775–797, June 2015. T. Lindvall. Lectures on the coupling method. Courier Corporation, 2002. S. Malik and M. K. Pitt. Particle filters for continuous likelihood evaluation and maximisation. Journal of Econometrics, 165(2):190–209, 2011. D. McLeish. A general method for debiasing a Monte Carlo estimator. Monte Carlo methods and applications, 17(4):301–315, 2012. L. M. Murray, A. Lee, and P. E. Jacob. Parallel resampling in the particle filter. Journal of Computational and Graphical Statistics, 2015. C. A. Naesseth, F. Lindsten, and T. B. Schön. Sequential Monte Carlo for graphical models. In Advances in Neural Information Processing Systems (NIPS) 27, Montreal, Quebec, Canada, December 2014. M. K. Pitt. Smooth particle filters for likelihood evaluation and maximisation. Technical report, University of Warwick, Department of Economics, 2002.

33

G. Poyiadjis, A. Doucet, and S. S. Singh. Particle approximations of the score and observed information matrix in state space models with application to parameter estimation. Biometrika, 98(1):65–80, 2011. C. Rhee. Unbiased Estimation with Biased Samplers. PhD thesis, Stanford University, 2013. URL http://purl.stanford.edu/nf154yt1415. C. Rhee and P. W. Glynn. A new approach to unbiased estimation for SDE’s. In Proceedings of the Winter Simulation Conference, pages 17:1–17:7, 2012. C. Rhee and P. W. Glynn. Unbiased Estimation with Square Root Convergence for SDE Models. Technical report, Stanford University, 2013. H.-C. Ruiz and H. Kappen. Particle smoothing for hidden diffusion processes: Adaptive path integral smoother. arXiv preprint arXiv:1605.00278, 2016. T. Rychlik. Unbiased nonparametric estimation of the derivative of the mean. Statistics & probability letters, 10(4):329–333, 1990. T. Rychlik. A class of unbiased kernel estimates of a probability density function. Applications Math, 22:485–497, 1995. The CGAL Project. CGAL User and Reference Manual. CGAL Editorial Board, 4.8 edition, 2016. URL http://doc.cgal.org/4.8/Manual/packages.html. M. Vihola. Unbiased estimators and multilevel Monte Carlo. arXiv preprint arXiv:1512.01022, 2015. N. Whiteley. Comment on Particle Markov chain Monte Carlo by Andrieu, Doucet and Holenstein. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):357–385, 2010. N. Whiteley. Stability properties of some particle filters. Ann. Appl. Probab., 23(6):2500–2537, 12 2013.

A

Approximate transport

An approximate solution to the transport problem was introduced in Cuturi (2013); Cuturi and Doucet (2014), and explained along with various other methods in Benamou et al. (2015). We describe it briefly here. The original transport program is regularized, leading to the modified objective function hP, Di − εh (P ) ,

34

where h (P ) = − P ij log P ij is the entropy of P , hP, Di is the sum of the terms P ij Dij , and ˜ corresponds to the optimal ε ∈ R+ . When ε → 0, minimizing the above objective over J (w, w) transport problem. We write P

hP, Di − εh (P ) = εKL (P || exp (−D/ε)) , where KL(S||Q) = i,j S ij log S ij /Qij and exp (S) is the element-wise exponential of S. Minimizing the regularized transport objective is equivalent to finding the matrix P with minimal KL projection on K = exp (−D/ε), leading to the optimization problem 

P

Pε =

inf

˜ P ∈J (w,w)

KL (P ||K) .

(9)

Compared to the original transport problem, this program is computationally simpler. n o By noting T ˜ = Jw ∩ Jw˜ where J (w) = {P : P 1 = w} and J (w) ˜ = P :P 1=w ˜ , we can find that J (w, w) (0) the solution of Eq. (9) by starting from P = K and performing iterative KL projections on J (w) ˜ This is Algorithm 1 in Cuturi (2013), which we state for completeness in Algorithm 2. and J (w). The algorithm is iterative, and requires matrix-vector multiplications which cost O(N 2 ) at every iteration. The number of steps n to achieve a certain precision can be taken independently of the number of particles N , thanks to the convexity of the objective function (Cuturi, 2013). The overall cost is thus in O(N 2 ).

Algorithm 2 Cuturi’s approximation to the transport problem. ˜ two N -vectors of normalized weights, x, x Input: w, w, ˜, two sets of locations in X; a distance d on X. Parameters: ε > 0 for the regularization, n for the number of iterations. The element-wise division between two vectors is denoted by /, 1 is a column vector of ones. 1. Compute the pairwise distances D = (Dij ) where Dij = d(xi , x ˜j ). 2. Compute K = exp (−D/ε) (element-wise) and set v (0) = 1. 3. For i ∈ {1, . . . , n}, 



(a) u(i) ← w/ Kv (i−1) , 



˜ K T u(i) . (b) v (i) ← w/ 4. Compute Pˆ as diag(u(n) ) K diag(v (n) ).

There are two tuning parameters: the regularization parameter ε and the number of iterations n. For ε, we follow Cuturi (2013) and set a small proportion of the median of the distance matrix 35

D. For instance, we can set ε = 10% × median(D). For the choice of n, we use the following adaptive criterion. As described in Section 2.3, once the approximate solution Pˆ is obtained through Algorithm 2, ˜ = Pˆ T 1, we need to correct its marginals. We compute the approximate marginals u = Pˆ 1 and u and set    ˜ − αu ˜ ˜i w − αu w wi w , , r= , r˜ = . α = min min i∈1:N ui u ˜i 1−α 1−α The final transport probability matrix is given by P = αPˆ + (1 − α)r r˜T . When n increases, Pˆ ˜ Thus, if n was very large we could take α gets nearer to the solution P ε which is in J (w, w). close to one. This gives a heuristic to choose n. We choose to stop Cuturi’s iterative algorithm when the current solution P (n) = diag(u(n) ) K diag(v (n) ) is such that α computed above is at least a certain value, for instance 90%. This ensures that the transport probability matrix P is a close approximation to the regularized transport problem.

B

Validity of correlated particle marginal MH

We first provide the proof of Lemma 3.1, and then we show that condition of Eq. (4) is satisfied for sorted, index-coupled and transport resampling schemes. The extended target distribution of the particle marginal MH algorithm has density π ¯ (θ, U0:T , a0:T −1 ) =

p(θ|y1:T )ϕ(U0:T −1 )

QT −1

r(at |a0:t−1 , U0:t , θ)pbN (y1:T |θ) , p(y1:T |θ)

t=0

(10)

where r(at |a0:t−1 , U0:t , θ) denotes the probability of sampling the ancestors at = a1:N at step t, t k k N given the particles (wt , xt )k=1 which are deterministic functions of (a0:t−1 , U0:t , θ). Under the assumption that P(akt = j) = wtj , the expectation of p(y1:T |θ) with respect to U0:T and a0:T −1 , given θ, is equal to p(y1:T |θ). Therefore we retrieve p(θ|y1:T ) as the marginal distribution of θ under π ¯ (θ, U0:T , a0:T −1 ), which justifies the method (Andrieu et al., 2010). We give a sufficient condition on the coupled resampling scheme for the correlated particle marginal MH algorithm to target the same distribution π ¯ (θ, U0:T , a0:T −1 ), and thus to have the correct marginal on θ. We denote by ξ all the auxiliary variables generated by the particle filter, that is, Ut for all t ∈ 0 : T , and at for all t ∈ 0 : T − 1. The extended target distribution of Eq. (10) can be rewritten π ¯ (θ, ξ) =

p(θ|y1:T )mθ (ξ)pbN (y1:T |θ) , p(y1:T |θ)

(11)

where mθ (ξ) is the distribution of ξ defined by a run of the particle filter. ˜ According to the procedure described in Section 3.4, from the state (θ, ξ), we sample θ˜ ∼ q(dθ|θ) ˜ from a Markov kernel on the space of ξ which may depend on θ and θ. ˜ The and ξ˜ ∼ Kθ,θ˜(dξ|ξ) 36

˜ = m ˜(ξ), ˜ that is, a particle filter is run original algorithm of Andrieu et al. (2010) uses Kθ,θ˜(ξ|ξ) θ N ˜ independently of everything else, to produce pb (y1:T |θ). ˜ Other kernels leaving π given θ, ¯ (dθ, dξ) invariant can be constructed, a sufficient condition being: ˜ = m ˜(ξ)K ˜ ˜ (ξ|ξ), ˜ mθ (ξ)Kθ,θ˜(ξ|ξ) θ θ,θ

˜ ξ, ξ. ˜ ∀θ, θ,

(12)

We can write the distribution of ξ as mθ (ξ) = ϕ(U0:T −1 )

TY −1

r(at |a0:t−1 , U0:t , θ),

t=0

where ϕ is the distribution of the process-generating variables, and r refers to the resampling distribution, as described in Section 3.4. We consider kernels Kθ,θ˜ of the form ˜ = φ(U ˜ 0:T |U0:T ) Kθ,θ˜(ξ|ξ)

TY −1

˜ a0:t , U0:t , θ). ˜ 0:t , θ, c(˜ at |˜ a0:t−1 , U

t=0

In this expression, φ is a Markov kernel in detailed balance with respect to ϕ, the distribution of the ˜ t given at and all the process-generating variables. The conditional resampling distribution c of a other variables generated up to time t is specified by the coupled resampling scheme. This form for Kθ,θ˜ leads to a sufficient condition on c for the kernel Kθ,θ˜ to satisfy Eq. (12), given by Eq. (4) in Lemma 3.1. Thus, under this condition, Kθ,θ˜ satisfies Eq. (12), so that the chain generated by the correlated pseudo-marginal algorithm admits π ¯ (dθ, dξ) as an invariant distribution. This concludes the proof of the lemma. ˜ t does not require any For independent and sorted resampling, the conditional sampling of a variable from the first particle system, so that we have ˜ a0:t , U0:t , θ) = r(˜ ˜ ˜ 0:t , θ, ˜ 0:t , θ), c(˜ at |˜ a0:t−1 , U at |˜ a0:t−1 , U and the condition is satisfied. For index-coupled resampling, with the conditional sampling scheme described in Section 3.1, ak · for each k ∈ 1 : N , a ˜kt is distributed according to Pt t , the akt -th row of Pt defined in Eq. (2). We can thus write ˜ a0:t , U0:t , θ) = ˜ 0:t , θ, c(˜ at |˜ a0:t−1 , U

N Y

ak a ˜kt

Pt t

.

k=1

Noting that the definition of the index-coupled probability matrix in Eq. (2) is such that we obtain ˜ in its definition, we check that the condition of Eq. (4) is satisfied its transpose if we swap w and w Q akt with respect to r(at |a0:t−1 , U0:t , θ) = N k=1 wt . This corresponds to multinomial resampling, which is indeed such that P(akt = j) = wtj . 37

For transport resampling, the distance matrix D in Section 2.3 is such that we obtain its ˜ are swapped. Thus, the optimal probability matrix Pt obtained by solving transpose if x and x ˜ x) ˜ are swapped. the transport problem is such that we obtain its transpose if (w, x) and (w, akt · k Therefore, if a ˜t is distributed according to Pt for each k ∈ 1 : N , then the condition of Eq. (4) will be satisfied, similarly to the case of index-coupled resampling. However, if we use an approximate solution Pˆ to the transport problem, then the condition might not directly hold. This can be circumvented by symmetrizing the resampling matrix as follows: ˜ x)) ˜ and P˜ based on ((w, ˜ x), ˜ (w, x)), i.e. compute P given Cuturi’s algorithm based on ((w, x), (w, T ˆ ˜ swapping the pairs of arguments. Then compute Pt = (P + P )/2, and sample a ˜kt distributed k· a according to Pˆt t for each k ∈ 1 : N . This ensures that detailed balance holds with respect to the multinomial resampling distribution.

C

Properties of the Rhee–Glynn smoothing estimator

This section provides proofs for the results of Section 4.

C.1

Proof of Lemma 4.1

For this proof, we use f (xt−1 , dxt ) to denote the Markov transition kernel of the latent process, m0 (dx0 ) the initial distribution and gt (xt ) = g(yt |xt ) the potential function of the model; thus we omit the parameter θ. Let Ft denote the filtrations generated by the coupled conditional particle filter at time t. We denote by xk0:t , for k ∈ 1 : N , the surviving trajectories at time t. Let It ⊆ 1 : N − 1 be the set of common particles at time t defined by ˜j0:t }. It = {j ∈ 1 : N − 1 : xj0:t = x The meeting probability, implicitly conditioned upon the reference trajectories x0:T and x ˜0:T , can be bounded by: P(x00:T

=

x ˜00:T )

h 

=E 1

T xb0:T

=

˜T x ˜b0:T

i



N −1 X

E[1(k ∈ IT ) PTkk ]

k=1

= (N − 1)E[1(1 ∈ IT ) PT11 ] ≥

N −1 E[1(1 ∈ IT ) gT (x1T )gT (˜ x1T )], (13) (N g¯)2

where we have used Assumptions 1 and 2. Now, let ψt : Xt 7→ R+ and consider E[1(1 ∈ It ) ψt (x10:t )ψt (˜ x10:t )] = E[1(1 ∈ It ) ψt (x10:t )2 ],

38

(14)

since the two trajectories agree on {1 ∈ It }. We have 1(1 ∈ It ) ≥

N −1 X





1(k ∈ It−1 ) 1 a1t−1 = a ˜1t−1 = k ,

(15)

k=1

and thus E[1(1 ∈ It ) ψt (x10:t )2 ] ≥ E[

N −1 X





1(k ∈ It−1 ) E[1 a1t−1 = a ˜1t−1 = k ψt (x10:t )2 | Ft−1 ]]

k=1





= (N − 1)E[1(1 ∈ It−1 ) E[1 a1t−1 = a ˜1t−1 = 1 ψt (x10:t )2 | Ft−1 ]]. (16) The inner conditional expectation can be computed as 



E[1 a1t−1 = a ˜1t−1 = 1 ψt (x10:t )2 | Ft−1 ] =

N X

k` Pt−1 1(k = ` = 1)

Z

ψt ((xk0:t−1 , xt ))2 f (xkt−1 , dxt )

k,`=1 11 = Pt−1

Z

ψt ((x10:t−1 , xt ))2 f (x1t−1 , dxt )

gt−1 (x1t−1 )gt−1 (˜ x1t−1 ) ≥ (N g¯)2

Z

ψt ((x10:t−1 , xt ))f (x1t−1 , dxt )

2

, (17)

where we have again used Assumptions 1 and 2. Furthermore, on {1 ∈ It−1 } it holds that x10:t−1 = x ˜10:t−1 and therefore, combining Eqs. (14)–(17) we get E[1(1 ∈ It ) ψt (x10:t )ψt (˜ x10:t )] ≥

(N − 1) h E 1(1 ∈ It−1 ) gt−1 (x1t−1 ) (N g¯)2

Z

ψt ((x10:t−1 , xt ))f (x1t−1 , dxt )

× gt−1 (˜ x1t−1 )

Z

i

ψt ((˜ x10:t−1 , xt ))f (˜ x1t−1 , dxt ) . (18)

Thus, if we define ψT (x0:T ) = gT (xT ),

and for t = 1, . . . , T − 1,

39

Z

ψt (x0:t ) = gt (xt )

ψt+1 (x0:t+1 )f (xt , dxt+1 ),

it follows that (N − 1)T E[1(1 ∈ I1 ) ψ1 (x11 )ψ1 (˜ x11 )] (N g¯)2T (N − 1)T (N − 1)T 2 1 2 = E[ψ (x ) ] ≥ Z > 0, 1 1 (N g¯)2T (N g¯)2T

P(x00:T = x ˜00:T ) ≥

where Z > 0 is the normalizing constant of the model, defined as E[ T t=1 gt (xt )] where the expecQT tation is with respect to the distribution m0 (dx0 ) t=1 f (xt−1 , dxt ) of the latent process x0:T . We note that, for any fixed T , the bound goes to zero when N → ∞. In Section D, we observe this behaviour when using systematic resampling for both particle systems; however we observe the opposite behaviour when using index-coupled resampling. The proof fails to capture accurately the behaviour of ε in Lemma 4.1 as a function of N and T . Q

C.2

Proof of Theorem 4.1

The proof follows in spirit that of Vihola (2015), which itself is related to those presented in Rhee (2013). We present the proof of Theorem 4.1 for a generalization of the estimator proposed in Section 4. Introduce a truncation variable G, with support on the integers {0, 1, 2, . . .}. Define the estimator as G X ∆(n) , (19) H= P (G ≥ n) n=0 ˜ (n−1) ), for n ≥ 1. We consider the following where ∆(0) = h(X (0) ) and ∆(n) = h(X (n) ) − h(X assumption on the truncation variable. Assumption 4. The truncation variable G is Geometric, with probability mass function P(G = n) = (1 − p)n p, with support on {0, 1, 2, . . .} and parameter p ∈ [0, 1), chosen such that p < 1 − (1 − ε)δ/(2+δ) , where ε is as in Lemma 4.1 and δ as in Assumption 3. This assumption precludes the use of a range of values of p near one, which could have been a tempting choice. Indeed large p ensures that the truncation variable G takes small values, and thus that the estimator of Eq. (19) has a small computational cost. On the other hand, it does not prevent the use of values of p near 0, so that we retrieve the estimator of Eq. (5) by setting p = 0, ensuring that Assumption 4 is satisfied even if we do not know the values of ε and δ. For p = 0, the truncation variable G is almost surely infinite, which corresponds to running the coupled conditional particle filter until the meeting time τ occurs, as in Algorithm 1. ˜ (n−1) }. We start by a remark on the distribution of the meeting time τ = inf{n ≥ 0 : X (n) = X We can upper-bound P (τ > n), for all n ≥ 2, using Lemma 4.1 (see e.g. Williams (1991), exercise E.10.5). We obtain for all n ≥ 2, P (τ > n) ≤ (1 − ε)n−1 . 40

(20)

This ensures that E[τ ] is finite; and that τ is almost surely finite. We then introduce the random variables ∀m ≥ 1 Zm =

m X ∆(n) 1(n ≤ G)

P (n ≤ G)

n=0

.

(21)

Since τ is almost surely finite, and since ∆(n) = 0 for all n ≥ τ , then Zm → Zτ = H almost surely when m → ∞. We prove that (Zm )m≥1 is a Cauchy sequence in L2 , i.e. i

h

sup E (Zm0 − Zm )2 −−−−→ 0. m→∞

m0 ≥m

We write 0

(Zm0

0

m X

0

m m X X (∆(n) )2 1(n ≤ G) ∆(n) ∆(`) 1(` ≤ G) + 2 − Zm ) = P (n ≤ G) P (` ≤ G) P (n ≤ G)2 n=m+1 n=m+1 `=n+1 2

and thus, using the independence between G and (∆(n) )n≥0 , h

0

h

i

E (Zm0 − Zm )2 =

E (∆(n) )2

m X n=m+1

i

P (n ≤ G)

h

0

m X

+2

h i m0 E ∆(n) ∆(`) X

n=m+1 `=n+1

P (n ≤ G)

.

i

We need to control the expectations E ∆(n) ∆(`) for ` = n, . . . , m0 . h

i

To control E (∆(n) )2 , we use Hölder’s inequality, with p = 1 + δ/2, and q = (2 + δ)/δ, where δ is as in Assumption 3, h

i

h

E (∆(n) )2 = E (∆(n) )2 1 (τ > n) h

≤ E (∆(n) )2p h

i1/p 

≤ E (∆(n) )2+δ

i

(1 − ε)n−1

i1/(1+δ/2) 

1/q

(1 − ε)δ/(2+δ)

n−1

.

Furthermore, using Assumption 3, there exists C1 < ∞ such that, for n0 ∈ N large enough, ∀n ≥ n0

h

E (∆(n) )2+δ

i1/(1+δ/2)

≤ C1 .

Indeed, using Minkowski’s inequality with p = 2 + δ, and denoting ||X||p = E [|X|p ]1/p ,        p   p (n) p (n) (n−1) (n) (n−1) ˜ ˜ −h X ∆ = h X ≤ h X + h X p p p p      p  2+δ = h X (n) + h X (n−1) ≤ 2C 1/(2+δ) , p

p

41

(22)

˜ (n) being the same as the where the second equality comes from the marginal distribution of X distribution of X (n) , and the last inequality from Assumption 3. This gives Eq. (22). We write η = (1 − ε)δ/(2+δ) , and take m such that m ≥ n0 . Using Cauchy–Schwarz, we have for all n, ` ≥ m, h

i

i

 h

h

E ∆(n) ∆(`) ≤ E (∆(n) )2 E (∆(`) )2

i1/2

≤ C1 η (n−1)/2 η (`−1)/2 .

We can now write 0

h

E (Zm0 − Zm )

2

i

0

m X

0

m m X X η n−1 C1 η (n−1)/2 η (`−1)/2 ≤ C1 +2 . P (n ≤ G) P (n ≤ G) n=m+1 n=m+1 `=n+1

We compute 0

m X

η

(`−1)/2

`=n+1

√ m0 √ n 1− η = ( η) √  . 1− η

This leads to √ m0 m0 X η n−1 η n−1 √ 1 − η ≤ C1 + 2C1 η √  . P (n ≤ G) P (n ≤ G) 1− η n=m+1 n=m+1 0

h

E (Zm0 − Zm )

2

i

m X

Under Assumption 4, we have P (n ≤ G) = (1 − p)n+1 . For the above series to go to zero when m → ∞ and m0 ≥ m, it is enough that η/(1 − p) < 1. By definition of η, this holds if (1 − ε)δ/(2+δ) < 1 − p ⇔ p < 1 − (1 − ε)δ/(2+δ) . This is part of Assumption 4. Thus (Zm )m≥1 is a Cauchy sequence in L2 . By uniqueness of the limit, since (Zm )m≥1 goes almost surely to H, (Zm )m≥1 goes to H in L2 . This shows that H has finite first two moments. We can retrieve the expectation of H by EZm =

m X

h

i

E[∆(n) ] = E h(X (m) ) −−−−→ π(h), m→∞

n=0

according to Assumption 3. We can retrieve the second moment of H by 2 E[Zm ]=

h i m E (∆(n) )2 X n=0

P (n ≤ G) h

−−−−→ m→∞

h i m X m E ∆(n) ∆(`) X

+2

n=0 `=n+1

i

∞ E (∆(n) )2 + 2 X

P (n ≤ G)

P∞

P (n ≤ G)

n=0

42

h

(n) ∆(`) `=n+1 E ∆

i

.

D

Experiments with the proposed smoother

We explore the sensitivity of the proposed smoother to various inputs, section by section. The experiments are based on the hidden auto-regressive model, with dx = 1, and the data are generated with θ = 0.95. Each experiment is replicated R = 1, 000 times. We do not use the variance reduction techniques mentioned in Section 4.5 in this section.

D.1

Effect of the resampling scheme

First we investigate the role of the resampling scheme. The naive scheme is systematic resampling performed on each system with a common uniform variable. The second scheme is index-coupled resampling. In both cases, at the final step, we sample the trajectory indices (bT , ˜bT ) using systematic resampling with a common uniform variable. We consider a time series of length T = 20. In Table 2, we give the average meeting time as a function of N , for both resampling schemes, with the standard deviation between parenthesis. First we see that the meeting time is orders of magnitude smaller when using index-coupled resampling. Secondly, we see that the meeting time tends to decrease with N , when using index-coupled resampling, whereas it increases when using systematic resampling. For longer time series, we find that the index-coupled resampling is the only viable option, and thus focus on this scheme for the Rhee–Glynn smoother.

N = 50 N = 100 N = 150 N = 200

systematic 482.87 (472.96) 462.88 (448.37) 531.69 (546.32) 569.49 (575.09)

index-coupled 7.95 (7.41) 4.88 (3.45) 4.19 (2.68) 4.01 (2.34)

Table 2: Average meeting time as a function of the number of particles N and of the resampling scheme. Standard deviations are between brackets. Results obtained in the hidden auto-regressive model with T = 20.

D.2

Effect of the number of particles

We consider the effect of the number of particles N , on the meeting time and on the variance of the resulting estimator. We use a time series of length T = 500, generated from the model. As seen in the previous section, when using index-coupled resampling, we expect the meeting time τ to occur sooner if N is larger. On the other hand, the cost of the coupled conditional particle filter is linear in N , so that the overall cost of obtaining each estimator H has expectation of order E[τ ] × N . We give estimators of this cost as a function of N in Table 3, as well as the average meeting time. We see that the cost per estimator decreases when N increases, and then increases again. There seems to be an optimal value of N yielding the minimum cost. 43

N = 256 N = 512 N = 1024 N = 2048 N = 4096

cost 220567 (241823) 17074 (17406) 7458 (5251) 8739 (4888) 14348 (6631)

meeting time 861.59 (944.62) 33.35 (34) 7.28 (5.13) 4.27 (2.39) 3.5 (1.62)

Table 3: Average cost and meeting time, as a function of the number of particles N . Standard deviations are between brackets. Results obtained in the hidden auto-regressive model with T = 500. 1e−04 256

efficiency

variance

1e−06 10000 512

100

4096 2048 1024

512

1e−08

1024 2048 4096 256

1

0

100

200

300

400

500

time

0

100

200

300

400

500

time

(a) Variance as a function of t for different N .

(b) Efficiency as a function of t for different N .

Figure 12: Variance (left) and efficiency (right) of the estimator of the smoothing mean E[xt |y1:T ] for T = 500, in the hidden auto-regressive model. The efficiency takes into account the computational cost of the estimator. The y-axis is on the logarithmic scale. We now consider the estimators Ht of each smoothing mean E[xt |y1:T ], for t ∈ 0 : T , i.e. we take h to be the identity function. We compute the empirical variance of Ht , for each t, over the R experiments. To take into account both variance and computational cost, we define the efficiency as 1/(V[Ht ] × E[τ ] × N ) and approximate this value for each t, using the R estimators. The results are shown in Figure 12. We see that the variance explodes exponentially when T − t increases. From Figure 12a, the variance is reduced when larger values of N are used. Secondly, the variance is most reduced for the estimators of the first smoothing means, i.e. E[xt |y1:T ] for small t. As such, the efficiency is maximized for the largest values of N only when t is small, as can be seen from Figure 12b. For values of t closer to T , the efficiency is higher for N = 1, 024 and N = 2, 048 than it is for N = 4, 096.

44

D.3

Effect of the truncation variable

We now consider the use of Geometric truncation variables, as introduced in Appendix C above Assumption 4. We set T = 500 and N = 512. We try a few values of the Geometric probability p, in an attempt to reduce the computation cost per estimator. The value p = 0 corresponds to the estimator proposed in Section 4. The average meeting times are shown in Table 4.

p=0 p = 0.025 p = 0.05

meeting time 31.94 (31.26) 17.66 (17.35) 11.82 (11.68)

Table 4: Average meeting time, as a function of the Geometric parameter p. Standard deviations are between brackets. Results obtained in the hidden auto-regressive model with T = 500. We plot the variance of the estimator of the smoothing mean E[xt |y1:T ] against t on Figure 13a, for each p. We plot the efficiency E[min(G, τ )] × V[Ht ] against t on Figure 13b, for each p. First, from Figure 13a we see that indeed, increasing p leads to a higher variance. In particular, the value p = 0.05 leads to a much larger variance than the other values. This seems to be in agreement with Assumption 4, which states that p has to be below a certain threshold related to the meeting probability. On Figure 13b, we see that the increase of variance is compensated by a reduction of the computation cost, for the smaller values of p. Therefore, the three smaller values lead to the same overall efficiency. On the other hand, the largest value p = 0.05 leads to a significantly lower efficiency. Thus, there does not seem to be much benefit in using p 6= 0 in this example.

D.4

Effect of ancestor sampling

We consider the use of ancestor sampling, which requires being able to evaluate the transition density point-wise. We set T = 500 as before, and consider different values of N . The average meeting times are displayed in Table 5. We see that the meeting times are significantly reduced by using ancestor sampling, especially for smaller numbers of particles.

N = 256 N = 512 N = 1024 N = 2048 N = 4096

without ancestor sampling 861.59 (944.62) 33.35 (34) 7.28 (5.13) 4.27 (2.39) 3.5 (1.62)

with ancestor sampling 8.79 (3.33) 5.99 (2.27) 4.51 (1.88) 3.76 (1.63) 3.34 (1.51)

Table 5: Average meeting time, as a function of the number of particles N , with and without ancestor sampling. Standard deviations are between brackets. Results obtained in the hidden auto-regressive model with T = 500.

45

1e−02

1e+04

efficiency

variance

1e+06

p = 0.05 p = 0.025 p = 0.01 p=0

1e−04

p=0 p = 0.01 p = 0.025 p = 0.05

1e+02 1e−06

0

100

200

300

400

500

time

0

100

200

300

400

500

time

(a) Variance as a function of t for various p.

(b) Efficiency as a function of t for various p.

Figure 13: Variance (left) and efficiency (right) of the smoothing estimator using a Geometric truncation variable with parameter p, for T = 500, and N = 512, in the hidden auto-regressive model. The efficiency takes into account the computational cost of the estimator. The y-axis is on the logarithmic scale. We consider variance and efficiency, here defined as 1/(V[Ht ] × E[τ ] × N ). The results are shown in Figure 14. This is to be compared with Figure 12 obtained without ancestor sampling. First we see that the variance is significantly reduced by ancestor sampling. The variance seems to increase only slowly as T − t increases, for each value of N . From Figure 14b, we see that the smallest value of N now leads to the most efficient algorithm. In other words, for a fixed computational budget, it is better to produce more estimators with N = 256 than to increase the number of particles.

E

Pseudo-code for particle filters

We provide pseudo-code for the bootstrap particle filter (Algorithm 3), the conditional particle filter (Algorithm 4), the coupled particle filter (Algorithm 5), and the coupled conditional particle filter (Algorithm 6).

46

256

8

efficiency

variance

1e−04

4 256 512 1024 2048 2 4096

512 1024 2048 4096

1e−05 0

100

200

300

400

500

time (a) Variance as a function of t for different N .

0

100

200

300

400

500

time (b) Efficiency as a function of t for different N .

Figure 14: Variance (left) and efficiency (right) of the estimator of the smoothing mean E[xt |y1:T ] for T = 500, in the hidden auto-regressive model, when using ancestor sampling. The efficiency takes into account the computational cost of the estimator. The y-axis is on the logarithmic scale.

Algorithm 3 Bootstrap particle filter, given a parameter θ. At step t = 0. 1. Draw xk0 ∼ m0 (dx0 |θ), for all k ∈ 1 : N . This can also be written xk0 = M (U0k , θ), for all k ∈ 1 : N . 2. Set w0k = N −1 , for all k ∈ 1 : N . At step t ≥ 1. 1:N |w 1:N ). 1. Draw ancestors a1:N t−1 ∼ r(da t−1 ak

t−1 2. Draw xkt ∼ f (dxt |xt−1 , θ), for all k ∈ 1 : N .

ak

t−1 This can also be written xkt = F (xt−1 , Utk , θ), for all k ∈ 1 : N .

3. Compute wtk ∝ g(yt |xkt , θ), for all k ∈ 1 : N , and normalize the weights. Return the likelihood estimator pˆN (y1:T |θ) =

QT

t=1 N

47

−1 PN g(y |xk , θ). t t k=1

Algorithm 4 Conditional particle filter, given a reference trajectory x0:T and θ. At step t = 0. 1. Draw xk0 ∼ m0 (dx0 |θ), for k ∈ 1 : N − 1, and set xN 0 = x0 . This can also be written xk0 = M (U0k , θ), for all k ∈ 1 : N − 1, and xN 0 = x0 . 2. Set w0k = N −1 , for k ∈ 1 : N . At step t ≥ 1. 1:N −1 1:N ), and set aN = N . 1. Draw ancestors at−1 ∼ r(da1:N −1 |wt−1 t−1 ak

t−1 2. Draw xkt ∼ f (dxt |xt−1 , θ), for all k ∈ 1 : N − 1, and set xN t = xt .

ak

t−1 This can also be written xkt = F (xt−1 , Utk , θ), for all k ∈ 1 : N − 1, and xN t = xt .

3. Compute wtk ∝ g(yt |xkt , θ), for all k ∈ 1 : N , and normalize the weights. Draw a trajectory. 1. Draw bT from a discrete distribution on {1, . . . , N }, with probabilities wT1:N . b

2. For t = T − 1, . . . , 0, set bt = at t+1 . Return x00:T = (xb00 , . . . , xTbT ).

˜ Algorithm 5 Coupled particle filter, given parameters θ and θ. At step t = 0. ˜ for all k ∈ 1 : N . 1. Draw U0k , and compute xk0 = M (U0k , θ) and x ˜k0 = M (U0k , θ), 2. Set w0k = N −1 and w ˜0k = N −1 , for all k ∈ 1 : N . At step t ≥ 1. 1:N and w 1:N . Sample (ak , a k 1. Compute a probability matrix Pt−1 , with marginals wt−1 ˜t−1 t−1 ˜t−1 ) from Pt−1 , for all k ∈ 1 : N . k

k

at−1 a ˜t−1 ˜ for all k ∈ 1 : N . 2. Draw Utk , and compute xkt = F (xt−1 , Utk , θ) and x ˜kt = F (˜ xt−1 , Utk , θ),

˜ for all k ∈ 1 : N , and normalize the weights. 3. Compute wtk ∝ g(yt |xkt , θ) and w ˜tk ∝ g(yt |˜ xkt , θ), Return pˆN (y1:T |θ) =

QT

t=1 N

−1 PN g(y |xk , θ) t t k=1

˜ = QT N −1 PN g(yt |˜ ˜ and pˆN (y1:T |θ) xkt , θ). t=1 k=1

48

Algorithm 6 Coupled conditional particle filter, given reference trajectories x0:T and x ˜0:T . At step t = 0. 1. Draw U0k , compute xk0 = M (U0k , θ) and x ˜k0 = M (U0k , θ) for k ∈ 1 : N − 1 2. Set xN ˜N ˜0 . 0 = x0 and x 0 =x 3. Set w0k = N −1 and w ˜0k = N −1 , for k ∈ 1 : N . At step t ≥ 1. 1:N and w 1:N . Sample (ak , a k 1. Compute a probability matrix Pt−1 , with marginals wt−1 ˜t−1 t−1 ˜t−1 ) from Pt−1 , for all k ∈ 1 : N − 1.

2. Set aN ˜N t−1 = N and a t−1 = N . ak

a ˜k

t−1 t−1 3. Draw Utk , and compute xkt = F (xt−1 , Utk , θ) and x ˜kt = F (˜ xt−1 , Utk , θ), for all k ∈ 1 : N − 1.

˜t . ˜N 4. Set xN t =x t = xt and x 5. Compute wtk ∝ g(yt |xkt , θ), for all k ∈ 1 : N , and normalize the weights. Draw a pair of trajectories. 1. Compute a probability matrix PT , with marginals wT1:N and w ˜T1:N . Draw (bT , ˜bT ) from PT . ˜b

b

2. For t = T − 1, . . . , 0, set bt = at t+1 and ˜bt = a ˜t t+1 . ˜

˜

˜00:T = (˜ xb00 , . . . , x ˜bTT ). Return x00:T = (xb00 , . . . , xTbT ) and x

49

References C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):357–385, 2010. J.-D. Benamou, G. Carlier, M. Cuturi, L. Nenna, and G. Peyré. Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111– A1138, 2015. M. Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In Advances in Neural Information Processing Systems (NIPS), pages 2292–2300, 2013. M. Cuturi and A. Doucet. Fast computation of Wasserstein barycenters. In Proceedings of the 31st International Conference on Machine Learning (ICML), pages 685–693, 2014. C. Rhee. Unbiased Estimation with Biased Samplers. PhD thesis, Stanford University, 2013. URL http://purl.stanford.edu/nf154yt1415. M. Vihola. Unbiased estimators and multilevel Monte Carlo. arXiv preprint arXiv:1512.01022, 2015. D. Williams. Probability with martingales. Cambridge university press, 1991.

50