Bayesian inverse problems with partial observations - arXiv

0 downloads 0 Views 1MB Size Report
Feb 25, 2018 - model, heat equation, inverse problem, nonparametric Bayesian .... danger of confusion; N(µ, Σ) denotes the Gaussian distribution with mean µ.
Bayesian inverse problems with partial observations

arXiv:1802.08993v1 [math.ST] 25 Feb 2018

S. Gugushvili, A. W. van der Vaart and D. Yan Mathematical Institute, Leiden University, The Netherlands

e-mail: [email protected]; [email protected]; [email protected] Abstract: We study a nonparametric Bayesian approach to linear inverse problems under the discrete observation scheme. We use the discrete Fourier transform to convert our model into a truncated Gaussian sequence model, that is closely related to the classical Gaussian sequence model. Upon placing the truncated series prior on the unknown parameter, we show that as the number of observations n → ∞, the corresponding posterior distribution contracts around the true parameter at a rate depending on the smoothness of the true parameter and the prior, and the ill-posedness degree of the problem. Correct combinations of these values lead to optimal posterior contraction rates (up to logarithmic factors). Similarly, the frequentist coverage of Bayesian credible sets is shown to be dependent on a combination of smoothness of the true parameter and the prior, and the ill-posedness of the problem. Oversmoothing priors lead to zero coverage, while undersmoothing priors produce highly conservative results. Finally, we illustrate our theoretical results by numerical examples. AMS 2000 subject classifications: Primary 62G20; secondary 35R30. Keywords and phrases: Credible set, Gaussian prior, Gaussian sequence model, heat equation, inverse problem, nonparametric Bayesian estimation, posterior contraction rate, singular value decomposition, Volterra operator.

Contents 1 2

3

4 5

Introduction . . . . . . . . . . . . . 1.1 Notation . . . . . . . . . . . . Sequence model . . . . . . . . . . . 2.1 Singular value decomposition 2.2 Equivalent formulation . . . . 2.3 Gaussian prior . . . . . . . . Main results . . . . . . . . . . . . . 3.1 Contraction rates . . . . . . . 3.2 Credible sets . . . . . . . . . Simulation examples . . . . . . . . Proofs . . . . . . . . . . . . . . . . 5.1 Proof of Lemma 2.7 . . . . . 5.2 Proof of Proposition 2.9 . . . 5.3 Proof of Theorem 3.1 . . . . .

. . . . . . . . . . . . . . 1

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

2 3 4 4 5 8 8 8 10 11 13 13 15 15

S. Gugushvili et al./Bayesian inverse problems

5.4 Proof of Theorem 5.5 Proof of Theorem 5.6 Proof of Theorem A Auxiliary lemmas . . . Acknowledgements . . . . References . . . . . . . . .

3.3 . 3.4 . 3.5 . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

2

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

17 18 20 21 22 22

1. Introduction Linear inverse problems have been studied since long in the statistical and numerical analysis literature; see, e.g., [2], [5], [6], [7], [8], [11], [17], [18], [23], and references therein. Emphasis in these works has been on the white noise model, Y = Af + εW,

(1)

where the parameter of interest f lies in some infinite-dimensional function space, A is a linear operator with values in a possibly different space, W is white noise, and ε is the noise level. Applications of linear inverse problems include, e.g., computerized tomography, see [21], partial differential equations, see [16], and scattering theory, see [9]. Arguably, in practice one does not have access to a full record of observations on the unknown function f as in the idealised model (1), but rather one indirectly observes it at a finite number of points,. This statistical setting can be conveniently formalised as follows: let the signal of interest f be an element in a Hilbert space H1 of functions defined on a compact interval [0, 1]. The forward operator A maps f to another Hilbert space H2 . We assume that H1 , H2 are subspaces of L2 ([0, 1]), typically collections of functions of certain smoothness as specified in the later sections, and that the design points are chosen deterministically,   i . (2) xi = n i=1,··· ,n Defining Yi = Af (xi ) + ξi ,

i = 1, · · · , n,

(3)

with ξi i.i.d. standard Gaussian random variables, our observations are the pairs (xi , Yi )i≤n , and we are interested in estimating f . A prototype example we think of is the case when A is the solution operator in the Dirichlet problem for the heat equation acting on the initial condition f ; see Example 2.6 below for details. Model (3) is related to the inverse regression model studied e.g. in [3] and [4]. Although the setting we consider is somewhat special, our contribution is arguably the first one to study from a theoretical point of view a nonparametric Bayesian approach to estimation of f in the inverse problem setting with partial observations (see [14] for a monographic treatment of modern Bayesian nonparametrics). In the context of the white noise model (1), a nonparametric Bayesian approach has been studied thoroughly in [19] and [20], and techniques from these works will turn out to be useful in our context as well. Our results

S. Gugushvili et al./Bayesian inverse problems

3

will deal with derivation of posterior contraction rates and study of asymptotic frequentist coverage of Bayesian credible sets. A posterior contraction rate can be thought of as a Bayesian analogue of a convergence rate of a frequentist estimator, cf. [13] and [14]. Specifically, we will show that as the sample size n → ∞, the posterior distribution concentrates around the ‘true’ parameter value, under which data have been generated, and hence our Bayesian approach is consistent and asymptotically recovers the unknown ‘true’ f . The rate at which this occurs will depend on the smoothness of the true parameter and the prior and the illposedness degree of the problem. Correct combinations of these values lead to optimal posterior contraction rates (up to logarithmic factors). Furthermore, a Bayesian approach automatically provides uncertainty quantification in parameter estimation through the spread of the posterior distribution, specifically by means of posterior credible sets. We will give an asymptotic frequentist interpretation of these sets in our context. In particular, we will see that the frequentist coverage will depend on a combination of smoothness of the true parameter and the prior, and the ill-posedness of the problem. Oversmoothing priors lead to zero coverage, while undersmoothing priors produce highly conservative results. The article is organized as follows: in Section 2, we give a detailed description of the problem, introduce the singular value decomposition and convert the model (3) into an equivalent truncated sequence model that is better amenable to our theoretical analysis. We show how a Gaussian prior in this sequence model leads to a Gaussian posterior and give an explicit characterisation of the latter. Our main results on posterior contraction rates and Bayesian credible sets are given in Section 3, followed by simulation examples in Section 4 that demonstrate our theoretical results. Section 5 contains the proofs of the main theorems, while the technical lemmas used in the proofs are collected in Appendix A. 1.1. Notation The notational conventions we use in this work are the following: definitions are marked by the := symbol; | · | denotes the absolute value and k · kH indicates the norm related to the space H; h·, ·iH is understood as the canonical inner product in the inner product space H; subscripts are omitted when there is no danger of confusion; N (µ, Σ) denotes the Gaussian distribution with mean µ and covariance operator Σ; subscripts Nn and NH may be used to emphasize the fact that the distribution is defined on the space Rn or on the abstract space H; Cov(·, ·) denotes the covariance or the covariance operator, depending on the context; for sequences {an }, {bn } of real numbers, the notation an . bn and an & bn mean respectively that there exist positive constants C1 , C2 independent of n, such that an ≤ C1 bn or an ≥ C2 bn hold for all n; finally, an  bn indicates that the ratio an /bn is asymptotically bounded from above and below, while an ∼ bn means an /bn → 1 as n → ∞.

S. Gugushvili et al./Bayesian inverse problems

4

2. Sequence model 2.1. Singular value decomposition We impose a common assumption on the forward operator A from the literature on inverse problems, see, e.g., [2, 5, 6]. Assumption 1. Operator A is injective and compact. It follows that A∗ A is also compact and in addition self-adjoint. Hence, by the spectral theorem for Pself-adjoint compact operators, see [10], we have a representation A∗ Af = k∈N a2k fk ϕk , where {ϕk } and {ak } are the eigenbasis on H1 and eigenvalues, respectively, (corresponding to the operator A∗ A), and fk = hf, ϕk i are the Fourier coefficients of f . This decomposition of A∗ A is known as the singular value decomposition (SVD), and {ak } are also called singular values. It is easy to show that the conjugate basis ψk := Aϕk /ak of the orthonormal basis {ϕk }k is again an orthonormal system in H2 and gives a convenient basis for Range(A), the range of A in H2 . Furthermore, the following relations hold (see [2]), Aϕk = ak ψk , A∗ ψk = ak ϕk . (4) Recall a standard result (see, e.g., P [15]): the Hilbert space H is isometric to 2 2 `2 , and Parseval’s identity kf k2`2 := k |fk | = kf kH holds; here fk are the Fourier coefficients with respect to some known and fixed orthonormal basis. We will employ the eigenbasis {ϕk } of A∗ A to define the Sobolev space of functions. This will define the space in which the unknown function f resides. Definition 2.1. We say f is in thePSobolev space S β with smoothness parameter ∞ β ≥ 0, if it can be written as f = k=1 fk ϕk with fk = hf, ϕk i, and if its norm P∞ 2 2β 1/2 kf kβ := is finite. k=1 fk k Remark 2.2. The above definition agrees with the classical definition of the Sobolev space if the eigenbasis is the trigonometric basis, see, e.g., [22]. With a fixed basis, which is always the case in this article, one can identify the function f and its Fourier coefficients {fk }k . Thus, we use S β to denote both the function space and the sequence space. For example, it is easy to verify that S 0 = `2 (correspondingly S 0 = L2 ), S β ⊂ `2 for any nonnegative β, and S β ⊂ `1 when β > 1/2. P Recall that Af = ai fi ψi . Then we have Af ∈ S β+p if ak  k −p , and ∞ k Af ∈ S := ∩k∈N S , if ak decays exponentially fast. Such a lifting property is beneficial in the forward problem, since it helps to obtain a smooth solution. However, in the context of inverse problems it leads to a difficulty in recovery of the original signal f , since information on it is washed out by smoothing. Hence, in the case of inverse problems one does not talk of the lifting property, but of ill-posedness, see [6].

S. Gugushvili et al./Bayesian inverse problems

5

Definition 2.3. An inverse problem is called mildly ill-posed, if ak  k −p as s k → ∞, and extremely ill-posed, if ak  e−k p with s ≥ 1 as k → ∞, where p is strictly positive in both cases. In the rest of the article, we will confine ourselves to the following setting. Assumption 2. The unknown true signal f in (3) satisfies f ∈ S β ⊂ H1 for β > 0. Furthermore, the ill-posedness is of one of the two types in Definition 2.3. Remark 2.4. As an immediate consequence of the lifting property, we have H2 ⊂ H1 . We conclude this section with two canonical examples of the operator A. Example 2.5 (mildly ill-posed case: Volterra operator [19]). The classical Volterra operator A : L2 [0, 1] → L2 [0, 1] and its adjoint A∗ are Z x Z 1 Af (x) = f (s) ds, A∗ f (x) = f (s) ds. 0

x

The eigenvalues, eigenfunctions of A∗ A and the conjugate basis are given by 1 , (i − 1/2)2 π 2 √ ϕi (x) = 2 cos((i − 1/2)πx), √ ψi (x) = 2 sin((i − 1/2)πx) a2i =

for i ≥ 1. Example 2.6 (extremely ill-posed case: heat equation [20]). Consider the Dirichlet problem for the heat equation: ∂ ∂2 u(x, t) = u(x, t), ∂t ∂x2 u(0, t) = u(1, t) = 0,

u(x, 0) = f (x),

(5)

t ∈ [0, T ],

where u(x, t) is defined on [0, 1] × [0, T ] and f (x) ∈ L2 [0, 1] satisfies f (0) = f (1) = 0. The solution of (5) is given by u(x, t) =

∞ √ X 2 2 fk e−k π t sin(kπx) =: Af (x), 2 k=1

√ where {fi } are the coordinates of f in the basis { 2 sin(kπx)}k≥1 . 2 2 For the solution map A, the eigenvalues of A∗ A√are e−k π t , the eigenbasis and conjugate basis coincide and ϕk (x) = ψk (x) = 2 sin(kπx). 2.2. Equivalent formulation In this subsection we develop a sequence formulation of the model (3), which is very suitable for Bayesian analysis. First, we briefly discuss the relevant results that provide motivation for our reformulation of the problem.

S. Gugushvili et al./Bayesian inverse problems

6

In Examples 2.5 and 2.6, the sine and cosine bases form the eigenbasis. In fact, the Fourier basis (trigonometric polynomials) frequently arises as an eigenbasis for various operators, e.g. in the case of differentiation, see [12], or circular deconvolution, see [7]. For simplicity, we will use Fourier basis as a primary example in the rest of the article. Possible generalization to other bases is discussed in Remark 2.8. Restriction of our attention to the Fourier basis is motivated by its special property: discrete orthogonality. The next lemma illustrate this property for the sine basis (Example 2.6). Lemma 2.7 (discrete orthogonality). Let {ψk }k∈N be the sine basis, i.e. √ ψk (x) = 2 sin(kπx), k = 1, 2, 3, · · · . Then: (i.) Discrete orthogonality holds: n

hψj , ψk id :=

1X ψj (i/n)ψk (i/n) = δjk , n i=1

j, k = 1, · · · , n − 1.

(6)

Here δjk is the Kronecker delta. (ii.) Fix l ∈ N. For any fixed 1 ≤ k ≤ n − 1 and all j ∈ {ln, ln + 1, · · · , (l + 1)n − 1}, there exits only one k¯ ∈ {1, 2, . . . , n − 1} depending only on the ¯ the equality parity of l, such that for ˜j = ln + k, |hψ˜j , ψk id | = 1

(7)

¯ k˜ ∈ holds, while hψ˜j , ψk id = 0 for all ˜j = ln + k˜ such that k˜ 6= k, {1, 2, . . . , n − 1}. Remark 2.8. For other trigonometric bases, discrete orthogonality can also be attained. For instance, the conjugate eigenbasis in Example 2.5 is discretely orthogonal with design points { i−1/2 n }i=1,··· ,n . We refer to [1] and references therein for details. With some changes in the arguments, our asymptotic statistical results still remain valid with such modifications of design points compared to (2). Motivated by the observations above, we introduce our central assumption on the basis functions. Assumption 3. Given the design points {xi }i=1,··· ,n in (2), we assume the conjugate basis {ψk }k∈N of the operator A in (3) possesses the following properties: (i.) for 1 ≤ j, k ≤ n − 1, n

hψj (x), ψk (x)id :=

1X ψj (xi )ψk (xi ) = δjk n i=1

S. Gugushvili et al./Bayesian inverse problems

7

(ii.) For 1 ≤ k ≤ n − 1 and j ∈ {ln, · · · , (l + 1)n − 1} with fixed l ∈ N, there ¯ such that 0 < |hψ˜ , ψk id | < M, where M is exits only one ˜j = ln + k, j ¯ a fixed constant, and k depends on the parity of l only. For other j 6= ˜j, |hψj , ψk id | = 0. Using the shorthand notation f=

X

fj ϕj =

j

n−1 X

fj ϕj +

j=1

X

fj ϕj =: f n + f r ,

j≥n

we obtain for k = 1, · · · , n − 1 that n

Uk =

n

1X 1X Yi ψk (xi ) = hAf n , ψk id + hAf r , ψk id + ξi ψk (xi ) n i=1 n i=1

(8)

1 =ak fk + Rk + √ ζk , n where

n

Rk := hAf r , ψk id ,

1 X ζk := √ ξi ψk (xi ). n i=1

By Lemma 2.7 we have |Rk | = |hAf r , ψk id | ≤

X

aj |fj ||hψj , ψk id | =

j≥n

∞ X

aln+k¯ |fln+k¯ |.

(9)

l=1

In addition, by straightforward calculus for normal random variables and i.i.d. Assumption 3 (i.), Eζk = 0, cov(ζk , ζl ) = δk,l , and thus ζk ∼ N (0, 1). We can write (8) concisely as 1 U n = (An + Rn )f + √ ζ n , n

(10)

where An , Rn are understood as bounded linear diagonal operators from H1 = S β ⊂ `2 to Vn = span{ψj }j 0, and furthermore β + p > 1/2, by letting λk = ρ2n k −1−2α with α > 0 and any positive ρn satisfying ρ2n n → ∞, we have, for any K > 0 and Mn → ∞, sup

Ef0 Πn (f : kf − f0 kH1 ≥ Mn εn |U n ) → 0,

kf0 kβ ≤K

where εn = εn,1 ∨ εn,2 = (ρ2n n)−β/(2α+2p+1)∧1 ∨ ρn (ρ2n n)−α/(2α+2p+1) .

(13)

In particular, (i.) if ρn = 1, then εn = n−(α∧β)/(2α+2p+1) ; (ii.) if β ≤ 2α + 2p + 1 and ρn  n(α−β)/(2β+2p+1) , then εn = n−β/(2β+2p+1) ; (iii.) if β > 2α + 2p + 1, then for every scaling ρn , εn  n−β/(2β+2p+1) . Thus we recover the same posterior contraction rates as obtained in [19], at the cost of an extra constraint β+p > 1/2. The frequentist minimax convergence rate for mildly ill-posed problems in the white noise setting is n−β/(2β+2p+1) , see [6]. We will compare our result to this rate. Our theorem states that in case (i.) the posterior contraction rate reaches the frequentist optimal rate if the regularity of the prior matches the truth (β = p) and the scaling factor ρn is fixed. Alternatively, as in case (ii.), the optimal rate can also be attained by proper scaling, provided a sufficiently regular prior is used. In all other cases the contraction rate is slower than the minimax rate. Our results are similar to those in [19] in the white noise setting. The extra constraint β + p > 1/2 demands an explanation. As (24) shows, the size of negligible terms Rk in (8) decreases as the smoothness β + p of the transformed signal Af increases. In order to control Rk , a minimal smoothness of Af is required. The latter is guaranteed if p + β ≥ 1/2, for it is known that in that case Af will be at least continuous, while it may fail to be so if p + β < 1/2, see [22]. Remark 3.2. The control on Rk from (9) depends on the fact that the eigenbasis possesses the properties in Assumption 3. If instead one only assumes |hψj , ψk i| ≤ 1, for any k ≤ n and j > n, the constraint on the smoothness of Af has to be strengthened to β + p ≥ 1 in order to obtain the same results as in Theorem 3.1, because the condition β + p ≥ 1 guarantees that the control on Rk (24) remains valid. Now we consider the extremely ill-posed problem. The following result holds. Theorem 3.3 (Posterior contraction: extremely ill-posed problem). Let the s problem be extremely ill-posed as ak  e−pk with s ≥ 1, and let the true parameter f0 ∈ S β with β > 0. Let λk = ρ2n k −1−2α with α > 0 and any positive ρn satisfying ρ2n n → ∞. Then sup kf0 kβ ≤K

Ef0 Πn (f : kf − f0 kH1 ≥ Mn εn |U n ) → 0,

S. Gugushvili et al./Bayesian inverse problems

10

for any K > 0 and Mn → ∞, where −β/s −α/s ∨ ρn log(ρ2n n) . εn = εn,1 ∨ εn,2 = log(ρ2n n)

(14)

In particular, (i.) if ρn = 1, then εn = (log n)−(α∧β)/s , (ii.) if n−1/2+δ . ρn . (log n)(α−β)/s for some δ > 0, then εn = (log n)−β/s . Furthermore, if λk = exp(−αk s ) with α > 0, the following contraction rate is obtained: εn = (log n)−β/s . Since the frequentist minimax estimation rate in extremely ill-posed problems in the white noise setting is (log n)−β/s (see [6]), Theorem 3.3 shows that the optimal contraction rates can be reached by suitable choice of the regularity of the prior or by using an appropriate scaling. In contrast to the mildly illposed case, we have no extra constraint on the smoothness of Af . The reason is obvious: because the signal is lifted to S ∞ by the forward operator A, the term (26) converges to zero exponentially fast, implying that Rk in (8) is always negligible. 3.2. Credible sets In the Bayesian paradigm, the spread of the posterior distribution is a common measure of uncertainty in parameter estimates. In this section we study the frequentist coverage of Bayesian credible sets in our problem. When the posterior is Gaussian, it is customary to consider credible sets centered at the posterior mean, which is what we will also do. In addition, because in our case the covariance operator of the posterior distribution does not depend on the data, the radius of the credible ball is determined by the credibility level 1 − γ and the sample size n. A credible ball centered at the posterior mean fˆ = Pn U n is given by Pn U n + B(rn,γ ) := {f ∈ H1 : kf − Pn U n kH1 ≤ rn,γ },

(15)

where the radius rn,γ is determined by the requirement that Πn (Pn U n + B(rn,γ )|U n ) = 1 − γ.

(16)

By definition, the frequentist coverage or confidence of the set (15) is Pf0 (f0 ∈ Pn Un + B(rn,γ )),

(17)

where the probability measure is the one induced by the law of U n given in (10) with f = f0 . We are interested in the asymptotic behavior of the coverage (17) as n → ∞ for a fixed f0 uniformly in Sobolev balls, and also along a sequence f0n changing with n. The following two theorems hold.

S. Gugushvili et al./Bayesian inverse problems

11

Theorem 3.4 (Credible sets: mildly ill-posed problem). Assume the same assumptions as in Theorem 3.1 hold, and let β˜ = β ∧(2α +2p+1). The asymptotic coverage of the credible set (15) is ˜

˜

(i.) 1, uniformly in {f0 : kf0 kS β ≤ 1}, if ρn  n(α−β)/(2β+2p+1) ; ˜ ˜ (ii.) 1, for every fixed f0 ∈ S β , if β < 2α + 2p + 1 and ρn  n(α−β)/(2β+2p+1) ; ˜ ˜ c, along some f0n with supn kf0n kS β < ∞, if ρn  n(α−β)/(2β+2p+1) (any c ∈ [0, 1)). ˜ ˜ (iii.) 0, along some f0n with supn kf0n kS β < ∞, if ρn  n(α−β)/(2β+2p+1) . Theorem 3.5 (Credible sets: extremely ill-posed problem). Assume the setup of Theorem 3.3. Then if λk = ρ2n k −1−2α with α > 0 and any positive ρn satisfying ρ2n n → ∞, the asymptotic coverage of the credible set (15) is (i.) 1, uniformly in {f0 : kf0 kS β ≤ 1}, if ρn  (log n)(α−β)/2 ; (ii.) 1, uniformly in f0 with kf0 kS β ≤ r with r small enough; 1, for any fixed f0 ∈ S β , provided the condition ρn  (log n)(α−β)/s holds; (iii.) 0, along some f0n with supn kf0n kS β < ∞, if ρn . (log n)(α−β)/s . s

Moreover, if λk = e−α with α > 0 and any positive ρn satisfying ρ2n n → ∞, the asymptotic coverage of the credible set (15) is (iv.) 0, for every f0 such that |f0,i | & e−ci

s

/2

for some c < α.

For the two theorems in this section, the most intuitive explanation is offered by the case ρn ≡ 1. The situations (i.), (ii.) and (iii.) correspond to α < β, α = β and α > β, respectively. The message is that the oversmoothing prior ((iii.) in Theorem 3.4 and (iii.), (iv.) in Theorem 3.5) leads to disastrous frequentist coverage of credible sets, while the undersmoothing prior ((i.) in both theorems) delivers very conservative frequentist results (coverage 1). With the right regularity of the prior (case (ii.)), the outcome depends on the norm of the true parameter f0 . Our results are thus similar to those obtained in the white noise setting in [19, 20]. 4. Simulation examples In this section we carry out a small-scale simulation study illustrating our theoretical results. Examples we use to that end are those given in Section 2.1. These were also used in simulations in [19, 20]. In the setting of Example 2.5, we use the following true signal, f0 (x) =

∞ X

f0,i ϕi (x) with f0,k = k −3/2 sin(k).

(18)

i=1

It is easy to check that f0 ∈ S 1 . In the setup of Example 2.6, the initial condition is assumed to be f0 (x) = 4x(x − 1)(8x − 5).

(19)

S. Gugushvili et al./Bayesian inverse problems

12

Fig 1. Realizations of the posterior mean (red) and 950 of 1000 draws from the posterior (colored thin lines) with smallest L2 distance to the posterior mean. From left to right columns, the posterior is computed based on sample size 103 , 104 and 105 respectively. The true parameter (black) is of smoothness β = 1 and given by coefficients f0,k = k−3/2 sin(k).

One can verify that in this case f0,k

√ 8 2(13 + 11(−1)k ) = , π3 k3

and f0 ∈ S β for any β < 5/2. First, we generate noisy observations {Yi }i=1,··· ,n from our observation scheme (3) at design points xi = i−1/2 in the case of Volterra operator, and xi = i/n in n the case of the heat equation. Next, we apply the transform described in (8) and obtain transformed observations {Ui }i=1,··· ,n . Then, applying Proposition 2.9, the posterior of the coefficients with the eigenbasis ϕi is given by   nak λk Uk λk n fk |U ∼ N , . na2k λk + 1 na2k λk + 1 Figures 1 and 2 contain plots of 95% L2 -credible bands for different sample sizes and different priors. For all priors we assume ρn ≡ 1, and use different smoothness degrees α, as shown in the titles of the subplots. In addition, the columns from left to right corresponds to 103 , 104 and 105 observations. The (estimated) credible bands are obtained by generating 1000 realizations from the posterior and retaining 95% of them that are closest in the L2 -distance to the posterior mean. Two simulations reflect several similar facts. First, because of the difficulty due to the inverse nature of the problem, the recovery of the true signal is rela-

S. Gugushvili et al./Bayesian inverse problems

13

Fig 2. Realizations of the posterior mean (red) and 950 of 1000 draws from the posterior (colored thin lines) with smallest L2 distance to the posterior mean. From left to right columns, the posterior is computed based on sample size 103 , 104 and 105 respectively. The true parameter (black) is of smoothness β for any β < 5/2 and given by (19).

tively slow, as the posteriors for the sample size 103 are still rather diffuse around the true parameter value. Second, it is evident that undersmoothing priors (the top rows in the figures) deliver conservative credible bands, but still capture the truth. On the other hand, oversmoothing priors lead to overconfident, narrow bands, failing to actually express the truth (bottom rows in the figures). As already anticipated due to a greater degree of ill-posedness, recovery of the initial condition in the heat equation case is more difficult than recovery of the true function in the case of the Volterra operator. Finally, we remark that qualitative behaviour of the posterior in our examples is similar to the one observed in [19, 20]; discreteness of the observation scheme does not appear to have a noticeably adversary effect compared to the fully observed case in [19, 20]. 5. Proofs 5.1. Proof of Lemma 2.7 This proof is a modification of the one of Lemma 1.7 in [22]. With following j k temporary definitions a := eiπ n and b := eiπ n , using Euler’s formula, we

S. Gugushvili et al./Bayesian inverse problems

14

have n

hψj , ψk id = −

1 X s (a − a−s )(bs − b−s ) 2n s=1 n

=−

 1 X (ab)s − (a/b)s − (a/b)−s + (ab)−s , 2n s=1 



(20)

 X n n n n X X X  1  s s −s −s   (ab) − =− (a/b) − (a/b) + (ab)  . 2n   s=1 s=1 s=1 s=1 | {z } | {z } | {z } | {z } A

B

Furthermore, ab = eiπ

j+k n

,

C

D

j−k a = eiπ n . b

Observe that when ab 6= 1, we have A=

ab(1 − (ab)n ) , 1 − ab

D=

1 − (ab)−n , ab − 1

A+D =

ab(1 − (ab)n ) − (1 − (ab)−n ) . 1 − ab

Similarly, if a/b 6= 1, B+C =

(a/b)(1 − (a/b)n ) − (1 − (a/b)−n ) . 1 − (a/b)

We fix 1 ≤ k ≤ n − 1 and discuss different situations depending on j. (I.) 1 ≤ j ≤ n − 1 and j + k 6= n. j+k Since n 6= j + k < 2n, we always have ab = eiπ n 6= 1 and the terms A and D can be calculated as above. Similarly, since −n < j − k < n, a/b = 1 only when j = k. Moreover, j + k and j − k have the same parity, and so j = k is only possible if j + k is even. (i.) j + k is even. In this case, (ab)n = 1. This leads to A = D = 0. Further, if j = k, we have a/b = b/a = 1 and B = C = n. Otherwise, if j 6= k, we have a/b 6= 1 and (a/b)n = 1 = (b/a)n (since j − k is even), and so B=

a/b(1 − (a/b)n ) = 0, 1 − a/b

C = 0,

which implies (20) equals 1. (ii.) j + k is odd. We have (ab)n = (a/b)n = −1, which results in A + D = B + C = −2, and so (20) equals 0. (II.) 1 ≤ j < n and j + k = n. We have ab = −1. Arguing as above, if n is odd, A + D = −2 and B + C = −2. If n is even, A = D = 0 and B = C = nδjk .

S. Gugushvili et al./Bayesian inverse problems

15

The remaining cases follow the same arguments, and hence we omit the (lengthy and elementary) calculations. (III.) j = ln with l ∈ N. It can be shown that A + D = B + C always holds. (IV.) j ∈ {ln + 1, · · · , (l + 1)n − 1}. When l is even, one obtains hψj , ψk id = δ˜jk , where ˜j = j − ln. Otherwise, for odd l, hψj , ψk id = −δ˜jk where ˜j = (l + 1)n − j. 5.2. Proof of Proposition 2.9 The expression for the posterior follows from the arguments analogous to those in the proof of Proposition 3.1 in [19]. Note that the operator Rn will disappear from the expression for the posterior thanks to our choice of the prior. Finally, the fact that the posterior and the prior are equivalent in the sense of absolute continuity of probability measures is a consequence of the fact that both can be identified with non-degenerate Gaussian distributions on Rn−1 . 5.3. Proof of Theorem 3.1 From Proposition 2.9, we have f |U n ∼ N (Pn U n , Σn ) =: N (fˆ, Σn ), or coordinatewise,   nak λk λk fk |U n ∼ N U , , (21) k na2k λk + 1 na2k λk + 1 {z } | {z } | 2 σk

fˆk

for k = 1, · · · , n − 1, and otherwise fˆk = σk = 0 (resulting in a degenerate normal distribution). In this proof we use the notation k · k = k · kH1 = k · k`2 . To show sup

Ef0 Πn (f : kf − f0 k ≥ Mn εn |U n ) → 0,

kf0 kβ ≤K

we first apply Markov’s inequality to the integrand, Z  Mn2 ε2n Πn f : kf − f0 k2 ≥ Mn2 ε2n |U n ≤ kf − f0 k2 dΠn (f |U n ). By the bias-variance decomposition, Z kf − f0 k2 dΠn (f |U n ) = kfˆ − f0 k2 + kσk2 , where σ = (σk )k . Because σ is deterministic, Ef0 [Πn {f : kf − f0 k ≥ Mn εn |U n }] ≤

1 Mn2 ε2n



 Ef0 kfˆ − f0 k2 + kσk2 .

S. Gugushvili et al./Bayesian inverse problems

16

Since Mn → ∞ is assumed, it suffices to show that the terms in brackets are bounded by a constant multiple of ε2n uniformly in f0 in the Sobolev ellipsoid. Under (10), fˆ0 has distribution Nn−1 (Pn An f0 +Pn Rn f0 , n1 P P ∗ ) =: Nn−1 (Ef0 fˆ, T ). Equivalently, by using (8) and (21), the coordinate entries are given as follows: √ nak λk na2k λk nak λk ˆ fk = f0,k + 2 Rk + 2 ζk , na2k λk + 1 nak λk + 1 nak λk + 1 | {z } | {z } τk

(Ef0 fˆ)k

for k < n, and fˆk = 0 for k ≥ n. Hence Ef0 kfˆ − f0 k2 = kEf0 fˆ − f0 k2 + kτ k2 = kEf0 fˆ − f0n k2 + kf0r k2 + kτ k2 , where τ = (τk )k and f0n =(f0,1 , · · · , f0,n−1 , 0, · · · ), f0r =(0, · · · , 0, f0,n , f0,n+2 , · · · ). We need to obtain a uniform upper bound over the ellipsoid {f0 : kf0 kS β ≤ K} for kEf0 fˆ − f0n k2 + kf0r k2 + kτ k2 + kσk2 .

(22)

We have n−1 X

kEf0 fˆ − f0n k2 =

k=1

.

n−1 X k=1

|

1 (na2k λk + 1) {z

na2k λk nak λk f0,k + 2 Rk − f0,k 2 nak λk + 1 nak λk + 1

2 2 2 f0,k +n sup Rk

A1

k 0, 0 ≤ r < pv and s ≥ 1, sup

∞ X

kf kS q ≤1 i=1

s

fi2 i−t e−ri  N −r/p (log N )−t/s−2q/s+ru/ps , (1 + N i−u e−pis )v

as N → ∞. In addition, for any fixed f ∈ S q , N r/p (log N )t/s+2q/s−ru/ps

∞ X i=1

s

fi2 i−t e−ri → 0, (1 + N i−u e−pis )v

as N → ∞. Lemma A.2 (Lemma 6.2 in [20]). For t, u ≥ 0, v > 0, p > 0, 0 < r < vp and s ≥ 1, as N → ∞, ∞ X i=1

s

i−t e−ri  N −r/p (log N )−t/s+ru/ps . (1 + N i−u e−pis )v

If r = 0 and t > 1, while other assumptions remain unchanged, ∞ X i=1

s

i−t e−ri  (log N )(−t+1)/s . (1 + N i−u e−pis )v

Lemma A.3 (Lemma 6.4 in [20]). Assume s ≥ 1. Let IN be the solution in i s to N i−u e−pi = 1, for u ≥ 0 and p > 0. Then  IN ∼

1 log N p

1/s

Lemma A.4 (Lemma 6.5 in [20]). Let s ≥ 1. As K → ∞, we have

S. Gugushvili et al./Bayesian inverse problems

(i.) for a > 0 and b ∈ R, Z

K

s

eax xb dx ∼

1

22

1 aK s b−s+1 e K ; as

(ii.) for a, b, K > 0, Z



K

s

e−ax x−b dx ≤

1 −aK s −b−s+1 e K . as

Acknowledgements The research leading to the results in this paper has received funding from the European Research Council under ERC Grant Agreement 320637. References [1] Akansu, A. N. and Agirman-Tosun, H. (2010). Generalized Discrete Fourier Transform With Nonlinear Phase. Signal Processing, IEEE Transactions on 58 4547-4556. [2] Alquier, P., Gautier, E. and Stoltz, G. (2011). Inverse Problems and High-Dimensional Estimation: Stats in the Chˆ ateau Summer School, August 31 – September 4, 2009. Lecture Notes in Statistics. Springer. [3] Birke, M., Bissantz, N. and Holzmann, H. (2010). Confidence bands for inverse regression models. Inverse Problems 26 115020. [4] Bissantz, N., Dette, H. and Proksch, K. (2012). Model checks in inverse regression models with convolution-type operators. Scand. J. Stat. 39 305–322. [5] Bissantz, N., Hohage, T., Munk, A. and Ruymgaart, F. (2007). Convergence Rates of General Regularization Methods for Statistical Inverse Problems and Applications. SIAM Journal on Numerical Analysis 45 26102636. [6] Cavalier, L. (2008). Nonparametric statistical inverse problems. Inverse Problems 24 034004. [7] Cavalier, L. and Tsybakov, A. (2002). Sharp adaptation for inverse problems with random noise. Probability Theory and Related Fields 123 323–354. [8] Cohen, A., Hoffmann, M. and Reiss, M. (2004). Adaptive Wavelet Galerkin Methods for Linear Inverse Problems. SIAM Journal on Numerical Analysis 42 1479-1501. [9] Colton, D. and Kress, R. (2012). Inverse Acoustic and Electromagnetic Scattering Theory. Applied Mathematical Sciences. Springer New York. [10] Conway, J. B. (1990). A Course in Functional Analysis. Graduate Texts in Mathematics. Springer. [11] Donoho, D. L. (1995). Nonlinear solution of linear inverse problems by wavelet–vaguelette decomposition. Applied and computational harmonic analysis 2 101–126.

S. Gugushvili et al./Bayesian inverse problems

23

[12] Efromovich, S. (1998). Simultaneous sharp estimation of functions and their derivatives. Ann. Statist. 26 273–278. [13] Ghosal, S., Ghosh, J. K. and van der Vaart, A. W. (2000). Convergence rates of posterior distributions. Ann. Statist. 28 500–531. [14] Ghosal, S. and van der Vaart, A. (2017). Fundamentals of Nonparametric Bayesian Inference. Cambridge Series in Statistical and Probabilistic Mathematics 44. Cambridge University Press, Cambridge. [15] Haase, M. (2014). Functional Analysis: An Elementary Introduction. Graduate Studies in Mathematics. Amer Mathematical Society. [16] Isakov, V. (2013). Inverse Problems for Partial Differential Equations. Applied Mathematical Sciences. Springer New York. [17] Kaipio, J. and Somersalo, E. (2006). Statistical and Computational Inverse Problems. Applied Mathematical Sciences. Springer New York. [18] Kirsch, A. (2011). An Introduction to the Mathematical Theory of Inverse Problems. Applied Mathematical Sciences. Springer. [19] Knapik, B., van der Vaart, A. and van Zanten, J. (2011). Bayesian inverse problems with Gaussian priors. Ann. Statist. 39 2626–2657. [20] Knapik, B. T., van der Vaart, A. W. and van Zanten, J. H. (2013). Bayesian recovery of the initial condition for the heat equation. Communications in Statistics – Theory and Methods 42 1294–1313. [21] Natterer, F. (2001). The Mathematics of Computerized Tomography. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics. [22] Tsybakov, A. B. (2008). Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer. [23] Wahba, G. (1977). Practical approximate solutions to linear operator equations when the data are noisy. SIAM Journal on Numerical Analysis 14 651–667.