Discrete-Time Martingales and the Kalman Filter

39 downloads 3471 Views 615KB Size Report
5 May 2011 ... 2.1.3 Elementary Formula and Probability Density Functions . ...... Martingales and Markov chains both appear often in probability theory, so we ...
Discrete-Time Martingales and the Kalman Filter Mark Webster Student Number: 200313814 Module Code: MATH5003M Supervisor: Dr. J. Voss May 5, 2011

Abstract This report covers the material required to talk about martingales, an important type of random process, and the Kalman Filter, a method for determining the value of a process with noisy observations. We begin with a definition of Lebesgue integration, and build up the probability theory necessary to talk about expectation and conditional expectation. This is followed by more in-depth descriptions of martingales, and some simulations in R. The final section covers the basic theory of the Kalman filter, a discrete-time method for determining the state of a process given noisy measurements and a dynamical model of said state. The appendices contain a brisk coverage of the material required from measure theory.

Contents Notation Table

3

1 Introduction

4

2 Expectation and Conditional Expectation 2.1 Expectation . . . . . . . . . . . . . . . . . . 2.1.1 Definition and Basic Theorems . . . 2.1.2 Variance and Inner Products . . . . 2.1.3 Elementary Formula and Probability 2.2 Conditional Expectation . . . . . . . . . . . 2.2.1 The Fundamental Theorem . . . . . 2.2.2 Example . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

5 . 5 . 5 . 6 . 7 . 9 . 9 . 10

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

12 12 13 15 17 18 20 21 21

4 Filtering 4.1 Bayes’ Formula For Bivariate Normal Distributions . . . . . . . . 4.1.1 Recursive Property in Probability Distribution Functions 4.1.2 Bayes’ Formula . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Single Random Variable . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 System Model and Filter . . . . . . . . . . . . . . . . . . 4.2.2 R source code: Filtering a Single Value . . . . . . . . . . 4.3 Series of Variables; Kalman Filter . . . . . . . . . . . . . . . . . . 4.3.1 System Model and Filter . . . . . . . . . . . . . . . . . . 4.3.2 R source code: Kalman Filter . . . . . . . . . . . . . . . . 4.3.3 Example: Moving on a Line . . . . . . . . . . . . . . . . . 4.3.4 R Source Code: Movement Problem . . . . . . . . . . . . 4.3.5 Extension for Multiple Processes and Observations . . . . 4.3.6 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

22 22 22 23 23 23 24 27 27 29 31 34 39 42

A Appendices A.1 Measures . . . . . A.2 Events . . . . . . . A.3 Random Variables A.4 Independence . . . A.5 Integration . . . .

. . . . .

. . . . .

. . . . .

. . . . .

43 43 44 44 45 45

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Density Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Martingales 3.1 Definition . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Examples: Martingales and Markov Chains 3.1.2 Stopping Times . . . . . . . . . . . . . . . . 3.1.3 R source code: Random Walk . . . . . . . . 3.2 The Convergence Theorem . . . . . . . . . . . . . 3.3 Further Results . . . . . . . . . . . . . . . . . . . . 3.3.1 Orthogonality of Increments . . . . . . . . . 3.3.2 L´evy’s Upward Theorem . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

Bibliography

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . . . . .

. . . . .

. . . . .

47 2

Notation Table Symbol Ω ω X, Y, Z Xn , Yn , Zn Cn C •X Ln log s.t. ∃ ∀ |X| |X|p hXi ΛX µ, ν λ F, G IA  T ∧n N Z+ R+ n gn Hn2 , Kn2 Zn Vn mΣ SF P(A) N (µ, σ 2 ) φn rn , r n

Meaning Set of possible events (A.2.1) An event in Ω (A.2.1) Random variable (A.3.2) Random process Pre-visible process (3.1.15) Martingale transform (3.1.16) Lebesgue space (2.1.12) Exponential logarithm “such that” “there exists” “for all” Lp -norm (2.1.12) Inner or scalar product (2.1.16) Law of X (A.3.3) Measure (A.1.4), or mean Lebesgue measure (A.1.5, A.1.8) σ-algebras (A.1.1) Indicator function Small value min(T, n) {1, 2, 3, . . .} {0, 1, 2, . . .} [0, ∞) Timestep of process Shift value in Kalman filter (4.3.1) Variance of noise in filter (4.2.1, 4.3.1) Mean of estimate (4.2.1, 4.3.1) Variance of estimate (4.2.1, 4.3.1) Set of Σ-measurable functions (A.3.1) Set of simple functions (A.5.2) Power set of A (2.1.2) Normal distribution (2.1.23) Kalman gain factor (4.3.5) Residual, or innovation (4.3.5)

3

Equivalent in sources Ω ω X, Y, Z Mn for martingales Cn C •X Ln log, ln “such that” N/A ∀ |X| kXkp hXi ΛX µ Leb F, G IA  T ∧n N Z+ R+ n, k gn , u(k) Hn2 , Kn2 or Q(k), R(k) Zn , x ˆ(k|k) Vn , P (k|k) mΣ SF P(A) N (µ, σ 2 ) K(k) ˜− z n

Chapter 1

Introduction We look at two interesting and important ideas. Martingales are random processes that, on average, stay the same. From this simple definition we get a lot of results that allow us to give useful results, particularly in the case when the martingale stops: in this case we get useful results about the value the martingale ends at. This is very useful, since a lot of random processes are martingales, and we can also use it to easily give results about other processes; see Example 3.1.27 for an example. Say we have a process which we wish to observe the value of: we know the dynamics the process follows, but our measurements of the current value are noisy, so blindly reading the measurements could result in large errors. The second idea we look at, the Kalman filter, is an important algorithm by which we can estimate what the value actually is, and thus it has many applications, since we almost always have to measure something without perfect instruments to do so. Chapter 2 covers the theory of expectation, an essential part of anything where we wish to know the expected, or average, value of a random variable, either in general (expectation) or given certain information (conditional expectation). Some of the beginning material here comes from results in measure theory, in particular the idea of using Lebesgue integration instead of Riemann integration: additional information is available in the Appendix if desired. Chapter 3 defines martingales, and gives an outline of the theorems the definition now allows us to use. There is also a comparison with Markov chains, another common type of random process. There are several examples given in this chapter, mostly based on scenarios in gambling. In the next chapter they are used to show that the filters eventually have a fixed variance in their estimates. Martingales have a lot of uses, and only a few are presented here.The examples and R source code are the author’s own unless otherwise stated; other material is from [8] unless otherwise stated. Chapter 4 deals with the subject of filtering, where we wish to estimate the value of a process given only noisy observations and knowledge of the dynamics of the process and the observations. The notation and most of the basic theory are from [8], with additional material from [1, 6, 7]; examples, and material in Section 4.3.5 are author’s own unless otherwise stated. The Appendix contains the basic material from measure theory that is needed to talk about probability and expectation: it is a brisk summary of the first few chapters of [8]. All R programs and this report can be found at http://www.maths.leeds.ac.uk/ ~voss/projects/2010-martingales/.

4

Chapter 2

Expectation and Conditional Expectation The theory of martingales depends heavily on the use of conditional expectations. We therefore first describe expectation and conditional expectation, and the associated theorems needed. Material in this section is from [8] unless otherwise marked. Not all of the material in this chapter is required to study martingales: the important parts will be listed at the beginning of each section.

2.1

Expectation

Important parts in this section are Definitions 2.1.1, 2.1.8, 2.1.10, 2.1.12, 2.1.15, 2.1.16, 2.1.19, 2.1.21, Theorems 2.1.3, 2.1.7, 2.1.9, 2.1.11, 2.1.14, and Examples 2.1.2, 2.1.23.

2.1.1

Definition and Basic Theorems

Definition 2.1.1 The expectation of X is Z E(X) = XdP ZΩ = X(ω)P(dω) ω∈Ω

= P(X), using the Lebesgue integral as defined in Definition A.5.1, where P is a probability measure as described in Definition A.2.1, and where Ω is a set of events as defined in Definition A.2.1. Example 2.1.2 Take [Own Example] Ω = {x1 , x2 , . . . , xn } ⊆ R with xi 6= xj ∀i 6= j, X(ω) = ω, FP= P(Ω), where P(Ω) is the power set of Ω, the set of all subsets of Ω. n Let P(A) = i=1 Ixi ∈A pi , so pi = P(X = xi ). We now have Z E(X) =

XdP Ω X = xi P(X = xi ) i

=

X

xi P({xi })

i

=

X

xi pi .

i

5

Taking P(Xn → X) = 1 for sequence (Xn )n∈N and some variable X, we have the following basic theorems [8, Section 6.2], derived from their equivalents in Appendix A.5: Theorem 2.1.3 (Monotone-Convergence Theorem) For 0 ≤ Xn ↑ X, E(Xn ) ↑ E(X) ≤ ∞. Theorem 2.1.4 (Fatou’s Lemma) Xn ≥ 0 ⇒ E(X) ≤ lim inf E(Xn ). Theorem 2.1.5 (Dominated-Convergence Theorem) If ∃Y such that |Xn (ω)| ≤ Y (ω) ∀n, w and E(Y ) < ∞, then E|Xn − X| → 0 s.t. E(Xn ) → E(X). Theorem 2.1.6 (Scheff´ e’s Lemma) E|Xn | → E|X| ⇐⇒ E|Xn − X| → 0. We also have the following new theorems: Theorem 2.1.7 (Bounded Convergence Theorem) If ∃ finite k s.t. |Xn (ω)| ≤ k ∀n, ω, then E(Xn − X) → 0. We will need this theorem later to prove one of the main theorems in Chapter 3, Doob’s Optional Stopping Theorem (Theorem 3.1.21). R Definition 2.1.8 E(X, F ) = F X(ω)P(dω) = E(XIF ), where F ∈ F, where F is a filtration as described in Definition A.2.1. Theorem 2.1.9 (Markov’s Inequality) Let X ∈ mΣ, i.e. be a Σ-measurable function as defined in Definition A.3.1, and g be a non-decreasing, Borel-measurable function g : R 7→ [0, ∞], with the Borel sigma-algebra on R as defined in Example A.1.3. Then E(X) ≥ E(g(X), X ≥ c) ≥ g(c)P(X ≥ c). This theorem will be useful in the next section, where we want to show the existence of a conditional expectation.

2.1.2

Variance and Inner Products

Definition 2.1.10 A function c : G 7→ R for G ⊂ R is convex on G if its graph lies below any of its chords, i.e. if for some x, y ∈ G c(λx + (1 − λ)y) < λc(x) + (1 − λ)c(y) ∀0 < λ < 1. For example, the function c(x) = x2 on R has (λx + (1 − λ)y)2 − (λx2 + (1 − λ)y 2 )

=

λ2 x2 + 2λ(1 − λ)xy + (1 − λ)2 y 2 −λx2 − (1 − λ)y 2

=

−λ(1 − λ)(y − x)2

>

0.

So x2 is convex on R. Theorem 2.1.11 (Jenson’s Inequality) For convex c, E(X) < ∞, P(X ∈ G) = 1 for an open subset G ⊂ R, and E|c(X)| < ∞, we have E(c(X)) ≥ c(E(X)). This comes in useful in Chapter 4, when we want to show that the estimate of a single value is a martingale. Definition 2.1.12 For 1 ≤ p < ∞, we say X ∈ Lp if E|X|p < ∞. Then |X|p = 1 (E|X|p ) p , where | · |p is the Lp -norm. Lp is also a vector space. 6

Theorem 2.1.13 (Monoticity of Lp -norms) Take p ≤ r, Y ∈ Lr . Then Y ∈ Lp , and |Y |p ≤ |Y |r . Theorem 2.1.14 (Schwartz Inequality) For X, Y ∈ L2 , i) XY ∈ L1 and E(XY ) ≤ E|XY | ≤ |X|2 |Y |2 . ii) X + Y ∈ L2 , |X + Y |2 ≤ |X|2 + |Y |2 . This theorem, like Theorem 2.1.9, will be used in the section on conditional expectation. ˜ = X −µX , Y˜ = Definition 2.1.15 Take X, Y ∈ L2 , µX = E(X), µY = E(Y ). Then X 2 1 ˜ ˜ Y − µY are in L . By the Schwartz Inequality, X Y ∈ L . We then define the co˜ Y˜ ) = E(X − µX )(Y − µY ) = E(XY ) − µX µY , and the variance Cov(X, Y ) = E(X variance Var(X) = Cov(X, X) = E(X 2 ) − µ2X . This gives the usual form of variance and covariance in probability and statistics; it comes particularly useful in Chapter 4. Definition 2.1.16 The inner, or scalar product hX, Y i = E(XY ). Then, the hX,Y i . The correlation of X angle between X and Y obeys equation cos θ = |X| 2 |Y |2 ˜ ˜

hX,Y i and Y , corr(X, Y ) = cos θ = |X| ˜ 2 |Y˜ |2 . We say that X, Y are orthogonal, or perpendicular, when hX, Y i = 0; we also write this as X⊥Y . In this case, on 2 2 2 ˜ Y˜ , in probabilistic language this L2 we have |X + Y |2 = |X|2 + |Y |2 . Taking X, becomes Var(X + Y ) = Var(X) + Var(Y ) when Cov(X, Y ) = 0.

This is known as pythagoras’ theorem. Using inner products in L2 will be shown later to be an easy – and usually possible – way to show the existence of a conditional expectation. Theorem 2.1.17 (Parallelogram Law) By the bilinearity of h·, ·i, 2

2

|X + Y |2 + |X − Y |2 = hX + Y, X + Y i + hX − Y, X − Y i 2

2

= 2|X|2 + 2|Y |2 .

2.1.3

Elementary Formula and Probability Density Functions

In this section we introduce the idea of a probability density function. This allows us to integrate over the real numbers to find the expectation, variance, or other functions of a random variable, by giving us a function that describes the probability of the variable being in any interval. Theorem 2.1.18 (Completeness of Lp ) Let p ∈ [1, ∞). Then if (Xn ) is a Cauchy sequence in Lp , i.e. lim sup |Xr − Xs |p = 0, k→∞ r,s≥k

then ∃X ∈ Lp such that Xr → X ∈ Lp , i.e. lim |Xr − X|p = 0.

r→∞

Definition 2.1.19 Take subspace K ⊂ L2 and a Cauchy sequence (Vn ∈ K), Vn → V ∈ K. Then ∀X ∈ L2 ∃Y ∈ K s.t. |X − Y |2 = inf{|X − W |2 | W ∈ K}, (X − Y )⊥Z ∀Z ∈ K. Then we say Y is the orthogonal projection of X onto K, and is almost surely unique. 7

Theorem 2.1.20 (Elementary Formula for Expectation) Let h be a Borel-measurable function, and ΛX ∈ (R, B) be the law of X, i.e. ΛX (B) = P(X ∈ B) as defined in Definition A.3.3. Then h(X) ∈ L1 ⇐⇒ h ∈ L1 (R, B, ΛX ). R In this case, E(h(X)) = ΛX (h) = P h(X)ΛX (dx). Definition 2.1.21 The probability density function fX forR X, if it exists, is a Borel-measurable function fX : R 7→ [0, ∞] such that P(X ∈ B) = B fX (x)dx for B ∈ X B. We also write this as fX = dΛ dλ , where λ is the Lebesgue measure on B, described in Examples A.1.5 and A.1.8, and using the notation for density from Definition A.5.15. Note: This means that, if the R we have a probability density function for X, we can now write the expectation as XfX dλ, by Lemma A.5.16. In particular, if xfX (x) is R also riemann-integrable and X(ω) = ω, we can also write the expectation as xf (x)dx, and the two integrals are equal. Example 2.1.22 The uniform distribution U [a, b] on interval [a, b] has constant R b dx 1 . Then the mean of X, E(X) = a xb−a = b+a probability density function fX (x) = b−a 2 , i.e. the midpoint of the interval, and the variance, E(X 2 ) − E(X)2 = = = = =

b

x2 dx (b + a)2 − 4 a b−a 3 3 b −a (b + a)2 − 3(b − a) 4 2 2 b2 + 2ab + a2 b − ab + a − 3 4 b2 + 2ab + a2 12 (b + a)2 . 12

Z

Example 2.1.23 The normal distribution N (µ, σ 2 ) has probability density function fX (x) =

√ 1 e− 2πσ 2

(x−µ)2 2σ 2

Z

. Then the mean of X,



(x−µ)2 x √ e− 2σ2 dx 2 2πσ −∞  Z ∞ 2 (x−µ)2 x − µ − (x−µ) µ √ = e 2σ2 + √ e− 2σ2 dx 2πσ 2 2πσ 2 −∞ Z ∞ (x−µ)2 1 √ =µ e− 2σ2 dx 2πσ 2 −∞ = µ,

E(X) =

and the variance, 2

2

Z



x2

e−

(x−µ)2 2σ 2

dx − µ2 −∞ Z ∞ Z ∞ 2 2 x − µ − (x−µ) xµ − (x−µ) 2 2σ √ √ = x e dx + e 2σ2 dx − µ2 2πσ 2 2πσ 2 −∞ −∞  ∞ Z ∞ 2 (x−µ) (x−µ)2 σ σ √ e− 2σ2 dx + µ2 − µ2 = − x √ e− 2σ2 + 2π 2π −∞ x=−∞

E(X ) − E(X) =



2πσ 2

= σ2 . 8

This is a rather important example: a lot of random process are normal, or can be approximated as being normal over a long period of time. In Chapter 4 one of the underlying assumptions is that the noise is normally distributed, something known as “white noise”. Theorem 2.1.24 (H¨ older’s Inequality) Take f ∈ Lp (S, Σ, µ) and h ∈ Lq (S, Σ, µ). Then f h ∈ L1 (S, Σ, µ) and |µ(f h)| ≤ µ(|f h|) ≤ |f |p |h|q . Theorem 2.1.25 (Minkowski’s Inequality) Take f, g ∈ Lp (S, Σ, µ). Then |f + g|p ≤ |f |p + |g|p .

2.2

Conditional Expectation

The important parts of this section are Definition 2.2.2, Notation 2.2.3, Theorem 2.2.1, and the “Existence in L2 ” section of the proof of said theorem.

2.2.1

The Fundamental Theorem

Now we have the concept of expectation, we will give a formal definition of a conditional expectation. The usual definition of conditional expectation Y = E(X|Z) has a random variable Y whose value depends on the value of random variable Z. Using our definitions of events and random variables, we can say that Z is a real-valued function Z(ω ∈ Ω), taking a certain point in the sample space. If Z has a given value, this limits the possible values of ω; in other words, out of the possible events, the actual outcome has been limited to a subset of the family of events F, say G. The conditional expectation is then the expected value of X(ω) given that ω ∈ G. Theorem 2.2.1 (Fundamental Theorem) Take a random variable X with E|X| < ∞, and G a sub-σ-algebra of F in the usual probability triple (Ω, F, P). Then there is a random variable Y with Y ∈ mG, E|Y | < ∞ such that ∀G ∈ G E(Y, G) = E(X, G). This Y is almost surely unique, i.e. if a random variable Y˜ has the same properties then P(Y = Y˜ ) = 1. Definition 2.2.2 A random variable Y with the properties described above is called a version of conditional expectation E(X|G). We say that Y = E(X|G) almost surely. Notation 2.2.3 We write E(X|Z) as shorthand for E(X|σ(Z)), where σ(Z) is the σ-algebra generated by Z, as defined in Definition A.1.2. Proof of Almost Sure Uniqueness Proof by contradiction [8, Section 9.5]. Assuming the conditional expectation exists, we have X ∈ L1 , and two versions Y, Y˜ of E(X|G). Then, by the definition, we have Y, Y˜ ∈ L1 (Ω, G, P), and that E(Y, G) = E(Y˜ , G) ⇒ E(Y − Y˜ , G) = 0 ∀G ∈ G. We now suppose Y, Y˜ are not almost surely equal. Without loss of generality, we take Y > Y˜ , so P(Y > Y˜ ) > 0. We can introduce an error term and construct a sequence of events Xn = {Y > Y˜ + n } with Xn ↑ {Y > Y˜ } as n ↓ 0. Then ∃n such that P(Y > Y˜ + n ) = P(Y − Y˜ > n ) > 0. Since Y, Y˜ are G-measurable, we can use the Markov inequality (Theorem 2.1.9) with g(c) = c, c = n to get E(Y − Y˜ , Y − Y˜ > n ) ≥ n P(Y − Y˜ > n ) > 0. But E(Y − Y˜ , Y − Y˜ > n ) = 0. Contradiction, therefore Y is almost surely unique. 9

Proof of Existence for X ∈ L2 We give proof of existence for the case of E|X|2 < ∞, since it is a commonly used L-norm, and it’s a more simple case since we can use the idea of orthogonal projections (Definition 2.1.19). Take Y as the orthogonal projection of X onto L2 (Ω, G, P), which always exists. Then we have hX − Y, Zi = E((X − Y )Z) = 0 ∀Z ∈ L2 (Ω, G, P). By linearity, we therefore arrive at E(XZ) = E(XY ). We can now set Z = IG for G ∈ G to obtain the result. Proof of Existence for X ∈ L1 In the standard machine (Method A.5.13), we defined h = h+ − h− , with h+ , h− as described in Notation A.5.10, to prove something for all measurable functions if it was true for all positive measurable functions. Similarly, we define X = X + − X − and limit ourselves to the case X ∈ (L1 )+ . We now choose [8, Section 9.5] a sequence 0 ≤ Xn ↑ X. Now Xn ∈ L2 , so by the previous section we can form sequence (Yn = E(Xn |G))n∈N . Lemma 2.2.4 For a non-negative bound random variable X, E(X|G) ≥ 0 almost surely. Proof: By contradiction. Assume version Y of the expectation has P(Y < 0) > 0: then ∃ > 0 such that we can set G = {Y < −} with P(G) > 0, and take the Markov Inequality (Theorem 2.1.9) with g(x) = x, X = −Y , c =  to obtain E(−Y, −Y > ) > P(−Y > ). We thus have 0 ≤ E(X, G) = E(Y, G) < −P(G) < 0. Contradiction, so E(X|G) ≥ 0 almost surely.  From this lemma we have 0 ≤ Yn almost surely ∀n, so we define Y (ω) = lim sup Yn (ω). Then Y ∈ mG, and Yn ↑ Y almost surely. By the Monotone-Convergence Theorem (Theorem 2.1.3), we then take the limit of E(Yn , G) = E(Xn , G) to get E(Y, G) = E(X, G) ∀G ∈ G. 

2.2.2

Example

Take [Own Example] Ω = {1, 2, 3, 4, 5, 6}, F = P(Ω), G = {∅, {1, 2}, {3, 4, 5, 6}, Ω}, P(A ∈ F) = ]A 6 , where P(Ω) is the power set of Ω, and ]A is the number of elements of A. We then have the probability triple associated with rolling a fair six-sided die. As we expect, the probability measure gives E(X) = 27 for X(ω) = ω. Since Y must be G-measurable, we can say that ( y1 ω ∈ {1, 2}, Y (ω) = y2 ω ∈ {3, 4, 5, 6}. Since E(Y IG ) = E(XIG ) ∀G ∈ G, Y is determined by the simultaneous equations E(Y I{1,2} ) E(Y I{3,4,5,6} )

= E(XI{1,2} ), = E(XI{3,4,5,6} ).

10

From the probability measure, we then have 1 y1 + 6 1 1 1 y2 + y2 + y2 + 6 6 6

1 y1 6 1 y2 6

= =

1 (1 + 2), 6 1 (3 + 4 + 5 + 6), 6

or 13 y1 = 12 , 23 y2 = 3. We therefore obtain result E(X|G) =

3 9 I{1,2} + I{3,4,5,6} . 2 2

Note: Since E|X|2 < ∞ we can treat Y as the orthogonal projection – as from Definition 2.1.19 – of X onto L2 (Ω, G, P). Then for any Z ∈ L2 (Ω, G, P), we define Z = z1 I{1,2} + z2 I{3,4,5,6} , and obtain hX − Y, Zi = E((X − Y )Z) 1 = ((3 − 2y1 )z1 + (18 − 4y2 )z2 ) 6 = 0 ∀z1 , z2 . So we see the inner product behaves as expected.

11

Chapter 3

Martingales Martingales, and similarly supermartingales and submartingales, are an important type of random process: their expected value stays the same over time, so on average they stay at the same value. (In the case of supermartingales and submartingales, the expected value is monotonically decreasing or increasing respectively.) This allows us to derive other results for random processes, especially useful as these three types of processes are rather common: common examples of processes that can be modelled as a martingale would be the amount of money a gambler owns during several rounds of betting, or brownian motion.

3.1

Definition

Important parts in this section are Definitions 3.1.1, 3.1.5, 3.1.7, 3.1.17, 3.1.18, 3.1.15, 3.1.16, Theorem 3.1.21 and Corollary 3.1.26. Definition 3.1.1 A filtration {Fn }n≥0 is an increasing family of sub-σ-algebras of S F, with F0 ⊆ F1 ⊆ . . . ⊆ F. We define F∞ = σ( n Fn ) ⊆ F. Example 3.1.2 For a common example from gambling, we consider the rolling of dice throws, each an independent and identically distributed random variable Xi for i ∈ Z+ . Then filtration Fn = σ(X0 , X1 , . . . , Xn ) is the σ-algebra generated by the set {X0 , . . . , Xn } as defined in Definition A.1.2. In other words, it is the set of all possible sets whose elements can be of the throws up to time n. Definition 3.1.3 A filtered space (Ω, F, {Fn }, P) is a probability triple (Ω, F, P) with an associated filtration {Fn }n≥0 . Example 3.1.4 {Fn } is usually taken as the natural filtration described in Example 3.1.2, Fn = σ(X0 , X1 , . . . , Xn ). For the gambling example, the filtered space is thus a description of the probabilistic model for dice-throwing, with a record of the dice throws up to time n. Definition 3.1.5 A process X = (Xn | n ≥ 0) is adapted to filtration {Fn } if Xn is Fn -measurable ∀n. Example 3.1.6 For the gambling example, a process Yn is adapted to Fn if it’s any function of {X0 , X1 , . . . , Xn }, so that P it’s determinable at time n; a common example n would be the sum of the throws, Yn = i=0 Xi . Definition 3.1.7 A process X is a martingale relative to ({Fn }, P) if i) X is adapted, ii) E|Xn | < ∞

∀n, 12

iii) E(Xn |Fn−1 ) = Xn−1 almost surely ∀n ≥ 1. A supermartingale has the equality in iii) replaced by “≤”, and a submartingale has it replaced by “≥”; so, a supermartingale decreases on average, and a submartingale increases on average.

3.1.1

Examples: Martingales and Markov Chains

Martingales and Markov chains both appear often in probability theory, so we look at the distinctions between the two. Definition 3.1.8 A stochastic process X is a collection of random variables (Xγ | γ ∈ C) parametrized by set C, where the variables are all on the same probability triple. Definition 3.1.9 A stochastic, or transition matrix P = (pij ) is a matrix such that X pij ≥ 0, pik = 1. k

Definition 3.1.10 A time-homogeneous Markov Chain X = (Xn | n ∈ Z+ ) is a stochastic process parametrized by set Z+ with elements Xn ∈ E for some set E. If E is countable, the chain is then defined by a stochastic |E| × |E| matrix P and an initial distribution µ over E. Then P(X0 = i0 , X1 = i1 , . . . , Xn = in ) = µi0 pi0 i1 pi1 i2 . . . pin−1 in . Corollary 3.1.11 A Markov Chain X is “memoryless”, i.e. P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ) = P(Xn = in | Xn−1 = in−1 ). Proof:

From the definition, we have [Own Proof ] P(X0 = i0 , . . . , Xn = in ) = P(X0 = i0 , . . . , Xn−1 = in−1 ) P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ) = µi0 pi0 i1 . . . pin−2 in−1 P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ), ∴ pin−1 in = P(Xn = in | X0 = i0 , . . . , Xn−1 = in−1 ) = P(Xn = in | Xn−1 = in−1 ). 

So we have the other common definition of a Markov Chain, where the next value only depends on the current value. Example 3.1.12 A random walk Sn = X1 + X2 + . . . + Xn , where Xn are independent, identically distributed random variables, has conditional expectation E(Sn | Fn−1 ) = E(Sn−1 + Xn | Fn−1 ) = Sn−1 + E(Xn | Fn−1 ) = Sn−1 + E(Xn ). So S is only a martingale when E(X) = 0. Additionally, we can say P(S0 = i0 = 0, . . . , Sn = in ) = P(S0 = 0, . . . , Sn−1 = in−1 ) P(Sn = in | Sn−1 = in−1 ) = P(S1 = i1 )P(S2 = i2 | S1 = i1 ) . . . P(Sn = in | Sn−1 = in−1 ) = p0i1 pi1 i2 . . . pin−1 in . So S is always a Markov chain, with µ0 = 1. 13

Example 3.1.13 Instead of the usual Markov chain, we have every value after time n = 1 be determined by a new stochastic matrix Q = Q(X1 ) whose values depend on the value of X1 ; in other words, P(X0 = i0 , X1 = i1 ) = µi0 pi0 i1 , P(X0 = i0 , . . . , Xn = in ) = µi0 pi0 i1 qi1 i2 (i1 )qi2 i3 (i1 ) . . . qin−1 in (i1 )

∀n > 1.

Except in the trivial case, process X is no longer a Markov Chain, since the transition probabilities are no longer “memoryless”, but also depend on an older value. However, if P and Q(X1 ) are defined in such a way that X E(X1 | X0 = i0 ) = jpi0 j = i0 ∀i0 ∈ E, j∈E

E(Xn | Fn−1 ) =

X

jqin−1 j (i1 ) = in−1

∀n > 1, i1 , in−1 ∈ E,

j∈E

then we have a process that isn’t a Markov Chain, but is a martingale. Note that the above process can become a Markov Chain if we store the value of X1 ; in this case, the process becomes a Markov Chain over the bivariate state space (X1 , Xn ). Example 3.1.14 The “Martingale betting system” is a gambling strategy where the gambler repeatedly doubles the size of his bet each round, until he wins a round. The theory behind this is that, given an infinite amount of time and money, the gambler can keep raising his bet indefinitely until he wins a round, at which point he is up by the size of his initial bet. Mathematically, we write this as a sum of weighted random variables. Without loss of generality, we take the starting value as zero, and assume the payout is evens, i.e. the payout for a successful round is the initial bet, plus the same again. Then we can write [8, Section 10.6] Sn = C1 X1 + C2 X2 + . . . + Cn Xn = Sn−1 + Cn−1 Xn−1 , where P(Xn = 1) = p, P(Xn = −1) = 1 − p, and Cn is a random variable that is determined by past results, i.e. Cn = Cn (X1 , X2 , . . . , Xn−1 ). Cn is called a pre-visible process, and we define this term after the example. In this case, we have ( 1 Xn−1 = 1, Cn = Cn (Xn−1 ) = 2Cn−1 Xn−1 = −1. For S to be a Markov Chain, we need the next value Sn to be completely determinable from the value of Sn−1 . This would require C to be determinable by S, but this isn’t the case, so S can’t be a Markov Chain. On the other hand, E(Sn | Fn−1 ) = E(Sn−1 + Cn Xn | Fn−1 ) = Sn−1 + Cn E(Xn ), So this is a martingale if E(Xn ) = 0. Note this strategy can take a long time, and hence a lot of money, waiting for a successful round: the chance of waiting N rounds for a successful one is P(N = n) = (1 − p)n−1 p, with an expected time of E(N ) =

X n∈N

n(1 − p)n−1 p =

X X

(1 − p)m−1 p =

n∈N m≥n

X

(1 − p)n−1 =

n∈N

1 . p

In addition, if we assume the initial bet C1 = c, the amount of money required to bet in N rounds is c + 2c + . . . + 2N −1 c = (2N − 1)c, so the expected amount of money M 14

required is E(M ) =

X

(2n − 1)c(1 − p)n−1 p = c

n∈N

( =

∞ c 2p−1

X

X

2n (1 − p)n−1 p − c

n∈N

(1 − p)n−1 p

n∈N

0 ≤ p ≤ 1/2, 1/2 < p ≤ 1,

A rather large amount to need just to gain an amount c. This betting system is where martingales derived their name from; the origin of the word before this is unclear. there are two main theories [2]: either that it comes from the name of a type of saddle – which bifurcates into two equally long strips in the middle – or that it comes from the Proven¸cal phrase “a la martegalo” – referring to the inhabitants of Martigues, who had a reputation for doing things in a ridiculous or naive way – and means “in an absurd manner”, an appropriate origin for the naming of this betting system if it is the case. Definition 3.1.15 A process C = (Cn )n∈N is pre-visible if Cn ∈ mFn−1 ∀n. Such a pre-visible process is thus determinable in advance of when it is used. Examples include a controllable parameter, which would need to be determined based on past results, and can not be based on the upcoming one, such as C in the example above. P Definition 3.1.16 The martingale transform (C • X)n = 1≤k≤n Ck (Xk − Xk−1 ), where C is pre-visible. This is the discrete equivalent of the stochastic integral.

3.1.2

Stopping Times

Definition 3.1.17 A map T : Ω 7→ {0, 1, 2, . . . ; ∞} is a stopping time if {T = n} = {ω : T (ω) = n} ∈ Fn ∀n ≤ ∞. It is possible to have T = ∞. The requirement that {T = n} ∈ Fn means that the decision to stop at a certain time can only depend on what has happened up to that time. For example, in general you can set the stopping time as the nth occurrence of any value, T = inf{n ≥ 0; Xn ∈ Y }, but you can’t set the stopping time as the nth last occurrence of any value, because this usually can not be determined unless the whole process has already been observed. Definition 3.1.18 The stopped process XT ∧n , or the process Xn stopped at T , is the process X up to stopping time T , and is equal to XT ∀n ≥ T . This can also be (T ) (T ) denoted as (C (T ) • X)n = XT ∧n − X0 , where Cn = I{n≤T } . Cn is then pre-visible, P (T ) since {Cn = 0} = {T ≤ n − 1} = 0≤k≤n {T = k} ∈ Fn−1 . Theorem 3.1.19 If X is a (super)martingale, and T is a stopping time, then X T is a (super)martingale, with E(XT ∧n ) is (less than or) equal to E(X0 )∀n. Example 3.1.20 Take X as a random walk on Z+ [8], starting at 0, with stopping time T = inf{n; Xn = 1}. Then E(XT ) = 1. However, E(XT ∧n ) = E(X0 ) = 0. We therefore do not necessarily have E(XT ) = E(X0 ). The next theorem gives sufficient conditions for E(XT ) = E(X0 ). Theorem 3.1.21 (Doob’s Optional Stopping Theorem) Let X be a (super)martingale, T be a stopping time. Then XT is integrable and E(XT ) is (less than or) equal to E(X0 ), if any of the following hold: i) T is bounded, i.e. ∃N ∈ N such that T (ω) ≤ N ∀ω; ii) X is bounded, i.e. ∃K ∈ R+ such that |Xn (ω)| ≤ K ∀n, ω, and T is almost surely finite; iii) E(T ) < ∞, and ∃K ∈ R+ such that |Xn (ω) − Xn−1 (ω)| ≤ K ∀(n, ω). 15

Proof: From Theorem 3.1.19, E(XT ∧n ) ≤ E(X0 ). Then for X being a supermartingale, i) Take n = N . ii) Take n ↑ ∞ using the Bounded Convergence Theorem (Theorem 2.1.7). PT ∧n iii) The condition gives |XT ∧n − X0 | = | k=1 (Xk − Xk−1 )| ≤ KT, E(KT ) < ∞, so we can take n ↑ ∞. For X being a martingale, we apply the above to −X to show equality. Example 3.1.22 (Gambler’s Ruin) Say we have an amount of money S0 , and we can bet on an infinite number of rounds, where the odds of victory and the corresponding payoffs are the same as in Example 3.1.14. We aim to reach an amount of money S, and so we stop when we’ve either reached this amount or run out of money: we’d like to know how likely we are to succeed. Take our money at time n as Sn = S0 + (C · X)n , with (C · X)n being the martingale transform. Our stopping time T = inf{n; Sn ∈ {0, S}}. Our situation can be split into two cases: i) p = q = 1/2. In this case Sn is a martingale regardless of our choice of Cn , and since Sn is bounded we can derive our result from Theorem 3.1.21 [3, Section 12.2]: 0.P(ST = 0) + S.P(ST = S) = E(ST ) = S0 , so P(Success) = SS0 . This is irrespective of our gambling strategy, as we’d expect from a fair game. ii) p 6= q. Sn is now either a supermartingale or a submartingale, so Theorem 3.1.21 will give us an inequality, and we can’t use it to directly calculate the answer. For the moment, let’s suppose we always take Cn = 1, and that S0 = S/2. If we now write pi for P(ST = 2i|S0 = i), then we can use the fact that Sn is time-homogenous to write pi = ppi+1 + qpi−1 , with boundary conditions p0 = 0, p2i = 1. This is a recurrence relation with solution pi =

1 − ( pq )i 1−

( pq )2i

=

1 pi . q i = i 1 + (p) p + qi

For p > q this is greater than 1+1 q = p, and for p < q this is smaller than p, p where p would be our chance of success by betting i and reaching our stop time in one round. Applying this result to each bet instead of the entire problem, we find that when p > 1/2 our best strategy is to bet on increments as small as possible 1−( q )S0

and P(Success) = 1−(pq )S , and when p < 1/2 our best strategy is to bet as high p as possible: intuitively, in the former case the odds benefit us in the long term so we can take our time, and in the latter case they do not. Note: Calculating the chance of success when p < q can be complicated, depending on exactly how we restrict the bets: limiting them to stay in [0,S] means the recurrence relation on pi changes depending on whether i ≥ S/2. (In the case S0 = S, we can obviously still say the chance of success is p.) On the other hand, if we allow ourselves to just stop when we have at least S, and bet all of our money at each timestep, the relation is simpler and has solution P(Success) = pt+1 , where −t − 1 ≤ log2 i < −t. Corollary 3.1.23 If X is a martingale with Xn − Xn−1 bounded by some constant K, and C is a pre-visible process bounded by some constant L, and T is a stopping time with E(T ) < ∞, then E(C • X)T = 0. Corollary 3.1.24 If X is a non-negative supermartingale, and T is an almost-surelyfinite stopping time, then E(XT ) ≤ E(X0 ). 16

Pn Example 3.1.25 Take a simple binomial random walk, with X0 = 0, Xn = k=1 Zk and Zk being equally likely to be −1 or 1 ∀k. Xn is then a martingale. Let stopping time T = min{n : Xn = −1}. We then have that i) T is not bounded, and ii) X is not bounded. Since −1 = E(XT ) 6= E(X0 ) = 0, iii) must also not hold. We therefore conclude that E(T ) = ∞. We usually need to determine whether E(T ) < ∞ instead of deriving it from Doob’s Optional-Stopping Theorem. Corollary 3.1.26 Let T be a stopping time such that ∃N ∈ N,  > 0 such that ∀n ∈ N

P(T ≤ n + N | Fn ) > 

almost surely. Then E(T ) < ∞. Example 3.1.27 Say we have a monkey randomly pressing keys on a typewriter, letters only, and we want to know how long it will take for it to type the word “abracadabra” [8, Exercise E10.6]. We suppose that at each timestep a gambler arrives with one betting chip, and bets it on the monkey typing the first letter, A, at the fair payoff of 25 to 1. If he wins, then at the next timestep he bets all 26 chips on the monkey typing B, and so on through the whole word, until the monkey types the whole word, or it misses a letter and he loses all his chips. Let Xn,m be the amount of chips owned at time m by the gambler who entered at time n – so, for example, Xn,n will be equal to either 26 or 0. Xn,m is then a martingale with Pmregard to m. Then the sum of money owned by all gamblers at time m, Zm = n=1 Xn,m , has E(Zm ) = m, and the process Wm = Zm − m is a martingale with E(Wm ) = 0. If we can find E(WT ) = E(ZT ) − E(T ), where T is the stopping time of the monkey typing the whole word, then we can express the expected time taken by the expected total number of chips held by the gamblers. We can use Corollary 3.1.26 to show that E(T ) < ∞, by, for example, setting N > 11 P11 i and  = 26−11 . In addition, Zm ≤ 26Zm−1 + 26, and Zm ≤ i=1 26 ∀m, so we P10 i 11 know that |Zm − Zm−1 | ≤ 25 i=1 26 + 26 = 26 + 25. We can thus use condition iii) of Doob’s Optional Stopping Theorem (3.1.21) to say that E(WT ) = 0, and so E(ZT ) = E(T ). E(ZT ) has a value predetermined by the stop condition, since only certain gamblers can have any chips remaining when the word is finished: those who entered at times T , T − 3, and T − 10 will have chips, since there are sequences of letters that is both at the beginning and the end of the word, of length 1, 4 and 11 respectively. These gamblers have 26, 264 and 2611 chips respectively, so E(T ) = E(ZT ) = 26 + 264 + 2611 .

3.1.3

R source code: Random Walk

This script creates a given number of random walks over a given number of timesteps with a given probability distribution, then plots them on a graph. The creation of the walk data is left as a separate function to allow manipulation before plotting. #Create walk data according to any defined random distribution create.walk=function(timesteps,runs,start,f) { y=array(dim=c(timesteps+1,runs)) y[1,]=start for (time in 1:timesteps) { x=f(runs) y[time+1,]=y[time,]+x } y } draw.walk=function(data) { 17

##Create a "width matrix" to record the no. of occurrences of ##each edge calculate.width=function(data,timesteps,runs) { z=array(NA,c(timesteps,runs)) for (time in 1:timesteps) { for (run in 1:runs) { if (is.na(z[time,run])) { z[time,run]=1 for (through in (run+1):runs) { if (data[time,through]==data[time,run] && data[time+1,through]==data[time+1,run]) { z[time,c(run,through)]=c(z[time,run]+1,0) } } } } if (is.na(z[time,runs])) { z[time,runs]=1 } } z } ##Use width matrix to plot walk with thickness depending on frequency draw.data=function(data,width,timesteps,runs) { plot(c(0,timesteps),range(data), type="n",xlab="Time",ylab="Walk values") for (run in 1:runs) { for (time in 1:timesteps) { if (width[time,run]>0) { lines(c(time-1,time),c(data[time,run], data[time+1,run]),lwd=width[time,run]) } } } } b=calculate.width(data,dim(data)[1]-1,dim(data)[2]) draw.data(data,b,dim(data)[1]-1,dim(data)[2]) } Some example commands, respectively for an even discrete (−1, 1) random walk and a normal distribution of mean 0 and variance 1, each with 10 runs over 25 time units: source("http://www.maths.leeds.ac.uk/~voss/projects/2010-martingales/plotwalk.R") x=create.walk(25,10,0,function(x) sample(c(-1,1),x,replace=TRUE)) draw.walk(x) source("http://www.maths.leeds.ac.uk/~voss/projects/2010-martingales/plotwalk.R") x=create.walk(25,10,0,function(x) rnorm(x,0,1)) draw.walk(x)

3.2

The Convergence Theorem

Important parts in this section are Theorem 3.2.5 and Lemmas 3.2.2 and 3.2.3. For a process X on R, let YN = (C • X)N , where the pre-visible strategy Cn (a, b) is defined as follows for a < b: Cn = 0 until X < a. Then Cn = 1 until X > b. Then Cn = 0, 18

5 0 −5

Walk values

0

5

10

15

20

25

Time

0 −5

Walk values

5

Figure 3.1: Example graphic for a binomial random walk in Program 3.1.3, with increments equally likely to be 1 or −1, taken over 25 time steps with 10 sample runs

0

5

10

15

20

25

Time

Figure 3.2: Example graphic for a random walk in Program 3.1.3 with normally-distributed increments, with mean 0 and variance 1, over 25 time steps and 10 sample runs

19

and the strategy repeats. More formally, C1 = I{X0 0. For V−∞ this becomes (1 − |α|)K 2 > 0, and for V+∞ this becomes (α2 − |α|)K 2 > 0. The stability of these points thus depends on |α|: for |α| < 1, V−∞ = 0 is stable, and V+∞ is both unstable and negative; for |α| > 1, the variance converges to non-zero value V+∞ . and V−∞ = 0 is unstable; for |α| = 1, V−∞ = V+∞ = 0 is stable. In summary, in this case we always have convergence. 3. For α = 0, the equation reduces to (C 2 H 2 + K 2 )V∞ − K 2 H 2 = 0, with solution V∞ = We have f (x) =

K2H2 K 2 +C 2 H 2 ,

so

df (x) dx

K 2H 2 . + C 2H 2

K2

= 0. We thus always have stability.

In summary, the variance always converges to V∞ . 

4.3.2

R source code: Kalman Filter

This takes a series of signals related by disturbed linear recursion, and noisy observations of them, then plots the signal series, the mean and standard deviation series for the estimations, and the differences between the observations. The differences are taken to keep the plots in the same area as the signal and observation series. #Script for creating noisy measurements of a process, and giving #an estimate of same by filtering ##Convert any possibly time-converted values into length-n arrays, ##sorted in a list convert.to.array = function(n, ...) { args=list(...) for(a in 1:length(args)) { args[[a]] = array(args[[a]],n) 29

} args } create.signal = function(n, meanX0, sdX0, A, g, H) { signal = array(0,n+1) convert=convert.to.array(n,A+1,g,H) alpha = convert[[1]] g = convert[[2]] H = convert[[3]] signal[1] = rnorm(1,meanX0,sdX0) for(step in 1:n) { signal[step+1] = rnorm(1, alpha[step]*signal[step]+g[step], H[step]) } signal } create.observations = function(n,signal,C,K) { convert=convert.to.array(n,0,C,K) observations = convert[[1]] C = convert[[2]] K = convert[[3]] observations[1] = rnorm(1,C[1]*signal[2],K[1]) for(step in 1:(n-1)) { observations[step+1] = rnorm(1,C[step+1]*signal[step+2], K[step+1]) } observations } ##Calculate the estimated mean and variance of possible values create.estimates = function(Y, meanX0, sdX0, A, g, H, C, K) { n = length(Y) convert = convert.to.array(n,A+1,g,H,C,K) alpha = convert[[1]] g = convert[[2]] H = convert[[3]] C = convert[[4]] K = convert[[5]] est = array(0, c(n+1, 2)) est[1,] = c(meanX0, sdX0^2) for(step in 1:n) { est[step+1,2] = 1/(1/((alpha[step]^2)*est[step,2] + H[step]^2) + (C[step]/K[step])^2) check = (alpha[step]*est[step,1] + g[step])/((alpha[step]^2)*est[step,2] + H[step]^2) check2 = C[step]*Y[step]/(K[step]^2) est[step+1,1] = (check+check2)*est[step+1,2] } est[,2] = sqrt(est[,2]) colnames(est) = c("mean", "sd") 30

+

1

+

0

+ +

+

+

+ +

+

−2

−1

Estimate

2

+

+ + 0

2

4

6

8

10

12

Measurements

Figure 4.5: Example graphic for measuring a random process in Program 4.3.2 with default values, create.filter(12,1)

est } draw.filter = function(signal,Y,est) { ##Calculate one standard deviation to each side of the estimates estplus = est[,1]+est[,2] estminus = est[,1]-est[,2] plot(c(0,length(Y)), range(c(est[,1],estplus,estminus,signal-0.1, signal+0.1,Y)), type="n", xlab="Measurements", ylab="Estimate") lines(0:length(Y),est[,1]) lines(0:length(Y),estplus,lty=2) lines(0:length(Y),estminus,lty=2) lines(0:length(Y),signal,lty=3) points(1:length(Y),Y,pch="+") } create.filter = function(n, K, meanX0=0, sdX0=1, A=-0.1, g=0, H=1, C=1) { signal = create.signal(n, meanX0, sdX0, A, g, H) observations = create.observations(n,signal,C,K) est = create.estimates(observations, meanX0, sdX0, A, g, H, C, K) draw.filter(signal, observations, est) }

4.3.3

Example: Moving on a Line

Consider the optimality problem [6, Section 11.4] of an object moving on the line R with controllable velocity gn at each timestep, where we can only measure the position of the object with noise, and we must choose gn , only based on Fn−1 = {Y0 , Y1 , . . . Yn −1}, to PN −1 2 minimize E( n=0 gn2 + DXN ) for a finite stopping time N and some D. Specifically, 31

2

+

0

+ + +

+

+

+

+ −2

+

+

+

+

+

+ 0

5

+ +

+

+

−4

Estimate

1

+ +

10

15

20

Measurements

Figure 4.6: Example graphic for measuring a random process in Program 4.3.2 with default values, create.filter(20,1)

1 0

+

+

+

+

+

−2

+ 0

+

+

+

−1

Estimate

2

+ +

2

+

4

6

8

10

12

Measurements

Figure 4.7: Example graphic for measuring a random process in Program 4.3.2 with default values and noise variance Kn2 = 1/n2 , create.filter(12,1/(1:12))

32

we have a system Xn+1 = Xn + gn , Yn = Xn + n , where L(n ) = N (0, 1). We thus have a Kalman Filter, with α = 1, g = gn−1 , Hn2 = 0, C = 1, Kn2 = 1. Our position estimate is then N (Zn , Vn ) with 1 Zn Zn−1 + gn−1 1 = + 1, = + Yn , Vn Vn−1 Vn Vn−1 V0 Vn−1 = , ∴ Vn = Vn−1 + 1 nV0 + 1 Yn Vn−1 + (Zn−1 + gn−1 ) Yn V0 + (Zn−1 + gn−1 )([n − 1]V0 + 1) Zn = = . Vn−1 + 1 nV0 + 1 In the absence of other information, we can use the above recursion by assuming we have no information at time n = 0, i.e. Z0 = z, 1/V0 = 0. Then 1 Yn + (n − 1)(Zn−1 + gn−1 ) , Vn = . n n

Z1 = Y1 , V1 = 1, Zn =

PN −1 2 Let F (Zk , Vk , k) = E( n=k gn2 + DXN |{Zk , Vk , Yk } = Gk ). Then F (Zk , Vk , k) = gk2 + E(F (Zk+1 , Vk+1 , k + 1)|Gk , gk ), 2 2 F (ZN , VN , N ) = E(DXN |FN ) = DE(XN |GN ) 2 2 2 2 = D(E(XN − ZN |GN ) + ZN ) = D(VN + ZN ). 2 Suppose we can write F (Zk+1 , Vk+1 , k + 1) = Ak+1 Zk+1 + Bk+1 . Then we have 2 F (Zk , Vk , k) = gk2 + E(Ak+1 Zk+1 + Bk+1 |Gk , gk ) ! 2  Yk+1 Vk + Zk + gk 2 |Gk , gk + Bk+1 = gk + Ak+1 E Vk + 1 !  2 V X + Z + V  + (V + 1)g k k k k k+1 k k = gk2 + Ak+1 E |Gk , gk + Bk+1 Vk + 1

= (Ak+1 + 1)gk2 + 2Ak+1 Zk gk + Ak+1 Zk2 +

Ak+1 Vk2 + Bk+1 . Vk + 1

k+1 Minimizing over gk gives gk = − AA Zk , so k+1 +1

F (Zk , Vk , k) =

Ak+1 Vk2 Ak=1 Zk2 + Bk+1 + = Ak Zk2 + Bk , Ak+1 + 1 Vk+1

where Ak = Bk = Bk+1 + = Bk+1 + =

Ak+1 D = , Ak+1 + 1 1 + kD

Ak+1 Vk2 Vk + 1 DV02 (1 + kV0 ) (1 + (k + 1)D)(1 + kV0 )2 (1 + (k + 1)V0 )

N −1 X V0 1 + DV02 . 1 + N V0 (1 + iV0 )(1 + (i + 1)V0 )(1 + (i + 1)D) i=k

33

Thus, the optimality problem has the solution gk = Zk becomes Zk =

−DZk 1+(N −k)D ,

so the expression for

1+(N −k)D Yk V0 + ([k − 1]V0 + 1) 1+(N −k+1)D Zk−1

kV0 + 1  (k − 1)V0 + 1 V0 Zk−1 + Yk 1 + (N − k + 1)D 1 + (N − k)D   k X 1 + (N − k)D Z0 V0 + Yi , = kV0 + 1 1 + N D i=1 1 + (N − i)D 1 + (N − k)D = kV0 + 1



and the expected final cost at time k, F (Zk , Vk , k)

=

D V0 Z2 + 1 + kD k 1 + N V0 N −1 X 1 +DV02 . (1 + iV0 )(1 + (i + 1)V0 )(1 + (i + 1)D) i=k

4.3.4

R Source Code: Movement Problem

This gives examples of the results of the problem in Example 4.3.3, plotting the current position, the mean and standard deviation of the estimated position, and the observed position. #Script for the movement problem ##Create arrays and starting values create.start = function(n, meanX0, sdX0) { signal = array(NA,n+1) signal[1] = rnorm(1,meanX0,sdX0) signal } create.observations = function(signal) { ##create array for observations for time 0 to n-1 n=length(signal)-1 observations = array(NA,n) observations[1] = rnorm(1,signal[1],1) observations } create.estimates = function(Y, meanX0, sdX0) { n = length(Y) est = array(NA, c(n, 2)) if (sdX0 == Inf) { est[1,] = c(Y[1], 1) } else { est[1,] = c((meanX0 + Y[1]*sdX0^2)/(1+sdX0^2), sdX0^2/(1+sdX0^2)) } colnames(est) = c("mean", "sd") est } ##Calculate values in next time step 34

create.move = function(k,n,signal,est,D) { if (D==Inf) {g=-est[k,1]/(n-k+1)} else { g = -D*est[k,1]/(1+(n-k+1)*D) } signal[k+1] = signal[k] + g signal } update.observations = function(k,observations,signal) { observations[k] = rnorm(1,signal[k],1) observations } update.estimates = function(k,est,signal,Y,D) { n = length(Y) if (D==Inf) {g=-est[k,1]/(n-k+1)} else { g = -D*est[k,1]/(1+(n-k+1)*D) } est[k+1,2] = est[k,2]/(est[k,2] + 1) check = (est[k,1] + g)/est[k,2] est[k+1,1] = (check+Y[k+1])*est[k+1,2] est } draw.filter = function(signal,Y,est) { ##Calculate one standard deviation to each side of the estimates estplus = est[,1]+est[,2] estminus = est[,1]-est[,2] plot(c(0,length(Y)), range(c(est[,1],estplus,estminus,signal,Y,0)), type="n", xlab="Measurements", ylab="Estimate") lines(0:(length(Y)-1),est[,1]) lines(0:(length(Y)-1),estplus,lty=2) lines(0:(length(Y)-1),estminus,lty=2) lines(0:length(Y),signal,lty=3) points(0:(length(Y)-1),Y,pch="+") } create.filter = function(n, meanX0=0, sdX0=1, D=1) { if (sdX0 == Inf) { ##using Inf normally would give "Not a Number" errors signal = create.start(n,meanX0,0) } else signal = create.start(n, meanX0, sdX0) observations = create.observations(signal) est = create.estimates(observations, meanX0, sdX0) for(k in 1:(n-1)) { signal = create.move(k,n,signal,est,D) observations = update.observations(k+1,observations,signal) est = update.estimates(k,est,signal,observations,D) } signal = create.move(n,n,signal,est,D) est[,2] = sqrt(est[,2]) draw.filter(signal, observations, est) }

35

+ +

−1

+

−2

+

+

+

+

+ +

+

−3

Estimate

0

+

+ 0

2

4

6

8

10

12

Measurements

Figure 4.8: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with default values and 12 timesteps, create.filter(12)

1.5

+ +

0.5

+ +

+

+

+ −0.5

Estimate

+ +

+

−1.5

+ + 0

2

4

6

8

10

12

Measurements

Figure 4.9: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with default values and 12 timesteps, create.filter(12)

36

0.5

+ +

−0.5

+ + +

−1.5 −2.5

Estimate

+

+

+

+ + 0

+

+ 2

4

6

8

10

12

Measurements

Figure 4.10: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps and D=∞, create.filter(12,D=Inf)

1.5

+

+ +

0.0

0.5

+ +

+ +

+

+

+

−0.5

Estimate

1.0

+

+ 0

2

4

6

8

10

12

Measurements

Figure 4.11: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps and D=0, create.filter(12,D=0), obviously resulting in zero movement

37

+

+

+ +

+

4

6

+ +

+

2

Estimate

8

10 12

+

+

+

0

+ 0

2

4

6

8

10

12

Measurements

Figure 4.12: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps, starting at 10 with infinite starting variance, create.filter(12,meanX0=10,sdX0=Inf)

+ +

+ 0.0

Estimate

1.0

+

+

+

+

+

−1.0

+ + 0

+

+ 2

4

6

8

10

12

Measurements

Figure 4.13: Example graphic for the movement problem in Section 4.3.3, using Program 4.3.4 with 12 timesteps, starting at 0 with infinite starting variance, create.filter(12,sdX0=Inf); we then still have movement due to the inaccurate observations

38

4.3.5

Extension for Multiple Processes and Observations

The above form of the Kalman Filter is satisfactory, but it causes problems if we have more than one variable, or more than one observations: in either case we also need to consider covariances between the different approximations, so we need a more sophisticated model. For this reason, the associated literature for the Kalman Filter usually denotes the equations in matrix form; we therefore extend the basic Kalman filter above to the case where we have v variables we wish to approximate and m observations. The aforementioned literature usually splits the Kalman Filter into two steps – the time update, or prediction step, and the measurement update, or correction step [7]. This makes the calculations more digestible, especially when using matrices, so we first show our solution for the simple case can be written in this form. Rewriting the Simple Case We first write our solution in terms of Vn and Zn instead of 1/Vn and Zn /Vn : Cn2 1 1 + = 2 Vn αn Vn−1 + Hn2 Kn2 2 2 2 K + C (α Vn−1 + Hn2 ) = n 2 n2 n , Kn (αn Vn−1 + Hn2 ) K 2 (α2 Vn−1 + Hn2 ) Vn = 2 n 2n 2 , Kn + Cn (αn Vn−1 + Hn2 ) Zn αn Zn−1 + gn Cn (Yn − Yn−1 ) = 2 + Vn αn Vn−1 + Hn2 Kn2 2 K (αn Zn−1 + gn ) + Cn (Yn − Yn−1 )(αn2 Vn−1 + Hn2 ) = n , Kn2 (αn2 Vn−1 + Hn2 ) K 2 (αn Zn−1 + gn ) + Cn (Yn − Yn−1 )(αn2 Vn−1 + Hn2 ) Zn = n . Kn2 + Cn2 (αn2 Vn−1 + Hn2 ) These expressions, and those that follow, frequently use the terms E(Xn |Fn−1 ) = αn Zn−1 + gn and E(Vn |Fn−1 ) = αn2 Vn−1 + Hn2 : for convenience, we write these as Z˜n and V˜n respectively. We then write the equation for Zn as Kn2 Z˜n + Cn (Yn − Yn−1 )V˜n Kn2 + Cn2 V˜n Cn V˜n (Yn − Yn−1 − Cn Z˜n ) = Z˜n + Kn2 + Cn2 V˜n

Zn =

= Z˜n + φn rn , where φn =

Cn V˜n 2 +C 2 V ˜ Kn n n

is pre-visible, and often called the gain, blending, or kalman

gain factor [1,7]; it can be thought of as an indication of how much we value the new information conpared to our old estimate. Additionally, the term rn = Yn − Yn−1 − Cn Z˜n = Yn − E(Yn |Fn−1 ) is the difference between actual and expected measurement Yn , and is often called the residual or innovation [7]. K2 The gain factor has the property Cn φn + K 2 +Cn2 V˜ = 1, so we can write the system as n

n

n

Z˜n = αn Zn−1 + gn , V˜n = αn2 Vn−1 + Hn2 , φn =

Cn V˜n ; Kn2 + Cn2 V˜n

rn = Yn − Yn−1 − Cn Z˜n , Zn = Z˜n + φn rn , Vn = (1 − Cn φn )V˜n . 39

Matrix Form Say we wish to estimate v variables from m observations. The variables can now depend on each other, so for the ith variable Xi,n we have dynamics equation Xi,n − Xi,n−1 =

v X

Ai,j,n Xj,n−1 + gi,n + νi,n ,

j=1

and for ith observation Yi,n we have equation Yi,n − Yi,n−1 =

v X

Ci,j,n Xj,n + i,n .

j=1

In matrix form, these equations become xn − xn−1 = An xn−1 + g n + ν n , y n − y n−1 = Cn xn + n . Additionally, the independence conditions for xn , νi,n , and i,n are given by the equations [1, Section 7.1] E(ν n xTm ) = 0 ∀n > m, E(n xTm ) = 0 ∀n, m, E(n ν Tm ) = E(ν n ν Tm ) = E(n Tm ) = 0

∀n 6= m,

where xT denotes the usual transpose of a vector, and the distribution of the noise variables is given by the covariance matrix    T   Jn2 Hn2 νn νn . E = n n (Jn2 )T Kn2 Usually the two noise variables are independent, so Jn2 = 0. If we assume that we have estimate at time n − 1 of mean z n−1 and variance Vn−1 , then we have E(xn |Fn−1 ) = αn z n−1 + g n , E(y n |Fn−1 ) = y n−1 + Cn (αn z n−1 + g n ), where αn = An + I. Let ξ n = xn − E(xn |Fn−1 ) = αn (xn−1 − z n−1 ) + ν n , and χn = y n − E(y n |Fn−1 ) = Cn (xn − αn xn−1 − g n ) + n = Cn (αn (xn−1 − z n−1 ) + ν n ) + n = Cn ξ n + n . Let V˜n = Var(ξn ) = Hn2 + αn Vn−1 αnT ; We can then write the covariance matrix     ξn V˜n Jn2 + V˜n CnT . Cov = χn (Jn2 )T + Cn V˜n Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn V˜n CnT Lemma 4.3.4 If two random variables  x, yare normallydistributed with mean 0 and x Vxx Vxy symmetric covariance matrix Cov = , where Vyy is non-singular, y Vyx Vyy then −1 −1 E(x|y) = Vxy Vyy y, Var(x|y) = Vxx − Vxy Vyy Vyx . −1 Proof: The term [6] x − Vxy Vyy y is linear in x, y, so is normally distributed. Ad−1 T −1 ditionally, E((x − Vxy Vyy y)y ) = 0, so it is independent of y and Vxy Vyy y is the −1 conditional expectation of y. So we can say that E(x|y) = Vxy Vyy y, and that −1 −1 Cov(x − Vxy Vyy y|y) = Cov(x − Vxy Vyy y) −T T −1 −1 −T T = E(xxT − xy T Vyy Vxy − Vxy Vyy yxT + Vxy Vyy yy T Vyy Vxy ) −1 −1 −1 −1 = Vxx − Vxy Vyy Vyx − Vxy Vyy Vyx + Vxy Vyy Vyy Vyy Vyx −1 = Vxx − Vxy Vyy Vyx . 

40

ξ n and χn have mean 0, so by the above lemma, and by the fact that E(ξ n |χn ) = ˜ n and Var(ξ n |χn ) = Vn , we can now say that zn − z ˜ n + (Jn2 + V˜n CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn V˜n CnT )−1 (y n − E(y n |Fn−1 )), zn = z Vn = V˜n − (Jn2 + V˜n CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn V˜n CnT )−1 ((Jn2 )T + Cn V˜n ). We have now found the gain factor and the innovation, φn = (Jn2 + V˜n CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn V˜n CnT )−1 , r n = y n − E(y n |Fn−1 ), and can more simply write the above as ˜ n + φn r n , zn = z Vn = V˜n − φn ((Jn2 )T + Cn V˜n ). We can now write the complete form of the filter, ˜ n = αn z n−1 + g n , V˜n = Hn2 + αn Vn−1 αnT z φn = (Jn2 + V˜n CnT )(Kn2 + Cn Jn2 + (Jn2 )T CnT + Cn V˜n CnT )−1 , ˜n, zn = z ˜ n + φn r n , Vn = V˜n − φn ((Jn2 )T + Cn V˜n ). r n = y n − y n−1 − Cn z V˜ C Note: Usually Jn2 = 0: this gives φn = K 2 +Cn V˜n 2 C T , Vn = (I − φn Cn )V˜n , and we n n n n then have the same form as the simple case. T

Example: Moving Under Gravity of Unknown Force Say we are measuring the position of a particle moving under gravity with observation noise, where the particle begins at zero height, but the initial velocity W and the acceleration g due to gravity are not known exactly [5, Section 7.2, with a different solution method and noisy timesteps]. The current position, Xn , follows the recursion 1 1 Xn = W n − gn2 = Xn−1 + W + g( − n). 2 2 The signal is then the vector of the position, initial velocity and acceleration, and follows the recursion     Xn 1 1 21 − n 0  xn−1 = αn xn−1 . xn =  W  =  0 1 g 0 0 1 Our only observation is of the current position, so we have measurement recursion  yn − yn−1 = 1 0 0 xn + n = Cn xn + n , where n ∼ N (0, Kn2 ). Say for timestep n − 1 we have estimate    zn−1 Vn−1,1,1 Vn−1,1,2 z n−1 =  Wn−1  , Vn−1 =  Vn−1,2,1 Vn−1,2,2 gn−1 Vn−1,3,1 Vn−1,3,2

mean and variance  Vn−1,1,3 Vn−1,2,3  ; Vn−1,3,3

then we have gain factor φn =

Kn2

1 1 V˜n CnT = V˜n CnT , T 2 ˜ ˜ + Cn Vn Cn Kn + Vn,1,1

and innovation rn = yn − yn−1 − z˜n . 41

The estimate mean then has form ˜ n + φn rn = z ˜n + zn = z Looking at zn = Cn z n , Wn = we can see that zn =

0

1 V˜n CnT (yn − yn−1 − z˜n ). + V˜n,1,1   0 z n , and gn = 0 0 1 z n separately,

Kn2 1

Kn2 z˜n + V˜n,1,1 (yn − yn−1 ) , Kn2 + V˜n,1,1

Wn = Wn−1 +

V˜n,3,1 (yn − yn−1 − z˜n ) , K 2 + V˜n,1,1 n

V˜n,3,1 (yn − yn−1 − z˜n ) gn = gn−1 + . Kn2 + V˜n,1,1 Additionally the estimate variance has form Vn = (I − φn Cn )V˜n =

 V˜n,1,1 1  I− V˜n,2,1 Kn2 + V˜n,1,1 V˜n,3,1

0 0 0

! 0 0  V˜n . 0

From this we can see that Vn,1,2 , Vn,1,3 can only be non-zero if they are in the prior   V˜n,2,2 V˜n,2,3 ˜ will stay the same if V˜0,1,2 = estimate Vn , and that the sub-matrix V˜n,3,2 V˜n,3,3 V˜0,1,3 = 0. However, in V˜n = αn Vn−1 αnT we have V˜n,1,2 = Vn−1,2,2 + ( 12 − n)Vn−1,3,2 , and similarly for V˜n,1,3 , so unless we definitely know W or g already we will use all parts of the covariance matrix.

4.3.6

Comments

1. The dynamics for the progression of the signal, or the observation, can be nonlinear. In this case, the estimates are calculated by taking partial derivatives of the recursion functions, as in a Taylor series: see [7] for more information. An example would be a digital meter, with measurement noise dependent on the size of X or Y as the meter switches between scales of measurement. 2. If either noise has a non-zero mean, we can simply adjust the noise by including its mean in gn or E(Yn |Fn−1 ). 3. In the case Cn = 0, the observations have no dependence on Xn , so the signal is not observable, and Z˜n , V˜n . In the matrix form, this is equivalent to Cn being singular, and so at least one member of r n is not observable: see [1, 6] for more on observability. 4. The signal at time n, Xn , can also be approximated given Fm for m > n. This is referred to as smoothing: for more information see [1, Chapter 9].

42

Appendix A

Appendices Theories about probability, expectations, and so on are derived from the more abstract field of measure theory. We therefore give a brisk summary of the important results and definitions from the first few chapters of [8].

A.1

Measures

Definition A.1.1 For a set S and a collection Σ of subsets of S, we say Σ is a σ-algebra on S if S ⊂ Σ, F ∈ Σ ⇒ F c = S \ F ∈ Σ, and, for a sequence of disjoint subsets F =(Fn )n∈N Fn ∈ Σ, [ Fn ∈ Σ. n

We say F ∈ Ω is an Ω-measurable subset of S. Definition A.1.2 We say σ(Σ) is the σ-algebra generated by Σ, where σ(Σ) is the intersection of all σ-algebras with Σ as a subset. Example A.1.3 (Borel σ-algebras) The Borel σ-algebra B(S) on the space S is generated by the family of open subsets of S. B(R), often written as just B, thus contains all subsets of R. Similarly, B((0, 1]) is often written as B(0, 1]. Definition A.1.4 We say a function µ : Σ 7→ [0, ∞) on σ-algebra Σ of set S is a measure if µ(φ) = 0 where φ is the empty set, and, for a sequence of distinct subsets (Fn ∈ Σ)n∈N , [ X µ( Fn ) = µ(Fn ). n∈N

n∈N

Furthermore, it is a probability measure if µ(F ) ≤ 1 ∀ F ∈ Σ. Example A.1.5 The lebesgue measure on B(0, 1], λ : B(0, 1] 7→ (0, 1], has the form

λ(

[

λ(a, b) = b − a, X (an , bn ) = (bn − an )

n∈N

n∈N

for disjoint subsets (an , bn ). The Lebesgue measure is thus a general measure of length. 43

Definition A.1.6 We say I is a π-system on S if it’s a family of subsets of S that is stable under finite intersections. Theorem A.1.7 (Carath´ eodory’s Extension Theorem) Let Σ be the σ-algebra generated by Σ0 . Then, for measure µ0 : Σ0 7→ [0, ∞], ∃ measure µ : Σ 7→ [0, ∞] such that µ = µ0 on Σ0 . So, we can extend a measure to a larger model. Example A.1.8 The Lebesgue measure on B(0, 1] can be extended to B[0, 1] by saying that λ{0} = 0. We can also extend to B, so we can measure length on the whole of R.

A.2

Events

Definition A.2.1 We say a probability triple is a measure space (Ω, F, P), where Ω is the sample space, ω ∈ Ω is a sample point, the σ-algebra F is the family of events – so that an event is a F-measurable subset of Ω – and P is a probability measure on (Ω, F). Definition A.2.2 For sequence of events (En )n∈N = {ω | ω ∈ E}, (En , i.o.) = (En infinitely often) \ [ = lim sup En = En m n≥m

= {ω | ∀m ∃n(ω) s.t. ω ∈ En(ω) } = {ω | ω ∈ En for infinitely many n}. (En , ev) = (En eventually) [ \ = lim inf En = En m n≥m

= {ω | ∃m(ω) s.t. ∀n ≥ m(ω) ω ∈ En } = {ω | ω ∈ En ∀ large n}. Theorem A.2.3 (Fatou’s Lemma) P(lim inf En ) ≤ lim inf P(En ). Theorem A.2.4 (Fatou Reverse Lemma) For finite measure P, P(lim sup En ) ≥ lim sup P(En ). Theorem A.2.5 (First Borel-Cantelli Theorem) For events (En )n∈N , X

P(En ) < ∞ ⇒ P(lim sup En ) = P(En , i.o) = 0.

n

A.3

Random Variables

Definition A.3.1 A function f : S 7→ R is Σ-measurable if f −1 : B → 7 Σ. mΣ is the set of all Σ-measurable functions on S, and mΣ+ is the set of all non-negative elements of mΣ. Definition A.3.2 For sample space Ω and σ-algebra F, a random variable X is a F-measurable function X : Ω 7→ R, X −1 : B 7→ F, where B is as defined in Example A.1.3. Definition A.3.3 For a random variable X, the law LX of X is LX = P ◦ X −1 , LX : B 7→ [0, 1]. Then LX is a probability measure on (R, B). 44

Definition A.3.4 For a random variable X, the distribution function of X is the function FX : R 7→ [0, 1], where FX (c) = LX (−∞, c] = P(x ≤ c) = P{ω | X(ω) ≤ c}. Theorem A.3.5 (Monotone-Class Theorem) Let H be a class of bounded functions S 7→ R with the following conditions: i) H is a vector space over R, ii) Constant function 1 ∈ H, iii) For non-negative functions (fn ∈ H)n∈N with fn ↑ (f ∈ Ω), f ∈ H. Then, if H contains the identity functions of every set in π-system I, then it contains every bounded σ(I)-measurable function on S.

A.4

Independence

Definition A.4.1 Sub-σ-algebras An of F are independent if ∀ ai ∈ Ai (i ∈ N) and distinct ij for j=1 to n P(ai1 ∩ ai2 . . . ∩ ain ) =

n Y

P(aik ).

k=1

Definition A.4.2 Random variables X1 , X2 , . . . are independent if σ-algebras σ(X1 ), σ(X2 ), . . . are independent. Definition A.4.3 Events E1 , E2 , . . . are independent if the σ-algebras E1 , E2 , . . . are independent, where En is the σ-algebra {∅, En , Ω \ En , Ω}. Theorem A.4.4 (Second Borel-Cantelli Lemma) For the series of independent events (En )n∈N , X P(En ) = ∞ ⇒ P(En , i.o.) = P(lim sup En ) = 1. Proof:

We have (lim sup EN )c = lim inf Enc =

[ \

Enc .

m n≥m

We then have

! P

\ n≥m

A.5

Enc

=

Y

(1 − P(En )). 

n≥m

Integration

Definition A.5.1 We say that measure µ is the lebesgue integral µ(f ) = and that Z µ(f, A) = f (s) µ(ds) = µ(f IA ) s ∈ S, A ⊂ Σ.

R

f dµ,

A

Such a measure is linear, so the integral is also linear. + Definition A.5.2 f ∈ mΣP is simple if it can be written as a weighted sum of m indicator functions, i.e. f = k=1 ak IAk for some ak ≥ 0 and some Ak ∈ Σ. We then write f ∈ SF.

45

We can assume that Ak in the above definition are disjoint, since a1 IA1 + a2 IA2 = a1 IA1 ∩/A2 + (a1 + a2 )IA1 ∩A2 + a2 I/A1 ∩A2 . Definition A.5.3 For subset A ∈ Σ, we define µ0 (IA ) = µ(A) ≤ ∞, where µ0 is + an Pmnaive integral defined for simple functions. For f ∈ SF we define µ0 (f ) = a µ(A ) ≤ ∞. k k=1 k Definition A.5.4 For f ∈ mΣ+ we define µ(f ) = sup{µ0 (b) | b ∈ SF + , h ≤ f } ≤ ∞. So we can take the integral of a non-negative function as the upper limit of a sequence of integrals of simple functions. Theorem A.5.5 (Monotone-Convergence Theorem) For the sequence of functions (fn ∈ mΣ+ )n∈N , fn ↑ f ⇒ µ(fn ) ↑ µ(f ) ≤ ∞. So the integral of fn tends to the integral of f . Definition A.5.6 The rth staircase function a(r) : [0, ∞] 7→ [0, ∞] is defined as   x = 0, 0 (r) −r a (x) = (i − 1)2 (i − 1)2−r < x ≤ i2−r ≤ r i ∈ N,   r x > r. The functions f (r) = a(r) ◦ f are simple functions, with f (r) ↑ f . By the MonotoneConvergence Theorem (Theorem A.5.5), we now have µ(f ) =↑ lim µ(f (r) ) =↑ lim µ0 (f (r) ). Since a(r) are left-continuous, we also have fn ↑ f ⇒ a(r) (fn ) ↑ a(r) (f ). Theorem A.5.7 (Fatou’s Lemma) For (fn ∈ mΣ+ )n∈N , µ(lim inf fn ) ≤ lim sup µ(fn ). So, Z Z lim inf fn ≤ lim inf fn . Theorem A.5.8 (Reverse Fatou’s Lemma) For (fn ∈ mΣ+ )n∈N with finite and bounded fn , µ(lim sup fn ) ≥ lim sup µ(fn ). Definition A.5.9 L1 (S, Σ, µ) is the set of µ-integrable functions, f ∈ mΣ such that µ(f ) < ∞. Notation A.5.10 We let f + (s) = max(f (s), 0), f − (s) = max(−f (s), 0). Then we have f = f + − f − , |f | = f + + f − . Note:

This means that Z Z f dµ = µ(f ) = µ(f + ) − µ(f − ), |f |dµ = µ(|f |) = µ(f + ) + µ(f − ).

So we immediately get µ(f ) ≤ µ(|f |), with equality if and only if f is non-negative. Note: Since f + , f − ∈ mΣ+ , and the integral is linear, we can extend the definition of integral µ0 to the set of measurable functions mΣ. Theorem A.5.11 (Dominated-Convergence Theorem) Take functions fn , f ∈ L1 (S, Σ, µ)+ , with |fn (s)| ≤ g(s) for some g ∈ L1 (S, Σ, µ)+ with µ(g) < ∞. Then fn → f

f ∈ L1 (S, Σ, µ).

So µ(|fn − f |) → 0, so µ(fn ) → µ(f ). 46

Theorem A.5.12 (Scheff´ e’s Lemma) Take fn , f ∈ L1 (S, Σ, µ), with fn → f almost everywhere. Then µ(|fn − f |) → 0 iff µ(|fn |) → µ(|f |). Method A.5.13 (Standard Machine) A method for proving a linear result is true for all functions in a space. i) Show the result is true for indicator functions. ii) By linearity, show the result is true for functions in SF + . iii) Use the Monotone-Convergence Theorem to show the result is true for functions in mΣ+ . iv) Write h = h+ − h− and use linearity to show the result is true for measurable functions. Definition A.5.14 For an F-measurable function f , f µ(A) = µ(f, A). R Definition A.5.15 A measure µ(A) = IA f dν, also written µ = f ◦ ν, has density f relative to µ. We then write dµ = f. dν Lemma A.5.16 If measurable, and

dµ dν

= f, and g is an F-measurable function, then f g is also FZ Z g dµ = gf dν.

Proof:

By the standard machine [Own Proof ], Method A.5.13. R R i) For g as an indicator function, IA dµ = µ(1, A) = f ν(1, A), and IA f dν = ν(f, A) = f ν(1, A). R RP Pm ii) P For gR as a simple Rfunction, Rg P = gk IAk dµ = so R g dµ = k=1 gk IAk ,P gk IAk dµ, and gf dν = gk IAk f dν = gk IAk f dν, which are equal by part i).

iii) For g as a non-negative F-measurable function, we can define g as the limit of a sequence (gn )n∈N of simple functions by using the staircase function from Definition A.5.6. By part ii) we have µ(gn , A) = f ν(gn , A), so by the Monotone Convergence Theorem (A.5.5), µ(g, A) = f ν(g, A), which is equivalent to R R gIA dµ = gIA f dν. iv) For g as an F-measurable function, we show part iii) for g + and −g − . We thus have the result from linearity, g = g + − g − ⇒ µ(g, A) = µ(g + , A) − µ(g − , A) = f ν(g + , A) − f ν(g − , A) = f ν(g, A).  R Corollary A.5.17 If f dν = 1, then µ is a probability measure.

47

Bibliography [1] Donald E. Catlin. Estimation, Control, and the Discrete Kalman Filter. Springer, 1989. [2] Roger Mansuy. Histoire de martingales. Math´ematiques & Sciences Humaines, (169):105–113, 2005. [3] Michael Mitzenmacher and Eli Upfal. Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press, 2005. [4] Bernt Øksendal. Stochastic Differential Equations. Springer, 2000. [5] Albert Tarantola. Inverse Problem Theory. Society for Industrial and Applied Mathematics, 2005. [6] Richard Weber. Optimization and control. Lecture notes for the Optimization and Control course at Cambridge, 2010. [7] Greg Welch and Gary Bishop. An introduction to the Kalman filter. Technical Report TR 95-041, University of North Carolina, Department of Computer Science, July 2006. Introductory article that also discusses the case of nonlinear systems. [8] David Williams. Probability with Martingales. Cambridge University Press, 1991.

48