Pattern Recognition

0 downloads 0 Views 489KB Size Report
example. Markov model of an SIR epidemic. S. I. R i r. 1 − i. 1 − r. 1.... St. It. Rt... = .... the average value of some function f(x) under a distribution p(x) ...
Pattern Recognition Prof. Christian Bauckhage

outline lecture 05 recap discrete Markov chains basic probability theory summary exercises

recap

basic terms and concepts from linear algebra vector space, inner product space, normed space linear combinations (convex, conic, affine, linear) span, linear independence, bases Lp norms / distances for Rm standard simplex in Rm

discrete Markov chains

stochastic vector

q ∈ Rm is a stochastic vector if q  0 and kqk1 = 1

stochastic vector

q ∈ Rm is a stochastic vector if q  0 and kqk1 = 1 ⇔ q ∈ Rm is stochastic if q ∈ ∆m−1

e3

q e1

e2

stochastic matrix

P ∈ Rm×n is column (row) stochastic if each of its columns (rows) is a stochastic vector P ∈ Rm×m is bi-stochastic if it is column- and row stochastic

stochastic matrix

P ∈ Rm×n is column (row) stochastic if each of its columns (rows) is a stochastic vector P ∈ Rm×m is bi-stochastic if it is column- and row stochastic

alternative terminology column stochastic ⇔ left stochastic row stochastic ⇔ right stochastic bi-stochastic ⇔ doubly stochastic

note

the literature typically considers row stochastic vectors and row stochastic matrices

note

the literature typically considers row stochastic vectors and row stochastic matrices in this course, we will consider column stochastic vectors and column stochastic matrices

note

the literature typically considers row stochastic vectors and row stochastic matrices in this course, we will consider column stochastic vectors and column stochastic matrices conceptually, this is no big deal . . . live with it

note

the literature typically considers row stochastic vectors and row stochastic matrices in this course, we will consider column stochastic vectors and column stochastic matrices conceptually, this is no big deal . . . live with it ⇒ from now on: stochastic matrix ⇔ column stochastic matrix

Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.

Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.

Proof. krk1 =

X i

ri

Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.

Proof. krk1 =

X i

ri =

XX i

j

Pij qj

Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.

Proof. krk1 =

X i

ri =

XX i

j

Pij qj =

X j

qj

X i

Pij =

X j

qj

Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.

Proof. krk1 =

X i

ri =

XX i

j

Pij qj =

X j

qj

X i

Pij =

X j

qj = 1

Lemma If P ∈ Rm×k and Q ∈ Rk×n are stochastic matrices, then R ∈ Rm×n where R = PQ is a stochastic matrix.

Lemma If P ∈ Rm×k and Q ∈ Rk×n are stochastic matrices, then R ∈ Rm×n where R = PQ is a stochastic matrix.

Proof.     Since Q = q1 q2 . . . qn and R = r1 r2 . . . rn , we note that R = PQ ⇔ ri = Pqi and resort to the previous Lemma.

note

stochastic matrices and vectors play a crucial role in Markov process models

Markov chains

used to model systems that have m possible states and, at any one time, are in one and only one of their m states  the set Q = q1 , . . . , qm of states is called the state space state transitions happen according to certain probabilities

Markov chains

used to model systems that have m possible states and, at any one time, are in one and only one of their m states  the set Q = q1 , . . . , qm of states is called the state space state transitions happen according to certain probabilities for instance, Markov model of an SIR epidemic 1−i

S

1−r

i

I

1

r

R

types of Markov chains

discrete-time Markov chain ⇔ a stochastic model that has the Markov property   p Xt+1 = qit+1 | Xt = qit , . . . , X1 = qi1 = p Xt+1 = qit+1 | Xt = qit

homogenous discrete-time Markov chain ⇔ a discrete-time Markov chain such that    p Xt+1 = qi | Xt = qj = p Xt = qi | Xt−1 = qj = p qi | qj = pij

Markov processes

the dynamics of a homogenous DTMC are governed by qt = P qt−1 where P ⇔ transition matrix q ⇔ state vector such that  pij ⇔ p i ← j  qi ⇔ p i

example Markov model of an SIR epidemic 1−i

S

1−r

i

I

1

r

R

     1−i 0 0 St−1 St      1 − r 0  It−1   It  =  i Rt 0 r 1 Rt−1

example Markovian SIR dynamics

1

1

St It Rt

St It Rt

t

i = 43 , r =

1 2

t

i = 14 , r =

1 2

example Markovian SIR dynamics

1

R

St It Rt

t

1

St It Rt

t

S

I

basic probability theory

probability

degree of belief in the truth of various propositions

probability

degree of belief in the truth of various propositions

examples of propositions A = it will rain this afternoon B = this is a fair coin C = this coin will come up heads twice as likely as tails D = this image shows a face Yi = party i will win the upcoming election

3 requirements for consistent reasoning

1) transitivity 2) closure 3) conditional probability

transitivity

if we believe X more than Y and Y more than Z, then we must believe X more than Z

transitivity

if we believe X more than Y and Y more than Z, then we must believe X more than Z ⇒ implies an ordering

transitivity

if we believe X more than Y and Y more than Z, then we must believe X more than Z ⇒ implies an ordering ⇒ assign real numbers to beliefs

transitivity

if we believe X more than Y and Y more than Z, then we must believe X more than Z ⇒ implies an ordering ⇒ assign real numbers to beliefs ⇔ the larger the value associated with a proposition, the more we believe it

transitivity

if we believe X more than Y and Y more than Z, then we must believe X more than Z ⇒ implies an ordering ⇒ assign real numbers to beliefs ⇔ the larger the value associated with a proposition, the more we believe it  0 = prob false disbelief  1 = prob true certainty

closure

if we specify, how much we believe that X is true, we implicitly specify our disbelief

closure

if we specify, how much we believe that X is true, we implicitly specify our disbelief ⇒ sum rule   prob X + prob ¬X = 1

conditional probability

if we first state how much we believe that Y is true, and then state how much we believe that X is true given that Y is true, we implicitly specify, how much we believe that both X and Y are true

conditional probability

if we first state how much we believe that Y is true, and then state how much we believe that X is true given that Y is true, we implicitly specify, how much we believe that both X and Y are true ⇒ product rule    prob X, Y = prob X Y prob Y

note

sum and product rule define the algebra of probability more results can be derived therefrom

Bayes’ theorem

   prob Y X prob X  prob X Y = prob Y

Bayes’ theorem

   prob Y X prob X  prob X Y = prob Y

this is because   prob X, Y = prob Y, X     ⇔ prob X Y prob Y = prob Y X prob X

marginalization

  prob X, Y + prob X, ¬Y    = prob Y X + prob ¬Y X prob X

marginalization

  prob X, Y + prob X, ¬Y     = prob Y X + prob ¬Y X prob X = prob X

marginalization  n let Yi i=1 be a set of mutually exclusive propositions

marginalization  n let Yi i=1 be a set of mutually exclusive propositions, then n X

 prob Yi X = 1

i=1

and n X i=1

  prob X, Yi = prob X

towards the continuum

if there are infinitely many mutually exclusive possibilities (e.g. Y = height of a person), then ∞ Z

 prob Y X dY = 1 −∞

and ∞ Z

  prob X, Y dY = prob X −∞

note

 prob X, Y is technically a probability density function Zy2 

 pdf X, Y dY

prob X, y1 6 Y 6 y2 = y1

note

to get probabilities out of densities, we have to integrate

note

to get probabilities out of densities, we have to integrate we will henceforth drop this distinction and simply write  p X, Y   to indicate either prob X, Y or pdf X, Y

independence

if X and Y are independent, then    p X, Y = p X p Y because    p X, Y = p X Y p Y and   p X Y =p X

random variable

a variable X whose value is subject to chance

random variable

a variable X whose value is subject to chance it can assume different values, each according to an associated probability

random variable

a variable X whose value is subject to chance it can assume different values, each according to an associated probability to express that X ∈ R is distributed according to p(x), we write X ∼ p(x)

or

pX (x)

expectation

the average value of some function f (x) under a distribution p(x) is called the expectation of f (x) we have   X E f (x) = f (x) p(x) x

or

Z   E f (x) = f (x) p(x) dx

special case

expectation of a random variable X Z   E X = x p(x) dx ≡µ

example

p(x)

symmetric, unimodal distribution

E[x]

x

example

p(x)

skewed, unimodal distribution

E[x]

x

example

p(x)

multimodal distribution

E[x]

x

example

p(x)

multimodal distribution

E[x]

x

special cases

averaging a function of several variables Z   E f (x, y) = f (x, y) p(x, y) dx dy

special cases

averaging a function of several variables Z   E f (x, y) = f (x, y) p(x, y) dx dy

averaging a function of several variables over one variable Z   Ex f (x, y) = f (x, y) p(x) dx conditional expectation Z   Ex f | y = f (x) p(x | y) dx

note

     E E f (x) = E f (x)

note

     E E f (x) = E f (x)

because    E E f (x) =

Z Z

 f (x) p(x) dx p(z) dz

note

     E E f (x) = E f (x)

because    E E f (x) =

Z Z



f (x) p(x) dx p(z) dz  Z Z = p(z) dz f (x) p(x) dx

note

     E E f (x) = E f (x)

because    E E f (x) =

Z Z



f (x) p(x) dx p(z) dz  Z Z = p(z) dz f (x) p(x) dx Z = f (x) p(x) dx   = E f (x)

variance

  the variability of f (x) around the mean E f (x) is called the variance of f (x) we have     2  var f (x) = E f (x) − E f (x)

variance

  the variability of f (x) around the mean E f (x) is called the variance of f (x) we have     2  var f (x) = E f (x) − E f (x) and note that h      i var f = E f 2 − 2 f E f + E2 f         = E f 2 − 2 E f E f + E2 f     = E f 2 − E2 f

special case

variance of a random variable X     2  var X = E X − E X  2  =E X−µ ≡ σ2

special case

variance of a random variable X Z   var X = (x − µ)2 p(x) dx Z

Z 2

= x p(x) dx − 2µ x p(x) dx + µ Z = x2 p(x) dx − µ2 ≡ σ2

Z 2

p(x) dx

note

once again

var = expected deviation from expected value

covariance

for two random variables X and Y, we have       cov X, Y = EXY X − E X Y −E Y 

      = EXY XY − E X E Y

covariance matrix

for two random vectors x and y, we have      T cov x, y = Exy x − E x y−E y 

     T = Exy xyT − E x E y ≡C

covariance matrix

in particular, we have    cov x, x = E xxT − µµT where   µ=E x

summary

we now know about

basic terminology and concepts of probability theory

exercises

show that all of the following are indeed identical p(X, Y, Z) = p(X | Y, Z) p(Y, Z) = p(X | Y, Z) p(Y | Z) p(Z) = p(Y | X, Z) p(Z | X) p(X) = p(Y, Z | X) p(X) .. .

exercises

show that, for a constant c and a random variable X a) E[c + X] = c + E[X] b) E[cX] = c E[X] c) var[c + X] = var[X] d) var[cX] = c2 var[X] show that, for two random variables X and Y a) E[X + Y] = E[X] + E[Y] show that, for two independent random variables X and Y a) E[XY] = E[X] E[Y]