Pattern Recognition

6 downloads 0 Views 3MB Size Report
for pattern recognition and machine learning is all about model fitting in today's lecture, we will use the term hypothesis instead of model this is, because (most ...
Pattern Recognition Prof. Christian Bauckhage

learning theory

an intriguing problem . . . training data 10

8

6

4

2

0 −2

0 −2

2

4

6

8

10

an intriguing problem . . . training data

training error

10

10

8

8

6

6

4

4

2

2

0 −2

0 0

2

4

6

8

10

−2

−2

0

2

4

6

8

10

0

2

4

6

8

10

−2

10

8

6

4

2

0 −2 −2

an intriguing problem . . . training data

training error

test error

10

10

10

8

8

8

6

6

6

4

4

4

2

2

2

0

0

−2

0

2

4

6

8

10

−2

−2

0 0

2

4

6

8

10

−2

−2

−2

10

10

8

8

6

6

4

4

2

2

0 −2

2

4

6

8

10

0

2

4

6

8

10

0 0

−2

0

2

4

6

8

10

−2 −2

outline lecture 12 preamble setting the stage the bias variance dilemma the Hoeffding inequality the Vapnik-Chervonenkis inequality summary

preamble

note

throughout this course we study machine learning for pattern recognition and machine learning is all about model fitting in today’s lecture, we will use the term hypothesis instead of model this is, because (most of) the literature on learning theory does so, too

setting the stage

general setting  n we are given a certain data sample D = (xi , yi ) i=1 there is an unknown process or function f that has generated this data in the sense that y = f (x) + 

we don’t know f but want to estimate it in order to generalize the “information” in D to new / unknown situations ⇔ we want to learn to predict the “correct” y for any x

general approach

we hypothesize that h(x, θ) ∈ H can approximate f (x)

H ∈ H denotes a certain set of functions, for instance HLCL the set of all linear classifiers applicable to x HMLP the set of all multilayer perceptrons applicable to x HSVM the set of all support vector machines applicable to x .. .

note

each hypothesis class H has its own particular set of parameters Θ(H)

depending on the nature of Θ(H), |H| may be large e.g. for x ∈ R2 there is a whole  continuum of linear classifiers h(x, w) = sign wT x

the nature of Θ(H) typically determines the flexibility or model complexity of functions in H and a class of hypotheses H may be more “complex” than another class H 0

general approach (cont.)

given the data in D and our choice of hypotheses H, we determine θ∗ such that yi ≈ h(xi , θ∗ ) for all i

we then “hope” that our model generalizes well enough so that f (x) ≈ h(x, θ∗ ) for all x

general approach (cont.)

in other words, given D and H, we determine θ∗ and “hope” that the corresponding training error h i Etrain = Exi d yi , h(xi , θ∗ ) is a good proxy for the test error h i Etest = Ex d f (x), h(x, θ∗ )

notational simplifications

machine learning determines parameters θ∗ from data D

from now on, we therefore simply write  ! h(x, θ∗ ) = h x, g(D) = h(x, D) = hD (x)

to distinguish hD1 (x) from hD2 (x), we simply write h1 (x)

and h2 (x)

questions how to choose H ? how well does this procedure work ? how large must D be for it to work well ? how, if at all, can Etrain be related to Etest ?

meta question why are these questions relevant ?

the bias variance dilemma

didactic experiment

a function f (x)

2.0

f (x)

1.5 1.0 0.5 0.0 −0.5 −1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

didactic experiment

a noisy sample D

2.0

f (x)

1.5 1.0 0.5 0.0 −0.5 −1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

train 3 models on 3 noisy samples

H1

H3

2.0

h(x, D1 )

1.5

D1

h(x, D1 )

1.5

1.0

0.5

0.5

0.0

0.0

0.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

−0.5

1.0

0.5

0.5

0.5

0.0

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

1.0

0.5

0.5

0.0

0.0

0.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

  ED Etrain = 1.34

3.0

h(x, D2 )

0.0

0.5

1.0

1.5

2.0

2.5

3.0

h(x, D3 )

1.5

1.0

0.5

−1.0

2.5

2.0

h(x, D3 )

1.5

1.0

−0.5

2.0

−1.0

2.0

h(x, D3 )

1.5

1.5

0.0 0.0

−1.0

2.0

1.0

1.5

1.0

−1.0

0.5

2.0

h(x, D2 )

1.5

1.0

−0.5

0.0

−1.0

2.0

h(x, D2 )

1.5

h(x, D1 )

1.5

1.0

0.5

2.0

D3

2.0

1.0

−1.0

D2

H11

2.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

  ED Etrain = 0.38

  ED Etrain = 0.00

test the 3 models on 3 noisy samples

H1

H3

2.0

h(x, D1 )

1.5

D1

h(x, D1 )

1.5

1.0

0.5

0.5

0.0

0.0

0.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

−0.5

1.0

0.5

0.5

0.5

0.0

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

1.0

0.5

0.5

0.0

0.0

0.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

−1.0

  ED Etest = 9.24

3.0

h(x, D2 )

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.0

h(x, D3 )

1.5

1.0

0.5

−1.0

2.5

2.0

h(x, D3 )

1.5

1.0

−0.5

2.0

−1.0

2.0

h(x, D3 )

1.5

1.5

0.0 0.0

−1.0

2.0

1.0

1.5

1.0

−1.0

0.5

2.0

h(x, D2 )

1.5

1.0

−0.5

0.0

−1.0

2.0

h(x, D2 )

1.5

h(x, D1 )

1.5

1.0

0.5

2.0

D3

2.0

1.0

−1.0

D2

H11

2.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

  ED Etest = 4.60

  ED Etest = 3 · 105

train 3 models on 3 pure samples

H1

H3

2.0

h(x, D1 )

1.5

D1

h(x, D1 )

1.5

1.0

0.5

0.5

0.0

0.0

0.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

−0.5

1.0

0.5

0.5

0.5

0.0

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

1.0

0.5

0.5

0.0

0.0

0.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

  ED Etrain = 3.22

3.0

h(x, D2 )

0.0

0.5

1.0

1.5

2.0

2.5

3.0

h(x, D3 )

1.5

1.0

0.5

−1.0

2.5

2.0

h(x, D3 )

1.5

1.0

−0.5

2.0

−1.0

2.0

h(x, D3 )

1.5

1.5

0.0 0.0

−1.0

2.0

1.0

1.5

1.0

−1.0

0.5

2.0

h(x, D2 )

1.5

1.0

−0.5

0.0

−1.0

2.0

h(x, D2 )

1.5

h(x, D1 )

1.5

1.0

0.5

2.0

D3

2.0

1.0

−1.0

D2

H11

2.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

  ED Etrain = 0.28

  ED Etrain = 0.00

test the 3 models on 3 pure samples

H1

H3

2.0

h(x, D1 )

1.5

D1

h(x, D1 )

1.5

1.0

0.5

0.5

0.0

0.0

0.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

−0.5

1.0

0.5

0.5

0.5

0.0

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

1.0

0.5

0.5

0.0

0.0

0.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

−1.0

  ED Etest = 8.42

3.0

h(x, D2 )

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.0

h(x, D3 )

1.5

1.0

0.5

−1.0

2.5

2.0

h(x, D3 )

1.5

1.0

−0.5

2.0

−1.0

2.0

h(x, D3 )

1.5

1.5

0.0 0.0

−1.0

2.0

1.0

1.5

1.0

−1.0

0.5

2.0

h(x, D2 )

1.5

1.0

−0.5

0.0

−1.0

2.0

h(x, D2 )

1.5

h(x, D1 )

1.5

1.0

0.5

2.0

D3

2.0

1.0

−1.0

D2

H11

2.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

  ED Etest = 8.58

  ED Etest = 4 · 106

question what is going on here?

question what is going on here?

answer let’s see . . .

observe

for square loss, we have   h   2 i ED Etest = ED Ex hD (x) − f (x) 

h

= Ex ED hD (x) − f (x)

 2 i

observe

defining the average hypothesis   ¯h(x) = ED hD (x) we recall (from lecture 05) that h i h  i ED ¯h(x) = ED ED hD (x) = ¯ h(x)

observe

defining the average hypothesis   ¯h(x) = ED hD (x) we recall (from lecture 05) that h i h  i ED ¯h(x) = ED ED hD (x) = ¯ h(x)

this allows us to examine the term h h 2 i 2 i ED hD (x) − f (x) = ED hD − f

we find

h h 2 i 2 i = ED hD − ¯ h+¯ h−f ED hD − f h  i 2 h − f )2 + 2 hD − ¯h ¯h − f = ED hD − ¯ h + ¯ h h 2 i 2 i = ED hD − ¯ h + ED ¯ h−f

observe: ED

h

i  i  h i h    i h hD − ¯ h ¯ h−f = ¯ h − f ED hD − ¯ h = ¯ h − f ED hD − ED ¯ h = ¯ h−f ¯ h−¯ h =0

we find

h h 2 i 2 i = ED hD − ¯ h+¯ h−f ED hD − f h  i 2 h − f )2 + 2 hD − ¯h ¯h − f = ED hD − ¯ h + ¯ h h 2 i 2 i = ED hD − ¯ h + ED ¯ h−f     = varD hD + bias2D hD

observe: ED

h

i  i  h i h    i h hD − ¯ h ¯ h−f = ¯ h − f ED hD − ¯ h = ¯ h − f ED hD − ED ¯ h = ¯ h−f ¯ h−¯ h =0

a picture says a 1000 words . . .

high bias low variance

medium bias medium variance

2.0

2.0

h(x, D1 )

1.5

D1

1.0

0.5

0.5

0.0 −0.5

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

−0.5

1.0

0.5

0.5

0.5

0.0

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

0.5

1.0

1.5

2.0

2.5

3.0

−0.5

1.0

0.5

0.5

0.0

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5 −1.0

3.0

h(x, D2 )

0.0

0.5

1.0

1.5

2.0

2.5

3.0

h(x, D3 )

1.5

1.0

0.5

−1.0

2.5

2.0

h(x, D3 )

1.5

1.0

−0.5

2.0

−1.0

2.0

h(x, D3 )

1.5

1.5

0.0 0.0

−1.0

2.0

1.0

1.5

1.0

−1.0

0.5

2.0

h(x, D2 )

1.5

1.0

−0.5

0.0

−1.0

2.0

h(x, D2 )

1.5

h(x, D1 )

1.5

1.0

0.5

2.0

D3

2.0

h(x, D1 )

1.5

1.0

−1.0

D2

low bias high variance

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−0.5 −1.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

in plain English

high bias ⇔ model does not fit training data well

low variance ⇔ training on different data samples yields similar results

low bias ⇔ model does fit training data well

high variance ⇔ training on different data samples yields very different results

note

while we derived the bias-variance decomposition in the context of regression, it also applies to classification

generally, for training data D, we always have h ED

f (x) − hD (x)

= ED

h

2 i

h i  2 i f (x) − ED hD (x) + varD hD (x) + noise

one more thing . . .

 Ex ED

h

f (x) − hD (x)

 2 i

h    i = Ex bias2D hD (x) + varD hD (x) = bias2 + var

implications

models / algorithms with many parameters (degrees of freedom) adapt well to training data in D but may generalize poorly ⇔ they tend to have low bias (w.r.t. f ) but high variance (w.r.t. different realizations of D) there is a tradeoff between two kinds of errors a model may miss relevant aspects (high bias) a model may learn irrelevant aspects (high variance)

⇔ appropriate modeling assumptions, prior information, and sufficiently many training data are important

error

bias variance dynamics

generalization

variance

bias

model complexity

error

bias variance dynamics

generalization

variance

bias

model complexity

error

bias variance dynamics

underfitting

overfitting

generalization

variance

bias

model complexity

the Hoeffding inequality

Hoeffding, 1963

let X1 , . . . , Xn be independent r.v. such that 0 6 Xi 6 1, then

    ¯ > δ 6 2 e−2 n δ2 p ¯ X−E X

recall lecture 07

N = 760, q = 0.1763

sample

n

ˆ q

|q − ˆ q|

1 2 4 8 16 32

0.0000 0.5000 0.7500 0.0000 0.1875 0.1562

0.1763 0.3237 0.5737 0.1763 0.0112 0.0201

64

0.0938

0.0826

128

0.2031

0.0268

256

0.1719

0.0044

observe

assume X ∈ {0, 1} and X ∼ fBer (q)

observe

assume X ∈ {0, 1} and X ∼ fBer (q)  drawing a sample D = X1 , . . . , Xn , we have X no ¯=1 Xi = X =ˆ q n n n

i=1

as an estimate of q = E[X]

observe

assume X ∈ {0, 1} and X ∼ fBer (q)  drawing a sample D = X1 , . . . , Xn , we have X no ¯=1 Xi = X =ˆ q n n n

i=1

as an estimate of q = E[X] but how good will this estimate be ?

observe

Hoeffding’s inequality is of the form  p bad 6 small

observe

Hoeffding’s inequality is of the form  p bad 6 small component by component, we have  p |ˆq − q| > δ 6 small

observe

Hoeffding’s inequality is of the form  p bad 6 small component by component, we have  p |ˆq − q| > δ 6 small  p |ˆq − q| > δ 6 2 e−2 n ... :-)

observe

Hoeffding’s inequality is of the form  p bad 6 small component by component, we have  p |ˆq − q| > δ 6 small  p |ˆq − q| > δ 6 2 e−2 n ... :-)  2 p |ˆq − q| > δ 6 2 e−2 n δ :-(

observe

Hoeffding’s inequality  2 p |ˆq − q| > δ 6 2 e−2 n δ says that the statement ˆ q = q is probably approximately correct (p.a.c.)

observe

Hoeffding’s inequality  2 p |ˆq − q| > δ 6 2 e−2 n δ says that the statement ˆ q = q is probably approximately correct (p.a.c.) if we want to guarantee that |ˆ q − q| is small, we should insist on a small δ (e.g. δ = 10−5 ) then, however, δ2 will be miniscule (δ2 = 10−10 ) and we would need a very large n in order for n δ2 to take effect

question OK, but what does this have to do with how hD (x) generalizes to f (x) ?

question OK, but what does this have to do with how hD (x) generalizes to f (x) ?

answer let’s see . . .

the Vapnik-Chervonenkis inequality

think of our problem like this . . .

let  d f (x), hD (x) =



1, if f (x) 6= hD (x) 0, otherwise

hD (x) ∈ H

then  2 p |Etest (h)−Etrain (h)| > δ 6 2 e−2 n δ

Etest

Etrain

however . . .

a sample D of training data is a random variable training on different samples D will yield different models hD

example

h1 ∈ H

h2 ∈ H

h3 ∈ H

hm ∈ H

...

Etest

Etest

Etest

Etest

... Etrain

Etrain

Etrain

Etrain

therefore . . .

for training samples of size n, the optimal h ∈ H will obey |H|

 X  p |Etest (h) − Etrain (h)| > δ 6 p |Etest (hi ) − Etrain (hi )| > δ i=1

6

|H| X

2

2 e−2 n δ

i=1 2

6 2 |H| e−2 n δ

observe

this result is almost meaningless, since |H| may grow beyond all bounds

observe

this result is almost meaningless, since |H| may grow beyond all bounds, however . . . some rather deep math establishes that for binary classification and the 0/1 loss function . . .

Vapnik & Chervonenkis, 1971

 1 2 p |Etest (h) − Etrain (h)| > δ 6 4 ∆(2n) e− 8 n δ

note

the function ∆(n) is called the growth function it indicates how many binary functions h ∈ H can realize one can show that either ∆(n) = 2n or ∆(n) 6 ndVC + 1

note

the number dVC is called the Vapnik-Chervonenkis dimension or VC dimension for short

it depends on the hypothesis class H, i.e. dVC = dVC (H)

we have dVC (H) = the most points H can shatter

shattering

a hypothesis class H shatters n points in general position in Rm if all possible binary patterns of the n points can be correctly classified by H

shattering

a hypothesis class H shatters n points in general position in Rm if all possible binary patterns of the n points can be correctly classified by H for example linear classifiers can shatter m + 1 points in Rm ⇔ their VC dimension is m + 1

observe

we may say dVC (H) = number of effective parameters in H compare, for instance y = sign wT x



 c  vs. y = sign b · wT x

where b ∈ R+ and c ∈ {1, 3, 5, . . .}

note

if dVC (H) is finite, the optimal h ∈ H will generalize f when trained with large enough n this is because looking at the right hand side of  1 2 p |Etest (h) − Etrain (h)| > δ 6 4 ∆(2n) e− 8 n δ | {z } γ

we find that for large enough n the negative exponential 1 2 e− 8 n δ declines faster than the polynomial ∆(2n) grows ⇔ for finite dVC (H) and large enough n, Etrain (h) will indeed be a proxy of Etest (h)

question fixing a certain δ (say 0.1) and γ (say 0.05), how does n depend on dVC ?

question fixing a certain δ (say 0.1) and γ (say 0.05), how does n depend on dVC ? answer hard to say in general . . . for certain hypothesis classes H, we can answer rigorously, but in general we have to adhere to rules of thumb n > 10 · dVC

or

n>

kθk δ

train and test 3 models on large samples

H11 training

H11 testing

2.0

2.0

h(x, D1 )

1.5 1.0

D1

1.0

0.5

0.5

0.0 −0.5

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

1.0

0.5

0.5

0.0

1.0

1.5

2.0

2.5

3.0

h(x, D2 )

0.0 0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

−1.0

2.0

2.0

h(x, D3 )

1.5

1.0

0.5

0.5

0.0

h(x, D3 )

1.5

1.0

−0.5

0.5

1.5

1.0

−0.5

0.0

2.0

h(x, D2 )

1.5

D3

−0.5 −1.0

2.0

D2

h(x, D1 )

1.5

0.0 0.0

0.5

1.0

1.5

2.0

2.5

−1.0

3.0

−0.5

0.0

0.5

1.0

1.5

2.0

2.5

−1.0

  E Etrain = 1.39

  E Etest = 0.54

3.0

another point of view

consider once again  1 2 p |Etest (h) − Etrain (h)| > δ 6 4 ∆(2n) e− 8 n δ = γ

an alternative insight arises from fixing γ (say 0.05) and asking for the corresponding δ we find s δ=

8 4 ∆(2n) ln = Ω(H, n, γ) n γ

long story short . . .

one can then show that  p Etest (h) 6 Etrain (h) + Ω = 1 − γ

⇔ yet another tradeoff: if the complexity of H increases Etrain (h) decreases Ω(H, n, γ) increases

error

VC dynamics

Etest

model complexity ∼ Ω

Etrain

V C dimension

summary

we now know about

the bias variance dilemma models may ignore relevant aspects in training data D models may learn irrelevant aspects in training data D there is a tradeoff between bias and variance

the VC inequality and the VC dimension

error

error

we may indeed hope that h will generalize to f complex models require massive amounts of training data Etest

model complexity ∼ Ω generalization

variance

bias

model complexity

Etrain

V C dimension