Pattern Recognition

0 downloads 0 Views 702KB Size Report
Pattern Recognition. Prof. Christian Bauckhage. Page 2. outline lecture 17 constrained optimization. Lagrange multipliers. Lagrange duality summary. Page 3 ...
Pattern Recognition Prof. Christian Bauckhage

outline lecture 17

constrained optimization Lagrange multipliers Lagrange duality

summary

before we begin . . .

min f (x) = max −f (x) x

x

x6y



−x > −y

x>y



xi > yi

constrained optimization

a problem from high-school . . .

A farmer has bought a coil of barbed wire of length L and wants to fence off a rectangular area A of land of maximum size. What side lengths 0 6 x1 , x2 6 L should he choose for the rectangle?

observe

algebraic preparation

A = A(x1 , x2 ) = x1 · x2 L = L(x1 , x2 ) = 2 x1 + 2 x2

observe

algebraic preparation

A = A(x1 , x2 ) = x1 · x2 L = L(x1 , x2 ) = 2 x1 + 2 x2



x1 =

L − x2 2

observe

algebraic preparation

A = A(x1 , x2 ) = x1 · x2 L = L(x1 , x2 ) = 2 x1 + 2 x2



x1 =

L − x2 2 





A=

L − x2 2

=

L x2 − x22 2

· x2

observe

algebraic preparation

solution

dA L ! = − 2 x2 = 0 dx2 2

A = A(x1 , x2 ) = x1 · x2 L = L(x1 , x2 ) = 2 x1 + 2 x2



x1 =

L − x2 2 





A=

L − x2 2

=

L x2 − x22 2

· x2



x2 =

L 4



x1 =

L L − x2 = 2 4

note

in general, expressing xi as a function of xj may break symmetry in general, it might not even be possible to express xi as a function of xj ⇒ we need a “better”, that is more general, approach . . .

note

the above problem is a constrained optimization problem

note

the above problem is a constrained optimization problem in particular, it is an instance of the following general type

max f (x)

x∈Rm

s.t. x ∈ S where S ⊆ Rm is called the feasible set of the constrained problem x∗ ∈ S is a (local) solution if f (x∗ ) > f (x) for all x ∈ B (x∗ )

observe

in our example, we have x ∈ R2 f (x) = x1 · x2 S=

 x 2 x1 + x2 = L

note

in practice (basically always) the feasible set is usually characterized in terms of constraint equations

max f (x)

x∈Rm

s.t. gi (x) = 0, i = 1, . . . , p

equality constraints

hj (x) 6 0, j = 1, . . . , q

inequality constraints

observe

in our example, we have x ∈ R2 f (x) = x1 · x2  g(x) = 2 x1 + x2 − L

⇔ no inequality constraints and only a single equality constraint

Lagrange multipliers

note

problems with equality constraints can be solved using the method of Lagrange multipliers

Joseph-L. Lagrange (∗1736, †1813)

observe

in our example, we would consider the following Lagrangian L(x, λ) = f (x) + λ g(x) with only one Lagrange multiplier λ

note

to solve an equality constrained problem, we consider !

∇L(x, λ) = ∇f (x) + λ ∇g(x) = 0 because, if the problem has a solution x∗ , there exists a Lagrange multiplier λ∗ such that ∇f (x∗ ) = −λ∗ ∇g(x∗ )

observe

in our example, we have ∂L ∂f ∂g = +λ = x2 + 2 λ = 0 ∂x1 ∂x1 ∂x1 ∂f ∂g ∂L = +λ = x1 + 2 λ = 0 ∂x2 ∂x2 ∂x2 ∂L = g(x1 , x2 ) ∂λ

= 2 x1 + 2 x2 = L

observe

in matrix / vector form 

    0 1 2 x1 0 1 0 2 x2  = 0 2 2 0 λ L

observe

in matrix / vector form 

    0 1 2 x1 0 1 0 2 x2  = 0 2 2 0 λ L so that  ∗  1 −2 x1  ∗  1 x  =   2  2 1 λ∗ 4

1 2 − 12 1 4

1 0 4   1   4  0 1 L 8

 

question why does this work?

question why does this work?

answer let’s see . . .

observe

f and g in our example

x2

x2

L

L

g = 0

L x1

f (x1 , x2 )

L x1

g(x1 , x2 )

observe

gradients of f and g in our example

x2

x2

L

L

L x1

∇f (x1 , x2 )

L x1

∇g(x1 , x2 )

in general

in Rm , a constraint g(x) = 0 defines an m − 1 surface S

∇g t ∇f

in general

in Rm , a constraint g(x) = 0 defines an m − 1 surface S

∇g t

if we let x, x + dt ∈ S, then g(x) = g(x + dt) = 0

∇f

in general

in Rm , a constraint g(x) = 0 defines an m − 1 surface S

∇g t

if we let x, x + dt ∈ S, then g(x) = g(x + dt) = 0 and from the Taylor expansion g(x+dt) = g(x)+dtT ∇g(x)+. . . we find that ∇g(x) ⊥ dt

⇔ ∇g(x) ⊥ S

∇f

in general

if there is a motion of x along S that increases f (x), then ∇f (x) has a component along the dt surface tangent t = kdtk ⇔ if x∈S is not an extremum of f , then tT ∇f (x) 6= 0

∇g t ∇f

in general

if x∗ ∈ S is an extremum of f , any motion along t would decrease f hence ∇f (x∗ ) does not have a component along t

∇g

t

⇔ if

∇f

x∗ ∈ S is an extremum of f , then tT ∇f (x∗ ) = 0

in general

if x∗ ∈ S is an extremum of f , any motion along t would decrease f hence ∇f (x∗ ) does not have a component along t

∇g

t

⇔ if

∇f

x∗ ∈ S is an extremum of f , then tT ∇f (x∗ ) = 0 ⇔

∇g(x∗ ) k ∇f (x∗ )

in general

if x∗ ∈ S is an extremum of f , any motion along t would decrease f hence ∇f (x∗ ) does not have a component along t

∇g

t

⇔ if

∇f

x∗ ∈ S is an extremum of f , then tT ∇f (x∗ ) = 0 ∇g(x∗ ) k ∇f (x∗ )

⇔ ⇔

∃ λ∗ ∈ R such that



−λ ∇g(x∗ ) = ∇f (x∗ )

observe

depending on f and g, there may be several solutions

x2

g(x) = 0

x1

observe

in general (more than one equality constraint), we consider L(x, λ) = f (x) +

X

λi gi (x)

i

and ∇L(x, λ) = ∇ f (x) +

X i

λi ∇gi (x)

note

above, we were not fully honest because, in our example, we actually have max f (x) = x1 · x2

x∈R2

 s.t. g1 (x) = 2 · x1 + x2 − L = 0 h1 (x) = −eT1 x 6 0 h2 (x) = −eT2 x 6 0 h3 (x) = eT1 x − L 6 0 h4 (x) = eT2 x − L 6 0

4 pictures say 4000 words . . .

x2

x2

x2

x2

L

L

L

L

h1 (x) = −eT1 x ≤ 0

h2 (x) = −eT2 x ≤ 0

L x1

h3 (x) = eT1 x − L ≤ 0

L x1

h4 (x) = eT1 x − L ≤ 0

L x1

L x1

question why didn’t we need these inequality constraints?

question why didn’t we need these inequality constraints?

answer

x2 L

because they were inactive ⇔ the solution did not reside on a boundary of the constraint region L x1

note

general optimization problem min f (x)

x∈Rm

s.t. gi (x) = 0, i = 1, . . . , p hj (x) 6 0, j = 1, . . . , q

if x∗ solves this problem, then ∃ λ∗i , µ∗j such that the Karush-Kuhn-Tucker conditions are met . . .

KKT conditions

1) stationarity ∇f (x∗ ) +

X i

λ∗i ∇gi (x∗ ) +

2) primal feasibility gi (x∗ ) = 0 ∀ i hj (x∗ ) 6 0

∀j

3) dual feasibility µ∗j > 0

∀j

4) complementary slackness µ∗j hj (x∗ ) = 0

∀j

X j

µ∗j ∇hj (x∗ ) = 0

example: solving an inequality constrained problem

let us consider min

x∈R2

2 2 f (x) = x1 − 7 + x2 − 3

s.t. h1 (x) = 2 x1 + x2 6 7

x2

h2 (x) = x1 + 3 x2 6 18

h2 (x) ≤ 18 h1 (x) ≤ 7

x1

example (cont.)

from KKT condition 1), we have " ∂h1 # " ∂f # " ∂h2 # ∂x1 ∂f ∂x2

+ µ1

∂x1 ∂h1 ∂x2

+ µ2

∂x1 ∂h2 ∂x2

=

" # 0 0

that is 2 x1 + 0 x2 + 2 µ1 + 1µ2 = 14 0 x1 + 2 x2 + 1 µ1 + 3µ2 = 6 and, assuming that both constraints are active, we also consider 2 x1 + 1 x2 = 7 1 x1 + 3 x2 = 18 and thus obtain 4 equations for 4 unknowns

example (cont.)

solving the matrix / vector equation      2 0 2 1 x1 14 0 2 1 3  x2   6    =   2 1 0 0 µ1   7 1 3 0 0 µ2 18 yields 

   x1 0.6  x2   5.8  =  µ1   8.8 µ2 −4.8

which violates KKT condition 3)

example (cont.)

we therefore inactivate the second constraint, i.e. we set µ2 = 0, and repeat the exercise

solving the resulting matrix / vector equation      2 0 2 x1 14 0 2 1  x2  =  6 2 1 0 µ1 7 yields 

   x1 3  x2  = 1 µ1 4

which is a feasible solution

solution

x2

x∗

x1

Lagrange duality

once more . . .

to solve the problem min f (x) x

s.t. gi (x) = 0, i = 1, . . . , p hj (x) 6 0, j = 1, . . . , q we work with the Lagrangian X X  L x, λ, µ = f (x) + λi gi (x) + µj hj (x) i

j

that is, with a linear combination of different functions

note

if ˜ x is feasible, then X X  L ˜x, λ, µ = f (˜ x) + λi gi (˜ x) + µj hj (˜x) 6 f (˜x) i

because gi (˜x) = 0 hj (˜x) 6 0 µj > 0

j

note

in particular, if x∗ is an optimal feasible solution, then X X  L x∗ , λ, µ = f (x∗ )+ λi gi (x∗ )+ µj hj (x∗ ) 6 f (x∗ ) = f ∗ i

j

note

in particular, if x∗ is an optimal feasible solution, then X X  L x∗ , λ, µ = f (x∗ )+ λi gi (x∗ )+ µj hj (x∗ ) 6 f (x∗ ) = f ∗ i

j

 note that L x∗ , λ, µ is a function of λ and µ

Lagrange dual

the function D : Rp × Rq → R   D λ, µ = inf L x, λ, µ x

 = inf  f (x) + x

X

λi gi (x) +

i

X

 µj hj (x)

j

is called the Lagrangian dual function

 it is concave and, for µ > 0, we have D λ, µ 6 f ∗

duality

min f (x) x

s.t. gi (x) = 0 hj (x) 6 0



 max D λ, µ

λ,µ>0

example: least squares

note that

2 min Xw − y w

is the same as min wT w w

s.t. Xw − y = 0

example: least squares

note that

2 min Xw − y w

is the same as min wT w w

s.t. Xw − y = 0

  ⇒ L w, λ = wT w + λT Xw − y

example (cont.)

to obtain   D λ = inf L w, λ w

we consider i ∂L ∂ h T ! = w w + λT Xw − y = 2 w + XT λ = 0 ∂w ∂w which yields w = − 21 XT λ

example (cont.)

 plugging this into L w, λ , we obtain  D λ =

1 4

λT XXT λ − 12 λT XXT λ − λT y

= − 41 λT XXT λ − λT y

example (cont.)

then, considering i ∂D ∂ h 1 T T ! = − 4 λ XX λ − λT y = − 12 XXT λ − y = 0 ∂λ ∂λ we find λ = −2 XXT

−1

y

example (cont.)

 plugging this back into L w, λ yields  −T −T L w = wT w − 2 yT XXT Xw + 2 yT XXT y

example (cont.)

 plugging this back into L w, λ yields  −T −T L w = wT w − 2 yT XXT Xw + 2 yT XXT y and −1 ! ∂L = 2 w − 2 XT XXT y=0 ∂w establishes w = XT XXT

−1

y

wait . . . what?

note

earlier, we found w = XT X

−1

XT y

now, we have w = XT XXT

−1

y

note

observe

earlier, we found w = XT X

−1

XT y

XT X

−1

XT XXT = XT

now, we have w = XT XXT

−1

y

XT XXT

−1

XXT = XT

note

X ∈ Rn×m and, typically, n  m

(n = # examples, m = dimensionality)

therefore −1 T w = XT X X y | {z }

cheap

Rm×m

−1 w = XT XXT y | {z } Rn×n

expensive

question if the dual solution −1 y w = XT XXT is more expensive then the primal solution w = XT X

−1

XT y

then why would we ever bother with the dual ?

question if the dual solution −1 y w = XT XXT is more expensive then the primal solution w = XT X

−1

XT y

then why would we ever bother with the dual ? answer will be given later . . .

summary

we now know about

constrained optimization the notion of the Lagrangian the notion of the Lagrangian dual the intriguing phenomenon of strong duality . . .