Pattern Recognition. Prof. Christian Bauckhage. Page 2. outline lecture 17 constrained optimization. Lagrange multipliers. Lagrange duality summary. Page 3 ...
Pattern Recognition Prof. Christian Bauckhage
outline lecture 17
constrained optimization Lagrange multipliers Lagrange duality
summary
before we begin . . .
min f (x) = max −f (x) x
x
x6y
⇔
−x > −y
x>y
⇔
xi > yi
constrained optimization
a problem from high-school . . .
A farmer has bought a coil of barbed wire of length L and wants to fence off a rectangular area A of land of maximum size. What side lengths 0 6 x1 , x2 6 L should he choose for the rectangle?
observe
algebraic preparation
A = A(x1 , x2 ) = x1 · x2 L = L(x1 , x2 ) = 2 x1 + 2 x2
observe
algebraic preparation
A = A(x1 , x2 ) = x1 · x2 L = L(x1 , x2 ) = 2 x1 + 2 x2
⇒
x1 =
L − x2 2
observe
algebraic preparation
A = A(x1 , x2 ) = x1 · x2 L = L(x1 , x2 ) = 2 x1 + 2 x2
⇒
x1 =
L − x2 2
⇒
A=
L − x2 2
=
L x2 − x22 2
· x2
observe
algebraic preparation
solution
dA L ! = − 2 x2 = 0 dx2 2
A = A(x1 , x2 ) = x1 · x2 L = L(x1 , x2 ) = 2 x1 + 2 x2
⇒
x1 =
L − x2 2
⇒
A=
L − x2 2
=
L x2 − x22 2
· x2
⇒
x2 =
L 4
⇒
x1 =
L L − x2 = 2 4
note
in general, expressing xi as a function of xj may break symmetry in general, it might not even be possible to express xi as a function of xj ⇒ we need a “better”, that is more general, approach . . .
note
the above problem is a constrained optimization problem
note
the above problem is a constrained optimization problem in particular, it is an instance of the following general type
max f (x)
x∈Rm
s.t. x ∈ S where S ⊆ Rm is called the feasible set of the constrained problem x∗ ∈ S is a (local) solution if f (x∗ ) > f (x) for all x ∈ B (x∗ )
observe
in our example, we have x ∈ R2 f (x) = x1 · x2 S=
x 2 x1 + x2 = L
note
in practice (basically always) the feasible set is usually characterized in terms of constraint equations
max f (x)
x∈Rm
s.t. gi (x) = 0, i = 1, . . . , p
equality constraints
hj (x) 6 0, j = 1, . . . , q
inequality constraints
observe
in our example, we have x ∈ R2 f (x) = x1 · x2 g(x) = 2 x1 + x2 − L
⇔ no inequality constraints and only a single equality constraint
Lagrange multipliers
note
problems with equality constraints can be solved using the method of Lagrange multipliers
Joseph-L. Lagrange (∗1736, †1813)
observe
in our example, we would consider the following Lagrangian L(x, λ) = f (x) + λ g(x) with only one Lagrange multiplier λ
note
to solve an equality constrained problem, we consider !
∇L(x, λ) = ∇f (x) + λ ∇g(x) = 0 because, if the problem has a solution x∗ , there exists a Lagrange multiplier λ∗ such that ∇f (x∗ ) = −λ∗ ∇g(x∗ )
observe
in our example, we have ∂L ∂f ∂g = +λ = x2 + 2 λ = 0 ∂x1 ∂x1 ∂x1 ∂f ∂g ∂L = +λ = x1 + 2 λ = 0 ∂x2 ∂x2 ∂x2 ∂L = g(x1 , x2 ) ∂λ
= 2 x1 + 2 x2 = L
observe
in matrix / vector form
0 1 2 x1 0 1 0 2 x2 = 0 2 2 0 λ L
observe
in matrix / vector form
0 1 2 x1 0 1 0 2 x2 = 0 2 2 0 λ L so that ∗ 1 −2 x1 ∗ 1 x = 2 2 1 λ∗ 4
1 2 − 12 1 4
1 0 4 1 4 0 1 L 8
question why does this work?
question why does this work?
answer let’s see . . .
observe
f and g in our example
x2
x2
L
L
g = 0
L x1
f (x1 , x2 )
L x1
g(x1 , x2 )
observe
gradients of f and g in our example
x2
x2
L
L
L x1
∇f (x1 , x2 )
L x1
∇g(x1 , x2 )
in general
in Rm , a constraint g(x) = 0 defines an m − 1 surface S
∇g t ∇f
in general
in Rm , a constraint g(x) = 0 defines an m − 1 surface S
∇g t
if we let x, x + dt ∈ S, then g(x) = g(x + dt) = 0
∇f
in general
in Rm , a constraint g(x) = 0 defines an m − 1 surface S
∇g t
if we let x, x + dt ∈ S, then g(x) = g(x + dt) = 0 and from the Taylor expansion g(x+dt) = g(x)+dtT ∇g(x)+. . . we find that ∇g(x) ⊥ dt
⇔ ∇g(x) ⊥ S
∇f
in general
if there is a motion of x along S that increases f (x), then ∇f (x) has a component along the dt surface tangent t = kdtk ⇔ if x∈S is not an extremum of f , then tT ∇f (x) 6= 0
∇g t ∇f
in general
if x∗ ∈ S is an extremum of f , any motion along t would decrease f hence ∇f (x∗ ) does not have a component along t
∇g
t
⇔ if
∇f
x∗ ∈ S is an extremum of f , then tT ∇f (x∗ ) = 0
in general
if x∗ ∈ S is an extremum of f , any motion along t would decrease f hence ∇f (x∗ ) does not have a component along t
∇g
t
⇔ if
∇f
x∗ ∈ S is an extremum of f , then tT ∇f (x∗ ) = 0 ⇔
∇g(x∗ ) k ∇f (x∗ )
in general
if x∗ ∈ S is an extremum of f , any motion along t would decrease f hence ∇f (x∗ ) does not have a component along t
∇g
t
⇔ if
∇f
x∗ ∈ S is an extremum of f , then tT ∇f (x∗ ) = 0 ∇g(x∗ ) k ∇f (x∗ )
⇔ ⇔
∃ λ∗ ∈ R such that
∗
−λ ∇g(x∗ ) = ∇f (x∗ )
observe
depending on f and g, there may be several solutions
x2
g(x) = 0
x1
observe
in general (more than one equality constraint), we consider L(x, λ) = f (x) +
X
λi gi (x)
i
and ∇L(x, λ) = ∇ f (x) +
X i
λi ∇gi (x)
note
above, we were not fully honest because, in our example, we actually have max f (x) = x1 · x2
x∈R2
s.t. g1 (x) = 2 · x1 + x2 − L = 0 h1 (x) = −eT1 x 6 0 h2 (x) = −eT2 x 6 0 h3 (x) = eT1 x − L 6 0 h4 (x) = eT2 x − L 6 0
4 pictures say 4000 words . . .
x2
x2
x2
x2
L
L
L
L
h1 (x) = −eT1 x ≤ 0
h2 (x) = −eT2 x ≤ 0
L x1
h3 (x) = eT1 x − L ≤ 0
L x1
h4 (x) = eT1 x − L ≤ 0
L x1
L x1
question why didn’t we need these inequality constraints?
question why didn’t we need these inequality constraints?
answer
x2 L
because they were inactive ⇔ the solution did not reside on a boundary of the constraint region L x1
note
general optimization problem min f (x)
x∈Rm
s.t. gi (x) = 0, i = 1, . . . , p hj (x) 6 0, j = 1, . . . , q
if x∗ solves this problem, then ∃ λ∗i , µ∗j such that the Karush-Kuhn-Tucker conditions are met . . .
KKT conditions
1) stationarity ∇f (x∗ ) +
X i
λ∗i ∇gi (x∗ ) +
2) primal feasibility gi (x∗ ) = 0 ∀ i hj (x∗ ) 6 0
∀j
3) dual feasibility µ∗j > 0
∀j
4) complementary slackness µ∗j hj (x∗ ) = 0
∀j
X j
µ∗j ∇hj (x∗ ) = 0
example: solving an inequality constrained problem
let us consider min
x∈R2
2 2 f (x) = x1 − 7 + x2 − 3
s.t. h1 (x) = 2 x1 + x2 6 7
x2
h2 (x) = x1 + 3 x2 6 18
h2 (x) ≤ 18 h1 (x) ≤ 7
x1
example (cont.)
from KKT condition 1), we have " ∂h1 # " ∂f # " ∂h2 # ∂x1 ∂f ∂x2
+ µ1
∂x1 ∂h1 ∂x2
+ µ2
∂x1 ∂h2 ∂x2
=
" # 0 0
that is 2 x1 + 0 x2 + 2 µ1 + 1µ2 = 14 0 x1 + 2 x2 + 1 µ1 + 3µ2 = 6 and, assuming that both constraints are active, we also consider 2 x1 + 1 x2 = 7 1 x1 + 3 x2 = 18 and thus obtain 4 equations for 4 unknowns
example (cont.)
solving the matrix / vector equation 2 0 2 1 x1 14 0 2 1 3 x2 6 = 2 1 0 0 µ1 7 1 3 0 0 µ2 18 yields
x1 0.6 x2 5.8 = µ1 8.8 µ2 −4.8
which violates KKT condition 3)
example (cont.)
we therefore inactivate the second constraint, i.e. we set µ2 = 0, and repeat the exercise
solving the resulting matrix / vector equation 2 0 2 x1 14 0 2 1 x2 = 6 2 1 0 µ1 7 yields
x1 3 x2 = 1 µ1 4
which is a feasible solution
solution
x2
x∗
x1
Lagrange duality
once more . . .
to solve the problem min f (x) x
s.t. gi (x) = 0, i = 1, . . . , p hj (x) 6 0, j = 1, . . . , q we work with the Lagrangian X X L x, λ, µ = f (x) + λi gi (x) + µj hj (x) i
j
that is, with a linear combination of different functions
note
if ˜ x is feasible, then X X L ˜x, λ, µ = f (˜ x) + λi gi (˜ x) + µj hj (˜x) 6 f (˜x) i
because gi (˜x) = 0 hj (˜x) 6 0 µj > 0
j
note
in particular, if x∗ is an optimal feasible solution, then X X L x∗ , λ, µ = f (x∗ )+ λi gi (x∗ )+ µj hj (x∗ ) 6 f (x∗ ) = f ∗ i
j
note
in particular, if x∗ is an optimal feasible solution, then X X L x∗ , λ, µ = f (x∗ )+ λi gi (x∗ )+ µj hj (x∗ ) 6 f (x∗ ) = f ∗ i
j
note that L x∗ , λ, µ is a function of λ and µ
Lagrange dual
the function D : Rp × Rq → R D λ, µ = inf L x, λ, µ x
= inf f (x) + x
X
λi gi (x) +
i
X
µj hj (x)
j
is called the Lagrangian dual function
it is concave and, for µ > 0, we have D λ, µ 6 f ∗
duality
min f (x) x
s.t. gi (x) = 0 hj (x) 6 0
⇔
max D λ, µ
λ,µ>0
example: least squares
note that
2 min Xw − y w
is the same as min wT w w
s.t. Xw − y = 0
example: least squares
note that
2 min Xw − y w
is the same as min wT w w
s.t. Xw − y = 0
⇒ L w, λ = wT w + λT Xw − y
example (cont.)
to obtain D λ = inf L w, λ w
we consider i ∂L ∂ h T ! = w w + λT Xw − y = 2 w + XT λ = 0 ∂w ∂w which yields w = − 21 XT λ
example (cont.)
plugging this into L w, λ , we obtain D λ =
1 4
λT XXT λ − 12 λT XXT λ − λT y
= − 41 λT XXT λ − λT y
example (cont.)
then, considering i ∂D ∂ h 1 T T ! = − 4 λ XX λ − λT y = − 12 XXT λ − y = 0 ∂λ ∂λ we find λ = −2 XXT
−1
y
example (cont.)
plugging this back into L w, λ yields −T −T L w = wT w − 2 yT XXT Xw + 2 yT XXT y
example (cont.)
plugging this back into L w, λ yields −T −T L w = wT w − 2 yT XXT Xw + 2 yT XXT y and −1 ! ∂L = 2 w − 2 XT XXT y=0 ∂w establishes w = XT XXT
−1
y
wait . . . what?
note
earlier, we found w = XT X
−1
XT y
now, we have w = XT XXT
−1
y
note
observe
earlier, we found w = XT X
−1
XT y
XT X
−1
XT XXT = XT
now, we have w = XT XXT
−1
y
XT XXT
−1
XXT = XT
note
X ∈ Rn×m and, typically, n m
(n = # examples, m = dimensionality)
therefore −1 T w = XT X X y | {z }
cheap
Rm×m
−1 w = XT XXT y | {z } Rn×n
expensive
question if the dual solution −1 y w = XT XXT is more expensive then the primal solution w = XT X
−1
XT y
then why would we ever bother with the dual ?
question if the dual solution −1 y w = XT XXT is more expensive then the primal solution w = XT X
−1
XT y
then why would we ever bother with the dual ? answer will be given later . . .
summary
we now know about
constrained optimization the notion of the Lagrangian the notion of the Lagrangian dual the intriguing phenomenon of strong duality . . .