Pattern Recognition

1 downloads 0 Views 684KB Size Report
Pattern Recognition. Prof. Christian Bauckhage. Page 2. Page 3. outline lecture 13 recap data clustering k-means clustering. Lloyd's algorithm. Hartigans's ...
Pattern Recognition Prof. Christian Bauckhage

outline lecture 13 recap data clustering k-means clustering Lloyd’s algorithm Hartigans’s algorithm MacQueen’s algorithm GMMs and k-means soft k-means summary

remember . . .

nearest neighbors, space partitioning, Voronoi cells . . .

questions can we combine individual Voronoi cells V(xi ) into larger ones ? how to choose / determine the centers of “super” Voronoi cells automatically ?

questions can we combine individual Voronoi cells V(xi ) into larger ones ? how to choose / determine the centers of “super” Voronoi cells automatically ?

observe these are questions related to data clustering

data clustering

clustering

⇔ given a finite data set X, automatically identify latent structures or groups of similar data points within X

note

in general, clustering is an ill-posed problem

note  from now on, we shall assume that X = x1 , . . . xn ⊂ Rm

note

there are many different clustering philosophies relational clustering (e.g. spectral clustering, graph cuts, . . . ) hierarchical clustering (e.g. divisive or agglomerative linkage methods) density-based clustering (e.g. DBSCAN, . . . ) prototype-based clustering (e.g k-means, LBG, . . . )

note

differences between clustering algorithms are often rather fuzzy than concise basically boil down to how they answer two questions

note

differences between clustering algorithms are often rather fuzzy than concise basically boil down to how they answer two questions

Q1: what defines similarity? Q2: what properties should a cluster have?

k-means clustering

k-means clustering

is the “most popular” algorithm for vector quantization

k-means clustering

is the “most popular” algorithm for vector quantization determines k  n clusters Ci and answers Q2 as follows Ci ⊂ X Ci ∩ Cj = ∅

∀ i 6= j

C1 ∪ C2 ∪ . . . ∪ Ck = X

k-means clustering

is the “most popular” algorithm for vector quantization determines k  n clusters Ci and answers Q2 as follows Ci ⊂ X Ci ∩ Cj = ∅

∀ i 6= j

C1 ∪ C2 ∪ . . . ∪ Ck = X considers cluster centroids µi to answer Q1 as follows

2

2 Ci = x ∈ X x − µi 6 x − µl

observe

each cluster therefore corresponds to a Voronoi cell in Rm Ci = V(µi )

the problem at the heart of k-means clustering is thus to determine k distinct suitable cluster centroids µ1 , . . . , µk

this can be done by minimizing an objective function  E C1 , C2 , . . . , Ck = E(k)

k-means objective function

various equivalent formulations possible, in particular E(k) =

k X X

xj − µi 2

(1)

i=1 xj ∈Ci

=

k X n X

2 zij xj − µi

(2)

i=1 j=1

where  zij =

1, if xj ∈ Ci 0, otherwise

(3)

note

the problem of solving argmin E(k) µ1 ,...,µk

using either (1) or (2) looks innocent but is actually NP-hard

E(k) has numerous local minima and there is no algorithm known today that is guaranteed to find the optimal solution

⇔ any available algorithm for k-means clustering is a heuristic

Lloyd’s algorithm

set t = 0 and initialize µt1 , µt2 , . . . , µtk repeat until convergence update all clusters

Cit = x ∈ X



2

2 x − µti 6 x − µtl

(4)

update all cluster means µt+1 = i

1 X x |Cit | t x∈Ci

increase iteration counter t =t+1

(5)

possible convergence criteria

cluster assignments stabilize, i.e. Cit ∩ Cit−1 = Cit

∀i

cluster centroids stabilize, i.e.

t

µ − µt−1 6  ∀ i i i

number of iterations exceeds threshold, i.e. t > tmax

example

initialization

example

initialization

1st update

example

··· initialization

1st update

final result

question is Lloyd’s algorithm guaranteed to converge?

answer we note that µi = argmin y

X

x − y 2 x∈Ci

so that the mean updates in (5) cannot increase E(k) we also note that by design the cluster updates in (4) cannot increase E(k) we therefore have 0 6 Et+1 (k) 6 Et (k) which implies that the algorithm converges to a (local) minimum

assignment  given X = x1 , . . . , xn , prove the following important property of the sample mean

µ = argmin x

X X

xj − x 2 = 1 xj n j

j

note

the fact that Lloyd’s algorithm will converge says nothing about the quality of the solution the fact that Lloyd’s algorithm will converge says nothing about the speed of convergence in fact, it usually converges quickly but its quality crucially depends on the initialization of the means µ1 , µ2 , . . . , µk

assignment

read C. Bauckhage, “Lecture Notes on Data Science: k-Means Clustering”, dx.doi.org/10.13140/RG.2.1.2829.4886

watch www.youtube.com/watch?v=5I3Ei69I40s www.youtube.com/watch?v=9nKfViAfajY

observe

there are many algorithms / heuristics for minimizing E(k) Hartigan’s algorithm is much less well known than Lloyd’s algorithm is provably more robust than Lloyd’s algorithm provably converges converges quickly

Hartigan’s algorithm  for all xj ∈ x1 , . . . , xn , randomly assign xj to a cluster Ci  for all Ci ∈ C1 , . . . , Ck , compute µi repeat until converged converged ← True  for all xj ∈ x1 , . . . , xn determine Ci = C(xj ) remove xj from Ci and recompute µi determine Cw = argminCl E C1 , . . . , Cl ∪ {xj }, . . . , Ck if Cw 6= Ci , then converged ← False assign xj to Cw and recompute µw



example

random initialization

result after one for-loop

result after two for-loops

assignment

watch www.youtube.com/watch?v=ivr91orblu8

question what if |X|  1 or what if the xj ∈ X arrive one at a time?

question what if |X|  1 or what if the xj ∈ X arrive one at a time?

answer consider the use of online k-means clustering

observe

1X xj = n n

(n)

µ

j=1

1X 1 xj + xn n n n−1

=

j=1

=

n − 1 (n−1) 1 µ + xn n n

observe

1X xj = n n

(n)

µ

j=1

1X 1 xj + xn n n n−1

=

j=1

=

n − 1 (n−1) 1 µ + xn n n

⇒ µ(n) is a convex combination of µ(n−1) and xn ⇔ µ(n) can be computed iteratively

observe

µ(n) =

n − 1 (n−1) 1 µ + xn n n

1 1 = µ(n−1) − µ(n−1) + xn n n = µ(n−1) +

i 1h xn − µ(n−1) n

MacQueen’s algorithm  for all Ci ∈ C1 , . . . , Ck , initialize µi and set ni = 0  for all xj ∈ x1 , x2 , . . . determine winner centroid

2 µw = argmin xj − µi i

update cluster size and centroid nw ← nw + 1 µw ← µw + n1w [xj − µw ]  for all Ci ∈ C1 , . . . , Ck

2

2 Ci = x ∈ X x − µi 6 x − µl

assignment

read C. Bauckhage, “Lecture Notes on Data Science: Online k-Means Clustering”, dx.doi.org/10.13140/RG.2.1.1608.6240

watch www.youtube.com/watch?v=hzGnnx0k6es

GMMs and k-means

pathological example

sample of data points xj ∈ R2

result of k-means clustering for k = 2

question what went wrong?

question what went wrong?

answer we applied k-means to data on which it cannot work well!

probabilistic view on clustering

imagine the given samples xj ∈ X were produced as follows sample a cluster Ci according to a discrete probability p Ci sample a point xj according to a continuous conditional  probability p x Ci



probabilistic view on clustering

imagine the given samples xj ∈ X were produced as follows sample a cluster Ci according to a discrete probability p Ci sample a point xj according to a continuous conditional  probability p x Ci

under this generative model, the probability for observing any sample point xj amounts to 

p xj =

k X i=1

  p xj Ci p Ci



modeling assumptions

let the elements in each cluster be distributed according to   1 T −1 p x Ci = N x µi , Σi = γi e− 2 (x−µi ) Σi (x−µi ) i.e. a multivariate Gaussian with normalization constant γi

modeling assumptions

let the elements in each cluster be distributed according to   1 T −1 p x Ci = N x µi , Σi = γi e− 2 (x−µi ) Σi (x−µi ) i.e. a multivariate Gaussian with normalization constant γi

let Σi = I (each Gaussian is isotropic and of unit variance)   1 2 p x Ci = N x µi = γi e− 2 kx−µi k

consequence

 letting wi = p Ci , we thus consider a particularly simple Gaussian mixture model (GMM) 

p xj =

k X i=1

 wi N xj µi

consequence

 letting wi = p Ci , we thus consider a particularly simple Gaussian mixture model (GMM) 

p xj =

k X

 wi N xj µi

i=1

and might be interested in estimating its parameters   θ = w1 , µ1 , . . . , wk , µk from the data . . .

likelihood and log-likelihood

L(θ) =

n Y

n X k   Y p xj = wi N xj µi

j=1

L(θ) =

n X j=1

j=1 i=1

log

" k X i=1

#  wi N xj µi

note

our trusted recipe of considering ∇L(θ) = 0 does not lead to a closed form solution

a great idea due to Dempster et al. (1977) is to assume a set of indicator variables  Z = z11 , z12 , . . . , zkn just as introduced in (3) and to consider . . .

complete likelihood

L(θ, Z) =

n X k Y j=1 i=1

 zij wi N xj µi

observe  P since zij ∈ 0, 1 and i zij = 1, we are allowed to write k X i=1

k h  Y izij zij wi N xj µi = wi N xj µi i=1

complete log-likelihood

L(θ, Z)=

n X k X

h i zij log wi + log N xj µi

j=1 i=1

=

n X k X j=1 i=1

=

n X k X

h

2 i 1 zij log wi + log γi − xj − µi 2

zij log wi +

j=1 i=1

|

{z T1

}

n X k X

zij log γi −

j=1 i=1

|

{z T2

}

n X k X j=1 i=1

|

zij



xj − µi 2

{z T3

2 }

note

maximizing L(θ, Z) = T1 + T2 − T3 requires to minimize T3

note

maximizing L(θ, Z) = T1 + T2 − T3 requires to minimize T3

looking at 2 · T3 =

k X n X

2 zij xj − µi

i=1 j=1

we recognize the k-means minimization objective in (2)

note

maximizing L(θ, Z) = T1 + T2 − T3 requires to minimize T3

looking at 2 · T3 =

k X n X

2 zij xj − µi

i=1 j=1

we recognize the k-means minimization objective in (2)

⇒ k-means clustering implicitly fits a simplified GMM to X

note

one can also show that k-means clustering implicitly fits isotropic Gaussians of small variance ⇒ if the data in X does not consist of “Gaussian blobs”, k-means clustering will produce questionable results

assignment

read C. Bauckhage, “Lecture Notes on Data Science: k-Means Clustering Is Gaussian Mixture Modeling”, dx.doi.org/10.13140/RG.2.1.3033.2646 C. Bauckhage, “Lecture Notes on Data Science: k-Means Clustering Minimizes Within Cluster Variances”, dx.doi.org/10.13140/RG.2.1.1292.4649

soft k-means

observe

so far, we have been focusing on k-means  for hard clustering with indicator variables zij ∈ 0, 1 where  zij =  =

1, if xj ∈ Ci 0, otherwise

2

2 1, if xj − µi 6 xj − µl ∀ i 6= l 0, otherwise

observe

we may relax this to theidea of soft clustering with indicator variables zij ∈ 0, 1 where zij > 0 X i

zij = 1

observe

we may relax this to theidea of soft clustering with indicator variables zij ∈ 0, 1 where zij > 0 X

zij = 1

i

a common approach towards this idea is to conside

2 −β xj −µi e zij =

P −β xj −µl 2 e l

soft k-means clustering

set t = 0 and initialize µt1 , µt2 , . . . , µtk repeat until convergence compute all indicator variables h

2 i exp −β xj − µti h zij = P

i t 2

l exp −β xj − µl update all centroids P j zij xj t+1 µi = P j zij increase iteration counter t =t+1

example

initial cluster centroids and corresponding soft cluster assignments

example

centroids and soft assignments after the first update step

example

centroids and soft assignments upon convergence

example

effect of the stiffness parameter β > 0

data

β=

1 2

β=1

β=2

assignment

read C. Bauckhage, “Lecture Notes on Data Science: Soft k-Means Clustering”, dx.doi.org/10.13140/RG.2.1.3582.6643

watch www.youtube.com/watch?v=Np9VuEg aqo

summary

we now know about

k-means clustering the fact that it implicitly fits a GMM and is therefore tailored to locally Gaussian data the fact that it is a difficult problem (NP-hard) whose optimal solution cannot be guaranteed the fact that there are various algorithms (heuristics!) Lloyd’s algorithm Hartigan’s algorithm MacQueen’s algorithm