Probabilistic Data Mining

36 downloads 5704 Views 2MB Size Report
Independent Components. Mixture Models. Outline. 1. Modelling Data. Motivation. Machine Learning. Latent variable models. 2. Estimation methods. 3.
Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Probabilistic Data Mining

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised

Lehel Csató

General concepts Principal Components Independent Components Mixture Models

Faculty of Mathematics and Informatics Babe¸s–Bolyai University, Cluj-Napoca,

November 2010

Outline Probabilistic Data Mining Lehel Csató Modelling Data Motivation

1

Modelling Data Motivation Machine Learning Latent variable models

2

Estimation methods

3

Unsupervised Methods

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Motivation for Data Mining Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Data Mining is not: SQL and relational data-base application; Storage technologies;

Machine Learning Latent variable models

Cloud Computing;

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Data mining:

Unsupervised General concepts Principal Components Independent Components Mixture Models

The extraction of knowledge or information from an ever-growing collection of data. “Advanced” search capability that enables one to extract patterns useful in providing models for: 1 2 3

characterising; prediction, and exploiting the data.

Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.

Principal Components Independent Components Mixture Models

Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense

Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.

Principal Components Independent Components Mixture Models

Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense

Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.

Principal Components Independent Components Mixture Models

Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense

Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.

Principal Components Independent Components Mixture Models

Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense

Data mining applications Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Identifying targets for vouchers/frequent flier bonuses or in telecommunications. “Basket analysis” – correlation–based analysis leading to recommending new items – Amazon.com.

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

(semi)automated fraud/virus detection: use guards that protect against procedural or other types of misuse of a system.

Principal Components Independent Components Mixture Models

Forecasting e.g. energy consumption of a region for optimising coal/hydro-plants or planning; Exploiting textual databases – the Google business: to answer user queries; to put content-sensitive ads: Google AdSense

The need for data mining Probabilistic Data Mining Lehel Csató Modelling Data Motivation

“Computers have promised us a fountain of wisdom but delivered a flood of data.” “The amount of information in the world doubles every 20 months.” (Frawley, Piatetsky-Shapiro, Matheus, 1991)

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

A competitive market environment requires sophisticated – and useful – algorithms.

Unsupervised General concepts Principal Components Independent Components Mixture Models

Data aquisition and storage is ubiquotuous. Algorithms are required to exploit them. The algorithms that exploit the data-rich environment are coming usually from the machine learning domain.

Machine learning Probabilistic Data Mining

Historical background / Motivation:

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation

Huge amount of data, that should automatically be processed,

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components

Mathematics provides general solutions, solutions are i.e. not for a given problem,

Mixture Models

Need for “science”, that uses mathematics machinery for solving practical problems.

Definitions for Machine Learning Probabilistic Data Mining

Machine learning

Lehel Csató Modelling Data Motivation

Collection of methods (from statistics, probability theory) to solve problems met in practice.

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

noise filtering for non-linear regression and/or non-Gaussian noise

Classification: binary, multiclass, partially labelled

Clustering, Inversion problems, density estimation, novelty detection. Generally, we need to model the data,

Modelling Data Probabilistic Data Mining

(x1,y1 )

Lehel Csató

Observation Modelling Data Motivation

f(x)

Machine Learning

(x2,y2 )

(x,y)

Latent variable models

Estimation

(xN,yN)

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components

Real world: there “is” a function y = f (x) Observation process: a corrupted datum is collected for a sample xn :

Mixture Models

tn = yn +  tn = h(yn , )

additive noise h distortion function

Problem: find function y = f (x)

Latent variable models Probabilistic Data Mining

(x1,y1 )

Lehel Csató Modelling Data Motivation

f*(x)

Inference

(x2,y2 )

Machine Learning Latent variable models

Estimation Maximum Likelihood

F − function class Observ. process

(xN,yN)

Maximum a-posteriori Bayesian Estimation

Unsupervised

Data set – collected.

General concepts Principal Components Independent Components Mixture Models

Assume a function class. polynomial, Fourier expansion, Wavelet;

Observation process – encodes the noise; Find the optimal function from the class.

Latent variable models II Probabilistic Data Mining

We have the data set D = {(xx 1 , y1 ), . . . , (xx N , yN )}.

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components

Consider a function class:  w ∈ Rd , b ∈ R (1) F = w T x + b|w 

(2) F = a0 +

K X

ak sin(2πkx) +

k =1

a, b ∈ RK , a0 ∈ R |a

K X

bk cos(2πkx)

k =1

Independent Components Mixture Models

Assume an observation process: yn = f (xx n ) +  with  ∼ N(0, σ2 ).

Latent variable models III Probabilistic Data Mining

1

The data set: D = {(xx 1 , y1 ), . . . , (xx N , yN )}.

Lehel Csató Modelling Data Motivation

2

Machine Learning Latent variable models

Estimation

Assume a function class:  θ ∈ Rp F = f (xx , θ )|θ

Maximum Likelihood Maximum a-posteriori

F – polynomial, etc.

Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

3

Assume an observation process. Define a loss function: L (yn , f (xx n , θ )) For the Gaussian noise: L(yn , f (xx n , θ )) = (yn − f (xx n , θ ))2 .

Outline Probabilistic Data Mining Lehel Csató Modelling Data Motivation

1

Modelling Data

2

Estimation methods Maximum Likelihood Maximum a-posteriori Bayesian Estimation

3

Unsupervised Methods

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Parameter estimation Probabilistic Data Mining

Estimating parameters:

Lehel Csató Modelling Data

Finding the optimal value to θ :

Motivation Machine Learning

θ ∗ = arg min L(D, θ )

Latent variable models

θ ∈Ω

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

where Ω is the domain of the parameters. L(D, θ ) is a “loss function” for the data set. Example: L(D, θ) =

N X n=1

L(yn , f (xx n , θ))

Maximum Likelihood Estimation Probabilistic Data Mining

L(D, θ ) – (log)likelihood function.

Lehel Csató Modelling Data Motivation Machine Learning

Maximum likelihood estimation of the model:

Latent variable models

θ ∗ = arg min L(D, θ )

Estimation Maximum Likelihood

θ ∈Ω

Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components

Example – quadratic regression:

Independent Components Mixture Models

L(D, θ ) =

N X

(yn − f (xx n , θ ))2

– factorisation

n=1

Drawback: can produce perfect fit to the data – over-fitting.

Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Y 190

Graphic

h(cm)

Example of an ML estimate

180

Machine Learning Latent variable models

170

Estimation Maximum Likelihood Maximum a-posteriori

160

Bayesian Estimation

Unsupervised

150

General concepts Principal Components Independent Components Mixture Models

140

50

60

70

80

90

100

We want to fit a model to the data. Use linear model: h = θ0 + θ1 w. Use log-linear model: h = θ0 + θ1 log(w). Use higher order polynomials, e.g. : h = θ0 + θ1 w + θ2 w 2 + θ3 w 3 + . . . .

w(kg) X 110

Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Y 190

Graphic

h(cm)

Example of an ML estimate

180

Machine Learning Latent variable models

170

Estimation Maximum Likelihood Maximum a-posteriori

160

Bayesian Estimation

Unsupervised

150

General concepts Principal Components Independent Components Mixture Models

140

50

60

70

80

90

100

We want to fit a model to the data. Use linear model: h = θ0 + θ1 w. Use log-linear model: h = θ0 + θ1 log(w). Use higher order polynomials, e.g. : h = θ0 + θ1 w + θ2 w 2 + θ3 w 3 + . . . .

w(kg) X 110

Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Y 190

Graphic

h(cm)

Example of an ML estimate

180

Machine Learning Latent variable models

170

Estimation Maximum Likelihood Maximum a-posteriori

160

Bayesian Estimation

Unsupervised

150

General concepts Principal Components Independent Components Mixture Models

140

50

60

70

80

90

100

We want to fit a model to the data. Use linear model: h = θ0 + θ1 w. Use log-linear model: h = θ0 + θ1 log(w). Use higher order polynomials, e.g. : h = θ0 + θ1 w + θ2 w 2 + θ3 w 3 + . . . .

w(kg) X 110

Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Y 190

Graphic

h(cm)

Example of an ML estimate

180

Machine Learning Latent variable models

170

Estimation Maximum Likelihood Maximum a-posteriori

160

Bayesian Estimation

Unsupervised

150

General concepts Principal Components Independent Components Mixture Models

140

50

60

70

80

90

100

We want to fit a model to the data. Use linear model: h = θ0 + θ1 w. Use log-linear model: h = θ0 + θ1 log(w). Use higher order polynomials, e.g. : h = θ0 + θ1 w + θ2 w 2 + θ3 w 3 + . . . .

w(kg) X 110

M.L. for linear models Probabilistic Data Mining Lehel Csató

I

Assume: linear model for the x → y relation

Modelling Data Motivation Machine Learning Latent variable models

θ) = f (xx n |θ

Estimation

d X

θ` x`

`=1

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

with x = [1, x, x 2 , log(x), . . .]T

Unsupervised General concepts Principal Components Independent Components Mixture Models

quadratic loss for D = {(xx 1 , y1 ), . . . , (xx N , hN )} E2 (D|f ) =

N X n=1

θ))2 (yn − f (xx n |θ

M.L. for linear models Probabilistic Data Mining Lehel Csató Modelling Data

Minimisation: N X

Motivation Machine Learning Latent variable models

θ))2 = (yy − X θ )T (yy − X θ ) (yn − f (xx n |θ

n=1

θT X T y + y T y = θ T X T X θ − 2θ

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

Solution:

Principal Components Independent Components Mixture Models

X T X θ − 2X X Ty 0 = 2X  −1 θ = X TX X Ty

where y = [y1 , . . . , yN ]T and X = [xx 1 , . . . , x N ]T are the transformed data.

II

M.L. for linear models Probabilistic Data Mining Lehel Csató

Generalised linear models: Use a set of functions Φ = [φ1 (.), . . . , φM (.)].

Modelling Data Motivation Machine Learning

Project the inputs into the space spanned by Im(Φ).

Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components

Have a parameter vector of length M: θ = [θ1 , . . . , θM ]T .

P x ) | θm ∈ R . The model is m θm φm (x

Mixture Models

The optimal parameter vector is:  −1 ΦT y θ∗ = Φ T Φ

III

Maximum Likelihood Probabilistic Data Mining

Summary

There are many candidate model families:

Lehel Csató Modelling Data

the degree of polynomials specifies a model family;

Motivation Machine Learning Latent variable models

the rank of a Fourier expansion;

Estimation Maximum Likelihood Maximum a-posteriori

the mixture of {log, sin, cos, . . .} also a family;

Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Selecting the “best family” is a difficult modelling problem. In maximum likelihood there is no controll on how good a family is when processing a given data-set. Smaller number of parameters than

√ #data.

Maximum a–posteriori Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Generalised linear model powerful – it can be extremely complex; With no complexity control, overfitting problem.

Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions.

I

Maximum a–posteriori Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood

Generalised linear model powerful – it can be extremely complex; With no complexity control, overfitting problem.

Aim: to include knowledge in the inference process. Our beliefs are reflected by the choice of the candidate functions.

Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Goal: Prior knowledge specification using probabilities; Using probability theory for consistent estimation; Encode the observation noise in the model;

I

Maximum a–posteriori Probabilistic Data Mining

Data/noise

Probabilistic data description:

Lehel Csató Modelling Data Motivation

How likely is that θ generated the data:

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

y = f (xx ) y = f (xx ) + 

⇔ ⇔

y − f (xx ) ∼ δ0 y − f (xx ) ∼ N

Unsupervised General concepts Principal Components Independent Components Mixture Models

Gaussian noise: y − f (xx ) ∼ N(0, σ2 )   1 (y − f (xx ))2 P(y |f (xx )) = √ exp − 2σ2 2πσ

Maximum a–posteriori Probabilistic Data Mining

William of Lehel Csató

Prior

Ockham (1285–1349) principle

Entities should not be multiplied beyond necessity. Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood

Also known as (wiki...): .

“Principle of simplicity” – KISS,

“When you hear hoofbeats, think horses, not zebras”.

Maximum a-posteriori Bayesian Estimation

Unsupervised

Simple models



L2 norm



General concepts Principal Components Independent Components Mixture Models

small number of parameters. L0 norm

Probabilistic representation: "

θk2 kθ θ) ∝ exp − 22 p0 (θ 2σ0

#

M.A.P. Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning

Inference

M.A.P. – probabilities assigned to D – via the log-likelihood function: P(yn |xx n , θ , F ) ∝ exp [−L(yn , f (xx n , θ ))]

Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

θ – prior probabilities:

Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

"

θk2 kθ θ) ∝ exp − 2 p0 (θ 2σ0

#

A–posteriori probability: θ|D, F ) = p(θ

θ)p0 (θ θ) P(D|θ F p(D|F )

F ) – probability of the data for a given family. p(D|F

M.A.P. Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning

Inference

M.A.P. – probabilities assigned to D – via the log-likelihood function: P(yn |xx n , θ , F ) ∝ exp [−L(yn , f (xx n , θ ))]

Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

θ – prior probabilities:

Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

"

θk2 kθ θ) ∝ exp − 2 p0 (θ 2σ0

#

A–posteriori probability: θ|D, F ) = p(θ

θ)p0 (θ θ) P(D|θ F p(D|F )

F ) – probability of the data for a given family. p(D|F

M.A.P. Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning

Inference

M.A.P. – probabilities assigned to D – via the log-likelihood function: P(yn |xx n , θ , F ) ∝ exp [−L(yn , f (xx n , θ ))]

Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

θ – prior probabilities:

Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

"

θk2 kθ θ) ∝ exp − 2 p0 (θ 2σ0

#

A–posteriori probability: θ|D, F ) = p(θ

θ)p0 (θ θ) P(D|θ F p(D|F )

F ) – probability of the data for a given family. p(D|F

M.A.P. Probabilistic Data Mining

Inference II

M.A.P. estimation – finds θ with largest probability:

Lehel Csató

θ|D, F ) θ ∗MAP = arg max p(θ

Modelling Data

θ ∈Ω

Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

Example: with L(yn , f (xx n , θ )) and Gaussian prior:

Principal Components Independent Components Mixture Models

θ ∗MAP = argmax K − θ∈Ω

σ20 = ∞ .

=⇒

θ k2 kθ 1X L(yn , f (xx n , θ )) − 2 n 2σ20

maximum likelihood. after a change of sign and max → min

M.A.P.

Example I

Probabilistic Data Mining

Poly 6

Lehel Csató

12

Modelling Data

10

−3

N. Dev =10 True function Training data −2 N. Dev = 10

Motivation Machine Learning Latent variable models

8

Estimation Maximum Likelihood

6

Maximum a-posteriori Bayesian Estimation

Unsupervised

4

General concepts Principal Components Independent Components

2

Mixture Models

0 −2 −4 −6 −10

−5

0

5

10

M.A.P.

Linear models I

Lehel Csató Modelling Data Motivation

Y 190

h(cm)

Probabilistic Data Mining

Machine Learning Latent variable models

180

Estimation Maximum Likelihood Maximum a-posteriori

170

Bayesian Estimation

Unsupervised

160

General concepts Principal Components

150

Independent Components Mixture Models

140

50

60

70

80

90

Aim: test different levels of flexibility. ⇒ Prior width:

100

w(kg) X 110

p = 10

M.A.P.

Linear models I

Lehel Csató Modelling Data Motivation

Y 190

h(cm)

Probabilistic Data Mining

Machine Learning Latent variable models

180

Estimation Maximum Likelihood Maximum a-posteriori

170

Bayesian Estimation

Unsupervised

160

General concepts Principal Components

150

Independent Components Mixture Models

140

50

60

70

80

90

Aim: test different levels of flexibility. ⇒ Prior width: σ20 = 106

100

w(kg) X 110

p = 10

M.A.P.

Linear models I

Lehel Csató Modelling Data Motivation

Y 190

h(cm)

Probabilistic Data Mining

Machine Learning Latent variable models

180

Estimation Maximum Likelihood Maximum a-posteriori

170

Bayesian Estimation

Unsupervised

160

General concepts Principal Components

150

Independent Components Mixture Models

140

50

60

70

80

90

Aim: test different levels of flexibility. ⇒ Prior width: σ20 = 106 σ20 = 105

100

w(kg) X 110

p = 10

M.A.P.

Linear models I

Lehel Csató Modelling Data Motivation

Y 190

h(cm)

Probabilistic Data Mining

Machine Learning Latent variable models

180

Estimation Maximum Likelihood Maximum a-posteriori

170

Bayesian Estimation

Unsupervised

160

General concepts Principal Components

150

Independent Components Mixture Models

140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104

M.A.P.

Linear models I

Lehel Csató Modelling Data Motivation

Y 190

h(cm)

Probabilistic Data Mining

Machine Learning Latent variable models

180

Estimation Maximum Likelihood Maximum a-posteriori

170

Bayesian Estimation

Unsupervised

160

General concepts Principal Components

150

Independent Components Mixture Models

140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103

M.A.P.

Linear models I

Lehel Csató Modelling Data Motivation

Y 190

h(cm)

Probabilistic Data Mining

Machine Learning Latent variable models

180

Estimation Maximum Likelihood Maximum a-posteriori

170

Bayesian Estimation

Unsupervised

160

General concepts Principal Components

150

Independent Components Mixture Models

140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103 σ20 = 102

M.A.P.

Linear models I

Lehel Csató Modelling Data Motivation

Y 190

h(cm)

Probabilistic Data Mining

Machine Learning Latent variable models

180

Estimation Maximum Likelihood Maximum a-posteriori

170

Bayesian Estimation

Unsupervised

160

General concepts Principal Components

150

Independent Components Mixture Models

140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103 σ20 = 102 σ20 = 101

M.A.P.

Linear models I

Lehel Csató Modelling Data Motivation

Y 190

h(cm)

Probabilistic Data Mining

Machine Learning Latent variable models

180

Estimation Maximum Likelihood Maximum a-posteriori

170

Bayesian Estimation

Unsupervised

160

General concepts Principal Components

150

Independent Components Mixture Models

140

50

60

70

80

90

100

w(kg) X 110

Aim: test different levels of flexibility. ⇒ p = 10 Prior width: σ20 = 106 σ20 = 105 σ20 = 104 σ20 = 103 σ20 = 102 σ20 = 101 σ20 = 100

M.A.P.

Linear models II

Probabilistic Data Mining Lehel Csató

θ ∗MAP = argmax K −

Modelling Data

θ ∈Ω

Motivation

θ k2 kθ 1X E2 (yn , f (xx n , θ )) − 2 n 2σ20

Machine Learning Latent variable models

Transform into vector notation:

Estimation

θ ∗MAP = argmax K −

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components

θ ∈Ω

1 θT θ (yy − X θ )T (yy − X θ ) − 2 2σ20

solve for θ by differentiation:

Independent Components

X T (yy − X θ ) −

Mixture Models

1 I dθ = 0 σ20 θ∗MAP

.

=

1 X X + 2 Id σ0 T

!−1 XTy

again M.L. for σ20 = ∞

M.A.P. Probabilistic Data Mining

Summary

Mximum a–posteriori models:

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Allow for the inclusion of prior knowledge;

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

May protect against overfitting;

Principal Components Independent Components Mixture Models

Can measure the fitness of the family to the data; . Procedure called M.L. type II.

M.A.P. application Probabilistic Data Mining Lehel Csató

M.L. II

Idea: instead of computing the most probable value of θ , we can measure the fit of the model F to the data D. F) P(D|F

Modelling Data

X

=

p(D, θ ` | F )

θ ` ∈Ω

Motivation

X

Machine Learning

=

Latent variable models

Estimation

θ` | F ) p(D | θ ` , F )p0 (θ

θ ` ∈Ω

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Gaussian noise ase and polynomial of order K : Z

Unsupervised General concepts Principal Components

F )) log(P(D|F

=

log

Independent Components

dθ Ωθ

Mixture Models

=



θ |F) p(D | θ , F )p0 (θ p(D | F )

! = log (N(yy |0, ΣX ))

 1 ΣX | + y T ΣX−1y N log(2π) + log |Σ 2

where ΣX =

X Σ 0X T I N σ2n +X

h i X = x 0, x 1, . . . , x K with

Σ 0 = diag(σ20 , σ21 , . . . , σ2K ) = σ2pI K +1

M.A.P. ⇒ M.L.II

Lehel Csató

−80 Modelling Data

log(P(D|k ))

Probabilistic Data Mining

poly

Motivation Machine Learning Latent variable models

−100

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

−120

Unsupervised General concepts Principal Components

log(σ2p )

Independent Components Mixture Models

−140 −6

−4

−2

0

Aim: test different models. Polynomial families: k = 10 k = 9 k = 8 .

2

4

M.A.P. ⇒ M.L.II

Lehel Csató

−80 Modelling Data

log(P(D|k ))

Probabilistic Data Mining

poly

Motivation Machine Learning Latent variable models

−100

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

−120

Unsupervised General concepts Principal Components

log(σ2p )

Independent Components Mixture Models

−140 −6

−4

−2

0

Aim: test different models. Polynomial families: k = 9 k = 8 k = 7 .

2

4

M.A.P. ⇒ M.L.II

Lehel Csató

−80 Modelling Data

log(P(D|k ))

Probabilistic Data Mining

poly

Motivation Machine Learning Latent variable models

−100

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

−120

Unsupervised General concepts Principal Components

log(σ2p )

Independent Components Mixture Models

−140 −6

−4

−2

0

Aim: test different models. Polynomial families: k = 8 k = 7 k = 6 .

2

4

M.A.P. ⇒ M.L.II

Lehel Csató

−80 Modelling Data

log(P(D|k ))

Probabilistic Data Mining

poly

Motivation Machine Learning Latent variable models

−100

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

−120

Unsupervised General concepts Principal Components

log(σ2p )

Independent Components Mixture Models

−140 −6

−4

−2

0

Aim: test different models. Polynomial families: k = 7 k = 6 k = 5 .

2

4

M.A.P. ⇒ M.L.II

Lehel Csató

−80 Modelling Data

log(P(D|k ))

Probabilistic Data Mining

poly

Motivation Machine Learning Latent variable models

−100

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

−120

Unsupervised General concepts Principal Components

log(σ2p )

Independent Components Mixture Models

−140 −6

−4

−2

0

Aim: test different models. Polynomial families: k = 6 k = 5 k = 4 .

2

4

M.A.P. ⇒ M.L.II

Lehel Csató

−80 Modelling Data

log(P(D|k ))

Probabilistic Data Mining

poly

Motivation Machine Learning Latent variable models

−100

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

−120

Unsupervised General concepts Principal Components

log(σ2p )

Independent Components Mixture Models

−140 −6

−4

−2

0

Aim: test different models. Polynomial families: k = 5 k = 4 k = 3 .

2

4

M.A.P. ⇒ M.L.II

Lehel Csató

−80 Modelling Data

log(P(D|k ))

Probabilistic Data Mining

poly

Motivation Machine Learning Latent variable models

−100

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

−120

Unsupervised General concepts Principal Components

log(σ2p )

Independent Components Mixture Models

−140 −6

−4

−2

0

Aim: test different models. Polynomial families: k = 4 k = 3 k = 2 .

2

4

M.A.P. ⇒ M.L.II

Lehel Csató

−80 Modelling Data

log(P(D|k ))

Probabilistic Data Mining

poly

Motivation Machine Learning Latent variable models

−100

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

−120

Unsupervised General concepts Principal Components

log(σ2p )

Independent Components Mixture Models

−140 −6

−4

−2

0

Aim: test different models. Polynomial families: k = 3 k = 2 k = 1 .

2

4

Bayesian estimation Probabilistic Data Mining

Intro

M.L. and M.A.P. estimates provide single solutions.

Lehel Csató Modelling Data Motivation

Point estimates lack the assessment of un/certainty.

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

Better solution: for a query x ∗ , the system output is probabilistic:

Bayesian Estimation

Unsupervised General concepts

x∗



p(y∗ |xx ∗ , F )

Principal Components Independent Components Mixture Models

Tool: go beyond the M.A.P. solution and use the a–posteriori distribution of the parameters.

Bayesian estimation Probabilistic Data Mining

II

We again use Bayes’ rule:

Lehel Csató Modelling Data Motivation

θ|D, F ) = p(θ

θ)p0 (θ θ) P(D|θ F) p(D|F

Z F) = with p(D|F

θ P(D|θ θ)p0 (θ θ). dθ Ω

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

and exploit the whole posterior distribution of the parameters. A-posteriori parameter estimates

Principal Components Independent Components Mixture Models

θ) def θ|D, F ) and use the total We operate with ppost (θ = p(θ probability rule: X θ` , F ) ppost (θ θ` ) p(y∗ |θ p(y∗ |D, F ) = θ ` ∈Ωθ

in assessing system output.

Bayesian estimation Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Example I

Given the data D = {(xx 1 , y1 ), . . . , (xx N , yN )} estimate the linear fit:  T   1 θ0 d  θ1   x1  X     T θi xi =  .   .  def y = θ0 + =θ x  ..   ..  i=1

θd

xd

Unsupervised General concepts Principal Components

Gaussian distributions noise and prior:

Independent Components Mixture Models

 = yn − θ T x n ∼ N(0, σ2n ) w ∼ N(0, Σ 0 )

Bayesian estimation Probabilistic Data Mining

Example II

θ). Goal: compute the posterior distribution ppost (θ

Lehel Csató Modelling Data Motivation

θ) ppost (θ



θ) p(D|θ θ, F ) = p0 (θ θ |Σ Σ0 ) p0 (θ

Machine Learning

Maximum Likelihood

θ)) −2 log (ppost (θ

=

Maximum a-posteriori Bayesian Estimation

Unsupervised

=

General concepts Principal Components Independent Components

θT x n ) P(yn |θ

n=1

Latent variable models

Estimation

N Y

=

1 Kpost + 2 (yy − X θ )T (yy − X θ ) + θ T Σ −1 0 θ σn   1 T 2 0 θT X X + Σ −1 θ − 2 θ T X T y + Kpost 0 σ2n σn T  00 θ − µpost Σ−1 post θ − µ post + Kpost

Mixture Models

and by identification  −1 1 T −1 Σ post = X X + Σ 0 σ2n

and

µ post = Σ post

X Ty σ2n

Bayesian estimation Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Example III

Bayesian linear model The posterior distribution for the parameters is a Gaussian with parameters

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

 Σ post =

1 T X X + Σ −1 0 σ2n

−1 and

µ post = Σ post

X Ty σ2n

Unsupervised General concepts Principal Components Independent Components Mixture Models

Point estimates from keeping : M.L. if we take Σ 0 → ∞ and considering only µ post . M.A.P if we approximate the distribution with a single value at the maximum, i.e. µ post .

Bayesian estimation Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation

Prediction for new values x ∗ : use the likelihood P(y∗ |xx ∗ , θ , F ), and the posterior for θ and Bayes’ rule. The steps: Z

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

p(y∗ |xx ∗ , D, F ) =

Unsupervised General concepts Principal Components Independent Components Mixture Models

Example IV

θ | D, F ) dθ p(y∗ |xx ∗ , θ , F )ppost (θ Ωθ

Z

   (y∗ − θ T x ∗ )2 1 T −1 θ θ dθ exp − K∗ + + (θ − µ ) Σ (θ − µ ) post post post 2 σ2n Ωθ    Z 2 1 y θ) = dθ exp − K∗ + ∗2 − a T C −1a + Q(θ 2 σn Ωθ =

where a=

x ∗ y∗ + Σ −1 postµ post σ2n

C=

x ∗x T∗ + Σ post σ2n

Bayesian estimation Probabilistic Data Mining Lehel Csató

Example V

Integrating out the quadratic in θ : Predictive distribution at x ∗

Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood

"

1 p(y∗ |xx ∗ , D, F ) = exp − 2

K∗ +

(y∗ − x ∗µpost )2

!#

σ2n + x T∗ Σ −1 postx ∗

Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components

  = N y∗ x T∗ µ post , σ2n + x T∗ Σ postx ∗

Independent Components Mixture Models

With the predictive distribution we: measure the variance of the prediction for each point: σ2∗ = σ2n + x T∗ Σ postx ∗ ; sample from the parameters and plot the candidate predictors.

Bayesian example Probabilistic Data Mining

Error bars

Pol. 6 − N.var σ2 = 1

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

10

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised

5

General concepts Principal Components Independent Components Mixture Models

0

−5 −10

−5

0

The errors are the symmetric thin lines.

5

10

Bayesian example

Predictive samples

Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Third order polynomials are used to approximate the data.

Bayesian estimation Probabilistic Data Mining Lehel Csató

Problems

θ|D, F ) we assumed that the When computing ppost (θ posterior can be represented analytically.

Modelling Data Motivation Machine Learning Latent variable models

This is not the case.

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Approximations are needed for the posterior distribution predictive distribution

In Bayesian modelling an important issue is how we approximate the posterior distribution.

Bayesian estimation Probabilistic Data Mining Lehel Csató

Summary

Complete specification of the model Can include prior beliefs about the model.

Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Accurate predictions Can compute the posterior probabilities for each test location.

Unsupervised General concepts Principal Components

Computational cost

Independent Components Mixture Models

Using models for prediction can be difficult and expensive in time and memory.

Bayesian estimation Probabilistic Data Mining Lehel Csató

Summary

Complete specification of the model Can include prior beliefs about the model.

Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Accurate predictions Can compute the posterior probabilities for each test location.

Unsupervised General concepts Principal Components

Computational cost

Independent Components Mixture Models

Using models for prediction can be difficult and expensive in time and memory. Bayesian models Flexible and accurate – if priors about the model are used.

Outline Probabilistic Data Mining Lehel Csató Modelling Data

1

Modelling Data

2

Estimation methods

3

Unsupervised Methods General concepts Principal Components Independent Components Mixture Models

Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Unsupervised setting Probabilistic Data Mining Lehel Csató

Data can be unlabeled, i.e. no values y are associated to an input x .

Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood

We want to “extract” information from D = {xx 1 , . . . , x N }.

Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

We assume that the data – although high-dimensional – span a much smaller dimensional manifold. Task is to find the subspace corresponding to the data span.

Models in unsupervised learning Probabilistic Data Mining

It is again important the model of the data:

Lehel Csató

Principal Components;

x2

Modelling Data

3 2 1

Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood

.

x1

0

1

2

3

0

1

2

3

3

4

5

6

Maximum a-posteriori

Independent Components;

Bayesian Estimation

x2

Unsupervised

3 2 1

General concepts Principal Components Independent Components Mixture Models

. Mixture models;

x1

x2 3 2 1

.

0

x1 1

2

The PCA model

I

Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood

x2

Simple data structure. Spherical cluster that is: translated; scaled; rotated.

3 2 1

.

0

x1 1

2

3

Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

We aim to find the principal directions of the data spread.

Principal Components Independent Components Mixture Models

Principal direction: the direction u along which the data preserves most of its variance.

The PCA model Probabilistic Data Mining

II

Principal direction:

Lehel Csató Modelling Data Motivation Machine Learning

u = argmax u k=1 ku

N 1 X T u x n − u x )2 (u 2N n=1

Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

we pre-process: x = 0 . Replacing the empirical covariance with Σx :

Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

u = argmax u k=1 ku

N 1 X T u x n − u x )2 (u 2N n=1

1 u k2 − 1) = argmax u T Σx u − λ(ku 2 u ,λ with λ the Lagrange multiplier. Differentiating w.r.t u : u =0 Σx u − λu

The PCA model Probabilistic Data Mining

III

The optimum solution must obey:

Lehel Csató Modelling Data

u Σx u = λu

Motivation Machine Learning Latent variable models

The eigendecomposition of the covariance matrix.

Estimation Maximum Likelihood Maximum a-posteriori

(λ∗ , u ∗ ) is an eigenvalue, eigenvector of the system.

Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

If we replace back, the value of the expression is λ∗ . ⇒ Optimal solution when λ∗ = λmax . Principal direction: The eigenvector u max corresponding to the largest eigenvalue of the system.

The PCA model Probabilistic Data Mining Lehel Csató Modelling Data

Data mining I

How is this used in data mining? Assume that data is: jointly Gaussian:

Motivation

mx , Σx ), x = N(m

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised

high-dimensional; only few (2) directions are relevant.

General concepts Principal Components Independent Components Mixture Models

20

2 −20 −40

0

−20 0 20

−20

The PCA model Probabilistic Data Mining

Data mining II

How is this used in data mining?

Lehel Csató Modelling Data

x2

Subtracting mean.

3 2 1

Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Eigendecomposition.

Selecting the K eigenvectors corresponding 0 to the K largest values.

x1 1

2

3

Computing the K projections: zn` = x Tn u ` .

Unsupervised General concepts Principal Components Independent Components Mixture Models

u 1 , . . . , u K ]T : The projection using matrix P def = [u Z = XP and z n can is used as a compact representation of x n .

The PCA model Probabilistic Data Mining

Reconstruction:

Lehel Csató Modelling Data

Data mining III

x n0 =

Motivation

K X

zn`u `

or, with matrix notation: X 0 = Z P T

`=1

Machine Learning Latent variable models

Estimation

PCA projection analysis:

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

N i 1 X 1 h 0 T 0 2 0 tr X − X X − X x − x = n n N2 N2 n=1 i h = tr Σx − P T Σz P i h = tr U (diag(λ1 , . . . , λd ) − diag(λ1 , . . . , λK , 0, . . .)) U T h i = tr U T U diag(0, . . . , 0, λK +1 , . . . , λd )

EPCA =

=

d−K X `=1

λK +`

The PCA model Probabilistic Data Mining Lehel Csató

Data mining IV

PCA reconstruction error: The error made using the PCA directions:

Modelling Data Motivation Machine Learning Latent variable models

EPCA =

Estimation

d−K X

λK +`

`=1

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components

PCA properties:

Independent Components Mixture Models

PCA system orthonormal: u T` u r = δ`−r Reconstruction fast. Spherical assumption critical.

PCA application Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

USPS digits – testbed for several models.

USPS I

PCA application Probabilistic Data Mining Lehel Csató Modelling Data Motivation

USPS II

USPS characteristics: handwritten data centered and scaled; ≈ 10.000 items of 16 × 16 grayscale images;

Machine Learning

λ(%)

Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

We plot P kr = r`=1 λ`

100 90

i

80 20

40

60

80

Principal Components Independent Components Mixture Models

Conclusion for the USPS set: The normalised λ1 = 0.24 24% of the data.



u 1 accounts for

at ≈ 10 more than 70% of variance is explained. at ≈ 50 more than 98% ⇒

50 numbers instead of 256.

PCA application Probabilistic Data Mining

USPS III

Visualisation application:

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Visualisation along the first two eigendirections.

PCA application Probabilistic Data Mining

Visualisation application:

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Detail.

USPS IV

The ICA model Probabilistic Data Mining

Motivation

x = Pz is a generative model for the data.

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised

x2

Start from the PCA:

Lehel Csató Modelling Data

I

3 2 1 0

x1 1

2

3

We assumed that 0, diag(λ` )); z i.i.d. Gaussian random variables z ∼ N(0 ⇒ x are not independent; ⇒

General concepts Principal Components Independent Components Mixture Models

In most of real data: Sources are not Gaussian. But sources are independent. We exploit that!.

z are Gaussian sources;

The ICA model Probabilistic Data Mining

II

The following model assumption:

Lehel Csató

x = As

Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

where z independent sources; A linear mixing matrix;

Bayesian Estimation

Unsupervised

Looking for matrix B that recovers the sources:

General concepts Principal Components Independent Components Mixture Models

As ) = (B BA ) s s 0 def = B x = B (A

BA ) is unity up to a permutation and scaling i.e. (B . but retains independence.

The ICA model Probabilistic Data Mining

In practice: s 0 def = Bx

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation

III

with s = [s1 , . . . , sK ] all independent sources. Independence test: the KL-divergence between the joint distribution and the marginals

Maximum Likelihood Maximum a-posteriori

B = argmin KL (p(s1 , s2 )kp(s1 )p(s2 ))

Bayesian Estimation

B ∈SOd

Unsupervised General concepts Principal Components Independent Components Mixture Models

B | = 1. where SOd is the group of matrices with |B In ICA we are looking for matrix Z B that minimises: XZ dp(s` ) log p(s` ) − dp(ss ) log(p(s1 , . . . , sd ) `

.

Ω`

Ω`

skip KL

The KL-divergence Probabilistic Data Mining

Detour

Kullback-Leibler divergence

Lehel Csató

KL(pkq) = Modelling Data

X

p(x) log

x

p(x) q(x)

Motivation Machine Learning Latent variable models

is zero only and only if p = q,

Estimation

is not a measure of distance (but cloooose to it!),

Maximum Likelihood Maximum a-posteriori

Efficient when exponential families are used.

Bayesian Estimation

Unsupervised General concepts Principal Components

Short proof:

Independent Components Mixture Models

0 = log 1 = log ≥

X x



KL(pkq) ≥ 0

X

! q(x)

x

 p(x) log

q(x) p(x)

= log

X x

 = −KL(pkq)

q(x) p(x) p(x)

!

ICA Application Probabilistic Data Mining

Data

Separation of source signals: Mixture

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

m2 m4

m1 m3

m3 m4

ICA Application Probabilistic Data Mining

Results

Results of separation: Source

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

m2 m4

Principal Components

m1 m3

m3 m4

Independent Components Mixture Models

.

FastICA package

Applications of ICA Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components

Applications: Coctail party problem; Separates noisy and multiple sources from multiple observations.

Fetus ECG; Separation of the ECG signal of a fetus from its mother’s ECG.

MEG recordings; Separation of MEG “sources”.

Financial data;

Mixture Models

Finding hidden factors in financial data.

Noise reduction; Noise reduction in natural images.

Interference removal; Interference removal from CDMA – Code-division multiple access – communication systems.

The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Introduction

The data structure is more complex. More than a single source for data.

x2 3 2 1 0

x1 1

2

3

4

5

6

Estimation Maximum Likelihood Maximum a-posteriori

The mixture model:

Bayesian Estimation

Unsupervised

Σ) = P(xx |Σ

General concepts Principal Components Independent Components

K X

µk , Σ k ) πk pk (xx |µ

k =1

Mixture Models

where: π1 , . . . , πK – mixing components. µk , Σ k ) – density of a component. pk (xx |µ The components are usually called clusters.

(1)

The mixture model Probabilistic Data Mining Lehel Csató

Data generation

The generation process reflects the assumptions about the model.

Modelling Data Motivation Machine Learning Latent variable models

Estimation Maximum Likelihood

The data generation: first we select from which component,

Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts

then we sample from the component’s density function.

Principal Components Independent Components Mixture Models

When modelling data we do not know: Which point belongs to which cluster. What are the parameters for each density function.

The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Old Faithful geyser in the Yellowstone National park. Characterised by: intense eruptions;

Machine Learning Latent variable models

differing times between them.

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Rule: Duration is 1.5 to 5 minutes. The length of eruption helps determine the interval. If an eruption lasts less than 2 minutes the interval will be around 55 minutes. If the eruption last 4.5 minutes the interval may be around 88 minutes.

Example I

The mixture model Probabilistic Data Mining

Example II

100

Lehel Csató 90

Modelling Data Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised

Interval between durations

Motivation

80

70

60

General concepts Principal Components Independent Components Mixture Models

50

40 1.5

2

2.5

3

3.5 Duration

4

4.5

5

The longer the duration, the longer the ininterval. The linear relation I = θ0 + θ1 d is not the best. There are only a very few eruptions lasting ≈ 3 minutes.

5.5

The mixture model Probabilistic Data Mining

Assumptions:

Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

Estimation

We know the family of individual density functions: These density functions are parametrised with a few parameters.

Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

The densities are easily identifiable: If we knew which data belongs to which cluster, the density function is easily identifiable.

Gaussian densities are often used – fulfill both “conditions”.

I

The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning

II

The Gaussian mixture model: µ1 , Σ 1 ) + π2 N2 (xx |µ µ2 , Σ 2 ) p(xx ) = π1 N1 (xx |µ for known densities (centres and ellipses):

100

90

Latent variable models

µk , Σ k ) p(k ) Nk (xx n |µ p(xx n |k ) = P x µ` , Σ ` ) p(`) N (x n |µ ` `

80

i.e. we know the probability that data comes from cluster k (shades from red to green).

70

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

60

For D:

γn`

x p(xx |1) p(xx |2) x1 γ11 γ12 .. .. .. . . . xN γN1 γN2 – responsibility of x n in cluster `.

50

40 1

2

3

4

5

6

The mixture model Probabilistic Data Mining Lehel Csató

When γn` known, the parameters are computed using the data weighted by their responsibilities:

Modelling Data Motivation Machine Learning

µk , Σ k ) = argmax (µ Σ µ ,Σ

Latent variable models

N Y

µ, Σ ))γnk (Nk (xx n |µ

n=1

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components

for all k . This means: µk , Σ k ) ⇐ (µ

X

µ, Σ ) γnk log N(xx n |µ

n

Independent Components Mixture Models

When making inference Have to find the responsibility vector and the parameters of the mixture.

III

The mixture model Probabilistic Data Mining Lehel Csató

When γn` known, the parameters are computed using the data weighted by their responsibilities:

Modelling Data Motivation Machine Learning

µk , Σ k ) = argmax (µ Σ µ ,Σ

Latent variable models

N Y

Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components

for all k . This means: µk , Σ k ) ⇐ (µ

X

Given data D: Initial guess: µ1 , Σ 1 ), . . . , (µ µK , Σ K )) ⇒ (µ

µ, Σ ))γnk (Nk (xx n |µ

n=1

Estimation Maximum Likelihood

III

µ, Σ ) γnk log N(xx n |µ

Re-estimate resps: x1 γ11 γ12 .. .. .. . . . xN γN1 γN2

n

Independent Components Mixture Models

When making inference Have to find the responsibility vector and the parameters of the mixture.

Re-estimate parameters: µ1 , Σ 1 ), . . . , (µ µK , Σ K ) ⇒ (µ (π1 , . . . , πK )

The mixture model Probabilistic Data Mining Lehel Csató Modelling Data Motivation

Summary

Responsibilities γ The additional latent variables needed to help computation.

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

In the mixture model: goal is to fit model to data; which submodel gets a particular data;

Achieved by the maximisation of the log-likelihood function.

The EM algorithm Probabilistic Data Mining Lehel Csató

π, Θ ) = argmax (π

Modelling Data

Latent variable models

Estimation Maximum Likelihood

X n

Motivation Machine Learning

I

" log

X

# µ` , Σ ` ) π` N` (xx n |µ

`

µ1 , Σ 1 , . . . , µ K , Σ K ] is the vector of parameters; Θ = [µ π = [π1 , . . . , πK ] the shares of the factors;

Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Problem with optimisation: The parameters are not separable due to the sum within the logarithm.

Solution: Use an approximation.

The EM algorithm

II

Probabilistic Data Mining Lehel Csató

π, Θ ) = log P(D|π

Modelling Data

X

log

" X

Motivation

n

Machine Learning Latent variable models

Estimation

=

Maximum Likelihood

Bayesian Estimation

µ` , Σ ` ) π` N` (xx n |µ

`

X

Maximum a-posteriori

#

log

" X

n

`

!

X

# p` (xx n , `)

Unsupervised General concepts Principal Components

Use Jensen’s inequality:

Independent Components Mixture Models

log

X

p` (xx n , `|θ` )

`

p` (xx n , `|θ` ) = log qn (`) qn (`) `   X p` (xx n , `) ≥ qn (`) log qn (`) `

!

for any [qn (1), . . . , qn (`)]. .

skip Jensen

Jensen Inequality

Detour

Probabilistic Data Mining

f (γ1z 1 + γ2z 2 )

Lehel Csató

γ1 f (zz 1 ) + γ2 f (zz 2 ) Modelling Data Motivation Machine Learning Latent variable models

concave f (zz )

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

z1

γ2 = 0.75

z2

Unsupervised General concepts Principal Components

Jensen’s Inequality

Independent Components Mixture Models

For any concave f (z), any z1 and z2 , and any γ1 , γ2 > 0 such that γ1 + γ2 = 1: f (γ1 z1 + γ2 z2 ) ≥ γ1 f (z1 ) + γ2 f (z2 )

The EM algorithm Probabilistic Data Mining Lehel Csató Modelling Data Motivation Machine Learning Latent variable models

log

X

!

Maximum a-posteriori



p` (xx n , `|θ` )

`

X

 qn (`) log

`

p` (xx n , `) qn (`)



for any distribution qn (·). Replacing with the right-hand side, we have:

Estimation Maximum Likelihood

III

π, Θ ) ≥ log P(D|π

XX n

Bayesian Estimation

Unsupervised



General concepts Principal Components

`

" X X n

`

Independent Components

qn (`) log

θ` ) p` (xx n |θ qn (`)

θ` ) p` (xx n |θ qn (`) log qn (`)

# =L

Mixture Models

and therefore the optimisation w.r.to cluster parameters separate.

∂`



0=

X n

qn (`)

θ` ) ∂ log p` (xx n |θ ∂ θ`

For distributions from exponential family optimisation is easy.

The EM algorithm Probabilistic Data Mining Lehel Csató Modelling Data Motivation

IV

any set of distributions q1 (`), . . . , qN (`) provides a lower bound to the log-likelihood. We should choose the distributions so that they are the closest to the current parameter set.

Machine Learning Latent variable models

Estimation Maximum Likelihood

We assume the parameters have the value θ 0 . Want to minimise the difference:

Maximum a-posteriori Bayesian Estimation

Unsupervised

θ0` ) − L = log P(xx n , `|θ

General concepts

X

θ0` ) − qn (`) log P(xx n , `|θ

`

X

Principal Components Independent Components Mixture Models

`

`

θ0` )qn (`) P(xx n , `|θ qn (`) log θ0` ) p` (xx n |θ

and observe that by setting qn (`) = we have

P `

qn (`) 0 = 0.

X

θ0` ) p` (xx n |θ θ0` ) P(xx n , `|θ

qn (`) log

θ0` ) p` (xx n , `|θ qn (`)

The EM algorithm Probabilistic Data Mining

V

The EM algorithm:

Lehel Csató Modelling Data Motivation

Init – initialise model parameters;

Machine Learning Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

E step – compute the responsibilities γn` = qn (`);

Bayesian Estimation

Unsupervised General concepts Principal Components

M step – for each k optimize

Independent Components Mixture Models

0=

X

qn (`)

n

repeat – goto the E step.

θ` ) ∂ log p` (xx n |θ ∂ θ`

EM application Probabilistic Data Mining

I

Old faithful: 100

Lehel Csató Modelling Data

90

Motivation Machine Learning

Estimation Maximum Likelihood Maximum a-posteriori Bayesian Estimation

Unsupervised General concepts Principal Components Independent Components Mixture Models

Interval between durations

Latent variable models

80

70

60

50

40 1.5

2

2.5

3

3.5 Duration

4

4.5

5

5.5

EM application Probabilistic Data Mining

II

Old faithful:

Lehel Csató

100 Modelling Data Motivation Machine Learning

90

Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

80

Bayesian Estimation

Unsupervised General concepts

70

Principal Components Independent Components Mixture Models

60

50

40 1

2

3

4

5

6

EM application Probabilistic Data Mining

III

Old faithful:

Lehel Csató

100 Modelling Data Motivation Machine Learning

90

Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

80

Bayesian Estimation

Unsupervised General concepts

70

Principal Components Independent Components Mixture Models

60

50

40 1

2

3

4

5

6

EM application Probabilistic Data Mining

IV

Old faithful:

Lehel Csató

100 Modelling Data Motivation Machine Learning

90

Latent variable models

Estimation Maximum Likelihood Maximum a-posteriori

80

Bayesian Estimation

Unsupervised General concepts

70

Principal Components Independent Components Mixture Models

60

50

40 1

2

3

4

5

6

References Probabilistic Data Mining Lehel Csató References

J. M. Bernardo and A. F. Smith. Bayesian Theory. John Wiley & Sons, 1994. C. M. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, New York, N.Y., 2006. T. M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley & Sons, 1991. A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society series B, 39:1–38, 1977. T. Hastie, R. Tibshirani, és J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Verlag, 2001.