Support Vector Machines and Kernel Based Learning - ESAT KULeuven

53 downloads 942 Views 2MB Size Report
neural networks and support vector machines feature map and kernels primal and dual problem. • classification, regression convex problem, robustness ...
Support Vector Machines and Kernel Based Learning Johan Suykens K.U. Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70 Email: [email protected] http://www.esat.kuleuven.be/scd/ Tutorial ICANN 2007 Porto Portugal, Sept. 2007

ICANN 2007 ⋄ Johan Suykens

biomedical

Living in a data world process industry energy

traffic multimedia bio-informatics ICANN 2007 ⋄ Johan Suykens

1

Kernel based learning: interdisciplinary challenges neural networks data mining

linear algebra

pattern recognition

mathematics SVM & kernel methods

machine learning optimization

statistics signal processing

systems and control theory

• Understanding the essential concepts and different facets of problems • Providing systematical approaches, engineering kernel machines • Integrative design, bridging the gaps between theory and practice ICANN 2007 ⋄ Johan Suykens

2

- Part I contents • neural networks and support vector machines feature map and kernels primal and dual problem • classification, regression convex problem, robustness, sparseness • wider use of the kernel trick least squares support vector machines as core problems kernel principal component analysis • large scale fixed-size method nonlinear modelling

ICANN 2007 ⋄ Johan Suykens

3

x1 w1 x2 w2 w x3 w 3 xn n b 1

Classical MLPs y

h(·) h(·)

Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions • Learning from input-output patterns: off-line/on-line • Parallel network architecture, multiple inputs and outputs +

Flexible and widely applicable: Feedforward/recurrent networks, supervised/unsupervised learning

-

Many local minima, trial and error for determining number of neurons

ICANN 2007 ⋄ Johan Suykens

4

Support Vector Machines cost function

cost function

MLP

weights

SVM

weights

• Nonlinear classification and function estimation by convex optimization with a unique solution and primal-dual interpretations. • Number of neurons automatically follows from a convex program. • Learning and generalization in high dimensional input spaces (coping with the curse of dimensionality). • Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, kernels from graphical models, ... ), application-specific kernels (e.g. bioinformatics) ICANN 2007 ⋄ Johan Suykens

5

Classifier with maximal margin n • Training set {(xi, yi)}N i=1 : inputs xi ∈ R ; class labels yi ∈ {−1, +1}

y(x) = sign[wT ϕ(x) + b] • Classifier: with ϕ(·) : Rn → Rnh the mapping to a high dimensional feature space (which can be infinite dimensional!) • Maximize the margin for good generalization ability (margin = (VC theory: linear SVM classifier dates back from the sixties) x x

x

x

x

x

x x

x

2 kwk2 )

o

x

o o o

ICANN 2007 ⋄ Johan Suykens



o o o o

o o

x

x

x x

x

o

x

o o o

o o o o

o o

6

SVM classifier: primal and dual problem • Primal problem: [Vapnik, 1995] 1 min J (w, ξ) = wT w + c w,b,ξ 2

N X i=1

ξi s.t.



yi[wT ϕ(xi) + b] ≥ 1 − ξi ξi ≥ 0, i = 1, ..., N

Trade-off between margin maximization and tolerating misclassifications • Conditions for optimality from Lagrangian. Express the solution in the Lagrange multipliers. • Dual problem: QP problem (convex problem)  N X  

N N X 1 X αj s.t. yiyj K(xi, xj ) αiαj + max Q(α) = − α  2 i,j=1  j=1

ICANN 2007 ⋄ Johan Suykens

αiyi = 0

i=1

0 ≤ αi ≤ c, ∀i

7

Obtaining solution via Lagrangian • Lagrangian: L(w, b, ξ; α, ν) = J (w, ξ) −

N X

αi{yi[wT ϕ(xi) + b] − 1 + ξi} −

                

max min L(w, b, ξ; α, ν), α,ν

∂L ∂w ∂L ∂b ∂L ∂ξi

w,b,ξ

=0 → w= =0 →

νiξi

i=1

i=1

• Find saddle point:

N X

N X

N X

one obtains

αiyiϕ(xi)

i=1

αiyi = 0

i=1

= 0 → 0 ≤ αi ≤ c, i = 1, ..., N

Finally, write the solution in terms of α (Lagrange multipliers). ICANN 2007 ⋄ Johan Suykens

8

SVM classifier model representations • Classifier: primal representation y(x) = sign[wT ϕ(x) + b] Kernel trick (Mercer Theorem): K(xi, xj ) = ϕ(xi)T ϕ(xj ) • Dual representation: X y(x) = sign[ αi yi K(x, xi) + b] i

3

2.5

1.5 1

x(2)

Some possible kernels K(·, ·): K(x, xi) = xTi x (linear) K(x, xi) = (xTi x + τ )d (polynomial) K(x, xi) = exp(−kx − xik22/σ 2) (RBF) K(x, xi) = tanh(κ xTi x + θ) (MLP)

2

0.5 0 −0.5 −1 −1.5 −2 −2.5 −2.5

• Sparseness property (many αi = 0) ICANN 2007 ⋄ Johan Suykens

−2

−1.5

−1

−0.5

0

x(1)

0.5

1

1.5

2

9

SVMs: living in two worlds ... Primal space:

y(x) = sign[wT ϕ(x) + b] ϕ1 (x)

ϕ(x)

x

w1 y(x)

xx x

x

o

o o

x x x x o o x oo

x

w nh ϕnh (x)

K(xi, xj ) = ϕ(xi)T ϕ(xj ) (“Kernel trick”) Dual space:

o

Input space

y(x) = sign[ Feature space

P#sv i=1

αiyiK(x, xi) + b]

K(x, x1 ) α1

y(x)

x

α#sv K(x, x#sv ) ICANN 2007 ⋄ Johan Suykens

10

Reproducing Kernel Hilbert Space (RKHS) view • Variational problem: [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000] find function f such that N 1 X L(yi, f (xi)) + λkf k2K min f ∈H N i=1

with L(·, ·) the loss function. kf kK is norm in RKHS H defined by K. • Representer theorem: for convex loss function, solution of the form f (x) =

N X

αiK(x, xi)

i=1

Reproducing property f (x) = hf, KxiK with Kx(·) = K(x, ·)

• Some special cases: L(y, f (x)) = (y − f (x))2: regularization network L(y, f (x)) = |y − f (x)|ǫ: SVM regression with ǫ-insensitive loss function ICANN 2007 ⋄ Johan Suykens

−ε

0



11

Different views on kernel machines SVM

LS−SVM

Kriging

RKHS

Gaussian Processes

Some early history on RKHS: 1910-1920: Moore 1940: Aronszajn 1951: Krige 1970: Parzen 1971: Kimeldorf & Wahba

Obtaining complementary insights from different perspectives: kernels are used in different methodologies Support vector machines (SVM): optimization approach (primal/dual) Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysis Gaussian processes (GP): probabilistic/Bayesian approach ICANN 2007 ⋄ Johan Suykens

12

Wider use of the kernel trick x

• Angle between vectors: (e.g. correlation analysis) Input space: xT z cos θxz = kxk2kzk2 Feature space: cos θϕ(x),ϕ(z)

θ

z

ϕ(x)T ϕ(z) K(x, z) p = =p kϕ(x)k2kϕ(z)k2 K(x, x) K(z, z)

• Distance between vectors: (e.g. for “kernelized” clustering methods) Input space: kx − zk22 = (x − z)T (x − z) = xT x + z T z − 2xT z Feature space: kϕ(x) − ϕ(z)k22 = K(x, x) + K(z, z) − 2K(x, z) ICANN 2007 ⋄ Johan Suykens

13

Least Squares Support Vector Machines: “core problems” • Regression (RR) T

min w w + γ

w,b,e

X

e2i s.t. yi = wT ϕ(xi) + b + ei, ∀i

i

• Classification (FDA) T

min w w + γ

w,b,e

X

e2i s.t. yi(wT ϕ(xi) + b) = 1 − ei, ∀i

i

• Principal component analysis (PCA) X T min −w w + γ e2i s.t. ei = wT ϕ(xi) + b, ∀i w,b,e

i

• Canonical correlation analysis/partial least squares (CCA/PLS)  X X X ei = wT ϕ1(xi) + b T T 2 2 min w w+v v+ν1 ei +ν2 ri −γ eiri s.t. ri = v T ϕ2(yi) + d w,v,b,d,e,r i

i

i

• partially linear models, spectral clustering, subspace algorithms, ... ICANN 2007 ⋄ Johan Suykens

14

LS-SVM classifier • Preserve support vector machine methodology, but simplify via least squares and equality constraints [Suykens, 1999] • Primal problem: N 1 T 1X 2 min J (w, e) = w w + γ ei s.t. yi [wT ϕ(xi) + b]=1 − ei, ∀i w,b,e 2 2 i=1

• Dual problem: 

0 y

T

y Ω + I/γ



b α



=



0 1N



where Ωij = yiyj ϕ(xi)T ϕ(xj ) = yiyj K(xi, xj ) and y = [y1; ...; yN ]. • LS-SVM classifiers perform very well on 20 UCI data sets [Van Gestel et al., ML 2004] Winning results in competition WCCI 2006 by [Cawley, 2006]

ICANN 2007 ⋄ Johan Suykens

15

Obtaining solution from Lagrangian • Lagrangian: L(w, b, e; α) = J (w, e) −

N X

αi{yi[wT ϕ(xi) + b] − 1 + ei}

i=1

with Lagrange multipliers αi (support values). • Conditions for optimality: 8 > > > > > > > > > < > > > > > > > > > :

∂L ∂w ∂L ∂b ∂L ∂ei ∂L ∂αi

=0 =0

→ →

w= N X

N X

αiyiϕ(xi)

i=1

αiyi = 0

i=1

=0 =0

→ →

αi = γei, yi[wT ϕ(xi) + b] − 1 + ei = 0,

i = 1, ..., N i = 1, ..., N

Eliminate w, e and write solution in α, b. ICANN 2007 ⋄ Johan Suykens

16

Microarray data analysis

FDA LS-SVM classifier (linear, RBF) Kernel PCA + FDA (unsupervised selection of PCs) (supervised selection of PCs) Use regularization for linear classifiers

Systematic benchmarking study in [Pochet et al., Bioinformatics 2004] Webservice: http://www.esat.kuleuven.ac.be/MACBETH/ Efficient computational methods for feature selection by rank-one updates [Ojeda et al., 2007] ICANN 2007 ⋄ Johan Suykens

17

Weighted versions and robustness Weighted version with modified cost function

Convex cost function

convex optimiz.

robust statistics

SVM solution

LS-SVM solution

SVM

Weighted LS-SVM

N

1 T 1X 2 • Weighted LS-SVM: min w w + γ viei s.t. yi = wT ϕ(xi) + b + ei, ∀i w,b,e 2 2 i=1 with vi determined from {ei}N i=1 of unweighted LS-SVM [Suykens et al., 2002]. Robustness and stability of reweighted kernel based regression [Debruyne et al., 2006].

• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005] ICANN 2007 ⋄ Johan Suykens

18

Kernel principal component analysis (KPCA) 1.5

1.5

1

1

0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5

−2

−2.5 −1.5

−1

−0.5

0

linear PCA

0.5

1

−2 −1.5

−1

−0.5

0

0.5

1

kernel PCA (RBF kernel)

Kernel PCA [Sch¨olkopf et al., 1998]: take eigenvalue decomposition of the kernel matrix 



K(x1, x1) ... K(x1, xN ) .. ..   K(xN , x1) ... K(xN , xN ) (applications in dimensionality reduction and denoising) Where is the regularization ? ICANN 2007 ⋄ Johan Suykens

19

Kernel PCA: primal and dual problem • Underlying primal problem with regularization term [Suykens et al., 2003] • Primal problem: N 1 X 2 1 T ei s.t. ei = wT ϕ(xi) + b, i = 1, ..., N. min − w w + γ w,b,e 2 2 i=1

(or alternatively min

1 T 2w w



• Dual problem = kernel PCA :

1 2γ

PN

2 e i=1 i )

Ωcα = λα with λ = 1/γ with Ωc,ij = (ϕ(xi) − µ ˆϕ)T (ϕ(xj ) − µ ˆϕ) the centered kernel matrix. • Score variables (allowing also out-of-sample extensions): P P 1 T z(x) = w (ϕ(x) − µ ˆϕ) = j αj (K(xj , x) − N r K(xr , x)− P P P 1 1 r K(xr , xj ) + N 2 r s K(xr , xs )) N ICANN 2007 ⋄ Johan Suykens

20

Primal versus dual problems Example 1: microarray data (10.000 genes & 50 training data) Classifier model: T sign(w P x + b)T (primal) sign( i αiyixi x + b) (dual)

primal: w ∈ R10.000 (only 50 training data!) dual: α ∈ R50 Example 2: datamining problem (1.000.000 training data & 20 inputs)

primal: w ∈ R20 dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000 !) ICANN 2007 ⋄ Johan Suykens

21

Fixed-size LS-SVM: primal-dual kernel machines Primal space

Dual space Nystr¨ om method Kernel PCA Density estimate Entropy criteria

Regression

Eigenfunctions SV selection

Link Nystr¨om approximation (GP) - kernel PCA - density estimation [Girolami, 2002; Williams & Seeger, 2001]

Modelling in view of primal-dual representations [Suykens et al., 2002]: primal space estimation, sparse, large scale ICANN 2007 ⋄ Johan Suykens

22

y

Fixed-size LS-SVM: toy examples 1.4

2.5

1.2

2

1

1.5

0.8

1

0.6

0.5

0.4

0

0.2

−0.5

0

−1

−0.2

−1.5

−0.4

−2

−0.6 −10

−8

−6

−4

−2

0

x

2

4

6

8

10

1.2

−2.5 −2.5

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

−2

−1.5

−1

−0.5

0

0.5

1

1.5

2

2.5

2

1

1.5 0.8

1 0.6

y

0.5 0.4

0

−0.5

0.2

−1 0

−1.5 −0.2

−2 −0.4 −10

−8

−6

−4

−2

0

x

2

4

6

8

10

−2.5 −2.5

Sparse representations with estimation in primal space ICANN 2007 ⋄ Johan Suykens

23

Large scale problems: Fixed-size LS-SVM Estimate in primal (approximate feature map from KPCA on subset) Santa Fe laser data 5

4

4

3 3

yk , yˆk

2

yk

2

1

1

0

0

−1

−1

−2

−2 0

100

200

300

400

500

600

discrete time k

700

800

900

10

20

30

40

50

60

70

80

90

discrete time k

Training: yˆk+1 = f (yk , yk−1, ..., yk−p) Iterative prediction: yˆk+1 = f (ˆ yk , yˆk−1, ..., yˆk−p) (works well for p large, e.g. p = 50) [Espinoza et al., 2003] ICANN 2007 ⋄ Johan Suykens

24

Partially linear models: nonlinear system identification 0.15 0.02

0.1 0.015

0.05 0.01

Residuals

0 −0.05 −0.1 −0.15 −0.2

0.005

0

−0.005

−0.01

−0.015

0

2

4

6

8

10

12

Input Signal

14 4

−0.02

x 10

1

1.5

2

2.5

3

3.5

Discrete Time Index

0.3

4 4

x 10

0.02

validation

training

test

0.2

0.015

0.01

0

0.005

Residuals

0.1

−0.1 −0.2 −0.3 −0.4

0.5

0

−0.005

−0.01

−0.015

0

2

4

6

8

Output Signal

10

12

14 4

x 10

−0.02

0.5

1

1.5

2

2.5

Discrete Time Index

3

3.5

4 4

x 10

Silver box benchmark study (physical system with cubic nonlinearity): (top-right) full black-box, (bottom-right) partially linear

Related application: power load forecasting [Espinoza et al., 2005] ICANN 2007 ⋄ Johan Suykens

25

- Part II contents • generalizations to KPCA weighted kernel PCA spectral clustering kernel canonical correlation analysis • model selection structure detection kernel design semi-supervised learning incorporation of constraints • kernel maps with reference point dimensionality reduction and data visualization

ICANN 2007 ⋄ Johan Suykens

26

Core models + additional constraints • Monoticity constraints: [Pelckmans et al., 2005] T

min w w + γ w,b,e

N X

2 ei

s.t.

i=1



yi = wT ϕ(xi) + b + ei, wT ϕ(xi) ≤ wT ϕ(xi+1),

(i = 1, ..., N ) (i = 1, ..., N − 1)

• Structure detection: [Pelckmans et al., 2005; Tibshirani, 1996] min ρ w,e,t

P X

P X

tp+

p=1

w

(p) T

w

(p)



p=1

N X

2 ei

i=1

s.t.

(

yi =

PP

T

(p)

(p) ϕ(p)(xi ) + ei, p=1 w T

(∀i)

(p)

−tp ≤ w(p) ϕ(p)(xi ) ≤ tp,

(∀i, ∀p)

• Autocorrelated errors: [Espinoza et al., 2006] T

min w w + γ

w,b,r,e

N X

2 ri

s.t.

i=1



yi = wT ϕ(xi) + b + ei, ei = ρei−1 + ri,

(i = 1, .., N ) (i = 2, ..., N )

• Spectral clustering: [Alzate & Suykens, 2006; Chung, 1997; Shi & Malik, 2000] T

T

min −w w + γe D w,b,e

ICANN 2007 ⋄ Johan Suykens

−1

T

e s.t. ei = w ϕ(xi) + b, (i = 1, ..., N )

27

Generalizations to Kernel PCA: other loss functions • Consider general loss function L (L2 case = KPCA): N 1 X 1 T L(ei) s.t. ei = wT ϕ(xi) + b, i = 1, ..., N. min − w w + γ w,b,e 2 2 i=1

Generalizations of KPCA that lead to robustness and sparseness, e.g. Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006]. • Weighted least squares versions and incorporation of constraints:   ei = wT ϕ(xi) + b, i = 1, ..., N   PN N  (1) X 1 T 1 e e i=1 i i = 0 vie2i s.t. min − w w + γ ...  w,b,e 2 2 i=1    PN e e(i−1) = 0 i=1 i i (j)

Find i-th PC w.r.t. i − 1 orthogonality constraints (previous PC ei ). The solution is given by a generalized eigenvalue problem.

ICANN 2007 ⋄ Johan Suykens

28

Generalizations to Kernel PCA: robust denoising Test set images, corrupted with Gaussian noise and outliers

Classical Kernel PCA

Robust method

bottom rows: application of different pre-image algorithms Robust method: improved results and fewer components needed

ICANN 2007 ⋄ Johan Suykens

29

Generalizations to Kernel PCA: sparseness 2.5

2

1.5

x2

1

0.5

0

−0.5

−1 −0.5

0

0.5

1 x1

1.5

2

2.5

2

2

2

1.5

1.5

1.5

1

1

1 x

x

x

2

2.5

2

2.5

2

2.5

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1 −0.5

0

0.5

1 x

1.5

2

1

PC1

2.5

−1 −0.5

0

0.5

1 x

1

PC2

1.5

2

2.5

−1 −0.5

0

0.5

1 x

1.5

1

PC3

Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006] (top figure: denoising; bottom figures: different support vectors (in black) per principal component vector) ICANN 2007 ⋄ Johan Suykens

30

2

2.5

Spectral clustering: weighted KPCA 5 4

1

• Spectral graph clustering [Chung, 1997; Shi & Malik, 2000; Ng et al., 2002] • Normalized cut problem Lq = λDq with L = D − W the Laplacian of the graph. Cluster membership indicators are given by q.

2

6

3

cut of size 2

cut of size 1 (minimal cut)

• KPCA Weighted LS-SVM formulation to normalized cut: 1 T 1 T min − w w + γ e V e such that ei = wT ϕ(xi) + b, ∀i = 1, ..., N w,b,e 2 2 with V = D−1 the inverse degree matrix [Alzate & Suykens, 2006]. Allows for out-of-sample extensions on test data. ICANN 2007 ⋄ Johan Suykens

31

Application to image segmentation

Given image (240 × 160)

Image segmentation

Large scale image: out-of-sample extension [Alzate & Suykens, 2006] ICANN 2007 ⋄ Johan Suykens

32

Kernel Canonical Correlation Analysis Correlation: min w,v

X

kzxi − zyi k22

i

z = wT ϕ1 (x) T x zy = v ϕ2 (y)

ϕ2(·)

ϕ1(·) x

x x

x x

x x

x

x

x

x

x

x

x

x

Target spaces

x

x x

x

x x

x

x x x

x

x

x x

x

x

x

x

x x

x

x x x

x

Space X

Space Y Feature space on Y

Feature space on X

Applications of kernel CCA [Suykens et al., 2002, Bach & Jordan, 2002] e.g. in: - bioinformatics (correlation gene network - gene expression profiles) [Vert et al., 2003] - information retrieval, fMRI [Shawe-Taylor et al., 2004] - state estimation of dynamical systems, subspace algorithms [Goethals et al., 2005] ICANN 2007 ⋄ Johan Suykens

33

LS-SVM formulation to Kernel CCA • Score variables: zx = wT (ϕ1(x) − µ ˆ ϕ1 ), zy = v T (ϕ2(y) − µ ˆ ϕ2 )

Feature maps ϕ1, ϕ2, kernels K1(xi, xj ) = ϕ1(xi)T ϕ1(xj ), K2(yi, yj ) = ϕ2(yi)T ϕ2(yj )

• Primal problem: (Kernel PLS case: ν1 = 0, ν2 = 0 [Hoegaerts et al., 2004]) max

γ

w,v,e,r

i=1

such that with µ ˆ ϕ1 = (1/N )

N X

N

N

1X 2 1 T 1X 2 1 T ei − ν2 ri − w w − v v eiri − ν1 2 i=1 2 i=1 2 2

ei = wT (ϕ1(xi) − µ ˆ ϕ1 ), ri = v T (ϕ2(yi) − µ ˆ ϕ2 ), ∀i

PN

ˆ ϕ2 = (1/N ) i=1 ϕ1 (xi ), µ

PN

i=1 ϕ2 (yi ).

• Dual problem: generalized eigenvalue problem [Suykens et al. 2002] »

0 Ωc,1

Ωc,2 0

–»

α β





»

ν1Ωc,1 + I 0

0 ν2Ωc,2 + I

–»

α β



, λ = 1/γ

with Ωc,1ij = (ϕ1 (xi ) − µ ˆ ϕ1 )T (ϕ1 (xj ) − µ ˆ ϕ1 ), Ωc,2ij = (ϕ2 (yi ) − µ ˆ ϕ2 )T (ϕ2 (yj ) − µ ˆ ϕ2 )

ICANN 2007 ⋄ Johan Suykens

34

System identification of Hammerstein systems • Hammerstein system:  xt+1 = Axt + Bf (ut) + νt yt = Cxt + Df (ut) + vt h h i i Q S νp with E{ vp [νqT vqT ]} = S T R δpq .

y

u f(.)

[A, B, C, D]

• System identification problem: −1 given {(ut, yt)}N t=0 , estimate A, B, C, D, f . • Subspace algorithms [Goethals et al., IEEE-AC 2005]: first estimate state vector sequence (can also be done by KCCA) (for linear systems equivalent to Kalman filtering) • Related problems: linear non-Gaussian models, links with ICA Kernels for linear systems and gait recognition [Bissacco et al., 2007]

ICANN 2007 ⋄ Johan Suykens

35

Bayesian inference Level 1 Parameters Likelihood Max

Posterior =

Prior Evidence

Level 2 Hyperparameters Likelihood Max

Posterior =

Prior Evidence

Level 3 Model comparison Likelihood Max

Posterior =

Prior Evidence

Automatic relevance determination (ARD) [MacKay, 1998]: infer elements of diagonal matrix S in K(xi, xj ) = exp(−(xi − xj )T S(xi − xj )) which indicate how relevant input variables are (but: many local minima, computationally expensive). ICANN 2007 ⋄ Johan Suykens

36

Classification of brain tumors using ARD

Bayesian learning (automatic relevance determination) of most relevant frequencies [Lu, 2005] ICANN 2007 ⋄ Johan Suykens

37

Hierarchical Kernel Machines Conceptually

Computationally

Level 3 Model selection

Level 2

Sparseness Structure detection

Level 1 LS−SVM substrate

Hierarchical kernel machine

Convex optimization

Hierarchical modelling approach leading to convex optimization problem Computationally fusing training, hyperparameter and model selection Optimization modelling: sparseness, input/structure selection, stability ... [Pelckmans et al., ML 2006] ICANN 2007 ⋄ Johan Suykens

38

Additive regularization trade-off • Traditional Tikhonov regularization scheme: X T e2i s.t. ei = yi − wT ϕ(xi), ∀i = 1, ..., N min w w + γ w,e

i

Training solution for fixed value of γ :

(K + I/γ)α = y

→ Selection of γ via validation set: non-convex problem

• Additive regularization trade-off [Pelckmans et al., 2005]: X T min w w + (ei − ci)2 s.t. ei = yi − wT ϕ(xi), ∀i = 1, ..., N w,e

i

Training solution for fixed value of c = [c1; ...; cN ]:

(K + I)α = y − c

→ Selection of c via validation set: can be convex problem

• Convex relaxation to Tikhonov regularization [Pelckmans et al., IEEE-TNN 2007] ICANN 2007 ⋄ Johan Suykens

39

Sparse models • SVM classically: sparse solution from QP problem at training level • Hierarchical kernel machine: fused problem with sparseness obtained at the validation level [Pelckmans et al., 2005] RBF

LS−SVMγ=5.3667,σ2=0.90784, with 2 different classes class 1 class 2 1

0.8

X

2

0.6

0.4

0.2

0

−0.2

ICANN 2007 ⋄ Johan Suykens

−1.2

−1

−0.8

−0.6

−0.4

−0.2 X1

0

0.2

0.4

0.6

0.8

40

Additive models and structure detection (p) T

PP

• Additive models: yˆ(x) = p=1 w ϕ(p)(x(p)) with x(p) the p-th input. PP (p) (p) Kernel K(xi, xj ) = p=1 K (p)(xi , xj ). • Structure detection [Pelckmans et al., 2005]: min ρ w,e,t ( s.t.

PP

p=1 tp

yi =

PP

+

PP

p=1

p=1 w T

(p) T

w

(p) T

w

(p)



(p)

PN

i=1

e2i

ϕ(p)(xi ) + ei, ∀i = 1, ..., N (p)

−tp ≤ w(p) ϕ(p)(xi ) ≤ tp, ∀i = 1, ..., N, ∀p = 1, ..., P

Study how the solution with maximal variation varies for different values of ρ

1.2

4 relevant input variables

Maximal Variation

1

0.8

0.6

0.4 21 irrelevant input variables 0.2

0

−0.2 0 10

ICANN 2007 ⋄ Johan Suykens

1

10

2

10

ρ

3

10

4

10

41

Incorporation of prior knowledge • Example: LS-SVM regression with monoticity constraint N 1X

yi = wT ϕ(xi) + b + ei, ∀i = 1, ..., N wT ϕ(xi) ≤ wT ϕ(xi+1), ∀i = 1, ..., N − 1



1 min wT w + γ e2i s.t. w,b,e 2 2 i=1

• Application: estimation of cdf [Pelckmans et al., 2005] empirical cdf true cdf 1

1

0.8

0.8

P(X)

0.6

P(X)

1

Y

0.6

0.4

0.4

Y2

0.2

0.2

0

0

−0.2 −2

ecdf cdf Chebychev mkr mLS−SVM

−0.2 −1.5

−1

−0.5

0

X

ICANN 2007 ⋄ Johan Suykens

0.5

1

1.5

2

−1.5

−1

−0.5

0

0.5

1

1.5

X

42

Equivalent kernels from constraints Regression with autocorrelated errors: min

w,b,r,e

T

w w+γ

2 i ri

P

s.t. yi = wT ϕ(xi) + b + ei (i = 1, .., N ) ei = ρei−1 + ri (i = 2, ..., N ) leads to fˆ(x) =

N X

Modular Definition of the Model Structure

αj−1Keq (xj , x) + b

j=2

LS-SVM Regression

Partially Linear Structure

Imposing Symmetry

Autocorrelated Residuals

with “equivalent kernel” Keq (xj , xi) = K(xj , xi) − ρK(xj−1, xi) where K(xj , xi) = ϕ(xj )T ϕ(xi) [Espinoza et al., 2006]. ICANN 2007 ⋄ Johan Suykens

43

Application: electric load forecasting Short-term load forecasting (1-24 hours) Important for power generation decisions Hourly load values from substations in Belgian grid Seasonal/weekly/intra-daily patterns 1-hour ahead

1

1 Actual Load Linear 0.9

0.8

0.8

Normalized Load

Normalized Load

Actual Load FS−LSSVM 0.9

0.7

0.6

0.5

0.4

0.3

0.2

0.6

0.5

0.4

0.3

0.2

0.1

0.1 20

40

60

80

Hour

100

120

140

160

20

40

60

(a)

24-hours ahead

80

Hour

100

120

140

160

(b)

1

1 Actual Load FS−LSSVM

Actual Load Linear

0.9

0.9

0.8

0.8

Normalized Load

Normalized Load

Fixed-size LS-SVM →

0.7

0.7

0.6

0.5

0.4

0.3

0.2

0.7

0.6

0.5

0.4

← Linear ARX model

0.3

0.2

0.1

0.1 20

40

60

80

Hour

100

120

140

160

20

40

60

(c)

80

Hour

100

120

140

160

(d)

[Espinoza et al., 2007] ICANN 2007 ⋄ Johan Suykens

44

1.2

1.2

1

1

0.8

0.8

0.6

0.6

0.4

0.4 x2

x2

Semi-supervised learning

0.2

0.2

0

0

−0.2

−0.2

−0.4

−0.4

−0.6

−0.6

−0.8

−1

−0.5

0

0.5 x1

1

1.5

2

2.5

−0.8

−1

−0.5

0

0.5 x1

1

1.5

2

2.5

Semi-supervised learning: part labeled and part unlabeled data Assumptions for semi-supervised learning to work [Chapelle et al., 2006]: • Smoothness assumption: if two points x1, x2 in a high density region are close, then also the corresponding outputs y1, y2 • Cluster assumption: points from same cluster are likely same class • Low density separation: decision boundary should be in low density region • Manifold assumption: data lie on a low-dimensional manifold ICANN 2007 ⋄ Johan Suykens

45

Semi-supervised learning in RKHS • Learning in RKHS [Belkin & Niyogi, 2004]: N 1 X V (yi, f (xi)) + λkf k2K + ηf T Lf min f ∈H N i=1

with V (·, ·) loss function, L Laplacian matrix, kf kK norm in RKHS H, f = [f (x1); ...; f (xNl+Nu )] (Nl, Nu number of labeled and unlabeled data) • Laplacian term: discretization of the Laplace-Beltrami operator • Representer theorem: f (x) =

PNl+Nu i=1

αiK(x, xi)

• Least squares solution case: Laplacian acts on kernel matrix • Problem: true labels of unlabeled data assumed to be zero. ICANN 2007 ⋄ Johan Suykens

46

Formulation by adding constraints • Semi-supervised LS-SVM model [Luts et al., 2007]: min

w,e,b,ˆ y

1 T 2w w

+ 12 γ

PN

1 2 + e i=1 i 2η

PN

yi i,j=1 vij (ˆ

− yˆj )2

s.t. yˆi = wT ϕ(xi) + b, i = 1, ..., N yˆi = νiyi − ei, νi ∈ {0, 1}, , i = 1, ..., N where νi = 0 for unlabeled data, νi = 1 for labeled data. • MRI image: healthy tissue versus tumor classification [Luts et al., 2007] Nosologic Image

Nosologic Image

6 2

6 2

5 4

5 4

4 6

4 6

3 8

3 8

2 10

2 10

1 12

1 12

2

4

6

8

10

12

2

4

6

8

10

12

[etumour FP6-2002-lifescihealth503094, healthagents FP6-2005-IST027213]

ICANN 2007 ⋄ Johan Suykens

47

Learning combination of kernels Pm • Take combination K = i=1 µi Ki (µi ≥ 0) (e.g. for data fusion). Learn µi as convex problem [Lanckriet et al., JMLR 2004] • QP problem of SVM: max 2αT 1 − αT diag(y)Kdiag(y)α s.t. 0 ≤ α ≤ C, αT y = 0 α

is replaced by m X min max 2αT 1 − αT diag(y)( µiKi)diag(y)α µi

α

s.t. 0 ≤ α ≤ C, αT y = 0, trace(

i=1

m X i=1

µiKi) = c,

m X

µiKi  0.

i=1

Can be solved as a semidefinite program (SDP problem) [Boyd & Vandenberghe, 2004] (LMI constraint for positive definite kernel) ICANN 2007 ⋄ Johan Suykens

48

Kernel design B

- Probability product kernel: Z K(p1, p2) = p1(x)ρ p2(x)ρdx

A

C

X

E

D

- Prior knowledge incorporation P(A,B,C,D,E) = P(A|B) P(B) P(C|B) P(D|C) P(E|B)

Kernels from graphical models, Bayesian networks, HMMs Kernels tailored to data types (DNA sequence, text, chemoinformatics) [Tsuda et al., Bioinformatics 2002; Jebara et al., JMLR 2004, Ralaivola et al., 2005]

ICANN 2007 ⋄ Johan Suykens

49

Dimensionality reduction and data visualization • Traditionally: commonly used techniques are e.g. principal component analysis, multidimensional scaling, self-organizing maps • More recently: isomap, locally linear embedding, Hessian locally linear embedding, diffusion maps, Laplacian eigenmaps (“kernel eigenmap methods and manifold learning”) [Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006] • Relevant issues: - learning and generalization [Cucker & Smale, 2002; Poggio et al., 2004] - model representations and out-of-sample extensions - convex/non-convex problems, computational complexity [Smale, 1997] • Kernel maps with reference point (KMref) [Suykens, 2007]: data visualization and dimensionality reduction by solving linear system ICANN 2007 ⋄ Johan Suykens

50

Kernel maps with reference point: problem statement • Kernel maps with reference point [Suykens, 2007]: - LS-SVM core part: realize dimensionality reduction x 7→ z - reference point q (e.g. first point; sacrificed in the visualization) • Example: d = 2 min

z,w1 ,w2 ,b1 ,b2 ,ei,1 ,ei,2

such that

N X ν 1 η 2 T T 2 T (ei,1 + ei,2) (z − PD z) (z − PD z) + (w1 w1 + w2 w2) + 2 2 2 i=1 cT1,1z = q1 + e1,1 cT1,2z = q2 + e1,2 cTi,1z = w1T ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N cTi,2z = w2T ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N

Coordinates in low dimensional space: z = [z1; z2; ...; zN ] ∈ RdN PN P 2 s Dz k kz − Regularization term: (z − PD z)T (z − PD z) = N ij j i 2 j=1 i=1 with D diagonal matrix and sij = exp(−kxi − xj k22/σ 2). ICANN 2007 ⋄ Johan Suykens

51

Kernel maps with reference point: solution • The unique solution to the problem is given by the linear system 

U  −1T M1−1V1T −1T M2−1V2T

−V1M1−11 1T M1−11 0

−V2M2−11









z η(q1c1,1 + q2c1,2)   b1  =   0 0 b2 0 1T M2−11

with matrices U = (I − PD )T (I − PD ) − γI + V1M1−1V1T + V2M2−1V2T + ηc1,1cT1,1 + ηc1,2cT1,2 1 1 1 1 M1 = Ω1 + I , M2 = Ω2 + I ν η ν η V1 = [c2,1 ... cN,1] , V2 = [c2,2 ... cN,2] kernel matrices Ω1, Ω2 ∈ R(N −1)×(N −1): Ω1,ij = K1(xi, xj ) = ϕ1(xi)T ϕ1(xj ), Ω2,ij = K2(xi, xj ) = ϕ2(xi)T ϕ2(xj )

positive definite kernel functions K1(·, ·), K2(·, ·). ICANN 2007 ⋄ Johan Suykens

52

Kernel maps with reference point: model representations • The primal and dual model representations allow making out-ofsample extensions. Evaluation at point x∗ ∈ Rp: N

zˆ∗,1 = w1T ϕ1(x∗) + b1 = zˆ∗,2 = w2T ϕ2(x∗) + b2 =

1X αi,1K1(xi, x∗) + b1 ν i=2 N 1X αi,2K2(xi, x∗) + b2 ν i=2

Estimated coordinates for visualization: zˆ∗ = [ˆ z∗,1; zˆ∗,2]. • α1, α2 ∈ RN −1 are the unique solutions to the linear systems M1α1 = V1T z − b11N −1 and M2α2 = V2T z − b21N −1 and α1 = [α2,1; ...; αN,1], α2 = [α2,2; ...; αN,2], 1N −1 = [1; 1; ..., ; 1]. ICANN 2007 ⋄ Johan Suykens

53

KMref: spiral example −3

20

x 10

15

0.5

0

z2

x

3

10

5

−0.5 1

0 0.5

1 0

0.5 0

−0.5 −0.5

−1 x

−1 −1.5

−1.5

2

−5 −0.02 x

−0.015

−0.01

1

−0.005 z1

0

0.005

0.01

training data (blue *), validation data (magenta o), test data (red +)

Model selection: min

X i,j

ICANN 2007 ⋄ Johan Suykens

zˆiT zˆj kˆ zi k2 kˆ zj k2



xT i xj kxi k2 kxj k2

2

54

KMref: swiss roll example −3

3

x 10

2.5

0.6 0.4 0.2

2

−0.2 z2

x

3

0

1.5

−0.4 −0.6 1

−0.8 0.6 0.4 0.2

1 0 −0.2 0

−0.4 −0.5

−0.6 x

2

0.5

0.5

−0.8

−1

0 −3.5

−3

−2.5

x

Given 3D swiss roll data

−2

−1.5

−1

−0.5

z1

1

KMref result - 2D projection

600 training data, 100 validation data ICANN 2007 ⋄ Johan Suykens

0 −3

x 10

55

KMref: visualizing gene distributions −3

x 10

−3

x 10

2.1

2.1

2

1.9 z

3

z3

2

1.9

1.8 1.8

1.7 1.7

2.3

2.3

2.2

2.2 2.1 2

−3

x 10

1.9

−2 −2.1 −2.05 −2.2 −2.15 −2.25 −2.35 −2.3

z

2

z

1

−1.95 −3

x 10

−2.35 −2.3 −2.25 −2.2 −2.15 −2.1 −2.05 −3

2.1 −3

x 10

2 −2 −1.95

1.9 z

x 10

2

z1

KMref 3D projection (Alon colon cancer microarray data set) Dimension input space: 62 Number of genes: 1500 (training: 500, validation: 500, test: 500) Model selection: σ 2 = 104, σ12 = 103, σ22 = 0.5σ12, σ32 = 0.1σ12, η = 1, ν = 100, D = diag{10, 5, 1}, q = [+1; −1; −1]. ICANN 2007 ⋄ Johan Suykens

56

Nonlinear dynamical systems control 1.5

1

θ u

0.5

xp 0 −1.5

min subject to

Control objective

+

−1

−0.5

0

0.5

1

1.5

2

2.5

3

3.5

LS−SVM objective

System dynamics (time k = 1, 2, ... , N) LS−SVM controller (time k = 1, 2, ... , N)

Merging optimal control and support vector machine optimization problems: Approximate solutions to optimal control problems [Suykens et al., NN 2001] ICANN 2007 ⋄ Johan Suykens

57

Conclusions and future challenges • Integrative understanding and systematic design for supervised, semisupervised, unsupervised learning and beyond • Kernel methods: complementary views (LS-)SVM, RKHS, GP • Least squares support vector machines as “core problems”: provides methodology for “optimization modelling” • Bridging gaps between fundamental theory, algorithms and applications • Reliable methods: numerically, computationally, statistically Websites: http://www.kernel-machines.org/ http://www.esat.kuleuven.be/sista/lssvmlab/ ICANN 2007 ⋄ Johan Suykens

58

Books • Boyd S., Vandenberghe L., Convex Optimization, Cambridge University Press, 2004. • Chapelle O., Sch¨olkopf B., Zien A. (Eds.), Semi-Supervised Learning, MIT Press, 2006. • Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge University Press, 2000. • Cucker F., Zhou D.-X., Learning Theory: an Approximation Theory Viewpoint, Cambridge University Press, 2007. • Rasmussen C.E., Williams C.K.I., Gaussian Processes for Machine Learning, MIT Press, 2006. • Sch¨olkopf B., Smola A., Learning with Kernels, MIT Press, 2002. • Sch¨olkopf B., Tsuda K., Vert J.P. (Eds.) Kernel Methods in Computational Biology 400, MIT Press, 2004. • Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. • Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B., Vandewalle J., Least Squares Support Vector Machines, World Scientific, Singapore, 2002. • Suykens J.A.K., Horvath G., Basu S., Micchelli C., Vandewalle J. (Eds.), Advances in Learning Theory : Methods, Models and Applications, vol. 190 NATO-ASI Series III: Computer and Systems Sciences, IOS Press, 2003. • Vapnik V., Statistical Learning Theory, John Wiley & Sons, 1998. • Wahba G., Spline Models for Observational Data, Series Appl. Math., 59, SIAM, 1990.

ICANN 2007 ⋄ Johan Suykens

59