Support Vector Machines and Kernel Based Learning Johan Suykens K.U. Leuven, ESAT-SCD/SISTA Kasteelpark Arenberg 10 B-3001 Leuven (Heverlee), Belgium Tel: 32/16/32 18 02 - Fax: 32/16/32 19 70 Email:
[email protected] http://www.esat.kuleuven.be/scd/ Tutorial ICANN 2007 Porto Portugal, Sept. 2007
ICANN 2007 ⋄ Johan Suykens
biomedical
Living in a data world process industry energy
traffic multimedia bio-informatics ICANN 2007 ⋄ Johan Suykens
1
Kernel based learning: interdisciplinary challenges neural networks data mining
linear algebra
pattern recognition
mathematics SVM & kernel methods
machine learning optimization
statistics signal processing
systems and control theory
• Understanding the essential concepts and different facets of problems • Providing systematical approaches, engineering kernel machines • Integrative design, bridging the gaps between theory and practice ICANN 2007 ⋄ Johan Suykens
2
- Part I contents • neural networks and support vector machines feature map and kernels primal and dual problem • classification, regression convex problem, robustness, sparseness • wider use of the kernel trick least squares support vector machines as core problems kernel principal component analysis • large scale fixed-size method nonlinear modelling
ICANN 2007 ⋄ Johan Suykens
3
x1 w1 x2 w2 w x3 w 3 xn n b 1
Classical MLPs y
h(·) h(·)
Multilayer Perceptron (MLP) properties: • Universal approximation of continuous nonlinear functions • Learning from input-output patterns: off-line/on-line • Parallel network architecture, multiple inputs and outputs +
Flexible and widely applicable: Feedforward/recurrent networks, supervised/unsupervised learning
-
Many local minima, trial and error for determining number of neurons
ICANN 2007 ⋄ Johan Suykens
4
Support Vector Machines cost function
cost function
MLP
weights
SVM
weights
• Nonlinear classification and function estimation by convex optimization with a unique solution and primal-dual interpretations. • Number of neurons automatically follows from a convex program. • Learning and generalization in high dimensional input spaces (coping with the curse of dimensionality). • Use of kernels (e.g. linear, polynomial, RBF, MLP, splines, kernels from graphical models, ... ), application-specific kernels (e.g. bioinformatics) ICANN 2007 ⋄ Johan Suykens
5
Classifier with maximal margin n • Training set {(xi, yi)}N i=1 : inputs xi ∈ R ; class labels yi ∈ {−1, +1}
y(x) = sign[wT ϕ(x) + b] • Classifier: with ϕ(·) : Rn → Rnh the mapping to a high dimensional feature space (which can be infinite dimensional!) • Maximize the margin for good generalization ability (margin = (VC theory: linear SVM classifier dates back from the sixties) x x
x
x
x
x
x x
x
2 kwk2 )
o
x
o o o
ICANN 2007 ⋄ Johan Suykens
→
o o o o
o o
x
x
x x
x
o
x
o o o
o o o o
o o
6
SVM classifier: primal and dual problem • Primal problem: [Vapnik, 1995] 1 min J (w, ξ) = wT w + c w,b,ξ 2
N X i=1
ξi s.t.
yi[wT ϕ(xi) + b] ≥ 1 − ξi ξi ≥ 0, i = 1, ..., N
Trade-off between margin maximization and tolerating misclassifications • Conditions for optimality from Lagrangian. Express the solution in the Lagrange multipliers. • Dual problem: QP problem (convex problem) N X
N N X 1 X αj s.t. yiyj K(xi, xj ) αiαj + max Q(α) = − α 2 i,j=1 j=1
ICANN 2007 ⋄ Johan Suykens
αiyi = 0
i=1
0 ≤ αi ≤ c, ∀i
7
Obtaining solution via Lagrangian • Lagrangian: L(w, b, ξ; α, ν) = J (w, ξ) −
N X
αi{yi[wT ϕ(xi) + b] − 1 + ξi} −
max min L(w, b, ξ; α, ν), α,ν
∂L ∂w ∂L ∂b ∂L ∂ξi
w,b,ξ
=0 → w= =0 →
νiξi
i=1
i=1
• Find saddle point:
N X
N X
N X
one obtains
αiyiϕ(xi)
i=1
αiyi = 0
i=1
= 0 → 0 ≤ αi ≤ c, i = 1, ..., N
Finally, write the solution in terms of α (Lagrange multipliers). ICANN 2007 ⋄ Johan Suykens
8
SVM classifier model representations • Classifier: primal representation y(x) = sign[wT ϕ(x) + b] Kernel trick (Mercer Theorem): K(xi, xj ) = ϕ(xi)T ϕ(xj ) • Dual representation: X y(x) = sign[ αi yi K(x, xi) + b] i
3
2.5
1.5 1
x(2)
Some possible kernels K(·, ·): K(x, xi) = xTi x (linear) K(x, xi) = (xTi x + τ )d (polynomial) K(x, xi) = exp(−kx − xik22/σ 2) (RBF) K(x, xi) = tanh(κ xTi x + θ) (MLP)
2
0.5 0 −0.5 −1 −1.5 −2 −2.5 −2.5
• Sparseness property (many αi = 0) ICANN 2007 ⋄ Johan Suykens
−2
−1.5
−1
−0.5
0
x(1)
0.5
1
1.5
2
9
SVMs: living in two worlds ... Primal space:
y(x) = sign[wT ϕ(x) + b] ϕ1 (x)
ϕ(x)
x
w1 y(x)
xx x
x
o
o o
x x x x o o x oo
x
w nh ϕnh (x)
K(xi, xj ) = ϕ(xi)T ϕ(xj ) (“Kernel trick”) Dual space:
o
Input space
y(x) = sign[ Feature space
P#sv i=1
αiyiK(x, xi) + b]
K(x, x1 ) α1
y(x)
x
α#sv K(x, x#sv ) ICANN 2007 ⋄ Johan Suykens
10
Reproducing Kernel Hilbert Space (RKHS) view • Variational problem: [Wahba, 1990; Poggio & Girosi, 1990; Evgeniou et al., 2000] find function f such that N 1 X L(yi, f (xi)) + λkf k2K min f ∈H N i=1
with L(·, ·) the loss function. kf kK is norm in RKHS H defined by K. • Representer theorem: for convex loss function, solution of the form f (x) =
N X
αiK(x, xi)
i=1
Reproducing property f (x) = hf, KxiK with Kx(·) = K(x, ·)
• Some special cases: L(y, f (x)) = (y − f (x))2: regularization network L(y, f (x)) = |y − f (x)|ǫ: SVM regression with ǫ-insensitive loss function ICANN 2007 ⋄ Johan Suykens
−ε
0
+ε
11
Different views on kernel machines SVM
LS−SVM
Kriging
RKHS
Gaussian Processes
Some early history on RKHS: 1910-1920: Moore 1940: Aronszajn 1951: Krige 1970: Parzen 1971: Kimeldorf & Wahba
Obtaining complementary insights from different perspectives: kernels are used in different methodologies Support vector machines (SVM): optimization approach (primal/dual) Reproducing kernel Hilbert spaces (RKHS): variational problem, functional analysis Gaussian processes (GP): probabilistic/Bayesian approach ICANN 2007 ⋄ Johan Suykens
12
Wider use of the kernel trick x
• Angle between vectors: (e.g. correlation analysis) Input space: xT z cos θxz = kxk2kzk2 Feature space: cos θϕ(x),ϕ(z)
θ
z
ϕ(x)T ϕ(z) K(x, z) p = =p kϕ(x)k2kϕ(z)k2 K(x, x) K(z, z)
• Distance between vectors: (e.g. for “kernelized” clustering methods) Input space: kx − zk22 = (x − z)T (x − z) = xT x + z T z − 2xT z Feature space: kϕ(x) − ϕ(z)k22 = K(x, x) + K(z, z) − 2K(x, z) ICANN 2007 ⋄ Johan Suykens
13
Least Squares Support Vector Machines: “core problems” • Regression (RR) T
min w w + γ
w,b,e
X
e2i s.t. yi = wT ϕ(xi) + b + ei, ∀i
i
• Classification (FDA) T
min w w + γ
w,b,e
X
e2i s.t. yi(wT ϕ(xi) + b) = 1 − ei, ∀i
i
• Principal component analysis (PCA) X T min −w w + γ e2i s.t. ei = wT ϕ(xi) + b, ∀i w,b,e
i
• Canonical correlation analysis/partial least squares (CCA/PLS) X X X ei = wT ϕ1(xi) + b T T 2 2 min w w+v v+ν1 ei +ν2 ri −γ eiri s.t. ri = v T ϕ2(yi) + d w,v,b,d,e,r i
i
i
• partially linear models, spectral clustering, subspace algorithms, ... ICANN 2007 ⋄ Johan Suykens
14
LS-SVM classifier • Preserve support vector machine methodology, but simplify via least squares and equality constraints [Suykens, 1999] • Primal problem: N 1 T 1X 2 min J (w, e) = w w + γ ei s.t. yi [wT ϕ(xi) + b]=1 − ei, ∀i w,b,e 2 2 i=1
• Dual problem:
0 y
T
y Ω + I/γ
b α
=
0 1N
where Ωij = yiyj ϕ(xi)T ϕ(xj ) = yiyj K(xi, xj ) and y = [y1; ...; yN ]. • LS-SVM classifiers perform very well on 20 UCI data sets [Van Gestel et al., ML 2004] Winning results in competition WCCI 2006 by [Cawley, 2006]
ICANN 2007 ⋄ Johan Suykens
15
Obtaining solution from Lagrangian • Lagrangian: L(w, b, e; α) = J (w, e) −
N X
αi{yi[wT ϕ(xi) + b] − 1 + ei}
i=1
with Lagrange multipliers αi (support values). • Conditions for optimality: 8 > > > > > > > > > < > > > > > > > > > :
∂L ∂w ∂L ∂b ∂L ∂ei ∂L ∂αi
=0 =0
→ →
w= N X
N X
αiyiϕ(xi)
i=1
αiyi = 0
i=1
=0 =0
→ →
αi = γei, yi[wT ϕ(xi) + b] − 1 + ei = 0,
i = 1, ..., N i = 1, ..., N
Eliminate w, e and write solution in α, b. ICANN 2007 ⋄ Johan Suykens
16
Microarray data analysis
FDA LS-SVM classifier (linear, RBF) Kernel PCA + FDA (unsupervised selection of PCs) (supervised selection of PCs) Use regularization for linear classifiers
Systematic benchmarking study in [Pochet et al., Bioinformatics 2004] Webservice: http://www.esat.kuleuven.ac.be/MACBETH/ Efficient computational methods for feature selection by rank-one updates [Ojeda et al., 2007] ICANN 2007 ⋄ Johan Suykens
17
Weighted versions and robustness Weighted version with modified cost function
Convex cost function
convex optimiz.
robust statistics
SVM solution
LS-SVM solution
SVM
Weighted LS-SVM
N
1 T 1X 2 • Weighted LS-SVM: min w w + γ viei s.t. yi = wT ϕ(xi) + b + ei, ∀i w,b,e 2 2 i=1 with vi determined from {ei}N i=1 of unweighted LS-SVM [Suykens et al., 2002]. Robustness and stability of reweighted kernel based regression [Debruyne et al., 2006].
• SVM solution by applying iteratively weighted LS [Perez-Cruz et al., 2005] ICANN 2007 ⋄ Johan Suykens
18
Kernel principal component analysis (KPCA) 1.5
1.5
1
1
0.5 0.5 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5
−2
−2.5 −1.5
−1
−0.5
0
linear PCA
0.5
1
−2 −1.5
−1
−0.5
0
0.5
1
kernel PCA (RBF kernel)
Kernel PCA [Sch¨olkopf et al., 1998]: take eigenvalue decomposition of the kernel matrix
K(x1, x1) ... K(x1, xN ) .. .. K(xN , x1) ... K(xN , xN ) (applications in dimensionality reduction and denoising) Where is the regularization ? ICANN 2007 ⋄ Johan Suykens
19
Kernel PCA: primal and dual problem • Underlying primal problem with regularization term [Suykens et al., 2003] • Primal problem: N 1 X 2 1 T ei s.t. ei = wT ϕ(xi) + b, i = 1, ..., N. min − w w + γ w,b,e 2 2 i=1
(or alternatively min
1 T 2w w
−
• Dual problem = kernel PCA :
1 2γ
PN
2 e i=1 i )
Ωcα = λα with λ = 1/γ with Ωc,ij = (ϕ(xi) − µ ˆϕ)T (ϕ(xj ) − µ ˆϕ) the centered kernel matrix. • Score variables (allowing also out-of-sample extensions): P P 1 T z(x) = w (ϕ(x) − µ ˆϕ) = j αj (K(xj , x) − N r K(xr , x)− P P P 1 1 r K(xr , xj ) + N 2 r s K(xr , xs )) N ICANN 2007 ⋄ Johan Suykens
20
Primal versus dual problems Example 1: microarray data (10.000 genes & 50 training data) Classifier model: T sign(w P x + b)T (primal) sign( i αiyixi x + b) (dual)
primal: w ∈ R10.000 (only 50 training data!) dual: α ∈ R50 Example 2: datamining problem (1.000.000 training data & 20 inputs)
primal: w ∈ R20 dual: α ∈ R1.000.000 (kernel matrix: 1.000.000 × 1.000.000 !) ICANN 2007 ⋄ Johan Suykens
21
Fixed-size LS-SVM: primal-dual kernel machines Primal space
Dual space Nystr¨ om method Kernel PCA Density estimate Entropy criteria
Regression
Eigenfunctions SV selection
Link Nystr¨om approximation (GP) - kernel PCA - density estimation [Girolami, 2002; Williams & Seeger, 2001]
Modelling in view of primal-dual representations [Suykens et al., 2002]: primal space estimation, sparse, large scale ICANN 2007 ⋄ Johan Suykens
22
y
Fixed-size LS-SVM: toy examples 1.4
2.5
1.2
2
1
1.5
0.8
1
0.6
0.5
0.4
0
0.2
−0.5
0
−1
−0.2
−1.5
−0.4
−2
−0.6 −10
−8
−6
−4
−2
0
x
2
4
6
8
10
1.2
−2.5 −2.5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
2
1
1.5 0.8
1 0.6
y
0.5 0.4
0
−0.5
0.2
−1 0
−1.5 −0.2
−2 −0.4 −10
−8
−6
−4
−2
0
x
2
4
6
8
10
−2.5 −2.5
Sparse representations with estimation in primal space ICANN 2007 ⋄ Johan Suykens
23
Large scale problems: Fixed-size LS-SVM Estimate in primal (approximate feature map from KPCA on subset) Santa Fe laser data 5
4
4
3 3
yk , yˆk
2
yk
2
1
1
0
0
−1
−1
−2
−2 0
100
200
300
400
500
600
discrete time k
700
800
900
10
20
30
40
50
60
70
80
90
discrete time k
Training: yˆk+1 = f (yk , yk−1, ..., yk−p) Iterative prediction: yˆk+1 = f (ˆ yk , yˆk−1, ..., yˆk−p) (works well for p large, e.g. p = 50) [Espinoza et al., 2003] ICANN 2007 ⋄ Johan Suykens
24
Partially linear models: nonlinear system identification 0.15 0.02
0.1 0.015
0.05 0.01
Residuals
0 −0.05 −0.1 −0.15 −0.2
0.005
0
−0.005
−0.01
−0.015
0
2
4
6
8
10
12
Input Signal
14 4
−0.02
x 10
1
1.5
2
2.5
3
3.5
Discrete Time Index
0.3
4 4
x 10
0.02
validation
training
test
0.2
0.015
0.01
0
0.005
Residuals
0.1
−0.1 −0.2 −0.3 −0.4
0.5
0
−0.005
−0.01
−0.015
0
2
4
6
8
Output Signal
10
12
14 4
x 10
−0.02
0.5
1
1.5
2
2.5
Discrete Time Index
3
3.5
4 4
x 10
Silver box benchmark study (physical system with cubic nonlinearity): (top-right) full black-box, (bottom-right) partially linear
Related application: power load forecasting [Espinoza et al., 2005] ICANN 2007 ⋄ Johan Suykens
25
- Part II contents • generalizations to KPCA weighted kernel PCA spectral clustering kernel canonical correlation analysis • model selection structure detection kernel design semi-supervised learning incorporation of constraints • kernel maps with reference point dimensionality reduction and data visualization
ICANN 2007 ⋄ Johan Suykens
26
Core models + additional constraints • Monoticity constraints: [Pelckmans et al., 2005] T
min w w + γ w,b,e
N X
2 ei
s.t.
i=1
yi = wT ϕ(xi) + b + ei, wT ϕ(xi) ≤ wT ϕ(xi+1),
(i = 1, ..., N ) (i = 1, ..., N − 1)
• Structure detection: [Pelckmans et al., 2005; Tibshirani, 1996] min ρ w,e,t
P X
P X
tp+
p=1
w
(p) T
w
(p)
+γ
p=1
N X
2 ei
i=1
s.t.
(
yi =
PP
T
(p)
(p) ϕ(p)(xi ) + ei, p=1 w T
(∀i)
(p)
−tp ≤ w(p) ϕ(p)(xi ) ≤ tp,
(∀i, ∀p)
• Autocorrelated errors: [Espinoza et al., 2006] T
min w w + γ
w,b,r,e
N X
2 ri
s.t.
i=1
yi = wT ϕ(xi) + b + ei, ei = ρei−1 + ri,
(i = 1, .., N ) (i = 2, ..., N )
• Spectral clustering: [Alzate & Suykens, 2006; Chung, 1997; Shi & Malik, 2000] T
T
min −w w + γe D w,b,e
ICANN 2007 ⋄ Johan Suykens
−1
T
e s.t. ei = w ϕ(xi) + b, (i = 1, ..., N )
27
Generalizations to Kernel PCA: other loss functions • Consider general loss function L (L2 case = KPCA): N 1 X 1 T L(ei) s.t. ei = wT ϕ(xi) + b, i = 1, ..., N. min − w w + γ w,b,e 2 2 i=1
Generalizations of KPCA that lead to robustness and sparseness, e.g. Vapnik ǫ-insensitive loss, Huber loss function [Alzate & Suykens, 2006]. • Weighted least squares versions and incorporation of constraints: ei = wT ϕ(xi) + b, i = 1, ..., N PN N (1) X 1 T 1 e e i=1 i i = 0 vie2i s.t. min − w w + γ ... w,b,e 2 2 i=1 PN e e(i−1) = 0 i=1 i i (j)
Find i-th PC w.r.t. i − 1 orthogonality constraints (previous PC ei ). The solution is given by a generalized eigenvalue problem.
ICANN 2007 ⋄ Johan Suykens
28
Generalizations to Kernel PCA: robust denoising Test set images, corrupted with Gaussian noise and outliers
Classical Kernel PCA
Robust method
bottom rows: application of different pre-image algorithms Robust method: improved results and fewer components needed
ICANN 2007 ⋄ Johan Suykens
29
Generalizations to Kernel PCA: sparseness 2.5
2
1.5
x2
1
0.5
0
−0.5
−1 −0.5
0
0.5
1 x1
1.5
2
2.5
2
2
2
1.5
1.5
1.5
1
1
1 x
x
x
2
2.5
2
2.5
2
2.5
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1 −0.5
0
0.5
1 x
1.5
2
1
PC1
2.5
−1 −0.5
0
0.5
1 x
1
PC2
1.5
2
2.5
−1 −0.5
0
0.5
1 x
1.5
1
PC3
Sparse kernel PCA using ǫ-insensitive loss [Alzate & Suykens, 2006] (top figure: denoising; bottom figures: different support vectors (in black) per principal component vector) ICANN 2007 ⋄ Johan Suykens
30
2
2.5
Spectral clustering: weighted KPCA 5 4
1
• Spectral graph clustering [Chung, 1997; Shi & Malik, 2000; Ng et al., 2002] • Normalized cut problem Lq = λDq with L = D − W the Laplacian of the graph. Cluster membership indicators are given by q.
2
6
3
cut of size 2
cut of size 1 (minimal cut)
• KPCA Weighted LS-SVM formulation to normalized cut: 1 T 1 T min − w w + γ e V e such that ei = wT ϕ(xi) + b, ∀i = 1, ..., N w,b,e 2 2 with V = D−1 the inverse degree matrix [Alzate & Suykens, 2006]. Allows for out-of-sample extensions on test data. ICANN 2007 ⋄ Johan Suykens
31
Application to image segmentation
Given image (240 × 160)
Image segmentation
Large scale image: out-of-sample extension [Alzate & Suykens, 2006] ICANN 2007 ⋄ Johan Suykens
32
Kernel Canonical Correlation Analysis Correlation: min w,v
X
kzxi − zyi k22
i
z = wT ϕ1 (x) T x zy = v ϕ2 (y)
ϕ2(·)
ϕ1(·) x
x x
x x
x x
x
x
x
x
x
x
x
x
Target spaces
x
x x
x
x x
x
x x x
x
x
x x
x
x
x
x
x x
x
x x x
x
Space X
Space Y Feature space on Y
Feature space on X
Applications of kernel CCA [Suykens et al., 2002, Bach & Jordan, 2002] e.g. in: - bioinformatics (correlation gene network - gene expression profiles) [Vert et al., 2003] - information retrieval, fMRI [Shawe-Taylor et al., 2004] - state estimation of dynamical systems, subspace algorithms [Goethals et al., 2005] ICANN 2007 ⋄ Johan Suykens
33
LS-SVM formulation to Kernel CCA • Score variables: zx = wT (ϕ1(x) − µ ˆ ϕ1 ), zy = v T (ϕ2(y) − µ ˆ ϕ2 )
Feature maps ϕ1, ϕ2, kernels K1(xi, xj ) = ϕ1(xi)T ϕ1(xj ), K2(yi, yj ) = ϕ2(yi)T ϕ2(yj )
• Primal problem: (Kernel PLS case: ν1 = 0, ν2 = 0 [Hoegaerts et al., 2004]) max
γ
w,v,e,r
i=1
such that with µ ˆ ϕ1 = (1/N )
N X
N
N
1X 2 1 T 1X 2 1 T ei − ν2 ri − w w − v v eiri − ν1 2 i=1 2 i=1 2 2
ei = wT (ϕ1(xi) − µ ˆ ϕ1 ), ri = v T (ϕ2(yi) − µ ˆ ϕ2 ), ∀i
PN
ˆ ϕ2 = (1/N ) i=1 ϕ1 (xi ), µ
PN
i=1 ϕ2 (yi ).
• Dual problem: generalized eigenvalue problem [Suykens et al. 2002] »
0 Ωc,1
Ωc,2 0
–»
α β
–
=λ
»
ν1Ωc,1 + I 0
0 ν2Ωc,2 + I
–»
α β
–
, λ = 1/γ
with Ωc,1ij = (ϕ1 (xi ) − µ ˆ ϕ1 )T (ϕ1 (xj ) − µ ˆ ϕ1 ), Ωc,2ij = (ϕ2 (yi ) − µ ˆ ϕ2 )T (ϕ2 (yj ) − µ ˆ ϕ2 )
ICANN 2007 ⋄ Johan Suykens
34
System identification of Hammerstein systems • Hammerstein system: xt+1 = Axt + Bf (ut) + νt yt = Cxt + Df (ut) + vt h h i i Q S νp with E{ vp [νqT vqT ]} = S T R δpq .
y
u f(.)
[A, B, C, D]
• System identification problem: −1 given {(ut, yt)}N t=0 , estimate A, B, C, D, f . • Subspace algorithms [Goethals et al., IEEE-AC 2005]: first estimate state vector sequence (can also be done by KCCA) (for linear systems equivalent to Kalman filtering) • Related problems: linear non-Gaussian models, links with ICA Kernels for linear systems and gait recognition [Bissacco et al., 2007]
ICANN 2007 ⋄ Johan Suykens
35
Bayesian inference Level 1 Parameters Likelihood Max
Posterior =
Prior Evidence
Level 2 Hyperparameters Likelihood Max
Posterior =
Prior Evidence
Level 3 Model comparison Likelihood Max
Posterior =
Prior Evidence
Automatic relevance determination (ARD) [MacKay, 1998]: infer elements of diagonal matrix S in K(xi, xj ) = exp(−(xi − xj )T S(xi − xj )) which indicate how relevant input variables are (but: many local minima, computationally expensive). ICANN 2007 ⋄ Johan Suykens
36
Classification of brain tumors using ARD
Bayesian learning (automatic relevance determination) of most relevant frequencies [Lu, 2005] ICANN 2007 ⋄ Johan Suykens
37
Hierarchical Kernel Machines Conceptually
Computationally
Level 3 Model selection
Level 2
Sparseness Structure detection
Level 1 LS−SVM substrate
Hierarchical kernel machine
Convex optimization
Hierarchical modelling approach leading to convex optimization problem Computationally fusing training, hyperparameter and model selection Optimization modelling: sparseness, input/structure selection, stability ... [Pelckmans et al., ML 2006] ICANN 2007 ⋄ Johan Suykens
38
Additive regularization trade-off • Traditional Tikhonov regularization scheme: X T e2i s.t. ei = yi − wT ϕ(xi), ∀i = 1, ..., N min w w + γ w,e
i
Training solution for fixed value of γ :
(K + I/γ)α = y
→ Selection of γ via validation set: non-convex problem
• Additive regularization trade-off [Pelckmans et al., 2005]: X T min w w + (ei − ci)2 s.t. ei = yi − wT ϕ(xi), ∀i = 1, ..., N w,e
i
Training solution for fixed value of c = [c1; ...; cN ]:
(K + I)α = y − c
→ Selection of c via validation set: can be convex problem
• Convex relaxation to Tikhonov regularization [Pelckmans et al., IEEE-TNN 2007] ICANN 2007 ⋄ Johan Suykens
39
Sparse models • SVM classically: sparse solution from QP problem at training level • Hierarchical kernel machine: fused problem with sparseness obtained at the validation level [Pelckmans et al., 2005] RBF
LS−SVMγ=5.3667,σ2=0.90784, with 2 different classes class 1 class 2 1
0.8
X
2
0.6
0.4
0.2
0
−0.2
ICANN 2007 ⋄ Johan Suykens
−1.2
−1
−0.8
−0.6
−0.4
−0.2 X1
0
0.2
0.4
0.6
0.8
40
Additive models and structure detection (p) T
PP
• Additive models: yˆ(x) = p=1 w ϕ(p)(x(p)) with x(p) the p-th input. PP (p) (p) Kernel K(xi, xj ) = p=1 K (p)(xi , xj ). • Structure detection [Pelckmans et al., 2005]: min ρ w,e,t ( s.t.
PP
p=1 tp
yi =
PP
+
PP
p=1
p=1 w T
(p) T
w
(p) T
w
(p)
+γ
(p)
PN
i=1
e2i
ϕ(p)(xi ) + ei, ∀i = 1, ..., N (p)
−tp ≤ w(p) ϕ(p)(xi ) ≤ tp, ∀i = 1, ..., N, ∀p = 1, ..., P
Study how the solution with maximal variation varies for different values of ρ
1.2
4 relevant input variables
Maximal Variation
1
0.8
0.6
0.4 21 irrelevant input variables 0.2
0
−0.2 0 10
ICANN 2007 ⋄ Johan Suykens
1
10
2
10
ρ
3
10
4
10
41
Incorporation of prior knowledge • Example: LS-SVM regression with monoticity constraint N 1X
yi = wT ϕ(xi) + b + ei, ∀i = 1, ..., N wT ϕ(xi) ≤ wT ϕ(xi+1), ∀i = 1, ..., N − 1
1 min wT w + γ e2i s.t. w,b,e 2 2 i=1
• Application: estimation of cdf [Pelckmans et al., 2005] empirical cdf true cdf 1
1
0.8
0.8
P(X)
0.6
P(X)
1
Y
0.6
0.4
0.4
Y2
0.2
0.2
0
0
−0.2 −2
ecdf cdf Chebychev mkr mLS−SVM
−0.2 −1.5
−1
−0.5
0
X
ICANN 2007 ⋄ Johan Suykens
0.5
1
1.5
2
−1.5
−1
−0.5
0
0.5
1
1.5
X
42
Equivalent kernels from constraints Regression with autocorrelated errors: min
w,b,r,e
T
w w+γ
2 i ri
P
s.t. yi = wT ϕ(xi) + b + ei (i = 1, .., N ) ei = ρei−1 + ri (i = 2, ..., N ) leads to fˆ(x) =
N X
Modular Definition of the Model Structure
αj−1Keq (xj , x) + b
j=2
LS-SVM Regression
Partially Linear Structure
Imposing Symmetry
Autocorrelated Residuals
with “equivalent kernel” Keq (xj , xi) = K(xj , xi) − ρK(xj−1, xi) where K(xj , xi) = ϕ(xj )T ϕ(xi) [Espinoza et al., 2006]. ICANN 2007 ⋄ Johan Suykens
43
Application: electric load forecasting Short-term load forecasting (1-24 hours) Important for power generation decisions Hourly load values from substations in Belgian grid Seasonal/weekly/intra-daily patterns 1-hour ahead
1
1 Actual Load Linear 0.9
0.8
0.8
Normalized Load
Normalized Load
Actual Load FS−LSSVM 0.9
0.7
0.6
0.5
0.4
0.3
0.2
0.6
0.5
0.4
0.3
0.2
0.1
0.1 20
40
60
80
Hour
100
120
140
160
20
40
60
(a)
24-hours ahead
80
Hour
100
120
140
160
(b)
1
1 Actual Load FS−LSSVM
Actual Load Linear
0.9
0.9
0.8
0.8
Normalized Load
Normalized Load
Fixed-size LS-SVM →
0.7
0.7
0.6
0.5
0.4
0.3
0.2
0.7
0.6
0.5
0.4
← Linear ARX model
0.3
0.2
0.1
0.1 20
40
60
80
Hour
100
120
140
160
20
40
60
(c)
80
Hour
100
120
140
160
(d)
[Espinoza et al., 2007] ICANN 2007 ⋄ Johan Suykens
44
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4 x2
x2
Semi-supervised learning
0.2
0.2
0
0
−0.2
−0.2
−0.4
−0.4
−0.6
−0.6
−0.8
−1
−0.5
0
0.5 x1
1
1.5
2
2.5
−0.8
−1
−0.5
0
0.5 x1
1
1.5
2
2.5
Semi-supervised learning: part labeled and part unlabeled data Assumptions for semi-supervised learning to work [Chapelle et al., 2006]: • Smoothness assumption: if two points x1, x2 in a high density region are close, then also the corresponding outputs y1, y2 • Cluster assumption: points from same cluster are likely same class • Low density separation: decision boundary should be in low density region • Manifold assumption: data lie on a low-dimensional manifold ICANN 2007 ⋄ Johan Suykens
45
Semi-supervised learning in RKHS • Learning in RKHS [Belkin & Niyogi, 2004]: N 1 X V (yi, f (xi)) + λkf k2K + ηf T Lf min f ∈H N i=1
with V (·, ·) loss function, L Laplacian matrix, kf kK norm in RKHS H, f = [f (x1); ...; f (xNl+Nu )] (Nl, Nu number of labeled and unlabeled data) • Laplacian term: discretization of the Laplace-Beltrami operator • Representer theorem: f (x) =
PNl+Nu i=1
αiK(x, xi)
• Least squares solution case: Laplacian acts on kernel matrix • Problem: true labels of unlabeled data assumed to be zero. ICANN 2007 ⋄ Johan Suykens
46
Formulation by adding constraints • Semi-supervised LS-SVM model [Luts et al., 2007]: min
w,e,b,ˆ y
1 T 2w w
+ 12 γ
PN
1 2 + e i=1 i 2η
PN
yi i,j=1 vij (ˆ
− yˆj )2
s.t. yˆi = wT ϕ(xi) + b, i = 1, ..., N yˆi = νiyi − ei, νi ∈ {0, 1}, , i = 1, ..., N where νi = 0 for unlabeled data, νi = 1 for labeled data. • MRI image: healthy tissue versus tumor classification [Luts et al., 2007] Nosologic Image
Nosologic Image
6 2
6 2
5 4
5 4
4 6
4 6
3 8
3 8
2 10
2 10
1 12
1 12
2
4
6
8
10
12
2
4
6
8
10
12
[etumour FP6-2002-lifescihealth503094, healthagents FP6-2005-IST027213]
ICANN 2007 ⋄ Johan Suykens
47
Learning combination of kernels Pm • Take combination K = i=1 µi Ki (µi ≥ 0) (e.g. for data fusion). Learn µi as convex problem [Lanckriet et al., JMLR 2004] • QP problem of SVM: max 2αT 1 − αT diag(y)Kdiag(y)α s.t. 0 ≤ α ≤ C, αT y = 0 α
is replaced by m X min max 2αT 1 − αT diag(y)( µiKi)diag(y)α µi
α
s.t. 0 ≤ α ≤ C, αT y = 0, trace(
i=1
m X i=1
µiKi) = c,
m X
µiKi 0.
i=1
Can be solved as a semidefinite program (SDP problem) [Boyd & Vandenberghe, 2004] (LMI constraint for positive definite kernel) ICANN 2007 ⋄ Johan Suykens
48
Kernel design B
- Probability product kernel: Z K(p1, p2) = p1(x)ρ p2(x)ρdx
A
C
X
E
D
- Prior knowledge incorporation P(A,B,C,D,E) = P(A|B) P(B) P(C|B) P(D|C) P(E|B)
Kernels from graphical models, Bayesian networks, HMMs Kernels tailored to data types (DNA sequence, text, chemoinformatics) [Tsuda et al., Bioinformatics 2002; Jebara et al., JMLR 2004, Ralaivola et al., 2005]
ICANN 2007 ⋄ Johan Suykens
49
Dimensionality reduction and data visualization • Traditionally: commonly used techniques are e.g. principal component analysis, multidimensional scaling, self-organizing maps • More recently: isomap, locally linear embedding, Hessian locally linear embedding, diffusion maps, Laplacian eigenmaps (“kernel eigenmap methods and manifold learning”) [Roweis & Saul, 2000; Coifman et al., 2005; Belkin et al., 2006] • Relevant issues: - learning and generalization [Cucker & Smale, 2002; Poggio et al., 2004] - model representations and out-of-sample extensions - convex/non-convex problems, computational complexity [Smale, 1997] • Kernel maps with reference point (KMref) [Suykens, 2007]: data visualization and dimensionality reduction by solving linear system ICANN 2007 ⋄ Johan Suykens
50
Kernel maps with reference point: problem statement • Kernel maps with reference point [Suykens, 2007]: - LS-SVM core part: realize dimensionality reduction x 7→ z - reference point q (e.g. first point; sacrificed in the visualization) • Example: d = 2 min
z,w1 ,w2 ,b1 ,b2 ,ei,1 ,ei,2
such that
N X ν 1 η 2 T T 2 T (ei,1 + ei,2) (z − PD z) (z − PD z) + (w1 w1 + w2 w2) + 2 2 2 i=1 cT1,1z = q1 + e1,1 cT1,2z = q2 + e1,2 cTi,1z = w1T ϕ1(xi) + b1 + ei,1, ∀i = 2, ..., N cTi,2z = w2T ϕ2(xi) + b2 + ei,2, ∀i = 2, ..., N
Coordinates in low dimensional space: z = [z1; z2; ...; zN ] ∈ RdN PN P 2 s Dz k kz − Regularization term: (z − PD z)T (z − PD z) = N ij j i 2 j=1 i=1 with D diagonal matrix and sij = exp(−kxi − xj k22/σ 2). ICANN 2007 ⋄ Johan Suykens
51
Kernel maps with reference point: solution • The unique solution to the problem is given by the linear system
U −1T M1−1V1T −1T M2−1V2T
−V1M1−11 1T M1−11 0
−V2M2−11
z η(q1c1,1 + q2c1,2) b1 = 0 0 b2 0 1T M2−11
with matrices U = (I − PD )T (I − PD ) − γI + V1M1−1V1T + V2M2−1V2T + ηc1,1cT1,1 + ηc1,2cT1,2 1 1 1 1 M1 = Ω1 + I , M2 = Ω2 + I ν η ν η V1 = [c2,1 ... cN,1] , V2 = [c2,2 ... cN,2] kernel matrices Ω1, Ω2 ∈ R(N −1)×(N −1): Ω1,ij = K1(xi, xj ) = ϕ1(xi)T ϕ1(xj ), Ω2,ij = K2(xi, xj ) = ϕ2(xi)T ϕ2(xj )
positive definite kernel functions K1(·, ·), K2(·, ·). ICANN 2007 ⋄ Johan Suykens
52
Kernel maps with reference point: model representations • The primal and dual model representations allow making out-ofsample extensions. Evaluation at point x∗ ∈ Rp: N
zˆ∗,1 = w1T ϕ1(x∗) + b1 = zˆ∗,2 = w2T ϕ2(x∗) + b2 =
1X αi,1K1(xi, x∗) + b1 ν i=2 N 1X αi,2K2(xi, x∗) + b2 ν i=2
Estimated coordinates for visualization: zˆ∗ = [ˆ z∗,1; zˆ∗,2]. • α1, α2 ∈ RN −1 are the unique solutions to the linear systems M1α1 = V1T z − b11N −1 and M2α2 = V2T z − b21N −1 and α1 = [α2,1; ...; αN,1], α2 = [α2,2; ...; αN,2], 1N −1 = [1; 1; ..., ; 1]. ICANN 2007 ⋄ Johan Suykens
53
KMref: spiral example −3
20
x 10
15
0.5
0
z2
x
3
10
5
−0.5 1
0 0.5
1 0
0.5 0
−0.5 −0.5
−1 x
−1 −1.5
−1.5
2
−5 −0.02 x
−0.015
−0.01
1
−0.005 z1
0
0.005
0.01
training data (blue *), validation data (magenta o), test data (red +)
Model selection: min
X i,j
ICANN 2007 ⋄ Johan Suykens
zˆiT zˆj kˆ zi k2 kˆ zj k2
−
xT i xj kxi k2 kxj k2
2
54
KMref: swiss roll example −3
3
x 10
2.5
0.6 0.4 0.2
2
−0.2 z2
x
3
0
1.5
−0.4 −0.6 1
−0.8 0.6 0.4 0.2
1 0 −0.2 0
−0.4 −0.5
−0.6 x
2
0.5
0.5
−0.8
−1
0 −3.5
−3
−2.5
x
Given 3D swiss roll data
−2
−1.5
−1
−0.5
z1
1
KMref result - 2D projection
600 training data, 100 validation data ICANN 2007 ⋄ Johan Suykens
0 −3
x 10
55
KMref: visualizing gene distributions −3
x 10
−3
x 10
2.1
2.1
2
1.9 z
3
z3
2
1.9
1.8 1.8
1.7 1.7
2.3
2.3
2.2
2.2 2.1 2
−3
x 10
1.9
−2 −2.1 −2.05 −2.2 −2.15 −2.25 −2.35 −2.3
z
2
z
1
−1.95 −3
x 10
−2.35 −2.3 −2.25 −2.2 −2.15 −2.1 −2.05 −3
2.1 −3
x 10
2 −2 −1.95
1.9 z
x 10
2
z1
KMref 3D projection (Alon colon cancer microarray data set) Dimension input space: 62 Number of genes: 1500 (training: 500, validation: 500, test: 500) Model selection: σ 2 = 104, σ12 = 103, σ22 = 0.5σ12, σ32 = 0.1σ12, η = 1, ν = 100, D = diag{10, 5, 1}, q = [+1; −1; −1]. ICANN 2007 ⋄ Johan Suykens
56
Nonlinear dynamical systems control 1.5
1
θ u
0.5
xp 0 −1.5
min subject to
Control objective
+
−1
−0.5
0
0.5
1
1.5
2
2.5
3
3.5
LS−SVM objective
System dynamics (time k = 1, 2, ... , N) LS−SVM controller (time k = 1, 2, ... , N)
Merging optimal control and support vector machine optimization problems: Approximate solutions to optimal control problems [Suykens et al., NN 2001] ICANN 2007 ⋄ Johan Suykens
57
Conclusions and future challenges • Integrative understanding and systematic design for supervised, semisupervised, unsupervised learning and beyond • Kernel methods: complementary views (LS-)SVM, RKHS, GP • Least squares support vector machines as “core problems”: provides methodology for “optimization modelling” • Bridging gaps between fundamental theory, algorithms and applications • Reliable methods: numerically, computationally, statistically Websites: http://www.kernel-machines.org/ http://www.esat.kuleuven.be/sista/lssvmlab/ ICANN 2007 ⋄ Johan Suykens
58
Books • Boyd S., Vandenberghe L., Convex Optimization, Cambridge University Press, 2004. • Chapelle O., Sch¨olkopf B., Zien A. (Eds.), Semi-Supervised Learning, MIT Press, 2006. • Cristianini N., Shawe-Taylor J., An Introduction to Support Vector Machines, Cambridge University Press, 2000. • Cucker F., Zhou D.-X., Learning Theory: an Approximation Theory Viewpoint, Cambridge University Press, 2007. • Rasmussen C.E., Williams C.K.I., Gaussian Processes for Machine Learning, MIT Press, 2006. • Sch¨olkopf B., Smola A., Learning with Kernels, MIT Press, 2002. • Sch¨olkopf B., Tsuda K., Vert J.P. (Eds.) Kernel Methods in Computational Biology 400, MIT Press, 2004. • Shawe-Taylor J., Cristianini N., Kernel Methods for Pattern Analysis, Cambridge University Press, 2004. • Suykens J.A.K., Van Gestel T., De Brabanter J., De Moor B., Vandewalle J., Least Squares Support Vector Machines, World Scientific, Singapore, 2002. • Suykens J.A.K., Horvath G., Basu S., Micchelli C., Vandewalle J. (Eds.), Advances in Learning Theory : Methods, Models and Applications, vol. 190 NATO-ASI Series III: Computer and Systems Sciences, IOS Press, 2003. • Vapnik V., Statistical Learning Theory, John Wiley & Sons, 1998. • Wahba G., Spline Models for Observational Data, Series Appl. Math., 59, SIAM, 1990.
ICANN 2007 ⋄ Johan Suykens
59