Probability & Bayesian Inference. J. Elder ... Nonlinear Classification and
Regression: Outline ...... could be learned using linear classifier techniques (e.g.,.
NONLINEAR CLASSIFICATION AND REGRESSION J. Elder
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
Nonlinear Classification and Regression: Outline Probability & Bayesian Inference
2
Multi-Layer Perceptrons The
Back-Propagation Learning Algorithm
Generalized Linear Models Radial
Basis Function Networks Sparse Kernel Machines Nonlinear
SVMs and the Kernel Trick Relevance Vector Machines
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Nonlinear Classification and Regression: Outline Probability & Bayesian Inference
3
Multi-Layer Perceptrons The
Back-Propagation Learning Algorithm
Generalized Linear Models Radial
Basis Function Networks Sparse Kernel Machines Nonlinear
SVMs and the Kernel Trick Relevance Vector Machines
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Implementing Logical Relations 4
Probability & Bayesian Inference
AND and OR operations are linearly separable problems
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
The XOR Problem Probability & Bayesian Inference
5
XOR is not linearly separable. x1
x2
XOR
Class
0
0
0
B
0
1
1
A
1
0
1
A
1
1
0
B
How can we use linear classifiers to solve this problem? CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Combining two linear classifiers Probability & Bayesian Inference
6
Idea: use a logical combination of two linear classifiers.
3 g2 (x ) = x 1 + x2 ! 2
g 1 (x ) = x 1 + x2 !
1 2
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Combining two linear classifiers Probability & Bayesian Inference
7
Let f (x ) be the unit step activation function: f (x ) = 0, x < 0 f (x ) = 1, x ! 0
Observe that the classification problem is then solved by " 1% 3 f $ y 1 ! y2 ! ' g (x ) = x + x ! 2 2& # 2
where
(
)
(
y 1 = f g 1 (x ) and y 2 = f g2 (x )
1
2
) g 1 (x ) = x 1 + x2 !
1 2
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Combining two linear classifiers Probability & Bayesian Inference
8
This calculation can be implemented sequentially: 1. 2.
Compute y1 and y2 from x1 and x2. Compute the decision from y1 and y2.
Each layer in the sequence consists of one or more linear classifications.
This is therefore a two-layer perceptron. g2 (x ) = x 1 + x2 !
3 2
g 1 (x ) = x 1 + x2 !
1 2
" 1% f $ y 1 ! y2 ! ' 2& # where
(
)
(
y 1 = f g 1 (x ) and y 2 = f g2 (x )
)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
The Two-Layer Perceptron Probability & Bayesian Inference
9
Layer 1
g2 (x ) = x 1 + x2 !
g 1 (x ) = x 1 + x2 !
Layer 2
x1
x2
y1
y2
0
0
0(-)
0(-)
B(0)
0
1
1(+)
0(-)
A(1)
1
0
1(+)
0(-)
A(1)
1
1
1(+)
1(+)
B(0)
3 2
" 1% f $ y 1 ! y2 ! ' 2& # where
(
)
(
y 1 = f g 1 (x ) and y 2 = f g2 (x )
Layer 1
Layer 2 y 1 ! y2 !
)
1 2
1 2
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
The Two-Layer Perceptron Probability & Bayesian Inference
10
The first layer performs a nonlinear mapping that makes the data linearly separable.
(
)
(
y 1 = f g 1 (x ) and y 2 = f g2 (x ) g2 (x ) = x 1 + x2 !
g 1 (x ) = x 1 + x2 !
3 2
)
Layer 1
Layer 2 y 1 ! y2 !
1 2
1 2
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
The Two-Layer Perceptron Architecture Probability & Bayesian Inference
11
Input Layer
Hidden Layer g 1 (x ) = x 1 + x2 !
Output Layer
1 2
y 1 ! y2 !
3 g2 (x ) = x 1 + x2 ! 2
1 2
-1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
The Two-Layer Perceptron Probability & Bayesian Inference
12
Note that the hidden layer maps the plane onto the vertices of a unit square.
(
)
(
y 1 = f g 1 (x ) and y 2 = f g2 (x ) g2 (x ) = x 1 + x2 !
g 1 (x ) = x 1 + x2 !
3 2
)
Layer 1
Layer 2 y 1 ! y2 !
1 2
1 2
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Higher Dimensions Probability & Bayesian Inference
13
Each hidden unit realizes a hyperplane discriminant function. The output of each hidden unit is 0 or 1 depending upon the location of the input vector relative to the hyperplane.
x ∈ Rl
x → y = [ y1 ,...y p ]T , yi ∈ {0, 1} i = 1, 2,... p CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Higher Dimensions Probability & Bayesian Inference
14
Together, the hidden units map the input onto the vertices of a p-dimensional unit hypercube.
x ∈ Rl
x → y = [ y1 ,...y p ]T , yi ∈ {0, 1} i = 1, 2,... p CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Two-Layer Perceptron Probability & Bayesian Inference
15
These p hyperplanes partition the l-dimensional input space into polyhedral regions Each region corresponds to a different vertex of the pdimensional hypercube represented by the outputs of the hidden layer.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Two-Layer Perceptron Probability & Bayesian Inference
16
In this example, the vertex (0, 0, 1) corresponds to the region of the input space where:
g1(x) < 0 g2(x) < 0 g3(x) > 0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Limitations of a Two-Layer Perceptron Probability & Bayesian Inference
17
The output neuron realizes a hyperplane in the transformed space that partitions the p vertices into two sets. Thus, the two layer perceptron has the capability to classify vectors into classes that consist of unions of polyhedral regions. But NOT ANY union. It depends on the relative position of the corresponding vertices. How can we solve this problem?
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
The Three-Layer Perceptron Probability & Bayesian Inference
18
Suppose that Class A consists of the union of K polyhedra in the input space.
Use K neurons in the 2nd hidden layer.
Train each to classify one region as positive, the rest negative.
Now use an output neuron that implements the OR function.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
The Three-Layer Perceptron Probability & Bayesian Inference
19
Thus the three-layer perceptron can separate classes resulting from any union of polyhedral regions in the input space.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
The Three-Layer Perceptron Probability & Bayesian Inference
20
The first layer of the network forms the hyperplanes in the input space. The second layer of the network forms the polyhedral regions of the input space The third layer forms the appropriate unions of these regions and maps each to the appropriate class.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Learning Parameters
Training Data Probability & Bayesian Inference
22
The training data consist of N input-output pairs:
( y(i ), x(i )) ,
i ! 1,… N
where t
y(i ) = "$y 1 (i ),…, y k (i ) %' # & L and
t
x (i ) = "$x 1 (i ),…, x k (i ) %' 0 # &
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Choosing an Activation Function Probability & Bayesian Inference
23
The unit step activation function means that the error rate of the network is a discontinuous function of the weights. This makes it difficult to learn optimal weights by minimizing the error. To fix this problem, we need to use a smooth activation function. A popular choice is the sigmoid function we used for logistic regression:
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Smooth Activation Function Probability & Bayesian Inference
24
1 f (a ) = 1+ exp(!a )
8
Classification models
!"
!
Figure 8.3 Logistic regression w model in 1D and 2D. a) One dimensionalx1fit. Green points denote set of examples S0 where y = 0. Pink points denote CSE 4404/5327 Introduction to Machine and Pattern Recognition set of examples S1 where y = 1. Note that Learning in this (and all future figures t
J. Elder
Output: Two Classes Probability & Bayesian Inference
25
For a binary classification problem, there is a single output node with activation function given by 1 f (a ) = 1+ exp(!a )
Since the output is constrained to lie between 0 and 1, it can be interpreted as the probability of the input vector belonging to Class 1.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Output: K > 2 Classes Probability & Bayesian Inference
26
For a K-class problem, we use K outputs, and the softmax function given by yk =
( )
exp ak
! exp (a j
j
)
Since the outputs are constrained to lie between 0 and 1, and sum to 1, yk can be interpreted as the probability that the input vector belongs to Class K. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Non-Convex Probability & Bayesian Inference
27
Now each layer of our multi-layer perceptron is a logistic regressor. Recall that optimizing the weights in logistic regression results in a convex optimization problem. Unfortunately the cascading of logistic regressors in the multi-layer perceptron makes the problem non-convex. This makes it difficult to determine an exact solution. Instead, we typically use gradient descent to find a locally optimal solution to the weights. The specific learning algorithm is called the backpropagation algorithm. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
End of Lecture Nov 21, 2011
Nonlinear Classification and Regression: Outline Probability & Bayesian Inference
29
Multi-Layer Perceptrons The
Back-Propagation Learning Algorithm
Generalized Linear Models Radial
Basis Function Networks Sparse Kernel Machines Nonlinear
SVMs and the Kernel Trick Relevance Vector Machines
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
The Backpropagation Algorithm Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974 Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536.
Werbos
Rumelhart
Hinton
Notation Probability & Bayesian Inference
31
Assume a network with L layers
k0 nodes in the input layer. kr nodes in the r’th layer.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Notation 32
Probability & Bayesian Inference
Let yrk !1 be the output of the kth neuron of layer r ! 1. Let wrjk be the weight of the synapse on the jth neuron of layer r from the kth neuron of layer r ! 1.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Input 33
Probability & Bayesian Inference
y k0 (i ) = x k (i ), k = 1,…, k0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Notation Probability & Bayesian Inference
34
Let vrj be the total input to the jth neuron of layer r : r j
( )y
v (i ) = w
r j
t
r !1
(i ) =
kr !1
"w
k =0
r jk
y kr !1 (i )
r 0
where we define y (i ) = +1 to incorporate the bias term.
# kr !1 r r !1 & Then y (i ) = f v (i ) = f % " w jk yk (i ) ( $ k =0 ' r j
(
r j
)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Cost Function Probability & Bayesian Inference
35
A common cost function is the squared error: J =
N
" ! (i ) i =1
k
(
)
k
(
2 1 L 1 L where ! (i ) ! " em (i ) = " y m (i ) # yˆm (i ) 2 m =1 2 m =1 and yˆm (i ) = y kr (i ) is the output of the network.
)
2
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Cost Function Probability & Bayesian Inference
36
To summarize, the error for input i is given by k k 2 2 1 L 1 L ! (i ) = " em (i ) = " yˆm (i ) # y m (i ) 2 m =1 2 m =1
(
)
(
)
where yˆm (i ) = y kr (i ) is the output of the output layer and each layer is related to the previous layer through
(
yrj (i ) = f v rj (i )
)
and r j
( )y
v (i ) = w
r j
t
r !1
(i )
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Gradient Descent Probability & Bayesian Inference
37
k
(
1 L ! (i ) = " em (i ) 2 m =1
)
2
k
(
1 L = " yˆm (i ) # y m (i ) 2 m =1
)
2
Gradient descent starts with an initial guess at the weights over all layers of the network. We then use these weights to compute the network output yˆ (i ) for each input vector x(i) in the training data. This allows us to calculate the errorε(i) for each of these inputs. Then, in order to minimize this error, we incrementally update the weights in the negative gradient direction: N !J !" (i ) r w (new) = w (old) - µ = w (old) µ # j r r !w j i =1 !w j r j
r j
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Gradient Descent Probability & Bayesian Inference
38
Since v (i ) = ( w ) yr !1 (i ) , the influence of the jth weight of the rth layer on the error can be expressed as: r j
r j
t
r !" (i ) !" (i ) !v j (i ) = r r !w j !v j (i ) !wrj
= # jr (i )yr $1 (i ) where !" (i ) # jr (i ) ! r !v j (i )
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Gradient Descent Probability & Bayesian Inference
39
!" (i ) r r $1 = # ( i )y (i ), j r !w j where !" (i ) # jr (i ) ! r !v j (i )
For an intermediate layer r, we cannot compute ! jr (i ) directly. However, ! jr (i ) can be computed inductively, starting from the output layer. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Backpropagation: The Output Layer Probability & Bayesian Inference
40
!" (i ) = # jr (i )yr $1 (i ), where r !w j k
(
1 L and ! (i ) = " em (i ) 2 m =1
)
2
# jr (i ) ! k
!" (i ) !v rj (i )
(
1 L = " yˆm (i ) # y m (i ) 2 m =1
(
Recall that yˆm (i ) = yLj (i ) = f v jL (i )
)
2
)
Thus at the output layer we have L "# (i ) "# (i ) "e j (i ) L L ! (i ) = L = L = e ( i ) f v (i ) $ j j "v j (i ) "e j (i ) "v jL (i )
(
L j
f (a) =
1 " f #(a) = f (a) 1! f (a) 1+ exp(!a)
(
(
)(
(
! jL (i ) = e jL (i )f v jL (i ) 1 " f v jL (i )
)
)
))
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Backpropagation: Hidden Layers Probability & Bayesian Inference
41
Observe that the dependence of the error on the total input to a neuron in a previous layer can be expressed in terms of the dependence on the total input of neurons in the following layer: !
r "1 j
#$ (i ) (i ) = r "1 = #v j (i ) r k
where v (i ) =
and so !
r #$ (i ) #v k (i ) % #v r (i ) #v r "1 (i ) = k =1 k j
kr !1
"w
m =0
Thus we have r "1 j
kr
r km
!v kr (i ) !v
r "1 j
(i )
y
r !1 m
(i ) =
kr !1
"w
m =0
(
(
%! k =1
r k
(i )
#v kr (i ) #v rj "1 (i )
(
f v mr!1 (i )
r km
= w kjrf # v rj "1 (i )
#$ (i ) (i ) = r "1 = f % v rj "1 (i ) #v j (i )
kr
)
)
kr
) & ! (i )w k =1
r k
r kj
(
L j
)(
(
L j
= f v (i ) 1 " f v (i )
)) & ! (i )w kr
k =1
r k
r kj
Thus once the ! kr (i ) are determined they can be propagated backward to calculate ! jr "1 (i ) using this inductive formula.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Backpropagation: Summary of Algorithm Probability & Bayesian Inference
42
Initialization
1.
Initialize all weights with small random values
Repeat until convergence
Forward Pass
2.
For each input vector, run the network in the forward direction, calculating:
( )y
v rj (i ) = wrj
t
r !1
(
yrj (i ) = f v rj (i )
(i ); k
(
1 L and finally ! (i ) = " em (i ) 2 m =1
Backward Pass
3.
k
(
1 L = " yˆm (i ) # y m (i ) 2 m =1
)
2
Starting with the output layer, use our inductive formula to compute the ! jr "1 (i ) :
4.
)
2
)
(
Output Layer (Base Case): ! L (i ) = e L (i )f " v L (i ) j j j
Hidden Layers (Inductive Case): !
r "1 j
(
(i ) = f # v
)
r "1 j
(i )
kr
) $ ! (i )w k =1
r k
r kj
Update Weights N
w (new) = w (old) - µ # r j
r j
i =1
!" (i ) !wrj
where
!" (i ) r r $1 = # ( i )y (i ) j !wrj
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Batch vs Online Learning Probability & Bayesian Inference
43
As described, on each iteration backprop updates the weights based upon all of the training data. This is called batch learning. N
w (new) = w (old) - µ # r j
r j
i =1
!" (i ) !wrj
where
!" (i ) = # jr (i )yr $1 (i ) r !w j
An alternative is to update the weights after each training input has been processed by the network, based only upon the error for that input. This is called online learning. wrj (new) = wrj (old) - µ
!" (i ) !wrj
where
!" (i ) = # jr (i )yr $1 (i ) r !w j
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Batch vs Online Learning Probability & Bayesian Inference
44
One advantage of batch learning is that averaging over all inputs when updating the weights should lead to smoother convergence. On the other hand, the randomness associated with online learning might help to prevent convergence toward a local minimum. Changing the order of presentation of the inputs from epoch to epoch may also improve results.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Remarks Probability & Bayesian Inference
45
Local Minima The
objective function is in general non-convex, and so the solution may not be globally optimal.
Stopping Criterion Typically
stop when the change in weights or the change in the error function falls below a threshold.
Learning Rate The
speed and reliability of convergence depends on the learning rate μ.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Nonlinear Classification and Regression: Outline Probability & Bayesian Inference
46
Multi-Layer Perceptrons The
Back-Propagation Learning Algorithm
Generalized Linear Models Radial
Basis Function Networks Sparse Kernel Machines Nonlinear
SVMs and the Kernel Trick Relevance Vector Machines
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Generalizing Linear Classifiers Probability & Bayesian Inference
47
One way of tackling problems that are not linearly separable is to transform the input in a nonlinear fashion prior to applying a linear classifier. The result is that decision boundaries that are linear in the resulting feature space may be highly nonlinear in the original input space. 1 1 φ2
x2
0
0.5
−1 0 −1
0
x1
Input Space
1
0
0.5
φ1
1
Feature Space
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Nonlinear Basis Function Models Probability & Bayesian Inference
48
Generally
where ϕj(x) are known as basis functions. Typically, Φ0(x) = 1, so that w0 acts as a bias.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Nonlinear basis functions for classification Probability & Bayesian Inference
49
In the context of classification, the discriminant function in the feature space becomes:
(
)
M
M
i =1
i =1
g y (x ) = w 0 + ! w i y i (x ) = w 0 + ! w i"i (x )
This formulation can be thought of as an input space approximation of the true separating discriminant function g(x) using a set of interpolation functions !i (x ).
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Dimensionality Probability & Bayesian Inference
50
The dimensionality M of the feature space may be less than, equal to, or greater than the dimensionality D of the original input space. M
< D: This may result in a factoring out of irrelevant dimensions, reduction in the number of model parameters, and resulting improvement in generalization (reduced overlearning). M > D: Problems that are not linearly separable in the input space may become separable in the feature space, and the probability of linear separability generally increases with the dimensionality of the feature space. Thus choosing M >> D helps to make the problem linearly separable. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Cover’s Theorem Probability & Bayesian Inference
51
“A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.” — Cover, T.M. , Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition., 1965 Example
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Nonlinear Classification and Regression: Outline Probability & Bayesian Inference
52
Multi-Layer Perceptrons The
Back-Propagation Learning Algorithm
Generalized Linear Models Radial
Basis Function Networks Sparse Kernel Machines Nonlinear
SVMs and the Kernel Trick Relevance Vector Machines
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Radial Basis Functions Probability & Bayesian Inference
53
Consider interpolation functions (kernels) of the form
(
!i x " µi
)
In other words, the feature value depends only upon the Euclidean distance to a ‘centre point’ in the input space. A commonly used RBF is the isotropic Gaussian:
$ 2' 1 !i (x ) = exp & " 2 x " µi ) % 2# i (
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Relation to KDE Probability & Bayesian Inference
54
We can use Gaussian RBFs to approximate the discriminant function g(x):
(
)
M
M
i =1
i =1
g y (x ) = w 0 + ! w i y i (x ) = w 0 + ! w i"i (x )
where
$ 2' 1 !i (x ) = exp & " 2 x " µi ) % 2# i (
This is reminiscent of kernel density estimation, where we approximated probability densities as a normalized sum of Gaussian kernels. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Relation to KDE Probability & Bayesian Inference
55
For KDE we planted a kernel at each data point. Thus there were N kernels. For RBF networks, we generally use far fewer kernels than the number of data points: M 0
( )
Example 3: k(x,z) = x z t
2
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Kernel Properties Probability & Bayesian Inference
67
Kernels obey certain properties that make it easy to construct complex kernels from simpler ones.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Kernel Properties 68
Combining Kernels
Probability & Bayesian Inference
Given valid kernels k1 (x, x! ) and k2 (x, x! ) the following kernels will also be valid: k(x, x! ) = ck1 (x, x! )
(6.13)
k(x, x! ) = f (x)k1 (x, x! )f (x! )
(6.14)
k(x, x! ) = q(k1 (x, x! ))
(6.15)
k(x, x! ) = exp(k1 (x, x! ))
(6.16)
k(x, x! ) = k1 (x, x! ) + k2 (x, x! )
(6.17)
k(x, x! ) = k1 (x, x! )k2 (x, x! )
(6.18)
k(x, x! ) = k3 (φ(x), φ(x! ))
(6.19)
k(x, x! ) = xT Ax!
(6.20)
k(x, x! ) = ka (xa , x!a ) + kb (xb , x!b )
(6.21)
k(x, x! ) = ka (xa , x!a )kb (xb , x!b )
(6.22)
withc corresponding conditions onpolynomial c, f, q, φ,with k3 ,nonnegative A, xa , xb , coefficients, ka , kb where > 0, f (!) is any function, q(!) is a " (x) is a mapping from x # !M ,
(
k3 is a valid kernel onVasil ! MKhalidov, , A is aAlex symmetric positive semidefinite matrix, x a and x b are variables such that x = x a ,x b Kl¨ aser Bishop Chapter 6: Kernel Methods
)
and ka ,kb are valid kernels over their respective spaces.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Constructing Kernels Probability & Bayesian Inference
69
Examples:
(
k(x, x!) = x x! + c t
(
)
M
,c > 0 2
k(x, x!) = exp " x " x! / 2# 2
(Use 6.18)
)
(Use 6.14 and 6.16.)
Corresponds to infinite-dimensional feature vector
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Nonlinear SVM Example (Gaussian Kernel) Probability & Bayesian Inference
70
Input Space
x2
x1
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
SVMs for Regression Probability & Bayesian Inference
71
In standard linear regression, we minimize 1 N y n ! tn " 2 n=1
(
)
2
# + w 2
2
This penalizes all deviations from the model.
To obtain sparse solutions, we replace the quadratic error function by an ! -insensitive error function, e.g., # 0, if y(x) - t < ! % E! y(x) " t = $ %& y(x) - t " ! , otherwise
(
y(x)
ξ>0
)
See text for details of solution.
y+! y y−!
ξ! > 0
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
x J. Elder
Example Probability & Bayesian Inference
72
1 t 0
!1
0
x
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
1 J. Elder
Nonlinear Classification and Regression: Outline Probability & Bayesian Inference
73
Multi-Layer Perceptrons The
Back-Propagation Learning Algorithm
Generalized Linear Models Radial
Basis Function Networks Sparse Kernel Machines Nonlinear
SVMs and the Kernel Trick Relevance Vector Machines
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Relevance Vector Machines Probability & Bayesian Inference
74
Some drawbacks of SVMs: Do
not provide posterior probabilities. Not easily generalized to K > 2 classes. Parameters (C, ε) must be learned by cross-validation.
The Relevance Vector Machine is a sparse Bayesian kernel technique that avoids these drawbacks. RVMs also typically lead to sparser models.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
RVMs for Regression Probability & Bayesian Inference
75
(
(
)
p t | x,w, ! = N t | y(x), ! "1 where y(x) = w t! (x)
) (
)
In an RVM, the basis functions ! (x) are kernels k x,x n : N
(
)
y(x) = " w nk x,x n + b n=1
However, unlike in SVMs, the kernels need not be positive definite, and the x n need not be the training data points. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
RVMs for Regression Probability & Bayesian Inference
76
Likelihood:
(
)
N
(
p t | X,w, ! = " p tn | x n ,w, ! n=1
)
where the nth row of X is x tn .
Prior: M
(
p(w | ! ) = # N w i | 0,! i"1 i =1
)
Note that each weight parameter has its own precision hyperparameter.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
prior. Now, if we have p(wm |αm ) and p(αm ) and we want to know the ‘true’ p(wm ) we alrea know what to do — we must marginalise: ! p(wm ) = p(wm |αm ) p(αm ) dαm . (2
RVMs for Regression 77
(
p(w i | ! i ) = N w i | 0,! i"1
( )
(
p ! i = Gam ! i | a,b
( )
(
p w i = St w i | 2a
)
)
)
For a GammaProbability p(αm ), this & integral is computable Bayesian Inference and we find that p(wm ) is a Student-t distributi illustrated as a function of two parameters in Figure 8; its equivalent as a regularising pena " function would be m log |wm |.
Gaussian prior
Marginal prior: single !
Independent !
w2
w
1 Figure 8: Contour plots of Gaussian and Student-t prior distributions over two parameters. While the marginal prior p(w1 , w2 ) for the ‘single’ hyperparameter model of Section 2 has a much sharper peak than the Gaussian at zero, it can be seen that it is not sparse unlike the multiple ‘independent’ hyperparameter prior, which as well as having a sharp peak at zero, places most of its probability mass along axial ridges where the magnitude of one of the two parameters is small.
The conjugate prior for the precision of a Gaussian is a gamma distribution. Integrating out the precision parameter leads to a Student’s t distribution over wi . Thus the distribution over w is a product of Student’s t distributions. 4.3 A Bayesian Regression As a result, maximizing theSparse evidence willModel yield for a sparse w. Note that to achieve critical thatmodel each wi has a separate We sparsity can developita is sparse regression by parameter following an identical methodology to the previo 2 precision αi. sections. Again, we assume independent Gaussian noise: tn ∼ N (y(xn ; w), σ ), which gives corresponding likelihood:
#
$
1 2 –N/2 Recognition CSE 4404/5327 Introduction to Machine Learning p(t|w, σ 2 ) = and (2πσPattern ) exp − 2 #t − Φw#2 , 2σ
J. Elder (2
p(wm ) =
RVMs for Regression
p(wm |αm ) p(αm ) dαm .
(2
For a Gamma p(αm ), this integral is computable and we find that p(wm ) is a Student-t distributi illustrated as a function of two parameters in Figure 8; its equivalent as a regularising pena " function would be m log |wm |.
Probability & Bayesian Inference
78
(
p(w i | ! i ) = N w i | 0,! i"1
( )
(
p ! i = Gam ! i | a,b
( )
(
p w i = St w i | 2a
)
)
Gaussian prior
Marginal prior: single !
Independent !
w2
)
w
(
Gamma Distribution: p x | a ,b
)
1 Figure 8: Contour plots of Gaussian and Student-t prior distributions # !over + 1&two parameters. ! +1 While the marginal prior p(w1 , w2 ) for the ‘single’ hyperparameter model of Section " * 2 % ( a # that 2 be' seen x &it is2 b 2 has " bx a much sharper peak than the Gaussian at zero, it$ can Student's t Distribution: p x | ! = 1+ = x a "1 e . sparse unlike the multiple ‘independent’ hyperparameter prior, which (' as . # ! & %$ as!well !(a ) not having a sharp peak at zero, places most of its probability axial ridges !)mass " % along where the magnitude of one of the two parameters is small. $ 2 ('
(
)
Also recall the rule for transforming densities: If y is a monotonic function of x , then
pY (y ) = p X ( x )
dx dy
4.3
A Sparse Bayesian Model for Regression
We can develop a sparse regression model by following an identical methodology to the previo sections. Again, we assume independent Gaussian noise: tn ∼ N (y(xn ; w), σ 2 ), which gives Thus if we let a ! 0,b ! 0, then corresponding likelihood: Very sparse!! #1 p log " i ! uniform and p w i ! w i . # $ 1 2 2 –N/2 2 p(t|w, σ ) = (2πσ ) exp − 2 #t − Φw# , (2 2σ
(
)
( )
T T as beforetowe denote tLearning = (t1 . . . tand w = (wRecognition N ) ,Pattern 1 . . . wM ) , and Φ is the N × M ‘design’ CSE 4404/5327where Introduction Machine J. Eldermat
with Φnm = φm (xn ).
RVMs for Regression Probability & Bayesian Inference
79
Likelihood:
(
)
N
(
p t | X,w, ! = " p tn | x n ,w, ! n=1
)
where the nth row of X is x tn .
Prior: M
(
p(w | ! ) = # N w i | 0,! i"1 i =1
)
In practice, it is difficult to integrate α out exactly. Instead, we use an approximate maximum likelihood method, finding ML values for each αi. When we maximize the evidence with respect to these hyperparameters, many will ∞. As a result, the corresponding weights will 0, yielding a sparse solution.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
RVMs for Regression Probability & Bayesian Inference
80
Since both the likelihood and prior are normal, the posterior over w will also be normal: Posterior:
(
)
(
p w | t,X,! , " = N w | m,#
)
where m = "#$t t
(
# = A + "$ $ t
)
%1
and
( ) A = diag (! )
$ni = &i x n
i
Note that when ! i " #, the i th row and column of $ " 0, and
(
)
(
p w i | t,X,! , " = N w i | 0,0
)
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
RVMs for Regression Probability & Bayesian Inference
81
The values for α and βare determined using the evidence approximation, where we maximize
(
)
(
) (
)
p t | X,! , " = # p t | X,w, " p w | ! d w
In general, this results in many of the precision parameters ! i " #, so that w i " 0.
Unfortunately, this is a non-convex problem.
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
J. Elder
Example Probability & Bayesian Inference
82
1 t 0
!1
0
x
CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition
1 J. Elder