NONLINEAR CLASSIFICATION AND REGRESSION

7 downloads 76 Views 6MB Size Report
Probability & Bayesian Inference. J. Elder ... Nonlinear Classification and Regression: Outline ...... could be learned using linear classifier techniques (e.g.,.
NONLINEAR CLASSIFICATION AND REGRESSION J. Elder

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Nonlinear Classification and Regression: Outline Probability & Bayesian Inference

2

 

Multi-Layer Perceptrons   The

 

Back-Propagation Learning Algorithm

Generalized Linear Models   Radial

Basis Function Networks   Sparse Kernel Machines   Nonlinear

SVMs and the Kernel Trick   Relevance Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Nonlinear Classification and Regression: Outline Probability & Bayesian Inference

3

 

Multi-Layer Perceptrons   The

 

Back-Propagation Learning Algorithm

Generalized Linear Models   Radial

Basis Function Networks   Sparse Kernel Machines   Nonlinear

SVMs and the Kernel Trick   Relevance Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Implementing Logical Relations 4

 

Probability & Bayesian Inference

AND and OR operations are linearly separable problems

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The XOR Problem Probability & Bayesian Inference

5

 

 

XOR is not linearly separable. x1

x2

XOR

Class

0

0

0

B

0

1

1

A

1

0

1

A

1

1

0

B

How can we use linear classifiers to solve this problem? CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Combining two linear classifiers Probability & Bayesian Inference

6

 

Idea: use a logical combination of two linear classifiers.

3 g2 (x ) = x 1 + x2 ! 2

g 1 (x ) = x 1 + x2 !

1 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Combining two linear classifiers Probability & Bayesian Inference

7

Let f (x ) be the unit step activation function: f (x ) = 0, x < 0 f (x ) = 1, x ! 0

Observe that the classification problem is then solved by " 1% 3 f $ y 1 ! y2 ! ' g (x ) = x + x ! 2 2& # 2

where

(

)

(

y 1 = f g 1 (x ) and y 2 = f g2 (x )

1

2

) g 1 (x ) = x 1 + x2 !

1 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Combining two linear classifiers Probability & Bayesian Inference

8  

This calculation can be implemented sequentially: 1.  2. 

Compute y1 and y2 from x1 and x2. Compute the decision from y1 and y2.

 

Each layer in the sequence consists of one or more linear classifications.

 

This is therefore a two-layer perceptron. g2 (x ) = x 1 + x2 !

3 2

g 1 (x ) = x 1 + x2 !

1 2

" 1% f $ y 1 ! y2 ! ' 2& # where

(

)

(

y 1 = f g 1 (x ) and y 2 = f g2 (x )

)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Two-Layer Perceptron Probability & Bayesian Inference

9

Layer 1

g2 (x ) = x 1 + x2 !

g 1 (x ) = x 1 + x2 !

Layer 2

x1

x2

y1

y2

0

0

0(-)

0(-)

B(0)

0

1

1(+)

0(-)

A(1)

1

0

1(+)

0(-)

A(1)

1

1

1(+)

1(+)

B(0)

3 2

" 1% f $ y 1 ! y2 ! ' 2& # where

(

)

(

y 1 = f g 1 (x ) and y 2 = f g2 (x )

Layer 1

Layer 2 y 1 ! y2 !

)

1 2

1 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Two-Layer Perceptron Probability & Bayesian Inference

10

 

The first layer performs a nonlinear mapping that makes the data linearly separable.

(

)

(

y 1 = f g 1 (x ) and y 2 = f g2 (x ) g2 (x ) = x 1 + x2 !

g 1 (x ) = x 1 + x2 !

3 2

)

Layer 1

Layer 2 y 1 ! y2 !

1 2

1 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Two-Layer Perceptron Architecture Probability & Bayesian Inference

11

Input Layer

Hidden Layer g 1 (x ) = x 1 + x2 !

Output Layer

1 2

y 1 ! y2 !

3 g2 (x ) = x 1 + x2 ! 2

1 2

-1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Two-Layer Perceptron Probability & Bayesian Inference

12

 

Note that the hidden layer maps the plane onto the vertices of a unit square.

(

)

(

y 1 = f g 1 (x ) and y 2 = f g2 (x ) g2 (x ) = x 1 + x2 !

g 1 (x ) = x 1 + x2 !

3 2

)

Layer 1

Layer 2 y 1 ! y2 !

1 2

1 2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Higher Dimensions Probability & Bayesian Inference

13

   

Each hidden unit realizes a hyperplane discriminant function. The output of each hidden unit is 0 or 1 depending upon the location of the input vector relative to the hyperplane.

x ∈ Rl

x → y = [ y1 ,...y p ]T , yi ∈ {0, 1} i = 1, 2,... p CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Higher Dimensions Probability & Bayesian Inference

14

 

Together, the hidden units map the input onto the vertices of a p-dimensional unit hypercube.

x ∈ Rl

x → y = [ y1 ,...y p ]T , yi ∈ {0, 1} i = 1, 2,... p CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Two-Layer Perceptron Probability & Bayesian Inference

15

 

 

These p hyperplanes partition the l-dimensional input space into polyhedral regions Each region corresponds to a different vertex of the pdimensional hypercube represented by the outputs of the hidden layer.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Two-Layer Perceptron Probability & Bayesian Inference

16

 

In this example, the vertex (0, 0, 1) corresponds to the region of the input space where:      

g1(x) < 0 g2(x) < 0 g3(x) > 0

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Limitations of a Two-Layer Perceptron Probability & Bayesian Inference

17

The output neuron realizes a hyperplane in the transformed space that partitions the p vertices into two sets.   Thus, the two layer perceptron has the capability to classify vectors into classes that consist of unions of polyhedral regions.   But NOT ANY union. It depends on the relative position of the corresponding vertices.   How can we solve this problem?  

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Three-Layer Perceptron Probability & Bayesian Inference

18  

Suppose that Class A consists of the union of K polyhedra in the input space.

 

Use K neurons in the 2nd hidden layer.

 

Train each to classify one region as positive, the rest negative.

 

Now use an output neuron that implements the OR function.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Three-Layer Perceptron Probability & Bayesian Inference

19

 

Thus the three-layer perceptron can separate classes resulting from any union of polyhedral regions in the input space.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Three-Layer Perceptron Probability & Bayesian Inference

20    

 

The first layer of the network forms the hyperplanes in the input space. The second layer of the network forms the polyhedral regions of the input space The third layer forms the appropriate unions of these regions and maps each to the appropriate class.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Learning Parameters

Training Data Probability & Bayesian Inference

22

 

The training data consist of N input-output pairs:

( y(i ), x(i )) ,

i ! 1,… N

where t

y(i ) = "$y 1 (i ),…, y k (i ) %' # & L and

t

x (i ) = "$x 1 (i ),…, x k (i ) %' 0 # &

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Choosing an Activation Function Probability & Bayesian Inference

23

The unit step activation function means that the error rate of the network is a discontinuous function of the weights.   This makes it difficult to learn optimal weights by minimizing the error.   To fix this problem, we need to use a smooth activation function.   A popular choice is the sigmoid function we used for logistic regression:  

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Smooth Activation Function Probability & Bayesian Inference

24

1 f (a ) = 1+ exp(!a )

8

Classification models

!"

!

Figure 8.3 Logistic regression w model in 1D and 2D. a) One dimensionalx1fit. Green points denote set of examples S0 where y = 0. Pink points denote CSE 4404/5327 Introduction to Machine and Pattern Recognition set of examples S1 where y = 1. Note that Learning in this (and all future figures t

J. Elder

Output: Two Classes Probability & Bayesian Inference

25

 

For a binary classification problem, there is a single output node with activation function given by 1 f (a ) = 1+ exp(!a )

 

Since the output is constrained to lie between 0 and 1, it can be interpreted as the probability of the input vector belonging to Class 1.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Output: K > 2 Classes Probability & Bayesian Inference

26

 

For a K-class problem, we use K outputs, and the softmax function given by yk =

( )

exp ak

! exp (a j

 

j

)

Since the outputs are constrained to lie between 0 and 1, and sum to 1, yk can be interpreted as the probability that the input vector belongs to Class K. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Non-Convex Probability & Bayesian Inference

27

 

 

 

   

 

Now each layer of our multi-layer perceptron is a logistic regressor. Recall that optimizing the weights in logistic regression results in a convex optimization problem. Unfortunately the cascading of logistic regressors in the multi-layer perceptron makes the problem non-convex. This makes it difficult to determine an exact solution. Instead, we typically use gradient descent to find a locally optimal solution to the weights. The specific learning algorithm is called the backpropagation algorithm. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

End of Lecture Nov 21, 2011

Nonlinear Classification and Regression: Outline Probability & Bayesian Inference

29

 

Multi-Layer Perceptrons   The

 

Back-Propagation Learning Algorithm

Generalized Linear Models   Radial

Basis Function Networks   Sparse Kernel Machines   Nonlinear

SVMs and the Kernel Trick   Relevance Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

The Backpropagation Algorithm Paul J. Werbos. Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University, 1974 Rumelhart, David E.; Hinton, Geoffrey E., Williams, Ronald J. (8 October 1986). "Learning representations by back-propagating errors". Nature 323 (6088): 533–536.

Werbos

Rumelhart

Hinton

Notation Probability & Bayesian Inference

31  

Assume a network with L layers    

k0 nodes in the input layer. kr nodes in the r’th layer.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Notation 32

Probability & Bayesian Inference

Let yrk !1 be the output of the kth neuron of layer r ! 1. Let wrjk be the weight of the synapse on the jth neuron of layer r from the kth neuron of layer r ! 1.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Input 33

Probability & Bayesian Inference

y k0 (i ) = x k (i ), k = 1,…, k0

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Notation Probability & Bayesian Inference

34

Let vrj be the total input to the jth neuron of layer r : r j

( )y

v (i ) = w

r j

t

r !1

(i ) =

kr !1

"w

k =0

r jk

y kr !1 (i )

r 0

where we define y (i ) = +1 to incorporate the bias term.

# kr !1 r r !1 & Then y (i ) = f v (i ) = f % " w jk yk (i ) ( $ k =0 ' r j

(

r j

)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Cost Function Probability & Bayesian Inference

35

 

A common cost function is the squared error: J =

N

" ! (i ) i =1

k

(

)

k

(

2 1 L 1 L where ! (i ) ! " em (i ) = " y m (i ) # yˆm (i ) 2 m =1 2 m =1 and yˆm (i ) = y kr (i ) is the output of the network.

)

2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Cost Function Probability & Bayesian Inference

36

 

To summarize, the error for input i is given by k k 2 2 1 L 1 L ! (i ) = " em (i ) = " yˆm (i ) # y m (i ) 2 m =1 2 m =1

(

)

(

)

where yˆm (i ) = y kr (i ) is the output of the output layer and each layer is related to the previous layer through

(

yrj (i ) = f v rj (i )

)

and r j

( )y

v (i ) = w

r j

t

r !1

(i )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Gradient Descent Probability & Bayesian Inference

37

k

(

1 L ! (i ) = " em (i ) 2 m =1  

 

   

)

2

k

(

1 L = " yˆm (i ) # y m (i ) 2 m =1

)

2

Gradient descent starts with an initial guess at the weights over all layers of the network. We then use these weights to compute the network output yˆ (i ) for each input vector x(i) in the training data. This allows us to calculate the errorε(i) for each of these inputs. Then, in order to minimize this error, we incrementally update the weights in the negative gradient direction: N !J !" (i ) r w (new) = w (old) - µ = w (old) µ # j r r !w j i =1 !w j r j

r j

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Gradient Descent Probability & Bayesian Inference

38

 

Since v (i ) = ( w ) yr !1 (i ) , the influence of the jth weight of the rth layer on the error can be expressed as: r j

r j

t

r !" (i ) !" (i ) !v j (i ) = r r !w j !v j (i ) !wrj

= # jr (i )yr $1 (i ) where !" (i ) # jr (i ) ! r !v j (i )

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Gradient Descent Probability & Bayesian Inference

39

!" (i ) r r $1 = # ( i )y (i ), j r !w j where !" (i ) # jr (i ) ! r !v j (i )

For an intermediate layer r, we cannot compute ! jr (i ) directly. However, ! jr (i ) can be computed inductively, starting from the output layer. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Backpropagation: The Output Layer Probability & Bayesian Inference

40

!" (i ) = # jr (i )yr $1 (i ), where r !w j k

(

1 L and ! (i ) = " em (i ) 2 m =1

)

2

# jr (i ) ! k

!" (i ) !v rj (i )

(

1 L = " yˆm (i ) # y m (i ) 2 m =1

(

Recall that yˆm (i ) = yLj (i ) = f v jL (i )

)

2

)

Thus at the output layer we have L "# (i ) "# (i ) "e j (i ) L L ! (i ) = L = L = e ( i ) f v (i ) $ j j "v j (i ) "e j (i ) "v jL (i )

(

L j

f (a) =

1 " f #(a) = f (a) 1! f (a) 1+ exp(!a)

(

(

)(

(

! jL (i ) = e jL (i )f v jL (i ) 1 " f v jL (i )

)

)

))

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Backpropagation: Hidden Layers Probability & Bayesian Inference

41

 

Observe that the dependence of the error on the total input to a neuron in a previous layer can be expressed in terms of the dependence on the total input of neurons in the following layer: !

r "1 j

#$ (i ) (i ) = r "1 = #v j (i ) r k

where v (i ) =

and so !

r #$ (i ) #v k (i ) % #v r (i ) #v r "1 (i ) = k =1 k j

kr !1

"w

m =0

Thus we have r "1 j

kr

r km

!v kr (i ) !v

r "1 j

(i )

y

r !1 m

(i ) =

kr !1

"w

m =0

(

(

%! k =1

r k

(i )

#v kr (i ) #v rj "1 (i )

(

f v mr!1 (i )

r km

= w kjrf # v rj "1 (i )

#$ (i ) (i ) = r "1 = f % v rj "1 (i ) #v j (i )

kr

)

)

kr

) & ! (i )w k =1

r k

r kj

(

L j

)(

(

L j

= f v (i ) 1 " f v (i )

)) & ! (i )w kr

k =1

r k

r kj

Thus once the ! kr (i ) are determined they can be propagated backward to calculate ! jr "1 (i ) using this inductive formula.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Backpropagation: Summary of Algorithm Probability & Bayesian Inference

42

Initialization

1. 

Initialize all weights with small random values

Repeat until convergence

 

Forward Pass

2. 

For each input vector, run the network in the forward direction, calculating:

 

( )y

v rj (i ) = wrj

t

r !1

(

yrj (i ) = f v rj (i )

(i ); k

(

1 L and finally ! (i ) = " em (i ) 2 m =1

Backward Pass

3. 

k

(

1 L = " yˆm (i ) # y m (i ) 2 m =1

)

2

Starting with the output layer, use our inductive formula to compute the ! jr "1 (i ) :

 

4. 

)

2

)

(

 

Output Layer (Base Case): ! L (i ) = e L (i )f " v L (i ) j j j

 

Hidden Layers (Inductive Case): !

r "1 j

(

(i ) = f # v

)

r "1 j

(i )

kr

) $ ! (i )w k =1

r k

r kj

Update Weights N

w (new) = w (old) - µ # r j

r j

i =1

!" (i ) !wrj

where

!" (i ) r r $1 = # ( i )y (i ) j !wrj

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Batch vs Online Learning Probability & Bayesian Inference

43

 

As described, on each iteration backprop updates the weights based upon all of the training data. This is called batch learning. N

w (new) = w (old) - µ # r j

 

r j

i =1

!" (i ) !wrj

where

!" (i ) = # jr (i )yr $1 (i ) r !w j

An alternative is to update the weights after each training input has been processed by the network, based only upon the error for that input. This is called online learning. wrj (new) = wrj (old) - µ

!" (i ) !wrj

where

!" (i ) = # jr (i )yr $1 (i ) r !w j

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Batch vs Online Learning Probability & Bayesian Inference

44

One advantage of batch learning is that averaging over all inputs when updating the weights should lead to smoother convergence.   On the other hand, the randomness associated with online learning might help to prevent convergence toward a local minimum.   Changing the order of presentation of the inputs from epoch to epoch may also improve results.  

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Remarks Probability & Bayesian Inference

45

 

Local Minima   The

objective function is in general non-convex, and so the solution may not be globally optimal.

 

Stopping Criterion   Typically

stop when the change in weights or the change in the error function falls below a threshold.

 

Learning Rate   The

speed and reliability of convergence depends on the learning rate μ.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Nonlinear Classification and Regression: Outline Probability & Bayesian Inference

46

 

Multi-Layer Perceptrons   The

 

Back-Propagation Learning Algorithm

Generalized Linear Models   Radial

Basis Function Networks   Sparse Kernel Machines   Nonlinear

SVMs and the Kernel Trick   Relevance Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Generalizing Linear Classifiers Probability & Bayesian Inference

47

 

 

One way of tackling problems that are not linearly separable is to transform the input in a nonlinear fashion prior to applying a linear classifier. The result is that decision boundaries that are linear in the resulting feature space may be highly nonlinear in the original input space. 1 1 φ2

x2

0

0.5

−1 0 −1

0

x1

Input Space

1

0

0.5

φ1

1

Feature Space

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Nonlinear Basis Function Models Probability & Bayesian Inference

48

 

Generally

where ϕj(x) are known as basis functions.   Typically, Φ0(x) = 1, so that w0 acts as a bias.  

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Nonlinear basis functions for classification Probability & Bayesian Inference

49

 

In the context of classification, the discriminant function in the feature space becomes:

(

)

M

M

i =1

i =1

g y (x ) = w 0 + ! w i y i (x ) = w 0 + ! w i"i (x )  

This formulation can be thought of as an input space approximation of the true separating discriminant function g(x) using a set of interpolation functions !i (x ).

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Dimensionality Probability & Bayesian Inference

50

 

The dimensionality M of the feature space may be less than, equal to, or greater than the dimensionality D of the original input space.   M

< D: This may result in a factoring out of irrelevant dimensions, reduction in the number of model parameters, and resulting improvement in generalization (reduced overlearning).   M > D: Problems that are not linearly separable in the input space may become separable in the feature space, and the probability of linear separability generally increases with the dimensionality of the feature space. Thus choosing M >> D helps to make the problem linearly separable. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Cover’s Theorem Probability & Bayesian Inference

51

 

“A complex pattern-classification problem, cast in a high-dimensional space nonlinearly, is more likely to be linearly separable than in a low-dimensional space, provided that the space is not densely populated.” — Cover, T.M. , Geometrical and Statistical properties of systems of linear inequalities with applications in pattern recognition., 1965 Example

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Nonlinear Classification and Regression: Outline Probability & Bayesian Inference

52

 

Multi-Layer Perceptrons   The

 

Back-Propagation Learning Algorithm

Generalized Linear Models   Radial

Basis Function Networks   Sparse Kernel Machines   Nonlinear

SVMs and the Kernel Trick   Relevance Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Radial Basis Functions Probability & Bayesian Inference

53

 

Consider interpolation functions (kernels) of the form

(

!i x " µi  

 

)

In other words, the feature value depends only upon the Euclidean distance to a ‘centre point’ in the input space. A commonly used RBF is the isotropic Gaussian:

$ 2' 1 !i (x ) = exp & " 2 x " µi ) % 2# i (

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Relation to KDE Probability & Bayesian Inference

54

 

We can use Gaussian RBFs to approximate the discriminant function g(x):

(

)

M

M

i =1

i =1

g y (x ) = w 0 + ! w i y i (x ) = w 0 + ! w i"i (x )  

 

where

$ 2' 1 !i (x ) = exp & " 2 x " µi ) % 2# i (

This is reminiscent of kernel density estimation, where we approximated probability densities as a normalized sum of Gaussian kernels. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Relation to KDE Probability & Bayesian Inference

55

For KDE we planted a kernel at each data point. Thus there were N kernels.   For RBF networks, we generally use far fewer kernels than the number of data points: M 0

( )

Example 3: k(x,z) = x z t

2

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Kernel Properties Probability & Bayesian Inference

67

 

Kernels obey certain properties that make it easy to construct complex kernels from simpler ones.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Kernel Properties 68

Combining Kernels

Probability & Bayesian Inference

Given valid kernels k1 (x, x! ) and k2 (x, x! ) the following kernels will also be valid: k(x, x! ) = ck1 (x, x! )

(6.13)

k(x, x! ) = f (x)k1 (x, x! )f (x! )

(6.14)

k(x, x! ) = q(k1 (x, x! ))

(6.15)

k(x, x! ) = exp(k1 (x, x! ))

(6.16)

k(x, x! ) = k1 (x, x! ) + k2 (x, x! )

(6.17)

k(x, x! ) = k1 (x, x! )k2 (x, x! )

(6.18)

k(x, x! ) = k3 (φ(x), φ(x! ))

(6.19)

k(x, x! ) = xT Ax!

(6.20)

k(x, x! ) = ka (xa , x!a ) + kb (xb , x!b )

(6.21)

k(x, x! ) = ka (xa , x!a )kb (xb , x!b )

(6.22)

withc corresponding conditions onpolynomial c, f, q, φ,with k3 ,nonnegative A, xa , xb , coefficients, ka , kb where > 0, f (!) is any function, q(!) is a " (x) is a mapping from x # !M ,

(

k3 is a valid kernel onVasil ! MKhalidov, , A is aAlex symmetric positive semidefinite matrix, x a and x b are variables such that x = x a ,x b Kl¨ aser Bishop Chapter 6: Kernel Methods

)

and ka ,kb are valid kernels over their respective spaces.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Constructing Kernels Probability & Bayesian Inference

69

 

Examples:

(

k(x, x!) = x x! + c t

(

)

M

,c > 0 2

k(x, x!) = exp " x " x! / 2# 2

(Use 6.18)

)

(Use 6.14 and 6.16.)

Corresponds to infinite-dimensional feature vector

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Nonlinear SVM Example (Gaussian Kernel) Probability & Bayesian Inference

70

Input Space

x2

x1

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

SVMs for Regression Probability & Bayesian Inference

71

In standard linear regression, we minimize 1 N y n ! tn " 2 n=1

(

)

2

# + w 2

2

This penalizes all deviations from the model.

To obtain sparse solutions, we replace the quadratic error function by an ! -insensitive error function, e.g., # 0, if y(x) - t < ! % E! y(x) " t = $ %& y(x) - t " ! , otherwise

(

y(x)

ξ>0

)

See text for details of solution.

y+! y y−!

ξ! > 0

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

x J. Elder

Example Probability & Bayesian Inference

72

1 t 0

!1

0

x

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

1 J. Elder

Nonlinear Classification and Regression: Outline Probability & Bayesian Inference

73

 

Multi-Layer Perceptrons   The

 

Back-Propagation Learning Algorithm

Generalized Linear Models   Radial

Basis Function Networks   Sparse Kernel Machines   Nonlinear

SVMs and the Kernel Trick   Relevance Vector Machines

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Relevance Vector Machines Probability & Bayesian Inference

74

 

Some drawbacks of SVMs:   Do

not provide posterior probabilities.   Not easily generalized to K > 2 classes.   Parameters (C, ε) must be learned by cross-validation.

The Relevance Vector Machine is a sparse Bayesian kernel technique that avoids these drawbacks.   RVMs also typically lead to sparser models.  

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

RVMs for Regression Probability & Bayesian Inference

75

(

(

)

p t | x,w, ! = N t | y(x), ! "1 where y(x) = w t! (x)

) (

)

In an RVM, the basis functions ! (x) are kernels k x,x n : N

(

)

y(x) = " w nk x,x n + b n=1

However, unlike in SVMs, the kernels need not be positive definite, and the x n need not be the training data points. CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

RVMs for Regression Probability & Bayesian Inference

76

Likelihood:

(

)

N

(

p t | X,w, ! = " p tn | x n ,w, ! n=1

)

where the nth row of X is x tn .

Prior: M

(

p(w | ! ) = # N w i | 0,! i"1 i =1

 

)

Note that each weight parameter has its own precision hyperparameter.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

prior. Now, if we have p(wm |αm ) and p(αm ) and we want to know the ‘true’ p(wm ) we alrea know what to do — we must marginalise: ! p(wm ) = p(wm |αm ) p(αm ) dαm . (2

RVMs for Regression 77

(

p(w i | ! i ) = N w i | 0,! i"1

( )

(

p ! i = Gam ! i | a,b

( )

(

p w i = St w i | 2a

   

     

)

)

)

For a GammaProbability p(αm ), this & integral is computable Bayesian Inference and we find that p(wm ) is a Student-t distributi illustrated as a function of two parameters in Figure 8; its equivalent as a regularising pena " function would be m log |wm |.

Gaussian prior

Marginal prior: single !

Independent !

w2

w

1 Figure 8: Contour plots of Gaussian and Student-t prior distributions over two parameters. While the marginal prior p(w1 , w2 ) for the ‘single’ hyperparameter model of Section 2 has a much sharper peak than the Gaussian at zero, it can be seen that it is not sparse unlike the multiple ‘independent’ hyperparameter prior, which as well as having a sharp peak at zero, places most of its probability mass along axial ridges where the magnitude of one of the two parameters is small.

The conjugate prior for the precision of a Gaussian is a gamma distribution. Integrating out the precision parameter leads to a Student’s t distribution over wi . Thus the distribution over w is a product of Student’s t distributions. 4.3 A Bayesian Regression As a result, maximizing theSparse evidence willModel yield for a sparse w. Note that to achieve critical thatmodel each wi has a separate We sparsity can developita is sparse regression by parameter following an identical methodology to the previo 2 precision αi. sections. Again, we assume independent Gaussian noise: tn ∼ N (y(xn ; w), σ ), which gives corresponding likelihood:

#

$

1 2 –N/2 Recognition CSE 4404/5327 Introduction to Machine Learning p(t|w, σ 2 ) = and (2πσPattern ) exp − 2 #t − Φw#2 , 2σ

J. Elder (2

p(wm ) =

RVMs for Regression

p(wm |αm ) p(αm ) dαm .

(2

For a Gamma p(αm ), this integral is computable and we find that p(wm ) is a Student-t distributi illustrated as a function of two parameters in Figure 8; its equivalent as a regularising pena " function would be m log |wm |.

Probability & Bayesian Inference

78

(

p(w i | ! i ) = N w i | 0,! i"1

( )

(

p ! i = Gam ! i | a,b

( )

(

p w i = St w i | 2a

)

)

Gaussian prior

Marginal prior: single !

Independent !

w2

)

w

(

Gamma Distribution: p x | a ,b

)

1 Figure 8: Contour plots of Gaussian and Student-t prior distributions # !over + 1&two parameters. ! +1 While the marginal prior p(w1 , w2 ) for the ‘single’ hyperparameter model of Section " * 2 % ( a # that 2 be' seen x &it is2 b 2 has " bx a much sharper peak than the Gaussian at zero, it$ can Student's t Distribution: p x | ! = 1+ = x a "1 e . sparse unlike the multiple ‘independent’ hyperparameter prior, which (' as . # ! & %$ as!well !(a ) not having a sharp peak at zero, places most of its probability axial ridges !)mass " % along where the magnitude of one of the two parameters is small. $ 2 ('

(

)

Also recall the rule for transforming densities: If y is a monotonic function of x , then

pY (y ) = p X ( x )

dx dy

4.3

A Sparse Bayesian Model for Regression

We can develop a sparse regression model by following an identical methodology to the previo sections. Again, we assume independent Gaussian noise: tn ∼ N (y(xn ; w), σ 2 ), which gives Thus if we let a ! 0,b ! 0, then corresponding likelihood: Very sparse!! #1 p log " i ! uniform and p w i ! w i . # $ 1 2 2 –N/2 2 p(t|w, σ ) = (2πσ ) exp − 2 #t − Φw# , (2 2σ

(

)

( )

T T as beforetowe denote tLearning = (t1 . . . tand w = (wRecognition N ) ,Pattern 1 . . . wM ) , and Φ is the N × M ‘design’ CSE 4404/5327where Introduction Machine J. Eldermat

with Φnm = φm (xn ).

RVMs for Regression Probability & Bayesian Inference

79

Likelihood:

(

)

N

(

p t | X,w, ! = " p tn | x n ,w, ! n=1

)

where the nth row of X is x tn .

Prior: M

(

p(w | ! ) = # N w i | 0,! i"1 i =1

   

 

 

)

In practice, it is difficult to integrate α out exactly. Instead, we use an approximate maximum likelihood method, finding ML values for each αi. When we maximize the evidence with respect to these hyperparameters, many will ∞. As a result, the corresponding weights will  0, yielding a sparse solution.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

RVMs for Regression Probability & Bayesian Inference

80

 

Since both the likelihood and prior are normal, the posterior over w will also be normal: Posterior:

(

)

(

p w | t,X,! , " = N w | m,#

)

where m = "#$t t

(

# = A + "$ $ t

)

%1

and

( ) A = diag (! )

$ni = &i x n

i

Note that when ! i " #, the i th row and column of $ " 0, and

(

)

(

p w i | t,X,! , " = N w i | 0,0

)

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

RVMs for Regression Probability & Bayesian Inference

81

 

The values for α and βare determined using the evidence approximation, where we maximize

(

)

(

) (

)

p t | X,! , " = # p t | X,w, " p w | ! d w

In general, this results in many of the precision parameters ! i " #, so that w i " 0.

Unfortunately, this is a non-convex problem.

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

J. Elder

Example Probability & Bayesian Inference

82

1 t 0

!1

0

x

CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

1 J. Elder