Bayesian Neural Network Classification of Head ... - IEEE Xplore

1 downloads 0 Views 136KB Size Report
Broadway NSW 2007 Australia. [email protected]. Hung T. Nguyen. Faculty of Engineering. University of Technology, Sydney. Broadway NSW ...
Bayesian Neural Network Classification of Head Movement Direction using Various Advanced Optimisation Training Algorithms* Son T. Nguyen

Hung T. Nguyen

Philip B. Taylor

Faculty of Engineering University of Technology, Sydney Broadway NSW 2007 Australia [email protected]

Faculty of Engineering University of Technology, Sydney Broadway NSW 2007 Australia [email protected]

Faculty of Engineering University of Technology, Sydney Broadway NSW 2007 Australia [email protected]

Abstract - Head movement is one of the most effective hands-free control modes for powered wheelchairs. It provides the necessary mobility assistance to severely disabled people and can be used to replace the joystick directly. In this paper, we describe the development of Bayesian neural networks for the classification of head movement commands in a hands-free wheelchair control system. Bayesian neural networks allow strong generalisation of head movement classifications during the training phase and do not require a validation data set. Various advanced optimisation training algorithms are explored. Experimental results show that Bayesian neural networks can be developed to classify head movement commands by abled and disabled people accurately with limited training data. Index Terms - Bayesian neural networks; head-movement classification; powered wheelchair.

networks could be used to classify head movement commands consistently with limited training data. In this paper, we continue to explore the properties of Bayesian neural networks in a hands-free control wheelchair control system using various advanced optimisation training algorithms. In the near future, the optimal Bayesian neural network will provide the ability for the system to be trained on-line for each individual operator, irrespective of his/her disability. Section II describes the formulation of Bayesian neural network classification. In Section III, we briefly discuss various advanced optimisation algorithms used for the training of Bayesian neural networks. Section IV shows experimental results of head movement classifications in a hands-free wheelchair control system. Section V provides a discussion of these results and directions for the future development of the system.

I. INTRODUCTION The use of joysticks as a form of control for severely disabled people is a very demanding task. These people have severe mobility disabilities such as cerebral palsy, tetraplegia, etc. Head movement is one of the most effective hands-free modes for the control of powered wheelchairs as it provides the necessary mobility assistance. Currently, it remains a major and elusive task to develop an innovative head movement interface for a human-machine system which assures the safety of the operator, remains unobtrusive, is easy to learn and can easily be adapted to a new operator with different mobility impairments. Many people with spinal cord injury who use head controls have very poor control of the musculature supporting the upper body and neck, and consequently control of head position is marginal at the best of times. Other challenges arise from changed position of the body in relation to the wheelchair. A feed-forward neural network classically trained using back-propagation can be viewed an effective classifier for head movements of severely disabled people [1], [2], [3]. However, the main disadvantage of standard neural networks is the potential of poor generalisation when facing with limited training data. Recently, Bayesian techniques have been applied to neural networks to improve the accuracy and robustness of neural network classifiers. In our previous research [4], it was shown that Bayesian neural *

This work was supported by an ARC LIEF grant (LE0454081).

II. BAYESIAN NEURAL NETWORK CLASSIFICATION Bayesian neural networks were firstly introduced by MacKay [5], [6], [7]. A Bayesian neural network has the following main benefits compared to a standard neural network: x Its network training adjusts weight decay parameters automatically to optimal values for the best generalisation. The adjustment is done during training, so the computational intensive search for the weight decay parameters is no longer required. x Its training converges to different local minima and networks with different numbers of hidden nodes can be compared and ranked. x As no separate validation set is required, all available data can be used for training. A. Multi-Layer Perceptron Neural Networks Multi-layer perceptron (MLP) neural networks are widely used in engineering applications. These networks take in a vector of real inputs, xi , and from them compute one or more values of the output layer, z k x, w . With a one hidden layer network, as shown in Fig 1, the value of the k th output is computed as follows:

wkj

w ji

y1

x1

where \ ^[1 ,..., [ G ` and equation (3) is the first level of the inference. p w | \ is the weight prior determined using the theory of prior. According to [7], the prior distribution of the weights is given by § G · 1 ¨ (4) p w | \ exp¨  [ g EWg ¸¸ ZW \ ¨ g 1 ¸ © ¹

z1

zk

xi

¦

yj zc xd

G

yM

Z W \

1

y M 1

M d § § · ·¸ ¨ f 0 ¨ bk  wkj tanh¨ b j  w ji xi ¸ ¸ ¨ ¸¸ ¨ j 1 i 1 © ¹¹ ©

¦

¦

(1)

Here, w ji is the weight on the connection from input unit i to hidden unit j ; similarly, wkj is the weight on the connection from hidden unit j to output unit k . b j and bk are the biases of the hidden and output units; f 0 is the output layer activation function. B. Regularisation In neural networks, appropriate regularisation can be used to prevent any weights becoming too large because large weights may give poor generalisation for new cases. Therefore, a weight decay term is added to the data error function E D to penalise large weights. Specifically, for classification problems, we have:

S w E D 

· ¸ ¸ ¹

Wg / 2

(5)

where ZW \ is the normalisation constant and W g is the

1

Fig.1 A MLP neural network

z k x, w

– g

bk

bj x d 1

§ 2S ¨ ¨[ 1© g

G

number of weights in the g th group. p D | w,\ is the dataset likelihood and p D | \ is the evidence for \ or the normalisation factor. If the dataset is identically independently distributed (i.i.d), the dataset likelihood is p D | w,\ exp  E D (6) Assuming that the total error function S w has a single minimum at the most probable weight vector wMP and S w can locally be approximated as a quadratic form obtained by the second-order Taylor series expansion of S w , 1 (7) S w | S wMP  w  wMP T A w  wMP 2 The matrix A is the Hessian matrix of the total error function at wMP : A

’’S wMP

G

H

¦[ g I g

(8)

g 1

where H

’’E D wMP is the Hessian matrix of the data

(2)

error function at wMP and I g is the diagonal matrix having

where S w is the total error function, [ g is a non-negative

ones along the diagonal that picks off weights in the g th group. After training phase, the posterior distribution of weights can be derived as: 1 p w | D,\ exp  S w (9) Z S \

¦ [ g EWg g 1

parameter for the distribution of other parameters (weights and biases) and known as a hyperparameter and EWg is the weight error for the g th group of weights and biases, and G is the number of groups of weights and biases in the neural network.

where Z S \ is the normalisation constant for the approximating Gaussian, and is therefore given by Z S \ | exp  S wMP 2S W / 2 det A 1 / 2

(10)

C. Bayesian Inference The adaptive parameters of neural networks (weights and biases) can be conveniently grouped into a single W dimensional weight vector w . According to the Bayesian inference, the posterior distribution of the weight vector w of a neural network given a data set D is given by p D | w,\ p w | \ (3) p w | D,\ p D | \

Again, using the Bayes’ theorem, we can express the posterior distribution of the hyperparameters as p D | \ p \ (11) p \ | D { p D | \ p \ p D where p \ is the prior distribution of the hyperparameters and simply we assume that this distribution is uniform. It

will thus be ignored subsequently, because to infer the value of the hyperparameter, we only seek the values of the hyperparameters to maximise p D | \ . Rearranging (3), we have the following form: p D | w,\ p w | \ p D | \ p w | D,\

A. Conjugate Gradient Algorithm The conjugate gradient algorithm starts by searching in the negative gradient on the first iteration. At the m th step, a line search is performed to find the step size D m T

(12)

gmdm

Dm

(18)

T dm Ad m

Since all the terms in the right-side of equation (12) are determined from (6), (4) and (9), equation (12) yields Z S \ (13) p D | \ ZW \

where g m and d m are the gradient and search direction at step m . The new search direction is then given by d m1  g m1  E m d m (19)

Taking the derivative of ln p D | \ with respect to

where E m can be determined using the Polak-Rebiere formula as follows

[g w w[ g

ln p D | \

Wg 2[ g

 EWg 



1 tr A 1 I g 2

Let this derivative be zero, we can determine [ g as follows 2[ g EWg



W g  [ g tr A 1 I g

(15)



W g  [ g tr A 1 I g

(16)

J g is called the number of well-determined parameters in

Newton’s method is an alternative to the conjugate gradient method for fast optimisation. The basic step of this method is based on Newton’s formula: wm1  wm

Jg 2 EWg

H

1

g m1  g m

(21) 1

However, the inverse Hessian matrix F H can be approximated using a class of algorithms called the quasiNewton method such as the most successful Broyden, Fletcher, Goldfarb and Shanno (BFGS) method

weight group g . Substituting (16) into (15) and rearranging (15), we have

[g

(20)

T gm 1 g m

B. Quasi-Newton Algorithm

The right-hand side is equal to a value J g defined as

Jg

g m1  g m T g m1

Em

(14)

Fm1

Fm 

pp T T

p v



Fm v vT Fm T

v Fm v





(22)

Fm v

(23)

 v T Fm v uu T

(17)

The terms [ g and J g are used with some formulas to compute the logarithm of the evidence in Bayesian model comparison. The optimal model is selected corresponding to the highest logarithm of the evidence. The details of this task can be referred to [8], [9]. III. PARAMETER OPTIMISATION ALGORITHMS FOR BAYESIAN NEURAL NETWORKS The main problem when training neural networks is that usually suitable values for the learning rate and momentum must be chosen. As this procedure is clearly inefficient, we focus on fast training algorithms which can automatically determine the search direction and step size. In this section, three advanced training algorithms for Bayesian neural network classifiers are developed: conjugate gradient, quasiNewton and scaled conjugate gradient algorithm.

where p , v and u are defined as follows p

wm1  wm ; v

g m1  g m ; u

p T



p v

T

v Fm v

C. Scaled Conjugate Gradient Algorithm In the conjugate gradient algorithm, as the total error function S w is not a quadratic form then the Hessian matrix A may not be positive definite and in this case the parameter update formula (18) may increase the function value. This can be overcome by adding a non-negative multiple O m of the unit matrix to the Hessian A to obtain A  O m I . The expression (18) becomes

Dm

T gm dm T dm Ad m  Om d m

2

(24)

The denominator of (24) can be written as

Gm

T dm Ad m  Om d m

2

(25)

If G m  0 , we can increase the value of Om in order to make G m ! 0 . Let the raised value of Om be O m , then the corresponding raised value of G m is given by



Gm



G m  O m  Om d m

2

(26)

Backward

Forward 1

1

0.8

0.8

0.6

0.6

0.4

0.4 0.2

0.2 0

0

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6 -0.8

-0.8 -1

Om

· ¸ ¸¸ ¹

(27)

G m  O m d m

6

8

10

12

14

16

18

20

2

-1

0

2

4

6

8

T

d m Ad m

(28)

2>S wm  S wm  D m d m @ T D mdm dm

(29)

Then the value of Om1 can be adjusted using the following prescriptions: If 'm ! 0.75 , then O m1

Om / 2 If 0.25  'm  0.75 , then Om1 Om If 'm  0.25 , then O m1 4O m If 'm  0 , then O m1 4O m and take no step

10

12

14

16

18

20

Right

1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

-0.4

-0.4

-0.6

-0.6

-1

This value is positive and used as the denominator in (24) to compute the step-size D m . In order to find O m1 , a comparison parameter is firstly defined as

'm

4

-0.8

Substituting (27) into (26) gives

Gm

2

Left

In order to G m ! 0 , Moller [10] chooses § G 2¨¨ O m  m 2 ¨ dm ©

0

-0.8

0

2

4

6

8

10

12

14

16

18

20

-1

0

2

4

6

8

10

12

14

16

18

20

Fig.2 Windowed sample patterns of user 1 (Solid line-x data; dotted line- y data)

In this system, a dual-axis accelerometer is installed in a cap worn by the user to measure head position. A Bayesian neural network classifier is designed to detect four movements: forward, backward, left and right. The success of this system highly depends on the training of the neural network classifier. The training algorithm modifies the classifier’s parameters to reduce the difference between the actual and target outputs of the classifier. The computer interface module provides the following important feedback to users. Real-time graphical displays of the accelerometer data allow the user to track the deviation of his/her head from the neutral position. In addition, Boolean outputs from the classifier and the numerical values of the neural network inform the user how the classifier is interpreting their head movements. A. Data Acquisition

IV. HANDS-FREE CONTROL OF POWERED WHEELCHAIRS USING HEAD MOVEMENT Head movement is one of the most effective hands-free control modes for powered wheelchairs. Joseph [2] reported the use of standard neural networks to classify head movement of disabled users. Taylor [1] reported the performance of head movement interface for wheelchair control accompanied by a standard neural network classifier. Nguyen [3] developed a real-time head-movement system using an embedded neural network Linux implementation. In this paper, we use Bayesian neural networks to classify head movements accurately with limited training data. TABLE I EXTRACTED MOVEMENT SAMPLES

Head movement data were collected from eight adults aged between 19 and 56, with the approval from the UTS Human Research Ethics Committee. Four users had highlevel spinal cord injuries (C4 and C5) and were unable to use a standard joystick. The remaining four did not have conditions affecting their head movement. The extracted movement samples of those users are shown in Table I. The movement of the user’s head is detected by analysing data from the accelerometer collected using a sampling period of 100 ms. The input to the Bayesian neural network is comprised of a window of 20 samples from each axis as shown in Fig 2. A pre-trained Bayesian neural network was used to classify the windowed data as corresponding to four types of head movement: forward, backward, left and right. B. Network Architecture

User

Forward

Backward

Left

Right

Injury Level

1 2 3 4 5 6 7 8

20 20 20 20 20 20 20 20

20 20 20 20 20 20 20 20

20 20 20 20 20 20 20 20

20 20 20 20 20 20 20 20

C5 C4 C4 C5

The Bayesian neural network used to classify head movement has the following architecture: x 41 inputs, corresponding to 20 samples from x axis, 20 samples from y axis and one augmented input with a constant value of 1 x three hidden neurons x four outputs, each corresponding to one of the classes: forward, backward, left and right.

C. Experiment 1 All the available data were divided into two subsets: the first half for training and the second half for testing. The training procedure was then implemented as follows:

90 80 70 Training Time (seconds)

There are four hyperparameters [1 , [ 2 , [ 3 and [ 4 corresponding optimal distributions from the weights between the input nodes to the hidden nodes, the bias input node to the hidden nodes, the hidden nodes to the output nodes, and the bias hidden node to the output nodes.

60 50 40 30 20 10 0

1. The weights were initialised randomly and choosing initial values for the hyperparameters. 2. The network was trained to minimise the total error function S w using optimal training algorithms. For each algorithm, ten networks were trained and the averaged training time was then computed. Fig.3 shows that quasi-Newton and scaled conjugate gradient algorithms have the fastest convergence.

Conjugate gradient

J gold 2 EWg

TABLE II THE CHANGE OF HYPERPARAMETERS ACCORDING TO THE PERIODS OF REESTIMATION IN EXPERIMENT I Periods [1 [2 [3 [4 1 0.22914 0.74056 0.033379 0.04562 2 0.85683 10.738 0.028076 0.11036 3 1.7719 48.878 0.019537 0.21202 TABLE III CONFUSION MATRIX IN EXPERIMENT I Predicted Classification

Actual Classification

Table II shows the change of these parameters according to the number of hyperparameter reestimation periods. 4. Steps 2 and 3 were repeated until convergence was achieved (the total error term was smaller than a predetermined value and did not change significantly in subsequent iterations). As the quasi-Newton training algorithm is the most effective in terms of computational time with a similar total error term, it was used to train the Bayesian neural network. The performance of the Bayesian neural network after training was obtained using the test subset. The confusion matrix in Table III shows that the network can classify head movement with an accuracy of 99.375%. D. Experiment 2 The training data were taken from movement samples of users 1, 2, 5 and 6. A Bayesian neural network was obtained using the same procedure as Experiment 1. The quasi-Newton algorithm was also used to train the network. The performance of this Bayesian neural network was then obtained using the test data from users 3, 4, 7 and 8. The confusion matrix in Table IV shows that this Bayesian neural network can classify head movement with a success of 95.625%.

Scaled conjugate gradient

Fig.3 Averaged network training time of different algorithms in Experiment 1

3. When the network training has reached a local minimum, the values of the hyperparameters were reestimated

[ gnew

quasi-Newton

Movement

Forward

Backward

Left

Forward

79

0

0

1

Backward

0

80

0

0

Left

0

0

80

0

Right

0

1

0

79

Accuracy (%)

Right

99.375

TABLE IV CONFUSION MATRIX IN EXPERIMENT II Predicted Classification

Actual Classification

Movement

Forward

Backward

Left

Right

Forward

75

0

0

5

Backward

0

76

4

0

Left

2

0

77

1

Right

0

2

0

78

Accuracy (%)

95.625

TABLE V SENSITIVITY, SPECIFICITY, POSITIVE PREDICTIVE VALUE (PPV) AND NEGATIVE PREDICTIVE VALUE (NPV) OF THE BAYESIAN NEURAL NETWORKS Experiment 1 Experiment 2

Sensitivity

Specificity

PPV

NPV

0.99375 0.95625

0.99792 0.98542

0.99379 0.95689

0.99792 0.98547

The classification results of the Experiments 1 and 2 are summarised in Table V. It can be seen that very high sensitivity (true positive) and specificity (true negative) have been achieved for both experiments. The effectiveness of the quasi-Newton training algorithm in terms of the overall computational time with similar accuracy will allow us to develop an on-line adaptive Bayesian neural network for each individual disabled operator in the near future.

V. DISCUSSION The results obtained show that Bayesian neural networks can be used to classify head movement accurately. The classification results are consistent using various forms of training algorithms. If all eight users (4 abled and 4 disabled persons) were trained, the Bayesian neural network classifies the test set very accurately (99.4% accuracy). If the training data were taken from only four users (2 abled and 2 disabled persons), the obtained Bayesian neural network classifies the test set from the other four users with an excellent accuracy (95.6% accuracy). It is clear that the Bayesian neural network training allows complex models to be developed without the overfitting problems that can occur with standard neural network training. In addition, good generalisation ability of the networks accompanied by the fastest network training algorithm holds promise to the development of an effective on-line adaptive raining framework for Bayesian neural network head movement classifiers in the near future. VI. CONCLUSION We have developed various Bayesian neural networks for the classification of head movement direction with an excellent overall accuracy. We have also shown that Bayesian neural networks allows complex models to be developed without the requirement for a validation set to overcome the overfitting problem that can occur with standard neural network training. REFERENCES [1]

P. B. Taylor and H. T. Nguyen, "Performance of a head-movement interface for wheelchair control," Proceedings of the 25th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, vol. 2, pp. 1590 - 1593, 2003. [2] T. Joseph and H. T. Nguyen, "Neural network control of wheelchairs using telemetric head movement," Proceedings of the 20th Annual International Conference of the IEEE, Engineering in Medicine and Biology Society, vol. 5, pp. 2731 - 2733, 1998. [3] H.T. Nguyen, L.M. King and G. Knight, “Real-time head-movement system and embedded Linux implementation for the control of power wheelchair”, Proceedings of the 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, San Francisco, USA, 1-5 September 2004, pp. 4892-4895 [4] S. Nguyen, H. Nguyen, and P. Taylor, "Hands-Free Control of Power Wheelchairs using Bayesian Neural Networks," Proceeding of IEEE Conference on Cybernetics and Intelligent Systems, pp. 745 - 749, 2004. [5] J. C. MacKay, "Bayesian Interpolation," Neural Computation, vol. 4, pp. 415 - 447, 1992. [6] J. C. MacKay, "A practical Bayesian Framework for Backpropagation Networks," Computation and Neural Systems, vol. 4, pp. 448-472, 1992. [7] MacKay, "The Evidence Framework Applied to Classification Networks," Neural Computation, vol. 4, pp. 720 -736, 1992. [8] H. H. Thodberg, "A review of Bayesian neural networks with an application to near infrared spectroscopy," IEEE Transactions on Neural Networks, vol. 7, pp. 56 - 72, 1996. [9] M. Bishop, "Neural networks for pattern recognition," Oxford : Clarendon Press ; New York : Oxford University Press, 1995. [10] M. F. Moller, "A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning," Neural Networks, vol. 6, pp. 525 - 533, 1993.