Comparison of Linear and Nonlinear Methods for

0 downloads 0 Views 129KB Size Report
visualize a sequence of numbers being written on a blackboard, and rotate a 3-dimensional ..... http://theoval.sys.uea.ac.uk/˜gcc/svm/toolbox. University of East ...
1

Comparison of Linear and Nonlinear Methods for EEG Signal Classification Deon Garrett, David A. Peterson, Charles W. Anderson, Michael H. Thaut

Abstract— The reliable operation of brain-computer interfaces (BCI’s) based on spontaneous electro electroencephalogram (EEG) signals requires accurate classification of multichannel EEG. The design of EEG representations and classifiers for BCI are open research questions whose difficulty stems from the need to extract complex spatial and temporal patterns from noisy multidimensional time series obtained from EEG measurements. It is possible that the amount of noise in EEG limits the power of nonlinear methods; linear methods may perform just as well as nonlinear methods. This article reports the results of a linear (linear discriminant analysis) and two nonlinear classifiers (neural networks and support vector machines) applied to the classification of spontaneous, six-channel EEG. The nonlinear classifiers produce only slightly better classification results. An approach to feature selection based on genetic algorithms is also presented with preliminary results. Index Terms— EEG, electroencephalogram, pattern classification, neural networks, support vector machines, feature selection, genetic algorithms

I. I NTRODUCTION Recently, much research has been performed into alternative methods of communication between humans and computers. The standard keyboard/mouse model of computer use is not only unsuitable for many people with disabilities, but also somewhat clumsy for many tasks regardless of the capabilities of the user. Electroencephalogram (EEG) signals provide one possible means of human-computer interaction which requires very little in terms of physical abilities. By training the computer to recognize and classify EEG signals, users could manipulate the machine by merely thinking about what they want it to do within a limited set of choices. Currently, most research into EEG classification uses such machine learning stalwarts as Neural Networks (NNs). In this article, we examine the application of support vector machines (SVM) to the problem of EEG classification and compare the results to those obtained using neural networks and linear disciminant analysis. Section II provides an overview of SVM theory and practice, and the problem of multi-class classification is considered in Section III. Section ?? discusses the D. Garrett is a Ph.D. candidate in the Department of Computer Science, Colorado State University, Fort Collins, CO (e-mail: [email protected]). D. Peterson is a Ph.D. candidate in the Department of Computer Science, Colorado State University, Fort Collins, CO (e-mail: [email protected]). C. Anderson is with the Department of Computer Science, Colorado State University, Fort Collins, CO (e-mail: [email protected]). M. Thaut is with the Department of Music, Theatre, and Dance and the Center for Biomedical Research, Colorado State University, Fort Collins, CO (e-mail: [email protected]). D. Peterson, C. Anderson and M. Thaut are also with the Molecular, Cellular, and Integrative Neuroscience Program at Colorado State University.

acquisition of EEG signals. The results of this study are detailed in Section V. Section ?? describes preliminary experiments using genetic algorithms to search for good subsets of features in an EEG classification problem. Section ?? summarizes the findings of this article and their implications. II. S UPPORT V ECTOR M ACHINES FOR B INARY C LASSIFICATION The support vector machine (SVM) is a classification method rooted in statistical learning theory. The motivation behind SVMs is to map the input into a high dimensional feature space, in which the data might be linearly separable. In this regard, SVMs are very similar to other Neural Network based learning machines. The principle difference between these machines and SVMs is that the latter produce the optimal decision surface in the feature space. Conventional neural networks can be difficult to build due to the need to select an appropriate number of hidden units. The network must contain enough hidden units to be able to approximate the function in question to the desired accuracy. However, if the network contains too many hidden units, it may simply memorize the training data, causing very poor generalization. The ability of the machine to learn features of the training data is often referred to as learning capacity, and is formalized in a concept called VC dimension. Support Vector Machines are constructed by solving a quadratic programming problem. In solving this problem, SVM training algorithms simultaneously maximize the performance of the machine while minimizing a term representing the VC dimension of the learning machine. This minimization of the capacity of the machine ensures that the system can not overfit the training data, for a given set of parameters. A. Linear Support Vector Machines In this section, the training of a support vector machine is described for the case of a binary classification problem for which a linear decision surface exists that can perfectly classify the training data. In later sections, the requirement of linear separability will be relaxed. The assumption of linear separability means that there exists some hyperplane which perfectly separates the data. This hyperplane is a decision surface of the form w · x + b = 0,

(1)

where w is an adjustable weight vector, x is an input vector, and b is a bias term. The assumption of separability means that

2

there exists some set of values w and b, such that the following constraints hold for all input vectors, given that the classes are labeled +1 and −1: w · xi + b ≥ +1 w · xi + b ≤ −1

∀yi = +1 ∀yi = −1

(2) (3)

or yi (w · xi + b) − 1 ≥ 0

∀i.

(4)

As previously stated, the support vector machine training algorithm finds the optimal hyperplane for separation of the training data. Specifically, it finds the hyperplane which maximizes the margin of separation of the classifier. Consider the set of training examples which satisfy (2) exactly. These examples are those which lie closest to the hyperplane on the positive side. Similarly, the training examples satisfying (3) exactly lie closest to the hyperplane on the negative side. These particular training examples are called support vectors. Note that requiring the existence of points exactly satisfying the constraints is equivalent to simply rescaling w and b by an appropriate amount. The distance between these points and the hyperplane is given by 1/ kwk. We define the margin of the hyperplane to be the distance between the positive examples nearest the hyperplane and the negative examples nearest the hyperplane, which is equal to 2/ kwk. Therefore, we can maximize the margin of the classifier by minimizing kwk, subject to the constraints of (4). Thus the problem of training the SVM can be stated as follows: find w and b such that the resulting hyperplane correctly classifies the training data and the Euclidean norm of the weight vector is minimized. To solve the problem described above, it is typically reformulated as a Lagrangian optimization problem. In this reformulation, nonnegative Lagrange multipliers A = {α1 , α2 , ...αn } are introduced, yielding the Lagrangian L=

n X 1 kwk − αi (yi (w · xi + b) − 1) 2 i=1

(5)

We must minimize this Lagrangian with respect to w and b, and simultaneously maximize with respect to the Lagrangian multipliers αi . Differentiating with respect to w and b and applying the results to the Lagrangian yields two conditions of optimality, n X w= αi yi xi (6) n X

LD =

n X

n

αi −

i=1

n

1 XX αi αj yi yj (xi · xj ) 2 i=1 j=1

is maximized subject to the constraints: αi ≥ 0 and

n X

∀i

(9)

αi yi = 0,

(10)

i=1

yielding a decision function of the form, f (x) = sign

n X

! αi yi (x · xi ) + b .

(11)

i=1

Note that while w is directly determined by the set of support vectors, the bias term b is not. Once the weight vector is known, the bias may be computed by substitution of any support vector into (4) and solving as an equality constraint, although numerically, it is better to take an average over all support vectors. B. Relaxing the Separability Restriction The previous derivation assumed that the training data was linearly separable. The constraints of (4) are too rigid for use with non-linearly separable data; they force all training examples to lie outside the margin of the classifier. The key idea in extending the Support Vector Machine to handle nonseparable data is to allow these constraints to be violated, but only if accompanied by a penalty in the objective function. We thus introduce another set of nonnegative slack variables, Ξ = {ξ1 , ξ2 , ..., ξn } into the constraints [7]. The new constraints are w · xi + b ≥ +1 − ξi w · xi + b ≤ −1 + ξi and ξi ≥ 0

n X

αi yi = 0

(8)

∀yi = +1, ∀yi = −1, ∀i.

(12) (13) (14)

An error thus occurs only when ξi > 1. Therefore, the sum

i=1

and

why this is true, recall that we wish to maximize the Lagrangian L with respect to A. Thus, assuming w and b are constant, the second term of L must be minimized. If (yi (w·xi +b)−1) > 0, then αi must be zero in order to maximize L. Therefore, only the training points lying closest to the optimal hyperplane, the support vectors, have any effect on its calculation. Substituting the optimality conditions, (6) and (7), into (5) yields the Wolfe dual [3] of the optimization problem: find multipliers αi such that

(7)

i=1

There are two important consequences of these conditions: the optimal weight vector wo is described in terms of the training data, and only those training examples whose corresponding Lagrange multipliers are non-zero contribute to wo . From the Karush-Kuhn-Tucker (KKT) conditions [12], [15], [10], [3], it follows that the training patterns corresponding to the nonzero multipliers are those that satisfy (4) exactly. To understand

ξi

i=1

effectively serves as an upper bound on the number of errors committed by the SVM. We modify the original goal of the optimization problem, minimize kwk, by adding a term to penalize errors. The new optimization problem thus becomes: minimize n X kwk + C ξi , i=1

3

where C is a user-defined parameter which controls the degree to which training errors can be tolerated. Proceeding in a manner analogous to that above, the Wolfe dual of the new Lagrangian is LD =

n X

n

αi −

i=1

n

1 XX αi αj yi yj (xi · xj ), 2 i=1 j=1

(15)

which is identical to (8). As in the separable case, LD must be maximized subject to constraints on the Lagrange multipliers. However, the addition of the ξi produces a subtle difference in these constraints. Specifically, the constraint given in (9) becomes the following: 0 ≤ αi ≤ C

∀i.

Mercer’s theorem [18], [8] provides the theoretical basis for the determination of whether a given kernel function K is equal to a dot product in some space, the requirement for admissibility as an SVM kernel. A discussion of Mercer’s theorem is outside the scope of this paper. Instead, we simply give two examples of suitable kernel functions which will be used here: • Polynomial Kernel K(xi , xj ) = (xTi xj + 1)p •

Radial Basis Function Kernel   1 2 K(xi , xj ) = exp − 2 kxi − xj k 2σ

(18)

(19)

(16) III. M ULTI - CLASS C LASSIFICATION

The second constraint, n X

αi yi = 0,

(17)

i=1

remains the same as in the separable problem. Thus, bounding the values of the Lagrange multipliers from above allows the Support Vector Machine to construct decision boundaries for training data which cannot be linearly separated. C. Relaxing the Linearity Restriction Thus far, it has been assumed that the SVM was to construct a linear boundary between two classes represented by a set of training data. Of course, most interesting problems cannot be adequately classified by a linear machine. In order to generalize the SVM to non-linear decision functions, we introduce the notion of a kernel function [1], [5]. The training data only appears in the optimization problem (15) in the form of dot products between the input vector and the support vectors. If the input vectors are mapped into some high dimensional space via some nonlinear mapping Φ(x), then the optimization problem would consist of dot products in this higher dimensional space, Φ(xi ) · Φ(xj ). Given a kernel function K(xi , xj ) = Φ(xi ) · Φ(xj ), the optimization problem would be unchanged except for the dot product xi · xj would be replaced with the kernel function K(xi , xj ). The actual mapping Φ(x) would not appear in the optimization problem and would never need to be calculated, or even known. Cover’s theorem on the separability of patterns [9] essentially says that data cast nonlinearly into a high dimensional feature space is more likely to be linearly separable there than in a lower dimensional space. Even though the SVM still produces a linear decision function, the function is now linear in the feature space, rather than the input space. Because of the high dimensionality of the feature space, we can expect the linear decision function to perform well, in accordance with Cover’s theorem. Viewed another way, because of the nonlinearity of the mapping to feature space, the SVM is capable of producing arbitrary decision functions in input space, depending on the kernel function. Thus the fact that the SVM constructs only hyperplane boundaries is of little consequence. The above discussion makes use of the kernel function K(xi , xj ), but does not specify how to choose a suitable kernel.

The best way to generalize SVMs to the multi-class case is an ongoing research problem. One such method proposed by Platt et.al. [23], is based on the notion of Decision Directed Acyclic Graphs (DDAGs). A given DDAG is evaluated much like a binary decision tree, where each internal node implements a decision between two of the k classes of the classification problem. At each node, one class is eliminated from consideration. When the traversal of the graph reaches a terminal node, only one class is left and the decision is made. The principle difference between the DDAG and the conventional decision tree is that DDAGs are not constrained in the same manner as trees. However, a DDAG does not take on arbitrary graph structures. It is a specific form of graph which differs from a tree only in how it handles duplication of decisions. In a decision tree, if the same decision is required in multiple locations in the tree, then each decision is represented through distinct but identical nodes. A DDAG allows two nodes to share a child. Because an algorithm using the DDAG has no need to backtrack through the graph, the algorithm can treat the graph as though it is a standard decision tree. In the so-called DAGSVM algorithm, each decision node uses a 1-v-1 SVM to determine which class to eliminate from consideration. A separate classifier must be constructed to separate all pairs of classes. For the EEG classification task presented here, there are five classes, and therefore a total of ten SVMs. Because each classifier deals only with approximately 40% of the available training data, assuming that each class is represented nearly equally, each may be trained relatively quickly. In addition, only four of the classifiers are used to classify any given unknown input. Figure 1 shows a possible DDAG for the EEG classification task. IV. EEG S IGNAL ACQUISITION The data used in this study were from the work of Keirn and Aunon [13], [14] and collected using the following procedure. Subjects were placed in a dim, sound controlled room and electrodes were placed at positions C3, C4, P3, P4, O1, and O2 as defined by the 10-20 system of electrode placement [11] and referenced to two electrically linked mastoids at A1 and A2. The impedance of all electrodes was kept below five Kohms. Data were recorded at a sampling rate of 250 Hz with a Lab Master 12 bit A/D converter mounted in an IBM-AT computer.

4

Fig. 1. A Decision Directed Acyclic Graph (DDAG) for the EEG classification problem. Each node represents a 1-v-1 SVM trained to differentiate between the two classes compared by the node.

Before each recording session, the system was calibrated with a known voltage. The electrodes were connected through a bank of Grass 7P511 amplifiers with analog bandpass filters from 0.1–100 Hz. Eye blinks were detected by means of a separate channel of data recorded from two electrodes placed above and below the subject’s left eye. An eye blink was defined as a change in magnitude greater than 100 µVolts within a 10 milliseconds period. With the recording instruments in place, the subjects were asked to perform five separate mental tasks. These tasks were chosen to invoke hemispheric brainwave asymmetry. The subjects were asked to first relax as much as possible. This task represents the baseline against which other tasks are to be compared. The subjects were also asked to mentally compose a letter to a friend, compute a non-trivial multiplication problem, visualize a sequence of numbers being written on a blackboard, and rotate a 3-dimensional solid. For each of these tasks, the subjects were asked not to vocalize or gesture in any way. Data was recorded for 10 seconds for each task, and each task was repeated five times during each session. The data from each channel was divided into half-second segments overlapping by one quarter-second. After segments containing eye blinks were discarded, the remaining data contained at most 39 segments. V. R ESULTS In testing the classification algorithms, five trials from one subject were selected from one day of experiments. Each trial consisted of the subject performing all five mental tasks. The first classifier tested is linear discriminant analysis (LDA). The second type of classifiers are feedforward neural networks, consisting of 36 input units, 20 hidden units, and five binary output units. The activation function at each unit is the tanh function. The networks were trained using backpropagation with a learning rate of 0.1 and no momentum term. Training was halted after 2,000 iterations or when the generalization began to fail, as determined by a small set of validation data chosen without replacement from the training data. The third type of classifiers is support vector machines (SVM) that were trained using radial basis function (RBF) kernels or polynomial kernels. The RBF based classifiers were trained using 0.5, 1.0, and 2.0 as standard deviations of the kernel functions. Polynomial kernels

of degrees two, three, five, and ten were trained to test the polynomial machines. For all kernel functions, the regularization parameter C was tested at values 1.0, 10.0, and 100.0. The Support Vector Machines were trained and tested using the DAGSVM algorithm described earlier. Each of the 1-v-1 SVMs was trained using Platt’s Sequential Minimal Optimization algorithm [21], [22]. SMO reduces the quadratic programming stage of training to a series of pairwise optimizations among the Lagrange multipliers. By solving the optimization problem two variables at a time, the optimization can be performed analytically. Platt shows significant speedups resulting from the SMO algorithm as compared to using a traditional quadratic programming routine. The training data was selected from the full set of five trials as follows. One trial was selected as test data. Of the four remaining trials, one was chosen to be a validation set, which was used to determine when to halt training of the neural networks and which values of the kernel parameters and regularization parameter to use for the SVM tests. Finally, the remaining three trials were compiled into one set of training data. The experiments were repeated for each of the 20 ways to partition the five trials in this manner and the results of the 20 experiments were averaged to produce the results shown in Table I. This choice of training paradigm is based on earlier results [2]. The SVM results reported in Table I are those corresponding to the choice of kernel function and regularization parameter, C, which produced the best results. Specifically, the SVM used for the comparisons was constructed with a Radial Basis Function (RBF) kernel using a standard deviation σ = 0.5 and a regularization parameter C equal to 1. LDA provides extremely fast evaluations of unknown inputs performed by distance calculations between a new sample and the mean of training data samples in each class weighted by their covariance matrices. Neural networks are also efficient after the training phase is complete. SVMs are similar to neural networks, but generally require more computation due to the comparatively large numbers of support vectors. The time required to compute class membership for an SVM is directly dependent on the number of support vectors. The number of support vectors resulting from experiments reported here ranged from 140 to 308.

5

Classifier LDA NN SVM

Rest 47.3 64.3 59.4

Math 45.1 47.3 44.5

Letter 51.1 54.7 52.7

Rotate 38.8 51.1 57.0

Count 44.5 47.3 47.9

Total 44.8 52.8 52.3

Average Over 20 Windows 66.0 69.4 72.0

TABLE I P ERCENTAGE OF TEST DATA CORRECTLY

CLASSIFIED BROKEN DOWN BY TASK .

T HE S UPPORT V ECTOR M ACHINE IN THESE EXPERIMENTS USED SVM S TESTED .

THE

SET OF PARAMETERS WHICH RESULTED IN THE HIGHEST CORRECT RATE OF CLASSIFICATION AMONG ALL

VI. F UTURE W ORK All data used in this study was collected from a single subject during the same day. A logical step to take from here is to test the performance of the classifiers on data collected on later days and to repeat these experiments on data collected from other subjects. In addition, there have been several other attempts at generalizing kernel based learners to multi-class classification. Weston and Watkins [24] have extended the theory of SVMs directly into the multi-class domain. Import Vector Machines [28] seem to offer similar performance while using significantly fewer support vectors. Each of these methods provide a slightly different approach to the classification problem, and could offer performance improvements. VII. F EATURE S ELECTION WITH G ENETIC A LGORITHMS A. Introduction Both invasive and non-invasive BCI systems produce a very large amount of electrophysiological data. However, only a relatively small percentage of the potentially informative features of the data are utilized. High-resolution analysis of spatial, temporal, and spectral aspects of the data, and allowing for their interactions, leads to a very high dimensional feature space. Leveraging a higher percentage of potential features in the measured data requires more powerful signal analysis and classification capabilities. We have developed an EEG analysis system that integrates advanced concepts and tools from the fields of machine learning and artificial intelligence to address this challenge. One of our initial test applications of the system is the “self-paced key typing” dataset from Blankertz et al. [4] which includes 413 pre-key press epochs of EEG recorded from one subject. B. Method The overall system is composed of two main parts: feature composition and feature selection (see Figure 2). Feature composition entails data preprocessing, feature derivation, and assembling all of the features into a single large feature matrix. In this specific experiment, we used a fixed set of six electrodes: F3, F4, C3, C4, CP3, CP4, partitioned each trial into 500 ms windows shifted by 100 ms over the entire epoch, zero-meaned the signals, zero-padded them to length 1024, and computed their power spectra at 1 Hz frequency resolution. We used mean power over the standard EEG frequency bands of delta (2-4 Hz), theta (4-8), alpha (8-13), beta1 (13-20), beta2 (2035), and gamma (35-46).

The feature selection part included a support vector machine (SVM) for predicting (classifying) the pressed key laterality and a genetic algorithm (GA) for searching the space of feature subsets [25], [27]. We used the radial basis function kernel, with gamma = 0.2. The SVM has several advantage over alternative classifiers. Unlike most neural network classifiers, the SVM is not susceptible to local optima. SVMs involve many fewer parameters than neural networks, have built-in regularization, are theoretically well-grounded, and, particularly important for ultimate real-time use in a BCI, are extremely fast. The GA was implemented with a population of 20, 2-point crossover probability of 0.66, and mutation rate of 0.008. Individuals in the population were binary strings, with 1 indicating that a feature was included, 0 indicating that it was not. We used a GA to search the space of feature subsets for two main reasons. First, exhaustive exploration of search spaces with greater than about 20 features is computationally intractable (ie. 220 possible subsets). Second, unlike gradient-based search methods, the GA is inherently designed to avoid the pitfall of local optima. We searched over the eleven time windows and six frequency bands, while constantly including all six electrodes in each case. Thus, the dimensionality of the searchable feature space was 66 (11 x 6). Each time an individual (feature subset) in the GA population was evaluated, we trained and tested the SVM using 10x10 fold cross validation and used the average classification accuracy as the individual’s fitness measure. C. Results The GA evolves a population of feature subsets whose corresponding fitness (classification accuracy) improves over iterations of the GA (Figure 3). Note that although both the population average and best individual fitness improve over successive generations, the best individual fitness does so in a monotonic fashion. The best fitness obtained was a classification accuracy of 76%. It was stable for over 50 generations of the GA. The standard deviation of the classification accuracy produced by the SVM was typically about 6%. Figure 4 shows the feature subset exhibiting the highest classification accuracy. The feature subset included features from every time window and every frequency band. This suggests that alternative methods that include only a few time windows or frequencies may be missing features that could improve classification accuracy. Furthermore, all frequency bands were included in the third time window, suggesting that early wideband activity may be a significant feature of the process for deciding finger laterality.

6

Fig. 2. System Architecture for mining the EEG feature space. The space of feature subsets is searched in a “wrapper” fashion, whereby the search is directed by the performance of the classifier, in this case a support vector machine.

Fig. 3. Classification accuracy (population fitness) evolves over iterations (generations) of the genetic algorithm. Thin line is the average fitness of the population. Thick line is the fitness of the best individual in the population.

D. Discussion Although the best classification accuracy (76%) was considerably higher than chance, it was much lower than the approximately 95% classification accuracy obtained by Blankertz et al [4]. One possible reason is that we used data from only a small subset of the electrodes recorded (6 of 27) in order to reduce computation time by restraining the dimensionality of the feature vector presented to the SVM. Optimizing classification accuracy was not, however, our primary goal. Instead, we sought insight into the nature of the features that would provide the best classicification accuracy. The feature selection method showed that a diverse subset of spectrotemporal features in the EEG contributed to the best classification accuracy. However, most BCIs that use EEG frequency information in imagined or real movement look at only alpha (mu) and beta bands over only one or a few time windows [20], [19], [26]. Furthermore, the system is amenable to on-line applications. One could use the full system, including the GA, to learn the best dissociating features for a given subject and task, then use the trained SVM with the best dissociating features in real-time. Thus preliminary results from the research suggests that BCI performance could be improved by leveraging advances in machine learning and artificial intelligence for systematic exploration of the EEG feature space. VIII. C ONCLUSIONS Support Vector Machines provide a powerful method for data classification. The SVM algorithm has a very solid foundation in statistical learning theory, and guarantees to find the optimal decision function for a set of training data, given a set of parameters determining the operation of the SVM. The empirical evidence presented here shows that the algorithm performs very well on one real problem. Finally, we are currently working with alternative representations of the EEG data. Preliminary results indicate that applying a KL-Transform to the raw data produces a data set which is much more susceptible to accurate classification by many types of classifiers.

Fig. 4. Features selected for the best individual. Black indicates the feature was included in the subset, white indicates it was not. Time windows correspond to number of 100 ms shifts from epoch onset, ie. time window 1 is early in the epoch, time window 11 ends 120 ms before the key press.

ACKNOWLEDGMENT Deon’s: were constructed using the SVM MATLAB toolbox developed by [6]. dave’s: The SVM was implemented with the OSU SVM Classifier Matlab Toolbox [17]. The GA was implemented with the commercial FlexTool GA software [16]. R EFERENCES [1] A. Aizerman, E.M. Braverman, and L.I. Rozoner. Theoretical foundations of the potential function method in pattern recognition learning. In Automation and Remote Control, 1964. [2] C. W. Anderson, S. V. Devulapalli, and E. A. Stolz. Determining mental state from EEG signals using neural networks. Scientific Programming, 4(3):171–183, 1995. [3] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1995. [4] B. Blankertz, G. Curio, and K. R. Muller. Classifying single trial EEG: Towards brain computer interfacing. 14, 2002. to appear. [5] B.E. Boser, I.M. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Fifth Annual Workshop on Computational Learning Theory, 1992. [6] G.C. Cawley. MATLAB support vector machine toolbox http://theoval.sys.uea.ac.uk/˜gcc/svm/toolbox. University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K. NR4 7TJ, 2000.

7

[7] C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20, 1995. [8] R. Courant and D. Hilbert. Methods of Mathematical Physics, volume I and II. Wiley Interscience, 1970. [9] T.M. Cover. Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. In IEEE Transactions on Electronic Computers, 1965. [10] R. Fletcher. Practical Methods of Optimization. John Wiley and Sons, Inc., 2nd edition, 1987. [11] H. Jasper. The ten twenty electrode system of the international federation. Electroencephalography and Clinical Neurophysiology, 10:371– 375, 1958. [12] W. Karush. Minima of functions of several variables with inequalities as side constraints. Master’s thesis, University of Chicago, 1939. [13] Z. A. Keirn. Alternative modes of communication between man and machine. Master’s thesis, Purdue University, Lafayette, IN, West Lafayette, IN, 1988. [14] Z. A. Keirn and J. I. Aunon. A new mode of communication between man and his surroundings. IEEE Transactions on Biomedical Engineering, 37(12):1209–1214, 1990. [15] H.W. Kuhn and A.W. Tucker. Nonlinear programming. In Proceedings of the 2nd Berkeley Symposium on Mathematical Statistics and Probabilistics, 1951. [16] CynapSys LLC. Flexga. www.cynapsys.com, 2002. [17] J. Ma, Y. Zhao, and S. Ahalt. Osu svm classifier matlab toolbox. http://eewww.eng.ohio-state.edu/˜maj/osu svm/, 2002. [18] J. Mercer. Functions of positive and negative type, and their connection with the theory of integral equations. In Transactions of the London Philosophical Society, 1909. [19] G. Pfurtscheller, C. Neuper, C. Guger, W. Harkam, H. Ramoser, A. Schl¨ogl, B. Obermaier, and M. Pregenzer. Current trends in Graz braincomputer interface (BCI) research. EEE Transactions on Rehabilitation Engineering, 8(2):456–460, 2000. [20] J. A. Pineda, B. Z. Allison, and A. Vankov. The effects of self-movement, observation, and imagination on mu rhythms and readiness potentials: Toward a brain-computer interface. IEEE Transactions on Rehabilitation Engineering, 8(2):219–222, June 2000. [21] John C. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Scholkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods – Support Vector Learning, pages 185–208. MIT Press, 1998. [22] John C. Platt. Using analytic qp and sparseness to speed training of support vector machines. In M. S. Kearns, S. A. Solla, and D. A. Cohn, editors, Advances in Neural Information Processing Systems 11. MIT Press, 1999. [23] John C. Platt, Nello Cristianini, and John Shawe-Taylor. Large margin DAGs for multiclass classification. In S. A. Solla, T. K. Leen, and K.-R. M¨uller, editors, Advances in Neural Information Processing Systems 12, pages 547–553, 2000. [24] J. Weston and C. Watkins. Multi-class support vector machines. Technical report, Royal Holloway University of London, 1998. [25] D. Whitley, R. Beveridge, C. Guerra, and C. Graves. Messy genetic algorithms for subset feature selection. In T. Baeck, editor, Proc. Int. Conf. on Genetic Algorithms, Boston, MA, 1997. Morgan Kaufmann. [26] J. R. Wolpaw, D. J. McFarland, and T. M. Vaughan. Brain-computer interface research at the wadsworth center. IEEE Transactions on Rehabilitation Engineering, 3(2):222–226, June 2000. [27] Jihoon Yang and Vasant Honavar. Feature subset selection using a genetic algorithm. In Huan Liu and Hiroshi Motoda, editors, Feature extraction, construction and selection : a data mining perspective, pages 117–136. Kluwer Academic, Boston, MA, 1998. [28] Ji Zhu and Trevor Hastie. Kernel logistic regression and the import vector machine. In NIPS2001, 2001.