Linear Methods for Classification

2 downloads 0 Views 2MB Size Report
Bio-informatics: gene sequence alignment, gene finding, gene expression and .... Kernel Discriminant Methods: Generalized Discriminant. Analysis, GDA ..... In such cases, a dual ridge regression can be adopted where the inverse can be ...
Linear Methods for Classification Seminar, 30 June 2011 Department of Computer Science Hong Kong Baptist University

Kar-Ann Toh

Biometrics Engineering Research Center School of Electrical & Electronic Engineering Yonsei University, Seoul, Korea

Outline 1. Introduction • Pattern classification • Major issues • Motivation 2. Preliminaries • Brief review of literature • Key ideas in classification • Linear parametric models – A reduced polynomial model – A single-hidden-layer feedforward network

BERC, Yonsei University

1

Outline 3. Minimization of Classification Error • A total error rate formulation • Extension to multi-category problems and generalization • Networks with a huge number of hidden nodes • A binary case study • Experiments 4. Maximization of Area Under ROC Curve • Multibiometric fusion • AUC optimization • Experiments 5. Conclusion

BERC, Yonsei University

2

Introduction

BERC, Yonsei University

3

INTRODUCTION - 1

Pattern classification • “Sorting incoming Fish on a conveyor according to species using optical sensing”. • Species: Sea bass or Salmon

(a)

(b)

Figure 1: (a) Sea bass, (b) Salmon

BERC, Yonsei University

4

INTRODUCTION - 2

Applications • Biometrics: face recognition, fingerprint verification, iris verification, multimodal biometrics fusion and etc. • Bio-informatics: gene sequence alignment, gene finding, gene expression and patterns, protein structure alignment, and etc. • Medical diagnosis: diagnosis of cancer, cardiac catheterization, brain tissue classification, and other diagnosis based on Magnetic Resonance Imaging (MRI), Computed Tomography (CT), Position Emission Tomography (PET) and X-ray. • Financial: trend forecasting, stock selection, credit application scoring, freud detection and etc. • Industry: products sorting in quality control line, fault detection and diagnosis, process monitoring and control and etc.

BERC, Yonsei University

5

INTRODUCTION - 3

Major Issues • Predictivity or Generalization: The generalization performance of a learning method relates to its prediction capability on independent test data. Assessment of this performance is extremely important in practice, since it guides the choice of learning method or model, and gives us a measure of the quality of the ultimately chosen model. • Learning Efficiency: model complexity, stability, global or local solution, initialization, training time and etc. • Model Assessment and Selection: training error, test error, cross-validation, bootstrap, equal error rate (EER), receiver operating characteristics (ROC), detection error tradeoff (DET).

BERC, Yonsei University

6

INTRODUCTION - 4

Motivation • Linear estimation methods have been widely adopted in many fields of research and applications. • However, for pattern classification, there exists certain discrepancies between the least-squares fit error and the actual classification error required. • We are thus motivated to seek a minimum classification error based formulation for such linear estimation methods.

BERC, Yonsei University

7

INTRODUCTION - 5

Motivation • The minimum classification error (MCE) problem has been studied in early literatures (see e.g. [1, 2] and references within). • Apart from traditional nonparametric and parametric means [2, 3, 4, 5, 6, 7], the problem was recently addressed in several ways. – In [8], the authors seek to directly minimize classification error by backpropagating error only on misclassified patterns from culprit output nodes. – In [9], a RBFN which focused on exploitation of the weight structure based on Gaussian assumption, was proposed for discriminant analysis.

BERC, Yonsei University

8

INTRODUCTION - 6

Motivation • In contrast to previous literatures, this work proposes a deterministic approach to solve a classification error based objective. – Proposed a closed-form network solution for a classification-based learning where the estimation can be determined or over-determined, – Showed that the solution can be related to a specific case of weighted least-squares, thus proposal of a robust tuning procedure to cater for different scenarios of data distributions,

BERC, Yonsei University

9

Preliminaries

BERC, Yonsei University

10

PRELIMINARIES - 1

Brief review of literature • A key component of learning process in intelligence • Inherent capability of our everyday decisions • Thus, may be traced to as early as existence of human being

BERC, Yonsei University

11

PRELIMINARIES - 2

Early Notions • Eastern:

Laozi (6th Century BC, see e.g. [10]),

Confucius (551BC-479BC, see e.g. [10]), Bodhidharma (470-543, see e.g.[7], p.18-19)

(a)

(b)

Figure 2: (a) Laozi: dao-de-jing, (b) Confucius: personal life and family BERC, Yonsei University

12

PRELIMINARIES - 3

Early Notions • Western:

Plato (427BC-437BC, see e.g. [10]),

Aristotle (384BC-322BC, see e.g. [10])

They distinguished between an “essential property” from an “accidental property”. Pattern recognition can be cast as the problem of finding such essential properties of a category (see e.g.[7], p.18-19).

BERC, Yonsei University

13

PRELIMINARIES - 4

Some Important Developments • Early days-1950s: – Bayes (Thomas Bayes 1763, Laplace, 1812, see [7], p.63-83) – LSE (Gauss-Legendre, 1795-1805 [11]), Linear Discriminant Function (Ronald A. Fisher, 1936, see [7], p.270-281) – Early Perceptron (McCulloch & Pitts, 1943); Networks of NAND gates (Turing, 1948); Functional approximation (Kolmogorov, 1957); Learning in Perceptron Network (Rosenblatt, 1958,1962); See [7], p.270-281 an p.333-349. – Nearest Neighbor (Fix & Hodges, 1951,1952 [12, 13]) • 1960-1970s: – Fuzzy Logic (Zadeh, 1965 [14]) – k-Nearest Neighbor (Patrick & Fischer, 1970 [15]) – Early Evolution Algorithms (Fogel, Owens & Walsh, 1966; Rechenberg, 1973, see [7], p.381-392) – Genetic Algorithms (Holland, 1975 [16]) BERC, Yonsei University

14

PRELIMINARIES - 5

Some Important Developments • 1980s: – Hopfield network (Hopfield, 1982 [17]) – Self-Organizing Maps, SOM (Kohonen, 1982 [18]) – Backpropagation Network, BP (Rumelhart, Hinton and Williams, 1986 [19]) – Tree classifier: ID3 (Quinlan, 1983 [20]) • 1990s: – Support Vector Machines, SVM (Boser, Guyon & Vapnik, 1992 [21]) – Tree classifiers: C4.5 (Quinlan, 1993 [22]); Classification And Regression Trees, CART (Breiman, Friedman, Olshen, and Stone, 1993 [23]); Multivariate Decision Trees (Brodley & Utgoff, 1995 [24]) – Bagging (Breiman, 1994,1996 [25]) – Adaboost (Freund & Schapire, 1995 [26]) BERC, Yonsei University

15

PRELIMINARIES - 6

Some Recent Developments • 2000s: – Kernel Discriminant Methods: Generalized Discriminant Analysis, GDA (Baudat & Anouar, 2000 [27]); Kernel-Direct-Discriminant-Analysis, KDDA (Lu, Konstantinos & Plataniotis, 2003 [28]); φ-Machine, (Precup & Utgoff, 2004 [29]) – Regression and Approximation: RM (Toh et. al., 2002, 2004 [30]), ELM (Huang et. al., 2004, 2006 [31]) – Classification-Based Objective functions: CB (Rimer & Martinez, 2006 [8]); TER and ROC (Toh, 2006 [32]) – Hybrid and Volumetric methods: ART (Assoc. Rule Tree by Berzal et. al., 2004 [33]); Scaling-Space (Toh, 2006 [34]) – and more ???

BERC, Yonsei University

16

PRELIMINARIES - 7

Key ideas in classification • Regression Error/Least-Squares Error • Discriminant Analysis • Prototype Methods m

• Support Vectors (Margin) • Probability/Likelihood • Cross-validation • Aggregation and Boosting • Classification Error

n

n n n

n

• Receiver Operating Characteristics

BERC, Yonsei University

17

PRELIMINARIES - 8

Key ideas in classification • Regression Error/Least-Squares Error

|

• Discriminant Analysis

}

• Prototype Methods m

• Support Vectors (Margin) • Probability/Likelihood • Cross-validation • Aggregation and Boosting • Classification Error

~

n

n n n

}

}

~ }

n

• Receiver Operating Characteristics

BERC, Yonsei University

18

PRELIMINARIES - 9

Key ideas in classification • Regression Error/Least-Squares Error • Discriminant Analysis • Prototype Methods m

• Support Vectors (Margin) • Probability/Likelihood • Cross-validation • Aggregation and Boosting • Classification Error • Receiver Operating Characteristics

BERC, Yonsei University

n

n n n ^

^

^

^ ^ ^ ^^ ^^ n

U

^^^

19

PRELIMINARIES - 10

Key ideas in classification • Regression Error/Least-Squares Error • Discriminant Analysis



• Prototype Methods

+ m

• Support Vectors (Margin) • Probability/Likelihood • Cross-validation • Aggregation and Boosting • Classification Error

n

  n n n

~ O

-

n

• Receiver Operating Characteristics

BERC, Yonsei University

20

PRELIMINARIES - 11

Key ideas in classification • Regression Error/Least-Squares Error • Discriminant Analysis • Prototype Methods m

• Support Vectors (Margin) • Probability/Likelihood • Cross-validation • Aggregation and Boosting • Classification Error

n

n n n

n

• Receiver Operating Characteristics

BERC, Yonsei University

21

PRELIMINARIES - 12

Key ideas in classification • Regression Error/Least-Squares Error • Discriminant Analysis • Prototype Methods m

• Support Vectors (Margin) • Probability/Likelihood • Cross-validation • Aggregation and Boosting • Classification Error

n

n n n

n

• Receiver Operating Characteristics

BERC, Yonsei University

22

PRELIMINARIES - 13

Key ideas in classification • Regression Error/Least-Squares Error



• Discriminant Analysis



• Prototype Methods • Support Vectors (Margin) • Probability/Likelihood

 n

• Cross-validation • Aggregation and Boosting • Classification Error



 m  n  n

 n







 n

• Receiver Operating Characteristics

BERC, Yonsei University

23

PRELIMINARIES - 14

Linear parametric models Consider a linear projection model given by: g(α, x) =

D X

αj pj (x) = p(x)T α,

(1)

j=1

where each term pj (x) is an element of the column vector p(x) that maps the input vector x ∈ Rd into a feature space RD , and α ∈ RD corresponds to a vector of weighting coefficients to be estimated. Here, we note that by incorporating a nonlinear mapping of p : Rd → RD , the linear parametric model extends its capability to map nonlinear input-output spaces.

BERC, Yonsei University

24

PRELIMINARIES - 15

Linear parametric models Reduced multivariate polynomial model: pRM (α, x) = α0 +

r X l X

k=1 j=1

+

r X

αkj xkj +

r X

αrl+j (x1 + x2 + · · · + xl )j

j=1

(αTj · x)(x1 + x2 + · · · + xl )j−1 l, r > 2.

(2)

j=2

The number of terms in this model can be expressed as: K = 1 + r + l(2r − 1). • K.-A. Toh, Q.-L. Tran, and D. Srinivasan, “Benchmarking A Reduced Multivariate Polynomial Pattern Classifier”, IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.26, no.6, pp.740-755, June 2004. • K.-A. Toh, “Training a Reciprocal-Sigmoid Classifier by Feature Scaling-Space”, Machine Learning, vol.65, no.1, pp.273-308, October, 2006.

BERC, Yonsei University

25

PRELIMINARIES - 16

Linear parametric models Single-layer-feedforward-network: A standard Single hidden-Layer Feedforward Network (SLFN) with p hidden nodes and activation function φ: gi =

p X

β j φ(wj · xi + bj ), i = 1, · · · , m,

(3)

j=1

where wj = [wj1 , wj2 , · · · , wjd ]T is the weight vector connecting the jth hidden node to the input nodes, xi = [xi1 , xi2 , · · · , xid ]T ∈ Rd is a d-dimensional input vector, β j = [βj1 , βj2 , · · · , βjq ]T is the weight vector connecting the jth hidden node to the output nodes, bj is the threshold of the jth hidden node, g i = [gi1 , gi2 , · · · , giq ]T is the q-dimensional network output. BERC, Yonsei University

26

PRELIMINARIES - 17

The m equations above can be written more compactly as: HΘ = Y,

(4)

where 

  H= 

φ(w1 · x1 + b1 ) .. .

··· ···

φ(wp · x1 + bp ) .. .

φ(w1 · xm + b1 ) · · ·

φ(wp · xm + bp )

    

,

(5)

m×p

  Θ = β 1 , · · · , β q p×q (∀β k ∈ Rp ),

Y = [y1 , · · · , yq ]m×q (∀yk ∈ Rm ) packs the output target vectors.

BERC, Yonsei University

27

PRELIMINARIES - 18

Under the scenario of • an over-determined system, H is a nonsquare matrix. • when the weights in the hidden nodes can be fixed in advance, such as being generated from a random process, a least-squares solution can be obtained for the output weighing parameters as: ELM:

Θ = H† Y = (HT H)−1 HT Y,

(6)

where H† is the Moore-Penrose generalized inverse of matrix H. This closed-form LSE-based learning method was called ELM in [35, 31] where the universal approximation capability of SLFN has been investigated and shown. • G.-B. Huang, L. Chen, and C.-K. Siew, “Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes,” IEEE Trans. on Neural Networks, vol. 17, no. 4, pp. 879-892, 2006. • G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme Learning Machine: Theory and Applications,” Neurocomputing, vol. 70, pp. 489-501, 2006.

BERC, Yonsei University

28

PRELIMINARIES - 19

SLFN in classification For two-category classification problems, a single-dimensional output (i.e. using a single vector y = [y1 , ..., ym ]T to represent m training samples) can be adopted with a threshold process. The output weighing parameters (β ∈ Rp ) to be learned can be found as β = H† y = (HT H)−1 HT y.

(7)

A threshold is then applied to determine whether a trained output point g = hβ (h is a row vector in H) belongs to category-1 or category-2:   +1 if g > τ , (8) cls(g) =  −1 if g < τ .

BERC, Yonsei University

29

PRELIMINARIES - 20

SLFN in classification For NC -classes of output targets indicated by Y = [y1 , y2 , ..., yNC ], a single SLFN with multiple outputs can also be trained as in (6) by treating the multiple outputs as multiple two-category problems. The class label can then be predicted using the one-versus-all technique: cls(g) = arg maxi gi , i = 1, 2, ..., NC .

(9)

In other words, the largest element of g will determine the output pattern class.

BERC, Yonsei University

30

Minimization of Classification Error

BERC, Yonsei University

31

MINIMIZATION OF CLASSIFICATION ERROR - 1

Classification accuracy and error rates Consider the threshold equation (8) for binary classification problems. • We will call the first category a positive-category and the other category a negative-category. • When a positive-category pattern is classified correctly, we call it a true-positive (TP). • Alternatively, when a negative-category pattern is classified correctly, we call it a true-negative (TN). • Conversely, when a positive-category pattern is classified incorrectly as negative-category, we call it a false-negative (FN). • Finally, when a negative-category pattern is classified incorrectly as positive-category, we call it a false-positive (FP). BERC, Yonsei University

32

MINIMIZATION OF CLASSIFICATION ERROR - 2

Classification accuracy and error rates With a scalar skew parameter s to indicate the relative importance of the two categories, the accuracy of a classification system is defined as TPR + s(1 − FPR) . (10) Accuracy = 1+s For simplicity, the scaling factor has been set at s = 1. In addition, by ignoring the resultant normalization factor (1/2), (10) reduces to Accuracy = TPR+(1−FPR) = (1−FNR)+(1−FPR) = TPR+TNR. (11)

BERC, Yonsei University

33

MINIMIZATION OF CLASSIFICATION ERROR - 3

Classification accuracy and error rates Instead of maximizing the accuracy, a minimization problem can also be posed to minimize a total-error-rate (TER) given by TER = FNR + FPR.

(12)

We shall minimize the empirical TER and observe its impact on the classification accuracy in this development. Note: minimization of TER is a two-step process if network optimization (locating of network parameters) and threshold optimization (locating the minimum TER at τ ∗ from FPR and FNR computations) are treated separately.

BERC, Yonsei University

34

MINIMIZATION OF CLASSIFICATION ERROR - 4

Classification accuracy and error rates Based on the above definition of error rates, the FPR and FNR are merely the averaged counts of decision scores falling within the opposite pattern categories: −

m 1 X I(g(x− FPR = − j ) > τ ), m j=1

+

m 1 X FNR = + I(g(x+ i ) < τ ), m i=1

(13) where I(•) is the zero-one indicator step function, which results in ‘1’ whenever ‘•’ is true and ‘0’ otherwise. • The superscripts and subscripts (+,−) indicate variables which correspond to the respective positive and negative categories. • Here, for correct classification, the network output gives a higher value with respect to the threshold τ for positive category data (g(β, x+ ) > τ ), and a lower value for negative category data (g(β, x− ) < τ ).

BERC, Yonsei University

35

MINIMIZATION OF CLASSIFICATION ERROR - 5

Classification accuracy and error rates + Let ǫj = g(β, x− ) − τ and ε = τ − g(β, x i j i ). Reducing the network’s misclassification performance can be treated as minimization of the TER:   − + m m  1 X  X 1 + − I(ǫj ) + + I(εi ) . arg min TER(β, x , x ) = arg min −  β β m m i=1 j=1

(14)

Solution to (14) in general requires an approximation to the non-differentiable indicator step function I.

BERC, Yonsei University

36

MINIMIZATION OF CLASSIFICATION ERROR - 6

Approximation of loss function • The explicit threshold formulation in (8) can be analogously interpreted in terms of a margin-loss function as seen in [6]. • In classifying binary data using a −1/1 response, the margin (given by y · g where y is the target response and g = g(β, x) is the predicted network response) plays a role analogous to y − g(β, x) in regression. • Data xi with positive margin (yi g(β, xi ) > 0) indicates a correct classification and that with negative margin indicates a misclassification. • The threshold at zero is thus implicit in this formulation.

BERC, Yonsei University

37

MINIMIZATION OF CLASSIFICATION ERROR - 7

Approximation of loss function 3 Misclassification Sigmoid Squared Error Exponential Binomial Deviance Support Vector

2.5

Loss: L(y⋅g)

2

1.5

1

0.5

0 −2

−1.5

−1

−0.5

0 Margin: y⋅g

0.5

1

1.5

2

Figure 3: Loss functions for binary classification problems.

BERC, Yonsei University

38

MINIMIZATION OF CLASSIFICATION ERROR - 8

Approximation of loss function (i) Using sigmoid or logistic-like function A natural choice for approximating the step function or step loss is the sigmoid function or logistic-like functions (e.g. tanh and arctan functions) where the minimization problem [36, 37, 32] in (12) becomes:   − + m m   1 X X 1 σ(ǫj ) + + σ(εi ) . arg min TER(β, x+ , x− ) ≈ arg min  β β  m− m j=1 i=1 (15)

where

1 , γ>0 , (16) −γx 1+e + and ǫj = g(β, x− ) − τ and ε = τ − g(β, x i j i ). In terms of margin-loss interpretation, the sigmoid is reversed as depicted in the dashed-dotted curve in Fig. 3. σ(x) =

BERC, Yonsei University

39

MINIMIZATION OF CLASSIFICATION ERROR - 9

Approximation of loss function However, there are two problems associated with this approximation. • Firstly, this formulation is nonlinear with respect to the learning parameters. Although an iterative search can be employed to obtain local solutions, different initializations may lead to different local solutions, hence incurring laborious trial-and-error efforts to select an appropriate setting. • The second problem is that the objective function can become ill-conditioned due to the many local plateaus resulting from summing the flat regions of the sigmoid. Much search effort may be spent in making little progress at locally flat regions [32].

BERC, Yonsei University

40

MINIMIZATION OF CLASSIFICATION ERROR - 10

Approximation of loss function (ii) Using power and piecewise functions In [36], two piecewise power functions have been adopted to approximate a step function in the Wilcoxon-Mann-Whitney statistic (WMW) formulation:   [−(x − η)]r : x < η R1 (x) = , (17)  0 : otherwise where 0 < η 6 1 and r > 1.

  (x − η)r R2 (x) =  0

:

x>η

:

otherwise

,

(18)

where 0 < η 6 1 and r > 1.

Due to the adoption of a nonlinear parametric MLP model, there exists multiple local solutions and hence an iterative search remains inevitable. BERC, Yonsei University

41

MINIMIZATION OF CLASSIFICATION ERROR - 11

Approximation of loss function • Several nonlinear but smooth functions such as binomial, exponential, and logarithmic functions have also been explored to approximate the step loss function in a monotonic manner (see e.g. [6, 38] and Fig. 3). • Also, as shown in Fig. 3, other piecewise functions such as the support vector hinge function has been adopted in place of step loss (see e.g. [6, 39, 40, 41, 42] and Fig. 3). • Due to the nonlinear nature or piecewise nature of formulations, these monotonic losses require an iterative procedure such as quadratic programming, gradient or heuristic search to locate the solution, regardless of whether the formulation is convex or nonconvex.

BERC, Yonsei University

42

MINIMIZATION OF CLASSIFICATION ERROR - 12

A total error rate formulation • As we adopt a linear network link function, a quadratic loss functional is appropriate for the desired convexity for the link-loss pair. • However, some considerations regarding “goodness” of approximation to the step function are necessary. • Once all inputs have been normalized within [0, 1], the step function can be approximated by a centred quadratic function at the origin. • To cater for inputs which get beyond this range (e.g. [−1, 1]), an offset η can be introduced such that only a single arm of the quadratic function is activated for the approximation.

BERC, Yonsei University

43

MINIMIZATION OF CLASSIFICATION ERROR - 13

A total error rate formulation With this idea in mind, the following quadratic TER approximation is proposed [43, 44]:   − + m m  1 X  X 1 2 2 . (ǫ + η) + (ε + η) arg min TER(β) ≈ arg min j i − +  β β  2m 2m i=1 j=1 (19)

+ ) − τ , and ε = τ − g(β, x where η > 0, ǫj = g(β, x− i j i ).

[42 ] K.-A. Toh, “Deterministic Neural Classification,” Neural Computation, vol. 20, no. 6, pp. 1565-1595, June 2008. [43 ] K.-A. Toh and H.-L. Eng, “Between Classification-Error Approximation and Weighted Least-Squares Learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 658-669, April 2008.

BERC, Yonsei University

44

MINIMIZATION OF CLASSIFICATION ERROR - 14

A total error rate formulation A closed-form solution:  −1   1 (τ + η) T 1 (τ − η) T T T H H− + + H+ H+ H− 1− + H+ 1+ , β= m− − m m− m+ (20) where 

    H+ =      

    H− =     

+ h(x1 ) + h(x2 ) . . . + h(x ) m+



− h(x1 ) − h(x2 ) . . . − h(x ) m−



        =        

        =        

+ + φ(w1 · x1 + b1 ), · · · , φ(wp · x1 + bp ) . . . + + φ(w1 · x + b1 ), · · · , φ(wp · x + bp ) m+ m+



− − φ(w1 · x1 + b1 ), · · · , φ(wp · x1 + bp ) . . . − − φ(w1 · x + b1 ), · · · , φ(wp · x + bp ) m− m−



+

   ,  

   ,  

(21)

(22)



and 1+ = [1, ..., 1]T ∈ Nm , 1− = [1, ..., 1]T ∈ Nm .

BERC, Yonsei University

45

MINIMIZATION OF CLASSIFICATION ERROR - 15

Extension to multi-category problems and generalization Assume a common setting for threshold (τ ) and bias (η) throughout all NC outputs. Then, by packing the solutions, equation (20) can be extended as Θ = [β 1 , · · · , β NC ]

(23)

where −1    1 1 1 1 T T T − T + ,, H H + H H H y + H βk = − + − + − + yk k − + − + mk mk mk mk k = 1, ..., NC .

BERC, Yonsei University

46

MINIMIZATION OF CLASSIFICATION ERROR - 16

Extension to multi-category problems and generalization By defining two class-specific diagonal weighting matrices − W− = diag (1/m− , · · · , 1/m k k , 0, · · · , 0) and + W+ = diag (0, · · · , 0, 1/m+ , · · · , 1/m k k ), equation (23) can be generalized to −1 T T β k = H WH H Wyk , (24) where W = W− + W+ and elements of H, yk being ordered according to the two classes.

This shows that TER belongs to a specific setting (W− and W+ + set according to 1/m− and 1/m k k ) of the generalized or weighted least-squares [11].

BERC, Yonsei University

47

MINIMIZATION OF CLASSIFICATION ERROR - 17

Networks with a huge number of hidden nodes The solution provided in (24) requires the computation of a matrix inverse. Consequently for physical applications, multi-collinearity may arise if linear dependency among the elements of H is present. A simple way to improve numerical stability is to include a weight decay regularization [11] (also called primal ridge regression): T

β k = H WH + λI

BERC, Yonsei University

−1

HT Wyk ,

(25)

48

MINIMIZATION OF CLASSIFICATION ERROR - 18

Networks with a huge number of hidden nodes For networks with a huge number of hidden nodes (such as those in [45]), the dimension of the network can be much larger than the number of training samples (i.e. p >> m). In such cases, a dual ridge regression can be adopted where the inverse can be performed in the smaller space out of the two possibilities (between Rp and Rm ) [46]: Θ = HT Ω = HT [Ω1 , ..., ΩNC ],

(26)

where Ωk

= (WHHT + λI)−1 Wyk , k = 1, ..., NC ,

(27)

with I ∈ Rm×m and Ω ∈ Rm×NC which correspond to data size m and output dimension NC respectively. BERC, Yonsei University

49

MINIMIZATION OF CLASSIFICATION ERROR - 19

Networks with a huge number of hidden nodes To predict unseen test data (e.g. from n number of unseen data points xt i ∈ Rd , i = 1, ..., n which forms h(xt i ) ∈ Rp , i = 1, ..., n), either the trained Θ can be used as in Yt = Ht Θ (Ht ∈ Rn×p ), or a computation of inner product between training data and test data can be performed such as: Yt

= Ht Θ = Ht HT Ω = ZΩ,

(28)

where Z = Ht HT ∈ Rn×m . Here we note that there is no need to split the test set into positive and negative categories as they are unknown.

BERC, Yonsei University

50

MINIMIZATION OF CLASSIFICATION ERROR - 20

A binary case study • The decision outputs of a SLFN using 3 hidden nodes for a case of overlapping data distribution are shown in Fig. 4. – The original LSE-based ELM – Proposed TERELM – Linear Regression model (LR) • Here, we see that the outputs of those LSE fits (ELM and LR) show a larger error count (which consists of two misclassified samples) than that of the proposed TER (which has only one misclassified sample).

BERC, Yonsei University

51

MINIMIZATION OF CLASSIFICATION ERROR - 21

A binary case study 1.5

1

y

0.5

0

−0.5

Class−1 data Class−2 data Threshold at 0 Linear regression ELM (3−hidden−nodes) TERELM (3−hidden−nodes)

−1

−1.5

1

2

3

4

5

6

7

8

9

10

11

12

x

Figure 4: Decision outputs for non-separable 1-D data: TER achieved lowest error count as compared to LSE (ELM) and linear regression. BERC, Yonsei University

52

MINIMIZATION OF CLASSIFICATION ERROR - 22

A binary case study 2

1.5

4

1

5

6

7

8

9

10

11

g and y

0.5 −ve Class data +ve Class data Trained g at η = −2.0 Trained g at η = −1.0 Trained g at η = −0.5 Trained g at η = +0.5 Trained g at η = +1.0 Trained g at η = +2.0

0

−0.5

−1

1

2

3

3

4

5

−1.5

−2

1

2

6

7

8

9

10

11

12

x

Figure 5: Training data and outputs of a single-dimensional learning example with varying offset η values. BERC, Yonsei University

53

MINIMIZATION OF CLASSIFICATION ERROR - 23

A binary case study 10 η = +2.0 4

4 η = −2.0

9

8

Loss: (−(y⋅g)+η)

2

7

6 training activation range −ve Class data +ve Class data

5

4

3 η = +1.0 4

35

5 3

4 η = −1.0

2

1 6 0 −3

2 89 10 1 11 7 −2

5 η = +0.5 4 3

5 3 4 η = −0.5

5 5 3 3 6 211 2 6 8 8910 98 910 9 6 211 111 12 110 1 7 710 7 116 78 −1 0 1 Margin: y⋅g

6 211 1 71098 2

3 1

Figure 6: From margin-loss function point of view.

BERC, Yonsei University

54

EXPERIMENTS - 1

Experimental Setup • Comparison with ELM under similar settings. • Comparison with a classification-based method by [8]. • Comparison with [30] (Ref-I) and [47] (Ref-II). Ref-I [46 ] K.-A. Toh, Q.-L. Tran, and D. Srinivasan, “Benchmarking A Reduced Multivariate Polynomial Pattern Classifier,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 740. 755, 2004. Ref-II [47 ] J. Li, G. Dong, K. Ramamohanarao, and L. Wong, “DeEPs: A New Instance-Based Lazy Discovery and Classification System,” Machine Learning, vol. 54, no. 2, pp. 99.124, 2004.

BERC, Yonsei University

55

EXPERIMENTS - 2

Results Table 1: Average classification accuracies (over 42 data sets) for individual trials (each trial containing 10 runs of 10-fold crossvalidations) Trial

ELM

TER

Difference

Ave(x%) ± Std

Ave(y%) ± Std

(y − x)%

1

85.00 ± 0.54

85.46 ± 0.53

0.46

2

84.80 ± 0.54

85.28 ± 0.56

0.48

3

84.99 ± 0.55

85.42 ± 0.53

0.43

4

84.95 ± 0.61

85.54 ± 0.59

0.59

5

85.11 ± 0.48

85.51 ± 0.53

0.40

6

84.50 ± 0.61

84.96 ± 0.61

0.46

7

84.78 ± 0.50

85.10 ± 0.52

0.32

8

84.89 ± 0.46

85.31 ± 0.48

0.42

9

85.09 ± 0.54

85.57 ± 0.48

0.48

10

84.85 ± 0.55

85.31 ± 0.54

0.46

Ave

84.90 ± 0.54

85.35 ± 0.54

0.45

number

Bold: maximum and minimum values within each column.

BERC, Yonsei University

56

EXPERIMENTS - 3

Results Table 2: Comparison with networks trained by classification-based objective (CB1-CE) [8] CB1-CE (x)

ELM (y)

(y − x)

TER (z)

(z − x)

Name

ave(%)±std

ave(%)±std

(%)

ave(%)±std

(%)

Pima-diabetes

76.82 ± 6.46

77.73 ± 0.20

0.9100

77.73 ± 0.20

0.9100

Breast-cancer-W

97.36 ± 1.81

97.26 ± 0.21

-0.1000

97.37 ± 0.16

0.0100

Ionosphere

90.88 ± 3.87

89.18 ± 1.40

-1.7000

89.63 ± 1.82

-1.2500

Iris

95.37 ± 5.25

96.36 ± 1.22

0.9900

96.86 ± 1.22

1.4900

Sonar

81.92 ± 8.60

81.78 ± 1.67

-0.1400

81.96 ± 1.83

0.0400

Wine

97.19 ± 3.47

98.52 ± 0.65

1.3300

98.59 ± 0.55

1.4000

Average

89.92 ± 4.91

90.14 ± 0.89

0.2150

90.36 ± 0.96

0.4333

[31 ] M. Rimer and T. Martinez, “Classification-based Objective Functions, Machine Learning, vol. 63, no. 2, pp. 183205, 2006.

BERC, Yonsei University

57

EXPERIMENTS - 4

100

average accuracy (%)

90

80

70

60

50 2−class problems 5

10

3−class problems 15

20 25 data set index

Multi−class problems 30

35

40

Figure 7: Accuracy plotted over data sets with reference to RM in Ref-I (shaded: TER, ⋆: RM) BERC, Yonsei University

58

EXPERIMENTS - 5

100 95 90

average accuracy (%)

85 80 75 70 65 60 55 2−class problems 50

5

10

3−class problems 15

20 25 data set index

Multi−class problems 30

35

40

Figure 8: Accuracy plotted over data sets with reference to three algorithms from Ref-II (shaded: TER, ⋆: DeEPs, : kNN ◦: C4.5 BERC, Yonsei University

59

EXPERIMENTS - 6

Results Table 3: Summary of average accuracy with respect to Ref-I and Ref-II Ref-I(all 42 sets)

Ref-II(22 sets)

TER(0.8535)

TER(0.893)

RM-Tuned(0.8534)

DeEPs(0.888)

ELM(0.8490)

ELM(0.887)

RM-Fixed(0.8364)

C4.5(0.868)



kNN(0.863)

BERC, Yonsei University

60

EXPERIMENTS - 7

Summary • Comparing TER with ELM: The overall average accuracy improvements (over 42 data sets) for these ten trials range from 0.32% to 0.59% (see Table 1). For individual data sets, the accuracy gain from TER can go as high as over 8% as compared to ELM.

BERC, Yonsei University

61

EXPERIMENTS - 8

Summary • Comparing TER with a classification-based network and other state-of-art classifiers: comparable to a classification-based network proposed by [8] and other state-of-the-art algorithms. Instead of a complex nonlinear formulation, good classification performance can also be achieved by a simple deterministic formulation.

BERC, Yonsei University

62

EXPERIMENTS - 9

Summary • Main reason for the achieved accuracy performance of TER can be attributed to its ability to cater for different data distributions by adjustment of the class-specific normalization.

BERC, Yonsei University

63

Maximization of Area Under ROC Curve

BERC, Yonsei University

64

MAXIMIZATION OF AUC - 1

Multibiometric fusion • Multiple modalities of biometrics can be combined either before matching or after matching (see e.g.[48]). • Fusion before matching: – sensor level – feature level • Fusion after matching: – abstract level – rank level – match score level

BERC, Yonsei University

65

MAXIMIZATION OF AUC - 2

Multibiometric fusion • Existing fusion methods can generally be divided into two types: non-training based methods and training-based methods. • Non-training based methods: it is often assumed that the outputs of individual biometrics are the probabilities that the input pattern belongs to a certain class (see e.g.[48, 49]). • Training based methods: do not require this assumption and they can operate directly on the match scores generated by biometric verification modules (see e.g.[50, 51]). • Since the outcome of biometric verification consists of only two labels (genuine-user or imposter ), the verification process can thus be treated as a two-category classification problem.

BERC, Yonsei University

66

MAXIMIZATION OF AUC - 3

Motivation • In the case of performance evaluation for multimodal biometric systems, the Receiver Operating Characteristics (ROC) curve has been extensively adopted due to its good overview interpretation (see e.g. [52, 53]). • However, the processes of fusion design optimization and the final ROC performance evaluation are usually conducted separately. • This has been inevitable because the ROC, when taken from the error counting point of view, does not have a well-posed structure linking to the fusion classifier of interest such that it can be directly computed.

BERC, Yonsei University

67

MAXIMIZATION OF AUC - 1

Linear Parametric Models Consider, again, a linear parametric model: g(α, x) =

K−1 X

αk pk (x) = p(x)α,

(29)

k=0

where pk (x) corresponds to the k-th polynomial expansion term within a row vector p(x) = [p0 (x), p1 (x), ..., pK−1 (x)]; α = [α0 , α1 , ..., αK−1 ]T is a column parameter vector.

BERC, Yonsei University

68

MAXIMIZATION OF AUC - 2

Learning and Prediction • When each element of the input x ∈ Rl has a known target label y ∈ R, giving rise to m number of example learning data pairs (xi , yi ), i = 1, 2, ..., m, the learning problem can be supervised. • In biometric verification problems, these target labels (yi , i = 1, 2, ..., m) are known as genuine-user and imposter. • For multimodal biometrics fusion, the input vectors xi , i = 1, 2, ..., m come from the scores of l number of biometric modalities to be fused.

BERC, Yonsei University

69

MAXIMIZATION OF AUC - 3

Conventional Method • Step 1: Least Squares Error (LSE) learning: m

J

=

1X b T 2 [yi − p(xi )α] + α α 2 i=1 2

=

1 b 2 ||y − Pα||2 + ||α||22 , 2 2

LSE solution : Prediction :

α = (PT P + bI)−1 PT y, ˆ = Pα. y

(30) (31) (32)

• Step 2: Based on a preset threshold τ , compute decision outcome (if yˆ > τ then class-0; if yˆ < τ then class-1). Finally compute the error rates (FAR, FRR) at different thresholds → plot ROC.

BERC, Yonsei University

70

MAXIMIZATION OF AUC - 1

Receiver operating characteristic (ROC) (a) Receiver Operating Characteristic Curves (Train data)

Authentic acceptance rate (%)

100 90 80 70

Fingerprint Hand−geometry Face−Visual Face−IR Random

60 50 40

−3

10

−2

10

−1

0

10 10 False acceptance rate (%)

1

10

2

10

(b) Receiver Operating Characteristic Curves (Test data)

Authentic acceptance rate (%)

100 90 80 70

Fingerprint Hand−geometry Face−Visual Face−IR Random

60 50 40

−3

10

−2

10

−1

0

10 10 False acceptance rate (%)

1

10

2

10

Figure 9: Receiver operating characteristic curves BERC, Yonsei University

71

MAXIMIZATION OF AUC - 2

Area Under ROC (AUC) Denote the variables (x, m) that correspond to positive and negative examples by superscripts + and − respectively, it is thus not difficult to see that the AUC for the given m training examples can be expressed [36] as: +

A(x+ , x− ) =

1 m+ m−



m X m X

1g(x+ )>g(x− ) , i

j

(33)

i=1 j=1

where the term 1g(x+ )>g(x− ) corresponds to a ‘1’ whenever the i

j

− + − ) > g(x ) (i = 1, 2, ..., m , j = 1, 2, ..., m ), and ‘0’ elements g(x+ i j otherwise.

Here we note that (33) is also known as the Wilcoxon-Mann-Whitney statistic [36].

BERC, Yonsei University

72

MAXIMIZATION OF AUC - 3

Area Under ROC (AUC) To solve the above maximization problem, it is usually re-stated in its dual form, i.e. to minimize the Area Above ROC Curve (AAC) [32, 54] given by: +

1 A(α, x , x ) = + − m m +





m m X X

u(ξji ) ,

(34)

i=1 j=1

+ − + ) − g(α, x ), j = 1, 2, ..., m , i = 1, 2, ..., m . where ξji = g(α, x− j i

Notice that AUC and AAC are differentiated by the sign of ξ indicated by ξij and ξji .

BERC, Yonsei University

73

MAXIMIZATION OF AUC - 4

AUC Optimization Since the step function u(·) in (34) is non-differentiable, an approximation is often adopted. A natural choice to approximate the above step function is the sigmoid function [32, 55, 36, 56] where the minimization problem becomes: arg min A(α, x+ , x− ) ≈ arg min α

α

where σ(ξji ) =

BERC, Yonsei University

1 m+ m−

+ − m m XX

σ(ξji ) ,

(35)

i=1 j=1

1 , γ>0 . −γξ ji 1+e

(36)

74

MAXIMIZATION OF AUC - 5

Approximation of Step Function 2 Step Sigmoid Quadratic

1.8

1.6

1.4

1.2

1

0.8

0.6

0.4

0.2

0 −2

−1.5

−1

−0.5

0

0.5

1

1.5

2

Figure 10: Step function and its approximations BERC, Yonsei University

75

MAXIMIZATION OF AUC - 6

A Deterministic Solution With inputs normalized within [0, 1] and inclusion of an offset η to shift the quadratic function, the following AAC objective function approximation was proposed [32] as: +

b 1 2 A(α, x , x ) ≈ kαk2 + 2 2 m+ m− +





m X m X

φ(ξji ) ,

(37)

i=1 j=1

where φ(ξji ) = = =

(ξji + η)2  2 − + g(α, xj ) − g(α, xi ) + η   2 − + p(xj ) − p(xi ) α + η ,

(38)

for j = 1, 2, ..., m− , i = 1, 2, ..., m+ . Here we note that a weight decay has been included in the above optimization objective to provide stabilization in case of matrix inversion. BERC, Yonsei University

76

MAXIMIZATION OF AUC - 7

A Deterministic Solution K Abbreviating the row polynomial vectors by pj = p(x− ) ∈ R and j K ) ∈ R , the solution for α which minimizes AAC can be pi = p(x+ i written as [32]:  −1 + − m m X X T  1  α = bI + + − pj − pi pj − pi  m m i=1 j=1   + − m m X X T −η  × (39) pj − pi  , + − m m i=1 j=1

where I is an identity matrix of K × K size. Reference: • K.-A. Toh, J. Kim, and S. Lee“Maximizing Area Under ROC Curve for Biometric Scores Fusion”, Pattern Recognition, vol.41, no.11, pp.3373-3392, November 2009.

BERC, Yonsei University

77

EXPERIMENTS - 1

Data set I: VS and IR Face Systems

Figure 11: Top row: some representative visual face samples for an identity from the BERC database; Bottom row: infra-red face samples for the same identity taken under similar conditions.

BERC, Yonsei University

78

EXPERIMENTS - 2

Data set II: Fingerprint and Hand-geometry

(a)

(b)

(c)

(d)

Figure 12: Fingerprint image samples: wet fingers in (a)-(b) and normal fingers in (c)-(d)

BERC, Yonsei University

79

EXPERIMENTS - 3

Data set II: Fingerprint and Hand-geometry 350

300

250

200

150

100

50

−50

0

50

100

150

200

250

300

350

400

Figure 13: (a) A hand image sample, (b) Extracted hand-geometry

BERC, Yonsei University

80

EXPERIMENTS - 4

Intra- and Inter-Modal Fusions • Data set III NIST-BSSR1 [57]: using the lowest performed face and fingerprint data sets, face-G and fing-li-V. This data set contains 517 identities, with 517 genuine match scores and 517 × 516 imposter match scores which are acquired over a period of nearly four years. • Data set IV XM2VTS [58]: within the 32 data sets, both inter-modal (data sets {1, ..., 15, 22, 23}) and intra-modal (data sets {16, ..., 21, 24, ..., 32}) fusions have been considered.

BERC, Yonsei University

81

EXPERIMENTS - 1

Experimental Setup • Data sets I, II, and III: 10 runs of two-fold cross-validations. • Data set IV: hold-out test following [58]. • RM polynomial orders r ∈ {1, 2, ..., 6}, b = 10−3 , η = 1. • Comparison with: SUM(Min-Max), SUM(Tanh), SVMPoly, and SVMRbf.

BERC, Yonsei University

82

EXPERIMENTS - 2

Results 8.8 Single−VS−face Single−IR−face SUM(Min−Max) SUM(Tanh) LSE AACQ

(γ=1) 8.6

SVMPoly SVMRbf

Average EER (10 runs)

8.4

8.2

8

7.8

(γ=0.1)

7.6

1

1.5

2 2.5 3 3.5 4 4.5 5 Polynomial order r∈{1,...,6}, SVMRbf gamma∈{0.1,1,2,...,9,10,100}

5.5

6

Figure 14: Data set I on Fusion(VS+IR): average EER values plotted over different model settings. BERC, Yonsei University

83

EXPERIMENTS - 3

Results 1.5 Single−fingerprint Single−hand SUM(Min−Max) SUM(Tanh) LSE AACQ SVMPoly SVMRbf

Average EER (10 runs)

1 (γ=100)

(γ=0.1)

0.5

(γ=10) (γ=9) (γ=8)

(γ=1) (γ=2)

0

1

1.5

(γ=3)

(γ=4)

(γ=5)

(γ=6)

(γ=7)

2 2.5 3 3.5 4 4.5 5 Polynomial order r∈{1,...,6}, SVMRbf gamma∈{0.1,1,2,...,9,10,100}

5.5

6

Figure 15: Data set II on Fusion(FP+HD): (average EER values plotted over different model settings.

BERC, Yonsei University

84

EXPERIMENTS - 4

Results NIST BSSR1 (fingerprint−face) 9

8

7

EER (%)

6

Single−face Single−fingerprint SUM(Min−Max) SUM(Tanh) LSE AAC

5

4

Q

(γ=100)

SVMPoly SVMRbf 3 (γ=8) (γ=6)

2

(γ=10)

(γ=9)

(γ=7)

(γ=5) (γ=0.1) 1

1

(γ=1) 1.5

(γ=4)

(γ=3) (γ=2) 2 2.5 3 3.5 4 4.5 5 RM, SVM−Poly order r∈{1,...,6}, SVM−RBF γ∈{0.1,1,2,...,9,10,100}

5.5

6

Figure 16: Data set III on NIST-BSSR1: average EER values plotted over different model settings. BERC, Yonsei University

85

EXPERIMENTS - 5

Results 3.5

Average EER from 32 cases

3 Mean operator Weighted−Sum−Fisher Weighted−Sum−Brute−Force LSE AAC Q

2.5

SVMPoly SVMRbf Omission of dataset 9 due to numerical ill−conditioning Omission of dataset 2 due to numerical ill−conditioning

2

1.5

1

1

1.5

2

2.5 3 3.5 4 4.5 Polynomial order and RBF gamma values

5

5.5

6

Figure 17: Data set IV on 32 fusion data sets: average EER values versus model order r ∈ {1, ..., 6}, gamma ∈ {0.1, 1, 5, 10, 50, 100} values, and NSIG ∈ [1, 2, 5, 10, 50, 100]. BERC, Yonsei University

86

EXPERIMENTS - 6

Results 12

10

Mean operator Weighted−Sum−Fisher Weighted−Sum−Brute−Force LSE AACQ SVMPoly SVMRbf

EER (%)

8

6

4

2

0

5 10 15 20 25 30 Experiment number for 32 fusion cases (LSE and AAC at r=5, SVMPoly at r=1, SVMRbf at gamma=50)

Figure 18: EER values for individual data from the 32 sets: LSE and AACQ at r = 5; SVMPoly at r = 1; SVMRbf at gamma = 50; LSE-SIG and AACQ -SIG at NSIG = 50. BERC, Yonsei University

87

EXPERIMENTS - 7

Summary • Strong correlation between the LAUC and EER values at large. • Fusion of two relatively strong biometrics is seen to have good performance for most fusion classifiers. The EER performances of AACQ is seen to improve over that of LSE in terms of stability over different model orders. • Fusion of a relatively strong and a very weak biometric (VS+IR) needs particular care. The combined decision may deteriorate if the fusion module is not well trained. AACQ shows good stability. • The stability of AACQ comes with a higher computational cost than those compared training based methods. We believe this limitation will be overcome by an efficient software implementation as well as by the advancement of computing technology in future. BERC, Yonsei University

88

Conclusion

BERC, Yonsei University

89

CONCLUSION - 1

• Presented a deterministic and fast-learning algorithm for total error rate minimization. • Presented a deterministic algorithm for AUC maximization. • The key is a quadratic approximation to an error counting step function in the objective functions.

BERC, Yonsei University

90

CONCLUSION - 1

• The deterministic TER solution was subsequently found to be related to a class-specific setting of the conventional weighted least-squares estimate. • Our empirical observations validated the generalization performance of the proposed methods.

BERC, Yonsei University

91

CONCLUSION - 2

Thank You!

BERC, Yonsei University

92

References [1] R. O. Duda and P. E. Hart, Pattern Classification and Scene Analysis. New York: Wiley & Sons, 1973. [2] D.-T. Hai and M. Installe, “Learning Algorithms for Nonparametric Solution to the Minimum Error Classification Problem,” IEEE Trans. on Computers, vol. c-27, no. 7, pp. 648–659, 1978. [3] B.-H. Juang and S. Katagiri, “Discriminative Learning for Minimum Error Classification,” IEEE Trans. on Signal Processing, vol. 40, no. 12, pp. 3043–3054, 1992. [4] M. Tsujitani and T. Koshimizu, “Neural Discriminant Analysis,” IEEE Trans. on Neural Networks, vol. 11, no. 6, pp. 1394–1401, 2000. [5] J. Sch¨ urmann, Pattern Classification: A Unified View of Statistical and Neural Approaches. New York: John Wiley & Sons, Inc., 1996. 92-1

[6] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer, 2001. [7] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2nd ed. New York: John Wiley & Sons, Inc, 2001. [8] M. Rimer and T. Martinez, “Classification-based Objective Functions,” Machine Learning, vol. 63, no. 2, pp. 183–205, 2006. [9] Z. R. Yang, “A Novel Radial Basis Function Neural Network for Discriminant Analysis,” IEEE Trans. on Neural Networks, vol. 17, no. 3, pp. 604–612, 2006. [10] L. Sanger and J. Wales, “Wikipedia The Free Encyclopedia,” in [http://en.wikipedia.org/wiki], 2006, (online since 2001). [11] N. R. Draper and H. Smith, Applied Regression Analysis. New York: John Wiley & Sons, Inc., 1998, wiley Series in Probability and Statistics. 92-2

[12] Evelyn Fix and Joseph L. Hodges, Jr, “Discriminatory analysis: Nonparametric discrimination: Consistency analysis,” USAF School of Aviation Medicine, vol. 4, pp. 261–279, 1951. [13] ——, “Discriminatory analysis: Nonparametric discrimination: Small sample performance,” USAF School of Aviation Medicine, vol. 11, pp. 280–322, 1952. [14] L. Zadeh, “Fuzzy Sets,” Information and Control, vol. 8, no. 3, pp. 338–353, 1965. [15] Edward A. Patrick and Frederick P. Fischer, III., “A Generalized k-nearest neighbor rule,” Information and Control, vol. 16, no. 2, pp. 128–152, 1970. [16] J. H. Holland, Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence, 2nd ed. Cambridge, MA: MIT Press, 1992.

92-3

[17] J. J. Hopfield, “Neural Networks and Physical Systems with Emergent Collective Computational Abilities,” Proceedings of the National Academy of Sciences of the USA, vol. 79, no. 8, pp. 2554– 2558, 1982. [18] T. Kohonen, “Self-Organizing formation of topologically correct feature maps,” Biological Cybernetics, vol. 43, no. 1, pp. 59–69, 1982. [19] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning Internal Representations by Backpropagating Errors,” Nature, vol. 323, no. 99, pp. 533–536, 1986. [20] J. R. Quinlan, “Learning Efficient Classification Procedures and their Application to Chess and Games,” in Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, and T. M. Mitchell, Eds., 1983, pp. 463–482. [21] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifiers,” in Fifth Annual Workshop 92-4

on Computational Learning Theory, Pittsburgh, ACM, 1992, pp. 144–152. [22] J. R. Quinlan, C4.5: Programs for Machine Learning. cisco, CA: Morgan Kaufmann, 1993.

San Fran-

[23] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees. New York: Chapman & Hall, 1993. [24] C. E. Brodley and P. E. Utgoff, “Multivariate Decision Trees,” Machine Learning, vol. 19, no. 1, pp. 45–77, 1995. [25] L. Breiman, “Bagging Predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, 1996. [26] Y. Freund and R. E. Schapire, “A Decision Theoretic Generalization of On-line Learning and Application to Boosting,” Journal of Computer and System Sciences, vol. 55, pp. 119–139, 1995.

92-5

[27] G. Baudat and F. Anouar, “Generalized Discriminant Analysis Using a Kernel Approach,” Neural Computation, vol. 12, pp. 2385– 2404, 2000. [28] J. Lu, K. N. Plataniotis, and A. N. Venetsanopoulos, “Face Recognition Using Kernel Direct Discriminant Analysis Algorithms,” IEEE Trans. on Neural Networks, vol. 14, no. 1, pp. 117–126, 2003. [29] D. Precup and P. E. Utgoff, “Classification Using Φ-Machines and Constructive Function Approximation,” Machine Learning, vol. 55, no. 1, pp. 31–52, 2004. [30] K.-A. Toh, Q.-L. Tran, and D. Srinivasan, “Benchmarking A Reduced Multivariate Polynomial Pattern Classifier,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 26, no. 6, pp. 740– 755, 2004.

92-6

[31] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme Learning Machine: Theory and Applications,” Neurocomputing, vol. 70, pp. 489– 501, 2006. [32] K.-A. Toh, “Learning from Target Knowledge Approximation,” in Proceedings of the First IEEE Conference on Industrial Electronics and Applications, Singapore, May 2006, pp. 815–822. [33] F. Berzal, J.-C. Cubero, D. S´anchez, and J. M. Serrano, “ART: A Hybrid Classification Model,” Machine Learning, vol. 54, no. 1, pp. 67–92, 2004. [34] K.-A. Toh, “Training a Reciprocal-Sigmoid Classifier by Feature Scaling-Space,” Machine Learning, vol. 65, no. 1, pp. 273–308, 2006. [35] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes,” IEEE Trans. on Neural Networks, vol. 17, no. 4, pp. 879–892, 2006. 92-7

[36] L. Yan, R. Dodier, M. C. Mozer, and R. Wolniewicz, “Optimizing Classifier Performance via an Approximation to the WilcoxonMann-Whitney Statistic,” in Proceedings of the Twentieth International Conference on Machine Learning (ICML 2003), Washington DC, USA, 2003, pp. 848–855. [37] E. McDermott and S. Katagiri, “A Derivation of Minimum Classification Error from the Theoretical Classification Risk using Parzen estimation,” Computer Speech and Language, vol. 18, pp. 107–122, 2004. [38] R. Yan, J. Zhang, J. Yang, and A. G. Hauptmann, “A Discriminative Learning Framework with Pairwise Constraints for Video Object Classification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 28, no. 4, pp. 578–593, 2006. [39] V. N. Vapnik, Statistical Learning Theory. Wiley-Interscience Pub, 1998.

92-8

[40] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A Training Algorithm for Optimal Margin Classifier,” in Proceedings of the 5th ACM Workshop on Computational Learning Theory, Pittsburgh, PA, 1992, pp. 144–152. [41] L. Bobrowski and J. Sklansky, “Linear Classifiers by Window Training,” IEEE Trans. on Systems, Man, and Cybernetics, vol. 25, no. 1, pp. 1–9, 1995. [42] I. N. Bronshtein, K. A. Semendyayev, G. Musiol, and H. Muehlig, Handbook of Mathematics, 4th ed. Berlin: Springer-Verlag, 2003. [43] K.-A. Toh, “Deterministic Neural Classification,” Neural Computation, vol. 20, no. 6, pp. 1565–1595, June 2008. [44] K.-A. Toh and H.-L. Eng, “Between Classification-Error Approximation and Weighted Least-Squares Learning,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 30, no. 4, pp. 658–669, April 2008. 92-9

[45] H. Jaeger and H. Haas, “Harnessing Nonlinearity: Predicting Chaotic Systems and Saving Energy in Wireless Communication,” Science, vol. 304, pp. 78–80, 2004. [46] C. Saunders, A. Gammerman, and V. Vovk, “Ridge Regression Learning Algorithm in Dual Variables,” in International Conference on Machine Learning (ICML), Madison, WI, 1998, pp. 515–521. [47] J. Li, G. Dong, K. Ramamohanarao, and L. Wong, “DeEPs: A New Instance-Based Lazy Discovery and Classification System,” Machine Learning, vol. 54, no. 2, pp. 99–124, 2004. [48] A. A. Ross and A. K. Jain, “Information fusion in biometrics,” Pattern Recognition Letters, vol. 24, pp. 2115–2125, 2003. [49] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On Combining Classifiers,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 3, pp. 226–239, 1998.

92-10

[50] L. I. Kuncheva, J. C. Bezdek, and R. Duin, “Decision Templates for Multiple Classifier Design: An Experimental Comparison,” Pattern Recognition, vol. 34, no. 2, pp. 299–314, 2001. [51] K.-A. Toh, “Fingerprint and Speaker Verification Decisions Fusion,” in International Conference on Image Analysis and Processing (ICIAP), Mantova, Italy, September 2003, pp. 626–631. [52] L. Hong and A. Jain, “Integrating Faces and Fingerprints for Person Identification,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1295–1307, 1998. [53] S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz, “Fusion of Face and Speech Data for Person Identity Verification,” IEEE Trans. on Neural Networks, vol. 10, no. 5, pp. 1065–1074, 1999. [54] K.-A. Toh, J. Kim, and S. Lee, “Maximizing Area Under ROC Curve for Biometric Scores Fusion,” Pattern Recognition, vol. 41, no. 11, pp. 3373–3392, November 2008. 92-11

[55] A. Herschtal and B. Raskutti, “Optimising Area Under the ROC Curve Using Gradient Descent,” in Proceedings of the Twenty-first International Conference on Machine Learning (ICML 2004), Banff, Alberta, Canada, 2004, (ACM Press). [56] T. Calders and S. Jaroszewicz, “Efficient AUC Optimization for Classification,” in Knowledge Discovery in Databases: PKDD 2007, vol. LNCS 4702/2007, 2007, pp. 42–53. [57] National Institute of Standards and Technology, “NIST Biometric Scores Set - release 1,,” in [http://www.itl.nist.gov/iad/894.03/biometricscores], 2004. [58] N. Poh and S. Bengio, “Database, Protocols and Tools for Evaluating Score-Level Fusion Algorithms in Biometric Authentication,” Pattern Recognition, vol. 39, no. 2, pp. 223–233, 2006.

92-12