Data Fusion for Person Verification and Imbalanced

0 downloads 0 Views 3MB Size Report
HMM. Hidden Markov Model iid independent and identically distributed pdf ...... appear in many scenarios such as rare disease detection, credit card fraud [11], ...
Data Fusion for Person Verification and Imbalanced Classes Datenfusion zur Personenverifikation bei unausgeglichenen Klassenverhältnissen Master-Thesis von Sergey Sukhanov Tag der Einreichung: 1. Gutachten: Jürgen Hahn, MSc. 2. Gutachten: Prof. Dr.-Ing. A.M. Zoubir

SIGNAL PROCESSING GROUP

Data Fusion for Person Verification and Imbalanced Classes Datenfusion zur Personenverifikation bei unausgeglichenen Klassenverhältnissen Vorgelegte Master-Thesis von Sergey Sukhanov 1. Gutachten: Jürgen Hahn, MSc. 2. Gutachten: Prof. Dr.-Ing. A.M. Zoubir Tag der Einreichung:

Declaration / Erklärung To the best of my knowledge and belief this work was prepared without aid from any other sources except where indicated. Any reference to material previously published by any other person has been duly acknowledged. This work contains no material which has been submitted or accepted for the award of any other degree in any institution. Hiermit versichere ich die vorliegende Arbeit ohne Hilfe Dritter nur mit den angegebenen Quellen und Hilfsmitteln angefertigt zu haben. Alle Stellen, die aus Quellen entnommen wurden, sind als solche kenntlich gemacht. Diese Arbeit hat in gleicher oder ähnlicher Form noch keiner Prüfungsbehörde vorgelegen.

Darmstadt, den 31. Oktober 2014

(Sergey Sukhanov)

i

Abstract The performance of many practical classification algorithms can be restricted by different limitations that usually come from the application scenario or are inherent to the data acquisition process. The accuracy of many different systems that based on these algorithms (e.g. biometric systems) can vary dramatically or sometimes be intolerably low. That usually results in a high amount of false alarm and miss rates. An efficient way to overcome the loss of performance is to use fusion. In this thesis, we implement the state-of-the-art data fusion algorithm, called Neyman-Pearson Support Vector Machines (NPSVMs) that is operating on the match score level. We set the NPSVMs framework and extend its formulation by incorporating preknowledge about local temporary behavior of objects, thereby increasing the performance of this approach. We also address the class imbalance issues that are typical for many classification and data fusion tasks. Finally, we propose an advanced bagging-based technique to increase the performance of the proposed algorithm.

ii

Abbreviations and Acronyms ABWBC AIC AUC AV BIC BWBC EM ERM FN FP FR FRS G-mean GMM HMM iid pdf KKT KNN KWSS MAP MC ML MV NB NIST NP NPSVM RBF RO ROC RU SA SMOTE SRM SV SVMs TP TN WBC

adaptive bagging weighted balanced combiner Akaike’s information criterion area under ROC curve average combiner Bayesian information criterion bagging weighted balanced combiner expectation maximization empirical risk minimization false negative false positive face recognition face recognition software geometric mean Gaussian mixture model Hidden Markov Model independent and identically distributed probability density function Karush-Kuhn-Tucker conditions K-nearest neighbor KNN weighted sampling scheme maximum a posteriori Monte Carlo maximum likelihood majority vote naive Bayes National Institute of Standards and Technology of the USA Neyman-Pearson Neyman-Pearson Support Vector Machines radial basis function random oversampling receiver operating characteristic random undersampling simulated annealing synthetic minority oversampling technique structural risk minimization support vector support vector machines true positive true negative weighted balanced combiner iii

WMV VC

weighted majority vote Vapnik-Chervonenkis

iv

Symbols |·| ()> sgn A(·) ACC ACC(k) α b β C ∆t smooth 0 ∆t m→n ∆t m→n η F (·) f∗ f (·) Φ(·) G γ H h K K(·, ·) Knn κ L(·) λ M µ µL M N N NM C ν O () ωj Ω P(x, y) pF

cardinality of a set transposition sign function activation function averaged accuracy for ABWBC method individual classifier accuracy for ABWBC method false alarm rate bias of the hyperplane detection threshold SVM trade-off parameter starting point of the smoothing interval, [s] relative position of the time value inside the smoothing interval time difference between detections by instances m and n, [s] posterior probability threshold ensemble aggregation function NP-optimal data fusion model function that maps SVM decision to π(·) function that maps data into feature space number of components in GMM constant that controls a trade-off between errors of different types feature space Vapnik Chervonenkis dimension number of weak learners in classifier ensemble kernel function number of nearest neighbors parameter of RBF loss function parameter vector of GMM number of biometric sensors (cameras) mean left-most Gaussian component amount of training examples Gaussian Normal Distribution number of Monte Carlo runs parameter that bounds margin errors and SVs complexity class label set of classes unknown probability distribution probability of false alarm v

pM P(∆t m→n ) π(·) rk S Sik sm Smaj Smin Svalid σ2 Tw tk T u(k) ui υi wk w x ξ y z(·)

probability of missing a true event transition penalty value decision function of the data fusion algorithm random number between 0 and 1 in ABWBC algorithm training dataset score matrix for person i score vector from camera m majority class subset minority class subset validation set variance time window for detection grouping, [s] time stamp for kth detection accuracy tolerance value undersampling level in ABWBC algorithm random number between 0 and 1 weights in ABWBC algorithm weight of individual classifier normal vector to the hyperplane feature vector slack variable class label function that merges two vectors according to max rule

vi

Contents 1 Introduction 1.1 1.2 1.3 1.4 1.5 1.6 1.7

1

Data Fusion for Biometrics . . . . . . . . . . . . . . . . . . . . . . . . . Fusion Levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Classification and Data Fusion Problem Formulation . . . . . . . . . . Neyman-Pearson Framework . . . . . . . . . . . . . . . . . . . . . . . . Applying the Neyman-Pearson Concept to Support Vector Machines Verification and Identification Problems . . . . . . . . . . . . . . . . . Outline and Contribution of the Thesis . . . . . . . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

2 Support Vector Machines

5

2.1 Statistical Learning Theory . . . . . . . . . . . . . . . . . . . . 2.1.1 VC Dimension . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Binary Classification Problem . . . . . . . . . . . . . . 2.2 Linear SVMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 In Case of Separable Classes . . . . . . . . . . . . . . . 2.2.2 In Case of Non-separable Classes . . . . . . . . . . . . 2.3 Non-linear SVMs and Kernels . . . . . . . . . . . . . . . . . . . 2.3.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Effect of SVM and Kernel Parameters . . . . . . . . . . . . . . 2.5 Extending C-SVMs to ν-SVMs . . . . . . . . . . . . . . . . . . . 2.6 Cost-Sensitive Extensions of SVMs: 2C-SVMs and 2ν-SVMs 2.7 SVMs Probability Output . . . . . . . . . . . . . . . . . . . . . . 2.8 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

3 Class Imbalance Problem

5 6 6 6 7 9 11 12 13 13 14 15 16

17

3.1 SVMs and Class Imbalance . . . . . . . . . . . . . . . . . . . . 3.2 Handling the Class Imbalance Problem . . . . . . . . . . . . . 3.2.1 External Balancing Methods . . . . . . . . . . . . . . . 3.2.2 Internal Balancing Methods . . . . . . . . . . . . . . . 3.2.3 Ensemble Learning Methods . . . . . . . . . . . . . . . 3.3 Performance Measures for Learning from Imbalanced Data 3.4 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Transition Model for NPSVM 4.1 General Description of the Proposed Method 4.2 Transition Model . . . . . . . . . . . . . . . . . 4.2.1 Travel Time Distribution Estimation . 4.2.2 Transition Penalty Estimation . . . . . 4.2.3 Finding False Detection . . . . . . . . . 4.3 Chapter Summary . . . . . . . . . . . . . . . . .

1 1 2 2 3 4 4

17 19 19 22 23 25 27

28 . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

28 28 29 30 31 33 vii

5 Bootstrapping Aggregation for Imbalanced Classes 5.1 Random Undersampling for Class Imbalance . . . . . . . 5.2 Proposed Methods . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Bagging Weighted Balanced Combiner . . . . . . . 5.2.2 Adaptive Bagging Weighted Balanced Combiner . 5.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . .

34 . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

6.1 Data Fusion in Face Recognition Scenario Using NPSVM Method 6.1.1 Experiment Description . . . . . . . . . . . . . . . . . . . . . 6.1.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 6.1.3 Solution using NPSVM Approach . . . . . . . . . . . . . . . 6.1.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . 6.1.5 Performance Evaluation on NIST Dataset . . . . . . . . . . 6.2 Studying Class Imbalance Problem . . . . . . . . . . . . . . . . . . 6.2.1 Benchmark Data . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Performance Measure for Imbalanced Dataset . . . . . . . 6.2.3 Condition for Class Imbalance . . . . . . . . . . . . . . . . . 6.2.4 Applying SMOTE to FR Darmstadt Dataset . . . . . . . . . 6.2.5 Studying the Redundancy of SMOTE . . . . . . . . . . . . . 6.2.6 Proposed Methods Evaluation . . . . . . . . . . . . . . . . . 6.2.7 FR Darmstadt Imbalance Fixing . . . . . . . . . . . . . . . . 6.3 Chapter Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

6 Experiment

7 Conclusion and Outlook

34 36 37 38 40

42 42 42 43 45 46 48 48 48 49 51 53 54 56 59 60

61

viii

1 Introduction Data fusion is becoming a more and more common way to improve the performance of a classification system and increase its operational robustness. A typical example of such a system can be a biometric authentication system that is used to identify or verify a person identity based on traits such as fingerprints, face, voice, iris and others [58]. A classification system that is based on one type of data and/or sensor can suffer from the limitations that come from the nature of the data acquisition process or are imposed by the application scenario. As a result, the performance of such system drops dramatically which can lead to the complete failure of the classification process. To address this problem data fusion techniques are applied and allow to improve the performance of the sole classification system in terms of accuracy, efficiency and robustness. Different data fusion techniques have been widely applied to biometric and other multisensor problems, however, it is also possible to apply data fusion to other fields, such as text or digit recognition, medical classification, image processing and others.

1.1 Data Fusion for Biometrics The main goal of the data fusion system utilizing multiple sensors is to obtain a more certain and reliable final decision that, in turn, increases the accuracy of the overall system. To perform a person authentication process a biometric system is employed. One distinguishes two types of biometric fusion systems: multimodal and intramodal. Multimodal systems are using different types of sensors that capture different biometric traits (e.g. fingerprint, handwriting and iris). The independence of sensors and/or biometrical traits allows increasing reliability and performance of the biometric fusion systems [8, 34]. Such type of systems are also capable to withstand impostor attacks as it is difficult to forge multiple traits simultaneously [8, 5, 26]. Moreover, a multimodal biometric system can deal with nonuniversality issues when a person does not possess some trait. There are a lot of studies on multimodal biometric fusion systems [8, 34, 16, 2]. Intramodal systems are using several one-type sensors (e.g. only FR cameras) or several features of one trait. There are only a few studies of intramodal biometric fusion as it is considered to be not very effective due to the data observation dependence and the absence of the ability to exploit the heterogeneity of different modalities. However, some of the studies show that one can obtain tangible benefits by combining different algorithms or features for the same trait [22, 21] and achieve high operating performance in Neyman-Pearson (NP) framework.

1.2 Fusion Levels Biometric multimodal fusion can be done using different fusion techniques. These techniques are grouped into the three main groups: feature level fusion, score level fusion and decision level fusion [36]. Feature level fusion combines several feature vectors obtained from different biometric modalities. This allows exploiting much richer information compared with other levels. Decision level fusion combines the decisions of each biometric system made on their own feature vectors. After that, different voting schemes can be applied [69] to obtain the final decision. Score level fusion combines the scores from each biometric matcher according to a specific rule. A biometric matcher is a recognition algorithm that outputs the score that corresponds to the similarity between a sensed biometric feature 1

vector and a stored template feature vector. The main advantage to the score level fusion is that such fusion algorithms can be independent of the type of the biometric systems and ignore the sensor or trait peculiarities. Moreover, score level fusion allows considering more information than at the decision level and providing more accurate fusion results. These are the main reasons why we consider score level fusion in this thesis.

1.3 Classification and Data Fusion Problem Formulation In this section we give a brief formulation of the learning problem in terms of classification and data fusion problems. Through the thesis we will use the term "classification" when there is no reason to distinguish between classification and data fusion tasks. However, in the end, we are solving a data fusion problem. Consider a binary classification problem. Assume a training set of vectors x1 , . . . , xn ∈ Rd with their corresponding labels y1 , . . . , yn ∈ {−1, +1} that are drawn from the unknown probability distribution P(x, y) under the assumption of being independent and identically distributed. A classification algorithm tries to find a function f : Rd → {−1, +1} predicting the label when having a feature vector on its input. Usually, the prediction should be done with a high level of accuracy that is the same as having a small error probability. However, there are two types of errors that have to be distinguished and treated differently. The reason for that are the different costs that each type of error has. In a hypothesis testing framework these errors are: • Type I error: False alarm • Type II error: Miss detection If a classification algorithm is not able to perfectly separate the classes, there is a trade-off between type I and II errors. The Neyman-Pearson framework provides a tool for efficiently controlling this trade off.

1.4 Neyman-Pearson Framework The ability of controlling false alarm rates in data fusion algorithms is of crucial importance. Various data fusion systems that lack this ability may suffer from high amount of alarm reports that have to be checked manually by the operator of the system. This brings additional operational costs and frequent involvement of a human to the system to check the correctness of the decision of the algorithm. By introducing a predefined false alarm level one can specify which amount of error of that type the system can tolerate. In general, in the NP classification or data fusion framework the goal is to learn a function that can generalize from the training dataset such that the probability of misses is minimized, while the probability of false alarm is not higher than a predefined level α. There are several advantages of utilizing a NP concept. Firstly, in many practical scenarios, such as fraud detection, disease classification or person authentication, different types of errors may have significantly different costs. For example, consider a person authentication scenario. The cost of wrongly verifying a person and letting him access a forbidden area can be much higher than the cost of the error when the access is denied to the person who possesses access rights. However, in this case it can be challenging to specify these costs. In such scenarios, when one class is more important than the other it is more natural to specify a limitation on the false alarm rate than to assign costs to the various types of errors. 2

Secondly, even when different types of errors have relatively equal costs, we might prefer to build a classification or data fusion system that is performing equally on both classes, including scenarios when one class is highly overrepresented compared with the other one, i.e. an imbalanced dataset. Thirdly, the NP paradigm does not rely anyhow on the knowledge of a prior. This can be very advantageous when the class frequencies in the training sample do not accurately reflect population class frequencies. Fourthly, most of real life classification and data fusion applications dictate us the tolerance level of false alarm. In such scenarios only the NP paradigm is suitable for the practical implementation. Let us consider a two-class NP-optimal data fusion model. Given a training set and a user specified false alarm level α the NP-optimal data fusion model is defined as:

fα∗ = argmin p M ( f )

(1.1)

f :p F ( f )≤α

where p M is the probability of miss and p F is the probability of false alarm. An NP-optimal model implies that when using the NP paradigm, we set a specific point on the receiver operating characteristic (ROC) curve (Figure 1.1) where we want to operate. This is done by specifying the false alarm rate α.

Figure 1.1: Example of a ROC curve with specified false alarm level α The ROC curve is a common graphical representation of performance of binary classification systems. It plots the true positive rate (hit rate) versus the false positive rate (false alarm rate). The ROC curve and its applications will be considered in details in the next chapters.

1.5 Applying the Neyman-Pearson Concept to Support Vector Machines The idea of applying the NP concept on top of Support Vector Machines is of great importance as it can provide a good error control while using one of the most powerful classification tools [23]. This can be done by modifying the statement (1.1) for continuous class probabilities (for an appropriate value of γ):

min f

γ · p F ( f ) + (1 − γ) · p M ( f )

(1.2) 3

This results in the fact that by introducing an additional parameter, we can modify the formulation of SVMs so that we are able to use it in compliance with the NP framework. However, the main challenge in this setting is to set the new additional parameter appropriately. For that we need to accurately estimate p F and p M which is not an easy task in traditional SVMs formulation. Nevertheless, by employing a 2ν-SVMs formulation this can be done by simply using grid search over a cost parameter that is bounded from both sides [23]. This provides an accurate estimation of p F and p M .

1.6 Verification and Identification Problems Depending on the context, a biometric system solves either identification or verification problem. In the verification mode, the system answers the question "Is this person X?" by comparing a biometric sample with a template of a claimed identity. In the identification mode, the system answers the question "Which person is it?" by comparing a biometric sample with all templates stored in the database. The system outputs the ID of the person who has the highest similarity score (closed set ID) or a decision that the person in a biometric sample is not enrolled in the database (open set ID). In this thesis we are treating verification problem only, however it is possible to extend the formulation to the identification problem.

1.7 Outline and Contribution of the Thesis The thesis consists of two parts. In the first part of this thesis, we are focusing on the Neyman Pearson SVM score level fusion approach in FR scenarios, explaining the theory behind this approach and conducting real world experiment for the task of person verification. We are also extending the formulation of Neyman Pearson SVMs to integrate additional information on human temporal behavior in a close environment (capturing the movement of people between FR cameras) and thereby to increase the performance of the fusion algorithm. Regardless of the fact that this type of information can be easily exploited in a Bayesian framework (e.g. using transition models), to the best of our knowledge, there have been no attempts before to handle this in a frequentist framework such as Neyman-Pearson SVMs. In the second part of the thesis, we address the imbalance dataset problem that is challenging for almost all data fusion and classification algorithms. We apply different known balancing strategies to our FR scenario as well as performing evaluations on public datasets. We also propose an advanced bootstrapping aggregation technique that allows to deal with class imbalance. The structure of this thesis as follows. In the next chapter we provide theoretical background behind Neyman-Pearson framework and SVMs. Chapter 3 describes the class imbalance problem and the effects on SVMs when having class imbalance issue. We describe the experiment, conduct a set of tests and propose a new imbalance fixing method in Chapter 6. We close our thesis with extensive discussions and conclusion in Chapter 7. The main contributions of this master thesis are • We implement and evaluate the NPSVM approach applied to FR scenarios • We incorporate a motion model to NPSVM framework to increase the performance of this method • We study the effect of class imbalance on SVMs, evaluate the state of the art approaches and provide recommendations for using them • We propose a novel bootstrapping aggregating algorithm that allows to have high classification performance even on highly imbalanced data 4

2 Support Vector Machines Support Vector Machines (SVMs) were introduced by Vapnik [63] and they are considered to be one of the most powerful classification tools and are widely used in many domains such as handwriting recognition [18], text classification [39], image retrieval [62] and others. Their strong theoretical foundation, high generalization capabilities and impressive practical performance have made them popular among machine learning theoreticians and practitioners. SVMs are based on the Structural Risk Minimization (SRM) concept which proved its advantage compared to usual Empirical Risk Minimisation (ERM) principle. SRM, instead of minimizing the error on the training data as ERM does, minimizes an upper bound on the expected risk that gives SVMs a remarkable generalization performance. In the following, we will consider the mathematical background behind SVMs and their various formulations. We will consider only SVMs for classification tasks, however SVMs can be also used to solve regression problems.

2.1 Statistical Learning Theory The statistical learning theory describes the problem of making predictions and decisions from a set of data. It provides a framework that allows to set a learner that chooses a model which is closest to the underlying function in the target space [25, 6]. The proximity between the model and underlying function is measured in terms of some error measure. Having the formulation of the learning problem in Chapter 1 we define a loss function L( y, f (x)) that measures the error between actual y and predicted f (x) values for a given x. The expected error (or risk) is actually the true mean error and calculated as

R[ f ] =

Z

L( y, f (x))P(x, y)dxd y.

(2.1)

X ×Y

The task is to find a function that minimizes this expected risk. However the probability distribution of the data P(x, y) is usually unknown so it is not possible to solve this task. By assuming an approximation to expected risk as empirical risk, we can write

Remp [ f ] =

n X

L( y i , f (xi ))

(2.2)

i=1

and the task now is to minimize this empirical risk that is the measured mean error rate on the training set. This is an ERM principle. Most of the learning algorithms utilize the principle of ERM and show quite high performance. However, the problems arise when the complexity of the learner is high and it tends to overfit the training data. To solve this problem Vapnik proposed to consider the capacity of the learner [63]. He showed that with probability 1 − δ the following bound holds

R[ f ] < Remp [ f ] +

v u δ t h ln( 2n h + 1) − ln( 4 ) n

,

(2.3)

where h is the Vapnik Chervonenkis (VC) dimension and n is the amount of training examples. To minimize this bound is the main task of the SRM principle. 5

2.1.1 VC Dimension The VC dimension is an non-negative scalar value that reflects the measure of the capacity of a set of learning machines (or functions). It is the largest set of points that can be separated by the learning machine. It implies that a simple machine will lead to a high empirical error and, thus, to a high expected error. To the opposite, a complex machine will lead to a small empirical error, however the VC-dimension will be high, resulting a high expected error. Hence, the VC-dimension concept allows to obtain a trade-off between the empirical error and the complexity of the learning machine. This is shown on Figure 2.1. The SRM principle can be formulated as the best trade-off between the degree of

Figure 2.1: Schematic relation among expected risk, emperical risk and capacity term based on 2.3 how well the function fits to the training data and the complexity of this function. Therefore it is directly related to the bias-variance trade-off. In the following sections, we will provide the formulation of the SVM based on the SRM.

2.1.2 Binary Classification Problem According to the initial formulation [63], the SVM concept is originally designed as a binary classifier and thus capable to solve only two-class problems. At the same time, in most of the real-world scenarios there are more than two classes that we need to discriminate. However, any multi-class problem can be reduced to the set of the binary ones and can be successfully treated by SVMs (one vs one or one vs all). In the following, we will restrict our consideration to solve only two-class (binary) problems. Lets formulate a binary classification problem. Given a training data set S that consists of n data examples x1 , . . . , xn , where xi is a feature vector in d -dimensional Hilbert space, xi ∈ Rd and with their corresponding class labels y1 , . . . , yn ∈ {−1, +1}. The samples that belong to the positive class set S+ have a class label yi = +1 and the ones that belong to the negative class set S− have yi = −1.

2.2 Linear SVMs In this section we will consider two cases: the separable and the non-separable case. The separable case implies that the training examples of two classes can be separated perfectly using a linear classifier. The 6

non-separable case describes such cases, when a perfect separation cannot be achieved. We will start with the the separable case in the next section and then continue with non-separable one.

2.2.1 In Case of Separable Classes Assume, there is a hyperplane that separates the positive and negative instances and has a form of

w · x + b = 0,

(2.4)

where w is the normal vector to the hyperplane and b is the perpendicular offset from the origin to the hyperplane. Such a hyperplane is called separating hyperplane and it defines the decision boundary between two classes so that they are perfectly separated (without an error) such that

w · xi + b ≥ 0 for

yi = +1

w · xi + b ≤ 0 for

yi = −1

(2.5)

Figure 2.2a shows an example of this perfectly separable case. Two classes can be separated by various hyperplanes, however there is only one optimal hyperplane.

(a) Linearly separable dataset can be separated by (b) Optimal separating hyperplane and two different hyperplanes e.g. L1, L2 or L3. Note, there is only one optimal hyperplane among all of them

equidistant hyperplanes H+ and H− that create the margin. The vector w is a normal to the optimal separating hyperplane

Figure 2.2: Perfectly separable dataset and the ideal separable hyperplane A pair of values {w, b} in equation 2.4 determines the hyperplane position. There are infinite combinations of {w, b} that can set the hyperplane. However, there is only one combination that separates the classes and maximizes the margin between them. SVMs are designed to provide this optimal pair of {w, b} that separates the positive and the negative training instances. Equation 2.4 can be multiplied by any non-zero value providing the same hyperplane. To obtain the canonical form of the hyperplane, the 7

parameters w and b must be scaled so that the distance between the closest point and the hyperplane is 1 ||w|| :

min |w · xi + b| = 1

(2.6)

i

A separating hyperplane in the canonical representation satisfying the following constraints:

w · xi + b ≥ +1 for

yi = +1

w · xi + b ≤ −1 for

yi = −1

(2.7)

These can be combined to be compactly represented as

yi (w · xi + b) − 1 ≥ 0 ∀i

(2.8)

The instances for which this the equality holds are called support vectors (SVs) and they are the closest examples to the separating hyperplane. They lie themselves on one of two following hyperplanes (Figure 2.2b)

H+ :

w · xi + b = +1

H− :

w · xi + b = −1

For the instances that lie on H+ the perpendicular distance from the origin is |−1−b| ||w|| .

(2.9) |1−b| ||w|| .

For the instances that

Let us define d+ and d− as the distances from the separating hyperplane to the lie on H− it is positive and negative SVs correspondingly. A margin will be the sum of d+ and d− . As mentioned before, SVMs maximizes this margin. The margin between the hyperplanes H+ and H− is then reformulated as:

ρ(w, b) = d+ + d− =

|w · x+ + b| |w · x− + b| 1 2 + = (|w · x+ + b| + |w · x− + b|) = . ||w|| ||w|| ||w|| ||w||

(2.10)

where x + ∈ S+ and x − ∈ S− are the SVs. As H+ and H− have the same normal vector w, they are parallel and there are no training instances that are located between them. Hence the hyperplane that perfectly separates the data and maximises the geometric margin is the optimal. Maximizing the margin 1 2 ||w|| is equivalent to minimizing the term ||w|| . This can be formulated as the following constrained optimization problem:

min w,b

s.t.

1 ||w||2 2 yi (w · xi + b) ≥ 1

(2.11)

i = 1, . . . , n.

Since the cost function is quadratic and convex and the constraint is linear, we can solve this optimisation problem by introducing n Lagrange multipliers αi ≥ 0, i = 1, . . . , n for each constraint and form a Lagrangian: n n X X 1 2 L p (w, b, α ) = ||w|| − αi yi (w · xi + b) + αi (2.12) 2 i=1 i=1 that has to be minimized with respect to the variables w and b and maximized with respect to α. L p is the primal formulation of the problem that is a convex quadratic programming problem. It can be transformed to the dual formulation that is easier to solve:

max L d = max(min L p (w, b, α )) α

α

w,b

(2.13) 8

We differentiate L p with respect to w and b to find the optimum conditions:

∂ L p (w, b, α ) ∂w

=0

w=



n X

αi yi xi

i=1

∂ L p (w, b, α ) ∂b

=0

n X



(2.14)

αi yi = 0

i=1

Hence, we can finally formulate the dual optimization problem L d as

max α

s.t.

α) = L d (α

n X i=1

n X

αi −

n X n X

αi α j yi y j xi x j

i=1 j=1

(2.15)

αi yi = 0

i=1

αi ≥ 0

i = 1, . . . , n

By adjusting the αs we are maximizing L d which represents a quadratic problem. The solution for optimal w and b is

w=

n X

αi yi xi (2.16)

i=1

1 b = − w · (x+ + x− ) 2 As one can see, the solution is the linear combination of xi s. However, αi is not equal to zero only for the SVs. That is the reason why the hyperplane of SVM is defined only by few training instances and is considered to have a sparse solution. Only SVs are responsible for the solution and, in principle, other instances can be removed providing the same decision boundary. It is also important to mention that the convexity of the quadratic programming problem implies one global minimum and as a result one optimal solution. We can formulate a decision function f (x) for a testing example x as

‚ f (x) = sgn(w · x + b) = sgn

n X

Œ αi yi (xi · x) + b

(2.17)

i=1

that assigns +1 or −1 to a testing instance. It is worth to mentioning that x appears in this function only in the form of an inner product. This will be further accounted for the non-separable case.

2.2.2 In Case of Non-separable Classes In the previous chapter the SVMs formulation was defined under the assumption of perfectly separable dataset. However, real datasets do not usually satisfy this constrain. In practice, the classes are often not linearly separable and this must be taken into account when formulating an SVM framework. To allow to have a trade-off between over-fitting and generalization, so-called slack variables ξi ≥ 0, i = 1, . . . , n are introduced. Slack variables allow an instance to lie inside the margin or to be just missclassified (Figure 2.3). The penalty ξi is typically determined as a distance from the decision boundary to an instance xi . When an example xi lies inside the margin, then the corresponding ξi is between 0 and 1. When an 9

Figure 2.3: Non-separable case and slack variables ξi example is missclassified, then the slack is ξi > 1. The optimization problem is now changed and the Pn objective is to minimize the training error while maximizing the margin. A term C i=1 ξi is added to the objective function to penalize margin and classification errors. A new SVM formulation called "soft-margin SVMs" or "C-SVMs" is then: n

X 1 ||w||2 + C ξi 2 i=1

min

ξ w,b,ξ

s.t.

yi (w · xi + b) ≥ 1 − ξi

i = 1, . . . , n

ξi ≥ 0

i = 1, . . . , n

(2.18)

where C > 0 is a parameter that determines the tradeoff between maximizing the margin and minimizing the training error. The inequality constraints are also changed, allowing margin and classification errors. The Lagrangian of the new optimization problem is: n

L p (w, b, α , ξ , β ) =

n

n

X X X 1 ||w||2 + C ξi − αi ( yi (w · xi + b) − 1 + ξi ) − βi ξi 2 i=1 i=1 i=1

(2.19)

where α and β are the Lagrange multipliers. As before, the Lagrangian is minimized to the respect to variables w, b and ξ and maximized with respect to α and β . The dual formulation is then

max L d = max(min L p (w, b, α , ξ , β )) β α ,β

β α ,β

ξ w,b,ξ

(2.20)

Taking partial derivatives with respect to w, b and ξ , we obtain the minimum of L p :

∂ Lp ∂w ∂ Lp ∂b ∂ Lp ∂ξ

=0



w=

n X

αi yi xi

i=1

=0



n X

αi yi = 0

(2.21)

i=1

=0



ξi + βi = C. 10

From this, the final dual problem is

α) = L d (α

max α

n X

αi −

i=1 n X

s.t.

n X n X

α i α j y i y j xi · x j

i=1 j=1

(2.22)

αi yi = 0

i=1

0 ≤ αi ≤ C

i = 1, . . . , n

As can be seen, the dual problem is identical to a separable case, except for the bound of the Lagrangian multiplier. The solution of the problem is the same as for the separable case. However, the SVs in this solution are not just the instances that lie on the hyperplanes H+ and H− , but also the ones that lie between these hyperplanes and the decision boundary or on the wrong side of the decision boundary. A remarkable fact about the formulation of the SVM optimization problem is that the training data appear only in inner products that can be replaced with a non-linear kernel function. In the following section we will consider kernel-based SVMs.

2.3 Non-linear SVMs and Kernels SVMs are also capable to solve non-linear problems applying the so-called "kernel trick" [30]. The goal is to achieve maximum margin, but in the higher dimensional space. As mentioned before, a "kernel trick" is possible since the data vectors appear only in the form of dot products, i.e. xi · x j . The idea of the "kernel trick" is to map input data to some higher dimensional space (feature space), where it can be separated linearly. The mapping can be done possibly to infinite dimensional space that guarantees linear separation of the data examples. The mapping from the data into the feature space is defined as:

Φ : Rd 7→ H

(2.23)

where d is the dimension of the input space and H is the feature space. The objective of the ”kernel trick” is to calculate the dot product in the feature space so that the mapping is

xi · x j 7→ Φ(xi ) · Φ(x j ).

(2.24)

However, instead of knowing Φ and calculating it explicitly, one can use a kernel K such that

K(xi · x j ) = Φ(xi ) · Φ(x j ).

(2.25)

Note, that K must satisfy Mercer’s conditions (we are discussing them in the Section 2.3.1). Now, we need to incorporate kernel K into the optimization problem. For this we have

max α

s.t.

α) = L d (α

n X

αi −

i=1 n X

n X n X

αi α j yi y j K(xi , x j )

i=1 j=1

(2.26)

αi yi = 0

i=1

0 ≤ αi ≤ C

i = 1, . . . , n 11

The final classification is performed also by using a kernel. The decision function is: ‚ n Œ X f (x) = sgn αi yi K(xi , x) + b

(2.27)

i=1

where x is an example that we want to test. It should be mentioned that the feature space H usually has much higher dimensions (up to infinite dimensions) than the original input space. However, the time for training mapped and unmapped data is approximately the same [10]. It is also worth mentioning that, in practice, convergence problems may occur. In the following, we will consider the requirements on a function to be a valid kernel. We will also provide the most widely used kernel functions and discuss their properties.

2.3.1 Kernels There are special conditions that a function has to satisfy to be a kernel. In general, kernel function can be considered as some similarity measure between the input instances. This consideration allows to think about wide families of kernels. A function K is considered to be a kernel if it satisfies the following conditions: 1. K must be symmetric 2. K must be positive definite 3. K must satisfy Mercer’s Theorem [10] These conditions ensure only the existence of the kernel function, however they do not specify whether a particular function is suitable and good enough for a problem. In principle, dozens of kernel function can be created. The most popular among them are: • Polynomial Kernel

K(xi , x j ) = (xi · x j + θ )d ,

(2.28)

where θ is a parameter specified by the user and d is a polynomial degree. Note, that if d = 1, then the polynomial kernel becomes linear one. Usually, for linear kernels a constant θ is omitted. • Gaussian Radial Basis Function (RBF)



K(xi , x j ) = exp −

||xi − x j ||2 2σ2

 ,

(2.29)

were σ2 is a width or variance on each SV specified by the user. ||xi − x j ||2 is a squared Euclidean distance between two objects. We will discuss the influence of σ2 in the next section. Sometimes, RBF kernel is called just Gaussian kernel. • Combinations of Kernels Some more sophisticated kernels can be produced by combining existing kernels in sum: X K(xi , x j ) = K r (xi , x j )

(2.30)

r

or product:

K(xi , x j ) =

Y

K r (xi , x j )

(2.31)

r

New kernels can be also obtained by multiplying an existing kernel by a constant or adding a constant to an existing kernel. 12

2.4 Effect of SVM and Kernel Parameters SVMs have a set of parameters called hyperparameters that are defined by the user. These are usually the parameter C that specifies the tradeoff between margin maximization and error minimization and the parameters of the kernel function. Let us consider the effect of hyperparameters on the decision boundary of the SVMs. When the value C is chosen to be large, then the classifier heavily penalizes the margin and classification errors. The model becomes more complex and the margin between classes gets smaller. When C is chosen to be small, then a small penalty is assigned for the margin and classification errors, making the model less complex. The margin becomes larger for the other non-erroneous data samples. In general, C can be treated as a regularization parameter that controls the complexity of the model. To discuss how the kernel parameter influences the decision boundary, let us consider a Gaussian kernel (Equation 2.29) and assign κ = 2σ1 2 , so that the RBF is:  K(xi , x j ) = exp −κ||xi − x j ||2 (2.32) The parameter κ, like parameter C , can significantly affect the decision boundary of SVMs. Generally, κ controls the flexibility of the decision boundary and how far the effect of each sample spreads. Analyzing the RBF formulation (Equation 2.32) one can notice that as the κ gets smaller, the decision boundary becomes smoother and can, in principle, result in a linear decision boundary. As κ gets higher, the Gaussian "bump" become localized leading to a curvy decision boundary. If κ is chosen to be too large, an overfitting can easily happen. So we can conclude, that κ, as the parameter C , sets the tradeoff between overfitting and generalization of SVMs. To conclude the discussion on the hyperparameters, it is worth to mention that various combinations of κ and C can lead to similar decision boundaries. The typical way to explore these hyperparameters is by using a grid-search technique. However, there is still an open question: Which intervals of these hyperparameters should be explored for obtaining the best result? We will address this problem in the next sections by introducing another SVM formulation.

2.5 Extending C-SVMs to ν-SVMs An alternative to the C-SVMs formulation is the ν-SVMs formulation introduced by Schölkopf et al in [59]. It is proposed there to reformulate the SVM optimization problem so to substitute the parameter C with another parameter ν. In general, C has no physical meaning and it is always hard to choose it in a practical application. On the other hand, ν instead is a very intuitive parameter that provides clear physical explanation: ν is a parameter that provides a control on errors and the number of support vectors. Strictly speaking, ν ∈ [0, 1] is an upper bound on the fraction of margin errors and the lower bound on the fraction of support vectors. So, by setting this parameter, a user can roughly understand how the decision boundary will behave. By increasing ν, higher margin errors will be tolerated and, as a result, the margin will be bigger. Another advantage of this formulation is that it exists in a bounded interval and can easily be explored via grid-search. The primal formulation for ν-SVM is n

min

ξ,ρ w,b,ξ

s.t.

1X 1 ||w||2 − νρ + ξi 2 n i=1 yi (K(w, xi ) + b) ≥ ρ − ξi

i = 1, . . . , n

ξi ≥ 0

i = 1, . . . , n

(2.33)

ρ≥0 13

and the dual formulation n

min α

s.t.

α) = L d (α n X i=1 n X

n

n

X 1 XX αi α j yi y j K(xi , x j ) − ξi 2 i=1 j=1 i=1

αi yi = 0 (2.34)

αi ≥ ν

i=1

0 ≤ αi ≤

1 n

i = 1, . . . , n

In this formulation a new parameter ρ is added to the objective function that represents the position of the margin. Now, for the training errors we have ξi > ρ and for the examples that lie within the margin we have ξi ∈ (0, ρ] The C-SVMs to ν-SVMs formulations have one big drawback. They penalize errors of both types equally with symmetric cost of 1n . This may not be preferable when the classes have different costs as discussed in the introduction. To address this problem, we further provide socalled Cost-Sensitive extension of SVMs. First, we will consider cost-sensitive C-SVMs that referred to as 2C-SVMs and then 2ν-SVMs.

2.6 Cost-Sensitive Extensions of SVMs: 2C-SVMs and 2ν-SVMs Cost-Sensitive Extensions of SVM penalize different kinds of errors asymmetrically allowing the algorithm to be applied in situations when one class is more important than the other one. Let us first consider the 2C-SVM formulation. There are two notation possible to assign different costs to different kinds of errors. The one possibility Pn Pn is instead of having the penalty term C i=1 ξi for all of the errors, one can denote C+ i∈I ξi and + Pn C− i∈I ξi as the penalty terms for the errors of each class ( I+ = {i : yi = +1} and I− = {i : yi = −1}) − Pn P P [6, 19]. Another possibility is to decompose C i=1 ξi into Cγ i∈I+ ξi and C(1 − γ) i∈I− ξi , where C > 0 and γ ∈ [0, 1]. For the further considerations we will choose the second possibility of representing different costs for different errors. As a result, the primal of 2C-SVM is

min

ξ w,b,ξ

s.t.

X X 1 ||w||2 + Cγ ξi + C(1 − γ) ξi 2 i∈I i∈I +



yi (K(w, xi ) + b) ≥ 1 − ξi

i = 1, . . . , n

ξi ≥ 0

i = 1, . . . , n

(2.35)

and dual is n

min α

s.t.

α) = L d (α n X

n

n

X 1 XX αi α j yi y j K(xi , x j ) − ξi 2 i=1 j=1 i=1

αi yi = 0

(2.36)

i=1

0 ≤ αi ≤ Cγ

f or

i ∈ I+

0 ≤ αi ≤ C(1 − γ)

f or

i ∈ I− 14

In this formulation, γ controls the tradeoff between two kinds of errors. As in ν-SVM it is possible to replace the parameter C with ν. The so-called 2ν-SVM is then γX (1 − γ) X 1 ||w||2 − νρ + ξi + ξi min ξ,ρ w,b,ξ 2 n i∈I n i∈I +

s.t.



yi (K(w, xi ) + b) ≥ ρ − ξi

i = 1, . . . , n

ξi ≥ 0

i = 1, . . . , n

(2.37)

ρ≥0 and its dual is n

min α

s.t.

α) = L d (α n X i=1 n X

n

1 XX αi α j yi y j K(xi , x j ) 2 i=1 j=1

αi yi = 0 (2.38)

αi ≥ ν

i=1

γ n (1 − γ) 0 ≤ αi ≤ n 0 ≤ αi ≤

f or

i ∈ I+

f or

i ∈ I−

In principle the 2ν-SVM expresses the same set of solutions as the 2C-SVM [13]. However, the parameters in 2ν-SVM are more intuitive and that is why this formulation is more convenient to use. A parameter γ can be considered as a parameter that controls the tradeoff between false alarm and misses. By adjusting γ one can achieve desired false alarm rate α. Recalling the NP framework discussed in the introduction, one can see that Equation 1.2 can be applied to SVMs. This implies that, in principle, the NP classification framework can be set on top of the SVMs by using cost-sensitive formulation with ν and γ parametrization. This also requires tuning the parameters to achieve the required error level and to be able to operate on the desired point of the ROC curve. However, it is not trivial to tune the parameters to get the desired false alarm level. To determine an appropriate value of γ, one requires accurate estimates of p f and pm [23]. The native SVM and formulations that were considered in this thesis provide only a class label as an output. In many practical applications it is important to have a measure of confidence in the classification result. In the next section, we will consider a method that allows to provide a posterior probability along with the corresponding class label.

2.7 SVMs Probability Output In the original formulation, SVMs are able to output class labels only, however, in many applications also posterior probabilities are required. There is no unambiguous approach to employ the functional distances of an example from the hyperplane, so that one can have the predicted class label along with its posterior probability at the output of the SVM [57]. On the other hand, it is very important to have reliable posterior probabilities that would fully reflect the certainty of the classifier about its output decision. To address this problem, several methods were proposed, but the most widely used is the socalled Platt’s method [57]. This approach proposes to map the distance from the output of SVM to the posterior probability by fitting a sigmoid function to the data. Let us formalize this method. 15

Having a distance measure at the output of SVMs f (decision value at x), we need to estimate p( y = 1| f ). A parametric model is used to fit the posterior probability. In Figure 2.4a from [57] a histogram of the densities, obtained using the UCI Adult dataset [3] is presented. These densities have a non-Gaussian asymmetric distributions. That is why a sigmoid function is proposed:

P( y = 1| f ) =

1 1 + exp(Af + B)

(2.39)

where A and B are parameters that have to be estimated. The estimation is done by minimizing the negative log likelihood function using a separate validation set. In Figure 2.4b from [57] one can see a sigmoid fit that is very close to the true model. Platt’s method provides accurate probability estimates and

(a) The histogram of posterior probabilities for a linear SVM

(b) A sigmoid fit to the data

Figure 2.4: SVM probability output estimation (from [57]) is chosen to be implemented in the experimental part of this thesis to obtain the posterior probabilities of the data fusion algorithm.

2.8 Chapter Summary In this chapter, we described the general theory behind SVMs, presented different SVM formulations and discussed their advantages and disadvantages. We showed that by employing 2ν-SVMs, one can turn SVMs to satisfy the NP paradigm. We also described a method that allows to extend the SVM output to have accurate probability estimates along with corresponding class labels.

16

3 Class Imbalance Problem Class imbalance problems appear in many practical applications, where one out of several classes is significantly overrepresented compared to the other classes [37]. Thus, it is considered to be an important research problem. Imbalance usually comes from the essence of a problem and the data acquisition process itself. A lack of occurrences in the real-world for a specific phenomenon or insufficient time or means to acquire the data are possible reasons for imbalance issues [40]. A class imbalance problem can appear in many scenarios such as rare disease detection, credit card fraud [11], oil spills [45], text classification [14] and others. A level of imbalance might be different and depends on the problem. Usually, real data sets consist of a high percentage of "normal" examples and only a few amounts of "abnormal" examples. For instance, a typical mammography dataset might contain 98% of "normal" pixels and only 2% of "abnormal" pixels. That means that the amount of instances of one class ("normal" pixels) is much more than of another class ("abnormal" pixels). Moreover, there are usually different missclassification costs for these two classes. Typically, it is more costly to misclassify the "abnormal" pixels than the "normal" ones [61]. At the same time, most original classification and data fusion algorithms do not consider class imbalance problem and as a result can show quite poor classification performance. There are three main reasons for that [27]: • many algorithms try to minimize an overall error that is expressed by accuracy, however the minority class does not contribute much to this error • the costs that the algorithms assign to different types of errors are usually the same, while usually the cost for missclassifying the sample of the minority class is higher • there is usually an assumption on identical distribution of all classes, however, that might not be true in reality Many recent research works address the class imbalance problem. We consider them in the next sections. At first, let us establish the notations that is used further in this thesis. As before, consider a binary classification problem. Let S be a training dataset with a certain imbalance level {(xi , yi )}, i = 1, . . . , n i.e. with n data points (|S| = n), where xi is a training example in the d -dimensional feature space and yi ∈ {−1, +1} is its corresponding class label. The subsets Smin ⊂ S and Smaj ⊂ S are the sets of minority and majority class samples correspondingly, so that Smin ∩ Smaj = {;} and Smin ∪ Smaj = {S}. In the following section let us discuss the effect of class imbalance on SVMs.

3.1 SVMs and Class Imbalance As discussed in the Chapter 2, in SVMs the decision boundary depends only on the SVs. This makes SVMs quite resistant to noise and small class overlaps. Moreover, it provides the ability to perform well on moderately imbalanced training datasets [38]. However, SVMs suffer from reduced performance when the level of imbalance is high or extreme [1, 66]. The reason is that in SVMs and other discriminative methods, classes are directly competing with each other and the predominance of one of them significantly decreases the classifier performance. As a consequence, SVMs heavily suffer from biased decision boundaries towards the minority class, resulting in a classifier that predicts the majority class 17

more often. In [1, 66] the effects of class imbalance on SVMs are considered. We can summarize them as follows: 1. Soft-margin weakness As the class imbalance ratio gets higher, the decision boundary is being shifted more towards the minority class samples. A decision boundary bias appears because the optimal C-SVM classifier (Equation 2.18) will be the one with a low complexity and a small training error. It comes from the fact that C-SVMs try to achieve a reasonable trade off between minimizing the classification error and maximizing the margin. This leads to a boundary that allows to classify most of the positive samples as negatives (Figure 3.1). This implies that the minority class instances are more likely to be misclassified than the majority class instances. Despite the fact that the prediction accuracy of such a classifier will be rather high, this will make practically little sense. However, this problem does not appear when the classes are linearly separable even with a high imbalance degree.

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

Figure 3.1: Decision boundary is shifted toward the minority class examples because of the soft-margin weakness

2. Imbalanced Support Vector Ratio The imbalanced support vector ratio results directly from the imbalanced dataset. As a consequence, it leads to decision boundary skew. In [66] it is shown that as the training dataset gets more imbalanced, the two classes support vector ratio becomes more imbalanced as well. However, Akbani et al [1] claim that this does not seriously influence the performance when the imbalance is moderately low. The Karush-Kuhn-Tucker conditions provide the constraint that in the dual formulation the sum of αs associated with the positive and negative support vectors must be equal (Formula 2.22). As a result, support vectors of the minority class get larger values of α then the support vectors of the negative class. This fact offsets the support vector ratio imbalance.

18

3. Positioning of the minority examples As it is pointed in [1] and [66] the minority class instances lie further away from the ideal decision boundary than the majority ones. This is explained from the probabilistic point of view: the fewer examples the minority class has, the further away they are located from the decision boundary. This also leads to a skewed decision boundary. All these factors limit the performance of SVMs when having class imbalance issues and they need to be considered when choosing a balancing strategy. Moreover, taking into account the nature of SVMs, there are several limitations on the balancing methods: 1. The SVM training algorithm complexity is between O(n2 ) and O(n3 ). That implies that all oversampling methods will lead to significant raise of the dataset size and, as a result, increase of the training time. In general, it is preferable to use less computationally expensive strategies. 2. The SVM decision boundary is defined only by SVs. That means that keeping them is of crucial importance for establishing a correct separating hyperplane. 3. The balancing method that is used should not introduce a large complexity

3.2 Handling the Class Imbalance Problem Class imbalance is considered to be one of the most challenging issues among all machine learning and data mining problems [51]. Through the recent years, many approaches were proposed by the research community do deal with class imbalance problems. These approaches can be grouped into three general groups of methods: external, internal and ensemble learning methods. External methods, also known as data level methods, include algorithms and techniques that operate directly on the training data before the process of training itself. These are various resampling strategies like undersampling the majority class or oversampling the minority class, data cleaning and preprocessing methods. Internal methods utilize the formulation of the learning algorithms and adjust them to consider the class imbalance. Ensemble learning methods are the methods that use the concept of classifier ensembles and when being combined with resampling strategies can perform well on imbalanced dataset. In the following, we consider these three balancing groups of methods.

3.2.1 External Balancing Methods The external balancing methods operate on the training data itself directly and aim at changing the class distributions in order to balance the data. Usually, these consists of resampling the original training set using random or more sophisticated resampling procedures. One of the most popular technique for balancing the class distributions is oversampling. Oversampling is a method that increases the amount of minority class samples keeping the majority class instances constant. The easiest way to oversample the minority class instances is to use Random Oversampling (RO) where a random replication of samples is used to balance the class distributions (Figure 3.2). Strictly speaking we add to S a set E that was randomly sampled from the Smin so that |Smaj | = |Smin |+|E|. In other words, the total number of training examples in the set S is increased by |E|:

|S| = |Smaj | + |Smin | + |E|

(3.1) 19

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1 0

0 0

0.2

0.4

0.6

0.8

0

1

0.2

0.4

0.6

0.8

1

(a) The original SVM decision boundary without class (b) SVM decision boundary after Random Oversamimbalance fixing

pling

Figure 3.2: Random Oversampling and its effect on the SVMs decision boundary Despite the fact that RO is considered to be quite a reliable resampling method (it introduces no synthetic information) that can be used in some practical applications, it also brings significant overhead that can dramatically decrease the training time of the learner. Moreover, since RO produces exact copies of minority class instances, it can make a classifier too specific and lead to overfitting [14, 46]. Chawla et al [14] proposed an oversampling method that generates new synthetic minority samples by interpolating between several neighboring minority class instances. This approach is called Synthetic Minority Oversampling Technique (SMOTE). SMOTE provides additional and what is important meaningful information on the minority class. Operating in the ”feature space”, rather than in the ”data space”, SMOTE creates synthetic training examples balancing the training dataset of almost any imbalance level. Operating in the ”feature space” makes this oversampling approach application-independent. Moreover, because of the essence of this approach the overfitting problem is avoided. The oversampling process is performed by generating synthetic instances for j out of Knn , ( j = 1, . . . , Knn ) nearest neighbors of every minority class instance xi ∈ Smin . New samples are introduced along the line that connects a sample with its neighbors. Knn -nearest neighbors are determined for every xi ∈ Smin as Knn objects from a set Smin whose euclidean distance in d -dimensional feature space X between xi and each of them is the smallest. Depending on the initial imbalance ratio and the amount of oversampling needed, one randomly chooses j nearest neighbors of every minority class object to produce j new synthetic exsamples for each xi . For instance, one takes 3 out of Knn nearest neighbors and generates synthetic instances in the direction of each neighbor. In practice Knn varies between 4 and 6 (depends on the dataset). For example, consider an instance xi and its neighbor x0i . According to SMOTE, a new sample xnew will be then:

xnew = xi + ui · (x0i − xi )

(3.2)

where ui ∼ U(0, 1) is a random value between 0 and 1. On the figure below one can see the mechanism of generating synthetic samples. Here, new synthetic instances (small blue circles) are generated along the line that joins a particular example (big blue circle with orange border) and its two nearest neighbors. In [14] one can find a pseudo code of the algorithm for this method and more detailed explanation with experiments. There are many other advanced SMOTE-based approaches that are proposed in the literature e.g. in [9, 52]. The most popular one is Borderline SMOTE [31] that is basically the standard 20

Figure 3.3: Generation of synthetic samples according to the SMOTE SMOTE methods that is being applied to the borderline samples. Operating only on the part of minority class samples this method introduces less overhead. Another straightforward way to balance the training dataset is to perform undersampling of the minority class according to some rule or randomly. Random Undersampling (RU) is the simplest way of undersampling. In RU the majority class samples are being removed randomly, so that the number of majority and minority class instances is equal:

|S| = |Smaj | − |E| + |Smin |

(3.3)

where |E| = |Smaj | − |Smin |. In the end, the dataset |S| becomes balanced. The main drawback of RU is that it can remove potentially useful information and, as result, different RU runs can lead to completely different learning functions. However, the advantage of undersampling approach is that it allows to significantly reduce the training set size and as a consequence decreases the training time. There are some more sophisticated undersampling approaches that try to overcome the information loss and make undersampling less random. In [47] the authors proposed the so-called one sided selection technique. This approach smartly undersamples the majority class so that only redundant and noisy examples are removed. In [55] another intelligent selection technique is proposed to select the instances of the majority class to be undersampled in order to remove redundant, borderline, noisy training data samples and outliers. Borderline examples are the ones that lie close to the decision boundary. This kind of samples are considered to be unreliable due to the strong noise impact. Outliers are usually very rare in nature and do not carry any useful information that can improve generalization ability. So, by eliminating this type of instances one can attempt to balance the skewed dataset. However, this approach does not really increase the performance of the SVMs. Conversely, redundant samples have no effect on SVMs (as SVMs use only support vectors for determining the separating hyperplane), while removing some of the borderline samples might cause strong negative effect. One more undersampling technique is proposed in [29]. Here the authors propose to eliminate class instances whose labels do not match with their three nearest neighbors. Along with undersampling, this method provides good noise cleaning ability. The drawback of this method is that it is not possible to control the resulting class distribution that means that the dataset can be still not properly balanced after such kind of undersampling. 21

As can be observed, the undersampling approach itself is not really suitable for SVMs and leads to strong information loss. This can be very critical in finding a proper decision boundary and as a result might cause a drop of the performance of the SVMs. Especially, when the imbalance ratio is very high (about 100:1), it might be intolerable to eliminate 99% of the majority class instances. In Figure 3.4 one can see how undersampling affects the SVM decision boundary. We will address the problems of 1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0

0 0

0.2

0.4

0.6

0.8

1

0

0.2

0.4

0.6

0.8

1

(a) The original SVM decision boundary without class (b) SVM decision boundary after Random Undersamimbalance fixing

pling

Figure 3.4: Random Undersampling and its effect on the SVMs decision boundary undersampling and in particular RU further in the experimental part. We will show that the properties of RU can be exploited in a very effective way leading to impressive results.

3.2.2 Internal Balancing Methods Internal balancing methods modify the learning algorithm itself so that it can take imbalance issues into account. The main advantage is that they operate on the original dataset without introducing any data overhead or losing the information. In this section, we give a brief overview of internal balancing methods for SVMs. The most popular and straightforward method is a cost sensitive extension of SVM. This method was firstly introduced in [64] and theoretically allows to overcome the class imbalance problem with SVMs. This method considers the 2C-SVMs formulation (Formula 2.35) in order to assign different missclassification costs for the majority and minority classes. When assigning a higher missclassification cost for the positive class instances than for the negative class instances (C + > C − ), the learning algorithm would not bias the separating hyperplane towards the minority class. The reason for that is because the cost for making an error on the positive class contributes more to the total missclassification and the learning algorithm tries to avoid these type of errors. In [1], it is proposed to set the missclassification costs according to the ratio between the majority and minority class subsets so that

|Smin | C− = . C+ |Smaj |

(3.4)

In the practical part of this thesis we address the class imbalance problem and explore the concept of modifying costs for SVMs and make a conclusion on its effectiveness. 22

Other internal balancing methods that are applicable to SVMs are the kernel modification methods. By modifying the kernel function, the SVM learning algorithm is becoming less prone to result in skewed decision boundary. In [67] a Boundary Movement approach is proposed. The idea behind this approach is to change the offset term b in the SVMs formulation (Formula 2.3), so that hyperplane is adjusted as

‚ f (x) = sgn

n X

Œ yi αi K(x, xi ) + b + 4b ,

(3.5)

i=1

where 4b is an offset to account a bias induced by the class imbalance issue. There is another method proposed by the same authors called Class-Boundary Alignment [67]. Here, the algorithm transforms the kernel function to increase the margin around the positive class in the higher dimensional feature space. This leads to an increase of TP rate and the overall classification performance in the end. Another method called Kernel Target Alignment [20] was introduced to address the class imbalance issue in [41]. This method measures the degree of agreement between the kernel and the classification problem. Based on this measures the kernel is being adapted. There are other kernel-modification approaches that were proposed e.g. in [35], where a kernel-based classification algorithm that is based on orthogonal weighted least squares estimator is constructed. This algorithm optimizes the model generalization capabilities so that it can generalize well when having imbalanced dataset.

3.2.3 Ensemble Learning Methods Ensemble learning methods employ the classifier ensembles concept to perform a classification task. An ensemble of classifiers is a set of several learners whose decisions are combined in a specific manner to classify the test sample. Ensembles are capable to achieve better classification performance than the individual classifier on which it is based [15, 32]. Usually, ensembles of classifiers are being generated using the bootstrapping technique [68, 24], where the original dataset |S| is being sampled K times so that K base learners are induced from these samples. After that, individual decisions of these learners are combined using different combining techniques. This ensemble learning approach is called bootstrap aggregating (bagging) [7] and can be successfully applied to different classifier algorithms. Beside improving classification accuracy, bagging allows to lower the variation of the predictor and make the classification result more stable. Despite the fact that SVMs have a built-in procedure for variance reduction, bagged SVMs showed to be an effective strategy for practical applications [42, 17]. In the following, we will describe the notation used in the sequel. Consider a training set |S| defined in the beginning of this chapter. To obtain K base learners that form the ensemble, one needs to sample K times from the set |S|, so to get K training datasets (k) Sbootstrap , k = 1, . . . , K . From these datasets K individual learners with corresponding decision functions f k and corresponding output label sk , k = 1, . . . , K are induced. The output of the ensemble f (x) is obtained by combining decision functions f k according to a specific model F . The output of the ensemble f (x) is then

f (x) = F ( f1 (x), ..., f k (x)).

(3.6)

Having a binary classification problem, we have two class labels ω j , j = 1, 2 and define class ω1 for the positive class and ω2 for the negative one so that the set of classes is Ω = {ω1 , ω2 }. 23

One of the most important steps in creating ensembles is choosing a combination technique. There are several ways to combine the classifiers’ outputs. The most popular ones are combining the learners at the class label level or at the score (estimated posterior) level. Here, we will consider the most widely used combination approaches that operate at these two levels. At first, we summarize up three famous combiners from [48] and then explain another one. 1. Majority Vote (MV) Combiner is the most popular and the most intuitive combining approach that operates at the class label level. The key assumption of this method is that the individual accuracies of base learners are equal e.g. P(sk = ωi |ωi ) = p and P(sk = ωi |ω j ) = 1 − p for any k = 1, . . . , K i, j = 1, 2 i 6= j . Here, the most voted class is chosen as a final decision according to the rule ( j)

f (x) = argmax |I+ |,

(3.7)

j

(c)

where |I+ | is the number of votes for ωc . Analyzing MV combiner, one can see that for  Kthe correct  decision it requires more than 50% of correct votes from the individual learners e.g. 2 + 1 . The probability that the ensemble makes a correct decision is

 ‹ K X K k Pens = p (1 − p)(K−k) . k K

(3.8)

k= 2 +1

As the number of base learners K approaches infinity while having p > 0.5, then the probability of the ensemble to make a correct decision approaches one. If p < 0.5, then it approaches zero. It is important to mention that the standard MV rule includes also an additional class constant so that the original formulation of the posterior probability given the array of labels from K base learners s is ‹  1−p ( j) log(P(ω j )) + |I+ | (3.9) log(P(ω j |s)) ∝ log p Practically, the class constant is not used in MV rule [48]. However MV is still one of the most accurate and powerful combining technique. 2. Weighted Majority Vote (WMV) Combiner is the natural extension of MV combiner when having no assumption about equal individual classifier accuracies e.g. P(sk = ωi |ωi ) = pk and P(sk = ωi |ω j ) = 1 − pk for any k = 1, . . . , K, i, j = 1, 2i 6= j . The posterior probability for class j is then X log(P(ω j |s)) ∝ log(P(ω j )) + ωk , (3.10) ( j)

k∈|I+ |

where ωk are the weights of individual classifiers that are calculated as

ωk = log



‹ pk , 0 < pk < 1. 1 − pk

(3.11)

The WMV combiner is considered to be an effective and reliable combiner [43, 49] and used in some practical appliactions. 24

3. Naive Bayes (NB) Combiner is another combination approach that is obtained from the MV combiner by releasing the assumptions about the equal individual classification accuracies and class probabilities so that P(sk = ωi |ω j ) = pki j . The probability pki j for the base learner k is the (i, j)-th entry in a probabilistic confusion matrix. Then, the posterior probability for the class j is calculated as K X log(P(ω j |s)) ∝ log(P(ω j )) + log(pk,si , j ). (3.12) k=1

4. Average (AV) Combiner is another widely used approach that, however, operates at the posterior level. It outputs the class label based on the averaged probability output of each individual learner denoted as pk ( j|x). The final decision is then: K 1 X f (x) = · pk ( j|x) K k=1

(3.13)

The AV combiner is known to show more accurate results then the MV combiner, thus, is not always applicable. For example, standard SVMs have a non-probabilistic output and, thus, the AV combiner cannot be applied. The combining techniques mentioned above are the most widely used for constructing classifier ensembles. However, none of them considers the class imbalance issue that is a typical problem for practical applications. In Chapter 6, we will address class imbalance using a bagging approach, but propose a new technique to account for the unequal class distributions. We will also practically show that the standard combining methods have limitations when having class imbalance problems. In spite of the fact that there were no successful attempts to date of adjusting combining rules to class imbalance issue, bagging itself is being successfully applied to address such problems. The majority of proposals apply the standard combining techniques, however, having different sampling approaches, sometimes integrating them with data prepocessing step. The original bagging considers bootstrapping K times (with replacement) so that every iteration the sampling is done from the whole original dataset. Following this strategy, the resulting bootstrap samples will be imbalanced as well as the original dataset. In Exactly Balanced Bagging approach [50] the sampling is done only from the majority class subset Smaj , while the minority subset is kept untouched. Moreover, the sampling from Smaj is done in the way that the number of instances in each new bootstrap sample is exactly the same as in the minority class subset (k) e.g. |Sbootstrap | = |Smin |. Actually, this is the way how RU works. In Roughly Balanced Bagging approach [33] the sampling is done so equalizing the sampling probability of both classes, without explicitly fixing the size of the bootstrap sample. Here, the bootstrap sample Smaj is determined by the negative binomial distribution and the resulting unified set can be slightly imbalanced.

3.3 Performance Measures for Learning from Imbalanced Data Evaluation measures are very important when assessing the performance of the classifier. It is always a challenging task to shose the measure that fully reflects the performance of the learning algorithm and in the same time gives the insight into the problem. Originally, accuracy (Equation 3.14) is the most widely used measure, since it gives the general overview of the classification performance of the algorithm. However, for the classification tasks with class imbalance issue, accuracy becomes a less suitable measure, since the underrepresented class introduces very little contribution to accuracy measure compared to the 25

overrepresented class. For instance, consider a scenario where a rare class is represented only by 1%. A classifier that always predicts the overrepresented class label for every test sample will provide 99% of classification accuracy. However, in applications where we definitely want to discover the samples of the rare class, such classifier would be totally absurd. We would prefer another one that has less accuracy level, but that could classify rare class samples correctly. For choosing such classifier one needs to use other performance measures. For considering them, let us first define a confusion matrix for a binary classification problem. The rows of this matrix represent the actual class label of the object, while the columns represent the predicted class label of the object. Table 3.1: Confusion matrix for a binary problem Predicted Positives

Predicted Negatives

Real Positives

TP (True Positive)

FN (False Negative)

Real Negatives

FP (False Positive)

TN (True Negative)

Typically, the minority class is considered to be positive and majority is negative. From Table 3.1 we formulate several measures that are used to evaluate the performance of the classifier. Predicted accuracy is determined as accuracy =

TP + TN . TP + TN + FP + FN

(3.14)

Precision reflects how well does the classifier determine positive objects among all objects and is defined as TP precision = . (3.15) TP + FP Recall measures the amount of positive objects that were correctly discovered. It is given by recall =

TP . TP + FN

(3.16)

F1-measure combines Precision and Recall together and reflects the balance of FP and FN. It is calculated as 2 × recall × precision f1-measure = . (3.17) recall + precision G-mean is a geometric mean of Precision and Recall and is determined as v t TP TP g-mean = × . (TP + FN) (TP + FP)

(3.18)

G-mean is widely used as a performance metric for classifiers when the class imbalance issue is present since it considers both Precision and Recall. ROC analysis is also often used to determine the performance of the classifier. In general, it shows the relative tradeoff between TP and FP rate for different threshold values. Varying the threshold one obtains different groups of TP, TN, FP, FN that are used to plot the ROC curve. The area under ROC curve (AUC) is a good metric to compare performance of several classifiers. Moreover, this metric can be nicely visualized in a intuitive way. AUC metric has a range from 0 to 1 and the better the classifier, the higher AUC value. However, AUC has some limitations and it is not always reasonable to apply this metric to evaluate imbalanced datasets since AUC provides only a general overview of the classifier performance. 26

3.4 Chapter Summary In this chapter, we considered the class imbalance problem. We went through the main issues of SVMs that are resulting from the class imbalance. We reviewed the main approaches that are designed to address the class imbalance problem and discussed their strong sides. In the end, we formulated the evaluation metrics that are used to evaluate the performance of the classifier learned on the imbalanced dataset.

27

4 Transition Model for NPSVM In this chapter, we introduce and formulate an approach to increase the performance of a biometric fusion system by introducing a motion model. In general, motion models reflect the temporal behavior of objects in a multisensor environment. The original SVM formulation does not provide a framework for utilizing additional information such as transition model. We propose to exploit this additional knowledge and incorporate it into the NPSVM framework to increase the performance of the biometric fusion, in particular FR fusion.

4.1 General Description of the Proposed Method Consider M biometric sensors deployed and Np people are participating in the biometric verification scenario. Let the biometric sensors be the FR cameras. Assume that people are moving in this multisensor environment and the sensors detect the moving people (perform recognitions). Consider Figure 4.1 where a state diagram of a particular person i is depicted. This particular identity was recognized by the individual sensors at the specific moments of time t k , k = 1, 2, . . . with the score i sm . Having this group of detections, we analyze the transition time between each state and decide how likely each transition is. We penalize the score of unfeasible transitions by the specific penalty value called transition penalty. For example, if the time spent to move from the sensor m = 1 to the sensor

Figure 4.1: State diagram of transitions for the person i based on the recognition output of the particular sensors

m = 2 is less than typical for this transition value, we assign a penalty to the score of the biometric matcher. However, there is no evidence which one out of two (or both) detections is responsible for this transition. To determine that we employ a Bayesian framework.

4.2 Transition Model The goal of employing a transition model is to obtain a transition penalty value that can be considered as a level of trust to the particular output of the sensor. To assign the transition penalty, we first need to create a motion model (estimate the travel time distribution) and then according to this distribution determine the penalty value itself. In the end, we have to decide which one out of the two sensors has 28

produced a wrong recognition and penalize the score of this particular sensor. Let us consider all these steps in detail.

4.2.1 Travel Time Distribution Estimation This step has to be done offline after obtaining and annotating FR detections that in turn should be acquired during the training procedure. To estimate the travel time distribution, we use the time stamps of the detections t k . For each of the camera pairs m, n = 1 . . . M we calculate a collection of transition time values ∆t m→n that correspond to the time difference between the detections of the same person at FR system m and FR system n

∆t m→n = t n − t m

(4.1)

For every camera pair m and n (m, n = 1 . . . M ) we collect these transition time estimates and use them further to create corresponding distributions pm→n (∆t m→n ). Due to the fact that people were passing the same FR cameras several times within one trial, one can observe several accumulation of transition times around particular transition time values. We model that by means of multimodal distribution with the Gaussian Mixture Model (GMM)

λ) = pm→n (∆t m→n |λ

G X

ωi · N (∆t m→n |µi , σ2i )

(4.2)

i=1

where ∆t m→n is a traveling time estimate, λ is a parameter vector that includes weights ωi ≥ 0, i = 1, . . . , G , and the parameters of the one dimensional Gaussian probability density functions (pdf) N (∆t m→n |µi , σ2i ), i = 1, . . . , G , of the form

N (∆t m→n |µi , σ2i ) =

(∆t m→n − µ)2 1 ) exp(− p 2σ2 σ 2π

(4.3)

with mean µ and variance σ2 . To learn a GMM from the training set Ξ, we utilize the Expectation Maximization (EM) algorithm [54] that allows to determine λ by Maximum Likelihood (ML) estimation. The goal of ML estimation is to find λ that maximize the likelihood function derived from the training set D . The original EM algorithm does not provide any information about the number of components G that has to be used. Usually, it is specified by the user. However, if one chooses a model with too few components it can lead to high bias and produce poor predictions. Such model is hardly suited for describing the underlying hidden distribution in a reasonable way. A model that has too many components is able to fit the training data almost perfectly, but in the end has high variance and poor generalization abilities. To find a trade-off between under- and over-fitting a penalized-likelihood information criteria are often used. Such criteria usually consist of log-likelihood functions with penalty terms to control the complexity. The most popular criteria are the Akaike’s Information Criterion (AIC) and the Bayesian Information Criterion (BIC). In this thesis, to determine the optimal number of Gaussian components G in GMM, we employ BIC. We learned different models with different G and computed the BIC for each model according to the

λ, D) = log L(D, λ ) − |λ λ| × fs (|D|) BIC(λ

(4.4) 29

pm→n(∆tm→n)

0.035 0.03

travel time histogram GMM GMM components

0.025 0.02 0.015 0.01 0.005 0 0

100

200

300

400

500

600

∆tm→n, s Figure 4.2: Transition model for FR cameras 3 and 4

λ| is the number of parameters and fs (|D|) is a function of the where L(D, λ ) is the likelihood function, |λ size of the sample and is considered to be the log function in this thesis. Finally, we select the model with particular G that provides the highest BIC value. An example of the resulting GMM for the transition model of the FR systems 3 and 4 is depicted in Figure 4.2. Here, one can see a histogram of the travel times (blue bars) from the FR systems 4 and 3 and a fitted GMM (red curve). The GMM components are depicted as an orange dashed line. There are two prominent peaks at around 41 and 101 seconds that correspond to the travel time that people usually required to get from one camera to another. We are considering that the time values that are located on the left side of the left-most peak are not feasible and we need to penalize them. Based on this distribution and considerations, we now can determine the transition penalty.

4.2.2 Transition Penalty Estimation To assign a penalty to the transition according to the motion model obtained in the first step, we consider the following assumptions. We assume that the transition time that is much smaller than the mean value of the left most Gaussian component is not feasible, since a person cannot move that fast. However, we do not want to penalize the score radically if the transition time is slightly lower than the mean of the left-most Gaussian. We define a smoothing interval, where a slight penalization is performed. In Figure 4.3 one can see an example of the transition model for the FR systems 3 and 4 and the zones that reflect the penalization degree. For the feasible area there is no penalization, while we apply a small penalization for the smoothing area and a strong one for the unfeasible area. The transition penalty value P(∆t m→n ) ∈ [0, 1] is determined according to the transition model as follows:

P(∆t m→n ) = A[0, ∆t smooth ) · ¯p(∆t m→n )  0 + A[∆t smooth , µ L M ) · ¯p(∆t m→n ) + (1 − ¯p(∆t m→n ) · 0.5(1 − ∆t m→n )

(4.5)

+ A[µ L M , ∞) · 1 30

Figure 4.3: Transition model regions where µ L M is the left-most Gaussian component in GMM (Equation 4.2) and is given by

µLM = min µi . i

The term ¯p(∆t m→n ) ∈ [0, 1] is defined as

¯p(∆t m→n ) =

p(∆t m→n ) . p(µ LM )

A starting point of the smoothing interval ∆t smooth is defined as

∆t smooth = µ LM − c(µ LM − σ), where c ∈ ℜ+ is specified by the user and controls the width of the smoothing interval. The ending point 0 of the smoothing interval is µ L M . The value ∆t m→n is given by 0 ∆t m→n =

µ LM − ∆t m→n c(µ LM − σ)

and reflects the relative position of the time value inside the smoothing interval. The function A(t 1 , t 2 ) is an activation function and takes the following values:

A(t 1 , t 2 ) = 1,

if∆t m→n ∈ (t 1 , t 2 )

A(t 1 , t 2 ) = 0,

if∆t m→n 6∈ (t 1 , t 2 )

4.2.3 Finding False Detection At this point the goal is to find which FR detection is responsible for the infeasibility of the transition. This could be, in principle, a detection from the FR camera m, a detection from the FR camera n or both of them. To determine such, we employ a Bayesian decision theory framework with 0/1 loss, called maximum a posteriori (MAP). According to this, we are testing the scores of both FR detections and 31

determine which class they belong to: genuine or impostor. Genuine class represents the successful recognitions, while impostor represent failed ones. According to the Bayes’ rule we have

p(c|s) =

p(s|c) · p(c) p(s)

(4.6)

that is calculating the posterior probability for the class c ∈ {genuine, impostor} given a score value s. After that a decision in favor of one out of two classes is done. Obtaining the likelihood p(s|c) for both classes is performed by the training process, where the class conditional distributions p(s|genuine) and p(s|impostor) are derived by learning from the labeled training data for every FR camera. With the assumptions that the scores are independent of the person and the distribution is multimodal, one can model it with a GMM, employing EM algorithm for determining the parameters of the model. In Figure 4.4 an example of genuine and impostor likelihood models for the camera 3 is shown.

(a) Impostor likelihood

(b) Genuine likelihood

Figure 4.4: Likelihoods for genuine and impostor classes The priors p(c) for both classes are chosen according to the relative frequencies of scores for both classes in the training dataset. As a result, we formulate a rule according to which we penalize a "suspected" score. We introduce penalization only if the "suspected" score belongs to the impostor class according to genuine

p(genuine|s)



impostor

p(impostor|s).

(4.7)

or going further, we formulate a Likelihood ratio test:

Λ=

p(s|genuine) p(s|impostor)

genuine



impostor

p(impostor) p(genuine)

(4.8)

It should be mentioned, that the this step is only necessary if the transition penalty is less than one. Otherwise, there is no need to penalize the scores as they are not "suspected" ones. This idea makes our model very conservative and we apply penalization only when we are absolutely sure about the infeasii bility of the transition. To penalize the scores, we update the corresponding score sm by multiplying it by i 0 the penalty term P(∆t m→n ) and obtain the resulting score sm that is used further for the fusion. We summarize the algorithm of transition model inclusion for NPSVM in Table 4.1. 32

Table 4.1: The algorithm of transition model inclusion for NPSVM Step 0.

Transition time values collection. Collect the time difference between the detections of the same person ∆t m→n = t n − t m at FR system m and FR system n based on the annotated detections. Do it for each camera pair m and n, so to obtain m × n collections of transition time values ∆t m→n .

Step 1.

Transition time distribution estimation. Based on the collected transition time values ∆t m→n estimate the corresponding distributions pm→n (∆t m→n ) using GMM (Formula 4.2) using BIC to determine the optimal number of Gaussian components (Formula 4.4).

Step 2.

Transition penalty estimation. The following steps have to be done in the operation (or testing) mode. Based on the obtained transition model at Step 1, determine the transition penalty value P(∆t m→n ) according to Equation 4.5. If the transition penalty value P(∆t m→n ) is less than 1 do further steps. Otherwise terminate the procedure (no penalization required).

Step 3.

Learning class conditional distributions. For every FR system m learn class conditional distributions p(s|genuine) and p(s|impostor) for genuine and impostor classes, respectively. Use GMM to model these distributions and employ an EM algorithm for determining the parameters of the model.

Step 4.

Deciding for wrong detection. According to the Likelihood ratio test (Equation 4.8) decide if the scores under consideration belong to the impostor class.

Step 5.

i Penalization. Penalize the "suspected" score (if it belongs to the impostor class) sm by i 0 multiplying it by the penalty term P(∆t m→n ) and obtain the resulting score sm that is used further for fusion.

4.3 Chapter Summary In this chapter we proposed a method that allows to extend the NPSVM formulation to utilize an additional information about human temporal behavior. We described the algorithm for obtaining transition model and transition penalty.

33

5 Bootstrapping Aggregation for Imbalanced Classes Bootstrapping aggregation methods can be successfully applied to combat class imbalance issues in different practical scenarios (see Chapter 3). In this chapter we propose to address class imbalance problems by introducing two novel bagging-based methods that both utilize the RU concept in their formulation. At first, we provide the motivation of using bagging together with RU by demonstrating its advantages. After that we will formulate and discuss the two proposed methods.

5.1 Random Undersampling for Class Imbalance In Chapter 3 we discussed the properties of RU and concluded that it is not a very reliable strategy as it discards potentially important information. At the same time, it allows to reduce the bias of the classifier induced by the class imbalance issues. To demonstrate that, we performed several runs of the RU algorithm and learned corresponding SVMs. In Figure 5.1a one can observe a decision boundary that was induced from the imbalanced dataset (black solid curve). A black dashed curve corresponds to the ideal decision boundary that would be learned from a balanced dataset. The bias between the ideal and the learned decision boundaries can be decreased or completely removed using RU approach. In Figure 5.1b violet dotted curves can be observed that correspond to decision boundaries that were obtained after different RU runs. It is clearly noticeable that the bias becomes smaller. However, the variance of the RU method is high. This means that this method cannot be applied to eliminate class imbalance issues. At the same time, RU is an attractive method due to its training set size reduction properties. We

(a) Learned (black solid curve) and ideal (black

(b) Learned decision boundaries (violet dotted

dashed curve) decision boundaries of SVM classifier

curves) of SVM classifier after several RU runs

Figure 5.1: RU is a high variance strategy 34

hypothesized that by applying some variance-reduction technique to RU one could probably make this method more robust and use it successfully in practical applications. To demonstrate that the cause of the problem is the high variance, the following experiment is conducted. For two synthetic datasets we performed RU approach K = 100 times, induced K SVM classifiers from them and then averaged the posterior probabilities of the whole decision space according to

q¯(i, j) =

K 1X qk (i, j) K k=1

(5.1)

where qk (i, j) is the posterior probability of the k-th SVM for the point (i, j) on the 2-d decision space. Figure 5.2 illustrates these averaged spaces for two synthetic datasets. We can clearly observe that RU is a high variance technique since the decision boundary has strong variations over several RU runs. The thickness of the non-red region reflects the amount of variance that the decision boundary has. We are

(a) Averaged decision space for pattern 1

(b) Averaged decision space for pattern 2

Figure 5.2: Averaged posterior probabilities for K = 100 SVM classifiers using RU able to observe, however, that the ideal decision boundary lies inside the dark blue strip that is in turn in the middle of the non-red region. If we could decrease the variance of RU, then we would probably obtain a decision boundary that is close to the ideal one. From the statistical concept of bias and variance [28], it is known that the error of an estimator x consists of two components: Error(x) = Bias2 (x) + Variance(x).

(5.2)

Usually, bias and variance are the terms that we want to minimize since we want to obtain the model that leads to the smallest error. As it was considered above, bias caused by the class imbalance issue could be decreased by performing RU. However, according to the bias-variance trade-off principle [28] usually when one of two terms is decreasing, another one is increasing. So, the variance of the model after RU is getting higher. At the same time, it is known that the averaging procedure reduces the variance without influence on the bias [7] i.e. ¯ = Var(x) . Var(x) (5.3) N So, by having several independent models of the learning algorithm one can achieve a decrease in variance theoretically by N1 . This idea motivates us to employ the variance-reduction ensemble method that is called bootstrapping aggregation (bagging). However, we propose a variation of bagging that is able to effectively account for the class imbalance issue. We will describe it in the next two sections. 35

5.2 Proposed Methods The traditional bagging technique creates K base learners by sampling K times from the original dataset. This method can reduce the variance of the SVMs, but not much, as SVMs have a built-in variance reduction technique. The biggest disadvantage of this method is that it does not allow to reduce bias caused by the class imbalance issue, while RU is able to do that. According to [65] these considerations can be formulated as follows. Let us assume that the minority class examples and the majority class examples are drawn from the independent distributions P and G , respectively. Since the distribution P is highly underrepresented, the learned decision boundary wlearned will be skewed towards the minority class samples. At the same time, the objective of the learning procedure is to find an optimal decision boundary wopt that minimizes the loss, with respect to P and G and defined as

wopt = argmin L(w) w

where L is the loss, i.e.

L(w) = λFN

Z Rw −

P(x)dx + λ F P

Z G(x)dx Rw +

w where λFN and λFP are the costs for False Negatives and False Positives respectively; Rw − and R + are the regions of two distributions. To account for the effect of the class imbalance, we introduce a variable ε < 0.5 that indicates the minority class prevalence. Then for the datasets drawn from P and G the expected empirical loss is Z Z

E[L(w)] = ελFN

Rw −

P(x)dx + (1 − ε)λFP

G(x)dx.

(5.4)

Rw +

For the imbalanced data set S the empirical loss is then w LS (w) = λFN |{x|x ∈ Smin ∧ x ∈ Rw − }| + λFP |{x|x ∈ Smaj ∧ x ∈ R + }|.

(5.5)

The decision boundary wlearned that minimizes the empirical loss is biased and can be expressed by having smaller decision region i.e. w

wopt

R+learned < R+

.

(5.6)

We can expect that the classifier with wlearned is biased when Z Z

ελ F N

Rw −

P(x)dx > (1 − ε)λ F P

G(x)dx.

(5.7)

Rw +

Thus, there are three main components that can induce biased decision boundary: distributions, missclassification costs and the weight ε. Let us assume that we have equal costs for both types of errors, i.e. λFN = λFP = 1. To reduce the probability that the decision boundary will be unbiased one need to get rid of ε. We can do it only if we balance the dataset S so that |Sma j | = |Smin |. We can achieve it by removing instances from Sma j by e.g. RU. In the end, only hidden distributions P and G can contribute to the biased separator i.e. Z Z

P(x)dx >

Rw −

G(x)dx.

(5.8)

Rw +

36

5.2.1 Bagging Weighted Balanced Combiner Coming back to the bagging concept, we can say that to account for the bias, RU must be performed at every kth sampling iteration, so that every kth sample is balanced. Let us consider this procedure in details according to Figure 5.3. On the first step the training data set S is being partitioned into the

Figure 5.3: Exactly balanced bagging procedure minority and majority class subsets Smin and Smaj respectively. Then, the bootstrapping procedure is performed. The original bootstrapping assumes equal sampling probabilities for each data sample to be picked up i.e.

pi =

1 |Smaj |

,

∀i = 1, . . . , |Smaj |

k The sampling is performed with replacement K times from Smaj and K realizations Sbootstr are obtained. The replacement provides the possibility for the sample to be drawn several times. The size of each k bootstrap sample is equal to the size of the minority class subset Smin , i.e. |Sbootstr | = |Smin |. In the k next step every bootstrap sample Sbootstr is combined with the minority class subset Smin to form K new k k k balanced training sets Sbalanced (i.e. Sbootstr ∪ Smin = Sbalanced , k = 1, . . . , K ). After that, K individual SVM classifiers (base learners) are induced from these K sets and result in K decision functions f k . These K decision functions can be evaluated on the separate previously unseen validation data set Svalid to determine combiner variables. If the combination rule does not require these variables, then this step can be skipped. In the last step in the testing mode K outputs of base learners are combined to obtain a final decision as ‚ K Œ X f (x) = sgn w k · f k (x) . (5.9) k=1

The main drawback of the described above scheme is that it does not consider the class specific performance of each base learner. That means that each individual classifier would contribute equally to the final result. To address this issue, the WMV combiner can be used, however, it accounts only for overall accuracy that is not a proper measure when the class imbalance issue presents. Motivated by that, we propose to assign to every individual classifier weights w k based on the classspecific accuracies, but not overall accuracy. For an L -class problem and the k-th individual learner, the weight w k is calculated as ‚ Œ 1 1 1 1 1 = + + ... + (5.10) (L) wk L acc(1) acc(2) acc k

k

k

37

where acc(l) , l ∈ {1, ..., L} is the class-specific accuracy for the l -th class. These weights are determined based on the validation set Svalid . For the two class problems that are treated in this thesis the weight w k is assigned based on (−)

wk = 2 ·

(+)

acck · acck (−)

(+)

acck + acck

(5.11)

where acc(−) and acc(+) are the accuracies of the majority and minority classes, respectively. As a result, a base learner that does well on both classes would get high weight and its output would carry in a high contribution into the final decision. We called the proposed combination technique together with the underlying bagging approach as Bagging Weighted Balanced Combiner (BWBC). The main drawback of BWBC lies in the fact that the performance in terms of accuracy of the resulting classifier can drop when the imbalance degree is extremely high. This can be explained by heavy information loss when performing RU procedure that undersamples the majority class examples up to the level of the minority class examples. So, the resulting classifier could have quite high G-mean value (discover both cases well), but at the same time decreased accuracy measure as the decision regions become not very specific.

5.2.2 Adaptive Bagging Weighted Balanced Combiner To leverage the advantages of BWBC, but at the same time to be able to control the accuracy, we propose another classifier ensemble method called Adaptive BWBC (ABWBC). There are two main differences between BWBC and ABWBC. Firstly, ABWBC considers a more complex sampling procedure, where the examples have different probabilities to be picked up. Secondly, ABWBC performs bootstrapping iteratively, ensuring the ability to control accuracy. The structure of the ABWBC is depicted in Figure 5.4. Here, the bootstrapping procedure and SVM training is done in a sequential way. The main idea of ABWBC is that every current kth bootstrapping sample determines the sampling level of the next (k +1)th bootstrapping sample, so that the undersampling degree is not constant for every bootstrapping procedure, but varies depending on the accuracy of the classifier induced on the kth bootstrapping sample. To establish a reference accuracy ACC, we are training the standard SVM classifier and obtaining its accuracy (ACC) on the testing dataset. Since the original dataset is skewed, ACC is usually high as the classifier typically discovers all negative cases, but only few of positive ones. Having the reference accuracy ACC we are setting the constraint on the accuracy of every base learner as ACC(k) ≥ (1 −

T ) · ACC. 100

(5.12)

That means that we are willing to have an accuracy of the base learner that is not less than T % of the original accuracy ACC. T is a parameter that determines how much of the accuracy in % we are allowing to loose to every individual learner. In principle, T can be considered as a trade-off between G-mean and accuracy measures of each base classifier. The higher it is, the more loss in accuracy we are able to tolerate. If we set it to a small value, that means that we want our resulting overall accuracy be close to the reference one. However, the undersampling level of the next (k +1)th bootstrapping sample depends not only on the accuracy of the current kth sample. It depends also on the current undersampling level u(k) and a random number r (k) . So, in the end, the next undersampling level is determined as

u(k+1) = υ1 · a(k) + υ2 · u(k) + υ3 · r (k)

(5.13) 38

Figure 5.4: Proposed bagging approach where υi , i = 1 . . . 3 are constants that sets the relative importance of every component in Equation 5.13, r (k) is a random number between 0 and 1 and a(k) is a term that shows how close is the current accuracy to the reference one and is defined as   ACC(k) (k) a = min 1, . (5.14) T (1 − 100 ) · ACC The random term in Equation 5.13 is necessary since it prevents the algorithm from being stacked at the case with small a(k) and u(k) values. As stated earlier, in ABWBC not every instance has the same probability to be taken. This idea is motivated by the fact that particular samples (e.g. noisy samples) do not improve classification performance and thereby we do not need to pick a lot of them during bootstrapping. To the opposite, noisy samples lead to decease of the SVMs performance, since they usually create an overlap between classes. That is why ABWBC employs a sophisticated sampling technique that assigns a particular probability to be picked for every majority class instance. The algorithm that assigns the probabilities is based on the kNN principle (Figure 5.5). For every majority class instance a number of nearest neighbors of the same class is calculated i.e.

wi =

{#k|xk ⊂ |Smaj |} Knn

i = 1, . . . , |Smaj |

(5.15) 39

Figure 5.5: kNN principle for determining sampling probability where Knn is a user-specified constant that is usually determined based on the rule of thumb, so that

Knn =

v t |S

maj |

2

.

Finally, the probability for the sample i to be picked is

wi pi = P|S | maj i=1

, ∀i = 1, . . . , |Smaj |.

(5.16)

wi

According to the algorithm, noisy examples should have less probability to be picked up than the nonnoisy ones. If the majority class example i has a lot of nearest neighbors of another classes, then that means that it is most probably a noisy example. We call the proposed sampling technique kNN Weighted Sampling Scheme (KWSS).Table 5.1 summarizes the ABWBC algorithm in details. The disadvantage of ABWBC compared to BWBC is that the sampling and learning processes cannot be done in parallel since each sampling operation depends on the other one. This means that the learning time of ABWBC method will be higher than of BWBC. However, thanks to the undersampling concept, the training procedure can be done rather fast as the number of instances after RU is much lower that initially. Training operation can be done even faster than some non-ensemble methods like SMOTE when the imbalance level is high.

5.3 Chapter Summary In this chapter, we discussed the advantages of the RU technique and showed how it can be used in bagging. We proposed two bagging approaches that can be applied to the datasets of different imbalance level. We discussed the strong and the weak sides of both proposed methods and provided recommendations when they should be applied.

40

Table 5.1: The ABWBC algorithm Step 0.

Initialization. Train NM C times an SVM on the original dataset S , obtain accuracy value for each of NM C SVM, average over NM C Monte Carlo iterations and calculate averaged accuracy ACC that is considered to be the reference accuracy. Choose P3 • weights υi , i = 1, . . . , 3, so that i=1 υi = 1 (by default υi = 13 , i = 1, . . . , 3) • accuracy tolerance value T , T ∈ [0, 100] (by default T = 10) • number of base learners K (by default K = 100) Set initial target sampling level u(1) = 1.

Step 1.

Bootstrapping and learning. for k = 1 : K • sample according to the kNN Weighted Sampling Scheme from the subset Smaj and (k)

obtain a sample Smaj so that

|Smin | (k)

|Smaj |

= u(k) . (k)

• train the SVM on the joined set Smin ∪ Smaj , explore the model parameters via gridsearch and pick a model with a decision function f(k) that provides the largest Gmean value. Take the correspondent accuracy value ACC(k) .  ‹ ACC(k) (k) • calculate a value a = min 1, that shows how close the current accuT (k)

racy ACC

(1− 100 )·ACC

is to the reference ACC.

• calculate the undersampling level for the next iteration u(k+1) = υ1 · a(k) + υ2 · u(k) + υ3 · r (k) , where r (k) is a random number between 0 and 1. Step 2.

(−)

Assigning weights. Calculate the class-specific accuracies acck

tion set and assign weights w k to every base learner as w k = 2 · Step 3.

(+)

and acck

(−) (+) acck ·acck (−) (+) acck +acck

on the valida.

SVM combining. In the operation (or testing) mode combine K outputs to obtain a final PK decision as f (x) = sgn( k=1 w k · f k (x)).

41

6 Experiment In this thesis we are conducting two sets of experiments. The first part of this chapter describes the practical implementation of the NPSVM concept in a FR scenario and the second part deals with class imbalance problems. To conduct all the experiments that are described in this thesis we used a Matlab programming environment (MATLAB and Statistics Toolbox Release 2014a, The MathWorks, Inc., Natick, Massachusetts, United States). An open source library, LIBSVM 3.18, was used to implement SVMs learning and testing algorithms. LIBSVM [12] is an optimized integrated software tool, written in C++ for SVM classification and has different extensions allowing to solve various types of classification problems. It is also possible to bind it to other programming languages including Matlab as was done in this thesis to carry out experiments. Throughout the experiments, we were using a RBF kernel, choosing the best model using 5-folds cross validation technique through different combinations of the SVM hyperparametes.

6.1 Data Fusion in Face Recognition Scenario Using NPSVM Method In this section, we conduct an FR data fusion experiment implementing NPSVM on a real-world application, as well as implement its extended formulation by incorporating a motion model. We describe an experimental set-up and environment and provide the formulation of the solution for the given problem. We discuss the benefits of the proposed data fusion method and accompany them with evaluation results.

6.1.1 Experiment Description In order to implement and to test the NPSVM data fusion approach for biometric applications, a realworld FR experiment was conducted inside the office AGT building in Darmstadt, Germany. For that four high-resolution FR cameras were installed at different positions (Figure 6.1). There were 27 people enrolled in this experiment whose individual biometric templates (face images) were captured beforehand. These people were walking a certain predefined trajectory from one FR camera to the other (dashed lines on the figure) and could move from one camera to another through different paths. There were in total 4 trials, where different conditions were varied i.e. back light conditions; illumination; speed of walking; appearing in front of the cameras not alone, but in groups; eating and drinking while walking. The experiment was lasting one hour and resulted in 5500 FR detections. These detections were annotated and split into genuine and impostor classes. In Table 6.1 the key details of the experiment are summarized. Table 6.1: Key details of FR experiment Number of people enrolled

Number of cameras

Number of trials

Total number of FR detections

27

4

4

5500

42

Figure 6.1: An office building where FR biometric data was aquired. The blue arrows show the location of the cameras

6.1.2 Problem Formulation Consider M FR recognition systems, each of them having an index m = 1, . . . , M . By FR system one understands a system that can record and detect faces in the surrounding environment and report whether a face belongs to the known person with a particular name. Consider Figure 6.2 with a typical FR system. An image sensor (CCD, CMOS or other type) records images with a predefined frame rate forming a video stream. Each frame (a biometric sample) is being processed by a special third party FR software (FRS). FRS searches for faces on the biometric sample. If a face is detected, it extracts relevant features. At the Face Matching step one compares features of a biometric sample against templates that are stored in the Template Storage producing similarity scores for every stored template. The similarity scores show

Figure 6.2: A typical FR system the degree of coincidence between the features of the biometric sample and the stored templates. The higher the similarity score, the higher the degree of fit. Usually, the template that has the highest score is considered to match the face under investigation. However, the conditions for every FR system deployed are different. Some of the cameras have a poor lightning, too narrow or too wide field of view, etc. That is why the FR rate for each of the cameras is 43

different and sometimes very low. So, the main issue of each single FR system is that it does not provide high identification accuracy, resulting in a high FP and FN rate. To overcome this, we combine the outputs of all FR cameras i.e. use the fusion of FR detections from multiple cameras. By doing that, we increase the performance of the whole monitoring system. In this thesis, we perform matching score level fusion of multiple FR systems. Let us consider the operational principle of the fusion-based FR system. When the FR system detects a person on the image, it issues an output consisting of scores for every 1 I > i enrolled person in the database forming a column vector of matching scores sm = [sm . . . sm ] , where sm is the score for the i -th person in the enrollment list and i is the unique person identifier (ID), so that i ∈ {1, . . . , I}. Along with a score vector, a FR system outputs the time stamp t k > 0 corresponding to the time the detection was captured, where k is the index of the detection. It is possible, that a person passes the same camera several times, so that mu = mv for some of u, v = 1, . . . , k. The goal of the fusion based FR system is to fuse scores provided by multiple FR systems during the predefined time interval. Such a system is depicted in the Figure 6.3. Here, a FR system detects a person and performs the FR process

Figure 6.3: FR fusion system along with the matching scores calculation. A detection buffer is used to keep the detections (score vectors sm along with their corresponding time stamps t k ) at the intermediate step. The data association block joins all detections of the same person to a single group to form a score matrix Sik = [sm1 ,1 . . . smk ,k ] containing k vectors of matching scores. Data fusion and decision making blocks that are treated jointly in this thesis are solving the verification problem given a score matrix Sik . For the verification problem the question to be answered is ”Is this person i ?”. The ID set for that is I = {0, 1}, where 0 means ”this is not person i ” (impostor person) and 1 corresponds to ”this is person i ” (genuine person). The goal of the data fusion and decision making blocks is to determine a decision function π that decides for a person i given a score matrix Sk : π(Sik ) = i , where I = {0, 1}. Therefore, we are interested in a binary answer, that can also be expressed as a posterior probability for i being the considered person. Having some threshold η, one obtains an answer 0 or 1 depending on the posterior probability value. The data association block operates based on the particular parameters. It produces the score matrices i Sk for every person i in the enrollment list. The score vector smk ,k is concatenated to the score matrix Sik−1 only if the following conditions are satisfied: • a matching score value for person i is the highest one (or one of the two highest) among others and it is above the threshold β • a detection was created so that t k − t 1 ≤ Tw , where t 1 is the time stamp of the first detection in the score matrix Sik−1 and Tw is a time window. 44

Practically, for every person from the enrollment list a score matrix Sik is produced. However, if a particular person has never been recognized, then the corresponding score matrix will be empty and is not considered further. At the same time, several score matrices can be potentially produced for a particular person. The following example illustrates the data association procedure for two persons with ID 2 and 4.

S25 = [s1,1 , s1,2 , s4,3 , s2,4 , s1,5 ] S21 = [s2,1 ] S43 = [s2,1 , s2,2 , s1,3 ] Note, that for the person with ID 2 two score matrices are created. The first score matrix contains 5 FR detections. The 6th detection does not satisfy the time window condition. That is why it is used to form a new score matrix. The main assumption of the Data association block is that each score matrix Sik contains only FR detections of person i . However, due to the limitations of FR systems that were mentioned in the introduction, it might often happen that the detection of person j is erroneously added to the score matrix of the person i (error of FR system). The data fusion algorithm has to discover such cases and produce reasonable output depending on the task we want to solve. There are two modes we formulate: 1. All i . This means that when a score matrix Sik contains all FR detections with the ID i , then we assign the groundtruth to such score matrices to be i . Then if at least one of the detections does not belong to person i , we assign the ground truth to be 0 (Figure 6.5).

(a) A person with ID=7 was detected correctly on all (b) A person with ID=7 was detected on all of the of the detections and associated to one group. The ground truth for this group is 7

detections, but on one of them incorrectly. The ground truth for this group is 0

Figure 6.4: Graphical explanation of the setting: all i

2. At least one i . This implies assigning the ground truth to be i when at least one of the detection within Sik contains a person i . Otherwise assigning 0.

6.1.3 Solution using NPSVM Approach To solve the considered problem, we use the 2ν-SVM formulation, described in Chapter 2. Mapping the SVM function f (·) to the final decision function π of the problem formulated above requires extracting from every column of Sik the score values for the person i forming a row vector of length k and then merging the scores that are produced by the same camera. That implies that the score matrix Sik should be transformed to the vector xiM of dimension M so it contains only the scores for a particular person i coming from unique sensors. Moreover, if Sik does not have detections from a particular FR camera 45

(a) A person with ID=7 was detected correctly on (b) A person with ID=7 was wrongly detected on one out of three detections. The ground truth for this group is 7

all of the detections. The ground truth for this group is 0

Figure 6.5: Graphical explanation of the setting: at least one i

m, then the corresponding position in the vector xiM is filled in with a zero value. We define the fusion function π as π(Sk ) = f (z(Sk )),

(6.1)

where the function z(·) merges the vectors smk ,k coming from the same camera, extracts scores for the person i and fills in zeros when there is no detection from the sensor m. So, in the end, SVM operates on vectors x M that are always of fixed length. There are several ways for merging scores coming from the same sensor. In this thesis, we are merging the scores according to the max rule. This means we are keeping the score that has the maximum value among others. For the SVM, the decision function is then: ‚ n Œ X f (x M ) = sgn αi yi K(xi , x M ) + b (6.2) i=1

where x M is an unseen example that we want to test. To increase the performance of the FR system by introducing a motion model we have to penalize the scores of the FR detection vector sm if the time spent for the transition is not feasible. To execute this, we propose to modify the score matrix Sik in the following way. If we have a "suspected" score vector smj ,j inside the matrix Sik , we multiply this vector by the penalty term P(∆t m→n ), so that in the end we have an updated penalized score vector smj ,j 0 . This can be written as

smj ,j 0 = P(∆t m→n ) × smj ,j

(6.3)

An updated score vector smj ,j 0 is then further used inside the score matrix Sik .

6.1.4 Performance Evaluation To evaluate the implemented data fusion algorithm and its extended formulation with the proposed transition model concept, we perform several tests while also comparing the proposed methods with a simple max fusion method. Max fusion is treated as a baseline algorithm here. It uses the maximum score among FR detections for a particular person i as a posterior probability for a person i i.e.

π(Sk ) = max sk . k

Firstly, we are testing the performance of the ”All i ” mode. That means that for assigning the ground truth ID for the score matrix Sik we are asking the question ”Is the person at all of the detections?”. In 46

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6 G−mean

Accuracy

1

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2 NPSVM NPSVM+MM MAX

0.1 0 0.1

0.2

0.3 0.4 Score threshold

0.5

NPSVM NPSVM+MM MAX

0.1 0 0.1

0.6

(a) Accuracy for the algorithms vs the detection threshold

0.2

0.3 0.4 Score threshold

0.5

0.6

(b) G-mean for the algorithms vs the detection threshold

Figure 6.6: Evaluation measures for the ”All i ” mode

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6 G−mean

Accuracy

other words we are looking if the score for the person i in each score vector sm is one of the two largest ones. The results of the experiment are presented on Figure 6.6. Secondly, all three methods were tested in the ”At least one i ” mode. For determining the ground truth ID for the score matrix Sik we are asking the question ”Is the person i in at least one of the detections?”. This is not a strict condition compared to the ”All i ” mode. Figure 6.7 shows the results of this experiment.

0.5

0.5

0.4

0.4

0.3

0.3

0.2

0.2 NPSVM NPSVM+MM MAX

0.1 0 0.1

0.2

0.3 0.4 Score threshold

0.5

0.6

(a) Accuracy for the algorithms vs the detection threshold

NPSVM NPSVM+MM MAX

0.1 0 0.1

0.2

0.3 0.4 Score threshold

0.5

0.6

(b) G-mean for the algorithms vs the detection threshold

Figure 6.7: Evaluation measures for the ”At least one i ” mode Analyzing the figures, we can observe that for both modes as the detection threshold β increases, the accuracy and G-mean values also increase. This is quite intuitive, as by increasing the threshold value, we filter out more and more noise and ambiguous detections, leaving only reliable ones. Moreover, when having low threshold values, the amount of negative samples is much higher than of the positive ones. The class imbalance problem becomes apparent here forcing the decision boundary of the SVM to be skewed toward the positive class samples. This leads to a drop of the G-mean. We can also notice that when using the motion model extension, we have an evident increase in G-mean measure and 47

also accuracy. This can be explained by the fact that the motion model concept helps to discover more negative cases, especially when the detection threshold β is low and such cases are dominating. So, by gaining in True Negatives, we are getting the overall increase in the performance of the data fusion algorithm. It is also worth mentioning, that the baseline max fusion method is always outperformed by the two proposed fusion algorithms.

6.1.5 Performance Evaluation on NIST Dataset To get an overview of the performance of the NPSVM approach on publicly available data, we have chosen a dataset from the National Institute of Standards and Technology (NIST) that is called NIST-BSSR1. NIST-BSSR1 is widely used as a benchmark for multibiometric fusion research by many researchers (see e.g. [56, 60]) and contains raw scores of two face and fingerprint recognition systems of 517 people. As a result, this dataset provides a 517 × 517 score matrix for each system, each of them has 517 genuine scores and 517 × 516 impostor ones. From them, we built feature vectors xi ∈ R4 , where each dimension contains a score from the particular recognition system. We evaluated the NIST-BSSR1 dataset and averaged the result over 5-fold cross-validation runs nested in NMC = 100 Monte Carlo iterations. We performed evaluation for both NPSVM and Max fusion methods. Max fusion is treated as a baseline methods here and provides the decision based on the best score among all recognition systems. The results are presented in Table 6.2. According to the results, NPSVM method completely outperforms baseline Max Table 6.2: Evaluation results on NIST-BSSR1 dataset NPSVM

Max Fusion

Accuracy

0.9957±0.0040

0.5784

G-mean

0.9955±0.0050

0.5743

fusion approach and provides high both G-mean and accuracy values. Moreover, the variance of NPSVM is extremely low. To conclude these experiments, we can say that the obtained evaluation results prove the high operational performance of the proposed fusion algorithm and confirm the effectiveness of the motion model concept. However, there is still a potential class imbalance issue that limits the performance. In the next section, we address the class imbalance problem independently of the application and propose a novel method to overcome the loss of performance due to that problem.

6.2 Studying Class Imbalance Problem In this section, we study class imbalance issues, conducting several experiments to verify its influence on SVMs as well as implement several state-of-the-art methods and test them using real-world and synthetic datasets. In addition, we propose our own method that can effectively deal with class imbalance problems. It is important to mention that this part of the thesis is totally generic and the results can be applied not only to the FR Darmstadt experiment, but to the wide range of real-world scenarios.

6.2.1 Benchmark Data For our practical experiments we use several datasets from UCI Machine Learning Repository [3]. These datasets were selected to meet the following requirements: 48

• they are obtained from real-world scenarios and applications • they have a class imbalance of different degree • they have different number of features • they contain enough objects to perform classification task In Table (6.3) one can find a general overview of these datasets. Besides UCI datasets, we are using Table 6.3: Description of UCI datasets used in experiments Datasets

Description

Number of objects

Number of attributes

Haberman

Predicts whether the patient survived after 5 years or longer after breast cancer surgery.

306 (81/225)

3

Cmc

Predicts the contraceptive method choice of a woman based on her demographic and socio-economic characteristics.

1473 (333/1140)

9

Pima

Predicts a presence or absence of the diabetes of the Pima Indians. All cases are females over 21 years old of old Pima ethnicity.

768 (268/500)

8

Biodeg

Predicts the class of chemicals to ready and not ready biodegradable [53].

1055 (356/699)

41

Bupa

Predicts a liver disorders that arise from excessive alcohol consumption based on the blood tests.

345 (145/200)

7

the NIST biometric matching score dataset (NIST-BSSR1) that is described in the previous section. In the end, we also use several datasets that were obtained by the FR Darmtsadt experiment. For our experiments and studies we also created two synthetic datasets with different patterns (Figure 6.8) of dimensionality d = 2 drawn from different distributions. Pattern 1 is depicted in Figure 6.8a and pattern 2 in Figure 6.8b. Red instances represent the positive (minority) class, while blue ones represent negative (majority) class. We are able to control the degree of imbalance for both synthetic datasets, as well as the number of training and testing examples. The range of values that instances can take is si ∈ [0, 1]2 .

6.2.2 Performance Measure for Imbalanced Dataset As we mentioned in Chapter 3, accuracy is not a suitable metric when having a class imbalance problem, however, it can be fully representative for general classification problems. To prove that, we conducted experiments using our synthetic dataset (pattern 1) and original SVM. 49

(a) Pattern 1

(b) Pattern 2

Figure 6.8: Patterns of synthetic datasets that are used in experiments In the first experiment we explored the relation between Accuracy, G-mean, F1-measure, two classspecific accuracies (marked as Accuracy + and Accuracy -) and different class imbalance levels. We varied the class imbalance ratio from 7:10 to 1:10 with a step size of 1:10. You can see the simulation results on Figure 6.9. Analyzing the Figure 6.9, one can notice that as the imbalance level increases 1

0.9

Value of metric

0.8

0.7

0.6 Accuracy G−mean F1−measure Accuracy + Accuracy −

0.5

0.4

0.7

0.6

0.5

0.4 Imbalance ratio

0.3

0.2

0.1

Figure 6.9: Evaluation metrics behaviour having different degree of imbalance (note that e.g. the imbalance level 0.1 is higher than 0.2), the class specific accuracy of the minority class (accuracy -) decreases quite rapidly. The same is true for f1-measure and G-mean measure. At the same time, the overall accuracy increases. So does the class specific accuracy of the majority class. This behavior can be explained by the fact that the minority class samples do not contribute much to the overall accuracy. However, the accuracy - has a large contribution to the overall accuracy as it represents the examples that dominate in the dataset. This experiment confirms, that the overall accuracy is not a suitable measure when the class imbalance issue presents. However, G-mean and f1-measure nicely reflect the classifier performance in such case. In the second experiment we measured the same metrics as before, however, on the balanced datasets of different size. We varied the size of the datasets from 500 objects to 2000 objects with a step size 50

of 250 keeping the imbalance ratio always 1:1. Figure 6.10 illustrates the results. All the metrics are 1 0.9 0.8

Value of metric

0.7 0.6 0.5 Accuracy G−mean F1−measure Accuracy + Accuracy −

0.4 0.3 0.2 0.1 0 500

750

1000 1250 1500 Number of instances

1750

2000

Figure 6.10: Evaluation metrics behaviour having different size of the training set, but the same degree of imbalance approximately constant independently of the dataset size. That means, that when having a balanced dataset, accuracy can be used as a measure of classifier performance. Usually, most machine learning algorithms assume an absence of the class imbalance problem and therefore are often evaluated with accuracy.

6.2.3 Condition for Class Imbalance In the literature that was considered in Chapter 3, a class imbalance issue is treated as only responsible for the decrease in classification performance. However, the classification results on the NIST dataset (see 6.1) prove the opposite. Even with the significant imbalance of the NIST dataset, a classification is done with extremely high accuracy and G-mean measures. That observation motivates to suspect class overlap to be responsible for the classifiers performance degradation. Having classes that are overlapping, imbalance issues makes the problem even more severe. To prove that, we conducted several experiments. In the first experiment we explored the dependence of f1-measure on the datasets with different imbalance levels for the non-overlapping synthetic pattern and the overlapping one (original pattern 1). We measured the performance of the original SVM changing the imbalance level from 1:1 to 1:10 with a step size of 1:10. In the second experiment we performed the same test as in the first experiment, however, operating on the NIST dataset. To create a number of non-overlapping datasets of different imbalance degrees, we sampled from the original NIST impostor (majority class) subset. To get a robust f1-measure estimates, we averaged over 5-fold cross validation runs nested in a loop of NMC = 50 Monte Carlo iterations. To create overlapping datasets we added Gaussian noise to the data instances. The results are depicted on Figure 6.11. We can observe a rapid decrease of f1-measure when the classes are overlapping. In the same time, for non-overlapping case, the f1-measure stays almost constant independently of the imbalance degree. 51

(a) Non-overlapping classes

(b) Overlapping classes

Figure 6.11: SVMs performance on the NIST overlapping and non-overlapping datasets of different imbalance level

In the third experiment we studied the influence of a class overlapping degree on the performance of the SVMs. For that, based on the synthetic dataset (pattern 2), we created a set of datasets with different distances between class clusters (Figure 6.12a). For this experiment we varied the distances between

(a) A synthetic dataset with no overlap (b) Resulting decision boundary Figure 6.12: Synthetic dataset that is used to study the influence of overlap class clusters from 1 (means no overlap and classes are far away from each other) to 6 (noticeable overlap) and learned SVM from that datasets. We performed this operations for four datasets with imbalance levels 1:1, 0.75:1, 0.5:1 and 0.25:1. The results of this experiment are presented on Figure 6.13. While performing relatively well when there is no class imbalance, SVM drops in performance when the imbalance level is high and the degree of overlap increases. 52

Figure 6.13: G-mean of SVM having different imbalance levels and overlapping degree To conclude this set of experiments, we can state, that the class overlapping level has an explicit correlation with the class imbalance. The distance between the classes is the key factor that affects the classifier performance. So, when having a strongly imbalanced dataset, one needs to take care of the regions that are overlapping. Firstly, one needs to understand the nature of the overlap. This can happen because of different reasons: noise, lack of information or the cases when the prior probabilities for two classes in some regions are comparable. Secondly, having understood the cause of the overlap, one needs to use either advanced noise cleaning techniques while preprocessing the data or to measure more parameters to get richer information about the problem.

6.2.4 Applying SMOTE to FR Darmstadt Dataset In this subsection, we conducted an experiment by applying the state-of-the-art balancing technique SMOTE to the problem formulated in the previous section. As it was stated before, to solve the FR data fusion problem within the NP framework, one has to have control over the operational point on the ROC curve. We showed that this problem can be solved using 2ν-SVMs. In this experiment, we provide evaluation results on the FR Darmstadt dataset analyzing the ROC-curve. Here we compare two methods, original SVMs and SMOTE-balanced SVM. A general way to compare two algorithms is to use an AUC measure. For that a ROC curve should be evaluated for each of the algorithm. To do that the posteriror probabilities of the classifiers are used. An original way to predict a class is to set a discrimination threshold η to 0.5. However, by changing η one is able to change class predictions. A pair of values (FP rate and TP rate) that correspond to each threshold η are plotted versus each other forming a ROC curve. The ideal classifier is the one that has a FP rate = 0 and TP rate = 1. So, the closer the curve to the upper left corner, the better the classifier is. However, it is difficult to compare the classifiers, unless the ROC curve of one of them is entirely above the ROC curve of another one. To do it in a reasonable way, an AUC is evaluated and compared. AUC is derived by integrating the TP rate across the full range of the FP rate. On Figure 6.14 one can see the ROC curves for NPSVM and NPSVM+SMOTE for the FR Darmstadt scenario dataset (detection threshold β = 0.10). The AUC of each method is presented in Table 6.4. The AUC values prove that the SMOTE performs better than 53

1 0.9 0.8 0.7 hit rate

0.6 0.5 0.4 0.3 0.2 0.1

NPSVM SMOTE+NPSVM

0 0

0.2

0.4 0.6 false alarm rate

0.8

1

Figure 6.14: ROC curves for NPSVM and NPSVM+SMOTE for FR Darmstadt dataset (detection threshold β = 0.10) original SVM. However, we are not really interested in the average performance of the algorithms. The Table 6.4: The AUC measure for the tested methods

AUC

NPSVM

NPSVM + SMOTE

0.836

0.899

most important observation that can be carried out of the ROC curves is that the curve of the SMOTE algorithm lies almost always higher than the curve of the baseline SVM. This gives a possibility to achieve higher data fusion performance of a NPSVM algorithm having different false alarm rates α.

6.2.5 Studying the Redundancy of SMOTE In Chapter 3 we considered the state-of-the-art balancing method SMOTE in details. As can be observed from the previous experiment, this technique is very effective especially when the class imbalance level is not too large. However, there are two assumptions that SMOTE imposes on the data: • A space between the data samples of one class must belong to the same class • A neighborhood of one class also belongs to this class Usually, these assumptions are satisfied, but when the learning problem is too complex, it may happen that SMOTE is not applicable to balance the dataset. This is the first reason why it is advantageous to have an universal balancing method that does not impose any assumptions on the dataset. The second reason is that it is not specified in the algorithm how many nearest neighbors one needs to consider. Usually, it is set to 5, but there is no rule or heuristic for that. The other weakness of SMOTE is that the original formulation of this approach dictates to introduce so many synthetic samples so that the dataset is completely balanced. This brings a huge overhead and leads to a strong redundancy. For example, if 54

the imbalance level of some dataset is around 1:1000 (that is typical for the rare diseases classification), one needs to introduce thousands of new samples. For many learning algorithms, including SVMs, this can be very critical and can lead to a heavy increase in the time of training. To prove the fact that SMOTE is an expensive strategy and introduces redundant information, we conducted a set of experiments. In the first one we were measuring the G-mean value for the synthetic datasets (pattern 1 and pattern 2) varying the oversampling level with a step size of 100%. The number of the nearest neighbors K was set to 5 here. Figure 6.15 illustrates the results of this experiment. We also conducted the same experiment on the real-world datasets described above. To obtain stable 0.9

1

0.8

0.9

0.7

0.8 0.7 0.6

0.5 im0.3 im0.2 im0.1 im0.05 im0.04 im0.03 im0.02

0.4 0.3 0.2 0.1 0 0%

G−mean

G−mean

0.6

0.5 0.4 0.3 0.2 0.1

100%

200%

300%

400% 500% 600% Oversampling Level

700%

800%

900%

0 0%

100%

200%

300%

400% 500% 600% Oversampling Level

700%

800%

900%

(a) G-mean vs Oversampling level of SMOTE for the (b) G-mean vs Oversampling level of SMOTE for the synthetic pattern 1

synthetic pattern 2

Figure 6.15: A study on oversampling level of the SMOTE algorithm results, we averaged the evaluated value over 5-fold cross validation runs and NM C = 50 Monte Carlo iterations. The results are presented on Figure 6.16. In general, we can see that for the most of the syn1 0.9 0.8 0.7

G−mean

0.6 0.5 0.4 haberman cmc pima biodeg bupa

0.3 0.2 0.1 0 0%

100%

200%

300% 400% 500% 600% Oversampling Level

700%

800%

900%

Figure 6.16: A study on oversampling level of the SMOTE algorithm on real-world datasets thetic and real-world datasets a stable G-mean estimate can be obtained just by performing the SMOTE with 100%-200% oversampling. However, some datasets require more synthetic samples to eliminate class imbalance issue. Moreover, the bupa dataset without any SMOTE oversampling shows better performance than with SMOTE. In fact, there is no clear way to distinguish how much oversampling each dataset needs. At the same time, applying SMOTE, the complexity of the SVM learning algorithm will be

O((|Smin | × (1 + |ESM OT E |) + |Smaj |)3 ) 55

where |ES M OT E | is a magnitude of the oversampling ratio. Moreover, SMOTE also introduces synthetic examples in the complex region between two classes that makes classification even more difficult and will require more time for convergence of the SVM optimization problem. That is why SMOTE requires a technique that would control an oversampling level with respect to the imbalance degree. These problems could be addressed in the future research. An alternative to the SMOTE oversampling could be, in principle, RU that is more attractive for SVMs.

6.2.6 Proposed Methods Evaluation

1

1

0.8

0.8

Accuracy

G−mean

To evaluate the performance of the proposed methods, we conducted experiments on the synthetic datasets and compared BWBC and ABWBC with the original SVM and SMOTE. We measured the Gmean value and accuracy of each method for the datasets with different imbalance degree and averaged results over NMC = 100 Monte Carlo iterations. The results are depicted on Figure 6.17.

0.6 0.4 0.2

0.6 0.4 0.2

SVM

0

SVM

0

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.010.001

Imbalance level

Imbalance level

(a) G-mean of the original SVM

(b) Accuracy of the original SVM

1

1

0.8

0.8

Accuracy

G−mean

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.010.001

0.6 0.4 0.2

0.6 0.4 0.2

SMOTE

0

SMOTE

0

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.010.001

Imbalance level

Imbalance level

(c) G-mean of the SVM with SMOTE

(d) Accuracy of the SVM with SMOTE

1

1

0.8

0.8

Accuracy

G−mean

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.010.001

0.6 0.4 0.2

0.6 0.4 0.2

BWBC

0

BWBC

0

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.010.001

Imbalance level

Imbalance level

(e) G-mean of the BWBC

(f) Accuracy of the BWBC

1

1

0.8

0.8

Accuracy

G−mean

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.010.001

0.6 0.4 0.2

0.6 0.4 0.2

ABWBC

0

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.010.001

Imbalance level

(g) G-mean of the ABWBC

ABWBC

0

1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.010.001

Imbalance level

(h) Accuracy of the ABWBC

Figure 6.17: Evaluation results for original SVM, SMOTE, BWBC and ABWBC on synthetic datasets 56

The results show that using the BWBC approach one can obtain quite high G-mean and accuracy measures even for high degrees of class imbalance. However, we can also observe that as the class imbalance level gets higher, the variance of G-mean also increases. Moreover, while G-mean preserves a quite high value, the accuracy measure tends to decrease with a high imbalance degree. However, it would be advantageous not to sacrifice accuracy to provide high G-mean, but to have means to control the trade-off between accuracy and G-mean. A positive observation regarding BWBC is that it provides a more stable G-mean measure compared to the state-of-the-art SMOTE algorithm. The variance of G-mean of BWBC is less than the one of SMOTE thanks to the variance reduction nature of the bagging procedure. To the opposite, SMOTE has strong randomization effects due to the essence of the method (random placement of the synthetic examples). We can also observe that the original SVM completely fails having a high imbalance degree. Despite the fact that the accuracy is always high and even increases for high imbalance levels, the G-mean measure rapidly decreases and in the end goes to zero. This indicates, that the classifier assigns the negative class for the whole decision space. For a moderate class imbalance degree ABWBC behaves approximately identical to BWBC. However, for the large imbalance level ABWBC demonstrates its adaptive nature. By changing the undersampling value adaptively, ABWBC allows to gain in accuracy while losing slightly in G-mean. In contrast, original SVM and SMOTE algorithms do not have this kind of trade-off control and, thus, have a noticeable loss in G-mean as the imbalance level gets higher. To determine the optimal number of weak learners K for BWBC that is necessary to create an ensemble that provides high operational performance, we performed an experiment where we measured the Gmean depending on different amount of bootstrap samples that form the base classifiers. For that we choose K ∈ {1, 10, 100, 1000, 10000} and used both synthetic and real-world datasets. We averaged the result over NMC = 100 Monte Carlo iterations. Figure 6.18 presents the results in a log-scale. From the 1

1

0.9

0.9

0.8

0.8

0.7

0.5

0.6 G−mean

0.6 G−mean

0.7

0.025 0.050 0.075 0.100

0.5

0.4

0.4

0.3

0.3

0.2

0.2

0.1

0.1

0 0 10

1

10

2

3

10 10 Number of Bootstrap samples K

4

10

0 0 10

haberman cmc pima biodeg bupa

1

10

2

3

10 10 Number of Bootstrap samples K

4

10

(a) G-mean depending on the number of weak

(b) G-mean depending on the number of weak

learners K on synthetic datasets of different imbalance degrees

learners K on real-world datasets from UCI repository

Figure 6.18: A study on optimal number of base learners for BWBC synthetic experiment we can see that the optimal number of bootstrap samples has an obvious correlation with the imbalance degree of the dataset. The higher the imbalance, the more samples are required to get a satisfactory performance. In general, K = 100 guaranties a converged result for all of the datasets. For most of the real-world datasets the optimal K is 10. For the pima dataset G-mean stays almost 57

1

1

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6 G−mean

G−mean

constant independent of K that indicates a possibility to apply just RU without creating an ensemble of classifiers. Based on these results, we choose K = 100. We performed also the same experiment to determine the optimal number of K bootstrapping iterations for ABWBC approach and compared it with BWBC method. Figure 6.19 presents the results. ABWBC behaves similarly to BWBC and requires approximately the same number of K to obtain an op-

0.5 0.4

0.4

0.3

haberman cmc pima biodeg bupa

0.3 0.025 0.050 0.075 0.100

0.2 0.1 0 0 10

0.5

1

10

2

3

10 10 Number of Bootstrap samples K

0.2 0.1 4

10

0 0 10

1

10

2

3

10 10 Number of Bootstrap samples K

4

10

(a) G-mean versus the number of weak learners K on (b) G-mean versus the number of weak learners K on synthetic datasets of different imbalance degrees

real-world datasets from UCI repository

Figure 6.19: A study on optimal number of base learners for ABWBC timal G-mean value. However, ABWBC achieves higher G-mean having less iterations than BWBC both on synthetic and real-world datasets. This can be explained by the fact that ABWBC has a more sophisticated advanced sampling technique that allows to obtain more stable result on every bootstrapping iteration. According to this experiment we choose K = 100 for the proposed ABWBC method. To test the proposed BWBC and ABWBC methods on real-life scenarios and compare them with the other state-of-the-art methods described in this thesis we conducted classification experiments using the UCI datasets. As classification methods we used SVM without imbalance fixing (denoted as SVM), cost-sensitive extension of SVM (Cost-SVM), RO, SMOTE, B-SMOTE that is Borderline-SMOTE, Bagging approach with averaging (BAVC), Bagging approach with Naive Bayes Combiner (BNBC), BNBC but using only the best 20% of base learners (BNBC20%), Bagging approach with Majority Vote Combiner (BMVC), Bagging approach with Weighted Majority Vote Combiner (BWMVC). For each dataset and classification approach we evaluated both accuracy and G-mean values and averaged them over 1000 Monte Carlo iterations. For the ensemble methods we used K = 100 ensembles. The results of these experiments are presented in Table 6.5.

Analysing the results, we can see that ABWBC outperforms all other methods in terms of the G-mean value and provides acceptable accuracy, loosing only 1%-3%. At the same time, ABWBC allows to gain significantly in G-mean for all of the datasets including bupa, where all non-bagging methods show a decrease in G-mean comparing to the SVM without imbalance fixing. In general, ensemble methods have also less variance than non-ensemble ones. In particular, BWBC and ABWBC have the least variance, while SMOTE due to its strong randomization effect is not that stable. BWBC and ABWBC take advantage of the variance reduction capacity inherent in Bagging and preserve high classification performance even on the high imbalance data. 58

SVM Cost-SVM RO SMOTE B-SMOTE BAVC BMAX BNBC BNBC20% BMVC BWMVC BWBC ABWBC

haberman

cmc

pima

biodeg

bupa

G

0.306 ± 0.069

0.320 ± 0.043

0.667 ± 0.086

0.801 ± 0.009

0.639 ± 0.019

Acc

0.736 ± 0.010

0.778 ± 0.005

0.750 ± 0.006

0.843 ± 0.007

0.689 ± 0.014

G

0.062 ± 0.035

0.341 ± 0.024

0.054 ± 0.030

0.803 ± 0.008

0.413 ± 0.023

Acc

0.730 ± 0.008

0.779 ± 0.004

0.651 ± 0.002

0.844 ± 0.006

0.632 ± 0.018

G

0.549 ± 0.069

0.578 ± 0.045

0.695 ± 0.037

0.807 ± 0.027

0.485 ± 0.120

Acc

0.686 ± 0.026

0.724 ± 0.021

0.712 ± 0.016

0.825 ± 0.018

0.591 ± 0.054

G

0.564 ± 0.056

0.601 ± 0.021

0.688 ± 0.049

0.813 ± 0.020

0.428 ± 0.121

Acc

0.658 ± 0.032

0.713 ± 0.012

0.704 ± 0.025

0.831 ± 0.015

0.580 ± 0.050

G

0.535 ± 0.058

0.626 ± 0.046

0.689 ± 0.058

0.810 ± 0.031

0.456 ± 0.121

Acc

0.718 ± 0.020

0.699 ± 0.031

0.697 ± 0.031

0.817 ± 0.020

0.574 ± 0.052

G

0.623 ± 0.015

0.623 ± 0.009

0.710 ± 0.008

0.809 ± 0.062

0.658 ± 0.016

Acc

0.682 ± 0.018

0.676 ± 0.008

0.706 ± 0.008

0.814 ± 0.006

0.672 ± 0.015

G

0.458 ± 0.109

0.630 ± 0.047

0.697 ± 0.038

0.755 ± 0.070

0.538 ± 0.113

Acc

0.522 ± 0.081

0.641 ± 0.036

0.698 ± 0.024

0.772 ± 0.042

0.615 ± 0.045

G

0.593 ± 0.024

0.668 ± 0.008

0.710 ± 0.008

0.804 ± 0.006

0.656 ± 0.019

Acc

0.690 ± 0.019

0.685 ± 0.007

0.710 ± 0.007

0.810 ± 0.006

0.670 ± 0.018

G

0.612 ± 0.016

0.652 ± 0.022

0.703 ± 0.038

0.784 ± 0.039

0.565 ± 0.099

Acc

0.646 ± 0.046

0.673 ± 0.020

0.711 ± 0.009

0.800 ± 0.016

0.622 ± 0.045

G

0.474 ± 0.099

0.663 ± 0.058

0.009 ± 0.075

0.799 ± 0.006

0.655 ± 0.016

Acc

0.731 ± 0.014

0.683 ± 0.008

0.710 ± 0.070

0.808 ± 0.006

0.670 ± 0.016

G

0.482 ± 0.114

0.643 ± 0.008

0.701 ± 0.008

0.803 ± 0.007

0.632 ± 0.024

Acc

0.736 ± 0.013

0.699 ± 0.007

0.710 ± 0.008

0.812 ± 0.006

0.677 ± 0.016

G

0.643 ± 0.015

0.672 ± 0.008

0.709 ± 0.008

0.800 ± 0.005

0.659 ± 0.015

Acc

0.704 ± 0.014

0.687 ± 0.008

0.710 ± 0.007

0.812 ± 0.005

0.674 ± 0.014

G

0.616 ± 0.014

0.650 ± 0.009

0.727 ± 0.008

0.810 ± 0.004

0.660 ± 0.013

Acc

0.727 ± 0.011

0.723 ± 0.006

0.730 ± 0.005

0.822 ± 0.006

0.683 ± 0.015

Table 6.5: Accuracy and G-mean for different methods evaluated on various datasets

6.2.7 FR Darmstadt Imbalance Fixing To finalize our experiments, we applied the proposed balancing method (ABWBC) to the FR data fusion scenario described in the beginning of this chapter. Before, we mentioned that the performance of the NPSVM data fusion algorithm decreases as the detection threshold β gets lower. Beside the noise, we encountered with the class imbalance issue that is getting more severe as β decreases. So, by changing β one can change the class imbalance level of the training dataset. To perform the experiment, we obtained five datasets of different imbalance level from the original FR dataset. The following table (Table 6.6) summarizes the details of these datasets. We solved the FR verification task for each of these datasets using three following approaches: • No imbalance fixing (original NPSVM method) 59

• Proposed ABWBC • Proposed BWBC • SMOTE

Table 6.6: The details of the FR imbalanced datasets Experiment number

1

2

3

4

5

Threshold β

0.05

0.10

0.15

0.20

0.25

Imbalance level

0.28

0.31

0.67

0.72

0.81

The results are depicted in Figure 6.20, where the plots for accuracy and G-mean values versus imbalance ratio are presented. We can observe that ABWBC and BWBC are able to provide higher G-mean value 1 0.95 0.85

0.9 0.85 0.8 G−mean

Accuracy

0.8

0.75

0.75 0.7

0.7

0.65 SVM ABWBC BWBC SMOTE

0.65

0.6 0.28

0.31

0.67 Imbalance level

(a) Accuracy

0.72

0.81

0.6

SVM ABWBC BWBC SMOTE

0.55 0.5 0.28

0.31

0.67 Imbalance level

0.72

0.81

(b) G-mean

Figure 6.20: Evaluation results of original SVM, ABWBC and SMOTE for FR Darmstadt scenario than SMOTE and baseline SVM. In the same time, ABWBC does not loose in accuracy due to its adaptive nature compared to BWBC. Both measures are not stable depending on the imbalance level since the training sets are different with different β value. As β increases, more and more detections are rejected. Nevertheless, the results provided by ABWBC and BWBC show that these methods can be successfully applied in practical applications to combat class imbalance issues.

6.3 Chapter Summary In this chapter, we formulated the data fusion problem and applied it to the FR scenario, described the conducted experiments and evaluated proposed NPSVM data fusion method. We also presented results for the extended formulation of the NPSVM approach leveraging the knowledge about temporal behavior of humans. Beside that we studied class imbalance issues, evaluated state-of-the-art balancing methods and two proposed bagging approached called BWBC and ABWBC that showed promising results. We successfully applied the proposed ABWBC and BWBC to the FR experiment and obtained an noticeable increase in performance of the NPSVM data fusion algorithm.

60

7 Conclusion and Outlook In this thesis, we addressed two problems: data fusion in the NP framework and the class imbalance problem. Data fusion is an effective tool to increase performance and reliability of different types of classification systems (e.g. biometic systems that use traits such as fingerprints, face, voice, iris and others). The class imbalance problem is a challenging issue that arises in many practical data fusion and classification applications where the classes are unequally represented. In Chapter 2 we provided an introduction into SVM classification and presented a formulation of NPSVM based on 2ν-SVM that allows to use the NP framework on top of the SVM concept. We showed that by doing so one is able to use SVMs as a classification or data fusion method and at the same time have control over the operational point on the ROC curve. To assess the performance of NPSVM we conducted a FR experiment in an office environment and compared it with the max fusion method that is usually used in conventional biometric fusion systems. We demonstrated that NPSVM completely outperforms max fusion providing high operational performance. In addition, in Chapter 4 we explained a possibility to increase the performance of the NPSVM data fusion method by utilizing the information about temporal behavior of a person. We proposed to employ a motion model and from this derived penalties for unreliable scores. We showed that by utilizing the proposed motion model concept one can discover more false alarms and, in turn, increase the accuracy of the NPSVM data fusion method. For large detection thresholds the motion model brings almost no advantages, while for low detection thresholds we achieved around 10% gain in accuracy. Finally, we tested the NPSVM data fusion approach on the widely used NIST dataset and confirmed high operational performance of NPSVM. In Chapter 3 we considered the main effects of class imbalance on SVMs and provided an extensive overview of existing balancing approaches, including state-of-the-art SMOTE approach and ensemble learning methods. In Chapter 5 we proposed two novel bagging-based approaches that allow to deal with class imbalance of different degrees. In Chapter 6 we experimentally explained the conditions for class imbalance issues and confirmed high operational performance of the proposed bagging-based approaches BWBC and ABWBC compared to implemented state-of-the-art methods both on synthetic and real-world datasets. We explored the performance of BWBC and ABWBC method depending on the number of base learners and found an optimal ensemble size for these methods. Finally, we tested ABWBC approach on the FR Darmstadt dataset and proved that this method can also be applied to solve biometric fusion tasks such as FR. Having an approach of estimating a motion model for moving targets, future research can be focused on advancing the algorithm for determining the penalty term to express the level of mistrust to the score of the classifier. This could be an algorithm based on the Hidden Markov Model (HMM) concept [4] where e.g. for the FR scenario the states are the FR cameras. Detections that are produced by these cameras for a particular person issue the scores that are, in turn, observed outputs, dependent on the states. The state transition probabilities are determined from the motion model proposed in this thesis. Finally, we are able to discover the most probable path and decide in favor of feasibility of the observed scores. The proposed ABWBC approach could be also modified according to the simulated annealing (SA) algorithm [44] that is used to find the global optimum of a function. Based on SA, we could formulate, instead, a maximization problem and treat the accuracy as an internal energy. A state would be then 61

an undersampling level, that is initialized with 1. A stopping criterion is the fact of achieving a stable tolerable accuracy level that is defined beforehand according to the reference accuracy. The other improvement on ABWBC method could be directed on extending the problem formulation from existing binary to the multiclass problem.

62

List of Figures 1.1 Example of a ROC curve with specified false alarm level α . . . . . . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4

Schematic relation among expected risk, emperical risk and capacity term based on 2.3 . Perfectly separable dataset and the ideal separable hyperplane . . . . . . . . . . . . . . . . Non-separable case and slack variables ξi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SVM probability output estimation (from [57]) . . . . . . . . . . . . . . . . . . . . . . . . . .

3

. . . .

6 7 10 16

soft. . . . . . . . . . . . . . . .

18 20 21 22

4.1 State diagram of transitions for the person i based on the recognition output of ticular sensors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Transition model for FR cameras 3 and 4 . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Transition model regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Likelihoods for genuine and impostor classes . . . . . . . . . . . . . . . . . . . . . .

the par. . . . . . . . . . . . . . . . . . . . . . . .

28 30 31 32

5.1 5.2 5.3 5.4 5.5

. . . . .

34 35 37 39 40

3.1 Decision boundary is shifted toward the minority class examples because of the margin weakness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Random Oversampling and its effect on the SVMs decision boundary . . . . . . . . . 3.3 Generation of synthetic samples according to the SMOTE . . . . . . . . . . . . . . . . 3.4 Random Undersampling and its effect on the SVMs decision boundary . . . . . . . .

RU is a high variance strategy . . . . . . . . . . . . . . . . . . . . . . . . . Averaged posterior probabilities for K = 100 SVM classifiers using RU . Exactly balanced bagging procedure . . . . . . . . . . . . . . . . . . . . . Proposed bagging approach . . . . . . . . . . . . . . . . . . . . . . . . . . . kNN principle for determining sampling probability . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

6.1 An office building where FR biometric data was aquired. The blue arrows show the location of the cameras . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 A typical FR system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 FR fusion system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.4 Graphical explanation of the setting: all i . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.5 Graphical explanation of the setting: at least one i . . . . . . . . . . . . . . . . . . . . . . . . . 6.6 Evaluation measures for the ”All i ” mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.7 Evaluation measures for the ”At least one i ” mode . . . . . . . . . . . . . . . . . . . . . . . . . 6.8 Patterns of synthetic datasets that are used in experiments . . . . . . . . . . . . . . . . . . . . 6.9 Evaluation metrics behaviour having different degree of imbalance . . . . . . . . . . . . . . 6.10 Evaluation metrics behaviour having different size of the training set, but the same degree of imbalance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.11 SVMs performance on the NIST overlapping and non-overlapping datasets of different imbalance level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.12 Synthetic dataset that is used to study the influence of overlap . . . . . . . . . . . . . . . . . 6.13 G-mean of SVM having different imbalance levels and overlapping degree . . . . . . . . . .

43 43 44 45 46 47 47 50 50 51 52 52 53 63

6.14 ROC curves for NPSVM and NPSVM+SMOTE for FR Darmstadt dataset (detection threshold β = 0.10) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.15 A study on oversampling level of the SMOTE algorithm . . . . . . . . . . . . . . . . . . . . . . 6.16 A study on oversampling level of the SMOTE algorithm on real-world datasets . . . . . . . 6.17 Evaluation results for original SVM, SMOTE, BWBC and ABWBC on synthetic datasets . . 6.18 A study on optimal number of base learners for BWBC . . . . . . . . . . . . . . . . . . . . . . 6.19 A study on optimal number of base learners for ABWBC . . . . . . . . . . . . . . . . . . . . . 6.20 Evaluation results of original SVM, ABWBC and SMOTE for FR Darmstadt scenario . . . .

54 55 55 56 57 58 60

64

List of Tables 3.1 Confusion matrix for a binary problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

26

4.1 The algorithm of transition model inclusion for NPSVM . . . . . . . . . . . . . . . . . . . . . .

33

5.1 The ABWBC algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

6.1 6.2 6.3 6.4 6.5 6.6

42 48 49 54 59 60

Key details of FR experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . Evaluation results on NIST-BSSR1 dataset . . . . . . . . . . . . . . . . . . . . Description of UCI datasets used in experiments . . . . . . . . . . . . . . . . The AUC measure for the tested methods . . . . . . . . . . . . . . . . . . . . Accuracy and G-mean for different methods evaluated on various datasets The details of the FR imbalanced datasets . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

65

Bibliography [1] R. Akbani, S. Kwek, and N. Japkowicz. Applying support vector machines to imbalanced datasets. Proc. of the 15th European Conference on Machine Learning (ECML 2004), pages 39–50, 2004. [2] P. Aleksic and A. Katsaggelos. Audio-visual biometrics. Proceedings of the IEEE, 94(11):2025–2044, Nov 2006. [3] K. Bache and M. Lichman. UCI, machine learning repository [http://archive.ics.uci.edu/ml]. 2013. [4] L. E. Baum and T. Petrie. Statistical inference for probabilistic functions of finite state markov chains. The Annals of Mathematical Statistics, 37:1554–1563, 1966. [5] S. Ben-Yacoub, Y. Abdeljaoued, and E. Mayoraz. Fusion of face and speech data for person identity verification. IEEE Transactions on Neural Networks, 10(5):1065–1074, Sep 1999. [6] O. Bousquet, S. Boucheron, and G. Lugosi. Introduction to statistical learning theory. Advanced Lectures on Machine Learning, 3176:169–207, 2004. [7] L. Breiman. Bagging predictors. Mach. Learn., 24:123–140, 1996. [8] R. Brunelli and D. Falavigna. Person identification using multiple cues. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(10):955–966, Oct 1995. [9] C. Bunkhumpornpat and S. Subpaiboonkit. Safe level graph for synthetic minority over-sampling techniques. Communications and Information Technologies (ISCIT), 2013 13th International Symposium, pages 570–575, 2013. [10] Christopher J.C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2:121–167, 1998. [11] P. K. Chan and S. J. Stolfo. Toward scalable learning with nonuniform class and cost distributions: A case study in credit card fraud detection. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, pages 164–168, 1998. [12] C.-C. Chang and C.-J. Lin. Libsvm - a [http://www.csie.ntu.edu.tw/ cjlin/libsvm/], 2014.

library

for

support

vector

machines.

[13] Chih-Chung Chang and Chih-Jen Lin. Training nu-support vector classifiers: Theory and algorithms. 2001. [14] N. V. Chawla, L. O. Hall, K. W. Bowyer, and W. P. Kegelmeyer. Smote: Synthetic minority oversampling technique. Journal of Artificial Intelligence Research, 16:321–357, 2002. [15] C. Chen, A. Liaw, and L. Breiman. Using random forest to learn imbalanced datag. Statistics Technical Reports, ID: 666. The University of California, 2004. [16] C.C. Chibelushi, F. Deravi, and J.S.D. Mason. A review of speech-based bimodal recognition. IEEE Transactions on Multimedia, 4(1):2–37, Mar 2002. 66

[17] R. Collobert, S. Bengio, and Y. Bengio. A parallel mixture of svms for very large scale problems. Neural Computation, 13:1105–1114, 2002. [18] C. Cortes. Prediction of generalisation ability in learning machines. PhD thesis, Department of Computer Science, University of Rochester., 1995. [19] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernelbased Learning Methods. Cambridge University Press, 2000. [20] Nello Cristianini, Jaz Kandola, Andre Elisseeff, and John Shawe-Taylor. On kernel-target alignment. pages 367–373, 2002. [21] J. Czyz, M. Sadeghi, J. Kittler, and L. Vandendorpe. Decision fusion for face authentication. In First International Conference on Biometric Authentication, 2004. [22] J.A. Dargham, A. Chekima, E. Moung, and S. Omatu. Data fusion for face recognition. in Distributed Computing and Artificial Intelligence, Advances in Intelligent and Soft Computing, 79:681–688, 2010. [23] M.A. Davenport, R.G. Baraniuk, and C.D. Scott. Tuning support vector machines for minimax and neyman-pearson classification. in IEEE Transactions on Pattern Analysis and Machine Intelligence, 32:1888–1898, 2010. [24] B. Efron and R.J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. [25] T. Evgenuiu and M. Pontil. Statistical Learning Theory: a Primer. 1998. [26] N.A. Fox, R. Gross, J.F. Cohn, and R.B. Reilly. Robust biometric person identification using automatic classifier fusion of speech, mouth, and face experts. IEEE Transactions on Multimedia, 9(4):701–714, June 2007. [27] V. Ganganwar. An overview of classification algorithms for imbalanced datasets. International Journal of Emerging Technology and Advanced Engineering, 2, April 2012. [28] P. Geurts. Bias vs Variance Decomposition for Regression and Classification in Data Mining and Knowledge Discovery Handbook, ed. O. Maimon and L. Rokach. Springer US, 2007. [29] E. A. Gustavo, P. A. Batista, C.P. Ronaldo, , and M. C. Monard. A study of the behavior of several methods for balancing machine learning training data. Sigkdd Explorations, 6:2004, 2004. [30] I. Guyon, B. Boser, and V. Vapnik. Automatic capacity tuning of very large vc-dimension classifiers. Advances in Neural Information Processing Systems, pages 147–155, 1993. [31] H. Han, W.-Y. Wang, and B.-H. Mao. Borderline-smote: A new over-sampling method in imbalanced data sets learning. Advances in Intelligent Computing. Lecture Notes in Computer Science, 3644:878– 887, 2005. [32] L.K. Hansen and P. Salamon. Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12:993–1001, 1990. [33] S. Hido, H. Kashima, and Y. Takahashi. Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining, 2(5-6):412–426, 2009. [34] L. Hong, A.K. Jain, and S. Pankanti. Can multibiometrics improve performance? in Proceedings of IEEE Workshop on Automatic Identification Advanced Technologies (AutoID), 99:59–64, 1999. 67

[35] X. Hong, S. Chen, and C. Harris. A kernel-based two-class classifier for imbalanced data sets. IEEE Transactions on Neural Networks, 18(1):28–41, Jan 2007. [36] A.K. Jain, A. Ross, and S. Prabhakar. An introduction to biometric recognition. IEEE Transactions on Circuits and Systems for Video Technology, 14(1):4–20, Jan 2004. [37] N. Japkowicz. The class imbalance problem: Significance and strategies. In the proceedings of International Conference on Artificial Intelligence (IC-AI), 1:111–117, 2000. [38] N. Japkowicz and S. Stephen. The class imbalance problem: a systematic study. Intelligent Data Analysis 6, pages 429–450, 2002. [39] T. Joachims. Text categorization with svm: Learning with many relevant features. Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998. [40] C. Jong Myong. A selective sampling method for imbalanced data learning on support vector machines. Graduate Theses and Dissertations. Paper 11529, [http://lib.dr.iastate.edu/etd/11529], 2010. [41] J. Kandola and J. Shawe-taylor. Refining kernels for regression and uneven classification problems. 2003. [42] H. Kim, S. Pang, H. Je, D. Kim, and S. Bang. Pattern classification using support vector machine ensemble. Proc. of ICPR’02, 6:20160–20163, 2002. [43] H. Kima, H. Kimb, H. Moonc, and H. Ahnb. A weight-adjusted voting algorithm for ensembles of classifiers. Journal of the Korean Statistical Society, 40:437–449, 2011. [44] S. Kirkpatrick, C. D. Gelatt Jr., and M. P. Vecchi. Optimization by simulated annealing. Science 13 May 1983: 220 (4598), 671-680. [DOI:10.1126/science.220.4598.671]. [45] M. Kubat, R.C. Holte, and S. Matwin. Machine learning for the detection of oil spills in satellite radar images. Machine Learning - Special issue on applications of machine learning and the knowledge discovery process, 30(2-3):195–215, 1988. [46] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One sided selection. Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, Tennesse, pages 179–186, 1997. [47] M. Kubat and S. Matwin. Addressing the curse of imbalanced training sets: One-sided selection. In Proceedings of the Fourteenth International Conference on Machine Learning, pages 179–186, 1997. [48] L.I. Kuncheva and J.J. Rodríguez. A weighted voting framework for classifiers ensembles. Knowledge and Information Systems, 38:259–275, 2014. [49] F. Lingenfelser, J. Wagner, and E. André. A systematic discussion of fusion techniques for multimodal affect recognition tasks. In Proceedings of the 13th international conference on multimodal interfaces (ICMI’11), pages 19–26, 2011. [50] X.-Y. Liu, J. Wu, and Z.-H. Zhou. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 39:539–550, 2008. 68

[51] R. Longadge, S. S. Dongre, and L. Malik. Class imbalance problem in data mining: Review. International Journal of Computer Science and Network, 2, 2013. [52] T. Maciejewski and J. Stefanowski. Local neighbourhood extension of smote for mining imbalanced data. Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium, pages 104–111, 2011. [53] K. Mansouri, T. Ringsted, D. Ballabio, R. Todeschini, and V. Consonni. Quantitative structure - activity relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53:867–878, 2013. [54] G. J. McLachlan and T. Krishnan. The EM algorithm and extensions. Wiley, 1996. [55] S. Naganjaneyulu and K. M. Rao. A novel class imbalance learning using intelligent undersampling. International Journal of Database Theory and Application, 5:25, 2012. [56] K. Nandakumar, Y. Chen, S. Dass, and A. Jain. Likelihood ratio-based biometric score fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30:754–763, 2008. [57] J. Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, pages 61–74, 1999. [58] A. A. Ross, A. K. Jain, and K. Nandakumar. Handbook of Multibiometrics. Springer-Verlag New York, Inc., 2006. [59] B. Schölkopf, A.J. Smola, R. C. Williamson, and P. L. Bartlett. New support vector algorithms. 2000. [60] N. Sedgwick and C. Limited. Preliminary report on development and evaluation of multi-biometric fusion using the nist bssr1 517. Cambridge Algorithmica Linited. [61] N. Thai-Nghe, B. Andre, and S. Lars. Academic performance prediction by dealing with class imbalance. in the proceedings of 9th IEEE International Conference on Intelligent Systems Design and Applications, IEEE Computer Society (ISDA), pages 878–883, 2009. [62] S. Tong and E. Chang. Support vector machine active learning for image retrieval. Proceedings of ACM International Conference on Multimedia, pages 107–118, 2001. [63] V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc., 1995. [64] K. Veropoulos, C. Campbell, and N. Cristianini. Controlling the sensitivity of support vector machines. Proceedings of the International Joint Conference on AI, pages 55–60, 1999. [65] B.C. Wallace, K. Small, C.E. Brodley, and T.A. Trikalinos. Class imbalance, redux. 2011 IEEE 11th International Conference on Data Mining (ICDM), pages 754–763, 2011. [66] G. Wu and E. Y. Chan. Class-boundary alignment for imbalanced dataset learning. 2003 ICML 2003 Workshop on Learning from Imbalanced Data Sets, pages 49–56, 2004. [67] G. Wu and E. Chang. Class-boundary alignment for imbalanced dataset learning. ICML 2003 Workshop on Learning from Imbalanced Data Sets II, Washington, DC., 2003. [68] A.M. Zoubir and D.R. Iskander. Bootstrap methods and applications. IEEE Signal Processing Magazine, 24(4):10–19, 2007. 69

[69] Y.A. Zuev and S.K. Ivanov. The voting as a way to increase the decision reliability. in Proc. Foundations of Information/Decision Fusion with Applications to Engineering Problems, pages 206–210, 1996.

70