an Empirical Evaluation - CiteSeerX

20 downloads 0 Views 522KB Size Report
[4] Gert Cauwenberghs and Tomaso Poggio. Incremental and ... [12] David Newman, Seth Hettich, Cason Blake, and Christopher Merz. UCI repository of.
Off-Line Learning with Transductive Confidence Machines: an Empirical Evaluation Stijn Vanderlooy∗

Laurens van der Maaten

Ida Sprinkhuizen-Kuyper





∗ MICC-IKAT, Universiteit Maastricht, P.O. Box 616, 6200 MD Maastricht, The Netherlands, {s.vanderlooy, l.vandermaaten}@micc.unimaas.nl † NICI, Radboud University Nijmegen, P.O. Box 9104, 6500 HE Nijmegen, The Netherlands, [email protected]

1

Contents 1 Introduction

3

2 Learning Setting

3

3 Transductive Confidence Machines 3.1 Construction of Prediction Sets . . 3.2 Calibration Property . . . . . . . . 3.3 Implementations . . . . . . . . . . 3.3.1 k-Nearest Neighbour . . . . 3.3.2 Nearest Centroid . . . . . . 3.3.3 Linear Discriminant . . . . 3.3.4 Naive Bayes . . . . . . . . . 3.3.5 Kernel Perceptron . . . . . 3.3.6 Support Vector Machine . .

. . . . . . . . .

4 4 6 6 6 7 7 7 8 8

4 Experiments 4.1 Benchmark Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8 8 9 10

5 Discussion

11

6 Conclusions

12

A Pseudo Codes A.1 TCM-kNN . A.2 TCM-NC . A.3 TCM-LDC A.4 TCM-NB . A.5 TCM-KP . A.6 TCM-SVM

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . . . . .

. . . . . .

. . . . . .

15 15 15 16 17 17 18

B Results as Graphs

23

C Results as Tables

33

2

Abstract The recently introduced transductive confidence machines (TCMs) framework allows to extend classifiers such that they satisfy the calibration property. This means that the error rate can be set by the user prior to classification. An analytical proof of the calibration property was given for TCMs applied in the on-line learning setting. However, the nature of this learning setting restricts the applicability of TCMs. In the report, we provide strong empirical evidence that the calibration property also holds in the off-line learning setting. Our results extend the range of applications in which TCMs can be applied. We may conclude that TCMs are appropriate in virtually any application domain.

1

Introduction

Machine-learning classifiers are common in many real-life applications. Many of these applications are characterized by high error costs, indicating that incorrect classifications can have serious consequences. It is therefore desired to have classifiers that output reliable classifications. One way to achieve this is to complement each classification with a confidence value. Classifications with a low confidence value are not reliable and should be handled with caution. For some classifiers (such as the naive Bayes classifier) a measure of confidence is readily available, but for many other classifiers this is not the case. The recently introduced transductive confidence machines (TCMs) framework allows for an efficient way to provide confidence values produced by virtually any classifier [8, 18]. The essential property of TCMs is that their error rate is controlled by the user prior to classification. For example, if the user specifies an error rate of 0.05, then at most 5% of the classifications made by a TCM are incorrect. This property is called the calibration property and has been proven to hold in the on-line learning setting. However, this learning setting restricts the applicability of TCMs. In the report, we investigate to what extent the calibration property holds in the off-line learning setting. We investigate this by means of a systematic empirical evaluation of TCMs using six different classifiers on various real-world datasets. The remainder of the report is organized as follows. Section 2 defines the learning setting that we consider. Section 3 explains TCMs and the calibration property. It also provides implementations of six classifiers in the TCM framework. Section 4 investigates to what extent the calibration property holds in the off-line learning setting. Section 5 provides a final discussion on TCMs. Section 6 concludes that TCMs satisfy the calibration property in the off-line learning setting.

2

Learning Setting

We consider the supervised machine-learning setting. The instance space is denoted by X and corresponding label space by Y. An example is of the form z = (x, y) where x ∈ X is the instance and y ∈ Y is the label. The symbol Z will be used as a compact notation for X × Y. Training data are a sequence of examples: S = (x1 , y1 ), . . . , (xn , yn ) = z1 , . . . , zn ,

(1)

where each example is generated by the same unknown probability distribution P over Z. We assume that this distribution satisfies the exchangeability assumption. This assumption states that the joint probability of a sequence of random variables is invariant under any 3

permutation of the indices of these variables. In other words, the information that the z i ’s provide is independent of the order in which they are collected. Formally, we write: P (z1 , . . . , zn ) = P (zπ(1) , . . . , zπ(n) ) ,

(2)

for all permutations π on the set {1 . . . , n}.1 We apply a classifier in the off-line learning setting (batch setting): the classifier is learned on training data and subsequently used to classify instances one-by-one. The true labels of instances are not returned. This is in contrast to the on-line learning setting where the true label of each instance is provided immediately after prediction. The classifier is then retrained after each prediction since new information is available. Clearly, the on-line learning setting restricts the applicability of classifiers since any form of feedback can be very expensive, or feedback is simply not available.

3

Transductive Confidence Machines

Traditionally, classifiers assign a single label to an instance. In contrast, transductive confidence machines (TCMs) are allowed to assign a set of labels to each instance. Such a prediction set contains multiple labels when there is uncertainty in the true label of the instance [8, 18]. The construction of prediction sets is explained in Subsection 3.1. Subsection 3.2 discusses the calibration property and Subsection 3.3 outlines six practical implementations of TCMs. For an extensive technical description of TCMs we refer to [18]. A detailed description of their connection with algorithmic randomness is presented in [17].

3.1

Construction of Prediction Sets

To construct a prediction set for an unlabeled instance xn+1 , TCMs operate in a transductive manner. Each possible label y ∈ Y is tried as a label for instance xn+1 . In each try we form the example zn+1 = (xn+1 , y) and add it to S. Then we measure how likely it is that the resulting sequence is generated by the underlying distribution P . To this end, each example in the extended sequence: (x1 , y1 ), . . . , (xn , yn ), (xn+1 , y) = z1 , . . . , zn+1 ,

(3)

is assigned a nonconformity score by means of a nonconformity measure. This measure defines how nonconforming an example is with respect to other available examples. We require that it is irrelevant in which order the nonconformity scores of the examples are computed (due to the exchangeability assumption). Definition 1. A nonconformity measure is a measurable mapping: A : Z (∗) × Z → R ∪ {∞} ,

(4)

with output indicating how nonconforming an example is with respect to other examples. The symbol Z (∗) denotes the set of all bags of elements of Z. A bag is denoted by * · +. 1 Note that exchangeable random variables are identically distributed and not necessarily independent from each other. Therefore, identically and independently distributed (iid) random variables are also exchangeable. The exchangeability assumption is thus weaker (i.e., more general) than the iid assumption.

4

Definition 2. Given a sequence of examples z1 , . . . , zn+1 with n ≥ 1, the nonconformity score of example zi (i = 1, . . . , n) is defined as: αi = A(*z1 , . . . , zi−1 , zi+1 , . . . , zn+1 +, zi ) ,

(5)

and the nonconformity score of example zn+1 is defined as: αn+1 = A(*z1 , . . . , zn +, zn+1 ) .

(6)

To know how nonconforming the artificially created example zn+1 is in the extended sequence, the nonconformity score αn+1 is compared to all other αi (i = 1, . . . , n). Definition 3. Given a sequence of nonconformity scores α1 , . . . , αn+1 with n ≥ 1, the p-value of label y assigned to an unlabeled instance xn+1 is defined as: py =

| {i = 1, . . . , n + 1 : αi ≥ αn+1 } | . n+1

(7)

If the p-value is close to its lower bound 1/(n + 1), then example zn+1 is very nonconforming. The closer the p-value is to upper bound 1, the more conforming example zn+1 is. Hence, the p-value indicates how likely it is that the tried label for an unlabeled instance is in fact the true label. A TCM outputs the set of labels with p-values above a predefined significance level . Definition 4. A transductive confidence machine determined by some nonconformity measure is a function that maps each sequence of examples z1 , . . . , zn with n ≥ 1, unlabeled instance xn+1 , and significance level  ∈ [0, 1] to the prediction set: Γ (z1 , . . . , zn , xn+1 ) = {y ∈ Y | py > } .

(8)

There may be situations in which many examples have nonconformity score equal to the score of example zn+1 . The p-value is then large, but caution is needed since many examples are equally nonconforming, making it impossible to discriminate between them. To alleviate this problem, a randomized version of the p-value has been proposed [18, p. 27]. Definition 5. Given a sequence of nonconformity scores α1 , . . . , αn+1 with n ≥ 1, the randomized p-value of label y assigned to unlabeled instance xn+1 is defined as: pτy =

| {i = 1, . . . , n + 1 : αi > αn+1 } | + τ | {i = 1, . . . , n + 1 : αi = αn+1 } | , n+1

(9)

with τ a random number uniformly sampled from [0, 1] for instance xn+1 . Definition 6. A randomized transductive confidence machine determined by some nonconformity measure is a function that maps each sequence of examples z1 , . . . , zn with n ≥ 1, unlabeled instance xn+1 , uniformly distributed random number τ ∈ [0, 1], and significance level  ∈ [0, 1] to the prediction set:  Γ,τ (z1 , . . . , zn , xn+1 ) = y ∈ Y | pτy >  . (10)

A randomized transductive confidence machine treats the borderline cases αi = αn+1 more carefully. Instead of increasing the p-value with 1/(n + 1), the p-value is increased with a random amount between 0 and 1/(n + 1). In the following, we employ randomized TCMs, although for brevity we simply call them TCMs. 5

3.2

Calibration Property

In the on-line learning setting, TCMs have been proven to satisfy the calibration property [18, p. 20-22 & p. 193]. This property states that the long run error rate of a TCM with significance level  equals : Errn = , (11) lim sup n n→∞ with Errn the number of prediction sets that do not contain the true label, given the first n prediction sets.2 The idea of the proof is to show that the sequence of prediction outcomes (i.e., whether the prediction set contains the true label or not) is a sequence of independent Bernoulli random variables with parameter . From Eq. 11 follows that the significance level has a frequentist interpretation as the limiting frequency of errors. It allows to control the number of errors prior to classification. The calibration property holds regardless of which nonconformity measure is used. In the off-line learning setting there theoretically exists a small probability that TCMs are not well-calibrated (since the training data is kept fixed, and hence the prediction outcomes are not independent) [18, p. 111]. In Section 4 we investigate empirically whether TCMs are well-calibrated in the off-line learning setting.

3.3

Implementations

This subsection shows that virtually any classifier can be plugged into the TCM framework. Nonconformity measures are formulated for the following six classifiers: (1) k-nearest neighbour, (2) nearest centroid, (3) linear discriminant, (4) naive Bayes, (5) kernel perceptron, and (6) support vector machine. Although nonconformity measures are based on specific classifier characteristics, they can readily be applied to similar classifiers. In addition, they provide clear insights in how to define new nonconformity measures. The implementation of TCMs based on linear discriminant, kernel perceptron, and support vector machine considers binary classification tasks. This is due to the nature of these classifiers. We denote the binary label space as Y = {−1, +1}. Extensions to multilabel learning are well-known and therefore not discussed in the report. Pseudo codes in Appendix A present incremental learning and decremental unlearning TCMs, resulting in TCMs with reasonable time complexity. 3.3.1

k-Nearest Neighbour

The k-nearest neighbour classifier (k-NN) classifies an instance by means of majority vote among the labels of its k nearest neighbours (k ≥ 1) [5]. An example is nonconforming when it is far from nearest neighbours with identical labels and close to nearest neighbours with different labels. A nonconformity measure can model this as follows. Given example zi = (xi , yi ), define an ascending ordered sequence Diyi with distances from instance xi to its k nearest neighbours with label yi . Similarly, let Di−yi contain ordered distances from instance xi to its k nearest neighbours with label different from yi . The nonconformity score is then defined as: Pk

j=1

αi = Pk

j=1

2 In

yi Dij

−yi Dij

,

case of non-randomized TCMs, the equality sign in Eq. 11 is replaced by the ≤ sign.

6

(12)

with subscript j representing the j-th element in a sequence [13]. Clearly, the nonconformity score is monotonically increasing when distances to the k nearest neighbours with identical label increase and/or distances to the k nearest neighbours with different label decrease. 3.3.2

Nearest Centroid

The nearest centroid classifier (NC) learns a Voronoi partition on the training data. It assumes that examples cluster around a class centroid. An example is nonconforming if it is far from the class centroid of its label and close to the class centroids of other labels. Therefore, the nonconformity score of example zi = (xi , yi ) can be defined as the distance from xi to the class centroid of yi relative to the minimum distance from xi to all other class centroids [2]. Formally, we write: d(µyi , xi ) , (13) αi = miny6=yi d(µy , xi ) with µy the class centroid of label y. The class centroid of label y is defined as: 1 X µy = xi , (14) |Cy | i∈Cy

with Cy the set of indices of instances with label y. 3.3.3

Linear Discriminant

The (Fisher) linear discriminant classifier (LDC) learns a separating hyperplane by maximizing the between scatter of instances with different label while minimizing the within scatter of instances with identical label [7]. Instances close to the hyperplane are classified with lower confidence than the remaining instances since a small change in the hyperplane can result in a different classification of nearby instances. Therefore, a natural nonconformity score of example zi = (xi , yi ) is the signed perpendicular distance from xi to the hyperplane: αi = −yi (hw, xi i + b) ,

(15)

with w and b the normal vector and intercept of the hyperplane, and h·, ·i the inner product. If a classification is correct, then the nonconformity score is negative. Also, a larger distance to the hyperplane represents more confidence in a correct classification, and consequently a lower nonconformity score is obtained. If a classification is incorrect, then the nonconformity score is positive and monotonically increasing with larger perpendicular distances to the hyperplane. 3.3.4

Naive Bayes

The naive Bayes classifier (NB) is a probabilistic classifier that applies Bayes theorem with independence assumptions [6]. A valid nonconformity score is large if the label of an instance is strange under the Bayesian model [18, p. 102]. We use the following as nonconformity score of example zi = (xi , yi ): αi = 1 − P(yi ) , (16) with P(yi ) the conditional probability of label yi that is estimated from the training data and instance xi , i.e., P(·) is the posterior label distribution computed by the naive Bayes classifier.3 3 It is tempting to believe that probabilities P(·) are confidence values. However, it has been verified that these probabilities are overestimated in case of an incorrect prior, e.g., classifying with a probability of 0.7 does not mean that the true label is predicted 70% of the time [11].

7

3.3.5

Kernel Perceptron

The kernel perceptron (KP) learns a separating hyperplane by updating a weight vector in a high-dimensional space during training [9]. The weight vector represents the normal vector and intercept of the hyperplane. The expansion of the weight vector in dual form is: w=

n+1 X

λi yi Φ(xi ) ,

(17)

i=1

with λi the dual variable for instance xi and Φ(·) the mapping to the high-dimensional space. It is easily verified that λi encodes the number of times that instance xi is incorrectly classified during training [16, p. 241-242]. Clearly, the nonconformity score of example z i = (xi , yi ) can be defined as αi = λi [11]. However, this nonconformity score is not valid in the sense that the KP solution depends on the ordering of the training examples. In our experiments we show that this violation of the exchangeability assumption does not have any effect in practice. 3.3.6

Support Vector Machine

The support vector machine (SVM) finds a separating hyperplane with maximum margin using the inner products of instances mapped to a high-dimensional space. The inner products are efficiently computed using a kernel function. The maximum margin hyperplane is found by solving a quadratic programming problem in dual form [16, Ch. 7]. In this optimization problem, the Lagrange multipliers λ1 , λ2 , . . . , λn+1 associated with examples z1 , . . . , zn+1 take values in the domain [0, C] with C the SVM error penalty. Examples with λi = 0 lie outside the margin and at the correct side of the hyperplane. Examples with 0 < λi < C also lie at the correct hyperplane side, but on the margin. Examples with λi = C can lie inside the margin and at the correct side of the hyperplane, or they can lie at the incorrect side of the hyperplane. Larger Lagrange multipliers represent more nonconformity and therefore they are valid nonconformity scores, i.e., we define αi = λi as the nonconformity score of example zi = (xi , yi ) [14, 15].

4

Experiments

The previous Section 3 discussed technical properties and practical implementations of TCMs. This section empirically investigates whether the calibration property holds when TCMs are applied in the off-line learning setting. We performed experiments with TCMs on a number of benchmark datasets. These datasets are described briefly in Subsection 4.1. Subsection 4.2 outlines the experimental setup and Subsection 4.3 presents the results of the experiments.

4.1

Benchmark Datasets

In the following, we denote the aforementioned TCM implementations by the classifier name and the prefix TCM, e.g., TCM-kNN is the TCM based on the k-NN nonconformity measure. We tested the six TCMs on 10 well-known binary datasets from the UCI benchmark repository [12]. The datasets are: heart statlog, house votes, ionosphere, liver, monks1, monks2, monks3, pima, sonar, and spect. Some datasets such as liver and sonar are known to be highly non-linear. For these non-linear datasets, it is especially challenging to verify if TCM-LDC satisfies the calibration property. The monks datasets are datasets for which distance-based classifiers have difficulties [3]. 8

dataset heart statlog house votes ionosphere liver monks1 monks2 monks3 pima sonar spect

size 270 342 350 341 432 432 432 768 208 219

features 13 16 34 6 6 6 6 8 60 22

% min class 44.44 34.21 35.71 41.64 50.00 32.87 48.15 34.90 46.63 12.79

Table 1: Characteristics of the UCI datasets used for experiments: name, number of examples, number of features, and percentage of examples in the minority class. As a preprocessing step, all instances with missing feature values are removed as well as duplicate instances. Features are standardized to have zero mean and unit variance to remove possible effects caused by features with different orders of magnitude. The main characteristics of the resulting datasets are summarized in Table 1.

4.2

Experimental Setup

The classifiers TCM-kNN, TCM-KP, and TCM-SVM require the selection of one or more parameters. We performed model selection by applying a ten-fold cross validation process that was repeated for five times. The chosen parameter values are those for which the number of prediction sets with multiple labels is minimized for significance levels in the domain [0, 0.2].4 The number of nearest neighbours for TCM-kNN is restricted to k = 1, 2, . . . , 10. For TCM-SVM and TCM-KP we tested polynomial and Gaussian kernels with exponent values e = 1, 2, . . . , 10 and bandwidth values σ = 0.001, 0.01, 0.03, 0.06, 1, 1.6 respectively. The SVM error penalty C is kept fixed at 10. We used Euclidean distance in TCM-kNN and TCM-NC. Once the parameter values are chosen, TCMs are applied in the off-line learning setting with ten-fold cross validation. To ensure that results are independent of the order of examples in the training folds, the experiments were repeated five times with random permutations of the data. We report the average performance of all experiments and test folds. The performance of TCMs is measured by two key statistics. First, the percentage of prediction sets that do not contain the true label is measured. This is the error rate measured as a percentage. Second, we measure efficiency to indicate how useful the prediction sets are. Efficiency is given by the percentages of three types of prediction sets. The first type are prediction sets with one label. These prediction sets are called certain predictions. Second, uncertain predictions correspond to prediction sets with two labels and indicate that both labels are likely to be correct.5 Third, prediction sets can also be empty. Clearly, certain predictions are preferred. 4 The conclusions based on our experiments do not depend on our model selection approach. Other parameter values simply result in more prediction sets with multiple labels. Other parameter values simply result in more prediction sets with multiple labels. 5 Of course, this generalizes in the case of multilabel learning to more than two labels.

9

4.3

Results

In this section we report our empirical results of off-line TCMs on the 10 benchmark datasets. To visualize performance of a TCM, we follow the convention as defined in [18]. Results are shown as graphs indicating four values for each significance level: (1) percentage of incorrect predictions, (2) percentage of uncertain predictions, (3) percentage of empty predictions, and (4) percentage of incorrect predictions that are allowed at the significance level. The first value represents the error rate as a percentage, while the second and third values represent efficiency.6 The line connecting the percentage of incorrect predictions allowed at each significance level is called the error calibration line. As an example, Fig. 1 shows a TCM-kNN and a TCM-NC applied on the ionosphere dataset. Graphs of all TCMs and datasets are given in Appendix B. In the following we first focus our attention to the calibration property, then we give some remarks about efficiency. TCMs satisfy the calibration property if the percentage of incorrect predictions at each significance level lies on the error calibration line. From Fig. 1 follows that the corresponding TCMs are well-calibrated up to neglectable statistical fluctuations (the empirical error line can hardly be distinguished from the error calibration line). For example, at  = 0.05 approximately 5% of the prediction sets do not contain the true label. Table 2 verifies the calibration property for all TCMs and datasets by reporting the average deviation between empirical errors and the error calibration line for  = 0, 0.01, . . . , 0.5. We do not consider significance levels above 0.5 since these result in classifiers for which more than 50% of the prediction sets do not contain the true label. Deviations are given in percentages and are almost zero, indicating that TCMs satisfy the calibration property when they are applied in the off-line learning setting. Note that we included datasets for which some classifiers have difficulties to achieve a low error rate (Subsection 4.1). Even for these datasets and classifiers, Table 2 reports deviations that are almost zero. To measure efficiency we note that the percentage of uncertain predictions is 100% when  = 0 since the computed prediction sets contain all labels. We allow for more incorrect predictions when the significance level is set to a higher value. Therefore, the percentage of uncertain predictions monotonically decreases with higher significance levels. How fast this decline goes depends on the performance of the classifier plugged into the TCM framework. This means that k-NN performs significantly better than NC on the ionosphere dataset, as illustrated by Fig. 1. The percentage of empty predictions starts to occur at approximately the significance level for which there are no more uncertain predictions. The percentage of empty predictions monotonically increases after this significance level, moving closer to the error calibration line to eventually lie on this line. To summarize efficiency for all TCMs and datasets, we consider four significance levels that we believe to be of interest in many practical situations:  = 0.20, 0.15, 0.10, 0.05. For these significance levels, Tables 3 - 12 in Appendix C report means and standard deviations for the percentage of incorrect, certain, and empty predictions. Of course, these tables again verify that the calibration property holds. The reported standard deviations may not seem that small. However, the number of test instances in a single test fold is small (between 20 and 76, depending on the dataset). The efficiency of the TCMs varies strongly for a number of TCMs and datasets. All values correspond to our discussion of efficiency. 6 The percentage of certain predictions is trivially derived from the reported percentages of the other types of prediction sets. Note that the percentage of empty predictions is at most the percentage of incorrect predictions.

10

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 1) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

(a)

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

(b)

Figure 1: Results of two TCMs applied on the ionosphere dataset in the off-line learning setting: (a) TCM-kNN and (b) TCM-NC.

heart statlog house votes ionosphere liver monks1 monks2 monks3 pima sonar spect

TCM-kNN 0.34 0.33 0.21 0.62 0.98 0.49 0.32 0.21 0.59 0.35

TCM-NC 0.59 0.27 0.81 1.35 1.02 1.29 0.51 0.28 1.09 1.06

TCM-LDC 0.35 0.38 0.31 0.35 0.40 0.46 0.22 0.13 0.38 0.36

TCM-NB 0.20 0.29 0.28 0.43 0.60 0.29 0.52 0.16 0.32 0.58

TCM-KP 0.25 0.53 0.33 0.47 0.26 0.27 0.21 0.16 0.46 0.51

TCM-SVM 0.31 0.28 0.38 0.23 0.40 0.36 0.45 0.16 0.67 0.61

Table 2: The deviations between empirical errors and the error calibration line. Values are reported as percentages.

5

Discussion

This section elaborates more on the difference between randomized and non-randomized TCMs, and on the meaning of empty prediction sets. In our experiments with non-randomized TCMs, we found that the line connecting the empirical errors of a non-randomized TCM-SVM is a step function that tends to stay below the error calibration line (as illustrated in Fig. 2). The reason for this observation is as follows. There are two possible scenarios if a new example is added to the training examples. First, the new example may be a support vector. The difference between the randomized p-value and the non-randomized p-value is small since the number of support vectors with equal nonconformity score is only a small fraction of the available examples. Also note that the non-randomized p-value obtains its maximum value 1. Second, the new example may be a non-support vector. The randomized p-value is then significantly smaller than the non-randomized p-value since all non-support vectors have equal nonconformity score. This implies that the non-randomized TCM-SVM will compute less empty prediction sets than 11

TCM-SVM (e = 2) 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-SVM (e = 2) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

(a)

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

(b)

Figure 2: TCM-SVM performance on the heart statlog dataset: (a) randomized TCMSVM and (b) non-randomized TCM-SVM. the randomized TCM-SVM. Therefore, the empirical error line becomes a step function since empty prediction sets are counted as errors. A similar reasoning holds for the difference between a non-randomized TCM-KP and a randomized TCM-KP. For the remaining TCM implementations, a non-randomized version did not led to significantly different results than a randomized version. Indeed, when the nonconformity scores take values in a large domain, then the difference between non-randomized and randomized TCMs is neglectable. Empty prediction sets indicate that the classification task has become too easy: we can afford the luxury of refusing to make a prediction. Thus, empty prediction sets are a tool to satisfy the calibration property for high significance levels. In fact, the significance level for which empty prediction sets start to arise is approximately equal to the error rate of the classifier when it is not plugged into the TCM framework. To avoid empty predictions, TCMs can be modified to include the label with highest p-value into the prediction set, even though this p-value can be smaller than or equal to the significance level. In this situation, the percentage of empirical errors will also behave as a step function below the error calibration line since an empty prediction set was previously counted as an error. The significance level now gives an upper bound on the error rate, although we do not know how tight this bound is. The resulting TCMs are called forced TCMs and they are said to be conservatively wellcalibrated [1].

6

Conclusions

In this report we focused on the applicability and validity of transductive confidence machines (TCMs) applied in the off-line learning setting. TCMs allow to make predictions such that the error rate is controlled a priori by the user. This property is called the calibration property. An analytical proof of the calibration property exists when TCMs are applied in the on-line learning setting. However, this learning setting restricts the applicability of TCMs. We provided an extensive empirical evaluation of TCMs applied in the off-line learning setting. Six TCM implementations with different nonconformity measures were applied on 10 well-known benchmark datasets. Moreover, pseudo codes presented incremental learning and decremental unlearning TCMs, resulting in TCMs with reasonable time complexity. From 12

the results of our experiments, we may conclude that TCMs satisfy the calibration property in the off-line learning setting, hereby strongly extending the range of tasks in which they can be applied. TCMs have a significant benefit over conventional classifiers for which the error rate cannot be controlled by the user prior to classification, especially in tasks where reliable instance classifications are desired. Since TCMs have now been shown to be widely applicable and well-calibrated in virtually any application domain, our future work focuses on efficiency. We noticed that the chosen nonconformity measure affects efficiency while it does not violate the upper bound on the error rate. Our next goal is to minimize the size of the computed prediction sets, especially in case of multilabel learning. We believe that this can be achieved with a new nonconformity measure. Our interest is a measure that is independent of the specific TCM implementation and that is designed to provide a confidence value on nonconformity scores too. In addition, we are also interested in possible information contained in the distribution of p-values.

Acknowledgments The first author is supported by the Dutch Organization for Scientific Research (NWO), ToKeN programme, grant nr: 634.000.435. The second author is supported by NWO, CATCH programme, grant nr: 640.002.401.

References [1] Tony Bellotti. Confidence Machines for Microarray Classification and Feature Selection. PhD thesis, Royal Holloway University of London, London, UK, February 2006. [2] Tony Bellotti, Zhiyuan Luo, Alex Gammerman, Frederick Van Delft, and Vaskar Saha. Qualified predictions for microarray and proteomics pattern diagnostics with confidence machines. International Journal of Neural Systems, 15(4):247–258, 2005. [3] Enrico Blanzieri and Francesco Ricci. Probability based metrics for nearest neighbor classification and case-based reasoning. In Klaus-Dieter Althoff, Ralph Bergmann, and Karl Branting, editors, 3rd International Conference on Case-Based Reasoning and Development (ICCBR-1999), pages 14–28, Seeon Monastery, Germany, July 27-30 1999. Springer. [4] Gert Cauwenberghs and Tomaso Poggio. Incremental and decremental support vector machine learning. In Todd Leen, Thomas Dietterich, and Volker Tresp, editors, Advances in Neural Information Processing (NIPS-2000), pages 409–415, Denver, CO, USA, November 27-30 2000. MIT Press. [5] Thomas Cover and Peter Hart. Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27, 1967. [6] Pedro Domingos and Michael Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2):103–130, 1997. [7] Ronald Fisher. The use of multiple measurements in taxonomics problems. Annals of Eugenics, 7:178–188, 1936. [8] Alex Gammerman, Vladimir Vovk, and Vladimir Vapnik. Learning by transduction. In Gregory Cooper and Serafin Moral, editors, 14th Conference on Uncertainty in Artificial 13

Intelligence (UAI-1998), pages 148–155, Madison, WI, USA, July 24-26 1998. Morgan Kaufmann. [9] Roni Khardon, Dan Roth, and Rocco Servedio. Efficiency versus convergence of boolean kernels for on-line learning algorithms. Journal of Artificial Intelligence Research, 24:341– 356, 2005. [10] Jingli Lu, Ying Yang, and Geoffrey Webb. Incremental discretization for naive Bayes classifier. In Xue Li, Osmar Za¨ıane, and Zhanhuai Li, editors, 2nd International Conference on Advanced Data Mining and Applications (ADMA-2006), pages 223–238, Xi’an, China, August 14-16 2006. Springer. [11] Thomas Melluish, Craig Saunders, Ilia Nouretdinov, and Vladimir Vovk. Comparing the Bayes and typicalness frameworks. In Luc De Raedt and Peter Flach, editors, 12th European Conference on Machine Learning (ECML-2001), pages 360–371, Freiburg, Germany, September 5-7 2001. Springer. [12] David Newman, Seth Hettich, Cason Blake, and Christopher Merz. UCI repository of machine learning databases, 1998. [13] Kostas Proedrou, Ilia Nouretdinov, Vladimir Vovk, and Alex Gammerman. Transductive confidence machines for pattern recognition. Technical Report 01-02, Royal Holloway University of London, London, UK, 2001. [14] Craig Saunders, Alex Gammerman, and Vladimir Vovk. Transduction with confidence and credibility. In Thomas Dean, editor, 16th International Joint Conference on Artificial Intelligence (IJCAI-1999), pages 722–726, Stockholm, Sweden, July 31 - August 6 1999. Morgan Kaufmann. [15] Craig Saunders, Alex Gammerman, and Vladimir Vovk. Computationally efficient transductive machines. In Toshio Okamoto, Roger Hartley, Kinshuk, and John Klus, editors, 11th International Conference on Algorithmic Learning Theory (ICALT-2000), pages 325–333, Madison, WI, USA, August 6-8 2000. IEEE Computer Society Press. [16] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK, 2004. [17] Stijn Vanderlooy and Ida Sprinkhuizen-Kuyper. An overview of algorithmic randomness and its application to reliable instance classification. Technical Report 07-02, Universiteit Maastricht, Maastricht, The Netherlands, 2007. [18] Vladimir Vovk, Alex Gammerman, and Glenn Shafer. Algorithmic Learning in a Random World. Springer, New York, NY, USA, 2005.

14

A

Pseudo Codes

In this appendix we provide pseudo code for the TCM implementations used in the report. Each implementation supports incremental learning and decremental unlearning of a single example. Consequently, time complexity is kept low. To compute time complexity, we will assume for simplicity that the number of labels |Y| is smaller than the number of features m. The number of training instances is denoted by n and we compute prediction sets for l unlabeled instances. For distance-based TCMs, we assume distances to be measured with the Minkowski distance: !1/p m X p dp (xi , xj ) = kxi − xj kp = (18) (xi,k − xj,k ) k=1

+

where p ∈ R , m is the number of features, and xi,k and xj,k are values of the k-th feature of instances xi and xj , respectively. Note that the Manhattan distance and the Euclidean distance are special cases (p = 1 and p = 2, respectively). Clearly, a single distance computation of the Minkowski distance requires time complexity O(m). For TCM-LDC, TCM-KP, and TCM-SVM we say that the binary label space consists of a negative and positive label, i.e., Y = {−1, +1}. We do not consider extensions to multilabel learning.

A.1

TCM-kNN

An efficient implementation of TCM-kNN first calculates nonconformity scores on training data, and updates these scores if needed for each new example [13]. Indeed, from the nonconformity score (Eq. 12) follows that the score of training example zi only changes when the distance to the new unlabeled instance is: (1) smaller than the last element in sequence Diy , or (2) smaller than the last element in sequence Di−y . Algorithm 1 provides pseudo code. Its time complexity is broken down as follows: lines 2-3 lines 4-5 lines 7-8 lines 10-11 lines 13-18 line 20 line 21

O(mn2 ) Θ(nk) O(mn) Θ(n) O(n) Θ(k) Θ(n)

Lines 2-5 are only calculated once for training examples and we may assume O(mn 2 ) as the dominating complexity. Lines 7-8 are specific for the unlabeled instance. From the remaining lines, the lines 13-18 are the most time intensive since they have time complexity O(n). 7 These lines are nested in a loop at line 9, resulting in time complexity O(|Y|n). Thus, the overall complexity of TCM-kNN applied in the off-line learning setting to compute l prediction sets is O(mn2 ) + O(mnl) + O(|Y|nl).

A.2

TCM-NC

A fast implementation of TCM-NC is given by Algorithm 2. It first computes distances from training examples to class centroids with identical and different label. These distances are iteratively updated for each new example. Time complexity is as follows: 7 The pseudo code may give the impression that the time complexity of these lines is O(kn). However, by keeping track of the sum of elements in Diy and Di−y , we can update a single nonconformity score in constant time. Hence, a time complexity of O(n) is obtained.

15

lines 2-3 lines 4-10 lines 11-12 lines 14-15 line 17 lines 18-19 lines 20-24 line 26 line 27

Θ(|Y|mn) O(|Y|mn) Θ(n) Θ(n) Θ(m) O(mn) Θ(n) Θ(n) Θ(n)

Lines 2-12 are only calculated once with dominating complexity O(|Y|mn). From the remaining lines, lines 18-19 have highest complexity. These lines are nested in a loop at line 13, resulting in time complexity O(|Y|mn). Therefore, the time complexity to compute l prediction sets in the off-line learning setting is O(|Y|mnl).

A.3

TCM-LDC

LDC finds the normal vector of the separating hyperplane using the matrix inversion: −1 w = SW (µ−1 − µ+1 )

(19)

with µy the class centroid of label y ∈ {−1, +1} and SW the within class scatter matrix defined as follows: X P(y)cov y (20) SW = y∈{−1,+1}

Here, P(y) is the prior probability of class y and cov y is its covariance matrix: X cov y = (xi − µy )(xi − µy )T

(21)

xi ∈Cy

with Cy the set of instances with label y. Once the normal vector is found, the hyperplane intercept b is computed such that the following equality holds: hw, µ−1 i + b = − hw, µ+1 i − b

(22)

When a new example is added to the training data, the class centroids of both labels can be updated incrementally. If we extend the definition of class centroids and covariance matrices to include superscripts denoting the number of training examples, then the new covariance matrix cov n+1 of class y can be written as follows: y cov n+1 y

=

X

xi − µn+1 y

xi ∈Cy

=

X



xi − µn+1 y

xi − µny + µny − µn+1 y

xi ∈Cy

= cov ny +

X

xi ∈Cy

µny − µn+1 y





T

(23)

xi − µny + µny − µn+1 y

µny − µn+1 y

T

+2

X

xi ∈Cy

T

xi − µny

(24) 

T (25) µny − µn+1 y

An iterative procedure is clearly available when we keep track of the class centroids and the values xi − µny . Algorithm 3 provides pseudo code of an efficient TCM-LDC implementation. Time complexity is as follows: 16

lines 2-4 lines 5-9 lines 10-11 line 13 line 15 line 16 line 18 lines 19-20 line 22

Θ(mn) Θ(mn) O(m2 n) Θ(m) Θ(m) O(m2 n) O(m3 ) Θ(mn) Θ(n)

Lines 2-11 are only calculated once with time complexity O(m2 n). From the remaining lines, lines 16 and 18 have dominating complexity. Line 16 computes the covariance matrix and line 18 computes the inverse of the within class scatter matrix using Gauss-Jordan elimination or singular value decomposition. Note that line 16 uses preprocessed information (i.e., x i − µny ) and hence is considerably faster than computing the covariance matrix from scratch. These lines are in a loop that is executed twice starting at line 12. Therefore, the time complexity to compute l prediction sets is O(m2 nl + m3 l).

A.4

TCM-NB

The naive Bayes classifier can learn and unlearn a new example at any moment, thus an efficient implementation of TCM-NB is straightforward. To allow for continuous features, several approaches have been proposed. We use the incremental flexible frequency discretization approach [10]. The approach groups sorted values of a continuous feature into a sequence of intervals and treats each interval as a new discrete attribute. The size of the intervals are updated when instances are added or removed from the training data in such a way that low classification variance and bias is retained. The time complexity for learning and unlearning an example is O(m). A posterior label probability distribution is computed in O(|Y|m). Algorithm 4 provides pseudo code (the details of the NB classifier are left out). Note that for each label being tried as the label for an unlabeled instance, the NB classifier is changed twice and one posterior label probability distribution is computed for each available example. Using the aforementioned complexities, it is straightforward to show that the overall time complexity of TCM-NB to compute l prediction sets is O(|Y|2 mnl).

A.5

TCM-KP

The kernel perceptron learns a separating hyperplane by iterating several times (say r) through the training data. It keeps track of the number of times each instance has been incorrectly classified (the dual variables). When a new example is presented, the learning process is simply repeated. Algorithm 5 provides pseudo code. If k is the number of steps needed to compute one entry of the kernel matrix, then time complexity is broken down as follows: line 2 Θ(kn2 ) line 3 Θ(rmn) line 5 Θ(kn) line 7 Θ(rmn) lines 8-9 Θ(n) line 11 Θ(n) Lines 2-3 are only computed once with time complexity O(kn2 + rmn). The next lines are nested in a loop in which the new kernel matrix is first computed by updating the old one: 17

add a new row and column for the kernel entries between training examples and the new example (line 5). Then, the new kernel perceptron is learned (line 7). The loop is only executed twice and therefore has time complexity Θ(kn + rmn). Combining all complexities, the overall time complexity to compute l prediction sets is Θ(kn2 l + knl + rmnl).

A.6

TCM-SVM

A TCM-SVM that solves quadratic programming problems is infeasible for large datasets. It was proposed to decompose the problem into manageable subproblems, using a hash function to divide training data into specific subsets [15]. Although sharp computation times are achieved, this technique may give approximate solutions. Our experiments used an adiabatic incremental SVM [4]. This approach can incrementally learn and decrementally unlearn an example while converging to the exact SVM solution in a considerable small number of iterations. Its key idea is to retain the Karush-Kuhn-Tucker conditions on the training data when a new example is added. An explanation of the approach is beyond the scope of this report and too detailed to provide clear pseudo code.

18

input : (x1 , y1 ), . . . , (xn , yn ) ← sequence of training examples xn+1 ← new unlabeled instance Y ← label space output: {py:y∈Y } ← p-values 1 2 3 4 5

6 7 8

9 10 11 12 13 14 15 16 17 18

19 20 21

22

% Compute statistics of training data for i ← 1 to n do Compute sequences Diyi and Di−yi for i ← 1 to n do αi ← nonconformity score for (xi , yi ) % Compute distances from new unlabeled instance to training instances for i ← 1 to n do dist (i) ← d(xi , xn+1 ) for all y ∈ Y do idS ← indices of training examples with label y idD ← indices of training examples with label different from y % Recalculate statistics to incorporate (xn+1 , y) for all i ∈ idS do y if Dik > dist (i) then αi ← new nonconformity score for (xi , yi ) for all i ∈ idD do −y if Dik > dist (i) then αi ← new nonconformity score for (xi , yi ) % Compute p-value αn+1 ← nonconformity score for (xn+1 , y) py ← p-value for (xn+1 , y) return {py:y∈Y } Algorithm 1: TCM-kNN

19

input : (x1 , y1 ), . . . , (xn , yn ) ← sequence of training examples xn+1 ← new unlabeled instance Y ← label space output: {py:y∈Y } ← p-values 1 2 3 4 5 6 7 8 9 10

11 12

13 14 15 16 17 18 19 20 21 22 23 24

25 26 27

28

% Compute statistics of training data for all y ∈ Y do µy ← class centroid of label y for all y ∈ Y do for i ← 1 to n do dist ← d(xi , µy ) if yi = y then distCentrS (i) = dist else distCentrD(i) = min (distCentrD(i), dist) for i ← 1 to n do αi ← nonconformity score for (xi , yi ) for all y ∈ Y do idS ← indices of training examples with label y idD ← indices of training examples with label different from y % Recalculate statistics to incorporate (xn+1 , y) µ0y ← update class centroid of label y for i ← 1 to n do dist (i) ← d(xi , µ0y ) for all i ∈ idS do αi ← new nonconformity score for (xi , yi ) for all i ∈ idD do if distCentrD(i) > dist (i) then αi ← new nonconformity score for (xi , yi ) % Compute p-value αn+1 ← nonconformity score for (xn+1 , y) py ← p-value for (xn+1 , y) return {py:y∈Y } Algorithm 2: TCM-NC

20

input : (x1 , y1 ), . . . , (xn , yn ) ← sequence of training examples xn+1 ← new unlabeled instance Y ← {−1, +1} output: {p−1 , p+1 } ← p-values 1 2 3 4 5 6 7 8 9 10 11

12 13

% Compute statistics of training data for all y ∈ {−1, +1} do µy ← class centroid of label y P(y) ← prior probability of label y for i ← 1 to n do if yi = −1 then diffs −1 (i) ← xi − µ−1 else diffs +1 (i) ← xi − µ+1 for all y ∈ {−1, +1} do cov y ← covariance matrix of label y for all y ∈ {−1, +1} do diffs y (n + 1) ← xn+1 − µy

17

% Recalculate statistics to incorporate (xn+1 , y) µ0y ← update class centroid of label y cov 0y ← update covariance matrix of label y (Eq. 25) P0 (·) ← update all prior probabilities

18

(w, b) ← learn LDC

19

for i ← 1 to n + 1 do αi ← −yi (hw, xi i + b)

14 15 16

20 21 22

23

% Compute p-value py ← p-value for (xn+1 , y) return {p−1 , p+1 } Algorithm 3: TCM-LDC

21

input : (x1 , y1 ), . . . , (xn , yn ) ← sequence of training examples xn+1 ← new unlabeled instance Y ← label space output: {py:y∈Y } ← p-values 1 2 3 4 5 6 7 8 9 10 11 12

13

% Compute statistics of training data NB ← learn examples (x1 , y1 ), . . . , (xn , yn ) for all y ∈ Y do NB ← incrementally learn example (xn+1 , y) % Compute posterior label probability distributions for i ← 1 to n + 1 do NB ← decrementally unlearn example (xi , yi ) P ← posterior label probability distribution for xi as given by NB αi ← P(yi ) NB ← incrementally learn example (xi , yi ) % Compute p-value py ← p-value for (xn+1 , y) return {py:y∈Y } Algorithm 4: TCM-NB

input : (x1 , y1 ), . . . , (xn , yn ) ← sequence of training examples xn+1 ← new unlabeled instance Y ← {−1, +1} output: {p−1 , p+1 } ← p-values 1 2 3 4 5 6 7 8 9 10 11

12

% Compute statistics of training data K ← kernel matrix of (x1 , y1 ), . . . , (xn , yn ) (λi , . . . , λn ) ← kernel perceptron dual variables for all y ∈ {−1, +1} do K ← extend K with entries for (xn+1 , y) % Compute new kernel perceptron dual variables (λi , . . . , λn+1 ) ← incrementally learn example (xn+1 , y) for i ← 1 to n + 1 do αi ← λ i % Compute p-value py ← p-value for (xn+1 , y) return {p−1 , p+1 } Algorithm 5: TCM-KP

22

B

Results as Graphs

This appendix shows the graphs (Figures 3 - 12) that represent the results of our experiments as explained in Section 4.3. TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 10) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (e = 2)

100

0

1

40

TCM-KP (σ = 0.001)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 3: Results off-line TCMs for heart statlog dataset.

23

0.7

0.8

0.9

1

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 10) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (e = 1)

100

0

1

40

TCM-KP (e = 1)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 4: Results off-line TCMs for house votes dataset.

24

0.7

0.8

0.9

1

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 1) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (σ = 0.06)

100

0

1

40

TCM-KP (σ = 0.001)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 5: Results off-line TCMs for ionosphere dataset.

25

0.7

0.8

0.9

1

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 8) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (e = 1)

100

0

1

40

TCM-KP (σ = 0.001)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 6: Results off-line TCMs for liver dataset.

26

0.7

0.8

0.9

1

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 6) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (σ = 0.01)

100

0

1

40

TCM-KP (e = 3)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 7: Results off-line TCMs for monks1 dataset.

27

0.7

0.8

0.9

1

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 5) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (σ = 1)

100

0

1

40

TCM-KP (e = 3)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 8: Results off-line TCMs for monks2 dataset.

28

0.7

0.8

0.9

1

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 7) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (σ = 1.6)

100

0

1

40

TCM-KP (σ = 0.03)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 9: Results off-line TCMs for monks3 dataset.

29

0.7

0.8

0.9

1

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 10) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (σ = 1.6)

100

0

1

40

TCM-KP (σ = 0.001)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 10: Results off-line TCMs for pima dataset.

30

0.7

0.8

0.9

1

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 1) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (σ = 1.6)

100

0

1

40

TCM-KP (σ = 0.03)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 11: Results off-line TCMs for sonar dataset.

31

0.7

0.8

0.9

1

TCM−NC 100

90

90

80

80

70

70

percentage of examples

percentage of examples

TCM-kNN (k = 9) 100

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line

10

60

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

100

90

90

80

80

70

70

60 50 40 30 20

0

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

30 Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

90

80

80

70

70

percentage of examples

percentage of examples

90

60 50 40 30

0.1

0.2

0.3

0.4 0.5 0.6 significance level

0.7

0.8

0.9

1

0.7

0.8

0.9

60 50 40 30 20

Uncertain predictions Empty predictions Errors Error calibration line 0

0.4 0.5 0.6 significance level TCM-SVM (σ = 1.6)

100

0

1

40

TCM-KP (σ = 0.001)

10

0.9

50

100

20

0.8

60

20

Uncertain predictions Empty predictions Errors Error calibration line

10

0.7

TCM−NB

100

percentage of examples

percentage of examples

TCM−LDC

0.4 0.5 0.6 significance level

Uncertain predictions Empty predictions Errors Error calibration line

10 0

1

0

0.1

0.2

0.3

0.4 0.5 0.6 significance level

Figure 12: Results off-line TCMs for spect dataset.

32

0.7

0.8

0.9

1

C

Results as Tables

This appendix shows tables (Tables 3 - 12) that represent the results of our experiments for significance levels  = 0.2, 0.15, 0.1, 0.05. Means and standard deviations are reported for the percentage of errors, certain, and empty predictions. classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

19.48 14.74 9.70 4.81

8.59 7.08 5.84 4.57

94.07 96.07 77.85 54.74

5.29 4.76 8.38 9.99

5.93 0.00 0.00 0.00

5.29 0.00 0.00 0.00

20.52 15.70 10.30 5.48

7.45 7.00 6.91 5.14

94.30 95.41 84.15 66.67

4.31 4.07 7.96 10.74

5.70 0.00 0.00 0.00

4.31 0.00 0.00 0.00

19.56 14.96 10.30 4.74

8.98 7.52 6.36 4.49

93.11 97.78 83.19 58.81

6.22 3.26 7.45 12.20

6.89 0.07 0.00 0.00

6.22 0.52 0.00 0.00

19.9 5.18 9.70 4.89

7.07 6.86 5.39 3.99

92.14 98.00 78.81 56.37

5.58 2.27 8.16 6.95

7.85 0.00 0.00 0.00

5.58 0.00 0.00 0.00

20.15 15.33 9.78 5.04

9.47 7.99 7.59 4.89

92.07 97.48 84.30 59.26

5.18 3.12 6.92 9.41

7.93 0.22 0.00 0.00

5.18 0.89 0.00 0.00

19.93 15.11 10.37 5.33

10.53 9.95 7.10 5.61

79.11 65.93 52.00 32.96

9.84 11.27 10.82 10.72

1.11 0.37 0.07 0.00

2.28 1.71 0.52 0.00

Table 3: TCMs results on the heart statlog dataset.

33

classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

19.47 14.47 9.88 5.00

6.93 6.03 5.34 4.00

83.35 89.53 97.24 91.88

6.69 5.61 2.68 4.31

16.65 10.47 2.76 0.00

6.69 5.61 2.68 0.00

19.82 15.00 10.00 5.00

8.12 7.01 6.32 4.65

88.29 94.76 94.12 74.65

6.36 4.04 4.60 7.54

11.71 5.24 0.00 0.00

6.36 4.04 0.00 0.00

19.41 14.76 9.53 4.82

8.19 7.57 5.62 4.36

81.76 86.65 93.00 97.47

7.79 7.23 4.79 2.79

18.24 13.35 7.00 0.00

7.79 7.23 4.79 0.00

19.64 14.94 9.64 4.82

6.15 5.53 4.48 3.13

84.88 90.76 98.82 85.82

5.51 4.67 2.05 5.90

15.11 9.23 0.05 0.00

5.51 4.67 0.41 0.00

21.00 15.18 10.00 4.94

7.59 6.96 5.61 4.38

80.82 86.82 92.71 98.24

7.64 6.81 4.94 2.14

19.18 13.18 7.29 0.47

7.64 6.81 4.94 1.50

20.18 15.71 10.35 4.76

7.49 7.02 6.04 4.15

83.71 88.35 94.41 91.94

6.81 6.03 4.57 4.36

16.29 11.65 5.59 1.18

6.81 6.03 4.57 2.14

Table 4: TCMs results on the house votes dataset. classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

19.71 14.86 9.66 4.46

7.66 7.05 5.59 4.25

89.43 97.26 90.69 72.97

5.75 3.96 6.34 8.62

10.57 2.63 0.00 0.00

5.75 3.99 0.00 0.00

21.94 15.40 10.23 4.69

7.86 6.80 6.00 4.27

91.14 82.80 70.86 48.00

4.87 5.87 6.88 8.79

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

19.71 14.69 10.00 5.14

6.86 6.38 5.24 4.32

87.60 93.71 95.31 81.71

5.96 3.74 3.86 6.76

12.34 5.43 0.11 0.00

5.96 3.75 0.57 0.00

19.88 14.74 9.71 4.80

8.20 7.12 5.88 4.18

95.42 93.82 83.82 71.82

4.20 4.91 7.67 8.62

4.57 0.00 0.00 0.00

4.20 0.00 0.00 0.00

20.11 14.74 8.80 5.37

6.88 5.90 5.22 4.57

88.86 96.11 89.83 70.40

5.51 3.20 5.39 10.01

11.09 2.06 0.00 0.00

5.59 2.77 0.00 0.00

20.06 15.31 10.29 5.31

8.67 7.48 6.85 4.55

81.14 86.06 77.03 52.34

8.26 6.97 5.91 9.51

18.86 13.31 7.20 2.69

8.26 7.11 5.42 3.48

Table 5: TCMs results on the ionosphere dataset.

34

classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

20.53 15.12 9.35 4.76

7.19 6.48 5.60 3.80

65.47 52.12 35.29 20.47

8.81 9.71 9.34 7.40

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

21.41 15.59 10.59 5.53

7.95 7.06 6.23 4.43

55.59 42.76 31.29 17.82

8.20 8.98 8.59 7.12

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.29 14.76 9.82 4.88

6.60 4.66 4.47 3.69

67.35 54.24 39.00 23.24

8.11 8.41 8.34 6.91

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

19.52 15.17 9.29 4.35

8.35 6.83 5.48 4.04

53.47 41.94 26.05 14.41

9.33 8.92 8.16 6.46

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

19.65 14.71 9.94 5.53

7.98 7.16 6.11 4.27

64.24 51.35 37.53 22.35

9.02 9.46 10.34 8.02

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.34 14.51 10.40 5.09

7.01 7.54 6.34 4.52

86.11 79.94 57.71 30.80

5.63 7.17 9.68 9.24

11.77 7.49 4.34 0.91

5.34 4.61 2.96 1.46

Table 6: TCMs results on the liver dataset. classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

19.35 15.63 9.30 4.60

6.19 5.36 5.15 3.20

80.65 84.37 90.84 95.86

6.19 5.36 5.11 3.23

19.35 15.63 9.16 4.14

6.19 5.36 5.11 3.23

21.72 15.21 11.40 5.35

6.70 5.30 4.44 4.51

73.81 63.07 50.33 26.33

5.60 7.01 7.15 8.72

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

21.02 14.98 9.86 4.56

7.14 5.56 4.54 3.29

68.28 59.44 52.00 16.19

7.38 7.24 7.28 5.87

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

19.58 14.04 8.65 4.04

4.88 5.43 4.88 2.81

78.04 63.02 52.18 35.34

8.34 9.56 8.76 6.66

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.47 15.91 10.47 5.26

6.98 6.11 5.74 3.60

67.63 59.81 50.09 18.98

8.71 9.06 10.01 6.70

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.56 14.65 9.91 5.26

6.52 6.02 5.21 3.69

37.91 30.51 21.91 11.91

8.00 7.72 7.08 4.85

5.95 3.02 1.40 0.28

3.55 2.22 1.76 0.90

Table 7: TCMs results on the monks1 dataset. 35

classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

20.09 14.88 10.42 6.05

6.92 6.82 6.15 4.33

77.63 65.77 49.63 35.91

7.38 9.04 7.32 8.99

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

21.63 15.67 11.07 5.77

5.28 4.55 3.92 3.43

44.33 30.28 19.35 8.33

8.14 6.78 6.06 4.36

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

19.53 14.56 9.86 5.16

5.98 4.83 4.73 3.30

42.19 31.30 21.30 10.23

8.43 7.74 6.81 5.06

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.04 14.88 10.23 4.32

4.67 4.38 4.35 3.32

58.74 48.55 38.46 21.95

7.29 8.22 8.65 7.63

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.19 15.30 10.19 4.93

6.35 6.34 5.44 3.69

62.98 45.81 28.98 17.81

7.85 7.72 7.37 6.24

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.05 14.79 9.67 5.21

7.52 6.51 5.74 4.00

85.53 80.60 71.63 50.60

6.80 7.30 8.37 10.44

9.12 5.63 2.14 0.19

6.03 3.93 2.20 0.64

Table 8: TCMs results on the monks2 dataset. classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

19.67 14.98 10.56 4.70

7.18 7.08 5.97 3.77

81.02 85.91 90.65 96.93

7.08 6.73 5.63 3.13

18.98 14.09 9.35 3.07

7.08 6.73 5.63 3.13

20.79 15.63 10.42 4.93

6.19 5.13 4.68 3.63

98.47 89.30 79.07 56.56

1.80 4.93 5.98 8.26

1.21 0.00 0.00 0.00

1.71 0.00 0.00 0.00

19.95 14.88 9.91 5.07

5.32 5.13 4.50 3.11

97.26 88.70 79.67 70.60

2.47 4.10 5.16 5.43

0.14 0.00 0.00 0.00

0.56 0.00 0.00 0.00

18.93 14.69 9.58 4.23

6.23 6.02 4.73 2.80

84.32 88.69 94.18 99.95

5.88 5.01 3.73 0.32

15.67 11.30 5.81 0.04

5.88 5.01 3.73 0.32

20.19 15.07 9.91 4.93

7.46 6.39 4.99 3.78

80.60 85.77 90.98 97.49

7.47 6.50 4.90 2.65

19.40 14.23 9.02 2.47

7.47 6.50 4.90 2.67

19.67 15.02 9.91 5.30

6.31 5.82 5.01 3.99

81.07 85.86 91.49 96.47

5.83 5.50 4.67 3.16

18.93 14.14 8.51 3.53

5.83 5.50 4.67 3.16

Table 9: TCMs results on the monks3 dataset. 36

classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

20.68 15.47 10.21 5.05

5.45 4.51 3.65 2.65

91.53 79.08 62.97 43.92

3.29 5.16 6.15 6.39

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.53 14.87 10.39 5.00

4.52 4.48 3.97 2.45

86.18 74.37 60.53 40.21

3.38 4.62 4.35 6.32

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.13 14.87 9.97 4.97

4.77 4.16 3.81 2.41

90.87 77.63 64.42 44.66

3.88 4.87 5.13 5.59

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

19.92 14.60 9.97 4.76

4.19 3.62 3.48 2.34

89.92 76.50 63.28 43.50

3.88 5.21 4.98 5.93

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.16 14.92 10.11 5.08

5.68 5.03 3.91 2.19

91.47 80.05 65.05 47.82

3.46 4.33 5.16 5.80

0.05 0.00 0.00 0.00

0.26 0.00 0.00 0.00

19.97 15.05 9.92 4.97

4.62 4.31 3.74 2.71

65.82 56.45 44.05 27.76

5.61 6.75 6.38 5.05

4.39 2.11 0.97 0.29

2.05 1.41 1.15 0.55

Table 10: TCMs results on the pima dataset. classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

19.50 14.20 8.80 3.70

9.65 8.47 6.43 4.93

95.80 93.70 83.10 74.80

5.57 5.52 8.44 10.64

4.20 0.00 0.00 0.00

5.57 0.00 0.00 0.00

21.00 16.40 10.90 6.30

10.40 9.26 8.06 6.53

78.50 66.30 52.50 41.10

10.41 10.92 11.70 11.96

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

19.00 14.20 9.80 5.10

8.33 7.10 7.21 5.49

81.50 72.90 59.30 44.60

8.65 10.45 10.83 11.20

1.30 0.00 0.00 0.00

2.64 0.00 0.00 0.00

20.10 15.10 9.80 5.20

9.17 8.42 5.71 4.73

76.20 65.40 56.10 45.30

11.45 12.24 11.44 10.89

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.10 14.60 8.90 4.60

9.66 8.26 7.97 5.61

90.50 89.00 74.70 58.60

5.82 6.47 10.76 10.69

5.20 0.70 0.00 0.00

4.51 2.26 0.00 0.00

20.10 14.80 10.20 5.20

12.31 10.59 7.21 5.71

87.00 83.60 77.00 69.70

6.85 6.23 6.85 7.03

10.10 6.80 4.40 2.80

7.25 5.69 3.99 3.22

Table 11: TCMs results on the sonar dataset. 37

classifier  TCM-kNN 0.20 0.15 0.10 0.05 TCM-NC 0.20 0.15 0.10 0.05 TCM-LDC 0.20 0.15 0.10 0.05 TCM-NB 0.20 0.15 0.10 0.05 TCM-KP 0.20 0.15 0.10 0.05 TCM-SVM 0.20 0.15 0.10 0.05

% error mean std

% certain mean std

% empty mean std

19.71 14.19 9.90 4.38

9.13 8.02 6.22 4.50

91.33 98.29 86.00 60.38

6.43 3.01 7.42 9.64

8.67 1.52 0.00 0.00

6.43 2.96 0.00 0.00

20.95 15.81 10.67 5.33

7.82 7.48 6.70 4.48

71.05 56.86 37.24 11.62

10.88 11.44 10.74 8.12

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.19 15.33 10.10 5.62

9.16 7.95 7.37 5.06

51.14 40.95 29.24 17.90

10.79 10.27 9.57 7.24

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

20.76 14.57 9.52 4.57

8.37 7.95 5.93 4.99

90.28 78.76 68.57 56.19

5.60 7.276 8.86 10.40

0.00 0.00 0.00 0.00

0.00 0.00 0.00 0.00

19.05 14.57 10.10 5.05

9.02 9.05 8.09 5.22

89.52 96.76 91.90 69.81

7.20 4.02 5.29 12.89

10.48 3.24 0.00 0.00

7.20 4.02 0.00 0.00

20.00 14.95 10.00 4.76

8.44 7.33 7.22 4.61

88.67 95.71 92.48 45.81

7.57 5.02 7.14 16.52

11.33 4.29 0.00 0.00

7.57 5.02 0.00 0.00

Table 12: TCMs results on the spect dataset.

38