HANDLING MISSING FEATURES IN MAXIMUM ... - IEEE Xplore

0 downloads 0 Views 242KB Size Report
Sebastian Tschiatschek, Nikolaus Mutsam, Franz Pernkopf. Signal Processing and Speech Communication Laboratory. Graz University of Technology, Austria.
2012 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 23–26, 2012, SATANDER, SPAIN

HANDLING MISSING FEATURES IN MAXIMUM MARGIN BAYESIAN NETWORK CLASSIFIERS Sebastian Tschiatschek, Nikolaus Mutsam, Franz Pernkopf Signal Processing and Speech Communication Laboratory Graz University of Technology, Austria ABSTRACT The Comprehensive Nuclear-Test-Ban Treaty Organization (CTBTO) records hydroacoustic data to detect nuclear explosions1 . This enables verification of the Comprehensive Nuclear-Test-Ban Treaty once it has entered into force. The detection can be considered as a classification problem discriminating noise-like, earthquake-caused and explosion-like data. Classification of the recorded data is challenging because it suffers from large amounts of missing features. While the classification performance of support vector machines has been evaluated, no such results for Bayesian network classifiers are available. We provide these results using classifiers with generatively and discriminatively optimized parameters and employing different imputation methods. In case of discriminatively optimized parameters, Bayesian network classifiers slightly outperform support vector machines. For optimizing the parameters discriminatively, we extend the formulation of maximum margin Bayesian network classifiers to missing features and latent variables. The advantage of these classifiers over classifiers with generatively optimized parameters is demonstrated in experiments. Index Terms— Discriminative Parameter Learning, Missing Data, Bayesian Network Classifiers, Maximum Margin Learning, CTBTO 1. INTRODUCTION Maximum margin Bayesian network (MMBN) classifiers [1, 2] are novel probabilistic classifiers showing good performance in many applications. Currently, these classifiers are only able to process fully observed data while incomplete data, i.e. data with missing features or latent variables, can not be handled. However, in practical applications parts of the data are often missing [3], and probabilistic models include latent variables [4]. Incomplete data can occur at induction This work was supported by the Austrian Science Fund (project number P22488-N23). 1 The views expressed in this paper are those of the authors and do not necessarily reflect the views of the CTBTO Preparatory Commission.

c 978-1-4673-1026-0/12/$31.00 2012 IEEE

time, i.e. in the training data, or at prediction time, i.e. during classification. Endowing MMBN classifiers with the ability to process incomplete samples in the training data can in principle be achieved in two ways: By employing methods to complete the training data prior to induction of the classifier, or by extending MMBNs to incomplete data. In this paper, we pursue both possibilities. To complete the training data we use unique value imputation (UVI) and k-nearest neighbor (k-NN) imputation. Furthermore, we extend the definition of MMBNs by replacing the sample margins in the maximum margin objective by the margins obtained from the marginal distribution. During prediction, there are several possibilities to handle incomplete data [3]. The most prominent ones are: Marginalization of missing features, refusing classification of incomplete samples, acquiring missing values, e.g. by performing additional measurements, imputing the missing values, or training special reduced-feature models, i.e. models that do not include the missing attributes. Refusal of samples may not be an option in some applications and training reducedfeature models can result in high computational costs when many features are missing. Performing further measurements is often not viable, because of extra costs or because of the required data being available only in the past, i.e. at the time the incomplete data was recorded. For this reason, we resort to marginalization and imputation techniques throughout this paper. We empirically evaluate some of the approaches mentioned above by classifying hydroacoustic signals obtained by the CTBTO. The aim is to detect signals with explosive origin. We compare the performance of MMBNs in their extended formulation to support vector machines (SVMs). Furthermore, we evaluate the performance of MMBNs using different imputation methods to complete the data at induction and prediction time. Our main results are: (i) An extension of MMBNs to missing features and latent variables in conjunction with a scheme for efficient parameter learning. (ii) MMBNs outperform support vector machines (SVMs) in some classification tasks on the CTBTO data. (iii) MMBNs in their extended formulation achieve better performance than Bayesian network classifiers with generatively optimized pa-

rameters. (iv) UVI is sufficient to achieve good performance on the CTBTO data. The remainder of this paper is structured as follows: In Section 2 we introduce the framework of probabilistic classifiers. Particularly, we consider Bayesian network classifiers. This includes approaches for learning these classifiers and methods for handling missing features. Subsequently, we extend MMBNs to incomplete data in Section 3. In Section 4 we describe the CTBTO dataset. Experimental results are presented in Section 5 and the paper is concluded in Section 6.

X0 , . . . , XL in G, respectively. Each node of the graph corresponds to an RV and the edges of the graph determine dependencies between these RVs. Throughout this paper, we denote RV X0 as C, i.e. it represents the class node, and assume that C has no parents in G, i.e. P a(C) = ∅. The joint probability of a BN factorizes according to the graph structure and is given as B

P (C, X1 , . . . , XL ) = P(C)

L Y

P(Xl |P a(Xl )).

(3)

l=1

This joint probability induces the BNC hPB (C,X) , given as

2. BAYESIAN NETWORK CLASSIFIERS

hPB (C,X) : sp(X) → sp(C),

2.1. Definition

(4)

B

x 7→ arg max P (C = c|X = x),

Probabilistic classifiers are embedded in the framework of probability theory. One assumes a random variable (RV) C denoting the class and the RVs X1 , . . . , XL representing the attributes/features of the classifier. These RVs are related by a joint probability distribution P∗ (C, X), where X = {X1 , . . . , XL }. In typical settings this joint distribution is unknown and only N i.i.d. samples drawn from P∗ (C, X) are available. These samples comprise the training set D = {(c(n) , x(n) )|1 ≤ n ≤ N }, where c(n) denotes the instantiation of RV C and x(n) the instantiation of X in the nth training sample. The aim is to induce classifiers with low generalization error from D. A classifier h assigns a class label to every instantiation of the classifiers attributes. Thus, a classifier is a mapping h : sp(X) → sp(C),

(1)

x 7→ h(x), where sp(X) denotes the set of all possible assignments to X and sp(C) is the set of class labels. The generalization error of this classifier is Err(h) = EP∗ (C,X) [1{C 6= h(X)}] ,

(2)

where 1{A} denotes the indicator function which is one if the statement A is true and zero otherwise, and EP∗ (C,X) [·] is the expectation operator with respect to distribution P∗ (C, X). Typically, the generalization error can not be evaluated as P∗ (C, X) is unknown but is rather estimated using crossvalidation [5] or a separate test set T (which actually resorts to a special instance of cross-validation). One particular type of classifiers are Bayesian network classifiers (BNCs). These classifiers use Bayesian networks (BNs) [6, 7] to represent joint probability distributions. A BN B = (G, PG ) consists of a directed acyclic graph G = (V, E), where V = {X0 , . . . , XL } is the set of nodes and E the set of edges, and a set of local conditional probability distributions PG = {P(X0 |P a(X0 )), . . . , P(XL |P a(XL ))}. The terms P a(X0 ), . . . , P a(XL ) denote the set of parents of

c∈C

which classifies each instantiation x of X as the maximum a-posteriori (MAP) estimate of C given x. The conditional probability PB (C = c|X = x) is derived from the joint probability using PB (C, X) = PB (C|X) · PB (X). 2.2. Learning BNCs BNs for classification can be optimized in two ways: firstly, by selecting the graph structure G, and secondly, by varying the conditional probabilities PG . Optimizing the graph structure is known as structure learning [8] and selecting PG is known as parameter learning. Throughout this paper we considered classifiers with naive Bayes (NB) structure only as we observed that richer structures such as tree augmented NB (TAN) do not increase the performance significantly. For learning the parameters PG of a BN two paradigms exist, namely generative parameter learning and discriminative parameter learning. In generative parameter learning, the aim is to identify parameters representing the generative process of the data in the training set. One instance of this type of learning is maximum likelihood (ML) parameter estimation which determines the parameters such that the likelihood of the data is maximized. Formally, ML parameters PGML are learned as PGML = arg max PG

N Y

PB (C = c(n) , X = x(n) ),

(5)

n=1

where PB (C, X) is the joint distribution in (3) induced by the BN (G, PG ). Discriminative learning is based on the idea of solving the classification problem directly [9]. This is usually advantageous in cases where the assumed model distribution PB (C, X) does not approximate P∗ (C, X) well, for example in cases of too restricted BN graph structures. The intermediate step of determining generative parameters is avoided and the aim is to find parameters resulting in good classification performance. Several objectives for this purpose exist in the

literature. We consider the maximum conditional likelihood (MCL) [10] and the maximum margin (MM) [1, 2] objective: (i) MCL parameters PGMCL are obtained as PGMCL = arg max PG

N Y

PB (C = c(n) |X = x(n) ).

(6)

n=1

Thus, MCL maximizes the conditional probability of the class instantiations given the instantiations of the attributes. (ii) Using the formulation in [2], MM parameters PGMM are found as PGMM = arg max PG

N Y

  min γ, d(n) ,

(7)

n=1

where d(n) is the margin of the nth sample given as d(n) =

PB (C = c(n) |X = x(n) ) . maxc6=c(n) PB (C = c|X = x(n) )

(8)

The parameter γ > 1 controls the margin. Equivalently, the sample margin can be defined as d(n) =

PB (C = c(n) , X = x(n) ) . maxc6=c(n) PB (C = c, X = x(n) )

(9)

Informally, the margin measures the ratio of the likelihood of the nth sample belonging to the correct class c(n) to belonging to the strongest competing class. The nth sample is correctly classified if d(n) > 1 and vice versa. MMBNs, i.e. BNs with MM parameters, have shown good classification performance in many applications [2]. They have the capability to handle model-mismatch and can be trained efficiently. 2.3. Missing Features In real-world applications the training set [11] and/or the test set [3] typically contains missing features. For example, in medical expert systems a patient may not have undergone some examinations, or acquiring all information is expensive due to high costs of laboratory tests. Before discussing methods to handle missing features, we extend the notation from the previous section. A sample consists of a class label c together with an instantiation xO of the observed attributes XO ⊆ X. The set of missing attributes is (n) XM = X \ XO . Similarly, we use the terms c(n) and xO to th denote the instantiations of the n sample in the training set. In the following, we introduce several options for handling missing features in classifiers. Most of these options are applicable to missing features in both the training and the test set: (i) Discarding incomplete instances: Whenever XM 6= ∅ the corresponding sample is removed from the training set. In case of a test sample, the classification is declined. However, this might be impractical. Especially in cases of features

missing frequently. (ii) Acquiring missing values: This might be an option in scenarios, where features can be acquired at later stages, e.g. in the medical domain additional examinations could be conducted. (iii) Imputation: Missing values are estimated and imputed by various methods. A missing feature can be replaced by the mean value of that feature (mean value imputation) or set to a specific value indicating that the feature was missing. This is known as unique value imputation (UVI). In k-nearest neighbor imputation (k-NN imputation) the missing features are replaced by the mean (or median, mode, etc.) of the k closest samples containing values for the missing features. Alternatively, missing features can be imputed based on an estimated distribution over the feature values (distribution-based imputation). (iv) Reduced feature models: Several classifiers are derived, where each classifier employs only a subset of the features available in the complete model. Whenever a particular sample with missing features is to be classified, a classifier incorporating only the observed features of that sample is applied. Therefore, no imputation is necessary. However, this approach may require the computation of many classifiers and, hence, can be computationally expensive. (v) Marginalization: This approach is applicable to probabilistic classifiers which are based on a joint probability distribution P(C, X). If a sample with missing features is to be classified, the missing features are marginalized out. Consequently, the sample is classified as the MAP estimate of class C under the probability distribution P(C, XO ), which is determined as X P(C, XO ) = P(C, X). (10) xM ∈sp(XM )

3. MMBNS WITH MISSING FEATURES In this section we extend MMBNs to incomplete data and provide approaches for learning their parameters. 3.1. Objective MMBNs are defined by the objective in (7). Whenever the training set contains incomplete samples the objective can not be evaluated because the class posteriors conditioned on all features can not be calculated. However, replacing (7) by PGMM = arg max PG

N Y

  (n) min γ, dO ,

(11)

n=1

(n)

where dO is (n)

(n)

dO =

PB (c(n) |xO ) (n)

maxc6=c(n) PB (c|xO )

,

(12)

alleviates this problem. The class posteriors are conditioned only on the observed features while missing features are

marginalized out. Evaluating and optimizing this objective, requires marginalization, i.e. inference, whenever there are missing features. In the case of fully observed data, the above formulation reduces to the original formulation of MMBNs in [2]. 3.2. Learning MMBNs with Missing Features Learning MMBNs from incomplete data requires solving (11). l l = P(Xl = j|P a(Xl ) = h), i.e. θj|h For this purpose, set θj|h denotes the probability of RV Xl taking value j given the assignment h of the parents of Xl . Hence, the joint probability of a BN can be stated as PB (C = c, X = x) = L Y

Y

(13) Y



l θj|h

ulj|h

,

l=0 j∈sp(Xl ) h∈sp(P a(Xl ))

where we resorted to denoting C as X0 in the product for ease of notation, and used ulj|h to represent the instantiation c of C and x of X in binary form, i.e. ulj|h = 1{Xl = j and P a(Xl ) = h}.

(14)

Introducing the logarithm in (13), the joint probability can be expressed as PB (C = c, X = x) =  L X X exp 

(15)  X

l  ulj|h · log θj|h .

l derivatives of (16) with respect to wj|h have to be determined. However, the objective in (16) is not differentiable at all w because of the minimum- and the maximum-operator. This issue can be resolved by replacing the minimum- and the maximum-operator by a softmax- and a softmin-function, respectively [2]. Consequently, the gradient of (16) can be derived and the objective can be maximized by performing projected gradient ascent. The projection is required to ensure that w parametrizes valid probability distributions, i.e. that l 0 ≤ exp(wj|h ) ≤ 1, ∀ l, j, h, and X l exp(wj|h ) = 1, ∀ l, h.

(17) (18)

j∈sp(Xl )

Calculating the gradient and evaluating the objective requires multiple marginalizations. Therefore, this approach is computationally expensive. (ii) Minorization-Maximization: The minorize-maximize approach [12] is a prescription for designing optimization algorithms. In a minimization step a lower bound to the objective is constructed and in the maximization step this lower bound is maximized. The expectation maximization (EM) algorithm is a special case of this type of algorithms. Creating a lower bound to (16) is in general difficult. While the logsum-exp term with positive sign can be lower bounded using Jensen’s inequality, lower bounding the log-sum-exp term with negative sign in a suitable2 way is more difficult. In principle, a linear lower bound or reverse Jensen’s inequality [13] can be employed. Consequently, maximization of (16) could potentially be solved using less marginalizations, resulting in lower computational complexity.

l=0 j∈sp(Xl ) h∈sp(P a(Xl )) l l l By setting wj|h = log θj|h , stacking all wj|h in vector w, l and all uj|h in the feature vector φ(c, x), we obtain PB (C = c, X = x) = exp(φ(c, x)T w). Consequently, the logarithm of objective (11) is given as  N   X X  (n) (n) min γ e, log exp φ(c(n) , [xO , xM ])T w (16) n=1

(n)

xM

 − max log c6=c(n)

X





 (n) (n) exp φ(c, [xO , xM ])T w  ,

(n)

xM

(n)

where the summations over xM marginalize the missing fea(n) (n) tures in the nth sample x(n) = [xO , xM ]. Maximization of (16) with respect to w can be achieved in several ways. The following two are the most obtruding approaches: (i) Projected gradient: The most simple approach might be to employ projected gradient methods for optimization. This approach was also used to perform the experiments in Section 5. To apply projected gradient methods, the partial

4. CTBTO DATASET The CTBTO dataset consists of hydroacoustic measurements in eight partially overlapping frequency bands. In each band, time-related, energy-related, statistical moments, and cepstral features are extracted independently. In total, the data consists of 128 features, i.e. 16 features in every frequency band. For more information about the data, see [14]. One or more frequency bands are missing in many samples, i.e. all 16 features in a band are missing. Hence, the features are not missing at random. This violates the assumptions of many approaches for dealing with missing features. Only 36 of all 778 samples are complete, i.e. 4.63 %. This fact renders approaches using only complete samples for training infeasible. In total, only about 61 % of the frequency bands are available. Two different class assignments of the samples are available: (i) Binary-class case: the classification problem is concerned with detecting signals of explosive origin. (ii) Multiclass case: noise-like, earthquake-caused, and explosive signals are to be distinguished. 2 The lower bound should be as tight as possible. Additionally, a lower bound having the same form as Jensen’s inequality is advantageous, as it allows standard tools for EM to be used.

5. EXPERIMENTS

Classification rate

We evaluated the classification performance of BNCs on the CTBTO data. The performance was determined using 5-fold cross-validation. In a preprocessing step, the data was normalized to zero mean and unit variance. Incomplete samples were either preserved without modification, or completed using UVI and k-NN mean value imputation, where k = 4. The data was discretized using k-means discretization [15] on a per-feature basis, where we arbitrarily selected k = 4. In experiments with imputed training set, we considered BNCs with ML, MCL and MM parameter learning. ML parameters with value 0 were replaced by small positive values, i.e. by  = 10−5 . The hyperparameter γ in the MM objective was selected using 5-fold cross-validation for each training fold separately. The early stopping heuristic proposed in [2] was employed when learning MCL and MM parameters. In cases in which no imputation on the training set was performed, we determined the parameters for the BNCs by EM and by the projected gradient approach proposed in Section 3. The latter approach is abbreviated by MMMF (maximum margin with missing features). MMMF parameters were learned by initializing them to EM parameters and performing projected gradient ascent. To limit the number of projected gradient ascent steps we used an early stopping heuristic. For inference the open source library libDAI was used [16]. Classification rates for both the binary and the multi-class case are provided in Table 1. Further, confidence intervals for a confidence level of 95 % are provided [17]. We compare our results to those in [14] obtained using linear discriminant analysis (LDA) and SVMs for classification (the data and experimental setup is the same). The authors made use of SVMs with special kernels that do not require imputation techniques to complete missing data. For details on the employed methods, refer to the original paper. In the binary-class case, the best result in [14] is obtained using SVMs with convolutional kernels. The achieved classification rate is 95.70 %. In the multi-class case, LDA resulted in the best performance of 84.40 % and the best result reported for SVMs is 83.60 % obtained by using radial basis function kernels. Unfortunately, the authors did not provide any information on classification confidence for their results. Our classification results in the multi-class case are significantly better than those of SVMs in [14] when using BNCs with UVI imputation and MM parameter learning. In the binary-classification task, our results are similar to that of SVMs. BNCs with discriminatively optimized parameters are typically prone to overfitting. Therefore, we tried to regularize the complexity of the structures of the BNCs by only considering a subset of the available features. For this purpose, we implemented a greedy feature selection heuristic. This heuristic sequentially adds the feature to the classifier which results in the best 5-fold cross-validation classification

UVI-ML

MMMF

EM

90

85

80 2

4

6

8

10

Number of features Fig. 1. Classification rates resulting from greedy feature selection in the multi-class case. performance. The multi-class classification rates of BNCs using only a subset of the features are shown in Figure 1. Using BNCs with UVI imputation and ML parameters (UVI-ML), two features are sufficient to achieve performance on par to BNCs using all features, i.e. 128 features. When no imputation is performed, BNCs with MMMF parameters and only two features achieve the same performance as BNCs using all features. Consequently, low complexity models such as NB with only a few features can be employed. This observation can also be exploited to minimize the cost of acquiring the features. 6. CONCLUSIONS AND FUTURE WORK In this paper we demonstrated the applicability of BNCs to the CTBTO dataset, i.e. they can be employed to discriminate noise-like, earthquake-caused and explosion-like data. Although the dataset contains lots of missing features and the number of training samples is limited, good performance is achieved. Classification rates are comparable to that of SVMs, or even superior. Furthermore, MM parameter learning was extended to incomplete data endowing discriminative parameter learning in presence of missing features and latent variables. This avoids imputation which might introduce bias to the data. In experiments, the advantage of this approach over generative parameter learning from incomplete data has been demonstrated. 7. REFERENCES [1] Yuhong Guo, Dana Wilkinson, and Dale Schuurmans, “Maximum margin Bayesian networks,” in Proceedings of the 21th Annual Conference on Uncertainty in Artificial Intelligence. 2005, pp. 233–242, UAI Press. [2] F. Pernkopf, M. Wohlmayr, and S. Tschiatschek, “Maximum margin Bayesian network classifiers,” IEEE

Table 1. Classification rates (CR) for CTBTO data in the binary-class and the multi-class case. For classification, BNCs with NB structure were used. In cases where data was completed prior to induction of the classifiers, maximum likelihood (ML), maximum conditional likelihood (MCL) and maximum margin (MM) parameters were employed. In the other cases, parameters were obtained using expectation maximization (EM) or the extended MM formulation (MMMF). To complete data with missing features, unique value imputation (UVI) and k-NN mean value imputation (KMI) was used. The abbreviation Marg. indicates that missing features in test instances were marginalized out. Test set 2 classes

Training set

UVI

KMI

3 classes

UVI

KMI

Marg.

UVI

KMI

Marg.

ML

93.80 ± 2.57





83.72 ± 5.66





MCL

93.67 ± 1.74





82.16 ± 4.78





MM

95.99 ± 2.23





86.82 ± 2.30





ML



94.06 ± 2.22





82.17 ± 4.99



MCL



94.06 ± 2.35





82.95 ± 3.23



MM



94.44 ± 1.84





85.27 ± 1.61



EM





93.54 ± 2.04





84.49 ± 2.23

MMMF





93.80 ± 1.46





85.54 ± 2.45

Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 3, pp. 521–531, 2012. [3] Maytal Saar-Tsechansky and Foster Provost, “Handling missing values when applying classification models,” J. Mach. Learn. Res., vol. 8, pp. 1623–1657, Dec. 2007. [4] Christopher M. Bishop, “Learning in graphical models,” pp. 371–403. MIT Press, Cambridge, MA, USA, 1999. [5] Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2007. [6] Judea Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann Publishers Inc., 1988. [7] Daphne Koller and Nir Friedman, Probabilistic Graphical Models: Principles and Techniques, MIT Press, 2009. [8] Franz Pernkopf and Jeff Bilmes, “Efficient heuristics for discriminative structure learning of Bayesian network classifiers,” Journal of Machine Learning Research, vol. 11, pp. 2323–2360, 2010. [9] Vladimir N. Vapnik, Statistical Learning Theory, Wiley, Sept. 1998. [10] Teemu Roos, Hannes Wettig, Peter Gr¨unwald, Petri Myllym¨aki, and Henry Tirri, “On discriminative Bayesian network classifiers and logistic regression,” Machine Learning, vol. 59, no. 3, pp. 267–296, 2005.

[11] E. Acuna and C. Rodriguez, “The treatment of missing values and its effect in the classifier accuracy,” Classification, Clustering and Data Mining Applications, pp. 639–648, 2004. [12] J. M. Ortega and W. C. Rheinboldt, Iterative Solution of Nonlinear Equations in Several Variables, Society for Industrial and Applied Mathematics, Philadephia, PA, 2000. [13] Tony Jebara and Alex Pentland, “On reversing Jensen’s inequality,” in In Advances in Neural Information Processing Systems 13. 2000, p. 2000, MIT Press. [14] M. Tuma, C. Igel, and M. Prior, “Hydroacoustic signal classification using kernel functions for variable feature sets,” in Pattern Recognition (ICPR), 20th International Conference on, 2010, pp. 1011 –1014. [15] J. B. MacQueen, “Some methods for classification and analysis of multivariate observations,” in Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, L. M. Le Cam and J. Neyman, Eds. 1967, vol. 1, pp. 281–297, University of California Press. [16] Joris M. Mooij, “libDAI: A free and open source C++ library for discrete approximate inference in graphical models,” Journal of Machine Learning Research, vol. 11, pp. 2169–2173, Aug. 2010. [17] Thomas M. Mitchell, Machine Learning, McGraw-Hill, Inc., New York, NY, USA, 1 edition, 1997.