Methods and Models for Interpretable Linear Classification

4 downloads 446 Views 1000KB Size Report
May 16, 2014 - training a predictive model with a transparent form (e.g., a decision tree or linear model), and then ... decrease the computational burden of training these models, we also introduce two ...... Consider a feature xj ∈ RN that.
Methods and Models for Interpretable Linear Classification Berk Ustun and Cynthia Rudin

arXiv:1405.4047v1 [stat.ME] 16 May 2014

Working Paper May 19, 2014 Abstract We present a new comprehensive approach to create accurate and interpretable linear classification models through mixed-integer programming. Unlike existing approaches to linear classification, our approach can produce models that incorporate a wide range of interpretability-related qualities, and achieve a precise, understandable and pareto-optimal balance between accuracy and interpretability. We demonstrate the value of our approach by using it to train scoring systems, M-of-N rule tables, and “personalized” classification models for applications in medicine, marketing, and crime. In addition, we propose novel methods to train interpretable classification models on large-scale datasets using off-theshelf mixed-integer programming software. Our paper includes theoretical bounds to assess the predictive accuracy of our models.

1

Introduction “Each time one of our favorite [machine learning. . . ] approaches has been applied in industry, the [interpretability. . . ] of the results, though ill-defined, has been a decisive factor of choice.” — Yves Kodratoff, The Comprehensibility Manifesto [48]

Interpretable predictive models provide “a qualitative understanding of the relationship between joint values of the input variables and the resulting predicted response value,” [28]. Such models are essential to the practice of applied predictive modeling because they are easy to troubleshoot, easy to explain, can be validated by domain knowledge, and can produce insights related to the underlying data. Together, these unique advantages ensure that interpretable models are far more likely to be trusted and used across a wide range of applications, such as crime prediction [79, 2, 86], marketing [39, 97], medical diagnosis [3, 31, 69], and scientific discovery [84, 88, 26, 38]. Building interpretable predictive models is inherently complicated due to the fact that interpretability is a subjective and multifaceted notion [48, 25]. Models that are highly interpretable to one audience may be completely uninterpretable to others due to relative differences in their affinity for certain types of predictive models, their background in statistics, their exposure to the data, and/or their domain expertise. Many practitioners deal with the subjective nature of interpretability in a pragmatic way, by first training a predictive model with a transparent form (e.g., a decision tree or linear model), and then tuning this model until it is sufficiently interpretable for a given task or audience. Through tuning, practitioners might wish to create models that are: 1. Sparse: According to Miller [68], humans can only handle a few cognitive entities at once (7 ± 2). In statistics, sparsity refers to the number of terms in a model and constitutes the standard way of measuring model complexity [83, 85]. 2. Expository: Humans are seriously limited in estimating the association between three or more variables [43]. To help domain experts gauge the influence of one predictive factor with respect to the others, practitioners often use linear models that clearly relate each factor to the predictive response value. Many medical scoring systems [3, 70, 31] and criminology risk assessment tools [98, 99] further improve the expository nature of these linear models by using meaningful integer coefficients. 3. Intuitive: R¨ uping [83] warns that humans tend to find a fact understandable if they are already aware of it. He illustrates this idea using the statement “rhinoceroses can fly,” - a very understandable assertion that no one would believe. Unfortunately, the signs of coefficients in linear classification models may be at odds with intuition of domain experts due to dependent relationships between variables. Recent approaches to interpretability aim to produce models that are sufficiently intuitive

1

to be used in practice by restricting the signs of coefficients to match the intuition of domain experts [74, 97, 66]. Existing approaches for creating transparent classification models were not originally designed to provide practitioners with the precision or flexibility required to deal with the subjective and multifaceted nature of interpretability. Current linear models, for instance, are produced using algorithms that only allow practitioners to adjust the sparsity of their models. Further, these approaches rely on regularization parameters that do not a clear meaning with respect to the accuracy-sparsity tradeoff. In this paper, we present a comprehensive and flexible approach to build accurate and interpretable linear classification models. Our approach allows practitioners to tackle the subjective and multifaceted nature of interpretability by including a wide range of interpretability-related qualities, by specifying whether each quality is strictly necessary to the model or may sacrificed for greater accuracy, and by controlling the balance between balance accuracy and interpretability in a clear and predictable way (e.g. by using regularization coefficient that represent how much training accuracy is sacrificed for each ”unit” of interpretability). Further, our approach produces models benefit from strong generalization guarantees, and that can address practical issues associated with incomplete datasets and imbalanced classification problems. We train our models using a mixed-integer programming (MIP) framework that optimizes direct measures of accuracy, by means of the 0–1 loss function, and precise measures of interpretability, by means of a judiciously crafted interpretability penalty function and interpretability set. The MIP framework allows us to produce models that optimize the exact objective that we care about, and presents several unique advantages. In particular, minimizing the 0–1 loss function produces models that are highly accurate, completely robust to outliers, and achieve the best possible learning-theoretic generalization guarantee. Moreover, optimizing direct measures of interpretability produces models that achieve the best possible (pareto-optimal) trade-off between accuracy and interpretability, and allows practitioners to control this trade-off using meaningful regularization parameters that do not necessarily have to be set through cross-validation and grid-search. Other advantages of the MIP framework include: 1. Bound of Optimality: Current MIP solvers use branch-and-bound algorithms that not only produce feasible solutions but pair these solution with a bound on optimality. 2. Range of Solutions: Current MIP solvers supplement the current best solution to a MIP with a pool of good feasible solutions. This allows practitioners to choose between several different interpretable classifiers by solving a single MIP. 3. Forward-Looking: In the past few years, MIP research has often been rapidly incorporated into commercial off-the-shelf solvers, allowing us to solve exponentially larger MIPs [8]. Using the MIP framework, practitioners can seamlessly benefit from such computational improvements without having to update their code. We introduce three novel off-the-shelf classification models: two of which allow practitioners to build accurate and interpretable scoring systems and M-of-N rule tables, as well as a third “personalized” model that allows practitioners to build a linear models that uses their own definitions of interpretability. To decrease the computational burden of training these models, we also introduce two novel methods that can help us train models for intractable large-scale datasets.

1.1 1.1.1

Related Work On the Interpretability of Predictive Models

Interpretability is a widely used yet ill-defined concept in the statistics and machine learning literature. In this paper, we view interpretability as an abstract, domain-dependent, notion that not only affects how easy it is to understand a given predictive model, but also affects how likely it is for that predictive model to be accepted and used in practice. Although our view of interpretability is in line with those of many related works, this may not be immediately obvious because these works refer to this notion using terms such as comprehensibility [48, 25], acceptability [63], and justifiability [63, 66]. Semantics issues notwithstanding, it is generally accepted that interpretability is a subjective, domaindependent, and multifaceted notion [48, 63, 25]. To illustrate this fact, note that works that assess that the interpretability of classification models with case studies [51] and user-studies [87, 1, 42], often reach conflicting conclusions on the qualities that produce classification models. Broadly speaking, models that are easy to understand and likely to be accepted are typically transparent (i.e. clearly relate outcomes to variables); sparse (i.e. contain few terms), intuitive (i.e. in line with domain expertise), and explanatory (i.e. relate attributes to outcomes in a clear, informative, and meaningful way).

2

1.1.2

On the Need for a Comprehensive Approach to Interpretable Classification

The lack of a comprehensive and flexible approach to interpretable classification is most obvious in the medical community, where physicians are currently diagnosing serious medical conditions with classification models that were built using heuristic methods. Medical experts have traditionally preferred to use scoring systems, which are sparse linear classification models that can make predictions by adding, subtracting and multiplying a few meaningful numbers. Some popular medical scoring systems include: SAPS I, II and III, used to asses the mortality of patients in intensive care [52, 53, 69]; APACHE I, II and III, used to assess the mortality of patients in intensive care [45, 46, 47]; CHADS2 , which assesses the risk of stroke in patients with atrial fibrillation [31]; TIMI, used to assess the risk of death and ischemic events in patients with certain types of heart problems [3, 70]; SIRS, used to detect Systemic Inflammatory Response Syndrome [9]; Wells Criteria for pulmonary embolisms [101], and deep vein thrombosis [100]; Ranson Criteria for acute pancreatitis [78]; and Light’s Criteria for transudative from exudative pleural effusions [56]. The fact that the medical community builds medical scoring systems using heuristic approaches should not be surprising - especially given that scoring systems require an approach that can produce accurate and sparse linear models, that use meaningful and intuitive coefficients, and that can be used with imbalanced datasets. In some cases, medical scoring systems were built by training and tweaking wellknown classification models. The SAPS II scoring system [53], for instance, was built by training a logistic regression model and rounding the coefficients. Specifically, Le Gall et al. [53] write that “the general rule was to multiply the β for each range by 10 and round off to the nearest integer.” This is at odds with the fact that rounding coefficients is known to produce suboptimal solutions. In other cases, medical scoring systems were built using consensus opinion from a panel of physicians, and not learned from data at all. Knaus et al. [46], for instance, determined the features and coefficients of the APACHE II scoring system using the domain expertise of a group of experts: “[There] was general agreement by the group on where cutoff points should be placed.” A similar approach was also used for the CHADS2 scoring system (see Table 1) as suggested by [31]: “To create CHADS2 , we assigned 2 points to a history of prior cerebral ischemia and 1 point for the presence of other risk factors because a history of prior cerebral ischemia increases the relative risk (RR) of subsequent stroke commensurate to 2 other risk factors combined. We calculated CHADS2 , by adding 1 point each for each of the following - recent CHF, hypertension, age 75 years or older, and DM - and 2 points for a history of stroke or TIA.” To illustrate the dangers of building predictive models with heuristic approaches, note that an attempted improvement to CHADS2 , known as CHA2 DS2 -VASc [57], performs worse than CHADS2 . This is not to say that CHADS2 cannot be improved: recent work has shown that using a sensible data-driven approach that explicitly optimizes accuracy and interpretability can produce a rule-based predictive model that is as interpretable as CHADS2 , but far more accurate [54]. Risk Factor

Points

History of Congestive Heart Failure

1

History of Hypertension

1

Age ≥ 75

1

Diabetes Mellitus

1

Prior Stroke or TIA

2

Total Points Figure 1: The CHADS2 scoring system for predicting stroke risk in patients with atrial fibrillation [31]. This scoring system makes predictions using meaningful integer coefficients that are associated with well-known risk factors for strokes.

1.1.3

On Current Approaches to Interpretable Classification

In assessing the interpretability of popular classification models, it is useful to distinguish between transparent models, which clearly link variables to outcomes, and black-box models, which do not. Popular transparent classification models include linear models (addressed in this work), decision trees [75, 94, 77], decision lists [80, 54], and decision tables [49]. Popular black-box models include artificial neural networks [93], support vector machines [96], and ensemble models such as random forests [12] and AdaBoost [27]. A comprehensive review on the interpretability of various classification models can be found in [25].

3

Recent research in statistics and machine learning has primarily focused on designing accurate and scalable black-box models to address complex automation problems such as spam prediction and computer vision. In turn, the goal of creating interpretable predictive models – once recognized as a holy grail in the fields of expert systems and artificial intelligence – has been neglected over the last two decades. Some reasons for this oversight include: first, fact that interpretability is a complex notion that is inherently difficult to incorporate into statistical models [48, 25]; second, the fact that there is often a (poorly understood) trade-off between accuracy and interpretability [11, 79, 24]. Most of the popular approaches for training classification methods are only flexible enough to offer practitioners control over the sparsity of their models. Linear methods such as the lasso [90], elastic net [104] and LARS [20, 40] allow practitioners to adjust the sparsity of models through `1 -regularization (i.e. penalizing the sum of absolute values of the coefficients). Decision tree methods [75, 94, 77] offer practitioners the ability to produce simpler trees by pruning [76, 13]. In both cases, sparsity can be further induced using feature selection algorithms [36, 50, 62, 61, 91, 102]. Nevertheless, sparsity remains an incomplete and imperfect measure of interpretability, as models that are too sparse can be viewed to oversimplify more complex situations [25, 83]. These approaches are disadvantageous to practitioners because they aim to build classifiers that are accurate and interpretable by optimizing convex proxy functions. These functions are limited in the way that they can address accuracy and interpretability. In light of the the trade-off between accuracy and interpretability [11, 79, 24], methods that optimize proxy functions for accuracy and interpretability are only guaranteed to produce a pareto-optimal trade-off between the proxies functions. In other words, these methods cannot - by definition - guarantee the best possible trade-off between accuracy (as measured by the misclassification rate) and interpretability (as defined by the user). More importantly, perhaps, the use of proxy functions frequently relates the accuracy and interpretability of models to meaningless parameters that have to be set through cross-validation and grid search. To illustrate these points, consider a state-of-the-art approach for producing sparse linear models using `1 -regularization, such as LARS [20, 37, 40, 29]. It is well-known that this approach is only guaranteed to produce the correct sparse solution (i.e. the one which minimizes the `0 -norm) under very restrictive conditions that are rarely satisfied in practice [103, 58]. LARS can adjust the `1 regularization parameter throughout its full range to obtain a regularization path [29, 37] containing models with all possible levels of sparsity. Even so, these models will not achieve the best possible tradeoff between accuracy and sparsity, as the logistic loss is not heavily influenced by outliers while the `1 -norm produces a substantial amount of additional regularization on the coefficients at each level of sparsity along the path. Although current approaches are poorly suited to handle interpretability, there has been some work on designing methods for building interpretable models. Some of these methods aim to can extract simple transparent models from black-box models such as neural networks [92], random forests [67, 95], and support vector machines [30, 65, 15]. There has also been work on improving the interpretability of transparent classification models. Carrizosa, Mart´ın-Barrag´ an and Romero-Morales suggest elegant ways to improve the interpretability of SVM classifiers by limiting coefficients to a very small set of meaningful values [16, 17, 14]. Other novel approaches to interpretable classification include: oracle-coaching [44], which uses a highly accurate black-box model to guide the training of a less accurate transparent model; and prototype selection [7], which extracts a small set of representative samples that can illustrate the workings of any classification model. There is also classic work supporting the idea that simple interpretable models can be accurate [41], and that domain expertise can be used to construct accurate models [19].

1.1.4

On Mathematical Programming Approaches for Classification

Mathematical programming has been used to tackle classification problems since the 1960s. A comprehensive overview of various mathematical programming formulations used for classification models can be found in Pardalos and Romeijn [73]. Mathematical programming has offered a flexible framework that can tackle both binary and multi-class classification problems, and simultaneously address practical issues such as constraining the rate of misclassification. In this paper, we use a MIP framework that optimizes the 0–1 loss. From an accuracy perspective, our models use formulations that are similar to those of Mangasarian [60], Rubin [81], Goldberg and Eckstein [35]. These formulations have been solved using decomposition approaches [82] and heuristic approaches [4]. To our knowledge, this is the first case in which a MIP framework has been suggested to directly address the interpretability of classifiers and the first case in which decomposition has been used incorporate non-linear loss functions and tackle large-scale classification problems in the way that we describe in Section 3.2.

4

1.2

Outline of Paper

Our paper is structured as follows. In Section 2, we first introduce our approach for producing accurate and interpretable linear classification models using a MIP framework, and then show how it can be used to build three novel classification models. In Section 3, we propose two novel methods that can reduce the computation associated in training these models that can be implemented using off-the-shelf MIP software. In Section 4, we present theoretical bounds that allow users to assess the predictive accuracy of our classifiers.

2

Models

We aim to produce classifiers that can balance accuracy and interpretability by solving an optimization problem with the following structure: Loss (f ) + C · InterpretabilityPenalty(f )

min f

(1) f ∈ InterpretabilitySet.

s.t.

This structure allows us to control accuracy by means of a loss function, and allows us to control the interpretability by means of interpretability penalty function and interpretability set. The interpretability penalty acts as a soft constraint, to induce interpretability-related qualities that are desirable but may be sacrificed to achieve greater accuracy. In contrast, the interpretability set acts as a hard constraint, to ensure that all classifiers will meet certain baseline interpretability-related qualities. We control the trade-off between accuracy and interpretability using a regularization constant, C. We consider linear classification models of the form y = f (x) = sign (λ · x) where: y ∈ Y = {−1, 1} denotes a class label; x ∈ X ⊆ RP denotes a P × 1 vector of features [x1 , . . . , xP ]T ; and λ ⊆ RP denotes P ×1 vector of coefficients, [λ1 , . . . , λP ]T . For the sake of clarity, we model an intercept term in our models by assuming that the P th component of each feature vector is equal to 1. We learn the coefficients by training our model on a dataset composed of N i.i.d. examples, DN = {(xi , yi )}N i=1 . The training process requires us to solve the following version of (1): min λ

s.t.

Loss (λ; DN ) + C · Φ(λ), (2) λ∈L

where Loss (λ; DN ) : RP × (X × Y)N → R+ denotes the loss function, Φ : RP → R denotes the interpretability penalty function, and L denotes the interpretability constraints. We generally parameterize the loss function by data, DN , so that: Loss (λ; DN ) =

N 1 X Loss (λ; (xi , yi )) , N i=1

Users can choose among different loss functions, interpretability penalties, and interpretability sets. In the remainder of this section, we discuss how users can choose these components to create a range of interpretable classification models, and how they may train these models by formulating and solving a mixed-integer program (MIP). We have purposefully sought to formulate our models in way that is easy to understand and easy to implement. For the sake of clarity, we assume that the interpretability penalty and interpretability sets are defined component-wise so that, Φ(λ) =

P X

Φ(λj ),

j=1

and, n o L = λ ∈ RP λj ∈ Lj ⊆ R for j = 1, . . . , P . This allows us to express each MIP in the form: min λ

s.t.

N P X 1 X ψi + Φj N i=1 j=1

ψi = Loss (λ; (xi , yi ))

i = 1,...,N

Φj = C · Φ(λj )

j = 1,...,P

λj ∈ Lj

j = 1,...,P

5

Name

Loss Variable   ψi = 1 yi λT xi ≤ 0

Mi ψi ≥ γ − yi λT xi

Quadratic

ψi = (1 − yi xi · λ)2

-

Logistic

ψi = log(1 + exp(−yi λT xi ))

-

Exponential

ψi = exp(1 − yi λT xi )

-

0–1

Loss Constraints

ψi = max(0, 1 − yi λT xi )

Hinge

ψi ≥ 1 − yi λT xi

Table 1: Variables and constraints for popular loss functions.

These formulations typically contain auxiliary variables to allow users to debug models and set branching priorities. Although these formulations could be further refined in order to decrease solution times, such refinements are beyond the scope of this paper. We note that solution time associated with any single formulation depends on a wide range of instance-dependent factors, such as the data used to train our model, regularization parameters, interpretability sets, and the MIP solver. Thus, it is unlikely that any single formulation will be guaranteed to outperform all others for all datasets, user-based parameters, and MIP solvers.

2.1

Choosing a Loss Function and Setting Regularization Parameters

Users can train our models using a wide range of loss functions, including discrete functions, such as the 0–1 loss, linear functions, such as the hinge loss, as well as any convex (linear or non-linear) function, such as the logistic loss and the exponential loss. Training models with discrete and linear loss functions requires users to add loss constraints into our MIP formulations, while training models with convex loss functions requires the use of the decomposition method described in Section 3.2. We highlight the shapes, variables and constraints associated with popular loss functions in Figure 2 and Table 1.

3 0–1 Hinge Quadratic Logistic Exponential

ψi

2

1

0

−1

0 yi xi · λ

1

Figure 2: Shapes of popular loss functions. We recommend users to train models with the 0–1 loss, and only use a different loss function if the training process becomes computationally challenging, or if they wish to train a model that produces conditional probability estimates (in which case they should use the logistic loss function). Our recommendation is based on two important reasons. First, the 0–1 loss produces a classifier that is robust to outliers and provides the best learning-theoretic guarantee for a finite hypothesis space. Second, when we use the 0–1 loss function in an optimization problem such as (2), the regularization parameter C represents “price” of interpretability; that is, the percent accuracy that users are willing to sacrifice in order to achieve one unit gain in interpretability. This allows users to set the value of C in

6

a purposeful way in which they understand how it affects the accuracy-interpretability trade-off before the training a model. More specifically, using the 0–1 loss function, immediately restricts the range of interesting values of C to [ N1 , 1 − N1 ] as: • When C < N1 then the objective of (2) penalizes each misclassification more than each unit of interpretability. Thus, the classifier produced by optimizing this objective will not misclassify even one example to improve interpretability, and will therefore achieve highest possible training accuracy. • When C > 1 − N1 , then (2) penalizes each unit of interpretability more than it penalizes a misclassification on all N examples. Thus, the classifier produced by optimizing this objective may misclassify all N points if it achieves a higher level of interpretability. In what follows, we present MIP formulations that use the 0–1 loss. Users may adapt these formulations to include other loss functions by changing a set of N constraints shown in Table 1, or by using the decomposition method from Section 3.2.

2.2

Useful Interpretability Sets

The process of building interpretable models often requires an approach that can train models that satisfy different hard constraints [64, 66]. Interpretability sets, L, are a natural way to incorporate these constraints. In what follows, we present several examples of interpretability sets that can be used to build well-known interpretable models and/or improve the interpretability of our models.

2.2.1

Building Scoring Systems

Scoring systems are practical and interpretable prediction models that are primarily used to assess the risk of medical outcomes in hospitals [3, 70, 31], but have also been used to predict the incidence of violent crimes [2, 86], gauge marine safety for military vessels [18], and rate business schools [22]. Scoring systems allow users to make quick, hands-on predictions by simply adding, subtracting and multiplying a few meaningful numbers. We can build these models by restricting coefficients to integers bounded by Λ: Lj = {λj ∈ Z ∩ [−Λ, Λ]}

2.2.2

Building M-of-N Rule Tables

M -of-N rule tables are highly interpretable classification models that make predictions based on a collection of rules. Given a list of N total rules, these models predict y = +1 if at least M of the N rules are satisfied. Although M -of-N rule tables were originally proposed as models that could be extracted from neural nets [92], we can also extract them from linear models if we are given data containing only binary features, xj ∈ {0, 1} (or if we use an approach where we first binarize the features as in Section 2.5). We can reproduce an M -of-N rule table by restricting the first P − 1 coefficients to the set, Lj = {λj ∈ Z ∩ [0, 1]}

j = 1, . . . , P − 1

and restricting the intercept term (i.e. the P th term) to the set, LP = {λj ∈ Z ∩ [−P, 0]} Training a model with these interpretability sets will produce a model of the form, yˆ = sign

P −1 X

! λj xj + λP

.

j=1

This is equivalent to an M -of-N rule table where the P number of total rules in the model is determined by the number of non-zero coefficients, so that N = P j=1 λj , and where the required number of rules to make a positive prediction is determined by the value of the intercept, so that M = λP + 1.

2.2.3

Incorporating Sign-Constraints

The interpretability of linear classification models can often be significantly affected when the signs of coefficients do not match the intuition or background knowledge of domain experts due to dependent relationships between variables [25, 74, 97, 66]. We can constrain the signs of coefficients to build models that match established relationships between the features and the outcome variable.

7

Suppose that we wanted the coefficients with indices in the set Jpos to be non-negative, the coefficients with indices in the set Jneg to be non-positive, and the remaining coefficients in the set Jf ree to take on either sign. We may then restrict each coefficient j to the set,  if j ∈ Jpos  {λj ∈ Z ∩ [0, Λ]} if j ∈ Jneg Lj = {λj ∈ Z ∩ [−Λ, 0]}   {λj ∈ Z ∩ [−Λ, Λ]} if j ∈ Jf ree These sets can be added to a MIP formulation using simple lower bound or upper bound constraints for coefficient variables, λj . Sign-constrained formulations can have the side effect of improved computational performance since they narrow down the feasible region of the MIP. In addition, correct prior knowledge on the sign of the coefficients may result in a more accurate predictive model.

2.2.4

Restricting Coefficients to Any Discrete Set

Given that interpretability is subjective, users may wish to restrict the coefficients of their models to a discrete set of their choosing. We allow users to restrict the coefficient λj to one of Kj distinct values from the set Lj = {lj,1 , . . . , lj,Kj } using a one-of-K constraint. This requires defining Kj binary variables, uj,k ∈ {0, 1}, and adding the following two constraints to our MIP formulation: λj =

K X

K X

lj,k uj,k

uj,k ≤ 1.

u=1

k=1

Generalized interpretability sets are also useful when the features of a dataset have wildly different orders of magnitude. In such cases, we might want a model similar to the following: predict violent crime in neighborhood if sign(0.0001#residents -3#parks +60#thefts last year)>0. Here, having coefficients with only 1 significant digit draws attention to the units of the different features by, for instance, clarifying that the values for residents are much larger than those of parks. To restrict coefficient λj to values with one significant digit between 10−3 and 900, we can define Lj as:   λj = d × 10E   Lj = . λj ∈ Z d ∈ {0, ±1, ±2 . . . ± 9}   E ∈ {−3, −2, −1, 0, 1, 2}

2.3

Personalized Interpretable Linear Models

A Personalized Interpretable Linear Model (PILM) is designed to let users produce interpretable models that use their own, personalized, definition of interpretability. To use this model, users must first define R interpretability sets, Lr = {lr,1 , . . . , lr,Kr } for r = 1, . . . , R as well as an interpretability penalty function that penalize coefficient j by Cr when λj ∈ Lr ,  C1 if λj ∈ L1j     .. Φ(λj ) = .     CR if λj ∈ LR j

(3)

(4)

Users can define these elements in any way so long as (1) the interpretability sets, L1 , . . . , LR are mutually exclusive, (2) Lr is more interpretable than Lr+1 and (3) the regularization parameters are monotonically increasing in r, so that 0 < C1 < . . . < CR . These assumptions ensure that the interpretability penalty function favors coefficients that belong to more interpretable sets. Note that we have assumed that each coefficient can belong to the same interpretability sets, so that Lrj = Lr for r = 1, . . . , R and j = 1, . . . , P . This assumption can be dropped as needed. When the 0–1 loss function is used, the value of Cr represents the minimum loss of training accuracy required to use a coefficient from Lrj . This implies that a coefficient λj in the final classifier will belong to Lr+1 if there exists at least one coefficient in Lr+1 that yields a gain in accuracy of more than Cr+1 − Cr . j j This convenient fact allows us to purposefully set the values of Cr beforehand, if desired. Users can further customize these models to induce sparsity models by adding an additional interpretability level, L0j = 0. In addition, we also advise that users to improve the interpretability, computational performance, and generalization of their models by restricting coefficients to a coprime set by including a `1 -term in the interpretability penalty function with a suitably-chosen tiny coefficient (as described in Sections 2.4 and 4.2).

8

2.3.1

PILM MIP Formulation

We can train PILM with the 0–1 loss function by solving the following MIP: min z,λ

s.t.

N 1 X ψi N i=1

+

Mi ψi



P X

Φj

j=1

γ−

P X

yi λj xi,j

i = 1,...,N

0-1 loss

(5)

lr,k uj,k,r

j = 1,...,P

coefficient values

(6)

Cr sj,r

j = 1,...,P

int. penalty

(7)

uj,r,k

j = 1,...,P

int. level indicator

(8)

1 coef. per 1 int. level

(9)

j=1

λj

=

Kr R X X r=1 k=1

Φj sj,r

= =

R X r=1 Kr X

r = 1,...,R

k=1 R X

sj,r

=

1

ψi sj,r

∈ ∈ ∈

{0, 1} {0, 1} {0, 1}

j = 1,...,P

r=1

i = 1,...,N 0-1 loss indicators j = 1,...,P r = 1,...,R int. level indicators uj,r,k j = 1,...,P r = 1,...,R k = 1,...,Kr coef. value indicators   Here, the loss variables ψi = 1 yi λT xi ≤ 0 are binary indicator variables that indicate a misclassification. The loss constraints (5) use a Big-M technique to ensure that ψi is set to 1 if the classifier λ makes a mistake on example i. This is a standard formulation for the 0–1 loss which depends on user-defined scalar P parameters γ and Mi ; we typically set the “margin” variable to γ = 0.1 and set Mi = γ − max P j=1 yi xi,j . Note that the loss constraints in (5) avoid numerical issues associated with the Big-M technique as we can calculate an exact value for each Mi . Constraints (6) restrict λj to a value in Lrj by means of a one-of-K formulation. The uj,k,r are binary variables that are set to 1 if λj is equal to lk,r . We ensure that λj can only take a single value from a single interpretability set by limiting the sum of uj,k,r to 1 using constraints (8) and (9). The variables Φj are auxiliary variables that represent the interpretability penalty for each coefficient λj . Constraints (8) and (9) ensure that Φj is equal to Cr if and only if λj belongs to the interpretability set Lrj by restricting sj,r to 1 when λj ∈ Lrj .

2.4

Supersparse Linear Integer Models

A Supersparse Linear Integer Model (SLIM) is a special case of (2) that is designed to produce accurate, interpretable and practical scoring systems. This model can be expressed as:  min Loss (λ; DN ) + C kλk0 +  kλk1 λ (10) s.t. λ∈L SLIM uses an interpretability penalty that combines a `0 -penalty with a small `1 -penalty. Here, the `0 -penalty is used to tune the sparsity of a classifier. In contrast, the `1 -penalty is used to discard the large number of equivalent classifiers that arise when we induce sparsity with an `0 -penalty. To illustrate this point, consider a classifier such as yˆ = sign (x1 + x2 ). If the objective in (10) only minimized the 0–1 loss with an `0 -penalty, then classifiers such as yˆ = sign (2x1 + 2x2 ) or yˆ = sign (3x1 + 3x2 ) would attain the same objective value as yˆ = sign (x1 + x2 ) because they make the same predictions and have the same number of features. Thus, we include a tiny `1 -penalty in the objective of (10) so that SLIM chooses a classifier with the smallest and most interpretable coefficients within the equivalence class of solutions: yˆ = sign (x1 + x2 ). Note that the solution to (10) is not necessarily unique. To see this, consider a case on R2 : x1 = (−1, 1), where y1 = +1; and x2 = (1, −1) where y2 = −1. Here, SLIM produces two optimal classifiers when coefficients are restricted to integers: λ∗ = (−1, 0) and λ∗∗ = (0,1). Although users can train SLIM using any of of the loss functions described in Section 2.1 and any of the interpretability sets described in Section 2.2, we typically train SLIM with the zero-one loss function and an interpretability set that restricts each coefficient to a set of small integers, such as:

9

n o L = λ ∈ ZP | |λj | ≤ 20 for j = 1, . . . , P These choices produce scoring systems that are practical enough for hands-on prediction and that achieve the best-possible trade-off between accuracy and sparsity among linear models with coefficients within L. The use of the 0–1 loss function also results in meaningful regularization parameters. Note that, when we use the 0–1 loss in (10), C0 represents minimum decrease in training error required to include a single feature in the final classifier. This restricts the interesting values of C0 to the range:   1 1 C0 ∈ ,1 − . N N For any given C0 and L, users can then set  to ensure that the maximum value of the `1 -penalty in SLIM’s objective (i.e.  max kλk1 ) is smaller than the unit value of accuracy in the objective (i.e. N1 ) as well as the unit value of sparsity in the objective (i.e. C0 ). That is, choosing: ! min ( N1 , C0 )   ∈ 0, maxλ∈L kλk1 ensures that SLIM will never sacrifice training accuracy or sparsity in order to reduce the `1 -norm of λ.

2.4.1

SLIM MIP Formulation

We can train SLIM with the 0–1 loss function by solving the following MIP: min z,λ

s.t.

N 1 X ψi N i=1

+

Mi ψi



P X

Φj

j=1

γ−

P X

yi λj xi,j

i = 1,...,N

0-1 loss

coefficient values

(11)

j=1

λj



Lj

j = 1,...,P

Φj Λj αj Λj αj βj βj

= ≥ ≥ ≥ ≥

C0 αj + βj λj −λj λj −λj

j j j j j

= 1,...,P = 1,...,P = 1,...,P = 1,...,P = 1,...,P

int. penalty `0 -norm #2 `0 -norm #2 `1 -norm #1 `1 -norm #2

(12) (13) (14) (15) (16)

ψi ∈ {0, 1} i = 1,...,N 0-1 loss indicators Φj ∈ R+ j = 1,...,P int. penalty values αj ∈ {0, 1} j = 1,...,P `0 indicators βj ∈ R+ j = 1,...,P abs. value variables   Here, the loss variables, ψi = 1 yi λT xi ≤ 0 , and constraints in (11) behave in the same way as as those in the PILM MIP formulation (see Section 2.3.1). The interpretability penalty variables, Φj , are set to C0 αj + βj in constraints (12). Here, αj = 1 [λj 6= 0] is an binary variable that indicates a non-zero coefficient, and βj = |λj | is a real variable that represents the absolute value of each coefficient. The values of these variables are governed by constraints (13) to (16).

2.5

Threshold-Rule Interpretable Linear Models

A Threshold-Rule Interpretable Linear Model (TILM) is an interpretable linear classification model that makes decision using a set of derived threshold rules. Given any real-valued feature such as age, a threshold rule is an binary-valued feature of the form, ( 1 if age ≥ 25 age geq 25 0 if age < 25 These types of features are often used in medical scoring systems [53] and may also be used to create M -of-N rule tables [92]. Given that these models are composed of non-linear features, they are likely to exhibit improved performance when the data contains non-linear relationships. Further, given that

10

threshold features are binary-valued, these models are also likely to avoid issues that may arise when users do choose interpretability sets for feature wildly different magnitudes (as in the example from Section 2.2.4). To train a TILM classifier, we first generate Tj new threshold rules for each real-valued feature xj in the data. Each threshold rule for xj has the form hj,t = 1 [xj ≥ vj,t ]. Here, the thresholding operation is carried out component-wise for each example i, and the threshold vj,t represent a user-defined threshold values for feature j. These values can either be chosen by hand (i.e. a the thresholds for a feature such as bmi can be chosen to correspond to values that are meaningful for doctors, such as 15, 18.5, 25 and 30) or in a non-parametric way (i.e. by setting vj,1 , . . . , vj,4 to the 20th, 40th, 60th, 80th percentiles of the distribution of bmi). Note that while there exists an infinite number of thresholds for a real-valued feature, users need only consider a total of N + 1 distinct values for the thresholds: one value that is strictly less than mini xi,j ; one value that is strictly greater than maxi xi,j ; and N − 1 values that are between each of the remaining data points. Using any other values will lead to the same matrix of threshold rules, and subsequently therefore produce the same TILM classifier. Once we have obtained a set of threshold rules hj,t ∈ {0, 1}N for each of the P features, we then train a linear model of the form:   Tj P X X λj,t hj,t  (17) y = sign  j=1 t=1

using the following optimization problem: Loss (λ; DN ) + C Features + Thresholds Rules/Feature +  kλk1

min λ

 (18)

λ∈L

s.t.

Here, we use an interpretability penalty that penalizes each feature that is included in the final model (to ensure that our final model uses a small number of distinct data sources), the number of thresholds per feature (to ensure that our model does not overfit, and exhibits an understandable relationship), and the `1 -norm of the coefficients (to restrict coefficients to coprime as discussed in Section 2.4). As before, our interpretability constraints can include restrictions on the signs and values of the coefficients. In this model, however, we also enforce a monotonicity constraint in order to further improve the interpretability of the model by ensuring that each feature is only related to the outcome in a distinct way. The works of Carrizosa et al. [16] and Goh and Rudin [34] take a different but related approaches to the problem of constructing threshold classifiers.

2.5.1

TILM MIP Formulation min z,λ

s.t.

N 1 X ψi N i=1

+

Mi ψi



P X

Φj

j=1

γ−

P X

yi xi,j λj

i = 1,...,N

0-1 loss

(19)

j = 1,...,P

coefficient values

j = 1,...,P

int. penalty

(20)

αj,t

j = 1,...,P

feature use

(21)

αj,t − 1

j = 1,...,P

threshold/feature

(22)

j=1

λj



Lj

Φj

=

Cf νj + Ct τj + 

Tj X

βj,t

t=1

Tj νj



Tj X t=1 Tj

τj



X t=1

Λj αj Λj αj βj βj

≥ ≥ ≥ ≥

λj,t −λj,t λj,t −λj,t

j j j j

ψi Φj αj

∈ ∈ ∈

{0, 1} R+ {0, 1}

i = 1,...,N j = 1,...,P j = 1,...,P

11

= 1,...,P = 1,...,P = 1,...,P = 1,...,P

`0 -norm #1 `0 -norm #2

t = 1,...,Tj t = 1,...,Tj

`1 -norm #1 `1 -norm #2

0-1 loss indicators int. penalty values `0 indicators

(23) (24) (25) (26)

βj νj τj

∈ ∈ ∈

R+ j = 1,...,P `1 values {0, 1} j = 1,...,P feature use indicators Z+ j = 1,...,P threshold/feature values   Here, the loss variables, ψi = 1 yi λT xi ≤ 0 , and constraints in (19) behave in the same way as as those in the PILM MIP formulation (see Section 2.3.1). PTj The interpretability penalty variables, Φj , are set to Cf νj + Ct τj +  t=1 βj,t in constraints 20. The regularization parameter Cf can be used to tune the number of features that are involved in the TILM model; the regularization parameter Ct can be used to tune the number of threshold rules per feature; and,  can be set to a small value in order to produce models with coprime features. The variables used in the interpretability penalty include: νj , which is an binary variable that indicates whether the classifier contains a non-zero coefficient for a threshold rule derived from the feature xj ; τj , which is a discrete variable that represents the number of threshold rules derived from the feature xj ; and βj,t = |λj,t |, which represents the absolute values of the coefficients associated with each threshold rule, hj,t . The values of νj and τj are set using constraints (21) and (22); both variables are determined by the value of the variable αj,t = 1 [λj,t 6= 0], which indicates whenever a given threshold rule has a non-zero coefficient.

2.6 2.6.1

Practical Extensions Imbalanced Datasets

Classification models are often used to detect rare events, such as heart attacks or violent crimes. In these cases, training a classifier that maximizes classification accuracy often produces a degenerate classifier (i.e., if the probability of heart attack is 1%, a degenerate classifier that never predicts a heart attack is 99% accurate). Our framework can avoid producing such classifiers by adjusting the loss function with class-based weights, W + and W − . This involves replacing the loss function in our formulations with: 1 X 1 X Loss (λ; DN ) = + W + Loss (λ; (xi , yi )) + − W − Loss (λ; (xi , yi )) , |I | |I | + − i∈I

+

i∈I



where I = {i : yi = +1} and I = {i : yi = −1}. This change can easily be incorporated in the MIP formulations for our models by replacing each ψi in the MIP formulation with W + ψi if yi = +1, or W − ψi if yi = −1. Loss functions that require the decomposition method in Section 3.2 can either produce separate supporting hyperplanes for each class, or a single “weighted” supporting hyperplane. When we train models by optimizing the 0–1 loss function, we can bound the values of W + and W − to lie within a bounded range. In particular, if we assume that: W− + W+ = 2 Then we can limit the values of the class-based weights to cases where:   2|I − | 2 W+ ∈ , 1 + |I + | 1 + |I − | 2 Setting W + to a value that is strictly less than 1+|I + | will produce a model will produce a classifier that will not sacrifice any accuracy on negative examples to improve the accuracy on positive examples. In 2|I − | contrast, setting W − to a value that is strictly greater than 1+|I − | will produce a classifier that will not sacrifice any accuracy on positive examples to improve the accuracy on negative examples. Clearly, setting W + = W − = 1 will reproduce our default classifier (i.e. one that that minimizes total accuracy). − + Alternatively, we can set W + = 2 |IN | and W − = 2 |IN | to produce a classifier that is likely to achieve the same degree of accuracy on the positive examples as it does on the negative examples.

2.6.2

Feature-Based Preferences

Practitioners may sometimes prefer that their models use certain features as opposed to others. Our framework can handle this by (1) choosing an interpretability penalty function, Φ(λ), that includes a term for the number of features in the final model, kλk0 , and (2) adjusting the feature-selection regularization parameter, C0 for each coefficient. In general, if a user wishes for a model to use feature j instead of feature k then that user can set C0,k = C0,j +  where  > 0 is a suitably chosen small constant. Using a slightly larger regularization parameter will ensure that λj is included in the model as opposed to λk , provided that both achieve the same gain in accuracy.

12

2.6.3

Missing Data

The previous approach can also be used to deal with missing data. Consider a feature xj ∈ RN that contains M < N missing points. Instead of dropping these examples, users can impute missing values. Given that features with more missing points are less informative, then users can adjust the regularization parameter for xj as follows: C0,j =

N C0 N −M

This adjustment ensures that features with lots of imputed features are more heavily penalized than features with fewer imputed features. The adjustment factor is chosen so that: if M = 0 then C0,j = C0 ; if M = N/2 then C0,j = 2C0 ; and if M = N then C0,j = ∞ and the coefficient is dropped entirely.

2.6.4

Relaxing Hard Interpretability Constraints

All of the models that we have presented have constraint coefficients to a discrete interpretable set using constraints of the form, λj ∈ Lj . Restricting coefficients in this way is beneficial because it allows us bound the predictive accuracy of classifiers (see Section 4). In theory, however, constraining coefficients to a discrete interpretability set may limit the accuracy of classifiers for two reasons: first, the coefficients within the set may be poorly matched with each feature (i.e. if the features have wide-ranging scales); second, the interpretability set may impose a discrete correlation structure between variables. Given that both these problems are difficult to identify before training a model, users may wish to train models using formulations that relax the interpretability constraint. In this approach, we effectively transform the interpretability constraint from a hard constraint into a soft constraint. Specifically, we allow each coefficient λj to take on an interpretable value from the discrete set Lj = (lj,1 , . . . , lj,Kj ), or a real value from the finite set, [Λj , Λj ]. In order to penalize models that use real-valued coefficients, we define an indicator variable, rj which is equal to 0 if λj ∈ Lj and 1 if λj ∈ [Λj , Λj ]. We then add a P penalty term of P j=1 rj to the interpretability penalty. Ensuring that λj belongs to either the discrete set or the real set can then be achieved by adding the following constraints into the MIP formulation, λj

=

µj

=

ρ j + µj Kj X lj,k uj,k

j = 1,...,P

each coef. is discrete or cts.

j = 1,...,P

value of discrete coef.

j = 1,...,P

each coef. is either discrete or cts.

k=1 Kj

1 − rj



X

uj,k

k=1

ρj ρj rj µj ρj uj,k

≤ ≥ ∈ ∈ ∈ ∈

rj Λj rj Λj {0, 1} R R {0, 1}

j j j j j j

= 1,...,P = 1,...,P = 1,...,P = 1,...,P = 1,...,P = 1,...,P

boundedness of cts. coefs. boundedness of cts. coefs.

k = 1,...,Kj

We include formulations of PLIM, SLIM, and TILM with relaxed interpretability constraints are included in the Appendix. Users should note that relaxing the interpretability constraints will increase the computational cost in training the model for two reasons: first, this approach requires that we add at least 1 3P new variables (rj , ρj , µj for j = 1, . . . , P ) and 3P new constraints to our MIP formulation; second, the approach requires the solution to an mixed-integer linear programming program as opposed to an integer programming program.

1 Only models that are already using one-of-K constraints of the form will require 3P ; models that did not use one-of-K P constraints will require an additional P j=1 Kj new variables and 2P constraints to use this approach

13

3

Methods

In this section, we present two novel methods to decrease the computational cost when training an interpretable model using mixed-integer programs. The first method, which we refer to as reduction, is a one-time procedure that reduces the size of the data used to train an interpretable model by removing examples that will be classified in a predictable way. The second method, which we refer to as decomposition, is an iterative procedure that can tackle intractable problems by building a piecewise linear approximation of the loss function.

3.1

Reduction

Reduction is a one-time procedure that decreases the computational burden of training the model by reducing the size of the data. The idea is to filter the data using many convex optimization problems before the MIP computation; given initial data, DN = (xi , yi )N i=1 , the reduction procedure solves N + 1 convex programming problems to identify examples whose class we are fairly certain of. These examples are then removed to produce reduced data, DM ⊆ DN . The computational gain associated with the reduction procedure comes from training a model with DM , which involves solving a MIP formulation with N − M fewer loss constraints. To illustrate the reduction procedure, consider the generic optimization problem: min f

s.t.

Z(f ; DN ), where Z(f ; DN ) = Loss (f ; DN ) + C · Φ(f ) (27) f ∈ F.

Reduction involves solving a convex proxy to the original discrete problem. We denote the set of optimal solutions to the original problem as Fmin = argminf ∈F Z(f ; DN ), and denote an optimal classifier ˜ ; D) denote a convex upper bound to Z(f ; DN ), its set of optimal in this set as f ∗ ∈ Fmin . Let Z(f ˜ ; DN ), and denote an optimal classifier in this set as f˜ ∈ F. ˜ solutions as F˜ = argminf ∈F Z(f Users can run the reduction procedure using any convex proxy as long as they can choose a value ˜ ; D) contains the set of optimal of ε > 0 that is large enough to hypothesize that the ε-level set of Z(f solutions to the original optimization problem, Fmin . In other words, the convex proxy and width of the level set must be chosen so that ˜ ∗ ; DN ) ≤ Z( ˜ f˜; DN ) + ε Z(f

∀f ∗ ∈ Fmin .

(28)

If ε is chosen too large, the reduction procedure will not filter very many examples and will thus be less helpful for reducing computation. This assumption is equivalent to saying that the set of ε-good functions, n o ˜ ˜ f˜; DN ) + ε F˜ε = f : X → Y Z(f ; DN ) ≤ Z( (29) contains Fmin . ˜ ; DN ) to obtain In the first stage of the reduction process, we solve the convex proxy problem min Z(f f˜. This allows us to establish the objective values of ε-good classifiers, as well as a set of baseline labels, y˜i = sign (f˜(xi )). In the second stage of the reduction process, we solve N additional modified versions of the convex proxy to check whether for each i there exists a function in the set of ε-good functions that classifies i in a way other than the baseline y˜i . The modified version of the convex proxy for example i contains a constraint that forces point i to be classified as −˜ yi , min f

s.t.

˜ ; DN ) Z(f

(30)

y˜i f (xi ) < 0.

(31)

˜ f˜-i ; DN ) and an argmin as f˜-i . If Z( ˜ f˜-i , DN ) > Z( ˜ f˜, DN ) + ε, We denote the optimal objective value as Z( then we know that no classifier in the set of ε-good classifiers F˜ε can label point i as −˜ yi . Thus, all functions within the set of ε-good classifiers F˜ε must label this point the same way, namely as y˜i . This, and our assumption that Fmin ⊆ F˜ε implies that all f ∗ ∈ Fmin will also classify the ith example as y˜i . This allows us to remove example i from the reduced dataset DM because we already know that the MIP will predict its label as y˜i . ˜ f˜-i , DN ) > Z( ˜ f˜, DN ) + ε, since otherwise we are not Note that we remove example i only when Z( certain of its sign in the 0-1 problem that is solved by the MIP.

14

(

Z! f!-i ;DN

(

)

)

Z! f! ;DN + ε

(

Z! f! ;DN

) !ε F

f!

Fmin

f!-i

Figure 3: Fmin denotes the set of optimal solutions to our original optimization problem, and let f˜ be a solution to the convex proxy problem. We choose a ε large enough to ensure that Fmin ⊆ F˜ . The reduction method trains an additional classifier f˜-i for each example in the original data, DN , by solving a modified version of the convex proxy problem with that forces f˜-i to label i in a different way than f˜. In this illustration, f˜-i 6∈ F˜ε so we know its predicted sign and thus do not include example i in the reduced data, DM .

Algorithm 1 Reduce DN to DM

Input: DN = (xi , yi )N i=1 , original data, ˜ ; DN ), objective function of the convex proxy, Z(f ε, width of the convex proxy level set DM ←− ∅ ˜ ; DN ) f˜ ←− argminf Z(f for i = 1, . . . , N do y˜i ←− signf˜(xi ) ˜ ; DN ) s.t. y˜i f (xi ) < 0 f˜-i ←− argminf Z(f ˜ f˜-i ; DN ) ≤ Z( ˜ f˜; DN ) + ε then if Z( DM ←− DM ∪ (xi , yi ) end if end for Output: DM , reduced data

15

We illustrate this situation in Figure 3, and provide a summary of the reduction method in Algorithm 1. In Theorem 1, we prove that we obtain the same set of classifiers if we train a model with the original data, DN , or the reduced data, DM . Theorem 1 (Equivalence of the Reduced Data) Consider the optimization problem used to train an interpretable classifier, f ∈ F, with data, DN . Let Z(f ; DN ) denote the objective function of this optimization problem with the set of optimal solutions being Fmin = argminf ∈F Z(f ; DN ). Similarly, let ˜ ; DN ) denote the objective function of a convex proxy to this optimization problem, and let f˜ denote Z(f ˜ ; DN ). an optimal solution from the set of optimal solutions, F˜ = argmin Z(f Assume that we have chosen a value of ε > 0 so that ˜ ∗ ; DN ) ≤ Z( ˜ f˜; DN ) +  Z(f

∀f ∗ ∈ Fmin .

(32)

Then the reduction procedure in Algorithm 1 will output a reduced dataset DM ⊆ DN such that argmin Z(f ; DN ) = argmin Z(f ; DM ). f ∈F

(33)

f ∈F

Proof Let us denote the set of points that have been removed by the reduction procedure as S = DN \ D M . When S = ∅, then DN = DM  , and(33) follows trivially. The reduction procedure ensures that for any f ∈ F˜ε , sign (f (xi )) = sign f˜(xi ) for all i in S. When, S = 6 ∅, then we can prove the theorem by showing that the objective value of the original optimization problem attains a constant value, C ∈ R, for all optimal classifiers f ∗ ∈ Fmin for all of the examples in S: Z(f ∗ ; S) = C

∀f ∗ ∈ Fmin .

(34)

This statement in (34) is sufficient to show (33) since Fmin = argmin Z(f ; DN ) = argmin Z(f ; DM ∪ S) f ∈F

f ∈F

= argmin Z(f ; DM ) + Z(f ; S) f ∈F

= argmin Z(f ; DM ) + C

(35)

f ∈F

= argmin Z(f ; DM ). f ∈F

We can prove (34) by showing that all optimal classifiers in Fmin classify each example i ∈ S with the same label. We will do this by showing that all classifiers in the larger set F˜ε (larger by assumption) classify each example in S using the same label. This comes directly from thereduction  procedure; recall that the reduction procedure ensures that for any f ∈ F˜ε , sign (f (xi )) = sign f˜(xi ) for all i in S. Thus, ∀f ∈ F˜ε , Z(f ; DN ) = Z(f ; DM ) +

X

1[yi f (xi )≤0] = Z(f ; DM ) +

i∈S

X

1[yi f˜(xi )≤0] = Z(f ; DM ) + C.

i∈S

The rest of the proof follows from (35). The reduction technique is very flexible and can be used with any convex proxy optimization problem.

16

3.2

Decomposition

Decomposition methods are a popular class of techniques for solving large-scale optimization problems. These methods are sometimes referred to as cutting-plane methods or localization methods in the optimization literature [10]. In what follows, we present a simplified case of Generalized Benders’ Decomposition [32] that is well-suited to solve large-scale instances of the formulations from Section 2.

3.2.1

Approach

Given a dataset with N examples and P features, we produce an interpretable classification model using an optimization problem with the following structure: Loss (λ; DN ) + C · Φ(λ)

min λ

(P) λ∈L

s.t.

This optimization problem, P, can be formulated as a MIP that uses N variables and constraints to model the loss function, and uses at least O(P ) variables and constraints to model the interpretability penalty and interpretability constraints. Solving P is difficult for datasets with large N or P . Decomposition can recover the optimal solution for large instances of P by iteratively solving a proxy problem in which the loss function is modeled with a single variable, θ: min λ

s.t.

θ + C · Φ(λ) λ∈L

(P 0 )

θ∈R Clearly, the proxy problem, P 0 , is far easier to solve than the original problem, P, because it contains N − 1 fewer variables and N fewer constraints. However, the solution to P 0 does not represent the same classifier as the solution to P because P 0 does not incorporate any constraints to model the loss function. To have the solution to (P 0 ) produce the same classifier as the solution to (P), the decomposition procedure builds a piecewise linear approximation of the loss function in the proxy problem using an iterative process. Each iteration of this process involves three steps: first, choosing a point in the feasible region of the proxy problem; second, computing a supporting hyperplane of the loss function at this point; and third, adding this supporting hyperplane as a constraint in P 0 . A supporting hyperplane to the loss function at the point λ0 is a linear inequality of the form:  θ ≥ Loss λ0 + (∇Loss)λ0 (λ − λ0 ), (36)  where Loss λ0 ∈ R and (∇Loss)λ0 ∈ RP are fixed values that denote the value and gradient of the loss function at λ0 . Note that we have dropped the data parameter from the loss function for clarity of exposition. In Figures 4(a) and 4(b), we show how K supporting hyperplanes, HK , can be combined to  g λ; HK . produce a piecewise-linear approximation of the loss function, Loss On the K th iteration, the decomposition procedure therefore solves a proxy problem, P K , that contains K additional constraints, HK : min

θ

+

C · Φ(λ)

s.t.

λ θ

∈ ≥ .. .

L  Loss λ1

+

(∇Loss)λ1 (λ1 )(λ − λ1 )

θ



  Loss λK

+

(∇Loss)λK (λK )(λ − λK )

θ

In comparison to the original problem, P, the proxy problem, P K , is relatively easy to solve as it models the loss function using 1 variable and K constraints as opposed to N variables and N constraints.

3.2.2

Bounds and Convergence

The decomposition procedure is guaranteed to produce the solution of the original optimization problem, P, as the piecewise linear approximation of the loss function will asymptotically converge to its actual value. The procedure can use the solutions to the proxy problem at each iteration, P K , to produce lower and upper bounds for the optimal value of the original problem, P. These bounds can be used to check if the solution of the proxy problem has converged to the solution of the original problem, and/or stop

17

Loss ( ) g Loss

; H2

1 2



Loss ( )



Loss

1 + (rLoss) ( 1



Loss ( )

2 + (rLoss) ( g2

Loss

g Loss

1)

Loss ( )

; H2

2)

; H2

Loss

Loss ( ) g Loss

1

1 2

2 1 + (rLoss) ( 1

Loss

Loss ( ) ✓

g Loss

; H2

1

✓ ✓

g Loss



2 + (rLoss) 2 ( 2

Loss

; H2

2

1)



2)



1

Loss ( )

2

g Loss



1)

+ (rLoss)

Loss

2 1 +✓ (rLoss) Loss 2 + (rLoss) ✓ 1 (Loss2 2 (12) + )(rLoss) 2 (



Loss

2 + (rLoss) ( 2 ✓

2)

2)

1

(a) Loss Function

Loss 2 1 + (rLoss)



2

2

2







Loss ( )



1



; H2

Loss1 ( 1 +1g (rLoss) 12 ( ) 1 Loss ;H

✓1

Loss

; H2

1 + (rLoss) ( 1

Loss Loss ( )

Loss

g Loss

1)

Loss ( )

2 + (rLoss) ( g2

Loss

; H2

2

2)

; H2

Loss ( ) g Loss

1

1 2

2

✓ ✓

Loss Loss

1 + (rLoss) ( 1



2 + (rLoss) 2 ( 2

1)



2)



2 + (rLoss) ( 2

2)



2 Loss

1 + (rLoss) ( 1

1)

g Loss



Loss

2 + (rLoss) ( 2

2)

; H2

1 2

2



1)

Loss ( )

1



Loss

1(

; H2 Loss ( ) g Loss

; H2

1



2

Loss ( )

2

1 + (rLoss) ( ✓ 1 1g Loss 1 + (rLoss) 1 ( ) Loss ; H2

Loss2



Loss

1 +✓ (rLoss) Loss 2 + (rLoss) ✓ 1 (Loss ✓



Loss

2 + (rLoss) ( 2 ✓

12) 2) + (rLoss) 2 (

2(

1) 2)

2) 1 Loss 1 + (rLoss)

1(

1)

Loss 2 2 + (rLoss)

2(

2)

(b) Approximation of Loss Function



2

2

✓ 18 2



2 Loss

1 + (rLoss) ( 1

1)



Loss

2 + (rLoss) ( 2

2)

the procedure when the solutions to the proxy problem achieve an objective value that lies within a user-defined range of the optimal value. Let us denote the objective function of the original optimization problem, P, as Z(λ), and denote an optimal solution to this problem λ∗ ∈ argmin Z(λ). Similarly, let us denote the objective function of the ˜ proxy problem on the K th iteration of the algorithm, P K , as Z(λ; HK ) and an optimal solution to this K K ˜ ∈ argmin Z(λ; ˜ problem as λ H ). Note that we have parametrized the objective value of the proxy problem on the K th because it depends on the value of the approximate loss function, which is composed of K hyperplanes, HK . To produce a lower bound to the optimization problem in (P), notice that a piecewise linear approximation to a convex loss function underestimates the true value of the loss.   g λ; HK ≤ Loss (λ) Loss ∀λ ∈ RP (37) Thus,   g λ; HK + C · Φ(λ) ≤ min Loss (λ) + C · Φ(λ) min Loss

(38)

˜ Z(λ; HK ) ≤ Z(λ)

(39)

λ∈L

λ∈L

To produce an upper bound to the optimization problem in (P), notice that any solution to the approximate problem is feasible in the original problem. Thus, ˜K) Z(λ∗ ) ≤ Z(λ

(40)

We can use these bounds to show that the value of the proxy problem will converge to the value of the original problem as add more supporting hyperplanes to approximate the loss function. This should be obvious as the objective function of the proxy problem is monotonically non-decreasing in the number of points used to approximate the loss function, HK : ˜ ˜ Z(λ; HK ) ≤ Z(λ; HK+1 )∀ λ ∈ RP

(41)

Further, the value of the objective function of the proxy problem is also bounded by the values of objective function in the original problem: ˜ Z(λ; HK ) ≤ Z(λ)∀ λ ∈ RP

(42)

Thus, ˜ ˜ lim Z(λ; HK ) = Z(λ)

K→∞

(43)

A complete proof of convergence of the decomposition procedure we have presented can be found in Chapter 6 of [23].

3.2.3

Implementation and Extensions

A simple and popular approach to implement the decomposition procedure in the optimization community is to use Benders’ Decomposition. On the K th iteration, Benders’ Decomposition solves P K−1 to obtain the solution λK , and then adds the next hyperplane around λK . This algorithm is easy to implement but suffers from multiple issues, including: 1. The first few cuts will be added at poorly chosen points, which can heavily impact the progress of the algorithm. 2. The approach requires solving the original MIP multiple times, and therefore rebuilds the branch and bound tree multiple times. This adds a significant amount of overhead to the process. 3. Solving the MIP multiple times also requires users to decide how well to solve the MIP at each iteration. A simple fix is to solve the MIP to optimality, however this will also add unnecessary computation to the process since we do not necessarily need optimal points at the first few iterations. Ideally, users should be able to require better and better solutions with more iterations, but it is not clear how to set the desired optimality gap in these cases. Fortunately, these issues can be addressed elegantly. To avoid generating cuts at poorly chosen points, for instance, users modify the decomposition procedure to add cuts at different points from the feasible region of P K . Boyd and Vandenberghe [10] provide a comprehensive overview of the different options to pick λK at each iteration of the decomposition procedure. These options include:

19

Algorithm 2 Benders’ Decomposition Input: δ > 0, tolerance gap between upper and lower bound P, original optimization problem with objective function, Z(λ) = Loss (λ) + C · Φ(λ) ˜ P 0 , initial proxy problem with objective function, Z(λ; H0 ) = θ + C · Φ(λ) Loss (λ), convex loss function K ←− 0 U BK ←− ∞ LBK ←− 0 while U BK − LBK < δ do K ←− K + 1 Solve P K to obtain λK Compute Loss λK and (∇Loss) λK  Add the cut, θ ≥ Loss λK + (∇Loss)λK (λ − λk ) to P K U BK ←− max(U BK−1 , Z(λK ) ˜ K ; HK ) LBK ←− Z(λ end while Output: λk , approximate solution

1. The center of gravity of P K . This choice of λK results in the the center of gravity method [72, 55]. 2. The center of the maximum volume ellipsoid contained in P K . This choice of λK results in the maximum volume ellipsoid (MVE) cutting-plane method [89]. 3. The center of the the center of the largest Euclidean ball in P K (i.e. the Chebyshev center). This choice of λK results in the Chebyshev center cutting-plane method [21]. 4. The analytic center of the inequalities defining Pk. This choice of λK results in the analytic center cutting-plane method (ACCPM) [5, 33]. To address the issues associated with the overhead and optimality gap, we recommend practitioners implement the decomposition procedure using a one-tree approach. 2 . In this approach, users solve the MIP only once using a branch and bound procedure, and use call-back functions to add cuts whenever the solver finds a feasible solution. A recent paper by Naoum-Sawaya and Elhedhli [71] illustrates the benefits of implementing a one-tree approach and generating cuts using points that were carefully chosen using the ACCPM method. Note that popular methods to speed up Benders’ Decomposition by generating “pareto-optimal cuts” as described in Magnanti and Wong [59] are unlikely to help in this particular application given the fact that the Loss function frequently has a unique solution.

2 The “one-tree” approach is popular among those who are familiar with Decomposition techniques. A clear explanation can be found in http://orinanobworld.blogspot.com/2011/10/benders-decomposition-then-and-now.html. The approach has been used by Bai and Rubin [6]

20

4 4.1

Theoretical Insights Discretization Bounds

We begin by showing that we can always adjust the default set of bounded integer coefficients n o L = λ ∈ ZP |λj | ≤ Λ for j = 1, . . . , P in order to produce a classifier (still with integer coefficients) that attains the same level of accuracy as any other linear classifier. In what follows, we provide bounds that link the resolution of this default set (specified by Λ) to an upper-bound on the error. Let ρ ∈ RP denote a vector of coefficients from any linear classifier (e.g., support vector machines). ρ Since the 0–1 loss function is scale invariant, Loss0,1 (ρ) = Loss0,1 ( kρk ). Thus, we provide bounds that 2 ρ are expressed in terms of the normalized vector kρk without loss of generality. 2 We define λ element-wise so that λj /Λ is the value of ρj /kρk2 rounded to the nearest value of 1/Λ. Theorem 2 (Discretization Error) For every ε > 0, if we choose √ Xmax P , Λ> T 2 mini |ρkρkxi | 2

then the 0-1 loss of ρ and λ are the same: i X h i X h T 1 y i ρ xi ≤ 0 = 1 yi λT xi ≤ 0 . i

i

where ρ is any vector in Rp , Xmax is maxi kxi k2 , and λ is defined above. Proof Any value of ρj /kρk2 is at most half an interval of size 1/Λ away from its rounded value λj /Λ by at most 1/2Λ. We use this in what follows. For any i, T

λ

λ ρ ρT

kλk2 xi − kρk2 xi ≤ Λ − kρk2 kxi k2 2 2 !1/2 X λj ρ j = kxi k2 Λ − kρk2 j !1/2 X 1 kxi k2 ≤ (2Λ)2 j √ P = Xmax 2Λ √ |ρT xi | P Xmax ! = min < . i √ kρk2 Xmax P 2 |ρT x | 2 mini

i kρk2

Here the first inequality is the Cauchy-Schwarz inequality, the second inequality comes from the definition of λ and the rounding, and the third inequality is from the assumption of the theorem. We consider separately the case where the classification function ρ has a positive margin on xi and the case where it has a negative margin. If it has positive margin, ρT xi > 0, then the following calculation is relevant, where we used the reasoning above. T λ xi |ρT xi | ρ T xi λT xi ρT xi − ≤ − < min . i kρk2 kλk2 kλk2 kρk2 kρk2 0

We will use the fact that for any i , by definition of the minimum: 0≤

|ρT xi0 | |ρT xi | − min , i kρk2 kρk2

and combine this with a rearrangement of the previous expression to obtain: 0≤

|ρT xi | |ρT xi | |ρT xi | ρT xi λT xi − min = − min < . i i kρk2 kρk2 kρk2 kρk2 kλk2

21

Thus, we have shown that λT xi > 0 whenever ρT xi > 0. For the case where ρ has a negative margin on xi , we perform an analogous calculation: T λ xi |ρT xi | λT xi ρ T xi ρT xi − ≤ − . < min i kλk2 kρk2 kλk2 kρk2 kρk2 and then using that ρT xi < 0, 0≤

|ρT xi | |ρT xi | |ρT xi | −ρT xi λT xi − min = − min 0, with probability at least 1 − δ, every f ∈ F obeys: r log(|L|) − log(δ) true emp . R (f ) ≤ R (f ) + 2N The proof of Theorem 3 uses Hoeffding’s inequality for a single function, f , combined with the union bound over all functions f ∈ F . The result that more restrictive hypothesis spaces can lead to better generalization provides some motivation for using more interpretable models without necessarily expecting a loss of test accuracy. As the amount of data N increases, the bound indicates that we can refine the set L to include more functions. When a large amount of data are available, we would be able to reduce the empirical error by including, for instance, one more significant digit within each coefficient λj . When we train our classifiers using similar techniques to SLIM, which optimizes an objective function that is scale-invariant over a set of bounded integer vectors (the 0-1 loss), with a small enough `1 -penalty, it effectively restricts coefficients to a set of coprime integers. In such cases, we may refine the generalization bound in Theorem 3 by counting the number of P -dimensional coprime integer vectors bounded by Λ as follows:

22

Theorem 4 (Generalization Bound for Classifiers with Coprime Coefficients) Let F denote the set of linear classifiers with coprime integer coefficients, λ, bounded by Λ. That is, n   o F ≡ f : X → Y f (x) = sign λT x and λ ∈ L , where ˆ P : |λj | ≤ Λ for j = 1, . . . , P }, L ≡ {λ ∈ Z n o ˆ P ≡ z ∈ ZP : gcd(z) = 1 . Z Here, gcd means greatest common denominator. For every δ > 0, with probability at least 1 − δ, every classifier f ∈ F obeys: r log(|CΛ,P |) − log(δ) Rtrue (f ) ≤ Remp (f ) + , 2N where  CP,Λ =

 λ ˆ P +1 and 0 < q ≤ Λ ∈ [0, 1)P : (λ, q) ∈ Z q

represents the set of Farey points of level Λ. This result ties interpretability to concepts from number theory. In Figure 4.2, we plot the density of coprime integer vectors in ZP that are bounded by Λ; that is: P ˆ Z ∪ [−Λ, Λ] r(P, Λ) = P Z ∪ [−Λ, Λ] CΛ,P = . (2Λ + 1)P This shows that requiring coefficients to be coprime can significantly reduce the number of coefficients depending on the dimensionality of the data and the value of Λ. Density of Coprime Integer Vectors λ ∈ Z P 50 45 40 35

P

30 25 20 15 10 5

2

4

6

8

10

12

14

16

60%

70%

18

20

80%

90%

Λ 10%

20%

30%

40%

50%

Figure 4: Density of Coprime Integers In Figure 5, we plot the relative improvement in the generalization bound due to the restriction to coprime coefficients (for δ = 0.01). This shows that the improvement in the generalization bound is minor in cases of low dimensionality and large Λ, which we expect as this is the case where we can well-approximate continuos functions by discrete functions. On the other hand, the improvement in the bound can be large when the data are high dimensional and the coefficients are restricted to be small.

23

Generalization Bound for SLIM Classifiers N = 1000 w.p δ = 0.01 50 45 40 35

P

30 25 20 15 10 5 5

10

15

20

25

30

35

40

45

50

Λ 5%

10%

15%

20%

25%

Figure 5: Improvement in Generalization Bound

24

30%

References [1] Hiva Allahyari and Niklas Lavesson. User-oriented assessment of classification model understandability. In SCAI, pages 11–19, 2011. [2] Joel T Andrade. Handbook of violence risk assessment and treatment: New approaches for mental health professionals. Springer Publishing Company, 2009. [3] Elliott M Antman, Marc Cohen, Peter JLM Bernink, Carolyn H McCabe, Thomas Horacek, Gary Papuchis, Branco Mautner, Ramon Corbalan, David Radley, and Eugene Braunwald. The TIMI risk score for unstable angina/non–ST elevation MI. The Journal of the American Medical Association, 284(7):835–842, 2000. [4] Ognian K. Asparouhov and Paul A. Rubin. Oscillation heuristics for the two-group classification problem. Journal of Classification, 21:255–277, 2004. [5] David S Atkinson and Pravin M Vaidya. A cutting plane algorithm for convex programming that uses analytic centers. Mathematical Programming, 69(1-3):1–43, 1995. [6] Lihui Bai and Paul A Rubin. Combinatorial Benders Cuts for the Minimum Tollbooth Problem. Operations Research, 57(6):1510–1522, 2009. doi: 10.1287/opre.1090.0694. URL http://or. journal.informs.org/cgi/content/abstract/opre.1090.0694v1. [7] Jacob Bien and Robert Tibshirani. Prototype selection for interpretable classification. The Annals of Applied Statistics, 5(4):2403–2424, 2011. [8] R. Bixby, M. Fenelon, Z. Gu, E. Rothberg, and R. Wunderling. 18. Mixed-Integer Programming: A Progress Report, chapter 18, pages 309–325. doi: 10.1137/1.9780898718805.ch18. URL http: //epubs.siam.org/doi/abs/10.1137/1.9780898718805.ch18. [9] RC Bone, RA Balk, FB Cerra, RP Dellinger, AM Fein, WA Knaus, RM Schein, WJ Sibbald, JH Abrams, GR Bernard, et al. American college of chest physicians/society of critical care medicine consensus conference: Definitions for sepsis and organ failure and guidelines for the use of innovative therapies in sepsis. Critical Care Medicine, 20(6):864–874, 1992. [10] Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004. [11] Ivan Bratko. Machine learning: Between accuracy and interpretability. Courses and LecturesInternational Centre for Mechanical Sciences, pages 163–178, 1997. [12] Leo Breiman. Random forests. Mach. Learn., 45(1):5–32, October 2001. ISSN 0885-6125. doi: 10.1023/A:1010933404324. URL http://dx.doi.org/10.1023/A:1010933404324. [13] Leonard A Breslow and David W Aha. Simplifying decision trees: A survey. The Knowledge Engineering Review, 12(01):1–40, 1997. [14] E. Carrizosa, A. Nogales-G´ omez, and D. Romero Morales. Strongly agree or strongly disagree?: Rating features in support vector machines. Technical report, Sa¨ıd Business School, University of Oxford, UK, 2013. [15] Emilio Carrizosa and Dolores Romero Morales. Supervised classification and mathematical optimization. Computers & Operations Research, 40(1):150–165, 2013. [16] Emilio Carrizosa, Belen Mart´ın-Barrag´ an, and Dolores Romero Morales. Binarized support vector machines. INFORMS Journal on Computing, 22(1):154–167, 2010. [17] Emilio Carrizosa, Bel´en Mart´ın-Barrag´ an, and Dolores Romero Morales. Detecting relevant variables and interactions in supervised classification. European Journal of Operational Research, 213 (1):260–269, 2011. [18] ABS Consulting. Marine Safety: Tools for Risk-Based Decision Making. Rowman & Littlefield, 2002. [19] Robyn M Dawes. The robust beauty of improper linear models in decision making. American psychologist, 34(7):571–582, 1979.

25

[20] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani. Least angle regression. The Annals of Statistics, 32(2):407–499, 2004. [21] Jack Elzinga and Thomas G Moore. A central cutting plane algorithm for the convex programming problem. Mathematical Programming, 8(1):134–145, 1975. [22] Sam Flanigan and Robert Morse. Methodology: Best business schools rankings, 2013. U.S. News & Wolrd Report. [23] Christodoulos A Floudas. Nonlinear and mixed-integer optimization: fundamentals and applications. Marcombo, 1995. [24] Alex A Freitas. A critical review of multi-objective optimization in data mining: a position paper. ACM SIGKDD Explorations Newsletter, 6(2):77–86, 2004. [25] Alex A Freitas. Comprehensible classification models: a position paper. ACM SIGKDD Explorations Newsletter, 15(1):1–10, March 2014. [26] Alex A Freitas, Daniela C Wieser, and Rolf Apweiler. On the importance of comprehensible classification models for protein function prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 7(1):172–182, 2010. [27] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997. [28] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning, volume 1. Springer Series in Statistics, 2001. [29] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):1–22, 2010. URL http: //www.jstatsoft.org/v33/i01/. [30] Glenn Fung, Sathyakama Sandilya, and R Bharat Rao. Rule extraction from linear support vector machines. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 32–40. ACM, 2005. [31] Brian F Gage, Amy D Waterman, William Shannon, Michael Boechler, Michael W Rich, and Martha J Radford. Validation of clinical classification schemes for predicting stroke. The journal of the American Medical Association, 285(22):2864–2870, 2001. [32] A M Geoffrion. Generalized benders decomposition. Journal of optimization theory and applications, 1972. [33] Jean-Louis Goffin and Jean-Philippe Vial. Convex nondifferentiable optimization: A survey focused on the analytic center cutting plane method. Optimization Methods and Software, 17(5):805–867, 2002. [34] Siong Thye Goh and Cynthia Rudin. Box drawings for learning with imbalanced data. arXiv preprint arXiv:1403.3378, 2014. [35] Noam Goldberg and Jonathan Eckstein. Sparse weighted voting classifier selection and its linear programming relaxations. Information Processing Letters, 112:481–486, 2012. [36] Isabelle Guyon and Andr´e Elisseeff. An introduction to variable and feature selection. The Journal of Machine Learning Research, 3:1157–1182, 2003. [37] Trevor Hastie, Saharon Rosset, Robert Tibshirani, and Ji Zhu. The entire regularization path for the support vector machine. Journal of Machine Learning Research, 5(2):1391, 2005. [38] Anne-Claire Haury, Pierre Gestraud, and Jean-Philippe Vert. The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures. PloS one, 6(12):e28210, 2011. [39] John R Hauser, Olivier Toubia, Theodoros Evgeniou, Rene Befurt, and Daria Dzyabura. Disjunctions of conjunctions, cognitive simplicity, and consideration sets. Journal of Marketing Research, 47(3):485–496, 2010.

26

[40] Tim Hesterberg, Nam Hee Choi, Lukas Meier, and Chris Fraley. Least angle and a ˜a ˆa ˆ1 penalized regression: A review. Statistics Surveys, 2:61–93, 2008. [41] Robert C. Holte. Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11(1):63–91, 1993. [42] Johan Huysmans, Karel Dejaeger, Christophe Mues, Jan Vanthienen, and Bart Baesens. An empirical evaluation of the comprehensibility of decision table, tree and rule based predictive models. Decision Support Systems, 51(1):141–154, 2011. [43] D Jennings, TM Amabile, and L Ross. Informal covariation assessment: Data-based vs. theorybased judgments. Judgment under uncertainty: Heuristics and biases, pages 211–230, 1982. [44] Ulf Johansson and Lars Niklasson. Evolving decision trees using oracle guides. In Computational Intelligence and Data Mining, 2009. CIDM’09. IEEE Symposium on, pages 238–244. IEEE, 2009. [45] William A Knaus, Jack E Zimmerman, Douglas P Wagner, Elizabeth A Draper, and Diane E Lawrence. APACHE-acute physiology and chronic health evaluation: a physiologically based classification system. Critical Care Medicine, 9(8):591–597, 1981. [46] William A Knaus, Elizabeth A Draper, Douglas P Wagner, and Jack E Zimmerman. APACHE II: a severity of disease classification system. Critical Care Medicine, 13(10):818–829, 1985. [47] William A Knaus, DP Wagner, EA Draper, JE Zimmerman, Marilyn Bergner, PG Bastos, CA Sirio, DJ Murphy, T Lotring, and A Damiano. The APACHE III prognostic system. risk prediction of hospital mortality for critically ill hospitalized adults. Chest Journal, 100(6):1619–1636, 1991. [48] Y Kodratoff. The comprehensibility manifesto. KDD Nugget Newsletter, 94(9), 1994. [49] Ron Kohavi. The power of decision tables. In Machine Learning: ECML-95, pages 174–189. Springer, 1995. [50] Ron Kohavi and George H John. Wrappers for feature subset selection. Artificial intelligence, 97 (1):273–324, 1997. [51] Ron Kohavi and Dan Sommerfield. Targeting business users with decision table classifiers. In KDD, pages 249–253, 1998. [52] Jean-Roger Le Gall, Philippe Loirat, Annick Alperovitch, Paul Glaser, Claude Granthil, Daniel Mathieu, Philippe Mercier, Remi Thomas, and Daniel Villers. A simplified acute physiology score for icu patients. Critical Care Medicine, 12(11):975–977, 1984. [53] Jean-Roger Le Gall, Stanley Lemeshow, and Fabienne Saulnier. A new simplified acute physiology score (SAPS II) based on a european/north american multicenter study. The Journal of the American Medical Association, 270(24):2957–2963, 1993. [54] Benjamin Letham, Cynthia Rudin, Tyler H. McCormick, and David Madigan. An interpretable stroke prediction model using rules and bayesian analysis. In Proceedings of AAAI Late Breaking Track, 2013. [55] A Yu Levin. On an algorithm for the minimization of convex functions. In Soviet Mathematics Doklady, volume 160, pages 1244–1247, 1965. [56] Richard W Light, M Isabelle Macgregor, Peter C Luchsinger, and Wilmot C Ball. Pleural effusions: the diagnostic separation of transudates and exudates. Annals of Internal Medicine, 77(4):507–513, 1972. [57] GY Lip, R Nieuwlaat, R Pisters, DA Lane, and HJ Crijns. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the euro heart survey on atrial fibrillation. Chest, 137:263–272, 2010. [58] Han Liu and Jian Zhang. Estimation consistency of the group lasso and its applications. In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics, 2009. [59] Thomas L Magnanti and Richard T Wong. Accelerating Benders Decomposition: Algorithmic Enhancement and Model Selection Criteria. Operations Research, 29(3):464–484, 1981.

27

[60] Olvi L Mangasarian. Misclassification minimization. Journal of Global Optimization, 5(4):309–323, 1994. [61] KZ Mao. Fast orthogonal forward selection algorithm for feature subset selection. IEEE Transactions on Neural Networks, 13(5):1218–1224, 2002. [62] KZ Mao. Orthogonal forward selection and backward elimination algorithms for feature subset selection. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 34(1):629– 634, 2004. [63] David Martens. Building acceptable classification models for financial engineering applications: thesis summary. ACM SIGKDD Explorations Newsletter, 10(2):30–31, 2008. [64] David Martens and Bart Baesens. Building acceptable classification models. In Data Mining, pages 53–74. Springer, 2010. [65] David Martens, Bart Baesens, Tony Van Gestel, and Jan Vanthienen. Comprehensible credit scoring models using rule extraction from support vector machines. European journal of operational research, 183(3):1466–1476, 2007. [66] David Martens, Jan Vanthienen, Wouter Verbeke, and Bart Baesens. Performance of classification models from a user perspective. Decision Support Systems, 51(4):782–793, 2011. [67] Nicolai Meinshausen et al. Node harvest. The Annals of Applied Statistics, 4(4):2049–2072, 2010. [68] Alan J Miller. Selection of subsets of regression variables. Journal of the Royal Statistical Society. Series A (General), pages 389–425, 1984. [69] Rui P Moreno, Philipp GH Metnitz, Eduardo Almeida, Barbara Jordan, Peter Bauer, Ricardo Abizanda Campos, Gaetano Iapichino, David Edbrooke, Maurizia Capuzzo, and Jean-Roger Le Gall. SAPS 3 - from evaluation of the patient to evaluation of the intensive care unit. part 2: Development of a prognostic model for hospital mortality at icu admission. Intensive Care Medicine, 31(10):1345–1355, 2005. [70] David A Morrow, Elliott M Antman, Andrew Charlesworth, Richard Cairns, Sabina A Murphy, James A de Lemos, Robert P Giugliano, Carolyn H McCabe, and Eugene Braunwald. TIMI risk score for ST-elevation myocardial infarction: a convenient, bedside, clinical score for risk assessment at presentation an intravenous nPA for treatment of infarcting myocardium early II trial substudy. Circulation, 102(17):2031–2037, 2000. [71] Joe Naoum-Sawaya and Samir Elhedhli. An interior-point Benders based branch-and-cut algorithm for mixed integer programs. Annals of Operations Research, 210(1):33–55, November 2010. [72] D J Newman. Location of the Maximum on Unimodal Surfaces. Journal of the ACM (JACM), 12 (3):395–398, July 1965. [73] Panos M Pardalos and H Edwin Romeijn. Handbook of optimization in medicine, volume 5. Springer, 2009. [74] MJ Pazzani, S Mani, and WR Shankle. Acceptance of rules generated by machine learning among medical experts. Methods of information in medicine, 40(5):380–385, 2001. [75] J. Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986. [76] J Ross Quinlan. Simplifying decision trees. International Journal of Human-Computer Studies, 51 (2):497–510, 1999. [77] John Ross Quinlan. C4. 5: programs for machine learning, volume 1. Morgan kaufmann, 1993. [78] JH Ranson, KM Rifkind, DF Roses, SD Fink, K Eng, FC Spencer, et al. Prognostic signs and the role of operative management in acute pancreatitis. Surgery, gynecology & obstetrics, 139(1):69, 1974. [79] Greg Ridgeway. The pitfalls of prediction. NIJ Journal, National Institute of Justice, 271:34–40, 2013. [80] Ronald L Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987.

28

[81] Paul A. Rubin. Heuristic solution procedures for a mixed-integer programming discriminant model. Managerial and Decision Economics, 11:255–266, 1990. [82] Paul A. Rubin. Solving mixed integer classification problems by decomposition. Annals of Operations Research, 74:51–64, 1997. [83] Stefan R¨ uping. Learning interpretable models. PhD thesis, Universit¨ at Dortmund, 2006. [84] Mark Schwabacher and Pat Langley. Discovering communicable scientific knowledge from spatiotemporal data. In ICML, pages 489–496, 2001. [85] Edgar Sommer. Theory Restructuring: A Perspective on Design and Maintenance of Knowledge Based Systems. PhD thesis, Universit¨ at Dortmund, 1996. [86] David Steinhart. Juvenile detention risk assessment: A practice guide to juvenile detention reform. Juvenile Detention Alternatives Initiative. A project of the Annie E. Casey Foundation. Retrieved on April, 28:2011, 2006. [87] Girish H Subramanian, John Nosek, Sankaran P Raghunathan, and Santosh S Kanitkar. A comparison of the decision table and tree. Communications of the ACM, 35(1):89–94, 1992. [88] Hongmao Sun. An accurate and interpretable bayesian classification model for prediction of herg liability. ChemMedChem, 1(3):315–322, 2006. [89] SP Tarasov, LG Khachiyan, and I Erlikh. The method of inscribed ellipsoids. In Soviet Mathematics Doklady, volume 37, pages 226–230, 1988. [90] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996. [91] Michael E Tipping. Sparse bayesian learning and the relevance vector machine. The Journal of Machine Learning Research, 1:211–244, 2001. [92] G G Towell and J W Shavlik. Extracting refined rules from knowledge-based neural networks. Machine Learning, 1993. [93] Alan Turing. Intelligent machinery (1948). B. Jack Copeland, page 395, 2004. [94] Paul E Utgoff. Incremental induction of decision trees. Machine Learning, 4(2):161–186, 1989. [95] Anneleen Van Assche and Hendrik Blockeel. Seeing the forest through the trees: Learning a comprehensible model from an ensemble. In Machine Learning: ECML 2007, pages 418–429. Springer, 2007. [96] Vladimir Vapnik. Statistical Learning Theory. Wiley, New York, 1998. [97] Wouter Verbeke, David Martens, Christophe Mues, and Bart Baesens. Building comprehensible customer churn prediction models with advanced rule induction techniques. Expert Systems with Applications, 38(3):2354–2364, 2011. [98] Christopher Webster. Risk assessment: Actuarial instruments & structured clinical guides, 2013. [99] Christopher D Webster and Derek Eaves. The HCR-20 scheme: The assessment of dangerousness and risk. Mental Health, Law and Policy Institute, Department of Psychology, Simon Fraser University and Forensic Psychiatric Services Commission of British Columbia, 1995. [100] Philip S Wells, David R Anderson, Janis Bormanis, Fred Guy, Michael Mitchell, Lisa Gray, Cathy Clement, K Sue Robinson, Bernard Lewandowski, et al. Value of assessment of pretest probability of deep-vein thrombosis in clinical management. Lancet, 350(9094):1795–1798, 1997. [101] Philip S Wells, David R Anderson, Marc Rodger, Jeffrey S Ginsberg, Clive Kearon, Michael Gent, AG Turpie, Janis Bormanis, Jeffrey Weitz, Michael Chamberlain, et al. Derivation of a simple clinical model to categorize patients probability of pulmonary embolism-increasing the models utility with the SimpliRED D-dimer. Thrombosis and Haemostasis, 83(3):416–420, 2000. [102] Lu Xu and Wen-Jun Zhang. Comparison of different methods for variable selection. Analytica Chimica Acta, 446(1):475–481, 2001.

29

[103] Peng Zhao and Bin Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7(2):25–41, 2007. [104] Hui Zou and Trevor Hastie. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2):301–320, 2005.

30