Optimal Decision Trees - Semantic Scholar

20 downloads 0 Views 311KB Size Report
globally optimal decision trees for classi cation problems. Typically ... the decision tree and then representing it as a set of disjunctive linear inequalities. An.
Optimal Decision Trees Kristin P. Bennett Jennifer A. Blue Department of Mathematical Sciences Rensselaer Polytechnic Institute Troy, NY 12180

R.P.I. Math Report No. 214

Abstract

We propose an Extreme Point Tabu Search (EPTS) algorithm that constructs globally optimal decision trees for classi cation problems. Typically, decision tree algorithms are greedy. They optimize the misclassi cation error of each decision sequentially. Our non-greedy approach minimizes the misclassi cation error of all the decisions in the tree concurrently. Using Global Tree Optimization (GTO), we can optimize existing decision trees. This capability can be used in classi cation and data mining applications to avoid over tting, transfer knowledge, incorporate domain knowledge, and maintain existing decision trees. Our method works by xing the structure of the decision tree and then representing it as a set of disjunctive linear inequalities. An optimization problem is constructed that minimizes the errors within the disjunctive linear inequalities. To reduce the misclassi cation error, a nonlinear error function is minimized over a polyhedral region. We show that it is sucient to restrict our search to the extreme points of the polyhedral region. A new EPTS algorithm is used to search the extreme points of the polyhedral region for an optimal solution. Promising computational results are given for both randomly generated and real-world problems.

Key Words: decision trees, tabu search, classi cation, machine learning, global optimization.

Knowledge Discovery and Data Mining Group, Department of Mathematical Sciences, Rensselaer Polytechnic Institute, Troy, NY 12180. Email [email protected], [email protected]. This material is based on research supported by National Science Foundation Grant 949427. 

1 Introduction Decision trees have proven to be a very e ective technique for classi cation problems. A training set, consisting of samples of points with n attributes from each class, is given. Then a decision tree is constructed to discriminate between these sets. The new decision tree is used to classify future points. The tree can be interpreted as rules for membership in the classes. Since decision trees have a readily interpretable logical structure, they provide insight into the characteristics of the classes. We propose a non-greedy non-parametric approach to constructing multivariate decision trees that is fundamentally di erent than greedy approaches. Popular decision tree algorithms are greedy. Greedy univariate algorithms such as CART [9] and C4.5 [30] construct a decision based on one attribute at a time. Greedy multivariate decision tree algorithms such as those in [2, 10, 24] construct decisions one at a time using linear combinations of all the attributes. In each case, an optimization problem is solved at each decision node to determine the locally best decision. The decision divides the attribute space into two or more regions. Decisions are constructed recursively for each of the regions. This process is repeated until the points in each region (a leaf of the tree) are all, or almost all, of the same class. Using this approach, it is possible to construct a tree to discriminate between any two disjoint sets. However, the resulting tree may be overparameterized and thus may not re ect the underlying characteristics of the data set. When the over tting occurs, the tree may not classify future points well. To avoid this problem, heuristics are applied to prune decisions from the tree [29, 22]. In this paper, optimization techniques are used to minimize the error of the entire decision tree. Our global approach is analogous to the widely used back propagation algorithm for constructing neural networks [31]. For a neural network, one speci es an initial structure, the number of units, and their interconnections. An error function which measures the error of the neural network is then constructed. The decisions in a multivariate decision tree are the same as linear threshold units in a neural network. In global tree optimization (GTO) an initial structure, the number of decision nodes, and the class of the leaf nodes are speci ed. We propose several possible error functions that measure the error of the entire decision tree. One of two di erent optimization methods, Frank-Wolfe and Extreme Point Tabu Search, can be used to minimize the error of the tree. GTO combines the bene ts of decision trees and neural network methods. The great strengths and challenges of decision trees are that they are logically interpretable but constructing them is combinatorially dicult. Typically, greedy methods are constructed one decision at a time starting at the root. Locally good but globally poor choices of the decisions at each node, however, can result in excessively large trees that do not re ect the underlying structure of the data. Pruning may not be sucient to compensate for over tting. This problem is readily shown in multivariate decision trees. The pruning process frequently produces a tree consisting of a single decision [2, 5, 33]. Univariate algorithms appear less susceptible to this problem. Murthy and Salzburg found that greedy heuristics worked well and lookahead algorithms o ered little improvement [25, 26]. We believe that this is because univariate decision trees have only one degree of freedom at each decision. The problem with univariate trees, however, is that many decisions are required to represent 2

C4.5 GTO Figure 1: Performance of C4.5 and GTO on a sample problem. simple linear relations. Multivariate decisions are much more powerful. But greedy methods are more easily led astray. In GTO, we search for the best multivariate tree with a given structure. Fixing the structure prevents the method from over tting the data. An example of this can be seen in Figure 1. The algorithm C4.5 [30] and the GTO algorithm were both applied to this sample data set of 250 points in two dimensions. The C4.5 algorithm used six univariate decisions and still was unable to correctly classify all of the points. GTO applied to this data set with the initial tree structure of Figure 2 correctly classi ed all the points. Another bene t of GTO is that existing trees may be optimized. Typically, greedy methods must reconstruct a tree from scratch each time the algorithm is run. The ability to optimize an existing tree can be very useful in classi cation and data mining applications. Domain knowledge can be incorporated into the initial tree and the tree then can be optimized. Knowledge transfer can be achieved by starting with a tree from a related problem or by updating an existing tree when new data becomes available. GTO can be used to prune and restructure a tree initially constructed via a greedy method. The error function used within GTO can be customized to meet a client's particular needs. This exibility of our method is very promising. We are just beginning to explore these possibilities. Since GTO is non-parametric, it is fundamentally di erent from the few prior nongreedy decision tree algorithms. Unlike the Bayesian or parametric approaches [32, 11, 18, 19], the strict logical rules of the tree are maintained. GTO is based on the idea of formulating a decision tree as a set of disjunctive linear inequalities. In Section 2, we show that the underlying problem of nding a decision tree with xed structure that completely classi es two sets is equivalent to nding a solution of a set of disjunctive linear inequalities. In Section 3, we propose objective functions that can be used to minimize the error within the disjunctive inequalities. The problem thus becomes a nonlinear nonconvex optimization problem over a polyhedral region. In Section 4, we show that there exist extreme point solutions for the proposed optimization problems which are as good as or better than the optimal solutions to the problems in terms of the number of points misclassi ed. In Sections 5 and 6 we propose two optimization algorithms for minimizing the error in the disjunctive inequalities by traversing the extreme point solutions. The Frank-Wolfe algorithm 3

1 w1 x ? 1  0

w1x ? 1 > 0

2 w2 x ? 2  0

A

4

3 w3x ? 3  0 w2 x ? 2 > 0

B

5

A

6

w3x ? 3 > 0

B

7

Figure 2: A multivariate decision tree with three decisions (FW) is a descent technique that stops at the rst local minimum it encounters. Extreme point tabu search (EPTS) is a global optimization technique. While it is not guaranteed to nd the global minimum, it is much more robust than the Frank-Wolfe algorithm. In addition, the Frank-Wolfe algorithm is limited to di erentiable objective functions. For EPTS, the objective function need not be di erentiable or even continuous. Very strong computational results for both of these algorithms are given in Section 7. We will use the following notation: For a vector x in the n-dimensional real space Rc; (x)i denotes the ith component of x, and x will denote the vector in Rc with components (x )i := max fxi; 0g; i = 1; : : :; c: The dot product of two vectors x and w will be indicated by xw. The outer product is never used. +

+

2 Decision Trees as Disjunctive Inequalities Our goal is to formulate an optimization problem that can be used to construct a decision tree with given structure that recognizes points from two or more classes. The key idea is that a given multivariate decision tree can be represented as a set of disjunctive linear inequalities [7, 3]. If we can nd a solution to the set of disjunctive linear inequalities, we have found a tree that correctly classi es all the points. Note that the structure of the tree must be xed, but the actual decisions to be used are not. The correspondence of the decision tree to the disjunctive inequalities is conceptually simple but notationly intricate. Consider the decision tree with three decisions shown in Figure 3. Each decision in the tree corresponds to a linear inequality. As a point traverses a path from the root of the tree to a classi cation leaf, each decision corresponds to an inequality that must be satis ed. Several leaves may belong to a given class. If a point satis es any one of the sets of inequalities corresponding to a leaf of the appropriate class, then it is correctly classi ed. Thus we would like a set of disjunctive inequalities to be satis ed.

4

wx ?

0

wx ? > 0

A

B

Figure 3: A multivariate decision tree with one decision

2.1 Sample Problems

In this paper, we restrict our discussion to the two-class discrimination problem using a binary multivariate decision tree. These results can be generalized to problems with three or more classes. Let there be two sets A and B containing mA and mB points, respectively. Each point has n real-valued attributes. If the attributes are not real-valued, the techniques used in neural networks for mapping symbolic attributes into real attributes can be used. Each decision consists of a linear combination of attributes. For the simple case of one decision as seen in Figure 3, let Ai be the ith point in A, w be the weights of the decision, and be the threshold. If Aiw ? < 0 then the point follows the left branch of the tree. If Aiw ? > 0 then the point follows the right branch of the tree. If Aiw ? = 0 we use the convention that the point follows the left branch. We consider this situation undesirable and count it as an error when training. We will rst investigate the simple cases of a decision trees consisting of one decision and three decisions. We will then give the formulation for general multivariate binary decision trees.

2.1.1 Decision Tree With One Decision

Consider the simplest case, a two-class decision tree consisting of a single decision in which the left leaf is labeled A and the right leaf is labeled B. This corresponds to a linear discriminant such as the one shown in Figure 3. Let w be the weights of the decision and

be the threshold. The decision accurately classi es all the points in A if there exist w and such that Aiw ? < 0 for all points i = 1; : : : ; mA. The decision accurately classi es all the points in B if Bj w ? > 0 for all points j = 1; : : : ; mB. We now remove the strict inequalities by adding a constant term. The inequalities become

Aiw ?  ? 1 i = 1; : : :; mA and Bj w ?  1 j = 1; : : : ; mB

(1)

We can easily determine in polynomial time if such a w and exist using a single linear program [6]. Notice that since there is no scaling on w and the constant (in this case 1) may be any positive number.

5

2.1.2 Decision Tree With Three Decisions

Now consider the problem of a tree with multiple decisions such as the three-decision tree in Figure 2. In order for the points in A to be correctly classi ed at leaf nodes 4 or 6, the following inequalities must be satis ed:

fAiw ?  ? 1 and Aiw ?  ? 1g or fAiw ?  1 and Aiw ?  ? 1g 1

1

1

2

1

2

3

3

(2)

i = 1; : : : ; mA In order for the points in B to be correctly classi ed at leaf nodes 5 or 7, the following inequalities must be satis ed: fBj w ?  ? 1 and Bj w ?  1g or (3) fBj w ?  1 and Bj w ?  1g j = 1; : : : ; mB Thus if w and exist that satisfy disjunctive inequalities (2) and (3), the tree will correctly classify all the points. 1

1

1

2

1

3

2

3

2.2 General Case

This same approach can be used for any binary multivariate decision tree with xed structure. The trees may be of arbitrary depth and need not be symmetric. First the structure of the tree must be speci ed. Let the tree have D decision nodes and D +1 leaf nodes. Assume the decision nodes are numbered 1 through D, and the leaf nodes are numbered D + 1 through 2D + 1. Let A be the set of leaf nodes classi ed as A and let B be the set of leaf nodes classi ed as B. Consider the path traversed from the root to a leaf node. For each leaf node k, de ne Gk as the index set of the decisions in which the right or \greater than" branch is traversed to reach leaf k. For each leaf node k, de ne Lk as the index set of the decisions in which the left or \less than" branch is traversed to reach leaf k. For example, the three-decision tree of Figure 2 has the following values:

A = f4; 6g B = f5; 7g L = f1; 2g L = f1g L = f3g L = f;g (4) G = f;g G = f2g G = f1g G = f1; 3g A point is correctly classi ed in A if all the inequalities from the root to a leaf node are satis ed for some leaf node of class A. Similarly, a point is correctly classi ed in B if all the inequalities from the root to a leaf node of class B are satis ed. We de ne the correct classi cation of all the points as follows. 4

4

5

6

5

6

7

7

De nition 2.1 (Correct Classi cations of Points by a Given Tree) The set of

disjunctive inequalities that must be satis ed in order to correctly classify each of the points

6

in A for a tree with xed structure is given by:

88 9 8 99 = < == _