multiple criteria linear programming approach to data

0 downloads 0 Views 269KB Size Report
3x1 + x2 + 2x3 + x4 = b + α5 − β5 x1 + 2x2 + 2x3 + x4 = b + α6 − β6. 3x1 + 2x2 + 2x3 + 2x4 = b + α7 − β7. 2x1 + 2x2 + x3 + 2x4 = b + α8 − β8. 2x1 + 3x2 + 2x3 + ...
Optimization Methods and Software Vol. 18, No. 4, August 2003, pp. 453–473

MULTIPLE CRITERIA LINEAR PROGRAMMING APPROACH TO DATA MINING: MODELS, ALGORITHM DESIGNS AND SOFTWARE DEVELOPMENT GANG KOUa,∗ , XIANTAO LIUb,† , YI PENGa,∗ , YONG SHIa,∗,‡ , MORGAN WISEc,§ and WEIXUAN XUd,¶ a

College of Information Science and Technology, University of Nebraska at Omaha, Omaha, NE 68182, USA; b School of Business Administration, Southwest Petroleum Institute, Chengdu, Sichuan 610500, China; c First National Bank of Omaha, 1620 Dodge Street Stop 3103 Omaha, NE 68197, USA; d Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100080, China (Received November 2002; Revised April 2003)

It is well known that data mining has been implemented by statistical regressions, induction decision tree, neural networks, rough set, fuzzy set and etc. This paper promotes a multiple criteria linear programming (MCLP) approach to data mining based on linear discriminant analysis. This paper first describes the fundamental connections between MCLP and data mining, including several general models of MCLP approaches. Given the general models, it focuses on a designing architecture of MCLP-data mining algorithms in terms of a process of real-life business intelligence. This architecture consists of finding MCLP solutions, preparing mining scores, and interpreting the knowledge patterns. Secondly, this paper elaborates the software development of the MCLP-data mining algorithms. Based on a pseudo coding, two versions of software (SAS- and Linux-platform) will be discussed. Finally, the software performance analysis over business and experimental databases is reported to show its mining and prediction power. As a part of the performance analysis, a series of data testing comparisons between the MCLP and induction decision tree approaches are demonstrated. These findings suggest that the MCLP-data mining techniques have a great potential in discovering knowledge patterns from a large-scale real-life database or data warehouse. Keywords: Data mining; Multi-criteria linear programming; Classification; Algorithm; Software

1

INTRODUCTION

The advent of data mining is due to data explosion and the imminent need for turning such data into useful information and knowledge. In the last twenty years, the volume of available data has increased exponentially because of the extensive use of electronic data gathering devices, such as point-of-sale, remote sensing devices [1]. Although there exist huge amounts of data, they are not helpful in decision support without analysis. As the demand for more ∗

E-mail: [email protected] E-mail: [email protected] Corresponding author. § E-mail: [email protected] ¶ E-mail: [email protected] † ‡

ISSN 1055-6788 print; ISSN 1029-4937 online © 2003 Taylor & Francis Ltd DOI: 10.1080/10556780310001600953

454

MCLP APPROACH TO DATA MINING

customer information increases, the need for advanced techniques to analyze high volume of data increases correspondingly. Data mining offers a promising approach. Data mining refers to extracting or mining knowledge from large amounts of data [2,3]. From a historical perspective, data mining is the result of evolution of information technology. This evolution can be divided into four major steps: data collection, data access, data warehousing, and data mining [4]. In the U.S., as early as 1960s, raw business data were collected and converted into various business information. Pharmaceutical companies have been applying data analysis tools to transform biomedical data into clinical insights and convert intellectual property from discoveries into effective drug treatments. From the late of 1980s, a number of major financial and insurance companies began to develop and use relational database systems, data modeling tools, and query languages for their business strategies and intelligence. In promotion campaigns, the sales department of a retail company can build a database containing all kinds of metadata regarding customers’ preferences for the products, history of credit card transactions, discount coupons, promotion prize drawings, and etc. Data mining techniques consist of four stages: (i) Selecting, (ii) Transforming, (iii) Mining, and (iv) Interpreting [3]. A database contains various data, not all of which relates to the datamining goal. Therefore, the related data has to be selected first. The data selection identifies the available data in the database and then extracts a subset of the available data as interested data for future analysis. After data selection, data is transformed into forms appropriate for mining. According to the nature of the data, data transformation can involve various techniques, such as smoothing, aggregation, and generalization. Smoothing is a form of data cleaning. It helps to remove the ‘‘noise’’ – a random error or variance, from the data. While aggregation summaries the data, generalization is to replace low-level data by higher-level concepts. Both selecting and transforming are known as the process of data warehousing. In data mining stage, the transformed data is mined using data mining techniques. These techniques have been developed for decades in research areas such as statistics, artificial intelligence, mathematics, machine learning and so on [2,3]. Finally, the data interpretation provides the analysis of the mined data with respect to the data mining tasks and goals. This stage assimilates knowledge from different mined data. The situation is similar to doing a ‘‘puzzle’’. The mined data are just like puzzle pieces. How to put them together for a business purpose depends on the business analysts and decision makers (such as managers or CEOs). From the aspect of methodology, data mining can be performed through association, classification, clustering, prediction, sequential patterns, and similar time sequences [5]. For example, in classification, data mining algorithms use the existing data to learn functions that map each item of the selected data into a set of predefined classes. Given such a set, a number of attributes, and a ‘‘learning (or training) set’’, these algorithms are used to predict the class of other unclassified data of the learning set. Two key research problems related to classification are the evaluation of misclassification (i.e., the accuracy of classification) and predictive power. Among various mathematical tools including statistics, binary decision trees, fuzzy set and neural networks, linear programming has been initiated in classification more than twenty years [6]. Given a set of classes and a set of attribute variables, one can use a linear programming model to define a related boundary value (or variables) separating the classes. Each class is then represented by a group of constraints with respect to a boundary in the linear program. The objective function minimizes the overlapping rate of the classes or maximizes the distance between the classes [6]. The linear programming approach results in an optimal classification. It is also flexible to construct an effective model to separate multi-class problems. However, the single objective linear programming approach cannot reflect the best tradeoff between the overlapping and the distance of the data classes, which present the collective rate of ‘‘misclassification’’ in the linear programming approach. The developing approach of

GANG KOU et al.

455

multiple criteria linear programming (MCLP) to data mining is promising to overcome this disadvantage [7,8]. The purpose of this paper is to introduce the fundamentals of MCLP-data mining models and the interfaces between the algorithm designs and software development. In order to do this, we proceed the paper as follows. Section 2 describes the fundamental connections between MCLP and data mining, including several general models of MCLP approaches. Given the general models, Section 3 focuses on architecture of MCLP-data mining algorithms in terms of a process of real-life business intelligence. This architecture consists of finding MCLP solutions, preparing mining scores, and interpreting the knowledge patterns. Section 4 elaborates the software development of the MCLP-data mining algorithms. Based on a pseudo coding, two versions of software (SASand Linux-platform) will be discussed. Section 5 reports the experimental results of software performance over a real large-scale business database to show the MCLP method’s mining and prediction power. As part of the performance analysis, a series of data testing comparisons between the MCLP and induction decision tree approaches is demonstrated. Finally, Section 6 summarizes the paper with some remarks on the MCLP-data mining approach.

2

MODELS OF MULTIPLE CRITERIA LINEAR PROGRAMMING CLASSIFICATION

A general problem of data classification by using multiple criteria linear programming can be described as [9]: Given a set of r variables or attributes in database a = (a1 , . . . , ar ), let Ai = (Ai1 , . . . , Air ) ∈ Rr be the sample observations of data for the variables, where i = 1, . . . , n and n is the sample size. If a given problem can be predefined as s different classes, G 1 , . . . , G s , then the boundary between the j th and j +1th classes can be b j , j = 1, . . . , s − 1. We want to determine the coefficients for an appropriate subset of the variables, denoted by X = (x 1 , . . . , xr )T ∈ Rr and scalars b j such that the separation of these classes can be described as follows: Data Separation Ai X ≤ b1 ,

∀Ai ∈ G 1 ;

bk−1 ≤ Ai X ≤ bk , Ai X ≥ bs−1 ,

∀Ai ∈ G k ,

k = 2, . . . , s − 1;

∀Ai ∈ G s ;

where ∀Ai ∈ G j , j = 1, . . . , s, means that the data case Ai belongs to the class G j . In the data separation, Ai X is called the score of data case i , which is a linear combination of the weighted values of attribute variables X. For example, in the case of credit card portfolio analysis, Ai X may represent the aggregated value of the i th cardholder’s score for his or her attributes of age, salary, education, and residency under consideration. Even though the boundary b j is defined as a scalar in the above data separation, generally, b j may be treated as a ‘‘variable’’ in the formulation. However, if there is no feasible solution about ‘‘variable’’ b j in the real data analysis, it should be predetermined as a control parameter according to the experience of the analyst (see Example 1 and Section 5). The quality of classification is measured by minimizing the total overlapping of data and j maximizing the distances of every data to its class boundary simultaneously. Let αi be the

456

MCLP APPROACH TO DATA MINING j

overlapping degree with respect of data case Ai within G j and G j +1 , and βi be the distance from Ai within G j and G j +1 to its adjusted boundaries. j j By incorporating both αi and βi into the separation inequalities, a multiple criteria linear programming (MCLP) classification model can be defined as:   j   j (M1) Minimize i j αi and Maximize i j βi Subject to: Ai X = b1 + αi1 − βi1 , ∀Ai ∈ G 1 ;

(1)

bk−1 − αik−1 + βik−1 = Ai X = bk + αik − βik , ∀Ai ∈ G k , k = 2, . . . , s − 1;

(2)

Ai X = bs−1 −

αis−1

+

βis−1 ,

∀Ai ∈ G s ;

(3)

bk−1 + αik−1 ≤ bk − αik , k = 2, . . . , s − 1, i = 1, . . . , n; j

j

(4)

where Ai are given; X and b j are unrestricted; and αi , βi ≥ 0, for j = 1, . . . , s − 1, i = 1, . . . , n. Note that the constraints bk−1 + αik−1 ≤ bk − αik ensure the existence of the boundaries. As a graphical representation, a version of model (M1) for three predefined classes is given in Fig. 1. If minimizing the total overlapping of data, maximizing the distances of every data to its class boundary, or a given combination of both criteria is considered separately, model (M1) is reduced to linear programming (LP) classification (known as linear discriminant analysis), which is initiated by Freed and Glover [6]. However, the single criterion LP could not determine the ‘‘best tradeoff’’ of two misclassification measurements. Therefore, the model (M1) is potentially better than LP classification in identifying the best tradeoff of the misclassifications for data separation. Although model (M1) can be theoretically solved by the MC-simplex method for all possible tradeoffs of both criteria functions [10], the available software such as Hao and Shi [11] still cannot handle the real-life database or data warehouse with a terabyte of data. To facilitate the computation on the real-life data, a compromise solution [12–14]  is   approach j j employed to reform model (M1) for the ‘‘best tradeoff’’ between i j αi and i j βi .  s−1  1 Let us assume the ‘‘ideal values’’ for s − 1 classes overlapping (− i αi , . . . , − i αi ) be

FIGURE 1 A three-classes MCLP model.

GANG KOU et al.

457

  (α∗1 , . . . , α∗s−1 ) > 0, and the ‘‘ideal values’’ of ( i βi1 , . . . , i βis−1 ) be (β∗1 , . . . , β∗s−1 ). Selection of the ideal values depends on the nature and data format of the problem.  j  j j j When − i αi > α∗ , we define the regret measure as −dα+j = α∗ + i αi ; otherwise, it  j j is 0, where j = 1, . . . , s − 1. When − i αi < α∗ , we define the regret measure as dα−j =  j j α∗ + i αi ; otherwise, it is 0, where j = 1, . . . , s − 1. Thus, we have: THEOREM 1  j j (i) α∗ + i αi = dα−j − dα+j  j j (ii) |α∗ + i αi | = dα−j + dα+j , and (iii) dα−j , dα+j ≥ 0, j = 1, . . . , s − 1. Similarly, we can derive: COROLLARY 1  j j (i) β∗ − i βi = dβ−j − dβ+j  j j (ii) |β∗ − i βi | = dβ−j + dβ+j (iii) dβ−j , dβ+j ≥ 0, j = 1, . . . , s − 1. The proofs of Theorem 1 and Corollary 1 can be shown easily by using the statement of Ref. [14, pp. 84, 85]. Applying the above results into model (M1), it is reformulated as: (M2) Minimize

s−1  (dα−j + dα+j + dβ−j + dβ+j ) j =1

Subject to:  j α∗j + i αi = dα−j − dα+j ,  j β∗j − i βi = dβ−j − dβ+j ,

j = 1, . . . , s − 1;

(5)

j = 1, . . . , s − 1;

(6)

Equations (1), (2), (3), and (4); where Ai , α∗ , and β∗ are given; X and b j are unrestricted; and αi , βi , dα−j , dα+j , dβ−j , dβ+j ≥ 0, for j = 1, . . . , s − 1, i = 1, . . . , n. Once the adjusted boundaries bk−1 + αik−1 ≤ bk − αik , k = 2, . . . , s − 1, i = 1, . . . , n, are properly chosen (see Fig. 1), model (M2) relaxes the conditions of data separation so that it can consider as many overlapping data as possible in the classification process. We can call model (M2) is a ‘‘weak separation formula’’. With this motivation, we can build a ‘‘medium separation formula’’ on the absolute class boundaries in (M3) and a ‘‘strong separation formula’’ which contains as few overlapping data as possible in (M4). j

j

(M3) Minimize

j

j

s−1  (dα−j + dα+j + dβ−j + dβ+j ) j =1

Subject to: Equations (5) and (6); Ai X = b1 − βi1 ,

∀Ai ∈ G 1 ;

bk−1 + βik−1 = Ai X = bk − βik ,

∀Ai ∈ G k ,

k = 2, . . . , s − 1;

458

MCLP APPROACH TO DATA MINING

Ai X = bs−1 + βis−1 ,

∀Ai ∈ G s ;

bk−1 + ε ≤ bk − αik ,

k = 2, . . . , s − 1,

i = 1, . . . , n;

where Ai , ε, α∗ , and β∗ are given; X and b j are unrestricted; and αi , βi , dα−j , dα+j , dβ−j , dβ+j ≥ 0, for j = 1, . . . , s − 1, i = 1, . . . , n. j

(M4) Minimize

j

j

j

s−1  (dα−j + dα+j + dβ−j + dβ+j ) j =1

Subject to: Equations (5) and (6); Ai X = b1 − αi1 − βi1 , bk−1 +

αik−1

+

βik−1

Ai X = bs−1 +

αis−1

∀Ai ∈ G 1 ;

= Ai X = bk − αik − βik , +

βis−1 ,

bk−1 + αik−1 ≤ bk − αik ,

∀Ai ∈ G k ,

k = 2, . . . , s − 1;

∀Ai ∈ G s ;

k = 2, . . . , s − 1,

i = 1, . . . , n;

where Ai , α∗ , and β∗ are given; X and b j are unrestricted; and αi , βi , dα−j , dα+j , dβ−j , dβ+j ≥ 0, for j = 1, . . . , s − 1, i = 1, . . . , n. A loosing relationship of models (M2), (M3), and (M4) is given as: j

j

j

j

THEOREM 2 (i) If a data case Ai is classified in a given class G j by model (M4), then it may be in G j by using models (M3) and (M2). (ii) If a data case Ai is classified in a given class G j by model (M3), then it may be in G j by using models (M2). Proof It follows the facts that for a certain value of ε > 0, the feasible solutions of model (M4) is the feasible solutions of models (M2) and (M3), and the feasible solutions of model (M3) is these of model (M2). Remark 1 Conceptually, the usefulness of these formulas should depend on the nature of a given database. If the database contains a few overlapping data, model (M4) may be used. Otherwise, model (M3) or (M2) should be applied. In many real data analyses, we can always find a feasible solution for model (M2) if proper values of boundary b j are chosen as control parameters. Comparing with the conditions of data separation, it is not easier to find the feasible solutions for models (M3) and/or (M4) than that of model (M2). However, the precise theoretical relationship between three models deserves a further and careful study. Example 1 As an illustration, we use a small training data set adapted from Ref. [2] and Ref. [15] in Table I (Columns 1–6) to show how the two-class model works. Suppose whether or not a customer buys computer relates to the attribute set {Age, Income, Student and Credit rating}. We first define the variables Age, Income, Student and Credit rating by numeric numbers as follows: For Age: ‘‘≤30’’ assigned to be ‘‘3’’; ‘‘31 . . . 40’’ to be ‘‘2’’; and ‘‘>40’’ to be ‘‘1’’. For Income: ‘‘high’’ assigned to be ‘‘3’’; ‘‘medium’’ to be ‘‘2’’; and ‘‘low’’ to be ‘‘1’’.

GANG KOU et al.

459

TABLE I A two-class data set of customer status Cases

Age

Income

Student

Credit rating

Class: buys computer

Training results

A1

31 . . . 40

High

No

Fair

Yes

Success

A2

>40

Medium

No

Fair

Yes

Success

A3

>40

Low

Yes

Fair

Yes

Success

A4

31 . . . 40

Low

Yes

Excellent

Yes

Success

A5

≤30

Low

Yes

Fair

Yes

Success

A6

>40

Medium

Yes

Fair

Yes

Success Success

A7

≤30

Medium

Yes

Excellent

Yes

A8

31 . . . 40

Medium

No

Excellent

Yes

Failure

A9

31 . . . 40

High

Yes

Fair

Yes

Success

A10

≤30

High

No

Fair

No

Success

A11

≤30

High

No

Excellent

No

Success

A12

>40

Low

Yes

Excellent

No

Failure

A13

≤30

Medium

No

Fair

No

Success

A14

>40

Medium

No

Excellent

No

Success

For Student: ‘‘yes’’ assigned to be ‘‘2’’ and ‘‘no’’ to be ‘‘1’’. For Credit rating: ‘‘excellent’’ assigned to be ‘‘2’’ and ‘‘fair’’ to be ‘‘1’’. G 1 = {yes to buys computer} and G 2 = {no to buys computer} Then, let j = 1, 2 and i = 1, . . . , 14, model (M2) for this problem to classify the customer’s status for {buys computer} is formulated by Minimize dα− + dα+ + dβ− + dβ+ Subject to:  α ∗ + i αi = dα− − dα+ ,  β ∗ − i βi = dβ− − dβ+ , 2x 1 + 3x 2 + x 3 + x 4 = b + α1 − β1 x 1 + 2x 2 + x 3 + x 4 = b + α2 − β2 x 1 + x 2 + 2x 3 + x 4 = b + α3 − β3 2x 1 + x 2 + 2x 3 + 2x 4 = b + α4 − β4 3x 1 + x 2 + 2x 3 + x 4 = b + α5 − β5 x 1 + 2x 2 + 2x 3 + x 4 = b + α6 − β6 3x 1 + 2x 2 + 2x 3 + 2x 4 = b + α7 − β7 2x 1 + 2x 2 + x 3 + 2x 4 = b + α8 − β8 2x 1 + 3x 2 + 2x 3 + x 4 = b + α9 − β9 3x 1 + 3x 2 + x 3 + x 4 = b + α10 − β10 3x 1 + 3x 2 + x 3 + 2x 4 = b + α11 − β11

460

MCLP APPROACH TO DATA MINING

x 1 + x 2 + 2x 3 + 2x 4 = b + α12 − β12 3x 1 + 2x 2 + x 3 + x 4 = b + α13 − β13 x 1 + 2x 2 + x 3 + 2x 4 = b + α14 − β14 where α ∗ and β ∗ are given, x 1 , x 2 , x 3 , x 4 and b are unrestricted, and αi , βi , dα− , dα+ , dβ− , dβ+ ≥ 0, i = 1, . . . , 14. Before solving the above problem for data classification, we have to choose the values for the control parameters α ∗ , β ∗ and b. Suppose we use α ∗ = 0.1, β ∗ = 30000 and b = 1. Then, the optimal solution of this linear program for the classifier is obtained as Column 7 of Table I, where only cases A8 and A12 are misclassified. In other words, cases {A1 , A2 , A3 , A4 , A5 , A6 , A7 , A9 } are correctly classified in G 1 , while cases {A10 , A11 , A13 , A14 } are found in G 2 . Similarly, when we apply models (M3) and (M4) with ε = 0, one of learning processes provides the same results where cases {A1 , A2 , A3 , A5 , A8 } are correctly classified in G 1 , while cases {A10 , A11 , A12 , A14 } are correctly found in G 2 . Then, we see that cases {A1 , A2 , A3 , A5 } classified in G 1 by model (M4) are also in G 1 by models (M3) and (M2), and cases {A10 , A11 , A14 } classified in G 2 by model (M4) are in G 2 by model (M3) and (M2). This is consistent to Theorem 2.

3 ARCHITECTURE OF ALGORITHMS FOR MCLP CLASSIFICATION Based on the discussion of Section 2, we observe that the MCLP models (M2–M4) can serve as mining tools to develop a MCLP-data mining technique. According to four stages of data mining development [3], we now describe architecture of data mining algorithm design by using the MCLP models. Given a data mining task request, the stages of selecting task-relevant data and attributes, and transforming the selected data into appropriate forms can be viewed as the data preprocessing. This function produces two sets of data from the particular data mart. One is called ‘‘training or learning data set’’, while the other is ‘‘verifying or testing data set’’. In business practice, the 10% or less of the target data can be randomly chosen as a training set, while the rest of data is used as a verifying set. There are other ways to define a training set and verifying set. For example, using the concept of k-fold cross-validation, the target data is divided into k mutually exclusive and equal-size subsets (or folds). If the j th fold is used as a training set, then the rest data becomes the verifying set. When the training set of the j th fold changes, the verifying set varies [16]. The stage of mining in this paper has two components: identifying optimal solutions of the MCLP models and computing the score of each class in terms of the class boundaries. After this, the stage of interpreting can vary with the measurements of the results and the virtualization tools. This is subject to the understanding of the end-users. Computing, scoring and interpretation of the MCLP solutions will be repeated on the training set until a better or satisfying classifier is found. The criteria of the better classifier are subjective. For example, a threshold of 85% correct classification can be a criterion for a better classifier. Finally, the chosen classifier will be used on the verifying set to predict the classification of unknown data for the end-users. This architecture is captured as Fig. 2. The details of the architecture can be further illustrated as another three sub-flowcharts. Figure 3 represents the data preprocessing. The relevant data mart means part of data from an enterprise database or data warehouse respond the specific data mining task. For example, if the task is about the credit cardholder’s spending behavior (see Section 5), then the data of race, nationality, and sex normally will not be considered as the attributes of the data mining.

GANG KOU et al.

461

FIGURE 2 Architecture of MCLP-data classification.

In this process, we may use a number of known ‘‘quick and dirty’’ methods, such as mean, mode, medium, Z -score from statistics to numerate the different measurements of the data that are either qualitative or quantitative. For example, the attributes ‘‘age’’ and ‘‘income’’ can be assigned as the scale of ‘‘1–3’’ (recall Example 1). After the predefined classes have been identified, the clean data will be divided into the training data set and verifying data set for the next step of the MCLP models.

FIGURE 3

Data mart selection and preprocessing.

462

MCLP APPROACH TO DATA MINING j

j

Given a training data set, we can choose the ideal values of α∗ and β∗ as well as ε before solving the MCLP models (Fig. 4). The boundary b j supposes to be variable in model (M2)–(M4). However, if the analyst is very familiar with the characteristics of the data source, then the value of b j can be predetermined for the initial training. We employ the well-known Simplex Method to find the optimal solution of model (M2)–(M4) [17]. Successful identification of the optimal solution for better classification depends on the proper values of j j α∗ , β∗ , and b j , which has to be determined recursively or interactively with the analyst. If the classifier resulted from the training process is satisfied with the classification threshold of a better classifier, then we find the optimal factors for the verifying set and go to the step of j j scoring. Otherwise, we go back to reset new values of α∗ , β∗ , and b j . The calculation formula of the score for both training set and verifying set can be written as Scor ei = Ai , X ∗ , where X ∗ = (x 1∗ , . . . , xr∗ )T is the optimal factors found from the training process, and Ai is the data case from either the training data set or the verifying data set. If the scores of the training result provide the better classifier for the given classification threshold, then Fig. 5 will use the optimal factors to compute the scores for every data in the verifying set. The classification results of this set come from the match up between the scores and the values of boundaries. These results are called the ‘‘absolute’’ results since they can be used for identifying the rate of misclassification. Another useful statistical measurement is Kolmogorov–Smirnov (KS) value that measures the largest separation of the cumulative distributions of any two classes [18]. Since the KS can be used to compare the two classes data

FIGURE 4 Solving MCLP model for optimal solution via Simplex Method.

GANG KOU et al.

FIGURE 5

463

Class prediction for verifying data set.

for any given intervals, it is viewed as the ‘‘relative’’ classification results. For example, given a three-class problem, if we use the KS value of 60 points as a threshold for class one vs. class two, then we intend to find a classifier satisfying this criterion between class one and class two with paying attention on the KS values of other class pair-wise comparisons. Virtualization tools, such as Excel, can easily show the distribution of predicted classes. With the support of these tools, the end-users should have the understanding of the business meaning for the discovered knowledge patterns (see Section 5 for the real-life applications).

4 ALGORITHM IMPLEMENTATION AND SOFTWARE DEVELOPMENT A general algorithm to execute the MCLP classification method can be outlined as: ALGORITHM 1 Step 1 Build a data mart for task data mining project. Step 2 Generate a set of relevant attributes or dimensions from a data mart, transform the scales of the data mart into the same numerical measurement and determine predefined classes, classification threshold, training set and verifying set. Step 3 Use the MCLP model to learn and compute the best overall score (X ∗ ) of the relevant attributes or dimensions over all observations. Step 4 Discover the interested patterns that can best match the original classes under the threshold by choosing the proper control parameters (α ∗ , β ∗ and b). If the patterns are found, go to Step 5. Otherwise, go back to Step 3. Step 5 Apply the final learned score (X ∗∗ ) to predict the unknown data cases. We now outline some major procedures to implement Algorithm 1 as follows. If the Step 1 of Algorithm 1 is completed by the management of a data mining task force, pseudo codes for Step 2 can be written as:

dataPreprocessing(DataSet X) { open data file, trainingSet, verifySet and log file;

464

MCLP APPROACH TO DATA MINING

for all variable { produce numerical definition; Original variable → Derived variable; } for all data in X { set X.group according group definition; for position ← 1 to trainingSet.size { random selection in X; save in trainingSet; } for position ← 1 to verifySet.size { random selection in X; save in verifySet; } } The implementation of MCLP models and the Simplex Method in Step 3 consists of several functions. Repri function reads from the source data file and processes the data in the form required by MCLP models. Smpiv function is the coding for the Simplex Method. Gauss function is the well-known Gauss elimination method. Wsolut function produces the resulted factors and the outputs of object values. While Compute Score function computes the scores, Wsolut2 function outputs the scores to disk. These pseudo codes are:

void main() { open data, result and log file; //Data reading Repri (); //factor setup according Simplex Methods Smpiv(); //data computing void Wsolut(); //factor weight solution void Compute Score();//score compute and group classification according to score void Wsolut2();//result generating } //Data reading Redata (); { Read the input data file name; Define the output name; Read the number of groups and number of data in each group; Read the number of variable; Read the value of all boundaries; Read value of α ∗ , β ∗ for each group; Set coefficients of the objective functions; Set coefficients for basic variables; Set coefficients for slack variables; Set all the coefficients for constraint functions; Read data variables to a(i,j) for every x;

GANG KOU et al.

465

Output error message if error occurs in the end of data file; } void equalform() { set artificial variable in constraint function; } Smpiv() {//simplex method Decided whether exists RHS < 0, if yes, output infeasible solution message; } While (no Optimal solution found) { Find next pivot column; If all the rows of that column < 0, output unbound solution message; Find next pivot row; Mark the pivot number; Gauss(); } Gauss() {// Gaussian Elimination For every value in pivot row, New row values = old row values/corresponding pivot number value; For every row, New row values = old row values − (corresponding coefficients in pivot column × corresponding pivot row value); } wsolut() { if exist solution, for all the factors variable, save them to output file; } wsolut2() { computing the score for every data according to the factor we find in optimal solution; verify whether successfully grouped data by boundary; output the statistic information of test set; save to file; } void Compute Score();//score compute and group classification according to score The pseudo codes used for Step 3–5 of Algorithm 1 and computing the classification scores are:

Score(DataSet trainingSet, verifySet, Factor optimalFactor) { open trainingSet, verifySet, optimalFactor; computing the score for every data by the optimal factor; verify whether successfully grouped data by boundary; statistic information collection; save to file; }

466

MCLP APPROACH TO DATA MINING

Two versions of actual software have been developed for the MCLP classification method. The first version is based on the well-known commercial SAS platform [19]. In this software, we have applied SAS codes to execute Algorithm 1 in which the MCLP models (M2)–(M4) utilize SAS linear programming procedure [20]. The second version of the software is written by C++ language running on Linux platform [21]. The reason for developing Linux version of the MCLP classification software is that the majority of database vendors, such as IBM are aggressively moving to Linux-based system development. Our Linux version goes along with the trend of information technology. Because many large companies currently use SAS system for data analysis, our SAS version is also useful to conduct data mining analysis under SAS environment.

5

EXPERIMENTAL RESULTS FROM A REAL-LIFE DATABASE

One of important data mining applications in banking industry is credit card bankruptcy analysis. In this business practice, on one hand, credit card promotes customer spending which stimulates the economy. On the other hand, the card issuers (or banks) can lose huge money because of the increasing of individual bankruptcy filings. Given a set of attributes, such as monthly payment, balance, purchase, and cash advance and the criteria about ‘‘bankruptcy’’, the purpose of data mining in credit card portfolio management is to find the better classifier through a training set and use the classifier to predict all other customer’s spending behaviors [20]. The frequently used data-mining model in the business is still two-class separation technique. The key of two-class separation is to separate the ‘‘bankruptcy’’ accounts from the ‘‘current’’ accounts and identify as many bankruptcy accounts as possible. This is also known as the method of ‘‘making black list’’. The examples of popular methods are Behavior Score, Credit Bureau Score, FDC Bankruptcy Score, and Set Enumeration Decision Tree Score [7]. These methods were developed by either statistics or decision tree. Using a terabyte real credit database of the major US Bank, the SAS version of two-class MCLP model (as in Example 1) has demonstrated a better predication power (e.g., higher KS values) than these popular business methods [7].

5.1 Testing on Three, Four, Five-Class MCLP Models In this subsection, we demonstrate the experimental results of three-class, four-class and fiveclass MCLP models in Linux version on the same real credit database. Since this database contains 64 attributes with a lot of overlapping, we employed the weak separation model j j (M2). For all training sets, we choose the ideal values, α∗ = 0.1 and β∗ = 30,000, j = 1, 2 for three-class, 1, 2, 3 for four-class, and 1, 2, 3, 4 for five-class models, respectively. In the three-class MCLP model, the criterion of defining classes is the number of overlimits done by each credit card account during the previous two years. Class 1 (G 1 ) represents ‘‘bad’’ for all accounts with over-limits > 6; Class 2 (G 2 ) represents ‘‘normal’’, for all accounts with 1 < over-limits ≤ 6; and Class 3 (G 3 ) represents ‘‘good’’, for all accounts with over-limits ≤ 1. Accounting to this definition, we first choose 300 samples with 100 accounts in each class as the training set. The control parameters are set up as b1 = 0.01 and b2 = 7. Then we choose 5000 samples randomly from 25,000 real-life credit card accounts from the database of the bank, where 218 in G 1 ; 557 in G 2 ; and 4225 in G 3 as the verifying set. After several times of learning, we found that G 1 has been correctly identified 83% (83/100). G 2

GANG KOU et al.

467

FIGURE 6 Three-class training data set.

72% (72/100) and G 3 91% (91/100). In addition to these absolute classifications, the KS scores which are calculated by KS value = max |Cum. distribution of Good–Cum. distribution of Bad|, are 60 for G 1 vs. G 2 , and 80 for G 2 vs. G 3 (see Fig. 6). Note that normally a commercial practice requires KS values 45 above. Suppose this model is found as the better classifier, then we can use it to predict the verifying set as G 1 for 64.7% (141/218), G 2 for 64.1% (357/557) and G 3 for 59.6% (2516/4225). The predicted KS values are 29.31 for G 1 vs. G 2 , and 59.82 for G 2 vs. G 3 (see Fig. 7). As we see, the predicted classification rates are stable around 60%, but the predicted KS value for G 1 vs. G 2 is not good. In the four-class MCLP model, we define four classes as Bankrupt charge-off accounts (the number of over-limits ≥ 13), Non-bankrupt charge-off accounts (7 ≤ the number of over-limits ≤ 12), Delinquent accounts (2 ≤ the number of over-limits ≤ 6), and Current accounts (0 ≤ the number of over-limits ≤ 2). We use b1 = 0.1, b2 = 10 and b3 = 20 as the control parameters. In this case, we select 160 samples with 40 accounts in each class as the training set. We use the same 5000 samples with 53 in G 1 ; 165 in G 2 ; 557 in G 3 and 4225 in G 4 as the verifying set. The better learning provided that G 1 has been correctly

FIGURE 7 Three-class verifying data set.

468

MCLP APPROACH TO DATA MINING

FIGURE 8 Four-class training data set.

identified 85% (34/40), G 2 70% (28/40), G 3 62.5% (25/40) and G 4 70% (28/40), while KS values 70.5 for G 1 vs. G 2 , 65 for G 2 vs. G 3 and 52.5 for G 3 vs. G 4 in Fig. 8. Using this model as the better classifier, we can predict the verifying set as G 1 for 32% (17/53), G 2 for 83.6% (138/165), G 3 for 41.4% (231/557) and G 4 for 63.5% (2682/4225). The predicted KS values are 35.8 for G 1 vs. G 2 , 28.23 for G 2 vs. G 3 , and 61.86 for G 3 vs. G 4 in Fig. 9. These results show that the predicted separation between G 3 and G 4 is better than others. In the five-class MCLP model, we define five classes as Bankrupt charge-off accounts (the number of over-limits ≥ 13), Non-bankrupt charge-off accounts (7 ≤ the number of over-limits ≤ 12), Delinquent accounts (3 ≤ the number of over-limits ≤ 6), Current accounts (1 ≤ the number of over-limits ≤ 2), and Outstanding accounts (no over limit). Let the control parameter be b1 = 0.1, b2 = 10, b3 = 20 and b4 = 30. This time we select 200 samples with 40 accounts in each class as the training set. The same 5000 samples with 53 in G 1 ; 165 in G 2 ; 379 in G 3 , 482 in G 4 and 3921 in G 5 is used as the verifying set. We find, in the training process, that G 1 has been correctly identified 47.5% (19/40), G 2 55% (22/40), G 3 47.5% (19/40), G 4 42.5% (17/40), and KS values 42.5 for G 1 vs. G 2 , G 2 vs. G 3 and G 3 vs. G 4 , but 67.5 for G 4 vs. G 5 in Fig. 10. When we use this as the better classifier, we predict the verifying set as G 1 for 50.9% (27/53), G 2 for 49.7% (82/165), G 3 for 40% (150/379), G 4 for 31.1% (150/482) and G 5 for 54.6% (2139/3921). The predicted KS values are 36.08 for G 1 vs. G 2 , 23.3 for G 2 vs. G 3 , 27.82 for G 3 vs. G 4 , and 42.17 for G 4 vs. G 5 in Fig. 11. This indicates that the separation between G 4 and G 5 is better than other situations. In other words, the classifier is favorable to G 4 vs. G 5 .

FIGURE 9 Four-class verifying data set.

GANG KOU et al.

FIGURE 10

469

Five-class training data set.

We note that although the general models (M2) and (M3) can be theoretically applied to classify any number of classes of data, finding a better classifier to efficiently handle the real-life problems is not easy. Besides, many real-life applications do not require more than five classes separations. This claim is partially supported by psychological studies. According to Ref. [22], Human attention span is ‘‘seven plus or minus two’’. Therefore, for the practical purpose of classification in data mining, classifying five interesting classes in a terabyte database can be very meaningful.

5.2 Comparison of MCLP Method and Decision Tree Method Decision Tree has been widely regarded as an effective tool in classification [15]. Almost all of commercial products such as IBM Intelligent Miner and SAS Enterprise Miner have employed decision tree approaches. A commercial decision tree software C5.0 (the newly updated version of C4.5) is used to test the classification accuracy of the multiple classes (from two to five) against MCLP methods [23]. Given a terabyte credit card database of the major US Bank, the number of samples for the training sets are 600 for two-class problem, 300 for three-class problem, 160 for four-class problem, and 200 for five-class problem. All control parameters are the same as that in Section 5.1, while we use b = −1.1, α ∗ = 0.1 and β ∗ = 30, 000 in the two-class problem. All verifying sets are used the same 5000 credit card records as in

FIGURE 11

Five-class verifying data set.

470

MCLP APPROACH TO DATA MINING TABLE II Two-class comparison of MCLP and decision tree Multi criteria linear programming 2∗

Decision tree

Class

1∗

1∗

1

255

45

300

286

14

300

2

46

254

300

15

285

300

1

692

123

815

710

105

815

2

1477

2708

4185

1743

2442

4185

Total

2∗

Total

Training set

Verify set

∗ 1,

2 shows the classified results.

Section 5.1. Tables II, III, IV, and V summarize these comparisons of two methods. Note that the row number of classes represents the original data, while the column number shows the classified results. The diagonal numbers, therefore, are the correct classifications. As we see, generally the decision tree method does a better job than the MCLP method on training sets when the sample size is small. When applying the classifier from the training process on the larger verifying sets, the MCLP method outperforms the decision tree method. Two issues may cause this evidence. One is that the MCLP method as a linear model may miss some nonlinear nature of data, while the decision tree is nonlinear model. This could be the reason why the latter is better than the former in the training process. However, the robustness and stability of the MCLP is better than the decision tree when the classifier is applied to predict the classification of the verifying sets. This may be due to the fact that the MCLP method employs optimization to find the optimal factors from all feasible factors for the scores while the decision tree method just selects the better tree from a limited built trees, which is not a best tree. In addition, when the decision tree gets big (i.e., the size of the verifying sets increase), the pruning procedure may further eliminate some better branches. It might be interesting to compare these two methods over other real-life databases, such as the popular bio-medical database: Genbank [24,25] and HIV database [26] for more insights. Because of quite different data structures, the working formulation and choice of control parameters vary. These are our on-going projects.

TABLE III Three-class comparison of MCLP and decision tree Multi criteria linear programming 3∗

Decision tree

Class

1∗

2∗

1∗

1

83

17

0

100

93

7

0

100

2

20

72

8

100

4

95

1

100

3

0

9

91

100

1

3

96

100

139

74

5

218

Total

2∗

3∗

Total

Training set

Verify set 1

141

72

5

218

2

42

357

158

557

115

408

34

557

3

149

1560

2516

4225

269

1655

2301

4225

∗ 1,

2, 3 shows the classified results.

GANG KOU et al. TABLE IV

471

Four-class comparison of MCLP and decision tree

Multi criteria linear programming 3∗

4∗

Decision tree

Class

1∗

2∗

1

34

6

0

0

40

38

2

0

0

2

3

28

9

0

40

5

31

4

0

40

3

0

1

25

14

40

0

0

38

2

40

4

0

0

12

28

40

0

0

2

38

40

Total

1∗

2∗

3∗

4∗

Total

Training set 40

Verify set 1

17

31

5

0

53

41

7

5

0

53

2

2

138

25

0

165

38

91

30

6

165

3

7

293

231

26

557

74

200

254

29

557

4

16

195

1332

2682

4225

158

274

1869

1924

4225

5∗

Total

∗ 1,

2, 3, 4 shows the classified results.

TABLE V

Five-class comparison of MCLP and decision tree

Multi criteria linear programming 4∗

5∗

Decision tree

Class

1∗

2∗

3∗

1

19

21

0

0

0

40

35

4

0

1

0

2

3

22

15

0

0

40

0

37

1

2

0

40

3

0

8

19

13

0

40

0

1

38

0

1

40

4

0

0

6

17

17

40

0

0

5

29

6

40

5

0

0

0

10

30

40

0

0

0

2

38

40

Total

1∗

2∗

3∗

4∗

Training set 40

Verify set 1

27

18

6

2

0

53

41

6

4

2

0

53

2

26

82

42

14

1

165

30

82

31

16

6

165

3

20

145

150

41

13

379

38

104

159

63

14

379

4

13

70

182

150

32

482

26

73

178

162

44

482

5

28

110

470

1074

2139

3921

85

95

1418

428

1895

3921



1, 2, 3, 4, 5 shows the classified results.

Furthermore, a parallel experimental study on the MCLP classifications through the developed SAS version can be referred to [20]. For the sake of space, we will not elaborate on the results here.

6

CONCLUDING REMARKS

In this paper, we have introduced a data mining method by using multiple criteria linear programming (MCLP) which differs from the traditional data mining methods. Based on the general models of MCLP approaches, we have proposed architecture of MCLP-data mining

472

MCLP APPROACH TO DATA MINING

algorithm designs for real-life business intelligence. This architecture consists of finding MCLP solution, preparing mining score, and interpreting the knowledge patterns. Then, we have MCLP algorithm implementation and the software development under both SAS- and Linuxplatforms. A series of experimental tests and comparison between the MCLP method and decision tree has suggested the MCLP be an alternative technology for real-life data mining projects. There are some research and experimental problems remaining to be explored. From the structure of MCLP formulation, the detailed theoretical relationship of models (M2), (M3), and (M4) need to be further investigated in terms of classification separation accuracy and predictive power. In addition, in the proposed MCLP the  penalties  j to measure the   models, j ‘‘cost’’ of misclassifications (or coefficients of i j αi and i j βi ) were fixed as 1. If they are allowed to change, their influence on the classification results can be studied, and a theoretical sensitivity analysis of the misclassification in the MCLP models will be also conducted. From mathematical structure point of view, a multiple criteria non-linear classification model may be generalized if the hyper-plane X becomes non-linear cases, say X p , p > 1. The possible connection of the MCLP classification with the known Support Vector Machine (SVM) method in pattern recognition can be researched [27]. In the empirical tests, we have noticed that identifying the optimal solution for model (M2), (M3), or (M3) in the training process may be time-consuming. Instead, we can apply the concept of fuzzy multiple criteria linear programming to seek a satisfying solution that may lead to a better data separation [28]. Other well-known methods, such as neural networks [29], rough set [30], and fuzzy set [31] should be considered into part of the extensive comparison study against the MCLP method so that the MCLP method can be known in the data mining community for both researchers and practitioners. We will report any significant results from these ongoing projects in the near future. Acknowledgment The original version of this paper has been presented at The Second Japanese-Sino Optimization Meeting, Kyoto, Japan, September 25–27, 2002. The authors thank Professor Haifeng Guo of the University of Nebraska at Omaha for his constructive comments on this paper. This research has been partially supported by a grant under (DUE-9796243), the National Science Foundation of USA, a National Excellent Youth Fund under (#70028101), National Natural Science Foundation of China and a grant from the K.C. Wong Education Foundation, Chinese Academy of Sciences. References [1] R. Dilly (1996). Data Mining: An Introduction, Version 2, available online: http://www.pcc.qub.ac.uk/tec/ courses/datamining/stu notes/dm book 1.html (Current as of Oct 11,2002). [2] J. Han and M. Kamber (2001). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, San Francisco, California. [3] Y. Shi (2002). Data mining. In: M. Zeleny (Ed.), IEBM Handbook of Information Technology in Business, pp. 490–495. International Thomson Publishing, England. [4] K. Thearling (2002). An Introduction to data mining: Discovering hidden value in your data warehouse. Available Online: http://www.thearling.com/text/dmwhite/dmwhite.htm (Current as of Oct 11, 2002). [5] P. Cabena, P. Hadjinian, R. Stadler, J. Verhees and A. Zanasi (1997). Discovering Data Mining from Concepts to Implementation. Prentice Hall, Upper Saddle River, New Jersey. [6] N. Freed and F. Glover (1981). Simple but powerful goal programming models for discriminant problems. European Journal of Operational Research, 7, 44–60. [7] Y. Shi, M. Wise, M. Luo and Y. Lin (2001). Data mining in credit card portfolio management: a multiple criteria decision making approach. In: M. Koksalan and S. Zionts (Eds.), Multiple Criteria Decision Making in the New Millennium, pp. 427–436. Springer, Berlin.

GANG KOU et al.

473

[8] Y. Shi, Y. Peng, X. Xu and X. Tang (2002). Data mining via multiple criteria linear programming: Applications in credit card portfolio management. International Journal of Information Technology and Decision Making, 1, 145–166. [9] G. Kou, Y. Peng, Y. Shi, M. Wise and W. Xu (2002). Discovering credit cardholders’ behavior by multiple criteria linear programming. Working Paper, College of Information Science and Technology, University of Nebraska at Omaha. [10] P.L. Yu and M. Zeleny (1975). The set of all nondominated solutions in the linear cases and a multicriteria simplex method. Journal of Mathematical Analysis and Applications, 49, 430–458. [11] X.R. Hao and Y. Shi (1996). MC2 Program: version 1.0: A C++ Program run on PC or Unix, College of Information Science and Technology, University of Nebraska at Omaha. [12] Y. Shi and P.L. Yu (1989). Goal setting and compromise solutions. In: B. Karpak and S. Zionts (Eds.), Multiple Criteria Decision Making and Risk Analysis Using Microcomputers, pp. 165–204. Springer-Verlag, Berlin. [13] Y. Shi (2001). Multiple Criteria Multiple Constraint-levels Linear Programming: Concepts, Techniques and Applications. World Scientific Publishing, River Edge, New Jersey. [14] P.L. Yu (1985). Multiple Criteria Decision Making: Concepts, Techniques and Extensions. Plenum, New York. [15] J. Quinlan (1986). Induction of decision trees. Machine Learning, 1, 81–106. [16] M. Stone (1974). Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society, 36, 111–147. [17] G.B. Dantzig (1963). Linear Programming and Extensions. Princeton Press, Princeton, New Jersey. [18] W.J. Conover (1999). Practical Nonparametric Statistics. Wiley, New York. [19] http://www.sas.com/ [20] Y. Peng (2002). Data Mining in Credit Card Portfolio Management: Classifications for Card Holder Behavior. Master Thesis, College of Information Science and Technology, University of Nebraska at Omaha. [21] B. Ball, D. Pitts and W. von Hagen (2002). Red Hat Linux 7. Sams Publishing, Indianapolis, Indiana. [22] G.A. Miller (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. The Psychological Review, 63, 81–97. [23] http://www.rulequest.com/see5-info.html [24] B.F.F. Ouellette (1998). The GeBank sequence database. In: A.D. Baxevanis and B.F.F. Ouellette (Eds.), Bioinformatics: A Practical Guide to the Analysis Genes and Proteins, pp. 16–45. Wiley-Liss. [25] H. Ali, G. Kou, D. Quest and Y. Shi (2002). Biological characteristic of records in Genbank database using multiple criteria linear programming. Working Paper, College of Information Science and Technology, University of Nebraska at Omaha. [26] J. Zheng, D. Erichsen, C. Williams, H. Peng, G. Kou, C. Shi and Y. Shi (2002). Classifications of neural dendritic and synaptic damage resulting from HIV- 1-associated dementia: a multiple criteria linear programming approach. Working paper, University of Nebraska Medical Center. [27] C. Cortes and V. Vapnik (1995). Support vector networks. Machine Learning, 20, 273–295. [28] Y.H. Liu and Y. Shi (1994). A fuzzy programming approach for solving a multiple criteria and multiple constraint level programming problem. Fuzzy Sets and Systems, 65, 117–124. [29] H. Guo and S.B. Gelfand (1992). Classification trees with neural network feature extraction. IEEE Transactions on Neural Networks, 3, 923–933. [30] Z. Pawlak (1982). Rough sets. International Journal of Computation and Information Sciences, 11, 341–356. [31] L.A. Zadeh (1965). Fuzzy sets. Information and Control, 8, 338–353.