CuiWongLui - Lingnan University

7 downloads 8857 Views 175KB Size Report
Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models. 598. Management .... made significant strides in many fields, such as software engineering, space naviga- ..... sumer electronics. The company sends regular mail-.
MANAGEMENT SCIENCE

informs

Vol. 52, No. 4, April 2006, pp. 597–612 issn 0025-1909  eissn 1526-5501  06  5204  0597

®

doi 10.1287/mnsc.1060.0514 © 2006 INFORMS

Machine Learning for Direct Marketing Response Models: Bayesian Networks with Evolutionary Programming Geng Cui

Department of Marketing and International Business, Lingnan University, Tuen Mun, N.T., Hong Kong, [email protected]

Man Leung Wong

Department of Computing and Decision Sciences, Lingnan University, Tuen Mun, N.T., Hong Kong, [email protected]

Hon-Kwong Lui

Department of Marketing and International Business, Lingnan University, Tuen Mun, N.T., Hong Kong, [email protected]

M

achine learning methods are powerful tools for data mining with large noisy databases and give researchers the opportunity to gain new insights into consumer behavior and to improve the performance of marketing operations. To model consumer responses to direct marketing, this study proposes Bayesian networks learned by evolutionary programming. Using a large direct marketing data set, we tested the endogeneity bias in the recency, frequency, monetary value (RFM) variables using the control function approach; compared the results of Bayesian networks with those of neural networks, classification and regression tree (CART), and latent class regression; and applied a tenfold cross-validation. The results suggest that Bayesian networks have distinct advantages over the other methods in accuracy of prediction, transparency of procedures, interpretability of results, and explanatory insight. Our findings lend strong support to Bayesian networks as a robust tool for modeling consumer response and other marketing problems and for assisting management decision making. Key words: direct marketing; Bayesian networks; evolutionary programming; machine learning; data mining History: Accepted by Jagmohan S. Raju, marketing; received May 12, 2004. This paper was with the authors 6 months for 4 revisions.

1.

Introduction

(CART), and latent class regression, in a tenfold crossvalidation with a large data set. The results suggest that BNs have distinctive advantages, including accurate prediction, transparent procedures, interpretable results, and greater explanatory power.

Machine learning is an innovative method that can potentially improve forecasting models and assist management decision making. Direct marketing, which relies on building accurate predictive models from databases, is one of the areas that can benefit from such applications. As more companies adopt direct marketing as a distribution strategy, spending in this channel has grown in recent years, making consumer response modeling a top priority for direct marketers to increase sales, reduce costs, and improve profitability. In addition to the conventional statistical approach to forecasting consumer purchases, researchers have recently applied machine learning methods, which have several distinctive advantages for data mining with large noisy databases. In this study, we adopt an innovative machine learning method—Bayesian networks (BNs) learned by evolutionary programming (EP)—to model responses to direct marketing. We compare the results of BNs with other benchmark methods, including neural networks, classification and regression tree

1.1. The Statistical Methods Because of budget constraints, most direct marketers only contact a preset percentage (e.g., 20%) of the names in a company’s database. Thus, the primary objective of modeling consumer responses in direct marketing is to identify customers who are most likely to respond. Researchers have developed many direct marketing response models using consumer data. One of the classic models, known as the recency, frequency, monetary value (RFM) model, determines the likelihood of consumers responding to a direct marketing promotion based on the recency of the last purchase, the frequency of purchases over the past years, and the monetary value of a customer’s purchase history (Berger and Magliozzi 1992). Other consumer demographic and psychographic variables, 597

598

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

credit histories, and purchase patterns may help build to more sophisticated models that can improve the understanding of consumer responses and the accuracy of purchase prediction. Until recently, statistical methods such as logistic regression and discriminant analysis have dominated the modeling of consumer responses to direct marketing (Berger and Magliozzi 1992). Although statistical methods can be very powerful, they make several stringent assumptions on the types of data and their distribution, and typically can only handle a limited number of variables. Regression-based methods are usually based on a fixed-form equation, and assume a single best solution, which means that researchers can compare only a few alternative solutions manually. Further, when the models are applied to real data, the key assumptions of the research methods are often violated (Bhattacharyya 1999). Recently, researchers have developed several more sophisticated models, including beta-logistic models (Rao and Steckel 1995), tree-generating techniques such as CART and CHAID (Haughton and Oulabi 1997), and the hierarchical Bayes model (Allenby et al. 1999). A number of studies have addressed the selection and endogeneity biases in the existing models to improve predictive accuracy (Bitran and Mondschein 1996, Gönül et al. 2000). In recent years, the rapid accumulation of customer and transactional data has resulted in very large databases, and the voluminous amount of consumer data provides unique opportunities for researchers to use data-mining methods to gain insight into consumer behavior. For instance, in addition to the RFM variables, researchers have used consumer lifetime and transaction variables to improve the performance of models (Bhattacharyya 1999, Venkatesan and Kumar 2004). However, the increasing amount and variety of customer data may render impossible a manual solution to the optimization of response models (Bitran and Mondschein 1996). How to take advantage of the different types and increasing volumes of customer data to assist management decision making presents new challenges. Innovative methods such as machine learning allow researchers to perform data mining with large databases to provide decision support to managers. 1.2. Machine Learning Machine learning refers to computer-based methods that can extract patterns or knowledge from data and perform optimization tasks with minimum human intervention. Most of these methods have their roots in artificial intelligence and dynamic programming. Machine learning methods have been adopted in many fields as effective data-mining tools to discover “interesting,” nonobvious patterns or knowledge hidden in a database that can improve the bottom

Management Science 52(4), pp. 597–612, © 2006 INFORMS

line. These methods include association rules, decision trees, neural networks, and genetic algorithms. Business researchers have adopted some of these techniques to solve classification problems, such as predicting bankruptcy and loan default and modeling consumer choice (Hu et al. 1999, West et al. 1997). Such methods can also be very useful in learning new knowledge when researchers have observable data but the model structure is unknown. Artificial neural network (ANN), a procedure that mimics the processes of the human brain, is among several innovative methods that have been used to model consumer responses to direct marketing (Baesens et al. 2002, Zahavi and Levin 1997). In comparison with the statistical approach, simple forms of ANNs are free from the assumptions of normality or complete data, and are thus particularly robust for handling noisy data, including cases in which the number and type of attributes vary (Michie et al. 1994). Moreover, neural networks are not subject to the linearity assumption or the highly parametric structure associated with models in a small-data setting. They can explore complex structures to find interactions, nonlinearities, and nonlinear interactions, and are good at pattern discovery (Warner 1997). Given a sufficiently large amount of data, neural networks may offer better solutions to complex optimization problems. When Zahavi and Levin (1997) applied ANNs to direct marketing, ANNs did not perform any better than logistic regression. They attribute the problem of overfitting by back-propagation ANNs to the complexity of the procedure, which typically starts with selecting input and output variables, determining the number of hidden layers and hidden nodes, and adjusting the weights of the nodes, all by trial and error. In an attempt to solve this problem, several researchers have recently developed a Bayesian approach to learning neural networks using the Markov chain Monte Carlo (MCMC) method (Neal 1996, Warner 1997). Using a set of hyperparameters to select the appropriate priors and to represent the noise in the data, relatively complex structures are able to model the data while minimizing overfitting (Warner 1997). Baesens et al. (2002) tested the Bayesian approach to neural networks with direct marketing data and produced positive results. Despite these improvements and potential benefits, machine learning still has limited applications in marketing research, and is not short of skeptics. Such methods have yet to make the necessary improvements to allow marketing researchers to take advantage of the features that they have to offer. First, although methods such as ANNs can model complex nonlinear structures, they lack a suitable topology

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

599

Management Science 52(4), pp. 597–612, © 2006 INFORMS

of networks or method of knowledge representation to describe the relationships among the variables (Zahavi and Levin 1997). Second, ANNs can generate empirical results that are comparable to those of the statistical approach, but its “black-box” procedures learn complex relationships at an “unconscious” level and are not transparent, thus making it difficult to understand how a certain solution has been derived. Third, their results are not easy to interpret to offer managerial insight (Nakhaeizadeh and Taylor 1997, West et al. 1997). 1.3. Objectives of the Study In this study, we propose the innovative machine learning method of BNs learned with EP to model consumer responses to direct marketing. In this datamining task, we focus on learning a predictive model by integrating the lifetime and consumer transaction variables with the RFM variables to forecast consumer purchases. For structural analysis, we use BNs as a universal approximator to represent the model structure, and then adopt EP as a stochastic search algorithm to learn the optimal BN model. To evaluate the model fitness and minimize overfitting, we apply the minimum description-length (MDL) metric. In the following sections, we first elaborate the advantages of BNs in representing model structures and EP as an efficient optimization algorithm. Second, we adopt the “control variable” approach to test the endogeneity bias in the RFM variables. Third, we test BNs learning with a large direct marketing data set and compare the results with those of ANNs, CART, and latent class regression in a tenfold crossvalidation. Fourth, we discuss the results and the advantages of BNs in terms of the accuracy of classification, transparency of procedure, and interpretation of the results. Finally, we explore the managerial implications, potential applications of BNs in marketing research, and directions for further development.

2.

Learning Bayesian Networks

Based on the well-developed Bayesian probability theory proposed about 250 years ago, BNs constitute a method of formal knowledge representation that has been around since the 1960s. In the last two decades, the development of computers, algorithms, and software has made it possible to execute realistic BN models (Jensen 1996, Pearl 1988). Since then, BNs have made significant strides in many fields, such as software engineering, space navigation, and medical diagnosis (Haddawy 1999). Like ANNs, the BN approach is free from the assumptions of data types and their normality and can effectively handle nonlinearity, which is often associated with data mining using large databases. Moreover, BNs require no a priori model formulation and can

take on any structure for a model. Such “freedom of expression” allows the researcher to explore complex relationships among the variables to discover new knowledge, making BNs an ideal tool for data mining (Heckerman 1997). The main task of BNs is to decompose a joint probability distribution into a set of local distributions. The network topology based on the independence semantics specifies how to combine the local distributions of the variables to obtain the joint probability through the nodes in the network (Haddawy 1999). The symmetric nature of conditional probability allows researchers to perform prediction and diagnosis and to solve classification problems. In the past decade, BNs have slowly made inroads into management research. For instance, marketing researchers have adopted BNs to model strategic planning for new products (Cooper 2000) and consumer complaint behavior (Blodgett and Anderson 2000). EP, an efficient stochastic optimization algorithm developed in the field of artificial intelligence, has been adopted by researchers to identify optimal solutions, but both BNs and EP are relatively new to management researchers. It takes a great deal of theoretical insight into these two methods to understand their combined benefits. 2.1. Introduction to Bayesian Networks By definition, a BN treats a research problem as modeled by a list of variables and encodes the joint probability distribution of these variables:  (1) P N1      Nn  = P Ni  Ni  i

First, a BN, B, has a qualitative part represented by a directed acyclic graph (DAG) that depicts the conditional independence among the variables in a domain U = N1      Nn and encodes the joint probability distribution (Pearl 1988). The network uses the variables as nodes to represent the relationships of dependency and independence among them. As is shown in Figure 1, each node in the graph corresponds to a variable in the domain. An edge Nj → Ni in the graph describes a parent and child relation in which Ni is Figure 1

A Bayesian Network Model of Customer Complaints

P(ui) = 0.01

P(rc) = 0.15 Regular customer (rc) Repeat business (rb) P(rb|rc) = 0.6 P(rb|~rc) = 0.05

Unhappy incident (ui) Service recovery (sr)

P(sr|rc, ui) = 0.99 P(sr|~rc, ui) = 0.90 P(sr|rc, ~ui) = 0.97 P(sr|~rc, ~ui) = 0.03

Happy customer (hc)

P(hc|sr) = 0.7 P(hc|~sr) = 0.01

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

600

Management Science 52(4), pp. 597–612, © 2006 INFORMS

the child and Nj is the parent. An edge specifies a dependency between Ni and Nj . All of the parents of Ni constitute the parent set of Ni , which is denoted by Ni . Bayes’ rule is used to update the conditional probabilities given evidence. Overall, BNs offer a formalism that can directly represent a complex distribution concisely and efficiently. Let the domain variables be U = N1      Nn , the joint probability table grows exponentially with the number of variables, and U does not need to be very large before the table becomes intractably large. A BN over U is a compact representation of the joint probability distribution, and its information can be calculated using Equation (1). Figure 1 is a hypothetical BN model for handling customer complaints with five variables: regular customer, unhappy incident, service recovery, repeat business, and happy customer. Because all the variables are binary, the joint probability distribution table should have 25 − 1 = 31 entries. However, the BN has only 10 probability values, and thus 21 values are saved. If there are more domain variables, the amount of values saved will be much larger. 2.1.1. Conditional Dependence. Secondly, based on a model structure, the quantitative part of BNs estimates the conditional probabilities for the variables in the model. BNs operate on the assumption of conditional independence. Let U be the set of variables in the domain and P be the joint probability distribution of U . Following Pearl’s notation, a conditional independence relation is denoted by IX Z Y , where X, Y , and Z are disjoint subsets of the variables in U . This notation says that X and Y are conditionally independent given the conditioning set Z. Thus, in a three-node network, one of the variables acts as a “virtual control” for the relationship between the other two. Formally, a conditional independence relation is defined as in Pearl (1988) as P x  y z = P x  z

where P y z > 0

(2)

where x, y, and z are any value assignments to the set of variables X, Y , and Z, respectively. A conditional independence relation is characterized by its order, which is simply the number of variables in the conditioning set Z. This is also referred to as the d-separation, which is used to determine the conditional dependence and independence among the variables, and is conditional on the state of the other variables (Pearl 2000). Probability calculus is used to quantify the relationships of dependence represented by a BN. In a BN, each node has a conditional probability distribution in the form of P Ni  Ni , which specifies the probability of each possible state of the node

given each possible combination of states of its parents. If a node contains no parent, then the marginal probabilities of the node are used (Pearl 1988). The example in Figure 1 suggests that the probability of a happy customer (hc) is the joint probability of the local probabilities of other events, such as service recovery (sr), an unhappy incident (ui), a repeat customer (rc), and repeat business (rb). The probabilities for all the nodes in the Bayesian network can be calculated. For example, as P rc = 015 and P ui = 001, then P rb = P rb  rcP rc + P rb  ∼rcP ∼rc = 06 × 015 + 005 × 1 − 015 = 01325

(3)

The probabilities for the other nodes in the network can be derived in a similar fashion. 2.1.2. Inference and Causality. With a BN model at hand, probabilistic inference can be performed to predict the outcome of certain variables based on the observation of others. Intuitively, an edge in a BN expresses the notion of “interaction” between variables, and a missing edge represents a missing relation between two variables. Through conditional dependencies and in probabilistic terms, the DAG gives a lucid representation of the dependencies and irrelevancies among the variables embedded in a network (Chiogna 1997). For example, the service recovery (sr) event may lead to the happy customer (hc) event, and the strength of this relationship would be represented by the conditional probability P hc  sr. The posterior probability of observing the event of service recovery when we see a happy customer can be obtained using the well-known Bayes’ rule P sr  hc =

P hc  srP sr = 09383 P hc

(4)

Although the formal definition of a BN is based on conditional independence, in practice it is often constructed using the notions of cause and effect, which makes it a powerful tool for the identification and analysis of the structural relationships among variables (Heckerman 1997). With data on intervention or similar knowledge, researchers can explicate the causal relationships among variables (Pearl 2000). An edge, Nj → Ni , may represent causality, with Nj being the cause and Ni being the effect. Their conditional dependencies can help distinguish causation from mere correlation or association, and can lead to the inference of causality on a solid mathematical basis. Consider the BN shown in Figure 1, in which the parent set of the node service recovery is regular customer unhappy incident . Happy customer is independent of regular customer and unhappy incident given service recovery, whereas the effect of regular customer and

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

601

Management Science 52(4), pp. 597–612, © 2006 INFORMS

unhappy incident is mediated by service recovery. A model with such modularity helps to explain how these probabilities change as a result of external intervention. This is particularly useful for research on the effect of management decisions or business strategies. 2.2.

Learning Bayesian Networks Using Evolutionary Programming A BN can be constructed based on previous research, as in Blodgett and Anderson (2000), or by eliciting knowledge from domain experts, as in Cooper (2000). However, building BNs based on previous information or expert knowledge can be difficult, because such information may not be available. To reduce the imprecision due to subjective judgments, researchers can learn a BN from collected data, instead of fitting a specified model to the data. For BNs to “learn” from the observed data a probability distribution that can best describe the relationship among the variables, researchers have devised various learning algorithms, such as the genetic algorithm (Larrañaga et al. 1996). In this study, we propose to learn BNs using EP. Figure 2 provides a graphic representation of the process in which EP learns BNs to solve prediction and classification problems. First, data reduction is undertaken to extract the relevant variables from the database. Second, researchers need a fitness measure to assess the goodness of BN models (Figure 2). Such measures may be derived from the Bayesian information criterion (BIC) or Figure 2

Akaike information criterion (AIC). In this study, we employ the MDL metric (Lam and Bacchus 1994), which is discussed in §2.2.1. With the defined metric, learning BNs is formulated as a search problem, and here we use EP to tackle the problem. We describe EP in §2.2.2 and discuss our learning algorithm in §2.2.3. Section 2.3 explains how the learned BN is used to perform prediction and classification. 2.2.1. The Minimum Description-Length Metric. An appropriate fitness measure is critical for model evaluation and selection. Despite their advantages in handling nonlinearity and learning complex models, BNs based on a uniform prior may overfit a particular data set (Lam and Bacchus 1994). In the worst case, the computation of the posterior probabilities in complex network models becomes intractable. Complex BNs require more probabilities and tremendous computer space to store them, and suffer from the conceptual disadvantage that a model with complex structures makes it difficult to understand and explain the underlying relationships. As a result, simpler models are preferred if they are sufficiently accurate. Rissanen (1978) proposed the MDL principle, and Lam and Bacchus (1994) subsequently adopted it for evaluating BNs. This metric is rooted in information theory, and is equivalent to the Bayesian scoring function or the BIC (Hansen and Yu 2001). In other words, when the number of samples increases, the learned model converges to the underlying true distribution

Data Mining Using Bayesian Networks Learned by Evolutionary Programming

Learning Bayesian network structure and conditional probabilities

Data preparation

Generate an initial population, P, of DAGs. Evaluate them using MDL

Max. no of generations reached?

yes

Select the best DAG from P

Calculate the conditional probabilities in the best DAG

Classification and prediction

no Generate new DAGs from P and evaluate them

Select DAGs from P and new DAGs

Store the selected DAGs in P

output

Bayesian network

input

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

602

Management Science 52(4), pp. 597–612, © 2006 INFORMS

of the data with a probability equal to one. As the MDL imposes a penalty on model complexity, the burden of proof falls on complex models. Thus, the MDL metric strikes a balance between model accuracy and simplicity, and has been employed as the fitness function to evaluate BNs. It effectively serves as a mechanism to control overfitting (Hansen and Yu 2001). Motivated by information coding (Rissanen 1978), the MDL principle assumes that a collection C of data items is given, and that it is necessary to place this collection in computer storage. The encoded collection is referred to as the total description length, which is defined as the sum of the length of the compressed version of data C and the description length of the model. The MDL metric measures the total description length Dt (B) of a BN structure B. The metric dictates that the optimal model to explain a collection of data minimizes the total description length. Let N = N1      Nn denote the set of nodes in a BN and Ni denote the set of parents of node Ni . The total description length of a BN Dt  is the sum of the description lengths of each node  Dt B = Dt Ni  Ni  (5) Ni ∈N

The total description length Dt  is based on two components: the network description length Dn  and the data description length Dd . Thus, the MDL score of a model depends on the sample size of the data and the complexity of the model Dt Ni  Ni  = Dn Ni  Ni  + Dd Ni  Ni 

(6)

The formula for the network description length is as follows.  Dn Ni  Ni  = ki log2 n + dsi − 1 sj  (7) j∈ Ni

where ki is the number of parents of variable Ni , Si is the number of values Ni , Sj is the number of values of a particular variable in Ni , and d is the number of bits required to store a numerical value. This is the description length for encoding the network structure. The first part in the addition is the length for encoding the parents, and the second part is the length for encoding the probabilities. The model description length measures the simplicity of the model. The formula for the data description length Dd  is Dd Ni  Ni  =

 Ni ∈ Ni

MNi  Ni  log2

M Ni 

MNi  Ni 

 (8)

where M· is the number of cases that match a particular instantiation in the database. This is the description length for encoding the data, which measures

the accuracy of a network. The compressed version of the data includes the values of x1  x2      xn and the errors e1  e2      en . As the storage size that is required for x1  x2      xn is fixed for all the models, if one model has a shorter data description length than another model, then the storage size for the errors of the first model is smaller than that of the second model. Thus, a model is more accurate if the corresponding data description length is smaller. To encode a BN, we need to encode the network topology of the graph (the list of parents for each node) and the set of conditional probabilities associated with each node. For a BN with n nodes, it is sufficient to encode a list of the parents of each node and a set of conditional probabilities for each node. For a node with k parents, we need k log n bits to encode the list of its parents. The encoding for the conditional probabilities depends on the number of parents and the number of values that the variables take. As BNs encode the data as a joint distribution of probabilities, which can be quite complex, the MDL is particularly suitable as a method to select BN models in data mining with large databases. The key advantage of the MDL metric is that it balances accuracy and simplicity of models. A simpler network is preferred, providing it is sufficiently accurate, but if there is no simpler network that is accurate enough, then the metric allows a more complex network to be induced. The MDL metric thus guides the search for a more accurate network by increasing its topological complexity (Lam and Bacchus 1994). On the other hand, it may select a simpler network with fewer errors over a more complex model that perfectly fits the data. Viewed in this light, the MDL provides an effective mechanism to minimize the overfitting of the data. Recent experiments suggest that compared with other criteria such as AIC and BIC, MDL is a robust metric and is increasingly used for model selection (Hansen and Yu 2001, Mitchell 1997). 2.2.2. Evolutionary Programming. Evolutionary computation refers to a family of computational methods that perform machine learning and function optimization by simulating the natural evolution process based on the Darwinian principle. Genetic algorithms (GAs), genetic programming (GP), evolution strategy (ES), and evolutionary programming (EP) are all examples of evolutionary computation methods. They are robust and parallel search tools for identifying accurate models from all possible hypotheses and discovering new knowledge from noisy data. Their main differences are in the evolution model assumed, the evolutionary operators employed, and the selection method or the fitness functions used. This study adopts EP because of its several distinct advantages. First, whereas GAs focus on binary bits and GP uses tree structures, EP does not require any

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

603

Management Science 52(4), pp. 597–612, © 2006 INFORMS

type of model structure. The model can be a binary string, a tree structure, or any other shape, and can evolve into new structures during the evolutionary process. Second, unlike GAs that use reproduction, crossover, and mutation operators, mutation is the only genetic operator for evolution in EP. Whereas crossover operation may lead to invalid models such as recursive models, mutation operator alone in EP allows the simultaneous modification of all the nodes (variables), which by itself is powerful and makes crossover redundant. Third, instead of emulating the specific genetic operators observed in nature, mutations in EP preserve the behavioral similarity of the parents and their offspring models (Fogel 1994). A “child” model is generally similar in behavior to the “parent” model, with slight variations. Thus, EP represents a model of evolution at a higher level of abstraction (Wong et al. 1999). Compared with the GA methods to learn BNs (Larrañaga et al. 1996), EP is more flexible and explores wider search spaces to compare alternative models. Because of these unique advantages, EP is an ideal search mechanism for optimization purposes, and has produced more satisfactory results than GAs in both model performance and computing efficiency when applied to learning BNs (Fogel 1994, Wong et al. 1999). A typical process of EP is outlined in Table 1. A set of models is randomly created to make up the initial population. Each model is evaluated by the fitness function, and then each model produces a child by mutation. There is a certain distribution of the different types of mutation, which range from minor to extreme, and minor modifications in the behavior of offspring occur more frequently than substantial modifications. The offspring are also evaluated by the fitness function, and then tournaments are performed to select the models for the next generation. For each model, a number of rivals are randomly selected among the parents and their offspring. The tournament score of a model is the number of its rivals with worse fitness scores than itself. The models with higher tournament scores are selected as the Table 1

The Algorithm of Evolutionary Programming

Initialize the generation, t, to be 0. Initialize a population of individual, Popt Evaluate the fitness of all individual in Popt While the termination criteria is not satisfied Produce one or more offspring from each individual by mutation Evaluate the fitness of each offspring Perform a tournament for each individual Put the individuals with high tournament scores into Popt + 1 Increase the generation t by 1 Return the individual with the highest fitness value Note. Unlike statistical procedures, machine learning algorithms are mostly written in logical language, rather than mathematical formulas.

Table 2

The Algorithm for Evolutionary Programming to Learn Bayesian Networks

• Set t to 0. • Set I to 0. • While I is smaller than the population size P S,

Let N be the set of all nodes.

Let B be a Bayesian networks without any edges.

For each Ni in N,  Randomly generate an integer k from 0 to 5.  Randomly select k nodes from N\ Ni without replacement.  For each selected node Nj , • If cycles are not generated by inserting the edge Ni ← Nj into B, then add the edge Ni ← Nj into B.

Insert B into the initial population Popt.

Increase I by 1. • Each BN in the population Popt is evaluated with the fitness function defined in Equation (5). • While t is smaller than the maximum number of generation G and there are opportunities for improvement,

Each BN in Popt produces one offspring by performing a number of mutation operations. If the offspring has cycles, then delete the edges of the offspring that invalidate the acyclic condition.

The BNs in Popt and all new offspring are stored in the intermediate population Popt . The size of Popt  is 2 ∗ P S.

Conduct a number of pairwise competitions over all BNs in Popt . Let Bi be the BN that is conditioned upon, and q opponents are selected randomly from Popt  with equal probability. Let Bij , 1 ≤ j ≤ q, be the randomly selected opponent BNs. The Bi gets one more score if Dt Bi  ≤ Dt Bij , 1 ≤ j ≤ q.

Select P S BNs with the highest scores from Popt  and store them in the new population Popt + 1.

Increase t by 1. • Let the BN with the lowest fitness value found in any generation of a run be Bbest . • Calculate the parameters of Bbest by using Equation (10). • Return Bbest as the result of the algorithm.

population of the next generation. The population size (the number of competing models) does not need to be held constant. The process is iterated until the termination criterion is satisfied. 2.2.3. The Learning Algorithm. Our EP algorithm for learning BNs is depicted in Table 2. Each individual represents a BN model, which is a DAG. First, a set of BNs is randomly generated to make up the initial population. Each graph is evaluated by the MDL metric described above. Then, each BN produces offspring by performing a number of mutations. The probabilities of using one, two, three, four, five, or six mutations are set to 0.2, 0.2, 0.2, 0.2, 0.1, and 0.1, respectively. The mutation operators modify the edges of a DAG. If a cyclic graph is formed after the mutation, edges in the cycles are removed to keep it acyclic. After generating the offspring models, they are also evaluated by the MDL metric. The next generation of the population is selected among the parents and offspring by tournaments. Each DAG B is compared with other randomly selected DAGs, and

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

604

Management Science 52(4), pp. 597–612, © 2006 INFORMS

a tournament score of B equals the number of rivals that B can beat, that is, the number of DAGs among those selected that have higher MDL scores than B. In our setting, q = 5. One half of the DAGs with the highest tournament scores are retained for the next generation. As is depicted in Table 2, the process is repeated until the maximum number of generations is reached, which depends on the complexity of the network structure. If one expects a simple network, the maximum number of generations can be set to a lower value. The network with the lowest MDL score emerges as the final solution. Once the learning algorithm returns the BN model with the lowest MDL score, the conditional probabilities of the nodes in the network can be calculated as follows. P Ni = vik  Ni = wij  =

Nijk + 1 Nij + ri



(9)

where vik is a value of variable Ni , wij is an instantiation of the parent set Ni , ri is the number of different values of variable Ni , Nijk is the number of cases in the database in which variable Ni has the value vik , and Ni is instantiated as wij with Nij = 2.3.

ri  k=1

Nijk 

(10)

Bayesian Networks for Classification and Prediction As is depicted in Figure 2, learning BNs for classification takes two steps. First, for BN learning, the EP algorithm automatically finds the directed edges between the nodes to identify a network model that can best describe the relationships based on the MDL metric. Once the best network structure has been identified, the conditional probabilities are calculated based on the data to describe the relationships among the variables. Second, the learned BN is used to generate a probability score for each example case or unseen data for the purpose of cross-validation and forecasting (see §§2.1, 2.2.3, and Equation (10)). The probability score for the dependent variable (e.g., purchase), which ranges from 0 to 1, is used for predictive modeling. With these probability scores, researchers can use a cutoff point (e.g., 0.5) to evaluate the error rate of a predictive model. The predicted class or membership is then compared with the actual data to evaluate the predictive accuracy of the model and to make sales forecasts by validating the results on a testing data set. Alternatively, researchers can examine the percentage of true positives in the top deciles of the testing (validation) data to arrive at the predictive lift of the model.

3.

Results

3.1. The Data Sets and Experiments To test the proposed methods for modeling consumer responses to direct marketing, we perform experiments to learn BNs with a direct marketing data set. The Direct Marketing Education Foundation provided the data from a U.S.-based catalog direct marketing company that sells multiple product lines of general merchandise that range from gifts and apparel to consumer electronics. The company sends regular mailings to its list of customers, and this particular data set contains the records of 106,284 consumers. Each customer record contains 361 variables, including purchase data from recent promotions and the customer’s purchase history over a 12-year period. In a recent promotion, every customer in this data set received a catalog. This promotion achieved a 5.4% response rate, which represents 5,740 customers who made purchases from this catalog. In this study, we compare BN’s performance with that of several other methods that are known for their ability to solve classification problems. They include two other machine learning methods, ANN using Bayesian learning and MCMC (Neal 1996), and the classification tree by CART (Haughton and Oulabi 1997). Recently, latent class models, which control for unobserved heterogeneity among subjects using latent or unknown groups, have become increasingly popular in marketing research, and thus serve as another method for comparison with BNs (Jain et al. 1990). We examine the performance of these classification methods by testing the predictive ability of the models learned from the training data on unseen cases, that is, the testing (validation) data. In direct marketing applications, simple error rate may not be the most appropriate method for assessing classifier performance. First, despite the large size of the data set, the percentage of true positives is very small (5.4% in this case). Second, simple error rate assumes no variance in the cost of different misclassification errors (Baesens et al. 2002). For direct marketing models, false negatives are much more costly than false positives. The loss from false positives is the cost of mailing, but the opportunity cost from false negatives—the loss of potential sales (US$80 on average in this case) and profit—is often much greater. Furthermore, due to budget constraints, typically only the names in the top two deciles or the 80th percentile (those with the highest probability to respond) will receive a catalog (Berger and Magliozzi 1992). Thus, the cumulative response lift in the top two deciles of the file (testing data sets) is used to compare the performance of these methods. It is the ratio ∗ 100 of the number of true positives (TPs) in a decile identified by the proposed model versus the number of TPs identified by a random model, which is the number of TPs

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models Management Science 52(4), pp. 597–612, © 2006 INFORMS

divided by the number of deciles (10). For instance, a model with a top decile lift of 200 is said to perform twice as well as a random model. 3.2. Dealing with Endogeneity In this data-mining task, we are interested in exploring the insight that can be gained by integrating lifetime and transaction variables with RFM variables. First, nine variables are selected by logistic regression using the forward selection criterion p = 005: recency (Recency), which is the number of months that have elapsed since the last purchase; the frequency of purchase in the last 36 months (Frequency); the monetary value of purchases in the last 36 months (Monetary value); the average order size (Ordsize); lifetime orders (number of orders placed, Liford); lifetime contacts (number of mailings sent, Lifcont); and whether a customer typically places order by telephone (Tele); makes cash payment (Cash); or uses the “house” credit card (Hcrd) issued by the catalog company. Although every customer in this data set received a catalog from the company in the current mailing, the data do not have information on prior selection by the management for each of the previous mailings. We cannot explicitly model management selection with this data set, but the model includes the number of lifetime contacts (Lifcont), which indicates the previous selection by the management. Another critical issue is the possible endogeneity of the RFM variables. In a structural equation model y = )x + *, a predictor variable must be uncorrelated with the error of the model to infer the causality of x on y. If x is correlated with the model error *, this variable is said to be endogenous, and its parameter estimates may be biased. Such problems are due to the omitted variables embedded in the error, which simultaneously affects both the y and x variables. Direct marketers often use RFM variables to predict future purchases, yet RFM variables are based on previous responses from these households. In this sense, the RFM variables may be endogenous and their parameter estimates biased due to the correlations between RFM variables and the error of the model. This is a common problem that is often associated with reoccurring data and arises from the lack of empirical data to control for endogeneity. Unlike regression methods, which focus on the conditional probability, BNs do not specify a model structure, but learn a joint probability distribution among all the variables simultaneously from the observed data. Thus, BNs do not suffer from the endogeneity bias. Both ANN and CART estimate a conditional distribution and may suffer from a potential

605

endogeneity bias. However, no solution to this problem has been devised for these two methods. Readers are advised to proceed with caution in interpreting their results. As for latent class regression, which assumes a logit model, endogeneity correction is necessary to remove such bias and to produce consistent parameter estimates. In the existing literature, researchers have adopted several solutions to this problem, including the instrumental variable method (Gönül et al. 2000) and the control function approach (Blundell and Powell 2004). As our dependent variable is binary, we adopt the control function approach to test for endogeneity bias, which is accomplished by adding the residuals of the endogenous variables into the model as control variables. Following the procedures in Blundell and Powell (2004), we first run a parametric reduced-form regression to compute the estimates of endogenous RFM variables on the whole data set. In the second stage, the residuals of the reduced-form regressors are included as covariates in the binary response model to account for their endogeneity. In Table 3, the first three columns are the reduced-form estimates for the RFM variables with two lifetime variables as covariates. Given their adjusted R-squares, the explanatory power of the three reduced-form equations is fairly high. The fourth column refers to the model without any adjustment for endogeneity. Except for monetary value, recency and frequency have very small coefficient estimates, although they are statistically significant. Despite the large data set N = 106280, the overall fit of the uncorrected model is statistically insignificant p = 0620, indicating a potential endogeneity bias in the RFM variables. The last column includes the residuals of the RFM variables as the control variables. The coefficient estimates change drastically, and thus the effect of correcting for endogeneity is obvious given the improvement in model fitness p = 0021. For the endogeneity tests, we employ the asymptotic t-test developed by Smith and Blundell (1986). The significant results of the t-tests reject the null hypothesis of exogeneity for the RFM variables. To assess the effect of endogeneity correction on predictive performance, we test the corrected RFM model by including the residuals of the RFM variables in the latent class logit model, perform a tenfold cross-validation, and compare the results with those of the uncorrected model. The use of the control function approach in the model produces a top decile lift of 401, which is a significant improvement over the uncorrected model (with a top decile lift of 334) and the model corrected with the instrumental variables (with a top decile lift of 397). Thus, the control function approach is adopted for subsequent experiments with the latent class method.

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

606

Management Science 52(4), pp. 597–612, © 2006 INFORMS

Table 3

Results of the Endogeneity Tests Reduced-form regression

Logit model

Models and tests/variables

Recency

Frequency

Monetary value

Uncorrected

Constant

32650∗∗ 0206 —

0969∗∗ 0022 —

3427∗∗ 0006 —

−4068∗∗ 0097 −0026∗∗ 0002

−11225∗∗ 1177 −0219∗∗ 0019

−6428∗∗ 0028 —



0025∗∗ 0010 0368∗∗ 0022

−1297∗∗ 0056



0379∗∗ 0002 —

Lifetime contacts

−2353∗∗ 0118

−0193∗∗ 0013

00545∗∗ 0001





Lifetime contacts2

0302∗∗ 0015 0668∗∗ 0011

0048∗∗ 0002







0121∗∗ 0001

−00085∗∗ 0000





−0006∗∗ 0000

−0001∗∗ 0000







0345

0255

0464

p = 0620

9428∗∗ t  −14486∗∗ t  −15110∗∗ t  p = 0021

Recency Frequency Monetary value

Lifetime orders Lifetime orders2 Adjusted R2 Endogeneity test: Recency Endogeneity test: Frequency Endogeneity test: Monetary value

Corrected

3511∗∗ 0217

Notes. (1) The standard errors are shown in parentheses. (2) The endogeneity test is the asymptotic t-test. ∗∗ = significant at the 0.001 level.

3.3. Results of Bayesian Networks To learn BNs, discretization is performed by the recursive minimal entropy partitioning method for several continuous variables with a large number of values, including frequency and monetary value, to reduce the size of the data matrix. In the EP learning process, the number of BN models in each generation is set at 100, with 50 parent models and 50 offspring models. The maximum number of generations is set at 5,000, which is sufficient to allow even complex models to converge. In a holdout validation experiment in which 90% of the data is allocated for training and the remaining 10% for validation, the BN model has a top decile lift of 413, which is 4.13 times as good as a random model. The MDL scores of the models in the first generation are around 1,130,000. Then the MDL metric of BNs declines as EP learns better network models. The MDL metrics of subsequent models show significant improvement until the 870th generation, beyond which the rate of improvement starts to level off. The optimal BN model appears at the 1,045th generation, with a MDL score of 1,047,550. As each generation compares 100 models (50 parent models and 50 offspring models), the optimal solution emerges after the comparison of more than 50,000 models. These results suggest that the performance of BNs is easily tractable in the EP process and demonstrates a high degree of

transparency, and the MDL metric is effective in optimizing BNs. We now examine the empirical results of BNs to assess their ease of interpretation and managerial insight. The DAG for the optimal BN model delineates the qualitative relationships among the variables (Figure 3). Close examination of the DAG model reveals several structures. The three variables of recency, frequency, and monetary value form a network that has a direct effect on consumer responses to the promotion, which confirms the explanatory power of the RFM model, despite its simplicity. The two Figure 3

A Directed Acyclic Graph Model for the Catalog Promotion

Recency

Cash Tele Hcrd

Frequency

Purchase

Lifcont Liford Ordsize

Monetary value

Note. Due to limited space, the posterior probabilities and examples of conditional probabilities of this model are attached in Table 4 and Appendix A, respectively.

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models Management Science 52(4), pp. 597–612, © 2006 INFORMS

lifetime variables of lifetime contacts (Lifcont) and lifetime orders (Liford), together with the RFM variables, form a larger network. In addition, the transaction variables of telephone (Tele), using the house credit card (Hcrd), and cash payment (Cash), together with recency and frequency, also help to explain consumer responses. Overall, the RFM, lifetime, and transactional variables together form a BN model to predict consumer responses. Based on the DAG model in Figure 3, the posterior probabilities are generated for all nine variables (Table 4). These results define the quantitative relationships among the variables, and help us to understand how they relate to consumer responses. First, the overall probability of response is 5.4%. The probability of purchase is the highest when the value of recency is 2, but is also relatively high when the value of recency is 4 or 6, which means that recency has a nonlinear effect on purchase. Higher values for frequency and monetary value lead to a higher purchase probability. This is also true for lifetime contacts, lifetime orders, and order size. The probabilities of using the house credit card, cash payment, and telephone order are relatively low. Based on these results, we can draw forward probabilistic inferences about the effect of one variable on the likelihood of purchase given the information of the other variables. For instance, given everything else we know about these customers, and when the value of recency is 7 (the largest number of months that have elapsed since the last purchase), the posterior probability of purchase is 0.018. Moreover, associated with the graphic model are the conditional probabilities for each of the variables (Appendix A). According to Table A1, given the values of recency (2) and frequency (4), when monetary value is the highest (6) the conditional probability of purchase equals 0.421, which is more than twice as high as when monetary value is lower (2 and 3). Table A2 suggests that if consumers tend to use the house card for payment, they will also use the telephone to place orders P = 0635. As is shown in Table 4

Posterior Probabilities of Variables in the Bayesian Network Model

Variables/values

x =0 x =1 x =2 x =3 x =4 x =5 x =6 x =7

P (Purchase) 0.054 P (Recency) 0.056 P (Frequency) 0.018 P (Monetary value) 0.019 P (Lifcont) 0.036 P (Liford) 0.038 P (Ordsize) 0.042 P (Hcrd) 0.053 0.055 P (Cash) 0.057 0.040 P (Tele) 0.050 0.070

0.168 0.045 0.046 0.050 0.075 0.051

0.032 0.063 0.062 0.048 0.087 0.057

0.066 0.092 0.085 0.058 0.153 0.066

0.023 0.054 0.018 0.137 0.115 0.183 0.085 0.071 0.086

Notes. (1) Hcrd, Cash, and Tele are binary variables, and thus only two values are recorded. (2) Due to the symmetric nature of the probabilities, the probabilities for nonpurchases are omitted.

607

Table A3, at a high frequency level (4) consumers tend to place telephone orders with a probability of 0.447, and at a lower frequency level (2) they are more likely to pay with the house card P = 0351. At a higher frequency level (4) they may place telephone orders and use the house card for payment at the same time P = 0304. These conditional probabilities help us to understand the general relationships among the variables. Furthermore, we can combine the posterior and conditional probabilities to perform inferences in a backward fashion to determine the combined effect of several variables on purchase probability (Blodgett and Anderson 2000). In other words, we can reverse the question—“Given a customer’s purchase from the promotion, what is the probability of a particular variable or value being the cause?” As for the effects of lifetime contacts (Table A4), at low frequency levels (2) lifetime contacts make little difference in consumer purchase, but at high frequency levels (5) lifetime contacts increase the purchase probability. These results can help identify the most attractive customer groups and build consumer profiles for segmentation purposes. Based on the conditional probabilities, researchers can also suggest specific actions to improve the response rate of direct marketing campaigns. For instance, in Table A5, at a low frequency level (1) the purchase probability is very low (0.018), and neither telephone orders nor payment with the house credit card makes a difference. At the highest frequency level (5) the probabilities of placing a telephone order and using the house credit card increase substantially. These results suggest that by integrating customer lifetime and transaction variables with the RFM variables, we can gain additional insight into consumer responses. 3.4. Results of the Other Methods We adopt the Bayesian approach to learning neural network models and test different ANNs by adjusting the number of hidden layers and hidden nodes (Warner 1997). After repeated trials, an ANN model with one hidden layer and five hidden nodes (neurons) returns the best model with the lowest error rate. Thus, the learning process is neither automatic nor transparent. The neural network model depicts the unidirectional effects from the input variables to the hidden nodes and purchase. In terms of predictive accuracy, the ANN model has a top decile lift of 376 on the holdout validation data set. In addition, the model offers a set of weights or parameters for the edges from the input variables to the five hidden nodes 5 ∗ 9 = 45, the weights from the hidden nodes to the output, and the weights from bias (intercept) to the hidden nodes and the output. The hidden nodes can be viewed as latent factors or segments,

608

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

and the activation values (weights) of the variables vary across them. We may conclude that the hidden neurons help to explain consumer response. Although ANNs achieve a reasonable level of predictive accuracy, its empirical results do not formalize the relationships among the variables in a user-friendly and comprehensible way, nor does it provide the opportunity to gain fresh insight into the problem. Even for competent users of ANNs, its lack of transparency and explanatory capability has been a major drawback (West et al. 1997). CART is a recursive partitioning method that generates a tree model by “splitting the tree” at each node. It uses the GINI index to determine how well the splitting rule separates the classes contained in the parent node. Once the best split is found, CART repeats the search process for another child node, and continues recursively until further splitting is impossible. Instead of deciding whether a given node is terminal, CART grows the tree to the maximum size and then starts “pruning” it to examine smaller trees. Finally, CART selects the best tree by testing its error rate. In the holdout validation experiment, the CART model achieves a top decile lift of 366 with the testing data set. The resulting “optimal” tree with 379 nodes (splits) is overwhelming, even for sophisticated users. In fact, it gives a set of weight scores to each of the nodes of the tree, including all the recursive nodes. CART then ranks the variables in terms of their overall explanatory power (weights) in classifying the cases, with lifetime orders at the top (100), followed by monetary value (78.3), recency (51.4), lifetime contacts (51.2), frequency (46.9), and order size (23.7). The other three variables make little difference. Despite its apparent capability for classification, it is not easy even for the trained eye to understand how the results can help explain consumer responses. Latent class regression estimates a logit model based on the assumption that the coefficients of the predictors differ across unobserved latent segments, and executes separate regressions for each of the latent classes. The procedure produces one set of parameters for the predictor variables, including the three residual variables that control for endogeneity, and indicates their overall effects and their coefficient estimates for each of the latent segments. In the holdout validation experiment, the number of latent classes that we tested ranged from two to five. It appears that a two-class model achieves the best fit, based on the L-square statistic and the BIC value. This model achieves a top decile lift of 387 on the holdout sample. As with a typical regression, the overall coefficient estimates indicate the effects of the 12 predictor variables. In Table 5, Class 1 achieves a very low R-square of 0.045 in comparison with Class 2 (R-square = 0712). In addition, Class 1 has different

Management Science 52(4), pp. 597–612, © 2006 INFORMS

Table 5

Coefficient Estimates by Latent Class Regression

Variables

Overall

Wald

p-value

Class 1

Class 2

Wald(=)

p-value

Recency Frequency Monetary value Lifcont Liford Ordsize Hcrd Cash Tele Res. R Res. F Res. M

0848 −0648 0001 1410 0687 0052 −0243 −0052 −0124 −0902 6673 0007

13301 23883 0994 24292 89470 0762 0154 2399 1320 14768 26096 0468

0.000 0.000 0.320 0.000 0.000 0.380 0.700 0.120 0.250 0.000 0.000 0.490

0173 0323 0001 0468 0140 0017 −0225 −0092 −0040 −0187 0940 −0053

3136 −3943 −0001 4606 2541 0170 −0303 0086 −0411 −3327 26128 0208

32402 36681 11268 170049 114781 2296 78662 9939 6890 36560 44249 1765

0.000 0.000 0.004 0.000 0.000 0.320 0.000 0.007 0.032 0.000 0.000 0.410

0048

0462

R-square

0225

parameter estimates from Class 2, albeit a different profile. A researcher may suggest that heterogeneity across these two clusters may improve predictive accuracy, but again the meaning of such unobserved clusters and their effect on consumer response is not apparent to the average analyst and must be rationalized by the researcher. 3.5. Comparisons of Alternative Methods In the holdout validation experiment, BNs achieve the highest top decile lift (413), followed by latent class regression (387), neural networks (376), and CART (366). However, a single train-test validation is hardly sufficient to assess the performance of a method. To further compare the robustness of BNs against the other methods, we conduct a tenfold cross-validation experiment (Kohavi 1995, Mitchell 1997). First, we use stratified random sampling to partition the data set into 10 disjoint subsets of equal size, that is, 10,629 cases with 5.4% of buyers. Following the standard practice of tenfold cross-validation, we then train and test all four methods 10 times using each of the 10 subsets as the testing set (estimation sample) and all the remaining subsets combined as the training set (validation sample). As is shown in Table 6, BNs provide the highest average lift in the top decile (408), followed by latent class regression (401), ANNs (397), and CART (366). In the second decile, the ANNs have the highest cumulative lift (285), followed by BNs (283) and CART (258), whereas latent class regression drops to the last position (236). Overall, no method dominates the others throughout the deciles. Although latent class regression provides the second-highest top decile lift, its cumulative lift drops sharply in the second decile and starts to trail behind the other methods, which indicates the instability of the method. In contrast, the results of BNs are rather stable, and their standard deviations of cumulative lifts are mostly lower than those of the competing methods. These results suggest that BNs not only predict consumer response with a high level of accuracy, but also

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

609

Management Science 52(4), pp. 597–612, © 2006 INFORMS

Table 6

Gains Table Based on Tenfold Cross-Validation

Decile

Bayesian networks Cum. lift

ANNs Cum. lift

CART Cum. lift

Latent class regression Cum. lift

408.0 (17.0) 282.7 (8.2) 222.7 (4.8) 186.8 (3.8) 162.2 (3.3) 144.5 (3.0) 129.8 (2.1) 118.0 (1.4) 107.4 (0.7) 100.0 (0.0)

396.5 (24.3) 285.2 (12.8) 225.3 (6.7) 190.6 (3.4) 166.0 (4.3) 147.5 (2.9) 131.7 (2.4) 119.7 (1.4) 108.9 (0.8) 100.0 (0.0)

365.7 (18.4) 258.2 (15.2) 204.8 (8.9) 175.5 (5.4) 149.0 (2.8) 132.5 (2.6) 120.1 (2.7) 110.9 (2.1) 104.2 (1.5) 100.0 (0.0)

401.1 (26.3) 236.1 (9.3) 185.7 (8.2) 156.5 (6.1) 135.8 (5.0) 120.6 (3.4) 107.0 (2.8) 95.5 (2.1) 93.4 (1.3) 100.0 (0.0)

1 2 3 4 5 6 7 8 9 10

Note. The reported figures are the means of the lifts of the 10 experiments, and the standard deviations are given in brackets.

demonstrate a higher degree of robustness across different data sets. In Table 7, we compare these methods in terms of their transparency, interpretability, and insight. Overall, these methods are based on strong theoretical rationales and computational procedures, and they produce different results based on their model configurations and the objective functions they emulate. The ANN procedure produces the optimal structure by trial and error based on predictive accuracy, and its results specify the weights between the input variables and the hidden nodes and between the hidden nodes and the output. The optimal number of latent classes in the latent class regression is determined by the L-square statistic and BIC value, which are relatively straightforward. Latent class regression produces parameter estimates for each of the predictors, which differ across the two latent clusters (classes). In this sense, both methods rely on unobserved latent variables to improve their classification performance. The CART model specifies how the variables form a splitting-tree model in a node-by-node fashion using the recursive partitioning method (i.e., repeated use of the same variables in the tree), and then ranks the variables in terms of their discriminating power (GINI index). However, the tree structure is simply too complex for meaningful comprehension. Although these Table 7

4.

Conclusion

The results of this study show that BNs learned by EP have performed satisfactorily against the objectives of predictive modeling in direct marketing and data mining with large noisy databases, and that they have several advantages over the other methods.

Comparison of the Alternative Methods

Method/criteria 1. Model structure 2. Results 3. Accuracy∗ 4. Transparency 5. Interpretability/insight ∗

methods achieve a reasonable level of classification accuracy, their procedures are not as transparent to the average users of marketing research, and the sources of improvement, if any, remain largely “hidden.” Their results are not easy to comprehend and do not help the average user to gain insight into the problem. From the viewpoint of data mining, they are less intuitive in revealing how the predictor variables affect consumer response or what actions could be taken to improve the response rate. In addition to its predictive accuracy, BNs exhibit a high level of transparency in the optimization process. Furthermore, the results are easier to comprehend, and provide more explanatory insight. BNs construct a graphic model to describe the relationships among the directly observable variables, and then produce the posterior and conditional probabilities among the nodes to quantify their relationships. Whereas other methods such as latent class regression only describe the conditional distribution of the dependent variable, BNs provide a joint probability distribution of all the variables, including the conditional distribution of all the exogenous variables. Based on the table of probabilities, BNs can reveal the nonlinearities and interactions among the variables, something that other methods cannot convey in a straightforward fashion. Overall, the results of BNs are more interpretable for understanding the underlying relationships among the variables, and provide more managerial insight into the problem and hence better decision support to managers in terms of customer selection for marketing promotions. On a minor note, learning BNs with EP is easy to execute and computationally efficient, even with large databases. Whereas other methods take hours to arrive at the optimal solution, BNs learned using EP take only a fraction of the time to converge.

Bayesian networks

ANNs

CART

Latent class regression

Graphic network model with nodes Posterior and conditional probabilities High High High

Network with input, output, hidden layer, and hidden nodes Weights of variables, nodes, and biases High Low Low

Recursive classification and regression tree Weights of variables as splitting nodes Medium Low Low

Logit model with latent classes

= based on the top decile lift.

Linear parameter estimates High Medium Medium

610

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

First, BNs attain a high level of predictive accuracy, and provide a good representation of the underlying distribution of probabilities. Second, the optimization algorithm of EP used in this study is straightforward. The MDL metric as a fitness criterion balances model accuracy and simplicity, and is effective in minimizing overfitting. Based on the MDL metric, the optimization process by EP is completely tractable and gives a clear indication of the relative performance of the competing BN models identified in the evolutionary process. Together, they operate like a “white box” with a great degree of transparency. Third, the tenfold cross-validation experiments indicate that BNs deal with the bias-variance trade-off effectively, and exhibit a high level of robustness. More importantly, the BN topology is intuitively appealing. A BN (a DAG) defines a qualitative model for the joint probability distribution and describes complex structures with efficiency and modularity. The posterior probabilities help to determine the effect of the predictors on consumer response and reveal the nonlinear and interactive effects. Together with the conditional probabilities, the resulting model is able to elaborate how these variables together explain consumer response and draw inferences about consumer variables that have meaningful implications for managerial actions. These advantages provide a straightforward representation of the structure of the problem, more interpretable results, and greater explanatory insight that can provide decision support (Chiogna 1997). It puts an intuitively appealing interface on the machine learning process and demystifies the methods of artificial intelligence. The conventional approach to knowledge development is largely theory driven. A researcher tests hypotheses about the relationships among the variables of interest (Malhotra et al. 1999). Research without a theoretical root is often considered to be lacking in intellectual merit and analytical rigor, but the current environment demands more problem-oriented research and feasible methods to explore the vast quantities of disaggregated data. Data mining aided by sophisticated technology and versatile algorithms can remove many of the restrictions associated with traditional methods, and has become increasingly important as a new way of discovering knowledge. BNs together with EP can help researchers to gain fresh insight into research problems by exploring relationships that were not anticipated, and provide a viable alternative approach that can complement traditional methods. Given the increasing amount and variety of data and the demand for reducing the research cycle, these methods present efficient tools for marketing managers to extract and update knowledge in a timely fashion to assist decision making.

Management Science 52(4), pp. 597–612, © 2006 INFORMS

However, researchers need to be aware of the limitations of these methods. BNs are discrete in nature. Although discretization simplifies the learning process and the resulting model, there may be a loss of potentially useful information, and the model may not fully capture all the details of the relationships. EP can explore a wider search space to optimize BNs by comparing many alternatives, but this process may include invalid models and affect the efficiency of the optimization process. Placing constraints on the learning algorithm based on existing domain knowledge may help guide the search process to avoid invalid models and to improve the overall efficiency and accuracy. The integration of existing domain knowledge using supervised learning is a fruitful avenue for future research. Finally, the current BN model cannot account for the element of firm behavior that exists in this data set, and future studies should use a data set that allows the explicit treatment of selection by the management to correct for potential selection bias. The convergence of the statistical approach and machine learning methods represents one of the most promising areas of research in the provision of improved optimization solutions and better decision support for managers (Nakhaeizadeh and Taylor 1997). As a universal approximator, BNs can be applied to a whole array of business problems. BNs can handle multiple dependent variables and deal with multiobjective optimization problems. Given their efficiency in computing conditional and marginal probabilities, BNs can also be used as a tool for variable selection. BNs can be further enhanced by tree-augmented networks (TAN) to perform decision tree analysis and to examine the effect of business strategies. Latent class BNs can perform cluster analysis for market segmentation and assist devising differentiated promotion strategies. Obviously, how to take advantage of these features of BNs requires more applications and collaboration between management researchers and machine learning specialists. An online supplement to this paper is available on the Management Science website (http://mansci.pubs. informs.org/ecompanion.html). Acknowledgments

The authors thank the associate editor and three anonymous reviewers for their insightful comments. They also thank Dr. Guichang Zhang, Lin Li, Zhen Zhao, and Yuanyuan Guo for their assistance in data processing and conducting the experiments and Lingnan University for funding this project.

Appendix. Conditional Probabilities of Selected Variables

A complex DAG model like the one in Figure 3 has many tables of conditional probabilities. We only show the following ones as examples. As some tables have many entries, only a few entries are shown here.

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models Management Science 52(4), pp. 597–612, © 2006 INFORMS

References

A1: P (Purchase  Recency, Frequency, Monetary value)

x=1

P (Purchase = x  Recency = 2, Frequency = 4 Monetary value = 1 P (Purchase = x  Recency = 2, Frequency = 4 Monetary value = 2 P (Purchase = x  Recency = 2, Frequency = 4 Monetary value = 3 P (Purchase = x  Recency = 2, Frequency = 4 Monetary value = 4 P (Purchase = x  Recency = 2, Frequency = 4 Monetary value = 5 P (Purchase = x  Recency = 2, Frequency = 4 Monetary value = 6

0.001 0.194 0.198 0.244 0.320 0.421

A2: P (Hcrd  Tele)

x=0

x=1

P (Hcrd = x  Tele = 0 P (Hcrd = x  Tele = 1

0.853 0.365

0.147 0.635

A3: P (Frequency  Hcrd, Tele) x = 1 x = 2 x = 3 x = 4 x = 5 P (Frequency = x  Hcrd = 0, Tele = 0 P (Frequency = x  Hcrd = 0, Tele = 1 P (Frequency = x  Hcrd = 1, Tele = 0 P (Frequency = x  Hcrd = 1, Tele = 1

611

0.211 0.510 0.159 0.105 0.015 0.026 0.162 0.238 0.447 0.128 0.217 0.351 0.200 0.194 0.038 0.065 0.298 0.255 0.304 0.078

A4: P (Purchase  Frequency, Lifcont)

x=1

P (Purchase = x  Frequency = 2 Lifcont = 1 P (Purchase = x  Frequency = 2 Lifcont = 2 P (Purchase = x  Frequency = 2 Lifcont = 3 P (Purchase = x  Frequency = 2 Lifcont = 4 P (Purchase = x  Frequency = 2 Lifcont = 5 

0.032 0.050 0.038 0.039 0.068

P (Purchase = x  Frequency = 5 Lifcont = 1 P (Purchase = x  Frequency = 5 Lifcont = 2 P (Purchase = x  Frequency = 5 Lifcont = 3 P (Purchase = x  Frequency = 5 Lifcont = 4 P (Purchase = x  Frequency = 5 Lifcont = 5

0.129 0.131 0.132 0.135 0.141

A5: P (Purchase  Frequency, Hcrd, Tele)

x=1

P (Purchase = x  Frequency = 1 Hcrd = 0 Tele = 0 P (Purchase = x  Frequency = 1 Hcrd = 0 Tele = 1 P (Purchase = x  Frequency = 1 Hcrd = 1 Tele = 0 P (Purchase = x  Frequency = 1 Hcrd = 1 Tele = 1 

0.018 0.018 0.018 0.018

P (Purchase = x  Frequency = 5 Hcrd = 0 Tele = 0 P (Purchase = x  Frequency = 5 Hcrd = 0 Tele = 1 P (Purchase = x  Frequency = 5 Hcrd = 1 Tele = 0 P (Purchase = x  Frequency = 5 Hcrd = 1 Tele = 1

0.141 0.141 0.130 0.130

Note. Due to the symmetric nature of posterior probabilities, the probability scores are omitted when purchase = 0.

Allenby, G. M., R. P. Leone, L. Jen. 1999. A dynamic model of purchase timing with application to direct marketing. J. Amer. Statist. Assoc. 94(446) 365–374. Baesens, B., S. Viaene, D. van den Poel, J. Vanthienen, G. Dedene. 2002. Bayesian neural network learning for repeat purchase modelling in direct marketing. Eur. J. Oper. Res. 138(1) 191–211. Berger, P., T. Magliozzi. 1992. The effect of sample size and proportion of buyers in the sample on the performance of list segmentation equations generated by regression analysis. J. Direct Marketing 6(1) 13–22. Bhattacharyya, S. 1999. Direct marketing performance modeling using genetic algorithms. INFORMS J. Comput. 11(3) 248–257. Bitran, G., S. Mondschein. 1996. Mailing decisions in the catalog sales industry. Management Sci. 42(9) 1362–1381. Blodgett, J. G., R. D. Anderson. 2000. A Bayesian network model of the consumer complaint process. J. Service Res. 2(4) 321–338. Blundell, R. W., J. L. Powell. 2004. Endogeneity in semiparametric binary response models. Rev. Econom. Stud. 71 655–679. Chiogna, M. 1997. Probabilistic symbolic classifiers: An empirical comparison from a statistical perspective. G. Nakhaeizadeh, C. C. Taylor, eds. Machine Learning and Statistics. John Wiley & Sons, New York. Cooper, L. G. 2000. Strategic marketing planning for radically new products. J. Marketing 64(1) 1–16. Fogel, D. B. 1994. An introduction to simulated evolutionary optimization. IEEE Trans. Neural Network 5 3–14. Gönül, Füson F., Byung-Do Kim, Mengzhe Shi. 2000. Mailing smarter to catalog customers. J. Interactive Marketing 14(2) 2–16. Haddawy, P. 1999. An overview of some recent developments in Bayesian problem-solving techniques. AI Magazine 20(2) 11–19. Hansen, M. H., B. Yu. 2001. Model selection and the principle of minimum description length. J. Amer. Statist. Assoc. 96(454) 746–773. Haughton, D., S. Oulabi. 1997. Direct marketing modeling with CART and CHAID. J. Direct Marketing 11(4) 42–52. Heckerman, D. 1997. Bayesian networks for data mining. Data Mining Knowledge Discovery 1 79–119. Hu, M. Y., M. Shanker, M. S. Hung. 1999. Estimation of posterior probabilities of consumer situational choices with neural network classifiers. Internat. J. Res. Marketing 16(4) 307–317. Jain, D., F. M. Bass, Y.-M. Chen. 1990. Estimation of latent class models with heterogeneous choice. J. Marketing Res. 27(1) 94–101. Jensen, F. V. 1996. An Introduction to Bayesian Networks. SpringerVerlag, New York. Kohavi, R. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc. 14th Internat. Joint Conf. on Artificial Intelligence, Montreal, Canada. Lam, W., F. Bacchus. 1994. Learning Bayesian belief networks— An approach based on the MDL principle. Comput. Intelligence 10(3) 269–293. Larrañaga, P., M. Poza, Y. Yurramendi, R. Murga, C. Kuijpers. 1996. Structure learning of Bayesian network by genetic algorithms: A performance analysis of control parameters. IEEE Trans. Pattern Anal. Mach. Learning 18(9) 9. Malhotra, N. K., M. Peterson, S. Bardi. 1999. Marketing research: A state-of-the-art review and directions for the twenty-first century. J. Acad. Marketing Sci. 27(2) 160–183. Michie, D., D. J. Spiegelhalter, C. C. Taylor. 1994. Machine Learning, Neural and Statistical Classification. Ellis Howard, New York. Mitchell, T. 1997. Machine Learning. McGraw-Hill, New York. Nakhaeizadeh, G., C. C. Taylor. 1997. Introduction. G. Nakhaeizadeh, C. C. Taylor, eds. Machine Learning and Statistics. John Wiley & Sons, New York.

612

Cui, Wong, and Lui: Machine Learning for Direct Marketing Response Models

Neal, R. M. 1996. Bayesian learning for neural networks. Lecture Notes in Statistics, Springer, New York. Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, San Mateo, CA. Pearl, J. 2000. Causality: Models, Reasoning, and Inference. Cambridge University Press, Cambridge, UK. Rao, V. R., J. H. Steckel. 1995. Selecting, evaluating, and updating prospects in direct mail marketing. J. Direct Marketing 9(2) 20–31. Rissanen, J. 1978. Modeling by shortest data description. Automatica 14 465–471. Smith, Richard J., Richard W. Blundell. 1986. An exogeneity test for a simultaneous equation tobit model with an application to labor supply. Econometrica 54(May) 679–685.

Management Science 52(4), pp. 597–612, © 2006 INFORMS

Venkatesan, R., V. Kumar. 2004. A customer lifetime value framework for customer selection and resource allocation strategy. J. Marketing 68(4) 106–125. Warner, B. A. 1997. Bayesian learning for neural networks. J. Amer. Statist. Assoc. 92 791–792. West, P. M., P. L. Brockett, L. L. Golden. 1997. A comparative analysis of neural networks and statistical methods for predicting consumer choice. Marketing Sci. 16(4) 370–391. Wong, M. L., W. Lam, K. S. Leung. 1999. Using evolutionary computation and minimum description length principle for data mining of Bayesian networks. IEEE Transactions Pattern Anal. Mach. Intelligence 21(2) 174–178. Zahavi, J., N. Levin. 1997. Applying neural computing to target marketing. J. Direct Marketing 11(4) 76–93.