learning bayesian networks for solving real-world ... - CiteSeerX

29 downloads 0 Views 1MB Size Report
to my parents who instilled in me a love for learning, provided me the love and support. I needed ... To address the second problem, I propose a principled method, based ..... 4.21 Scatter Plots Comparing the Error Rates of (a) CGR with CB, and (b) ... 6.4 Learning From a 10,000 Case Dataset With 40% MAR Data Using 4-.
LEARNING BAYESIAN NETWORKS FOR SOLVING REAL-WORLD PROBLEMS MONINDER SINGH A DISSERTATION in

COMPUTER AND INFORMATION SCIENCE Presented to the Faculties of the University of Pennsylvania in Partial Ful llment of the Requirements for the Degree of Doctor of Philosophy. 1998

Gregory M. Provan Supervisor of Dissertation

Mark Steedman Graduate Group Chairperson

c Copyright 1998 by Moninder Singh

To My Parents and Smita

Acknowledgements First and foremost, I would like to express my sincere gratitude to my advisor, Dr. Gregory Provan, and to Professor Bonnie Webber, for their invaluable guidance and support throughtout the course of this research. Dr. Provan was responsible for inspiring me to work on the problems of this dissertation, and I owe much of the content of this document to my discussions with him. I am also very grateful to Professor Webber for her patience, support and understanding in helping me balance my personal life with my research during the last three years. I have specially enjoyed the freedom and constant encouragement given by them throughout my stay at Penn. It has been a pleasure working with them both. I would also like to express my deep appreciation to the other members of my committee, Professors John Clarke, M.D., Russell Greiner and Lyle Ungar, for their valuable time, helpful discussions and suggestions. This research has bene ted a lot from their comments and suggestions. I am especially thankful to Dr. Clarke for all his help with my work in the abdominal pain domain. My research has also bene ted a lot from frequent discussions with Dr. David Heckerman for which I am very grateful. I am also very thankful to Dr. Leora Morgenstern for her constant support and encouragement. I am also very grateful to her, and the IBM T.J. Watson research center, for funding my dissertation research for three years via an IBM Cooperative fellowship. I am also very thankful to Mr. Mike Felker for all his help during the last ve iv

years, and for going out of his way to make my stay here as smooth as possible. I also wish to thank the faculty, sta and students in the Computer and Information Science department, especially the graduate group chairperson, Professor Mark Steedman, for providing a wonderful and supportive environment for learning and intellectual growth. Special thanks are due to the members of the TraumAID group, especially Jonathan Kaye, Abigail Gertner, Lola Ogunyemi, Richard Washington, Diane Chi and Beverly Spejewski for their comments and critiques on earlier parts of this research. I am also very grateful for the support and friendship of my colleagues, especially Sonu Chopra, Rama Bindiganavale, Libby Levison, Rebecca Mercuri, Lloyd Greenwald and Anoop Sarkar. On the personal side, I am most indebted to my wife, Smita, without whose love, support and understanding I could not have nished this work. I am also greatly indebted to my parents who instilled in me a love for learning, provided me the love and support I needed, and then sent me across the world so that my dreams could become a reality. The path to my Ph.D. was long and hard, and at times I felt that I would not make it. Smita, my parents and our families were pillars of support in those times, and their love and encouragement helped me move forward and nally achieve my goals. I am also thankful to Drs. Sven Vestergaard and Frank Jensen of HUGIN Expert A/S, Denmark for providing me the HUGIN system free of charge for my dissertation research. Thanks are also due to Dr. Wray Buntine and RIACS/NASA Ames Research center for providing me the IND package for this work. This research was supported in part by an IBM Cooperative fellowship, NSF grant #IRI92-10030, and NLM grant #BLR 3 RO1 LM0 5217-02S1.

v

ABSTRACT LEARNING BAYESIAN NETWORKS FOR SOLVING REAL-WORLD PROBLEMS Moninder Singh Supervisor: Gregory M. Provan Bayesian networks, which provide a compact graphical way to express complex probabilistic relationships among several random variables, are rapidly becoming the tool of choice for dealing with uncertainty in knowledge based systems. However, approaches based on Bayesian networks have often been dismissed as un t for many real-world applications since probabilistic inference is intractable for most problems of realistic size, and algorithms for learning Bayesian networks impose the unrealistic requirement of datasets being complete. In this thesis, I present practical solutions to these two problems, and demonstrate their e ectiveness on several real-world problems. The solution proposed to the rst problem is to learn selective Bayesian networks, i.e., ones that use only a subset of the given attributes to model a domain. The aim is to learn networks that are smaller, and hence computationally simpler to evaluate, but retain the performance of networks induced using all attributes. I present two methods for inducing selective Bayesian networks from data and evaluate them on several di erent problems. Both methods are shown to induce selective networks that are not only signi cantly smaller and computationally simpler to evaluate, but also perform as well, or better, than networks using vi

all attributes. To address the second problem, I propose a principled method, based on the EM algorithm, for learning both Bayesian network structure and probabilities from incomplete data, and evaluate its performance on several datasets with di erent amounts of missing data and di erent assumptions about the missing data mechanisms. The proposed algorithm is shown to induce Bayesian networks that are very close to the actual underlying model. Finally, I apply both methods to the task of diagnosing acute abdominal pain. Known to be a very dicult domain, this is a very high dimensional problem characterized by a large number of attributes and missing data. Several researchers have argued that the simplest Bayesian network, the naive Bayesian classi er, is optimal for this problem. My experiments on two datasets in this domain show that not only do selective Bayesian networks use only a small fraction of the attributes but they also signi cantly outperform other methods, including the naive Bayesian classi er.

vii

Contents Acknowledgements

iv

Abstract

vi

1 Introduction

1

1.1 Thesis Statement : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

2

1.2 Thesis Organization : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

5

2 Bayesian Networks

7

2.1 Representation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :

8

2.2 Inference Using Bayesian Networks : : : : : : : : : : : : : : : : : : : : : :

9

2.3 Induction of Bayesian Networks : : : : : : : : : : : : : : : : : : : : : : : : 10 2.4 Advantages of Bayesian Networks Over Other Representations : : : : : : 14

3 Selective Bayesian Networks: Theory

16

3.1 Reducing the Complexity of Inference Using Bayesian Networks : : : : : : 16 3.1.1 Restricting Network Topology to Reduce Inference Complexity : : 16 3.1.2 Using Feature Selection to Reduce Inference Complexity : : : : : : 20 3.2 The Selective Bayesian Network : : : : : : : : : : : : : : : : : : : : : : : : 21 3.3 A Wrapper Method for Learning Selective Bayesian Networks : : : : : : : 26 3.3.1 The K2-AS Algorithm : : : : : : : : : : : : : : : : : : : : : : : : : 26 viii

3.3.2 Properties of the K2-AS Algorithm : : : : : : : : : : : : : : : : : : 29 3.3.3 Complexity of the Attribute Selection Phase : : : : : : : : : : : : 30 3.4 Information-Theoretic Attribute Selection Approach : : : : : : : : : : : : 31 3.4.1 Attribute Selection Metrics : : : : : : : : : : : : : : : : : : : : : : 31 3.4.2 The Info-AS Algorithm : : : : : : : : : : : : : : : : : : : : : : : : 37 3.4.3 Properties of the Info-AS Algorithm : : : : : : : : : : : : : : : : : 38 3.4.4 Complexity of the Attribute-Selection Phase : : : : : : : : : : : : 41 3.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 43

4 Selective Bayesian Networks: Evaluation

45

4.1 Objectives : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 45 4.2 Description of the Datasets Used : : : : : : : : : : : : : : : : : : : : : : : 48 4.3 Experimental Methodology : : : : : : : : : : : : : : : : : : : : : : : : : : 48 4.4 Experimental Design : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 49 4.5 Experimental Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 54 4.5.1 Selective versus Non-selective Bayesian Networks : : : : : : : : : : 54 4.5.2 Selective Bayesian Networks versus Naive Bayesian Classi ers : : : 65 4.5.3 Selective Bayesian Networks versus Decision Trees : : : : : : : : : 79 4.6 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80 4.6.1 Selective versus Non-selective Bayesian Networks : : : : : : : : : : 81 4.6.2 Selective Bayesian Networks versus Naive Bayesian Classi ers : : : 85 4.7 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 93

5 Learning Bayesian Networks From Incomplete Data: Theory

95

5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 95 5.2 Mechanism Leading to Missing Data : : : : : : : : : : : : : : : : : : : : : 97 5.3 Methods for Handling Missing Data : : : : : : : : : : : : : : : : : : : : : 100 5.3.1 Complete Case Analysis : : : : : : : : : : : : : : : : : : : : : : : : 100 ix

5.3.2 Treating Missing Values as Just Another Value : : : : : : : : : : : 101 5.3.3 Imputation Procedures : : : : : : : : : : : : : : : : : : : : : : : : : 102 5.3.4 Model based Procedures : : : : : : : : : : : : : : : : : : : : : : : : 104 5.4 Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 107 5.4.1 Learning Bayesian Network Parameters : : : : : : : : : : : : : : : 108 5.4.2 Learning Bayesian Network Structure and Parameters : : : : : : : 110 5.5 Learning Bayesian Network Structure and Parameters from Incomplete Data : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 112 5.6 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 120

6 Learning Bayesian Networks From Incomplete Data: Evaluation

121

6.1 Objectives : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 121 6.2 Description of the Dataset Used : : : : : : : : : : : : : : : : : : : : : : : : 122 6.3 Experimental Methodology : : : : : : : : : : : : : : : : : : : : : : : : : : 123 6.4 Results and Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 125 6.5 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 133

7 Diagnosing Acute Abdominal Pain: a Case Study

136

7.1 The Acute Abdominal Pain Domain : : : : : : : : : : : : : : : : : : : : : 137 7.2 Objectives : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 139 7.3 Description of the Datasets Used : : : : : : : : : : : : : : : : : : : : : : : 140 7.4 Experimental Methodology : : : : : : : : : : : : : : : : : : : : : : : : : : 142 7.4.1 Learning from Complete Data : : : : : : : : : : : : : : : : : : : : : 142 7.4.2 Learning from Incomplete Data : : : : : : : : : : : : : : : : : : : : 143 7.5 Experimental Design : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 145 7.6 Experimental Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 147 7.7 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 151 7.8 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 155 x

8 Conclusion

162

8.1 Summary of Contributions : : : : : : : : : : : : : : : : : : : : : : : : : : : 162 8.1.1 Reducing the Inference Complexity of Bayesian Networks : : : : : 163 8.1.2 Learning Bayesian Networks from Incomplete Data : : : : : : : : : 165 8.1.3 Diagnosing Acute Abdominal Pain : : : : : : : : : : : : : : : : : : 166 8.2 Extensions and Directions for Future Research : : : : : : : : : : : : : : : 168

Bibliography

171

xi

List of Tables 4.1 Description of the Databases Used in the Study. : : : : : : : : : : : : : : 47 4.2 Predictive Accuracies of the Various Algorithms. : : : : : : : : : : : : : : 61 7.1 Accuracies of Di erent Methods on the Abdominal Datasets with No Missing Values. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 147 7.2 Accuracies of Di erent Methods on the Abdominal Datasets in the Presence of Missing Values. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 148 7.3 Predictive Values and Likelihood Ratios for the T&S Dataset. : : : : : : : 152 7.4 Predictive Values and Likelihood Ratios for the ECCA Dataset. : : : : : : 153 7.5 Sensitivities and Speci cities for the T&S Dataset. : : : : : : : : : : : : : 154 7.6 Sensitivities and Speci cities for the ECCA Dataset. : : : : : : : : : : : : 155 7.7 Discriminant Matrix for the Non-selective Bayesian Network Induced from the T&S Dataset : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 156 7.8 Discriminant Matrix for the Selective Bayesian Network Induced from the T&S Dataset : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 157 7.9 Discriminant Matrix for the Naive Bayesian Classi er Induced from the T&S Dataset : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 158 7.10 Discriminant Matrix for the Selective Bayesian Network Induced from the T&S Dataset with Missing Values : : : : : : : : : : : : : : : : : : : : : : 158 7.11 Discriminant Matrix for the Non-selective Bayesian Network Induced from the ECCA Dataset : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 159 xii

7.12 Discriminant Matrix for the Selective Bayesian Network Induced from the ECCA Dataset : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 159 7.13 Discriminant Matrix for the Naive Bayesian Classi er Induced from the ECCA Dataset : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 160 7.14 Discriminant Matrix for the Selective Bayesian Network Induced from the ECCA Dataset with Missing Values : : : : : : : : : : : : : : : : : : : : : 160 7.15 Discriminant Matrix for the Selective Bayesian Network Induced from the ECCA Dataset with Missing Values using Previously Selected Attributes : 161

xiii

List of Figures 2.1 Induction of Bayesian networks using the K2 Algorithm. : : : : : : : : : : 12 3.1 (a) A Typical Bayesian Network; (b) A Naive Representation for the Same Bayesian Network. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 17 3.2 The K2-AS Algorithm. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 28 3.3 The Info-AS Algorithm. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38 4.1 Problem Dimensionality. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 49 4.2 Partitioning of Databases for Induction During Each Cross-Validation Run. 51 4.3 K2-AS versus CB: Reduction in Number of Attributes. : : : : : : : : : : : 56 4.4 K2-AS versus CB: Reduction in Number of Network Arcs. : : : : : : : : : 57 4.5 K2-AS versus CB: Reduction in Number of Cliques. : : : : : : : : : : : : 58 4.6 K2-AS versus CB: Reduction in Maximum Clique Size. : : : : : : : : : : : 59 4.7 K2-AS versus CB: Reduction in Number of Parameters. : : : : : : : : : : 59 4.8 K2-AS versus CB: Reduction in Compilation Time. : : : : : : : : : : : : : 60 4.9 K2-AS versus CB: Reduction in Inference Time. : : : : : : : : : : : : : : 60 4.10 Scatter Plots Comparing the Error Rates of (a) K2-AS with CB, and (b) K2-AS< with CB. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 61 4.11 Di erence in Accuracies of K2-AS and CB on the Various Datasets. : : : 62 4.12 Di erence in Accuracies of K2-AS< and CB on the Various Datasets. : : 62 4.13 Info-AS versus CB: Reduction in Number of Attributes. : : : : : : : : : : 64 4.14 Info-AS versus CB: Reduction in Number of Network Arcs. : : : : : : : : 67 xiv

4.15 Info-AS versus CB: Reduction in Number of Cliques. : : : : : : : : : : : : 67 4.16 Info-AS versus CB: Reduction in Maximum Clique Size. : : : : : : : : : : 68 4.17 Info-AS versus CB: Reduction in Number of Parameters. : : : : : : : : : : 68 4.18 Info-AS versus CB: Reduction in Compilation Time. : : : : : : : : : : : : 69 4.19 Info-AS versus CB: Reduction in Inference Time. : : : : : : : : : : : : : : 69 4.20 Scatter Plots Comparing the Error Rates of CDC with CB. : : : : : : : : 70 4.21 Scatter Plots Comparing the Error Rates of (a) CGR with CB, and (b) CIG with CB. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 70 4.22 Di erence in Accuracies of CDC and CB on the Various Datasets. : : : : 71 4.23 Di erence in Accuracies of CGR and CB on the Various Datasets. : : : : 71 4.24 Di erence in Accuracies of CIG and CB on the Various Datasets. : : : : : 72 4.25 Scatter Plots Comparing the Error Rates of (a) K2-AS with naiveAS, and (b) K2-AS with naiveALL. : : : : : : : : : : : : : : : : : : : : : : : : : : : 72 4.26 Di erence in Accuracies of K2-AS and naiveAS on the Various Datasets. : 73 4.27 Di erence in Accuracies of K2-AS and naiveALL on the Various Datasets. 73 4.28 Scatter Plots Comparing the Error Rates of (a) K2-AS< with naiveAS, and (b) K2-AS< with naiveALL. : : : : : : : : : : : : : : : : : : : : : : : 74 4.29 Di erence in Accuracies of K2-AS< and naiveAS on the Various Datasets. 74 4.30 Di erence in Accuracies of K2-AS< and naiveALL on the Various Datasets. 75 4.31 Scatter Plots Comparing the Error Rates of (a) CDC with naiveAS, and (b) CDC with naiveALL. : : : : : : : : : : : : : : : : : : : : : : : : : : : 75 4.32 Di erence in Accuracies of CDC and naiveAS on the Various Datasets. : : 76 4.33 Di erence in Accuracies of CDC and naiveALL on the Various Datasets. : 77 4.34 Scatter Plots Comparing the Error Rates of (a) CGR with naiveAS, and (b) CGR with naiveALL. : : : : : : : : : : : : : : : : : : : : : : : : : : : 78 4.35 Di erence in Accuracies of CGR and naiveAS on the Various Datasets. : : 79 4.36 Di erence in Accuracies of CGR and naiveALL on the Various Datasets. : 80 xv

4.37 Scatter Plots Comparing the Error Rates of (a) CIG with naiveAS, and (b) CIG with naiveALL. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 81 4.38 Di erence in Accuracies of CIG and naiveAS on the Various Datasets. : : 82 4.39 Di erence in Accuracies of CIG and naiveALL on the Various Datasets. : 83 4.40 Learning Curves for the Chess Database. : : : : : : : : : : : : : : : : : : : 84 4.41 Learning Curves for the Mushroom Database. : : : : : : : : : : : : : : : : 85 4.42 Learning Curves for the Soybean Database. : : : : : : : : : : : : : : : : : 86 4.43 Learning Curves for the Voting Database. : : : : : : : : : : : : : : : : : : 87 4.44 Scatter Plots Comparing the Error Rates of (a) K2-AS with C4.5, and (b) CDC with C4.5. : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 88 4.45 Feature Selection Times for K2-AS, K2-AS< and CDC. : : : : : : : : : : 91 4.46 Scatter Plots Comparing the Error Rates of (a) K2-AS with CDC, and (b) K2-AS< with CDC. : : : : : : : : : : : : : : : : : : : : : : : : : : : : 92 6.1 Learning From a 10,000 Case Dataset With 20% MCAR Data Using 1Imputation: (a) Cross-entropy (b) Log-likelihood : : : : : : : : : : : : : : 126 6.2 Learning From a 10,000 Case Dataset With 40% MCAR Data Using 2Imputations: (a) Cross-entropy (b) Log-likelihood : : : : : : : : : : : : : 127 6.3 Learning From a 10,000 Case Dataset With 20% MAR Data Using 4Imputations: (a) Cross-entropy (b) Log-likelihood : : : : : : : : : : : : : 128 6.4 Learning From a 10,000 Case Dataset With 40% MAR Data Using 4Imputations: (a) Cross-entropy (b) Log-likelihood : : : : : : : : : : : : : 129 6.5 Learning From a 1,000 Case Dataset With 20% MAR Data: (a) Crossentropy (b) Log-likelihood : : : : : : : : : : : : : : : : : : : : : : : : : : : 130 6.6 Learning From a 1,000 Case Dataset With 40% MAR Data: (a) Crossentropy (b) Log-likelihood : : : : : : : : : : : : : : : : : : : : : : : : : : : 131 6.7 Learning from a 100 case dataset with 20% MCAR data: (a) Cross-entropy (b) Log-likelihood : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 132 xvi

6.8 Learning From a 100 Case Dataset With 40% MAR Data: (a) Crossentropy (b) Log-likelihood : : : : : : : : : : : : : : : : : : : : : : : : : : : 133

xvii

Chapter 1

Introduction Over the course of the past few years, there has been wide-spread interest in probabilistic and decision modeling, spurred by signi cant advances in representation and computation with graphical modeling formalisms. Probabilistic graphical models o er a uni ed qualitative and quantitative framework for representing and reasoning with probabilities and independencies. Bayesian networks (Pearl, 1988; Neapolitan, 1990), which provide a compact graphical way to express complex probabilistic relationships among several random variables, are rapidly becoming the tool of choice for dealing with uncertainty in knowledge based systems. Bayesian networks are an important representation because they can encode joint probability distributions, in addition to precisely de ning the conditional distributions that constitute the joint distribution by means of an underlying graphical structure. It is primarily this graphical nature that has made Bayesian networks so popular, as it encodes the causal structure of the domain being modeled (Heckerman and Shachter, 1995; Pearl, 1988). Moreover, in terms of learning, one can use this structure to explicitly encode prior knowledge, a feature absent in many learning frameworks. Not only have Bayesian networks been shown to perform competitively with other representations such as decision trees and neural networks, they also o er some attractive features not 1

o ered by other induction methods. Amongst the many advantages o ered by Bayesian networks are the ease of comprehensibility to humans, e ectiveness as complex decision making models, elicitability of informative prior distributions, to name a few. Due to these attractive features, Bayesian networks are being increasingly used in various real{world applications, including medical diagnosis (Suermondt and Amylon, 1989; Andreassen et al., 1987), telecommunications (Ezawa et al., 1996; Ezawa and Norton, 1995), information retrieval (Fung and Favero, 1995), system troubleshooting (Heckerman et al., 1994), vision (Levitt et al., 1989) and language understanding (Charniak and Goldman, 1989). Although substantial advances have been made both in the development of the theory and applications of Bayesian networks, approaches based on Bayesian network have often been dismissed as un t for many real-world applications. Two of the most important reasons are that probabilistic inference is intractable for most problems of realistic size, and that algorithms for learning Bayesian networks are unable to e ectively deal with missing data. In this thesis, I present practical solutions to these two problems, and demonstrate their e ectiveness on several real-world datasets. The two problems, as well as my solutions, are described in the following section.

1.1 Thesis Statement As pointed out earlier, one of the major factors limiting the use of Bayesian networks is the complexity of probabilistic inference. Inference using Bayesian networks is in general NP-hard (Cooper, 1990). A number of researchers have been trying to get by this obstacle by looking at ways of nding approximate solutions to inference in such networks (Draper, 1994; Henrion, 1991; Poole, 1993). However, as Dagum and Luby (1993) have shown, even the task of approximating probabilistic inference using Bayesian 2

networks is NP-hard. In practice, however, network topology (especially maximum clique size) and the number of attributes are two of the most signi cant parameters that govern inference complexity. One of the main objectives of my research is to address the issue of the intractability of probabilistic inference. I tackle this problem by learning selective Bayesian networks i.e. Bayesian networks that use only a subset of the available attributes to model a domain. The aim is to learn networks that are smaller, and ,thus, generally computationally simpler to evaluate, by discarding attributes that are irrelevant, redundant, or have too weak an in uence on the attributes of interest to be of signi cant consequence. At the same time, all dependencies/independencies are modeled amongst the attributes that are retained in the network to yield as accurate a representation of the domain as possible. I present two methods for inducing selective Bayesian networks from data. The rst method selects a subset of attributes so as to maximize the classi cation accuracy of the resultant model. The idea behind this approach is that attributes which have little or no in uence on the accuracy of learned networks can be discarded without signi cantly a ecting their performance. The second method selects attributes based upon the amount of information they provide about the variables of interest. This method is based on the hypothesis that attributes which can give us little or no information about the variables of interest, given the other attributes in the network, can be eliminated without signi cantly a ecting performance. To demonstrate the validity and usefulness of these methods, I have applied them to several problems taken from the University of California, Irvine's repository of machine learning (ML) databases. Selective Bayesian networks learned by both methods are shown to be signi cantly smaller and computationally simpler to evaluate relative to networks learned using all attributes. Moreover, they are also shown to retain, and sometimes improve upon, the performance of Bayesian networks learned using all the attributes. Both methods have also been shown to signi cantly outperform the naive 3

Bayesian classi er, one of the most widely-studied Bayesian methods within the machine learning community, as well as display accuracy comparable to decision trees learned by C4.5. In addition, we identify properties of the kinds of problems where selective Bayesian networks will be most useful as opposed to other representations. The second problem I address deals with the task of learning Bayesian networks from incomplete data. Handling missing data is a very dicult and largely unsolved problem that has been widely studied in statistics for over 20 years, generally in regards to the interpretation of survey data. Most methods for learning Bayesian networks assume that the data from which the network is to be learned is complete. In situations where this is not the case, as in most real-world problems, the data is often made complete by lling-in values using a variety of, often ad-hoc, methods. Although some work has been done in developing methods for learning network parameters (conditional probabilities) assuming that the network structure is known, the general, and more practical, task of learning both network structure and parameters from incomplete data has not been fully explored. The few techniques developed in this regard make the highly restrictive, and impractical, assumption that the values are missing randomly, independent of the state of other attributes. However, in practice, values are often missing based on the values of other attributes. For example, a CAT scan is generally not performed if the patient is in shock. Thus, the attribute representing the results of a CAT scan will have a missing value, based on the value of another attribute that denotes whether the patient is in shock or not. I present a principled approach to learn both the Bayesian network structure as well as the conditional probabilities from incomplete data. This method correctly handles both types of missing data mentioned above. Experiments carried out on data, generated from a large Bayesian network, with varying amounts of missing data, and di erent assumptions about the missing data mechanism show that the learned distribution (represented by the induced network) is much closer to the \true" distribution than the 4

distribution learned by commonly used ad-hoc methods of handling missing data. Moreover, the resultant networks are almost as good as the best model that could possibly be learned from the available information in the incomplete dataset. Finally, I combine the proposed algorithms to learn selective Bayesian networks from incomplete data for the task of diagnosing acute abdominal pain. Known to be a very dicult domain, this is a very high dimensional problem characterized by a large number of attributes and missing data, yielding little more than 60% predictive accuracy for most human and machine diagnosticians. In fact, several researchers have argued that the simplest Bayesian network, the naive Bayesian classi er, is optimal for this problem. My experiments on two datasets in this domain show that not only do selective Bayesian networks, learned using the incomplete-data algorithms I have developed, use only a small fraction of the attributes but they also signi cantly outperform other methods, including the naive Bayesian classi er.

1.2 Thesis Organization In Chapter 2, I present some background material on Bayesian networks, and discuss, among other things, various algorithms for learning, and performing inference on, Bayesian networks as well as various advantages o ered by Bayesian networks over traditional representations. Chapter 3 describes selective Bayesian networks, and presents two algorithms for learning such networks from data. The performance of selective Bayesian networks is evaluated in Chapter 4 on a variety of datasets and is compared to the performance of Bayesian networks learned using all the attributes as well as naive Bayesian classi ers and decision trees. In Chapter 5, I discuss the missing data problem, and describe a principled approach, based on the EM algorithm, for learning both the structure and parameters of Bayesian 5

networks from data with missing values. I present experimental results in Chapter 6 that show that the new algorithm learns Bayesian networks from incomplete data that are very close to the best models that could possibly be learned from the incomplete data. In Chapter 7, I describe the application of the above-mentioned methods to the task of diagnosing acute abdominal pain, and present experimental results comparing the performance of the models learned using these methods with other techniques such as naive classi ers on two di erent datasets. Finally, I summarize my contributions, and discuss avenues of future research, in Chapter 8.

6

Chapter 2

Bayesian Networks Bayesian networks combine a graphical structure (with nodes representing the domain variables, and edges representing probabilistic dependencies between them) with associated conditional probabilities to give a rich, explicit representation of the various conditional independence and dependence relationships between the variables. The local probability distributions associated with each variable, along with the set of conditional independence assertions represented in the network, can be directly combined to construct the joint probability distribution of the variables in the network. This represents extensive savings, both in the representation of the joint probability distribution as well as in the computation of posterior probabilities of the variables of interest, given some evidence. I rst formalize the representation of a Bayesian network, and mention some of its salient properties in Section 2.1. I then discuss, in Section 2.2, probabilistic inference using Bayesian networks, followed by several methods for the induction of Bayesian networks from data (Section 2.3). Finally, in section 2.4, I discuss some of the main advantages o ered by Bayesian networks over other representations such as decision trees and neural networks, and brie y describe some real-world applications of Bayesian networks. 7

2.1 Representation Formally, a Bayesian network consists of a qualitative network structure G and a quantitative probability distribution  over the network structure. The qualitative network structure G (N; A) consists of a directed acyclic graph (DAG) of nodes N and arcs A, where A  N  N . Each node i corresponds to a discrete random variable Zi with nite domain Zi . The Bayesian network then represents the joint probability distribution,

P (Z ) = P (Z1; Z2; : : :; Zn). Arcs in a Bayesian network represent the dependence relationships among the variables. An arc into node i from node j represents probabilistic dependence of Zi on

Zj , and is precisely speci ed using the notion of parents of a node. The parents of Zi , pa(Zi), are the direct predecessors of Zi in G . A Bayesian network encodes the set of conditional independence assertions that render each variable Zi conditionally independent of its non-descendants (in G ), given the state of its parents, pa(Zi ), in the network. This notion of conditional independence is implied in the network by the absence of arcs. Thus, the absence of an arc from node i to j indicates that variable Zj is conditionally independent of variable Zi given pa(Zj ). Other conditional independencies follow from these ones, and can be eciently determined from the network structure using a simple graph-theoretic criterion (Neapolitan, 1990; Pearl, 1988). The quantitative parameter set  consists of the conditional probability distributions

P (Zijpa(Zi)) necessary to de ne the joint distribution P (Z1 ; Z2; : : :; Zn). Without loss of generality, assume that the variables Z1; Z2; : : :; Zn are ordered such that, in G , pa(Zi )  fZ1; Z2; : : :; Zi?1g. Using the chain rule of probability, the joint probability distribution, P (Z ), can be represented as follows: P (Z1; Z2; : : :; Zn) =

n Y i=1

P (Zi jZ1; Z2; : : :; Zi?1):

(2:1)

Note that the structure G unambiguously de nes the parameter set  which is necessary 8

to specify the joint distribution P (Z1 ; Z2; : : :; Zn), since

P (Zi jZ1; Z2; : : :; Zi?1) = P (Zi jpa(Zi)):

(2:2)

This follows directly from the encoding of conditional independence statements in the Bayesian network, as described above. Thus, this unique joint distribution can be written as

P (Z1; Z2; : : :; Zn ) =

n Y i=1

P (Zi jpa(Zi)):

(2:3)

2.2 Inference Using Bayesian Networks The performance task in Bayesian networks, in general, involves arbitrary query answering, i.e. given some evidence about a subset of the domain variables, one is interested in computing the posterior probabilities of some other variables of interest. Thus, given a set of attributes, Z1  Z, which has been instantiated to a tuple of values z1 , the task is

then to compute the posterior probability distribution, P (Z2 jZ1 = z1 ; G ; )1 of a set of

variables Z2 given the evidence z1 . Of special interest is the use of Bayesian networks as classi ers, where one is interested in predictions about a special target variable (the class variable). The performance task then consists of classifying instances. The classi cation process involves a class variable

C that can take on values c1; c2; :::; cm, and a feature vector Z of n features that can take on a tuple of values denoted by fz1 ; z2; :::; zng. Given a case Z represented by an instantiation fz1 ; z2; :::; zng of feature values, the classi cation task is to determine the class value ci that Z falls into.2 The performance of the network is measured on some set of test cases in terms of the classi cation accuracy, i.e. the percentage of test cases for which it predicts the class correctly. 1 Henceforth, I do not explicitly mention G and  in probability functions unless necessary for clarity. 2 For simplicity of exposition, we will restrict our discussion to domains with only discrete variables.

9

A large number of algorithms have been designed for performing the task of computing probabilities of interest (probabilistic inference) in Bayesian networks. These methods directly manipulate the network probabilities to perform inference, thereby eliminating the need for explicit reconstruction of the underlying joint probability space. Exact methods proposed for inference on Bayesian networks have used various techniques such as arc reversals (Shachter, 1988), message passing (Pearl, 1986), trees of cliques (Lauritzen and Spiegelhalter, 1988; Jensen et al., 1990) and symbolic manipulations of sums and products (D'Ambrosio, 1991). However, exact probabilistic inference using arbitrary Bayesian networks is NP -hard (Cooper, 1990). Moreover, even if approximate algorithms are used, e.g. stochastic simulation methods (Shachter and Peot, 1990; Chavez and Cooper, 1990; Pearl, 1987), inference is still NP -hard (Dagum and Luby, 1993). However, for some forms of inference queries, ecient inference techniques have been specially developed for problems where the above mentioned techniques are computationally infeasible (Darwiche and Provan, 1995; Shachter et al., 1990). Pearl (1988) and Neapolitan (1990) review some of the commonly used techniques for Bayesian network inference.

2.3 Induction of Bayesian Networks In this section, I brie y describe some of the common methods for learning Bayesian classi ers from data. Most of the earlier work on learning Bayesian networks was based on extensive testing of conditional independence relations (Verma and Pearl, 1992; Spirtes and Glymour, 1991; Pearl and Verma, 1991; Spirtes et al., 1990). Not only were these approaches computationally expensive and often unreliable but they also made strong, often unrealistic, assumptions about the underlying probability distribution. Much of the current research deals with learning networks that best t the data and 10

the prior knowledge. Such approaches consist of two components|a scoring metric to measure the goodness-of- t of a network with respect to the data, and a search algorithm to search through the space of possible structures so as to nd one with a high score. Methods based on Bayesian scoring metrics, that attempt to induce the most probable network given the data, have been studied by Cooper and Herskovits (1992), Singh and Valtorta (1995), Heckerman et al. (1995), Spiegelhalter et al. (1993) and Buntine (1991), among others. Another commonly used scoring metric is the Minimum Description Length (MDL). Such algorithms attempt to nd the network that minimizes the encoding of the model plus the encoding of the data. MDL based approaches have been developed by Lam and Bacchus (1994), Bouckaert (1993) and Suzuki (1996). Whereas all the above methods concentrate on the task of learning the \single" best network (model selection), researchers have also looked at handling model uncertainty by identifying a small set of networks and averaging over them (selective model averaging). Such approaches have been shown to yield better results (Madigan and Raftery, 1993; Madigan et al., 1993) than when only one model used; however, these approaches are much more computationally expensive. Of the above mentioned methods, one of the most widely used algorithms is a Bayesian method, K2, proposed by Cooper and Herskovits (1992). The selective Bayesian network induction algorithms proposed in Chapter 3 use CB (Singh and Valtorta, 1995), a variant of the K2 algorithm, for evaluating the performance of selective Bayesian networks. As such, I describe these two algorithms in detail in the next few paragraphs.3 The K2 algorithm tries to nd the most probable Bayesian network structure, given the data, by carrying out a greedy search through the space of all network structures. Given the space of possible network structures, the single most likely network is given 3 We use the nomenclature used by Herskovits and Cooper (1992). Heckerman (1995) reviews various methods for the induction of Bayesian networks.

11

Figure 2.1: Induction of Bayesian networks using the K2 Algorithm. by

BSmax = argmaxBS [P (BS jD)]: where P (BS jD) is the posterior probability of a network BS given the data D. Cooper and Herskovits (1992) derive a formula for computing the joint probability, P (BS ; D), which can be used to nd the most probable network given the data, since the posterior probability of a network given the data, P (BS jD) is proportional to the joint probability. However, since it is computationally infeasible to search for the most probable network by exhaustively enumerating the space of belief network structures (of size exponential in the number of nodes), K2 further reduces the search space by assuming that a total ordering, i = 1; : : :; m, is available on the m features and that, a priori, all structures are equally likely. This approach represents each feature with a node in the Bayesian network BS . The feature ordering induces an ordering n1 ; n2; :::; nm on the corresponding m nodes, thereby constraining the arcs allowable in the network: arcs are allowed to go to a node ni from nodes n1 ; n2; :::; nm that precede it in this ordering (the predecessor nodes). Induction takes place as follows: the algorithm takes each successive feature in the ordering, adds it as a node ni in the network, and adds arcs to ni from its predecessors in a greedy fashion: rather than evaluate all subsets of network nodes n1 ; n2; :::; ni?1 12

as parent nodes, K2 adds the arc to ni from the single node in fn1 ; n2; :::; ni?1g that most increases the posterior probability of the network structure. New arcs are added sequentially to ni as long as doing so increases the posterior probability of the network given the data. Figure 2.1 summarizes the induction process. Figure 2.1(a) shows the set of nodes (representing attributes) from which the Bayesian network is to be constructed, along with the subnetwork constructed at any given point of the network induction process, while Figure 2.1(b) shows the addition of a node to the network in part (a) along with the set of arcs that can join this new node to nodes already in the network. K2's ordering requirement is an unnecessarily strong assumption. In a domain where very little expertise is available, or the number of vertices is fairly large, nding such an ordering may not be feasible. As such, one would like to avoid such a requirement. Singh and Valtorta (1995) proposed a variant of K2, called CB, that does not require a node ordering. The CB algorithm uses conditional independence tests to generate a \good" node ordering from the data, and then uses the K2 algorithm to generate the Bayesian network from the database using this node ordering. Starting with the complete, undirected graph on all variables, the CB algorithm rst deletes edges between adjacent nodes that are unconditionally independent (conditional independence tests of order 0). CB orients the edges in the resultant graph and obtains a total ordering on the variables. It passes this ordering to the K2 algorithm to construct the corresponding network. The algorithm then repeats this process by removing edges (from the undirected graph obtained in the previous iteration) between adjacent nodes that are conditionally independent given one node (conditional independence test of order 1). CB keeps constructing networks for increasing orders of conditional independence tests as long as the posterior probability of the resultant network, given the data, keeps increasing. Since CB uses the K2 algorithm to generate the Bayesian network from a particular ordering, CB is correct in the same sense as K2 (Singh and Valtorta, 1995). Singh and Valtorta (1995) show the importance of computing a good node ordering, given the n! possible 13

node orderings on n features.

2.4 Advantages of Bayesian Networks Over Other Representations Over the past few years, Bayesian networks have become the de-facto tool of choice for dealing with uncertainty in knowledge based systems. One very attractive feature of Bayesian networks is their ability to compactly encode joint probability distributions, in addition to precisely de ning the conditional distributions that constitute the joint distribution by means of an underlying graphical structure. The graphical structure of the Bayesian network greatly enhances the understandability of the model, as various relationships between the domain attributes can be simply read o the structure. It is primarily this graphical structure which has made Bayesian networks so popular, as it can encode the causal structure of the domain being modeled. The ease of understandability of Bayesian networks is one big advantage over representations such as neural networks which are notoriously dicult, if not impossible, to comprehend. Another major advantage of Bayesian networks is the ability to incorporate prior knowledge. Prior knowledge, especially in the form of causal information, greatly simpli es the construction of the Bayesian network, and also enhances the understandability of the resultant model. Yet another advantage of Bayesian networks over most traditional approaches is their ability to naturally handle missing values. As explained in Section 2.2, inference can be performed on Bayesian networks to update the posterior probability of any set of variables, given any other set of variables. As such, it does not require every variable to be instantiated, and hence can easily handle missing values. Perhaps the biggest advantage of Bayesian networks over other approaches is the 14

e ectiveness of their use as complex decision making models. When augmented with decision and utility nodes, they can be easily used for decision making, such as deciding the best next action, etc. Due to all these advantages, there has been a widespread increase in the number of real world applications of Bayesian networks. Bayesian networks have been recently applied to several domains including medical diagnosis (Suermondt and Amylon, 1989; Andreassen et al., 1987), telecommunications (Ezawa et al., 1996; Ezawa and Norton, 1995), information retrieval (Fung and Favero, 1995), system troubleshooting (Heckerman et al., 1994), vision (Levitt et al., 1989) and language understanding (Charniak and Goldman, 1989). Greiner (1997) lists several more recent applications of Bayesian networks, along with the reasons given by the implementors of each system regarding their choice of Bayesian networks over other techniques.

15

Chapter 3

Selective Bayesian Networks: Theory 3.1 Reducing the Complexity of Inference Using Bayesian Networks Although probabilistic inference using Bayesian networks is a NP -hard problem (Cooper, 1990), the number of attributes1 in the model and the network topology (especially maximum clique size) are two of the most signi cant parameters that govern inference complexity in practice. The following sections discuss these issues in more detail.

3.1.1 Restricting Network Topology to Reduce Inference Complexity One way to reduce the inference complexity is to restrict the network topology by making strong independence assumptions about the domain attributes, thus constraining the types of arcs that may be present in the network. However, although the network can be highly simpli ed by making varying degrees of independence assumptions, the 1 I use the terms \attributes"

and \features" interchangeably. The term \feature" has been predominantly used in the machine learning literature, whereas the term \attribute" is more common within the Bayesian network literature.

16

Class node

Class node

(a)

(b)

Figure 3.1: (a) A Typical Bayesian Network; (b) A Naive Representation for the Same Bayesian Network. learned model may be far removed from the true underlying model leading to a marked deterioration in performance. A great deal of work has been done in learning a special type of Bayesian networks that are used solely for the speci c task of classi cation. Such Bayesian networks have the obvious disadvantage that they can be used for answering only speci c queries (i.e. estimating the posterior distribution of a special variable, called the class variable, given the values of the other attributes) as opposed to Bayesian networks in general that can answer arbitrary queries (i.e. nding the posterior distribution of any set of attributes given the values of another set of attributes). Nevertheless, since classi cation is a very important task in most data analysis tasks, such networks have been widely studied within the Machine Learning community. The simplest form of such a classi er is the naive Bayesian classi er. The naive Bayesian classi er (Langley et al., 1992) assumes that the attributes are conditionally independent given the class variable. Thus, the structure of the network is as shown in Figure 3.1, where each attribute has only the class variable as a parent. The joint distribution is then given by

P (C; Z1; Z2; : : :; Zn) = P (C ) 17

n Y i=1

P (Zi jC );

(3:1)

where C is the class variable and Z1 ; Z2; : : :; Zn are the other domain variables. Consequently, inference is almost trivial. To perform this task, we assume that we have the prior probabilities, P (ci ), for each value ci of the class variable. Further, we assume that we have the conditional probability distributions for each feature value zj given class

V

V

value ci , P (zj jci). Using Bayes' rule, a new case, Z = j zj ( denotes conjunction), can then be classi ed as:

P (ci )P (Vj zj jci) P ( c ) P ( Z j c ) i i P (cijZ ) = = P P (V z jc )P (c ) : P (Z ) k k j j k

(3:2)

The assumption of the independence of features within each class can then be used to rewrite the denominator of equation 3.2 to give

P (ci ) Qj P (zj jci) P (cij zj ) = P Q P (z jc )P (c ) ; j k k k j j

^

(3:3)

which can then be easily calculated from the given probability distributions. As Langley and Sage (1994a) note, the naive Bayesian classi er is simple, inherently robust with respect to noise, and scales well to domains that involve many irrelevant features. Moreover, despite its simplicity and the strong assumption that attributes are independent within each class, the naive classi er has been shown to give remarkably high accuracies in many natural domains (Langley et al., 1992). However, this approach is typically limited to learning classes that can be separated by a single decision boundary (Langley, 1993), and it can su er in domains in which the features are correlated given the class variable. As such, several methods have been developed to increase the accuracy of such classi ers by relaxing these strong independence assumptions. The selective naive Bayesian classi er (Langley and Sage, 1994a) is an extension to the naive Bayesian classi er designed to perform better in domains with redundant features. The intuition is that, if highly correlated features are not selected, the classi er should perform better given its feature independence assumptions. Using forward selection of features, this approach uses a greedy search, at each point in the search, to select 18

from the space of all feature subsets the feature which most improves accuracy on the entire training set. Features are added until the addition of any other feature results in reduced accuracy. While the selective naive Bayesian classi er should perform better than the naive classi er in cases where attributes are perfectly correlated, its performance often degrades in situations where attributes are only partially correlated since elimination of correlated attributes leads to loss of useful information. Pazzani (1995), on the other hand, attempts to deal with attribute dependencies by merging correlated attributes into a single attribute. As in the case of the selective naive Bayesian classi er, such classi ers also show a rapid deterioration in domains with signi cant correlations between the various attributes as it too assumes that each group of (merged) attributes is conditionally independent of the other groups of (merged) attributes, thus accounting for only a small fraction of the correlations between all the attributes. Friedman et al. (1996a) go one step further in relaxing some of the strong independence assumptions made by naive classi ers in order to improve their accuracy in domains with signi cant correlations. They propose the tree-augmented naive (TAN) Bayesian classi er which models some of the correlations amongst the domain attributes by allowing each attribute to have one more parent in addition to the class variable. Due to the simple structure, these models retain the relatively small inference complexity of the naive classi ers; at the same time, they often display better accuracy than naive Bayesian classi ers since they attempt to capture some of the correlations that exist between the domain attributes. Compared to selective naive Bayesian classi ers, the TAN classi er is sometimes signi cantly better, but signi cantly worse in other cases. Nevertheless, none of the above mentioned approaches can fully account for all correlations, and generally display poor performance in domains where extensive correlations exist between the domain attributes. In such cases, the more complex Bayesian 19

networks should perform very well since they easily model all model attribute dependencies. Moreover, although naive classi ers are very good at prediction tasks, they are not very e ective for most of the other tasks for which Bayesian networks are so useful (see Chapter 2), especially in domains with extensive correlations.

3.1.2 Using Feature Selection to Reduce Inference Complexity The solution I propose to the problem of inference complexity is to learn Bayesian networks using only a subset of attributes to model the domain. The aim is to learn networks that are smaller, and hence, generally simpler to evaluate, by discarding attributes that are irrelevant, redundant, or have too weak an in uence on the attributes of interest to be of signi cant consequence. At the same time, the goal is to model all the dependencies/independencies between all the attributes that are, in fact, kept in the model, to yield as accurate a representation of the domain as possible. The task is then to preselect a subset of the available attributes and use that subset to learn the nal model. This problem, commonly known as \feature selection", has been widely used in statistics and pattern recognition with research in this area being focused primarily on selecting a subset of features within linear regression. Techniques developed include sequential backward selection (Marill and Green, 1963), branch and bound (Narendra and Fukunaga, 1977), best- rst (Xu et al., 1989) and beam search as well as bidirectional search (Siedlecki and Sklansky, 1988). A recent meeting of the Society of AI and Statistics was dedicated to papers on \Selecting Models from Data"(Cheeseman and Oldford, 1994), and contains numerous papers on feature selection. This statistical approach to subset selection shares many principles with other statistical notions of information minimality, like the minimum description length principle. For example, Dawid (1992) discusses the close relation between feature-selection and the minimum description length principle. Feature selection has also received considerable attention in the last few years within 20

the machine learning community. The strategies that have been generally used to evaluate alternative subsets of attributes fall into two main classes. One type of strategies constitutes what are commonly described as wrapper methods (John et al., 1994) in that the induction algorithm itself is used during feature selection to evaluate alternative subsets of attributes. The second type of strategies are known as lter methods where less relevant features are ltered out during feature selection using an algorithm di erent from the induction algorithm. Filter model approaches used for the induction of decision trees include the Focus algorithm (Amuallim and Dietterich, 1991) and the Relief algorithm (Kira and Rendell, 1992b; Kira and Rendell, 1992a), which Kononenko (1994) has extended. Cardie (1993) used a ltering approach in an extended nearest-neighbor algorithm, while Kubat et al. (1993) ltered features for use with a naive Bayesian classi er. Filter methods using a mutual-information based metric have recently been studied by Koller and Sahami (1996) in the context of decision trees and naive Bayesian classi ers. Wrapper models were formalized into a general framework for feature selection by John et al. (1994). Besides studying such methods in the context of decision tree induction, they also examine notions of relevance and irrelevance in the context of machine learning. Other work within the wrapper framework has dealt with the nearest neighbor method (Langley and Sage, 1994b) as well as naive Bayesian classi ers (Langley and Sage, 1994a), among others. Tradeo s between the accuracy and the computational cost of wrapper methods have been studied by Caruana and Freitag (1994). Langley (1994) presents a thorough review of these as well as other feature-selection approaches studied within the machine learning literature.

3.2 The Selective Bayesian Network As explained in the previous section, one way to address the problem of computational intractability of inference using Bayesian networks is to learn smaller networks that use 21

only a few of the attributes, thereby reducing the size of the network and, hence, inference complexity. However, the network models all correlations among the variables that are selected; thus, it retains the various advantages that Bayesian networks o er over other representation methods, especially in domains with extensive correlations. The selective Bayesian network is a variant of the Bayesian network that uses only a subset of the given attributes to model a domain. The induction of a selective Bayesian network thus consists of two steps, attribute or feature selection followed by network construction. The feature selection phase involves choosing a \good" subset of the available attributes under some objective function like predictive accuracy or information content. The objective function may be a function of a speci c variable or a set of variables. In the network construction phase, the subset of attributes selected in the previous step are used to learn the nal network. Selective Bayesian networks o er several advantages over Bayesian networks that use all attributes (non-selective Bayesian networks). First, they provide a way of applying Bayesian networks to problems where it was not possible to do so previously, due to computational intractability. Second, since smaller networks must generally learn a much smaller set of parameters, than the corresponding network using all attributes, they should learn at a much faster rate and asymptote with fewer training cases than the full model. Moreover, since induction requirements increase rapidly with the number of attributes, it may be dicult to get a good estimate of the model parameters (conditional probabilities) for large networks. This is especially true if the number of cases is small where the larger network may pick up spurious dependencies. Third, in real world applications, features may have an associated cost (e.g. a feature representing an expensive test). Smaller networks can be learned with a bias towards the removal of such high-cost tests. Fourth, reducing the size of the Bayesian networks may be of great bene t in decision 22

theoretic planning where the Bayesian networks may have to be replicated over time. In modeling time{varying systems (Provan, 1993) where networks have to be constructed a number of times, limiting the size of the network is often the only way to ensure computational feasibility. Fifth, reducing the size of the Bayesian networks may also be advantageous from a purely human factors perspective. In a medical domain, for example, there are normally a large number of symptoms, as well as diagnostic tests that can be performed to detect those symptoms, to test for the presence or absence of a disease. Many of the symptoms may be prevalent in a number of di erent diseases; similarly, many test results may indicate the presence/absence of a number of diseases. As such, a Bayesian network representation becomes very dense and dicult to understand | cluttered with a large number of facts, many of which do not give much information about the disease variable. By learning smaller networks, it is possible to eliminate most of such concepts (symptoms, tests) which do not o er much evidence about the presence/absence of a disease. This not only simpli es the model but, more importantly, it e ectively \prunes" away a large sub-space of possible tests that one needs to explore in order to decide what evidence to look for in order to support a particular conclusion. Sixth, application in some domains, e.g. telecommunications, are constrained not just by the inference complexity but also by the size of the model itself. For example, Ezawa and Norton (1995) point out that Bayesian networks learned from telecommunications data result in huge networks, since the number of attributes is very large, with most attributes having literally thousands of possible values. As such, the storage and use of such models poses a serious problem. Similarly, when selective Bayesian networks are used for classi cation tasks, they o er a distinct advantage over simpler approaches such as naive Bayesian classi ers, especially in domains where there are signi cant correlations between the domain variables. At the same time, they do not su er from the high inference complexity associated with 23

unrestricted Bayesian networks. Thus, they represent an excellent trade-o between the performance of unrestricted Bayesian networks and the eciency of naive classi ers. In the following paragraphs, I discuss some issues relevant to the design of ecient algorithms for inducing selective Bayesian networks from data. As pointed out above, induction of selective Bayesian networks includes a feature selection phase in which an appropriate subset of attributes is selected. Feature selection can be regarded as a search through the space of possible attribute subsets. A number of issues must be addressed to determine the nature of this search process (Langley, 1994). One decision relates to the strategy used to perform feature selection, i.e. whether it is designed as a wrapper method or as a lter method. Although wrapper methods are believed to be better than lter based methods since they incorporate the induction method itself into the feature selection process, and hence should yield better performance than lter based methods which may have an altogether di erent bias, they su er from the disadvantage of being relatively more computationally expensive, since the induction algorithm has to be called at every stage of the feature selection process. Another issue relates to the criteria used to evaluate the quality of a feature subset and compare it to alternative subsets. Common objective functions used for this purpose are classi cation accuracy, information content as well as structure size. This issue is closely related to the design of the feature selection process as a wrapper or lter method. For example, in a wrapper model, if the induction algorithm learns networks with the aim of maximizing accuracy, then the objective function used to evaluate attribute subsets will also be classi cation accuracy since the same induction algorithm is used for both purposes. Also, the objective function may be based on the value of a single variable, or may be a function of several variables. For example, the objective function for a typical classi er will be based solely on the class variable. On the other hand, there often are applications where one is interested in the values of several variables (or a function of 24

several variables) such as a medical system where one may be interested in nding out whether to begin treatment ( and for which disease), or to continue diagnosis (and using which test). Moreover, one must decide on the organization of the search process itself. Since an exhaustive search through the space of feature subsets is clearly infeasible, one must decide on an appropriate search mechanism. A greedy search strategy is generally preferred, although other methods like best- rst or even fast stochastic search may be used. A related problem concerns the starting point of the search. One way, commonly known as forward selection, starts from the empty set of features and incrementally adds features to it, while a second method, called backward elimination, starts with the entire set of features and successively removes them. Moreover, a decision has to be made whether to allow attributes to be both added or removed at any point of the feature selection process to revisit previously visited states and consider alternate paths. Yet another decision concerning the search process is to decide on the stopping criteria. Since, generally, non-optimal search methods, like greedy search, are used, it is easy for the process to end up in a local minima (or maxima depending upon the objective function). So the question is to decide whether to stop when such a point is reached, or whether to allow the process to continue on a \plateau" in the hope that a better terminating point may be found. Keeping these issues in mind, two algorithms have been designed for inducing selective Bayesian networks from data. A wrapper method for learning Selective Bayesian Networks is discussed in Section 3.3 while a lter approach is described in Section 3.4. In this thesis, I restrict myself to learning selective Bayesian networks for the speci c task of predicting the class of a single target variable. While the methods discussed here are speci c to the single class variable problem, they can be easily generalized to problems where the target is a function of several variables. Moreover, the objective function may be based on any criterion such as classi cation accuracy, information content, posterior 25

distribution, etc. of a single variable, or a set of attributes. Also, note that although our primary purpose is accuracy, the selective Bayesian network can still be used for arbitrary querying since all dependencies between all attributes retained in the network are fully modeled.

3.3 A Wrapper Method for Learning Selective Bayesian Networks The rst method proposed for the induction of selective Bayesian networks is a wrapper method which selects a subset of attributes that maximizes the objective function prior to the network learning phase. The idea behind this approach is that attributes which have little or no in uence on the classi cation accuracy of learned networks can be safely discarded without signi cantly a ecting their performance. As discussed previously, such wrapper methods based on classi cation accuracy as the objective function have been widely studied within the Machine learning community and have been used for learning, among other things, decision trees (John et al., 1994), nearest neighbor classi ers (Langley and Sage, 1994b) and naive Bayesian classi ers (Langley and Sage, 1994a). The two stages of the algorithm are discussed in detail in Section 3.3.1. Section 3.3.2 describes some important properties of this algorithm while its complexity is evaluated in Section 3.3.3.

3.3.1 The K2-AS Algorithm Since this is a wrapper method, the same algorithm is used in both the attribute selection as well as the network construction phase. During the feature selection phase, the algorithm selects that set of attributes that maximizes the objective function - in our case, accuracy in predicting the value of the class variable. However, nding the optimal such 26

set may entail, in the worst case, an exponential search which may be computationally prohibitive when the number of attributes is large. Thus, a less than optimal strategy is used. I chose to use a forward selection approach in that the algorithm starts with an empty set of features and successively adds features using a greedy approach. While other types of search such as backward elimination, best- rst search or stochastic methods may be better at times, I chose this approach simply because previous researchers (cf. (John et al., 1994; Langley and Sage, 1994a)) have shown that this method works reasonably well in practice. Using other search strategies in place of greedy search is a straightforward task. Each attribute is evaluated by learning a Bayesian network from the set of attributes already selected along with the attribute under consideration, and then determining its accuracy on a set of evaluation cases. For the purpose of learning the Bayesian network, the CB algorithm (Singh and Valtorta, 1993) was used. Since this algorithm is based on the K2 metric (Cooper and Herskovits, 1992), I refer to the wrapper model approach as K2-AS 2 . K2-AS adds attributes are added as long as the classi cation accuracy of the resultant Bayesian networks increases. Since a greedy search is being used, I also implemented a variant of this strategy, K2-AS 0, both attributes would be selected. In forward selection, it may be possible to eliminate redundant attributes, that are incorrectly selected as shown in the rst example above, by allowing selected nodes to be removed as well. For example, in addition to considering selecting a node that provides the maximum information about the class node, given the set of selected attributes, the algorithm should also consider removing any selected attribute which no longer provides any information about the class variable, given the remaining selected attributes. However, this modi cation will still not avoid situations such as those described in the second example. One solution to this problem would be to use backward elimination as opposed to the forward selection approach currently used. In such a case, the algorithm would start with the complete set of attributes, and would successively remove those attributes which provide the least (or no) information about the class variable, given the remaining selected attributes. Clearly, in situations such as those discussed in the second example above, the relevant attributes will always be retained since removal of either one of them would result in a very big loss in the information content of the remaining attributes. However, backward elimination may not be reliable due to what is commonly known as the \curse of dimensionality". Since, in real world problems, the data is limited, as the size of the set of attributes considered increases, the cell counts in the corresponding contingency tables become very small and often zero, thus yielding inaccurate values of probabilities, and consequently, of the information metric. Another option would be to start with random subsets, with multiple restarts, and use a greedy search strategy the allows addition as well as removal of attributes at every 40

step of the attribute selection process. Yet another method, suggested by Koller and Sahami (1996), would be to start with a set of attributes that are highly correlated to the class variable, and then perform backward elimination of those attributes whose information about the class is subsumed by the other variables. They suggest eliminating, at every stage, that attribute whose Markov blanket5 wholly consists of attributes still remaining in the subset of attributes. This is a much stronger condition than just testing for information content, and will lead to the optimal subset of attributes provided one starts with a good initial subset and is able to determine the Markov blanket of a node easily. Unfortunately, as pointed out by the authors, the test for the Markov blanket is very expensive and approximations have to be used. The approximation used by Koller and Sahami (1996) is successful even if the attribute is only independent of the class variable given the set of selected attributes (i.e. IM(fxg; ) = 0). Thus, it degenerates to performing backward elimination using information based criteria as mentioned above.

3.4.4 Complexity of the Attribute-Selection Phase I now derive the worst-case time complexity of selecting the subset of nodes in the nodeselection phase using the metrics presented.

Theorem 3.4.3 The worst case time complexity of the node-selection phase is given by nX ?2 k=0

(mkr + mr2) (n ? k ? 1) = O(mrn2 (r + n));

(3:5)

where r the maximum number of possible values for any attribute (including C ), m is the number of cases in the training data, k = jj, and n = jZ j.

Proof: Let IM(fxg; ) be the value of the information metric on adding x to the set . Having selected a set of attributes, say , we can easily calculate the value of IM(fxg; ) (where x is the node under consideration) by building a 3-d contingency table whose rows 5 The Markov blanket M of an attribute Z is a set of attributes that makes every other attribute (6= Z; 62 M ) conditionally independent of Z given M .

41

correspond to the various classes, the columns correspond to the di erent values of the attribute x, and the layers correspond to the unique instantiations of the attributes in . Two stages contribute to the complexity of this process|forming the contingency table and then using it to calculate IM(fxg; ). For now, assume that the contingency table has already been created. Let r be the maximum number of possible values for any attribute. The total number of unique instantiations of the attributes in , hence the number of layers in the contingency table, can be at most m where m is the number of cases in the training data. In order to calculate IM(fxg; ), each cell in the contingency table has to be accessed. This can be done in O(mr2) time. Now, consider the complexity of creating the contingency table itself. The contingency table can be created by building an index tree.6 The index tree is a tree of depth

jj + 1 where the leaves are 2-d contingency tables. Each interior node represents an attribute in  (with the nodes at a given level all representing the same attribute) and has an outgoing edge for each possible value that the attribute can take. Thus, a path from the root of the tree to a leaf (a 2-d contingency table) represents a unique instantiation of the attributes in . Note that paths in the tree may be partial (and hence have no corresponding 2-d table) because that particular instantiation of the attributes in  may not occur in the training set. Since there are m cases in the training set, there will be at most m 2-d tables (as opposed to the theoretical maximum of rk where jj = k). To enter a case into the contingency table, we must branch on or construct a path with

k nodes, each of size r. Thus, a case can be entered in O(kr) and the entire contingency table can be constructed in O(mkr). Now, consider the attribute selection phase. The process starts with  = ;, i.e., k = 0. At each stage, it evaluates the metric for n ? k ? 1 attributes where n is the total number of variables (including C ). In the worst case, it selects all the attributes and 6 Cooper and

Herskovits (1992) used a similar technique to calculate P (BS ; D) eciently.

42

hence stops when jj = n ? 2. Thus, the worst case time complexity of the node-selection phase is given by nX ?2 k=0

(mkr + mr2) (n ? k ? 1) = O(mrn2 (r + n)):

2

3.5 Summary In this chapter, I introduced the notion of selective Bayesian networks, and described how they may be bene cial in real-life domains where it may be dicult to apply non-selective Bayesian networks due to the computational complexity of inference using Bayesian networks. I also described two methods for learning selective Bayesian networks from data for the speci c task of classi cation. The rst method is a wrappermodel approach that attempts to maximize classi cation accuracy prior to the network learning phase, thus incorporating a bias for small networks that retain high accuracy. The second approach is a lter-model approach that uses information theoretic metrics to select a subset of attributes by discarding those that cannot provide any information about the variable of interest, given the other selected attributes. The wrapper method can be computationally intensive since it makes repeated calls to the network induction algorithm during the attribute selection phase. Moreover, the wrapper method su ers from the disadvantage that it is dicult to theoretically justify or guarantee the performance of the algorithm. However, it has the advantage of incorporating the bias of the induction algorithm into the attribute selection phase itself, and thus should lead to networks that are better suited for classi cation. On the other hand, the lter method has a much smaller computational complexity, and, as shown in the previous section, has a polynomial complexity for the feature selection phase. Moreover, I have proven several properties of the feature selection metrics 43

which give a clear understanding of the process itself as well as show the way to better, possibly optimal, feature selection. On the negative side, since di erent algorithms are used for the feature selection and network induction phases, the feature selection phase incorporates a di erent bias, which may sometimes lead to relatively inferior classi ers. In the following chapter, I evaluate both the algorithms on several di erent datasets and compare the selective Bayesian networks learned by these algorithms to unrestricted Bayesian networks (learned using all attributes) in terms of inference complexity as well as classi cation accuracy. I also compare the performance of the learned networks to that of naive Bayesian classi ers as well as decision trees.

44

Chapter 4

Selective Bayesian Networks: Evaluation To demonstrate the validity and usefulness of the algorithms developed for the induction of selective Bayesian networks, I carried out detailed experimental studies on a number of databases acquired from the University of California, Irvine, repository of machine learning (ML) databases. The main objectives of these experiments are described in Section 4.1. Section 4.2 describes the various datasets used in the experiments, while the experimental methodology and design are discussed in Sections 4.3 and 4.4, respectively. Results of experiments comparing selective Bayesian networks with other representations such as non-selective Bayesian networks and naive Bayesian classi ers are described in Section 4.5, whereas the implications of these results are discussed in Section 4.6. Finally, I summarize the results in Section 4.7.

4.1 Objectives The experiments were designed to accomplish several objectives as described in the following paragraphs. 45

First, by learning (smaller) selective Bayesian networks, it should be possible to signi cantly improve the inference eciency of the resultant networks. As pointed out in Section 3.1, the number of attributes and the maximum clique size are two of the most important parameters that govern the inference complexity of a Bayesian network. As such, one of the main objectives of the experiments was to measure the reduction obtained in the size of the networks and the maximum clique size, and also the time taken to perform inference with the resulting networks. Also, since it was important to ensure that the reduction in network size was not obtained at the expense of a signi cant deterioration of performance, I also compared the accuracy of the selective Bayesian networks with that of the Bayesian networks learned from all attributes. I chose classi cation accuracy as a measure of network performance, as opposed to some other objective function such as the joint probability distribution, since both selective Bayesian network learning algorithms used an objective function based on the class variable for feature selection. Moreover, as discussed in Chapter 3.1, another way to reduce the inference complexity of Bayesian networks is to restrict the network topology by making strong independence assumptions about the domain attributes. Previous experimental studies (Langley and Sage, 1994a) have shown that the simplest such network, the naive Bayesian classi er, performs fairly well in practice. Moreover, the inference complexity of such networks is polynomial in the number of nodes. As such, another objective of these experiments was to compare the performance of selective Bayesian networks with that of naive Bayesian classi ers. Also, since a selective Bayesian network classi er must estimate a much smaller set of parameters than the corresponding non-selective network, selective networks are expected learn at a much faster rate and asymptote faster than non-selective networks. Moreover, since induction requirements increase rapidly with the number of attributes, it may be dicult to get a good estimate of conditional probabilities for large networks, especially when little data is available. As such, I was also interested in comparing 46

Attributes Classes Cases Letter Recognition 16 26 20000 Voting 16 2 435 Voting1 15 2 435 Segment 18 6 2310 Promoter 57 2 106 Crx 15 2 690 Satimage 36 6 6435 Vehicle 18 4 846 Hepatitis 19 2 155 Car-eval 6 4 1728 Nursery 8 5 12960 Page-blocks 10 5 5473 Musk 166 2 6598 Audiology 69 24 226 Chess 36 2 3196 Soybean 35 15 630 Mushroom 22 2 8124 Gene-splicing 60 3 3175 Thyroid 29 32 9172 Table 4.1: Description of the Databases Used in the Study.

the performance of selective Bayesian networks against that of unrestricted Bayesian networks as a function of the number of cases used for training. Finally, selective Bayesian networks can be thought of as lying somewhere along a continuous spectrum with the ends representing modeling all inter-attribute dependencies (as done by a Bayesian network) and modeling no inter-attribute dependencies (as done by a naive Bayesian classi er). Given that each representation has its own advantages and disadvantages, I also wanted to categorize the types of datasets where each method, especially selective Bayesian networks, would be the most bene cial. 47

4.2 Description of the Datasets Used Several datasets were acquired from the University of California, Irvine, repository of Machine Learning databases (Murphy and Aha, 1992). The choice of the datasets was motivated by the fact that not only did I want to compare the performance of the algorithms for learning selective Bayesian networks with other induction approaches, but also try to categorize the types of problems where the use of such networks would be of more bene cial than the other approaches. As such, a diverse set of datasets was collected with varying numbers of attributes, classes and cases. Table 4.1 describes the various datasets used for the experiments1 , while Figure 4.1 shows their distribution with respect to dataset size, number of attributes as well as the number of classes. Thus, there were problems characterized by high dimensionality and small datasets (e.g. promoter and audiology), high dimensionality and large datasets (e.g. musk and satimage), low dimensionality and large datasets (e.g. nursery and car-eval), low dimensionality and small datasets (e.g. voting and vehicle) and several other problems in-between.

4.3 Experimental Methodology The unrestricted Bayesian networks were learned using the CB algorithm (Singh and Valtorta, 1995). Similarly, both K2-AS and Info-AS used the CB algorithm for learning the selective Bayesian networks from the subset of selected attributes. The selective naive Bayesian classi ers were learned using the method suggested by Langley & Sage (1994a), as described in Section 3.1.1. The terms naiveALL and naiveAS are used to denote the naive Bayesian classi er and its selective extension, respectively. Decision trees were learned using the C4.5 implementation of the IND (version 2.1) software. 1 Further information on these datasets can be obtained from the UCI repository by anonymous ftp to

ics.uci.edu. The voting1 database was derived from the voting database by deleting the most signi cant attribute physician-fee-freeze (Buntine and Niblett, 1992).

48

35 Classes Cases (x 1000) Attrs (x 10)

30 25 20 15 10 5

Letter

Nursery

Thyroid

Mushroom

Musk

Satimage

Page-blocks

Chess

Gene-splicing

Segment

Car-eval

Vehicle

Crx

Soybean

Voting1

Voting

Audiology

Hepatitis

Promoter

0

Figure 4.1: Problem Dimensionality. The HUGIN (Anderson et al., 1989) clique-tree inference algorithm was used for performing inference on the learned Bayesian networks. Missing values, if any, were treated like just another attribute value. Continuous attributes were discretized by putting the attribute values into 10-12 bins such that the number of data points in each bin was roughly the same. In addition, datasets were also randomized prior to their usage to try to get a uniform distribution of the class variable.

4.4 Experimental Design In order to evaluate various learning algorithms, researchers often use a strategy of randomly splitting the dataset into a training and test set, and determining the accuracy 49

of the models learned from the training set on the test set. This process is used for conducting a xed number of trials, and the paired t-test is then used to determine the signi cance of any di erence between the algorithms. However, despite its widespread use within the Machine learning community, the use of the t-test in this way is incorrect (Feelders and Verkooijen, 1995; Salzberg, 1997). The reason is that the t-test assumes that the samples are independent; yet, in the above strategy, there is overlap in both the training and test data over subsequent trials since the same dataset is repeatedly split, albeit randomly. This in turn leads to an unacceptably high type-I error i.e. nding signi cance when there is none. This problem is much more acute than most researchers seem to realize. In preliminary experiments with several datasets, I compared the performance of selective Bayesian networks to that of non-selective Bayesian networks as well as naive Bayesian classi ers. In almost every case, the di erence between the algorithms was found to be signi cant (Singh and Provan, 1995). However, upon further study, two problems became very apparent. One, the number of trials used for conducting each experiment plays a very important role in determining signi cance. Most researchers use about 25-30 trials for each dataset. However, I found that di erences that were found to be signi cant over 25 trials were often found to be not signi cant when fewer, say 10 trials, were used. Since the number of trials used by researchers is fairly arbitrary, it is clear that results based on the t-test cannot be trusted. Second, I found that signi cance results were not consistent over repeat experiments. Thus, if two methods were found to be signi cantly di erent, for a given dataset, in one experiment consisting of, say 25 trials, it was not always true that they would be found signi cantly di erent when the experiment was repeated again. Instead, in order to measure the classi cation accuracy of the various methods, I used 10-fold cross validation for each data set as described below: 50

learning Training

Testing

Testing evaluation Training

(a) Partition of data without attribute selection

(b) Partition of data with attribute selection

Figure 4.2: Partitioning of Databases for Induction During Each Cross-Validation Run. 1. Divide the dataset into 10 equal parts. Label the parts 1; 2; : : :; 10. 2. For i:= 1 to 10 do: (a) The ith part of the dataset is designated as the test data, which is used for determining the accuracy of the model learned from the training data consisting of the remaining 9 parts of the datasets. Moreover, for both K2-AS as well as naiveAS, the training data is further split into two parts, referred to as the subset-learning (consisting of 70% of the training cases) and the subset-evaluation (consisting of the remaining 30% of the cases) sets.

(b) For each algorithm, learn a model using the training set and then determine its classi cation accuracy on the test data. Figure 4.2 shows the partitioning of the databases for each cross-validation run. The classi cation accuracy of an algorithm is the percentage of test cases for which it predicts the class correctly. For K2-AS and naiveAS, the subset-learning set is used for learning the network (from the current subset of attributes and the attribute under consideration) while the subset-evaluation set is used to test this network for predictive accuracy (to decide whether to add the new attribute to the set of selected attributes). Once the subset of attributes has been selected, the training set is used to learn the nal network. Thus, the feature subsets are evaluated solely on the basis of the training data, without using cases from the test 51

data. Only after the best subset of features is chosen by the feature selection algorithm is the test data used to obtain the nal classi cation accuracies. The use of cross-validation ensured that there was no overlap between the test sets over the 10 runs. However, the training sets still overlapped, thus violating the requirements of the t-test. Although the t-test used with a cross validation strategy does give more accurate results than the strategy discussed above (Dietterich, 1996), it still leads to a relatively high type-I error. Another problem stems from the fact that most of the time, more than two algorithms are compared on a given dataset. However, the increased number of comparisons also increases the probability of nding signi cance when, in fact, there is none. This problem, known as the multiplicity e ect, should also be take into account by the statistical test employed. The McNemar test (McNemar, 1947) is an appropriate test for comparing di erent methods when the samples are not independent. However, it can be used only when two classi cation methods are being compared. The multivariable extension of the McNemar test, Cochran's Q test (Cochran, 1950), is best suited for comparing multiple methods. However, the Cochran's Q test can only be used to determine whether the methods being compared are statistically di erent or not. It is not possible to determine, on the basis of the test, the magnitude or the direction of the di erences among the multiple problems on the basis of the test alone (Marascuilo and McSweeney, 1977). As such, I used posthoc procedures for the Cochran test described by Marascuilo and McSweeney (1977) to construct simultaneous con dence intervals for all pairwise di erences which can be used to determine and direction and magnitude of the di erence between each pair of methods being compared. Feelders and Verkooijen (1995) demonstrate the application of this method to the task of comparing multiple machine learning methods on a given dataset. The method is brie y described below: 52

Let R = (Rij )nk be a response indicator matrix such that

Rij

8 > < 1; case i is correctly classi ed by method j , = > : 0; otheriwse.

(4.1)

where n is the number of cases and k is the number of methods being compared. Let

R:j = (1=n) Pni=1 Rij denote the accuracy of method j . Also, let Ri: = Pkj=1 Rij . Then, the pooled variance of any pairwise di erence, R:j ? R:j , can be determined as 0

(Marascuilo and McSweeney, 1977; Feelders and Verkooijen, 1995)

Pn R ? Pn R2 ) 2( k i=1 i: i=1 i: diff = n2 k(k ? 1)

2

(4:2)

100(1- )% con dence intervals are then calculated for each pair of methods, say j and j 0, as

h

j ? j 2 R:j ? R:j  Zk1:1? =2diff 0

i

0

(4:3)

where j denotes the population proportion of correct classi cations of method j , k* is the number of comparisons involved, and Zk1:1? =2 denotes the value of the DunnBonferroni distribution (Dunn, 1961; Marascuilo and McSweeney, 1977) with the degrees of freedom = 1. As expected, the con dence intervals get wider with increasing number of comparisons. If the con dence interval for two methods contains 0, then the two methods are not statistically signi cant at the 100(1- )% con dence level. In addition, to measure the rate of improvement of the various classi ers, the accuracy was also measured for di erent numbers of training cases. More speci cally, experiments were carried out using 20%, 40%, 60%, 80% and 100% of the cases in the training data for learning the classi ers, followed by testing on the test data. Once again, the (reduced) training data set was split into two parts for the two selective algorithms. 53

4.5 Experimental Results 4.5.1 Selective versus Non-selective Bayesian Networks As pointed out in Section 4.1, there were several objectives in experimentally comparing selective Bayesian networks to non-selective ones. First, I wanted to gauge the degree to which feature selection a ected the inference complexity of the resultant networks by measuring the reduction in various parameters such as the number of attributes selected, the maximum clique size of the resultant networks, and the time to compile, and perform inference on, the resultant networks. Moreover, I wanted to compare the classi cation accuracy of the selective and nonselective Bayesian networks. This was important for two reasons: (a) we wanted to ensure that feature selection did not result in a high loss of performance, and (b) see if feature selection resulted in better estimation of conditional probabilities, especially on high dimensional problems with small datasets. Finally, I wanted to measure the e ect of feature selection on the induction rates of the (smaller) networks learned. I rst describe the results of comparing non-selective Bayesian networks to selective Bayesian networks learned using the wrapper-method described in Section 3.3 (K2-AS). Following that, I describe the corresponding experiments involving selective Bayesian networks learned using the lter-method approach (Info-AS).

Comparison of Selective Bayesian Networks learned by K2-AS with NonSelective Bayesian Networks Figures 4.3{4.9 show the average reduction in various parameters such as the number of attributes selected, maximum clique size and compilation/inference times for the selective Bayesian networks learned by K2-AS, relative to the non-selective Bayesian networks learned using all attributes. 54

For most of the domains, K2-AS selected only a small fraction of the available attributes. For example, K2-AS selected, on an average, 3.4 attributes out of 69 for the audiology dataset, a 95% reduction. Similarly, only 6.4% of the attributes were selected for the promoter dataset (3.6 of 57) and about 14% for the chess dataset (5 of 36). The reduction in the number of attributes selected was much less in the case of K2-AS < 1; dli observed, rli = > : 0; dli missing. If

(5.1)

is the set of parameters governing the distribution for the missing data mecha-

nism, P (RjDobs ; Dmis ; ), the joint distribution of R and D can be written as

P (Dobs ; Dmis; Rj; ) = P (Dobs ; Dmisj)P (RjDobs ; Dmis; )

(5.2)

Missing Completely at Random: If the probability that the data is missing is independent of both the observed, Dobs , as well as the missing, Dmis , data, then the data is said to be Missing Completely at Random or MCAR. In other words, the data is MCAR i

P (RjDobs ; Dmis; ) = P (Rj )

(5:3)

Missing at Random: If the probability that the data is missing for a particular variable may depend on the values of the observed variable in that case, but is independent of the values of the missing component itself (given the values of the observed component), the missing data 98

is said to be Missing at Random or MAR.1 Thus, the missing data is MAR i

P (RjDobs ; Dmis; ) = P (RjDobs ; )

(5:4)

Not Missing at random: If the probability that the data is missing depends on the value of the missing component itself, and possibly on the value of the observed component as well, then the missing data is said to be Not Missing at Random or NMAR.

Ignorable versus Non-ignorable Missing Data The type of the missing data mechanism plays a critical role in determining the types of problems that can be addressed by various learning algorithms. For maximum likelihood methods, the estimates of the parameters  and are related to the observed component of the data, Dobs , and R via P (Dobs ; Rj; ). The likelihood of  and is any function of  and proportional to P (Dobs ; Rj; ):

L(; jDobs; R) / P (Dobs ; Rj; ): From equation 5.2, we get

Z

P (Dobs; Rj; ) = P (Dobs ; Dmis j)P (RjDobs ; Dmis; ) dDmis:

(5:5)

Now, if the data is missing at random (MAR), then using equation 5.4, we get

P (Dobs; Rj; ) = P (Dobs j)P (RjDobs ; ):

(5:6)

Thus, to nd the maximum likelihood estimate of  (by maximizing L(; jDobs; R) w.r.t ), it is sucient to maximize L(jDobs )(/ P (Dobs j)) since this is the only term containing  on the r.h.s. of Equation 5.6. 1 Within

the Machine Learning community, researchers often refer to the MCAR case as \missing at random". I adopt the more widely used terminology of di erentiating between MCAR and MAR because, as discussed in the following sections, these two assumptions lead to very di erent requirements of learning algorithms.

99

Thus, if the data is MAR or MCAR, then the maximum likelihood estimate of  can be determined by ignoring the parameters of the missing data mechanism (\ignorable"). This means that one does not have to explicitly model the missing data mechanism while determining the maximum likelihood estimate of a parameter of interest from incomplete data since doing so will not improve the estimate in any way. It is precisely these kinds of problems (where the missing data is either MAR or MCAR) that I am interested in. This is because in most real-world domains the MAR (and thus MCAR) assumption is generally sucient. For example, in most medical datasets, values are often missing completely at random (e.g. error in transcribing nurse notes to a computer) or, more generally, missing at random (e.g. although a CAT-scan is performed for certain abdominal injuries, it is not performed if the patient is in shock; thus, the record of such a patient will have a missing value for the attribute corresponding to the CAT-scan result). The case where the missing data is NMAR cannot be handled without rst formulating a model for the missing data mechanism (\non-ignorable"). This generally leads to various problems, such as high computational requirements, and is not dealt with in this research. In the following sections, I discuss a number of methods of handling missing data, and point out the various advantages/disadvantages of these approaches. All the methods discussed assume that the missing data is ignorable. For the non-ignorable case, the reader is referred to Little and Rubin (1987) and Rubin (1987) for a discussion.

5.3 Methods for Handling Missing Data 5.3.1 Complete Case Analysis The simplest way for handling missing data is to simply drop cases for which the values of all variables were not observed. Thus, only complete data cases are used for the purpose at hand. While this method is fairly straightforward, and may be satisfactory with small 100

amounts of missing data, it may lead to serious biases in the presence of large amounts of missing data. In general, the approach is valid if the missing data is MCAR. Under the MCAR assumption, the complete cases are e ectively a random subsample of the original cases. Thus, the estimates are not biased due to discarding data in the incomplete cases. However, this assumption is rarely valid in real life domains. If the missing data mechanism is MAR, then the approach may lead to considerable biases. This disadvantage stems from the potential loss of information as a result of discarding cases with missing values. The loss in sample size can be considerable, particularly if the number of variables is large (Little and Rubin, 1987). For example, consider the case where a disease variable

Y is always instantiated to disease yi (for some xed i) whenever a symptom variable X has a missing value. If all incomplete cases are discarded, then this relation between the absence of the value of X and the disease yi will not be discovered by a learning algorithm. Moreover, the smaller amount of data also results in wider con dence intervals, and thus lower reliability, for the estimated parameters.

5.3.2 Treating Missing Values as Just Another Value If the attributes with missing values are discrete, then it is fairly common to assign to each missing value a `new value' corresponding to an `unknown category' in the value set (Cheeseman et al., 1988). Real values variables can be accommodated by adding a `known/unknown' category and its value can then be set appropriately. Although this method is straightforward, it su ers from the drawback that it adds an additional parameter to be estimated. Moreover, the extra value does not re ect the fact that the missing value actually came from the original multinomial value set; thus, for a dataset where the value of a particular attribute is often missing, an algorithm may form a class based on that attribute value being unknown, which may not be desirable in a classi er (Ghahramani and Jordan, 1994). Also, sometimes an extra value just does not make 101

sense (e.g. sex of a patient where the unknown value has to be either male or female). However, in certain situations, this approach may actually be preferable over other techniques of handling missing data due to its inherent simplicity. One such scenario, generally found in medical domains, is where the very absence of a value may be highly informative. For example, the value of an attribute corresponding to a particular clinical test may be absent because the test was deemed irrelevant or unnecessary by the physician, given the other information. As such, the particular missing value is not important. However, the absence of a value should increase the likelihood of diseases for which the test if not relevant (and likewise decrease the likelihood of diseases for which it is). By treating missing values as an extra value, this relationship can be easily encoded (Greiner et al., 1997a). This approach is also useful for modeling some situations where the missing data mechanism is NMAR, such as censored data. For example, certain values recorded from a sensor may be absent due to the inability of the sensor to record information above a maximum value. Treating missing values as an extra value o ers a simple and straightforward technique of encoding this information (i.e. the extra value corresponds to a value greater than the maximum).

5.3.3 Imputation Procedures Other commonly used methods for handling missing data involve \ lling in" the missing values with some appropriate value or values. These methods, collectively known as Imputation procedures, are brie y discussed below.

Imputing Single Values Most of the commonly used imputation techniques involve replacing missing values by a single value computed by some method. Some of the common techniques are mean imputation, regression imputation and hot-deck imputation (Little and Rubin, 1987). Single imputation procedures have the advantage of being exible, simple and that the 102

resultant complete-data can be analyzed with standard methods. However, such approaches have serious disadvantages. One problem is that, although the missing values are not known, the use of complete-data methods in analyzing the imputed data sets treats the missing values as if they were in fact the actual values. Due to this, even if the reasons for the missing values are known, inferences based on the imputed set will be very sharp since by treating the imputed values as the true underlying values, the extra variability due to the missing values is not taken into account (Rubin, 1987). Moreover, quantities depending upon variabilities (e.g. correlation) may be seriously biased. Also, in situations where the reasons for the missing data are not known, by treating imputed values as if they were in fact the correct values, the uncertainty arising due the fact that the correct missing data mechanism is not known is not taken into account. Moreover, the con dence limits for the estimated parameters are changed, thus reducing the reliability of the estimated values. Excellent examples illustrating these problems may be found in (Rubin, 1987) and (Ghahramani and Jordan, 1994). In mean imputation, the missing values for a variable are replaced by the average of the values observed for the variable in those cases where its value was observed. However, since the missing values are replaced by values at the center of the distribution, this method generally underestimates the sample variance and the covariance matrix calculated from the lled-in data. Regression imputation, on the other hand, replaces missing values by values calculated by by carrying out a regression of the missing values on the other variables observed for that case. However, this method still yields biased estimates for the covariance matrix because now the lled in points all fall along the regression line (Ghahramani and Jordan, 1994). A third technique for imputation is known as hot deck imputation. In this form of imputation, the distribution for each missing value is estimated, from which the imputed value is then drawn. This is di erent from mean imputation where the imputed value is 103

the mean of this distribution. Thus, if we take the distribution to consist of values from the cases where the values of the variable under consideration were observed, hot deck imputation involves substituting values drawn from similar cases.

Multiple Imputation In all the imputation methods discussed above, each missing value is replaced by a single imputed value. However, a single value can in no way convey the uncertainty about the true value; rather, it depicts (incorrectly) that the imputed value is in fact the `actual' missing value. A much better technique is to use Multiple imputation methods (Rubin, 1987) | that impute more than one value for the missing data points. Each missing value is replaced by a vector of M  2 imputed values thus giving M completed data sets, each of which can be analyzed using standard complete-data methods. By replacing each missing value by the rst component in its imputation vector gives the rst completed data set, replacing each missing value by the second component gives the second such set, and so on. Thus, by using complete data methods to analyze each of the M complete data sets, the results can be combined to yield a single inference. As discussed by Rubin (1987), multiple imputation has several advantages over single imputation. First, the uncertainty about the missing data is easily represented. Second, if the mechanism is unknown, then the imputed values represent the uncertainty about the di erent reasons (models) for the missing values. Third, by using complete data methods to analyze each of the M complete data sets, the results can be combined to yield a single, more representative model.

5.3.4 Model based Procedures A number of methods have also been developed that de ne a model for the partially missing data and base inferences on the likelihood under that model. In such cases, parameters are often estimated by procedures such as maximum likelihood (ML) or 104

maximum a posteriori probability (MAP). Such techniques have an obvious advantage over the methods discussed previously that the assumptions underlying these methods can be displayed and evaluated. Two of the techniques commonly used are the EM algorithm (Dempster et al., 1977) and Gibbs sampling (Geman and Geman, 1984). Another technique often used, Data Augmentation (Tanner and Wong, 1987; Li, 1985), may be regarded as a combination of the EM algorithm and multiple imputation. These approaches are brie y discussed below.

The EM Algorithm The Expectation{Maximization (EM) (Dempster et al., 1977) algorithm is a general iterative algorithm for maximum likelihood estimation in problems with missing data. Rather than estimate individual missing values, the algorithm estimates the complete data likelihood. The

M step consists of simply performing a maximum likelihood estimation of ,

assuming that the data is complete. The E step nds the conditional expectation of the complete-data loglikelihood, given the observed component of the data and the current values of the parameters. Thus, using the notation of Little and Rubin (1987), the steps can be written down as:

E step: Q(j(t) ) =

Z

l(jD)P (Dmis jDobs ; (t)) dDmis :

(5:7)

M step: (t+1) = arg max Q(j(t) ) 

(5:8)

(t) is the current estimate of the parameter  while l(jD) is the complete-data loglikelihood. Although the method described above requires that (t+1) be chosen so as to satisfy 105

Equation 5.8, it is sucient to ensure that Q((t+1)j(t) ) > Q((t)j(t) ), i.e. the value of

Q() is increased but not necessarily maximized. Such algorithms are often referred to as generalized EM or GEM algorithms. This method has the advantages of being conceptually easy to design, and, more importantly, having the property of converging reliably. However, it su ers from the disadvantage of having a very slow convergence rate if the proportion of data missing is very high. Another disadvantage is that the method often nds just a local maximum, and not a global one.

Gibbs Sampling A more general method for handling missing values is Gibbs sampling (Geman and Geman, 1984). This is also an iterative algorithm but more general than the EM algorithm. The basic idea behind the method is explained below. Suppose the distribution P (jDobs ) needs to be estimated for some parameter of interest . The missing component of the data, Dmis is rst instantiated to some initial value (e.g. random). This yields a new complete data set, say D = D(0). Then, for each value missing in the original data set, dli 2 Dmis , the algorithm replaces the current value by sampling a new value from the distribution P (dli jDndli) where Dndli denotes the dataset D with the value of dli removed. This operation produces a new completedata set, D(t+1). The posterior distribution P (jD) is then calculated, and these two steps are then iterated a number of times, say s. Assuming a small set of conditions are met during this process, the distribution P (jDobs ) can then be estimated as the average of P (jD) computed during the iterations. Gibbs sampling can be applied to almost an distribution|the only restriction being that the joint is greater than zero i.e. any instantiation of the domain variables is possible, and that each state (instantiation) is visited in nitely often. Although, Gibbs sampling is more accurate than the EM algorithm (it converges to the optimal solution), 106

it is typically much more time consuming and is generally less preferred to the latter.

Data Augmentation Data Augmentation (Tanner and Wong, 1987) is another technique for estimating the posterior distribution P (jDobs ) of some parameter of interest. This method also has two steps, akin to the EM algorithm. During the imputation or I step, multiple imputations are drawn from the current estimate of the posterior distribution of the missing data, given the observed data and the current parameter estimate, i.e. multiple impute values are drawn from the distribution P (Dmis jDobs ; ). During the posterior or P step, the current posterior distribution (P (jDobs )) is represented as a equally-weighted mixture (i.e. each mixture component has the same weight) of the complete-data posterior distributions P (jDobs ; Dmis) which are easily calculated using the imputed data from the I-step. Tanner and Wong (1987) show that the algorithm converges to the true posterior distribution, provided some mild regularity conditions are satis ed.

5.4 Related Work Although several methods have been developed for inducing Bayesian networks from data, e.g. (Cooper and Herskovits, 1992; Singh and Valtorta, 1995; Heckerman et al., 1995; Lam and Bacchus, 1993), most of these techniques assume that the data from which the network is to be learned is complete. In situations where this is not the case, as in most real-world problems, the data is often made complete by \ lling-in" values using a variety of, often ad-hoc, methods. A substantial amount of work has been done in adapting methods such as EM and Gibbs sampling to the task of learning Bayesian network parameters (conditional probability tables) from incomplete data under the assumption that the structure is known. However, the general, and more practical, task of learning both network structure and parameters from incomplete data has not been 107

fully explored. Most of the techniques developed in this regard work satisfactorily only if the missing data is MCAR. However, as pointed out in Section 5.2, this restriction is too strong for most real-world datasets where data that is missing-at-random (MAR) is often encountered and must be handled properly to yield good models. Very little work has been done to address this issue. Some stochastic methods have been developed that could be used to handle MAR data; however, they are far to computationally inecient to be of any use. I rst discuss some of the commonly used methods for learning the conditional probabilities for a Bayesian network from incomplete data, assuming that the structure is known. Then, I describe the main methods developed for learning both structure and parameters in the presence of missing data.

5.4.1 Learning Bayesian Network Parameters In this section, I discuss some of the salient approaches to learning Bayesian network probabilities, assuming that the structure is known, from incomplete data. Lauritzen (1995) discusses the application of the EM algorithm to the task of learning conditional probabilities from data with missing values. Assuming that the data D consists of m independent cases, the complete-data likelihood function is given by

LBS (jD) / PBS (Dj) = =

m Y

PBS (dlj) l=1 qi Y ri n Y Y Nijk ijk i=1 j =1 k=1

where BS is the Bayesian network structure, n is the number of attributes, qi is the number of possible instantiations of the parents, i , of attribute i, ri is the number of values of attribute i, Nijk is the number of cases in which attribute i is instantiated to its kth value while i is instantiated to its j th value, and  = (ijk )nqi ri are the conditional probabilities. 108

Now, taking the conditional expectation of the complete-data log-likelihood, given the current value of the parameter  and the observed component of the data, Dobs , we get

Q(j(t) ) =

qi X ri n X X i=1 j =1 k=1

E(Nijk jDobs ; ) log ijk

(5:9)

The E step then consists of determining the value of E(Nijk

where dobs l is the

jDobs;

) =

m X

E (lijk jdobs l )

l=1 observed component of the lth case, and lijk

(5.10) is de ned as

8 > 1; if i and i are observed and i = kth inst., pa(i) = j th inst. > > < E (lijk jdobs 6 kth inst. or i 6= j th inst. l ) = > 0; if i and i are observed and i = > > : PBS (i = kthinst:; i = j thinst:jdobs l ; ); otherwise.

To determine PBS (i = kth inst:; i = j th inst:jdobs l ; ), the Bayesian network hBS ; i is instantiated with the observed evidence in the lth case i.e. dobs l , and inference is carried out. The M step then consists of using these estimated values to compute ijk . Thus,

P

Nijk jDobs ; ) ijk = E( E(Nij jDobs ; )

where Nij = k Nijk . Although the EM algorithm is known to converge reliably, it does not guarantee optimality and often ends up in a local maximum. As discussed in Section 5.3.4, Gibbs sampling is a much more general technique; rather than estimate the parameter of interest directly, it estimates the posterior distribution of the parameter, and under mild regularity conditions, guarantees convergence to the true posterior. On the down side, it is much more time consuming than EM. The application of Gibbs sampling to the task of learning BN parameters from incomplete data, given the structure, has been discussed by Heckerman (1995). The algorithm starts by somehow instantiating all 109

missing values (e.g. randomly) to yield a complete-data set, D(0). Then, for each missing value in the original data, dli 2 Dmis (value of the ith attribute in the lth case), the current value is discarded and replaced by a new value drawn from the distribution

P (dli jDndli; BS ). This operation yields a new complete-data set, D(t+1). The posterior distribution, P (jD; BS ), is then determined, and the two steps are iterated. The average of P (jD; BS ) over the various iterations is used as the approximation for P (jDobs ). Several other techniques have also been proposed over the last few years. Russell et al. (1995) describe a gradient-descent approach to the task of determining the maximumlikelihood estimates of the BN parameters, given the structure, as an alternative to the EM algorithm. Heckerman (Heckerman, 1995) discusses a method for computing the posterior distribution of the BN parameters based on the Gaussian approximation. Ramoni and Sebastiani (1997) describe a method of estimating BN parameters that rst uses the available information to establish bounds over the set of estimates (\bound" step) and then collapses the resulting set to a point estimate via a convex combination of the extreme points (\collapse" set).

5.4.2 Learning Bayesian Network Structure and Parameters Unlike the case of learning parameters, given structure, not many useful algorithms have been designed for the more general and practical task of learning both BN structure and parameters from incomplete data. Most of the methods developed handle only those incomplete datasets where the data is missing completely at random (MCAR). However, this assumption is too strong for most real-world problems where the missing data is often MAR. Other methods, that may be able to successfully handle MAR data, are too computationally expensive to be of any use. I brie y discuss some of the main work in this regard, and discuss the pros and cons of various approaches. The earliest approach to this problem can be attributed to Cooper and Herskovits (1992) who present a theoretical method of handling missing data while learning Bayesian 110

networks from incomplete data. The basic idea is to sum over all possible missing data instantiations while calculating the likelihood function, which is used to evaluate various Bayesian networks against each other. However, given the exponential complexity of this approach, it is not feasible to use it except for very small datasets. Although, Cooper (1995) extends this approach to yield a relatively ecient algorithm, it may still be computationally expensive for most real world problems. Moreover, both approaches work only if the missing data is MCAR. Heckerman (1995) discusses various Monte-Carlo approaches to compute the observeddata likelihood functions, that can then be used to learn the most likely Bayesian network. However, although the methods are accurate, they are computationally very inef cient. Heckerman also discusses a more ecient method for computing the likelihood based on the Gaussian approximation; however, no results are available regarding its performance. Ramoni and Sebastiani (1997) extend their \bound and collapse" approach, discussed in the previous section, to the task of learning both parameters and structure from incomplete data. However, although the method is computationally ecient, it is valid only if the missing data is MCAR. Note that although the authors state that their method works when the data is missing at random, they de ne missing-at-random to imply that the observed data Dobs is a \representative sample of the complete database". This assumption is valid only if the data is missing completely at random (MCAR) and is not true for the MAR case. Similarly, their experimental results are also based on incomplete datasets where the missing data is MCAR (values were randomly removed independent of the values of other variables). In parallel with this research, Friedman (1997) has independently developed a method similar to the one I present in the next section for learning both parameters and structure from incomplete data using the EM algorithm. The proposed algorithm uses a metric based on the minimum description length (MDL) principle (Lam and Bacchus, 1993) to 111

evaluate candidate Bayesian network structures. The algorithm performs a search for the best network structure simultaneously with the search for conditional probabilities within the EM algorithm. As such, it can be computationally very demanding. Moreover, the author has shown its performance only when the missing data is MCAR. It is unclear how its performance will be a ected if the data is missing at random (MAR).

5.5 Learning Bayesian Network Structure and Parameters from Incomplete Data In this section, I present a principled method for learning not only the conditional probabilities but also the Bayesian network structure from data with missing values. Unlike previous methods, the proposed algorithm correctly handles data missing under both MCAR and MAR assumptions. The algorithm can also be extended to handle the NMAR case, but due to complexity considerations, I just suggest possible ways of doing so, leaving the complete work as a topic of future research. As discussed in Section 5.2, it is possible to model the missing data mechanism by augmenting the dataset D (consisting of n attributes, X1; X2; : : :; Xn) by n more attributes, R1; R2; : : :; Rn, such that Ri takes value 1 in case l if Xi is observed in case

l, and 0 otherwise. Bayesian networks are well suited to analyze such data. One could learn a Bayesian network from the augmented database, thereby automatically encoding the missing data mechanisms (e.g. an arc from an original attribute, say Xj to a new attribute, say Rk , would model the fact that the value of Xk may be missing based on the value of Xj ). This approach can be used regardless of the mechanism leading to the missing data. However, this method has two main drawbacks. One, the number of parameters to be estimated increases which not only increases the time for learning the Bayesian network, but may also lead to unreliable estimates, especially if the amount of data available is small. Second, doubling the number of attributes increases the size of 112

the network which makes inference all the more complex. However, as discussed in Section 5.2, if the missing data is MAR or MCAR, then the maximum likelihood estimate of a parameter of interest can be determined by ignoring the missing data mechanism. In other words, the maximum likelihood estimate of the Bayesian network given the original (incomplete) data without taking the missing data mechanism into account will be the same as the estimate obtained from the augmented dataset (that takes the mechanism into account). Thus, if we assume that the missing data is MCAR or MAR, and use a likelihood based approach such as EM to estimate the most likely Bayesian network given the incomplete data, then we do not need to model the missing data mechanism by adding additional nodes and learning from the augmented database (as described above) since that will not result in any improvement in the quality of the BN learned. Note that this advantage is lost if methods other than likelihood based approaches are used.2 If, however, the missing data is NMAR, then the missing data mechanism must always be explicitly modeled, and can be handled as described in the previous paragraph. In any case, as described in Section 5.3.2, the method of treating missing values as an extra value may be appropriate for some attributes in certain domains. As such, it would be much simpler to use this strategy where appropriate. Here, I restrict myself to problems where either this technique is not suitable, or is suitable for only some of the domain attributes and they have already been handled appropriately (using this technique). I now develop an EM algorithm to learn both Bayesian network structure and parameters from incomplete data under the assumption that the missing data mechanism is either MCAR or MAR. The proposed algorithm formalizes the following scheme: it uses the current estimate of the structure and the incomplete data to re ne the conditional probabilities, then imputes new values for missing data points by sampling from the new 2 For Bayesian approaches, the missing data mechanism can be similarly ignored for MAR and MCAR data by making additional assumptions (cf. (Ghahramani and Jordan, 1994)).

113

estimate of the conditional probability distributions, and then re nes the network structure from the new estimate of the data using standard algorithms for learning Bayesian networks from complete data. This process is repeated until convergence. As explained in the previous section, the EM algorithm can be easily used to estimate the conditional probabilities from the missing data, assuming that the structure is known. However, since the EM algorithm does not \ ll-in" values but, rather, estimates the conditional expectation of the complete-data log-likelihood, it cannot be used directly in the method described above. Nevertheless, as described below, the EM algorithm can be combined with (multiple) Imputation techniques to yield a Monte Carlo method to learn both the network structure and conditional probabilities as discussed above. As discussed in Section 5.3.4, the E step of the EM algorithm nds the conditional expectation of the complete-data log-likelihood, given the observed component of the data and the current estimate of the parameter. The E step can be described as follows:

Z l(jD)P (Dmis jDobs ; (t)) dDmis Z = l(jhDobs ; Dmisi)P (Dmis jDobs ; (t)) dDmis

Q(j(t) ) =

(5.11)

Wei and Tanner (1990) point out that it is possible to approximate Equation 5.11 by using the Monte Carlo method as follows:

Q(j(t) ) = M1

M X s=1

l(jhDobs; Dhmis si i):

(5:12)

mis Thus, the current approximation to , (t) , is used to impute M samples, Dhmis 1i ; : : :; DhM i,

from the current approximation to P (Dmis jDobs ; (t)) which are then used to estimate

Q(j(t)) as a mixture of the complete-data log-likelihood functions (mixed over the M

imputed samples). Then, the M step simply involves maximizing the right-hand side of Equation 5.12. Now, let BS represent the Bayesian network structure to be estimated and  represent the corresponding conditional probabilities. Then, in order to nd the maximum 114

likelihood estimate of BS , Equation 5.11 can be rewritten as

Z

Q(BS jBS(t) ) = l(BS jhDobs ; Dmis i)P (DmisjDobs ; BS(t)) dDmis:

(5:13)

Then, using the Monte Carlo method, Equation 5.13 can be written as M X

Q(BS jBS(t) ) = M1 l(BS jhDobs; Dhmis si i) s=1

M X = M1 log P (hDobs ; Dhmis si ijBS ) s=1

(5.14)

where M is the number of imputations.

In order to compute Q(BS jBS(t)) via Eqn. 5.14, we need to compute P (hDobs ; Dmis ijBS ) as well as P (Dmis jDobs ; BS ). Now, P (Dmis jDobs ; BS ) is given by the equation

P (Dmis jDobs; BS(t)) =

Z

P (Dmis jDobs ; BS(t); )P (jDobs ; BS(t)) d

(5:15)

Applying the Monte Carlo method again, Equation 5.15 can be approximated as

P (Dmis jDobs ; BS(t)) =

T 1X mis obs (t) T P (D jD ; BS ; hri) r=1

(5:16)

where h1i; : : :; hT i are T samples of , imputed from the current approximation of the

distribution P (jDobs ; BS(t) ), used to estimate P (Dmis jDobs ; BS(t)) as described above.

Given Dobs , BS(t) and hri , computation of P (Dmis jDobs ; BS(t); hri) is straightforward.

For each missing data point, this posterior distribution can be computed by instantiating

the Bayesian network hBS(t) ; hri i to the observed data in the corresponding case and propagating this evidence through the network using one of several well-known inference

algorithms.

Now, if the distribution P (jDobs ; BS(t)) is known, then the left-hand side of Equation

5.16 can be computed by generating the samples h1i ; : : :; hP i from this distribution, computing the distributions P (Dmis jDobs ; BS ; hri), 1  r  P , and then updating the current approximation to P (Dmis jDobs ; BS(t)) to be the mixture of these distributions.

In practice, this may not always be possible. However, we can obtain a reasonable 115

approximation to Equation 5.16 by using a \good" estimate of , say ^, given the observed data Dobs and the current estimate of the network structure BS . Then, Equation 5.16 may be rewritten as

P (DmisjDobs ; BS ) = P (Dmis jDobs ; BS ; ^)

(5:17)

This leads us to the following algorithm for learning both the BN structure as well as the conditional probabilities from incomplete data. 1. Initialization: Create M complete-datasets, Dh(0) si ; 1  s  M , by imputing M values for each missing data point in Dmis . This can be done using several methods including generating the imputations randomly, or by sampling from the prior distribution of each attribute, etc. 2. Find the BN structure that maximizes the mixture of the complete-data loglikelihood functions over the M latent-data patterns, i.e. M 1 X BS(t+1) = arg max log P (Dh(st)ijBS ) BS M s=1

M 1 X (t) log P (hDobs ; Dhmis = arg max si i jBS ) BS M s=1

3. Use the EM algorithm to estimate the conditional probabilities using the observed data, Dobs and the current BN structure estimate, BS(t+1) , i.e.

(t+1) = arg max log P (Dobs jBS(t+1); )  4. If the convergence criteria is satis ed, stop. Else, go to step 5. 5. Use the current BN estimate, hBS(t+1); (t+1) i along with the observed data, Dobs

obs mis (t+1); 1  s  M . to impute M new, complete-datasets, Dh(st+1) i = hD ; Dhsi i In order to do this, do the following for each case dl : instantiate the BN to the

observed data dobs l in case dl, perform probabilistic inference on the BN to compute 116

(t+1) (t+1)i) for each missing data point d the posterior distribution P (djdobs l ; hBS ; 

in that case, and then impute the required number of values for the concerned missing data point from this distribution. 6. t

t + 1. Go to Step 2.

I refer to this algorithm as the MIEM-BN algorithm since it combines Multiple Imputation with the EM algorithm to learn Bayesian Networks.

I now discuss Step 2 of the algorithm in more detail. Step 2 requires the determination of the BN structure that maximizes the mixture of the complete-data log-likelihood functions over the M latent data patterns. Two steps are involved in this exercise: (a) computing the complete data log-likelihood function given a BN structure, and (b) a search over the space of all possible BN structures to determine the maximizing structure. Several methods are available to compute the complete-data log-likelihood function for a given BN structure (Cooper and Herskovits, 1992; Heckerman et al., 1995) under di erent sets of assumptions such as the types of priors assumed for the conditional probabilities, etc. However, the second task (step (b) above) is computationally expensive since the search space is too large; thus, a less than optimal search strategy has to be used.

Here, I discuss a greedy algorithm to determine BS(t+1) . As pointed out by Heckerman

et al. (1995), most known Bayesian and non-Bayesian metrics for evaluating a BN

structure are \decomposable", i.e. the metric can be written as a product of measures, each of which is a function of only one attribute and its parents. For example, for the metrics proposed by Cooper and Herskovits (1992) (assuming uniform priors on the conditional probabilities) and Heckerman et al. (1995) (assuming Dirichlet priors on the conditional probabilities), the complete-data log-likelihood function for a BN structure can be written as

P (DjBS ) =

n Y i=1

117

g(i; i)

(5:18)

where n is the number of variables, i is the set of parents of node i, and g (i; i) is a function that depends only on the instantiations of i and i in the database D. To maximize Equation 5.18, we need only nd the parent set of each variable that maximizes

g(i; ). Thus, we have max P (DjBS ) = B S

n Y

max i=1 i

g(i; i)

(5:19)

Incorporating this in Step 2 of the algorithm we get 1 BS(t+1) = arg max B M S

M X

log P (Dh(st)i jBS )

s=1 M X n X

1 = arg max log gs (i; i) BS M s=1 i=1 n M X X = 1 arg max log g (i;  )

M i=1

BS s=1

n X G (i; i) = M1 arg max B S

i=1

s

i

(5.20)

P log g (i;  ). where gs (i; i) is the function g (i;  ?i) calculated on Dh(st)i and G (i; i) = M s i s=1 Thus, the right of Equation 5.20 can be maximized by maximizing the term in the outer summation independently for the various attributes. Thus, in order to nd the Bayesian network structure that maximizes the right of Equation 5.20, it is sucient to independently determine the set of parents of each attribute by maximizing G (i; i). Thus, the Bayesian network structure, that maximizes the mixture of the complete-data log-likelihood functions over the M latent data patterns in Step 2 of the algorithm, can be determined by independently nding the parent set of each variable that maximizes

G (i; i). However, it is still computationally expensive to perform the maximization over all possible parents sets for each node. To reduce the complexity further, we assume that a total ordering is available on the attributes such that for any node in the ordering, only nodes that precede it in the ordering are allowed to be its parents. Then, for each attribute i, the algorithm starts with an empty set of attributes, and incrementally adds that node (from the set of nodes preceding it in the ordering) that increases G (i; i) by 118

the maximum amount. Attributes are added as long as the value increases. Since a greedy algorithm is used, it does not guarantee nding the BN that maximizes Equation 5.20. Note, however, that we do not need to nd the optimal (\maximizer")

structure. It suces to just nd a structure that increases the value of Q(BS jBS(t)), i.e. choose BS(t+1) such that Q(BS(t+1) jBS(t) ) > Q(BS(t) jBS(t)) (Section 5.3.4). As discussed above, the algorithm assumes that a total ordering on the variables is

available as an input. This assumption can be relaxed by learning the ordering also from the dataset prior to network induction (cf. (Singh and Valtorta, 1995)). However, since there are M complete datasets, there will be potentially M orderings learned from the various datasets, and all of them may not be consistent with each other. Keeping this in mind, steps 2 and 3 of the algorithm can be replaced by the following in order to relax the ordering assumption: 2. For s := 1 to M do 2a. From the complete-dataset Dh(st)i , induce the Bayesian network structure, (t) BS(t+1) s , that maximizes P (Dhsi jBS ). h i

2b. Use the EM algorithm to learn the conditional probabilities h(st+1) i using the original, incomplete data D and the network structure BS(t+1) s . h i

3. Fuse the networks to create a single Bayesian network hBS(t+1) ; (t+1)i as follows. Construct the network structure BS(t+1) by taking the arc-union of the individual, S network-structures, i.e. BS(t+1) = s=1:::M BS(t)s . If the orderings imposed on the h i

attributes by the various network structures are not consistent, then it is possible

to construct BS(t+1) by choosing one of the orderings, making all the other network structures consistent with this ordering by performing necessary arc-reversals, and then taking the graph union of all the resultant structures. Matzkevich and Abramson (1993) describe an ecient algorithm to do so. Then, create (t+1) by taking a weighted-average of the individual distributions. 119

The remaining steps remain unchanged.

5.6 Summary In this chapter, I described a new method for learning both Bayesian network structure as well as the conditional probability tables from incomplete data. Previous methods for learning Bayesian networks from data often assume that the data is complete, or restrict themselves to the task of learning conditional probabilities assuming that the structure is known. The method combines the well known Expectation-Maximization (EM) algorithm with another widely used statistical technique, multiple imputation, to yield an ecient method for learning Bayesian networks in the presence of missing data. In the following chapter, I carry out a detailed evaluation of the MIEM-BN algorithm by varying both the amounts of missing data as well as the assumptions about the missing data mechanisms.

120

Chapter 6

Learning Bayesian Networks From Incomplete Data: Evaluation In order to evaluate the performance of the MIEM-BN algorithm, I carried out a series of experiments by varying parameters such as the amount of missing data, the missing data mechanism and the dataset size. The main objectives of these experiments are described in Section 6.1. Section 6.2 describes the dataset used in these experiments, while the experimental methodology is discussed in Section 6.3. In Section 6.4, I describe, and discuss, the results of these experiments. Finally, I summarize the results in Section 6.5.

6.1 Objectives As described in Chapter 5, the MIEM-BN algorithm should correctly handle missing data under both the MCAR as well as the MAR assumption. Thus, one of the main objectives of the experiments was to evaluate the quality of the Bayesian networks learned by the MIEM-BN algorithm from incomplete datasets where the data was missing under either 121

the MCAR or the MAR assumption. Moreover, as discussed in Section 5.5, the MIEM-BN algorithm allows a trade-o between computational complexity and model quality via the number of imputations used (step 2). As such, another objective of the experiments was to evaluate the e ect of the number of imputations used on the quality of the induced network. Finally, both the size of the dataset as well as the amount of missing data should have an e ect on the quality of the network learned by the MIEM-BN algorithm. Therefore, I was also interested in evaluating the performance of the MIEM-BN algorithm as a function of dataset size as well as the amount of missing data.

6.2 Description of the Dataset Used For the experiments, I used a dataset of 10,000 cases created from the ALARM network (Beinlich et al., 1989) using the case-generation facility of HUGIN (Anderson et al., 1989). The ALARM network was constructed as a prototype to model potential anesthesia problems that could arise in the operating room, and is a relatively large network consisting of 37 attributes and 46 arcs representing 8 diagnostic problems, 16 ndings and 13 intermediate variables that relate diagnostic problems to ndings. Since the quality of the Bayesian network induced by the MIEM-BN algorithm could be easily evaluated by comparing it with the actual model (the gold-standard), it was possible to carry out a systematic evaluation of the algorithm by controlling both the amount of missing data as well as the missing data mechanism. The methodology used for the experiments is described in the following section. 122

6.3 Experimental Methodology In order to evaluate the performance of the MIEM-BN algorithm, I used the following methodology. First, I systematically removed data from the dataset under di erent assumptions (MCAR or MAR). Then, I used the MIEM-BN algorithm to learn a Bayesian network from the incomplete data. Finally, I measured the quality of the network learned by comparing it with the gold-standard network. The incomplete datasets were created as described below: 1. MCAR: To remove data under the MCAR assumption, the following method was used: 1. Let  be the required percentage of missing data. 2. For each case, l in the dataset, and each attribute xi , remove the observed value of xi in case l with probability  . This ensured that, at any stage, the probability that a value was missing did not depend on either the observed or the missing component of the dataset. 2. MAR: To generate missing data under the MAR assumption, the following method was used: 1. Let  be the required percentage of missing data 2. If the current percentage of missing data equals  , then stop. Else, go to Step 3. 3. Generate a MAR rule as follows: a. Randomly select an attribute xi . b. Randomly select a set of attributes Y , as well as an instantiation y of Y . c. Randomly select a probability p 4. For each case k, apply the MAR rule as follows: if in case k, the set of attributes Y is observed and is instantiated to y , then remove the observed 123

value of attribute xi with probability p. This ensures that the probability that the value of xi is missing depends on the observed component of the data in that case, i.e. MAR 5. Go to Step 2. I removed various amount of data under the two assumptions, and used the MIEMBN algorithm to learn both the Bayesian network structure as well as the conditional probability tables from the incomplete data. I used the log-likelihood of the data to assess the progress of the algorithm, stopping when the change in the log-likelihood, relative to the previous iteration, was less that 10?4. Note that the log-likelihood function can be evaluated without any extra work by simply summing over all cases the log of the normalization constant obtained after propagating the evidence in each case. To evaluate the log-likelihood value of a Bayesian network structure, given a complete database, I used the Bayesian metric proposed by Cooper and Herskovits (1992) in Step 2 of the algorithm, as described in section 5.5. This metric assumes a uniform prior on the conditional probabilities. Note that, as described in Section 5.5, we can easily incorporate other metrics such as the one proposed by Heckerman et al. (1995) assuming Dirichlet priors. To measure the quality of the learned network, I computed the cross{entropy (Kullback and Leibler, 1951) from the actual distribution (represented by the gold-standard) to the distribution represented by the learned network. The cross-entropy measure is de ned as follows. Let P denote the joint distribution represented by the gold-standard network and Q the joint distribution represented by the learned network. Then, the cross-entropy H (P; Q) is given by

H (P; Q) =

X x

P (x) P (x) log Q (x)

(6:1)

Low values of cross-entropy correspond to a learned distribution that is close to the gold standard. I used the method suggested by Heckerman et al. (1995) to evaluate Equation 124

6.1 eciently without summing over an exponential number of terms. Another metric I used to evaluate the quality of the learned Bayesian network is the observed-data log-likelihood of the learned Bayesian network. Since the true Bayesian network is known, the \optimal" value, P (Dobs jBS ; ) can be easily computed where

hBS ; i is the gold-standard. This metric gives an estimate of the \best" model that can possibly be learned from the incomplete data due to the missing information. Ideally, the estimate of the Bayesian network at every stage of the algorithm should approach this optimal value, converging to it asymptotically.

6.4 Results and Discussion I carried out a series of experiments by varying four parameters: a) the mechanism leading to the missing values, b) the amount of missing data, c) the number of imputations used and, d) the number of cases used for learning. I present and discuss the results of some of these experiments below. Figures 6.1 { 6.4 show the results of evaluating the algorithm on a dataset of 10,000 cases with varying amounts of missing data under either the MCAR or the MAR assumption. In all cases, the number of imputations used did not greatly a ect the nal result, probably due to the large dataset used which ensured that, even with missing values, there was sucient \structure" in the observed data to recover the underlying model easily. Thus, these gures show the algorithm's performance for only a speci c number of imputations in each case. Figure 6.1 shows the performance of the algorithm on data containing about 20% missing values, removed completely at random (MCAR), using single imputation. The gures plot the improvement in the cross-entropy and the log-likelihood function against the progress of the algorithm. Similarly, Figure 6.2 depicts the results for data containing about 40% missing values removed completelyat-random using 2-imputations. An important point to note in these gures is that 125

3

-125000

2.5

2 Log-likelihood

Kullback-Leibler Distance

-130000

1.5

-135000

-140000

1

-145000 0.5

0

-150000 1

2

3

1

Iteraion No.

2

3

Iteration No.

(a)

(b)

Figure 6.1: Learning From a 10,000 Case Dataset With 20% MCAR Data Using 1Imputation: (a) Cross-entropy (b) Log-likelihood the rst plotted-value in each graph (shown as the Y-value for iteration 1) represents the Bayesian network learned from the data that had been made complete by lling-in missing values using the prior distributions (unconditional imputation). Clearly, the learned Bayesian network in this case is far o from the true model which strengthens our claim that ad-hoc methods used to handle missing values often result in very poor models. Similar results would be expected from another often used approach of replacing each missing value by its (sample) mean, since that is a special case of the above method. For the case where only 20% data is missing under the MCAR assumption, the algorithm induces a Bayesian network with a joint distribution that is very close to the true representation (as represented by the gold-standard network) with a cross-entropy of only 0.05. As a reference point, the cross-entropy value for the Bayesian network learned by the K2 algorithm from the complete dataset was 0.04. As the amount of the missing data increases to 40%, the deviation of the learned distribution from the 126

7

-105000

-110000

6

-115000

Log-Likelihood

Kullback-Leibler Distance

5

4

3

-120000

-125000

-130000

2 -135000 1

-140000

0

-145000 1

3

5

1

Iteration No.

3

5

Iteration No.

(a)

(b)

Figure 6.2: Learning From a 10,000 Case Dataset With 40% MCAR Data Using 2Imputations: (a) Cross-entropy (b) Log-likelihood true distribution also increases (entropy is 0.14). However, this is to be expected since the quality of the ML estimate is directly related to the amount of missing information. Nevertheless, the learned Bayesian network is still much better than the one learned by replacing missing values by sampling from unconditional distributions. Also, in each case, the log-likelihood of the learned network converged to a value that was very close to the \optimal" value (the observed-data log-likelihood of the true model as shown by the thick, horizontal line). This indicates that the model learned in each case is very close to the best model that could possibly be learned from the incomplete data. Another important issue is the rate of convergence of the algorithm under various assumptions. Whereas the labels along the X-axis show the number of iterations of the main (outer) loop of the algorithm, the marks on this axis show the number of iterations of the EM algorithm while learning the conditional probabilities from the missing data given the current estimate of the network structure. Clearly, the number of iterations needed for 127

10

-120000

9

-122000

8

Log-Likelihood

Kullback-Leibler Distance

-124000 7 6 5 4

-126000

-128000

-130000

3 -132000 2 -134000

1 0

-136000 1

4

1

Iteration No.

4 Iteration No.

(a)

(b)

Figure 6.3: Learning From a 10,000 Case Dataset With 20% MAR Data Using 4Imputations: (a) Cross-entropy (b) Log-likelihood convergence increase with the amount of the missing data, both in terms of the number of times the network structure is revised (main loop) as well as the number of times the conditional probabilities have to be updated (inner, EM loop). Again this is to be expected since the rate of convergence should be proportional to the fraction of information in the observed data, Dobs . Figures 6.3{6.4 show the results of testing the algorithm on 10,000 case datasets where approximately 20% or 40% values were missing at random (MAR), respectively. As in the MCAR case, the number of imputations did not signi cantly a ect the quality of the model learned. However, convergence was faster, especially when the amount of missing data was 40%, when more imputations were used. Fewer imputations resulted in a larger variance of the log-likelihood function, and took longer to stabilize. The system variability gets smaller as the number of imputations are increased, leading to faster convergence of the algorithm. This improvement, both in terms of the quality of the network learned as well as in 128

12

-100000

11 -105000

9 Log-Likelihood

Kullback-Leibler Distance

10

8

7

-110000

-115000

6 -120000 5

4 1

-125000

2

1

Iteration No.

6

1 1

Iteration No.

(a)

(b)

Figure 6.4: Learning From a 10,000 Case Dataset With 40% MAR Data Using 4Imputations: (a) Cross-entropy (b) Log-likelihood the convergence rate is expected to become more pronounced as the amount of missing data increases and/or the size of the available dataset decreases. In either case, there will be much more uncertainty about the missing values than in the current situation. As such, I also carried out experiments by reducing the size of the dataset to 1000 as well as 100 cases. As far as datasets of 1000 cases were concerned, once again, there was not a signi cant di erence as far as the quality of the nal model learned was concerned. However, the time taken by the algorithm to converge dropped signi cantly as the number of imputations were increased, especially for the MAR case. Figures 6.5 and 6.6 show the results of applying the proposed algorithm to datasets containing 1000 cases with 20% or 40% data missing at random. As is evident from Figure 6.5, using two imputations instead of one resulted in much faster convergence and a slightly better model. Figure 6.6 shows one of the potential pitfalls of using single 129

-12000

12 1-Imputation 2-Imputations 10

8 Log-Likelihood

Kullback-Leibler Distance

-12500

6

-13000

-13500

4

-14000

1-Imputation

2

2-Imputations LL-OPT -14500

0 1

3

5

7

9

1

11 13 15 17 19 21

3

5

7

9

11 13 15 17 19 21

Imputation No.

Imputation No.

(a)

(b)

Figure 6.5: Learning From a 1,000 Case Dataset With 20% MAR Data: (a) Cross-entropy (b) Log-likelihood imputation. As shown in the gure, the algorithm using single imputation converged fairly rapidly to a very bad model. On the other hand, when the algorithm used multiple imputations (four in the case shown), the algorithm converged to a signi cantly better model, much closer to the true model. This is probably due to the fact that single imputation does not, in any way, represent the uncertainty behind the missing values, and treats the imputed values as if they were in fact the actual, missing values. As such, for certain patterns of missing data, especially if the amount of missing data is large, the algorithm may quickly converge to a bad local maximum (as for the single imputation case in Figure 6.6). Since multiple imputation models the uncertainty behind each missing data, it is better suited to handle such situations and thus performs relatively well. In both cases, the deviation of the learned model from the true Bayesian network is more than the models learned using 10,000 cases. However, this is expected since the 130

13

-10000

-11000

12

1-Imputation 4-Imputations

-12000

-13000 10 Log-Likelihood

Kullback-Leibler Distance

11

9

8

-14000

-15000

-16000 7 1-Imputation 4-Imputations LL-OPT

-17000 6 -18000

19

17

15

13

11

9

7

5

1

Imputation No.

3

-19000

19

17

15

13

11

9

7

5

3

1

5

Imputation No.

(a)

(b)

Figure 6.6: Learning From a 1,000 Case Dataset With 40% MAR Data: (a) Cross-entropy (b) Log-likelihood smaller dataset has much less information than the corresponding bigger one. However, note that the models learned are still fairly close to the best model that could potentially be learned from the 1000 case datasets. As the size of the datasets was further reduced to 100 cases, the advantages of multiple imputation over single imputation became much more pronounced. Figures 6.7 and 6.8 show the results of applying the algorithm to 100 case datasets with 20% values missing-completely-at-random and 40% values missing-at-random, respectively. When only one or two imputations were used, the system exhibited a very high variability, and took signi cantly longer to converge. However, as the number of imputations were increased, the algorithm converged much faster, and the quality of the network learned was also better. These di erences were much more for the 40% MAR case where 10 imputations were needed to ensure quick convergence. Once again, in either case, the Bayesian network learned was fairly close to the best model that could be learned from 131

7

-1300

AAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAA AAAA AAAA AAAA AAAA 1-Imputation AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAAAAAA 2-ImputationsAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA 4-ImputationsAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAAAAAAAAAAAAAAAAAAAAAA AAAA AAAA

-1600

28

19

1

97

89

81

73

65

57

49

41

33

25

9

17

1

10

-1650

2.5

91

-1550

82

-1500

73

-1450

64

-1400

55

-1350

AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA

46

AAAA AAAA AAAA AAAA AAAA 4-Imputations 6 AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA 5.5 AAAA AAAA AAAA AAAA AAAA AAAA AAAA 5 AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA 4.5 AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA 4 AAAA AAAA AAAA AAAA AAAA AAAA AAAA 3.5 AAAA AAAAAAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA 3

Log-Likelihood

Kullback_Leibler Distance

AAAAAAAAAAAA AAAAAAAAAAAA 2-Imputations

37

1-Imputation 6.5

Iteration No.

Iteration No.

(a)

(b)

Figure 6.7: Learning from a 100 case dataset with 20% MCAR data: (a) Cross-entropy (b) Log-likelihood the given dataset, though further away from the actual model than the models learned when the size of the dataset was more. These experiments show that the number of imputations does not have a signi cant impact on the performance of the algorithm if the size of the dataset is large, especially if few values are missing. However, if the available data is less and/or the amount of missing data is high, then it may be better to use multiple imputations to ensure quicker convergence and better models. Since multiple imputation does not increase the complexity of the algorithm by a signi cant amount, it might be better to always use multiple imputations. Alternatively, one could simultaneously run the algorithm with di erent numbers of imputations, monitor the improvement in the likelihood as the algorithm progresses and choose the one that converges most rapidly. 132

18

-1000 1-Imputation AAAAAAAAAAAA 2-Imputations

Kullback-Leibler Distance

16

15

14

13

12

11

10

-1050

4-Imputations AAAA AAAA AAAA 10-Imputations AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA

-1100

-1150 Log-Likelihood

17

-1200

-1250

-1300

AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAAAAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA AAAA 1-Imputation AAAA AAAA AAAA 2-Imputations AAAAAAAA AAAAAAAA AAAA

4-Imputations 10-Imputations

-1350

LL-OPT 9 Iteration No.

37

33

29

25

21

17

13

9

5

1

40

37

34

31

28

25

22

19

16

13

7

10

4

1

-1400 Iteration No.

(a)

(b)

Figure 6.8: Learning From a 100 Case Dataset With 40% MAR Data: (a) Cross-entropy (b) Log-likelihood

6.5 Summary A systematic, principled approach to learning Bayesian networks, both structure and conditional probabilities, from incomplete data has been presented. The proposed algorithm is an iterative method that successively re nes the learned Bayesian network through a combination of Expectation-Maximization (EM) and Imputation techniques in order to nd the maximum likelihood estimate of the Bayesian network given the observed data. Previous methods generally assume that the Bayesian network structure is known, and restrict themselves to the task of estimating the conditional probabilities. The few methods that do attempt to learn both structure and probabilities can generally only handle the case where the missing data is MCAR. Stochastic methods that can handle the MAR case as well, on the other hand, are computationally very expensive 133

making their application to most real-world problems infeasible. The proposed method correctly handles both the MCAR and the MAR case, and converges relatively rapidly. Moreover, the experimental results show that not only is the quality of Bayesian networks learned by the new algorithm from incomplete data much better than networks learned using some common, ad-hoc methods, but the learned distribution (as encoded by the induced Bayesian network) is also fairly close to the true distribution, especially when the amount of missing data is not very large. Also, the resultant networks are almost as good as the best model that could possibly be learned from the available information in the incomplete dataset. However, despite these encouraging results, there are several issues that must be explored further. First, further experimentation should be done using other evaluation functions such as classi cation accuracy, or ideally, utility functions. This is especially important since, for problems in the real domain, the true model is unknown and hence the Kullback-Leibler distance cannot be measured. Moreover, further experiments are needed on additional datasets, especially from real-world domains. Another issue that should be explored further involves the method used, in Step 2 of the algorithm, to update the Bayesian network structure from the updated data. Currently, at the beginning of each iteration, the previous estimate of the network structure is discarded, and a new structure is learned from scratch. However, it is possible to modify step 2 so that incremental changes are made to the previous network structure to get a better estimate. Instead of starting with an empty set of parents (as is done currently), the algorithm can be modi ed to start with the same set of parents as the current estimate, and incrementally add or remove attributes from the set of parents as long as the value of the metric increases. This method should lead to faster convergence of the algorithm. Finally, I have not explicitly handled cases where some of the attributes are never observed i.e. hidden or latent variables. In such a case, since the values are always missing, absence of the data is independent of the values of various attributes i.e. data is 134

MCAR. In this respect, this problem is somewhat simpler than the problem of partially missing data. The proposed algorithm can be easily modi ed to handle latent variables as well. In this case, one can skip Step 1 of the algorithm (initialization), and instead start with an initial approximation of the network structure in Step 2. The problem then is to decide the set of nodes to which the hidden variables are connected in the initial structure. Heckerman (1995) and Friedman (1997) discuss various places in the network structure where hidden variables may be present. Given an initial network structure, the algorithm then proceeds as before, successively improving the network until convergence is achieved.

135

Chapter 7

Diagnosing Acute Abdominal Pain: a Case Study This chapter discusses the application of the techniques developed in this dissertation to a challenging medical application | the diagnosis of acute abdominal pain. This domain is well known to be dicult, yielding little more than 60% predictive accuracy (Todd and Stamper, 1994; Ohmann et al., 1996) for most human and machine diagnosticians. Section 7.1 discusses the acute abdominal pain domain as well as previous approaches at building diagnostic tools for this domain. Moreover, since many researchers argue that the naive Bayesian classi er is optimal for this domain (e.g. (Todd and Stamper, 1994; de Dombal, 1991)), I describe the motivations for, and the objectives of, applying selective Bayesian networks to this domain in Section 7.2. Moreover, to study this optimality claim, and evaluate the performance of selective Bayesian networks in this domain, I carried out several experiments comparing the performance of the naive Bayesian classi er to both selective and non-selective Bayesian network classi ers. The datasets used for the experiments are described in Section 7.3 whereas the experimental methodology and design are discussed in Sections 7.4 and 7.5, respectively. I describe the results of the experiments in Section 7.6 and conclude by discussing some implications of these results 136

in Section 7.7.

7.1 The Acute Abdominal Pain Domain The diagnosis of acute abdominal pain is considered to be a classic Bayesian problem, as ndings are probabilistically (rather than deterministically) related to underlying diseases, and prior information can make a signi cant di erent to a successful diagnosis. The most serious common cause of acute abdominal pain is appendicitis, and in many cases a clear diagnosis of appendicitis is dicult, since other diseases such as NonSpeci c Abdominal Pain (NSAP) can present similar signs and symptoms ( ndings). Appendicitis progresses over a course of hours to days, and one might be tempted to wait until the complex of signs and symptoms is highly characteristic of appendicitis before removing the appendix. However, the in amed appendix may perforate during the observation period, causing a more generalized infection and raising the risk of death from about 1 in 200 cases to about 1 in 42. Thus, the tradeo is between the possibility of an unnecessary operation and a perforation. A full model for this domain typically has three variable types: observable, intermediate (latent) and disease. Observable variables correspond to ndings that can be observed directly, such as nausea, vomiting and fever. Disease variables correspond to diseases that are the underlying causes for a case of acute abdominal pain, such as appendicitis or NSAP. Latent variables correspond to physiological states that are neither directly observable nor are underlying diseases, but are clinically relevant (as determined by the domain expert) to determining a diagnosis. Examples include peritonitis and in ammation. Such models typically do not make strong assumptions about conditional independence of latent or observable variables given the disease variable. Models with such a structure are described in (Provan, 1994; Todd and Stamper, 1994). A naive model typically ignores the class of latent variables, or if it includes any 137

latent variables it assumes that they are independent of any observable variables given the disease variable. This latter assumption is technically incorrect, based on known physiological principles; in addition, the inclusion of latent variables should improve diagnostic performance, since more information is being used. Hence, it appears that a full model should outperform a naive model. However, this hypothesis has not been fully supported by empirical evidence which provides, at best, inconclusive evidence about the e ect on diagnostic accuracy of capturing dependencies in Bayesian models. Following de Dombal et al.'s publication of a successful naive Bayesian model for the diagnosis of acute abdominal pain (de Dombal et al., 1972), many researchers have studied empirically the e ect of independence assumptions on diagnostic accuracy. Some studies have demonstrated the in uence on diagnostic accuracy of capturing dependencies. For example, Seroussi (1986), in the domain of acute abdominal pain, reported a 4% increase in diagnostic accuracy (from 63.7% to 67.7%) by accounting for pairwise interactions using a Lancaster model; other researchers (Fryback, 1978; Norusis and Jacquez, 1975) have also shown that capturing conditional independencies may improve diagnostic accuracy. In contrast, other studies have shown no statistically signi cant di erence between the two approaches (Todd and Stamper, 1993), and some have even found the naive Bayesian classi er to be optimal (Todd and Stamper, 1994; Edwards and Davies, 1984; de Dombal, 1991). Ohmann et al. (1996) compared the performance of the naive Bayesian classi er to several more complex representations such as decision trees and rule-based systems. His results showed that there was no major di erences in overall accuracy between the various approaches, though there were considerable di erences with respect to speci c diagnoses. Similarly, Todd and Stamper (1994) compared the accuracy of the naive classi er to that of nearest neighbor classi ers, rule-based systems as well as unrestricted Bayesian networks (built from expert knowledge rather than learned from data). They too concluded that the naive Bayesian classi er is the best model for this domain. Fryback (1978) studied the 138

sensitivity of diagnostic accuracy to conditional independence assumptions in a Bayesian model for medical diagnosis. He showed empirically that large models with many inappropriate independence assumptions can be less accurate than smaller models that do not have to make such inappropriate assumptions. Fryback suggested that model size should be increased incrementally in cases where conditional independence assumptions are not all known, rather than starting from a large model. With these studies and claims in mind, I outline, in Section 7.2, the hypotheses that I address, in this study, by applying selective Bayesian networks to the acute abdominal pain domain.

7.2 Objectives As pointed out in the previous section, several researchers have argued that the naive Bayesian classi er is the optimal model for this domain. This is very counterintuitive since there are strong dependencies between several domain variables, an assumption that is violated by the naive classi er. Models that capture dependencies such as Bayesian networks should provide more accurate diagnoses than naive Bayesian classi ers since they can capture conditional independencies that may be important for computing the correct diagnosis. However, as discussed in Section 7.1, this fact has not been supported by empirical studies. As explained in Section 4.6, one reason why more sophisticated techniques such as Bayesian networks may not perform as well as the naive Bayesian classi er in this domain. is the high problem dimensionality. Typically, the number of domain variables as well as the diseases (classes) is high, whereas little data is available to learn the model. As the experimental results in Chapter 4 showed, the performance of Bayesian networks in such domains is generally very poor, possibly due to over tting, resulting in spurious dependencies and unreliable probability estimates. On the other hand, despite 139

its strong independence assumptions, the naive classi er is fairly robust, and performs well even for small sample sizes and high dimensional problems. However, as discussed in Chapters 3, selective Bayesian networks somewhat alleviate this problem since they use only a subset of the domain features, thus leading to better parameter estimates when relatively little data is available, and, thus, should perform better that the more complex, non-selective Bayesian networks. At the same time, they represent all the dependencies between the features that are modeled, and thus should perform better than the naive classi er. This was con rmed by the experiments in Chapter 4 where selective Bayesian networks outperformed naive Bayesian classi ers on almost all the domains tested. Thus, selective Bayesian networks may provide an ideal middle ground between the two extreme approaches. As such, one of the main objectives of this study was to learn selective Bayesian networks for this domain, and see if they performed any better than both non-selective Bayesian networks as well as naive Bayesian classi ers. Another issue that I wanted to explore is the presence of missing values in the data. Since missing data leads to a degradation in performance, principled approaches to learning models from incomplete data, such as the MIEM-BN algorithm discussed in Chapters 5 and 6, may perform better than the ad-hoc methods that are commonly used. Thus, the second main objective of this study was to see if properly handling missing data results in any improvement in the performance of selective Bayesian networks in this domain.

7.3 Description of the Datasets Used Two datasets were used in this study. Both datasets, described below, have been used previously by other researchers for building various models such as decision trees, neural networks, nearest neighbor classi ers, Bayesian networks as well as naive Bayesian classi ers. In all studies, the naive Bayesian classi er was found be better than, or at 140

least as good as, more complex representations. In cases where some other technique was better, no statistical di erence was found between them. The rst dataset used for this study consists of 1270 cases consisting of 169 attributes each.1 However, out of the 1270 cases, the diagnosis of only 895 cases was de nitely known; the remaining 375 cases were assigned the best possible diagnosis, as a presumed diagnosis. For the experiments, I used only the 895-case subset of de nite diagnoses. The class variable, nal diagnosis, has 19 possible values. This data was collected and pre-screened by Todd and Stamper, as described in (Todd and Stamper, 1993; Todd and Stamper, 1994). The resulting database addresses acute abdominal pain of gynaecological origin, based on case-notes for patients of reproductive age admitted to hospital, with no recent history of abdominal or back pain. In creating the database, the rst 202 cases were used in the design of the database itself; thus, they cannot be used for the purpose of testing any model. Finally, 97 patients occur more than once in the database due to repeated visits. An additional 53 variables representing pathophysiological states and re nements of the nal diagnosis were recorded. However, these variables cannot be used for evaluating a model since their values are ordinarily no more observable than the nal diagnosis. Although Bayesian networks could easily incorporate these extra attributes (and take advantage of them during the learning phase), I did not use them for the current study to keep the problem dimensionality manageable. I refer to this dataset as the T&S dataset. The second dataset, used previously by Ohmann et al. (1996), is a 1254 case dataset consisting of patients with acute abdominal pain of less than a week's duration. Forty six attributes are recorded for each patient, and the nal diagnosis is broken up into 15 di erent categories. The data was collected from six surgical departments in Germany 1 This

dataset was provided by Dr. B.S. Todd.

141

under the `European Community Concerted Action on Objective Medical Decision Making in Patients with Acute Abdominal Pain' project (de Dombal et al., 1993)2. I refer to this dataset as the ECCA dataset. Two sets of experiments were performed. The rst set was carried out after making the data complete by treat missing values as just another attribute value. The second set of experiments were conducted by properly handling the missing data using the MIEMBN algorithm developed in Chapter 5. These experiments and their results are described in the following sections.

7.4 Experimental Methodology 7.4.1 Learning from Complete Data The datasets were made complete by treating each missing value as just another value of the attribute in question. As a baseline case, naive classi ers were evaluated on each dataset. Bayesian networks were then constructed from the two datasets as described below. Unrestricted Bayesian networks were induced from each dataset using the CB algorithm (Singh and Valtorta, 1993). To learn the selective Bayesian networks, I used the CDC metric for feature selection since it had the best performance on the datasets tested in Chapter 4. Once the features were selected, the CB algorithm was used to learn the nal network. In addition, due to the very high dimensionality of the T&S dataset, an additional constraint was imposed on the structure of the models learned. Since those attributes are selected by the CDC metric that provide information about the class variable, given the other attributes, one would expect them to be connected to the class variable, either directly or via a common child. If sucient data was available and/or the problem dimensionality was small, 2 The data was provided

to the author by Dr. C. Ohmann.

142

these arcs would likely be added automatically; however, due to the high dimensional problem and little data, this requirement was arti cially imposed on the learned network. Thus, once the nal network was learned from the set of selected attributes, the class variable was explicitly added to the parent set of every attribute, unless it was already there. For the ECCA dataset, this problem was felt to be less acute since the problem dimensionality was lower (fewer attributes, fewer classes and relatively more data). As such no such constraint was imposed on the learned model. For comparison purposes, I also learned decision trees using C4.5 for each dataset.

7.4.2 Learning from Incomplete Data Learning selective Bayesian networks from incomplete data is a much more complicated task than learning non-selective networks. Learning selective Bayesian networks involves search along three dimensions since, in addition to searching for the best structure and the corresponding parameters, the process also entails searching for a good subset of features. In order to incorporate feature selection into the network induction process, a feature selection step was wrapped around the MIEM-BN algorithm described in Chapter 5. Thus, the algorithm rst selects a subset of attributes from the set of imputed datasets, and then learns both the Bayesian network structure and the conditional probability tables using those attributes. This process is repeated until convergence. To select the set of attributes from the imputed datasets, I used a weighted sum of the CDC metric, as described below, for feature selection. Let M be the number of imputations. Moreover, as described in Section 5.5, let

Dh(st)i; 1  s  M be the set of imputed datasets at the tth iteration. Let CDC (fxg; ; Dh(st)i) denote the value of the CDC metric measured on the dataset Dh(st)i . In other words, this is the amount of information given by attribute x about the target variable, given the set of attributes  and the data Dh(st)i . Then, at every stage of the feature selection process, 143

that attribute is selected which maximizes M X 1 CDC (fxg; ; Dh(st)i) CDC(fxg; ) =

M s=1

However, the use of feature selection along with the MIEM-BN algorithm leads to a further complication. This is because by using only a subset of attributes, we are, in e ect, approximating the joint probability distribution of all the attributes by the joint probability distribution of the set of selected attributes. However, since the progress of the algorithm, as well as its termination point, is based on the log-likelihood of the data, this leads to an immediate problem. Since the number of attributes may increase or decrease each time feature selection is performed, the log-likelihood may decrease or increase accordingly since only the selected attributes are included in the model and contribute towards the computation of the likelihood function. This leads to a great deal of uctuation in the log-likelihood function, which in turn implies that convergence cannot be guaranteed (at least in a reasonable amount of time), although, intuitively, one would expect the algorithm to stabilize down once it nds a good set of attributes after some initial uctuation (during which it searches through the feature space) This issue is further complicated by the fact that in order to learn the Bayesian networks, the induction algorithm requires a total ordering on the attributes. If an ordering on the attributes is not provided, the algorithm determines a good total ordering from the dataset. Since the ordering has a big e ect on the type and quality of the network generated, this makes it more unlikely that the algorithm can converge quickly. As such I imposed three constraints on the algorithm described above to ensure that it converged quickly, and with a reasonably good solution. One, the algorithm was (arbitrarily) given a total ordering on the attributes. This ordering corresponded to the order in which the various attributes occurred in the datasets. Thus, the algorithm was constrained to look for the parents of an attribute only amongst the set of attributes that preceded it in the ordering. Two, for the feature selection process at the start of each 144

iteration, the attributes selected during the previous iteration were used as the starting point for the feature selection process (rather than the empty set), and attributes were incrementally added as long as any remaining attribute could provide information about the target, given the other selected attributes. Attributes were only allowed to be added, and not removed; thus, the set of selected attributes could only grow in each iteration. Third, each attribute was forced to have the class variable as one of its parents.

7.5 Experimental Design For the T&S dataset, I used the same methodology as Todd & Stamper (1994) due to the structure of the data and the way it was collected (multiple records for patients, some records used in designing the data collection strategy, etc.). An 11-fold cross-validation strategy was used to evaluate the di erent methods. The dataset was restricted to only cases for which a de nite diagnosis was known. Cases that had been used during the construction of the database itself were not used for testing purposes, although they were used for training purposes. Moreover, for each run, all repeat presentations of any patient, in the subset that was about to be tested, were removed from the training set. This left approximately 895 cases in the dataset for which 751 cases could be used for testing the models. For the ECCA dataset I departed from the strategy employed by Ohmann et al. (1996). They randomly split the dataset into two parts, and used one for learning the models and the other to evaluate them. I, instead, chose to do a 10-fold cross validation study. Besides the advantage of considering multiple splits as opposed to a single split, this approach also increases the amount of data available for training. Inference on the Bayesian networks was carried out using the the HUGIN (Anderson et al., 1989) system. The main performance measure used was the classi cation accuracy of a model on 145

the test data; the classi cation accuracy being the percentage of test cases that were diagnosed correctly. However, since each problem involves a large number of classes, the classi cation accuracy alone does not give an accurate way of comparing various methods. Thus, I computed several other statistics that are generally used for comparing alternative tests in medical diagnosis; nevertheless, they can also be used to compare alternative models in terms of their ability to discriminate between various diseases, etc. Assuming the positive (negative) value for a disease to represent its presence (absence), the di erent statistics I computed can be described as follows (Clarke and Hayward, 1990) 1. Sensitivity: This is the ability of a model to correctly predict the presence of a disease in a patient with that disease. Also known as the True Positive Rate, it is de ned as

Sensitivity = TP TP + FN

where TP is the number of true positives while FN represents the number of false negatives. 2. Speci city: This is the ability of a model to correctly identify patients that do not have a given disease. Thus, it is the proportion of people who do not have a given disease, and are correctly predicted so by the model. As such,

Specificity = TNTN + FP where TN represents the number of true negatives and FP represent the number of false positives. 3. Predictive Value: This measures the accuracy of a model on a given disease, and is the probability that a patient actually has a certain disease, given that the model has so predicted. It is de ned as

Predictive V alue = TP TP + FP 146

Method CDC CB naiveALL C4.5

T & S Dataset No. of Correct Error Rate Diagnoses 536 0.286 476 0.366 511 0.320 501 0.333

ECCA Dataset No. of Correct Error Rate Diagnoses 641 0.489 555 0.557 616 0.509 530 0.577

Table 7.1: Accuracies of Di erent Methods on the Abdominal Datasets with No Missing Values. 4. Likelihood Ratio: This measure the ability of a model to discriminate between alternative diseases. The higher the value, the greater is the discriminating ability of the model. It is de ned as follows:

TP FP + TN = Sensitivity Likelihood Ratio = FP TP + FN 1 ? Specificity In addition, for each technique, I also created Discriminant Matrices describing the performance of each model with respect to individual diseases. This allows us to compare di erent models with respect to their ability to correctly identify various classes.

7.6 Experimental Results The rst set of experiments compared the performance of selective Bayesian networks learned by CDC to non-selective Bayesian networks, as well as decision trees and naive Bayesian classi ers on the complete datasets. Table 7.1 summarizes the results of these experiments. In addition to measuring statistical signi cance of the di erence in accuracies between the various methods via the McNemar test, I also calculated the 95% con dence levels for the di erence in accuracies of each pair of methods using Cochran's Q test, as explained in Section 4.4. As can be seen from Table 7.1, the performance of selective Bayesian networks was far superior to that of non-selective Bayesian networks on both datasets. The selective 147

Method CDC naiveALL MIEM-BN MIEM-BN + prior set

T & S Dataset No. of Correct Error Rate Diagnoses 536 0.286 511 0.320 545 0.274 -

ECCA Dataset No. of Correct Error Rate Diagnoses 641 0.489 616 0.509 642 0.488 653 0.479

Table 7.2: Accuracies of Di erent Methods on the Abdominal Datasets in the Presence of Missing Values. Bayesian network (CDC) induced from the T&S dataset correctly diagnosed 60 more cases than the non-selective network (CB), while the selective Bayesian network learned from the ECCA dataset correctly diagnosed 86 more cases. Moreover, the selective Bayesian networks were much smaller than the non-selective networks, using, on an average, only about 11% of the attributes for the T&S dataset and about 33% of the attributes for the ECCA dataset. Compared to the naive Bayesian classi er (naiveALL), selective Bayesian networks were again much better, correctly diagnosing 25 more cases on each dataset. However, whereas the di erence between the two approaches was statistically signi cant at the 95% con dence level for the T&S dataset (p = 0.042 using the McNemar test), it was not signi cant for the ECCA dataset (p = 0.15). The 95% con dence level for the di erence in accuracies of each pair of methods on the T&S dataset was 0.0314 while it was 0.0274 on the ECCA dataset. The performance of C4.5 was relatively poor compared to both selective Bayesian networks as well as naive classi ers, especially on the ECCA dataset. Table 7.2 summarizes the results of applying the MIEM-BN algorithm to the task of learning selective Bayesian networks from incomplete data. While the selective Bayesian network induced by the MIEM-BN algorithm from the T&S dataset was slightly better than the selective Bayesian network induced from the data after \completing" it by treating each missing value as an extra value (diagnosing 9 more cases correctly), there 148

was virtually no di erence between the two on the ECCA dataset (an increase in the number of correct diagnoses by just 1). In both cases, however, the di erence was not statistically signi cant. As in the complete data case, the di erence between the induced selective Bayesian network and the naive Bayesian classi er was statistically signi cant for the T&S dataset (p = .005) but not so in the case of the ECCA dataset (p = 0.11). Given that the MIEM-BN algorithm improved the performance only slightly, especially for the ECCA dataset, I carried out another experiment on the ECCA dataset by giving, as input to the algorithm, the set of attributes selected as well as the attribute ordering chosen in the complete-data case by the CDC algorithm. The performance of the resultant selective Bayesian network (MIEM-BN + prior set) was much better, correctly diagnosing an additional 11 cases. However, the di erence between the new selective Bayesian network and the selective Bayesian network learned from the complete data was still not statistically signi cant. On the other hand, the di erence between the learned selective Bayesian network and the naive Bayesian classi er was statistically signi cant at the 95% level (p = 0.028). This seems to suggest that both attribute selection and the chosen total ordering on the attributes play an important role in determining the quality of the induced selective Bayesian networks. Although classi cation accuracy provides an overall measure of the performance of each induction method, I also evaluated the performance of the di erent algorithms on each individual disease since that may be much more signi cant from a clinical point of view due to the di erent mis-classi cation costs of the various diseases. As such, I also computed the various statistics described in Section 7.5. Tables 7.3 and 7.4 show the predictive values as well as the likelihood ratios, while Tables 7.5 and 7.6 show the sensitivities and speci cities of the various models on the two datasets. Values of (-) for a particular measure indicate that the denominator of the term was 0. The discriminant matrices for the various models on the two datasets are shown in Tables 7.8{7.15. 149

The statistics give a clearer picture about the relative behavior of the various methods. For the T&S datasets, both selective as well as non-selective Bayesian networks are generally unable to diagnose low-frequency diseases well. Naive Bayesian classi ers, on the other hand, are relatively much better at doing so. On the other hand, Bayesian networks are very good at correctly identifying high-frequency diseases, and show a very high discriminating ability with respect to these diseases. Moreover, selective Bayesian networks are much better than the non-selective Bayesian networks on these diseases. From Table 7.3, it is clear that selective Bayesian networks have higher predictive values and likelihood ratios than naive Bayesian classi ers for most of the frequent diseases. On the low frequency diseases, the naive classi ers are better. Similarly, selective Bayesian networks are more sensitive to the more common diseases (Table 7.5), but virtually insensitive to the less frequent ones. This is also apparent from the discriminant matrices shown in Tables 7.7 { 7.9. Clearly, for the more frequent diseases such as non-speci c abdominal pain (NSAP), abortion, ectopic pregnancy and threatened abortion, the number of true positives (patients correctly identi ed) is higher for the selective Bayesian networks than for the naive Bayesian classi ers; the reverse is true for the low frequency diseases such as ovarian cyst and cyctic accident. Moreover, whenever high frequency diseases were misclassi ed by the selective Bayesian network, they were generally misclassi ed as other high frequency diseases. A relatively large number of these misclassi cations by the selective Bayesian network consisted of cases that were diagnosed as NSAP, which resulted in a lower predictive value for NSAP even though the number of true positives was higher. In the case of the selective Bayesian network learned from incomplete data using the MIEM-BN algorithm, the improvement in accuracy was generally due to an even better performance on the high frequency diseases (Table 7.10), generally resulting in higher predictive values and likelihood ratios on these diseases. For the ECCA dataset, the results were not so clear cut. Although the overall accuracy of the selective Bayesian networks was higher, most of the gain was in the case 150

of NSAP, which was also the most frequent disease accounting for about 40% of the cases. However, most of the other diseases that were incorrectly diagnosed were also classi ed as NSAP. The selective Bayesian network had higher predictive values and likelihood ratios for only a few of the diseases such as appendicitis, urinary tract infection and renal colic. The selective Bayesian networks learned from incomplete data using the MIEM-BN algorithm, especially the one learned using the previously selected attributes, were relatively better with higher accuracies on most of the diseases while retaining the performance with respect to NSAP. For example, the selective Bayesian networks learned by the MIEM-BN algorithm were better than the naive Bayesian classi er at discriminating several more diseases including cancer, dyspepsia and perforated ulcer. This improvement in performance can be attributed to better modeling of the interattribute dependencies by the MIEM-BN algorithm, resulting in a slight improvement in the prediction of less frequent diseases, while retaining the accuracy of the models learned from complete data for the more frequent diseases.

7.7 Discussion One of the key ndings of this work is the important role of feature selection in this domain. The results show that networks using only a small fraction of the attributes (selected by using the CDC metric) display a much higher accuracy compared to the networks that model all attributes. Thus, selective Bayesian networks are not only computationally much more ecient, but also greatly improve upon the performance of the non-selective networks. Moreover, the selective Bayesian networks were also found to outperform both naive Bayesian classi ers as well as decision trees on both datasets. A detailed analysis of the results on the two datasets showed that while selective Bayesian networks were much better than naive Bayesian classi ers at correctly identifying high frequency diseases, 151

Non-specific abdominal pain Threatened abortion Abortion Retained products Hydatidiform mole Ectopic pregnancy Pelvic inflammatory disease Ovarian cyst Cyctic accident Pelvic haematoma Fibroids Hyperstimulation Urinary tract infection Endometriosis Ureteric colic Appendicitis Hyperemesis gravidarum Abdominal wall haematoma Other

177 23.57 54 7.19 322 42.88 24 3.20 2 0.27 53 7.06 43 5.73 17 2.26 20 2.66 0 0.00 4 0.53 0 0.00 4 0.53 11 1.46 0 0.00 2 0.27 2 0.27 0 0.00 16 2.13

65.2 66.7 85.9 72.7 40.4 37.0 23.8 10.0 0.0 100.0 33.3

57.6 80.0 87.1 76.2 55.8 28.6 0.0 0.0 25.0

52.5 47.4 80.5 55.0 45.5 13.8 0.0 0.0 0.0 25.0

59.0 82.7 87.5 83.3 54.9 31.3 0.0 0.0 0.0 0.0

6.1 25.8 8.1 80.8 8.9 9.7 13.5 4.1 0.0 23.0

4.4 51.6 9.0 96.9 16.6 6.6 0.0 0.0 15.3

CDC + MIEM-BN

CB

CDC

naiveALL

Likelihood Ratio

CDC + MIEM-BN

CB

CDC

naiveALL

% Cases

Predictive Value (%)

# Cases

Final Diagnosis

3.6 4.7 11.6 61.7 5.5 9.3 37.0 151.5 11.0 16.0 2.6 7.5 0.0 0.0 0.0 0.0 0.0 0.0 15.3 0.0

Table 7.3: Predictive Values and Likelihood Ratios for the T&S Dataset. the converse was true for the low frequency diseases. This can possibly be attributed to the lack of sucient data in a high dimensionality domain. For diseases that occur more frequently, Bayesian networks are able to easily detect the appropriate relationships and get accurate estimates of the parameters, thus yielding higher accuracies for these diseases. For low frequency diseases, Bayesian networks may pick up spurious dependencies and/or inaccurate parameter estimates, which in turn leads to a poor performance on these diseases. However, selective Bayesian networks somewhat alleviate this problem since they use only a subset of the domain features, thus leading to better parameter estimates when relatively little data is available, and, thus, perform better that the more complex, non-selective Bayesian networks. At the same time, they represent all the dependencies between the features that are modeled, and thus perform better than the naive classi er, 152

Appendicitis Cancer Cholecystitis Diverticulitis Dyspepsia Urinary tract infection Bowel obstruction Crohn's disease Renal colic Ovarian cyst Pancreatitis Salpingitis Perforated ulcer Non-specific abdominal pain Other diagnoses

211 16.83 29 2.31 86 6.86 28 2.23 101 8.05 34 2.71 48 3.83 14 1.12 29 2.31 13 1.04 26 2.07 28 2.23 20 1.59 504 40.19 83 6.62

54.4 12.1 59.0 32.5 37.4 0.0 33.3 0.0 52.9 0.0 26.1 18.2 38.2 61.8 19.1

55.1 9.1 51.3 25.0 36.7 50.0 22.2 0.0 63.6 20.0 0.0 13.0 57.3 20.8

32.7 16.7 50.0 37.5 29.8 0.0 16.0 70.0 0.0 0.0 49.6 20.0

56.2 16.7 48.6 18.2 42.9 25.0 22.2 57.1 30.8 16.7 40.0 56.7 29.5

55.2 18.2 51.8 26.3 40.0 0.0 27.6 63.6 14.3 17.9 57.9 25.0

5.9 5.8 19.5 21.1 6.8 0.0 12.6 0.0 47.5 0.0 16.7 9.7 38.2 2.4 3.3

6.1 4.2 14.3 14.6 6.6 35.9 7.2 0.0 73.9 11.8 0.0 9.3 2.0 3.7

2.4 8.4 13.6 26.3 4.8 0.0 4.8 98.6 0.0 0.0 1.5 3.5

6.3 8.4 12.9 9.7 8.6 12.0 7.2 56.3 21.0 8.8 41.1 1.9 5.9

CDC + MIEMBN + prior set

CDC + MIEMBN

CB

CDC

naiveALL

Likelihood Ratio

CDC + MIEMBN + prior set

CDC + MIEMBN

CB

CDC

naiveALL

% Cases

Predictive Value (%)

# Cases

Final Diagnosis

6.1 9.4 14.6 15.6 7.6 0.0 9.6 73.9 7.9 13.4 2.1 4.7

Table 7.4: Predictive Values and Likelihood Ratios for the ECCA Dataset. especially on the high-frequency classes. This suggests that the performance of complex representations, such as selective Bayesian networks, will probably improve in such domains with the availability of increasing amounts of data. Nevertheless, even for the small amount of data in the two datasets, the experiments show that selective Bayesian networks are viable alternatives to the extremely simple naive classi ers as well as the more complex non-selective Bayesian networks. The use of the MIEM-BN algorithm, however, did not result in a signi cant increase in the accuracy of the selective Bayesian networks over and above those learned from the data after making it complete by treating each missing value as an extra value. Two main reasons can be attributed to this apparent lack of performance. First, the MIEMBN algorithm tries to learn the Bayesian network that best models the data (i.e, a good approximation to the underlying joint distribution) rather than attempt to maximize accuracy. However, a model that best ts the data is not necessarily the most predictive model. Thus, even though the Bayesian networks learned by the MIEM-BN algorithm 153

Non-specific abdominal pain Threatened abortion Abortion Retained products Hydatidiform mole Ectopic pregnancy Pelvic inflammatory disease Ovarian cyst Cyctic accident Pelvic haematoma Fibroids Hyperstimulation Urinary tract infection Endometriosis Ureteric colic Appendicitis Hyperemesis gravidarum Abdominal wall haematoma Other

177 23.57 54 7.19 322 42.88 24 3.20 2 0.27 53 7.06 43 5.73 17 2.26 20 2.66 0 0.00 4 0.53 0 0.00 4 0.53 11 1.46 0 0.00 2 0.27 2 0.27 0 0.00 16 2.13

0.667 0.630 0.910 0.667 0.000 0.434 0.395 0.294 0.100 0.000 0.000 0.000 0.000 0.500 0.125

0.774 0.741 0.941 0.667 0.000 0.547 0.233 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.063

0.701 0.333 0.910 0.458 0.000 0.472 0.093 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.063

0.797 0.796 0.957 0.625 0.000 0.528 0.233 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

0.824 0.986 0.895 0.993 1.000 0.967 0.965 0.997 0.999 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 0.996

0.805 0.971 0.834 0.988 1.000 0.957 0.965 0.999 0.996 1.000 1.000 1.000 0.999 1.000 1.000 1.000 1.000 1.000 0.996

0.829 0.987 0.897 0.996 1.000 0.967 0.969 0.999 0.996 1.000 1.000 1.000 1.000 0.999 1.000 1.000 1.000 1.000 0.997

CDC + MIEM-BN

CB

CDC

naiveALL

Specificity

CDC + MIEM-BN

CB

CDC

naiveALL

% Cases

Sensitivity

# Cases

Final Diagnosis

0.829 0.987 0.897 0.996 1.000 0.967 0.969 0.999 0.996 1.000 1.000 1.000 1.000 0.999 1.000 1.000 1.000 1.000 0.997

Table 7.5: Sensitivities and Speci cities for the T&S Dataset. were more likely, they were only slightly more accurate. Second, it is probably true that the \treat missing value as extra value" approach is appropriate for this domain. As discussed in Section 5.3.2, in certain problems, it is the very absence of a value that is informative, and not the actual missing value. This is probably true for the abdominal datasets, which have a large number of classes and a much larger number of attributes. As such, several attributes are relevant only for certain diseases, and, thus, their values may not be recorded in certain cases where the physician deemed them to be irrelevant or redundant, given the other information. As such, the particular missing value is not important. However, the absence of a value for a particular attribute should increase the likelihood of diseases for which that attribute if not relevant (and likewise decrease the likelihood of diseases for which it is). By treating missing values as an extra value, this relationship can be easily encoded. Thus, as in the case of the the MIEM-BN 154

Appendicitis Cancer Cholecystitis Diverticulitis Dyspepsia Urinary tract infection Bowel obstruction Crohn's disease Renal colic Ovarian cyst Pancreatitis Salpingitis Perforated ulcer Non-specific abdominal pain Other diagnoses

211 16.83 29 2.31 86 6.86 28 2.23 101 8.05 34 2.71 48 3.83 14 1.12 29 2.31 13 1.04 26 2.07 28 2.23 20 1.59 504 40.19 83 6.62

0.616 0.138 0.686 0.464 0.426 0.000 0.354 0.000 0.310 0.000 0.231 0.071 0.650 0.609 0.157

0.564 0.034 0.686 0.179 0.218 0.029 0.125 0.000 0.241 0.000 0.077 0.000 0.150 0.806 0.120

0.171 0.103 0.616 0.214 0.248 0.000 0.083 0.000 0.241 0.000 0.000 0.000 0.000 0.819 0.096

0.621 0.138 0.616 0.071 0.297 0.029 0.167 0.000 0.276 0.000 0.154 0.036 0.500 0.748 0.157

0.602 0.069 0.674 0.179 0.277 0.000 0.167 0.000 0.241 0.000 0.038 0.000 0.250 0.796 0.133

0.895 0.976 0.965 0.978 0.938 0.993 0.972 0.990 0.993 0.995 0.986 0.993 0.983 0.747 0.953

0.907 0.992 0.952 0.988 0.967 0.999 0.983 0.999 0.997 1.000 0.993 0.998 0.984 0.597 0.968

0.929 0.988 0.955 0.992 0.949 0.998 0.983 1.000 0.998 1.000 0.999 1.000 0.994 0.440 0.973

0.902 0.984 0.952 0.993 0.965 0.998 0.977 1.000 0.995 1.000 0.993 0.996 0.988 0.616 0.974

CDC + MIEMBN + prior set

CDC + MIEMBN

CB

CDC

naiveALL

CDC + MIEMBN + prior set

Specificity

CDC + MIEMBN

CB

CDC

naiveALL

% Cases

Sensitivity

# Cases

Final Diagnosis

0.901 0.993 0.954 0.989 0.964 0.999 0.983 1.000 0.997 1.000 0.995 1.000 0.981 0.612 0.972

Table 7.6: Sensitivities and Speci cities for the ECCA Dataset. algorithm, this approach is probably suitable for this domain, and thus yields similar results. Nevertheless, since the Bayesian networks learned by MIEM-BN were more accurate, albeit by a small factor, the extra-value approach is probably not suitable for all attributes that have missing values. As such, given the simplicity of this approach, a better approach to handling missing data in this domain may be to use expert/domain knowledge to determine variables for which the extra-value approach is appropriate and use it for those attributes; for handling missing values in other attributes, the MIEM-BN algorithm can be used as before.

7.8 Summary The experimental results showed that the selective Bayesian networks signi cantly outperform other representations such as non-selective Bayesian networks, naive Bayesian classi ers as well as decision trees in the acute abdominal pain. However, despite these encouraging results, several issues were raised by this study. 155

Threatened abortion

Abortion

Retained products

Hydatidiform mole

Ectopic pregnancy

Pelvic inflammatory disease

Ovarian cyst

Cyctic accident

Pelvic haematoma

Fibroids

Hyperstimulation

Urinary tract infection

Endometriosis

Ureteric colic

Appendicitis

Hyperemesis gravidarum

Abdominal wall haematoma

Other

Non-specific abdominal pain Threatened abortion Abortion Retained products Hydatidiform mole Ectopic pregnancy Pelvic inflammatory disease Ovarian cyst Cyctic accident Pelvic haematoma Fibroids Hyperstimulation Urinary tract infection Endometriosis Ureteric colic Appendicitis Hyperemesis gravidarum Abdominal wall haematoma Other

Non-specific abdominal pain Model Diagnosis

Actual Diagnosis

124 4 15 1 15 14 1 2 1 177

1 18 29 1 5 54

6 11 293 4 7 1 322

11 11 1 1 24

1 1 0 2

16 3 8 25 1 53

32 1 3 1 4 1 1 43

14 1 1 1 0 17

16 1 3 0 20

0 0

3 0 1 4

0 0

2 2 0 4

7 1 3 0 11

0 0

2 0 2

2 0 2

0 0

11 2 2 1 16

236 38 364 20 0 55 29 1 3 0 0 0 1 0 0 0 0 0 4 751

Table 7.7: Discriminant Matrix for the Non-selective Bayesian Network Induced from the T&S Dataset First, further research is required into exploring techniques of extending the MIEM-BN algorithm for the task of learning Bayesian networks primarily for classi cation purposes. Second, the integration of feature selection with the MIEM-BN algorithm is an important problem that also requires further study. As the experimental results show, feature selection not only reduces the computational complexity of the models, but can also increase their performance, especially when the amount of data available is small. Although I presented a simple way of incorporating feature selection into the network induction process of MIEM-BN, further research is needed to ensure that the method converges, and at the same time, learns a good model. Third, as discussed in the previous section, the approach of treating missing values as an extra value is probably appropriate for several attributes in this domain. Given the simplicity of that approach, the integration of these two approaches for learning Bayesian network classi ers for this domain should be studied further. Fourth, the analysis of the results showed the inappropriateness of using 156

Threatened abortion

Abortion

Retained products

Hydatidiform mole

Ectopic pregnancy

Pelvic inflammatory disease

Ovarian cyst

Cyctic accident

Pelvic haematoma

Fibroids

Hyperstimulation

Urinary tract infection

Endometriosis

Ureteric colic

Appendicitis

Hyperemesis gravidarum

Abdominal wall haematoma

Other

Non-specific abdominal pain Threatened abortion Abortion Retained products Hydatidiform mole Ectopic pregnancy Pelvic inflammatory disease Ovarian cyst Cyctic accident Pelvic haematoma Fibroids Hyperstimulation Urinary tract infection Endometriosis Ureteric colic Appendicitis Hyperemesis gravidarum Abdominal wall haematoma Other

Non-specific abdominal pain Model Diagnosis

Actual Diagnosis

137 4 10 1 10 11 2 2 177

3 40 9 2 54

7 4 303 2 6 322

7 16 1 24

2 0 2

12 10 29 2 53

27 4 1 10 1 43

14 1 2 0 17

17 2 1 0 20

0 0

2 2 0 4

0 0

2 2 0 4

8 2 1 0 11

0 0

1 1 0 2

2 0 2

0 0

8 3 2 2 1 16

238 50 348 21 0 52 35 2 1 0 0 0 0 0 0 0 0 0 4 751

Table 7.8: Discriminant Matrix for the Selective Bayesian Network Induced from the T&S Dataset overall classi cation accuracy as a means of evaluating models in this domain. Since the performance of various models on some diseases may be more clinically relevant than on others, especially due to their widely disparate misclassi cation costs, it is important to study the use of objective functions based on utilities, rather than classi cation accuracy, both for feature selection as well as for learning the nal Bayesian networks.

157

Threatened abortion

Abortion

Retained products

Hydatidiform mole

Ectopic pregnancy

Pelvic inflammatory disease

Ovarian cyst

Cyctic accident

Pelvic haematoma

Fibroids

Hyperstimulation

Urinary tract infection

Endometriosis

Ureteric colic

Appendicitis

Hyperemesis gravidarum

Abdominal wall haematoma

Other

Non-specific abdominal pain Threatened abortion Abortion Retained products Hydatidiform mole Ectopic pregnancy Pelvic inflammatory disease Ovarian cyst Cyctic accident Pelvic haematoma Fibroids Hyperstimulation Urinary tract infection Endometriosis Ureteric colic Appendicitis Hyperemesis gravidarum Abdominal wall haematoma Other

Non-specific abdominal pain Model Diagnosis

Actual Diagnosis

118 4 13 1 12 11 7 6 2 3 177

34 15 1 3 1 54

7 11 293 9 1 1 322

1 6 16 1 24

2 0 2

14 1 6 1 23 4 4 53

12 4 1 5 17 1 2 1 43

6 1 2 2 5 1 17

10 2 3 2 2 1 20

0 0

1 1 1 0 1 4

0 0

1 2 0 1 4

6 1 1 1 2 0 11

0 0

1 1 0 2

1 1 2

0 0

5 1 2 2 2 2 2 16

181 51 341 22 0 57 46 21 20 0 0 0 0 5 0 0 1 0 6 751

Table 7.9: Discriminant Matrix for the Naive Bayesian Classi er Induced from the T&S Dataset

Threatened abortion

Abortion

Retained products

Hydatidiform mole

Ectopic pregnancy

Pelvic inflammatory disease

Ovarian cyst

Cyctic accident

Pelvic haematoma

Fibroids

Hyperstimulation

Urinary tract infection

Endometriosis

Ureteric colic

Appendicitis

Hyperemesis gravidarum

Abdominal wall haematoma

Other

Non-specific abdominal pain Threatened abortion Abortion Retained products Hydatidiform mole Ectopic pregnancy Pelvic inflammatory disease Ovarian cyst Cyctic accident Pelvic haematoma Fibroids Hyperstimulation Urinary tract infection Endometriosis Ureteric colic Appendicitis Hyperemesis gravidarum Abdominal wall haematoma Other

Non-specific abdominal pain Model Diagnosis

Actual Diagnosis

141 6 11 1 9 7 2 177

2 43 7 2 54

6 2 308 1 5 322

8 15 1 24

2 0 2

13 9 28 2 1 53

25 5 2 10 1 43

14 1 1 0 1 17

13 3 3 0 1 20

0 0

3 1 0 4

0 0

2 2 0 4

9 1 1 0 11

0 0

1 1 0 2

1 1 0 2

0 0

9 2 1 4 0 16

239 52 352 18 0 51 32 1 3 0 0 0 0 1 0 0 0 0 2 751

Table 7.10: Discriminant Matrix for the Selective Bayesian Network Induced from the T&S Dataset with Missing Values 158

Cholecystitis

Diverticulitis

Dyspepsia

Urinary tract infection

Bowel obstruction

Crohn's disease

Renal colic

Ovarian cyst

Pancreatitis

Salpingitis

Perforated ulcer

Non-specific abdominal pain

36 1 5 1 3 1 162 2 211

2 3 4 2 3 1 10 4 29

3 1 53 10 3 2 12 2 86

2 2 2 6 2 14 28

9 10 25 2 54 1 101

4 2 1 0 26 1 34

1 3 2 2 6 4 1 25 4 48

2 1 0 10 1 14

4 1 1 7 15 1 29

1 0 12 13

5 11 0 1 9 26

5 1 0 22 28

2 1 4 2 1 0 8 2 20

34 3 15 2 13 1 3 3 3 413 14 504

Other diagnoses

Cancer

Appendicitis Cancer Cholecystitis Diverticulitis Dyspepsia Urinary tract infection Bowel obstruction Crohn's disease Renal colic Ovarian cyst Pancreatitis Salpingitis Perforated ulcer Non-specific abdominal pain Other diagnoses

Appendicitis Model Diagnosis

Actual Diagnosis

5 4 8 3 7 1 5 1 41 8

110 18 106 16 84 3 25 0 10 0 1 0 8 833 40

83

1254

Table 7.11: Discriminant Matrix for the Non-selective Bayesian Network Induced from the ECCA Dataset

Cholecystitis

Diverticulitis

Dyspepsia

Urinary tract infection

Bowel obstruction

Crohn's disease

Renal colic

Ovarian cyst

Pancreatitis

Salpingitis

Perforated ulcer

Non-specific abdominal pain

119 1 88 3 211

4 1 5 2 1 3 2 9 2 29

1 59 7 1 10 8 86

3 1 5 1 14 4 28

6 9 1 22 2 2 3 52 4 101

5 2 1 1 1 1 23 34

6 3 1 3 3 6 1 1 3 16 5 48

0 1 11 2 14

5 1 2 3 2 7 8 1 29

3 0 10 13

6 6 2 4 4 4 26

8 1 0 19 28

2 1 2 4 1 4 3 3 20

49 3 19 3 12 4 1 2 406 5 504

Other diagnoses

Cancer

Appendicitis Cancer Cholecystitis Diverticulitis Dyspepsia Urinary tract infection Bowel obstruction Crohn's disease Renal colic Ovarian cyst Pancreatitis Salpingitis Perforated ulcer Non-specific abdominal pain Other diagnoses

Appendicitis Model Diagnosis

Actual Diagnosis

5 2 10 3 5 1 6 1 1 4 35 10

216 11 115 20 60 2 27 1 11 0 10 2 23 708 48

83

1254

Table 7.12: Discriminant Matrix for the Selective Bayesian Network Induced from the ECCA Dataset 159

Appendicitis

Cancer

Cholecystitis

Diverticulitis

Dyspepsia

Urinary tract infection

Bowel obstruction

Crohn's disease

Renal colic

Ovarian cyst

Pancreatitis

Salpingitis

Perforated ulcer

Non-specific abdominal pain

Other diagnoses

Model Diagnosis

Actual Diagnosis

Appendicitis Cancer Cholecystitis

130 1 1

4 4 3

1 2 59

3 -

3 7

5 1 3

2 2 3

3 1 -

5 1

3 -

2 3

8 -

1 1 2

67 8 13

7 8 5

239 33 100

Diverticulitis Dyspepsia Urinary tract infection Bowel obstruction Crohn's disease Renal colic Ovarian cyst Pancreatitis Salpingitis Perforated ulcer Non-specific abdominal pain Other diagnoses

1 1 3 2 1 2 3 59 7 211

4 2 3 1 4 4 29

9 1 1 1 8 4 86

13 1 1 1 1 6 2 28

1 43 1 4 4 2 32 4 101

1 1 0 1 1 1 2 17 1 34

2 5 17 1 3 9 4 48

0 1 2 1 4 2 14

2 1 1 9 9 1 29

1 0 9 13

7 1 1 6 4 1 1 26

2 1 1 2 13 1 28

1 2 13 20

10 36 4 12 6 4 4 3 4 2 307 24 504

4 8 1 8 2 4 1 3 19 13

40 115 8 51 12 17 6 23 11 34 497 68

83

1254

Table 7.13: Discriminant Matrix for the Naive Bayesian Classi er Induced from the ECCA Dataset

Cholecystitis

Diverticulitis

Dyspepsia

Urinary tract infection

Bowel obstruction

Crohn's disease

Renal colic

Ovarian cyst

Pancreatitis

Salpingitis

Perforated ulcer

Non-specific abdominal pain

131 2 2 1 1 1 2 67 4 211

4 4 4 3 1 1 1 10 1 29

1 2 53 10 2 2 2 12 2 86

2 1 2 2 2 1 15 3 28

4 1 12 30 2 2 1 48 1 101

5 2 1 1 1 24 34

1 4 2 1 4 8 1 3 19 5 48

2 1 0 10 1 14

5 1 1 8 12 2 29

2 0 11 13

2 2 3 4 1 4 2 8 26

5 1 21 1 28

2 2 1 2 1 10 2 20

60 4 16 2 14 2 12 2 3 1 377 11 504

Other diagnoses

Cancer

Appendicitis Cancer Cholecystitis Diverticulitis Dyspepsia Urinary tract infection Bowel obstruction Crohn's disease Renal colic Ovarian cyst Pancreatitis Salpingitis Perforated ulcer Non-specific abdominal pain Other diagnoses

Appendicitis Model Diagnosis

Actual Diagnosis

9 2 10 3 4 6 1 3 1 2 29 13

233 24 109 11 70 4 36 0 14 0 13 6 25 665 44

83

1254

Table 7.14: Discriminant Matrix for the Selective Bayesian Network Induced from the ECCA Dataset with Missing Values 160

Cholecystitis

Diverticulitis

Dyspepsia

Urinary tract infection

Bowel obstruction

Crohn's disease

Renal colic

Ovarian cyst

Pancreatitis

Salpingitis

Perforated ulcer

Non-specific abdominal pain

127 1 1 1 77 4 211

2 2 5 2 2 3 2 8 3 29

2 58 8 1 2 10 5 86

3 2 5 1 3 11 3 28

8 1 8 1 28 1 1 2 2 47 2 101

4 2 1 1 0 26 34

4 2 1 3 2 8 1 3 20 4 48

0 12 2 14

6 1 2 2 7 10 1 29

2 0 11 13

5 8 1 1 6 3 2 26

7 1 0 20 28

2 1 2 5 2 5 2 1 20

53 1 20 2 11 6 1 3 401 6 504

Other diagnoses

Cancer

Appendicitis Cancer Cholecystitis Diverticulitis Dyspepsia Urinary tract infection Bowel obstruction Crohn's disease Renal colic Ovarian cyst Pancreatitis Salpingitis Perforated ulcer Non-specific abdominal pain Other diagnoses

Appendicitis Model Diagnosis

Actual Diagnosis

10 1 10 3 6 4 4 34 11

230 11 112 19 70 1 29 0 11 0 7 0 28 692 44

83

1254

Table 7.15: Discriminant Matrix for the Selective Bayesian Network Induced from the ECCA Dataset with Missing Values using Previously Selected Attributes

161

Chapter 8

Conclusion In this chapter, I summarize the main contributions of this thesis, and discuss some extensions and directions for future research. As described in Chapter 1, there were two main goals of this research. First, I wanted to address some of the practical issues faced in learning Bayesian networks for solving practical problems. Second, I wanted to demonstrate the feasibility and usefulness of the methods developed by applying them to some dicult, real-world problems. Section 8.1 summarizes the main contributions of this thesis towards meeting each of these objectives, while Section 8.2 describes some avenues of future research.

8.1 Summary of Contributions The two practical issues I addressed in this thesis are the computational intractability of inference using Bayesian networks, and the diculty of learning Bayesian networks from incomplete data. Section 8.1.1 summarizes the methods developed to address the rst problem, while Section 8.1.2 discusses my solution to the second problem. Section 8.1.3 brie y discusses the application of these methods to the acute abdominal pain domain. 162

8.1.1 Reducing the Inference Complexity of Bayesian Networks In order to reduce the inference complexity of Bayesian networks, I proposed a new representation, the selective Bayesian network, which is a Bayesian network that uses only a subset of the available attributes to model a domain. The aim is to learn networks that are smaller, and hence computationally simpler to evaluate, by discarding attributes that are irrelevant, redundant, or have too weak an in uence on the attributes of interest to be of signi cant consequence. At the same time, the goal is to model all the dependencies/independencies between all the attributes that are, in fact, kept in the model, to yield as accurate a representation of the domain as possible. I have developed two methods for learning selective Bayesian networks from data, and have carried out extensive experiments to show that they are generally better than other induction methods, and also learn Bayesian networks that are much smaller and, hence, more computationally ecient for inference, thus making the use of these models in real-life applications a practical reality. The rst method, K2-AS, is a wrapper approach that uses the same induction algorithm for feature selection as for network induction. K2-AS selects a subset of attributes that maximizes predictive accuracy prior to the network learning phase, thereby learning Bayesian networks with a bias for small networks that retain high predictive accuracy. The idea behind this approach is that attributes that have little or no in uence on the classi cation accuracy of learned networks can be safely discarded without signi cantly a ecting their performance. The second method, Info-AS, is a lter method in that it uses an algorithm di erent from the network induction algorithm for feature selection. Info-AS uses information-theoretic metrics to eciently select a subset of attributes from which to learn the selective Bayesian network. The aim is to discard those attributes that can provide little or no information about the variable(s) of interest, given the other attributes in the network. Relative to networks learned using all attributes, networks 163

learned by both K2-AS and Info-AS were shown to be signi cantly smaller and computationally simpler to evaluate, but display comparable performance. Moreover, they displayed faster learning rates, hence requiring smaller data sets to achieve their asymptotic accuracy. Both methods were also shown to signi cantly outperform the naive Bayesian classi er (selective as well as non-selective), one of the most widely-studied Bayesian methods within the machine learning community. With respect to a decision tree algorithm, C4.5, it was shown that the selective Bayesian networks displayed comparable accuracy. I have also proved some interesting properties of the two algorithms that help us understand their positive points as well as their drawbacks that can help in the design of better methods. These results have several important rami cations. First, they give us a way of applying Bayesian networks to problems where it was not possible to do so previously, due to computational intractability. Second, they show that decreasing the size of the networks does not signi cantly reduce the classi cation accuracy which may be very important in some applications (e.g. medicine). Third, in real world applications, features may have an associated cost (e.g. a feature representing an expensive test). The learning algorithms proposed can be modi ed to prefer removal of such high-cost tests. Fourth, they give a way of identifying various properties of a dataset that can help determine whether selective Bayesian networks are suitable for the domain, and whether their use can result in substantial bene t over other representations. Speci cally, selective Bayesian networks will generally be more bene cial than non-selective networks for all types of datasets, with the bene ts more pronounced for datasets with many attributes. Compared to naive Bayesian classi ers, although selective Bayesian networks generally perform better, their performance may be relatively poor on problems characterized by high dimensionality and small datasets. In such cases it may be worthwhile to use naive Bayesian classi ers instead, due to their lower induction and inference costs. 164

On the other hand, if a large amount of data is available, or if the problem has a low dimensionality, even with less data, then selective Bayesian networks should be preferred, especially if there are extensive correlations between the attributes.

8.1.2 Learning Bayesian Networks from Incomplete Data I have also developed a method of learning both Bayesian network structure as well as the corresponding conditional probability tables from data with missing values. Previous methods for learning Bayesian networks generally assume that the data from which the network is to be learned is complete. In situations where this is not the case, as in most real-world problems, the data is often made complete by llingin values using a variety of, often ad-hoc, methods. Although some work has been done in developing methods for learning network parameters (conditional probabilities) assuming that the network structure is known, the general, and more practical, task of learning both network structure and parameters from incomplete data has not been fully explored. The few techniques developed in this regard make the highly restrictive, and impractical, assumption that the values are missing randomly, independent of the state of other attributes. However, in practice, values are often missing based on the values of other attributes. I have shown how the well-known Expectation-Maximization (EM) algorithm, can be used to learn both the Bayesian network structure as well as the conditional probabilities from incomplete data. Since determining the exact solution is computationally hard, I combine the EM algorithm with another well know statistical technique, Multiple Imputation, to yield an approximate method, called the MIEM-BN algorithm, for eciently nding a good solution. This method correctly handles both types of missing data mentioned above. Another advantage of this approach is that it allows a trade-o between search complexity and model quality by varying the number of imputations used at every step. Experiments 165

carried out on data, generated from a large Bayesian network, with varying amounts of missing data, and di erent assumptions about the missing data mechanism show that the learned distribution (represented by the induced network) is much closer to the \true" distribution than the distribution learned by commonly used ad-hoc methods of handling missing data. Moreover, the resultant networks are almost as good as the best model that could possibly be learned from the available information in the incomplete dataset. Given that most real-life datasets are replete with missing values, especially under the MAR assumption, the MIEM-BN algorithm should be very useful in learning Bayesian networks that accurately model such domains. The absence of methods that correctly handled missing data as well as learn both Bayesian network structure and conditional probabilities was a big limitation that, along with other problems such as computational intractability, prevented the application of Bayesian networks to real-life domains. The MIEM-BN algorithm should go a long way in changing that situation.

8.1.3 Diagnosing Acute Abdominal Pain I have also combined both methods to learning selective Bayesian networks from missing data for the task of diagnosing acute abdominal pain. Known to be a very dicult domain, this is a high dimensional problem characterized by little data, several attributes, many classes and missing data. Several researchers have argued that the naive Bayesian classi er is optimal for this domain. I carried out detailed experiments on two datasets in this domain, and compared the performance of selective Bayesian networks to that of non-selective Bayesian networks, naive Bayesian classi ers as well as decision trees. The experiments showed that not only were the selective Bayesian networks much smaller than the non-selective networks, but they also signi cantly outperformed them on both datasets. The selective Bayesian networks were also found to outperform both naive Bayesian classi ers as well as decision trees on both datasets. Although classi cation accuracy gives an overall measure of the 166

performance of each induction method, it is important to evaluate the performance of the various algorithms on each individual disease since that may be much more significant from a clinical point of view. The reason this is crucial is that di erent diseases have di erent mis-classi cation costs, and a model that has a high overall accuracy but performs very poorly on diseases with high mis-classi cation costs may be less preferable to one that is less accurate in general but better at identifying the diseases with high mis-classi cation costs. A detailed analysis of the results on the two datasets showed that while Bayesian networks were generally much better than naive Bayesian classi ers at correctly identifying high frequency diseases, the converse was true for the low frequency diseases.

These results suggest that modeling dependencies between the domain attributes does make a di erence in practice. Previous results in this domain which showed that the naive Bayesian classi er performs very well compared to more complex representations, thus leading to optimality claims, is probably due to the lack of sucient data in a domain that is, essentially, of a very high dimensionality. For diseases that occur more frequently, Bayesian networks are able to easily detect the appropriate relationships and get accurate estimates of the parameters, thus yielding higher accuracies for these diseases. For low frequency diseases, Bayesian networks pick up spurious dependencies and/or inaccurate parameter estimates, and hence are very poor at diagnosing them. Selective Bayesian networks somewhat alleviate this problem since they use only a subset of the domain features, thus leading to better parameter estimates when relatively little data is available, and, thus, perform better that the more complex, non-selective Bayesian networks. At the same time, they represent all the dependencies between the features that are modeled, and thus perform better than the naive classi er, especially on the high-frequency classes. 167

8.2 Extensions and Directions for Future Research In this section, I discuss some of the drawbacks of the methods proposed in this thesis, and suggest ways of extending and improving them to yield even better and more useful results. Despite the excellent performance of selective Bayesian networks compared to other representations, as discussed in Section 4.7, there are a number of ways in which their performance can be improved further. One drawback of the current methods is that only one attribute is evaluated at a time. As pointed out in Sections 3.3.2 and 3.4.3, this may lead to poor models in certain situations. A simple modi cation that should be examined in the future is to start with random subsets, with multiple restarts, and use a greedy search strategy the allows addition as well as removal of attributes at every step of the attribute selection process. A related issue is the greedy search strategy employed which often ends in a local maxima. Other search strategies like simulated annealing (Kirkpatrick et al., 1983) or even best- rst search may yield better results.1 Another drawback of the proposed selective Bayesian network induction algorithms is the choice of the algorithm (CB) used to learn the Bayesian network from the set of selected attributes. However, this choice was arbitrary and any other method for learning Bayesian networks ( e.g. (Heckerman et al., 1995; Lam and Bacchus, 1993), etc) could have been chosen in its place, Nevertheless, most of the current algorithms attempt to learn the network which \best" ts the data and do not take into account the classi cation accuracy of the resulting models. However, even though a given model may be \correct" in the sense of generating the data, it need not be the best model when it comes to making predictions (Cowell et al., 1993). In order to learn better selective Bayesian networks for the purpose of classi cation, it is important to develop methods 1 One alternative hypothesis is that over tting from such approaches may prevent them from yielding better results in certain situations (Quinlan and Cameron-Jones, 1995).

168

for learning Bayesian networks with a view towards maximizing their performance on that measure. The global metrics of Cowell et al. (Cowell et al., 1993; Spiegelhalter et al., 1993) o er one way of addressing this issue. Greiner et al. (1997b) also discuss this issue extensively, and describe techniques of learning Bayesian networks that have the best performance over the speci c queries that they will have to cover. Apart from these various methods for improving the performance of selective Bayesian networks, there are several other issues of general interest, which should be explored further. First, as discussed in Section 3.1.1, the tree-augmented naive Bayesian classi er (TAN) (Friedman and Goldszmidt, 1996a) eliminates some of the problems of the naive Bayesian classi er by modeling some of the inter-attribute dependencies. As such, it performs better than the naive classi er in many domains where the attributes are correlated. At the same time, it retains much of the simplicity of the naive classi er since each attribute can have at most one parent, in addition to the class variable, and, thus, is simple to learn and use. As such, it will be interesting to compare the performance of selective Bayesian networks with TAN's. More importantly, it will be more bene cial to identify the characteristics of problems where the selective Bayesian network will be more bene cial than the TAN, despite the increased complexity, and vice-versa. Second, as discussed by Friedman and Goldszmidt (1996b), it is possible to improve the quality of Bayesian networks induced from data by explicitly representing and learning the local structure in the conditional probability tables that quantify these networks. It will be interesting to see if incorporating their techniques in the methods for learning selective Bayesian networks described in this dissertation results in better models. Similarly, as discussed in Section 6.5, there are several issues concerning the induction of Bayesian networks from incomplete data that deserve further attention. One interesting problem that I have not addressed in this thesis concerns the presence of hidden variables i.e. variables whose value is always missing in the dataset. This is a potentially important problem since the addition of hidden variables has often been 169

found to improve the quality of the learned distribution (Friedman, 1997). Another issue that should be explored further is the use of other metrics such as classi cation accuracy, or ideally, utility functions, to evaluate the quality of the Bayesian networks learned from incomplete data. In the experiments in Chapter 6, the KullbackLeibler distance was used to evaluate the quality of the networks induced by the MIEMBN algorithm from incomplete data. However, this is clearly infeasible in practice since the true model is not known. For most real life domains, Bayesian networks will be used for speci c applications such as classi cation, and as such should be learned with the intended objective in mind. The MIEM-BN algorithm attempts to learn the model that best represents the underlying distribution, which is generally not the best model for classi cation, as described above. This was very apparent in the acute abdominal pain domain where although the learned models were more likely than those learned assuming complete data, the di erence in the classi cation accuracy was not signi cant. This leads to another important issue that was, again, very much evident in the application of the MIEM-BN algorithm to the task of diagnosing acute abdominal pain. This pertains to the integration of feature selection with the MIEM-BN algorithm. As the experiments with the ECCA dataset showed, feature selection plays a very important role in determining the quality of the nal model. As such, further research should be carried out in developing better techniques for performing feature selection as a part of the network induction process while learning from incomplete data. Finally, as discussed in Section 6.5, there are several ways of making the MIEM-BN algorithm even more ecient. It would be worthwhile to explore these ideas further as well.

170

Bibliography Amuallim, H. and Dietterich, T. (1991). Learning with many irrelevant features. In Proc. Conf. of the AAAI, pages 547{552, Menlo Park, CA. AAAI Press.

Anderson, S., Olesen, K., Jensen, F., and Jensen, F. (1989). HUGIN - A Shell for building Bayesian Belief Universes for Expert Systems. In Proceedings of the 11th International Joint Conference on Arti cial Intelligence, pages 1080{1085.

Andreassen, S., Woldbye, M., Falck, B., and Andersen, S. (1987). A causal probabilistic network for interpretation of electromyographic ndings. In Proc. Tenth International Joint Conference on Arti cial Intelligence, pages 366{372, San Mateo, CA.

Morgan Kaufmann. Beinlich, I., Suermondt, H., Chavez, R., and Cooper, G. (1989). The ALARM monitoring system: A Case Study with Two probabilistic inference techniques for belief networks. In Proceedings of the Second European Conference on Arti cial Intelligence in Medicine, pages 247{256, London, England.

Bouckaert, R. R. (1993). Belief network construction using the minimum description length principle. In Proceedings ECSQARU, pages 41{48. Buntine, W. (1991). Theory re nement on Bayesian networks. In Uncertainty in Arti cial Intelligence: Proceedings of the Seventh Conference, pages 52{60, San Mateo, CA.

Morgan Kaufmann. 171

Buntine, W. and Niblett, T. (1992). A further comparison of splitting rules for decisiontree induction. Machine Learning, 7. Cardie, C. (1993). Using decision trees to improve case-based learning. In Proc. Machine Learning, pages 25{32. Morgan Kaufmann.

Caruana, R. and Freitag, D. (1994). Greedy attribute selection. In Cohen, W. and Hirsch, H., editors, Proc. Machine Learning, pages 28{36. Morgan Kaufmann. Charniak, E. and Goldman, R. (1989). Plan recognition in stories and in life. In Uncertainty in Arti cial Intelligence: Proceedings of the Fifth Workshop, pages 54{60,

Mountain View, California. Chavez, R. and Cooper, G. (1990). A randomized approximation algorithm for probabilistic inference on Bayesian belief networks. Networks, 20:661{685. Cheeseman, P., Kelly, J., Self, M., Stutz, J., Taylor, W., and Freeman, D. (1988). Autoclass: A Bayesian classi cation system. In Fifth International Conference on Machine Learning, Ann Arbor, Michigan.

Cheeseman, P. and Oldford, W., editors (1994). Selecting Models from Data: AI and Statistics IV. Springer-Verlag.

Clarke, J. R. and Hayward, C. Z. (1990). Workshop on surgical decision making: a scienti c approach to surgical reasoning. Theoretical Surgery, 5:129{132. Cochran, W. (1950). The comparison of percentages in matched samples. Biometrika, 37:256{266. Cooper, G. (1990). The computational complexity of probabilistic inference using Belief networks. Arti cial Intelligence, 42:393{405. Cooper, G. (1995). A Bayesian method for learning belief networks that contain hidden variables. Journal of Intelligent Systems, 4:71{88. 172

Cooper, G. and Herskovits, E. (1992). A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309{347. Cowell, R., Dawid, P., and Spiegelhalter, D. (1993). Sequential model criticism in probabilistic expert systems. IEEE Transactions of Pattern Analysis and Machine Intelligence, 15(3):209{219.

Dagum, P. and Luby, M. (1993). Approximating probabilistic inference in Bayesian belief networks is NP-hard. Arti cial Intelligence, 60:141{153. D'Ambrosio, B. (1991). Local expression languages for probabilistic dependence. In Proceedings Seventh Conference on Uncertainty in Arti cial Intelligence, pages 95{

102. Morgan Kaufmann. Darwiche, A. and Provan, G. (1995). Quesry DAGs: a practical paradigm for implementing on{line causal{network inference. Technical report, Rockwell Science Center, Thousand Oaks, CA. Dawid, A. (1992). Prequential analysis, stochastic complexity and Bayesian inference. In Bernardo, J., Berger, J., Dawid, A., and Smith, A., editors, Bayesian Statistics 4, pages 109{125. Oxford Science Publications.

de Dombal, F. (1991). The diagnosis of acute abdominal pain with computer assistance. Annals Chir., 45:273{277.

de Dombal, F., de Baere, H., van Elk, P., Fingerhut, A., Henriques, J., Lavelle, S., Malizia, G., Ohmann, C., Pera, C., Sitter, H., and Tsiftsis, D. (1993). Objective medical decision making in acute abdominal pain, pages 65{87. IOS Press.

de Dombal, F., Leaper, D., Staniland, J., McCann, A., and Horrocks, J. (1972). Computer-aided diagnosis of acute abdominal pain. British Medical Journal, 2:9{13. 173

Dempster, A., Laird, N., and Rubin, D. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society (Series B), 39:1{38. Dietterich, T. (1996). Statistical tests for comparing supervised learning algorithms. Technical report, Oregon State University, Corvallis, OR. Draper, D. (1994). Relevance measures for localized partial evaluation of Belief networks. In Working notes of the AAAI-94 Fall symposium series on Relevance, pages 56{59. Dunn, O. (1961). Multiple comparisons among means. Journal of the American Statistical Association, 56:52{64.

Edwards, F. and Davies, R. (1984). Use of a Bayesian algorithm in the computer-assisted diagnosis of appendicitis. Surg. Gynecol. Obstet., 158:219{222. Ezawa, K. and Norton, S. (1995). Knowledge discovery in telecommunication services data using Bayesian network models. In Proc. 1st Int. Conf. on Knowledge Discovery and Data Mining.

Ezawa, K., Singh, M., and Norton, S. (1996). Learning goal-oriented Bayesian networks for telecommunications risk management. In Proc. 13th Intl. Conference on Machine Learning. To appear.

Feelders, A. and Verkooijen, W. (1995). Which method learns most from the data? In Preliminary Papers of the Fifth International Workshop on Arti cial Intelligence and Statistics, pages 219{225, Ft. Lauderdale, FL.

Friedman, N. (1997). Learning belief networks in the presence of missing values and hidden variables. In Procs. 14th International Conference on Machine Learning. Friedman, N. and Goldszmidt, M. (1996a). Building classi ers using Bayesian networks. In Proc. 13th National Conference on Arti cial Intelligence (AAAI). 174

Friedman, N. and Goldszmidt, M. (1996b). Learning Bayesian networks with local structure. In Proc. 12th Conference on Uncertainty in Arti cial Intelligence. Fryback, D. G. (1978). Bayes' theorem and conditional nonindependence of data in medical diagnosis. Computers and Biomedical Research, 11:429{435. Fung, R. and Favero, B. D. (1995). Applying Bayesian networks to information retrieval. Communications of the ACM, 38(3).

Geman, S. and Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6:721{741.

Ghahramani, Z. and Jordan, M. (1994). Learning from incomplete data. A.I. Memo 1509, Massachusetts Institute of Technology, Arti cial Intelligence Laboratory. Greiner, R. (1997).

Deployed Bayesian network systems in routine use.

http://www.auai.org/BN-Routine.html.

Greiner, R., Grove, A., and Kogan, A. (1997a). Knowing what doesn't matter: Exploiting the omission of irrelevant data. Arti cial Intelligence. Greiner, R., Grove, A., and Schuurman, D. (1997b). Learning Bayesian nets that perform well. In Procs. Conf. on Uncertainty in AI. Heckerman, D. (1995). A tutorial on learning Bayesian networks. Technical Report MSR{TR{95{06, Microsoft Research, Redmond, WA. Heckerman, D., Breese, J. S., and Rommelse, K. (1994). Troubleshooting under uncertainty. Technical Report MSR{TR{94{07, Microsoft Research, Redmond, WA. Heckerman, D., Geiger, D., and Chickering, M. (1995). Learning Bayesian networks{the combination of knowledge and statistical data. Machine Learning, 20(3):197{243. 175

Heckerman, D. and Shachter, R. (1995). Decision-theoretic foundations for causal reasoning. Journal of Arti cial Intelligence Research, 3:405{430. Henrion, M. (1991). Search-based methods to bound diagnostic probabilities in very large belief nets. In D'Ambrosio, B., Smets, P., and Bonissone, P., editors, Uncertainty in Arti cial Intelligence.

Herskovits, E. H. (1991). Computer-based Probabilistic Network Construction. PhD thesis, Medical Information Sciences, Stanford University, Stanford, CA. Jensen, F. V., Lauritzen, S. L., and Olesen, K. (1990). Bayesian updating in recursive graphical models by local computations. Computational Statisticals Quarterly, 4:269{282. John, G., Kohavi, R., and P eger, K. (1994). Irrelevant features and the subset selection problem. In Cohen, W. and Hirsch, H., editors, Proc. Machine Learning, pages 121{129. Morgan Kaufmann. Kira, K. and Rendell, L. (1992a). The feature selection problem: Traditional methods and a new algorithm. In Proc. AAAI, pages 129{134. AAAI Press. Kira, K. and Rendell, L. (1992b). A practical approach to feature selection. In Proc. Machine Learning, pages 249{256. Morgan Kaufmann.

Kirkpatrick, S., Gelatt, C., and Vecchi, M. (1983). Optimization by simulated annealing. Science, 220:671{680.

Koller, D. and Sahami, M. (1996). Toward optimal feature selection. In Proc. 13th Intl. Conference on Machine Learning. To appear.

Kononenko, I. (1994). Estimating attributes: Analysis and extension of RELIEF. In Proc. European Conf. on Machine Learning, pages 171{182. Springer Verlag.

176

Kubat, M., Flotzinger, D., and Pfurtscheller, G. (1993). Discovering patters in eeg signals: Comparative study of a few methods. In Proc. European Conf. on Machine Learning, pages 367{371. Springer Verlag.

Kullback, S. and Leibler, R. (1951). Information and suciency. Ann. Math. Statistics, 22. Lam, W. and Bacchus, F. (1993). Using causal information and local measures to learn Bayesian networks. In Heckerman, D. and Mamdani, E., editors, Uncertainty in Arti cial Intelligence: Proceedings of the Ninth Conference, pages 243{250, San

Mateo, CA. Morgan Kaufmann. Lam, W. and Bacchus, F. (1994). Learning Bayesian belief networks, an approach based on the MDL principle. Computational Intelligence, 10(4). Langley, P. (1993). Induction of recursive Bayesian classi ers. In Proc. European Conf. on Machine Learning, pages 153{164. Springer Verlag.

Langley, P. (1994). Selection of relevant features in machine learning. In Greiner, R., editor, Proc. AAAI Fall Symposium on Relevance. AAAI Press. Langley, P., Iba, W., and Thompson, K. (1992). An analysis of Bayesian classi ers. In Proceedings of the Tenth National Conference on Arti cial Intelligence, pages 223{228. AAAI Press. Langley, P. and Sage, S. (1994a). Induction of selective Bayesian classi ers. In Proc. Conf. on Uncertainty in AI. Morgan Kaufmann.

Langley, P. and Sage, S. (1994b). Oblivious decision trees and abstract cases. In Working notes of the AAAI'94 Workshop on Case-Based Reasoning, pages 113{117. AAAI

Press. 177

Lauritzen, S. (1995). The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:191{201. Lauritzen, S. L. and Spiegelhalter, D. J. (1988). Local computations with probabilities on graphical structures and their application to expert systems. Journal of the Royal Statistical Society (Series B), 50:157{224.

Levitt, T., Mullin, J., and Binford, T. (1989). Model-based in uence diagrams for machine vision. In Proc. Fifth Workshop on Uncertainty in AI, pages 233{244. Li, K, H. (1985). Hypothesis testing in multiple imputation | with emphasis on mixed-up frequencies in contingency tables. PhD thesis, Department of Statistics, University

of Chicago. Little, R. J. A. and Rubin, D. B. (1987). Statistical Analysis with Missing Data. John Wiley & Sons. Lopez de Mantaras, R. (1991). A distance-based attribute selection measure for decision tree induction. Machine Learning, 6:81{92. Madigan, D., Raftery, A., York, J., Bradshaw, J., and Almond, R. (1993). Strategies for graphical model selection. In Proc. International Workshop on AI and Statistics, pages 331{336. Madigan, D. and Raftery, A. E. (1993). Model selection and accounting for model uncertainty in graphical models using occam's window. Technical Report 213 (revised), Dept. of Statistics, Univ. of Washington. Marascuilo, L. and McSweeney, M. (1977). Nonparametric and Distibution-Free Methods for the Social Sciences. Brooks/Cole Publiching Company, CA.

Marill, T. and Green, D. (1963). On the e ectiveness of receptors in recognition systems. IEEE Trans. on Information Theory, 9:11{17.

178

Matzkevich, I. and Abramson, B. (1993). Deriving a minimal i-map of a belief network relative to a target ordering of its nodes. In Heckerman, D. and Mamdani, E., editors, Uncertainty in Arti cial Intelligence: Proceedings of the Ninth Conference, pages 159{165, San Mateo, CA. Morgan Kaufmann. McNemar, Q. (1947). Note on the sampling error of the di erence between correlated proportions or percentages. Psychometrika, 12:153{157. Murphy, P. and Aha, D. (1992). UCI repository of machine learning databases. Machinereadable data repository - Department of Information and Computer Science, University of California, Irvine. Narendra, M. and Fukunaga, K. (1977). A branch and bound algorithm for feature subset selection. IEEE Trans. on Computers, 26(9):917{922. Neapolitan, R. (1990). Probabilistic Reasoning in Expert Systems. John Wiley & Sons, New York. Norusis, M. and Jacquez, J. (1975). Diagnosis I: Symptom nonindependence in mathematical models for diagnosis. Comput. Biomed. Res., 8:156{172. Ohmann, C., Moustakis, V., Yang, Q., and Lang, K. (1996). Evaluation of automatic knowledge acquisition techniques in the diagnosis of acute abdominal pain. Arti cial Intelligence in Medicine, 8:23{36.

Pazzani, M. (1995). Searching for attribute dependencies in Bayesian classi ers. In Proc. Fifth International Workshop on Arti cial Intelligence and Statistics, pages

424{429. Pearl, J. (1986). Fusion, propagation and structuring in belief networks. Arti cial Intelligence, 29:241{288.

179

Pearl, J. (1987). Evidential reasoning using stochastic simulation of causal models. Arti cial Intelligence, 32:245{257.

Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufman, San Mateo, CA.

Pearl, J. and Verma, T. (1991). A theory of inferred causation. In Allen, J., Fikes, R., and Sandwell, E., editors, Principles of Knowledge Representation and Reasoning: Proceedings of the Second International Conference, pages 441{452, San Mateo, CA.

Morgan Kaufmann. Poole, D. (1993). Average-case analysis of a search algorithm for estimating prior and posterior probabilities in Bayesian networks with extreme probabilities. In Proc. IJCAI, pages 606{612.

Provan, G. (1994). Tradeo s in knowledge-based construction of probabilistic models. IEEE Trans. on SMC.

Provan, G. M. (1993). Tradeo s in constructing and evaluating temporal in uence diagrams. In Proc. Ninth Conf. Uncertainty in Arti cial Intelligence, pages 40{47. Morgan-Kaufmann. Quinlan, J. (1986). Induction of decision trees. Machine Learning, 1:81{106. Quinlan, J. R. and Cameron-Jones, R. M. (1995). Oversearching and layered search in empirical learning. In Procs. 14th Intl. Joint Conf. on Arti cial Intelligence, pages 1019{1024. Ramoni, M. (1997). Learning bayesian networks from incomplete databases. In Procs. 13th Conference on Uncertainty in AI, San Mateo, CA. Morgan Kaufman.

Rubin, D. (1987). Multiple Imputation for Nonresponse in Surveys. Wiley, New York. 180

Russell, S., Binder, J., Koller, D., and Kanazawa, K. (1995). Local learning in probabilistic networks with hidden variables. In Proc. Conference on Uncertainty in AI.

Salzberg, S. (1997). On comparing classi ers: pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1(3):317{328. Schuurmans, D. and Greiner, R. (1994). Learning default concepts. In Procs. CSCSI, pages 519{523. Seroussi, B. (1986). Computer-aided diagnosis of acute abdominal pain when taking into account interactions. Method. Inform. Med., 25:194{198. Shachter, R. (1988). Probabilistic inference and in uence diagrams. Operations Research, 36(4):589{604. Shachter, R., Andersen, S., and Poh, K. (1990). Directed reduction algorithms and decomposable graphs. In Proceedings of the Sixth Conference on Uncertainty in Arti cial Intelligence, pages 237{244.

Shachter, R. and Peot, M. (1990). Simulation approaches to general probabilistic inference on belief networks. In Henrion, M., Shacter, R., Kanal, L., and Lemmer, J., editors, Uncertainty in Arti cial Intelligence 5, pages 221{231. North-Holland. Siedlecki, W. and Sklansky, J. (1988). On automatic feature selection. Intl. J. of Pattern Recognition and Arti cial Intelligence, 2(2):197{220.

Singh, M. and Provan, G. M. (1995). A comparison of induction algorithms for selective and non-selective Bayesian classi ers. In Proc. 12th Intl. Conference on Machine Learning, pages 497{505.

Singh, M. and Valtorta, M. (1993). An algorithm for the construction of Bayesian network structures from data. In Heckerman, D. and Mamdani, E., editors, Uncertainty 181

in Arti cial Intelligence: Proceedings of the Ninth Conference, pages 259{265, San

Mateo, CA. Morgan Kaufmann. Singh, M. and Valtorta, M. (1995). Construction of Bayesian network structures from data: a brief survey and an ecient algorithm. International Journal of Approximate Reasoning, 12:111{131.

Spiegelhalter, D., Dawid, P., Lauritzen, S., and Cowell, R. (1993). Bayesian analysis in expert systems. Statistical Science, 8(3):219{283. Spirtes, P. and Glymour, C. (1991). An algorithm for fast recovery of sparse causal graphs. Social Science Computing Review, 9(1):62{72. Spirtes, P., Glymour, C., and Scheines, R. (1990). Causality from probability. In Tiles, J., McKee, G., and Dean, G., editors, Evolving Knowledge in the Natural and Behavioral Sciences, pages 181{199. Pitman, London.

Suermondt, H. J. and Amylon, M. D. (1989). Probabilistic prediction of the outcome of bone-marrow transplantation. In Proceedings of the Symposium on Computer Applications in Medical Care, pages 208{212, Washington, D.C.

Suzuki, J. (1996). Learning Bayesian belief networks based on the minimum description length principle: an ecient algorithm using the b&b technique. In Proc. 13th Intl. Conference on Machine Learning. To appear.

Tanner, M. and Wong, W. (1987). The calculation of posterior distributions by data augmentation. Journal of American Statistical Association, 82(398):528{550. Todd, B. S. and Stamper, R. (1993). The formal design and evaluation of a variety of medical diagnostic programs. Technical Monograph PRG-109, Oxford University Computing Laboratory. 182

Todd, B. S. and Stamper, R. (1994). The relative accuracy of a variety of medical diagnostic programs. Methods Inform. Med., 33:402{416. Verma, T. and Pearl, J. (1992). An algorithm for deciding if a set of observed independencies has a causal explanation. In Dubois, D., Wellman, M. P., D'Ambrosio, B., and Smets, P., editors, Uncertainty in Arti cial Intelligence: Proceedings of the Eighth Conference, pages 323{330, San Mateo, CA. Morgan Kaufmann.

Wei, G. and Tanner, M. (1990). A monte carlo implementation of the EM algorithm and the poor man's data augmentation algorithms. Journal of the American Statistical Association, 85(411):699{704.

White, A. and Liu, W. (1994). Bias in information-based measures in decision tree induction. Machine Learning, pages 321{329. Xu, L., Yan, P., and Chang, T. (1989). Best- rst strategy for feature selection. In Proc. Ninth International Conf. on Pattern Recognition, pages 706{708. IEEE Computer

Society Press.

183