Combining Bayesian Networks, k Nearest

0 downloads 0 Views 238KB Size Report
Combining Bayesian Networks, k Nearest. Neighbours algorithm and Attribute Selection for Gene Expression Data Analysis *. B. Sierra, E. Lazkano, J.M. ...
Combining Bayesian Networks, k Nearest Neighbours algorithm and Attribute Selection for Gene Expression Data Analysis ? B. Sierra, E. Lazkano, J.M. Mart´ınez-Otzeta and A. Astigarraga Dept. of Computer Science and Artificial Intelligence, University of the Basque Country, P.O. Box 649, E-20080 San Sebasti´ an, Spain, [email protected] http://www.sc.ehu.es/ccwrobot

Abstract. In the last years, there has been a large growth in gene expression profiling technologies, which are expected to provide insight into cancer related cellular processes. Machine Learning algorithms, which are extensively applied in many areas of the real world, are not still popular in the Bioinformatics community. We report on the successful application of the combination of two supervised Machine Learning methods, Bayesian Networks and k Nearest Neighbours algorithms, to cancer class prediction problems in three DNA microarray datasets of huge dimensionality (Colon, Leukemia and NCI-60). The essential gene selection process in microarray domains is performed by a sequential search engine and after used for the Bayesian Network model learning. Once the genes are selected for the Bayesian Network paradigm, we combine this paradigm with the well known K NN algorithm in order to improve the classification accuracy.

1

Introduction

Development of high throughput data acquisition technologies in biological sciences, and specifically in genome sequencing, together with advances in digital storage and computing, have begun to transform biology, from a data–poor science to a data–rich science. In order to manage and deal with all this new biological data, the Bioinformatics discipline powerfully emerges. These advances in the last decade in genome sequencing have lead to the spectacular development of a new technology, named DNA microarray, which can be included into the Bioinformatics discipline. DNA micro-array allows simultaneously monitoring and measurement of the expression levels of thousands of genes in an organism. A systematic and computational analysis of these microarray datasets is an interesting way to study and understand many aspects of the underlying biological processes. ?

This work was supported the University of the Basque Country under UPV 140.226-EA186/96 grant and by the Gipuzkoako Foru Aldundi Txit Gorena under OF761/2003 grant.

There has been a significant recent interest in the development of new methods for functional interpretation of these micro-array gene expression datasets. The analysis frequently involves class prediction (supervised classification), regression, feature selection (in this case, gene selection), outliers detection, principal component analysis, discovering of gene relationships and cluster analysis (unsupervised classification). In this way, DNA micro-array datasets are an appropriate starting point to carry out the explained systematic and automatic cancer classification (Golub et al., [10]). Cancer classification is divided in two major issues: the discovery of previously unknown types of cancer (class discovery) and the assignment of tumor samples to already known cancer types (class prediction). While the class discovery is related with cluster analysis (or unsupervised classification), class prediction is related with the application of supervised classification techniques. Our work focuses on class prediction for cancer classification. In the last decade there has been a big growth in the accumulation of information in Economic, Marketing, Medicine, Finance, etc. databases. The larger size of these databases and the improvement of computer related technologies inspired the development of a set of techniques that are grouped under the Machine Learning (ML) (Mitchell [22]) term and that discover and extract knowledge in an automated way. Although ML techniques have successfully solved classification problems in many different areas of the real world, its application is nowadays emerging as a powerful tool to solve DNA micro-array problems (Li and Yang [20]; Xing et al. [31]; Ben-Dor et al. [2]; Aris and Recce [1]; Inza et al. [12]; Blanco et al. [3]). As micro-array datasets have a very large number of predictive genes (usually more than 1000) and a small number of samples (usually less than 100), a reduction in the number of genes to build a classifier is an essential part of any micro-array study. Moreover, for diagnostic purposes it is important to find small subsets of genes that are informative enough to distinguish between cells of different types [2]. All the studies also show that the main part of the genes measured in a DNA micro-array are not relevant for an accurate distinction between cancer classes (Golub et al., [10]). To this end we suggest a simple combinatorial, sequential, classic search mechanism, named Sequential Forward Selection (Kittler [14]), which performs the major part of its search near the empty subset of genes. Each found gene subset is evaluated by a wrapper (Kohavi and John [16]) scheme, which is very popular in ML applications and it is started to be used in DNA micro-array tasks (Inza et al. [12]; Blanco et al. [3]; Li et al. [19]; Xing et al. [31]). In this paper, we present a new classifier combination technique that consists of combining the use of two well known paradigms: Bayesian Networks (BN) and K Nearest Neighbor (K-NN). We show the results obtained when using this new classifier in three micro-array datasets. Before running the classification process, Feature Subset Selection is applied using a wrapper technique in order to keep only the relevant features. The accuracy estimation is given using the Leaving One Out Cross Validation (LOOCV) technique because, due to its unbiased

nature (Kohavi, [15]), it is the most suited estimation procedure for micro-array datasets. The rest of the paper is organized as follows: section 2 focuses on the Feature Selection process; sections 3 and 4 present the Bayesian Networks and K-NN paradigms, respectively; section 5 is devoted to the new classifier proposed in this paper; obtained experimental results are shown in section 6, and in the final section 7 the conclusions are presented and the future work lines are pointed out.

2

Gene selection process: Feature Subset Selection

The basic problem of ML is concerned with the induction of a model that classifies a given object into one of several known classes. In order to induce the classification model, each object is described by a pattern of d features. Here, the ML community has formulated the following question: are all of these d descriptive features useful for learning the ‘classification rule’ ? On trying to respond to this question, we come up with the Feature Subset Selection (FSS) [21] approach which can be reformulated as follows: given a set of candidate features, select the ‘best’ subset in a classification problem. In our case, the ‘best’ subset will be the one with the best predictive accuracy. Most of the supervised learning algorithms perform rather poorly when faced with many irrelevant or redundant (depending on the specific characteristics of the classifier) features. In this way, the FSS proposes additional methods to reduce the number of features so as to improve the performance of the supervised classification algorithm. FSS can be viewed as a search problem with each state in the search space specifying a subset of the possible features of the task. Exhaustive evaluation of possible feature subsets is usually infeasible in practice due to the large amount of computational effort required. In this way, any feature selection method must determine four basic issues that define the nature of the search process: 1. The starting point in the space. It determines the direction of the search. One might start with no features and successively add them, or one might start with all features and successively remove them. 2. The organization of the search. It determines the strategy of the search. Roughly speaking, the search strategies can be complete or heuristic (see [21] for a review of FSS algorithms). When we have more than 10-15 features the search space becomes huge and a complete search strategy is infeasible. As FSS is a classic NP-hard optimization problem, the use of search heuristics is justified. Classic deterministic heuristic FSS algorithms are sequential forward and backward selection (SFS and SBS [14]), floating selection methods (SFFS and SFBS [24]) and best-first search [16]. Two classic implementations of non-deterministic search engines are Genetic Algorithms [29], Estimation of Distribution Algorithms [11] and Simulated Annealing [7]. 3. Evaluation strategy of feature subsets. The evaluation function identifies the promising areas of the search space. The objective of FSS algorithm is its

maximization. The search algorithm uses the value returned by the evaluation function for helping to guide the search. Some evaluation functions carry out this objective looking only at the characteristics of the data, capturing the relevance of each feature or set of features to define the target concept: these type of evaluation functions are grouped below the filter strategy. However, Kohavi and John [16] report that when the goal of FSS is the maximization of the accuracy, the features selected should depend not only on the features and the target concept to be learned, but also on the learning algorithm. Thus, they proposed the wrapper concept: this implies that the FSS algorithm conducts a search for a good subset using the induction algorithm itself as a part of the evaluation function, the same algorithm that will be used to induce the final classification model. In this way, representational biases of the induction algorithm which are used to construct the final classifier are included in the FSS process. It is claimed by many authors [16, 21] that the wrapper approach obtains better predictive accuracy estimates than the filter approach. However, its computational cost must be taken into account. 4. Criterion for halting the search. An intuitive approach for stopping the search is the non-improvement of the evaluation function value of alternative subsets. Another classic criterion is to fix an amount of possible solutions to be visited along the search. In our microarray problems, we propose to use Sequential Forward Selection (SFS) [14], a classic and well known hill-climbing, deterministic search algorithm which starts from an empty subset of genes. It sequentially selects genes until no improvement is achieved in the evaluation function value. As the totality of previous microarray studies note that very few genes are needed to discriminate between different cell classes, we consider that SFS could be an appropriate search engine because it performs the major part of its search near the empty gene subset. To assess the goodness of each proposed gene subset for a specific classifier, a wrapper approach is applied. In the same way as supervised classifiers when no gene selection is applied, this wrapper approach estimates, by the LOOCV procedure, the goodness of the classifier using only the gene subset found by the search algorithm.

3

Bayesian Networks

Bayesian networks are probabilistic graphical models represented by directed acyclic graphs in which nodes are variables and arcs show the conditional (in) dependencies among the variables [13]. There are different ways of establishing the Bayesian network structure [9, 4]. It can be the human expert who designs the network taking advantage of his/her knowledge about the relations among the variables. It is also possible to learn the structure by means of an automatic learning algorithm. A combination of both systems is a third alternative, mixing the expert knowledge and the learning mechanism. Within the context of microarray data it is beyond the knowledge

of the authors the meaning and the inter-relations among the different genes (variables); this is the reason for not defining an expert made net structure for each of the databases. Within the supervised classification area, learning is performed using a training datafile but there is always a special variable, namely the class, i.e. the one we want to deduce. Some structural learning approaches take into account the existence of that special variable [8, 29, 30, 18], but most of them do consider all the variables in the same manner and use an evaluation metric to measure the suitability of a net given the data. Hence, a structural learning method needs two components: the learning algorithm and the evaluation measure (score+search). The search algorithm used in the experimentation here described is the K2 algorithm [5] (figure 1 shows its pseudo-code). This algorithm assumes an order has been established for the variables so that the search space is reduced. The fact that X1 , X2 , · · · , Xn is an ordering of the variables implies that only the predecessors of Xk in the list can be its parent nodes in the learned network. The algorithm also assumes that all the networks are equally probable, but because it is a greedy algorithm it can not ensure that the net resulting from the learning process is the most probable one given the data.

Input: ordering and database Output: net structure Begin: for i = 1 to n πi = ∅ p old = g(i, πi ) ok to proceed = true while (ok to proceed) and (|πi | < max parents) S z ∈ pred(i) where S z = argmaxk g(i, πi {k}) p new = g(i, πi {k}) if(p new > p old) p old =Sp new πi = πi {z} else ok to proceed = false End

Fig. 1. Pseudo-code of the K2 structural learning algorithm

The original algorithm used the K2 Bayesian metric to evaluate the net while it is being constructed: !! ri qi n Y Y Y (ri − 1)! P (D|S) = log10

i=1 j=1

(Nij + ri − 1)!

Nijk !

k=1

(1)

where: P (D|S) is a measure of the goodness of the S Bayesian net defined over the D dataset; n is the number of variables; ri represents the number of values or states that the i-th variable can take; qi is the set of all possible configurations for the parents of variable i; Nijk is the frequency with whom variable i takes qi X Nijk ; and N is the the value k while its parent configuration is j; N ik = j=1

number of entries in the database In addition to this metric, we have tried two more measures in combination with the algorithm. The Bayesian Information Criterion [27] (BIC) includes a term that penalizes complex structures: P (D|S) =

ri qi n X X X

Nijk log

Nijk − f (N ) Nik

n X

qi (ri − 1)

(2)

i=1

i=1 j=1 k=1

where f (N ) = 21 log N is the penalization term. The well known entropy [28] metric measures the disorder of the given data: ! ri qi n X X X Nijk + 1 Nijk + 1 Nij P (D|S) =

N

i=1 j=1



Nij + ri

ln

Nij + ri

(3)

k=1

Evidence propagation or probabilistic inference consists of, given an instantiation of some of the variables, obtaining the a posteriori probability of one ore more of the non-instantiated variables. It is known that this computation is a NP-hard problem, even for the case of a unique variable. There are different alternatives to perform the propagation methods. Exact methods calculate the exact a posteriori probabilities of the variables and the resulting error is only due to the limitations of the computer where the calculation is performed. The computational cost can be reduced looking over the independence of the nodes in the net. Approximated propagation methods are based on simulation techniques that obtain approximated values for the probabilities needed. [23] proposes a stochastic simulation method known as the Markov Sampling Method. In the case all the Markov Blanquet1 of the variable of interest is instantiated there is no need of the simulation process to obtain the values of the non-evidential variables and thereby, P (x|wx ) can be calculated using only the probability tables of the parents and children of the node, i.e. using the parameters saved in the model specification. The method becomes for that particular case an exact propagation method.

4

The k-NN Classification Method

A set of pairs (x1 , θ1 ), (x2 , θ2 ), ..., (xn , θn ) is given, where the xi ’s take values in a metric space X upon which is defined a metric d and the θ i ’s take values in the 1

The Markov Blanquet of a node is the set of nodes formed by its parents, its children and the parents of those children

set {1, 2, ..., M } of possible classes. Each θi is considered to be the index of the category to which the ith individual belongs, and each x i is the outcome of the set of measurements made upon that individual. We use to say that ”x i belongs to θi ” when we mean precisely that the ith individual, upon which measurements xi have been observed, belongs to category θi . A new pair (x, θ) is given, where only the measurement x is observable, and it is desired to estimate θ by using the information contained in the set of correctly classified points. We shall call: x0n ∈ x1 , x2 , ..., xn the nearest neighbor of x if: min d(xi , x) = d(x0n , x)

i = 1, 2, ..., n

The NN classification decision method gives to x the category θ n0 of its nearest neighbor x0n . In case of tie for the nearest neighbor, the decision rule has to be modified in order to break it. A mistake is made if θn0 6= θ. An immediate extension to this decision rule is the so called k-NN approach [6], which assigns to the candidate x the class which is most frequently represented in the k nearest neighbors to x.

5

Proposed Approach

We present a new classifier combination technique that could be seen either as a multi-classifier (or classifier combination technique) or as an hybrid classifier that uses the ideas of the two classifiers involved in the composition. The new approach works as follows: 1. Select the genes by a wrapper Sequential Forward Selection with the Bayesian Network paradigm. Naive Bayes algorithm implemented in MLC ++ tool (Kohavi et al. [17]) has been used for this purpose. 2. Given a database containing the selected genes (and the variable corresponding to the class), learn the classification model (BN structure) to be used. The K2 algorithm has been used as a learning paradigm together with the three different metrics already explained. This algorithm treats all variables equally and does not consider the classifying task the net will be used for. In order to reduce the impact of the random order in the net structures learned, the experiments have been repeated 3000 times and the nets with the optimal values have been selected. 3. When a new case comes to the classifier, assign a class to the new case according to the following process: – a) Look for its nearest neighbor case in the training database according to the 1-NN algorithm. The use of the nearest neighbor of the new case instead of the case itself allows us to avoid the negative effect that the Laplace correction could have when propagating instances in which almost all the probabilities to be calculated correspond to conditions not present in the database. Let Ki be this nearest case.

– b) Propagate the Ki case in the learned BN as if it was the new case, – c) Give to the new case the class which a posteriori higher probability after the propagation is done.

6

Experimental results

To test the performance of the presented approach the following three well known microarray class prediction datasets are used: – Colon dataset, presented by Ben-Dor et al. (2000) [2]. This dataset is composed by 62 samples of colon ephitelial cells. Each sample is characterized by 2, 000 genes. – Leukemia dataset, presented by Golub et al. (1999) [10]. It contains 72 instances of leukemia patients involving 7, 129 genes. – NCI-60 dataset, presented by Ross et al. (2000) [26]; it asseses the gene expression profiles in 60 human cancer cell lines that were characterized pharmacologically by treatment with more than 70, 000 different drug agents, one at time and independently. Nine different cancers are considered. We test the classification accuracy of the new proposed approach and compare the results with those obtained by K-NN and Naive Bayes with all the genes and by the BN models learned by the three metrics. When learning Bayesian networks two subgoals can be identified. In one hand, the structure of the network must be fixed and in the other hand, the values of the probability tables for each node must be established. In this experimental work the probability tables are always estimated from the training database. 6.1

Structural learning using different quality metrics

In this approximation, instead of using a fixed net structure the K2 algorithm has been used as a learning paradigm together with the three different metrics already explained. Figure 2 shows some of the BN structures obtained by the K2 algorithms with the different metrics used in the three databases. As it can be seen, the structures obtained by the K2 metric are more complex, and those obtained by the BIC learning approach use to be disconnected due to the effect of the penalization. That is the reason, in our opinion, that makes the structures obtained by using the Entropy measure optimal for the classification task. 6.2

Classification results

We present in this section the classification results obtained by each of the used paradigms. Table 1 shows the LOOCV accuracy estimates for the K-NN algorithm and the Naive-Bayes approaches when all the genes of the databases are used as

(a) Entropy metric for the Leukemia database

(b) BIC metric for the Colon database

(c) K2 metric for the NCI-60 database Fig. 2. The obtained Bayesian Network structures for different metrics in the three databases used in the experimental done

predictor variables in the microarray datasets, as well as the results obtained after the Feature Subset Selection has been performed for the BN, learned using the different metrics, and also for the K-NN paradigm. Naive-Bayes results are outperformed by the BN models when FSS is applied; notice that when considering the whle set of variables, we use the Naive-Bayes algorithm instead of the three learning approaches of the Bayesian Network paradigm due to the the complexity of the BN learning process when the number of variables is large. After the FSS is done with the wrapper approach, we have 7 genes selected for the Colon database, 6 for the Leukemia and 15 for the NCI-60 when Naive Bayes is used as the classification method. The selected gene number, and the genes themself, are different when K-NN paradigm is used in the FSS task (6 genes for the Colon database, 3 for the Leukemia and 14 for the NCI-60). Table 1. LOOCV accuracy percentage obtained for the three databases by using KNN and NB with all the genes, and by the BN and KNN algorithms after the gene selection process. Dataset KNN-all NB-all K2-FSS BIC-FSS Entropy-FSS KNN-FSS Colon 32.26 53.23 82.29 80.65 83.87 82.26 Leukemia 65.28 84.72 88.88 93.05 91.66 91.66 73.33 48.33 55.00 25.00 66.66 61.66 NCI-60

The results obtained by the new proposed approach are shown in Table 2. An accuracy increment is obtained in two of the databases (Colon and Leukemia) with respect to all the previous approaches. Nevertheless, the third dataset does not show the same behaviour, probably cecause this database is divided into 9 different classes, which makes difficult to the Bayesian Network paradigm to discriminate them using so few cases. For this third database, neither the new approach nor the FSS improves the results obtained by the K-NN when all the variables are considered. Table 2. LOOCV accuracy percentage for obtained by the proposed classifier for each dataset after SFS is applied. Dataset Colon Leukemia NCI-60

7

k2 83.87 91.66 58.33

bic 83.87 93.05 26.66

entropy 85.88 94.44 68.33

Conclusions and Future Work

A new approach in the Machine Learning world is presented in this paper, and it is applied to the micro-array data expression analysis. We use a distance based classifier in combination with a BN, with the main idea of avoiding the bad effect of the application of the Laplace Correction in the a posteriori probability calculations due to the small number of cases of the used databases. Obtained results indicate to the authors that the new approach can be used in order to outperform the well-classified case number obtained by a Bayesian Network when this paradigm is used as a classifier model, and also to obtain better results than the K-NN paradigm. It must also be pointed out that the BN structure obtained gives us a view of the relations of conditional (in)dependences among the selected genes and the variable representing the class. This fact could be used by physicians in order to understand better the meaning of the existing relations. As future work a more sophisticated searching approach should be used for the BN structure learning phase. Evolutionary computation based techniques are promising candidates for that job [25]. Filter measures should also be probed for the Gene Selection process instead of the wrapper approach here applied.

Bibliography

[1] V. Aris and M. Recce. A Ranking Method to Improve Detection of Disease Using Selectively Expressed Genes in Microarray Data. In Proceedings of the First Conference on Critical Assessment of Microarray Data Analysis, CAMDA2000, 2000. [2] A. Ben-Dor, L. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z. Yakhini. Tissue Classification with Gene Expression Profiles. Journal of Computational Biology, 7(3-4):559–584, 2000. [3] R. Blanco, P. Larra˜ naga, I. Inza, and B. Sierra. Gene selection for cancer classification using wrapper approaches. International Journal of Pattern Recognition and Artificial Intelligence, 2004. [4] D.M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2002. [5] G. F. Cooper and E. Herskovits. A bayesian method for induction of probabilistic networks from data. Machine Learning. Kluwer Academic PUBLISHERs, Boston, 9:309–347, 1992. [6] T. M. Cover and P. E. Hart. Nearest neighbor pattern classification. IEEE Trans. IT-13, 1:21–27, 1967. [7] J. Doak. An evaluation of feature selection methods and their application to computer security. Technical Report CSE-92-18, University of California at Davis, 1992. [8] N. Friedman and M. Goldszmidt. Building classifiers using bayesian networks. In AAAI/IAAI, Vol. 2, pages 1277–1284, 1996. [9] N. Friedman and D. Koller. Being bayesian about network structure. a bayesian approach to structure discovery in bayesian networks. Machine Learning, 50:95–125, 2003. [10] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caliguri, C.D. Bloomfield, and E.S. Lander. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286:531–537, 1999. [11] I. Inza, P. Larra˜ naga, R. Etxeberria, and B. Sierra. Feature Subset Selection by Bayesian network-based optimization. Artificial Intelligence, 123(12):157–184, 2000. [12] I. Inza, B. Sierra, R. Blanco, and P. Larra naga. Gene selection by sequential search wrapper approaches in microarray cancer class prediction. JOURNAL of Intelligent and Fuzzy Systems, 2002. accepted. [13] F. V. Jensen. Bayesian Networks and Decision Graphs (Statistics for Engineering and Information Science). Springer, 2001. [14] J. Kittler. Feature set search algorithms. In C.H. Chen, editor, Pattern Recognition and Signal Processing, pages 41–60. Sithoff and Noordhoff, 1978.

[15] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In N. Lavrac and S. Wrobel, editors, Proceedings of the International Joint Conference on Artificial Intelligence, 1995. [16] R. Kohavi and G. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273–324, 1997. [17] R. Kohavi, D. Sommerfield, and J. Dougherty. Data mining using MLC++, a Machine Learning library in C++. International Journal of Artificial Intelligence Tools, 6:537–566, 1997. [18] E. Lazkano and B. Sierra. Bayes-nearest:a new hybrid classifier combining bayesian network and distance based algorithms. Lecture Notes in Artificial Intelligence, 2902:171–183, 2003. [19] L. Li, L.G. Pedersen, T.A. Darden, and C. Weinberg. Computational Analysis of Leukemia Microarray Expression Data Using the GA/KNN Method. In Proceedings of the First Conference on Critical Assessment of Microarray Data Analysis, CAMDA2000, 2000. [20] W. Li and Y. Yang. How many genes are needed for a discriminant microarray data analysis? In Proceedings of the First Conference on Critical Assessment of Microarray Data Analysis, CAMDA2000, 2000. [21] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining. Kluwer Academic Publishers, 1998. [22] T.M. Mitchell. Machine Learning. McGraw Hill, 1997. [23] J. Pearl. Evidential reasoning using stochastic simulation of causal models. Artificial Intelligence, 32 (2):247–257, 1987. [24] P. Pudil, J. Novovicova, and J. Kittler. Floating search methods in feature selection. Pattern Recognition Letters, 15(1):1119–1125, 1994. [25] D. Romero, P. Larra˜ naga, and B. Sierra. Learning bayesian networks on the space of orderings with estimation of distribution algorithms. International Journal on Pattern Recognition and Artificial Intelligence, 18 (4):45–60, 2004. [26] D.T. Ross, U. Scherf, M.B. Eisen, C.M. Perou, C. Rees, P. Spellman, V. Iyer, S.S. Jeffrey, M. Van de Rijn, M. Waltham, A. Pergamenschikov, J.C.F. Lee, D. Lashkari, D. Shalon, T.G. Myers, J.N. Weinstein, D. Botstein, and P.O. Brown. Systematic variation in gene expression patterns in human cancer cell lines. Nature Genetics, 24(3):227–234, 2000. [27] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6(2):461–464, 1978. [28] C.E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379–423, 1948. [29] B. Sierra and P. Larra˜ naga. Predicting survival in malignant skin melanoma using bayesian networks automatically induced by genetic algorithms. An empirical comparison between different approaches. Artificial Intelligence in Medicine, 14:215–230, 1998. [30] B. Sierra, N. Serrano, P. Larra˜ naga, E. J. Plasencia, I. Inza, J. J. Jim´enez, P. Revuelta, and M. L. Mora. Using bayesian networks in the construction of a bi-level multi-classifier. Artificial Intelligence in Medicine, 22:233–248, 2001.

[31] E.P. Xing, M.I. Jordan, and R.M. Karp. Feature Selection for HighDimensional Genomic Microarray Data. In Proceedings of the Eighteenth International Conference in Machine Learning, ICML2001, pages 601–608, 2001.