Computational Methods in the Development of a

7 downloads 0 Views 327KB Size Report
Computational Methods in the Development of a Knowledge-Based System .... A common method for finding an approximate optimal descriptor set relies in the ...
Combinatorial Chemistry & High Throughput Screening, 2007, 10, 37-50

37

Computational Methods in the Development of a Knowledge-Based System for the Prediction of Solid Catalyst Performance Joanna Procelewskaa, Javier Llamas Galileaa, Frederic Clercb, David Farrussengb and Ferdi Schüth*,a a

Max-Planck-Institut für Kohlenforschung, Kaiser-Wilhelm-Platz 1, D-45470 Mülheim/Ruhr, Germany

b

Institut de Recherches sur la Catalyse-CNRS-2, Av. A. Einstein, F-69626 Villeurbanne, France Abstract: The objective of this work is the construction of a correlation between characteristics of heterogeneous catalysts, encoded in a descriptor vector, and their experimentally measured performances in the propene oxidation reaction. In this paper the key issue in the modeling process, namely the selection of adequate input variables, is explored. Several data-driven feature selection strategies were applied in order to obtain an estimate of the differences in variance and information content of various attributes, furthermore to compare their relative importance. Quantitative property activity relationship techniques using probabilistic neural networks have been used for the creation of various semi-empirical models. Finally, a robust classification model, assigning selected attributes of solid compounds as input to an appropriate performance class in the model reaction was obtained. It has been evident that the mathematical support for the primary attributes set proposed by chemists can be highly desirable.

1. INTRODUCTION Reaction planning and computational library design has substantially changed the process of drug discovery, since it enables selection of an adapted and improved subset of experiments among an almost infinite number of candidate molecules by applying virtual screening techniques [1].

molecule, a number of physico-chemical features can be computed. These quantitative values, together with values obtained from the sum formula (molecular weight etc.) or the two-dimensional structure (atom connectivities, presence of functional groups) constitute the so-called “QSAR descriptors”. The set of all descriptors for a given molecule represents its fingerprint. Even if the level of information con-

Fig. (1). Scheme representing the different parameter spaces. Composition space (left), physico-chemical descriptor space (middle) and performance space (right).

The diversity profiling of drugs/molecules is the basic concept in the design of libraries of molecules. It relies on the “similar property principle”, assuming that structurally similar molecules should have similar biological activities. The similarity/diversity of two molecules can be assessed by measuring a “distance” between them, while for a molecule library it is a distance matrix. The relationship which links the molecules to their activities is captured by means of a statistical QSAR model. The quantification of distances between molecules is the key issue in the QSAR approach. For this purpose, molecules should be described in a way that all molecules can be localized in a common linear search space, starting from their composition and structure. From the 3D structure of a *Address correspondence to this author at the Max-Planck-Institut für Kohlenforschung, Kaiser-Wilhelm-Platz 1, D-45470 Mülheim/Ruhr, Germany; E-mail: [email protected]

1386-2073/07 $50.00+.00

tained in the fingerprint is somewhat reduced with respect to the full 3D description of molecules, it may capture relevant information with respect to the activity profile one wants to assess. Using the fingerprints as new variables, all molecules can be represented in a new linear search space in which distances can be calculated using Euclidian metrics, for instance. Unfortunately, the QSAR approach cannot directly be transferred to the discovery of materials, especially of heterogeneous catalysts, because (i) complex solids (partially amorphous, micro-crystalline domains, metal-oxide interfaces) can hardly be characterized on the atomic scale which is a serious obstacle for the fingerprint encoding as mentioned above for molecules, and (ii) key descriptors are usually not known or can hardly be measured in the case of a diverse library of materials [2]. Moreover, the similar property principle is generally probably not valid.

© 2007 Bentham Science Publishers Ltd.

38

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1

Procelewska et al.

Fig. (2). QSAR approach in heterogeneous catalysis. Composition space (left), physico-chemical descriptor space (middle) and performance space (right).

We have over the last years established a propertyactivity relationship approach for the discovery of solid state materials [3, 4]. The concept of descriptors (attributes) applied here encodes some native physicochemical properties or calculated feature of a solid catalyst as a numerical value (input variable). The set of all descriptors for a given catalyst represents its descriptors vector. The coupling of the literature data with the elemental composition of the catalysts plus information about synthesis parameters allowed the computation of highly informative attributes, which allowed prediction of catalytic performance, after establishment of a statistical model, linking the descriptors to the catalytic response. A conceptual workflow illustrating the methodology is presented in Fig. 2. In this study, a rational methodology for the selection of highly informative descriptors is developed. In contrast to “chemical intuition” used previously [4], the choice of attributes by a method based on information theory and data mining techniques is described. Investigations of discriminant descriptors which are decisive in the predictions of catalytic properties were carried out. 2. PROPENE OXIDATION DATASET We collected a library of 467 heterogeneous catalysts and checked their performance in the gas-phase oxidation of propene with oxygen (the same library as in previous studies)

[4]. This reaction is a good measure of catalytic activity since it provides a wide spectrum of possible products. From the catalytic results catalysts were classified in 5 distinct classes using standard clustering algorithms [5]. The classes can roughly be characterized as (i) low activity total oxidation, (ii) medium activity total oxidation, (iii) high activity total oxidation, (iv) partial oxidation, and (v) oligomerization. On the other hand, 3296 attributes were systematically computed for each of the 467 catalysts, starting from both their elemental composition and physico-chemical properties of elements, oxides and ions [4] (Fig. 3). These are, for instance, enthalpies of formation of oxides, possible coordination numbers of the atoms, ionization energies, electronegativities, averages of such values for multicomponent catalysts, variance of such values for multicomponent catalysts, etc. This dataset is named hereafter X2. The advantage of using X2 attributes is two-fold: first they contain a higher degree of information than the nominal composition, since they integrate thermodynamic, physical and electronic features, second the attribute space of 3296 dimensions is linear so that distance metrics can be calculated. The detailed information on the attributes is summarized in Tables 4-6. In addition, 19 discrete variables were added that describe the synthesis conditions (called X3). These include for

Fig. (3). An example of descriptors calculation. Considering code in brackets: element symbolizes the vector of element names that constitute the catalysts, in the same fashion, [%composition]. Considering code in brace brackets: {Molar mass} symbolizes the vector of molar masses of corresponding elements of the catalyst. Functions oxides() and ions() returns the corresponding list of oxides and ions for the vector of elements.

Prediction of Solid Catalyst Performance

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1 39

instances, the synthesis type, the support, the calcination type, the precursors and the additives (if any). Consequently, the data mining search space is composed of 3296+19 dimensions. We applied different machine learning techniques (mainly various types of neural networks) to establish multivariable relationships between the descriptor based chemical codes of the catalysts and their experimentally determined catalytic behavior (Section 4.4, the models based on PNN outperformed the others). This allowed mapping the dependence of materials properties on their catalytic response. 3. TECHNICAL ISSUES The choice of the most appropriate descriptors from the computed 3296 attributes, capturing physico-chemical properties of the solids collected in the library, is a highly complex task. In addition, the development and the assessment of an appropriate QSAR model are not obvious. Several issues are faced. 3.1. The Idea of Attribute Pruning in the Indeterminate Case (Number of Attributes >> Number of Catalysts) Given a large set of potential descriptors, such as a set of numerous attributes describing solids, it is necessary to find a small informative subset to distinguish categories of different catalytic performance. It has been confirmed by many empirical studies that the accuracy of classification techniques is not monotonic with respect to the number of features employed by the model and does not necessarily increase with the number of features. Some important cases are known where accuracy is lost by using too many features (Section 4.4). The problem is statistically inherent in classification because typically the true error of a designer classifier will fall with use of more features and, after some optimal numbers of features for a given sample size, begins to rise, this being called the peaking phenomenon [6]. Depending on the nature of the classification technique, the presence of irrelevant or redundant features can cause the system to focus attention on the idiosyncrasies of the individual samples and lose sight of the broad picture that is essential for generalization beyond the training set [7]. This problem is stressed in indeterminate cases, when the number of variables for possible inclusion in the statistical model is higher than the number of observations. This concerns the case studied here where the informative classification should be carried out for a relatively small sample consisting of 467 data points in the matrix with around 3300 variables. The remarkable fact that predictions are possible even in this setting stems from the (sometimes hidden) property that, although the data are presented in a highdimensional space, they actually have a much lower intrinsic dimensionality (a small informative set of features contains enough information to distinguish catalytically relevant categories, in other words some optimal number of features for a given sample size). This can be handled by implementing a feature selection routine that determines a small subset of relevant attributes which have a significant influence on the catalytic activity and use only them to construct the actual model.

The task of finding an optimal feature set of given size is inherently combinatorial, because in order to arrive at an optimal solution, all feature sets of a given size must be checked unless there is distributional knowledge that mitigates the search requirement. From an algorithmic perspective, feature selection can best be viewed as a heuristic search, where each state in the search space represents a particular subset of the available features. In all but the simplest cases, an exhaustive search of the state space is impractical, since it involves 2N possible combinations, where N is the number of available features in data space D, which is also called descriptor space. When choosing a descriptor subset Ds of size Ns =  DS out of a descriptor set of size N requires the evaluation of:  N N! N  =  S  NS ! N  NS !

(

)

(1)

subsets [8]. Owing to the combinatorial intractability of checking all feature sets, many algorithms have been proposed to find good suboptimal feature sets; nevertheless feature selection remains problematic. 3.2. Assessing Importance of Variables (Definition of Information Content in a Variable) A common method for finding an approximate optimal descriptor set relies in the first step on identification of features with high variability and high information content. In this study, finding of high variability attributes was complicated by the fact that the distribution of features describing properties of solid catalysts cannot be directly translated into variability because their units and value ranges differ. For example, first ionization energy (1ie) or attributes accounting for ionic radius present a continuum of values, whereas attributes providing information on the number of element oxides adopts a narrow range of discrete values. However, even so different attributes describing solids cannot be subjected to conventional statistics, i.e., cannot be systematically compared using measures such as standard variations. A standard deviation, or any statistical measure derived from variance depends on the identification of a central mean or mode. Thus, more general analysis and comparison of feature variability requires uniform data representation, independent of the type of attributes and the value range they can adopt. An entropic formulation termed Shannon entropy [9], applied in this studies, allows the comparison of the information content of various attributes. 3.3. Redundancy vs Collinearity To produce statistical models with good predictive power based on a small number of relevant attributes forming the descriptor vector, additional pruning to decrease the redundancy and multicollinearity contained in the data must be performed. Redundancy (describes an exact linear dependence between a subset of the columns in data matrix, so that at least one column in this subset contributes no unique information) should always be avoided. However, if there exists a multiple correlation in a subset of features, deviating somewhat from the perfect linear combination (multicollinearity), this subset may reflect some class specific properties characteristic for a series of related catalysts and should not be simply discarded. In this paper,

40

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1

algorithms will be described which eliminate redundancy but allow to retain collinearity to an extent which can be chosen. 3.4. Selection of the Best Attribute Subsets for Classification Since it is not possible to know a priori which solid properties are most relevant for their catalytic activity, starting from a comprehensive set of attributes in this work, different strategies to obtain small subsets of features capable of producing the optimal classification accuracy were employed.

Procelewska et al.

4. METHODS 4.1. Overall Methodology for Rational Design of Descriptors We built a preprocessing workflow to select the optimal set of predictive features in two steps: (i) identification of the most variable descriptors and (ii) the orthogonalization process.

In previous studies [3, 4], chemical knowledge and intuition have been used to formulate the initial attribute set. For example, the weighted average of the molar mass of all elements that compose the catalyst was considered as an important descriptor, wmecms, as well as the number of elements in the catalyst, nvec, etc. For the following discussion, this attribute collection has been denoted as expert set. Due to the high dimension of the data set and convoluted interrelations between variables the process of feature selection was assisted with the Relief algorithm [10]. This method has been applied to assign a relevance weight which indicates the discriminative power of each attribute with respect to the different classes of catalytic behavior. The main drawback of the randomized Relief algorithm which finds all weakly relevant features is its inability to eliminate redundant features. Thus, although the construction of the descriptor vector based on chemical intuition was encouraging, in fact many among the components of the vector were redundant. Since the redundancy is invisible to the naked eye of chemists and intrinsically cannot be eliminated with the Relief method, in this work the application of a collective assessment to identify variables with high information content is presented. 3.5. Handling of Descriptor Vectors with Missing Values In this study the issue of eliminating descriptors with missing values has been address. The presence of such descriptors is a direct consequence of the fact that not all information about the properties of elements, element oxides and element ions can be found in the literature. This concerns about 25% of all descriptors, which have entries for some feature accessible only for few elements, e.g., dielectric constant. A general procedure of replacing missing values which was applied in the previous work [4] was to substitute them by another value that does not significantly affect the whole calculation, i.e., zero. However, there is some influence expected on the results, and therefore it was decided here to improve the method by using NULL values as entries for the unknown values. In contrast to the previous approach, in this way the uncertainty of the attributes used to be picked for the final feature set is reduced. This quality filtration produces more robust prediction models and led to more relevant descriptors. Above restriction significantly reduced the size of the data set to 638 attributes. This requirement may eliminate some decisive attributes. For example, all variables concerning ionic covalent parameter or normalized free formation enthalpy for the most stable metal oxide had to be dropped, although one may expect these values to be important for the prediction of oxidation performance. However, in order to use the algorithms which are discussed in the following, this cannot be avoided.

Fig. (4). Scheme for the data workflow.

Identification of Most Variable Descriptors The whole X2 data set consisted of 3296 attributes for each of the 467 catalysts. The initial pruning, done by identification and removal of variables with homogeneous information content, containing either only NULLs or identical values (mostly zeroes), eliminated 382 elements. The attributes containing for some catalysts currently unknown values confuse the data structure. It was assumed that attributes would not be particularly meaningful if the number of missing values is higher than the number of the existing ones. This assumption eliminated the further 448 descriptors. Requiring at least 50% known values is certainly an arbitrary threshold, but it seemed to be a reasonable choice. 4.2. Shannon Entropy for Variable Weight Assessment Descriptor variability independent of value ranges and distribution was calculated by application of an entropic formulation, termed Shannon Entropy, SE [9]. SE is defined as: SE =   pi log 2 pi

(2)

In this formulation, pi denotes the probability of a data point computed from a “count” (c) that adopts a value within

Prediction of Solid Catalyst Performance

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1 41

a specific data interval i. Thus, if ci is the number of all occurrences of i in the whole data set then pi is calculated as: pi =

ci

c

(3)

i

The logarithm to the base two in eq 2 represents a scale factor which permits Shannon Entropy to be considered as a metric of information content. Probabilities and, in turn, SE values can be calculated for any set of data that is divided into evenly space intervals (bins). Although attributes have intrinsically different numbers of possible values (e.g. ionic radius and number of oxidation states in elements oxide), SE can be directly compared, provided the binning scheme is uniform. Captured by the SE metric, attribute variability may vary from zero for a single-valued attribute to a maximum of the logarithm to the base two of the number of bins utilized. Thus, the major benefit of these manipulations is that descriptors with very different distributions and ranges are rendered comparable if their numbers of bins (different elements in the value space) are equal. Since the number of bins can be different for different attributes (for a two-value set SEmax = 1 and for an eight-value set SEmax = 3) it may be useful to establish a bin independent SE value, a scaled SE (sSE), that can be generally compared, regardless of the number of bins used for the histogram representation of datasets. Scaled SE is calculated by dividing the computed SE value by the maximum possible SE value for the number of bins used sSE = SE

log 2 (bins)

(4)

The value of sSE changes from 0 (all the same values) to 1 (e.g. vector consisting of all different values) and is fully comparable between the vectors having different numbers of bins. Thus, the variability of 2466 attributes left after the initial pruning steps was determined. Technically, performing entropy analysis requires computing histograms that cover the entire value range of each attribute in the database. Statistical analysis of value distributions and sSE calculations were performed with C# / .NET Framework 2.0 application written by the authors. 4.3. Unsupervised Forward Selection Algorithm for Orthogonalization of Data Set Although the dataset after preprocessing with the Shannon Entropy approach preserved as much variance as possible, it can still not be directly used for the model building. To perform this task in the preprocessed of data set we designed an Unsupervised Forward Selection [11] (UFS) algorithm. The UFS algorithm starts with the two variables with the smallest pair-wise correlation coefficient (measured by the square correlation coefficient, R2) and selects additional variables on the basis of their multiple correlation with those already chosen, which should be minimal, thus building a subset of variables that are as close to orthogonality as possible. The selection process halts when the R2 value of each remaining variable with those already selected exceeds some preassigned limit R2max.

Although, it seems obvious to consider a set of features to be optimal if it minimizes the R2max factor, the process would end with a too small number of attributes selected. Thus, the UFS algorithm has been applied with three different threshold values for the correlation factor, R2max, 0.8, 0.9 and 0.99. This leads finally to sets of 36, 58 and 149 descriptors with high information content, respectively. The selected attributes cover the whole significant sSE spectrumfrom 0.513 to 0.983 which confirms the choice of the border entropy value. 4.4. IPS for Final Feature Selection Subsequently, the different feature selection strategies were evaluated by applying artificial neural networks (ANN) classification modeling. Since interactions between chemical systems are often nonlinear by nature, the ANN methodology has been successfully applied in QSAR studies yielding, in most of the cases, better results than multilinear regression analysis [12, 13]. For developing the ANN architecture as well as for training and data validation we applied the commercially available software package Statistica Neural Networks 6.0 as an evaluating tool. In search for an appropriate ANN model, both Multi Layer Perceptron (MLP) and Probabilistic Neural Network (PNN) were applied, using the Intelligent Problem Solver (IPS) of Statistica. The different models were built using the training set, and then the best performing network architecture for each method was chosen to predict outcome on the test set. The IPS-based experiments showed that a PNN architecture with one hidden layer was the most appropriate for solving the problems at hand. Hundreds of clones of that architecture were retrained with different numbers of input parameters. After training of the PNN, its classification ability was checked by calculating the percentage of the compounds correctly classified with respect to their catalytic performance. The best models obtained in these experiments were additionally retrained and modified manually. Selecting too many features leads to the well-known overfitting. This can be identified by an independent validation set or by cross-validation as applied here. For a proper cross-validation (repeatedly splitting the initial data set into subsets for training and testing), feature selection was performed separately during each iteration, only using information from the current training set. The performance of all algorithms was evaluated using leave-one-out cross-validation, with the feature selection step performed inside the cross-validation loop. We averaged the predicted activity values from 10 random leave-one-out experiments to develop a quantitative structure-activity relationship because different runs of the training procedure yield different networks (due to randomized initialization of weights and order of presentation objects). 5. RESULTS 5.1. Attribute Selection Using Shannon Entropy To leave only the data with a high level of heterogeneity all attributes with a scaled Shannon Entropy value smaller than 0.5 have been removed (195 attributes). This cutoffvalue was selected based on the overall distribution of sSE values shown in Fig. 5. The figure shows that in the range

42

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1

Procelewska et al.

Fig. (5). Scaled Shannon Entropy calculated for the different attributes.

around sSE = 0.5 the number of attributes increases significantly. Although the cutoff point at sSE = 0.5 is somewhat arbitrary and also higher values of 0.6 or 0.65 could have been selected, it seemed reasonable to use the lower cutoff in order to not loose too many attributes which have significant variability and thus may carry useful information for the correlation.

A number of attributes, including oxidation state and some properties of oxides (melting and boiling point, heat capacity), have little variability. Thus, these features are expected to be not useful to compare catalyst in the data set. High-entropy features, on the other hand, are expected to perform well in the design of a diverse library. Next, we noted that many attributes have identical distribution and identical entropy values. The exact analysis showed that in most (but not all) cases they also contain identical values. Since the information represented by these features is redundant they can be safely removed (310 further attributes).

In Table 1 the most and least variable attributes in the database are reported. We expected that both the intrinsic variability of attributes and their extrinsic variability with respect to the catalysts under investigation determine attribute entropy.

Table 1.

Descriptors with Highest and Lowest Entropy

Descriptor

Elements

Zeroes

Nulls

SE

sSE

StDev

mincountvlcnie_os

2

0

0

0.022

0.022

21.679

mindifvhosoe_d

2

463

1

0.056

0.056

1.238

minmeanvhosoe_bp

6

1

3

0.163

0.063

5148.324

minminvlosoe_hc

6

0

2

0.174

0.067

640.699

minmeanvhosoe_d

8

0

1

0.206

0.069

41.187

wmecdifvhcnie_l

7

282

179

0.200

0.071

0.211

minmeanvlosoe_mp

5

0

1

0.186

0.080

4404.365

minmoe_bp

5

1

1

0.186

0.080

3157.765

wmecpe

401

0

0

8.480

0.981

50.796

wmecminoe_ffe

403

0

0

8.489

0.981

12532.682

wmecmaxvhosie_ir

403

0

0

8.489

0.981

20.400

maxminvhosie_cn

2

0

0

0.983

0.983

113.367

wmecminvlcnie_l

366

0

47

8.377

0.984

7.147

wmecmaxvlosie_l

359

0

59

8.357

0.985

9.088

wmmcmaxvlcnie_l

304

0

125

8.128

0.985

10.728

mincountvhosoe_d

2

215

0

0.995

0.995

15.875

mincountvlosoe_d

2

251

0

0.996

0.996

14.697

Prediction of Solid Catalyst Performance

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1 43

After this preprocessing strategy involving discarding of variables with very low information content, a set of 1961 attributes was left with an information content considered to be sufficiently high. This set has been used for further analysis. Although, for practical purposes, the analysis was limited to attributes from the X2 set, variables that describe the synthesis procedure X3 could also be studied using this approach. However, as mentioned earlier, in the framework of this study those parameters which refer directly to the experimental procedure and are considered to be important for model building should remain included in the correlation. 5.2. Representative Data Sets As part of a structured approach to model building, various approaches for data preprocessing are presented in this study in order to produce statistical models with good predictive power, based on a small number of relevant properties. To assess the optimal data set, various settings were explored. Data sets of a given size were created for the different feature selection strategies and designated as A-J. First, in order to compare different pruning algorithms among themselves we constructed a dummy data set A by selecting 56 random features from the initial pool of 697 attributes. Data set B, called expert set widely characterized in the previous study [4], was created mainly based on chemical knowledge and intuition. Unfortunately, many variables in the expert set possess unknown values, and thus this data set contains many attributes which had been discarded in order to apply the UFS algorithm. Therefore, to generate a more comparable data set, any variables with the entries containing unknown values were removed from the original set B to create the data set C. To increase the predictive ability of the sets based on chemical intuition these sets were further modified by different methods to create four new data collections D to F. First, the UFS algorithm was applied to dataset C to assure the orthogonality of the input set. Data sets D and F were formed by reducing the set C to an orthonormal basis selecting those expert features whose squared correlation coefficient with other variables was smaller than the threshold values 0.8 (D) or 0.9 (F). Those data sets were then used as the basis of two new, expanded Table 2.

sets E and G. For their creation, additional features were selected from the pool of 662 remaining attributes not included in the original expert set C [14], using as the selection criteria R2max of 0.8 or 0.9, respectively. Finally, the data sets H-J were created by direct application of the forward selection algorithm on the initial pool of data without having any prior base. 5.3. Evaluation of the Data Sets As expected, the misclassification rate did not change dramatically with small changes in the smoothing factor [15]. The QPAR models were assessed by comparing the prediction made by the model with the observed classification of the catalysts according to their catalytic activity. The choice of the most useful network was based on its quality of prediction of the catalyst performance class. As can be seen from Tables 7-16 (Supporting Information), the differences between classification performance for the prediction and test sets were small for all clusters, which is a good sign for the predictive power of the trained PNN. The quality of the different feature selection strategies described above, relative to their prediction accuracies and average errors in ANN modeling, are given in Table 2. The detailed results of the prediction rate are reported in the confusion matrices, Tables 7-16 (Supporting Information). The prediction performance for descriptors sets D and G compared to data set from B to C is slightly improved, which indicates the usefulness of orthogonalization. However, none of those expert sets could truly provide a significant improvement in the quality of the classification compared to the models H-J. The plausible explanation for this phenomenon is that although chemical knowledge and intuition has considerable potential for design of chemical libraries, these features are poorly structured and not well orthogonalized. The redundancy is invisible to the naked eye of chemists. Mathematical support of feature selection by an expert is thus high desirable. The classification capabilities of the expert set are almost identical as when applying random feature selection. Nevertheless, the PNN models built for the set A are much less stable and show more symptoms of overlearning.

Results of PNN Modeling for the Different Feature Selection Strategies

Set

Set Tag

#X2 Features

Network Architecture

Perform.

Error

A

Random Choice

56

PNN 62:96-236-5:1

0,47

0,354

B

Expert Set (CI)

56

PNN 60:89-188-5:1

0,47

0,355

C

Expert Set Corr (CI, ¬NULL)

36

PNN 54:122-235-5:1

0,47

0,357

D

UFS(CI, 0.8)

14

PNN 27:89-235-5:1

0,50

0,350

E

UFS(CI, 0.8) + UFS(All, 0.8)

33

PNN 48:71-235-5:1

0,48

0,354

F

UFS(CI, 0.9)

20

PNN 37:95-235-5:1

0,49

0,349

G

UFS(CI, 0.9) + UFS(All, 0.9)

56

PNN 58:63-235-5:1

0,51

0,342

H

UFS 0.8

37

PNN 37:91-236-5:1

0,57

0,339

I

UFS 0.9

58

PNN 62:81-236-5:1

0,57

0,346

J

UFS 0.99

149

PNN 35:57-236-5:1

0,48

0,364

44

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1

The unsatisfying results for the expert system are a direct consequence of poorly structured orthogonal variation of attributes in data sets B-G. To confirm this, an additional experiment was performed: The UFS algorithm as presented gives the user no control over the variables to be selected, and thus some “supervising”, forcing the entry of favorable variables into the data set, was introduced. The UFS algorithm was modified to promote the selection of attributes from the expert set, allowing the user to over-ride the automatic choice. The UFS routine starts with the two variables with the smallest pair-wise correlation coefficient and selects additional variables based on their multiple correlations, R2, with those already chosen. At this point, the expert descriptors can be favored by selecting instead of just one an arbitrary number of descriptors (3, 5, 10…) with the best projection on the already created subspace. If one of them belongs to the set of descriptors selected by chemical intuition, this one should be chosen even if it is more correlated than the others. Table 3.

Number of Expert Set Descriptors in the Sets H-J Depending on the Favorization Factor

R2max

1 Favored

3 Favored

5 Favored

10 Favored

0.8

3

1

2

2

0.9

4

2

2

2

0.99

8

7

8

10

In Table 3 the results of a combined supervised and unsupervised learning approach are summarized. It seems that increasing the number of favorable variables cannot be directly translated to the number of finally selected descriptors, indicating again on the high redundancy in the expert set, which explains its relatively poor performance. The predictive ability of neural networks significantly increases for sets H and I. However, the performance in set I does not appears to be more significantly improved when 58 X2 descriptors were used instead of 37 in the set H. This indicates that sometimes sets of parameters determined by pruning methods could still contain some collinear variables, the presence of which does not influence the models compared to the kernel set. However, care in such cases must be taken, because introducing many irrelevant descriptors may also affect the performance to some extent. It is likely that the use of 149 descriptors in the J model cannot improve the performance because of the increase in chance correlation and noise generated by irrelevant descriptors. By applying the UFS algorithm with a cutoff of 0.99, only redundant variables and those with very high degree of multicollinearity were removed, and thus the prediction rate fell back to the range of the random set A. Summarizing, the descriptor vectors presented in Table 2 were selected in order to check the sensitivity of a predictive model with respect to two main issues: redundancy and multicollinearity. Two main conclusions can be drawn, if the aims are minimization of chance correlation and improving the quality of the set: removing the redundancy from the data set is crucial, while some level of multicollinearity, which needs to be empirically determined for each data set, can be tolerated, or may, in fact, even be beneficial.

Procelewska et al.

Although the PNN network architecture PNN 35:57-2365:1 with 35 nodes in the input layer is an example of a data set where significant pruning of the number of attributes occurs during model building, satisfactory predictions could not be achieved. This confirms that an initial dimension reduction of the input data prior to application of an ANN is more beneficial in practice than the dimension reduction performed by the ANN. 5.4. Interpretation of the Input Descriptors To understand the relevancy of input variables further, in the following the contribution of each feature to the classification by the ANN models was determined by measuring the sensitivity of the classification to a change in the variables. In this way, the attribute set X2 has been identified in ranked order according to their significance in the classification. Based on this ranking, it was possible to define the more and less significant attributes, included in the best models that can correctly classify the samples into their performance categories and reduce the misclassification. Considering the attributes related to properties of the elements, variables based on the atomic radius, number of element oxides and electron affinity seemed to be of major importance, whereas bond strength between element-element and number of oxidation states showed little importance. When examining the attributes related to the element oxides, the melting point and first formation enthalpy seemed to play an important role, while boiling point and heat capacity were mostly ignored. When analyzing the attributes related to element ions, no especially significant or insignificant properties have been identified, although all descriptors based on ionic covalent parameter were dropped because they included NULL values. The highest rank in the attributes’ significance was found for three attributes from the X3 set (“synthesis method”, “support used” and “applied calcination temperature”). This is reasonable from a chemical point of view, since it is known, that the performance of many catalysts is strongly influenced by the applied synthesis protocol. On the other hand, it is a result that chemical intuition would also have given. This is a strong argument for the reasonability of the applied methodology in the field of in silico prediction of the catalytic activity. 6. CONCLUSIONS Feature selection is a critical step for building models to classify catalysts and predict outcomes of catalytic experiments based on high dimensional data. In this work, different methods to define a minimal set of attributes that can correctly classify data set into its performance categories have been explored. The preprocessing procedure advocated here was consistent with hybrid feature selection approaches: identification of variables with a significant projection onto response, elimination of irrelevance, and addressing redundancy and multicollinearity. The use of attributes selected by pruning methods can provide an improvement of neural networks prediction abilities compared to those calculated using the unpruned sets of variables. For analysis of descriptors distribution and for comparing differences in their information content a nonparametric variability estimator of the data spread was applied, which is

Prediction of Solid Catalyst Performance

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1 45

sensitive to the possible values that a descriptor can retain. Shannon Entropy calculations carried out to compare the variability of very different attributes, irrespective of their units and value ranges and differences in descriptor entropy provide a possible route to detect class-specific features of compounds, proportional to the amount of choices that are available to a system. This strategy appears even more promising when linked prior to the modeling with an orthogonalizing tool which eliminates highly intercorrelated attributes. The orthogonalization process is necessary to eliminate intercorrelated attributes and explore the space based on a limited number of the most relevant features with unique information and minimal collinearity. The analysis showed clearly that some error terms are superior for poorly orthogonalized systems, providing another empirical verification for theoretical considerations. Chemical intuition is a method of choice to perform the initial dimension reduction. The analysis discussed above, however, suggests that more advanced feature selection algorithms can substantially improve the prediction of catalyst performance. This means in practice, that the mathematical support for the primary attributes set proposed by experts can be highly desirable. However, for such a complex system with many interfering performance classes, the optimal prediction errors could still not be achieved. Nonetheless, we believe that with larger catalyst libraries as training set, it will be possible to improve the sensitivity of these models and their specificity for the classification of catalytic activity. SUPPORTING INFORMATION Descriptors’ Construction Chemical knowledge and intuition have been used to formulate three different types of attribute sets [4]. As mentioned, three different types of attribute sets were generated: X1 accounted for the composition of all catalysts, (60 in total). X2 consists of physicochemical parameters calculated on the basis of X1 from tabulated data, (3296 in total). Classic information about element, e.g. molar mass, Pauling electronegativity, first ionization energy, etc. and more advanced data about the oxides and ions of each element has been collected from literature, mainly the Handbook of Chemistry and Physics. The summary of these features is presented in Table 4. Finally, the X3 set contains the main parameters on the last synthesis step of the solid catalysts. The attribute set X3 consisted of 19 categorical (mostly binary) variables, which provide information on the last synthesis step, and was estimated to have a strong influence on the catalytic performance. Therefore they were not subjected to the feature selection process and all included in the final attribute set. Since attribute set X1, consisting of the catalysts composition, showed very poor variation and this information was implicitly included in set X2 it was decided not to include this set in the formation of the final descriptor vector. The feature selection task thus consists of selecting discriminative attributes from set X2. Each feature is encoded with its own unique code. The elements’ features have been considered as the primary attributes, the ions’ and oxides’ as the auxiliary ones. The re-

spective properties of the elements, ions and oxides were computed in a combinatorial manner using some aggregate functions presented in Tables 5,6. Additionally two other features: nffefmsmo (normalized free formation enthalpy for the most stable metal oxide) and sedmsmoom (smallest energy difference between most stable metal oxide and other metal oxide) were calculated. All these sets have been computed using Microsoft SQL Server 2005. Information about missing values has been preserved-they have not been replaced by any other value (e.g. 0). Table 4. Primary Features Gathered for All Elements, Ions and Oxides Feature Name Element

Element Oxide

molar mass

ms

Electronegativity

pe

electron affinity

ea

atomic radius

ar

first ionization energy

1ie

bond strength (element-element)

bsee

bond strength (element-oxide)

bseo

number of element’s oxides

no

number of element’s oxidation states

osc

formation’s free enthalpy

ffe

melting point

mp

boiling point

bp

Density

d

dielectric constant

dc

oxidation state

Element Ion

Feature Code

oseo

heat capacity

hc

oxidation state

os

coordination number

cn

ionic radius

ir

ionic covalent parameter Optical basicity

icp l

All statistical analysis and data exploration were carried out using the Statistica 6.0 software package. Tables 7-16 reveal the number of correctly classified catalysts, how many were misclassified, and for which classes the misclassification occurred. The quality of the prediction is derived from the confusion matrix and is estimated by the prediction rate. The prediction rate accounts for the correctly classified cases in the respective prediction class and can be considered as a benchmark for the quality of prediction. Many of the results obtained here illustrate a tradeoff between overlearning and generalization.

46

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1

Table 5. Based on

Oxide

Definition of Attributes Generated for the Ions and Oxides Main Describer

oe

Part Describer

vlosoe

m

Average of the feature values for all oxides Maximal value of the feature for all oxides

min

Minimal value of the feature for all oxides

dif

Difference between maximal and minimal value of the feature for all oxides

Maximal value of the feature for the oxides having minimal oxidation state

min

Minimal value of the feature for the oxides having minimal oxidation state

dif

vhosoe

Ion

ie

vlcnie

Average of the feature value for the oxides having maximal oxidation state Maximal value of the feature for the oxides having maximal oxidation state

min

Minimal value of the feature for the oxides having maximal oxidation state

Difference between maximal and minimal value of the feature for the oxides having maximal oxidation state

m

Average of the feature values for all ions

min

Maximal value of the feature for all ions

max

Minimal value of the feature for all ions

dif

Difference between maximal and minimal value of the feature for all ions

nv

Number of different values of the feature all ions

sum

Sum of the feature values for all ions

mean

Average of the feature value for the ions having minimal coordination number

max

Maximal value of the feature for the ions having minimal coordination number

min

Minimal value of the feature for the ions having minimal coordination number

Average of the feature value for the ions having maximal coordination number Maximal value of the feature for the ions having maximal coordination number

min

Minimal value of the feature for the ions having maximal coordination number

Difference between maximal and minimal value of the feature for the ions having maximal coordination number Average of the feature value for the ions having minimal oxidation state

max

Maximal value of the feature for the ions having minimal oxidation state

min

Minimal value of the feature for the ions having minimal oxidation state

dif

vhosie

Number of the ions having maximal coordination number

mean

count

ion

Difference between maximal and minimal value of the feature for the ions having minimal coordination number

max

dif

vlosie

Number of the ions having minimal coordination number

mean

count

ion

Number of the oxides having maximal oxidation state

dif

dif

vhcnie

Difference between maximal and minimal value of the feature for the oxides having minimal oxidation state

max

count

ion

Number of the oxides having minimal oxidation state

mean

count

Ion

Average of the feature value for the oxides having minimal oxidation state

max

count

Oxide

Description

max

mean

Oxide

Procelewska et al.

Number of the ions having minimal oxidation state Difference between maximal and minimal value of the feature for the ions having minimal oxidation state

mean

Average of the feature value for the ions having maximal oxidation state

max

Maximal value of the feature for the ions having maximal oxidation state

min count dif

Minimal value of the feature for the ions having maximal oxidation state Number of the ions having maximal oxidation state Difference between maximal and minimal value of the feature for the ions having maximal oxidation state

Prediction of Solid Catalyst Performance

Table 6.

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1 47

Operations Used for the Creation of the Catalyst Descriptors Describer

Table 7.

Description

meanec

Average of the feature values

meanmc

Average of the feature values for the metals only

meanhc

Average of the feature values for the semi metals only

meanbc

Average of the feature values for the all non-metals

wmec

Weighted mean of the feature values

wmmc

Weighted mean of the feature values for the metals only

wmhc

Weighted mean of the feature values for the semi metals only

wmbc

Weighted mean of the feature values for the all non-metals

sdec

Standard deviation of the feature values

sdmc

Standard deviation of the feature values for the metals only

sdhc

Standard deviation of the feature values for the semi metals only

sdbc

Standard deviation of the feature values for the all non-metals

max

Maximal value of the feature

min

Minimal value of the feature

dif

Difference between maximal and the minimal value of the feature

sum

Sum of the feature values

Analysis Reports of PNN Performance for Expert System with 56 Descriptors PNN 60:89-188-5:1 Train

Selection

Test

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

38

0

1

2

0

92,68%

88,37%

5

5

6

1

2

26,32%

25,00%

8

3

5

0

0

50,00%

38,10%

2

0

49

0

0

0

100,00%

92,45%

7

11

4

0

4

42,31%

42,31%

5

10

3

0

7

40,00%

40,00%

3

5

0

37 0

0

88,10%

92,50%

5

2

13

0

0

65,00%

52,00%

5

2

11

2

0

55,00%

52,38%

4

0

0

0

7

0

100,00%

77,78%

0

0

0

2

0

100,00%

66,67%

0

1

0

3

0

75,00%

60,00%

5

0

4

2

0 43

87,76%

100,00%

3

8

2

0

12

48,00%

66,67%

3

9

2

0

13

48,15%

65,00%

92,55%

Table 8.

46,74%

48,91%

Analysis Reports of PNN Performance for Expert System with 36 Non-Null Descriptors PNN 54:122-235-5:1 Train

Selection

Test

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1 52

0

2

0

0

96,30%

92,86%

11

3

4

0

0

61,11%

33,33%

11

0

8

1

0

55,00%

39,29%

2

0

61

2

1

1

93,85%

91,04%

8

16

4

1

12

39,02%

48,48%

8

16

4

0

10

42,11%

55,17%

3

3

3

57 0

0

90,48%

93,44%

12

3

13

1

1

43,33%

61,90%

8

3

14

1

1

51,85%

46,67%

4

0

0

0

7

0

100,00%

87,50%

0

0

0

2

0

100,00%

50,00%

0

0

0

3

0

100,00%

60,00%

5

1

3

0

0

42

91,30%

97,67%

2

11

0

0

12

48,00%

48,00%

1

10

4

0

13

46,43%

54,17%

93,19%

46,55%

49,14%

48

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1

Table 9.

Procelewska et al.

Analysis Reports of PNN Performance for Set B-UFS(CI, 0.8) PNN 27:89-235-5:1 Train

1

2

3

Selection

4

5

Pred

Sensib

1

2

Test

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

59

0

6

0

1

89,39%

90,77%

12

4

8

2

2

42,86%

50,00%

11

7

8

0

2

39,29%

39,29%

2

0

48

1

0

0

97,96%

80,00%

5

17

3

0

7

53,13%

47,22%

5

14

3

0

7

48,28%

42,42%

3

2

4

52

0

1

88,14%

88,14%

4

6

12

1

3

46,15%

50,00%

9

2

15

1

2

51,72%

51,72%

4

0

0

0

6

0

100,00%

100,00%

0

0

0

4

0

100,00%

57,14%

0

0

0

3

0

100,00%

75,00%

5

4

8

0

0

43

78,18%

95,56%

3

9

1

0

13

50,00%

52,00%

3

10

3

0

11

40,74%

50,00%

88,51%

50,00%

46,55%

Table 10. Analysis Reports of PNN Performance for Set C-UFS(CI, 0.8)+UFS(All, 0.8) PNN 48:71-235-5:1 Train

Selection

Test

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

49

0

4

0

0

92,45%

89,09%

9

1

6

0

2

50,00%

27,27%

11

1

7

0

2

52,38%

37,93%

2

4

59

4

1

5

80,82%

90,77%

14

24

2

0

12

46,15%

72,73%

10

23

5

0

9

48,94%

74,19%

3

1

2

50

0

0

94,34%

86,21%

7

4

12

0

3

46,15%

54,55%

7

3

16

0

2

57,14%

50,00%

4

0

0

0

12

0

100,00%

92,31%

1

0

1

2

0

50,00%

100,00%

1

0

1

2

0

50,00%

100,00%

5

1

4

0

0

39

88,64%

88,64%

2

4

1

0

9

56,25%

34,62%

0

4

3

0

9

56,25%

40,91%

88,94%

48,28%

52,59%

Table 11. Analysis Reports of PNN Performance for Set D UFS(CI, 0.9) PNN 37:95-235-5:1 Train

Selection

Test

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1 51

0

6

0

0

89,47%

86,44%

16

2

10

0

2

53,33%

48,48%

9

2

9

0

1

42,86%

36,00%

2

1

57

1

0

6

87,69%

85,07%

9

19

0

0

12

47,50%

59,38%

6

17

6

1

8

44,74%

56,67%

3

5

4

49

0

0

84,48%

87,50%

6

4

10

0

2

45,45%

45,45%

4

5

16

0

2

59,26%

47,06%

4

0

0

0

10

0

100,00%

100,00%

0

0

0

2

0

100,00%

66,67%

0

0

0

2

0

100,00%

50,00%

5

2

6

0

0

37

82,22%

86,05%

2

7

2

1

10

45,45%

38,46%

6

6

3

1

12

42,86%

52,17%

86,81%

49,14%

48,28%

Table 12. Analysis Reports of PNN Performance for Set E-UFS(CI, 0.9)+UFS(All, 0.9) PNN 58:63-235-5:1 Train 1

Selection

2

3

4

5

Pred

Sensib

1

1 49

0

3

0

0

94,23%

90,74%

10

3

2

3

59

1

0

1

92,19%

90,77%

11

21

3

2

1

56 0

1

93,33%

93,33%

7

3

4

0

0

0

6

0

100,00%

100,00%

0

0

5

0

5

0

0 48

90,57%

96,00%

1

3

92,77%

2

3

Test

4

5

Pred

Sensib

1

5

0

1

52,63%

34,48%

12

8

3

7

42,00%

70,00%

11

17

2

3

53,13%

56,67%

9

0

2

0

100,00%

28,57%

0

0

0

9

69,23%

45,00%

2

50,86%

2

3

4

5

Pred

Sensib

1

7

0

1

57,14%

35,29%

19

7

1

7

42,22%

55,88%

7

8

2

0

30,77%

36,36%

0

0

1

0

100,00%

25,00%

7

0

0

14

60,87%

63,64%

46,55%

Prediction of Solid Catalyst Performance

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1 49

Table 13. Analysis Reports of PNN Performance for Set F-UFS 0.8 PNN 37:91-236-5:1 Train 1

2

3

Selection

4

5

Pred

Sensib

1

2

3

Test

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

53

0

3

0

0

94,64%

86,89%

12

0

5

0

0

70,59%

42,86%

11

1

7

1

2

50,00%

39,29%

2

4

64

1

0

1

91,43%

92,75%

10

18

4

1

8

43,90%

64,29%

5

19

2

0

9

54,29%

61,29%

3

4

2

50

1

0

87,72%

90,91%

6

5

19

1

1

59,38%

59,38%

7

6

13

1

2

44,83%

52,00%

4

0

0

0

7

0

100,00%

87,50%

0

0

0

3

0

100,00%

60,00%

0

0

0

2

0

100,00%

50,00%

5

0

3

1

0

42

91,30%

97,67%

0

5

4

0

13

59,09%

59,09%

5

5

3

0

14

51,85%

51,85%

91,53%

56,52%

51,30%

Table 14. Analysis Reports of PNN Performance for Set G-UFS 0.9 PNN 62:81-236-5:1 Train

Selection

Test

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

67

1

5

0

0

91,78%

94,37%

14

5

7

0

1

51,85%

56,00%

10

5

7

1

4

37,04%

47,62%

2

0

52

1

0

2

94,55%

92,86%

5

21

4

0

8

55,26%

58,33%

7

22

4

0

7

55,00%

61,11%

3

3

1

48

0

0

92,31%

88,89%

4

5

17

1

2

58,62%

58,62%

4

5

17

0

4

56,67%

58,62%

4

0

0

0

11

0

100,00%

100,00%

0

0

0

1

0

100,00%

50,00%

0

0

0

3

0

100,00%

75,00%

5

1

2

0

0

42

93,33%

95,45%

2

5

1

0

12

60,00%

52,17%

0

4

1

0

10

66,67%

40,00%

93,22%

56,52%

53,91%

Table 15. Analysis Reports of PNN Performance for Set H-UFS 0.99 PNN 35:57-236-5:1 Train

Selection

Test

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

41

0

1

0

0

97,62%

80,39%

9

0

6

0

0

60,00%

25,71%

12

2

5

0

0

63,16%

38,71%

2

5

62

4

0

6

80,52%

95,38%

12

22

3

0

10

46,81%

62,86%

6

20

2

0

13

48,78%

71,43%

3

5

1

57 2

1

86,36%

91,94%

13

8

16

1

1

41,03%

64,00%

9

3

17

1

1

54,84%

68,00%

4

0

0

0

8

0

100,00%

80,00%

0

0

0

2

0

100,00%

66,67%

0

1

0

3

0

75,00%

75,00%

5

0

2

0

0 41

95,35%

85,42%

1

5

0

0

6

50,00%

35,29%

4

2

1

0

13

65,00%

48,15%

88,56%

47,83%

56,52%

Table 16. Results for the Randomly Chosen Descriptors, Analysis Reports PNN 62:96-236-5:1 Train 1

2

3

Selection

4

5

Pred

Sensib

1

2

3

4

Test

5

Pred

Sensib

1

2

3

4

5

Pred

Sensib

1

51

0

9

0

0

85,00%

91,07%

10

0

8

0

0

55,56%

32,26%

14

0

6

0

0

70,00%

46,67%

2

2

66

1

0

2

92,96%

94,29%

13

15

8

0

12

31,25%

60,00%

9

19

5

0

8

46,34%

55,88%

3

3

1

45

1

0

90,00%

81,82%

5

3

14

1

0

60,87%

45,16%

5

2

11

1

0

57,89%

44,00%

4

0

0

0

8

0

100,00%

88,89%

0

0

0

3

0

100,00%

75,00%

0

1

0

3

0

75,00%

75,00%

5

0

3

0

0

44

93,62%

95,65%

3

7

1

0

12

52,17%

50,00%

2

12

3

0

14

45,16%

63,64%

90,68%

46,96%

53,04%

50

Combinatorial Chemistry & High Throughput Screening, 2007, Vol. 10, No. 1

REFERENCES

[8]

[1]

[9]

[2] [3] [4] [5]

[6] [7]

Eds.: Gordon, E.M.; Kerwin, J. F. Combinatorial Chemistry and Molecular Diversity in Drug Discovery. Wiley, New York: 1998. Klanner, C.; Farrusseng, D.; Baumes, L.; Mirodatos, C.; Schueth, F. QSAR Comb. Sci., 2003, 22, 729-736. Klanner, C.; Farrusseng, D.; Baumes, L.; Lengliz, M.; Mirodatos, C.; Schueth, F. Angew. Chem. Int. Ed. Engl. 2004, 43, 5347-5349. Farrusseng, D.; Klanner, C.; Baumes, L.; Lengliz, M.; Mirodatos, C.; Schueth, F. QSAR Comb. Sci., 2005, 24(1), 78-93. Vandegniste, B.G.M.; Rutan, S.C. Handbook of Chemometrics and Qualimetrics: Part B. Elsevier Science B.V.: Amsterdam, 1998; Vol. 20B. Hua, J.; Xiong, Z.; Lowey, J.; Suh, E.; Dougherty, E.R. Bioinformatics, 2005, 21(8), 1509-1515. Manallack, D.T.; Ellis, D.D.; Livingstone, D.J. J. Med. Chem., 1994, 37, 3758-67.

Received: September 25, 2006

[10]

[11] [12] [13] [14] [15]

Procelewska et al. Gillet, V.J.; Willett, P.; Bradshaw, J.; Green, D.V.S. J. Chem. Inf. Comput. Sci., 1999, 39, 169-177. Shannon, C.E.; Weaver, W. The Mathematical Theory of Communication. University of Illinois Press: Urbana, IL: 1963. Willett, P. Subset-Selection Methods for Chemical Databases. In Molecular diversity in Drug Design; Dean, M., Lewis, R.A. Kluwer Academic Publishers: Dordrecht, Netherlands, 1999. Whitley, D.C.; Ford, M.G.; Livingstone, D.J. J. Chem. Inf. Comput. Sci., 2000, 40, 1160-1168. Fernandez, M.; Caballero, J.; Helguera, A.M.; Castro, E.A.; Gonzalez, M.P. Bioorg. Med. Chem., 2005, 13(9), 3269-3277. So, S.-S.; Karplus, M. J. Med. Chem., 1996, 39(7), 1521-30. The attribute nvec (number of elements) does not belong to the generated descriptors set, so there are only 35 "real" descriptors. Specht, D.F. Neural Netw., 1990, 3, 109-118.

Revised: December 4, 2006

Accepted: December 5, 2006