prediction of aromatic amines mutagenicity from

0 downloads 0 Views 359KB Size Report
Keywords: Aromatic amines; Mutagenicity; Molecular descriptors; QSAR; Training set ... a free available software for molecular descriptors calculation, able to ... geometrical descriptors such as the 3D-Wiener index [14,15], folding degree index ...... Sybyl, and SD file formats. Free download at: http://www.disat.unimib.it/chm/.
GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 SAR and QSAR in Environmental Research, 2003 Vol. not known (not known), pp. 1–14

PREDICTION OF AROMATIC AMINES MUTAGENICITY FROM THEORETICAL MOLECULAR DESCRIPTORS P. GRAMATICAa,*, V. CONSONNIb and M. PAVANb a

Department of Structural and Functional Biology, QSAR and Environmental Chemistry Research Unit, University of Insubria, via Dunant 3, Varese 21100, Italy; bDepartment of Environmental Sciences, Milan Chemometrics and QSAR Research Group, University of Milano-Bicocca, P.za della Scienza 1, Milan 20126, Italy (Received 1 December 2002; In final form 12 April 2003) In the present research the mutagenicity data (Ames tests TA98 and TA100) for various aromatic and heteroaromatic amines, a data set extensively studied by other quantitative structure–activity relationship (QSAR)-authors, have been modeled by a wide set of theoretical molecular descriptors using linear multivariate regression (MLR) and genetic algorithm–variable subset selection (GA–VSS). The models have been calculated on a subset of compounds selected by a D-optimal experimental design. Moreover, they have been validated by both internal and external validation procedures showing satisfactory predictive performance. The models proposed here can be useful in predicting data and setting a testing priority for those compounds for which experimental data are not available or are not yet synthesized. Keywords: Aromatic amines; Mutagenicity; Molecular descriptors; QSAR; Training set selection; Regression models

INTRODUCTION Understanding and predicting the chronic toxic effects of chemicals, especially mutagenicity and carcinogenicity, has become one of the major problems faced by chemists involved in the development of industrial chemicals, as well as by scientists studying the toxicology of natural and xenobiotic products. The development of efficient and inexpensive technologies for testing and predicting the physical, chemical and biological properties of new compounds, which would enable the estimation of the potential dangers of old compounds and allow effective risk assessment, is thus of major significance. Quantitative structure – activity relationships (QSARs) have been used over the years to develop models to estimate, and predict, toxicity by relating it to chemical structures. QSAR models are particularly useful for screening chemical databases and virtual libraries before the synthesis of chemicals, for setting testing priorities, for reducing reliance on animal testing and, in conclusion, for the timely assessment of the health and environmental risks of chemicals. Mutagenicity, referred to in this paper, is a complex biological activity resulting from cell

*Corresponding author. E-mail: [email protected] ISSN ????-???? print/ISSN ????-???? online q 2003 Taylor & Francis Ltd DOI: 10.1080/1062936032000101484

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 2

P. GRAMATICA et al.

penetration, bioactivation, interaction and DNA modification, together with various errorfree and error-prone DNA repair processes. This is reflected in the great diversity of chemical structures found associated with mutagenicity, and, as a consequence, chemical comparisons must be based on a much wider description of chemical structure; thus, the search for which molecular determinants are active (or not) is a very complex task. The development of a model for predicting mutagenicity needs a test system able to provide reproducible and quantitative estimates of toxic activity: the most widely used is a bacterial test based on the Salmonella typhimurium strain, introduced by Ames. In this paper, we studied the relationship between the chemical structure of aromatic and heteroaromatic amines and their mutagenicity in strains TA98 and TA100. We chose these compounds because, being widespread chemicals of considerable industrial and environmental relevance and important for human health, their mutagenicity is well documented. Moreover, several accurate QSAR models have been proposed recently for these chemicals [1 – 9]. However, most of the published models for this data set lack an absolute essential for a QSAR modeling, which can be applicable to the production of predicted data for new chemicals: the statistical validation and the definition of the chemical range of applicability [10 – 12]. The robustness of these mutagenicity models [1 – 8] is characterized in a limited way, using only parameters of fitting accuracy such as r or r 2, but it is well known [10 – 12] that these statistical parameters cannot be considered as indicators of the predictive power of the model; they just measure how well the model is able to reproduce the training set response. The aim of this study was to develop multilinear regression (MLR) models to predict reliable mutagenicity of amines, and to verify the prediction ability of our models. Such models should allow us to accurately predict the mutagenicity of new compounds that have not yet been used in the model training set but that still belong to the same chemical domain as the training set. In order to have knowledge of model predictive capability, the models were statistically validated both internally and externally. The verification of their chemical domain of applicability (i.e. the range within which they “tolerate” a new molecule) was also proposed. Finally, a proposal was made to differentiate chemical structure characteristics appearing more related to the mechanism of mutagenicity in TA98 and TA100.

EXPERIMENTAL SECTION Biological Responses Experimental mutagenicity data of 146 aromatic amines towards S. typhimurium TA98 þ S9 and TA100 þ S9 microsomial preparation (expressed by the logarithm of the number of revertants per nanomole) were taken from the literature [1]. This data set has been widely used in QSAR modeling and our goal was to compare the different QSAR models and verify their applicability to the prediction of new data. The names of the chemicals and their biological activities (experimental and predicted) are listed in Table I.

Molecular Descriptors Molecular descriptors were computed for the 146 amines using the DRAGON package [13], a free available software for molecular descriptors calculation, able to provide more than 1400 descriptors including several constitutional, topological and geometrical descriptors. These include, together with the traditional topological and information indices, various geometrical descriptors such as the 3D-Wiener index [14,15], folding degree index [16,17], radius of gyration [18,19], span [19], spherosity index [20] and asphericity [21].

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 PREDICTION OF AMINES MUTAGENICITY

3

TABLE I Names of chemicals and biological activities (experimental and estimated) ID

Compounds

TA98exp

TA98est

TA100exp

TA100est

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62

2,3-Dimethylaniline 2,5-Dimethylaniline 4-Chloro-1,2-phenylendiamine 4-Aminophenylsulfide 4-Aminopyrene 2-Amino-4-methylphenol 1-Aminofluoranthene 2-Aminofluorene Benzidine 4-Methyl-2-bromoaniline 8-Aminoquinoline 3,4-Dimethylaniline 3-Aminofluorene 4-Methyl-2-chloroaniline 4-Aminofluorene 4-Chloroaniline 8-Aminofluoranthene 2-Ethyl-4-chloroaniline 2-Aminopyrene 2-Aminonaphthalene 4-Cyclohexylaniline 6-Aminoquinoline 2-Amino-1-methylnaphthalene 4-Amino-3-methylbipheyl 4,40 -Ethylenbis(aniline) 2-Methoxy-5-methylaniline 2,4,5-Trimethylaniline 2,4-Diamino-n-butylbenzene 7-Aminofluoranthene 1-Aminocarbazole 1-Aminophenanthrene 3-Amino-4-methylbiphenyl 3-Aminocarbazole 4-Methoxy-2-methylaniline 2-Aminobiphenyl 3-Aminofluoranthene 4-Aminobiphenyl 3,30 -Dichlorobenzidine 2,6-Dichloro-1,4-phenylendiamine 3,30 -Dimethoxybenzidine 4-Aminophenyldisulfide 2-Aminocarbazole 4-Aminocarbazole 1-Aminofluorene 2-Aminoanthracene 2-Amino-3-methylnaphthalene 2-Aminofluoranthene 3-Aminoquinoline 3-Methoxy-4-methylaniline 2-Chloroaniline 4-Phenoxyaniline 2-Amino-4-chlorophenol 1-Amino-2-methylnaphthalene 6-Aminocrysene 2-Methyl-4-bromoaniline 4-Aminophenanthrene 4-Aminophenylether 4-Ethoxyaniline 1-Aminonaphthalene 2.4-Dimethylaniline 2.4-Difluoroaniline 4,40 -Methylenedianiline

– 2 2.4 2 0.49 0.31* 3.16* 2 2.1 3.35* 1.93 2 0.39* – 2 1.14 – 0.89* – 1.13 2 2.52* 3.8* – 3.5 2 0.67* 2 1.24* 2 2.67* – – 2 2.15* 2 2.05* 2 1.32* 2 2.7* 2.88 2 1.04 2.38 – 2 0.48 2 3* 2 1.49* 3.31 2 0.14 0.81* 2 0.69* 0.15* 2 1.03* 0.6 2 1.42 0.43 2.62* – 3.23 2 3.14 2 1.96* 2 3* 0.38* 2 3* – 1.83* – – 2 1.14 2 2.3 2 0.6* 2 2.22* 2 2.7* 2 1.6*

2 1.9 2 1.75 2 1.68 2 0.77 3.74 2 1.62 2.78 0.7 2 0.61 2 1.39 2 0.53 2 1.83 0.88 2 2.12 1.08 2 2.66 2.3 2 2.38 3.76 2 0.79 2 2.32 2 1.47 2 0.5 2 1.13 2 1.7 2 2.76 2 1.78 2 2.38 2.33 0.27 0.79 2 1.15 0.22 2 2.41 2 1.19 2.71 2 1.2 2 0.02 2 1.01 0.47 2 0.5 0.2 0.58 0.59 1.12 2 0.49 2.2 2 1.39 2 1.73 2 2.64 2 1.17 2 2.11 2 0.73 2.61 2 1.47 0.39 2 0.17 2 2.73 2 0.33 2 2.15 2 2.25 2 0.6

2 1.36* 2 1.43 2 1.44* 0.48* 2.69* 2 1.68* 2.34* 0.78* 2 0.66* 2 0.64* 2 0.34* 2 1.08 0.1 2 0.4* 0.64 2 1.51* 1.98* 0.08* 2.58 0.39* 2 0.14* 2 1.22 0.84 1.12* 2 1.51* 2 1.85* 2 0.26* 2 0.84* 2.76 2 0.25 1.79 0.09 2 0.11 2 2.1 2 0.51* 2.25 0.85 0.66* 2 1.12* 2 0.85* 0.54* 2 0.56 2 0.47 2 0.04 2.76 1.09 2.87 0.07 2 0.81* 2 2.05* 0.63* 2 2* 2 0.37* 2.41* 0.46* 2 0.11 2 0.27* 2 0.61 21 2 0.23* 2 2.52* 2 0.15*

2 0.98 2 1.18 2 1.57 0.84 2.22 2 1.72 2.51 0.54 2 0.45 2 0.07 2 0.31 2 1.14 0.72 2 0.62 0.87 2 1.22 2.12 0.07 2.32 2 0.13 2 0.59 2 0.63 0.06 0.35 2 1.00 2 1.12 2 0.77 2 0.95 2.25 0.16 1.15 0.13 0.21 2 1.36 0.78 1.97 0.40 0.66 2 0.35 2 0.79 1.23 2 0.01 0.29 0.53 0.88 2 0.07 2.57 2 0.43 2 0.46 2 1.08 0.39 2 1.42 2 0.07 1.86 2 0.55 1.43 2 0.61 2 1.11 0.06 2 0.96 2 1.70 2 0.42

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 4

P. GRAMATICA et al.

TABLE I – continued ID

Compounds

63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124

9-Aminophenanthrene 3,40 -Diaminobiphenyl 3-Aminophenanthrene 2-Aminophenanthrene 1-Aminoanthracene 1-Aminopyrene 9-Aminoanthracene 2,4-Diaminotoluene 3,30 -Diaminobenzidine 1,3-Phenylendiamine 3,4-Diaminotoluene 1,2-Phenylendiamine 3-Amino-6-methylphenol 2,4-Diaminoethylbenzene 3-Aminobiphenyl 2,3-Diaminobiphenyl 2-Methyl-4-chloroaniline 2-Chloro-4-methylaniline 4-Methoxyaniline 3-Methoxyaniline Aniline 3-Chloroaniline 3-Ethoxyaniline 2-Ethoxyaniline 4-Aminophenol 3-Aminophenol 2,4,6-Trimethylaniline 2,4,6-Tribromoaniline 2,4,6-Trichloroaniline 2,6-Diethylaniline 3,5-Dimethylaniline 2,6-Dimethylaniline 2,4-Dibromoaniline 2,4-Dichloroaniline 4-Iodoaniline 2-Iodoaniline 2-Fluoroaniline 2-Bromoaniline 4-Ethylaniline 2-Ethylaniline 4-Methylaniline 3-Methylaniline 2-Methylaniline 2,20 -Diaminobiphenyl 3,30 -Dimethylbenzidine 9-Aminofluorene 2,4-Diaminoisopropylbenzene 2,40 -Diaminobiphenyl 2-Aminophenol 3,30 -Diaminobiphenyl 2-Methoxyaniline 3-Trifluoromethylaniline 4-Bromoaniline 2-Bromo-7-aminofluorene 1,7-Diaminophenazine 3-Amino-3’-nitrobiphenyl 2,7-Diaminofluorene 2-Amino-40 -nitrobiphenyl 2-Amino-5-nitrophenol 2-Hydroxy-7-aminofluorene 4-Amino-20 -nitrobiphenyl 2-Aminophenazine

TA98exp

TA98est

TA100exp

TA100est

2.98 0.2 3.77 2.46 1.18 1.43 0.87* 2 1.29 2 0.04* 2 0.46* 2 1.42 2 0.75* 2 1.4* 2 0.87 – – – – – – – – – – – – – – – – – – – – – – – – – – – – – 2 1.52* 0.01* – 2 3* 2 0.92* – 2 1.3* – 2 0.8* 2 2.7* 2.62* 0.75 2 0.55 0.48 2 0.62 2 2.52* 0.41* 2 0.92* 0.55

1.06 2 0.69 0.86 0.93 0.99 3.4 1.32 2 1.43 0.71 2 2.37 2 1.36 2 2.02 2 1.54 2 1.93 2 1.25 2 0.85 2 2.38 2 2.1 2 3.16 2 2.21 2 2.92 2 2.73 2 2.85 2 2.05 2 2.07 2 2.72 2 1.24 2 0.74 2 1.98 2 2.17 2 2.82 2 2.52 2 1.11 2 2.36 – 2 3.57 2 2.45 2 1.7 2 2.72 2 1.93 2 2.25 2 2.77 2 2.55 2 0.35 2 0.15 1.13 2 2.51 2 0.43 2 2.34 2 0.48 2 1.89 2 1.47 2 1.86 1.89 0.63 2 0.19 1.28 0.17 2 2.42 0.77 2 0.36 2 0.27

2.79 0.65 2.66 2.74* 0.36 1.05 2 0.24* 2 1.66* 2 1.11* 2 1.4* 2 2.1 2 1.89* 2 1.82 2 1.21* – – 0.38 – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

1.56 2 0.12 1.38 1.02 0.97 1.92 1.08 2 1.98 2 1.90 2 1.85 2 1.76 2 2.15 2 2.01 2 1.55 0.78 2 0.07 2 0.85 2 0.61 2 1.74 2 1.49 2 1.75 2 0.54 2 0.79 2 0.57 2 2.33 2 1.98 0.16 1.10 0.09 2 0.30 0.24 2 0.96 0.05 2 0.59 2 0.11 2 0.40 2 1.72 2 0.91 2 1.21 2 0.94 2 1.57 2 0.99 2 1.51 0.13 2 0.22 0.71 2 1.19 2 0.11 2 2.15 0.42 2 1.15 2 1.18 2 0.95 1.27 2 1.13 2 0.79 2 0.10 2 1.34 – 2 0.10 2 1.16 2 0.55

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 PREDICTION OF AMINES MUTAGENICITY

5

TABLE I – continued ID

Compounds

125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146

2,4-Dinitroaniline 2-Amino-30 -nitrobiphenyl 4-Fluoroaniline 2-Amino-7-acetamidofluorene 2,8-Diaminophenazine 3-Amino-20 -nitrobiphenyl 1,6-Diaminophenazine 2-Bromo-4,6-dinitroaniline 1,9-Diaminophenazine 2-Amino-1-nitronaphthalene 3-Amino-40 -nitrobiphenyl 2-Amino-7-nitrofluorene 1-Amino-7-nitronaphthalene 4-Amino-30 -nitrobyphenil 4-Amino-40 -nitrobyphenil 1-Aminophenazine 4-Chloro-2-nitroaniline 2,7-Diaminophenazine 4-Chloro-1,3-phenylenediamine 2-Nitro-1,4-phenylenediamine 4-Nitro-1,3-phenylenediamine 4-Nitro-1,2-phenylenediamine

TA98exp

TA98est

TA100exp

TA100est

2 2* 2 0.89* 2 3.32* 1.18* 1.12* 2 1.3 0.2* 2 0.54* 0.04 2 1.17 0.69 3* 2 1.77* 1.02 1.04* 2 0.01 2 2.22* 3.97 2 0.77 2 0.05* 2 2.4* 0.35

2 2.46 2 0.03 2 2.74 1.47 0.43 2 0.41 0.56 2 0.47 0.41 0.13 2 0.21 1.93 2 0.15 2 0.24 2 0.31 2 0.09 2 2.06 0.47 2 1.89 2 0.69 2 2.54 2 2.05

– – – – – – – – – – – – – – – – – – – – – –

– 2 0.94 2 1.80 2 0.53 2 1.22 2 0.85 – – – – 2 1.28 2 1.25 – 2 1.17 2 1.62 2 0.41 – 2 1.26 2 1.43 – – –

TA98, log revertants/nmol in TA98 strain; TA100, log revertants/nmol in TA100 strain. The experimental data of the compounds used for training set.

Moreover, 3D-Morse [22,23], Randic molecular profiles [24,25], Moreau – Broto autocorrelations [26 – 28], WHIM [29], Galvez topological charge indices [30,31], BCUT descriptors [32,33] and GETAWAY descriptors (GEometry, Topology and Atom-Weights AssemblY) [34] are also calculated by DRAGON. Definitions and further information regarding all these molecular descriptors can be found in Todeschini and Consonni [35]. The input files for descriptor calculation containing information of atom and bond types, connectivity and atomic spatial coordinates relative to the minimum energy conformation of the molecule were obtained by the molecular mechanics method of Allinger (MMþ ) using the HYPERCHEM package [36]. Chemometric Methods To have compounds for external validation, the available set of amines was split into a training set and an external validation set. The training set selection was performed by D-optimal experimental design from the DOLPHIN package [37] using the Marengo – Todeschini algorithm [38]. The Marengo – Todeschini algorithm is an algorithm for optimal, distance based, experimental design that does not require any preliminary hypothesis about a regression model. The best set of compounds is defined through a fast exchange algorithm where, in each cycle, a substitution is selected to provide the maximum increase of the minimum distance between the currently selected compounds. Such an algorithm provides a final uniform distribution of the compounds selected from the set of allowed candidates. The regression models were developed on the selected training set and once the models were established, predictions were made for the remaining molecules under study. Multiple linear regression analysis was performed from the MobyDigs package [39] by using the Ordinary least squares (OLS) regression method. As it is impossible to perform multilinear regression when descriptor variables are too many and correlation among them is too high, a variable selection procedure must be adopted. Consequently, genetic algorithms

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 6

P. GRAMATICA et al.

(GA) [40,41] were used to select, from among all the calculated descriptors, the most relevant in obtaining models with the highest predictive power for the studied mutagenicity. GA variable selection is based on the evolution of a population of models. In the GA terminology, the binary vector I is called chromosome, which is a p-dimensional vector where each position (a gene) corresponds to a variable (1 if included in the model, 0 otherwise). Each chromosome represents a model with a subset of variables. The statistical parameter to be optimized must be defined, along with the model population size and the maximum number L of allowed variables in a model. Moreover, a crossover probability pC (usually high, pC . 0:8; if no repeated samples are allowed) and a mutation probability pM (usually small, pM , 0:1) should be defined by the user. Once the leading parameters are defined, the GA evolution starts based on three main steps: the random initialization of the population, the cross-over step and the mutation step. In the first step, the model population is built initially by random models with a number of variables between 1 and L, and the models are ordered with respect to a selected statistical parameter—the quality of the model (here the prediction power verified by Q 2). In the cross-over step, pairs of models are selected (randomly or with a probability proportional to their quality) and for each pair of models the common characteristics are preserved (i.e. variables excluded in both models remain excluded, variables included in both models remain included). The Q 2 for the new model is calculated. If its value is better than the worst value in the population, the model is included in the population, in the place corresponding to its rank. Otherwise, it is no longer considered. In the mutation step, for each model present in the population p, each gene is randomly changed according to the mutation probability. For the mutated model the Q 2 is also calculated and if its value is better than the worst value in the population, the model is included in the population. The second and third steps are repeated until some stop condition is encountered, or the process is ended arbitrarily. GAs simultaneously create many different results of comparable quality in large model populations (100 in Moby-Digs software). Within a given population the selected models can differ in number and the kind of variables. Only models producing the highest predictive power are finally retained and further analyzed. All the models were internally validated by the leave-one-out procedure, i.e. leaving out from the training set one molecule at a time. Moreover, to avoid overestimation of the predictive power of models, the leave-many-out procedure (i.e. 20% of objects left out at each step) was also performed. To avoid chance correlation, regressions were retained only for variable subsets with an acceptable multivariate correlation with response, applying the QUIK rule [42] (only models with a global correlation of ½X þ Y block (KXY) greater than the global correlation of the X block (KXX) variable, X being the molecular descriptors and y the response variable, are accepted). Moreover, all the models were also checked for their reliability by the Y randomization procedure [43]. Model performance was described by means of parameters related to model predictive capability (Q2LOO ; Q2LMO ) and fitting power (r 2). Standard deviation error in prediction (SDEP), standard deviation error in calculation (SDEC), standard error of estimate (s), the F value of the Fisher and the correlation of the selected descriptors (KXX) were also reported. Any developed model, based on a designed training set, was externally validated by comparing the predicted and the actual data, evaluating the prediction errors and computing the “external” standard deviation (SDEPext) and Q2ext ; and examining how well these “external” parameters compare with the corresponding internal SDEP and Q 2. The models proposed here were chosen by maximizing the explained variance in prediction by external validation (Q2ext ;). Good predictive properties are an additional indication that chance correlation has been avoided.

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 PREDICTION OF AMINES MUTAGENICITY

7

In addition, reliability was checked by regression diagnostics and the application domain was quantitatively defined, i.e. chemicals to which the models should not be applied for predicting mutagenicity can be identified by their leverage values [44]. The leverages of all the studied compounds were calculated to check their distance from the model experimental space; the greater the distance the more unreliable the predicted response. Prediction for a compound that has high leverage (h . h*, the critical value being h* ¼ 3p0 =n; where p0 is the number of the model parameters and n the number of compounds) must be considered as unreliable. While for compounds with a leverage value lower than the critical one, the degree of agreement between the predicted and the actual values is as high as that for the training set chemicals [45].

RESULTS AND DISCUSSION One of the most relevant objectives of the development of QSAR models for the prediction of mutagenicity is to obtain knowledge of the mutagenicity of substances that still have not been tested, or for which reliable experimental data are not available. In addition, the safety of new chemicals can be assessed via QSARs often already in the pre-production phase. However, the potential benefits of QSARs can be fulfilled only if the QSAR results are acceptable. QSAR result acceptance relies on assessing the reliability and uncertainty of predictions, as well as the assessing of the applicable domain of a QSAR. The Ames tests of mutagenicity, based on the S. typhimurium strains, has often been used, and many different statistical models have been derived for the estimation of mutagenicity [1 – 9]. In order to have reliable predictions of mutagenicity, QSAR regression models have to be statistically validated, both internally and externally. In greater detail: the effective predictive capability of a model must be evaluated by a validation procedure comparing predictions for molecules that have been excluded from the model generation step with their experimental activities. To have compounds for use in this kind of “external” validation the available set of amines was split into a training set and a test set by a D-optimal experimental design [37,38]. For each biological response a population of OLS models of dimension ranging from 1 to 5 was obtained by applying the genetic algorithm – variable subset selection (GA – VSS) to the set of DRAGON theoretical molecular descriptors. The GA, first proposed as a strategy for variable subset selection in multivariate analysis by Leardi et al. [40], is now widely and successfully applied in QSAR approaches where quite a number of molecular descriptors like X-variables can be found. The model with the highest predictive ability, obtained for the prediction of amine mutagenic activities in S. typhimurium TA98 þ S9, using a training set of 60 compounds (highlighted with a star in Table I) and an external evaluation set of 39 compounds, is the following four-dimensional model with the reported statistical parameters: log TA98 ¼ 23:98 þ 2:40 MWC07 þ 0:56 MATS7m þ 2:44 Mor27u þ 1:12 Mor15m n ¼ 60 r 2 ¼ 80:3 Q2LOO ¼ 76:6 Q2LMO ¼ 75:9 Q2ext ¼ 68:9 K XX ¼ 27:9 s ¼ 0:827 F ð55Þ ¼ 55:87 SDEC ¼ 0:791 SDEP ¼ 0:861 SDEPext ¼ 0:991: ð1Þ Figure 1 shows the corresponding regression line. It can be verified that 1,3phenylendiamine (72) and 1-amino-7-nitronaphthalene (137) are outliers in the training set, while 4-aminocarbazole (43), 3-aminoquinoline (48), 9-aminophenantrene (63) and 1-aminopyrene (68) are evaluation set chemicals whose predicted data are more than two standard deviations from the experimental value. These results raise some doubts regarding

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 8

P. GRAMATICA et al.

FIGURE 1 Regression line of TA98 model.

the mutagenic activity in TA98 þ S9 of the outlier chemicals, and suggest the need for more accurate experiments and/or interpretation of the anomaly in the mechanism. 4-iodoaniline (97) is the only compound of the test set that is out of the chemical domain of the training set (as can be verified by its high leverage value). Some of the structural aspects peculiar to this chemical are not included in any molecule of the training set, thus its predicted mutagenicity could be unreliable. The model points out the importance of structural descriptors related to molecular size and branching, together with intramolecular long-distance interactions, as already evidenced by different descriptors in previous papers [1 – 8], in predicting the mutagenicity of aromatic amines in strain TA98. In particular, the symbol MWC07 refers to a molecular descriptor that represents the number of walks [46] of length seven in the molecular graph, related to the molecular branching and size and in general to the molecular complexity of the structural graph. MATS7m is a 2D autocorrelation descriptor calculated by the spatial autocorrelation formula of Moran [47] on the molecular graph weighted by atomic masses and by using a topological distance equal to seven as lag; Mor27u and Mor15m are 3D-MoRSE [22,23] molecular descriptors of signal 27 and 15 unweighted and weighted by atomic masses, respectively. All these descriptors take into account different features (2D and 3D) of the molecular dimension and branching, highlighting the relevance of steric interactions. The most important molecular descriptor in predicting the aromatic amines mutagenicity in S. typhimurium TA98 þ S9 has turned out to be the MWC07, which is also the best overall single descriptor with r 2 ¼ 72; Q2LOO ¼ 70 and a Q2ext ¼ 50 for the correlation equation with a single parameter. The amine mutagenic activity in S. typhimurium TA100 þ S9, a response modeled with less satisfactory results in the reported literature [1,5], was also modeled by our approach.

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 PREDICTION OF AMINES MUTAGENICITY

9

The best model obtained for the prediction of TA100, using a training set of 46 compounds (highlighted with a star in Table I) and an external evaluation set of 30 compounds, is the following three-dimensional model: log TA100 ¼ 23:99 2 0:61 nHA þ 9:55 ATS5p þ 0:65 L2v n ¼ 46 r 2 ¼ 81:2 Q2LOO ¼ 78:0 Q2LMO ¼ 77:4 Q2ext ¼ 67:1 K XX ¼ 17:1 s ¼ 0:579 F ð42Þ ¼ 60:40 SDEC ¼ 0:553 SDEP ¼ 0:598 SDEPext ¼ 0:731: ð2Þ Figure 2 shows the corresponding regression line. It can be noted that 2-aminobiphenyl (35), 2-aminophenanthrene (66) and 9-aminoanthracene (69) are outliers in the training set, while 3,30 -dimethoxybenzidine (40) is an influential chemical. This model produces predictions more than two standard deviations from the experimental value for 2-aminoanthracene (45) and 4-aminophenanthrene (56), while 11 chemicals of the evaluation set (mainly nitro-substituted), with an high leverage value, are out of the chemical domain of the training set and for this reason their predicted data, considered unreliable, are not reported in Table I. The behavior of these chemicals, both the outliers and the influentials, points out the need for a deeper study into their differences in terms of mechanism. It is obvious that the nitro-substituted amines, having a completely different mechanism of activity, are problematic in this data set, as already evidenced by Debnath [1]. The model reveals the importance of different structural aspects in predicting the aromatic amine mutagenicity in strain TA100 compared with mutagenicity in strain TA98. In fact, instead of steric aspects which were more important in TA98 models, polarizability and electronic factors appear, in this case, to be the most relevant parameters, as already observed [1 – 6]. ATS5p, the most important descriptor, is a Moreau-Broto 2D-autocorrelation descriptor [26 – 28] of a topological structure with lag five, weighted by atomic

FIGURE 2

Regression line of TA100 model.

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 10

P. GRAMATICA et al.

polarizability. The nHA parameter, the number of electronegative atom acceptors in hydrogen bonds, encodes here for the well known significance of the amino group in mutagenic reactions and the importance of hydrogen bonding [6], while some less relevant dimensional aspects are represented by the directional WHIM descriptor [29] weighted by atomic van der Waals volume L2v (molecular size in the second principal component). The predicted mutagenicity values of TA98 and TA100 responses are listed in Table I only for those compounds where the leverage values ensure their belonging to the applicability domain of the model. It must be pointed out that the obtained models are derived from heterogeneous data sets including amines with different mechanisms of mutagenesis, and that the tentative to propose “general” models, useful for screening purposes, is particularly challenging. Moreover, the mutagenicity mechanisms are complex and consist of multiple activity discriminating steps, sometimes complicated by the existence of various parallel metabolic pathways. In addition, it should be kept in mind that the mutagenicity data of this data set came from many separate experiments carried out in several different laboratories. Certainly, the data set was contaminated by systematic errors that magnify the uncertainty of experimental data. Better results would be obtained using mutagenicity data obtained from a single laboratory. For comparison purposes in this paper no effort has been made to select different data. Table II summarizes a list of QSAR models of aromatic and heteroaromatic amine mutagenicity (found in the literature) and our new proposed models with their relative performance. The results obtained by testing the quality of our models on the external evaluation set reveal that the effective prediction power of the models ðQ2ext Þ is less than that obtained by internal validation ðQ2LOO Þ; strongly highlighting the importance of an adequate external validation of QSAR models [10 –12] and posing some doubts on the real predictive power of the already published models, for which the reported parameters are only r 2(and in some cases even only r). In this context, it is of crucial importance to realize the difference between a model’s fit and prediction ability. If interest is focused mainly on the mechanistic interpretation, a model with good fit to the underlying data can be very useful, but the problem with this kind of model is that it may not be representative for other, additional new compounds. Only predictive validation is one way to reliably assess model adequacy for new compounds. Models without statistical validation can be quite unsuitable for prediction purposes, and there is the possibility that the mechanistic claims might not hold true if these models are applied to new chemicals.

CONCLUSION In this paper, we have analyzed the relationship between the chemical structure of aromatic amines and their mutagenicity in the TA98 and TA100 S. typhimurium strains. Our aim was to develop QSAR models with verified predictive power, in order to obtain reliable predicted data, that could be useful mainly for setting testing priorities and for the screening of chemicals also before their synthesis. Several different theoretical molecular descriptors, calculated only on the basis of a knowledge of the three-dimensional structure of chemicals, an efficient variable subset selection procedure, like GA, and a trainingevaluation set splitting methodology, the experimental design, led to models with quite satisfactory predictive performance, verified by internal and external validations. The predicted data for the evaluation set are considered reliable (as verified by the leverage

Test strain

N. desc

88 88 67 95 95 43 60 95 95 95 95 95 95 95 95 95 95 95 95 95 47 60 46

4 3 3 8 9 1 1 4 6 6 9 4 6 4 8 9 9 6 9 1 1 4 3

Molecular descriptors

r2

Q2LOO

Q2LMO

Q2ext

Reference

log P HOMO LUMO IL log P HOMO IL log P HOMO LUMO b v IC, O, IC3, SIC1, SIC4, 4x, 6 xPC , 3 xC 4 3D 6 b 3 v IC, O, SIC1, x, xPC , xC , P0, WH, 3DW HOMOcmax HOMO O, 4xPC, P0, J b b IC4, SIC2, SIC4,4x v, 5 xC , 4 xPC 4 5 b xPC ; P0, J, SIC2, SIC4, xC 4 b xPC , P0, J, SIC2, SIC4, 5 xbC , EHOMO1, DHf, m nRings, g-polarizability, H acceptor surface area, H donors surface area nRings, g-polarizability, H acceptor surface area, H donors surface area, E tot (C-C), E tot (C-N) 4 5 6 IW D , xPC , xCh , xCh b b 4 xPC , IC6, SIC2, JB, ASZ3, ASZ4 3 xC , 5 xC B 4 3 b 5 b xPC , IC6, SIC2, J , ASZ3, ASZ4 xC , xC , VW b 4 xPC , IC6, SIC2, JB, ASZ3, ASZ4, 5 xC , ELUMO, DHf Cmethyl, Csubst-ar, Onitro, Oether, Nsec-amine, F Cmethyl, Cmethylene, Csubst-ar, Onitro, Oether, Nsec-amine, Nar, F, Cl Graphs of atomic orbitals (GAOs) Graphs of atomic orbitals (GAOs) MWC0MATS7m Mor27u Mor15m nHA ATS5p L2v

80.6 77.8 76.9 76.0 79.7 62.8 70.0 72.1 73.7 75.2 79.1 78.1 83.4 70.9 79.4 79.9 80.3 75.0 76.7 75.6 76.4 80.3 81.2

– – – – – – – – – – – 75.5 80.5 – – – – – – – – 76.6 78.0

– – – – – – – – – – – – – – – – – – – – – 75.9 77.4

– – – – – – – – – – – – – – – – – – – – 75.7 68.9 67.1

[1] [1] [1] [3] [3] [5] [5] [4] [4] [4] [4] [6] [6] [7] [7] [7] [7] [8] [8] [9] [9] current study current study

PREDICTION OF AMINES MUTAGENICITY

TA98 TA98 TA100 TA98 TA98 TA98 TA100 TA98 TA98 TA98 TA98 TA98 TA98 TA98 TA98 TA98 TA98 TA98 TA98 TA98 TA98 TA98 TA100

N. obj

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795

TABLE II List of the QSAR models of aromatic and heteroaromatic amines mutagenicity

11

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 12

P. GRAMATICA et al.

approach) only for chemicals belonging to the chemical domain of the model. The proposed validated QSAR models are based on molecular descriptors with a reasonable chemical interpretation. In fact, even if it is evident that the fine details of the corresponding reaction mechanism cannot be fully established with this kind of QSAR approach molecular descriptors of completely different structural meaning are selected by GA in the modeling of the two responses. The selection of molecular descriptors encoding specific structural aspects in statistically robust models (and thus without chance-correlation) highlights that the basis of the mutagenicity in TA98 and TA100 is quite different mechanistically, and related to peculiar structural aspects, While steric factors appear more important in TA98 models, polarizability, electronic and hydrogen-bonding features seem to be more related to mutagenicity in TA100. However, the descriptors, selected by GAs as the best combinations correlated to the different mutagenicity responses, are not so easily and singularly interpretable for an understanding of the complex underlying mechanisms. Such descriptors represent the overall effect of several activation steps. Their practical value rely mainly on the predictive ability in the models, which, of course, must be carefully tested by cross-validation and other diagnostic techniques (scrambling of response, leverage of prediction, etc.). This type of QSAR model follows a path that starts with a strict statistical validation and definition of chemical range of applicability and suggests a further possible interpretation of biological and mechanistic meaning [11]. Therefore, their application domain is related mainly to the production of predicted data (data verified for their reliability) and is useful mainly for screening and priority testing. These models can be applied to chemicals different from the studied amines (even those not yet synthesized) as they are based on theoretical molecular descriptors that are easily and rapidly calculated by web-available software [13]. However, as a QSAR model cannot be expected to reliably predict the modeled property for the entire universe of chemicals, it must be underlined that the predicted data must be considered reliable only for those chemicals that fall within the chemical domain on which the model was obtained.

References [1] Debnath, A.K., Debnath, G., Shusterman, A.J. and Hansch, C. (1992) “A QSAR investigation of the role of hydrophobicity in regulating mutagenicity in the Ames test: 1. Mutagenicity of aromatic and heteroaromatic amines in Salmonella typhimurium TA98 and TA100”, Environ. Mol. Mutagen. 19, 37– 52. [2] Benigni, R., Andreoli, C. and Giuliani, A. (1994) “QSAR models for both mutagenic potency and activity: application to nitroarenes and aromatic amines”, Environ. Mol. Mutagen. 24, 208– 219. [3] Basak, S.C., Grunwald, G.D. and Niemi, G.J. (1997) “Use of graphic –theoretic and geometrical molecular descriptors in structure–activity relationships”, In: Balaban, A.T., ed, From Chemical Topology to ThreeDimensional Geometry (Plenum Press, New York), pp. 73–116. [4] Basak, S.C., Gute, B.D. and Grunwald, G.D. (1998) “Relative effectiveness of topological, geometrical and quantum chemical parameters in estimating mutagenicity of chemicals”, Quantitative Structure–Activity Relationships in Environmental Sciences VII (SETAC Press, Pensacola, FL), pp. 245– 261. [5] Benigni, R., Passerini, L., Gallo, G., Giorgi, F. and Cotta-Ramusino, M. (1998) “QSAR models for discriminating between mutagenic and nonmutagenic aromatic and heteroaromatic amines”, Environ. Mol. Mutagen. 32, 75–83. [6] Maran, U., Karelson, M. and Katritzky, A.R. (1999) “A comprehensive QSAR treatment of the genotoxicity of heteroaromatic and aromatic amines”, Quant. Struct.-Act. Relat. 18, 3– 10. [7] Basak, S.C., Mills, D.R., Balaban, A.T. and Gute, B.D. (2001) “Prediction of mutagenicity of aromatic and heteroaromatic amines from structure: a hierarchical QSAR approach”, J. Chem. Inf. Comput. Sci. 41, 671–678. [8] Cash, G. (2001) “Prediction of the genotoxicity of aromatic and heteroaromatic amines using electrotopological state indices”, Mutat. Res. 491, 31–37. [9] Toporov, A.A. and Toporova, A.P. (2001) “Prediction of heteroaromatic amine mutagenicity by means of correlation weighting of atomic orbital graphs of local invariants”, J. Mol. Struct. (Theochem.) 538, 287–293.

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 PREDICTION OF AMINES MUTAGENICITY

13

[10] Eriksson, L., Johansson, E. and Wold, S. (1997) “QSAR model validation”, In: Chen, F. and Schu¨u¨rmann, G., eds, Quantitative Structure–Activity Relationships in Environmental Sciences—VII. Proceedings of the 7th International Workshop on QSAR in Environmental Sciences (SETAC Press, Pensacola, FL), pp. 381 –397. [11] Tropsha, A., Gramatica, P. and Gombar, V.J. (2003) “The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models”, Quant. Struct.-Act. Relat., In press. [12] Eriksson, L., Jaworska, J., Worth, A., Cronin, M., McDowell, R.M. and Gramatica, P. (2003) “Methods for reliability, uncertainty assessment and applicability evaluations of classification and regression based QSARs”, Environ. Health Perspect., In press. [13] Todeschini, R., Consonni, V., Mauri, A. and Pavan, M. DRAGON, rel. 2.1 for Windows, Milano, Italy, 2002. Program for the calculation of molecular descriptors from HyperChem, Sybyl, and SD file formats. Free download at: http://www.disat.unimib.it/chm/ [14] Mekenyan, O., Peitchev, D., Bonchev, D., Trinajstic, N. and Bangov, I.P. (1986) “Modelling the interaction of small organic molecules with biomacromolecules. I. Interaction of substituted pyridines with anti-3azopyridine antibody”, Arzneim-Forsch. 36, 176– 183. [15] Bogdanov, B., Nikolic, S. and Trinajstic, N. (1989) “On the three-dimensional Wiener number”, J. Math. Chem. 3, 299– 309. [16] Randic, M., Kleiner, A.F. and DeAlba, L.M. (1994) “Distance/distance matrices”, J. Chem. Inf. Comput. Sci. 34, 277– 286. [17] Randic, M. and Krilov, G. (1999) “On a characterization of the folding of proteins”, Int. J. Quant. Chem. 75, 1017–1026. [18] Tanford, C. (1961) Physical Chemistry of Macromolecules (Wiley, New York, NY). [19] Volkenstein, M.V. (1963) Configurational Statistics of Polymeric Chains (Wiley-Interscience, New York, NY). [20] Robinson, D.D., Barlow, T.W. and Richards, W.G. (1997) “Reduced dimensional representations of molecular structure”, J. Chem. Inf. Comput. Sci. 37, 939 –942. [21] Arteca, G.A. (1991) “Molecular shape descriptors”, In: Lipkowitz, K.B. and Boyd, D., eds, Reviews in Computational Chemistry (VCH Publishers, New York, NY) 9. [22] Schuur, J. and Gasteiger, J. (1996) “3D-MoRSE code—a new method for coding the 3D structure of molecules”, In: Gasteiger, J., ed, Software Development in Chemistry—Vol. 10 (Frankfurt am Main, Germany). [23] Schuur, J. and Gasteiger, J. (1997) “Infrared spectra simulation of substituted benzene derivatives on the basis of a 3D structure representation”, Anal. Chem. 69, 2398–2405. [24] Randic, M. (1995) “Molecular shape profiles”, J. Chem. Inf. Comput. Sci. 35, 373– 382. [25] Randic, M. (1996) “Quantitative structure–property relationship—boiling points of planar benzenoids”, N J Chem. 20, 1001– 1009. [26] Moreau, G. and Broto, P. (1980) “The autocorrelation of a topological structure: a new molecular descriptor”, Nouv. J. Chim. 4, 359–360. [27] Moreau, G. and Broto, P. (1980) “Autocorrelation of molecular structures, application to SAR studies”, Nouv. J. Chim. 4, 757– 764. [28] Broto, P., Moreau, G. and Vandycke, C. (1984) “Molecular structures: perception, autocorrelation descriptor and SAR studies. Autocorrelation descriptor”, Eur. J. Med. Chem. 19, 66 –70. [29] Todeschini, R. and Gramatica, P. (1997) “3D-modelling and prediction by WHIM descriptors. Part 5. Theory development and chemical meaning of WHIM descriptors”, Quant. Struct.-Act. Relat. 16, 113–119. [30] Ga´lvez, J., Garcı`a, R., Salabert, M.T. and Soler, R. (1994) “Charge indexes. New topological descriptors”, J. Chem. Inf. Comput. Sci. 34, 520– 525. [31] Ga´lvez, J., Garcı`a-Domenech, R., De Julia´n-Ortiz, V. and Soler, R. (1995) “Topological approach to drug design”, J. Chem. Inf. Comput. Sci. 35, 272–284. [32] Pearlman, R.S. and Smith, K.M. (1998) “Novel software tools for chemical diversity”, In: Kubinyi, H., Folkers, G. and Martin, Y.C., eds, 3D QSAR in Drug Design (Kluwer/ESCOM, Dordrecht, The Netherlands) 2. [33] Pearlman, R.S. (1999) Novel software tools for addressing chemical diversity. Internet Communication http://www.netsci.org/Science/Combichem/feature08.html. [34] Consonni, V., Todeschini, R. and Pavan, M. (2002) “Structure/response correlation and similarity/diversity analysis by GETAWAY descriptors. Part 1. Theory of the novel 3D molecular descriptors”, J. Chem. Comput. Sci. 42, 693–705. [35] Todeschini, R. and Consonni, V. (2000) Handbook of Molecular Descriptors (Wiley-VCH, Weinheim, Germany), p 667. [36] HYPERCHEM, rel. 4 for Windows, 1995, Autodesk, Inc., Sausalito, CA, USA. [37] Todeschini, R. and Mauri, A. DOLPHIN—Software for experimental Design, rel. 2.1 for Windows, 2000, Milano Chemometrics and QSAR Research Group. [38] Marengo, E. and Todeschini, R. (1992) “A new algorithm for optimal, distance—based experimental design”, Chemom. Intell. Lab. Syst. 16, 37 –44. [39] MobyDigs—Software for multilinear regression analysis and variable subset selection by Genetic Algorithm, rel. 2.1 for Windows, 1999, Milano Chemometrics and QSAR Research Group. [40] Leardi, R., Boggia, R. and Terrile, M. (1992) “Genetic algorithms as a strategy for feature selection”, J. Chemom. 6, 267 –281. [41] Goldberg, D.E. (1989) Genetic Algorithms in Search, Optimization and Machine Learning (Addison-Wesley, Reading, MA).

GSAR 31008—16/5/2003—KRISHNAMURTHI—71795 14

P. GRAMATICA et al.

[42] Todeschini, R., Consonni, V. and Maiocchi, A. (1999) “The K correlation index: theory development and its application in chemometrics”, Chemom. Intell. Lab. Syst. 46, 13–29. [43] Lindgren, F., Hansen, B., Karcher, W., Sjo¨stro¨m, M. and Eriksson, L. (1996) “Model validation by permutation tests: applications to variable selection”, J. Chemom. 10, 521–532. [44] Atkinson, A.C. (1985) Plots, Transformations and Regression (Clarendon Press, Oxford). [45] Gombar, V.K. and Enslein, K. (1996) “Assessment of n-octanol/water partition coefficient: when is the assessment reliable?”, J. Chem. Inf. Comput. Sci. 36, 1127–1134. [46] Ru¨cker, G. and Ru¨cker, C. (1993) “Counts of all walks as atomic and molecular descriptors”, J. Chem. Inf. Comput. Sci. 33, 683–695. [47] Moran, P.A.P. (1950) “Notes on continuous stochastic phenomena”, Biometrika 37, 17– 23.

Author Query Form

Journal: GSAR Article no.: 31008

COPY FOR AUTHOR Dear Author, During the preparation of your manuscript for typesetting some questions have arisen. These are listed below. Please check your typeset proof carefully and mark any corrections in the margin of the proof or compile them as a separate list. This form should then be returned with your marked proof/list of corrections to Alden Multimedia. Disk use In some instances we may be unable to process the electronic file of your article and/or artwork. In that case we have, for efficiency reasons, proceeded by using the hard copy of your manuscript. If this is the case the reasons are indicated below: Disk damaged

Incompatible file format

Virus infected

Discrepancies between electronic file and (peer-reviewed, therefore definitive) hard copy.

Other:

LaTeX file for non-LaTeX journal

..........................................................................................................................................................................

We have proceeded as follows: Manuscript scanned Files only partly used (parts processed differently:

Manuscript keyed in

Artwork scanned

)

Bibliography If discrepancies were noted between the literature list and the text references, the following may apply: The references listed below were noted in the text but appear to be missing from your literature list. Please complete the list or remove the references from the text. Uncited references: This section comprises references which occur in the reference list but not in the body of the text. Please position each reference in the text or, alternatively, delete it. Any reference not dealt with will be retained in this section. Manuscript page/line

Details required

Author's Response

Kindly give expansion of N. obj and N. desc in Table II. Kindly update Refs. [11] and [12]. Author kindly provide the total number of pages for Refs [18], [19], [21], [32], [41] and [44].

ALDEN MULTIMEDIA

Many thanks for your assistance Page 1 of 1