Convolutional neural networks for ATC classification

0 downloads 0 Views 349KB Size Report
Finally a convolutional neural network (CNN) is trained and used as feature extractor. ... Matlab code will be available at https://github.com/LorisNanni. 1.
Convolutional neural networks for ATC classification Alessandra Lumini a and Loris Nannib a. b

DISI, Università di Bologna, Via Sacchi 3, 47521 Cesena, Italy. DEI - University of Padova, Via Gradenigo, 6 - 35131- Padova, Italy.

Abstract Background: Anatomical Therapeutic Chemical (ATC) classification of unknown compound has raised high significance for both drug development and basic research. The ATC system is a multilabel classification system proposed by the World Health Organization (WHO), which categorizes drugs into classes according to their therapeutic effects and characteristics. This system comprises five levels and includes several classes in each level; the first level includes 14 main overlapping classes. The ATC classification system simultaneously considers anatomical distribution, therapeutic effects, and chemical characteristics, the prediction for an unknown compound of its ATC classes is an essential problem, since such a prediction could be used to deduce not only a compound’s possible active ingredients but also its therapeutic, pharmacological, and chemical properties. Nevertheless, the problem of automatic prediction is very challenging due to the high variability of the samples and the presence of overlapping among classes, resulting in multiple predictions and making machine learning extremely difficult. Methods: In this paper, we propose a multi-label classifier system based on deep learned features to infer the ATC classification. The system is based on a 2D representation of the samples: first a 1D feature vector is obtained extracting information about a compound’s chemical-chemical interaction and its structural and fingerprint similarities to other compounds belonging to the different ATC classes, then the original 1D feature vector is reshaped to obtain a 2D matrix representation of the compound. Finally a convolutional neural network (CNN) is trained and used as feature extractor. Two general purpose classifiers designed for multi-label classification are trained using the deep learned features and resulting scores are fused by the average rule. Results: Experimental evaluation based on rigorous cross-validation demonstrates the superior prediction quality of this method compared to other state-of-the-art approaches developed for this problem. Conclusion: Extensive experiments demonstrate that the new predictor, based on CNN, outperforms other existing predictors in the literature in almost all the five metrics used to examine the performance for multi-label systems, particularly in the “absolute true” rate and the “absolute false” rate, the two most significant indexes. Matlab code will be available at https://github.com/LorisNanni

1. Introduction Drug discovery is a time and money consuming process: the procedure which brings a single drug to the market can last more than 10 years and cost millions of USD. The main reasons of failures for clinical trials are lack of efficacy and adverse side-effect of drugs [1]. In order to counteract these difficulties and with the aim of finding new uses for already approved drugs, it is highly desirable to develop tools that predict drug therapeutic indications and side-effects accurately. A feasible method to perform this task is based on the automatic prediction of the Anatomical Therapeutic Chemical (ATC) classes of a given compound, which can be useful to deduce its active ingredients, its therapeutic, pharmacological and chemical properties, and therefore to substantially expediting the speed of drug development.

The ATC classification system is a comprehensive drug classification scheme developed by the World Health Organization (WHO) in the early 1970s, which simultaneously considers anatomical distribution, therapeutic effects, and chemical characteristics [2]. The ATC system1 comprises five classification levels and includes several overlapping classes; the first level includes 14 main groups: (1) alimentary tract and metabolism; (2) blood and blood forming organs; (3) cardiovascular system; (4) dermatologicals; (5) genitourinary system and sex hormones; (6) systemic hormonal preparations, excluding sex hormones and insulins; (7) anti-infectives for systemic use; (8) antineoplastic and immunomodulating agents; (9) musculoskeletal system; (10) nervous system; (11) antiparasitic products, insecticides and repellents; (12) respiratory system; (13) sensory organs; and (14) various. The great advantage of using ATC code (the code which provides a drug classification according to each ATC level) is that it gives a guidance regarding clinical drug use and it helps pharmacological research for the identification of therapeutic indications and side-effects. Unfortunately, only a small part of drugs have been labelled with their corresponding ATC codes and also some famous databases (i.e KEGG [3], DrugBank) include many drugs without ATC codes. The identification of ATC classes for new drugs or compounds using traditional experimental methods is quite difficult, therefore in the last decade, many systems and webservers for performing ATC classification based on machine learning approaches have been proposed [4][5]. Most methods, including the one proposed in this work, focused on 14 classes in the first ATC levels. Older approaches [4][5] performed an exclusive classification: Dunkel et al. [4] proposed an exclusive classification method based on the compound structural fingerprint information of the compounds; Wu et al. (Wu, et al., 2013)) proposed a system based on the attempt to discover relationships among ATC classes and use them to make predictions. Chen et al. [2] proposed a multi-label classification method based on chemical-chemical interactions to predict one or more classes for each compound/drug. Moreover they collected a benchmark dataset of 4,376 drugs obtained by selecting ATC-coded drugs from the public available database KEGG [3]. More recently, Cheng et al. [6][7] proposed complex approaches based on the fusion of very different descriptors (i.e. data from the chemical-chemical interaction, structural similarity, fingerprint similarity) in order to effectively deal with multi-label problem. The predictor iATC-mISF proposed in [6] has been evolved into a new hybrid predictor iATC-mHyb in [7] by the combination with the predictor iATC-mDO based on drug ontologies. The same descriptors used in [6] (i.e. compound’s chemical-chemical interaction and its structural and fingerprint similarities to other compounds) have been reshaped in 2D matrices and used in the machine-learning approach proposed in [8]. In this work, we take advantage of the idea of reshaping data in 2D matrices to use Convolutional Neural Networks (CNN) for performing the classification task. Deep learning [9] is a classification paradigm widely used in the computer vision literature for solving image classification problems; CNNs are complex networks which are trained to perform both the feature extraction and classification task. CNNs analyze input images evaluating a set of features that have been learned directly from observations of the training images and preprocessed using a pyramidal approach, then the output layers perform the classification task. Several works [10] show that the first layers of a CNN have a high generalization capability in represent the semantics of the data and provide great robustness to intra-class variability [11]. Therefore, trained CNNs can be effectively reused in several different problems both as feature extractors and, after ad hoc re-training, as classifiers. The approach proposed in this work shows the advantage of using deep learned features and of their fusion with handcrafted ones: the input data are first reshaped in a 2D representation, then a set of features are learned from some pre-trained networks. Due to the multi-label nature of the ATC classification problem, the classification step is performed using an ensemble composed of MultiLabel Learning with Label-Specific Features (LIFT) classifiers [12] and an ensemble of ridge regression classifiers (RR) from the MLC Toolbox [13]. 1

http://www.whocc.no/atc/structure_and_principles/

The proposed approach is evaluated on a well-known benchmark dataset [2] and its superiority vs. other state-of-the-art approaches [6][7][8] is demonstrated using a rigorous testing protocol. The main strength of this approach lies in using a 2D representations of patterns which allows several advantages: on one hand it gives the possibility of extracting powerful texture descriptors [14], on the other it allows to use already trained CNN for transfer learning. The combination of such different descriptors grant a substantial performance improvement with respect to other state-of-the-art approaches based on feature selection and classification.

2. Materials and Methods 2.1 Benchmark Dataset The dataset for drugs used in our experiment was obtained from [2] (Supporting Information S1) . The dataset contains a total of 3883 drugs that are divided into c=14 nonexclusively ATC-classes (3295 samples belong to only one class, 370 belong to 2 classes, 110 belong to 3 classes, 37 belong to 4 classes, 27 belong to 5 classes, and 44 belong to 6 classes; none of the samples belong to more than 6 classes). The sum of samples among classes, denoted as N(vir) in [2] (i.e. number of virtual drug samples), is equal to 4912; therefore the average number of ATC-classes for each sample is 4912/3883=1.27. In Table 1 an analysis of the benchmark dataset according to the 14 main ATC classes is reported. Table 1 Summary of the benchmark dataset according to the 14 main ATC classes 1st-level ATC class Alimentary tract and metabolism Blood and blood forming organs Cardiovascular system Dermatologicals Genito-urinary system and sex hormones Systemic hormonal preparations, excluding sex hormones and insulins Antiinfectives for systemic use Antineoplastic and immunomodulating agents Musculo-skeletal system Nervous system Antiparasitic products, insecticides and repellents Respiratory system Sensory organs Various Number of total virtual drugs N(vir) Number of total drugs

Number of drugs 540 133 591 421 248 126 521 232 208 737 127 427 390 211 4912 3883

The descriptors used to represent drugs are based on drug-drug interaction and the correlation with the target classes to be predicted. Given the set of 14 first level ATC classes, each sample can be represented starting with three mathematical expressions reflecting its intrinsic correlation with each of the classes, leading to a final descriptor of 14×3=42 features. The three different properties considered are: the maximum interaction score with the drugs in each of the 14 classes, the maximum structural similarity score with the drugs in each class and the molecular fingerprint similarity score in the 14 subsets. These descriptors can be downloaded in the supplementary material of [8].

2.2 Data reshaping to 2D representation The typical approach for solving a pattern recognition problem is to select some relevant features from the input data and pass them along to a classifier system. In [15] the authors demonstrates that in many classification problem the idea of reshaping 1D feature to 2D representation can be useful to better exploit feature correlation using well-known texture descriptors. This approach has been already tested in ATC classification [8] with excellent results: the adoption of 2D representations of patterns and the use of powerful Histogram of Gradients (HoG) texture descriptors allowed to improve the overall performance of the system. In this work, we propose to use 2D representation to exploit the classification capability of pre-trained CNNs. Pre-trained CNNs are complex network architectures already trained on large datasets and used for transfer learning [10]. In order to obtain a 2D matrix to be used as input of a pre-trained CNN, we perform random reshaping. Given the original feature vector 𝑓 ∈ ℜ𝑛 (n=42 in this problem) we obtain the output matrix 𝐌 ∈ ℜ𝑑×𝑑 (d depends on the chosen architecture, d=227 and 224 in this work) by performing a random rearrangement of the original vector into a square matrix 𝐔 ∈ ℜ𝑢×𝑢 where 𝑢 = 𝑛0.5 and resizing U to 𝑑 × 𝑑 using bicube interpolation. The random permutation is selected once per all samples in order to maintain the disposition of features. Since the performance vary using different feature disposition, a simple approach for improving performance is to design an ensemble based on feature perturbation by performing K reshaping operations (K=5 in our experiments). 2.3 Deep learned feature extraction CNNs are among the most studied deep learning architecture [16]: a multi-layered image classification network which incorporates spatial context and weight sharing between pixels and is able to learn the optimal image features for a given classification task. CNNs are able of learning accurate and generalizable models and of achieving state-of-the-art performance in several pattern recognition tasks. A CNN is made of several types of layers, i.e. convolutional, pooling, fullyconnected layers, whose weights are trained with the backpropagation algorithm using a large labelled dataset. When the training set is not large enough to preform training from scratch, transfer learning [17] can be used. CNNs have shown a large generalization power [10]: a pre-trained model can be used as a feature extractor to obtain descriptors learned using the image dataset, or as a classifier for a different problem by fine-tuning the weights of the network to the new classification problem. A variety of CNN architectures have been introduced in the literature (i.e LeNet [18], AlexNet [19], VGGNet [20], GoogleNet [21] and ResNet [22]), the first being simpler and lighter, the last being deeper and of increasing complexity. For this classification task, we performed experiments using many different available network, obtaining the best performance with AlexNet and ResNet50. AlexNet [19] has been proposed in 2012 and is the winner of the ImageNet ILSVRC challenge in that year; it is composed by stacked and connected layers: five convolutional layers followed by three fully-connected layers, with some max-pool layers in the middle and a rectified linear unit nonlinearity for each convolutional and fully connected layer. AlexNet has been fine-tuned to the ATC benchmark dataset according to the testing protocol explained in section 3.1, and the second fully connected layer containing 4096 nodes has been used for feature extraction. Then the length of the resulting descriptor is 4096. The fine-tuning of AlexNet using the training set of the ATC problem has been performed changing the last fully-connected layer to the number of classes of the target problem (c=14), setting the maximum number of epochs for training to 40, the mini-batch size to 10 and using a fixed learning rate of 0.001.

ResNet [22], the winner of ILSVRC 2015, is a network about 20 times deeper than AlexNet, which introduces the use of residual layers, a kind of “network-in-network” architecture which can be seen as a set of “building blocks” to construct the network. The MATLAB network ResNet50 has been fine-tuned changing the last layer to the number of classes of the target problem and using the same training option used for AlexNet. The last fully connected layer has been selected for feature extraction, resulting in a descriptor of 14 elements (i.e. the number of classes).

2.4 Data classification and fusion Since ATC is a multi-label classification problem, i.e. a problem which allows a sample to belong to more than one class, we use two multi-label classifiers for the classification step and we perform fusion at score level in order to boost the performance. The first classifier is named LIFT, i.e. multi-label learning with Label specIfic FeaTures [12], that is a two-step method to perform multi-label learning. The approach is based on the observation that in multi-label learning different classes may carry specific characteristics. Therefore, the first step is aimed at selecting features specific for each class by means of clustering analysis, then, in the second step, a set of Support Vector Machines (SVMs) are trained using the features selected for each class. In this work we use linear kernel SVMs. The final response for an unknown sample is obtained by comparing the response of each classifier to a fixed threshold τ (if not specified τ=0.5). The second classifier is obtained from the MATLAB/OCTAVE library for multi-class classification MLC Toolbox [13] which offers an environment for evaluating many combinations of feature space dimension reduction, sample clustering, label space dimension reduction and ensemble of classifiers. In our experiments, we use a set of ridge regression classifiers (RR). Ridge regression is an extension for linear regression which includes in the loss function a regularization term used to penalize high values of the learned coefficients according to a parameter λ (λ=1 in this work). The multi-label problem is dealt with the binary relevance (BR) approach, which is a decomposition method where each label is classified by a binary classifier trained independently from the rest of the labels (thus assuming their independence).

3. Results and Discussion 3.1 Testing protocol Experiments have been conducted on the benchmark dataset described in section 2.1 according to the jackknife test, which is considered the least arbitrary [23] among the cross-validation methods most used in statistical prediction (i.e. independent dataset test, K-fold cross-validation and jackknife test). The jackknife test is a resampling technique, which performs parameter estimation considering all the samples available in the training set except for one. The procedure is repeated in order to omit in turn all the samples from the training, then the results are averaged considering all the simulations. During the fine tuning of CNNs the training samples are considered labelled by their first class. This choice is motivated by the fact that CNNs are single label classifiers, therefore the other labels are discarded for training purposes. Please note that, as stated above, CNNs are used as feature extractors in this approach. 3.2 Performance indicators According to some recent works [6][8] published in this field, we use 5 performance indicators for evaluating and comparing the performance of this multi-label classification problem. Let 𝕃𝑘 be the subset of the “Actual” labels observed for the kth sample and 𝕃∗𝑘 be the subset of the “Predicted” labels for the kth sample: the “Aiming” defined in equation (1) and also known as “Precision” checks

the rate of the correctly predicted labels over the practically predicted labels; the “Coverage” in equation (2), also known as “Recall”, measures the rate of the correctly predicted labels over the actual labels; the “Accuracy” in equation (3) checks the average ratio of correctly predicted labels over the total labels including the union of “Actual” and “Predicted” labels; the “Absolute true” in equation (4) measures the ratio of the perfectly correct prediction events over the total prediction events; and the “Absolute false” in equation (5) checks the ratio of the completely wrong prediction over the total prediction events. All the five indicators range in [0-1], the first four should be maximized, while the 5th, representing an error, should be minimized. The following 5 metrics are defined according to the formulation in [23]: 1

Aiming = 𝑁 ∑𝑁 𝑘=1 (

‖𝕃𝑘 ∩𝕃∗𝑘 ‖ ‖𝕃∗𝑘 ‖

)

(1)

1

‖𝕃𝑘 ∩𝕃∗𝑘 ‖ ) ‖𝕃𝑘 ‖

1

‖𝕃 ∩𝕃∗ ‖

Coverage = 𝑁 ∑𝑁 𝑘=1 (

(2)

𝑘 𝑘 Accuracy = 𝑁 ∑𝑁 𝑘=1 (‖𝕃 ∪𝕃∗ ‖) 𝑘

(3)

𝑘

1

∗ Absolute True = 𝑁 ∑𝑁 𝑘=1 Δ(𝕃𝑘 , 𝕃𝑘 ) 1

Absolute False = 𝑁 ∑𝑁 𝑘=1 (

(4)

‖𝕃𝑘 ∪𝕃∗𝑘 ‖−‖𝕃𝑘 ∩𝕃∗𝑘 ‖ 𝑀

)

(5)

where N is the number of samples, M the number of classes and Δ(∙,∙) is an operator returning 1 if the two sets have exactly the same elements, 0 otherwise. 3.3 Comparison of features This experiment is designed for performance evaluation of the features proposed in this work: two new feature extractors have been proposed in section 2.3, obtained from feature reshaping and transfer learning from two different CNNs: AlexNet and ResNet. The baselines for comparison are the original feature vector 𝑓 ∈ ℜ42 (named Base) and the texture descriptor HoG ∈ ℜ270 proposed in [8]. All the methods are coupled with the two classifiers described in section 2.4: LIFT, RR. Please note that in [8] only the LIFT classifier was evaluated, the results for RR have been computed for this work. Table 2 The jackknife success rates achieved by deep learned and baseline descriptors on the benchmark dataset. Descriptor Classifier

Aiming Coverage

Accuracy

Absolute true Absolute false

AlexNET ResNET Base [8] HoG [8] AlexNET ResNET Base HoG

0.8550 0.8071 0.8393 0.8048 0.8444 0.8881 0.8472 0.8814

0.6599 0.5701 0.6778 0.6739 0.6850 0.4071 0.5745 0.6124

0.6101 0.5230 0.6111 0.6173 0.6333 0.3811 0.5127 0.5601

LIFT LIFT LIFT LIFT RR RR RR RR

0.6423 0.5468 0.6818 0.6773 0.6891 0.3538 0.5755 0.6024

0.0315 0.0356 0.0313 0.0288 0.0288 0.0560 0.0412 0.0373

The results in Table 2 show that the success rates obtained by the AlexNet descriptor are better than ResNet and other baseline descriptors for almost all the indicators and considering both the

classification configuration. In our opinion, this is a consequence of the greater length of the AlexNet descriptor (4096 vs. 14). AlexNet obtains very valuable rates for “absolute true” and “absolute false” which are considered [23] the most important predictors for multi-label systems and are in contrast each other (i.e. it is extremely difficult to increase the absolute true rate and reduce the absolute false rate). 3.4 Comparison of systems This experiment is designed for performance evaluation of systems obtained by fusion of different descriptors/classifiers. All the fusion approaches are based on the average rule: the final score is the average score of different approaches (without normalization). In Table 3 the success rates achieved by several ensembles proposed here and some of the best approaches proposed in the literature are compared: • • • • •

• • • •

EnsANet_cl: ensemble of K=5 classifiers (cl=[LIFT,RR]) trained using AlexNet descriptors obtained by different reshaping operations as detailed in section 2.2 EnsANet_LR : fusion by the average rule of EnsANet_LIFT+ EnsANet_RR EnsBH_RR: fusion by the average rule of Base+HoG (with the RR classifier) EnsBHA_RR: fusion by the average rule of Base+HoG+EnsANet_RR (with the RR classifier) LSTM, we trained a long short-term memory (LSTM) network (in a similar way of AlexNet) and we use the last layer for training LIFT/RR. We use the same parameters proposed in https://it.mathworks.com/help/nnet/examples/classify-sequence-data-usinglstm-networks.html, i.e. numHiddenUnits = 100; miniBatchSize = 27. EnsLIFT [8]: ensemble of 50 LIFT classifiers trained using HoG iATC-mISF [6]: a predictor based on the fusion of different descriptors iATC-mHyb [7]: a hybrid approach based on the combination of iATC-mISF and the predictor iATC-mDO based on drug ontologies (DO) EnsANet_LR  DO: the combination of our system and the DO features used in [7]. If DO features are present we use LIFT trained with DO, else the score is given by EnsANet_LR

Table 3 The jackknife success rates achieved by some ensemble proposed in this work compared with the state-of-the-art predictors reported in the literature. Ensemble Name

Classifier

Aiming

Coverage

Accuracy

Absolute true

Absolute false

EnsANet_LIFT EnsANet_RR EnsANet_LR EnsBH_RR EnsBHA_RR LSTM_LIFT LSTM_RR Chen et al. EnsLIFT iATC-mISF iATC-mHyb EnsANet_LR  DO EnsANet_LR  DO

LIFT RR LIFT+RR RR RR LIFT RR

0.8769 0.8728 0.7536 0.8705 0.8868 0.8068 0.8528 0.5076 0.7818 0.6783 0.7191 0.7957 0.9011

0.6553 0.7011 0.8249 0.5979 0.6570 0.7070 0.6177 0.7579 0.7577 0.6710 0.7146 0.8335 0.7162

0.6728 0.7033 0.7512 0.6043 0.6615 0.7015 0.7041 0.4938 0.7121 0.6641 0.7132 0.7778 0.7232

0.6263 0.6637 0.6668 0.5483 0.6155 0.6555 0.6530 0.1383 0.6330 0.6098 0.6675 0.7090 0.6871

0.0310 0.0272 0.0262 0.0384 0.0323 0.0372 0.0323 0.0883 0.0285 0.0585 0.0243 0.0240 0.0267

[2] [8] [6] [7] (τ=0.25) (τ=0.5)

From the results in Table 3 it is clear that the combination of several AlexNet descriptors in the ensemble EnsANet_cl allows for a consistent performance improvement with respect to the single descriptor in Table 2, both considering LIFT and RR as base classifiers. The base classifier RR has higher performance than LIFT both in stand-alone methods and in ensembles, anyway the design of an ensemble method (EnsANet_LR) based on the perturbation of classifiers [24] allows for a further improvement with respect to the single systems. EnsANet_LR, which is an ensemble based on AlexNet descriptors and LIFT and RR classifiers, is the most stable and best performing method presented in this paper and its superiority is proved also in the comparison with other recent state-ofthe art systems (EnsLIFT [8], iATC-mISF [6], iATC-mHyb [7]). EnsBH_RR and EnsBHA_RR are two ensembles based on the perturbation of features: their performance is lower than EnsANet_LR, but higher than each stand-alone system; this result is interesting to point out that the three sets of features (i.e. Base, HoG and AlexNet), even if obtained from the same Base descriptor, are quite uncorrelated each other and their independence can be exploited to design an ensemble. We had run tests using other CNNs as GoogleNet, Vgg16 and Vgg19, but they don’t converge so the performance of LIFT/RR is low, for this reason we have not reported them in Table 3. It has been already shown in [7] that mapping the compounds into the DO (drug ontology) database space and fusing such information with other descriptors can significantly enhance the quality of ATC classification. Our final system which combines the score of EnsANet_LR and DO when available (only 1,144 drug compounds in the benchmark dataset have a DO descriptor, the other 2,689 samples are classified considering only the first representation) has the best performance with respect to all other methods published in the literature. The main drawback to use CNNs in this application is that many CNNs need a large training set for a good training, due to the size of the available dataset many CNNs can not be used for feature extraction. Moreover, the training computation time of CNNs is very large, the tests here performed cannot be run without a power GPU, we have used a NVIDIA Titan X. Anyway, notice that after the training step, the test step is very quick also in standard CPU (well known in the CNN literature).

4. Conclusions The focus of this paper was to find a new valuable descriptor for predicting a compound’s ATC class/classes. A new method for extracting a numerical descriptor from a compound is proposed based on reshaping the input feature vector in a 2D matrix and performing transfer learning from a finetuned CNN, AlexNet. Extensive experiments demonstrate that the new predictor outperforms other existing predictors in the literature in almost all the five metrics used to examine the performance for multi-label systems, particularly in the “absolute true” rate and the “absolute false” rate, the two most significant indexes. Moreover, the new descriptor is proven to be quite uncorrelated with existing ones, thus allowing the design of a powerful ensemble based on the fusion of different descriptors, including “baseline” correlation measures among classes (Base), textural descriptors (HoG) and features based on drug ontologies (DO). LSTM coupled with LIFT/RR obtains good performance, as future work we would run more tests for optimizing it and to develop an ensemble based on both LSTM and CNN.

The MATLAB code for the https://github.com/LorisNanni

new

method

is

available

for

further

researches at

5. Acknowledgements We would like to acknowledge the support that NVIDIA provided us through the GPU Grant Program. We used a donated TitanX GPU to train CNNs used in this work.

6. References [1]

R.C. Pitts, Reconsidering the concept of behavioral mechanisms of drug action, J. Exp. Anal. Behav. 101 (2014) 422–441. doi:10.1002/jeab.80.

[2]

L. Chen, W.M. Zeng, Y.D. Cai, K.Y. Feng, K.C. Chou, Predicting anatomical therapeutic chemical (ATC) classification of drugs by integrating chemical-chemical interactions and similarities, PLoS One. 7 (2012). doi:10.1371/journal.pone.0035254.

[3]

H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, M. Kanehisa, KEGG: Kyoto encyclopedia of genes and genomes, Nucleic Acids Res. 27 (1999) 29–34. doi:10.1093/nar/27.1.29.

[4]

M. Dunkel, S. Günther, J. Ahmed, B. Wittig, R. Preissner, SuperPred: drug classification and target prediction., Nucleic Acids Res. 36 (2008). doi:10.1093/nar/gkn307.

[5]

L. Wu, N. Ai, Y. Liu, Y. Wang, X. Fan, Relating anatomical therapeutic indications by the ensemble similarity of drug sets, J. Chem. Inf. Model. 53 (2013) 2154–2160. doi:10.1021/ci400155x.

[6]

X. Cheng, S.-G. Zhao, X. Xiao, K.-C. Chou, iATC-mISF: a multi-label classifier for predicting the classes of anatomical therapeutic chemicals., Bioinformatics. (2016) btw644. doi:10.1093/bioinformatics/btw644.

[7]

X. Cheng, S.-G. Zhao, X. Xiao, K.-C. Chou, iATC-mHyb: a hybrid multi-label classifier for predicting the classification of anatomical therapeutic chemicals, Oncotarget. 8 (2017) 58494–58503. doi:10.18632/oncotarget.17028.

[8]

L. Nanni, S. Brahnam, Multi-label classifier based on histogram of gradients for predicting the anatomical therapeutic chemical class/classes of a given compound, Bioinformatics. 33 (2017) 2837– 2841. doi:10.1093/bioinformatics/btx278.

[9]

J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks. 61 (2014) 85– 117. doi:10.1016/j.neunet.2014.09.003.

[10]

L. Nanni, S. Ghidoni, S. Brahnam, Handcrafted vs. non-handcrafted features for computer vision classification, Pattern Recognit. 71 (2017) 158–172. doi:10.1016/j.patcog.2017.05.025.

[11]

T.H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, Y. Ma, PCANet: A Simple Deep Learning Baseline for Image Classification?, IEEE Trans. Image Process. 24 (2015) 5017–5032. doi:10.1109/TIP.2015.2475625.

[12]

M.L. Zhang, L. Wu, Lift: Multi-label learning with label-specific features, IEEE Trans. Pattern Anal. Mach. Intell. 37 (2015) 107–120. doi:10.1109/TPAMI.2014.2339815.

[13]

K. Kimura, L. Sun, M. Kudo, MLC Toolbox: A MATLAB/OCTAVE Library for Multi-Label Classification, CoRR. abs/1704.0 (2017).

[14]

L. Nanni, S. Brahnam, A. Lumini, Texture descriptors for generic pattern classification problems,

Expert Syst. Appl. 38 (2011). doi:10.1016/j.eswa.2011.01.123. [15]

L. Nanni, S. Brahnam, A. Lumini, Matrix representation in pattern classification, Expert Syst. Appl. 39 (2012). doi:10.1016/j.eswa.2011.08.165.

[16]

J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, et al., Recent advances in convolutional neural networks, Pattern Recognit. (2017). doi:https://doi.org/10.1016/j.patcog.2017.10.013.

[17]

J. Yosinski, J. Clune, Y. Bengio, H. Lipson, How Transferable Are Features in Deep Neural Networks?, in: Proc. 27th Int. Conf. Neural Inf. Process. Syst. - Vol. 2, MIT Press, Cambridge, MA, USA, 2014: pp. 3320–3328. http://dl.acm.org/citation.cfm?id=2969033.2969197.

[18]

Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning applied to document recognition, Proc. IEEE. 86 (1998) 2278–2323. doi:10.1109/5.726791.

[19]

A. Krizhevsky, I. Sutskever, G.E. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Adv. Neural Inf. Process. Syst. (2012) 1–9. doi:http://dx.doi.org/10.1016/j.protcy.2014.09.007.

[20]

K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, Int. Conf. Learn. Represent. (2015) 1–14. doi:10.1016/j.infsof.2008.09.005.

[21]

C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, et al., Going deeper with convolutions, in: Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2015: pp. 1–9. doi:10.1109/CVPR.2015.7298594.

[22]

K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: 2016 IEEE Conf. Comput. Vis. Pattern Recognit., 2016: pp. 770–778. doi:10.1109/CVPR.2016.90.

[23]

K.-C. Chou, Some remarks on predicting multi-label attributes in molecular biosystems., Mol. Biosyst. 9 (2013) 1092–100. doi:10.1039/c3mb25555g.

[24]

L. Nanni, A. Lumini, Using ensemble of classifiers in Bioinformatics, in: H. Peters, Vogel M. (Eds.), Mach. Learn. Res. Prog., Nova Publisher, 2010.