Functional Link Artificial Neural Network-based Disease Gene Prediction

6 downloads 0 Views 1MB Size Report
network to identify the causing genes of four complex diseases: Cancer, Type I Diabetes, Type 2 Diabetes, and. Ageing genes. We used to-fold cross-validation ...
Proceedings of International Joint Conference on Neural Networks, Atlanta, Georgia, USA, June 14-19, 2009

Functional Link Artificial Neural Network-based Disease Gene Prediction Jiabao Sun, Jagdish

c. Patra and Yongjin Li

Abstract - Genes that contribute to complex traits pose special challenges that make candidate disease-associated gene discovery more difficult. In this work, we investigated topological features derived from PPI network to identify the causing genes of four complex diseases: Cancer, Type I Diabetes, Type 2 Diabetes, and Ageing genes. We used to-fold cross-validation to evaluate the predictive capacity of all possible combinations of these features and found the features with the best predictive ability. We assessed the performance of Multi-layer Perceptron (MLP), Functional Link Artificial Neural Network (FLANN), and Support Vector Machines (SVM). We found that SVM provides higher accuracy than MLP and FLANN. However, the FLANN has significantly low computation time while its accuracy is comparable to that of SVM and MLP. I. INTRODUCTION

A pertinent role for bioinformatics research exists in the analysis of biological data for candidate genes, and subsequent selection of a subset of most likely disease gene candidates for empirical validation. The completion of the Human Genome Project has brought us new research opportunities and challenges. Before the project started in 1990, only fewer than 100 gene-disease associations were established. Currently, more than 1400 gene-disease associations have been identified. Online Mendelian Inheritance in Man database is one of the most well-known databases storing gene-disease associations. One challenge that scientists face is that determining disease-related genes needs laborious experiments. Genetic linkage analysis method works very well to identify disease-associated genomic regions. However, these regions can contain hundreds of genes. Picking out the real disease-related genes from the large amount of candidate genes by biological experiment requires considerable effort and time. To address the challenge, making prediction of good candidate genes before experimental analysis, which will save both time and effort, is quite necessary. A variety of existing computational approaches to select candidate disease-genes have been developed. Several candidate gene identification systems that rely on grouping Gene Ontology (GO) terms have been described [1, 2], notably POCUS [3], which finds genes across multiple susceptibility loci that share Interpro domains [4] and GO

terms. Some others predicted disease genes through sequence-based features, because they found that human genes involved in hereditary diseases have some distinct sequence properties which render them more susceptible to mutation causing genetic disorders. Physically interacting proteins tend to be involved in the same cellular process and preferentially interact with other disease genes significantly [5, 6], so disease genes mutations in their genes may lead to similar disease phenotypes. Oti et al. predicted interacting partners of disease genes in the disease loci to be disease-genes [7]. Five topological features of hereditary disease and cancer in protein-protein interaction (PPI) network [8, 9] were investigated separately to predict disease genes. Existing methods have been successful in identifying single high relative risky disease genes, however, they have typically failed to identify genes underlying complex diseases or traits that often present with a wide range of phenotypes and generally involve multiple aetiological mechanisms and contributing genes [10, 11]. In particular, the contribution of each of several genes to the complex disease state is likely to be small, and only the joint effect of several susceptibility genes (often in concert with predisposing environmental factors) leads to disease, making functional validation of complex disease-causing genes difficult [12]. In this work, in addition to the five topological features used in [8, 9], we investigated three more topological features derived from the protein-protein interaction network and use them to classify and characterize four complex disease genes: Cancer, Type 1 Diabetes, Type 2 Diabetes, and Ageing genes. We used 10-fold cross-validation to evaluate the predictive capacity of all possible combinations of these features and found the best ones. The result shows that the predictive ability of our new proposed features is satisfactory. We assessed the performance of three neural networks (NNs): Multi-layer Perceptron (MLP), Functional Link Artificial Neural Network (FLANN), and Support Vector Machines (SYM). We found that SYM obtained higher accuracy than MLP and FLANN. However, the FLANN has much less computational requirement while its performance is comparable with that of SYM and MLP. The FLANN is a single layer NN in which the need of hidden layer is eliminated. Here the original input pattern is enhanced by functional expansion using orthogonal trigonometric functions. Therefore its computational complexity is much lower than that of the MLP or SYM.

School of Computer Engineering, Nanyang Technological University, Singapore. E.mail: {sunj0006.aspatra.s070035}@ntu.edu.sg.

978-1-4244-3553-1/09/$25.00 ©2009 IEEE

3003

II. METHODS A. Data Description In this section, we make a brief introduction of the experiment data collected. Data were collected from different databases for our study. 1) PPI Network The PPI data was derived from HPRD (Human Protein Reference Database) [13]. HPRD is a protein database accessible through the internet. HPRD contains manually curated scientific information pertaining to the biology of most human proteins. All the interactions in HPRD are extracted manually from literatures by expert biologists who read, interpret and analyze the published data. HPRD not only contains the information of protein interactions, but also the experimental information and literatures where the interactions come from. 2) Cancer Cancer is a class of diseases in which a group of cells display uncontrolled growth (division beyond the normal limits), invasion (intrusion on and destruction of adjacent tissues), and sometimes metastasis (spread to other locations in the body via lymph or blood) [14]. Cancer can affect people at all ages, even fetuses, but the risk for most varieties increases with age. According to the American Cancer Society, during 2007, 7.6 million people died from cancer in the world, which takes up 13% of all deaths. The cancer-related genes we used in this experiment were obtained from the Cancer Gene Census project [15], which is an on going effort to catalogue those genes for which mutations have been causally implicated in cancer. Till now, Cancer Gene Census has discovered 377 cancer-associated genes. These genes were selected from hundreds of research papers by The Cancer Genome Project team at the Welcome Trust Sanger Institute. The 377 genes are all reported to show mutations in primary patient material in at least two independent reports. 3) Type 1 Diabetes Type 1 Diabetes (TID) is a form of diabetes mellitus. It is an autoimmune disease that results in the permanent destruction of the beta cells which produce insulin of the pancreas. The TID candidate genes we used in this experiment were obtained from TIDBase [16], a public website and database that supports researchers working on the molecular genetic and biology of type 1 diabetes susceptibility and pathogenesis. The system collates and organizes data relevant to TID research from public and private sources. Currently, there are 425 candidate genes in TIDBase. 4) Type 2 Diabetes Type 2 Diabetes (T2D) is a metabolic disorder that is primarily characterized by insulin resistance, relative insulin deficiency, and hyperglycemia [17]. It shows a fast increase in the developed world.

TABLE I THE NUMBER OF DISEASE GENES IN THE PPI NETWORK.

Disease

Total No. of Disease Genes

Cancer TID T2D Ageing

377 425 159 261

No. of Disease Genes in The PPI Network 322 309 136 242

The T2D candidate genes used in this experiment were obtained from T2D-Db [18], a database of all molecular factors reported to be involved in the pathogenesis of T2D in human, mouse and rat. Currently, there are 159 candidate genes reported to be associated with T2D in this database. 5) Ageing Ageing is the accumulation of changes in an organism or object over time. Ageing in humans refers to a multidimensional process of physical, psychological and social change [19]. Human ageing is a major but poorly understood biological problem. Even though ageing is universal amongst human beings, there is little information on the genetics of human ageing and few online resources. The Ageing-associated genes we used in this work were obtained from the GenAge database of HAGR (The Human Ageing Genomic Resources) [20]. HAGR is a collection of databases and tools aimed to help researchers understand the genetics of human ageing through a combination of functional genomic and evolutionary biology. GenAge database is a core component of HARG. It is a curated database of genes associated with human ageing. Currently, there are 261 candidate genes in GenAge which are all reported to be related with human ageing. B. Building Training Samples For each of the four diseases, we obtained the disease-related genes and then map these genes to the PPI network that we extracted from HPRD. The corresponding genes made up the disease gene set or positive set. The total number of such genes for each disease is listed in Table I. For example, in row 1, out of 377 genes obtained from Cancer database [15], 322 genes were found in PPI network database [13]. These 377 genes we consider as positive set genes for Cancer. Compiling a list of genes which are known not to be involved in the corresponding disease from HPRD is quite difficult. We used the same method proposed in [9] to select non-disease genes. It is found that essential genes differ significantly from both disease-causing genes and other genes [21]. This set of essential genes should be classified as an independent and unique group that is distinctly different from the disease-causing genes and non-disease genes. The authors compiled a group of ubiquitously expressed human genes (UEHGs) to approximate essential genes, because a group of well-defined human essential genes are not available.

3004

8

control set

UEHGs

Fig.l. The illustration of control set and negative genes. TABLE II BUILDING POSITIVE SET AND NEGATIVE SET Disease Cancer TID T2D Ageing

Positive Set 322 309 136 242

No. of Genes Negative Set 1 Set 2 322 400 400 309 400 136 242 400

17

Set Set 3 500 500 500 500

Set 4 800 800 800 800

In order to obtain a set of negative genes or genes not causing any disease (non-disease genes), we processed the data as follows to generate a control set of genes for each of the four diseases. A control set of genes for each disease is obtained by excluding the UEHG genes and excluding the disease genes from the genes of the PPI network. Then from the control set we randomly selected genes as the non-disease gene set or negative set, first with a size equal to the positive set, then increase the size to 400, 500 and 800 (Table II). C. Defining Topological Features

We investigated eight measures, five of which were used in [8, 9], for assessing the topological properties of genes in the protein-protein interaction network. The assumption is that there is difference between the topological features of disease-associated genes and non-disease associated genes. A few works have provided evidence for this assumption: Tu et al. found that disease-genes have larger degree [21], and Gandhi et al. observed that human disease-associated genes preferentially interacted with other disease-causing genes significantly [6]. Proteins seldom perform function individually, but rather in a modular fashion. This gives us an insight that disease-causing genes might have the intension to be clustered into modules in order to perform their "function". As shown in Fig. 2, protein-protein interaction data can be viewed as networks, where proteins are nodes (dark shaded nodes represent proteins encoded by disease genes) and interactions between proteins are links. For example, in the protein interaction network illustrated in Fig. 2, there are totally 18 proteins and 27 interactions. The link between node 1 and node 2 means that protein 1 and protein 2 have an interaction with each other. Analysis of the global architecture of this large-scale interaction network can give us insights of general cellular mechanisms.

Fig. 2. An illustration of protein-protein interaction network. TABLE III FUNCTIONS USED TO DESCRIBE TOPOLOGICAL FEATURES Symbol k, d, Nj

hj

c., D (Ji,j (JjJCv)

Description the number of direct links to node i the number of direct links between node i and disease genes the set of direct neighbors of node i the length of the shortest path between nodes i and node j the number of nodes to which both node i and j are directly linked the set of all disease genes that we investigated in the PPI network the number of shortest paths from node i to j the number of shortest paths from node i to j that pass through a node v

Before we introduce the topological features, we first define several symbols in Table III. 1) Feature 1 (Degree) The degree of a node i is the number of direct links to node i. The computation of degree is shown below: (1)

Degree is the simplest network measure. It measures the extent of influence that a node has on the network. 2) Feature 2 (IN-index) [9] 1N-index of a node i is the proportion of disease genes in the immediate neighbors of node i. The computation of IN-index is shown below: (2)

3) Feature 3 (2N-index) [9] 2N-index of a node i measures its connectivity with disease genes in the 2-neighborhood of node i (the 2-neighborhood of a node i means the set of all the nodes connected to node i by a minimum of two links). Below is the computation of 2N-index:

3005

LjENj (dj-l )

ifnode ie D

LjENi (kj-l ) LjENidj

(3)

otherwise

LjENi (kj-l )

• • • • • • • •• • •• 0 • • 0 0 0 0

4) Feature 4 (Sum of Topological Coefficient) [22] This feature is the sum of topological coefficients between node i and disease genes. It measures the extent to which node i shares interaction partners with other disease-related genes. Below is the computation of Feature 4: (4)

5) Feature 5 (Average Distance to Disease Genes) [9] This feature calculates the average distance between node i and all disease genes in the protein interaction network. The computation of Feature 5 is shown below:

L i:

}EDJ;