Identification of Key Nodes of Type 2 Diabetes Mellitus Protein ...

1 downloads 0 Views 787KB Size Report
Identification of Key Nodes of Type 2 Diabetes Mellitus Protein. Interactome and Study of their Interactions with Phloridzin. Vinay Randhawa,* Purnima Sharma,* ...
OMICS A Journal of Integrative Biology Volume 17, Number 6, 2013 ª Mary Ann Liebert, Inc. DOI: 10.1089/omi.2012.0115

Identification of Key Nodes of Type 2 Diabetes Mellitus Protein Interactome and Study of their Interactions with Phloridzin Vinay Randhawa,* Purnima Sharma,* Shashi Bhushan, and Ganesh Bagler**

Abstract

Network biology-inspired approaches could be used effectively in probing regulatory processes by which small molecules intervene with disease mechanisms. The present study aims at identification of key targets of type 2 diabetes mellitus (T2DM) by network analysis of the underlying protein interactome, and probing for mechanisms by which phloridzin could be critical at altering the disease phenotype. Towards this goal, we constructed a protein–protein interaction network associated with T2DM, starting from candidate genes and systems-level interactions data available. The relevance of the network constructed was verified with the help of gene ontology, node deletion, and biological essentiality studies. Using a network analysis method, MAPK1, EP300, and SMAD2 were identified as the most central proteins of potential therapeutic value. Phloridzin, a known antidiabetic agent, potentially interacts with proteins central to T2DM mechanisms. The structural understanding of interaction of phloridzin with these proteins of relevance to T2DM could provide better insight into its regulatory mechanisms and help in developing better therapeutic agents. The molecular docking results suggest that phloridzin is potentially involved in making critical interactions with MAPK1. These results could further be validated by experimental studies and could be used to design therapeutic agents for T2DM intervention.

Introduction

T

ype 2 diabetes mellitus (T2DM), a chronic metabolic disease, is a complex condition characterized by intricate molecular mechanisms and interlinked factors. T2DM is the most prevalent form of diabetes, accounting for around 90% of cases worldwide. According to recent World Health Organization estimates, approximately 346 million people have type 2 diabetes mellitus. The number of diabetic patients is expected to rise to 438 million by the year 2030 (Reboldi et al., 2011; World Health Organization-Diabetes mellitus, 2012). Patients with diabetes have an increased risk of developing a wide range of disease-related complications. Hypertension is extremely common in patients with diabetes, affecting approximately 20%–60% of patients (Arauz-Pacheco et al., 2004). Diabetes is also associated with an increased frequency of renal disease (diabetic nephropathy), a major cause leading to cardiovascular (American Diabetes Association, 2012; Lim and Lip, 2003; Winer and Sowers, 2004) and end-stage renal disease (Molitch et al., 2004). In addition, diabetic nephropathy is reported to be linked with a substantially increased risk of cardiovascular morbidity and mortality (Gerstein et al., 2001; Gross et al., 2005; Odoni and Ritz, 1999). Therefore, for

developing more targeted and effective approaches against T2DM, studying the disease from systems perspective and understanding control mechanisms of T2DM is essential. Systems biological studies aim to explore the underlying molecular architecture of biological systems and to investigate their regulatory mechanisms. Though various experimental studies have reported T2DM associated candidate genes hitherto (Evadnie et al., 2007; Hayes et al., 2007; Rasche et al., 2008; Rich et al., 2008), not much systems-level work has been done to assimilate the data and to look for drug targets. High throughput proteomic/genomic studies have led to large experimental datasets that could be effectively used in biological and medical research. The combination of graph (network) theory and ‘omics’ data facilitates systems-level studies of underlying molecular mechanisms of complex diseases. This approach allows construction of molecular (such as gene regulatory and protein) interactomes and their analysis to elucidate molecular elements and regulatory pathways that are critical for a disease (Baraba´si et al., 2011). Phloridzin (phloretin 2’-O-glucoside, phlorizin, phlorrhizin or phlorizoside), a natural phenolic compound present in apple pomace (Boyer and Liu, 2004; Escarpa and Gonza´lez, 1998) is a major bioactive compound found in polyphenols of

Biotechnology Division, Institute of Himalayan Bioresource Technology, Council of Scientific and Industrial Research (CSIR-IHBT), Palampur, India. *These authors contributed equally. **Presently at Indian Institute of Technology, Jodhpur, India.

302

DIABETES MELLITUS PROTEIN INTERACTOME apple (Shao et al., 2008). It has played a vital role in uncovering the mechanism of renal glucose reabsorption and the role of hyperglycemia in diabetes (Chao, 2011). It is known to be an antidiabetic agent (Crespy et al., 2001; Dimitrakoudis et al., 1992; Hongu et al., 1998; Minami et al., 1993; Rastogi et al., 1998; Rossetti et al., 1987) and is reported to inhibit the activity of sodium–glucose co-transporter (SGLT) in the kidney, leading to glycosuria (Nunoi et al., 2002). Due to several known disadvantages of phloridzin as a drug (such as poor intestinal absorption, rapid in vivo degradation by b-glucosidase, inhibitory effect on SGLT1, and inhibitory effect on GLUT1 via phloretin) (Ehrenkranz et al., 2005), efforts are being made to search for its derivatives that would minimize these shortcomings. Several phloridzin analogs, including dapagliflozin (Wilding et al., 2009) and canagliflozin (Chao, 2011), appear to have potential pharmacological viability (Abdul-Ghani and DeFronzo, 2008; Bakris et al., 2009). A number of studies (Hongu et al., 1998; Hongu et al., 1998) have fueled the search for phloridzin derivatives. These studies highlight the pharmacological role of phloridzin and its mechanism of action in the management of type 2 diabetes. Apart from understanding the effect of phloridzin on SGLT, its inhibitory effects on the activity of other proteins have also been studied. In addition to regulating many other processes, it is found to activate a cascade of enzymes, including tyrosinase, by inhibiting the activity of protein kinase C (PKC); as a result it is involved in inhibiting the tumor cells growth (Shoji et al., 1997). Based on the reports of involvement of phloridzin in T2DM mechanisms, we start with a premise that phloridzin may potentially be interacting with proteins involved in processes central to T2DM. Therefore, understanding the molecular mechanisms by which phloridzin may be interacting with central targets could throw light on its role in controlling the disease and potentially enable search for better drugs. In this study, we aim to construct a representative protein interactome (network) underlying T2DM to identify proteins central to its topology and to probe molecular interactions of phloridzin with these targets. Starting from a set of candidate genes, compiled from T2D-Db (Agrawal et al., 2008) and literature databases, a protein interactome characterizing T2DM was constructed. Using graph theoretical analysis, key proteins that were topologically central to the network and associated to the disease were identified. These target proteins were indeed observed to be critical for the structural integrity of the interactome and to be biologically essential. Further, we performed molecular docking studies of phloridzin with these targets to get an insight into potential molecular mechanisms of phloridzin binding. We found that phloridzin was involved in making critical interactions with mitogen-activated protein kinase 1 (MAPK1). Starting from a systems-level exploration of protein interaction network, this article presents molecular docking studies of phloridzin with the key nodes of the network. The methodology presented herein can complement experimental studies and improve the established drug discovery strategies for T2DM. Materials and Methods The research methodology used in this study involved four main steps: (1) compiling candidate genes associated with type 2 diabetes mellitus (T2DM) and construction of the Type

303 2 Diabetes Protein Interaction Network (T2DPIN); (2) establishing ontological relevance of the protein interactome to the disorder; (3) identification of key targets of T2DPIN using graph theoretical analysis; and (4) molecular docking studies of phloridzin with key targets. Compilation of genes associated with type 2 diabetes mellitus The known disease-associated genes for T2DM were retrieved from three different sources: (1) Type 2 Diabetes database (T2D-Db) (Agrawal et al., 2008), (2) Online Mendelian Inheritance in Man (OMIM) (McKusick-Nathans Institute of Genetic Medicine, 2011) and (3) PubMed (http:// www.ncbi.nlm.nih.gov/pubmed). T2D-Db, a comprehensive resource of molecular factors reported to be involved in T2DM, was considered to be the main source of candidate genes. T2D-Db is a repository of diabetes-associated molecular factors in human, mouse, and rat. Candidate genes information pertaining to human was extracted. The T2D-Db was last updated in January 2009, and therefore does not include data for about 4 years. Hence, in order to retrieve recent research articles reporting T2DM candidate genes, OMIM and PubMed databases were text mined. OMIM database, a compendium of human genes and genetic phenotypes, was mined by keyword ‘type 2 diabetes mellitus’. After careful and extensive reading of each OMIM entry, candidate genes were curated. In PubMed, the same keyword (type 2 diabetes mellitus) along with the term ‘‘gene’’ was used (‘‘type 2 diabetes mellitus’’ AND ‘‘gene’’) to retrieve research articles published from January 2009 to May 2012. A total of 537, 582, 590, and 528 articles were accessed for the years 2009, 2010, 2011, and 2012, respectively. Articles reporting T2DM gene association were further shortlisted by scanning their abstracts. Finally, a manually curated candidate gene list was compiled by integrating genes from these three resources. HGNC (HUGO Gene Nomenclature Committee) (Seal et al., 2011) nomenclatures were obtained for the curated candidate genes by resolving aliases. Construction of protein interaction networks Proteins are the functional macromolecules that regulate biological processes. We integrated empirical data of candidate genes and those of protein interactions, to construct an interactome relevant to the disease phenotype. We used the Human Protein Reference Database (HPRD) (Keshava Prasad et al., 2009) as the source of protein interactions data. It includes annotations of proteins based on experimental evidence such as yeast two-hybrid analysis, and in vitro (e.g., GST pull-down assays) as well as in vivo (e.g., co-immunoprecipitation) methods. HPRD is one of the best resources of human protein–protein interactions (PPIs), containing the largest number of binary nonredundant human PPIs, largest number of genes annotated with at least one interactor, and largest citations of PPIs curated (Mathivanan et al., 2006). Although HPRD contains high quality data, it was manually curated for formatting errors and missing gene symbols for some of the interacting protein partners. We built the human protein interactome using the curated HPRD data (latest release April 9, 2010). Further, we constructed PPI networks representing processes involved in T2DM. We created two types of PPI

304 networks, a core network and an extended network. T2DM candidate genes, compiled from various sources, were mapped onto the HPRD interactome to obtain PPIs forming the core network. Since proteins that directly interact with each other are known to play roles in similar biochemical process or disease (Hartwell et al., 1999), we created the ‘Type 2 Diabetes Protein Interaction Network (T2DPIN)’ by extending the core network of candidate T2DM proteins to include proteins that directly interact with them. The T2DPIN represents the complex phenotype underlying the disease condition. This methodology of extended neighborhood of interactions has been shown to be useful by earlier studies on complex traits and diseases (Hwang et al., 2008; Lim et al., 2011; Randhawa and Bagler, 2012). Gene Ontology enrichment analysis To validate the reliability of the disease protein interactome constructed, and to understand the key biological processes and functions in which they were involved, Gene Ontology (GO) enrichment analyses were performed. The core network (representing the T2DM candidate genes) and the extended network (T2DPIN; the T2DM protein interactome) were subjected to GO enrichment analysis. The ontological relevance of the candidate genes, and that of the T2DPIN, was ascertained by using overrepresented GO terms (biological process and molecular functions) obtained from the enrichment analysis. The analyses were performed using Gene Ontology enRIchment anaLysis and visuaLizAtion (GOrilla) web server (Eden et al., 2009) using ‘two unranked lists of genes’ mode. In this mode, the program identifies enriched GO terms in a target set (proteins from the networks) versus a source set (HPRD proteins) using a hypergeometric model. We identified significantly enriched GO terms with p value cutoff of 0.001, and those having at most 100 proteins associated with a specific GO term in the source set (B £ 100). The latter criterion weeds out the terms that are too generic. Analysis of protein interaction networks We intended to identify proteins that were central to the interactome using graph theoretical analysis. These proteins, due to their critical role in structural stability and information dynamics over the interactome, could be of potential therapeutic value. Network-based identification of key nodes has been reported for various diseases and traits such as bovine marbling (Lim et al., 2011), H. pylori infection response (Kim and Kim, 2009), and asthma (Hwang et al., 2008; Randhawa and Bagler, 2012). In this study, we used three network topology metrics for identification of key nodes of T2DPIN: Degree (k) (Diestel, 2005), Betweenness Centrality (BC) (Brandes, 2001; Freeman, 1977), and Closeness Centrality (CC) (Newman, 2005). These measures of centrality have been used previously to identify key nodes in various disease conditions (Flo´rez et al., 2010; Hwang et al., 2008; Kim and Kim, 2009; Lim et al., 2011; Randhawa and Bagler, 2012). These topological metrics were calculated over the largest connected component of the network using Network Analyzer v.2.8.0 plugin (Assenov et al., 2008) of Cytoscape 2.8.0 software (Shannon et al., 2003). The giant component, constituting 99.28% of the T2DPIN, comprised most of the proteins and interactions amongst them. The isolated nodes, which were not included in the analysis, generate an infinite

RANDHAWA ET AL. number of shortest paths (geodesics) leading to meaningless values of betweenness and closeness centralities. Degree (k) specifies the number of links (interactions) of a node (protein) and quantifies the importance of a node in structural integrity of the network, whereas BC enumerates the role of a node in information dynamics over the network by encoding the shortest paths (geodesic) statistics. The CC enumerates the proximity and centrality of a node to the rest of the network. More technical details of these network metrics (formulae and biological importance) are provided as Supplementary Materials (Supplementary Material is available online at www.liebertonline.com/omi). We intended to identify and investigate nodes of local as well as of global importance in T2DPIN that are characterized with high centrality values of degree, betweenness, and closeness. A total of 31 central nodes, characterized with Top10 values when ranked according to k, BC, and CC, respectively, were identified as the best nodes among the hubs, bottlenecks, and swift communicators. From these central nodes, 14 nodes, which were identified by ‘either’ of the three metrics, were shortlisted using a method implemented earlier (see Supplementary Table S4) (Randhawa and Bagler, 2012). This list was further refined to obtain 7 proteins (CREBBP, EP300, ESR1, MAPK1, SMAD2, SRC, and TP53) that were identified by ‘each’ of the three metrics (kXBCXCC). Finally, three key proteins (EP300, MAPK1, and SMAD2) were selected for further analysis after omitting the ones that were also central to HPRD interaction network and hence did not signify relevance to T2DM (see Supplementary Table S4). Inferring topological relevance of chosen network metrics using node deletion studies The relevance of selected network metrics in the topology of T2DPIN was assessed by performing node deletion studies. The robustness (attack tolerance) of the network topology was computed by observing the changes in network topology when a small fraction of nodes were removed. We removed the most connected nodes (hubs) of T2DPIN in decreasing order of their connectivity. Top-ranked nodes were progressively removed and the effect of node deletion was enumerated in terms of ‘number of clusters’ that the network gets fragmented into, and the ‘size of giant cluster’ as indicators of topological contribution of each of the parameter with increase in number of deleted nodes. Herein, the nodes characterized by Top5 best values of degree were removed first, followed by the next best five values and this procedure was continued until around 4% of top-valued nodes were deleted. This procedure was also carried out, independently, for nodes with high BC and high CC parameter values. Inferring biological relevance of chosen network metrics using mouse phenotype data The biological significance of the network metrics was inferred by comparing their distribution in essential and nonessential proteins of T2DPIN. We considered the classes of embryonic, perinatal, neonatal, or postnatal lethality in mouse models (Goh et al., 2007) as lethal phenotypes, and the rest of the phenotypes as nonlethal. The human orthologs of murine genes were considered as essential, when the murine gene was annotated with one of the following phenotypes (Smith et al., 2005): neonatal lethality (MP:0002058), embryonic

DIABETES MELLITUS PROTEIN INTERACTOME lethality (MP:0002080), perinatal lethality (MP:0002081), postnatal lethality (MP:0002082), lethality-postnatal (MP: 0005373), lethality-embryonic/perinatal (MP:0005374), embryonic lethality before implantation (MP:0006204), embryonic lethality before somite formation (MP:0006205), or embryonic lethality before turning of embryo (MP:0006206). The human–mouse orthology and mouse phenotype data were obtained from Mouse Genome Informatics (Blake et al., 2011) (accessed in August 2012). Out of a total 2616 genes of T2DPIN, 1206 (46.10%) were essential and the rest 1410 (53.90%) were nonessential. For each network parameter chosen, from the top-ranked genes identified using each of the three network metrics, we computed the percentage of topranked genes that were essential. We also computed the percentage of essential genes in the corresponding bottomranked genes. We varied the definition of top-ranked hubs, bottlenecks, and swift communicators between Top10 and Top260 proteins, the latter constituting around 10% of T2DPIN, and computed the percentage of essential genes in them. Molecular docking studies To obtain insights into structural interactions of phloridzin with the topologically most central proteins of relevance to T2DM, computational molecular docking studies were performed. The docking protocol involved four main steps: (1) protein preparation, (2) ligand (phloridzin) preparation, (3) docking of phloridzin, and (4) evaluation of docking results. Protein preparation for docking Crystal structures of the human MAPK1 (ERK2) in complex with the FR180204 inhibitor (PDB ID: 1TVO) (Ohori et al., 2005), Histone acetyltransferase (HAT) p300 domain in complex with a bi-substrate inhibitor Lys-CoA (PDB ID: 3BIY) (Liu et al., 2008) and SMAD2-MH2 domain (PDB ID: 1DEV) (Wu et al., 2000) were retrieved from the Protein Data Bank (PDB) (Bernstein et al., 1977). These crystal structures correspond to the key targets of T2DPIN that were identified from interactome analysis. The coordinates from these crystal structures were used for molecular docking studies. The crystal structures of MAPK1, HAT p300 domain, and SMAD2-MH2 domain were determined by X-ray crystallo˚ , regraphy method at a resolution of 2.50, 1.70, and 2.20 A spectively. During computational prediction of ligand-binding poses, the accuracy of a molecular docking program is usually measured by the Root Mean Square Deviation (RMSD) between the experimentally observed heavy atom positions of the ligand and those predicted by the docking program (Cole et al., 2005). The crystal structures of the human MAPK1 in complex with the FR180204 inhibitor (PDB ID: 1TVO) was considered for validating accuracy of the docking protocol. The bound FR180204 inhibitor coordinates were extracted and re-docked into the ATP binding site using AutoDock Vina 1.1.2 package (Trott and Olson, 2010). The lowest-energy structure generated during the docking runs was considered as the best conformation. For molecular docking studies, the protein structures were prepared by removing any bound inhibitor(s), ligand(s), and water molecules, as they were not found to be playing any important role in protein–ligand interactions. The inhibitors

305 FR180204 and Lys-CoA were removed from MAPK1 and HAT p300 proteins, respectively. Using AutoDock Tools 1.5.4 (Morris et al., 2008) software, polar hydrogen atoms and Kollman charges were added to the protein structures, followed by merging of nonpolar hydrogen atoms. The protein receptors were converted from PDB to PDBQT format, which is required as an input to AutoDock Vina. Ligand (phloridzin) preparation The phloridzin molecule (CID: 6072) was retrieved from NCBI-PubChem compound database (http://pubchem .ncbi.nlm.nih.gov) in 3D-SDF format and was checked for consistency in all bond/atom types. The SDF-formatted file, containing 3D coordinates, was converted to PDB format using OpenBabel 2.2.3 software (O’Boyle et al., 2011), and subsequently into PDBQT format using AutoDock Tools. In this process, gasteiger charges were assigned and nonpolar hydrogens were merged. The rigid root of ligand was defined automatically and torsion flexibility was allowed only for the phloridzin side groups. Molecular docking AutoDock Vina is known for its accuracy in binding mode predictions and speed, which is approximately two orders of magnitude better than its predecessor, AutoDock4 (Morris et al., 2009). For MAPK1 and HAT p300 domains, the binding sites were defined around the region of bound inhibitors. The binding site of MAPK1 comprises of an ATP binding pocket consisting of two lobes, the activation loop and glycine-rich loop, including all amino acid residues in the pocket. Similarly, for the HAT p300 domain, the region around three hydrophobic residues, Tyr1397, Trp1436, and Tyr1446, together with Cys1438 forming a hydrophobic tunnel (Ohori et al., 2005), was considered as binding pocket. A grid box with dimensions of 25 · 25 · 25 points along the x, y, and z axes was defined for these two protein structures to cover the entire protein binding site, and to accommodate free motion of phloridzin. Phloridzin was docked into the binding sites of free receptors using the docking protocol validated for MAPK1 structure. During the docking process, a maximum of 25 independent binding conformations, ranked according to their binding affinities, were obtained. For each docking simulation, we used a semi-flexible docking protocol. Assuming that there is no induced conformational change in protein receptors upon ligand binding, the protein targets were kept rigid. Torsion flexibility was allowed only for phloridzin side groups in order to explore an arbitrary number of torsional degrees of freedom in addition to six spatial degrees of freedom. All other options were kept default during docking experiments. The identification of ligand binding sites is an important part of the drug discovery process. We used the strategy of ‘blind docking’ (Hete´nyi and van der Spoel, 2002) to identify the binding sites of SMAD2-MH2 domain. This docking procedure has been successfully implemented to identify potential binding sites on protein surfaces (Cosconati et al., 2011; Hete´nyi and van der Spoel, 2006). Phloridzin was docked with the complete domain structure to find out its interactions with the domain. The docking procedure was implemented in two steps. In the first step, the docking was performed with the whole structure, without defining the

306 binding site (blind docking). The grid field was adjusted at 100 · 100 · 100 points along the x, y, and z axes at the center, covering the entire protein. In the second step, also called as ‘refined docking’, we docked phloridzin in each of the four binding sites estimated earlier. The grid was adjusted around these four pockets, having the best scored conformations. Using the lowest energy conformation as a reference, poses ˚ of the reference pose were clustered. In with RMSD within 2 A most of the docking runs, the lowest docking-energy conformations were included in the largest cluster. The process was repeated until all the 25 conformations belonged to a single cluster. Finally, the pocket adjusting all 25 modes and with the least energy conformations was accepted for docking of phloridzin. This procedure was employed to avoid picking the lower energy-ranked poses, due to ‘chance hits’. The results of blind docking study were further confirmed using LIGSITEcsc (Huang and Schroeder, 2006) server, which is reported to be a reliable tool for identification of putative binding sites (Mallipeddi et al., 2012). At the end of docking run, AutoDock Vina ranks all the docked conformations generated, based on the binding energy. Molecular interactions between proteins and phloridzin were observed using LigPlot software (Wallace et al., 1995). Molecular rendering was performed using PyMOL (The PyMOL Molecular Graphics System, Version 1.5.0.1, Schrodinger, LLC.). All computations were performed on an HPZ600 workstation and HP ProLiant DL980 server running Ubuntu 11.10 and Red Hat 4.1.2 operating systems, respectively, with Intel Xeon processors. Results In this section, we first present the results of protein interaction network analysis and identification of key proteins. Then, we present results of molecular docking of phloridzin with the key targets. Protein interactome of type 2 diabetes mellitus We compiled an extensive list of type 2 diabetes mellitus (T2DM) candidate genes using Type 2 Diabetes database (T2D-Db) (Agrawal et al., 2008), Online Mendelian Inheritance in Man (OMIM) (McKusick-Nathans Institute of Genetic Medicine, 2011), and PubMed (http://www.ncbi.nlm .nih.gov/pubmed). A total of 302 genes were obtained from T2D-Db; whereas OMIM and PubMed searchs, which complement T2D-Db data, yielded 34 and 70 genes, respectively. Using these three resources, we created a composite list of 330 unique core candidate genes (Supplementary Table S1). The Venn diagram illustrates our strategy to obtain an extensive list of candidate genes (Supplementary Fig. S1). Out of these genes, 293 were mapped on the HPRD protein interactome to construct the core network. The core network was comprised of a giant cluster (187 proteins and 596 interactions), 8 smaller clusters (20 proteins and 12 interactions) (Supplementary Fig. S2) and the rest fragmented into 86 isolated nodes. The core proteins are critical in deciding the nature of the extended network and in obtaining a functionally relevant superset (Hormozdiari et al., 2007). The core network comprised of empirically validated proteins, and interactions among them, for their direct role in T2DM. We further intended to obtain a superset of proteins functionally relevant to T2DM phenotype. It is expected that an extended network

RANDHAWA ET AL. constructed from such a superset is useful in obtaining a functionally relevant interactome, and would also include related molecular mechanisms. The utility of study of such extended disease interactomes, implementing functional relatedness (Hartwell et al., 1999), has been demonstrated in earlier studies (Hwang et al., 2008; Lim et al., 2011; Randhawa and Bagler, 2012). The Type 2 Diabetes Protein Interaction Network (T2DPIN), the interactome underlying the complex disease phenotype, was constructed by extending the core network to include directly interacting proteins. After excluding multiple edges and self-loops, T2DPIN comprised 15548 interactions among 2635 proteins inclusive of a giant cluster (2616 proteins and 15536 interactions) and 8 smaller clusters. The giant cluster, constituting 99.28% of the T2DPIN, was used for network analysis. Figure 1 illustrates the T2DPIN. We further assessed the inclusive nature of disease-associated protein interactome thus constructed. The T2DPIN was observed to comprise two major signaling pathways empirically known to be associated with T2DM: insulin signaling pathway ( Jolivalt et al., 2008) and adipokine signaling pathway (Shoelson et al., 2006). Defects in the insulin signaling pathway have been implicated in the pathogenesis of insulin resistance (Vollenweider et al., 2002) and studies have shown that insulin resistance is a major risk factor for the development of T2DM (Lillioja et al., 1993). Adipokines, the endproducts of the adipokine signaling pathway, have been linked with obesity-induced inflammation and signaling pathways that contribute to T2DM (Shoelson et al., 2006). The insulin and adipokine signaling pathways have 138 and 69 genes, respectively, as obtained from T2DM associated pathway entry (map04930) in Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database (Kanehisa et al., 2006). We computed the percentage of genes from these signaling pathways that are part of the interactome. It was observed that whereas the core network comprises of around 35% (73 of 207) of the genes from these pathways, T2DPIN, the superset inclusive of functionally related genes, comprises of about 75% (155 of 207) of those. This further strengthens the logic of obtaining an extended network starting from a sparse set of core candidate genes. Network topology of T2DPIN Protein interaction networks are often characterized by a small number of highly connected hubs and bottlenecks (Baraba´si and Oltvai, 2004). We observed that the degree distributions of both the core network as well as T2DPIN followed a scale-free distribution, indicating presence of exceptionally rich nodes. For core network, the exponent of the power-law fit (c) for degree distribution was - 2.6. The T2DPIN was characterized by hub nodes with exceptional number of interactions (c = - 2.7) and major bottlenecks (c = - 2.2) mediating a large number of interactions across the network (Supplementary Fig. S3). The exponent is reported to be in the range of 2 £ c £ 3 for real-world networks, such as the internet, human collaboration networks, and metabolic networks (Re´ka and Albert-La´szlo, 2002). Thus, both the networks have scale-free topology as reported for other biological networks. The exponent values suggest that the connectivity distribution of the core network is more skewed than that of the T2DPIN. Identification of targets, key

DIABETES MELLITUS PROTEIN INTERACTOME

307

FIG. 1. Type 2 diabetes mellitus protein interaction network (T2DPIN), a representative molecular interactome associated with T2DM. The circles represent proteins, and the lines indicate an interaction between proteins. The highlighted proteins represent seven topological central proteins: ESR1, CREBBP, SMAD2, MAPK1, EP300, and SRC. Among them the nongeneric, key proteins are depicted in red. The graph was rendered using Cytoscape software.

molecular agents, those are central to the regulation of disease mechanisms is an important problem (Baraba´si et al., 2011). For identification of key targets for T2DM, we identified the central nodes in the giant component of T2DPIN using three network metrics: Degree (k), Betweenness Centrality (BC), and Closeness Centrality (CC). Gene Ontology enrichment analysis of T2DPIN To describe the functional annotation of gene products, the Gene Ontology (GO) provides three biological domains: biological processes (BP), the molecular events to which a gene product contributes; molecular functions (MF), the biochemical activities of the gene product at the molecular level; and cellular components (CC), the subcellular locations where a gene product is active (Dimmer et al., 2008). For T2DPIN to be of relevance to the disease, we anticipated it to carry ontological signatures characteristic to T2DM. We first retrieved a total of 333, 59, and 11 significantly enriched GO terms associated with core candidate genes for BP, MF, and CC, re-

spectively. Since it is not within the scope of gene ontology project (http://www.geneontology.org/) to include diseasespecific GO terms, therefore, after extensive literature survey, it was adjudged that GO terms associated with insulin, glucose, carbohydrate, fatty acid, and lipid metabolism were most relevant. Using this criterion, 36, 12, and 7 significantly enriched GO terms were obtained for BP, MF, and CC, respectively (Supplementary Table S2). The extended network is a superset of the T2DM core candidate genes and includes genes of potential secondary relevance. GO enrichment analysis of T2DPIN yielded 546, 105, and 21 significantly enriched GO terms for BP, MF, and CC, respectively. Using the criterion described above, 27, 21, and 7 significantly enriched GO terms, relevant to T2DM phenotype, were obtained for BP, MF, and CC, respectively (Supplementary Table S3). The GO enrichment analyses of the core candidate genes and that of T2DPIN revealed that biological processes and molecular functions were highly populated with significantly enriched GO terms. The cellular component terms were less informative, being either too

308 general or annotated for very few proteins. The results obtained from over-representation analysis revealed the ontological signatures of T2DM were embedded in the disease protein interactome. Figure 2 depicts the T2DM associated significantly enriched GO terms (BP and MF) obtained from the core candidate genes and T2DPIN. Node deletion studies The role of network metrics in specifying structural integrity of the network could be assessed by computing network

RANDHAWA ET AL. vulnerability under node deletions. Scale-free networks are prone to fragmentation under targeted attacks on hubs (Albert et al., 2000). It has been suggested that deletion of nodes ranked according to k and BC metrics have deleterious effects on the network topology (Holme et al., 2002; Hwang et al., 2008). This concept has been extended to consider other centrality measures as well (Chen et al., 2009; Holme et al., 2002; Santiago and Benito, 2009; Zhao et al., 2004). We performed targeted node deletion studies in T2DPIN to assess the role of chosen parameters (k, BC, and CC). In a stepwise manner, up to around 4% of top-ranked nodes with best parameter values

FIG. 2. Illustration of T2DM associated significantly enriched molecular function (A), and biological process (B) GO terms common in core and extended network. The number associated with each term indicates the number of genes associated with the specific GO term.

DIABETES MELLITUS PROTEIN INTERACTOME were deleted. The vulnerability of the network topology was measured by computing total number of clusters that the network is divided into, and size of the giant cluster. When around 4% of the best degree-ranked nodes were eliminated from the giant cluster, the number of clusters increased rapidly, leading to a fragmented network comprising 157 clusters. We observed a simultaneous decrease in the size of giant cluster from 2616 to 2345 nodes. Similar node deletion strategy was applied for high BC and high CC nodes. With the removal of around 4% of high BC and high CC nodes, the largest cluster broke into 187 and 123 clusters, respectively. The size of giant cluster was also reduced to 2327 and 2385 nodes for high BC and CC nodes, respectively. While hubs are the most promiscuous interactors, nodes with high BC (bottlenecks) are typically the bridges between clusters. Hence, as anticipated, BC-based node removal had the greatest impact on giant cluster fragmentation, followed by k and CC (Fig. 3A). Interestingly, CC- and BC-based node removal had the greatest impact on the size of giant cluster, followed by k (Fig. 3B). The connectedness of the network and accessibility of nodes are deemed to be representative features characterizing the disease-associated molecular interactome. In a scale-free network, it is observed that the central nodes are critical for the integrity of the network (Albert et al., 2000) and thus are of potential functional relevance. In T2DPIN, the node deletion studies indicated that the central nodes identified using the chosen parameters are indeed responsible for the integrity and dynamics across the disease interactome. Based on the structural arguments, these studies support the choice of network metrics and their functional utility in identification of central nodes of T2DPIN. Biological essentiality of key nodes We further checked the biological relevance of chosen parameters. Hubs and bottlenecks in protein interactomes have been reported to have a much higher tendency to be essential (Hahn and Kern, 2005; Jeong et al., 2001; Joy et al., 2005; Yu et al., 2004; Yu et al., 2007). We divided T2DPIN into two sets, essential and nonessential, using Mammalian Phenotype Ontologies (Smith et al., 2005) to classify genes as essential when they caused embryonic, perinatal, postnatal, or neonatal lethality in mouse models (Goh et al., 2007) using phenotypic data from Mouse Genome Informatics (MGI) (Blake

309 et al., 2011). The percentage of essential genes was computed for the best-ranked nodes and the remaining nodes, for varying definitions. We varied the definition of hubs, bottlenecks, and swift communicators between Top10 and Top260 (around 10% of T2DPIN) proteins and computed percentage of essential genes in them, to those in the rest of the network. Figure 4 shows the statistics for percentage of best-ranked nodes annotated as essential. We observed that the network parameters chosen for target identification were highly correlated with essentiality. For these parameters, we found that the hubs, bottlenecks, and swift communicators were characterized by significantly high percentage (k: between 66 and 80%, BC: between 88 and 100%, and CC: between 80 and 100%) of essential proteins (Fig. 4A). Regardless of the metric used for ranking, the remainder of the nodes in the network had low percentage (k: between 43 and 46%, BC: between 41 and 46%, and CC: between 42 and 46%) of essential proteins (Fig. 4B) as expected from the random samplings statistics. Thus, the statistics of essential proteins supports our selection of network parameters for identification of key drug targets. Central proteins of T2DPIN as putative targets The central proteins of a protein interactome, characterized by rich connections, are strategically placed in the core of the network and mediate large number of interactions across the network. While richly connected hubs are central due to their promiscuous interactions, the nodes with large BC function as bottlenecks in the network (Yu et al., 2007) even when a node’s degree is low. Nodes with high CC complement the hubs and bottlenecks as they tend to be centrally placed in the network (Freeman, 1977; Sabidussi, 1966). The central proteins of the T2DPIN, characterized by hubs, bottlenecks, and swift communicators, were identified by combining the nodes with Top10 parameter values of k, BC and CC. There were a total of 14 proteins that were found to be important using ‘either’ of the three centrality measures (Supplementary Table S4). Among these, a refined list of seven central proteins of T2DPIN, that were characterized by all three features of centrality, were identified (Fig. 5 and Table 1): CREBBP (cAMP-response-element-binding binding protein), EP300 (E1A binding protein p300), ESR1 (estrogen receptor 1), MAPK1 (mitogen-activated protein kinase 1), SMAD2 (SMAD family member 2), SRC (v-src sarcoma viral oncogene homolog), and TP53 (tumor protein p53). We

FIG. 3. Node deletion studies. Change in the number of clusters (A) and size of the giant cluster (B) under deletion of high degree (squares), high BC (circles), and high CC nodes (diamonds).

310

RANDHAWA ET AL. observed that these seven central proteins were indeed critical to the structural integrity of T2DPIN, as deletion of these proteins led to the fragmentation of the giant cluster into 28 clusters. While these proteins occupy central positions in the T2DPIN, the following four were observed to be central by virtue of their topological features in the HPRD protein interactome: CREBBP, ESR1, SRC, and TP53. Hence, we rejected these four generic proteins to obtain the following key proteins with potential therapeutic value for T2DM: MAPK1, EP300, and SMAD2. Among these, MAPK1 and SMAD2 were part of the core network, whereas EP300 was obtained from the extended neighborhood. Further, we confirmed that these proteins are relevant to T2DM and report to play, direct or indirect, role in the progression of the disease. The ERK2(MAPK1)/STAT3 pathway has been proposed as a novel diabetes target. MAPK1 is known to suppress STAT3 activation enzymatically, and is proposed as a valuable candidate for diabetes therapy (Kinoshita et al., 2011). MAPK1 phosphorylates and negatively modulates STAT3, which is known to be essential for normal glucose homeostasis (Chung et al., 1997; Inoue et al., 2004). EP300, a transcriptional coactivator, has been identified to cause diabetes via regulating fibronectin expression via poly(ADP-ribose) polymerase and NF-kappaB (NF-kB) activation (Kaur et al., 2006). Phosphorylation of SMAD2 mediates interaction with SMAD4 and is required for transforming growth factor-b (TGF-b) signaling (Souchelnytskyi et al., 1997). The TGF-b in turn plays a central role in the pathogenesis of glomerulosclerosis and tubulointerstitial fibrosis in the nephropathy of type 2 diabetes (Hong et al., 2001). SMAD2 activity is a crucial step in TGF-b1 signal transduction and has been observed to be having high expression in diabetic condition (Zhou et al., 2012). Molecular docking studies with phloridzin

FIG. 4. The statistics of percentage of essential genes in the hubs (blue), bottlenecks (red), and swift communicators (green) (A) to the corresponding rest of the nodes (B) in the T2DPIN.

The computationally predicted binding energy (Kcal/mol), calculated on the basis of scoring function, indicates the strength of ligand binding to the protein receptor. The more negative is the energy, the stronger is the binding. The best modes were selected based on the binding energy of each

FIG. 5. ‘Hubs,’ ‘bottlenecks,’ and ‘swift communicators’ identified in Type 2 Diabetes Protein Interaction Network (T2DPIN) topology using degree (k), betweenness centrality (BC), and closeness centrality (CC) metrics.

DIABETES MELLITUS PROTEIN INTERACTOME Table 1. An Alphabetical List of Seven Proteins Central to Type 2 Diabetes Protein Interaction Network (T2DPIN) Topology Characterized by All Three Features of Centrality Central Proteins

Degree (k)

Betweenness Centrality (BC)

Closeness Centrality (CC)

CREBBP EP300 ESR1 MAPK1 SMAD2 SRC TP53

198 167 188 160 166 161 269

0.049 0.030 0.059 0.038 0.048 0.029 0.095

0.410 0.409 0.434 0.420 0.412 0.41 0.436

BC, betweenness centrality; CC, closeness centrality; k, degree of centrality values. The proteins highlighted in bold font represent the nongeneric putative drug targets for T2DM.

conformation and pattern of protein–ligand interactions. The quality of the docking program was assessed on the basis of ˚ ) returned by the poses, when RMSD value (usually