Cancer is driven by DNA mutations that propagate to the protein level â resulting in per- ... I would like to thank all who helped me get to where I am today. To those I in- evitably forget to ..... This replication process involves unzipping the DNA.
Protein Disorder Targeting Driver Genes in Cancer A Thesis Presented to the School of Interdisciplinary Informatics and the Faculty of the Graduate College University of Nebraska In Partial Fulfillment of the Requirements for the Degree Master of Science in Biomedical Informatics University of Nebraska at Omaha by Ryan A. Hagenson July 2017
Supervisory committee: Dr. Dario Ghersi Dr. Kate Cooper Dr. Parvathi Chundi
Protein Disorder Targeting Driver Genes in Cancer Ryan A. Hagenson, MS University of Nebraska, 2017 Advisor: Dr. Dario Ghersi
Cancer is driven by DNA mutations that propagate to the protein level – resulting in perturbed biochemistry due to modifying interactions within the cell. Of great importance to the proper function of the cell are the protein-protein interactions which define how the body responses to stimuli, both positive and negative. Such interactions often involve two structurally distinct types of protein regions: ordered binding sites and disordered binding targets. Historically, only the ordered half of this complementary pairing has been extensively investigated with respect to how observed DNA mutations in these regions possibly drive cancer. This work represents an initial in silico investigation leveraging data from The Cancer Genome Atlas (TCGA) which shifts the focus to investigate disordered regions. Two measures of protein disorder are used to calculate protein disorder, one scoring individual positions and the other scoring local regions, across 62 mutation profiles or two profiles for each of the 31 cancer types under investigation. Data from each cancer is analyzed via two mutation profiles considering: 1. all observed mutations, and 2. missense mutations only. To ensure novelty, results with prior strong implication in cancer are removed from the final sets – focusing results on potential disorder-targeted genes not yet known. By using the combination of a search for positive selection for a biological property and high-dimensional analysis with conservative statistical cutoffs, novel genes not
previously implicated in cancer can be given likely context and internally cross-validated – providing evidence for their potential role in driving cancer. As a result of positional analysis, 77 disorder-targeted genes were characterized. Meanwhile, by regional analysis, 480 disorder-targeted genes were found.
Acknowledgements I would like to thank all who helped me get to where I am today. To those I inevitably forget to mention by name I extend a special thank you and an apology for my lapse of thought at the moment of writing. I wish to thank my folks, the parents who raised me with a love for learning and whose couch I became acquainted with when balancing work, school, and my future became just a little too much. Thank you for listening to my constant yammering about the latest factoids and now the latest science. To Dr. Garry Duncan, thank you for introducing me to Bioinformatics, a way to combine my dual-interest in Biology and Computer Science. To Dr. Bill McClung, thank you for teaching me so much and indulging me in discussions about all areas on my work. I wish I would have started learning from you sooner. To Dr. Jessica Petersen, thank you for guiding me on my first Bioinformatics investigation. To Bell Labs and all its employees, you are a constant inspiration and embodiment of what I find most fascinating about computer science and bioinformatics: no challenge can compete with dedicated individuals. Lastly, to Dr. Dario Ghersi, even though your name is elsewhere on this thesis I believe a special thank you is in order. I would not be graduating confidently without the wealth of knowledge I gained from you during each weekly meeting.
i
Contents Acknowledgements 1
Introduction
1
1.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
1.2.1
Causes of Cancer . . . . . . . . . . . . . . . . . . . . . . . . . .
4
Mutations from External Mutagens . . . . . . . . . . . . . . .
5
Mutations from Internal Mutagens . . . . . . . . . . . . . . . .
6
Cancer Driver Genes . . . . . . . . . . . . . . . . . . . . . . . .
7
Tumor Suppressor Genes . . . . . . . . . . . . . . . . . . . . .
7
Oncogenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
Discovering Drivers . . . . . . . . . . . . . . . . . . . . . . . .
8
The Cancer Genome Atlas . . . . . . . . . . . . . . . . . . . . .
9
Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
Cancers in the Atlas . . . . . . . . . . . . . . . . . . . . . . . .
9
1.2.2
1.2.3
1.3
Computational Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1
1.4
Past Driver Gene Discovery Methods . . . . . . . . . . . . . . 10
Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
ii 2
3
Proteins
12
2.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2
Protein Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1
Amino Acid Structure . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.2
Primary Structure (1◦ ) . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3
Secondary Structure (2◦ ) . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4
Tertiary Structure (3◦ ) . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.5
Quaternary Structure (4◦ ) . . . . . . . . . . . . . . . . . . . . . 16
2.3
Protein Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4
Protein Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5
Protein Disorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1
Pairwise Amino Acid Interactions . . . . . . . . . . . . . . . . 19
2.5.2
Hydrophobicity and Net Charge . . . . . . . . . . . . . . . . . 20
Methodology
22
3.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2
Signal Versus Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3
Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.1
Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.2
Dataset Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4
Disorder Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.5
Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.1
3.6
Steps as a List . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
Binomial Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
iii 3.6.1 3.7 4
Steps as a List . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Enrichment Analysis and Validation . . . . . . . . . . . . . . . . . . . 32
Positional Analysis Results
36
4.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2
COSMIC Hypergeometric Testing . . . . . . . . . . . . . . . . . . . . . 37
4.3
Mutational Prevalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.4
Visualizations of Select Genes . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.1
COADREAD – TBP . . . . . . . . . . . . . . . . . . . . . . . . . 39 PDB: 1NVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.2
BRCA – TBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 PDB: 1NVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4.3
STES – CASC3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 PDB: 2J0S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5
4.5
Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.6
Partner Set Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . 43
Regional Analysis Results
53
5.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2
COSMIC Hypergeometric Testing . . . . . . . . . . . . . . . . . . . . . 55
5.3
Mutational Prevalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.4
5.3.1
Both Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.3.2
Mutation Prevalence Distributions . . . . . . . . . . . . . . . . 58
Visualizations of Select Genes . . . . . . . . . . . . . . . . . . . . . . . 58 5.4.1
TBP.001 in BRCA . . . . . . . . . . . . . . . . . . . . . . . . . . 58
iv Smoothed Disorder Plot with Mutations . . . . . . . . . . . . . 58 5.4.2
PLEC.005 in ACC . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Smoothed Disorder Plot with Mutations . . . . . . . . . . . . . 60
5.4.3
NEFH.001 in ACC . . . . . . . . . . . . . . . . . . . . . . . . . 62 Smoothed Disorder Plot with Mutations . . . . . . . . . . . . . 62
6
5.5
Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.6
Partner Set Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . 63
Discussion
83
6.1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.2
Intersection of Both Methods of Analysis . . . . . . . . . . . . . . . . 84
6.3
6.2.1
EP400 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.2.2
TBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2.3
SRRM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2.4
NCOA3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.5
GPRIN2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.2.6
ZNF707 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Enrichment Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.1
Positional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.2
Regional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.3.3
Regional and Positional Cross-comparison . . . . . . . . . . . 89 Significant Novel Finds Sets . . . . . . . . . . . . . . . . . . . . 89 Binding Partner Sets . . . . . . . . . . . . . . . . . . . . . . . . 89
6.4
Disorder Binding Incitation of Cancer . . . . . . . . . . . . . . . . . . 90
v 6.5
COSMIC – Limited Complement . . . . . . . . . . . . . . . . . . . . . 90
6.6
On Limit to In Silico Analysis . . . . . . . . . . . . . . . . . . . . . . . 91
6.7
On the High Number of Regional Results . . . . . . . . . . . . . . . . 91
6.8
Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
6.9
6.8.1
Impact of mutations . . . . . . . . . . . . . . . . . . . . . . . . 93
6.8.2
Monte Carlo simulations side effect . . . . . . . . . . . . . . . 94
6.8.3
Intersection of significance sets . . . . . . . . . . . . . . . . . . 94
Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Bibliography
96
A TCGA Cancers
110
B Positional Supplemental Information
112
C Regional Supplemental Information
115
vi
List of Figures 3.1
Positional Steps Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2
Regional Steps Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1
Positional Novel Finds . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2
Positional Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.3
Positional All Mutations Heatmap . . . . . . . . . . . . . . . . . . . . 48
4.4
Positional Missense Mutations Heatmap . . . . . . . . . . . . . . . . . 49
4.5
Chimera COADREAD – TBP against 1NVP . . . . . . . . . . . . . . . 50
4.6
Chimera BRCA – TBP against 1NVP . . . . . . . . . . . . . . . . . . . 51
4.7
Chimera STES – CASC3 against 2J0S . . . . . . . . . . . . . . . . . . . 52
5.1
Regional Novel Finds (1 of 6) . . . . . . . . . . . . . . . . . . . . . . . 65
5.2
Regional Novel Finds (2 of 6) . . . . . . . . . . . . . . . . . . . . . . . 66
5.3
Regional Novel Finds (3 of 6) . . . . . . . . . . . . . . . . . . . . . . . 67
5.4
Regional Novel Finds (4 of 6) . . . . . . . . . . . . . . . . . . . . . . . 68
5.5
Regional Novel Finds (5 of 6) . . . . . . . . . . . . . . . . . . . . . . . 69
5.6
Regional Novel Finds (6 of 6) . . . . . . . . . . . . . . . . . . . . . . . 70
5.7
Regional Heatmap (1 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.8
Regional Heatmap (2 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 72
vii 5.9
Regional Heatmap (3 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.10 Regional Heatmap (4 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.11 Regional Heatmap (5 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.12 Regional Heatmap (6 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.13 Regional Heatmap (7 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.14 Regional Heatmap Distribution By Cancer . . . . . . . . . . . . . . . 78 5.15 Regional Heatmap Distribution By Gene . . . . . . . . . . . . . . . . . 79 5.16 Smoothed TBP.001 Disorder . . . . . . . . . . . . . . . . . . . . . . . . 80 5.17 Smoothed PLEC.005 Disorder . . . . . . . . . . . . . . . . . . . . . . . 81 5.18 Smoothed NEFH.001 Disorder . . . . . . . . . . . . . . . . . . . . . . 82
viii
List of Tables 2.1
The Twenty Common Amino Acids . . . . . . . . . . . . . . . . . . . 15
3.1
Sample Processed Input . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2
TCGA Cancers in this Study . . . . . . . . . . . . . . . . . . . . . . . . 26
4.1
Positional COSMIC Difference . . . . . . . . . . . . . . . . . . . . . . 37
4.2
COADREAD TBP Mutations . . . . . . . . . . . . . . . . . . . . . . . 40
4.3
BRCA TBP Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.4
STES CASC3 Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5
Positional Analysis Uncorrected Results . . . . . . . . . . . . . . . . . 44
4.6
Positional Results Interaction Partner Set Enrichment Analysis . . . . 45
5.1
Regional COSMIC Difference . . . . . . . . . . . . . . . . . . . . . . . 53
5.2
BRCA TBP Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3
PLEC.005 ACC Mutations . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.4
NEFH ACC Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.5
Regional Analysis Uncorrected Results . . . . . . . . . . . . . . . . . . 64
5.6
Regional Results Interaction Partner Set Enrichment Analysis . . . . 64
6.1
Intersect of Positional and Regional Novel Finds . . . . . . . . . . . . 84
ix A.1 TCGA Selected Cancers . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.1 Top 50 Interaction Partner Set Enrichment Terms – Positional . . . . . 112 C.1 Top 50 Interaction Partner Set Enrichment Terms – Regional . . . . . 115 C.2 By Cancer Mean Distribution – Both Profiles . . . . . . . . . . . . . . 117 C.3 By Gene Mean Distribution – Both Profiles . . . . . . . . . . . . . . . 119
1
Chapter 1
Introduction 1.1
Summary
This work represents a focal shift in the process of in silico discovery of cancer driver genes. Historically, there has been a general trend toward observing how DNA mutations propagate into disrupting ordered protein regions – causing either an acceleration or deceleration of biochemical function, which can be linked to driving cancer. This focus is heavily influenced by the classic structure-function paradigm of proteins which states that a well-defined structure is necessary for well-defined function, or more simply that "structure dictates function" – a paradigm supported by the seminal models of Fischer (1894) and Pauling, Corey, and Branson (1951), found to be true in early experimentally-determined structures (Blake et al., 1965; Kendrew, 1961), and lastly by the denaturation experiments of Anfinsen (1973). This paradigm, although known now to be only mostly true, gave rise to the long since disproven one-gene, one-protein belief that followed the characterization of DNA as the genetic material of the cell. Now we know that a single gene can result in numerous different versions of a protein product; these differing
Chapter 1. Introduction
2
versions are known as protein isoforms and are primarily the result of alternative splicing (Modrek & Lee, 2002; Kornblihtt et al., 2013; Black, 2003; Ast, 2004). When considering mutations across isoforms, it is likely that a single mutation will affect isoforms differently. This is partially due to many isoforms being of different lengths – therefore one isoform may be shorter than its sister isoforms due to removing a (mutated) region. As well, isoforms may be slight rearrangements of one another – thus splicing a mutation into a new structural context. For the purposes herein, this differing nature of isoform mutations is not explored. Rather the most mutated, shortest isoform of each gene is taken to represent the worst-case scenario – most mutated to increase the number of individual observations per gene, and shortest isoform to increase the degree to which these mutations may perturb the underlying chemistry. These criteria are sufficient for an initial pass, however it would be expected for the results to change with other criteria. Selecting one isoform per gene is necessary since it allows for multiple hypothesis correction over discrete, statistically-independent tests while keeping computational and logical complexity lower for this initial focal shift within in silico driver gene discovery. By searching for a specific biological property within a large search space using data from the disease state of interest one can find novel genes that are implicated in that disease state via the specific biological property. Using conservative statistical cutoffs, the confidence in these results is increased – provided the biological property has a potential role in driving the disease state. Here, potential cancer driver genes are discovered by searching for protein disorder-targeting mutations across the largest public cancer mutation data search space. Protein disorder, a
Chapter 1. Introduction
3
ubiquitous property within protein-protein interaction networks, has a strong potential role in driving cancer specifically by disrupting essential protein-protein interactions.
1.2
Cancer
Cancer is a disease marked by the breakdown of cellular machinery due to somatic DNA mutations (Hanahan & Weinberg, 2011). The naïve thought on potential driver genes of cancer would suggest they are as varied as the potential mutations which can occur in the variety of people it can occur within while still supporting life – an impossibly large number of possibilities. Thankfully, this impossibly large number is not the true number of cancer drivers. Instead, there exist a much smaller number of drivers which can be roughly characterized as frequently-mutated or infrequently-mutated driver genes (Vogelstein et al., 2013). The largest attempt to identify and record observed mutations in patients with cancer is The Cancer Genome Atlas (TCGA).1 The TCGA datasets represent our most comprehensive observations to date despite being hardly a fraction of the total possibilities (Tomczak, Czerwinska, ´ & Wiznerowicz, 2015). It is unexpected to ever fully capture the true variety of cancer mutations with our observations due to the rarity of mutations and size of the human genome. The discrepancy between observations and potential drivers is where Bioinformatics and in silico analysis can aid us. By comparing the small number of observations we do have against the large number of observations that are theoretically possible we can 1
https://cancergenome.nih.gov/
Chapter 1. Introduction
4
identify significant differences. This process is not without its own shortcomings as discussed in Section 1.3 below. As a disease, cancer is especially complex due to the great variety of mutations that can occur, where they occur, which tissue(s) they are affecting, the individual’s personal genetics/diet/activity level, and even more factors having an effect on disease progression (Campisi, 2013; Lawrence et al., 2014; Vogelstein et al., 2013). We know far more about cancer today than we ever have before due in no small part to our increasing ability to capture data on these factors and thereby understanding how they collectively contribute to cancer (Y. Chen et al., 2014; Cheng et al., 2014; Surget, Khoury, & Bourdon, 2013). Although many people know of certain risk factors for cancer such as overexposure to UV radiation via sunlight, there are more subtle risk factors for cancer.
1.2.1
Causes of Cancer
Cancer being driven by mutations means that almost any aspect of life that causes mutations, prevents proper repair of genetic mistakes, or otherwise increases the mutation rate in a cell line can be linked to increased risk for developing cancer. This, quite thankfully, does not mean one will develop cancer, only that the risk is greater as carcinogenesis is a process not an event (Kasper et al., 2015). Understanding cancer risk is made even more difficult by seemingly countering scientific findings. Nowhere is this more obvious than in layman news interpretations of the latest cancer research findings stating that some commonplace routine such as drinking coffee increases cancer risk one day2 but previously decreased 2
http://www.cnn.com/2016/06/15/health/coffee-tea-hot-drinks-cancer-risk/
Chapter 1. Introduction
5
cancer risk.3 These news stories are often not overtly wrong, but are reductionist to the point of blurring the truth from the latest original scientific report (Loomis et al., 2016). The underlying truth is that the complex nature of oncogenesis can not so easily be linked to such commonplace activities because it is a disease of mutations and thus is different on a person-by-person, case-by-case basis with more critical risk factors to consider than whether a person drinks coffee or not. However, there are routines with undeniably strong evidence for causing cancer – tobacco-use was epidemiologically-linked to increased risk for developing cancer and this causal link is no longer debated (Boffetta, Hecht, Gray, Gupta, & Straif, 2008; Denissenko & Pao, 1996; Vineis et al., 2004). Ultimately, with cancer being driven by mutations it is important to understand the two major classes of mutagens: external to the body and internal to the body.
Mutations from External Mutagens External mutagens are those which occur outside the body and affect the mutation rate inside the body. Even people without a biomedical research background understand these mutagens quite clearly and (most) take active steps to avoid them. Often these are observed as chemicals that a person is exposed to or activities they willingly do that are linked with an increased risk for cancer. Examples include using tobacco (DeMarini, 2004; Hecht, 1999) and increased exposure to UV radiation via sunlight (D’Orazio, Jarrett, Amaro-Ortiz, & Scott, 2013; de Gruijl, 1999). Such activities cause chemical changes within the cell which lead to DNA mutations in the cell. Other commonly known external mutagens are radiation and 3
http://www.cbsnews.com/news/new-findings-on-coffee-and-cancer-risk/
Chapter 1. Introduction
6
chemical spills, which the fear of, this researcher believes, has led to increased environmental regulations to protect citizens from less easily self-prevented risk factors. The less easily identified risks are no less important than the risks that are easily avoided.
Mutations from Internal Mutagens Internal mutagens are those which occur inside the body to affect the mutation rate inside the body. Often these are far less understood by people without a biomedical research background and are often understated even by people with such a research background. One internal mutagen is the DNA repair mechanisms failing to correct a mistake during DNA replication. As DNA is replicated prior to cellular division, a complete copy must be made that will split off into the daughter cell following division. This replication process involves unzipping the DNA double helix via DNA helicase and building two new complementary strands via DNA polymerase. The average mistake rate for DNA polymerase is one in one hundred thousand
1 100,000
positions. When we consider that there are roughly
six billion (6, 000, 000, 000) positions in a human diploid cell, this equals an average of one hundred twenty thousand (120, 000) errors at each division (Pray, 2008). The cell is able to repair most but not all of these mistakes and if it cannot repair the mistakes should mark the cell for termination as a major deviation from the healthy cell line. Any mistakes that are not corrected or deviated cell lines not terminated are, by definition, mutations and these mutations have the risk of being oncogenic.
Chapter 1. Introduction
7
Another internal mutagen is the progressive shortening of telomeres with each cell division and is partially why cancer is more common later in life. Telomeres are repeated, non-coding segments of DNA at the ends of chromosomes which protect the internal coding portions from mutation and degradation by being mutated and degraded themselves. As telomeres shorten with age, the coding portions are exposed to mutation and degradation (Blasco, 2005).
1.2.2
Cancer Driver Genes
There are two major classes of cancer driver genes: 1. tumor suppressor genes, and 2. oncogenes (R A Weinberg, 1994; Lehman et al., 1991; E. Y. H. P. Lee & Muller, 2010).
Tumor Suppressor Genes Tumor suppressor genes are the "brakes" on tumorigenesis intended to stop the rapid cellular proliferation and growth characteristic of a tumor. These genes are a single point of failure which follow the Knudson two-hit hypothesis (Knudson, 1971; Nordling, 1953; Hutchinson, 2001) and thus mutations within them tend to present fairly uniform results. This uniformity is a result of mutations having the same loss-of-function effect: preventing the gene from stopping the growth of tumors effectively, which presents the same no matter the causing mutation. A notable example of a tumor suppressor genes is p53 (or TP53), which is ubiquitous and provides a check for deviated cell lines during the G1/S regulation point of the cell division cycle just prior to dividing. Many mutations can result in p53 malfunction and that is why this driver gene is implicated in > 50% of cancers
Chapter 1. Introduction
8
(Surget et al., 2013) – it is a single point of failure where malfunction means cell lines are not subjected to the proper health check prior to dividing.
Oncogenes Oncogenes are the "gas" on oncogenesis with a variety of intended functions which are accelerated via mutation, causing a variety of the notable cancer hallmarks. The diversity of these genes means there is greater diversity in their biochemical presentation. Newly discovered driver genes tend to fall into this class because their variety means different approaches analyze new and different contexts for how a gene might be driving oncogenesis. Oncogenes are set in motion by specific "driver" mutations while most mutations within them are random "passenger" mutations which are not, themselves, oncogenic (R. A. Weinberg, 1984; Chial, 2008; Todd & Wong, 1999; Stehelin, 1995). A notable example of an oncogene is telomerase, which is oncogenic by causing cancer cells to lengthen their telomeres – aiding in cancer cell immortality.
Discovering Drivers Discovering cancer driver genes requires many levels of analysis. Newer studies tend to look for positive selection for a biological property with potential in driving cancer. This is an effective combination of biological hypothesis and highdimensional data analysis. In order to not bias results in these types of analyses, capturing the mutational landscape of cancer must be done in as systematic a way as possible. Currently, the most systematic and comprehensive approach to discovering the mutations noted in cancer is The Cancer Genome Atlas (TCGA).
Chapter 1. Introduction
1.2.3
9
The Cancer Genome Atlas
The Cancer Genome Atlas (TCGA) is the leading effort to catalog genetic mutations in cancer via high-throughput genomics – bettering our understanding of the genetic basis of cancer with a primary goal of improving diagnosis, treatment, and prevention of cancer. Over its lifespan from 2005 to 2017 (time of this study), it collected 2.5 petabytes of data, from more than 11,000 patients, describing the mutational observations of 33 cancer types. The TCGA data used in this study is from July 18th, 2016.
Methods The TCGA Research Network consists of many parts; each part is integral to achieving TCGA’s central goal – beginning with the Biospecimen Core Resource (BCR), which reviews and processes the initial blood and tissue samples, and ending with the Analysis Working Groups (AWGs), which are made up of scientific and clinical experts analyzing a single type of cancer across all TCGA methods and who publish a comprehensive analysis of findings.
Cancers in the Atlas Under TCGA investigation there are 33 tumor types (see Table A.1), of which 31 cancer types are included in this work (see Table 3.2). The two cancers present in TCGA not analyzed here are Mesothelioma (MESO) and Acute Myeloid Leukemia (LAML), which were excluded due to using an older version of the human reference genome at the time of SNP characterization.
Chapter 1. Introduction
1.3
10
Computational Problem
The major computational problem within cancer genomics is distinguishing signal from noise – driver mutation from passenger mutation – which allows us to further understand the disease process.
1.3.1
Past Driver Gene Discovery Methods
Detailing all past methods would be impossible, therefore a select few methods will be discussed. Past computational methods have focused within or integrated analysis in the areas of: 1. somatic copy-number alternations (SCNAs), as is the case with GISTIC (Mermel et al., 2011); 2. protein-coding region length, variations in mutation types, and multiple mutations in one gene, as is the case with DrGaP (Hua et al., 2013); and 3. signals of positive selection, as is the case with MuSiC (Dees et al., 2012), OncodriveFM (Gonzalez-Perez & Lopez-Bigas, 2012), OncodriveCLUST (Tamborero, Gonzalez-Perez, & Lopez-Bigas, 2013), and E-Driver (Porta-Pardo & Godzik, 2014).. The methods leveraging positive selection all share the use of a base-level mutation profile/rate in order to differentiate between random (passenger) mutations and driver mutations. Notably, none of these methods focus on investigating regions of disorder.
Chapter 1. Introduction
1.4
11
Hypothesis
I propose that by studying the effects of cancer mutations within inherently disordered regions, we can further understand how cancer manipulates cellular chemistry, disrupting healthy processes. Due to this being a major shift in focus from historical cancer driver gene discovery approaches, it is expected to find novel drivers.
12
Chapter 2
Proteins 2.1
Introduction
For this work, and others like it, analysis at the protein level is necessary; here a brief overview of proteins is presented to provide context to analysis. Without an understanding of proteins, the positive selection for a protein biological property and how such selection might driver cancer cannot be understood. Proteins are biopolymers made up of a string of amino acids and are the actors of biochemical activity. They are important for driving cellular chemistry by catalyzing reactions, acting as signals for processes, providing structural support to cells, helping other proteins fold, and much more. As the final step in The Central Dogma of Molecular Biology, or that: a gene coded in DNA is transcribed into RNA, which is then translated into protein, these functional biomolecules are responsible for nearly all biochemical activity within the cell. Due to this, proteins serve as the chemical carriers for DNA mutations – often being the biological component enacting damage due to the mutation. A single gene in DNA can result in multiple related protein products – these related products are called protein isoforms
Chapter 2. Proteins
13
produced by alternative splicing (Modrek & Lee, 2002; Kornblihtt et al., 2013; Black, 2003; Ast, 2004). Therefore, a mutation at the DNA level is likely to affect more than one protein isoform.
2.2
Protein Structure
Protein structure is broken up into four categories: primary structure (1◦ ), secondary structure (2◦ ), tertiary structure (3◦ ), and quaternary structure (4◦ ), each structural level is built off of the levels before it. These levels are discussed in detail below.
2.2.1
Amino Acid Structure
Before discussing the levels of protein structure, it is important to understand the basic structure of amino acids, the repeating subunits of the protein biopolymer. All amino acids are composed of four components all bonded to a central carbon atom. These four components are: 1. a single proton/hydrogen atom (H + ), 2. an amine functional group (−N H2 ), 3. a carboxyl functional group (−COOH), and 4. most importantly, a side chain specific to each amino acid (−R). The side chain identifies the amino acid as well as its chemistry (i.e., is it polar/non-polar, aromatic/aliphatic, charged/non-charged). See Table 2.1 for how the chemistry differs between amino acids.
Chapter 2. Proteins
2.2.2
14
Primary Structure (1◦ )
Proteins are made up of a string of individual amino acids. Within human biology, there are 20 common amino acids (listed in Table 2.1) which make up all proteins. The linear, string sequence of amino acids is the primary (1◦ ) protein structure. (This is the only one-dimensional protein structure and thus is the one most often used in bioinformatics.)
2.2.3
Secondary Structure (2◦ )
As the protein begins to fold, it interacts with other residues and the environment to take on localized, 3D conformations that reduce localized energy levels. These local conformations are considered the secondary (2◦ ) protein structure and include: alpha helices, beta sheets, and turns/loops. Of these secondary elements, only turns/loops are fairly disordered.
2.2.4
Tertiary Structure (3◦ )
As the protein forms its secondary structure and continues to fold, it will continually assume the lowest overall energy state possible1 until the entire protein has been folded. This final folded structure of one original primary sequence chain is considered the tertiary (3◦ ) structure. It is important to draw attention to a tertiary structure being one continuous amino acid chain that has taken on a 3D folded structure. The structure of some proteins ends at this level since it often stable and functional. 1
This is without considering the role of chaperone proteins, which help proteins fold in ways that would otherwise be chemically unstable in the process.
Chapter 2. Proteins
15
TABLE 2.1: A brief summary of the twenty common amino acids. Full name, shortened name, single letter code, and a broad chemical classification are included for each. Reorganization of table at: http://wbiomed.curtin.edu.au/biochem/tutorials/AAs/AA.html Full name Glycine Alanine Valine Leucine Isoleucine Proline Phenylalanine Tyrosine Tryptophan Serine Threonine Cysteine Methionine Asparagine Glutamine Lysine Arginine Histidine Aspartate Glutamate
Shortened name Single Letter Code aliphatic (non-polar) Gly G Ala A Val V Leu L Ile I Pro P aromatic (non-polar) Phe F Tyr Y Trp W polar, non-charged Ser S Thr T Cys C Met M Asn N Gln Q positively charged Lys K Arg R His H negatively charged Asp D Glu E
Chapter 2. Proteins
2.2.5
16
Quaternary Structure (4◦ )
Not all proteins have a quaternary structure. The quaternary (4◦ ) structure is formed from multiple independent amino acid chains interacting with one another to form a complex. Every quaternary structure is made up of multiple protein chains, each capable of independent folding into a tertiary structure, and interacting with one another to form a final, functional protein complex.
2.3
Protein Folding
According to the framework model of protein folding, proteins begin to fold as they are being synthesized. First forming localized secondary elements at one end prior to the synthesis of the other terminal end. There are two primary chemical driving forces behind protein folding, in order of strength: 1. the burial of hydrophobic side chains away from the aqueous environment, termed the entropic penalty, and 2. the reduction in total, solvent-accessible surface area (Ken A. Dill, Ozkan, Shell, & Weikl, 2008). Due to these chemical drivers, most proteins result in a hydrophobic core and a hydrophilic surface. However, sometimes burying hydrophobic amino acids is not possible, especially in the early stages of folding. If these hydrophobic amino acid side chains were left exposed it would result in protein aggregation via the same entropic penalty driving their burial – hydrophobic amino acids on the surface of the synthesizing protein would be driven toward hydrophobic surface amino acids on other proteins rather than driven inward (Kessel & Ben-Tal, 2011). Such aggregation would present a major and highly prevalent problem if the folding process were entirely stochastic; however there
Chapter 2. Proteins
17
exist chaperone proteins which support and protect a protein as it folds (Garrett & Grisham, 2013). Chaperone proteins lower the overall energy barrier allowing folding into lower energy states that would first require adopting a higher, unfavorable energy state (Q. Liu & Craig, 2016; Hendrick & Hartl, 1993) – as would be the case in temporarily exposing hydrophobic amino acids to bury them further than before.
2.4
Protein Mutation
Proteins are very rarely mutated directly but when they are rarely remain in the cell long due to protein turnover replacing a mutated protein with a healthy protein. Rather, most protein mutations can be linked back to an original DNA mutation which propagated to the protein level. Structurally, a mutation can occur within ordered regions such as binding or catalytic sites or within disordered regions such as protein-protein interaction junctions (there are also transition regions between these two). Since every amino acid has unique chemistry, protein mutations rarely result in the same level of functionality – accelerating or stunting protein activity based on the healthy and mutated amino acid chemistry. There are many classifications of protein mutations, each with their own semantic weight, however herein only two mutually-exclusive classifications are used: 1. synonymous mutation, no amino acid change despite a DNA mutation, and 2. missense mutation, an amino
Chapter 2. Proteins
18
acid change due to a DNA mutation. It has been shown that synonymous mutations can result in effects at the protein level (Goymer, 2007; Hunt, Simhadri, Iandoli, Sauna, & Kimchi-Sarfaty, 2014; Sauna & Kimchi-Sarfaty, 2011) and even frequently drive cancer (Supek, Miñana, Valcárcel, Gabaldón, & Lehner, 2014). However, a stronger case can be made for how a missense mutation may be driving cancer due to perturbed chemistry, therefore in this study two mutation profiles are explored: all mutations (synonymous and missense) and missense-only (no synonymous mutations). The natural third profile, synonymous-only, would be nearly uninterpretable in itself.
2.5
Protein Disorder
Protein order/disorder is the measure of how well-defined the 3D conformational location of a given residue within the final folded protein is. An ordered region is one that adopts a well-defined 3D conformation, while a disordered region may adopt no apparent structure or many similar structures depending on cellular conditions. Protein regions are made up of discrete residues each with their own order/disorder. Each residue can have as many potential inter-residue interactions as there are other residues in the protein. The combination of amino acids interactions is what leads to the native, or biologically-functional, 3D structure of a protein – balancing attractive and repulsive forces to form the final conformation. One way of measuring the disorder of a protein is to consider each potential pairwise interaction across the length of the protein. In a protein only 100 amino acids in length, this would be
100 2
or 4950 possible pairwise interactions – a number
Chapter 2. Proteins that grows quickly with a length of 200 being
19
200 2
or 19, 900 pairwise interactions.
Realistically, most residues do not interact with most other residues therefore not all combinations must be considered – in fact the naïve method of considering all possible combinations leads to inaccurate measures of order/disorder by neglecting proximity entirely – thus de novo measures of disorder commonly use sliding windows which consider interactions only within a certain sequence proximity range. Due to our knowledge that protein folding is driven by the burial of hydrophobic side chains and reduction of surface area (see Section 2.3 for more detail), we can estimate the final folded tertiary structure in silico based on known properties of each individual amino acid in the primary sequence. These estimations approximate protein disorder by assigning a value to how predictable each residue’s position is in the final structure. The two chemical measures used herein to estimate protein disorder from the primary sequence are: 1. pairwise amino acid interactions, and 2. hydrophobicity and net charge. Both of these have basis in measuring the favorability of amino acid interactions to predict how the primary sequence will form secondary structures and final tertiary structure.
2.5.1
Pairwise Amino Acid Interactions
The chemical natures of different amino acids generate either attractive (favorable) or repulsive (unfavorable) pairwise interactions. Two polar amino acids of opposite charge or two non-polar amino acids will have favorable interactions, while two polar amino acids with the same charge or a polar and non-polar amino acid pair will have unfavorable interactions. The IUPred method (Dosztányi, Csizmók,
Chapter 2. Proteins
20
Tompa, & Simon, 2005) used herein to measure positional disorder scores is based on the ENERGI method of determining pairwise amino acids interaction energylike quantities created by Thomas and Dill (1996). Using pairwise interaction energies in this way allows each position within a protein sequence to be given a score that corresponds to how well we can predict the final 3D conformational location of that position. The IUPred method uses a scale from 0 to 1 with precision to the ten-thousandth decimal place where 0 is complete order and 1 is complete disorder (Dosztányi, Csizmok, Tompa, & Simon, 2005). This method of positional score determination was chosen for its ability to distinguish partially disordered proteins from fully disordered proteins and is currently one of the best methods for measuring positional disorder, outperforming DISOPRED2 (Ward, McGuffin, Bryson, Buxton, & Jones, 2004) and VL3-H (Obradovic et al., 2003), both of which use a trained artificial intelligence model for disorder determination.
2.5.2
Hydrophobicity and Net Charge
With the strongest driving force behind protein folding being the entropic penalty, which forces the burial of hydrophobic amino acids, measures of hydrophobicity and net charge (an effective estimator of hydrophilicity) provide strong correlation with the ordered/disordered nature of the final folded structure. A region of highly hydrophobic amino acids indicates the region will likely be membranebound and thus more likely to be ordered, while a mixed region (alternating hydrophobic residues and hydrophilic residues) is unlikely to be bound and thus more likely to be disordered. FoldIndex©, a method by Prilusky et al. (2005), uses
Chapter 2. Proteins
21
an algorithm by Uversky, Gillespie, and Fink (2000) to define a boundary line between regions of folded order and unfolded disorder. Values from this method are bound between -1 and 1 with positive values being likely folded (ordered) regions and negative values being likely unfolded (disordered) regions.
22
Chapter 3
Methodology 3.1
Introduction
Discovery of driver genes by focusing specifically on regions with a particular biological property is a fairly standard approach. In fact, computational approaches to driver gene discovery all but require a measurable property and a biological basis for how that property can drive cancer. Past methods have considered: 1. somatic copy-number alternations (SCNAs), as is the case with GISTIC (Mermel et al., 2011), 2. protein-coding region length, variations in mutation types, and multiple mutations in one gene, as is the case with DrGaP (Hua et al., 2013), and 3. signals of positive selection, as is the case with MuSiC (Dees et al., 2012), OncodriveFM (Gonzalez-Perez & Lopez-Bigas, 2012), OncodriveCLUST (Tamborero, Gonzalez-Perez, & Lopez-Bigas, 2013), and E-Driver (Porta-Pardo & Godzik, 2014).. Critically, the positive-selection methods (which this work is considered) face the same computational challenge of differentiating signal from noise in order to draw their conclusions.
Chapter 3. Methodology
3.2
23
Signal Versus Noise
Differentiating signal from noise is a problem in more than just Bioinformatics with importance in any field where random observations are able to mask important observations (T. T. Liu, 2016; Edwards, Russell, & Stott, 1998). There are many complex methods, such as the Fourier transform (Fourier, 1822) that allow making relative sense of seemingly random input, however within driver genes discovery typically the background-anomaly approach is used in conjunction with a biological property (Kamburov et al., 2015; Tamborero, Gonzalez-Perez, & LopezBigas, 2013; Tamborero, Lopez-Bigas, & Gonzalez-Perez, 2013; Gonzalez-Perez & Lopez-Bigas, 2012). Establishing a background rate or level for a biological property allows one to begin differentiating signal from noise via deviations from this background. The work herein is a focal shift from past driver gene discovery methods by focusing on the under-investigated property of protein disorder. By focusing on this property in particular, it is expected to find results not found in other methods due to characterizing proteins differently than before. To do this two approaches were taken, positional analysis via Monte Carlo simulations and regional analysis via binomial testing, both leveraging data from The Cancer Genome Atlas (TCGA).
3.3
Data Preparation
Raw TCGA data were processed following the same procedure as in Ghersi and Singh (2014). In short, the chromosomal coordinates provided by TCGA were
Chapter 3. Methodology
24
TABLE 3.1: The heading 10 rows of ACC_mut.txt. This format represents the effective input to analysis herein following the mapping of raw TCGA chromosomal coordinates to protein sequence positions. Isoform
TCGA Barcode
A1BG.001 A1CF.001 A1CF.002 A1CF.003 A1CF.004 A1CF.005 A1CF.006 A4GALT.001 AACS.001 AACS.001
TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5JY-01A TCGA-OR-A5LD-01A TCGA-PK-A5HB-01A
DNA Position 281 1167 1191 1191 1167 1215 1191 903 306 103
DNA Start G C C C C C C C A G
DNA End A A A A A A A G C C
Protein Position 94 389 397 397 389 405 397 301 102 35
Protein Start R G G G G G G P A A
Protein End H G G G G G G P A P
mapped to their protein sequence positions by using the Human Genome Reference (GRCh37.p10). The head of a sample input after this procedure can be seen in Table 3.1 – this is the file ACC_mut.txt, representing the cancer background of Adrenocortical carcinoma considering all mutations (both missense and synonymous mutations).
3.3.1
Data Acquisition
Cancer mutation data were obtained from the latest available TCGA1 run on July 18th, 2016 from the Broad Institute Firehose system,2 in the form of Data Level 2 (Processed Data), which is the level of consensus results from processing the raw genome sequencing reads (Data Level 1 Raw Data). This data includes 33 cancer backgrounds (Table A.1), while the work here analyzes 31 cancer backgrounds (Table 3.2). The remaining two cancer backgrounds from TCGA, Mesotheliomia (MESO) and Acute Myeloid Leukemia (LAML), were excluded due to using an 1 2
https://cancergenome.nih.gov/ http://firebrowse.org
Chapter 3. Methodology
25
older version of the human reference genome at the time of data preparation. It should also be noted that in this analysis Colon adenocarcinoma [COAD] and Rectum adenocarcinoma [READ] are combined into a single Colon and Rectum adenocarcinoma [COADREAD] background, which is also true for Esophageal carcinoma [ESCA] and Stomach adenocarcinoma [STAD] which are combined into a single Stomach and Esophageal carcinoma [STES] background. These combination are due to the component cancer backgrounds being indistinguishable from one another (Muzny et al., 2012; Bass et al., 2014).
3.3.2
Dataset Size
The TCGA dataset used contained information on 95, 836 isoforms from 31 cancer types, each combination of which was processed positionally and regionally for a total of three disorder score profiles: 1. IUPred ’short’ (positional), 2. IUPred ’long’ (positional), and 3. FoldIndex©(regional).
3.4
Disorder Scoring
Positional analysis was done on measurements by IUPred long, or a 100 residue interaction window, and IUPred short, or a 25 residue interaction window (Dosztányi, Csizmok, et al., 2005); regional analysis was done on measurements by FoldIndex©(Prilusky et al., 2005), which uses a default window size of 51 residues. Calculations for both positional measurements are based on pairwise chemical interaction energies across their respective window sizes and smoothed over a window size of 21 residues – this is in accordance with the IUPred method (Dosztányi,
Chapter 3. Methodology
26
TABLE 3.2: The 31 cancer types involved in this study. COAD and READ were combined because their backgrounds are indistinguishable (Muzny et al., 2012), while STES was not part of the original pilot project, but was investigated by Bass et al. (2014) and subsequently added to TCGA. STES is a combination of two cancers, stomach and esophageal carcinomas into one unified cancer background. Number of subjects is based off of unique TCGA barcodes within each cancer dataset. Identifier ACC BLCA BRCA CESC CHOL COADREAD DLBC ESCA GBM HNSC KICH KIRC KIRP LGG LIHC LUAD LUSC OV PAAD PCPG PRAD SARC SKCM STES TGCT THCA THYM UCEC UCS UVM
Cancer Type Adrenocortical Carcinoma Bladder Urothelial Carcinoma Breast Invasive Carcinoma Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma Cholangiocarcinoma Colon Adenocarcinoma [COAD] & Rectum Adenocarcinoma [READ] Lymphoid Neoplasm Diffuse Large B-cell Lymphoma Esophageal Carcinoma Glioblastoma Multiforme Head and Neck Squamous Cell Carcinoma Kidney Chromophobe Kidney Renal Clear Cell Carcinoma Kidney Renal Papillary Cell Carcinoma Brain Lower Grade Glioma Liver Hepatocellular carcinoma Lung Adenocarcinoma Lung Squamous Cell Carcinoma Ovarian Serous Cystadenocarcinoma Pancreatic Adenocarcinoma Pheochromocytoma and Paraganglioma Prostate Adenocarcinoma Sarcoma Skin Cutaneous Melanoma Stomach and Esophageal Carcinoma Testicular Germ Cell Tumors Thyroid Carcinoma Thymoma Uterine Corpus Endometrial Carcinoma Uterine Carcinosarcoma Uveal Melanoma
Number of Subjects 90 130 987 194 35 295 48 185 290 279 66 411 161 286 198 230 177 142 150 184 332 247 345 473 155 405 118 247 57 80
Chapter 3. Methodology
27
Csizmók, et al., 2005). The IUPred positional long and short measurements are processed concurrently, but separately at each step. Calculations for regional measures are based on the Kyte/Doolittle scale(Kyte & Doolittle, 1982) of hydrophobicity and net charge, considering the mean of both values across the window – this is in accordance with the FoldIndex©method (Prilusky et al., 2005).
3.5
Monte Carlo Simulations
Beginning with calculating IUPred long and IUPred short disorder score profiles for each protein isoform, Monte Carlo simulations were carried out by comparing the observed mutation load (see Equation 3.1) against the average mutation load across one million random simulations of the same number of mutations and calculating an empirical p-value (see Equation 3.2) between these values. The empirical p-value is the number of simulated cases below the observed disorder load divided by the number of simulations performed – this calculation is based on comparing the observed value versus the simulated values if the null hypothesis is rejected (random mutations). Mobs =
N X
mi × si
(3.1)
i=1
Where Mobs is the observed disorder load, mi is the number of observed mutations at position i, si is the calculated IUPred disorder score at position i, and N is the total number of residues in the protein.
Chapter 3. Methodology
28
P
ppositive =
Mobs ≥ Mrandom Lrandom
(3.2)
Where ppositive is the empirical p-value for positive selection for disorder, Mobs is the observed disorder load, Mrandom is the vector of simulated disorder loads, and Lrandom is the length of the simulated disorder loads vector. Lrandom is equal to one million for each isoform.
Following empirical p-value calculation, one isoform per gene was selected according to the highest number of mutations and shortest protein length with any ties resolved alphanumerically. Most mutated to increase the number of individual observations per gene, while shortest isoform to increase the degree to which these mutations may perturb the underlying chemistry. These criteria are sufficient for an initial pass, however it would be expected for the results to change with other criteria. This selection was to ensure statistical independence prior to multiple hypothesis correction – which was performed at a false discovery rate (FDR) level of 0.05 using the Benjamini-Hochberg correction procedure (Benjamini & Hochberg, 1995). This selection was performed after p-value calculation rather than prior in order to test other potential avenues of investigation, such as single-gene isoform cross comparisons, which are not part of the work presented here.
3.5.1
Steps as a List
See also Figure 3.1 for these steps as a flowchart. 1. Calculate positional disorder scores via IUPred (long and short) 2. Simulate one million random mutation observations using sampling with replacement (same number of mutations as observed)
Chapter 3. Methodology
29
• ’Observed’ defined as individual mutated positions, not individual mutation observations so as to not inflate highly-mutated positions in analysis 3. Calculate empirical p-value between observed and average random mutation load 4. Select one isoform per gene Criteria: – Highest number of mutations – Shortest isoform length – Ties resolved alphanumerically 5. Correct at FDR of 0.05 according to Benjamini-Hochberg correction procedure
3.6
Binomial Testing
First, disorder region calls for each protein isoform were made using the FoldIndex©webserver.3 Following this, disordered regions within mutated isoforms for each cancer background were extracted. For each of these regions, five values were calculated to find regions with heightened mutational concentration: 1. the total isoform length (length of the region as found via FoldIndex©), 2. the total number of mutations observed in the isoform, 3. the number of mutation observed in the 3
http://bioportal.weizmann.ac.il/fldbin/findex
Chapter 3. Methodology
30
F IGURE 3.1: The general flowchart of the steps taken for Monte Carlo simulations during positional analysis.
IUPred ’long’
Calculate Positional Disorder
One million iterations
Simulate random profiles
IUPred ’short’
Calculate empirical p-value between observed and expected random disorder loads
Select one isoform per gene
At 0.05 level
Correct FDR via Benjamini-Hochberg correction procedure
Highest number of mutations
Criteria
Shortest isoform length
Ties resolved alphanumerically
Chapter 3. Methodology
31
disordered region, 4. expected value (see Equation 3.4), and 5. p-value via binomial test of observed number of mutations or fewer. See Figure 3.2 for a flowchart version of how these values are used. Following binomial testing (see Equation 3.3), the regions in each cancer were filtered for only the most significant isoform of each gene. This filtering step ensures statistical independence prior to FDR correction at the 0.05 level via the Benjamini-Hochberg correction procedure (Benjamini & Hochberg, 1995). !
n x P r(X = x) = p (1 − p)n−x x
(3.3)
Where P r(X = x) is the probability of observing x successes, n is the number of trials (the length of the isoform), x is the number of successes (the number of mutations in the region), and p is the probability of success (the length of the region divided by the length of the isoform). For the work herein the binomial distribution density was used to calculate the probability of observing exactly x successes.
Eval = M ×
lenreg leniso
(3.4)
Where Eval is the expected value, M is the total number of observed mutations across the isoform, lenreg is the length of the region, and leniso is the length of the isoform. This equals the number of mutations expected to randomly fall within the region.
3.6.1
Steps as a List
See also Figure 3.2 for these steps as a flowchart.
Chapter 3. Methodology
32
1. Calculate regional disorder scores via FoldIndex© 2. Run binomial tests to find regions with heightened mutational concentration • Subset by < −0.1 average score in region • Subset by greater than expected mutations given length of region, length of isoform, and number of observed mutations • Subset by at least 5 mutations in the region 3. Select one isoform per gene Criteria: – Lowest p-value 4. Correct at FDR of 0.05 according to Benjamini-Hochberg correction procedure
3.7
Enrichment Analysis and Validation
The sets of significant genes from each method of analysis were run through enrichment analysis using hypergeometric testing across Gene Ontology Biological Process (GO-BP) terms with FDR correction. Biological process enrichment might suggest possible disorder-implicated mechanisms for driving cancer in yet uncharacterized proteins. The utilities used here for enrichment analysis were written in Python and R by my advisor prior to my work here. The Python script processes the raw annotation file to extract the GO branch of interest, in this case the Biological Process branch; in addition this, it also allows blacklisting evidence codes
Chapter 3. Methodology
33
F IGURE 3.2: The general flowchart of the steps taken during binomial testing in regional analysis.
FoldIndex©
Calculate Regional Disorder
< −0.1 average score
Narrow results
Binomial tests
Greater than 5 mutations in region
Greater than expected mutations
Select one isoform per gene
At 0.05 level
Correct FDR via Benjamini-Hochberg correction procedure
Criteria
Take the isoform with the lowest p-value
Chapter 3. Methodology
34
that would otherwise invite circular reasoning.4 Then, in R, making heavy use of the igraph package, many objects are generated: a GO graph, a GO dictionary, an annotation list, and a term-centric annotation list. Enrichment analysis was run both with and without FDR correction, in addition to filtering the results for only the most specific terms.5 Under-represented terms were thrown out in all cases. To give mutation prevalence context to significant genes across cancer types, heatmaps were generated of cancer type versus significantly disorder-targeted genes with coloring by the ratio between the number of unique patients with a mutation in the gene over the total number of unique patients in that cancer type. This allows cross-comparison of similar cancer types and similar genes as well as immediate validation that certain well-known outcomes are holding true (e.g., p53 should be significant across most cancers). Due to the functional dependency between binding partners, a binding partners set is made and analyzed via enrichment analysis to determine if the partner set can provide additional insight into possible cancer drivers. Significant enrichment in the binding partners set would suggest mechanisms that are disrupted by disorder-targeting mutations. Additional in silico validation was done by ensuring a limited intersection between disorder-targeted sets and COSMIC, the Catalogue Of Somatic Mutations In Cancer (Futreal et al., 2004). In addition to this limited intersection, a p-value for each significant set compared to the COSMIC census was computed via the 4
Using GO terms inferred by protein interaction would invite bias in the protein interaction partner sets created later for validation. 5 All parent terms in the GO-BP tree are removed, keeping only the most specific terms from within the tree.
Chapter 3. Methodology
35
hypergeometic distribution (see Equation 3.5) to find P [X > x]). These steps are standard procedure for finding new cancer driver genes.
P r(X = k) =
K k
N −K n−k N n
(3.5)
Where N is the total population size (the number of genes in the TCGA set), K is the number of successes in the population (the number of genes in the COSMIC set), n is the number of draws (the number of of genes in each significant set), and k is the number of successful draws (the number of genes in the intersect of COSMIC and each significant set).
36
Chapter 4
Positional Analysis Results 4.1
Introduction
Positional Monte Carlo simulations results across the 31 cancer types (listed in Table 3.2) were limited to only considering IUPred short findings. IUPred short was better able to capture regions of positional disorder by considering a localized proximity window size of 25 residues. When considering only genes with 5 or more observed mutations, furthering the conservative estimation of significance, 102 significant genes were found. Well-characterized driver genes were removed by taking the set difference between the COSMIC gene set and this significant set; leaving 77 remaining gene symbols across both missense-only and all-mutations profiles. See Table 4.1 for a listing of these finds and Figure 4.1 for a binary mapping of these finds to the cancer backgrounds they were significant within.
Chapter 4. Positional Analysis Results
37
TABLE 4.1: The significant gene symbols according to positional Monte Carlo simulations and considering only those symbols not already in the COSMIC census gene set. ADAT3 CD8B EP400 GLTSCR1 KANK3 MUC16 PGLS PRRT4 SMOC2 TCHH TRIP6
4.2
ANKLE1 CEBPB FAM100B GPRIN2 KCNK17 NBPF10 PGM5 RARRES2 SOX17 TCTEX1D4 TYSND1
ARL10 CRCT1 FAM48B1 GTF2I KRTAP10-10 NCOA3 PKDREJ RASIP1 SPRR3 TENM4 UPF3A
ASPDH CSGALNACT2 FAM72A HS3ST4 KRTAP10-2 NMU POU3F3 RGS9BP SRRM2 TES ZNF148
C16orf3 CXorf38 FAM86C1 IARS2 LRIG1 OSGIN2 PPP1R3G RREB1 SYN1 TNIP2 ZNF707
C19orf10 DNAH9 FCHSD1 IGFBP4 MAP1S PCDHGA2 PROB1 SCYL2 TBP TOR3A ZSCAN1
CASC3 EME2 FSIP1 IGFN1 MSANTD1 PCDHGA9 PRR18 SENP6 TCEB3C TRIM61 ZZZ3
COSMIC Hypergeometric Testing
Only using the COSMIC gene set to determine which finds are novel provides no value for measuring the overall significance of finds with respect to capturing known cancer drivers. Therefore, the hypergeometric test was performed using the following values: 1. the number of genes in the intersect between COSMIC and my significant set, 29; 2. the number of genes in COSMIC, 616; 3. the number of genes with mutations in TCGA set , 18201; and 4. the number of genes in my significant set, 384. This resulted in a p-value of 5.0875 × 10−5 , see Equation 4.1 for calculation and Equation 3.5 for general equation. This p-value indicates the significant set determined via positional analysis has a high degree of true positives.
p(k = 29 − 1) =
616 28
18201−616 384−28 18201 384
= 5.0875 × 10−5
(4.1)
Note that in Equation 4.1, k − 1 successes are considered to find the cumulative probability of k or more successes.
Chapter 4. Positional Analysis Results
4.3
38
Mutational Prevalence
In order to measure the prevalence of mutations across missense-only and allmutations profiles, heatmaps were created wherein cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. The mutation profile naming scheme in these images is such that mut profiles include all observed mutations in the cancer profile (i.e., synonymous and missense mutations), while missense profiles include only missense mutations in the cancer profile (i.e., synonymous mutations have been removed). These heatmaps show the mutation prevalence across individuals is low for most genes. As is expected, there is a high mutation prevalence in genes such as TP53, a ubiquitous tumor-suppressor gene. More important than the prevalence of mutations across genes in a single cancer is comparison of significant genes between cancers. Note that cancer backgrounds with a high mutation prevalence for a particular gene are not necessarily targeting disordered positions within that gene, they simply have a high number of patients with mutations in the gene.
4.4
Visualizations of Select Genes
For those novel finds with Protein Data Bank (PDB) entries at a resolution of < 2.5 Å, observed mutations were visualized using UCSF Chimera, production version 1.11.2 (build 41380) along with tables listing the observed mutations. Note that due to the inherent difficulty in generating a PDB structure for a disordered protein – especially for so fine a resolution – these images and results are biased toward the
Chapter 4. Positional Analysis Results
39
more ordered genes in the significant set. Images here were selected for illustrative purposes.
4.4.1
COADREAD – TBP
This combination of COADREAD (colon and rectum adenocarcinoma) cancer and TBP (TATA-box-binding protein) was significant by Monte Carlo analysis and had a mutation prevalence of 0.08474576, or ≈ 8.47% of patients, according to the heatmaps.
PDB: 1NVP The major difference in the number of mutations listed in Table 4.2 and visible in Figure 4.5 is due to the positions ≈ 60 to 85 in isoform one (TBP.001) and ≈ 40 to 65 in isoform two (TBP.002) being a single amino acid repeat of glutamine, which is not present in the PDB structure. This region’s absence suggests that it is likely disordered and therefore did not crystallize well. The positions that remain, {224, 284}TBP.001 and {204, 264}TBP.002 , target the same two positions due to the offset between the isoforms. Both of these positions are part of turns/loops. The vast majority of mutations occur in the glutamine-repeat region which was likely too disordered to crystallize.
4.4.2
BRCA – TBP
This combination of BRCA (breast invasive carcinoma) cancer and TBP (TATAbox-binding protein) was significant by Monte Carlo analysis and had a mutation
Chapter 4. Positional Analysis Results
40
TABLE 4.2: The mutations noted in COADREAD_mut for TBP in the TCGA dataset. Isoform TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002
Amino Acid Position 60 63 72 73 74 75 76 77 78 79 80 81 82 83 84 224 284 40 43 52 53 54 55 56 57 58 59 60 61 62 63 64 204 264
Frequency 2 2 2 4 1 2 2 4 1 3 1 1 1 1 1 1 1 2 2 2 4 1 2 2 4 1 3 1 1 1 1 1 1 1
Chapter 4. Positional Analysis Results
41
TABLE 4.3: The mutations noted in BRCA_mut for TBP in the TCGA dataset. Isoform TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002
Amino Acid Position 60 76 77 78 80 89 131 238 40 56 57 58 60 69 111 218
Frequency 3 28 2 1 1 1 1 1 3 28 2 1 1 1 1 1
prevalence of 0.03688525, or ≈ 3.69% of patients, according to the heatmaps.
PDB: 1NVP In Figure 4.6, it can be seen that only one mutated position from Table 4.3 is highlighted. The remaining mutated positions were not part of the PDB structure or, much like what is noted above in Section 4.4.1, fall into a single amino acid repeat of glutamine which is not present in the PDB structure. This region’s absence suggests that it is likely disordered and therefore did not crystallize well. The positions that remain, {238}TBP.001 and {218}TBP.002 , target the same position due to the offset between the isoforms. This position falls well within an alpha helix. The vast majority of mutations occur in the glutamine-repeat region which was likely too disordered to crystallize.
Chapter 4. Positional Analysis Results
42
TABLE 4.4: The mutations noted in STES_mut for CASC3 in the TCGA dataset. Isoform CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001
4.4.3
Amino Acid Position 105 198 232 250 337 338 438 523 524 535 540 550 560 603 619 627 645 658 690
Frequency 1 2 3 2 1 1 1 1 1 1 2 2 1 3 1 3 3 3 3
STES – CASC3
This combination of STES (stomach and esophageal carcinoma) cancer and CASC3 (cancer susceptibility candidate gene 3 protein) was significant by Monte Carlo analysis and had a mutation prevalence of 0.02114165, or ≈ 2.11% of patients, according to the heatmaps.
PDB: 2J0S In Figure 4.7, it can be seen that only three mutated positions are part of the PDB structure. The three visible positions are, {198, 232, 250}CASC3.001 . Among these
Chapter 4. Positional Analysis Results
43
three positions, 198 and 232 fall at the edge of α-helices, while 250 falls at the edge of a β-sheet. The missing un-mappable mutations occur in regions that are either too disordered to crystallize or were cut out prior to crystallization as part of an attempt to crystallize this protein’s ordered site(s) in hopes of understanding its cancer susceptibility cause.
4.5
Enrichment Analysis
There were no significant terms following FDR correction, however the top 10 terms prior to correction are listed in Table 4.5.
4.6
Partner Set Enrichment Analysis
Utilizing Homo sapiens data from BioGRID downloaded from their latest release on June 14th, 2017, any direct interactors with the significant set were extracted into their own binding partner set (duplicate entries were removed). This resulted in 1545 gene symbols, which when run through the same enrichment analysis process resulted in hundreds of enriched terms. Considering only the most specific terms by removing parents in the graph, a total of 168 terms were enriched with the top 10 listed in Table 4.6 (the top 50 terms can be seen in Table B.1).
Chapter 4. Positional Analysis Results
44
TABLE 4.5: Note here that these are uncorrected p-values therefore they do not represent term enrichment. They are presented to show the top Gene Ontology terms associated with the significant gene set. The adjusted p-values following FDR correction are provided to reinforce their non-significance. GO ID GO:0006366 GO:0060850 GO:0006351 GO:0097659 GO:0050652
GO:1903691 GO:0032289 GO:0003142 GO:0060807
GO:0060796
Process transcription from RNA polymerase II promoter regulation of transcription involved in cell fate commitment transcription, DNA-templated nucleic acid-templated transcription dermatan sulfate proteoglycan biosynthetic process, polysaccharide chain biosynthetic process positive regulation of wound healing, spreading of epidermal cells central nervous system myelin formation cardiogenic plate morphogenesis regulation of transcription from RNA polymerase II promoter involved in definitive endodermal cell fate specification regulation of transcription involved in primary germ layer cell fate commitment
p-value 0.000433
FDR 1
0.000487
1
0.00153 0.00155 0.00423
1 1 1
0.00423
1
0.00423 0.00423 0.00423
1 1 1
0.00423
1
Chapter 4. Positional Analysis Results
45
TABLE 4.6: The top 10 most specific terms associated with interaction partners to the significant genes determined by Monte Carlo simulations. In total there were 168 terms in the full table (the top 50 of which are in Table B.1). GO ID GO:0006368 GO:0038095 GO:0006369 GO:0043968 GO:0042795 GO:0016925 GO:0002223 GO:1900034 GO:0050821 GO:1900740
Process transcription elongation from RNA polymerase II promoter Fc-epsilon receptor signaling pathway termination of RNA polymerase II transcription histone H2A acetylation snRNA transcription from RNA polymerase II promoter protein sumoylation stimulatory C-type lectin receptor signaling pathway regulation of cellular response to heat protein stabilization positive regulation of protein insertion into mitochondrial membrane involved in apoptotic signaling pathway
p-value 9.55e-12 5.47e-10 8.88e-10 7.76e-09 1.54e-08 1.05e-07 1.38e-07 4.67e-07 8.02e-07 1.4e-06
Chapter 4. Positional Analysis Results
46
F IGURE 4.1: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. ADAT3 ANKLE1 ARL10 ASPDH C16orf3 C19orf10 CASC3 CD8B CEBPB CRCT1 CSGALNACT2 CXorf38 DNAH9 EME2 EP400 FAM100B FAM48B1 FAM72A FAM86C1 FCHSD1 FSIP1 GLTSCR1 GPRIN2 GTF2I HS3ST4 IARS2 IGFBP4 IGFN1 KANK3 KCNK17 KRTAP10−10 KRTAP10−2 LRIG1 MAP1S MSANTD1 MUC16 NBPF10 NCOA3 NMU OSGIN2 PCDHGA2 PCDHGA9 PGLS PGM5 PKDREJ POU3F3 PPP1R3G PROB1 PRR18 PRRT4 RARRES2 RASIP1 RGS9BP RREB1 SCYL2 SENP6 SMOC2 SOX17 SPRR3 SRRM2 SYN1 TBP TCEB3C TCHH TCTEX1D4 TENM4 TES TNIP2 TOR3A TRIM61 TRIP6 TYSND1 UPF3A ZNF148 ZNF707 ZSCAN1 ZZZ3 UCEC_mut
UCEC_missense
THYM_mut
THYM_missense
TGCT_mut
TGCT_missense
STES_mut
STES_missense
SKCM_mut
SKCM_missense
SARC_mut
SARC_missense
LUAD_mut
KICH_mut
COADREAD_mut
COADREAD_missense
BRCA_mut
ACC_mut
ACC_missense
Chapter 4. Positional Analysis Results
47
F IGURE 4.2: A heatmap showing the significant genes compared across all cancer types with both background mutation profiles. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. ADAT3 ANKLE1 ARID2 ARL10 ASPDH ATRX BRAF C16orf3 C19orf10 CASC3 CD8B CEBPB CIC CRCT1 CSGALNACT2 CXorf38 DNAH9 EGFR EME2 EP400 FAM100B FAM48B1 FAM72A FAM86C1 FCHSD1 FSIP1 GLTSCR1 GNA11 GNAQ GPRIN2 GTF2I HOXD13 HRAS HS3ST4 IARS2 IDH1 IGFBP4 IGFN1 KANK3 KCNK17 KEAP1 KIT KRAS KRTAP10−10 KRTAP10−2 LRIG1 MAML2 MAP1S MLL3 MLLT3 MSANTD1 MUC16 NBPF10 NCOA3 NFE2L2 NMU NOTCH2 NRAS OSGIN2 PCDHGA2 PCDHGA9 PGLS PGM5 PKDREJ POU3F3 PPP1R3G PROB1 PRR18 PRRT4 PTEN RAC1 RARRES2 RASIP1 RGS9BP RREB1 SCYL2 SENP6 SETD2 SMOC2 SOX17 SPRR3 SRRM2 SYN1 TBP TCEB3C TCF3 TCHH TCTEX1D4 TENM4 TES TNIP2 TOR3A TP53 TRIM61 TRIP6 TYSND1 UPF3A VHL ZNF148 ZNF707 ZSCAN1 ZZZ3
0.8
0.6
0.4
0.2
0
UVM_mut UVM_missense UCS_mut UCS_missense UCEC_mut UCEC_missense THYM_mut THYM_missense THCA_mut THCA_missense TGCT_mut TGCT_missense STES_mut STES_missense SKCM_mut SKCM_missense SARC_mut SARC_missense PRAD_mut PRAD_missense PCPG_mut PCPG_missense PAAD_mut PAAD_missense OV_mut OV_missense LUSC_mut LUSC_missense LUAD_mut LUAD_missense LIHC_mut LIHC_missense LGG_mut LGG_missense KIRP_mut KIRP_missense KIRC_mut KIRC_missense KICH_mut KICH_missense HNSC_mut HNSC_missense GBM_mut GBM_missense ESCA_mut ESCA_missense DLBC_mut DLBC_missense COADREAD_mut COADREAD_missense CHOL_mut CHOL_missense CESC_mut CESC_missense BRCA_mut BRCA_missense BLCA_mut BLCA_missense ACC_mut ACC_missense
Chapter 4. Positional Analysis Results
48
F IGURE 4.3: A heatmap showing the significant genes compared across all cancer types with only mut background mutation profiles, or those considering all mutations, both synonymous and missense. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. ADAT3 ANKLE1 ARL10 ASPDH BRAF C16orf3 C19orf10 CASC3 CD8B CEBPB CIC CRCT1 CSGALNACT2 CXorf38 DNAH9 EGFR EME2 EP400 FAM100B FAM72A FAM86C1 FCHSD1 GLTSCR1 GNA11 GNAQ GPRIN2 GTF2I HOXD13 HRAS HS3ST4 IDH1 IGFBP4 IGFN1 KANK3 KCNK17 KIT KRAS KRTAP10−10 KRTAP10−2 LRIG1 MAML2 MAP1S MLL3 MLLT3 MSANTD1 MUC16 NBPF10 NCOA3 NFE2L2 NMU NRAS OSGIN2 PCDHGA2 PCDHGA9 PGLS PGM5 POU3F3 PPP1R3G PROB1 PRR18 PRRT4 PTEN RAC1 RARRES2 RASIP1 RGS9BP RREB1 SCYL2 SENP6 SMOC2 SOX17 SPRR3 SRRM2 TBP TCEB3C TCF3 TCHH TCTEX1D4 TENM4 TES TNIP2 TOR3A TP53 TRIM61 TRIP6 TYSND1 UPF3A VHL ZNF148 ZNF707 ZSCAN1 ZZZ3
0.8
0.6
0.4
0.2
0
UVM_mut
UCS_mut
UCEC_mut
THYM_mut
THCA_mut
TGCT_mut
STES_mut
SKCM_mut
SARC_mut
PRAD_mut
PCPG_mut
PAAD_mut
OV_mut
LUSC_mut
LUAD_mut
LIHC_mut
LGG_mut
KIRP_mut
KIRC_mut
KICH_mut
HNSC_mut
GBM_mut
ESCA_mut
DLBC_mut
COADREAD_mut
CHOL_mut
CESC_mut
BRCA_mut
BLCA_mut
ACC_mut
Chapter 4. Positional Analysis Results
49
F IGURE 4.4: A heatmap showing the significant genes compared across all cancer types with only missense background mutation profiles, or those considering only missense mutations with synonymous mutations removed. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. ADAT3 ANKLE1 ARL10 ASPDH BRAF C16orf3 C19orf10 CASC3 CD8B CEBPB CIC CRCT1 CSGALNACT2 CXorf38 DNAH9 EGFR EME2 EP400 FAM100B FAM72A FAM86C1 FCHSD1 GLTSCR1 GNA11 GNAQ GPRIN2 GTF2I HOXD13 HRAS HS3ST4 IDH1 IGFBP4 IGFN1 KANK3 KCNK17 KIT KRAS KRTAP10−10 KRTAP10−2 LRIG1 MAML2 MAP1S MLL3 MLLT3 MSANTD1 MUC16 NBPF10 NCOA3 NFE2L2 NMU NRAS OSGIN2 PCDHGA2 PCDHGA9 PGLS PGM5 POU3F3 PPP1R3G PROB1 PRR18 PRRT4 PTEN RAC1 RARRES2 RASIP1 RGS9BP RREB1 SCYL2 SENP6 SMOC2 SOX17 SPRR3 SRRM2 TBP TCEB3C TCF3 TCHH TCTEX1D4 TENM4 TES TNIP2 TOR3A TP53 TRIM61 TRIP6 TYSND1 UPF3A VHL ZNF148 ZNF707 ZSCAN1 ZZZ3
0.8
0.6
0.4
0.2
0
UVM_missense
UCS_missense
UCEC_missense
THYM_missense
THCA_missense
TGCT_missense
STES_missense
SKCM_missense
SARC_missense
PRAD_missense
PCPG_missense
PAAD_missense
OV_missense
LUSC_missense
LUAD_missense
LIHC_missense
LGG_missense
KIRP_missense
KIRC_missense
KICH_missense
HNSC_missense
GBM_missense
ESCA_missense
DLBC_missense
COADREAD_missense
CHOL_missense
CESC_missense
BRCA_missense
BLCA_missense
ACC_missense
Chapter 4. Positional Analysis Results
F IGURE 4.5: Image of mutations within TBP for the COADREAD cancer profile mapped against 1NVP from PDB. All mutations are considered, however not all mutated positions are found in the PDB structure. The vast majority of mutations occurred in regions that either did not crystallize in the final structure or have been cut out of the structure for crystallizing the ordered regions seen here.
50
Chapter 4. Positional Analysis Results
F IGURE 4.6: Image of mutations within TBP for the BRCA cancer profile mapped against 1NVP from PDB. All mutations are considered, however not all mutated positions are found in the PDB structure. The vast majority of mutations occurred in regions that either did not crystallize in the final structure or have been cut out of the structure for crystallizing the ordered regions seen here.
51
Chapter 4. Positional Analysis Results
F IGURE 4.7: Image of mutations within CASC3 for the STES cancer profile mapped against 2J0S from PDB. All mutations are considered, however not all mutated positions are found in the PDB structure. The vast majority of mutations occurred in regions that either did not crystallize in the final structure or have been cut out of the structure for crystallizing the ordered regions seen here.
52
53
Chapter 5
Regional Analysis Results 5.1
Introduction
Regional binomial testing results across 31 cancer types (listed in Table 3.2) were limited only by the parameters stated in Section 3.6. These parameters are already highly-conservative while still resulting in 525 significant genes. Well-characterized driver genes were removed by taking the set difference between the COSMIC gene set and this initial significant set. This left 480 remaining gene symbols across both missense-only and all-mutations profiles. See Figure 5.1 through Figure 5.6 for binary mappings of these 480 gene symbols to the cancer backgrounds they were significant within and Table 5.1 for a table of the significant gene symbols. TABLE 5.1: The significant gene symbols according to regional binomial testing and considering only those genes not already in the COSMIC census gene set. ACD
ACSM2B
ADAM19
ADAM33
ADAMTS1
ADAMTS18
ADCY2
AFAP1L1
AGAP2
AGAP7
AKAP12
AKAP13
AKAP2
AKR1C3
ALOX5
AMOT
ANK3
ANKRD12
ANKRD24
ANKRD30A
ANKRD30B
ANKRD36C
ANO4
APBA2
APBB1IP
APOBR
ARC
ARFIP1
ARHGAP23
ARHGAP5
ARHGEF40
ARID3A
ARNTL2
ARPP21
ASAP1
Chapter 5. Regional Analysis Results
54
ASAP3
ATAD2
ATCAY
ATN1
ATP2B2
ATP8B4
ATXN1
ATXN2
AZI1
B3GALT1
B3GAT2
B4GALNT3
BACH1
BBX
BCAS1
BEGAIN
BMP2K
BZRAP1
C10orf90
C15orf40
C19orf6
C1orf173
C1orf198
C1orf65
C4orf27
C5orf42
C6orf10
C8orf34
C9orf66
CA1
CACNA1A
CACNA1H
CACNA2D2
CACTIN
CADPS
CALR3
CBX7
CCDC102A
CCDC105
CCDC110
CCDC40
CDH24
CDKL5
CEP170
CEP41
CEP63
CERKL
CHD3
CHRM2
CHST13
CILP2
CLIC6
CNTN5
COL11A1
COL15A1
COL21A1
COL23A1
COL28A1
COL4A4
COL5A1
CPSF6
CROCC
CRYBG3
CTAGE6P
DBX2
DCAF8L1
DDX11
DDX46
DENND4B
DGKB
DGKI
DHX34
DIDO1
DLEU7
DMKN
DSC3
DUPD1
DYSF
DZIP1
E2F5
EIF1AX
ELF3
EN1
ENAM
EP400
ERI1
EYA1
EYA4
FAM120B
FAM123A
FAM123C
FAM157A
FAM171B
FAM184B
FAM194A
FAM196B
FAM21A
FAM47A
FAM71E2
FBN3
FCGBP
FCRL5
FER1L6
FETUB
FGF12
FGF13
FHDC1
FILIP1
FOXP2
FOXS1
FSCB
FSD1
FSIP2
GAB1
GABRG2
GDF15
GDF5
GIMAP6
GJA8
GLDN
GOLGB1
GON4L
GPATCH8
GPR158
GPR179
GPRIN1
GPRIN2
GSG2
HAP1
HECW2
HGF
HHIPL2
HIVEP3
HLA-C
HMGB3
HOMEZ
HSCB
ILDR1
INPP5J
IRF2BPL
IRS4
IRX4
ISL1
ISX
ITSN2
JPH1
KAT8
KCNA6
KCND2
KCNJ4
KCNJ8
KCNN3
KCTD8
KDM4A
KIAA0040
KIAA0284
KIAA0319
KIAA0355
KIAA0907
KIAA1211
KIAA1257
KIAA1522
KIAA1549L
KIAA2018
KIF1A
KIF1C
KIR3DL2
KNDC1
L1TD1
LAMA3
LAMC3
LAS1L
LDLR
LIG1
LILRB5
LIMK2
LIPE
LMTK3
LONRF2
LPA
LRP11
LRRC43
MAD1L1
MAP1A
MAP6
MAPK13
MAST1
MBD1
MBD6
MCM10
MED17
MEFV
MESP2
METTL10
MGA
MICAL3
MKI67
MPHOSPH10
MPHOSPH9
MSGN1
MUC15
MYBPC2
MYH13
MYH2
MYH4
MYH6
MYH8
MYLK
MYO15A
MYO18B
MYOM1
MYRIP
MYT1L
NALCN
NASP
NBPF3
NCOA3
NEFH
NEFM
NFASC
NFATC1
NFKBIB
NFYA
NGFR
NLRP11
NOL8
NOM1
NOS1
NPAP1
NPAS3
NRAP
NRD1
NRG3
NSUN2
NTN5
NUMBL
OCEL1
OPRM1
OSBPL3
OSBPL6
OTOF
P2RX2
PALMD
PAPD7
PAPPA2
PARD3B
PAX4
PCDH15
PCF11
PCLO
PCMTD1
PCSK1
Chapter 5. Regional Analysis Results
55
PCSK5
PDGFRL
PDZD4
PEG3
PENK
PEX5L
PHLDA1
PHLDB2
PHRF1
PIEZO1
PIK3AP1
PIK3R5
PKP4
PLEC
PLEKHG3
PMEPA1
PMFBP1
POTEF
POTEG
POU3F2
PPFIA2
PPM1E
PPP1R16B
PPP2R3A
PPP2R3B
PRDM13
PRICKLE1
PRKCSH
PRKG2
PRLR
PRRC2C
PRRG3
PTPRO
PTRF
RALY
RASSF6
RBM12B
RBM14
RBMXL3
RC3H1
RECQL5
REM1
RERE
RGPD4
RIMS2
RIMS3
RINL
RLIM
RNF146
RNFT2
ROBO2
RP1L1
RSBN1
RSPH4A
RTN3
RUNX2
RYR2
RYR3
SCAND3
SCARF2
SCN2A
SCRN2
SDCCAG3
SDPR
SEMA3E
SGSM1
SH2D2A
SHANK1
SHANK2
SHOX
SIM1
SIPA1L3
SLC16A2
SLC17A6
SLC24A3
SLC8A3
SLCO1C1
SLCO6A1
SMC2
SNAP25
SNED1
SOGA3
SORBS2
SORBS3
SORCS1
SOWAHB
SOX10
SOX9
SPATA31A3
SPATA31D1
SPATS2L
SPDYE5
SPEF2
SPERT
SPHKAP
SPOCK3
SPTA1
SPTAN1
SRL
SRRM2
SRRT
STK19
STON1-GTF2A1L
SWI5
SYNJ2
TAF1
TAF4
TARSL2
TBC1D1
TBC1D10C
TBC1D3B
TBP
TCHHL1
TDRD3
TENM1
TENM2
TEX33
THSD1
TIAM1
TIMELESS
TLN2
TLR6
TMC2
TMC5
TMEM200C
TNRC6A
TNXB
TONSL
TOP2A
TRAK1
TRANK1
TRAPPC12
TRIM3
TRIOBP
TRMT44
TSKS
TTBK1
TTLL11
TTLL2
TTN
TUB
TULP4
TUSC3
TXLNB
UNCX
USP31
USP6NL
UTP18
VRTN
WDR33
WDR64
WDR70
WDR87
WDR96
WNT16
XIRP1
XIRP2
ZAR1L
ZBBX
ZBTB38
ZC3H12D
ZC4H2
ZFHX4
ZFP106
ZFP36L2
ZFR2
ZFX
ZFYVE20
ZIC4
ZIM2
ZNF189
ZNF208
ZNF254
ZNF285
ZNF329
ZNF347
ZNF385B
ZNF398
ZNF462
ZNF534
ZNF599
ZNF638
ZNF676
ZNF696
ZNF707
ZNF711
ZNF717
ZNF746
ZNF768
ZNF770
ZNF804A
ZNF845
ZNF91
5.2
COSMIC Hypergeometric Testing
Only using the COSMIC gene set to determine which finds are novel provides no value for measuring the overall significance of finds with respect to capturing
Chapter 5. Regional Analysis Results
56
known cancer drivers. Therefore, the hypergeometric test was performed using the following values: 1. the number of genes in the intersect between COSMIC and my significant set, 45; 2. the number of genes in COSMIC, 616; 3. the number of genes with mutations in TCGA set , 18201; and 4. the number of genes in my significant set, 525. This resulted in a p-value of 1.1274 × 10−08 , see Equation 5.1 for calculation and Equation 3.5 for general equation. This p-value indicates the significant set determined via regional analysis has a high degree of true positives
p(k = 45 − 1) =
616 44
18201−616 525−44 18201 525
= 1.1274 × 10−08
(5.1)
Note that in Equation 5.1, k − 1 successes are considered to find the cumulative probability of k or more successes.
5.3
Mutational Prevalence
In order to measure the prevalence of mutations across missense-only and allmutations profiles, heatmaps were created where cells are colored by the ratio between number of patients with a mutation in the given gene over the number of patients in that cancer type. The mutation profile naming scheme is such that mut profiles include all observed mutations in the cancer background (i.e., synonymous and missense mutations), while missense profiles include only missense mutations in the cancer background (i.e., synonymous mutations have been removed). Only novel finds (genes not found in the COSMIC gene set) are considered in these heatmaps and the same scale is used for each heatmap in the set.
Chapter 5. Regional Analysis Results
57
These heatmaps show the mutation prevalence across individuals is low for most genes. As is expected, there is a high mutation prevalence in genes such as TP53, a ubiquitous tumor-suppressor gene. More important than the prevalence of mutations across genes in a single cancer is comparison of significant genes between cancers. Note that cancer backgrounds with a high mutation prevalence for a particular gene are not necessarily targeting disordered positions within that gene, they simply have a high number of patients with mutations in the gene.
5.3.1
Both Profiles
The heatmaps are arranged by significant gene in alphabetical order, with a single gene overlap between each, therefore: 1. ACD through CAMTA1 can be seen in Figure 5.7. 2. CARD11 through FCRL5 can be seen in Figure 5.8 3. FER1L6 through LAS1L can be seen in Figure 5.9 4. LDLR through NTN5 can be seen in Figure 5.10 5. NUMBL through RTN3 can be seen in Figure 5.11 6. RUNX2 through TONSL can be seen in Figure 5.12 7. TOP2A through ZNF91 can be seen in Figure 5.13
Chapter 5. Regional Analysis Results
5.3.2
58
Mutation Prevalence Distributions
Given the many rows and necessary splitting of these results across many figures, in order to facilitate better understanding the cancer-wise (Figure 5.14) and genewise (Figure 5.15) mean summaries are provided. The tables of these values are available in Appendix C.
5.4
Visualizations of Select Genes
The three genes selected here are for illustrative purposes. They were chosen due to being within the top five significant results of their cancer background by padjusted value and being among the greatest 20 absolute observed disorder loads across all results – balancing the disorder and number of mutations observed in the gene. There were no available Protein Data Bank (PDB) structures for these genes, which might suggest they are entirely or partially too disordered to properly crystallize as is necessary to generate PDB structures.
5.4.1
TBP.001 in BRCA
Smoothed Disorder Plot with Mutations In Figure 5.16, the mutations from BRCA_mut (Breast invasive carcinoma, all mutations in background) are mapped onto a smoothed disorder plot. All mutations occur at the most disordered region of the protein with the observed position-wise disorder score among these mutated positions being ≈ −0.60, well below the −0.1 threshold used to annotate high-confidence disordered regions.
Chapter 5. Regional Analysis Results
59
TABLE 5.2: The mutations noted in BRCA_mut for TBP in the TCGA dataset. Isoform TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002
Amino Acid Position 131 238 60 76 77 78 80 89 111 218 40 56 57 58 60 69
Frequency 1 1 3 28 2 1 1 1 1 1 3 28 2 1 1 1
Note that this isoform, TBP.001, was also significant within: ACC_mut Adrenocortical carcinoma, missense-only mutations in background COADREAD_mut Colon adenocarcinoma and Rectum adenocarcinoma, all mutations in background ESCA_mut Esophageal carcinoma, all mutations in background KICH_mut Kidney chromophobe, all mutations in background KIRC_mut Kidney renal clear cell carcinoma, all mutations in background SKCM_mut Skin cutaneous melanoma, all mutations in background STES_mut Stomach and esophageal carcinoma, all mutations in background
Chapter 5. Regional Analysis Results
60
TCGT_mut Testicular germ cell tumors, all mutations in background
5.4.2
PLEC.005 in ACC
Smoothed Disorder Plot with Mutations In Figure 5.17, the mutations from ACC_mut (Adrenocortical carcinoma, all mutations in background) are mapped onto a smoothed disorder plot. We can see that roughly half of the mutations occur below the high confidence threshold (−0.1) for determining disordered regions. Most of the remaining mutations occur at the C-terminus, where the sequence becomes more disordered. Note that this isoform, PLEC.005, was also significant within: ACC_missense Adrenocortical carcinoma, missense-only mutations in background CESC_missense Cervical squamous cell carcinoma and Endocervical adenocarcinoma, missense-only mutations in background COADREAD_mut Colon adenocarcinoma and Rectum adenocarcinoma, all mutations in background COADREAD_missense Colon adenocarcinoma and Rectum adenocarcinoma, missenseonly mutations in background HNSC_mut Head and neck squamous cell carcinoma, all mutations in background HNSC_missense Head and neck squamous cell carcinoma, missense-only mutations in background SKCM_mut Skin cutaneous melanoma, all mutations in background
Chapter 5. Regional Analysis Results
61
TABLE 5.3: The mutations noted in ACC_mut for PLEC.005 in the TCGA dataset. Note there were 760 mutations across 176 positions, therefore this table only shows the mutations for PLEC.005. Isoform PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005
Amino Acid Position 1321 1386 1697 1854 1880 1905 1998 2047 2106 2113 2242 2495 2507 2713 3145 4004 4005 4382 4445 4539 4624 4668
Frequency 18 12 11 1 1 1 4 1 17 12 1 1 1 1 1 1 1 1 1 1 6 1
Chapter 5. Regional Analysis Results
62
STES_mut Stomach and esophageal carcinoma, all mutations in background STES_missense Stomach and esophageal carcinoma, missense-only mutations in background UCEC_mut Uterine corpus endometrial carcinoma, all mutations in background
5.4.3
NEFH.001 in ACC
Smoothed Disorder Plot with Mutations In Figure 5.18, the mutations from ACC_mut (Adrenocortical carcinoma, all mutations in background) are mapped onto a smoothed disorder plot. All mutations occur below the high confidence threshold (−0.1) for determining a disordered region. The mutations are concentrated around an effective plateau of disorder – suggesting this region is consistent in itself. Rather than simply being a transition region between the relative order before this region and relative disorder after this region, the plateau suggests the region maintains a given level of disorder, which might confer a given function to this region beyond simple transition between other key regions of the folded protein. Note that this isoform, NEFH.001, was also significant within: BRCA_mut Breast invasive carcinoma, all mutations in background KIRP_mut Kidney renal papillary cell carcinoma, all mutations in background KIRP_missense Kidney renal papillary cell carcinoma, missense-only mutations in background
Chapter 5. Regional Analysis Results
63
TABLE 5.4: The mutations noted in ACC_mut for NEFH in the TCGA dataset. Isoform NEFH.001 NEFH.001 NEFH.001 NEFH.001 NEFH.001 NEFH.001 NEFH.001 NEFH.001
5.5
Amino Acid Position 645 646 655 698 701 702 744 805
Frequency 2 1 11 2 2 2 1 1
Enrichment Analysis
There were no significant terms following FDR correction, however the top 10 terms prior to correction are listed in Table 5.5.
5.6
Partner Set Enrichment Analysis
Utilizing Homo sapiens data from BioGRID downloaded from their latest release on June 14th, 2017 any direct interactors with the significant set were extracted into their own interaction partner set (duplicate entries were removed). This resulted in 480 gene symbols, which when run through the same enrichment analysis process as before resulted in hundreds of enriched terms. Considering only the most specific terms by removing parents in the graph, a total of 149 terms were enriched with the top 10 terms are listed in Table 5.6 (the top 50 terms can be seen in Table C.1).
Chapter 5. Regional Analysis Results
64
TABLE 5.5: Note here that these are uncorrected p-values therefore they do not represent term enrichment. They are presented to show the top Gene Ontology terms associated with the significant gene set. The adjusted p-values following FDR correction are provided to reinforce their non-significance. GO ID GO:0006936 GO:0003012 GO:0070252 GO:0030048 GO:0001508 GO:0030049 GO:0033275 GO:0033693 GO:0072719 GO:0072718
Process muscle contraction muscle system process actin-mediated cell contraction actin filament-based movement action potential muscle filament sliding actin-myosin filament sliding neurofilament bundle assembly cellular response to cisplatin response to cisplatin
p-value 7.78e-05 8.44e-05 0.000112 0.000128 0.000278 0.000324 0.000324 0.000694 0.000694 0.000694
FDR 1 1 1 1 1 1 1 1 1 1
TABLE 5.6: The top 10 most specific terms associated with interaction partners to the significant genes determined by regional binomial tests. In total there were 149 terms in the full table (the top 50 can be seen in Table C.1). GO ID GO:0044260 GO:0090304 GO:0043170 GO:0006139 GO:0016070 GO:0046483 GO:0006725 GO:0010467 GO:0044238 GO:1901360
Process cellular macromolecule metabolic process nucleic acid metabolic process macromolecule metabolic process nucleobase-containing compound metabolic process RNA metabolic process heterocycle metabolic process cellular aromatic compound metabolic process gene expression primary metabolic process organic cyclic compound metabolic process
p-value 1.53e-121 4.94e-112 6.18e-105 1.25e-90 3.46e-88 3.31e-82 3.87e-81 3.51e-80 4.6e-80 2.53e-77
Chapter 5. Regional Analysis Results
65
F IGURE 5.1: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (1 of 6) ACD ACSM2B ADAM19 ADAM33 ADAMTS1 ADAMTS18 ADCY2 AFAP1L1 AGAP2 AGAP7 AKAP12 AKAP13 AKAP2 AKR1C3 ALOX5 AMOT ANK3 ANKRD12 ANKRD24 ANKRD30A ANKRD30B ANKRD36C ANO4 APBA2 APBB1IP APOBR ARC ARFIP1 ARHGAP23 ARHGAP5 ARHGEF40 ARID3A ARNTL2 ARPP21 ASAP1 ASAP3 ATAD2 ATCAY ATN1 ATP2B2 ATP8B4 ATXN1 ATXN2 AZI1 B3GALT1 B3GAT2 B4GALNT3 BACH1 BBX BCAS1 BEGAIN BMP2K BZRAP1 C10orf90 C15orf40 C19orf6 C1orf173 C1orf198 C1orf65 C4orf27 C5orf42 C6orf10 C8orf34 C9orf66 CA1 CACNA1A CACNA1H CACNA2D2 CACTIN CADPS CALR3 CBX7 CCDC102A CCDC105 CCDC110 CCDC40 CDH24 CDKL5 CEP170 CEP41 CEP63 UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
66
F IGURE 5.2: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (2 of 6) CEP63 CERKL CHD3 CHRM2 CHST13 CILP2 CLIC6 CNTN5 COL11A1 COL15A1 COL21A1 COL23A1 COL28A1 COL4A4 COL5A1 CPSF6 CROCC CRYBG3 CTAGE6P DBX2 DCAF8L1 DDX11 DDX46 DENND4B DGKB DGKI DHX34 DIDO1 DLEU7 DMKN DSC3 DUPD1 DYSF DZIP1 E2F5 EIF1AX ELF3 EN1 ENAM EP400 ERI1 EYA1 EYA4 FAM120B FAM123A FAM123C FAM157A FAM171B FAM184B FAM194A FAM196B FAM21A FAM47A FAM71E2 FBN3 FCGBP FCRL5 FER1L6 FETUB FGF12 FGF13 FHDC1 FILIP1 FOXP2 FOXS1 FSCB FSD1 FSIP2 GAB1 GABRG2 GDF15 GDF5 GIMAP6 GJA8 GLDN GOLGB1 GON4L GPATCH8 GPR158 GPR179 GPRIN1 UVM_missense
UVM_mut
UCS_mut
UCEC_missense
UCEC_mut
THYM_missense
THYM_mut
TGCT_missense
TGCT_mut
STES_missense
STES_mut
SKCM_missense
SKCM_mut
SARC_mut
PCPG_missense
PCPG_mut
PAAD_missense
PAAD_mut
LUSC_missense
LUSC_mut
LUAD_missense
LUAD_mut
KIRP_missense
KIRP_mut
KIRC_missense
KIRC_mut
KICH_missense
KICH_mut
HNSC_missense
HNSC_mut
GBM_missense
GBM_mut
ESCA_missense
ESCA_mut
DLBC_mut
COADREAD_missense
COADREAD_mut
CHOL_missense
CHOL_mut
CESC_missense
CESC_mut
BRCA_missense
BRCA_mut
BLCA_missense
BLCA_mut
ACC_missense
ACC_mut
Chapter 5. Regional Analysis Results
67
F IGURE 5.3: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (3 of 6) GPRIN1 GPRIN2 GSG2 HAP1 HECW2 HGF HHIPL2 HIVEP3 HLA−C HMGB3 HOMEZ HSCB ILDR1 INPP5J IRF2BPL IRS4 IRX4 ISL1 ISX ITSN2 JPH1 KAT8 KCNA6 KCND2 KCNJ4 KCNJ8 KCNN3 KCTD8 KDM4A KIAA0040 KIAA0284 KIAA0319 KIAA0355 KIAA0907 KIAA1211 KIAA1257 KIAA1522 KIAA1549L KIAA2018 KIF1A KIF1C KIR3DL2 KNDC1 L1TD1 LAMA3 LAMC3 LAS1L LDLR LIG1 LILRB5 LIMK2 LIPE LMTK3 LONRF2 LPA LRP11 LRRC43 MAD1L1 MAP1A MAP6 MAPK13 MAST1 MBD1 MBD6 MCM10 MED17 MEFV MESP2 METTL10 MGA MICAL3 MKI67 MPHOSPH10 MPHOSPH9 MSGN1 MUC15 MYBPC2 MYH13 MYH2 MYH4 MYH6 UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
68
F IGURE 5.4: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (4 of 6) MYH6 MYH8 MYLK MYO15A MYO18B MYOM1 MYRIP MYT1L NALCN NASP NBPF3 NCOA3 NEFH NEFM NFASC NFATC1 NFKBIB NFYA NGFR NLRP11 NOL8 NOM1 NOS1 NPAP1 NPAS3 NRAP NRD1 NRG3 NSUN2 NTN5 NUMBL OCEL1 OPRM1 OSBPL3 OSBPL6 OTOF P2RX2 PALMD PAPD7 PAPPA2 PARD3B PAX4 PCDH15 PCF11 PCLO PCMTD1 PCSK1 PCSK5 PDGFRL PDZD4 PEG3 PENK PEX5L PHLDA1 PHLDB2 PHRF1 PIEZO1 PIK3AP1 PIK3R5 PKP4 PLEC PLEKHG3 PMEPA1 PMFBP1 POTEF POTEG POU3F2 PPFIA2 PPM1E PPP1R16B PPP2R3A PPP2R3B PRDM13 PRICKLE1 PRKCSH PRKG2 PRLR PRRC2C PRRG3 PTPRO PTRF UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut
GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
69
F IGURE 5.5: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (5 of 6) PTRF RALY RASSF6 RBM12B RBM14 RBMXL3 RC3H1 RECQL5 REM1 RERE RGPD4 RIMS2 RIMS3 RINL RLIM RNF146 RNFT2 ROBO2 RP1L1 RSBN1 RSPH4A RTN3 RUNX2 RYR2 RYR3 SCAND3 SCARF2 SCN2A SCRN2 SDCCAG3 SDPR SEMA3E SGSM1 SH2D2A SHANK1 SHANK2 SHOX SIM1 SIPA1L3 SLC16A2 SLC17A6 SLC24A3 SLC8A3 SLCO1C1 SLCO6A1 SMC2 SNAP25 SNED1 SOGA3 SORBS2 SORBS3 SORCS1 SOWAHB SOX10 SOX9 SPATA31A3 SPATA31D1 SPATS2L SPDYE5 SPEF2 SPERT SPHKAP SPOCK3 SPTA1 SPTAN1 SRL SRRM2 SRRT STK19 STON1−GTF2A1L SWI5 SYNJ2 TAF1 TAF4 TARSL2 TBC1D1 TBC1D10C TBC1D3B TBP TCHHL1 TDRD3 UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
70
F IGURE 5.6: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (6 of 6) TDRD3 TENM1 TENM2 TEX33 THSD1 TIAM1 TIMELESS TLN2 TLR6 TMC2 TMC5 TMEM200C TNRC6A TNXB TONSL TOP2A TRAK1 TRANK1 TRAPPC12 TRIM3 TRIOBP TRMT44 TSKS TTBK1 TTLL11 TTLL2 TTN TUB TULP4 TUSC3 TXLNB UNCX USP31 USP6NL UTP18 VRTN WDR33 WDR64 WDR70 WDR87 WDR96 WNT16 XIRP1 XIRP2 ZAR1L ZBBX ZBTB38 ZC3H12D ZC4H2 ZFHX4 ZFP106 ZFP36L2 ZFR2 ZFX ZFYVE20 ZIC4 ZIM2 ZNF189 ZNF208 ZNF254 ZNF285 ZNF329 ZNF347 ZNF385B ZNF398 ZNF462 ZNF534 ZNF599 ZNF638 ZNF676 ZNF696 ZNF707 ZNF711 ZNF717 ZNF746 ZNF768 ZNF770 ZNF804A ZNF845 ZNF91 UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
71
F IGURE 5.7: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (1 of 7) ACD ACSM2B ADAM19 ADAM33 ADAMTS1 ADAMTS18 ADCY2 AFAP1L1 AGAP2 AGAP7 AKAP12 AKAP13 AKAP2 AKAP9 AKR1C3 AKT1 ALOX5 AMOT ANK3 ANKRD12 ANKRD24 ANKRD30A ANKRD30B ANKRD36C ANO4 APBA2 APBB1IP APOBR ARC ARFIP1 ARHGAP23 ARHGAP5 ARHGEF40 ARID3A ARNTL2 ARPP21 ASAP1 ASAP3 ASPSCR1 ATAD2 ATCAY ATN1 ATP2B2 ATP8B4 ATXN1 ATXN2 AXIN2 AZI1 B3GALT1 B3GAT2 B4GALNT3 BACH1 BBX BCAS1 BEGAIN BMP2K BZRAP1 C10orf90 C15orf40 C19orf6 C1orf173 C1orf198 C1orf65 C4orf27 C5orf42 C6orf10 C8orf34 C9orf66 CA1 CACNA1A CACNA1H CACNA2D2 CACTIN CADPS CALR3 CAMTA1
0.6
0.4
0.2
0
UVM_missense UVM_mut UCS_missense UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut THCA_missense THCA_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_missense SARC_mut PRAD_missense PRAD_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut OV_missense OV_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut LIHC_missense LIHC_mut LGG_missense LGG_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_missense DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
72
F IGURE 5.8: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (2 of 7) CARD11 CBX7 CCDC102A CCDC105 CCDC110 CCDC40 CDH24 CDKL5 CDKN2A CEP170 CEP41 CEP63 CERKL CHD3 CHRM2 CHST13 CIC CILP2 CLIC6 CNTN5 COL11A1 COL15A1 COL21A1 COL23A1 COL28A1 COL4A4 COL5A1 CPSF6 CREB3L1 CROCC CRYBG3 CTAGE6P CTCF CTNNB1 DAXX DBX2 DCAF8L1 DDX11 DDX46 DEK DENND4B DGKB DGKI DHX34 DIDO1 DLEU7 DMKN DSC3 DUPD1 DYSF DZIP1 E2F5 EIF1AX ELF3 EN1 ENAM EP400 EPAS1 ERI1 EYA1 EYA4 FAM120B FAM123A FAM123B FAM123C FAM157A FAM171B FAM184B FAM194A FAM196B FAM21A FAM47A FAM71E2 FBN3 FCGBP FCRL5
0.6
0.4
0.2
0
UVM_missense UVM_mut UCS_missense UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut THCA_missense THCA_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_missense SARC_mut PRAD_missense PRAD_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut OV_missense OV_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut LIHC_missense LIHC_mut LGG_missense LGG_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_missense DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
73
F IGURE 5.9: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (3 of 7) FER1L6 FETUB FGF12 FGF13 FHDC1 FILIP1 FOXA1 FOXP2 FOXS1 FSCB FSD1 FSIP2 GAB1 GABRG2 GATA3 GDF15 GDF5 GIMAP6 GJA8 GLDN GNAS GOLGB1 GON4L GPATCH8 GPR158 GPR179 GPRIN1 GPRIN2 GRIN2A GSG2 HAP1 HECW2 HGF HHIPL2 HIVEP3 HLA−C HMGB3 HOMEZ HSCB ILDR1 INPP5J IRF2BPL IRF4 IRS4 IRX4 ISL1 ISX ITSN2 JPH1 KAT6A KAT8 KCNA6 KCND2 KCNJ4 KCNJ8 KCNN3 KCTD8 KDM4A KIAA0040 KIAA0284 KIAA0319 KIAA0355 KIAA0907 KIAA1211 KIAA1257 KIAA1522 KIAA1549L KIAA2018 KIF1A KIF1C KIR3DL2 KNDC1 L1TD1 LAMA3 LAMC3 LAS1L
0.6
0.4
0.2
0
UVM_missense UVM_mut UCS_missense UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut THCA_missense THCA_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_missense SARC_mut PRAD_missense PRAD_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut OV_missense OV_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut LIHC_missense LIHC_mut LGG_missense LGG_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_missense DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
74
F IGURE 5.10: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (4 of 7) LDLR LIG1 LILRB5 LIMK2 LIPE LMTK3 LONRF2 LPA LRP11 LRRC43 MAD1L1 MAML2 MAP1A MAP6 MAPK1 MAPK13 MAST1 MBD1 MBD6 MCM10 MECOM MED17 MEFV MESP2 METTL10 MGA MICAL3 MKI67 MKL1 MLL MLL2 MLLT3 MN1 MPHOSPH10 MPHOSPH9 MSGN1 MSH6 MTOR MUC15 MYBPC2 MYH13 MYH2 MYH4 MYH6 MYH8 MYLK MYO15A MYO18B MYOM1 MYRIP MYT1L NALCN NASP NBPF3 NCOA3 NCOR2 NEFH NEFM NFASC NFATC1 NFE2L2 NFKBIB NFYA NGFR NLRP11 NOL8 NOM1 NOS1 NPAP1 NPAS3 NRAP NRD1 NRG3 NSD1 NSUN2 NTN5
0.6
0.4
0.2
0
UVM_missense UVM_mut UCS_missense UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut THCA_missense THCA_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_missense SARC_mut PRAD_missense PRAD_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut OV_missense OV_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut LIHC_missense LIHC_mut LGG_missense LGG_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_missense DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
75
F IGURE 5.11: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (5 of 7) NUMBL OCEL1 OPRM1 OSBPL3 OSBPL6 OTOF P2RX2 PALMD PAPD7 PAPPA2 PARD3B PAX4 PBRM1 PCDH15 PCF11 PCLO PCMTD1 PCSK1 PCSK5 PDE4DIP PDGFRL PDZD4 PEG3 PENK PEX5L PHLDA1 PHLDB2 PHRF1 PIEZO1 PIK3AP1 PIK3R1 PIK3R5 PKP4 PLEC PLEKHG3 PMEPA1 PMFBP1 POTEF POTEG POU3F2 PPFIA2 PPM1E PPP1R16B PPP2R3A PPP2R3B PRDM13 PRICKLE1 PRKCSH PRKG2 PRLR PRRC2C PRRG3 PTPRO PTRF RALY RASSF6 RBM12B RBM14 RBMXL3 RC3H1 RECQL5 REM1 RERE RGPD4 RIMS2 RIMS3 RINL RLIM RNF146 RNFT2 ROBO2 ROS1 RP1L1 RSBN1 RSPH4A RTN3
0.6
0.4
0.2
0
UVM_missense UVM_mut UCS_missense UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut THCA_missense THCA_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_missense SARC_mut PRAD_missense PRAD_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut OV_missense OV_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut LIHC_missense LIHC_mut LGG_missense LGG_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_missense DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
76
F IGURE 5.12: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (6 of 7) RUNX2 RYR2 RYR3 SCAND3 SCARF2 SCN2A SCRN2 SDCCAG3 SDPR SEMA3E SGSM1 SH2D2A SHANK1 SHANK2 SHOX SIM1 SIPA1L3 SLC16A2 SLC17A6 SLC24A3 SLC8A3 SLCO1C1 SLCO6A1 SMARCA4 SMC2 SNAP25 SNED1 SOGA3 SORBS2 SORBS3 SORCS1 SOWAHB SOX10 SOX9 SPATA31A3 SPATA31D1 SPATS2L SPDYE5 SPEF2 SPEN SPERT SPHKAP SPOCK3 SPTA1 SPTAN1 SRL SRRM2 SRRT STK19 STON1−GTF2A1L SWI5 SYNJ2 TAF1 TAF4 TARSL2 TBC1D1 TBC1D10C TBC1D3B TBP TCF7L2 TCHHL1 TDRD3 TENM1 TENM2 TEX33 THSD1 TIAM1 TIMELESS TLN2 TLR6 TMC2 TMC5 TMEM200C TNRC6A TNXB TONSL
0.6
0.4
0.2
0
UVM_missense UVM_mut UCS_missense UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut THCA_missense THCA_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_missense SARC_mut PRAD_missense PRAD_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut OV_missense OV_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut LIHC_missense LIHC_mut LGG_missense LGG_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_missense DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
77
F IGURE 5.13: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (7 of 7) TOP2A TP53 TP63 TRAK1 TRANK1 TRAPPC12 TRIM3 TRIOBP TRMT44 TSKS TTBK1 TTLL11 TTLL2 TTN TUB TULP4 TUSC3 TXLNB UNCX USP31 USP6 USP6NL UTP18 VRTN WDR33 WDR64 WDR70 WDR87 WDR96 WNT16 WT1 XIRP1 XIRP2 ZAR1L ZBBX ZBTB38 ZC3H12D ZC4H2 ZFHX4 ZFP106 ZFP36L2 ZFR2 ZFX ZFYVE20 ZIC4 ZIM2 ZNF189 ZNF208 ZNF254 ZNF285 ZNF329 ZNF347 ZNF385B ZNF398 ZNF462 ZNF534 ZNF599 ZNF638 ZNF676 ZNF696 ZNF707 ZNF711 ZNF717 ZNF746 ZNF768 ZNF770 ZNF804A ZNF845 ZNF91
0.6
0.4
0.2
0
UVM_missense UVM_mut UCS_missense UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut THCA_missense THCA_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_missense SARC_mut PRAD_missense PRAD_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut OV_missense OV_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut LIHC_missense LIHC_mut LGG_missense LGG_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_missense DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut
Chapter 5. Regional Analysis Results
78
F IGURE 5.14: The mean mutation prevalence from regional heatmaps by cancer. Note that the number of missense-only profiles (Missense plus the number of all-mutations profiles (Muts) is equal to the number of both profiles (Both). This is due to Missense and Muts profiles being mutually exclusive. Heatmaps Column Means Summary
Mutation Prevalence Mean
0.06
ID
0.04
Both Missense Muts
0.02
0.00 0
20
40
Sorted Index
60
Chapter 5. Regional Analysis Results
79
F IGURE 5.15: The mean mutation prevalence from regional heatmaps by gene. Note that here the number of missense-only profile genes (Missense) plus the number of all mutation profiles genes (Muts) is not equal to the number of both profiles (Both). This is due to only considering genes significant in each of the background profiles, which are not mutually exclusive. Therefore, Both is not simply the combination of Missense and Muts, but rather their union with Missense and Muts profiles sharing 236 entries. Lengths: Both (525), Muts (447), Missense (314). Heatmaps Row Means Summary
0.3
Mutation Prevalence Mean
0.2
ID Both Missense Muts
0.1
0.0
0
100
200
300
Sorted Index
400
500
Chapter 5. Regional Analysis Results
80
F IGURE 5.16: The smoothed disorder scores of TBP.001 in BRCA_mut. The red dots are mutated positions and exact disorder score at that position.
Smooth Disorder Plot for TBP.001
Disorder Score
0.3
0.0
−0.3
−0.6
100
200
Amino Acid Position
300
Chapter 5. Regional Analysis Results
81
F IGURE 5.17: The smoothed disorder scores of PLEC.005 in ACC_mut. The red dots are mutated positions and exact disorder score at that position.
Smooth Disorder Plot for PLEC.005
Disorder Score
0.0
−0.2
−0.4
0
1000
2000
3000
Amino Acid Position
4000
Chapter 5. Regional Analysis Results
82
F IGURE 5.18: The smoothed disorder scores of NEFH.001 in ACC_mut. The red dots are mutated positions and exact disorder score at that position.
Smooth Disorder Plot for NEFH.001
Disorder Score
0.0
−0.2
−0.4
0
250
500
Amino Acid Position
750
1000
83
Chapter 6
Discussion 6.1
Introduction
Invariably if one attempts to interpret as many results as were found in the analysis here, 551,1 without some systematic approach they would need to rely on some form of heuristic, of which there are no known heuristics for judging the significance or novelty of protein disorder findings within the context of cancer. Therefore, rather than random selection or taking the top N results from both methods to focus results it was decided to take the intersection of the two methods to determine the most notable genes of potential disorder-driven cancer implication. Genes captured by both methods of analysis should have a heightened degree of potential in driving cancer via the hypothesis herein that protein disorder may be implicated in yet uncharacterized driver genes. 1
77 from positional analysis, 480 from regional analysis with a 6 result overlap; 77+480−6 = 511
Chapter 6. Discussion
84
TABLE 6.1: The significant gene symbols according to both positional Monte Carlo simulations and regional binomial tests. Only those genes not already in the COSMIC census gene set are considered. EP400
6.2
TBP
SRRM2
NOCA3
GPRIN2
ZNF707
Intersection of Both Methods of Analysis
Given the positional and regional analysis methods each having their own bias – positional could falsely call insignificant random mutations at disordered positions significant, while regional analysis could call intrinsically disordered proteins significant due to being one large disordered region – taking the intersection of the two novel find sets should give a high-confidence disorder-implicated gene set. The significant genes shared between the two method are listed in Table 6.1, note that only novel finds are considered here rather than all finds therefore none of these results are in the COSMIC census currently.
6.2.1
EP400
This gene, E1A-binding protein p400, is involved in the transcriptional activation of select genes via H4 and H2A acetylation (Doyon et al., 2004).2 Notably, Endo et al. (2013) found that this gene presented an ossifying fibromyxoid tumor. This was detected in only a single case, but showed potential reproducibility. The rarity and uncertainty associated with the finding suggests it might be disorderassociated – the rarity due to disorder regions being less susceptible to mutational disrupt, while the reproducibility suggesting it is more than a random chance finding. Meanwhile, Mouradov et al. (2014) did a systematic investigation of primary 2
http://www.uniprot.org/uniprot/Q96L91
Chapter 6. Discussion
85
colorectal tumors and compared them against TCGA data to conclude that these tumors are representative of the main subtypes of primary tumors at the genomic level – finding EP400 mutation enrichment among other commonly found tumor genes. In addition to these findings, Smith et al. (2010) and Wu et al. (2015) found this gene implicated in human papillomavirus (HPV)-associated cancers and bladder cancer recurrence, respectively.
6.2.2
TBP
This gene, TATA-box-binding protein, is part of the TFIID complex and its binding to the complex is part of the initial transcriptional step of the pre-initiation complex (PIC).3 TBP has not yet been implicated in cancer by itself, but has been noted in interaction with p53, a ubiquitous cancer driver gene (Truant, Xiao, Ingles, & Greenblatt, 1993). This gene has primarily and almost exclusively been implicated in neurodegeneration, particularly via spinocerebellar ataxia (Zühlke, Dalski, Schwinger, & Finckh, 2005). If this gene is driving cancer via disorder-focused mutation it is likely affecting its ability to bind to the TFIID complex leading to a slowing of transcriptional activity or is affecting the rate of signal transduction by p53 (GO:1901796).
6.2.3
SRRM2
This gene, Serine/arginine repetitive matrix protein 2, has been previously implicated in papillary thyroid carcinoma predisposition (Tomsic et al., 2015), colorectal 3
http://www.uniprot.org/uniprot/P20226
Chapter 6. Discussion
86
cancer (Hinoue et al., 2012), and breast cancer (Semaan, Wang, Stewart, Marshall, & Sang, 2011). Its exact function is still unknown, but it may stabilize the catalytic center or position of the RNA substrate being involved in pre-mRNA splicing (Blencowe et al., 2000).4
6.2.4
NCOA3
This gene, Nuclear receptor coactivator 3, is overexpressed in ≈ 60% of primary breast tumors (Burwinkel et al., 2005). This overexpression has been shown to significantly reduce the disease-free and overall survival rate when compared to patients with other tumor types (Zhao et al., 2003) to the point its secondary alias symbol is AIB1 (amplified in breast cancer 1). Breast cancers can be divided into two distinct classes: estrogen receptorα-positive (ERα-positive) and -negative disease where AIB1 amplification characterizes a subgroup of ERα-positive breast cancer with worse outcome (Burandt et al., 2013).5
6.2.5
GPRIN2
This gene, G protein-regulated inducer of neurite outgrowth 2, was first shown to be involved in the G protein action of the brain (L. T. Chen, Gilman, & Kozasa, 1999). Since then is has been shown to be highly mutated in invasive lobular breast cancer (Ciriello et al., 2015) and involved in cancer risk in conjunction with environmental risks such as ceramic fibers (Gérazime, Stücker, & Luce, 2016) and 4 5
http://www.uniprot.org/uniprot/Q9UQ35 http://www.uniprot.org/uniprot/Q9Y6Q9
Chapter 6. Discussion
87
asbestos (Jiménez, Aguilar, Velázquez, Tachiquin, & Juárez, 2016). Beyond these publications, GPRIN2 is mostly absent from any directed study.6
6.2.6
ZNF707
This gene, zinc finger protein 707, has never been directly studied,7 instead all publications caught ZNF707 in other analyses with only one study mentioning it as a result. The study, by Nesslinger et al. (2007), found that in prostate cancer ZNF707 + PTMA was recognized by treatment-associated autoantibodies. Beyond that ZNF707 has been annotated in four interactome studies (Rual et al., 2005; Rolland et al., 2014; Hein et al., 2015; Xin et al., 2009), sequenced as part of two analyses of chromosome 8 (Nusbaum et al., 2006; Ota et al., 2004), and part of an NIH project to expand the Mammalian Gene Collection (MGC) (Gerhard et al., 2004).
6.3
Enrichment Analyses
The lack of significant terms following enrichment analysis does not elude meaning. A lack of enriched terms in this case might suggest that disorder-targeted proteins do not share a similar driving mechanism and instead are as varied as their lack of well-defined structure suggests. This varied set of mechanisms would likely be attributable to binding partner disruption if these disorder-targeted proteins are implicated in cancer. This aspect is supported by many of the uncorrected 6 7
http://www.uniprot.org/uniprot/O60269/publications http://www.uniprot.org/uniprot/Q96C28/publications
Chapter 6. Discussion
88
terms (Table 4.5 and Table 5.5) being associated with complex protein network interactions.
6.3.1
Positional
The terms listed in Table 4.6 (positional analysis interaction partner set) are all either associated with metabolic processes or gene expression. These associations are unsurprising given the mutations were noted in patients with cancer; however, more surprisingly, one of the top terms here is "protein stabilization," which might suggest that these disorder-targeted genes destabilize more than just their own binding relationships by having a secondary effect on protein stabilization at large. Another significant term, "protein sumoylation," is a post-translational modification associated with apoptosis, protein stability, and progression through the cell cycle (Hay, 2005) and is associated with the long-term fate of a protein.
6.3.2
Regional
The terms listed in Table 5.6 (regional analysis interaction partner set) suggest that disorder-implicated driver genes may drive cancer via their binding partners as opposed to directly driving cancer. Since every term in the table is concerning regulation, particularly of gene expression and biosynthesis, the effect(s) of mutations is likely to disrupt metabolic networks rather than metabolic processes directly. When looking at the expanded interaction partner set enrichment table for regional analysis (Table C.1), there are terms further down the list such as "positive regulation of ATP biosynthetic process" which suggest the energetics aspect of cancer
Chapter 6. Discussion
89
induction. There are some surprising enriched terms such as "behavioral response to ethanol" which, despite being interesting, offer no aid in characterizing these genes as cancer driver genes rather they highlight the limitations of this approach (further discussed in Section 6.8).
6.3.3
Regional and Positional Cross-comparison
Significant Novel Finds Sets Looking at both the uncorrected positional terms (Table 4.5) and uncorrected regional terms (Table 5.5) we see that neither set of terms make much sense in driving cancer, rather there are a great variety of terms that do not seem cancer-related. This might suggest that these disorder-targeted genes, if driving cancer do so via their interaction network not directly.
Binding Partner Sets Between both positional (Table 4.6) and regional analysis (Table 5.6) partner set enrichment sets terms such as "protein sumylation" and "protein stabilization" occur. This helps cross-validate the results from each method of analysis, however might also be due to the scale-free property of protein-protein interactions networks where gathering the interaction partner set to any initial set is likely to result in a more central set overall – in this case a more biologically critical gene set. This point is discussed further in Section 6.8 below.
Chapter 6. Discussion
6.4
90
Disorder Binding Incitation of Cancer
Following the analysis herein, I suspect now that if disorder-implicated driver genes exist they are likely effecting cancer via their binding relationships. Disordered proteins add a robustness to protein-protein interaction networks by complementing the rigidity of ordered regions (e.g., bindings site and conserved domains). An ordered site being made more disordered by disrupting binding makes general sense, meanwhile the analysis herein did not offer any aid in answering the more general question of how disorder may incite cancer. It is possible that disorder-targeted genes incite cancer by affecting binding relationships rather than directly, however, further research needs to be done on how mutations in disordered regions of even known driver genes present themselves.
6.5
COSMIC – Limited Complement
The consensus driver genes in COSMIC have largely been determined by methods more geared toward finding order-targeted mutation effects and therefore offers a strong complement to the disorder-focused discovery of driver genes herein. Since COSMIC is the standard for causally implicated genes in cancer ensuring a degree of union here offers slight support for the remaining significant results being potential drivers. However, since the biological property basis of prior methods and the work herein differ so greatly, using COSMIC to remove known driver genes, although the standard, is likely to remove few genes driven by disorder. Using the COSMIC set to remove known driver genes represents a good use to find novel results, however the COSMIC set does not likely include many terms that would be
Chapter 6. Discussion
91
found by the method of analysis used herein due to the focus on protein disorder.
6.6
On Limit to In Silico Analysis
Despite all the in silico validation methods used herein, future validation via wet lab experimentation, possibly through the use of pull-down assays, will be necessary. Pull-down assays are particularly fit to the nature of disorder-regions due to directly testing binding disruption – a likely hypothesis for how disorder-targeting mutations might drive cancer.
6.7
On the High Number of Regional Results
Having 525 genes be called significant in regional analysis, and the remaining 480 following removal of well-characterized driver genes, suggests a potential problem with the null model used in this approach. This is simply too many results to conclude generalizations shared between findings. If we assume these results are problematic, or at least that the FDR correction is proper and
1 th 20
of the re-
sults are false discoveries, then the number of results are most likely inflated by one of two possibilities: 1. these region-gene combinations accumulate non-fatal, non-significant mutations after oncogenesis (passenger mutation accumulation), or 2. these mutations are important and their accumulation in so many genes indicates a more important conclusion to be made with further analysis (unknown mutation accumulation). The latter of these is, at best, blindly hopeful of the significance of my findings and lacks an effective next step toward this aforementioned
Chapter 6. Discussion
92
important conclusion. Meanwhile, the former is far more likely and has multiple next steps that can be taken. One potential next step, informed by the work of London, Movshovitz-Attias, and Schueler-Furman (2010), is to consider the mutation of "hot spot" residues to find mutated regions which would show the most binding disruption due to mutation of these "hot spot" residues. I would suspect that, given the additional biological significance subsetting rather than statistical subsetting, reanalyzing regional heightened mutational concentration with added weight on the residue being mutated would drastically reduce the number of false discoveries.
6.8
Limitations
As with any analysis, the approach taken herein has its flaws. Here I discuss the most important limitations and problems with the analysis herein, however these are certainly not the only limitations given the scale and dimensionality of this analysis. With so many discrete tests in Monte Carlo simulations, binomial tests, and a variety of places corrections could have been performed but were not due to a seemingly safe assumption that it was not necessary8 there is no doubt that there are more limitations than just the ones presented here. 8
An example of such would be, during regional analysis, correcting for the number of regions in an isoform/gene prior to selecting the most representative isoform. This correction is informed by it being more significant if a protein has many disordered regions and all the mutations concentrate in one disordered region than if a highly-mutated protein has one very large disordered region.
Chapter 6. Discussion
6.8.1
93
Impact of mutations
The impact or context of mutations is not considered in this analysis. We know of many reasons seemingly insignificant mutations are far more important than they would have been measured as via the simple math used herein. One such case is that transition (i.e., purine to purine and pyrimidine to pyrimidine DNA mutations) versus transversion (i.e., purine to pyrimidine and pyrimidine to purine DNA mutations) are not considered despite this researcher’s knowledge that transitions occur at a much higher rate on average than transversions (≈ 3 : 1 ratio) despite there being twice as many transversions than transitions. Thus certain random protein mutations are more likely than others due to being caused by a transition as opposed to a transversion. In considering mutations here the method naïvely assumes either all mutations matter (hopeful, but likely not true) or that only missense mutations matter (also hopefully, but likely not true since synonymous mutations do have an impact on translation rates). This limitation can be addressed through use of either/or MutSig (Beroukhim et al., 2007) and SIFT (Mooney, 2005) methods to give mutations more context. Converting DNA mutations to the protein level and analyzing the data at that level then trying to draw general conclusions about the original gene level from the signal at the protein level required some level of compromise in considerations such as these. This translation from DNA to protein is not as simple or as straightforward in nature as a translation table may suggest.
Chapter 6. Discussion
6.8.2
94
Monte Carlo simulations side effect
A side effect of the Monte Carlo analysis is that if observed mutated positions are all just slightly more disordered than the rest of the isoform then that isoform will be called significant without a true disorder-driven reasoning. This is partially addressed by the countering regional analysis which mitigates against these such false positives. Therefore an intersection of positional finds and regional finds is a far more confident set.
6.8.3
Intersection of significance sets
Taking the intersection of all-mutation and missense-only profiles within both positional and regional analysis (so the intersection of four sets: positional-all, positionalmissense, regional-all, and regional-missense) should result in a far more confident set, however drawing conclusions from this set would be difficult. Is a significant isoform disordered overall with great peaks of order? Do all the mutations matter? Questions such as these will need to be addressed in further research.
6.9
Conclusions
With this work being the only intersection between cancer driver gene discovery and protein disorder, it is not yet possible to make any general, objective conclusions about disorder-implicated driver genes. Further stringency is necessary to draw meaningful conclusions about this potential cancer-driving biological property. Addressing some of the limitations as stated above in Section 6.8 should
Chapter 6. Discussion
95
be the next step of investigation into this intersection. If possible, initial wet-lab validation of the high-confidence set could inform a more advanced reanalysis of the data used herein by finding some general property or binding partner common to mutated versions of the significant isoforms. Such validation would likely take the form of site-directed mutagensis and pull-down assays to determine if the observed mutations are the causal link between binding success and disruption between binding partners. Although the conclusions from this work are limited, as a first step proof of concept the conclusions here are important for informing continued work in this area of investigation. By repeating this analysis with increased biological consideration, such as focusing on "hot spot" residues known to be more critical in binding interactions, new drivers may be discovered beyond the six suggested here. Investigation into any shared nature between the six genes listed above may prove to further inform continued directed study in this area. For this work, by studying the relationship between protein disorder and cancer while making the least number of assumptions possible, a launching pad has been laid for continued, more informed investigations into how protein disorder may drive cancer.
96
Bibliography Anfinsen, C. B. (1973). Principles that govern the folding of protein chains. Science (80-. ). 181(4096), 223–230. doi:10.1126/science.181.4096.223 Ast, G. (2004). How did alternative splicing evolve? Nat. Rev. Genet. 5(10), 773–782. doi:10.1038/nrg1451 Bass, A. J., Thorsson, V., Shmulevich, I., Reynolds, S. M., Miller, M., Bernard, B., . . . Liu, J. (2014). Comprehensive molecular characterization of gastric adenocarcinoma. Nature, 513(7517), 202–9. doi:10.1038/nature13480. arXiv: NIHMS150003 Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. B, 57(1), 289– 300. doi:10.2307/2346101. arXiv: 95/57289 [0035-9246] Beroukhim, R., Getz, G., Nghiemphu, L., Barretina, J., Hsueh, T., Linhart, D., . . . Sellers, W. R. (2007). Assessing the significance of chromosomal aberrations in cancer: Methodology and application to glioma. Proc. Natl. Acad. Sci. 104(50), 20007–20012. doi:10.1073/pnas.0710052104 Black, D. L. (2003). Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem. 72(1), 291–336. doi:10.1146/annurev.biochem.72.121801.161720 Blake, C. C. F., Koenig, D. F., Mair, G. A., North, A. C. T., Phillips, D. C., & Sarma, V. R. (1965). Structure of hen egg-white lysozyme, a three dimensional fourier
BIBLIOGRAPHY
97
synthesis at 2-Ångstroms resolution. Nature, 206(4986), 757–761. doi:10.1038/ 206757a0 Blasco, M. A. (2005). Telomeres and human disease: Ageing, cancer and beyond. Nat Rev Genet, 6(8), 611–622. doi:10.1038/nrg1656 Blencowe, B. J., Baurén, G., Eldridge, A. G., Issner, R., Nickerson, J. A., Rosonina, E., & Sharp, P. A. (2000). The SRm160/300 splicing coactivator subunits. RNA, 6(1), 111–20. doi:10.1017/S1355838200991982 Boffetta, P., Hecht, S., Gray, N., Gupta, P., & Straif, K. (2008). Smokeless tobacco and cancer. doi:10.1016/S1470-2045(08)70173-6 Burandt, E., Jens, G., Holst, F., Jänicke, F., Müller, V., Quaas, A., . . . Lebeau, A. (2013). Prognostic relevance of AIB1 (NCoA3) amplification and overexpression in breast cancer. Breast Cancer Res. Treat. 137(3), 745–753. doi:10.1007/ s10549-013-2406-4 Burwinkel, B., Wirtenberger, M., Klaes, R., Schmutzler, R. K., Grzybowska, E., Försti, A., . . . Hemminki, K. (2005). Association of NCOA3 polymorphisms with breast cancer risk. Clin. Cancer Res. 11(6), 2169–2174. doi:10.1158/1078-0432. CCR-04-1621 Campisi, J. (2013). Aging, cellular senescence, and cancer. Annu. Rev. Physiol. 75(1), 685–705. doi:10.1146/annurev-physiol-030212-183653. arXiv: NIHMS150003 Chen, L. T., Gilman, A. G., & Kozasa, T. (1999). A candidate target for G protein action in brain. J. Biol. Chem. 274(38), 26931–26938. doi:10.1074/jbc.274.38. 26931
BIBLIOGRAPHY
98
Chen, Y., McGee, J., Chen, X., Doman, T. N., Gong, X., Zhang, Y., . . . Kouros-Mehr, H. (2014). Identification of druggable cancer driver genes amplified across TCGA datasets. PLoS One, 9(5), e98293. doi:10.1371/journal.pone.0098293 Cheng, W. C., Chung, I. F., Chen, C. Y., Sun, H. J., Fen, J. J., Tang, W. C., . . . Wang, H. W. (2014). DriverDB: An exome sequencing database for cancer driver gene identification. Nucleic Acids Res. 42(D1). doi:10.1093/nar/gkt1025 Chial, H. (2008). Proto-oncogenes to oncogenes to cancer. Nature Education, 1(1), 33. Ciriello, G., Gatza, M. L., Beck, A. H., Wilkerson, M. D., Rhie, S. K., Pastore, A., . . . Perou, C. M. (2015). Comprehensive molecular portraits of invasive lobular breast cancer. Cell, 163(2), 506–519. doi:10.1016/j.cell.2015.09.033 de Gruijl, F. R. (1999). Skin cancer and solar UV radiation. Eur. J. Cancer, 35(14), 2003–9. doi:10.1016/S0959-8049(99)00283-X Dees, N. D., Zhang, Q., Kandoth, C., Wendl, M. C., Schierding, W., Koboldt, D. C., . . . Ding, L. (2012). MuSiC: Identifying mutational significance in cancer genomes. Genome Res. 22(8), 1589–1598. doi:10.1101/gr.134635.111 DeMarini, D. M. (2004). Genotoxicity of tobacco smoke and tobacco smoke condensate: A review. doi:10.1016/j.mrrev.2004.02.001 Denissenko, M. F. & Pao, A. (1996). Preferential formation of benzo[a]pyrene adducts at lung cancer mutational hotspots in P53. Science (80-. ). 274(5286), 430–432. doi:10.1126/science.274.5286.430 Dill, K. A. [Ken A.], Ozkan, S. B., Shell, M. S., & Weikl, T. R. (2008). The protein folding problem. Annu. Rev. Biophys. 37(1), 289–316. doi:10 . 1146 / annurev. biophys.37.092707.153558. arXiv: NIHMS150003
BIBLIOGRAPHY
99
D’Orazio, J., Jarrett, S., Amaro-Ortiz, A., & Scott, T. (2013). UV radiation and the skin. doi:10.3390/ijms140612222 Dosztányi, Z., Csizmok, V., Tompa, P., & Simon, I. (2005, August). IUPred: Web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics, 21(16), 3433–4. doi:10.1093/bioinformatics/ bti541 Dosztányi, Z., Csizmók, V., Tompa, P., & Simon, I. (2005, April). The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol. 347(4), 827–39. doi:10.1016/j.jmb.2005.01.071 Doyon, Y., Selleck, W., Lane, W. S., Tan, S., Côté, J., & Co, J. (2004). Structural and functional conservation of the NuA4 histone acetyltransferase complex from yeast to humans structural and functional conservation of the NuA4 histone acetyltransferase complex from yeast to humans. Mol. Cell. Biol. 24(5), 1884– 96. doi:10.1128/MCB.24.5.1884 Edwards, A. G. K., Russell, I. T., & Stott, N. C. H. (1998). Signal versus noise in the evidence base for medicine: An alternative to hierarchies of evidence? Fam. Pract. 15(4), 319–322. doi:10.1093/fampra/15.4.319 Endo, M., Kohashi, K., Yamamoto, H., Ishii, T., Yoshida, T., Matsunobu, T., . . . Oda, Y. (2013). Ossifying fibromyxoid tumor presenting EP400-PHF1 fusion gene. Hum. Pathol. 44(11), 2603–2608. doi:10.1016/j.humpath.2013.04.003 Fischer, E. (1894). Einfluss der configuration auf die wirkung der enzyme. Berichte der Dtsch. Chem. Gesellschaft, 27(3), 2985–2993. doi:10.1002/cber.18940270364 Fourier, J.-B.-J. (1822). Théorie analytique de la chaleur. Paris: F. Didot.
BIBLIOGRAPHY
100
Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., . . . Stratton, M. R. M. (2004, March). A census of human cancer genes. Nat. Rev. Cancer, 4(3), 177–183. doi:10.1038/nrc1299.A Garrett, R. H. & Grisham, C. M. (2013). Biochemistry. 5th, Brooks/Cole Cengage Learning. Belmont, CA. Gérazime, A., Stücker, I., & Luce, D. (2016, September). P006 Occupational exposure to refractory ceramic fibres and respiratory cancer risk. Occup. Environ. Med. 73(Suppl 1), A121 LP –A121. Retrieved from http : / / dx . doi . org / 10 . 1136/oemed-2016-103951.331 Gerhard, D. S., Wagner, L., Feingold, E. A., Shenmen, C. M., Grouse, L. H., Schuler, G., . . . Malek, J. (2004). The status, quality, and expansion of the NIH fulllength cDNA project: The Mammalian Gene Collection (MGC). Genome Res. 14(10 B), 2121–2127. doi:10.1101/gr.2596504 Ghersi, D. & Singh, M. (2014). Interaction-based discovery of functionally important genes in cancers. Nucleic Acids Res. 42(3), 1–11. doi:10.1093/nar/gkt1305 Gonzalez-Perez, A. & Lopez-Bigas, N. (2012). Functional impact bias reveals cancer drivers. Nucleic Acids Res. 40(21). doi:10.1093/nar/gks743 Goymer, P. (2007). Synonymous mutations break their silence. Nat. Rev. Genet. 8(2), 92–92. doi:10.1038/nrg2056 Hanahan, D. & Weinberg, R. A. [Robert A.]. (2011). Hallmarks of cancer: The next generation. Cell, 144(5), 646–74. doi:10.1016/j.cell.2011.02.013. arXiv: 0208024 [gr-qc] Hay, R. T. (2005). SUMO: A history of modification. Mol. Cell, 18(1), 1–12. doi:10. 1016/j.molcel.2005.03.012. arXiv: arXiv:1102.0541
BIBLIOGRAPHY
101
Hecht, S. (1999). Tobacco smoke carcinogen and lung cancer. J. Natl. Cancer Inst. 91(14), 1194–1210. doi:10.1093/jnci/91.14.1194 Hein, M. Y., Hubner, N. C., Poser, I., Cox, J., Nagaraj, N., Toyoda, Y., . . . Mann, M. (2015). A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell, 163(3), 712–723. doi:10.1016/j.cell. 2015.09.053 Hendrick, J. P. & Hartl, F.-U. (1993). Molecular chaperone functions of heat-shock proteins. Annu. Rev. Biochem. 62(1), 349–384. doi:10 . 1146 / annurev. bi . 62 . 070193.002025 Hinoue, T., Weisenberger, D. J., Lange, C. P. E., Shen, H., Byun, H. M., Van Den Berg, D., . . . Laird, P. W. (2012). Genome-scale analysis of aberrant DNA methylation in colorectal cancer. Genome Res. 22(2), 271–282. doi:10 . 1101 / gr.117523.110 Hua, X., Xu, H., Yang, Y., Zhu, J., Liu, P., & Lu, Y. (2013, September). DrGaP: A powerful tool for identifying driver genes and pathways in cancer sequencing studies. Am. J. Hum. Genet. 93(3), 439–51. doi:10.1016/j.ajhg.2013.07.003 Hunt, R. C., Simhadri, V. L., Iandoli, M., Sauna, Z. E., & Kimchi-Sarfaty, C. (2014). Exposing synonymous mutations. doi:10.1016/j.tig.2014.04.006 Hutchinson, E. (2001). Alfred Knudson and his two-hit hypothesis. Lancet Oncol. 2(10), 642–645. doi:10.1016/S1470-2045(01)00524-1 Jiménez, C., Aguilar, G., Velázquez, A. C., Tachiquin, M. R., & Juárez, C. (2016, September). P005 Molecular karyotype in two mesothelioma cases and four controls with exposure to asbestos. Occup. Environ. Med. 73(Suppl 1), A121 LP –A121. Retrieved from http://dx.doi.org/10.1136/oemed-2016-103951.330
BIBLIOGRAPHY
102
Kamburov, A., Lawrence, M. S., Polak, P., Leshchiner, I., Lage, K., Golub, T. R., . . . Getz, G. (2015, October). Comprehensive assessment of cancer missense mutation clustering in protein structures. Proc. Natl. Acad. Sci. U. S. A. 112(40), E5486–95. doi:10.1073/pnas.1516373112 Kasper, D. L., Fauci, A. S., Hauser, S. L., Longo, D. L. ( L., Jameson, J. L., & Loscalzo, J. (2015). Harrison’s principles of internal medicine. McGraw-Hill Medical. Retrieved from http : / / www. worldcat . org / title / harrisons - principles - of internal-medicine/oclc/890181375 Kendrew, J. C. (1961). The three-dimensional structure of a protein molecule. Sci. Am. 205, 96–110. doi:10.1038/scientificamerican1261-96 Kessel, A. & Ben-Tal, N. (2011). Introduction to proteins: Structure, function, and motion. CRC Press. Knudson, A. G. (1971). Mutation and cancer: Statistical study of retinoblastoma. Proc. Natl. Acad. Sci. 68(4), 820–823. doi:10.1073/pnas.68.4.820 Kornblihtt, A. R., Schor, I. E., Alló, M., Dujardin, G., Petrillo, E., & Muñoz, M. J. (2013). Alternative splicing: A pivotal step between eukaryotic transcription and translation. Nat. Rev. Mol. Cell Biol. 14(3), 153–165. doi:10.1038/nrm3525 Kyte, J. & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157(1), 105–132. doi:10.1016/0022-2836(82) 90515-0 Lawrence, M. S., Stojanov, P., Mermel, C. H., Robinson, J. T., Garraway, L. a., Golub, T. R., . . . Getz, G. (2014, January). Discovery and saturation analysis of cancer genes across 21 tumour types. Nature, 505(7484), 495–501. doi:10.1038/ nature12912. arXiv: NIHMS150003
BIBLIOGRAPHY
103
Lee, E. Y. H. P. & Muller, W. J. (2010, October). Oncogenes and tumor suppressor genes. Cold Spring Harb. Perspect. Biol. 2(10), a003236–a003236. doi:10.1101/ cshperspect.a003236 Lehman, T. A., Reddel, R., Pfeifer, A. M. A., Spillare, E., Kaighn, M. E., Weston, A., . . . Harris, C. C. (1991). Oncogenes and tumor-suppressor genes. In Environ. health perspect. (Vol. 93, pp. 133–144). doi:10.1289/ehp.9193133 Liu, Q. & Craig, E. A. (2016). Molecular biology: Mature proteins braced by a chaperone. Nature, 539(7629), 361–362. doi:10.1038/nature20470 Liu, T. T. (2016). Noise contributions to the fMRI signal: An overview. Neuroimage, 143, 141–151. doi:10.1016/j.neuroimage.2016.09.008 London, N., Movshovitz-Attias, D., & Schueler-Furman, O. (2010). The structural basis of peptide-protein binding strategies. Structure, 18(2), 188–199. doi:10. 1016/j.str.2009.11.012 Loomis, D., Guyton, K. Z., Grosse, Y., Lauby-Secretan, B., El Ghissassi, F., Bouvard, V., . . . Straif, K. (2016). Carcinogenicity of drinking coffee, mate, and very hot beverages. Lancet Oncol. 17(7), 877. Mermel, C. H., Schumacher, S. E., Hill, B., Meyerson, M. L., Beroukhim, R., & Getz, G. (2011). GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 12(4), R41. doi:10.1186/gb-2011-12-4-r41 Modrek, B. & Lee, C. (2002). A genomic view of alternative splicing. Nat. Genet. 30(1), 13–19. doi:10.1038/ng0102-13 Mooney, S. (2005). Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis. doi:10.1093/bib/6.1.44
BIBLIOGRAPHY
104
Mouradov, D., Sloggett, C., Jorissen, R. N., Love, C. G., Li, S., Burgess, A. W., . . . Sieber, O. M. (2014). Colorectal cancer cell lines are representative models of the main molecular subtypes of primary cancer. Cancer Res. 74(12), 3238– 3247. doi:10.1158/0008-5472.CAN-14-0013 Muzny, D. M., Bainbridge, M. N., Chang, K., Dinh, H. H., Drummond, J. a., Fowler, G., . . . Thomson., E. (2012, July). Comprehensive molecular characterization of human colon and rectal cancer. Nature, 487(7407), 330–337. doi:10.1038/ nature11252. arXiv: nature11252 [10.1038] Nesslinger, N. J., Sahota, R. A., Stone, B., Johnson, K., Chima, N., King, C., . . . Nelson, B. H. (2007). Standard treatments induce antigen-specific immune responses in prostate cancer. Clin. Cancer Res. 13(5), 1493–1502. doi:10 . 1158 / 1078-0432.CCR-06-1772 Nordling, C. O. (1953). A new theory on the cancer-inducing mechanism. Br. J. Cancer, 7(1), 68–72. doi:10.1038/bjc.1953.8 Nusbaum, C., Mikkelsen, T. S., Zody, M. C., Asakawa, S., Taudien, S., Garber, M., . . . Lander, E. S. (2006). DNA sequence and analysis of human chromosome 8. Nature, 439(7074), 331–335. doi:10 . 1038 / nature04406. arXiv: arXiv : 1011 . 1669v3 Obradovic, Z., Peng, K., Vucetic, S., Radivojac, P., Brown, C. J., & Dunker, a. K. (2003). Predicting intrinsic disorder from amino acid sequence. Proteins, 53 Suppl 6(February), 566–72. doi:10.1002/prot.10532 Ota, T., Suzuki, Y., Nishikawa, T., Otsuki, T., Sugiyama, T., Irie, R., . . . Sugano, S. (2004). Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat. Genet. 36(1), 40–45. doi:10.1038/ng1285
BIBLIOGRAPHY
105
Pauling, L., Corey, R. B., & Branson, H. R. (1951). The structure of proteins: Two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. 37(4), 205–211. doi:10.1073/pnas.37.4.205 Porta-Pardo, E. & Godzik, A. (2014, November). E-Driver: A novel method to identify protein regions driving cancer. Bioinformatics, 30(21), 3109–3114. doi:10. 1093/bioinformatics/btu499 Pray, L. (2008). DNA replication and causes of mutation. Nat. Educ. 1(1), 214. Prilusky, J., Felder, C. E., Zeev-ben-mordehai, T., Rydberg, E. H., Man, O., Beckmann, J. S., . . . Sussman, J. L. (2005). FoldIndex©: A simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics, 21(16), 3435–3438. doi:10.1093/bioinformatics/bti537 Rolland, T., Ta¸san, M., Charloteaux, B., Pevzner, S. J., Zhong, Q., Sahni, N., . . . Vidal, M. (2014). A proteome-scale map of the human interactome network. Cell, 159(5), 1212–1226. doi:10.1016/j.cell.2014.10.050 Rual, J. F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., . . . Vidal, M. (2005). Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437(7062), 1173–1178. doi:10.1038/nature04209 Sauna, Z. E. & Kimchi-Sarfaty, C. (2011, August). Understanding the contribution of synonymous mutations to human disease. Nat. Rev. Genet. 12(10), 683–691. doi:10.1038/nrg3051
BIBLIOGRAPHY
106
Semaan, S. M., Wang, X., Stewart, P. A., Marshall, A. G., & Sang, Q. X. A. (2011). Differential phosphopeptide expression in a benign breast tissue, and triplenegative primary and metastatic breast cancer tissues from the same AfricanAmerican woman by LC-LTQ/FT-ICR mass spectrometry. Biochem. Biophys. Res. Commun. 412(1), 127–131. doi:10.1016/j.bbrc.2011.07.057 Smith, J. A., White, E. A., Sowa, M. E., Powell, M. L. C., Ottinger, M., Harper, J. W., & Howley, P. M. (2010). Genome-wide siRNA screen identifies SMCX, EP400, and Brd4 as E2-dependent regulators of human papillomavirus oncogene expression. Proc. Natl. Acad. Sci. 107(8), 3752–3757. doi:10.1073/pnas. 0914818107 Stehelin, D. (1995). Oncogenes and cancer. Science (80-. ). 267(5203), 1408–1409. doi:10.1126/science.7878455 Supek, F., Miñana, B., Valcárcel, J., Gabaldón, T., & Lehner, B. (2014). Synonymous mutations frequently act as driver mutations in human cancers. Cell, 156(6), 1324–1335. doi:10.1016/j.cell.2014.01.051 Surget, S., Khoury, M. P., & Bourdon, J. C. (2013). Uncovering the role of p53 splice variants in human malignancy: A clinical perspective. doi:10 . 2147 / OTT. S53876 Tamborero, D., Gonzalez-Perez, A., & Lopez-Bigas, N. (2013, September). OncodriveCLUST: Exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics, 29(18), 2238–44. doi:10.1093/bioinformatics/ btt395
BIBLIOGRAPHY
107
Tamborero, D., Lopez-Bigas, N., & Gonzalez-Perez, A. (2013). Oncodrive-CIS: A method to reveal likely driver genes based on the impact of their copy number changes on expression. PLoS One, 8(2). doi:10.1371/journal.pone.0055489 Thomas, P. D. & Dill, K. A. [K A]. (1996). An iterative method for extracting energylike quantities from protein structures. Proc. Natl. Acad. Sci. U. S. A. 93(21), 11628–11633. doi:10.1073/pnas.93.21.11628 Todd, R. & Wong, D. T. (1999). Oncogenes. Anticancer Res. 19(6A), 4729–4746. Tomczak, K., Czerwinska, ´ P., & Wiznerowicz, M. (2015). The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. (Poznan, ´ Poland), 19(1A), A68–77. doi:10.5114/wo.2014.47136 Tomsic, J., He, H., Akagi, K., Liyanarachchi, S., Pan, Q., Bertani, B., . . . de la Chapelle, A. (2015). A germline mutation in SRRM2, a splicing factor gene, is implicated in papillary thyroid carcinoma predisposition. Sci. Rep. 5(1), 10566. doi:10. 1038/srep10566 Truant, R., Xiao, H., Ingles, C. J., & Greenblatt, J. (1993). Direct interaction between the transcriptional activation domain of human p53 and the TATA box-binding protein. J Biol Chem, 268(4), 2284–2287. Uversky, V. N., Gillespie, J. R., & Fink, A. L. (2000, November). Why are "natively unfolded" proteins unstructured under physiologic conditions? Proteins, 41(3), 415–27. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/ 11025552 Vineis, P., Alavanja, M., Buffler, P., Fontham, E., Franceschi, S., Gao, Y. T., . . . Doll, R. (2004). Tobacco and cancer: Recent epidemiological evidence. JNCI J. Natl. Cancer Inst. 96(2), 99–106. doi:10.1093/jnci/djh014
BIBLIOGRAPHY
108
Vogelstein, B., Papadopoulos, N., Velculescu, V. E., Zhou, S., Diaz, L. A., & Kinzler, K. W. (2013, March). Cancer genome landscapes. Science, 339(6127), 1546–58. doi:10.1126/science.1235122 Ward, J. J., McGuffin, L. J., Bryson, K., Buxton, B. F., & Jones, D. T. (2004). The DISOPRED server for the prediction of protein disorder. Bioinformatics, 20(13), 2138– 2139. doi:10.1093/bioinformatics/bth195 Weinberg, R. A. [R. A.]. (1984). Cellular oncogenes. Trends Biochem. Sci. 9(4), 131– 133. doi:10.1016/0968-0004(84)90117-8 Weinberg, R. A. [R A]. (1994). Oncogenes and tumor suppressor genes. CA. Cancer J. Clin. 44(3), 160–170. doi:10.3322/canjclin.44.3.160 Wu, S., Yang, Z., Ye, R., An, D., Li, C., Wang, Y., . . . Cai, Z. (2015). Novel variants in MLL confer to bladder cancer recurrence identified by whole-exome sequencing. Oncotarget. doi:10.18632/oncotarget.6380 Xin, X., Rual, J. F., Hirozane-Kishikawa, T., Hill, D. E., Vidal, M., Boone, C., & Thierry-Mieg, N. (2009). Shifted transversal design smart-pooling for high coverage interactome mapping. Genome Res. 19(7), 1262–1269. doi:10.1101/ gr.090019.108 Zhao, C., Yasui, K., Lee, C. J., Kurioka, H., Hosokawa, Y., Oka, T., & Inazawa, J. (2003). Elevated expression levels of NCOA3, TOP1, and TFAP2C in breast tumors as predictors of poor prognosis. Cancer, 98(1), 18–23. doi:10 . 1002 / cncr.11482 Zühlke, C., Dalski, A., Schwinger, E., & Finckh, U. (2005, July). Spinocerebellar ataxia type 17: Report of a family with reduced penetrance of an unstable Gln49TBP allele, haplotype analysis supporting a founder effect for unstable
BIBLIOGRAPHY
109
alleles and comparative analysis of SCA17 genotypes. BMC Med. Genet. 6(1), 27. doi:10.1186/1471-2350-6-27
110
Appendix A
TCGA Cancers
Appendix A. TCGA Cancers
111
TABLE A.1: Reproduction of the information from https:// cancergenome.nih.gov/cancersselected listing all 33 cancer types in the TCGA dataset. Note here that Paraganglioma and Pheochromocytoma are grouped together due to being interrelated.
Tissue Type of Samples Breast Central Nervous System Endocrine
Gastrointestinal
Gynecologic
Head and Neck Hematologic Skin Soft Tissue Thoracic
Urologic
Cancer Type Breast Ductal Carcinoma Breast Lobular Carcinoma Glioblastoma Multiforme Lower Grade Glioma Adrenocortical Carcinoma Papillary Thyroid Carcinoma Paraganglioma & Pheochromocytoma Cholangiocarcinoma Colorectal Adenocarcinoma Esophageal Cancer Liver Hepatocellular Carcinoma Pancreatic Ductal Adenocarcinoma Stomach Cancer Cervical Cancer Ovarian Serous Cystadenocarcinoma Uterine Carcinosarcoma Uterine Corpus Endometrial Carcinoma Head and Neck Squamous Cell Carcinoma Uveal Melanoma Acute Myeloid Leukemia Thymoma Cutaneous Melanoma Sarcoma Lung Adenocarcinoma Lung Squamous Cell Carcinoma Mesothelioma Chromophobe Renal Cell Carcinoma Clear Cell Kidney Carcinoma Papillary Kidney Carcinoma Prostate Adenocarcinoma Testicular Germ Cell Cancer Urothelial Bladder Carcinoma
112
Appendix B
Positional Supplemental Information TABLE B.1: The enrichment table of the top 50 most specific terms with FDR correction for interaction partner set by positional analysis. It is ordered by p-value. Process
p-value
GO:0006368
transcription elongation from RNA polymerase II promoter
9.48e-12
GO:0038095
Fc-epsilon receptor signaling pathway
5.49e-10
GO:0006369
termination of RNA polymerase II transcription
8.96e-10
GO:0043968
histone H2A acetylation
7.73e-09
GO:0042795
snRNA transcription from RNA polymerase II promoter
1.52e-08
GO:0016925
protein sumoylation
1.05e-07
GO:0002223
stimulatory C-type lectin receptor signaling pathway
1.38e-07
GO:1900034
regulation of cellular response to heat
4.67e-07
GO:0050821
protein stabilization
6.57e-07
GO:1900740
positive regulation of protein insertion into mitochondrial
1.39e-06
membrane involved in apoptotic signaling pathway GO:0038128
ERBB2 signaling pathway
1.95e-06
GO:0032922
circadian regulation of gene expression
2.5e-06
GO:0000184
nuclear-transcribed mRNA catabolic process,
3.68e-06
nonsense-mediated decay GO:0070125
mitochondrial translational elongation
6.43e-06
GO:1904837
beta-catenin-TCF complex assembly
7.03e-06
GO:0050852
T cell receptor signaling pathway
1.22e-05
Appendix B. Positional Supplemental Information
GO:0051123
113
Process
p-value
RNA polymerase II transcriptional preinitiation complex
2.65e-05
assembly GO:0070126
mitochondrial translational termination
3.5e-05
GO:0043923
positive regulation by host of viral transcription
4.82e-05
GO:1902895
positive regulation of pri-miRNA transcription from RNA
4.82e-05
polymerase II promoter GO:0030521
androgen receptor signaling pathway
4.82e-05
GO:0000086
G2/M transition of mitotic cell cycle
0.000109
GO:0070932
histone H3 deacetylation
0.00015
GO:0090090
negative regulation of canonical Wnt signaling pathway
0.000206
GO:0045899
positive regulation of RNA polymerase II transcriptional
0.00022
preinitiation complex assembly GO:0042791
5S class rRNA transcription from RNA polymerase III type 1
0.00022
promoter GO:0042797
tRNA transcription from RNA polymerase III promoter
0.00022
GO:0006283
transcription-coupled nucleotide-excision repair
0.000299
GO:0031648
protein destabilization
0.000306
GO:0007179
transforming growth factor beta receptor signaling pathway
0.000506
GO:1903146
regulation of mitophagy
0.000655
GO:0000381
regulation of alternative mRNA splicing, via spliceosome
0.000794
GO:1904874
positive regulation of telomerase RNA localization to Cajal
0.000802
body GO:0007173
epidermal growth factor receptor signaling pathway
0.000809
GO:0007050
cell cycle arrest
0.00084
GO:0060766
negative regulation of androgen receptor signaling pathway
0.000882
GO:1990440
positive regulation of transcription from RNA polymerase II
0.000882
promoter in response to endoplasmic reticulum stress GO:0006978
DNA damage response, signal transduction by p53 class
0.000882
mediator resulting in transcription of p21 class mediator GO:0070911
global genome nucleotide-excision repair
0.00105
GO:0070527
platelet aggregation
0.00137
GO:0070933
histone H4 deacetylation
0.00145
GO:0051571
positive regulation of histone H3-K4 methylation
0.00155
GO:0070934
CRD-mediated mRNA stabilization
0.00178
GO:0071681
cellular response to indole-3-methanol
0.00178
Appendix B. Positional Supplemental Information
114
Process
p-value
GO:1902857
positive regulation of non-motile cilium assembly
0.00178
GO:1900026
positive regulation of substrate adhesion-dependent cell
0.00186
spreading GO:0043124
negative regulation of I-kappaB kinase/NF-kappaB signaling
0.00217
GO:0042769
DNA damage response, detection of DNA damage
0.00224
GO:0048025
negative regulation of mRNA splicing, via spliceosome
0.00248
GO:0051092
positive regulation of NF-kappaB transcription factor activity
0.00263
115
Appendix C
Regional Supplemental Information TABLE C.1: The enrichment table of the top 50 most specific terms with FDR correction for interaction partner set for regional analysis. It is ordered by p-value. Process
P-value
GO:0000086
G2/M transition of mitotic cell cycle
4.72e-17
GO:0016925
protein sumoylation
1.86e-13
GO:0006369
termination of RNA polymerase II transcription
9.33e-13
GO:0050821
protein stabilization
3.71e-12
GO:1900034
regulation of cellular response to heat
1.6e-11
GO:0038095
Fc-epsilon receptor signaling pathway
1.82e-11
GO:0032922
circadian regulation of gene expression
6.77e-11
protein ubiquitination involved in ubiquitin-dependent protein
4.43e-09
GO:0042787
catabolic process GO:0006977
DNA damage response, signal transduction by p53 class
7.54e-09
mediator resulting in cell cycle arrest GO:0038096
Fc-gamma receptor signaling pathway involved in
1.08e-08
phagocytosis GO:0002223
stimulatory C-type lectin receptor signaling pathway
2.18e-08
GO:0042769
DNA damage response, detection of DNA damage
3.12e-08
GO:0070979
protein K11-linked ubiquitination
3.84e-08
GO:0031145
anaphase-promoting complex-dependent catabolic process
4.29e-08
GO:0051092
positive regulation of NF-kappaB transcription factor activity
2.08e-07
GO:0030521
androgen receptor signaling pathway
6.69e-07
Appendix C. Regional Supplemental Information
116
Process
P-value
GO:0035329
hippo signaling
8.63e-07
GO:0050852
T cell receptor signaling pathway
1.3e-06
GO:1900740
positive regulation of protein insertion into mitochondrial
1.31e-06
membrane involved in apoptotic signaling pathway GO:0048013
ephrin receptor signaling pathway
1.31e-06
GO:0051437
positive regulation of ubiquitin-protein ligase activity involved
1.32e-06
in regulation of mitotic cell cycle transition GO:0051436
negative regulation of ubiquitin-protein ligase activity involved
3.72e-06
in mitotic cell cycle GO:0070987
error-free translesion synthesis
1.35e-05
GO:0070936
protein K48-linked ubiquitination
2.67e-05
GO:0051865
protein autoubiquitination
3.03e-05
GO:0042771
intrinsic apoptotic signaling pathway in response to DNA
3.7e-05
damage by p53 class mediator GO:0000183
chromatin silencing at rDNA
5.61e-05
GO:0006283
transcription-coupled nucleotide-excision repair
6.69e-05
GO:0007173
epidermal growth factor receptor signaling pathway
7.48e-05
GO:0006296
nucleotide-excision repair, DNA incision, 5’-to lesion
9.32e-05
GO:0042795
snRNA transcription from RNA polymerase II promoter
9.69e-05
GO:0043153
entrainment of circadian clock by photoperiod
9.97e-05
GO:0070911
global genome nucleotide-excision repair
0.000113
GO:0090263
positive regulation of canonical Wnt signaling pathway
0.000148
GO:1902895
positive regulation of pri-miRNA transcription from RNA
0.000209
polymerase II promoter GO:0043968
histone H2A acetylation
0.000209
GO:0010501
RNA secondary structure unwinding
0.00028
GO:0042149
cellular response to glucose starvation
0.000312
GO:0006978
DNA damage response, signal transduction by p53 class
0.000388
mediator resulting in transcription of p21 class mediator GO:0019886
antigen processing and presentation of exogenous peptide
0.000445
antigen via MHC class II GO:0000722
telomere maintenance via recombination
0.000549
GO:0070933
histone H4 deacetylation
0.000559
GO:0085020
protein K6-linked ubiquitination
0.000559
GO:0048208
COPII vesicle coating
0.000571
Appendix C. Regional Supplemental Information
117
Process
P-value
GO:0035666
TRIF-dependent toll-like receptor signaling pathway
0.000606
GO:0000289
nuclear-transcribed mRNA poly(A) tail shortening
0.00149
GO:0051571
positive regulation of histone H3-K4 methylation
0.00154
GO:0070932
histone H3 deacetylation
0.00154
GO:0071539
protein localization to centrosome
0.00154
GO:0006297
nucleotide-excision repair, DNA gap filling
0.00163
TABLE C.2: A tabular representation of the mean distribution for each cancer in both mutation profiles. Mean Mutation Sorted Index
Prevalence
SKCM_mut
1
0.05592
UCEC_mut
2
0.04359
SKCM_missense
3
0.04068
LUSC_mut
4
0.03718
UCEC_missense
5
0.03534
COADREAD_mut
6
0.03419
STES_mut
7
0.0319
LUAD_mut
8
0.03026
LUSC_missense
9
0.02921
DLBC_mut
10
0.02652
COADREAD_missense
11
0.02478
BLCA_mut
12
0.02418
STES_missense
13
0.02405
ESCA_mut
14
0.02396
LUAD_missense
15
0.02362
ACC_mut
16
0.02213
BLCA_missense
17
0.01863
HNSC_mut
18
0.01835
ESCA_missense
19
0.01826
DLBC_missense
20
0.01766
CESC_mut
21
0.01649
CHOL_mut
22
0.01399
Appendix C. Regional Supplemental Information
118
Mean Mutation Sorted Index
Prevalence
HNSC_missense
23
0.01397
UCS_mut
24
0.01318
ACC_missense
25
0.01279
CESC_missense
26
0.01275
LIHC_mut
27
0.01241
PAAD_mut
28
0.01069
KICH_mut
29
0.01014
UCS_missense
30
0.009828
CHOL_missense
31
0.009493
LIHC_missense
32
0.009161
PAAD_missense
33
0.00839
KIRP_mut
34
0.007188
GBM_mut
35
0.006863
BRCA_mut
36
0.006751
KICH_missense
37
0.006712
TGCT_mut
38
0.006216
SARC_mut
39
0.005862
BRCA_missense
40
0.005157
GBM_missense
41
0.005026
KIRP_missense
42
0.004965
KIRC_mut
43
0.004771
SARC_missense
44
0.004477
OV_mut
45
0.004171
TGCT_missense
46
0.004115
KIRC_missense
47
0.003534
OV_missense
48
0.00321
PRAD_mut
49
0.002833
LGG_mut
50
0.002559
PRAD_missense
51
0.002182
PCPG_mut
52
0.002098
UVM_mut
53
0.001906
LGG_missense
54
0.001894
THYM_mut
55
0.001551
UVM_missense
56
0.001364
Appendix C. Regional Supplemental Information
119
Mean Mutation Sorted Index
Prevalence
THCA_mut
57
0.001338
PCPG_missense
58
0.001327
THYM_missense
59
0.001186
THCA_missense
60
0.001063
TABLE C.3: A tabular representation of the mean mutation prevalence by gene across both profile types. Mean Mutation Sorted Index
Prevalence
TTN
1
0.2814
RYR2
2
0.09741
FLG
3
0.09525
PCLO
4
0.0864
ZFHX4
5
0.07581
XIRP2
6
0.07238
SPTA1
7
0.06872
PCDH15
8
0.06001
PLEC
9
0.05896
FMN2
10
0.05177
HRNR
11
0.04933
COL11A1
12
0.04867
PAPPA2
13
0.04812
NAV3
14
0.048
HYDIN
15
0.04723
FAM135B
16
0.04588
TENM1
17
0.04458
RP1L1
18
0.0445
PEG3
19
0.04319
ZNF208
20
0.04261
MYH2
21
0.04182
C1orf173
22
0.04179
ADAMTS12
23
0.03952
Appendix C. Regional Supplemental Information
120
Mean Mutation Sorted Index
Prevalence
EP400
24
0.03951
NPAP1
25
0.03942
RIMS2
26
0.03898
ANKRD30A
27
0.03851
PRDM9
28
0.03781
ZNF804B
29
0.03685
TRIOBP
30
0.03617
MYH7
31
0.03597
UNC79
32
0.0357
MKI67
33
0.03567
COL5A1
34
0.03556
TAF1L
35
0.03507
TNR
36
0.03437
PCDH11X
37
0.03414
MYH4
38
0.03411
CACNA1A
39
0.03348
SRRM2
40
0.03339
HCN1
41
0.03321
SCN2A
42
0.03298
ZNF804A
43
0.03268
MYH8
44
0.03246
SCN10A
45
0.03207
CDH9
46
0.0316
MYH13
47
0.0314
MYO3A
48
0.03064
MYT1L
49
0.03057
ZNF676
50
0.03044
PCDH10
51
0.03044
PTPRZ1
52
0.03037
NALCN
53
0.03005
GPR158
54
0.02883
TNRC18
55
0.0288
SORCS1
56
0.02846
ZFPM2
57
0.0283
Appendix C. Regional Supplemental Information
121
Mean Mutation Sorted Index
Prevalence
ADCY2
58
0.02825
MYH15
59
0.02824
PRG4
60
0.02801
BOD1L1
61
0.02779
KNDC1
62
0.02769
MAP2
63
0.02766
GOLGB1
64
0.02704
KIF1A
65
0.02657
FAM47A
66
0.02654
CDH18
67
0.02652
AFF2
68
0.02649
PPFIA2
69
0.02648
TRPS1
70
0.02645
ANKRD11
71
0.02634
ZNF99
72
0.02585
BCLAF1
73
0.02567
MAP1A
74
0.02525
SPTAN1
75
0.02524
COL19A1
76
0.02458
KIAA2018
77
0.02438
GPR179
78
0.02416
ZNF462
79
0.02389
HDAC9
80
0.02383
CENPE
81
0.02326
ZBBX
82
0.02279
XIRP1
83
0.02245
ZNF469
84
0.02237
MYH10
85
0.02211
USP29
86
0.02203
LRP4
87
0.022
CTNNA3
88
0.02198
TNRC6A
89
0.02157
NEFH
90
0.02137
SPATA31D1
91
0.02135
Appendix C. Regional Supplemental Information
122
Mean Mutation Sorted Index
Prevalence
FAM83B
92
0.0213
ZNF91
93
0.02106
NES
94
0.02092
PDE1C
95
0.02073
COL6A2
96
0.02071
KIF21B
97
0.02061
ZNF407
98
0.02049
WDR33
99
0.02049
RERE
100
0.02035
ZNF638
101
0.02027
MEFV
102
0.02026
MYPN
103
0.02013
ATXN1
104
0.01986
FSCB
105
0.01976
KCND2
106
0.01972
ST6GAL2
107
0.01969
GON4L
108
0.01947
GPRIN2
109
0.01923
POTEG
110
0.01909
CHRM2
111
0.01904
WDR96
112
0.01902
TTBK1
113
0.01895
PDZRN3
114
0.01887
TJP1
115
0.01886
MAGI1
116
0.01886
RPTN
117
0.01879
ZFC3H1
118
0.01867
LRRTM4
119
0.01851
IRS4
120
0.01834
TNIK
121
0.01824
TCEB3B
122
0.01812
TULP4
123
0.0181
PAK7
124
0.01787
SULF1
125
0.01776
Appendix C. Regional Supplemental Information
123
Mean Mutation Sorted Index
Prevalence
ZEB1
126
0.01764
ZNF479
127
0.01748
PRRC2C
128
0.01745
ATAD2
129
0.01736
YLPM1
130
0.01731
LRRC66
131
0.01726
LRRIQ3
132
0.01725
DDX11
133
0.01723
GPATCH8
134
0.01714
PDZRN4
135
0.0171
TMC5
136
0.01705
RGS7
137
0.01686
TRPC7
138
0.01683
ATN1
139
0.01683
ANKRD12
140
0.01667
IVL
141
0.01651
USP31
142
0.01632
ZNF845
143
0.01631
WDR87
144
0.01619
ZIC1
145
0.01606
SHROOM2
146
0.01606
SORBS2
147
0.01606
ZNF257
148
0.01569
FAM184A
149
0.01563
TICRR
150
0.01559
NOL4
151
0.01543
SNED1
152
0.01542
ZNF33A
153
0.01539
KCNA4
154
0.01537
TNRC6B
155
0.01535
ITSN2
156
0.01531
SRRM4
157
0.01526
WWC3
158
0.01521
HGF
159
0.01514
Appendix C. Regional Supplemental Information
124
Mean Mutation Sorted Index
Prevalence
STON1-GTF2A1L
160
0.01512
RBMXL3
161
0.01507
ZNF285
162
0.01505
CCDC102A
163
0.01498
ZNF217
164
0.01493
ZNF835
165
0.01493
KCNN3
166
0.01491
TCHHL1
167
0.01485
AKAP12
168
0.01448
PCF11
169
0.01446
PPFIA1
170
0.01443
FAM123C
171
0.01442
NRD1
172
0.01442
SOGA3
173
0.01436
HRC
174
0.0142
WDR66
175
0.0142
ZNF608
176
0.01408
RPH3A
177
0.01408
PPP1R9A
178
0.01406
ZBTB20
179
0.01405
NKTR
180
0.01399
APOBR
181
0.01398
AMOT
182
0.01397
ZFP64
183
0.01396
ZNF585B
184
0.01395
ZNF43
185
0.01385
ZNF334
186
0.01371
PKP4
187
0.01366
ZBTB38
188
0.01365
EIF3A
189
0.01364
FAM13C
190
0.01351
ZMYND8
191
0.01346
ZNF667
192
0.01336
SGOL2
193
0.01327
Appendix C. Regional Supplemental Information
125
Mean Mutation Sorted Index
Prevalence
RUNX2
194
0.01325
FYB
195
0.01325
ZNF135
196
0.01324
PCMTD1
197
0.01323
ZIC4
198
0.0132
NOM1
199
0.01318
ZNF532
200
0.01317
NPAS3
201
0.01311
ZCCHC5
202
0.01305
ZNF445
203
0.01302
PHACTR3
204
0.01301
TONSL
205
0.01301
BMP2K
206
0.013
ZNF347
207
0.01296
FOXP2
208
0.01288
TOP2A
209
0.01287
HIST1H1E
210
0.01284
ZNF534
211
0.01282
TRIM51
212
0.01279
ZNF254
213
0.01275
MAP4K4
214
0.01271
TSKS
215
0.01247
ZKSCAN2
216
0.01245
NSUN2
217
0.01241
CRNN
218
0.01239
PPP1R16B
219
0.01235
PLEKHG3
220
0.01234
ZNF616
221
0.01221
WWP1
222
0.0122
C8orf34
223
0.01206
ZNF85
224
0.01197
ZNF711
225
0.01196
TRIM55
226
0.01193
USH1C
227
0.01192
Appendix C. Regional Supplemental Information
126
Mean Mutation Sorted Index
Prevalence
MNDA
228
0.01191
TBP
229
0.01189
KCTD8
230
0.01183
ZNF615
231
0.01182
FAM184B
232
0.01177
WWC1
233
0.01176
SYCP1
234
0.01175
CCDC105
235
0.01172
SMG6
236
0.01169
USP54
237
0.01169
ZC3H18
238
0.01168
PYHIN1
239
0.01167
ZNF268
240
0.01166
AZI1
241
0.01165
ZNF234
242
0.01165
RLIM
243
0.01163
TRIML2
244
0.0116
TRAPPC12
245
0.01159
SEMG2
246
0.01156
WDR64
247
0.01144
ZNF107
248
0.01143
ZNF471
249
0.01132
ZNF780A
250
0.0113
ZNF607
251
0.01129
ZNF454
252
0.01126
ZNF100
253
0.01118
HIST1H1C
254
0.01116
TTLL2
255
0.01114
WWP2
256
0.01113
SRRT
257
0.01112
PEX5L
258
0.01111
RBMX
259
0.01109
YTHDC1
260
0.01107
ZFP28
261
0.01104
Appendix C. Regional Supplemental Information
127
Mean Mutation Sorted Index
Prevalence
ZNF71
262
0.01101
CYLC2
263
0.01096
ZNF528
264
0.01091
UBE2O
265
0.01089
PRAM1
266
0.01087
ZNF189
267
0.01086
GPR101
268
0.01085
ZFR2
269
0.01076
SV2A
270
0.01076
NOL8
271
0.01076
ZNF594
272
0.0107
FAM13A
273
0.0107
BBX
274
0.0107
TRAK1
275
0.01065
RSBN1
276
0.01061
ZNF300
277
0.01061
SDPR
278
0.0106
ZNF473
279
0.01059
TRDN
280
0.01058
ZMIZ1
281
0.01058
RINL
282
0.01058
ATXN2
283
0.01057
ZNF696
284
0.01057
MPHOSPH8
285
0.01055
ZNF709
286
0.01052
DPCR1
287
0.01047
ZNF180
288
0.01043
ZNF28
289
0.01042
RBMXL1
290
0.01037
CALD1
291
0.01037
CGN
292
0.01035
ZSCAN5B
293
0.01033
EIF5B
294
0.01032
ZNF527
295
0.01028
Appendix C. Regional Supplemental Information
128
Mean Mutation Sorted Index
Prevalence
ZNF496
296
0.01025
ZNF799
297
0.01024
ARID3A
298
0.01022
SCARF2
299
0.0102
TUB
300
0.01016
HNRNPUL1
301
0.01014
ZNF415
302
0.01013
ZIC3
303
0.01012
CXXC1
304
0.01008
BCAS1
305
0.01005
ZNF568
306
0.009858
ZNF777
307
0.009848
RBM25
308
0.009825
EHBP1
309
0.009812
RBMXL2
310
0.009806
PRICKLE1
311
0.009803
C3orf30
312
0.009766
USP6NL
313
0.009757
ZNF610
314
0.009675
MAP9
315
0.009668
ZNF546
316
0.009622
WASF3
317
0.009568
ZSCAN18
318
0.009553
HTATSF1
319
0.009516
ZFP106
320
0.0095
REST
321
0.009485
TXLNB
322
0.009394
TTBK2
323
0.009381
PPM1E
324
0.009378
CT47B1
325
0.009321
ZNF658
326
0.009312
UBTF
327
0.00931
MUC15
328
0.009282
LIMA1
329
0.009279
Appendix C. Regional Supplemental Information
129
Mean Mutation Sorted Index
Prevalence
ZNF157
330
0.00927
ZNF844
331
0.009242
PDZD4
332
0.009233
JPH1
333
0.00922
ZFP2
334
0.009197
TARSL2
335
0.009183
ZNF442
336
0.009183
PPIG
337
0.009165
ZSCAN10
338
0.009129
CLIC6
339
0.00912
NOP14
340
0.009024
ZNF582
341
0.008988
PJA1
342
0.008985
FAM13B
343
0.008957
ZBTB41
344
0.008939
LUZP2
345
0.008937
ZNF613
346
0.008931
TTC14
347
0.008921
NASP
348
0.008907
GPRIN1
349
0.008891
ZNF813
350
0.008883
SOWAHB
351
0.008851
ZNF230
352
0.008836
ZNF329
353
0.008811
PRKCSH
354
0.008792
ZNF618
355
0.008768
CACTIN
356
0.008755
ZNF790
357
0.00875
TRIM6-TRIM34
358
0.008747
PENK
359
0.008733
ZCWPW1
360
0.008733
TGIF2LX
361
0.008725
KDM4A
362
0.008705
ZNF574
363
0.008687
Appendix C. Regional Supplemental Information
130
Mean Mutation Sorted Index
Prevalence
ZNF583
364
0.008677
ZNF599
365
0.008668
ZNF160
366
0.008662
ZNF16
367
0.00866
SH3PXD2B
368
0.008642
ZNF461
369
0.008634
ZBTB46
370
0.008607
ZNF14
371
0.008589
GRIPAP1
372
0.008586
ZNF235
373
0.008585
ZNF251
374
0.008546
UTP14A
375
0.008544
NEXN
376
0.008514
SAMD15
377
0.008512
ZNF507
378
0.00851
ABRA
379
0.008502
URI1
380
0.008501
FRMD6
381
0.008482
ZKSCAN5
382
0.00848
ZNF519
383
0.008469
ZNF737
384
0.008452
PRPF4B
385
0.008451
ZFP112
386
0.008445
POU3F2
387
0.008441
RBM12B
388
0.0084
ZNF484
389
0.008394
ZNF385B
390
0.008377
CPSF6
391
0.008376
ZNF467
392
0.00833
TUSC3
393
0.008327
ZNF485
394
0.008302
ZNF483
395
0.008297
ZFYVE20
396
0.008292
FRG2B
397
0.008245
Appendix C. Regional Supplemental Information
131
Mean Mutation Sorted Index
Prevalence
FTSJ3
398
0.008224
HS6ST1
399
0.008181
ZNF214
400
0.008157
ZNF416
401
0.008132
ZFP90
402
0.008112
ZNF41
403
0.008111
ZNF816
404
0.008109
ZNF683
405
0.008109
ZNF551
406
0.008036
PHACTR1
407
0.008006
PDYN
408
0.007985
SLC16A2
409
0.007976
ZNF420
410
0.007973
BACH1
411
0.007962
ZNF195
412
0.007928
ZNF167
413
0.007917
ZFX
414
0.007909
ZNF540
415
0.007904
CEP112
416
0.007892
ZIM2
417
0.007884
ZNF304
418
0.00788
NBPF3
419
0.00787
ZNF732
420
0.007869
ARHGAP23
421
0.007862
ZNF652
422
0.007859
ZNF624
423
0.007858
SPANXN2
424
0.007855
ZNF20
425
0.007805
ZNF93
426
0.007729
TMEM200C
427
0.007717
MAGEB1
428
0.007713
ZNF430
429
0.007697
ZFP91
430
0.007678
ZSCAN4
431
0.007653
Appendix C. Regional Supplemental Information
132
Mean Mutation Sorted Index
Prevalence
ZNF500
432
0.00765
DMP1
433
0.007603
TTLL11
434
0.007594
ZNF141
435
0.007593
ZNF567
436
0.007582
ZNF211
437
0.007579
SPERT
438
0.007573
ZFP30
439
0.007554
OS9
440
0.007547
VIM
441
0.007506
ZNF358
442
0.007487
ZNF286A
443
0.00747
ZNF770
444
0.007452
ZNF678
445
0.00745
ZNF227
446
0.00745
USP51
447
0.007443
GAB1
448
0.007426
ZNF317
449
0.007418
ZNF671
450
0.007415
ZNF544
451
0.007407
DMKN
452
0.007404
ZNF486
453
0.007343
ZNF226
454
0.007314
ZIK1
455
0.00731
ZNF555
456
0.00731
ZNF324
457
0.007293
ZNF502
458
0.007277
ZNF77
459
0.007249
ZNF354B
460
0.007238
ZC3H12D
461
0.007233
ZNF729
462
0.007123
ZNF83
463
0.007118
SPARCL1
464
0.007084
NR1H4
465
0.007063
Appendix C. Regional Supplemental Information
133
Mean Mutation Sorted Index
Prevalence
TNIP3
466
0.007052
ZNF101
467
0.007004
ZNF697
468
0.006987
ZNF529
469
0.006984
ZNF746
470
0.006983
ZNF80
471
0.006972
ZNF530
472
0.006906
ZNF563
473
0.006872
ZNF382
474
0.006853
HIST1H1D
475
0.00685
PTRF
476
0.006842
ZNF132
477
0.006838
ZNF768
478
0.006834
TRAT1
479
0.006829
VSTM4
480
0.006826
ZCWPW2
481
0.006824
ZNF829
482
0.006817
ZNF419
483
0.006796
ZNF175
484
0.006795
FAM71E2
485
0.006789
NUMBL
486
0.006754
ZNF17
487
0.006746
LRP11
488
0.006731
ZNF212
489
0.006702
ZNF782
490
0.006659
ZSCAN5A
491
0.006609
ZNF823
492
0.006576
ZNF248
493
0.006546
ZNF557
494
0.006546
ZNF793
495
0.006521
ZC4H2
496
0.006504
ZFP36L2
497
0.006475
ZNF10
498
0.006473
GOLGA6L6
499
0.006466
Appendix C. Regional Supplemental Information
134
Mean Mutation Sorted Index
Prevalence
HLA-DRB5
500
0.00645
ZNF354A
501
0.00641
RSPH4A
502
0.006397
ZNF169
503
0.00639
EIF1AX
504
0.006344
ZNF682
505
0.00634
ZNF655
506
0.006283
ZNF311
507
0.006207
ZNF699
508
0.006197
PHACTR2
509
0.006193
ZNF253
510
0.006134
ZNF394
511
0.006078
ZNF260
512
0.005972
ZNF713
513
0.005955
ZNF571
514
0.005953
ZNF266
515
0.005905
ZNF490
516
0.005886
ANKRD36C
517
0.005875
ZNF662
518
0.005856
ZNF205
519
0.005846
ZSCAN2
520
0.005818
HKR1
521
0.005795
ZNF48
522
0.00578
SDAD1
523
0.005771
ZNF174
524
0.005753
NBPF7
525
0.005728
ZNF117
526
0.005683
GAP43
527
0.005679
C7orf60
528
0.005666
ZNF177
529
0.005659
TNNT1
530
0.005633
ZNF398
531
0.005584
C9orf66
532
0.005582
ZNF701
533
0.005528
Appendix C. Regional Supplemental Information
135
Mean Mutation Sorted Index
Prevalence
RNFT2
534
0.005524
ZNF449
535
0.005503
ZNF498
536
0.005495
ZNF883
537
0.005477
HMGB3
538
0.005466
ZNF286B
539
0.005449
OCEL1
540
0.005428
ZNF735
541
0.005427
ZNF25
542
0.005427
ZNF785
543
0.005411
ZNF343
544
0.005383
ZNF707
545
0.005315
SHOX
546
0.005287
ZNF8
547
0.005256
ZNF225
548
0.005233
NTN5
549
0.005218
C1orf198
550
0.005201
SPANXN3
551
0.005168
ZNF510
552
0.005162
ZNF79
553
0.005157
ZNF562
554
0.005152
TUSC1
555
0.00513
ZNF323
556
0.00513
ZNF432
557
0.005127
ZWINT
558
0.005077
ZNF689
559
0.005073
SDCCAG3
560
0.005054
ZNF239
561
0.005019
SOX9
562
0.004973
ZNF573
563
0.004969
EN1
564
0.004942
SRFBP1
565
0.004761
HMGB2
566
0.004671
ZNF705A
567
0.004669
Appendix C. Regional Supplemental Information
136
Mean Mutation Sorted Index
Prevalence
E2F5
568
0.004645
CWC27
569
0.004638
RALY
570
0.004605
ZNF436
571
0.00458
ZNF70
572
0.004535
ZNF92
573
0.004523
ZNF812
574
0.00444
ZNF200
575
0.004388
YBX2
576
0.004381
C10orf95
577
0.004339
HEXIM1
578
0.004298
ZNF672
579
0.00427
RNF113A
580
0.004257
SPATA31A3
581
0.004195
XRCC4
582
0.004158
SURF6
583
0.004156
ZNF193
584
0.004122
ZNF75A
585
0.004115
ZNF497
586
0.004085
UTP18
587
0.004023
ZNF18
588
0.004016
ZNF670
589
0.003939
ZNF565
590
0.003937
ZNF620
591
0.003936
ZNF736
592
0.003932
VCX
593
0.003881
ZNF34
594
0.003874
GRB2
595
0.003816
ZNF627
596
0.003759
ZAR1L
597
0.003724
ZNF154
598
0.003711
VCX3B
599
0.003642
ZNF275
600
0.003629
ZNF805
601
0.003597
Appendix C. Regional Supplemental Information
137
Mean Mutation Sorted Index
Prevalence
TNNI2
602
0.003529
ZNF501
603
0.003506
FAM157A
604
0.0035
ZNF728
605
0.003473
ZNF674
606
0.003403
PROCA1
607
0.003322
ZNF717
608
0.00329
RPS6
609
0.00329
ZSCAN16
610
0.003272
UBXN1
611
0.003271
AKAP2
612
0.0031
FAM21A
613
0.003099
ZNF706
614
0.002971
ZNF32
615
0.002964
ZNF367
616
0.00286
DLEU7
617
0.002552
HEXIM2
618
0.00252
ZNF524
619
0.002509
PQBP1
620
0.002315
HMGN5
621
0.002314
ZNF705G
622
0.001371
VCX2
623
0.0009513