Protein Disorder in Cancer Subtypes

Protein Disorder Targeting Driver Genes in Cancer A Thesis Presented to the School of Interdisciplinary Informatics and the Faculty of the Graduate College University of Nebraska In Partial Fulfillment of the Requirements for the Degree Master of Science in Biomedical Informatics University of Nebraska at Omaha by Ryan A. Hagenson July 2017

Supervisory committee: Dr. Dario Ghersi Dr. Kate Cooper Dr. Parvathi Chundi

Protein Disorder Targeting Driver Genes in Cancer Ryan A. Hagenson, MS University of Nebraska, 2017 Advisor: Dr. Dario Ghersi

Cancer is driven by DNA mutations that propagate to the protein level – resulting in perturbed biochemistry due to modifying interactions within the cell. Of great importance to the proper function of the cell are the protein-protein interactions which define how the body responses to stimuli, both positive and negative. Such interactions often involve two structurally distinct types of protein regions: ordered binding sites and disordered binding targets. Historically, only the ordered half of this complementary pairing has been extensively investigated with respect to how observed DNA mutations in these regions possibly drive cancer. This work represents an initial in silico investigation leveraging data from The Cancer Genome Atlas (TCGA) which shifts the focus to investigate disordered regions. Two measures of protein disorder are used to calculate protein disorder, one scoring individual positions and the other scoring local regions, across 62 mutation profiles or two profiles for each of the 31 cancer types under investigation. Data from each cancer is analyzed via two mutation profiles considering: 1. all observed mutations, and 2. missense mutations only. To ensure novelty, results with prior strong implication in cancer are removed from the final sets – focusing results on potential disorder-targeted genes not yet known. By using the combination of a search for positive selection for a biological property and high-dimensional analysis with conservative statistical cutoffs, novel genes not

previously implicated in cancer can be given likely context and internally cross-validated – providing evidence for their potential role in driving cancer. As a result of positional analysis, 77 disorder-targeted genes were characterized. Meanwhile, by regional analysis, 480 disorder-targeted genes were found.

Acknowledgements I would like to thank all who helped me get to where I am today. To those I inevitably forget to mention by name I extend a special thank you and an apology for my lapse of thought at the moment of writing. I wish to thank my folks, the parents who raised me with a love for learning and whose couch I became acquainted with when balancing work, school, and my future became just a little too much. Thank you for listening to my constant yammering about the latest factoids and now the latest science. To Dr. Garry Duncan, thank you for introducing me to Bioinformatics, a way to combine my dual-interest in Biology and Computer Science. To Dr. Bill McClung, thank you for teaching me so much and indulging me in discussions about all areas on my work. I wish I would have started learning from you sooner. To Dr. Jessica Petersen, thank you for guiding me on my first Bioinformatics investigation. To Bell Labs and all its employees, you are a constant inspiration and embodiment of what I find most fascinating about computer science and bioinformatics: no challenge can compete with dedicated individuals. Lastly, to Dr. Dario Ghersi, even though your name is elsewhere on this thesis I believe a special thank you is in order. I would not be graduating confidently without the wealth of knowledge I gained from you during each weekly meeting.

i

Contents Acknowledgements 1

Introduction

1

1.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Cancer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.2.1

Causes of Cancer . . . . . . . . . . . . . . . . . . . . . . . . . .

4

Mutations from External Mutagens . . . . . . . . . . . . . . .

5

Mutations from Internal Mutagens . . . . . . . . . . . . . . . .

6

Cancer Driver Genes . . . . . . . . . . . . . . . . . . . . . . . .

7

Tumor Suppressor Genes . . . . . . . . . . . . . . . . . . . . .

7

Oncogenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

Discovering Drivers . . . . . . . . . . . . . . . . . . . . . . . .

8

The Cancer Genome Atlas . . . . . . . . . . . . . . . . . . . . .

9

Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

Cancers in the Atlas . . . . . . . . . . . . . . . . . . . . . . . .

9

1.2.2

1.2.3

1.3

Computational Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.3.1

1.4

Past Driver Gene Discovery Methods . . . . . . . . . . . . . . 10

Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

ii 2

3

Proteins

12

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2

Protein Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1

Amino Acid Structure . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.2

Primary Structure (1◦ ) . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3

Secondary Structure (2◦ ) . . . . . . . . . . . . . . . . . . . . . . 14

2.2.4

Tertiary Structure (3◦ ) . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.5

Quaternary Structure (4◦ ) . . . . . . . . . . . . . . . . . . . . . 16

2.3

Protein Folding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4

Protein Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.5

Protein Disorder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1

Pairwise Amino Acid Interactions . . . . . . . . . . . . . . . . 19

2.5.2

Hydrophobicity and Net Charge . . . . . . . . . . . . . . . . . 20

Methodology

22

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2

Signal Versus Noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.3

Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.1

Data Acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3.2

Dataset Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4

Disorder Scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5

Monte Carlo Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.5.1

3.6

Steps as a List . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

Binomial Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

iii 3.6.1 3.7 4

Steps as a List . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Enrichment Analysis and Validation . . . . . . . . . . . . . . . . . . . 32

Positional Analysis Results

36

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2

COSMIC Hypergeometric Testing . . . . . . . . . . . . . . . . . . . . . 37

4.3

Mutational Prevalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4

Visualizations of Select Genes . . . . . . . . . . . . . . . . . . . . . . . 38 4.4.1

COADREAD – TBP . . . . . . . . . . . . . . . . . . . . . . . . . 39 PDB: 1NVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.4.2

BRCA – TBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 PDB: 1NVP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4.3

STES – CASC3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 PDB: 2J0S . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5

4.5

Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.6

Partner Set Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . 43

Regional Analysis Results

53

5.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2

COSMIC Hypergeometric Testing . . . . . . . . . . . . . . . . . . . . . 55

5.3

Mutational Prevalence . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4

5.3.1

Both Profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.3.2

Mutation Prevalence Distributions . . . . . . . . . . . . . . . . 58

Visualizations of Select Genes . . . . . . . . . . . . . . . . . . . . . . . 58 5.4.1

TBP.001 in BRCA . . . . . . . . . . . . . . . . . . . . . . . . . . 58

iv Smoothed Disorder Plot with Mutations . . . . . . . . . . . . . 58 5.4.2

PLEC.005 in ACC . . . . . . . . . . . . . . . . . . . . . . . . . . 60 Smoothed Disorder Plot with Mutations . . . . . . . . . . . . . 60

5.4.3

NEFH.001 in ACC . . . . . . . . . . . . . . . . . . . . . . . . . 62 Smoothed Disorder Plot with Mutations . . . . . . . . . . . . . 62

6

5.5

Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.6

Partner Set Enrichment Analysis . . . . . . . . . . . . . . . . . . . . . 63

Discussion

83

6.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.2

Intersection of Both Methods of Analysis . . . . . . . . . . . . . . . . 84

6.3

6.2.1

EP400 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

6.2.2

TBP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.2.3

SRRM2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.2.4

NCOA3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.2.5

GPRIN2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.2.6

ZNF707 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

Enrichment Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.3.1

Positional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.3.2

Regional . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.3.3

Regional and Positional Cross-comparison . . . . . . . . . . . 89 Significant Novel Finds Sets . . . . . . . . . . . . . . . . . . . . 89 Binding Partner Sets . . . . . . . . . . . . . . . . . . . . . . . . 89

6.4

Disorder Binding Incitation of Cancer . . . . . . . . . . . . . . . . . . 90

v 6.5

COSMIC – Limited Complement . . . . . . . . . . . . . . . . . . . . . 90

6.6

On Limit to In Silico Analysis . . . . . . . . . . . . . . . . . . . . . . . 91

6.7

On the High Number of Regional Results . . . . . . . . . . . . . . . . 91

6.8

Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6.9

6.8.1

Impact of mutations . . . . . . . . . . . . . . . . . . . . . . . . 93

6.8.2

Monte Carlo simulations side effect . . . . . . . . . . . . . . . 94

6.8.3

Intersection of significance sets . . . . . . . . . . . . . . . . . . 94

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

Bibliography

96

A TCGA Cancers

110

B Positional Supplemental Information

112

C Regional Supplemental Information

115

vi

List of Figures 3.1

Positional Steps Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.2

Regional Steps Flowchart . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1

Positional Novel Finds . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2

Positional Heatmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3

Positional All Mutations Heatmap . . . . . . . . . . . . . . . . . . . . 48

4.4

Positional Missense Mutations Heatmap . . . . . . . . . . . . . . . . . 49

4.5

Chimera COADREAD – TBP against 1NVP . . . . . . . . . . . . . . . 50

4.6

Chimera BRCA – TBP against 1NVP . . . . . . . . . . . . . . . . . . . 51

4.7

Chimera STES – CASC3 against 2J0S . . . . . . . . . . . . . . . . . . . 52

5.1

Regional Novel Finds (1 of 6) . . . . . . . . . . . . . . . . . . . . . . . 65

5.2


5.3


5.4


5.5


5.6


5.7

Regional Heatmap (1 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.8


vii 5.9


5.10 Regional Heatmap (4 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 74 5.11 Regional Heatmap (5 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 75 5.12 Regional Heatmap (6 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 76 5.13 Regional Heatmap (7 of 7) . . . . . . . . . . . . . . . . . . . . . . . . . 77 5.14 Regional Heatmap Distribution By Cancer . . . . . . . . . . . . . . . 78 5.15 Regional Heatmap Distribution By Gene . . . . . . . . . . . . . . . . . 79 5.16 Smoothed TBP.001 Disorder . . . . . . . . . . . . . . . . . . . . . . . . 80 5.17 Smoothed PLEC.005 Disorder . . . . . . . . . . . . . . . . . . . . . . . 81 5.18 Smoothed NEFH.001 Disorder . . . . . . . . . . . . . . . . . . . . . . 82

viii

List of Tables 2.1

The Twenty Common Amino Acids . . . . . . . . . . . . . . . . . . . 15

3.1

Sample Processed Input . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2

TCGA Cancers in this Study . . . . . . . . . . . . . . . . . . . . . . . . 26

4.1

Positional COSMIC Difference . . . . . . . . . . . . . . . . . . . . . . 37

4.2

COADREAD TBP Mutations . . . . . . . . . . . . . . . . . . . . . . . 40

4.3

BRCA TBP Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.4

STES CASC3 Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5

Positional Analysis Uncorrected Results . . . . . . . . . . . . . . . . . 44

4.6

Positional Results Interaction Partner Set Enrichment Analysis . . . . 45

5.1

Regional COSMIC Difference . . . . . . . . . . . . . . . . . . . . . . . 53

5.2

BRCA TBP Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3

PLEC.005 ACC Mutations . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.4

NEFH ACC Mutations . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.5

Regional Analysis Uncorrected Results . . . . . . . . . . . . . . . . . . 64

5.6

Regional Results Interaction Partner Set Enrichment Analysis . . . . 64

6.1

Intersect of Positional and Regional Novel Finds . . . . . . . . . . . . 84

ix A.1 TCGA Selected Cancers . . . . . . . . . . . . . . . . . . . . . . . . . . 111 B.1 Top 50 Interaction Partner Set Enrichment Terms – Positional . . . . . 112 C.1 Top 50 Interaction Partner Set Enrichment Terms – Regional . . . . . 115 C.2 By Cancer Mean Distribution – Both Profiles . . . . . . . . . . . . . . 117 C.3 By Gene Mean Distribution – Both Profiles . . . . . . . . . . . . . . . 119

1

Chapter 1

Introduction 1.1

Summary

This work represents a focal shift in the process of in silico discovery of cancer driver genes. Historically, there has been a general trend toward observing how DNA mutations propagate into disrupting ordered protein regions – causing either an acceleration or deceleration of biochemical function, which can be linked to driving cancer. This focus is heavily influenced by the classic structure-function paradigm of proteins which states that a well-defined structure is necessary for well-defined function, or more simply that "structure dictates function" – a paradigm supported by the seminal models of Fischer (1894) and Pauling, Corey, and Branson (1951), found to be true in early experimentally-determined structures (Blake et al., 1965; Kendrew, 1961), and lastly by the denaturation experiments of Anfinsen (1973). This paradigm, although known now to be only mostly true, gave rise to the long since disproven one-gene, one-protein belief that followed the characterization of DNA as the genetic material of the cell. Now we know that a single gene can result in numerous different versions of a protein product; these differing

Chapter 1. Introduction

2

versions are known as protein isoforms and are primarily the result of alternative splicing (Modrek & Lee, 2002; Kornblihtt et al., 2013; Black, 2003; Ast, 2004). When considering mutations across isoforms, it is likely that a single mutation will affect isoforms differently. This is partially due to many isoforms being of different lengths – therefore one isoform may be shorter than its sister isoforms due to removing a (mutated) region. As well, isoforms may be slight rearrangements of one another – thus splicing a mutation into a new structural context. For the purposes herein, this differing nature of isoform mutations is not explored. Rather the most mutated, shortest isoform of each gene is taken to represent the worst-case scenario – most mutated to increase the number of individual observations per gene, and shortest isoform to increase the degree to which these mutations may perturb the underlying chemistry. These criteria are sufficient for an initial pass, however it would be expected for the results to change with other criteria. Selecting one isoform per gene is necessary since it allows for multiple hypothesis correction over discrete, statistically-independent tests while keeping computational and logical complexity lower for this initial focal shift within in silico driver gene discovery. By searching for a specific biological property within a large search space using data from the disease state of interest one can find novel genes that are implicated in that disease state via the specific biological property. Using conservative statistical cutoffs, the confidence in these results is increased – provided the biological property has a potential role in driving the disease state. Here, potential cancer driver genes are discovered by searching for protein disorder-targeting mutations across the largest public cancer mutation data search space. Protein disorder, a


3

ubiquitous property within protein-protein interaction networks, has a strong potential role in driving cancer specifically by disrupting essential protein-protein interactions.

1.2

Cancer

Cancer is a disease marked by the breakdown of cellular machinery due to somatic DNA mutations (Hanahan & Weinberg, 2011). The naïve thought on potential driver genes of cancer would suggest they are as varied as the potential mutations which can occur in the variety of people it can occur within while still supporting life – an impossibly large number of possibilities. Thankfully, this impossibly large number is not the true number of cancer drivers. Instead, there exist a much smaller number of drivers which can be roughly characterized as frequently-mutated or infrequently-mutated driver genes (Vogelstein et al., 2013). The largest attempt to identify and record observed mutations in patients with cancer is The Cancer Genome Atlas (TCGA).1 The TCGA datasets represent our most comprehensive observations to date despite being hardly a fraction of the total possibilities (Tomczak, Czerwinska, ´ & Wiznerowicz, 2015). It is unexpected to ever fully capture the true variety of cancer mutations with our observations due to the rarity of mutations and size of the human genome. The discrepancy between observations and potential drivers is where Bioinformatics and in silico analysis can aid us. By comparing the small number of observations we do have against the large number of observations that are theoretically possible we can 1

https://cancergenome.nih.gov/


4

identify significant differences. This process is not without its own shortcomings as discussed in Section 1.3 below. As a disease, cancer is especially complex due to the great variety of mutations that can occur, where they occur, which tissue(s) they are affecting, the individual’s personal genetics/diet/activity level, and even more factors having an effect on disease progression (Campisi, 2013; Lawrence et al., 2014; Vogelstein et al., 2013). We know far more about cancer today than we ever have before due in no small part to our increasing ability to capture data on these factors and thereby understanding how they collectively contribute to cancer (Y. Chen et al., 2014; Cheng et al., 2014; Surget, Khoury, & Bourdon, 2013). Although many people know of certain risk factors for cancer such as overexposure to UV radiation via sunlight, there are more subtle risk factors for cancer.

1.2.1

Causes of Cancer

Cancer being driven by mutations means that almost any aspect of life that causes mutations, prevents proper repair of genetic mistakes, or otherwise increases the mutation rate in a cell line can be linked to increased risk for developing cancer. This, quite thankfully, does not mean one will develop cancer, only that the risk is greater as carcinogenesis is a process not an event (Kasper et al., 2015). Understanding cancer risk is made even more difficult by seemingly countering scientific findings. Nowhere is this more obvious than in layman news interpretations of the latest cancer research findings stating that some commonplace routine such as drinking coffee increases cancer risk one day2 but previously decreased 2

http://www.cnn.com/2016/06/15/health/coffee-tea-hot-drinks-cancer-risk/


5

cancer risk.3 These news stories are often not overtly wrong, but are reductionist to the point of blurring the truth from the latest original scientific report (Loomis et al., 2016). The underlying truth is that the complex nature of oncogenesis can not so easily be linked to such commonplace activities because it is a disease of mutations and thus is different on a person-by-person, case-by-case basis with more critical risk factors to consider than whether a person drinks coffee or not. However, there are routines with undeniably strong evidence for causing cancer – tobacco-use was epidemiologically-linked to increased risk for developing cancer and this causal link is no longer debated (Boffetta, Hecht, Gray, Gupta, & Straif, 2008; Denissenko & Pao, 1996; Vineis et al., 2004). Ultimately, with cancer being driven by mutations it is important to understand the two major classes of mutagens: external to the body and internal to the body.

Mutations from External Mutagens External mutagens are those which occur outside the body and affect the mutation rate inside the body. Even people without a biomedical research background understand these mutagens quite clearly and (most) take active steps to avoid them. Often these are observed as chemicals that a person is exposed to or activities they willingly do that are linked with an increased risk for cancer. Examples include using tobacco (DeMarini, 2004; Hecht, 1999) and increased exposure to UV radiation via sunlight (D’Orazio, Jarrett, Amaro-Ortiz, & Scott, 2013; de Gruijl, 1999). Such activities cause chemical changes within the cell which lead to DNA mutations in the cell. Other commonly known external mutagens are radiation and 3

http://www.cbsnews.com/news/new-findings-on-coffee-and-cancer-risk/


6

chemical spills, which the fear of, this researcher believes, has led to increased environmental regulations to protect citizens from less easily self-prevented risk factors. The less easily identified risks are no less important than the risks that are easily avoided.

Mutations from Internal Mutagens Internal mutagens are those which occur inside the body to affect the mutation rate inside the body. Often these are far less understood by people without a biomedical research background and are often understated even by people with such a research background. One internal mutagen is the DNA repair mechanisms failing to correct a mistake during DNA replication. As DNA is replicated prior to cellular division, a complete copy must be made that will split off into the daughter cell following division. This replication process involves unzipping the DNA double helix via DNA helicase and building two new complementary strands via DNA polymerase. The average mistake rate for DNA polymerase is one in one hundred thousand

1 100,000

positions. When we consider that there are roughly

six billion (6, 000, 000, 000) positions in a human diploid cell, this equals an average of one hundred twenty thousand (120, 000) errors at each division (Pray, 2008). The cell is able to repair most but not all of these mistakes and if it cannot repair the mistakes should mark the cell for termination as a major deviation from the healthy cell line. Any mistakes that are not corrected or deviated cell lines not terminated are, by definition, mutations and these mutations have the risk of being oncogenic.


7

Another internal mutagen is the progressive shortening of telomeres with each cell division and is partially why cancer is more common later in life. Telomeres are repeated, non-coding segments of DNA at the ends of chromosomes which protect the internal coding portions from mutation and degradation by being mutated and degraded themselves. As telomeres shorten with age, the coding portions are exposed to mutation and degradation (Blasco, 2005).

1.2.2

Cancer Driver Genes

There are two major classes of cancer driver genes: 1. tumor suppressor genes, and 2. oncogenes (R A Weinberg, 1994; Lehman et al., 1991; E. Y. H. P. Lee & Muller, 2010).

Tumor Suppressor Genes Tumor suppressor genes are the "brakes" on tumorigenesis intended to stop the rapid cellular proliferation and growth characteristic of a tumor. These genes are a single point of failure which follow the Knudson two-hit hypothesis (Knudson, 1971; Nordling, 1953; Hutchinson, 2001) and thus mutations within them tend to present fairly uniform results. This uniformity is a result of mutations having the same loss-of-function effect: preventing the gene from stopping the growth of tumors effectively, which presents the same no matter the causing mutation. A notable example of a tumor suppressor genes is p53 (or TP53), which is ubiquitous and provides a check for deviated cell lines during the G1/S regulation point of the cell division cycle just prior to dividing. Many mutations can result in p53 malfunction and that is why this driver gene is implicated in > 50% of cancers


8

(Surget et al., 2013) – it is a single point of failure where malfunction means cell lines are not subjected to the proper health check prior to dividing.

Oncogenes Oncogenes are the "gas" on oncogenesis with a variety of intended functions which are accelerated via mutation, causing a variety of the notable cancer hallmarks. The diversity of these genes means there is greater diversity in their biochemical presentation. Newly discovered driver genes tend to fall into this class because their variety means different approaches analyze new and different contexts for how a gene might be driving oncogenesis. Oncogenes are set in motion by specific "driver" mutations while most mutations within them are random "passenger" mutations which are not, themselves, oncogenic (R. A. Weinberg, 1984; Chial, 2008; Todd & Wong, 1999; Stehelin, 1995). A notable example of an oncogene is telomerase, which is oncogenic by causing cancer cells to lengthen their telomeres – aiding in cancer cell immortality.

Discovering Drivers Discovering cancer driver genes requires many levels of analysis. Newer studies tend to look for positive selection for a biological property with potential in driving cancer. This is an effective combination of biological hypothesis and highdimensional data analysis. In order to not bias results in these types of analyses, capturing the mutational landscape of cancer must be done in as systematic a way as possible. Currently, the most systematic and comprehensive approach to discovering the mutations noted in cancer is The Cancer Genome Atlas (TCGA).


1.2.3

9

The Cancer Genome Atlas

The Cancer Genome Atlas (TCGA) is the leading effort to catalog genetic mutations in cancer via high-throughput genomics – bettering our understanding of the genetic basis of cancer with a primary goal of improving diagnosis, treatment, and prevention of cancer. Over its lifespan from 2005 to 2017 (time of this study), it collected 2.5 petabytes of data, from more than 11,000 patients, describing the mutational observations of 33 cancer types. The TCGA data used in this study is from July 18th, 2016.

Methods The TCGA Research Network consists of many parts; each part is integral to achieving TCGA’s central goal – beginning with the Biospecimen Core Resource (BCR), which reviews and processes the initial blood and tissue samples, and ending with the Analysis Working Groups (AWGs), which are made up of scientific and clinical experts analyzing a single type of cancer across all TCGA methods and who publish a comprehensive analysis of findings.

Cancers in the Atlas Under TCGA investigation there are 33 tumor types (see Table A.1), of which 31 cancer types are included in this work (see Table 3.2). The two cancers present in TCGA not analyzed here are Mesothelioma (MESO) and Acute Myeloid Leukemia (LAML), which were excluded due to using an older version of the human reference genome at the time of SNP characterization.


1.3

10

Computational Problem

The major computational problem within cancer genomics is distinguishing signal from noise – driver mutation from passenger mutation – which allows us to further understand the disease process.

1.3.1

Past Driver Gene Discovery Methods

Detailing all past methods would be impossible, therefore a select few methods will be discussed. Past computational methods have focused within or integrated analysis in the areas of: 1. somatic copy-number alternations (SCNAs), as is the case with GISTIC (Mermel et al., 2011); 2. protein-coding region length, variations in mutation types, and multiple mutations in one gene, as is the case with DrGaP (Hua et al., 2013); and 3. signals of positive selection, as is the case with MuSiC (Dees et al., 2012), OncodriveFM (Gonzalez-Perez & Lopez-Bigas, 2012), OncodriveCLUST (Tamborero, Gonzalez-Perez, & Lopez-Bigas, 2013), and E-Driver (Porta-Pardo & Godzik, 2014).. The methods leveraging positive selection all share the use of a base-level mutation profile/rate in order to differentiate between random (passenger) mutations and driver mutations. Notably, none of these methods focus on investigating regions of disorder.


1.4

11

Hypothesis

I propose that by studying the effects of cancer mutations within inherently disordered regions, we can further understand how cancer manipulates cellular chemistry, disrupting healthy processes. Due to this being a major shift in focus from historical cancer driver gene discovery approaches, it is expected to find novel drivers.

12

Chapter 2

Proteins 2.1

Introduction

For this work, and others like it, analysis at the protein level is necessary; here a brief overview of proteins is presented to provide context to analysis. Without an understanding of proteins, the positive selection for a protein biological property and how such selection might driver cancer cannot be understood. Proteins are biopolymers made up of a string of amino acids and are the actors of biochemical activity. They are important for driving cellular chemistry by catalyzing reactions, acting as signals for processes, providing structural support to cells, helping other proteins fold, and much more. As the final step in The Central Dogma of Molecular Biology, or that: a gene coded in DNA is transcribed into RNA, which is then translated into protein, these functional biomolecules are responsible for nearly all biochemical activity within the cell. Due to this, proteins serve as the chemical carriers for DNA mutations – often being the biological component enacting damage due to the mutation. A single gene in DNA can result in multiple related protein products – these related products are called protein isoforms

Chapter 2. Proteins

13

produced by alternative splicing (Modrek & Lee, 2002; Kornblihtt et al., 2013; Black, 2003; Ast, 2004). Therefore, a mutation at the DNA level is likely to affect more than one protein isoform.

2.2

Protein Structure

Protein structure is broken up into four categories: primary structure (1◦ ), secondary structure (2◦ ), tertiary structure (3◦ ), and quaternary structure (4◦ ), each structural level is built off of the levels before it. These levels are discussed in detail below.

2.2.1

Amino Acid Structure

Before discussing the levels of protein structure, it is important to understand the basic structure of amino acids, the repeating subunits of the protein biopolymer. All amino acids are composed of four components all bonded to a central carbon atom. These four components are: 1. a single proton/hydrogen atom (H + ), 2. an amine functional group (−N H2 ), 3. a carboxyl functional group (−COOH), and 4. most importantly, a side chain specific to each amino acid (−R). The side chain identifies the amino acid as well as its chemistry (i.e., is it polar/non-polar, aromatic/aliphatic, charged/non-charged). See Table 2.1 for how the chemistry differs between amino acids.

Chapter 2. Proteins

2.2.2

14

Primary Structure (1◦ )

Proteins are made up of a string of individual amino acids. Within human biology, there are 20 common amino acids (listed in Table 2.1) which make up all proteins. The linear, string sequence of amino acids is the primary (1◦ ) protein structure. (This is the only one-dimensional protein structure and thus is the one most often used in bioinformatics.)

2.2.3

Secondary Structure (2◦ )

As the protein begins to fold, it interacts with other residues and the environment to take on localized, 3D conformations that reduce localized energy levels. These local conformations are considered the secondary (2◦ ) protein structure and include: alpha helices, beta sheets, and turns/loops. Of these secondary elements, only turns/loops are fairly disordered.

2.2.4

Tertiary Structure (3◦ )

As the protein forms its secondary structure and continues to fold, it will continually assume the lowest overall energy state possible1 until the entire protein has been folded. This final folded structure of one original primary sequence chain is considered the tertiary (3◦ ) structure. It is important to draw attention to a tertiary structure being one continuous amino acid chain that has taken on a 3D folded structure. The structure of some proteins ends at this level since it often stable and functional. 1

This is without considering the role of chaperone proteins, which help proteins fold in ways that would otherwise be chemically unstable in the process.

Chapter 2. Proteins

15

TABLE 2.1: A brief summary of the twenty common amino acids. Full name, shortened name, single letter code, and a broad chemical classification are included for each. Reorganization of table at: http://wbiomed.curtin.edu.au/biochem/tutorials/AAs/AA.html Full name Glycine Alanine Valine Leucine Isoleucine Proline Phenylalanine Tyrosine Tryptophan Serine Threonine Cysteine Methionine Asparagine Glutamine Lysine Arginine Histidine Aspartate Glutamate

Shortened name Single Letter Code aliphatic (non-polar) Gly G Ala A Val V Leu L Ile I Pro P aromatic (non-polar) Phe F Tyr Y Trp W polar, non-charged Ser S Thr T Cys C Met M Asn N Gln Q positively charged Lys K Arg R His H negatively charged Asp D Glu E

Chapter 2. Proteins

2.2.5

16

Quaternary Structure (4◦ )

Not all proteins have a quaternary structure. The quaternary (4◦ ) structure is formed from multiple independent amino acid chains interacting with one another to form a complex. Every quaternary structure is made up of multiple protein chains, each capable of independent folding into a tertiary structure, and interacting with one another to form a final, functional protein complex.

2.3

Protein Folding

According to the framework model of protein folding, proteins begin to fold as they are being synthesized. First forming localized secondary elements at one end prior to the synthesis of the other terminal end. There are two primary chemical driving forces behind protein folding, in order of strength: 1. the burial of hydrophobic side chains away from the aqueous environment, termed the entropic penalty, and 2. the reduction in total, solvent-accessible surface area (Ken A. Dill, Ozkan, Shell, & Weikl, 2008). Due to these chemical drivers, most proteins result in a hydrophobic core and a hydrophilic surface. However, sometimes burying hydrophobic amino acids is not possible, especially in the early stages of folding. If these hydrophobic amino acid side chains were left exposed it would result in protein aggregation via the same entropic penalty driving their burial – hydrophobic amino acids on the surface of the synthesizing protein would be driven toward hydrophobic surface amino acids on other proteins rather than driven inward (Kessel & Ben-Tal, 2011). Such aggregation would present a major and highly prevalent problem if the folding process were entirely stochastic; however there

Chapter 2. Proteins

17

exist chaperone proteins which support and protect a protein as it folds (Garrett & Grisham, 2013). Chaperone proteins lower the overall energy barrier allowing folding into lower energy states that would first require adopting a higher, unfavorable energy state (Q. Liu & Craig, 2016; Hendrick & Hartl, 1993) – as would be the case in temporarily exposing hydrophobic amino acids to bury them further than before.

2.4

Protein Mutation

Proteins are very rarely mutated directly but when they are rarely remain in the cell long due to protein turnover replacing a mutated protein with a healthy protein. Rather, most protein mutations can be linked back to an original DNA mutation which propagated to the protein level. Structurally, a mutation can occur within ordered regions such as binding or catalytic sites or within disordered regions such as protein-protein interaction junctions (there are also transition regions between these two). Since every amino acid has unique chemistry, protein mutations rarely result in the same level of functionality – accelerating or stunting protein activity based on the healthy and mutated amino acid chemistry. There are many classifications of protein mutations, each with their own semantic weight, however herein only two mutually-exclusive classifications are used: 1. synonymous mutation, no amino acid change despite a DNA mutation, and 2. missense mutation, an amino

Chapter 2. Proteins

18

acid change due to a DNA mutation. It has been shown that synonymous mutations can result in effects at the protein level (Goymer, 2007; Hunt, Simhadri, Iandoli, Sauna, & Kimchi-Sarfaty, 2014; Sauna & Kimchi-Sarfaty, 2011) and even frequently drive cancer (Supek, Miñana, Valcárcel, Gabaldón, & Lehner, 2014). However, a stronger case can be made for how a missense mutation may be driving cancer due to perturbed chemistry, therefore in this study two mutation profiles are explored: all mutations (synonymous and missense) and missense-only (no synonymous mutations). The natural third profile, synonymous-only, would be nearly uninterpretable in itself.

2.5

Protein Disorder

Protein order/disorder is the measure of how well-defined the 3D conformational location of a given residue within the final folded protein is. An ordered region is one that adopts a well-defined 3D conformation, while a disordered region may adopt no apparent structure or many similar structures depending on cellular conditions. Protein regions are made up of discrete residues each with their own order/disorder. Each residue can have as many potential inter-residue interactions as there are other residues in the protein. The combination of amino acids interactions is what leads to the native, or biologically-functional, 3D structure of a protein – balancing attractive and repulsive forces to form the final conformation. One way of measuring the disorder of a protein is to consider each potential pairwise interaction across the length of the protein. In a protein only 100 amino acids in length, this would be

100 2

or 4950 possible pairwise interactions – a number

Chapter 2. Proteins that grows quickly with a length of 200 being

19

200 2

or 19, 900 pairwise interactions.

Realistically, most residues do not interact with most other residues therefore not all combinations must be considered – in fact the naïve method of considering all possible combinations leads to inaccurate measures of order/disorder by neglecting proximity entirely – thus de novo measures of disorder commonly use sliding windows which consider interactions only within a certain sequence proximity range. Due to our knowledge that protein folding is driven by the burial of hydrophobic side chains and reduction of surface area (see Section 2.3 for more detail), we can estimate the final folded tertiary structure in silico based on known properties of each individual amino acid in the primary sequence. These estimations approximate protein disorder by assigning a value to how predictable each residue’s position is in the final structure. The two chemical measures used herein to estimate protein disorder from the primary sequence are: 1. pairwise amino acid interactions, and 2. hydrophobicity and net charge. Both of these have basis in measuring the favorability of amino acid interactions to predict how the primary sequence will form secondary structures and final tertiary structure.

2.5.1

Pairwise Amino Acid Interactions

The chemical natures of different amino acids generate either attractive (favorable) or repulsive (unfavorable) pairwise interactions. Two polar amino acids of opposite charge or two non-polar amino acids will have favorable interactions, while two polar amino acids with the same charge or a polar and non-polar amino acid pair will have unfavorable interactions. The IUPred method (Dosztányi, Csizmók,

Chapter 2. Proteins

20

Tompa, & Simon, 2005) used herein to measure positional disorder scores is based on the ENERGI method of determining pairwise amino acids interaction energylike quantities created by Thomas and Dill (1996). Using pairwise interaction energies in this way allows each position within a protein sequence to be given a score that corresponds to how well we can predict the final 3D conformational location of that position. The IUPred method uses a scale from 0 to 1 with precision to the ten-thousandth decimal place where 0 is complete order and 1 is complete disorder (Dosztányi, Csizmok, Tompa, & Simon, 2005). This method of positional score determination was chosen for its ability to distinguish partially disordered proteins from fully disordered proteins and is currently one of the best methods for measuring positional disorder, outperforming DISOPRED2 (Ward, McGuffin, Bryson, Buxton, & Jones, 2004) and VL3-H (Obradovic et al., 2003), both of which use a trained artificial intelligence model for disorder determination.

2.5.2

Hydrophobicity and Net Charge

With the strongest driving force behind protein folding being the entropic penalty, which forces the burial of hydrophobic amino acids, measures of hydrophobicity and net charge (an effective estimator of hydrophilicity) provide strong correlation with the ordered/disordered nature of the final folded structure. A region of highly hydrophobic amino acids indicates the region will likely be membranebound and thus more likely to be ordered, while a mixed region (alternating hydrophobic residues and hydrophilic residues) is unlikely to be bound and thus more likely to be disordered. FoldIndex©, a method by Prilusky et al. (2005), uses

Chapter 2. Proteins

21

an algorithm by Uversky, Gillespie, and Fink (2000) to define a boundary line between regions of folded order and unfolded disorder. Values from this method are bound between -1 and 1 with positive values being likely folded (ordered) regions and negative values being likely unfolded (disordered) regions.

22

Chapter 3

Methodology 3.1

Introduction

Discovery of driver genes by focusing specifically on regions with a particular biological property is a fairly standard approach. In fact, computational approaches to driver gene discovery all but require a measurable property and a biological basis for how that property can drive cancer. Past methods have considered: 1. somatic copy-number alternations (SCNAs), as is the case with GISTIC (Mermel et al., 2011), 2. protein-coding region length, variations in mutation types, and multiple mutations in one gene, as is the case with DrGaP (Hua et al., 2013), and 3. signals of positive selection, as is the case with MuSiC (Dees et al., 2012), OncodriveFM (Gonzalez-Perez & Lopez-Bigas, 2012), OncodriveCLUST (Tamborero, Gonzalez-Perez, & Lopez-Bigas, 2013), and E-Driver (Porta-Pardo & Godzik, 2014).. Critically, the positive-selection methods (which this work is considered) face the same computational challenge of differentiating signal from noise in order to draw their conclusions.

Chapter 3. Methodology

3.2

23

Signal Versus Noise

Differentiating signal from noise is a problem in more than just Bioinformatics with importance in any field where random observations are able to mask important observations (T. T. Liu, 2016; Edwards, Russell, & Stott, 1998). There are many complex methods, such as the Fourier transform (Fourier, 1822) that allow making relative sense of seemingly random input, however within driver genes discovery typically the background-anomaly approach is used in conjunction with a biological property (Kamburov et al., 2015; Tamborero, Gonzalez-Perez, & LopezBigas, 2013; Tamborero, Lopez-Bigas, & Gonzalez-Perez, 2013; Gonzalez-Perez & Lopez-Bigas, 2012). Establishing a background rate or level for a biological property allows one to begin differentiating signal from noise via deviations from this background. The work herein is a focal shift from past driver gene discovery methods by focusing on the under-investigated property of protein disorder. By focusing on this property in particular, it is expected to find results not found in other methods due to characterizing proteins differently than before. To do this two approaches were taken, positional analysis via Monte Carlo simulations and regional analysis via binomial testing, both leveraging data from The Cancer Genome Atlas (TCGA).

3.3

Data Preparation

Raw TCGA data were processed following the same procedure as in Ghersi and Singh (2014). In short, the chromosomal coordinates provided by TCGA were


24

TABLE 3.1: The heading 10 rows of ACC_mut.txt. This format represents the effective input to analysis herein following the mapping of raw TCGA chromosomal coordinates to protein sequence positions. Isoform

TCGA Barcode

A1BG.001 A1CF.001 A1CF.002 A1CF.003 A1CF.004 A1CF.005 A1CF.006 A4GALT.001 AACS.001 AACS.001

TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5KB-01A TCGA-OR-A5JY-01A TCGA-OR-A5LD-01A TCGA-PK-A5HB-01A

DNA Position 281 1167 1191 1191 1167 1215 1191 903 306 103

DNA Start G C C C C C C C A G

DNA End A A A A A A A G C C

Protein Position 94 389 397 397 389 405 397 301 102 35

Protein Start R G G G G G G P A A

Protein End H G G G G G G P A P

mapped to their protein sequence positions by using the Human Genome Reference (GRCh37.p10). The head of a sample input after this procedure can be seen in Table 3.1 – this is the file ACC_mut.txt, representing the cancer background of Adrenocortical carcinoma considering all mutations (both missense and synonymous mutations).

3.3.1

Data Acquisition

Cancer mutation data were obtained from the latest available TCGA1 run on July 18th, 2016 from the Broad Institute Firehose system,2 in the form of Data Level 2 (Processed Data), which is the level of consensus results from processing the raw genome sequencing reads (Data Level 1 Raw Data). This data includes 33 cancer backgrounds (Table A.1), while the work here analyzes 31 cancer backgrounds (Table 3.2). The remaining two cancer backgrounds from TCGA, Mesotheliomia (MESO) and Acute Myeloid Leukemia (LAML), were excluded due to using an 1 2

https://cancergenome.nih.gov/ http://firebrowse.org


25

older version of the human reference genome at the time of data preparation. It should also be noted that in this analysis Colon adenocarcinoma [COAD] and Rectum adenocarcinoma [READ] are combined into a single Colon and Rectum adenocarcinoma [COADREAD] background, which is also true for Esophageal carcinoma [ESCA] and Stomach adenocarcinoma [STAD] which are combined into a single Stomach and Esophageal carcinoma [STES] background. These combination are due to the component cancer backgrounds being indistinguishable from one another (Muzny et al., 2012; Bass et al., 2014).

3.3.2

Dataset Size

The TCGA dataset used contained information on 95, 836 isoforms from 31 cancer types, each combination of which was processed positionally and regionally for a total of three disorder score profiles: 1. IUPred ’short’ (positional), 2. IUPred ’long’ (positional), and 3. FoldIndex©(regional).

3.4

Disorder Scoring

Positional analysis was done on measurements by IUPred long, or a 100 residue interaction window, and IUPred short, or a 25 residue interaction window (Dosztányi, Csizmok, et al., 2005); regional analysis was done on measurements by FoldIndex©(Prilusky et al., 2005), which uses a default window size of 51 residues. Calculations for both positional measurements are based on pairwise chemical interaction energies across their respective window sizes and smoothed over a window size of 21 residues – this is in accordance with the IUPred method (Dosztányi,


26

TABLE 3.2: The 31 cancer types involved in this study. COAD and READ were combined because their backgrounds are indistinguishable (Muzny et al., 2012), while STES was not part of the original pilot project, but was investigated by Bass et al. (2014) and subsequently added to TCGA. STES is a combination of two cancers, stomach and esophageal carcinomas into one unified cancer background. Number of subjects is based off of unique TCGA barcodes within each cancer dataset. Identifier ACC BLCA BRCA CESC CHOL COADREAD DLBC ESCA GBM HNSC KICH KIRC KIRP LGG LIHC LUAD LUSC OV PAAD PCPG PRAD SARC SKCM STES TGCT THCA THYM UCEC UCS UVM

Cancer Type Adrenocortical Carcinoma Bladder Urothelial Carcinoma Breast Invasive Carcinoma Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma Cholangiocarcinoma Colon Adenocarcinoma [COAD] & Rectum Adenocarcinoma [READ] Lymphoid Neoplasm Diffuse Large B-cell Lymphoma Esophageal Carcinoma Glioblastoma Multiforme Head and Neck Squamous Cell Carcinoma Kidney Chromophobe Kidney Renal Clear Cell Carcinoma Kidney Renal Papillary Cell Carcinoma Brain Lower Grade Glioma Liver Hepatocellular carcinoma Lung Adenocarcinoma Lung Squamous Cell Carcinoma Ovarian Serous Cystadenocarcinoma Pancreatic Adenocarcinoma Pheochromocytoma and Paraganglioma Prostate Adenocarcinoma Sarcoma Skin Cutaneous Melanoma Stomach and Esophageal Carcinoma Testicular Germ Cell Tumors Thyroid Carcinoma Thymoma Uterine Corpus Endometrial Carcinoma Uterine Carcinosarcoma Uveal Melanoma

Number of Subjects 90 130 987 194 35 295 48 185 290 279 66 411 161 286 198 230 177 142 150 184 332 247 345 473 155 405 118 247 57 80


27

Csizmók, et al., 2005). The IUPred positional long and short measurements are processed concurrently, but separately at each step. Calculations for regional measures are based on the Kyte/Doolittle scale(Kyte & Doolittle, 1982) of hydrophobicity and net charge, considering the mean of both values across the window – this is in accordance with the FoldIndex©method (Prilusky et al., 2005).

3.5

Monte Carlo Simulations

Beginning with calculating IUPred long and IUPred short disorder score profiles for each protein isoform, Monte Carlo simulations were carried out by comparing the observed mutation load (see Equation 3.1) against the average mutation load across one million random simulations of the same number of mutations and calculating an empirical p-value (see Equation 3.2) between these values. The empirical p-value is the number of simulated cases below the observed disorder load divided by the number of simulations performed – this calculation is based on comparing the observed value versus the simulated values if the null hypothesis is rejected (random mutations). Mobs =

N X

mi × si

(3.1)

i=1

Where Mobs is the observed disorder load, mi is the number of observed mutations at position i, si is the calculated IUPred disorder score at position i, and N is the total number of residues in the protein.


28

P

ppositive =

Mobs ≥ Mrandom Lrandom

(3.2)

Where ppositive is the empirical p-value for positive selection for disorder, Mobs is the observed disorder load, Mrandom is the vector of simulated disorder loads, and Lrandom is the length of the simulated disorder loads vector. Lrandom is equal to one million for each isoform.

Following empirical p-value calculation, one isoform per gene was selected according to the highest number of mutations and shortest protein length with any ties resolved alphanumerically. Most mutated to increase the number of individual observations per gene, while shortest isoform to increase the degree to which these mutations may perturb the underlying chemistry. These criteria are sufficient for an initial pass, however it would be expected for the results to change with other criteria. This selection was to ensure statistical independence prior to multiple hypothesis correction – which was performed at a false discovery rate (FDR) level of 0.05 using the Benjamini-Hochberg correction procedure (Benjamini & Hochberg, 1995). This selection was performed after p-value calculation rather than prior in order to test other potential avenues of investigation, such as single-gene isoform cross comparisons, which are not part of the work presented here.

3.5.1

Steps as a List

See also Figure 3.1 for these steps as a flowchart. 1. Calculate positional disorder scores via IUPred (long and short) 2. Simulate one million random mutation observations using sampling with replacement (same number of mutations as observed)


29

• ’Observed’ defined as individual mutated positions, not individual mutation observations so as to not inflate highly-mutated positions in analysis 3. Calculate empirical p-value between observed and average random mutation load 4. Select one isoform per gene Criteria: – Highest number of mutations – Shortest isoform length – Ties resolved alphanumerically 5. Correct at FDR of 0.05 according to Benjamini-Hochberg correction procedure

3.6

Binomial Testing

First, disorder region calls for each protein isoform were made using the FoldIndex©webserver.3 Following this, disordered regions within mutated isoforms for each cancer background were extracted. For each of these regions, five values were calculated to find regions with heightened mutational concentration: 1. the total isoform length (length of the region as found via FoldIndex©), 2. the total number of mutations observed in the isoform, 3. the number of mutation observed in the 3

http://bioportal.weizmann.ac.il/fldbin/findex


30

F IGURE 3.1: The general flowchart of the steps taken for Monte Carlo simulations during positional analysis.

IUPred ’long’

Calculate Positional Disorder

One million iterations

Simulate random profiles

IUPred ’short’

Calculate empirical p-value between observed and expected random disorder loads

Select one isoform per gene

At 0.05 level

Correct FDR via Benjamini-Hochberg correction procedure

Highest number of mutations

Criteria

Shortest isoform length

Ties resolved alphanumerically


31

disordered region, 4. expected value (see Equation 3.4), and 5. p-value via binomial test of observed number of mutations or fewer. See Figure 3.2 for a flowchart version of how these values are used. Following binomial testing (see Equation 3.3), the regions in each cancer were filtered for only the most significant isoform of each gene. This filtering step ensures statistical independence prior to FDR correction at the 0.05 level via the Benjamini-Hochberg correction procedure (Benjamini & Hochberg, 1995). !

n x P r(X = x) = p (1 − p)n−x x

(3.3)

Where P r(X = x) is the probability of observing x successes, n is the number of trials (the length of the isoform), x is the number of successes (the number of mutations in the region), and p is the probability of success (the length of the region divided by the length of the isoform). For the work herein the binomial distribution density was used to calculate the probability of observing exactly x successes.

Eval = M ×

lenreg leniso

(3.4)

Where Eval is the expected value, M is the total number of observed mutations across the isoform, lenreg is the length of the region, and leniso is the length of the isoform. This equals the number of mutations expected to randomly fall within the region.

3.6.1

Steps as a List

See also Figure 3.2 for these steps as a flowchart.


32

1. Calculate regional disorder scores via FoldIndex© 2. Run binomial tests to find regions with heightened mutational concentration • Subset by < −0.1 average score in region • Subset by greater than expected mutations given length of region, length of isoform, and number of observed mutations • Subset by at least 5 mutations in the region 3. Select one isoform per gene Criteria: – Lowest p-value 4. Correct at FDR of 0.05 according to Benjamini-Hochberg correction procedure

3.7

Enrichment Analysis and Validation

The sets of significant genes from each method of analysis were run through enrichment analysis using hypergeometric testing across Gene Ontology Biological Process (GO-BP) terms with FDR correction. Biological process enrichment might suggest possible disorder-implicated mechanisms for driving cancer in yet uncharacterized proteins. The utilities used here for enrichment analysis were written in Python and R by my advisor prior to my work here. The Python script processes the raw annotation file to extract the GO branch of interest, in this case the Biological Process branch; in addition this, it also allows blacklisting evidence codes


33

F IGURE 3.2: The general flowchart of the steps taken during binomial testing in regional analysis.

FoldIndex©

Calculate Regional Disorder

< −0.1 average score

Narrow results

Binomial tests

Greater than 5 mutations in region

Greater than expected mutations

Select one isoform per gene

At 0.05 level

Correct FDR via Benjamini-Hochberg correction procedure

Criteria

Take the isoform with the lowest p-value


34

that would otherwise invite circular reasoning.4 Then, in R, making heavy use of the igraph package, many objects are generated: a GO graph, a GO dictionary, an annotation list, and a term-centric annotation list. Enrichment analysis was run both with and without FDR correction, in addition to filtering the results for only the most specific terms.5 Under-represented terms were thrown out in all cases. To give mutation prevalence context to significant genes across cancer types, heatmaps were generated of cancer type versus significantly disorder-targeted genes with coloring by the ratio between the number of unique patients with a mutation in the gene over the total number of unique patients in that cancer type. This allows cross-comparison of similar cancer types and similar genes as well as immediate validation that certain well-known outcomes are holding true (e.g., p53 should be significant across most cancers). Due to the functional dependency between binding partners, a binding partners set is made and analyzed via enrichment analysis to determine if the partner set can provide additional insight into possible cancer drivers. Significant enrichment in the binding partners set would suggest mechanisms that are disrupted by disorder-targeting mutations. Additional in silico validation was done by ensuring a limited intersection between disorder-targeted sets and COSMIC, the Catalogue Of Somatic Mutations In Cancer (Futreal et al., 2004). In addition to this limited intersection, a p-value for each significant set compared to the COSMIC census was computed via the 4

Using GO terms inferred by protein interaction would invite bias in the protein interaction partner sets created later for validation. 5 All parent terms in the GO-BP tree are removed, keeping only the most specific terms from within the tree.


35

hypergeometic distribution (see Equation 3.5) to find P [X > x]). These steps are standard procedure for finding new cancer driver genes.

P r(X = k) =

K k

N −K n−k N n

(3.5)

Where N is the total population size (the number of genes in the TCGA set), K is the number of successes in the population (the number of genes in the COSMIC set), n is the number of draws (the number of of genes in each significant set), and k is the number of successful draws (the number of genes in the intersect of COSMIC and each significant set).

36

Chapter 4

Positional Analysis Results 4.1

Introduction

Positional Monte Carlo simulations results across the 31 cancer types (listed in Table 3.2) were limited to only considering IUPred short findings. IUPred short was better able to capture regions of positional disorder by considering a localized proximity window size of 25 residues. When considering only genes with 5 or more observed mutations, furthering the conservative estimation of significance, 102 significant genes were found. Well-characterized driver genes were removed by taking the set difference between the COSMIC gene set and this significant set; leaving 77 remaining gene symbols across both missense-only and all-mutations profiles. See Table 4.1 for a listing of these finds and Figure 4.1 for a binary mapping of these finds to the cancer backgrounds they were significant within.

Chapter 4. Positional Analysis Results

37

TABLE 4.1: The significant gene symbols according to positional Monte Carlo simulations and considering only those symbols not already in the COSMIC census gene set. ADAT3 CD8B EP400 GLTSCR1 KANK3 MUC16 PGLS PRRT4 SMOC2 TCHH TRIP6

4.2

ANKLE1 CEBPB FAM100B GPRIN2 KCNK17 NBPF10 PGM5 RARRES2 SOX17 TCTEX1D4 TYSND1

ARL10 CRCT1 FAM48B1 GTF2I KRTAP10-10 NCOA3 PKDREJ RASIP1 SPRR3 TENM4 UPF3A

ASPDH CSGALNACT2 FAM72A HS3ST4 KRTAP10-2 NMU POU3F3 RGS9BP SRRM2 TES ZNF148

C16orf3 CXorf38 FAM86C1 IARS2 LRIG1 OSGIN2 PPP1R3G RREB1 SYN1 TNIP2 ZNF707

C19orf10 DNAH9 FCHSD1 IGFBP4 MAP1S PCDHGA2 PROB1 SCYL2 TBP TOR3A ZSCAN1

CASC3 EME2 FSIP1 IGFN1 MSANTD1 PCDHGA9 PRR18 SENP6 TCEB3C TRIM61 ZZZ3

COSMIC Hypergeometric Testing

Only using the COSMIC gene set to determine which finds are novel provides no value for measuring the overall significance of finds with respect to capturing known cancer drivers. Therefore, the hypergeometric test was performed using the following values: 1. the number of genes in the intersect between COSMIC and my significant set, 29; 2. the number of genes in COSMIC, 616; 3. the number of genes with mutations in TCGA set , 18201; and 4. the number of genes in my significant set, 384. This resulted in a p-value of 5.0875 × 10−5 , see Equation 4.1 for calculation and Equation 3.5 for general equation. This p-value indicates the significant set determined via positional analysis has a high degree of true positives.

p(k = 29 − 1) =

616 28

18201−616 384−28 18201 384

= 5.0875 × 10−5

(4.1)

Note that in Equation 4.1, k − 1 successes are considered to find the cumulative probability of k or more successes.


4.3

38

Mutational Prevalence

In order to measure the prevalence of mutations across missense-only and allmutations profiles, heatmaps were created wherein cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. The mutation profile naming scheme in these images is such that mut profiles include all observed mutations in the cancer profile (i.e., synonymous and missense mutations), while missense profiles include only missense mutations in the cancer profile (i.e., synonymous mutations have been removed). These heatmaps show the mutation prevalence across individuals is low for most genes. As is expected, there is a high mutation prevalence in genes such as TP53, a ubiquitous tumor-suppressor gene. More important than the prevalence of mutations across genes in a single cancer is comparison of significant genes between cancers. Note that cancer backgrounds with a high mutation prevalence for a particular gene are not necessarily targeting disordered positions within that gene, they simply have a high number of patients with mutations in the gene.

4.4

Visualizations of Select Genes

For those novel finds with Protein Data Bank (PDB) entries at a resolution of < 2.5 Å, observed mutations were visualized using UCSF Chimera, production version 1.11.2 (build 41380) along with tables listing the observed mutations. Note that due to the inherent difficulty in generating a PDB structure for a disordered protein – especially for so fine a resolution – these images and results are biased toward the


39

more ordered genes in the significant set. Images here were selected for illustrative purposes.

4.4.1

COADREAD – TBP

This combination of COADREAD (colon and rectum adenocarcinoma) cancer and TBP (TATA-box-binding protein) was significant by Monte Carlo analysis and had a mutation prevalence of 0.08474576, or ≈ 8.47% of patients, according to the heatmaps.

PDB: 1NVP The major difference in the number of mutations listed in Table 4.2 and visible in Figure 4.5 is due to the positions ≈ 60 to 85 in isoform one (TBP.001) and ≈ 40 to 65 in isoform two (TBP.002) being a single amino acid repeat of glutamine, which is not present in the PDB structure. This region’s absence suggests that it is likely disordered and therefore did not crystallize well. The positions that remain, {224, 284}TBP.001 and {204, 264}TBP.002 , target the same two positions due to the offset between the isoforms. Both of these positions are part of turns/loops. The vast majority of mutations occur in the glutamine-repeat region which was likely too disordered to crystallize.

4.4.2

BRCA – TBP

This combination of BRCA (breast invasive carcinoma) cancer and TBP (TATAbox-binding protein) was significant by Monte Carlo analysis and had a mutation


40

TABLE 4.2: The mutations noted in COADREAD_mut for TBP in the TCGA dataset. Isoform TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002

Amino Acid Position 60 63 72 73 74 75 76 77 78 79 80 81 82 83 84 224 284 40 43 52 53 54 55 56 57 58 59 60 61 62 63 64 204 264

Frequency 2 2 2 4 1 2 2 4 1 3 1 1 1 1 1 1 1 2 2 2 4 1 2 2 4 1 3 1 1 1 1 1 1 1


41

TABLE 4.3: The mutations noted in BRCA_mut for TBP in the TCGA dataset. Isoform TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002

Amino Acid Position 60 76 77 78 80 89 131 238 40 56 57 58 60 69 111 218

Frequency 3 28 2 1 1 1 1 1 3 28 2 1 1 1 1 1

prevalence of 0.03688525, or ≈ 3.69% of patients, according to the heatmaps.

PDB: 1NVP In Figure 4.6, it can be seen that only one mutated position from Table 4.3 is highlighted. The remaining mutated positions were not part of the PDB structure or, much like what is noted above in Section 4.4.1, fall into a single amino acid repeat of glutamine which is not present in the PDB structure. This region’s absence suggests that it is likely disordered and therefore did not crystallize well. The positions that remain, {238}TBP.001 and {218}TBP.002 , target the same position due to the offset between the isoforms. This position falls well within an alpha helix. The vast majority of mutations occur in the glutamine-repeat region which was likely too disordered to crystallize.


42

TABLE 4.4: The mutations noted in STES_mut for CASC3 in the TCGA dataset. Isoform CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001 CASC3.001

4.4.3

Amino Acid Position 105 198 232 250 337 338 438 523 524 535 540 550 560 603 619 627 645 658 690

Frequency 1 2 3 2 1 1 1 1 1 1 2 2 1 3 1 3 3 3 3

STES – CASC3

This combination of STES (stomach and esophageal carcinoma) cancer and CASC3 (cancer susceptibility candidate gene 3 protein) was significant by Monte Carlo analysis and had a mutation prevalence of 0.02114165, or ≈ 2.11% of patients, according to the heatmaps.

PDB: 2J0S In Figure 4.7, it can be seen that only three mutated positions are part of the PDB structure. The three visible positions are, {198, 232, 250}CASC3.001 . Among these


43

three positions, 198 and 232 fall at the edge of α-helices, while 250 falls at the edge of a β-sheet. The missing un-mappable mutations occur in regions that are either too disordered to crystallize or were cut out prior to crystallization as part of an attempt to crystallize this protein’s ordered site(s) in hopes of understanding its cancer susceptibility cause.

4.5

Enrichment Analysis

There were no significant terms following FDR correction, however the top 10 terms prior to correction are listed in Table 4.5.

4.6

Partner Set Enrichment Analysis

Utilizing Homo sapiens data from BioGRID downloaded from their latest release on June 14th, 2017, any direct interactors with the significant set were extracted into their own binding partner set (duplicate entries were removed). This resulted in 1545 gene symbols, which when run through the same enrichment analysis process resulted in hundreds of enriched terms. Considering only the most specific terms by removing parents in the graph, a total of 168 terms were enriched with the top 10 listed in Table 4.6 (the top 50 terms can be seen in Table B.1).


44

TABLE 4.5: Note here that these are uncorrected p-values therefore they do not represent term enrichment. They are presented to show the top Gene Ontology terms associated with the significant gene set. The adjusted p-values following FDR correction are provided to reinforce their non-significance. GO ID GO:0006366 GO:0060850 GO:0006351 GO:0097659 GO:0050652

GO:1903691 GO:0032289 GO:0003142 GO:0060807

GO:0060796

Process transcription from RNA polymerase II promoter regulation of transcription involved in cell fate commitment transcription, DNA-templated nucleic acid-templated transcription dermatan sulfate proteoglycan biosynthetic process, polysaccharide chain biosynthetic process positive regulation of wound healing, spreading of epidermal cells central nervous system myelin formation cardiogenic plate morphogenesis regulation of transcription from RNA polymerase II promoter involved in definitive endodermal cell fate specification regulation of transcription involved in primary germ layer cell fate commitment

p-value 0.000433

FDR 1

0.000487

1

0.00153 0.00155 0.00423

1 1 1

0.00423

1

0.00423 0.00423 0.00423

1 1 1

0.00423

1


45

TABLE 4.6: The top 10 most specific terms associated with interaction partners to the significant genes determined by Monte Carlo simulations. In total there were 168 terms in the full table (the top 50 of which are in Table B.1). GO ID GO:0006368 GO:0038095 GO:0006369 GO:0043968 GO:0042795 GO:0016925 GO:0002223 GO:1900034 GO:0050821 GO:1900740

Process transcription elongation from RNA polymerase II promoter Fc-epsilon receptor signaling pathway termination of RNA polymerase II transcription histone H2A acetylation snRNA transcription from RNA polymerase II promoter protein sumoylation stimulatory C-type lectin receptor signaling pathway regulation of cellular response to heat protein stabilization positive regulation of protein insertion into mitochondrial membrane involved in apoptotic signaling pathway

p-value 9.55e-12 5.47e-10 8.88e-10 7.76e-09 1.54e-08 1.05e-07 1.38e-07 4.67e-07 8.02e-07 1.4e-06


46

F IGURE 4.1: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. ADAT3 ANKLE1 ARL10 ASPDH C16orf3 C19orf10 CASC3 CD8B CEBPB CRCT1 CSGALNACT2 CXorf38 DNAH9 EME2 EP400 FAM100B FAM48B1 FAM72A FAM86C1 FCHSD1 FSIP1 GLTSCR1 GPRIN2 GTF2I HS3ST4 IARS2 IGFBP4 IGFN1 KANK3 KCNK17 KRTAP10−10 KRTAP10−2 LRIG1 MAP1S MSANTD1 MUC16 NBPF10 NCOA3 NMU OSGIN2 PCDHGA2 PCDHGA9 PGLS PGM5 PKDREJ POU3F3 PPP1R3G PROB1 PRR18 PRRT4 RARRES2 RASIP1 RGS9BP RREB1 SCYL2 SENP6 SMOC2 SOX17 SPRR3 SRRM2 SYN1 TBP TCEB3C TCHH TCTEX1D4 TENM4 TES TNIP2 TOR3A TRIM61 TRIP6 TYSND1 UPF3A ZNF148 ZNF707 ZSCAN1 ZZZ3 UCEC_mut

UCEC_missense

THYM_mut

THYM_missense

TGCT_mut

TGCT_missense

STES_mut

STES_missense

SKCM_mut

SKCM_missense

SARC_mut

SARC_missense

LUAD_mut

KICH_mut

COADREAD_mut

COADREAD_missense

BRCA_mut

ACC_mut

ACC_missense


47

F IGURE 4.2: A heatmap showing the significant genes compared across all cancer types with both background mutation profiles. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. ADAT3 ANKLE1 ARID2 ARL10 ASPDH ATRX BRAF C16orf3 C19orf10 CASC3 CD8B CEBPB CIC CRCT1 CSGALNACT2 CXorf38 DNAH9 EGFR EME2 EP400 FAM100B FAM48B1 FAM72A FAM86C1 FCHSD1 FSIP1 GLTSCR1 GNA11 GNAQ GPRIN2 GTF2I HOXD13 HRAS HS3ST4 IARS2 IDH1 IGFBP4 IGFN1 KANK3 KCNK17 KEAP1 KIT KRAS KRTAP10−10 KRTAP10−2 LRIG1 MAML2 MAP1S MLL3 MLLT3 MSANTD1 MUC16 NBPF10 NCOA3 NFE2L2 NMU NOTCH2 NRAS OSGIN2 PCDHGA2 PCDHGA9 PGLS PGM5 PKDREJ POU3F3 PPP1R3G PROB1 PRR18 PRRT4 PTEN RAC1 RARRES2 RASIP1 RGS9BP RREB1 SCYL2 SENP6 SETD2 SMOC2 SOX17 SPRR3 SRRM2 SYN1 TBP TCEB3C TCF3 TCHH TCTEX1D4 TENM4 TES TNIP2 TOR3A TP53 TRIM61 TRIP6 TYSND1 UPF3A VHL ZNF148 ZNF707 ZSCAN1 ZZZ3

0.8

0.6

0.4

0.2

0

UVM_mut UVM_missense UCS_mut UCS_missense UCEC_mut UCEC_missense THYM_mut THYM_missense THCA_mut THCA_missense TGCT_mut TGCT_missense STES_mut STES_missense SKCM_mut SKCM_missense SARC_mut SARC_missense PRAD_mut PRAD_missense PCPG_mut PCPG_missense PAAD_mut PAAD_missense OV_mut OV_missense LUSC_mut LUSC_missense LUAD_mut LUAD_missense LIHC_mut LIHC_missense LGG_mut LGG_missense KIRP_mut KIRP_missense KIRC_mut KIRC_missense KICH_mut KICH_missense HNSC_mut HNSC_missense GBM_mut GBM_missense ESCA_mut ESCA_missense DLBC_mut DLBC_missense COADREAD_mut COADREAD_missense CHOL_mut CHOL_missense CESC_mut CESC_missense BRCA_mut BRCA_missense BLCA_mut BLCA_missense ACC_mut ACC_missense


48

F IGURE 4.3: A heatmap showing the significant genes compared across all cancer types with only mut background mutation profiles, or those considering all mutations, both synonymous and missense. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. ADAT3 ANKLE1 ARL10 ASPDH BRAF C16orf3 C19orf10 CASC3 CD8B CEBPB CIC CRCT1 CSGALNACT2 CXorf38 DNAH9 EGFR EME2 EP400 FAM100B FAM72A FAM86C1 FCHSD1 GLTSCR1 GNA11 GNAQ GPRIN2 GTF2I HOXD13 HRAS HS3ST4 IDH1 IGFBP4 IGFN1 KANK3 KCNK17 KIT KRAS KRTAP10−10 KRTAP10−2 LRIG1 MAML2 MAP1S MLL3 MLLT3 MSANTD1 MUC16 NBPF10 NCOA3 NFE2L2 NMU NRAS OSGIN2 PCDHGA2 PCDHGA9 PGLS PGM5 POU3F3 PPP1R3G PROB1 PRR18 PRRT4 PTEN RAC1 RARRES2 RASIP1 RGS9BP RREB1 SCYL2 SENP6 SMOC2 SOX17 SPRR3 SRRM2 TBP TCEB3C TCF3 TCHH TCTEX1D4 TENM4 TES TNIP2 TOR3A TP53 TRIM61 TRIP6 TYSND1 UPF3A VHL ZNF148 ZNF707 ZSCAN1 ZZZ3

0.8

0.6

0.4

0.2

0

UVM_mut

UCS_mut

UCEC_mut

THYM_mut

THCA_mut

TGCT_mut

STES_mut

SKCM_mut

SARC_mut

PRAD_mut

PCPG_mut

PAAD_mut

OV_mut

LUSC_mut

LUAD_mut

LIHC_mut

LGG_mut

KIRP_mut

KIRC_mut

KICH_mut

HNSC_mut

GBM_mut

ESCA_mut

DLBC_mut

COADREAD_mut

CHOL_mut

CESC_mut

BRCA_mut

BLCA_mut

ACC_mut


49

F IGURE 4.4: A heatmap showing the significant genes compared across all cancer types with only missense background mutation profiles, or those considering only missense mutations with synonymous mutations removed. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. ADAT3 ANKLE1 ARL10 ASPDH BRAF C16orf3 C19orf10 CASC3 CD8B CEBPB CIC CRCT1 CSGALNACT2 CXorf38 DNAH9 EGFR EME2 EP400 FAM100B FAM72A FAM86C1 FCHSD1 GLTSCR1 GNA11 GNAQ GPRIN2 GTF2I HOXD13 HRAS HS3ST4 IDH1 IGFBP4 IGFN1 KANK3 KCNK17 KIT KRAS KRTAP10−10 KRTAP10−2 LRIG1 MAML2 MAP1S MLL3 MLLT3 MSANTD1 MUC16 NBPF10 NCOA3 NFE2L2 NMU NRAS OSGIN2 PCDHGA2 PCDHGA9 PGLS PGM5 POU3F3 PPP1R3G PROB1 PRR18 PRRT4 PTEN RAC1 RARRES2 RASIP1 RGS9BP RREB1 SCYL2 SENP6 SMOC2 SOX17 SPRR3 SRRM2 TBP TCEB3C TCF3 TCHH TCTEX1D4 TENM4 TES TNIP2 TOR3A TP53 TRIM61 TRIP6 TYSND1 UPF3A VHL ZNF148 ZNF707 ZSCAN1 ZZZ3

0.8

0.6

0.4

0.2

0

UVM_missense

UCS_missense

UCEC_missense

THYM_missense

THCA_missense

TGCT_missense

STES_missense

SKCM_missense

SARC_missense

PRAD_missense

PCPG_missense

PAAD_missense

OV_missense

LUSC_missense

LUAD_missense

LIHC_missense

LGG_missense

KIRP_missense

KIRC_missense

KICH_missense

HNSC_missense

GBM_missense

ESCA_missense

DLBC_missense

COADREAD_missense

CHOL_missense

CESC_missense

BRCA_missense

BLCA_missense

ACC_missense


F IGURE 4.5: Image of mutations within TBP for the COADREAD cancer profile mapped against 1NVP from PDB. All mutations are considered, however not all mutated positions are found in the PDB structure. The vast majority of mutations occurred in regions that either did not crystallize in the final structure or have been cut out of the structure for crystallizing the ordered regions seen here.

50


F IGURE 4.6: Image of mutations within TBP for the BRCA cancer profile mapped against 1NVP from PDB. All mutations are considered, however not all mutated positions are found in the PDB structure. The vast majority of mutations occurred in regions that either did not crystallize in the final structure or have been cut out of the structure for crystallizing the ordered regions seen here.

51


F IGURE 4.7: Image of mutations within CASC3 for the STES cancer profile mapped against 2J0S from PDB. All mutations are considered, however not all mutated positions are found in the PDB structure. The vast majority of mutations occurred in regions that either did not crystallize in the final structure or have been cut out of the structure for crystallizing the ordered regions seen here.

52

53

Chapter 5

Regional Analysis Results 5.1

Introduction

Regional binomial testing results across 31 cancer types (listed in Table 3.2) were limited only by the parameters stated in Section 3.6. These parameters are already highly-conservative while still resulting in 525 significant genes. Well-characterized driver genes were removed by taking the set difference between the COSMIC gene set and this initial significant set. This left 480 remaining gene symbols across both missense-only and all-mutations profiles. See Figure 5.1 through Figure 5.6 for binary mappings of these 480 gene symbols to the cancer backgrounds they were significant within and Table 5.1 for a table of the significant gene symbols. TABLE 5.1: The significant gene symbols according to regional binomial testing and considering only those genes not already in the COSMIC census gene set. ACD

ACSM2B

ADAM19

ADAM33

ADAMTS1

ADAMTS18

ADCY2

AFAP1L1

AGAP2

AGAP7

AKAP12

AKAP13

AKAP2

AKR1C3

ALOX5

AMOT

ANK3

ANKRD12

ANKRD24

ANKRD30A

ANKRD30B

ANKRD36C

ANO4

APBA2

APBB1IP

APOBR

ARC

ARFIP1

ARHGAP23

ARHGAP5

ARHGEF40

ARID3A

ARNTL2

ARPP21

ASAP1

Chapter 5. Regional Analysis Results

54

ASAP3

ATAD2

ATCAY

ATN1

ATP2B2

ATP8B4

ATXN1

ATXN2

AZI1

B3GALT1

B3GAT2

B4GALNT3

BACH1

BBX

BCAS1

BEGAIN

BMP2K

BZRAP1

C10orf90

C15orf40

C19orf6

C1orf173

C1orf198

C1orf65

C4orf27

C5orf42

C6orf10

C8orf34

C9orf66

CA1

CACNA1A

CACNA1H

CACNA2D2

CACTIN

CADPS

CALR3

CBX7

CCDC102A

CCDC105

CCDC110

CCDC40

CDH24

CDKL5

CEP170

CEP41

CEP63

CERKL

CHD3

CHRM2

CHST13

CILP2

CLIC6

CNTN5

COL11A1

COL15A1

COL21A1

COL23A1

COL28A1

COL4A4

COL5A1

CPSF6

CROCC

CRYBG3

CTAGE6P

DBX2

DCAF8L1

DDX11

DDX46

DENND4B

DGKB

DGKI

DHX34

DIDO1

DLEU7

DMKN

DSC3

DUPD1

DYSF

DZIP1

E2F5

EIF1AX

ELF3

EN1

ENAM

EP400

ERI1

EYA1

EYA4

FAM120B

FAM123A

FAM123C

FAM157A

FAM171B

FAM184B

FAM194A

FAM196B

FAM21A

FAM47A

FAM71E2

FBN3

FCGBP

FCRL5

FER1L6

FETUB

FGF12

FGF13

FHDC1

FILIP1

FOXP2

FOXS1

FSCB

FSD1

FSIP2

GAB1

GABRG2

GDF15

GDF5

GIMAP6

GJA8

GLDN

GOLGB1

GON4L

GPATCH8

GPR158

GPR179

GPRIN1

GPRIN2

GSG2

HAP1

HECW2

HGF

HHIPL2

HIVEP3

HLA-C

HMGB3

HOMEZ

HSCB

ILDR1

INPP5J

IRF2BPL

IRS4

IRX4

ISL1

ISX

ITSN2

JPH1

KAT8

KCNA6

KCND2

KCNJ4

KCNJ8

KCNN3

KCTD8

KDM4A

KIAA0040

KIAA0284

KIAA0319

KIAA0355

KIAA0907

KIAA1211

KIAA1257

KIAA1522

KIAA1549L

KIAA2018

KIF1A

KIF1C

KIR3DL2

KNDC1

L1TD1

LAMA3

LAMC3

LAS1L

LDLR

LIG1

LILRB5

LIMK2

LIPE

LMTK3

LONRF2

LPA

LRP11

LRRC43

MAD1L1

MAP1A

MAP6

MAPK13

MAST1

MBD1

MBD6

MCM10

MED17

MEFV

MESP2

METTL10

MGA

MICAL3

MKI67

MPHOSPH10

MPHOSPH9

MSGN1

MUC15

MYBPC2

MYH13

MYH2

MYH4

MYH6

MYH8

MYLK

MYO15A

MYO18B

MYOM1

MYRIP

MYT1L

NALCN

NASP

NBPF3

NCOA3

NEFH

NEFM

NFASC

NFATC1

NFKBIB

NFYA

NGFR

NLRP11

NOL8

NOM1

NOS1

NPAP1

NPAS3

NRAP

NRD1

NRG3

NSUN2

NTN5

NUMBL

OCEL1

OPRM1

OSBPL3

OSBPL6

OTOF

P2RX2

PALMD

PAPD7

PAPPA2

PARD3B

PAX4

PCDH15

PCF11

PCLO

PCMTD1

PCSK1


55

PCSK5

PDGFRL

PDZD4

PEG3

PENK

PEX5L

PHLDA1

PHLDB2

PHRF1

PIEZO1

PIK3AP1

PIK3R5

PKP4

PLEC

PLEKHG3

PMEPA1

PMFBP1

POTEF

POTEG

POU3F2

PPFIA2

PPM1E

PPP1R16B

PPP2R3A

PPP2R3B

PRDM13

PRICKLE1

PRKCSH

PRKG2

PRLR

PRRC2C

PRRG3

PTPRO

PTRF

RALY

RASSF6

RBM12B

RBM14

RBMXL3

RC3H1

RECQL5

REM1

RERE

RGPD4

RIMS2

RIMS3

RINL

RLIM

RNF146

RNFT2

ROBO2

RP1L1

RSBN1

RSPH4A

RTN3

RUNX2

RYR2

RYR3

SCAND3

SCARF2

SCN2A

SCRN2

SDCCAG3

SDPR

SEMA3E

SGSM1

SH2D2A

SHANK1

SHANK2

SHOX

SIM1

SIPA1L3

SLC16A2

SLC17A6

SLC24A3

SLC8A3

SLCO1C1

SLCO6A1

SMC2

SNAP25

SNED1

SOGA3

SORBS2

SORBS3

SORCS1

SOWAHB

SOX10

SOX9

SPATA31A3

SPATA31D1

SPATS2L

SPDYE5

SPEF2

SPERT

SPHKAP

SPOCK3

SPTA1

SPTAN1

SRL

SRRM2

SRRT

STK19

STON1-GTF2A1L

SWI5

SYNJ2

TAF1

TAF4

TARSL2

TBC1D1

TBC1D10C

TBC1D3B

TBP

TCHHL1

TDRD3

TENM1

TENM2

TEX33

THSD1

TIAM1

TIMELESS

TLN2

TLR6

TMC2

TMC5

TMEM200C

TNRC6A

TNXB

TONSL

TOP2A

TRAK1

TRANK1

TRAPPC12

TRIM3

TRIOBP

TRMT44

TSKS

TTBK1

TTLL11

TTLL2

TTN

TUB

TULP4

TUSC3

TXLNB

UNCX

USP31

USP6NL

UTP18

VRTN

WDR33

WDR64

WDR70

WDR87

WDR96

WNT16

XIRP1

XIRP2

ZAR1L

ZBBX

ZBTB38

ZC3H12D

ZC4H2

ZFHX4

ZFP106

ZFP36L2

ZFR2

ZFX

ZFYVE20

ZIC4

ZIM2

ZNF189

ZNF208

ZNF254

ZNF285

ZNF329

ZNF347

ZNF385B

ZNF398

ZNF462

ZNF534

ZNF599

ZNF638

ZNF676

ZNF696

ZNF707

ZNF711

ZNF717

ZNF746

ZNF768

ZNF770

ZNF804A

ZNF845

ZNF91

5.2

COSMIC Hypergeometric Testing

Only using the COSMIC gene set to determine which finds are novel provides no value for measuring the overall significance of finds with respect to capturing


56

known cancer drivers. Therefore, the hypergeometric test was performed using the following values: 1. the number of genes in the intersect between COSMIC and my significant set, 45; 2. the number of genes in COSMIC, 616; 3. the number of genes with mutations in TCGA set , 18201; and 4. the number of genes in my significant set, 525. This resulted in a p-value of 1.1274 × 10−08 , see Equation 5.1 for calculation and Equation 3.5 for general equation. This p-value indicates the significant set determined via regional analysis has a high degree of true positives

p(k = 45 − 1) =

616 44

18201−616 525−44 18201 525

= 1.1274 × 10−08

(5.1)

Note that in Equation 5.1, k − 1 successes are considered to find the cumulative probability of k or more successes.

5.3

Mutational Prevalence

In order to measure the prevalence of mutations across missense-only and allmutations profiles, heatmaps were created where cells are colored by the ratio between number of patients with a mutation in the given gene over the number of patients in that cancer type. The mutation profile naming scheme is such that mut profiles include all observed mutations in the cancer background (i.e., synonymous and missense mutations), while missense profiles include only missense mutations in the cancer background (i.e., synonymous mutations have been removed). Only novel finds (genes not found in the COSMIC gene set) are considered in these heatmaps and the same scale is used for each heatmap in the set.


57

These heatmaps show the mutation prevalence across individuals is low for most genes. As is expected, there is a high mutation prevalence in genes such as TP53, a ubiquitous tumor-suppressor gene. More important than the prevalence of mutations across genes in a single cancer is comparison of significant genes between cancers. Note that cancer backgrounds with a high mutation prevalence for a particular gene are not necessarily targeting disordered positions within that gene, they simply have a high number of patients with mutations in the gene.

5.3.1

Both Profiles

The heatmaps are arranged by significant gene in alphabetical order, with a single gene overlap between each, therefore: 1. ACD through CAMTA1 can be seen in Figure 5.7. 2. CARD11 through FCRL5 can be seen in Figure 5.8 3. FER1L6 through LAS1L can be seen in Figure 5.9 4. LDLR through NTN5 can be seen in Figure 5.10 5. NUMBL through RTN3 can be seen in Figure 5.11 6. RUNX2 through TONSL can be seen in Figure 5.12 7. TOP2A through ZNF91 can be seen in Figure 5.13


5.3.2

58

Mutation Prevalence Distributions

Given the many rows and necessary splitting of these results across many figures, in order to facilitate better understanding the cancer-wise (Figure 5.14) and genewise (Figure 5.15) mean summaries are provided. The tables of these values are available in Appendix C.

5.4

Visualizations of Select Genes

The three genes selected here are for illustrative purposes. They were chosen due to being within the top five significant results of their cancer background by padjusted value and being among the greatest 20 absolute observed disorder loads across all results – balancing the disorder and number of mutations observed in the gene. There were no available Protein Data Bank (PDB) structures for these genes, which might suggest they are entirely or partially too disordered to properly crystallize as is necessary to generate PDB structures.

5.4.1

TBP.001 in BRCA

Smoothed Disorder Plot with Mutations In Figure 5.16, the mutations from BRCA_mut (Breast invasive carcinoma, all mutations in background) are mapped onto a smoothed disorder plot. All mutations occur at the most disordered region of the protein with the observed position-wise disorder score among these mutated positions being ≈ −0.60, well below the −0.1 threshold used to annotate high-confidence disordered regions.


59

TABLE 5.2: The mutations noted in BRCA_mut for TBP in the TCGA dataset. Isoform TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.001 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002 TBP.002

Amino Acid Position 131 238 60 76 77 78 80 89 111 218 40 56 57 58 60 69

Frequency 1 1 3 28 2 1 1 1 1 1 3 28 2 1 1 1

Note that this isoform, TBP.001, was also significant within: ACC_mut Adrenocortical carcinoma, missense-only mutations in background COADREAD_mut Colon adenocarcinoma and Rectum adenocarcinoma, all mutations in background ESCA_mut Esophageal carcinoma, all mutations in background KICH_mut Kidney chromophobe, all mutations in background KIRC_mut Kidney renal clear cell carcinoma, all mutations in background SKCM_mut Skin cutaneous melanoma, all mutations in background STES_mut Stomach and esophageal carcinoma, all mutations in background


60

TCGT_mut Testicular germ cell tumors, all mutations in background

5.4.2

PLEC.005 in ACC

Smoothed Disorder Plot with Mutations In Figure 5.17, the mutations from ACC_mut (Adrenocortical carcinoma, all mutations in background) are mapped onto a smoothed disorder plot. We can see that roughly half of the mutations occur below the high confidence threshold (−0.1) for determining disordered regions. Most of the remaining mutations occur at the C-terminus, where the sequence becomes more disordered. Note that this isoform, PLEC.005, was also significant within: ACC_missense Adrenocortical carcinoma, missense-only mutations in background CESC_missense Cervical squamous cell carcinoma and Endocervical adenocarcinoma, missense-only mutations in background COADREAD_mut Colon adenocarcinoma and Rectum adenocarcinoma, all mutations in background COADREAD_missense Colon adenocarcinoma and Rectum adenocarcinoma, missenseonly mutations in background HNSC_mut Head and neck squamous cell carcinoma, all mutations in background HNSC_missense Head and neck squamous cell carcinoma, missense-only mutations in background SKCM_mut Skin cutaneous melanoma, all mutations in background


61

TABLE 5.3: The mutations noted in ACC_mut for PLEC.005 in the TCGA dataset. Note there were 760 mutations across 176 positions, therefore this table only shows the mutations for PLEC.005. Isoform PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005 PLEC.005

Amino Acid Position 1321 1386 1697 1854 1880 1905 1998 2047 2106 2113 2242 2495 2507 2713 3145 4004 4005 4382 4445 4539 4624 4668

Frequency 18 12 11 1 1 1 4 1 17 12 1 1 1 1 1 1 1 1 1 1 6 1


62

STES_mut Stomach and esophageal carcinoma, all mutations in background STES_missense Stomach and esophageal carcinoma, missense-only mutations in background UCEC_mut Uterine corpus endometrial carcinoma, all mutations in background

5.4.3

NEFH.001 in ACC

Smoothed Disorder Plot with Mutations In Figure 5.18, the mutations from ACC_mut (Adrenocortical carcinoma, all mutations in background) are mapped onto a smoothed disorder plot. All mutations occur below the high confidence threshold (−0.1) for determining a disordered region. The mutations are concentrated around an effective plateau of disorder – suggesting this region is consistent in itself. Rather than simply being a transition region between the relative order before this region and relative disorder after this region, the plateau suggests the region maintains a given level of disorder, which might confer a given function to this region beyond simple transition between other key regions of the folded protein. Note that this isoform, NEFH.001, was also significant within: BRCA_mut Breast invasive carcinoma, all mutations in background KIRP_mut Kidney renal papillary cell carcinoma, all mutations in background KIRP_missense Kidney renal papillary cell carcinoma, missense-only mutations in background


63

TABLE 5.4: The mutations noted in ACC_mut for NEFH in the TCGA dataset. Isoform NEFH.001 NEFH.001 NEFH.001 NEFH.001 NEFH.001 NEFH.001 NEFH.001 NEFH.001

5.5

Amino Acid Position 645 646 655 698 701 702 744 805

Frequency 2 1 11 2 2 2 1 1

Enrichment Analysis

There were no significant terms following FDR correction, however the top 10 terms prior to correction are listed in Table 5.5.

5.6

Partner Set Enrichment Analysis

Utilizing Homo sapiens data from BioGRID downloaded from their latest release on June 14th, 2017 any direct interactors with the significant set were extracted into their own interaction partner set (duplicate entries were removed). This resulted in 480 gene symbols, which when run through the same enrichment analysis process as before resulted in hundreds of enriched terms. Considering only the most specific terms by removing parents in the graph, a total of 149 terms were enriched with the top 10 terms are listed in Table 5.6 (the top 50 terms can be seen in Table C.1).


64

TABLE 5.5: Note here that these are uncorrected p-values therefore they do not represent term enrichment. They are presented to show the top Gene Ontology terms associated with the significant gene set. The adjusted p-values following FDR correction are provided to reinforce their non-significance. GO ID GO:0006936 GO:0003012 GO:0070252 GO:0030048 GO:0001508 GO:0030049 GO:0033275 GO:0033693 GO:0072719 GO:0072718

Process muscle contraction muscle system process actin-mediated cell contraction actin filament-based movement action potential muscle filament sliding actin-myosin filament sliding neurofilament bundle assembly cellular response to cisplatin response to cisplatin

p-value 7.78e-05 8.44e-05 0.000112 0.000128 0.000278 0.000324 0.000324 0.000694 0.000694 0.000694

FDR 1 1 1 1 1 1 1 1 1 1

TABLE 5.6: The top 10 most specific terms associated with interaction partners to the significant genes determined by regional binomial tests. In total there were 149 terms in the full table (the top 50 can be seen in Table C.1). GO ID GO:0044260 GO:0090304 GO:0043170 GO:0006139 GO:0016070 GO:0046483 GO:0006725 GO:0010467 GO:0044238 GO:1901360

Process cellular macromolecule metabolic process nucleic acid metabolic process macromolecule metabolic process nucleobase-containing compound metabolic process RNA metabolic process heterocycle metabolic process cellular aromatic compound metabolic process gene expression primary metabolic process organic cyclic compound metabolic process

p-value 1.53e-121 4.94e-112 6.18e-105 1.25e-90 3.46e-88 3.31e-82 3.87e-81 3.51e-80 4.6e-80 2.53e-77


65

F IGURE 5.1: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (1 of 6) ACD ACSM2B ADAM19 ADAM33 ADAMTS1 ADAMTS18 ADCY2 AFAP1L1 AGAP2 AGAP7 AKAP12 AKAP13 AKAP2 AKR1C3 ALOX5 AMOT ANK3 ANKRD12 ANKRD24 ANKRD30A ANKRD30B ANKRD36C ANO4 APBA2 APBB1IP APOBR ARC ARFIP1 ARHGAP23 ARHGAP5 ARHGEF40 ARID3A ARNTL2 ARPP21 ASAP1 ASAP3 ATAD2 ATCAY ATN1 ATP2B2 ATP8B4 ATXN1 ATXN2 AZI1 B3GALT1 B3GAT2 B4GALNT3 BACH1 BBX BCAS1 BEGAIN BMP2K BZRAP1 C10orf90 C15orf40 C19orf6 C1orf173 C1orf198 C1orf65 C4orf27 C5orf42 C6orf10 C8orf34 C9orf66 CA1 CACNA1A CACNA1H CACNA2D2 CACTIN CADPS CALR3 CBX7 CCDC102A CCDC105 CCDC110 CCDC40 CDH24 CDKL5 CEP170 CEP41 CEP63 UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut


66

F IGURE 5.2: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (2 of 6) CEP63 CERKL CHD3 CHRM2 CHST13 CILP2 CLIC6 CNTN5 COL11A1 COL15A1 COL21A1 COL23A1 COL28A1 COL4A4 COL5A1 CPSF6 CROCC CRYBG3 CTAGE6P DBX2 DCAF8L1 DDX11 DDX46 DENND4B DGKB DGKI DHX34 DIDO1 DLEU7 DMKN DSC3 DUPD1 DYSF DZIP1 E2F5 EIF1AX ELF3 EN1 ENAM EP400 ERI1 EYA1 EYA4 FAM120B FAM123A FAM123C FAM157A FAM171B FAM184B FAM194A FAM196B FAM21A FAM47A FAM71E2 FBN3 FCGBP FCRL5 FER1L6 FETUB FGF12 FGF13 FHDC1 FILIP1 FOXP2 FOXS1 FSCB FSD1 FSIP2 GAB1 GABRG2 GDF15 GDF5 GIMAP6 GJA8 GLDN GOLGB1 GON4L GPATCH8 GPR158 GPR179 GPRIN1 UVM_missense

UVM_mut

UCS_mut

UCEC_missense

UCEC_mut

THYM_missense

THYM_mut

TGCT_missense

TGCT_mut

STES_missense

STES_mut

SKCM_missense

SKCM_mut

SARC_mut

PCPG_missense

PCPG_mut

PAAD_missense

PAAD_mut

LUSC_missense

LUSC_mut

LUAD_missense

LUAD_mut

KIRP_missense

KIRP_mut

KIRC_missense

KIRC_mut

KICH_missense

KICH_mut

HNSC_missense

HNSC_mut

GBM_missense

GBM_mut

ESCA_missense

ESCA_mut

DLBC_mut

COADREAD_missense

COADREAD_mut

CHOL_missense

CHOL_mut

CESC_missense

CESC_mut

BRCA_missense

BRCA_mut

BLCA_missense

BLCA_mut

ACC_missense

ACC_mut


67

F IGURE 5.3: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (3 of 6) GPRIN1 GPRIN2 GSG2 HAP1 HECW2 HGF HHIPL2 HIVEP3 HLA−C HMGB3 HOMEZ HSCB ILDR1 INPP5J IRF2BPL IRS4 IRX4 ISL1 ISX ITSN2 JPH1 KAT8 KCNA6 KCND2 KCNJ4 KCNJ8 KCNN3 KCTD8 KDM4A KIAA0040 KIAA0284 KIAA0319 KIAA0355 KIAA0907 KIAA1211 KIAA1257 KIAA1522 KIAA1549L KIAA2018 KIF1A KIF1C KIR3DL2 KNDC1 L1TD1 LAMA3 LAMC3 LAS1L LDLR LIG1 LILRB5 LIMK2 LIPE LMTK3 LONRF2 LPA LRP11 LRRC43 MAD1L1 MAP1A MAP6 MAPK13 MAST1 MBD1 MBD6 MCM10 MED17 MEFV MESP2 METTL10 MGA MICAL3 MKI67 MPHOSPH10 MPHOSPH9 MSGN1 MUC15 MYBPC2 MYH13 MYH2 MYH4 MYH6 UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut


68

F IGURE 5.4: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (4 of 6) MYH6 MYH8 MYLK MYO15A MYO18B MYOM1 MYRIP MYT1L NALCN NASP NBPF3 NCOA3 NEFH NEFM NFASC NFATC1 NFKBIB NFYA NGFR NLRP11 NOL8 NOM1 NOS1 NPAP1 NPAS3 NRAP NRD1 NRG3 NSUN2 NTN5 NUMBL OCEL1 OPRM1 OSBPL3 OSBPL6 OTOF P2RX2 PALMD PAPD7 PAPPA2 PARD3B PAX4 PCDH15 PCF11 PCLO PCMTD1 PCSK1 PCSK5 PDGFRL PDZD4 PEG3 PENK PEX5L PHLDA1 PHLDB2 PHRF1 PIEZO1 PIK3AP1 PIK3R5 PKP4 PLEC PLEKHG3 PMEPA1 PMFBP1 POTEF POTEG POU3F2 PPFIA2 PPM1E PPP1R16B PPP2R3A PPP2R3B PRDM13 PRICKLE1 PRKCSH PRKG2 PRLR PRRC2C PRRG3 PTPRO PTRF UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut

GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut


69

F IGURE 5.5: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (5 of 6) PTRF RALY RASSF6 RBM12B RBM14 RBMXL3 RC3H1 RECQL5 REM1 RERE RGPD4 RIMS2 RIMS3 RINL RLIM RNF146 RNFT2 ROBO2 RP1L1 RSBN1 RSPH4A RTN3 RUNX2 RYR2 RYR3 SCAND3 SCARF2 SCN2A SCRN2 SDCCAG3 SDPR SEMA3E SGSM1 SH2D2A SHANK1 SHANK2 SHOX SIM1 SIPA1L3 SLC16A2 SLC17A6 SLC24A3 SLC8A3 SLCO1C1 SLCO6A1 SMC2 SNAP25 SNED1 SOGA3 SORBS2 SORBS3 SORCS1 SOWAHB SOX10 SOX9 SPATA31A3 SPATA31D1 SPATS2L SPDYE5 SPEF2 SPERT SPHKAP SPOCK3 SPTA1 SPTAN1 SRL SRRM2 SRRT STK19 STON1−GTF2A1L SWI5 SYNJ2 TAF1 TAF4 TARSL2 TBC1D1 TBC1D10C TBC1D3B TBP TCHHL1 TDRD3 UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut


70

F IGURE 5.6: A simple binary mapping of cancer backgrounds to significant gene finds. A black square indicates that the gene (row) was significant in that cancer background (column). Cancer backgrounds with no significant results have been removed. (6 of 6) TDRD3 TENM1 TENM2 TEX33 THSD1 TIAM1 TIMELESS TLN2 TLR6 TMC2 TMC5 TMEM200C TNRC6A TNXB TONSL TOP2A TRAK1 TRANK1 TRAPPC12 TRIM3 TRIOBP TRMT44 TSKS TTBK1 TTLL11 TTLL2 TTN TUB TULP4 TUSC3 TXLNB UNCX USP31 USP6NL UTP18 VRTN WDR33 WDR64 WDR70 WDR87 WDR96 WNT16 XIRP1 XIRP2 ZAR1L ZBBX ZBTB38 ZC3H12D ZC4H2 ZFHX4 ZFP106 ZFP36L2 ZFR2 ZFX ZFYVE20 ZIC4 ZIM2 ZNF189 ZNF208 ZNF254 ZNF285 ZNF329 ZNF347 ZNF385B ZNF398 ZNF462 ZNF534 ZNF599 ZNF638 ZNF676 ZNF696 ZNF707 ZNF711 ZNF717 ZNF746 ZNF768 ZNF770 ZNF804A ZNF845 ZNF91 UVM_missense UVM_mut UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut


71

F IGURE 5.7: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (1 of 7) ACD ACSM2B ADAM19 ADAM33 ADAMTS1 ADAMTS18 ADCY2 AFAP1L1 AGAP2 AGAP7 AKAP12 AKAP13 AKAP2 AKAP9 AKR1C3 AKT1 ALOX5 AMOT ANK3 ANKRD12 ANKRD24 ANKRD30A ANKRD30B ANKRD36C ANO4 APBA2 APBB1IP APOBR ARC ARFIP1 ARHGAP23 ARHGAP5 ARHGEF40 ARID3A ARNTL2 ARPP21 ASAP1 ASAP3 ASPSCR1 ATAD2 ATCAY ATN1 ATP2B2 ATP8B4 ATXN1 ATXN2 AXIN2 AZI1 B3GALT1 B3GAT2 B4GALNT3 BACH1 BBX BCAS1 BEGAIN BMP2K BZRAP1 C10orf90 C15orf40 C19orf6 C1orf173 C1orf198 C1orf65 C4orf27 C5orf42 C6orf10 C8orf34 C9orf66 CA1 CACNA1A CACNA1H CACNA2D2 CACTIN CADPS CALR3 CAMTA1

0.6

0.4

0.2

0

UVM_missense UVM_mut UCS_missense UCS_mut UCEC_missense UCEC_mut THYM_missense THYM_mut THCA_missense THCA_mut TGCT_missense TGCT_mut STES_missense STES_mut SKCM_missense SKCM_mut SARC_missense SARC_mut PRAD_missense PRAD_mut PCPG_missense PCPG_mut PAAD_missense PAAD_mut OV_missense OV_mut LUSC_missense LUSC_mut LUAD_missense LUAD_mut LIHC_missense LIHC_mut LGG_missense LGG_mut KIRP_missense KIRP_mut KIRC_missense KIRC_mut KICH_missense KICH_mut HNSC_missense HNSC_mut GBM_missense GBM_mut ESCA_missense ESCA_mut DLBC_missense DLBC_mut COADREAD_missense COADREAD_mut CHOL_missense CHOL_mut CESC_missense CESC_mut BRCA_missense BRCA_mut BLCA_missense BLCA_mut ACC_missense ACC_mut


72

F IGURE 5.8: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (2 of 7) CARD11 CBX7 CCDC102A CCDC105 CCDC110 CCDC40 CDH24 CDKL5 CDKN2A CEP170 CEP41 CEP63 CERKL CHD3 CHRM2 CHST13 CIC CILP2 CLIC6 CNTN5 COL11A1 COL15A1 COL21A1 COL23A1 COL28A1 COL4A4 COL5A1 CPSF6 CREB3L1 CROCC CRYBG3 CTAGE6P CTCF CTNNB1 DAXX DBX2 DCAF8L1 DDX11 DDX46 DEK DENND4B DGKB DGKI DHX34 DIDO1 DLEU7 DMKN DSC3 DUPD1 DYSF DZIP1 E2F5 EIF1AX ELF3 EN1 ENAM EP400 EPAS1 ERI1 EYA1 EYA4 FAM120B FAM123A FAM123B FAM123C FAM157A FAM171B FAM184B FAM194A FAM196B FAM21A FAM47A FAM71E2 FBN3 FCGBP FCRL5

0.6

0.4

0.2

0



73

F IGURE 5.9: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (3 of 7) FER1L6 FETUB FGF12 FGF13 FHDC1 FILIP1 FOXA1 FOXP2 FOXS1 FSCB FSD1 FSIP2 GAB1 GABRG2 GATA3 GDF15 GDF5 GIMAP6 GJA8 GLDN GNAS GOLGB1 GON4L GPATCH8 GPR158 GPR179 GPRIN1 GPRIN2 GRIN2A GSG2 HAP1 HECW2 HGF HHIPL2 HIVEP3 HLA−C HMGB3 HOMEZ HSCB ILDR1 INPP5J IRF2BPL IRF4 IRS4 IRX4 ISL1 ISX ITSN2 JPH1 KAT6A KAT8 KCNA6 KCND2 KCNJ4 KCNJ8 KCNN3 KCTD8 KDM4A KIAA0040 KIAA0284 KIAA0319 KIAA0355 KIAA0907 KIAA1211 KIAA1257 KIAA1522 KIAA1549L KIAA2018 KIF1A KIF1C KIR3DL2 KNDC1 L1TD1 LAMA3 LAMC3 LAS1L

0.6

0.4

0.2

0



74

F IGURE 5.10: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (4 of 7) LDLR LIG1 LILRB5 LIMK2 LIPE LMTK3 LONRF2 LPA LRP11 LRRC43 MAD1L1 MAML2 MAP1A MAP6 MAPK1 MAPK13 MAST1 MBD1 MBD6 MCM10 MECOM MED17 MEFV MESP2 METTL10 MGA MICAL3 MKI67 MKL1 MLL MLL2 MLLT3 MN1 MPHOSPH10 MPHOSPH9 MSGN1 MSH6 MTOR MUC15 MYBPC2 MYH13 MYH2 MYH4 MYH6 MYH8 MYLK MYO15A MYO18B MYOM1 MYRIP MYT1L NALCN NASP NBPF3 NCOA3 NCOR2 NEFH NEFM NFASC NFATC1 NFE2L2 NFKBIB NFYA NGFR NLRP11 NOL8 NOM1 NOS1 NPAP1 NPAS3 NRAP NRD1 NRG3 NSD1 NSUN2 NTN5

0.6

0.4

0.2

0



75

F IGURE 5.11: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (5 of 7) NUMBL OCEL1 OPRM1 OSBPL3 OSBPL6 OTOF P2RX2 PALMD PAPD7 PAPPA2 PARD3B PAX4 PBRM1 PCDH15 PCF11 PCLO PCMTD1 PCSK1 PCSK5 PDE4DIP PDGFRL PDZD4 PEG3 PENK PEX5L PHLDA1 PHLDB2 PHRF1 PIEZO1 PIK3AP1 PIK3R1 PIK3R5 PKP4 PLEC PLEKHG3 PMEPA1 PMFBP1 POTEF POTEG POU3F2 PPFIA2 PPM1E PPP1R16B PPP2R3A PPP2R3B PRDM13 PRICKLE1 PRKCSH PRKG2 PRLR PRRC2C PRRG3 PTPRO PTRF RALY RASSF6 RBM12B RBM14 RBMXL3 RC3H1 RECQL5 REM1 RERE RGPD4 RIMS2 RIMS3 RINL RLIM RNF146 RNFT2 ROBO2 ROS1 RP1L1 RSBN1 RSPH4A RTN3

0.6

0.4

0.2

0



76

F IGURE 5.12: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (6 of 7) RUNX2 RYR2 RYR3 SCAND3 SCARF2 SCN2A SCRN2 SDCCAG3 SDPR SEMA3E SGSM1 SH2D2A SHANK1 SHANK2 SHOX SIM1 SIPA1L3 SLC16A2 SLC17A6 SLC24A3 SLC8A3 SLCO1C1 SLCO6A1 SMARCA4 SMC2 SNAP25 SNED1 SOGA3 SORBS2 SORBS3 SORCS1 SOWAHB SOX10 SOX9 SPATA31A3 SPATA31D1 SPATS2L SPDYE5 SPEF2 SPEN SPERT SPHKAP SPOCK3 SPTA1 SPTAN1 SRL SRRM2 SRRT STK19 STON1−GTF2A1L SWI5 SYNJ2 TAF1 TAF4 TARSL2 TBC1D1 TBC1D10C TBC1D3B TBP TCF7L2 TCHHL1 TDRD3 TENM1 TENM2 TEX33 THSD1 TIAM1 TIMELESS TLN2 TLR6 TMC2 TMC5 TMEM200C TNRC6A TNXB TONSL

0.6

0.4

0.2

0



77

F IGURE 5.13: A heatmap showing the significant genes compared across all cancer backgrounds. Cells are colored by the ratio between the number of patients with a mutation in the given gene over the total number of patients in that cancer type. (7 of 7) TOP2A TP53 TP63 TRAK1 TRANK1 TRAPPC12 TRIM3 TRIOBP TRMT44 TSKS TTBK1 TTLL11 TTLL2 TTN TUB TULP4 TUSC3 TXLNB UNCX USP31 USP6 USP6NL UTP18 VRTN WDR33 WDR64 WDR70 WDR87 WDR96 WNT16 WT1 XIRP1 XIRP2 ZAR1L ZBBX ZBTB38 ZC3H12D ZC4H2 ZFHX4 ZFP106 ZFP36L2 ZFR2 ZFX ZFYVE20 ZIC4 ZIM2 ZNF189 ZNF208 ZNF254 ZNF285 ZNF329 ZNF347 ZNF385B ZNF398 ZNF462 ZNF534 ZNF599 ZNF638 ZNF676 ZNF696 ZNF707 ZNF711 ZNF717 ZNF746 ZNF768 ZNF770 ZNF804A ZNF845 ZNF91

0.6

0.4

0.2

0



78

F IGURE 5.14: The mean mutation prevalence from regional heatmaps by cancer. Note that the number of missense-only profiles (Missense plus the number of all-mutations profiles (Muts) is equal to the number of both profiles (Both). This is due to Missense and Muts profiles being mutually exclusive. Heatmaps Column Means Summary

Mutation Prevalence Mean

0.06

ID

0.04

Both Missense Muts

0.02

0.00 0

20

40

Sorted Index

60


79

F IGURE 5.15: The mean mutation prevalence from regional heatmaps by gene. Note that here the number of missense-only profile genes (Missense) plus the number of all mutation profiles genes (Muts) is not equal to the number of both profiles (Both). This is due to only considering genes significant in each of the background profiles, which are not mutually exclusive. Therefore, Both is not simply the combination of Missense and Muts, but rather their union with Missense and Muts profiles sharing 236 entries. Lengths: Both (525), Muts (447), Missense (314). Heatmaps Row Means Summary

0.3

Mutation Prevalence Mean

0.2

ID Both Missense Muts

0.1

0.0

0

100

200

300

Sorted Index

400

500


80

F IGURE 5.16: The smoothed disorder scores of TBP.001 in BRCA_mut. The red dots are mutated positions and exact disorder score at that position.

Smooth Disorder Plot for TBP.001

Disorder Score

0.3

0.0

−0.3

−0.6

100

200

Amino Acid Position

300


81

F IGURE 5.17: The smoothed disorder scores of PLEC.005 in ACC_mut. The red dots are mutated positions and exact disorder score at that position.

Smooth Disorder Plot for PLEC.005

Disorder Score

0.0

−0.2

−0.4

0

1000

2000

3000

Amino Acid Position

4000


82

F IGURE 5.18: The smoothed disorder scores of NEFH.001 in ACC_mut. The red dots are mutated positions and exact disorder score at that position.

Smooth Disorder Plot for NEFH.001

Disorder Score

0.0

−0.2

−0.4

0

250

500

Amino Acid Position

750

1000

83

Chapter 6

Discussion 6.1

Introduction

Invariably if one attempts to interpret as many results as were found in the analysis here, 551,1 without some systematic approach they would need to rely on some form of heuristic, of which there are no known heuristics for judging the significance or novelty of protein disorder findings within the context of cancer. Therefore, rather than random selection or taking the top N results from both methods to focus results it was decided to take the intersection of the two methods to determine the most notable genes of potential disorder-driven cancer implication. Genes captured by both methods of analysis should have a heightened degree of potential in driving cancer via the hypothesis herein that protein disorder may be implicated in yet uncharacterized driver genes. 1

77 from positional analysis, 480 from regional analysis with a 6 result overlap; 77+480−6 = 511

Chapter 6. Discussion

84

TABLE 6.1: The significant gene symbols according to both positional Monte Carlo simulations and regional binomial tests. Only those genes not already in the COSMIC census gene set are considered. EP400

6.2

TBP

SRRM2

NOCA3

GPRIN2

ZNF707

Intersection of Both Methods of Analysis

Given the positional and regional analysis methods each having their own bias – positional could falsely call insignificant random mutations at disordered positions significant, while regional analysis could call intrinsically disordered proteins significant due to being one large disordered region – taking the intersection of the two novel find sets should give a high-confidence disorder-implicated gene set. The significant genes shared between the two method are listed in Table 6.1, note that only novel finds are considered here rather than all finds therefore none of these results are in the COSMIC census currently.

6.2.1

EP400

This gene, E1A-binding protein p400, is involved in the transcriptional activation of select genes via H4 and H2A acetylation (Doyon et al., 2004).2 Notably, Endo et al. (2013) found that this gene presented an ossifying fibromyxoid tumor. This was detected in only a single case, but showed potential reproducibility. The rarity and uncertainty associated with the finding suggests it might be disorderassociated – the rarity due to disorder regions being less susceptible to mutational disrupt, while the reproducibility suggesting it is more than a random chance finding. Meanwhile, Mouradov et al. (2014) did a systematic investigation of primary 2

http://www.uniprot.org/uniprot/Q96L91


85

colorectal tumors and compared them against TCGA data to conclude that these tumors are representative of the main subtypes of primary tumors at the genomic level – finding EP400 mutation enrichment among other commonly found tumor genes. In addition to these findings, Smith et al. (2010) and Wu et al. (2015) found this gene implicated in human papillomavirus (HPV)-associated cancers and bladder cancer recurrence, respectively.

6.2.2

TBP

This gene, TATA-box-binding protein, is part of the TFIID complex and its binding to the complex is part of the initial transcriptional step of the pre-initiation complex (PIC).3 TBP has not yet been implicated in cancer by itself, but has been noted in interaction with p53, a ubiquitous cancer driver gene (Truant, Xiao, Ingles, & Greenblatt, 1993). This gene has primarily and almost exclusively been implicated in neurodegeneration, particularly via spinocerebellar ataxia (Zühlke, Dalski, Schwinger, & Finckh, 2005). If this gene is driving cancer via disorder-focused mutation it is likely affecting its ability to bind to the TFIID complex leading to a slowing of transcriptional activity or is affecting the rate of signal transduction by p53 (GO:1901796).

6.2.3

SRRM2

This gene, Serine/arginine repetitive matrix protein 2, has been previously implicated in papillary thyroid carcinoma predisposition (Tomsic et al., 2015), colorectal 3

http://www.uniprot.org/uniprot/P20226


86

cancer (Hinoue et al., 2012), and breast cancer (Semaan, Wang, Stewart, Marshall, & Sang, 2011). Its exact function is still unknown, but it may stabilize the catalytic center or position of the RNA substrate being involved in pre-mRNA splicing (Blencowe et al., 2000).4

6.2.4

NCOA3

This gene, Nuclear receptor coactivator 3, is overexpressed in ≈ 60% of primary breast tumors (Burwinkel et al., 2005). This overexpression has been shown to significantly reduce the disease-free and overall survival rate when compared to patients with other tumor types (Zhao et al., 2003) to the point its secondary alias symbol is AIB1 (amplified in breast cancer 1). Breast cancers can be divided into two distinct classes: estrogen receptorα-positive (ERα-positive) and -negative disease where AIB1 amplification characterizes a subgroup of ERα-positive breast cancer with worse outcome (Burandt et al., 2013).5

6.2.5

GPRIN2

This gene, G protein-regulated inducer of neurite outgrowth 2, was first shown to be involved in the G protein action of the brain (L. T. Chen, Gilman, & Kozasa, 1999). Since then is has been shown to be highly mutated in invasive lobular breast cancer (Ciriello et al., 2015) and involved in cancer risk in conjunction with environmental risks such as ceramic fibers (Gérazime, Stücker, & Luce, 2016) and 4 5

http://www.uniprot.org/uniprot/Q9UQ35 http://www.uniprot.org/uniprot/Q9Y6Q9


87

asbestos (Jiménez, Aguilar, Velázquez, Tachiquin, & Juárez, 2016). Beyond these publications, GPRIN2 is mostly absent from any directed study.6

6.2.6

ZNF707

This gene, zinc finger protein 707, has never been directly studied,7 instead all publications caught ZNF707 in other analyses with only one study mentioning it as a result. The study, by Nesslinger et al. (2007), found that in prostate cancer ZNF707 + PTMA was recognized by treatment-associated autoantibodies. Beyond that ZNF707 has been annotated in four interactome studies (Rual et al., 2005; Rolland et al., 2014; Hein et al., 2015; Xin et al., 2009), sequenced as part of two analyses of chromosome 8 (Nusbaum et al., 2006; Ota et al., 2004), and part of an NIH project to expand the Mammalian Gene Collection (MGC) (Gerhard et al., 2004).

6.3

Enrichment Analyses

The lack of significant terms following enrichment analysis does not elude meaning. A lack of enriched terms in this case might suggest that disorder-targeted proteins do not share a similar driving mechanism and instead are as varied as their lack of well-defined structure suggests. This varied set of mechanisms would likely be attributable to binding partner disruption if these disorder-targeted proteins are implicated in cancer. This aspect is supported by many of the uncorrected 6 7

http://www.uniprot.org/uniprot/O60269/publications http://www.uniprot.org/uniprot/Q96C28/publications


88

terms (Table 4.5 and Table 5.5) being associated with complex protein network interactions.

6.3.1

Positional

The terms listed in Table 4.6 (positional analysis interaction partner set) are all either associated with metabolic processes or gene expression. These associations are unsurprising given the mutations were noted in patients with cancer; however, more surprisingly, one of the top terms here is "protein stabilization," which might suggest that these disorder-targeted genes destabilize more than just their own binding relationships by having a secondary effect on protein stabilization at large. Another significant term, "protein sumoylation," is a post-translational modification associated with apoptosis, protein stability, and progression through the cell cycle (Hay, 2005) and is associated with the long-term fate of a protein.

6.3.2

Regional

The terms listed in Table 5.6 (regional analysis interaction partner set) suggest that disorder-implicated driver genes may drive cancer via their binding partners as opposed to directly driving cancer. Since every term in the table is concerning regulation, particularly of gene expression and biosynthesis, the effect(s) of mutations is likely to disrupt metabolic networks rather than metabolic processes directly. When looking at the expanded interaction partner set enrichment table for regional analysis (Table C.1), there are terms further down the list such as "positive regulation of ATP biosynthetic process" which suggest the energetics aspect of cancer


89

induction. There are some surprising enriched terms such as "behavioral response to ethanol" which, despite being interesting, offer no aid in characterizing these genes as cancer driver genes rather they highlight the limitations of this approach (further discussed in Section 6.8).

6.3.3

Regional and Positional Cross-comparison

Significant Novel Finds Sets Looking at both the uncorrected positional terms (Table 4.5) and uncorrected regional terms (Table 5.5) we see that neither set of terms make much sense in driving cancer, rather there are a great variety of terms that do not seem cancer-related. This might suggest that these disorder-targeted genes, if driving cancer do so via their interaction network not directly.

Binding Partner Sets Between both positional (Table 4.6) and regional analysis (Table 5.6) partner set enrichment sets terms such as "protein sumylation" and "protein stabilization" occur. This helps cross-validate the results from each method of analysis, however might also be due to the scale-free property of protein-protein interactions networks where gathering the interaction partner set to any initial set is likely to result in a more central set overall – in this case a more biologically critical gene set. This point is discussed further in Section 6.8 below.


6.4

90

Disorder Binding Incitation of Cancer

Following the analysis herein, I suspect now that if disorder-implicated driver genes exist they are likely effecting cancer via their binding relationships. Disordered proteins add a robustness to protein-protein interaction networks by complementing the rigidity of ordered regions (e.g., bindings site and conserved domains). An ordered site being made more disordered by disrupting binding makes general sense, meanwhile the analysis herein did not offer any aid in answering the more general question of how disorder may incite cancer. It is possible that disorder-targeted genes incite cancer by affecting binding relationships rather than directly, however, further research needs to be done on how mutations in disordered regions of even known driver genes present themselves.

6.5

COSMIC – Limited Complement

The consensus driver genes in COSMIC have largely been determined by methods more geared toward finding order-targeted mutation effects and therefore offers a strong complement to the disorder-focused discovery of driver genes herein. Since COSMIC is the standard for causally implicated genes in cancer ensuring a degree of union here offers slight support for the remaining significant results being potential drivers. However, since the biological property basis of prior methods and the work herein differ so greatly, using COSMIC to remove known driver genes, although the standard, is likely to remove few genes driven by disorder. Using the COSMIC set to remove known driver genes represents a good use to find novel results, however the COSMIC set does not likely include many terms that would be


91

found by the method of analysis used herein due to the focus on protein disorder.

6.6

On Limit to In Silico Analysis

Despite all the in silico validation methods used herein, future validation via wet lab experimentation, possibly through the use of pull-down assays, will be necessary. Pull-down assays are particularly fit to the nature of disorder-regions due to directly testing binding disruption – a likely hypothesis for how disorder-targeting mutations might drive cancer.

6.7

On the High Number of Regional Results

Having 525 genes be called significant in regional analysis, and the remaining 480 following removal of well-characterized driver genes, suggests a potential problem with the null model used in this approach. This is simply too many results to conclude generalizations shared between findings. If we assume these results are problematic, or at least that the FDR correction is proper and

1 th 20

of the re-

sults are false discoveries, then the number of results are most likely inflated by one of two possibilities: 1. these region-gene combinations accumulate non-fatal, non-significant mutations after oncogenesis (passenger mutation accumulation), or 2. these mutations are important and their accumulation in so many genes indicates a more important conclusion to be made with further analysis (unknown mutation accumulation). The latter of these is, at best, blindly hopeful of the significance of my findings and lacks an effective next step toward this aforementioned


92

important conclusion. Meanwhile, the former is far more likely and has multiple next steps that can be taken. One potential next step, informed by the work of London, Movshovitz-Attias, and Schueler-Furman (2010), is to consider the mutation of "hot spot" residues to find mutated regions which would show the most binding disruption due to mutation of these "hot spot" residues. I would suspect that, given the additional biological significance subsetting rather than statistical subsetting, reanalyzing regional heightened mutational concentration with added weight on the residue being mutated would drastically reduce the number of false discoveries.

6.8

Limitations

As with any analysis, the approach taken herein has its flaws. Here I discuss the most important limitations and problems with the analysis herein, however these are certainly not the only limitations given the scale and dimensionality of this analysis. With so many discrete tests in Monte Carlo simulations, binomial tests, and a variety of places corrections could have been performed but were not due to a seemingly safe assumption that it was not necessary8 there is no doubt that there are more limitations than just the ones presented here. 8

An example of such would be, during regional analysis, correcting for the number of regions in an isoform/gene prior to selecting the most representative isoform. This correction is informed by it being more significant if a protein has many disordered regions and all the mutations concentrate in one disordered region than if a highly-mutated protein has one very large disordered region.


6.8.1

93

Impact of mutations

The impact or context of mutations is not considered in this analysis. We know of many reasons seemingly insignificant mutations are far more important than they would have been measured as via the simple math used herein. One such case is that transition (i.e., purine to purine and pyrimidine to pyrimidine DNA mutations) versus transversion (i.e., purine to pyrimidine and pyrimidine to purine DNA mutations) are not considered despite this researcher’s knowledge that transitions occur at a much higher rate on average than transversions (≈ 3 : 1 ratio) despite there being twice as many transversions than transitions. Thus certain random protein mutations are more likely than others due to being caused by a transition as opposed to a transversion. In considering mutations here the method naïvely assumes either all mutations matter (hopeful, but likely not true) or that only missense mutations matter (also hopefully, but likely not true since synonymous mutations do have an impact on translation rates). This limitation can be addressed through use of either/or MutSig (Beroukhim et al., 2007) and SIFT (Mooney, 2005) methods to give mutations more context. Converting DNA mutations to the protein level and analyzing the data at that level then trying to draw general conclusions about the original gene level from the signal at the protein level required some level of compromise in considerations such as these. This translation from DNA to protein is not as simple or as straightforward in nature as a translation table may suggest.


6.8.2

94

Monte Carlo simulations side effect

A side effect of the Monte Carlo analysis is that if observed mutated positions are all just slightly more disordered than the rest of the isoform then that isoform will be called significant without a true disorder-driven reasoning. This is partially addressed by the countering regional analysis which mitigates against these such false positives. Therefore an intersection of positional finds and regional finds is a far more confident set.

6.8.3

Intersection of significance sets

Taking the intersection of all-mutation and missense-only profiles within both positional and regional analysis (so the intersection of four sets: positional-all, positionalmissense, regional-all, and regional-missense) should result in a far more confident set, however drawing conclusions from this set would be difficult. Is a significant isoform disordered overall with great peaks of order? Do all the mutations matter? Questions such as these will need to be addressed in further research.

6.9

Conclusions

With this work being the only intersection between cancer driver gene discovery and protein disorder, it is not yet possible to make any general, objective conclusions about disorder-implicated driver genes. Further stringency is necessary to draw meaningful conclusions about this potential cancer-driving biological property. Addressing some of the limitations as stated above in Section 6.8 should


95

be the next step of investigation into this intersection. If possible, initial wet-lab validation of the high-confidence set could inform a more advanced reanalysis of the data used herein by finding some general property or binding partner common to mutated versions of the significant isoforms. Such validation would likely take the form of site-directed mutagensis and pull-down assays to determine if the observed mutations are the causal link between binding success and disruption between binding partners. Although the conclusions from this work are limited, as a first step proof of concept the conclusions here are important for informing continued work in this area of investigation. By repeating this analysis with increased biological consideration, such as focusing on "hot spot" residues known to be more critical in binding interactions, new drivers may be discovered beyond the six suggested here. Investigation into any shared nature between the six genes listed above may prove to further inform continued directed study in this area. For this work, by studying the relationship between protein disorder and cancer while making the least number of assumptions possible, a launching pad has been laid for continued, more informed investigations into how protein disorder may drive cancer.

96

Bibliography Anfinsen, C. B. (1973). Principles that govern the folding of protein chains. Science (80-. ). 181(4096), 223–230. doi:10.1126/science.181.4096.223 Ast, G. (2004). How did alternative splicing evolve? Nat. Rev. Genet. 5(10), 773–782. doi:10.1038/nrg1451 Bass, A. J., Thorsson, V., Shmulevich, I., Reynolds, S. M., Miller, M., Bernard, B., . . . Liu, J. (2014). Comprehensive molecular characterization of gastric adenocarcinoma. Nature, 513(7517), 202–9. doi:10.1038/nature13480. arXiv: NIHMS150003 Benjamini, Y. & Hochberg, Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. B, 57(1), 289– 300. doi:10.2307/2346101. arXiv: 95/57289 [0035-9246] Beroukhim, R., Getz, G., Nghiemphu, L., Barretina, J., Hsueh, T., Linhart, D., . . . Sellers, W. R. (2007). Assessing the significance of chromosomal aberrations in cancer: Methodology and application to glioma. Proc. Natl. Acad. Sci. 104(50), 20007–20012. doi:10.1073/pnas.0710052104 Black, D. L. (2003). Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem. 72(1), 291–336. doi:10.1146/annurev.biochem.72.121801.161720 Blake, C. C. F., Koenig, D. F., Mair, G. A., North, A. C. T., Phillips, D. C., & Sarma, V. R. (1965). Structure of hen egg-white lysozyme, a three dimensional fourier

BIBLIOGRAPHY

97

synthesis at 2-Ångstroms resolution. Nature, 206(4986), 757–761. doi:10.1038/ 206757a0 Blasco, M. A. (2005). Telomeres and human disease: Ageing, cancer and beyond. Nat Rev Genet, 6(8), 611–622. doi:10.1038/nrg1656 Blencowe, B. J., Baurén, G., Eldridge, A. G., Issner, R., Nickerson, J. A., Rosonina, E., & Sharp, P. A. (2000). The SRm160/300 splicing coactivator subunits. RNA, 6(1), 111–20. doi:10.1017/S1355838200991982 Boffetta, P., Hecht, S., Gray, N., Gupta, P., & Straif, K. (2008). Smokeless tobacco and cancer. doi:10.1016/S1470-2045(08)70173-6 Burandt, E., Jens, G., Holst, F., Jänicke, F., Müller, V., Quaas, A., . . . Lebeau, A. (2013). Prognostic relevance of AIB1 (NCoA3) amplification and overexpression in breast cancer. Breast Cancer Res. Treat. 137(3), 745–753. doi:10.1007/ s10549-013-2406-4 Burwinkel, B., Wirtenberger, M., Klaes, R., Schmutzler, R. K., Grzybowska, E., Försti, A., . . . Hemminki, K. (2005). Association of NCOA3 polymorphisms with breast cancer risk. Clin. Cancer Res. 11(6), 2169–2174. doi:10.1158/1078-0432. CCR-04-1621 Campisi, J. (2013). Aging, cellular senescence, and cancer. Annu. Rev. Physiol. 75(1), 685–705. doi:10.1146/annurev-physiol-030212-183653. arXiv: NIHMS150003 Chen, L. T., Gilman, A. G., & Kozasa, T. (1999). A candidate target for G protein action in brain. J. Biol. Chem. 274(38), 26931–26938. doi:10.1074/jbc.274.38. 26931

BIBLIOGRAPHY

98

Chen, Y., McGee, J., Chen, X., Doman, T. N., Gong, X., Zhang, Y., . . . Kouros-Mehr, H. (2014). Identification of druggable cancer driver genes amplified across TCGA datasets. PLoS One, 9(5), e98293. doi:10.1371/journal.pone.0098293 Cheng, W. C., Chung, I. F., Chen, C. Y., Sun, H. J., Fen, J. J., Tang, W. C., . . . Wang, H. W. (2014). DriverDB: An exome sequencing database for cancer driver gene identification. Nucleic Acids Res. 42(D1). doi:10.1093/nar/gkt1025 Chial, H. (2008). Proto-oncogenes to oncogenes to cancer. Nature Education, 1(1), 33. Ciriello, G., Gatza, M. L., Beck, A. H., Wilkerson, M. D., Rhie, S. K., Pastore, A., . . . Perou, C. M. (2015). Comprehensive molecular portraits of invasive lobular breast cancer. Cell, 163(2), 506–519. doi:10.1016/j.cell.2015.09.033 de Gruijl, F. R. (1999). Skin cancer and solar UV radiation. Eur. J. Cancer, 35(14), 2003–9. doi:10.1016/S0959-8049(99)00283-X Dees, N. D., Zhang, Q., Kandoth, C., Wendl, M. C., Schierding, W., Koboldt, D. C., . . . Ding, L. (2012). MuSiC: Identifying mutational significance in cancer genomes. Genome Res. 22(8), 1589–1598. doi:10.1101/gr.134635.111 DeMarini, D. M. (2004). Genotoxicity of tobacco smoke and tobacco smoke condensate: A review. doi:10.1016/j.mrrev.2004.02.001 Denissenko, M. F. & Pao, A. (1996). Preferential formation of benzo[a]pyrene adducts at lung cancer mutational hotspots in P53. Science (80-. ). 274(5286), 430–432. doi:10.1126/science.274.5286.430 Dill, K. A. [Ken A.], Ozkan, S. B., Shell, M. S., & Weikl, T. R. (2008). The protein folding problem. Annu. Rev. Biophys. 37(1), 289–316. doi:10 . 1146 / annurev. biophys.37.092707.153558. arXiv: NIHMS150003

BIBLIOGRAPHY

99

D’Orazio, J., Jarrett, S., Amaro-Ortiz, A., & Scott, T. (2013). UV radiation and the skin. doi:10.3390/ijms140612222 Dosztányi, Z., Csizmok, V., Tompa, P., & Simon, I. (2005, August). IUPred: Web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics, 21(16), 3433–4. doi:10.1093/bioinformatics/ bti541 Dosztányi, Z., Csizmók, V., Tompa, P., & Simon, I. (2005, April). The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol. 347(4), 827–39. doi:10.1016/j.jmb.2005.01.071 Doyon, Y., Selleck, W., Lane, W. S., Tan, S., Côté, J., & Co, J. (2004). Structural and functional conservation of the NuA4 histone acetyltransferase complex from yeast to humans structural and functional conservation of the NuA4 histone acetyltransferase complex from yeast to humans. Mol. Cell. Biol. 24(5), 1884– 96. doi:10.1128/MCB.24.5.1884 Edwards, A. G. K., Russell, I. T., & Stott, N. C. H. (1998). Signal versus noise in the evidence base for medicine: An alternative to hierarchies of evidence? Fam. Pract. 15(4), 319–322. doi:10.1093/fampra/15.4.319 Endo, M., Kohashi, K., Yamamoto, H., Ishii, T., Yoshida, T., Matsunobu, T., . . . Oda, Y. (2013). Ossifying fibromyxoid tumor presenting EP400-PHF1 fusion gene. Hum. Pathol. 44(11), 2603–2608. doi:10.1016/j.humpath.2013.04.003 Fischer, E. (1894). Einfluss der configuration auf die wirkung der enzyme. Berichte der Dtsch. Chem. Gesellschaft, 27(3), 2985–2993. doi:10.1002/cber.18940270364 Fourier, J.-B.-J. (1822). Théorie analytique de la chaleur. Paris: F. Didot.

BIBLIOGRAPHY

100

Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., . . . Stratton, M. R. M. (2004, March). A census of human cancer genes. Nat. Rev. Cancer, 4(3), 177–183. doi:10.1038/nrc1299.A Garrett, R. H. & Grisham, C. M. (2013). Biochemistry. 5th, Brooks/Cole Cengage Learning. Belmont, CA. Gérazime, A., Stücker, I., & Luce, D. (2016, September). P006 Occupational exposure to refractory ceramic fibres and respiratory cancer risk. Occup. Environ. Med. 73(Suppl 1), A121 LP –A121. Retrieved from http : / / dx . doi . org / 10 . 1136/oemed-2016-103951.331 Gerhard, D. S., Wagner, L., Feingold, E. A., Shenmen, C. M., Grouse, L. H., Schuler, G., . . . Malek, J. (2004). The status, quality, and expansion of the NIH fulllength cDNA project: The Mammalian Gene Collection (MGC). Genome Res. 14(10 B), 2121–2127. doi:10.1101/gr.2596504 Ghersi, D. & Singh, M. (2014). Interaction-based discovery of functionally important genes in cancers. Nucleic Acids Res. 42(3), 1–11. doi:10.1093/nar/gkt1305 Gonzalez-Perez, A. & Lopez-Bigas, N. (2012). Functional impact bias reveals cancer drivers. Nucleic Acids Res. 40(21). doi:10.1093/nar/gks743 Goymer, P. (2007). Synonymous mutations break their silence. Nat. Rev. Genet. 8(2), 92–92. doi:10.1038/nrg2056 Hanahan, D. & Weinberg, R. A. [Robert A.]. (2011). Hallmarks of cancer: The next generation. Cell, 144(5), 646–74. doi:10.1016/j.cell.2011.02.013. arXiv: 0208024 [gr-qc] Hay, R. T. (2005). SUMO: A history of modification. Mol. Cell, 18(1), 1–12. doi:10. 1016/j.molcel.2005.03.012. arXiv: arXiv:1102.0541

BIBLIOGRAPHY

101

Hecht, S. (1999). Tobacco smoke carcinogen and lung cancer. J. Natl. Cancer Inst. 91(14), 1194–1210. doi:10.1093/jnci/91.14.1194 Hein, M. Y., Hubner, N. C., Poser, I., Cox, J., Nagaraj, N., Toyoda, Y., . . . Mann, M. (2015). A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell, 163(3), 712–723. doi:10.1016/j.cell. 2015.09.053 Hendrick, J. P. & Hartl, F.-U. (1993). Molecular chaperone functions of heat-shock proteins. Annu. Rev. Biochem. 62(1), 349–384. doi:10 . 1146 / annurev. bi . 62 . 070193.002025 Hinoue, T., Weisenberger, D. J., Lange, C. P. E., Shen, H., Byun, H. M., Van Den Berg, D., . . . Laird, P. W. (2012). Genome-scale analysis of aberrant DNA methylation in colorectal cancer. Genome Res. 22(2), 271–282. doi:10 . 1101 / gr.117523.110 Hua, X., Xu, H., Yang, Y., Zhu, J., Liu, P., & Lu, Y. (2013, September). DrGaP: A powerful tool for identifying driver genes and pathways in cancer sequencing studies. Am. J. Hum. Genet. 93(3), 439–51. doi:10.1016/j.ajhg.2013.07.003 Hunt, R. C., Simhadri, V. L., Iandoli, M., Sauna, Z. E., & Kimchi-Sarfaty, C. (2014). Exposing synonymous mutations. doi:10.1016/j.tig.2014.04.006 Hutchinson, E. (2001). Alfred Knudson and his two-hit hypothesis. Lancet Oncol. 2(10), 642–645. doi:10.1016/S1470-2045(01)00524-1 Jiménez, C., Aguilar, G., Velázquez, A. C., Tachiquin, M. R., & Juárez, C. (2016, September). P005 Molecular karyotype in two mesothelioma cases and four controls with exposure to asbestos. Occup. Environ. Med. 73(Suppl 1), A121 LP –A121. Retrieved from http://dx.doi.org/10.1136/oemed-2016-103951.330

BIBLIOGRAPHY

102

Kamburov, A., Lawrence, M. S., Polak, P., Leshchiner, I., Lage, K., Golub, T. R., . . . Getz, G. (2015, October). Comprehensive assessment of cancer missense mutation clustering in protein structures. Proc. Natl. Acad. Sci. U. S. A. 112(40), E5486–95. doi:10.1073/pnas.1516373112 Kasper, D. L., Fauci, A. S., Hauser, S. L., Longo, D. L. ( L., Jameson, J. L., & Loscalzo, J. (2015). Harrison’s principles of internal medicine. McGraw-Hill Medical. Retrieved from http : / / www. worldcat . org / title / harrisons - principles - of internal-medicine/oclc/890181375 Kendrew, J. C. (1961). The three-dimensional structure of a protein molecule. Sci. Am. 205, 96–110. doi:10.1038/scientificamerican1261-96 Kessel, A. & Ben-Tal, N. (2011). Introduction to proteins: Structure, function, and motion. CRC Press. Knudson, A. G. (1971). Mutation and cancer: Statistical study of retinoblastoma. Proc. Natl. Acad. Sci. 68(4), 820–823. doi:10.1073/pnas.68.4.820 Kornblihtt, A. R., Schor, I. E., Alló, M., Dujardin, G., Petrillo, E., & Muñoz, M. J. (2013). Alternative splicing: A pivotal step between eukaryotic transcription and translation. Nat. Rev. Mol. Cell Biol. 14(3), 153–165. doi:10.1038/nrm3525 Kyte, J. & Doolittle, R. F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157(1), 105–132. doi:10.1016/0022-2836(82) 90515-0 Lawrence, M. S., Stojanov, P., Mermel, C. H., Robinson, J. T., Garraway, L. a., Golub, T. R., . . . Getz, G. (2014, January). Discovery and saturation analysis of cancer genes across 21 tumour types. Nature, 505(7484), 495–501. doi:10.1038/ nature12912. arXiv: NIHMS150003

BIBLIOGRAPHY

103

Lee, E. Y. H. P. & Muller, W. J. (2010, October). Oncogenes and tumor suppressor genes. Cold Spring Harb. Perspect. Biol. 2(10), a003236–a003236. doi:10.1101/ cshperspect.a003236 Lehman, T. A., Reddel, R., Pfeifer, A. M. A., Spillare, E., Kaighn, M. E., Weston, A., . . . Harris, C. C. (1991). Oncogenes and tumor-suppressor genes. In Environ. health perspect. (Vol. 93, pp. 133–144). doi:10.1289/ehp.9193133 Liu, Q. & Craig, E. A. (2016). Molecular biology: Mature proteins braced by a chaperone. Nature, 539(7629), 361–362. doi:10.1038/nature20470 Liu, T. T. (2016). Noise contributions to the fMRI signal: An overview. Neuroimage, 143, 141–151. doi:10.1016/j.neuroimage.2016.09.008 London, N., Movshovitz-Attias, D., & Schueler-Furman, O. (2010). The structural basis of peptide-protein binding strategies. Structure, 18(2), 188–199. doi:10. 1016/j.str.2009.11.012 Loomis, D., Guyton, K. Z., Grosse, Y., Lauby-Secretan, B., El Ghissassi, F., Bouvard, V., . . . Straif, K. (2016). Carcinogenicity of drinking coffee, mate, and very hot beverages. Lancet Oncol. 17(7), 877. Mermel, C. H., Schumacher, S. E., Hill, B., Meyerson, M. L., Beroukhim, R., & Getz, G. (2011). GISTIC2.0 facilitates sensitive and confident localization of the targets of focal somatic copy-number alteration in human cancers. Genome Biol. 12(4), R41. doi:10.1186/gb-2011-12-4-r41 Modrek, B. & Lee, C. (2002). A genomic view of alternative splicing. Nat. Genet. 30(1), 13–19. doi:10.1038/ng0102-13 Mooney, S. (2005). Bioinformatics approaches and resources for single nucleotide polymorphism functional analysis. doi:10.1093/bib/6.1.44

BIBLIOGRAPHY

104

Mouradov, D., Sloggett, C., Jorissen, R. N., Love, C. G., Li, S., Burgess, A. W., . . . Sieber, O. M. (2014). Colorectal cancer cell lines are representative models of the main molecular subtypes of primary cancer. Cancer Res. 74(12), 3238– 3247. doi:10.1158/0008-5472.CAN-14-0013 Muzny, D. M., Bainbridge, M. N., Chang, K., Dinh, H. H., Drummond, J. a., Fowler, G., . . . Thomson., E. (2012, July). Comprehensive molecular characterization of human colon and rectal cancer. Nature, 487(7407), 330–337. doi:10.1038/ nature11252. arXiv: nature11252 [10.1038] Nesslinger, N. J., Sahota, R. A., Stone, B., Johnson, K., Chima, N., King, C., . . . Nelson, B. H. (2007). Standard treatments induce antigen-specific immune responses in prostate cancer. Clin. Cancer Res. 13(5), 1493–1502. doi:10 . 1158 / 1078-0432.CCR-06-1772 Nordling, C. O. (1953). A new theory on the cancer-inducing mechanism. Br. J. Cancer, 7(1), 68–72. doi:10.1038/bjc.1953.8 Nusbaum, C., Mikkelsen, T. S., Zody, M. C., Asakawa, S., Taudien, S., Garber, M., . . . Lander, E. S. (2006). DNA sequence and analysis of human chromosome 8. Nature, 439(7074), 331–335. doi:10 . 1038 / nature04406. arXiv: arXiv : 1011 . 1669v3 Obradovic, Z., Peng, K., Vucetic, S., Radivojac, P., Brown, C. J., & Dunker, a. K. (2003). Predicting intrinsic disorder from amino acid sequence. Proteins, 53 Suppl 6(February), 566–72. doi:10.1002/prot.10532 Ota, T., Suzuki, Y., Nishikawa, T., Otsuki, T., Sugiyama, T., Irie, R., . . . Sugano, S. (2004). Complete sequencing and characterization of 21,243 full-length human cDNAs. Nat. Genet. 36(1), 40–45. doi:10.1038/ng1285

BIBLIOGRAPHY

105

Pauling, L., Corey, R. B., & Branson, H. R. (1951). The structure of proteins: Two hydrogen-bonded helical configurations of the polypeptide chain. Proc. Natl. Acad. Sci. 37(4), 205–211. doi:10.1073/pnas.37.4.205 Porta-Pardo, E. & Godzik, A. (2014, November). E-Driver: A novel method to identify protein regions driving cancer. Bioinformatics, 30(21), 3109–3114. doi:10. 1093/bioinformatics/btu499 Pray, L. (2008). DNA replication and causes of mutation. Nat. Educ. 1(1), 214. Prilusky, J., Felder, C. E., Zeev-ben-mordehai, T., Rydberg, E. H., Man, O., Beckmann, J. S., . . . Sussman, J. L. (2005). FoldIndex©: A simple tool to predict whether a given protein sequence is intrinsically unfolded. Bioinformatics, 21(16), 3435–3438. doi:10.1093/bioinformatics/bti537 Rolland, T., Ta¸san, M., Charloteaux, B., Pevzner, S. J., Zhong, Q., Sahni, N., . . . Vidal, M. (2014). A proteome-scale map of the human interactome network. Cell, 159(5), 1212–1226. doi:10.1016/j.cell.2014.10.050 Rual, J. F., Venkatesan, K., Hao, T., Hirozane-Kishikawa, T., Dricot, A., Li, N., . . . Vidal, M. (2005). Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437(7062), 1173–1178. doi:10.1038/nature04209 Sauna, Z. E. & Kimchi-Sarfaty, C. (2011, August). Understanding the contribution of synonymous mutations to human disease. Nat. Rev. Genet. 12(10), 683–691. doi:10.1038/nrg3051

BIBLIOGRAPHY

106

Semaan, S. M., Wang, X., Stewart, P. A., Marshall, A. G., & Sang, Q. X. A. (2011). Differential phosphopeptide expression in a benign breast tissue, and triplenegative primary and metastatic breast cancer tissues from the same AfricanAmerican woman by LC-LTQ/FT-ICR mass spectrometry. Biochem. Biophys. Res. Commun. 412(1), 127–131. doi:10.1016/j.bbrc.2011.07.057 Smith, J. A., White, E. A., Sowa, M. E., Powell, M. L. C., Ottinger, M., Harper, J. W., & Howley, P. M. (2010). Genome-wide siRNA screen identifies SMCX, EP400, and Brd4 as E2-dependent regulators of human papillomavirus oncogene expression. Proc. Natl. Acad. Sci. 107(8), 3752–3757. doi:10.1073/pnas. 0914818107 Stehelin, D. (1995). Oncogenes and cancer. Science (80-. ). 267(5203), 1408–1409. doi:10.1126/science.7878455 Supek, F., Miñana, B., Valcárcel, J., Gabaldón, T., & Lehner, B. (2014). Synonymous mutations frequently act as driver mutations in human cancers. Cell, 156(6), 1324–1335. doi:10.1016/j.cell.2014.01.051 Surget, S., Khoury, M. P., & Bourdon, J. C. (2013). Uncovering the role of p53 splice variants in human malignancy: A clinical perspective. doi:10 . 2147 / OTT. S53876 Tamborero, D., Gonzalez-Perez, A., & Lopez-Bigas, N. (2013, September). OncodriveCLUST: Exploiting the positional clustering of somatic mutations to identify cancer genes. Bioinformatics, 29(18), 2238–44. doi:10.1093/bioinformatics/ btt395

BIBLIOGRAPHY

107

Tamborero, D., Lopez-Bigas, N., & Gonzalez-Perez, A. (2013). Oncodrive-CIS: A method to reveal likely driver genes based on the impact of their copy number changes on expression. PLoS One, 8(2). doi:10.1371/journal.pone.0055489 Thomas, P. D. & Dill, K. A. [K A]. (1996). An iterative method for extracting energylike quantities from protein structures. Proc. Natl. Acad. Sci. U. S. A. 93(21), 11628–11633. doi:10.1073/pnas.93.21.11628 Todd, R. & Wong, D. T. (1999). Oncogenes. Anticancer Res. 19(6A), 4729–4746. Tomczak, K., Czerwinska, ´ P., & Wiznerowicz, M. (2015). The Cancer Genome Atlas (TCGA): An immeasurable source of knowledge. Contemp. Oncol. (Poznan, ´ Poland), 19(1A), A68–77. doi:10.5114/wo.2014.47136 Tomsic, J., He, H., Akagi, K., Liyanarachchi, S., Pan, Q., Bertani, B., . . . de la Chapelle, A. (2015). A germline mutation in SRRM2, a splicing factor gene, is implicated in papillary thyroid carcinoma predisposition. Sci. Rep. 5(1), 10566. doi:10. 1038/srep10566 Truant, R., Xiao, H., Ingles, C. J., & Greenblatt, J. (1993). Direct interaction between the transcriptional activation domain of human p53 and the TATA box-binding protein. J Biol Chem, 268(4), 2284–2287. Uversky, V. N., Gillespie, J. R., & Fink, A. L. (2000, November). Why are "natively unfolded" proteins unstructured under physiologic conditions? Proteins, 41(3), 415–27. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/ 11025552 Vineis, P., Alavanja, M., Buffler, P., Fontham, E., Franceschi, S., Gao, Y. T., . . . Doll, R. (2004). Tobacco and cancer: Recent epidemiological evidence. JNCI J. Natl. Cancer Inst. 96(2), 99–106. doi:10.1093/jnci/djh014

BIBLIOGRAPHY

108

Vogelstein, B., Papadopoulos, N., Velculescu, V. E., Zhou, S., Diaz, L. A., & Kinzler, K. W. (2013, March). Cancer genome landscapes. Science, 339(6127), 1546–58. doi:10.1126/science.1235122 Ward, J. J., McGuffin, L. J., Bryson, K., Buxton, B. F., & Jones, D. T. (2004). The DISOPRED server for the prediction of protein disorder. Bioinformatics, 20(13), 2138– 2139. doi:10.1093/bioinformatics/bth195 Weinberg, R. A. [R. A.]. (1984). Cellular oncogenes. Trends Biochem. Sci. 9(4), 131– 133. doi:10.1016/0968-0004(84)90117-8 Weinberg, R. A. [R A]. (1994). Oncogenes and tumor suppressor genes. CA. Cancer J. Clin. 44(3), 160–170. doi:10.3322/canjclin.44.3.160 Wu, S., Yang, Z., Ye, R., An, D., Li, C., Wang, Y., . . . Cai, Z. (2015). Novel variants in MLL confer to bladder cancer recurrence identified by whole-exome sequencing. Oncotarget. doi:10.18632/oncotarget.6380 Xin, X., Rual, J. F., Hirozane-Kishikawa, T., Hill, D. E., Vidal, M., Boone, C., & Thierry-Mieg, N. (2009). Shifted transversal design smart-pooling for high coverage interactome mapping. Genome Res. 19(7), 1262–1269. doi:10.1101/ gr.090019.108 Zhao, C., Yasui, K., Lee, C. J., Kurioka, H., Hosokawa, Y., Oka, T., & Inazawa, J. (2003). Elevated expression levels of NCOA3, TOP1, and TFAP2C in breast tumors as predictors of poor prognosis. Cancer, 98(1), 18–23. doi:10 . 1002 / cncr.11482 Zühlke, C., Dalski, A., Schwinger, E., & Finckh, U. (2005, July). Spinocerebellar ataxia type 17: Report of a family with reduced penetrance of an unstable Gln49TBP allele, haplotype analysis supporting a founder effect for unstable

BIBLIOGRAPHY

109

alleles and comparative analysis of SCA17 genotypes. BMC Med. Genet. 6(1), 27. doi:10.1186/1471-2350-6-27

110

Appendix A

TCGA Cancers

Appendix A. TCGA Cancers

111

TABLE A.1: Reproduction of the information from https:// cancergenome.nih.gov/cancersselected listing all 33 cancer types in the TCGA dataset. Note here that Paraganglioma and Pheochromocytoma are grouped together due to being interrelated.

Tissue Type of Samples Breast Central Nervous System Endocrine

Gastrointestinal

Gynecologic

Head and Neck Hematologic Skin Soft Tissue Thoracic

Urologic

Cancer Type Breast Ductal Carcinoma Breast Lobular Carcinoma Glioblastoma Multiforme Lower Grade Glioma Adrenocortical Carcinoma Papillary Thyroid Carcinoma Paraganglioma & Pheochromocytoma Cholangiocarcinoma Colorectal Adenocarcinoma Esophageal Cancer Liver Hepatocellular Carcinoma Pancreatic Ductal Adenocarcinoma Stomach Cancer Cervical Cancer Ovarian Serous Cystadenocarcinoma Uterine Carcinosarcoma Uterine Corpus Endometrial Carcinoma Head and Neck Squamous Cell Carcinoma Uveal Melanoma Acute Myeloid Leukemia Thymoma Cutaneous Melanoma Sarcoma Lung Adenocarcinoma Lung Squamous Cell Carcinoma Mesothelioma Chromophobe Renal Cell Carcinoma Clear Cell Kidney Carcinoma Papillary Kidney Carcinoma Prostate Adenocarcinoma Testicular Germ Cell Cancer Urothelial Bladder Carcinoma

112

Appendix B

Positional Supplemental Information TABLE B.1: The enrichment table of the top 50 most specific terms with FDR correction for interaction partner set by positional analysis. It is ordered by p-value. Process

p-value

GO:0006368

transcription elongation from RNA polymerase II promoter

9.48e-12

GO:0038095

Fc-epsilon receptor signaling pathway

5.49e-10

GO:0006369

termination of RNA polymerase II transcription

8.96e-10

GO:0043968

histone H2A acetylation

7.73e-09

GO:0042795

snRNA transcription from RNA polymerase II promoter

1.52e-08

GO:0016925

protein sumoylation

1.05e-07

GO:0002223

stimulatory C-type lectin receptor signaling pathway

1.38e-07

GO:1900034

regulation of cellular response to heat

4.67e-07

GO:0050821

protein stabilization

6.57e-07

GO:1900740

positive regulation of protein insertion into mitochondrial

1.39e-06

membrane involved in apoptotic signaling pathway GO:0038128

ERBB2 signaling pathway

1.95e-06

GO:0032922

circadian regulation of gene expression

2.5e-06

GO:0000184

nuclear-transcribed mRNA catabolic process,

3.68e-06

nonsense-mediated decay GO:0070125

mitochondrial translational elongation

6.43e-06

GO:1904837

beta-catenin-TCF complex assembly

7.03e-06

GO:0050852

T cell receptor signaling pathway

1.22e-05

Appendix B. Positional Supplemental Information

GO:0051123

113

Process

p-value

RNA polymerase II transcriptional preinitiation complex

2.65e-05

assembly GO:0070126

mitochondrial translational termination

3.5e-05

GO:0043923

positive regulation by host of viral transcription

4.82e-05

GO:1902895

positive regulation of pri-miRNA transcription from RNA

4.82e-05

polymerase II promoter GO:0030521

androgen receptor signaling pathway

4.82e-05

GO:0000086

G2/M transition of mitotic cell cycle

0.000109

GO:0070932

histone H3 deacetylation

0.00015

GO:0090090

negative regulation of canonical Wnt signaling pathway

0.000206

GO:0045899

positive regulation of RNA polymerase II transcriptional

0.00022

preinitiation complex assembly GO:0042791

5S class rRNA transcription from RNA polymerase III type 1

0.00022

promoter GO:0042797

tRNA transcription from RNA polymerase III promoter

0.00022

GO:0006283

transcription-coupled nucleotide-excision repair

0.000299

GO:0031648

protein destabilization

0.000306

GO:0007179

transforming growth factor beta receptor signaling pathway

0.000506

GO:1903146

regulation of mitophagy

0.000655

GO:0000381

regulation of alternative mRNA splicing, via spliceosome

0.000794

GO:1904874

positive regulation of telomerase RNA localization to Cajal

0.000802

body GO:0007173

epidermal growth factor receptor signaling pathway

0.000809

GO:0007050

cell cycle arrest

0.00084

GO:0060766

negative regulation of androgen receptor signaling pathway

0.000882

GO:1990440

positive regulation of transcription from RNA polymerase II

0.000882

promoter in response to endoplasmic reticulum stress GO:0006978

DNA damage response, signal transduction by p53 class

0.000882

mediator resulting in transcription of p21 class mediator GO:0070911

global genome nucleotide-excision repair

0.00105

GO:0070527

platelet aggregation

0.00137

GO:0070933


0.00145

GO:0051571

positive regulation of histone H3-K4 methylation

0.00155

GO:0070934

CRD-mediated mRNA stabilization

0.00178

GO:0071681

cellular response to indole-3-methanol

0.00178

Appendix B. Positional Supplemental Information

114

Process

p-value

GO:1902857

positive regulation of non-motile cilium assembly

0.00178

GO:1900026

positive regulation of substrate adhesion-dependent cell

0.00186

spreading GO:0043124

negative regulation of I-kappaB kinase/NF-kappaB signaling

0.00217

GO:0042769

DNA damage response, detection of DNA damage

0.00224

GO:0048025

negative regulation of mRNA splicing, via spliceosome

0.00248

GO:0051092

positive regulation of NF-kappaB transcription factor activity

0.00263

115

Appendix C

Regional Supplemental Information TABLE C.1: The enrichment table of the top 50 most specific terms with FDR correction for interaction partner set for regional analysis. It is ordered by p-value. Process

P-value

GO:0000086

G2/M transition of mitotic cell cycle

4.72e-17

GO:0016925

protein sumoylation

1.86e-13

GO:0006369

termination of RNA polymerase II transcription

9.33e-13

GO:0050821

protein stabilization

3.71e-12

GO:1900034

regulation of cellular response to heat

1.6e-11

GO:0038095

Fc-epsilon receptor signaling pathway

1.82e-11

GO:0032922

circadian regulation of gene expression

6.77e-11

protein ubiquitination involved in ubiquitin-dependent protein

4.43e-09

GO:0042787

catabolic process GO:0006977


7.54e-09

mediator resulting in cell cycle arrest GO:0038096

Fc-gamma receptor signaling pathway involved in

1.08e-08

phagocytosis GO:0002223

stimulatory C-type lectin receptor signaling pathway

2.18e-08

GO:0042769

DNA damage response, detection of DNA damage

3.12e-08

GO:0070979

protein K11-linked ubiquitination

3.84e-08

GO:0031145

anaphase-promoting complex-dependent catabolic process

4.29e-08

GO:0051092

positive regulation of NF-kappaB transcription factor activity

2.08e-07

GO:0030521

androgen receptor signaling pathway

6.69e-07

Appendix C. Regional Supplemental Information

116

Process

P-value

GO:0035329

hippo signaling

8.63e-07

GO:0050852

T cell receptor signaling pathway

1.3e-06

GO:1900740

positive regulation of protein insertion into mitochondrial

1.31e-06

membrane involved in apoptotic signaling pathway GO:0048013

ephrin receptor signaling pathway

1.31e-06

GO:0051437

positive regulation of ubiquitin-protein ligase activity involved

1.32e-06

in regulation of mitotic cell cycle transition GO:0051436

negative regulation of ubiquitin-protein ligase activity involved

3.72e-06

in mitotic cell cycle GO:0070987

error-free translesion synthesis

1.35e-05

GO:0070936


2.67e-05

GO:0051865

protein autoubiquitination

3.03e-05

GO:0042771

intrinsic apoptotic signaling pathway in response to DNA

3.7e-05

damage by p53 class mediator GO:0000183

chromatin silencing at rDNA

5.61e-05

GO:0006283

transcription-coupled nucleotide-excision repair

6.69e-05

GO:0007173

epidermal growth factor receptor signaling pathway

7.48e-05

GO:0006296

nucleotide-excision repair, DNA incision, 5’-to lesion

9.32e-05

GO:0042795

snRNA transcription from RNA polymerase II promoter

9.69e-05

GO:0043153

entrainment of circadian clock by photoperiod

9.97e-05

GO:0070911

global genome nucleotide-excision repair

0.000113

GO:0090263

positive regulation of canonical Wnt signaling pathway

0.000148

GO:1902895

positive regulation of pri-miRNA transcription from RNA

0.000209

polymerase II promoter GO:0043968

histone H2A acetylation

0.000209

GO:0010501

RNA secondary structure unwinding

0.00028

GO:0042149

cellular response to glucose starvation

0.000312

GO:0006978


0.000388

mediator resulting in transcription of p21 class mediator GO:0019886

antigen processing and presentation of exogenous peptide

0.000445

antigen via MHC class II GO:0000722

telomere maintenance via recombination

0.000549

GO:0070933


0.000559

GO:0085020


0.000559

GO:0048208

COPII vesicle coating

0.000571


117

Process

P-value

GO:0035666

TRIF-dependent toll-like receptor signaling pathway

0.000606

GO:0000289

nuclear-transcribed mRNA poly(A) tail shortening

0.00149

GO:0051571

positive regulation of histone H3-K4 methylation

0.00154

GO:0070932


0.00154

GO:0071539

protein localization to centrosome

0.00154

GO:0006297

nucleotide-excision repair, DNA gap filling

0.00163

TABLE C.2: A tabular representation of the mean distribution for each cancer in both mutation profiles. Mean Mutation Sorted Index

Prevalence

SKCM_mut

1

0.05592

UCEC_mut

2

0.04359

SKCM_missense

3

0.04068

LUSC_mut

4

0.03718

UCEC_missense

5

0.03534

COADREAD_mut

6

0.03419

STES_mut

7

0.0319

LUAD_mut

8

0.03026

LUSC_missense

9

0.02921

DLBC_mut

10

0.02652

COADREAD_missense

11

0.02478

BLCA_mut

12

0.02418

STES_missense

13

0.02405

ESCA_mut

14

0.02396

LUAD_missense

15

0.02362

ACC_mut

16

0.02213

BLCA_missense

17

0.01863

HNSC_mut

18

0.01835

ESCA_missense

19

0.01826

DLBC_missense

20

0.01766

CESC_mut

21

0.01649

CHOL_mut

22

0.01399


118

Mean Mutation Sorted Index

Prevalence

HNSC_missense

23

0.01397

UCS_mut

24

0.01318

ACC_missense

25

0.01279

CESC_missense

26

0.01275

LIHC_mut

27

0.01241

PAAD_mut

28

0.01069

KICH_mut

29

0.01014

UCS_missense

30

0.009828

CHOL_missense

31

0.009493

LIHC_missense

32

0.009161

PAAD_missense

33

0.00839

KIRP_mut

34

0.007188

GBM_mut

35

0.006863

BRCA_mut

36

0.006751

KICH_missense

37

0.006712

TGCT_mut

38

0.006216

SARC_mut

39

0.005862

BRCA_missense

40

0.005157

GBM_missense

41

0.005026

KIRP_missense

42

0.004965

KIRC_mut

43

0.004771

SARC_missense

44

0.004477

OV_mut

45

0.004171

TGCT_missense

46

0.004115

KIRC_missense

47

0.003534

OV_missense

48

0.00321

PRAD_mut

49

0.002833

LGG_mut

50

0.002559

PRAD_missense

51

0.002182

PCPG_mut

52

0.002098

UVM_mut

53

0.001906

LGG_missense

54

0.001894

THYM_mut

55

0.001551

UVM_missense

56

0.001364


119


Prevalence

THCA_mut

57

0.001338

PCPG_missense

58

0.001327

THYM_missense

59

0.001186

THCA_missense

60

0.001063

TABLE C.3: A tabular representation of the mean mutation prevalence by gene across both profile types. Mean Mutation Sorted Index

Prevalence

TTN

1

0.2814

RYR2

2

0.09741

FLG

3

0.09525

PCLO

4

0.0864

ZFHX4

5

0.07581

XIRP2

6

0.07238

SPTA1

7

0.06872

PCDH15

8

0.06001

PLEC

9

0.05896

FMN2

10

0.05177

HRNR

11

0.04933

COL11A1

12

0.04867

PAPPA2

13

0.04812

NAV3

14

0.048

HYDIN

15

0.04723

FAM135B

16

0.04588

TENM1

17

0.04458

RP1L1

18

0.0445

PEG3

19

0.04319

ZNF208

20

0.04261

MYH2

21

0.04182

C1orf173

22

0.04179

ADAMTS12

23

0.03952


120


Prevalence

EP400

24

0.03951

NPAP1

25

0.03942

RIMS2

26

0.03898

ANKRD30A

27

0.03851

PRDM9

28

0.03781

ZNF804B

29

0.03685

TRIOBP

30

0.03617

MYH7

31

0.03597

UNC79

32

0.0357

MKI67

33

0.03567

COL5A1

34

0.03556

TAF1L

35

0.03507

TNR

36

0.03437

PCDH11X

37

0.03414

MYH4

38

0.03411

CACNA1A

39

0.03348

SRRM2

40

0.03339

HCN1

41

0.03321

SCN2A

42

0.03298

ZNF804A

43

0.03268

MYH8

44

0.03246

SCN10A

45

0.03207

CDH9

46

0.0316

MYH13

47

0.0314

MYO3A

48

0.03064

MYT1L

49

0.03057

ZNF676

50

0.03044

PCDH10

51

0.03044

PTPRZ1

52

0.03037

NALCN

53

0.03005

GPR158

54

0.02883

TNRC18

55

0.0288

SORCS1

56

0.02846

ZFPM2

57

0.0283


121


Prevalence

ADCY2

58

0.02825

MYH15

59

0.02824

PRG4

60

0.02801

BOD1L1

61

0.02779

KNDC1

62

0.02769

MAP2

63

0.02766

GOLGB1

64

0.02704

KIF1A

65

0.02657

FAM47A

66

0.02654

CDH18

67

0.02652

AFF2

68

0.02649

PPFIA2

69

0.02648

TRPS1

70

0.02645

ANKRD11

71

0.02634

ZNF99

72

0.02585

BCLAF1

73

0.02567

MAP1A

74

0.02525

SPTAN1

75

0.02524

COL19A1

76

0.02458

KIAA2018

77

0.02438

GPR179

78

0.02416

ZNF462

79

0.02389

HDAC9

80

0.02383

CENPE

81

0.02326

ZBBX

82

0.02279

XIRP1

83

0.02245

ZNF469

84

0.02237

MYH10

85

0.02211

USP29

86

0.02203

LRP4

87

0.022

CTNNA3

88

0.02198

TNRC6A

89

0.02157

NEFH

90

0.02137

SPATA31D1

91

0.02135


122


Prevalence

FAM83B

92

0.0213

ZNF91

93

0.02106

NES

94

0.02092

PDE1C

95

0.02073

COL6A2

96

0.02071

KIF21B

97

0.02061

ZNF407

98

0.02049

WDR33

99

0.02049

RERE

100

0.02035

ZNF638

101

0.02027

MEFV

102

0.02026

MYPN

103

0.02013

ATXN1

104

0.01986

FSCB

105

0.01976

KCND2

106

0.01972

ST6GAL2

107

0.01969

GON4L

108

0.01947

GPRIN2

109

0.01923

POTEG

110

0.01909

CHRM2

111

0.01904

WDR96

112

0.01902

TTBK1

113

0.01895

PDZRN3

114

0.01887

TJP1

115

0.01886

MAGI1

116

0.01886

RPTN

117

0.01879

ZFC3H1

118

0.01867

LRRTM4

119

0.01851

IRS4

120

0.01834

TNIK

121

0.01824

TCEB3B

122

0.01812

TULP4

123

0.0181

PAK7

124

0.01787

SULF1

125

0.01776


123


Prevalence

ZEB1

126

0.01764

ZNF479

127

0.01748

PRRC2C

128

0.01745

ATAD2

129

0.01736

YLPM1

130

0.01731

LRRC66

131

0.01726

LRRIQ3

132

0.01725

DDX11

133

0.01723

GPATCH8

134

0.01714

PDZRN4

135

0.0171

TMC5

136

0.01705

RGS7

137

0.01686

TRPC7

138

0.01683

ATN1

139

0.01683

ANKRD12

140

0.01667

IVL

141

0.01651

USP31

142

0.01632

ZNF845

143

0.01631

WDR87

144

0.01619

ZIC1

145

0.01606

SHROOM2

146

0.01606

SORBS2

147

0.01606

ZNF257

148

0.01569

FAM184A

149

0.01563

TICRR

150

0.01559

NOL4

151

0.01543

SNED1

152

0.01542

ZNF33A

153

0.01539

KCNA4

154

0.01537

TNRC6B

155

0.01535

ITSN2

156

0.01531

SRRM4

157

0.01526

WWC3

158

0.01521

HGF

159

0.01514


124


Prevalence

STON1-GTF2A1L

160

0.01512

RBMXL3

161

0.01507

ZNF285

162

0.01505

CCDC102A

163

0.01498

ZNF217

164

0.01493

ZNF835

165

0.01493

KCNN3

166

0.01491

TCHHL1

167

0.01485

AKAP12

168

0.01448

PCF11

169

0.01446

PPFIA1

170

0.01443

FAM123C

171

0.01442

NRD1

172

0.01442

SOGA3

173

0.01436

HRC

174

0.0142

WDR66

175

0.0142

ZNF608

176

0.01408

RPH3A

177

0.01408

PPP1R9A

178

0.01406

ZBTB20

179

0.01405

NKTR

180

0.01399

APOBR

181

0.01398

AMOT

182

0.01397

ZFP64

183

0.01396

ZNF585B

184

0.01395

ZNF43

185

0.01385

ZNF334

186

0.01371

PKP4

187

0.01366

ZBTB38

188

0.01365

EIF3A

189

0.01364

FAM13C

190

0.01351

ZMYND8

191

0.01346

ZNF667

192

0.01336

SGOL2

193

0.01327


125


Prevalence

RUNX2

194

0.01325

FYB

195

0.01325

ZNF135

196

0.01324

PCMTD1

197

0.01323

ZIC4

198

0.0132

NOM1

199

0.01318

ZNF532

200

0.01317

NPAS3

201

0.01311

ZCCHC5

202

0.01305

ZNF445

203

0.01302

PHACTR3

204

0.01301

TONSL

205

0.01301

BMP2K

206

0.013

ZNF347

207

0.01296

FOXP2

208

0.01288

TOP2A

209

0.01287

HIST1H1E

210

0.01284

ZNF534

211

0.01282

TRIM51

212

0.01279

ZNF254

213

0.01275

MAP4K4

214

0.01271

TSKS

215

0.01247

ZKSCAN2

216

0.01245

NSUN2

217

0.01241

CRNN

218

0.01239

PPP1R16B

219

0.01235

PLEKHG3

220

0.01234

ZNF616

221

0.01221

WWP1

222

0.0122

C8orf34

223

0.01206

ZNF85

224

0.01197

ZNF711

225

0.01196

TRIM55

226

0.01193

USH1C

227

0.01192


126


Prevalence

MNDA

228

0.01191

TBP

229

0.01189

KCTD8

230

0.01183

ZNF615

231

0.01182

FAM184B

232

0.01177

WWC1

233

0.01176

SYCP1

234

0.01175

CCDC105

235

0.01172

SMG6

236

0.01169

USP54

237

0.01169

ZC3H18

238

0.01168

PYHIN1

239

0.01167

ZNF268

240

0.01166

AZI1

241

0.01165

ZNF234

242

0.01165

RLIM

243

0.01163

TRIML2

244

0.0116

TRAPPC12

245

0.01159

SEMG2

246

0.01156

WDR64

247

0.01144

ZNF107

248

0.01143

ZNF471

249

0.01132

ZNF780A

250

0.0113

ZNF607

251

0.01129

ZNF454

252

0.01126

ZNF100

253

0.01118

HIST1H1C

254

0.01116

TTLL2

255

0.01114

WWP2

256

0.01113

SRRT

257

0.01112

PEX5L

258

0.01111

RBMX

259

0.01109

YTHDC1

260

0.01107

ZFP28

261

0.01104


127


Prevalence

ZNF71

262

0.01101

CYLC2

263

0.01096

ZNF528

264

0.01091

UBE2O

265

0.01089

PRAM1

266

0.01087

ZNF189

267

0.01086

GPR101

268

0.01085

ZFR2

269

0.01076

SV2A

270

0.01076

NOL8

271

0.01076

ZNF594

272

0.0107

FAM13A

273

0.0107

BBX

274

0.0107

TRAK1

275

0.01065

RSBN1

276

0.01061

ZNF300

277

0.01061

SDPR

278

0.0106

ZNF473

279

0.01059

TRDN

280

0.01058

ZMIZ1

281

0.01058

RINL

282

0.01058

ATXN2

283

0.01057

ZNF696

284

0.01057

MPHOSPH8

285

0.01055

ZNF709

286

0.01052

DPCR1

287

0.01047

ZNF180

288

0.01043

ZNF28

289

0.01042

RBMXL1

290

0.01037

CALD1

291

0.01037

CGN

292

0.01035

ZSCAN5B

293

0.01033

EIF5B

294

0.01032

ZNF527

295

0.01028


128


Prevalence

ZNF496

296

0.01025

ZNF799

297

0.01024

ARID3A

298

0.01022

SCARF2

299

0.0102

TUB

300

0.01016

HNRNPUL1

301

0.01014

ZNF415

302

0.01013

ZIC3

303

0.01012

CXXC1

304

0.01008

BCAS1

305

0.01005

ZNF568

306

0.009858

ZNF777

307

0.009848

RBM25

308

0.009825

EHBP1

309

0.009812

RBMXL2

310

0.009806

PRICKLE1

311

0.009803

C3orf30

312

0.009766

USP6NL

313

0.009757

ZNF610

314

0.009675

MAP9

315

0.009668

ZNF546

316

0.009622

WASF3

317

0.009568

ZSCAN18

318

0.009553

HTATSF1

319

0.009516

ZFP106

320

0.0095

REST

321

0.009485

TXLNB

322

0.009394

TTBK2

323

0.009381

PPM1E

324

0.009378

CT47B1

325

0.009321

ZNF658

326

0.009312

UBTF

327

0.00931

MUC15

328

0.009282

LIMA1

329

0.009279


129


Prevalence

ZNF157

330

0.00927

ZNF844

331

0.009242

PDZD4

332

0.009233

JPH1

333

0.00922

ZFP2

334

0.009197

TARSL2

335

0.009183

ZNF442

336

0.009183

PPIG

337

0.009165

ZSCAN10

338

0.009129

CLIC6

339

0.00912

NOP14

340

0.009024

ZNF582

341

0.008988

PJA1

342

0.008985

FAM13B

343

0.008957

ZBTB41

344

0.008939

LUZP2

345

0.008937

ZNF613

346

0.008931

TTC14

347

0.008921

NASP

348

0.008907

GPRIN1

349

0.008891

ZNF813

350

0.008883

SOWAHB

351

0.008851

ZNF230

352

0.008836

ZNF329

353

0.008811

PRKCSH

354

0.008792

ZNF618

355

0.008768

CACTIN

356

0.008755

ZNF790

357

0.00875

TRIM6-TRIM34

358

0.008747

PENK

359

0.008733

ZCWPW1

360

0.008733

TGIF2LX

361

0.008725

KDM4A

362

0.008705

ZNF574

363

0.008687


130


Prevalence

ZNF583

364

0.008677

ZNF599

365

0.008668

ZNF160

366

0.008662

ZNF16

367

0.00866

SH3PXD2B

368

0.008642

ZNF461

369

0.008634

ZBTB46

370

0.008607

ZNF14

371

0.008589

GRIPAP1

372

0.008586

ZNF235

373

0.008585

ZNF251

374

0.008546

UTP14A

375

0.008544

NEXN

376

0.008514

SAMD15

377

0.008512

ZNF507

378

0.00851

ABRA

379

0.008502

URI1

380

0.008501

FRMD6

381

0.008482

ZKSCAN5

382

0.00848

ZNF519

383

0.008469

ZNF737

384

0.008452

PRPF4B

385

0.008451

ZFP112

386

0.008445

POU3F2

387

0.008441

RBM12B

388

0.0084

ZNF484

389

0.008394

ZNF385B

390

0.008377

CPSF6

391

0.008376

ZNF467

392

0.00833

TUSC3

393

0.008327

ZNF485

394

0.008302

ZNF483

395

0.008297

ZFYVE20

396

0.008292

FRG2B

397

0.008245


131


Prevalence

FTSJ3

398

0.008224

HS6ST1

399

0.008181

ZNF214

400

0.008157

ZNF416

401

0.008132

ZFP90

402

0.008112

ZNF41

403

0.008111

ZNF816

404

0.008109

ZNF683

405

0.008109

ZNF551

406

0.008036

PHACTR1

407

0.008006

PDYN

408

0.007985

SLC16A2

409

0.007976

ZNF420

410

0.007973

BACH1

411

0.007962

ZNF195

412

0.007928

ZNF167

413

0.007917

ZFX

414

0.007909

ZNF540

415

0.007904

CEP112

416

0.007892

ZIM2

417

0.007884

ZNF304

418

0.00788

NBPF3

419

0.00787

ZNF732

420

0.007869

ARHGAP23

421

0.007862

ZNF652

422

0.007859

ZNF624

423

0.007858

SPANXN2

424

0.007855

ZNF20

425

0.007805

ZNF93

426

0.007729

TMEM200C

427

0.007717

MAGEB1

428

0.007713

ZNF430

429

0.007697

ZFP91

430

0.007678

ZSCAN4

431

0.007653


132


Prevalence

ZNF500

432

0.00765

DMP1

433

0.007603

TTLL11

434

0.007594

ZNF141

435

0.007593

ZNF567

436

0.007582

ZNF211

437

0.007579

SPERT

438

0.007573

ZFP30

439

0.007554

OS9

440

0.007547

VIM

441

0.007506

ZNF358

442

0.007487

ZNF286A

443

0.00747

ZNF770

444

0.007452

ZNF678

445

0.00745

ZNF227

446

0.00745

USP51

447

0.007443

GAB1

448

0.007426

ZNF317

449

0.007418

ZNF671

450

0.007415

ZNF544

451

0.007407

DMKN

452

0.007404

ZNF486

453

0.007343

ZNF226

454

0.007314

ZIK1

455

0.00731

ZNF555

456

0.00731

ZNF324

457

0.007293

ZNF502

458

0.007277

ZNF77

459

0.007249

ZNF354B

460

0.007238

ZC3H12D

461

0.007233

ZNF729

462

0.007123

ZNF83

463

0.007118

SPARCL1

464

0.007084

NR1H4

465

0.007063


133


Prevalence

TNIP3

466

0.007052

ZNF101

467

0.007004

ZNF697

468

0.006987

ZNF529

469

0.006984

ZNF746

470

0.006983

ZNF80

471

0.006972

ZNF530

472

0.006906

ZNF563

473

0.006872

ZNF382

474

0.006853

HIST1H1D

475

0.00685

PTRF

476

0.006842

ZNF132

477

0.006838

ZNF768

478

0.006834

TRAT1

479

0.006829

VSTM4

480

0.006826

ZCWPW2

481

0.006824

ZNF829

482

0.006817

ZNF419

483

0.006796

ZNF175

484

0.006795

FAM71E2

485

0.006789

NUMBL

486

0.006754

ZNF17

487

0.006746

LRP11

488

0.006731

ZNF212

489

0.006702

ZNF782

490

0.006659

ZSCAN5A

491

0.006609

ZNF823

492

0.006576

ZNF248

493

0.006546

ZNF557

494

0.006546

ZNF793

495

0.006521

ZC4H2

496

0.006504

ZFP36L2

497

0.006475

ZNF10

498

0.006473

GOLGA6L6

499

0.006466


134


Prevalence

HLA-DRB5

500

0.00645

ZNF354A

501

0.00641

RSPH4A

502

0.006397

ZNF169

503

0.00639

EIF1AX

504

0.006344

ZNF682

505

0.00634

ZNF655

506

0.006283

ZNF311

507

0.006207

ZNF699

508

0.006197

PHACTR2

509

0.006193

ZNF253

510

0.006134

ZNF394

511

0.006078

ZNF260

512

0.005972

ZNF713

513

0.005955

ZNF571

514

0.005953

ZNF266

515

0.005905

ZNF490

516

0.005886

ANKRD36C

517

0.005875

ZNF662

518

0.005856

ZNF205

519

0.005846

ZSCAN2

520

0.005818

HKR1

521

0.005795

ZNF48

522

0.00578

SDAD1

523

0.005771

ZNF174

524

0.005753

NBPF7

525

0.005728

ZNF117

526

0.005683

GAP43

527

0.005679

C7orf60

528

0.005666

ZNF177

529

0.005659

TNNT1

530

0.005633

ZNF398

531

0.005584

C9orf66

532

0.005582

ZNF701

533

0.005528


135


Prevalence

RNFT2

534

0.005524

ZNF449

535

0.005503

ZNF498

536

0.005495

ZNF883

537

0.005477

HMGB3

538

0.005466

ZNF286B

539

0.005449

OCEL1

540

0.005428

ZNF735

541

0.005427

ZNF25

542

0.005427

ZNF785

543

0.005411

ZNF343

544

0.005383

ZNF707

545

0.005315

SHOX

546

0.005287

ZNF8

547

0.005256

ZNF225

548

0.005233

NTN5

549

0.005218

C1orf198

550

0.005201

SPANXN3

551

0.005168

ZNF510

552

0.005162

ZNF79

553

0.005157

ZNF562

554

0.005152

TUSC1

555

0.00513

ZNF323

556

0.00513

ZNF432

557

0.005127

ZWINT

558

0.005077

ZNF689

559

0.005073

SDCCAG3

560

0.005054

ZNF239

561

0.005019

SOX9

562

0.004973

ZNF573

563

0.004969

EN1

564

0.004942

SRFBP1

565

0.004761

HMGB2

566

0.004671

ZNF705A

567

0.004669


136


Prevalence

E2F5

568

0.004645

CWC27

569

0.004638

RALY

570

0.004605

ZNF436

571

0.00458

ZNF70

572

0.004535

ZNF92

573

0.004523

ZNF812

574

0.00444

ZNF200

575

0.004388

YBX2

576

0.004381

C10orf95

577

0.004339

HEXIM1

578

0.004298

ZNF672

579

0.00427

RNF113A

580

0.004257

SPATA31A3

581

0.004195

XRCC4

582

0.004158

SURF6

583

0.004156

ZNF193

584

0.004122

ZNF75A

585

0.004115

ZNF497

586

0.004085

UTP18

587

0.004023

ZNF18

588

0.004016

ZNF670

589

0.003939

ZNF565

590

0.003937

ZNF620

591

0.003936

ZNF736

592

0.003932

VCX

593

0.003881

ZNF34

594

0.003874

GRB2

595

0.003816

ZNF627

596

0.003759

ZAR1L

597

0.003724

ZNF154

598

0.003711

VCX3B

599

0.003642

ZNF275

600

0.003629

ZNF805

601

0.003597


137


Prevalence

TNNI2

602

0.003529

ZNF501

603

0.003506

FAM157A

604

0.0035

ZNF728

605

0.003473

ZNF674

606

0.003403

PROCA1

607

0.003322

ZNF717

608

0.00329

RPS6

609

0.00329

ZSCAN16

610

0.003272

UBXN1

611

0.003271

AKAP2

612

0.0031

FAM21A

613

0.003099

ZNF706

614

0.002971

ZNF32

615

0.002964

ZNF367

616

0.00286

DLEU7

617

0.002552

HEXIM2

618

0.00252

ZNF524

619

0.002509

PQBP1

620

0.002315

HMGN5

621

0.002314

ZNF705G

622

0.001371

VCX2

623

0.0009513