predicting gene function using dna microarray data

0 downloads 0 Views 1MB Size Report
to thank my parents, Dr. Robert Stuart and Deena Stuart for their ... I want to especially thank my wife, Andrea Preble, who took on the even greater project of ...
PREDICTING GENE FUNCTION USING DNA MICROARRAY DATA FROM MULTIPLE ORGANISMS

a dissertation submitted to the program in biomedical informatics and the committee on graduate studies of stanford university in partial fulfillment of the requirements for the degree of doctor of philosophy

Joshua Michael Stuart December 2003

c Copyright by Joshua Michael Stuart 2004

All Rights Reserved

ii

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Stuart K. Kim (Deptartments of Developmental Biology and Genetics) (Principal Adviser) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Art B. Owen (Department of Statistics) I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Russ B. Altman (Department of Genetics) Approved for the University Committee on Graduate Studies.

iii

Abstract The genome projects have recently established a nearly complete list of genes in several genomes, revealing that most of the genes in multicellular organisms have yet to be characterized genetically. To understand genetic mechanisms fully, one of the next challenges facing biology is to elucidate the role of every gene in the genome, especially for the large fraction of genes whose functions are currently unknown. Microarrays have created an unprecedented opportunity for investigating gene expression under a variety of different conditions on a genome-wide scale. Coexpression of genes under different conditions can provide information about gene function. However, coexpression detected in a single organism may not always imply functional relatedness. Evolutionary conservation is a powerful criterion to identify genes that are functionally important from a set of coregulated genes. Coregulation of a pair of genes over large evolutionary distances implies that the coregulation confers a selective advantage, most likely because the genes are functionally related. This thesis describes a method for identifying evidence of such conserved gene coexpression from DNA

iv

microarray data. Applying the method to the compendiums of DNA microarray data from human, fly, worm, and yeast identified a network containing 3416 genes linked by 22,163 conserved interactions. Many of these genes in the resulting network are uncharacterized and therefore the results provide an opportunity for predicting their functions for the first time. Several areas of highly interconnected genes were found in the network, revealing an overall structure present among the coexpression links. Computational evaluations and experimental verification demonstrate that the multiple species approach outperforms those based on expression data from only a single organism. Combining microarray data across multiple organisms can therefore improve gene function prediction and provides clues about the way conserved genetic modules evolve.

v

Acknowledgements Working on this thesis was truly a joy. I can hardly believe it’s come to an end. Right now, I feel the same happy sadness I’ve felt after reading to the end of a book series I liked and still wanting the characters to live on. My incredibly positive experience here was made possible by educational and personal support. Stanford University was a truly great place to study. I enjoyed the interaction and collaboration with many incredible people on a daily basis for five years. The intellectual energy and curiousity of the people I worked with really made every day of this project worthwhile. In fact, I only have one gripe to level at Stanford – the amount of time allocated for recreational swimming at the aquatic center was a joke! Oh well, it still was an amazing facility to use, even if you had to swim during lunch or dinner to experience it. I want to especially thank my adviser, Dr. Stuart Kim. His love of discovery and pursuit of scientific truths were inspirational. He made my graduate career fun and memorable. Working in his laboratory was like having the chance to accompany an

vi

explorer on his next great expedition. I want to thank Stuart for inviting me along for the adventure. I also want to thank my other mentors. I want to thank Dr. Art Owen for his keen insights and for our many conversations about furthering methods for gene prediction. I also want to thank Art for his help on LaTeX. I want to thank Dr. Daphne Koller for her incredible classes that inspired me in my research. I also want to thank Dr. Russ Altman for his enthusiasm and support throughout my graduate career. I want to thank the Kim lab for letting a me join the group. The Kim lab that I knew consisted of: Kathy Mach, Peter Roy, Jim Lund, Min Jiang, Rebecca Sonu, Graham Rodwell, Irene Liu, Flo Pauli, and John Wang. I looked forward to working with them everyday because they were such a fun bunch of people. I want to thank them for their help in preparing talks. I especially want to thank Dr. Peter Roy for invigorating scientific discussions. I want to thank my fellow classmates in Biomedical Informatics. They gave my experience at Stanford a sense of community. My classmates, the faculty, and the support staff at Stanford Medical Informatics, gave me a family away from home. I especially want to thank Darlene Vian for her wisdom and help (and parties!) these past five years. Darlene: you’re a gem! I want to thank my friend and colleague, Eran Segal, for many late nights of computing back-to-back and only eating $0.25 ramen packages.

vii

I want to thank my classmate Dr. Soumya Raychaudhuri. In many ways the work in this thesis was most influenced by Soumya. I want to thank him for sharing his scientific imagination with me, which often helped me navigate my way through this project. I especially want to thank Soumya for his friendship. I still miss our late night “tea walks” around campus that helped me get through the first year. I want to thank my family for their love and support thoughout the years. I want to thank my parents, Dr. Robert Stuart and Deena Stuart for their encouragement and faith. Under their guidance, pursuing what I love came naturally. I want to thank my little brother, Jacob Stuart, and older sister, Erin Tankersley, for being my friend longer than anyone. I want to thank Jake for loving me even though I’ve given him every stitch he has in his body. I take that back since he did tell on me a number of times when I cussed. I want to thank my older sister, Erin Stuart, for her carefree spirit and for demonstrating what a good parent can be. I want to thank the Prebles – Warren, Cecile, and Brian – for welcoming me into their family like a new son. They are the greatest in-laws a guy could have. I want to especially thank my wife, Andrea Preble, who took on the even greater project of parenthood while I worked on this thesis. I want to thank her for the many sacrifices she’s made while raising our children during this period. I want to thank her for keeping her humor somehow after days without sleep. Her strength and compassion are amazing.

viii

I also want to thank Andrea for helping me write this dissertation. Andrea helped me see the perspective of a molecular biologist that helped me understand how to make the work more relevant. I want to thank her for editing multiple drafts of the manuscript and for all the support and encouragement. Andrea: I love you!

ix

Contents

Abstract

iv

Acknowledgements

vi

1 Introduction

1

1.1 Thesis motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.2 Functional modules from gene expression . . . . . . . . . . . . . . . .

5

1.2.1

Global studies of gene expression . . . . . . . . . . . . . . . .

5

1.2.2

Evaluating module prediction . . . . . . . . . . . . . . . . . .

8

1.3 Identification of gene modules de novo . . . . . . . . . . . . . . . . .

10

1.3.1

Compendiums . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

1.3.2

Global cluster analysis . . . . . . . . . . . . . . . . . . . . . .

14

1.3.3

Context-dependent clustering . . . . . . . . . . . . . . . . . .

15

1.4 Bootstrapping modules from known pathways . . . . . . . . . . . . .

17

1.5 Finding conserved modules . . . . . . . . . . . . . . . . . . . . . . . .

18

x

2 Expression terrain map

23

2.1 Identifying coexpression neighbors . . . . . . . . . . . . . . . . . . . . 2.1.1

27

Measuring similarity in gene function . . . . . . . . . . . . . .

28

2.2 Constructing a visual resource of coexpression . . . . . . . . . . . . .

31

2.3 Testing the significance of the terrain map . . . . . . . . . . . . . . .

34

2.3.1

Concordance with known biology . . . . . . . . . . . . . . . .

34

2.3.2

Significance of concordance . . . . . . . . . . . . . . . . . . . .

36

2.3.3

Using concordance to compare clusterings . . . . . . . . . . .

40

2.3.4

New concordance relationships for known functional groups . .

43

2.4 Predicting gene function with the terrain map . . . . . . . . . . . . .

44

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

2.5.1

Similarity metric . . . . . . . . . . . . . . . . . . . . . . . . .

49

2.5.2

Global versus local trade-off . . . . . . . . . . . . . . . . . . .

50

3 Gene recommender

52

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

3.2 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

3.3 Finding a subset of informative experiments . . . . . . . . . . . . . .

63

3.4 Finding new candidate genes . . . . . . . . . . . . . . . . . . . . . . .

64

3.4.1

Scoring genes . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.4.2

Estimating the likelihood of membership . . . . . . . . . . . .

67

xi

3.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

68

3.5.1

Approximate precision and recall . . . . . . . . . . . . . . . .

70

3.5.2

Cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . .

75

3.5.3

RNAi knock-down . . . . . . . . . . . . . . . . . . . . . . . .

77

3.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

79

3.6.1

Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

3.6.2

Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

4 Multiple species network

85

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

85

4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

4.2.1

Relating genes across organisms: meta-genes . . . . . . . . . .

90

4.2.2

Identifying conserved coexpression partners

. . . . . . . . . .

90

4.2.3

Clustering and visualizing a conserved coexpression network .

98

4.2.4

Measuring the conservation of a cluster . . . . . . . . . . . . .

99

4.3 Results and Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.3.1

Randomizing orthology . . . . . . . . . . . . . . . . . . . . . . 103

4.3.2

Robustness to noise . . . . . . . . . . . . . . . . . . . . . . . . 106

4.3.3

Generalization to experimental choice . . . . . . . . . . . . . . 107

4.3.4

Improvement over single species . . . . . . . . . . . . . . . . . 109

4.3.5

Functional modules of conserved coexpression . . . . . . . . . 114 xii

4.3.6

Biological confirmation . . . . . . . . . . . . . . . . . . . . . . 117

4.4 Systems-level analysis of conserved coexpression . . . . . . . . . . . . 119 4.4.1

Newly-evolved and ancient modules . . . . . . . . . . . . . . . 119

4.4.2

Significant abundance of large modules . . . . . . . . . . . . . 125

4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5 Contributions and future directions

130

5.1 Increasing the breadth of discovered modules . . . . . . . . . . . . . . 131 5.1.1

Integrating multiple different microarray platforms . . . . . . 131

5.1.2

Identifying mammalian-specific conserved modules . . . . . . . 132

5.2 Algorithmic developments for multiple species analysis . . . . . . . . 133 5.2.1

Recommending genes across organisms . . . . . . . . . . . . . 133

5.2.2

Alternative estimates of functional coexpression . . . . . . . . 134

5.3 More complete modules with integrative approaches . . . . . . . . . . 135 A Principal components analysis

138

A.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 A.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 A.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144 B Chromosomal clustering

147

B.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

xiii

B.1.1 Intergenic distances . . . . . . . . . . . . . . . . . . . . . . . . 151 B.1.2 Detecting chromosomal clustering . . . . . . . . . . . . . . . . 152 B.1.3 Assigning genes to operons . . . . . . . . . . . . . . . . . . . . 152 B.1.4 Accounting for tandem duplications . . . . . . . . . . . . . . . 152 B.1.5 Sampling to assess significance of clustering . . . . . . . . . . 153 B.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 B.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160 C Predicting orthology

162

C.1 Meta-genes: identifying orthologous sets of genes

. . . . . . . . . . . 162

C.2 Best-reciprocal similarity . . . . . . . . . . . . . . . . . . . . . . . . . 163 C.3 Transitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166 C.4 Discussion and evaluation of the meta-gene orthology . . . . . . . . . 167 Bibliography

170

xiv

List of Tables 1.1 Unknown genes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Examples of query genes.

. . . . . . . . . . . . . . . . . . . . . . . .

2 57

3.2 Top scoring genes for the Rb query as ranked by the score given by the gene recommender.

. . . . . . . . . . . . . . . . . . . . . . . . . . .

69

3.3 The gene recommender yields a more precise list of candidates than the topomap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 lin-8 suppresses wrm-1 lethality.

. . . . . . . . . . . . . . . . . . . .

74 79

4.1 Precision of multi-species network on cell proliferation compared to single species.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

4.2 Network components and their biological functions. . . . . . . . . . . 116 A.1 Results of PCA on the sporulation time series data.

. . . . . . . . . 141

B.1 Genes expressed in the same tissue tend to cluster along the chromosomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 xv

List of Figures 1.1 Coordinate expression of ribosomal subunits.

. . . . . . . . . . . . .

3

1.2 A compendium of expression data built from individual microarray experiments.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.1 A topological landscape of gene coexpression for C. elegans. . . . . .

32

2.2 A topological map derived from randomly permuted data. . . . . . .

33

2.3 An overlap matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

2.4 The topomap is enriched for concordance with known biology. . . . .

39

2.5 Comparing clustering methods based on their concordance with known biology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

2.6 Correspondence with clusters can be used to relate gene functional categories.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

3.1 Global clustering can miss significant, but under-represented correlations.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvi

54

3.2 The retinoblastoma pathway.

. . . . . . . . . . . . . . . . . . . . . .

55

3.3 Defining a prediction set from the C. elegans topomap. . . . . . . . .

58

3.4 The Rb query genes show tight coregulation on a subset of experiments in the C. elegans microarray database. . . . . . . . . . . . . . . . . .

59

3.5 Some DNA microarray experiments show coregulation of the genes in the Rb query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

3.6 Restriction to experimental subspaces can change the apparent similarity of genes.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

65

3.7 The gene recommender needs fewer genes to enclose the original query. 73 3.8 The leave-one-out procedure.

. . . . . . . . . . . . . . . . . . . . . .

75

. . . . . . . . . . . . . . . . . . . . .

76

3.10 JC8.6(RNAi) results in a Muv phenotype similar to lin-35 Rb(RNAi).

78

3.9 Leave-one-out cross-validation.

4.1 The method for computing P-values to detect evidence of conserved coexpression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

95

4.2 A two-dimensional surface of meta-genes based on their conserved coexpression P-values.

. . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3 Visualizing the network of meta-genes in VxInsight [26].

99

. . . . . . . 102

4.4 The number of meta-gene interactions is significant compared to randomization controls. . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

xvii

4.5 Similar meta-gene interactions can be obtained even after introducing various levels of Gaussian noise.

. . . . . . . . . . . . . . . . . . . . 106

4.6 Splitting the datasets into random halves yields approximately the same set of meta-gene interactions. . . . . . . . . . . . . . . . . . . . 108 4.7 The multiple species network is more accurate for predicting gene function than any of the single-species networks. . . . . . . . . . . . . . . 112 4.8 The superior performance of the multiple species network for predicting gene function is not due simply to larger database size. . . . . . . . . 113 4.9 Multiple species coexpression landscape. . . . . . . . . . . . . . . . . 114 4.10 Genes predicted to be involved in cell proliferation by the multiple species network are differentially regulated in pancreatic cancer. . . . 118 4.11 RNAi-induced phenotype of ZK652.1.

. . . . . . . . . . . . . . . . . 120

4.12 The multiple species network contains large components with differing levels of conservation.

. . . . . . . . . . . . . . . . . . . . . . . . . . 123

4.13 Distribution of number of links for each meta-gene compared to randomly generated networks.

. . . . . . . . . . . . . . . . . . . . . . . 126

A.1 Plot of eigenvalues of the principal components. . . . . . . . . . . . . 142 A.2 Plots of the coefficients of the first three principal components.

. . . 143

A.3 PCA enriches for genes with known sporulation-related promoter motifs.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

xviii

B.1 The definition of a “close” inter-genic distance.

. . . . . . . . . . . . 149

B.2 The sampled distribution of the number of intergenic distances can often be well-approximated by a normal distribution. . . . . . . . . . 154 B.3 The chromosomal positions of the genes specifically expressed in C. elegans muscle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 B.4 Histogram comparing random with observed muscle gene clusters.

. 158

B.5 Muscle genes are clustered in small groups of 2-5 genes along the chromosomes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 C.1 Meta-genes on H. sapiens D. melanogaster C. elegans and S. cerevisiae. 164 C.2 Example of the meta-gene MEG273. . . . . . . . . . . . . . . . . . . 165

xix

Chapter 1 Introduction One of the aims of biology is to understand biological systems at a mechanistic level. The genome projects have recently established a nearly complete list of genes in the genome, revealing that most of the genes have yet to be characterized genetically. Therefore, to understand genetic mechanisms more fully one of the next challenges is to characterize the role of every gene in the genome, especially for the large fraction of genes whose functions are currently unknown. For example, in the nematode C. elegans, only 8% of its genes have been studied to any great extent, leaving the remaining 92% without direct functional characterization [25] (Table 1.1). It is clear that a systematic, genome-wide approach to predicting gene function is needed in higher organisms. DNA microarrays are an ideal tool to use for characterizing gene function. DNA

1

2

CHAPTER 1. INTRODUCTION

Organism S. cerevisiae C. elegans

Total 6145 18,545

Characterized by genetics 3778 (61%) 1541 (8%)

Inferred by similarity Unknown 701 (11%) 1665 (27%) 10,227 (55%) 7118 (38%)

Table 1.1: Breakdown of known and unknown genes in S. cerevisiae and C. elegans(based on data from Costanzo et al. (2001) [24]). microarrays report on the expression of every gene in a genome in parallel and therefore provide the scale-up needed to characterize the large fraction of unknown genes. The expression of genes under different conditions can provide information about gene function. Biological processes often involve multiple genes acting in concert. For example, multiple genes can act together by encoding the proteins that make up a single complex. Multiple genes also can act together by encoding proteins that participate in a common signaling pathway. Therefore, the measurements collected from a set of DNA microarray experiments across a range of conditions for a single gene can be used to relate a gene to other genes in the genome. If the expression pattern of gene A is significantly similar to gene B, the function of A may be inferred from the function of B. The ribosome is one of the largest protein complexes known. It is comprised of over one hundred proteins and RNAs involved in the translation process. When a eukaryotic cell needs to synthesize or destroy ribosomes, it must express these genes in an orchestrated fashion. Tight control over the production of the ribosomal subunits allows the cell to produce intact and functional ribosomes. Failure to coordinately

3

CHAPTER 1. INTRODUCTION

3 2

expression level

1 0

−1 −2 −3

Ribosomal subunits −4 −5 0

Random genes 5

10

15

20

25

fly embryogenesis time (minutes)

Figure 1.1: Coordinate expression of ribosomal subunits. Ribosomal subunits are coordinately regulated across a D. melanogaster embryogenesis time course (red lines) compared to random genes (black lines). regulate the subunits might be inefficient as it would lead to stoichiometric imbalance and wasted mRNA production. Figure 1.1 shows the expression of 30 genes that encode subunits of the ribosome. The expression profiles of the ribosomal subunits, taken together, show a pattern. While the absolute levels of the subunits differ slightly, the overall pattern of expression is highly similar. If the rising and falling across the D. melanogaster embryogenesis time course can be used as a signature of ribosomal activity, then identifying other genes with similar patterns of expression

CHAPTER 1. INTRODUCTION

4

may reveal new components of the translational or other protein expression machinery. In the same way the genes encoding ribosomal subunits exhibit correlated expression, we may be able to identify new molecular machines, or other genetic modules, from expression data. The availability of genome-wide functional datasets, that measure properties about genes for every gene in a genome, promise to contribute to a fuller understanding of biological pathways. The large scale datasets now available create an unprecedented opportunity for discovering relationships among genes. This thesis develops methods for predicting gene function based on DNA microarray expression data.

1.1

Thesis motivation

The work was motivated by the need to characterize the function of genes in an automated way. At the same time, high throughput technologies have provided us with the ability to measure functional aspects about genes on a genome-wide scale [77, 54, 38, 51, 58, 66, 21, 39, 97, 62]. This thesis was also motivated by the need to develop computational tools that give biologists access to the large amount of data now available for the genes they study. Although many bioinformatics methods exist for analyzing functional genomics datasets [31, 46, 23, 79, 1, 9, 71, 68, 92, 15], there is still a demand for resources that will allow biologists to navigate the results of highthroughput technologies. Making such resources available to the research community

CHAPTER 1. INTRODUCTION

5

will accelerate our efforts to understand gene function.

1.2

Identifying functional modules with gene expression data

Genes that are part of the same protein complex, like the ribosomal subunits, or that participate in the same pathway(s) are often coexpressed. This may be due to the protein products of the genes being required to be present together to participate in a shared task. DNA microarrays give us a simultaneous observation on every gene in the genome for a single state of a cell or of an organism. Genes with similar profiles over a diverse range of conditions may have common functional roles. For a review on methods for analyzing gene expression data see Altman and Raychaudhuri (2001) [3].

1.2.1

Global studies of gene expression

Microarrays allow the expression of every gene in the genome to be queried simultaneously. Ideally, a microarray contains one or more nucleic-acid based probes for each unique transcript in the genome. When exposed to material derived from cellular extracts, the amount of RNA that hybridizes to a particular probe quantifies the degree to which its gene is up- or down-regulated. Although different types of microarrays exist, this thesis deals solely with cDNA microarrays developed in Pat

CHAPTER 1. INTRODUCTION

6

Brown’s lab at Stanford in 1995 [77]. The cDNA microarray technology has been described extensively elsewhere and will be summarized here only briefly [19, 29, 16]. For technical reasons, a competitive hybridization is performed to measure gene expression under an experimental condition relative to a control condition. RNA derived from an experimental population of cells is labeled with one flourescent dye (usually red) while RNA derived from a control population of cells is labeled with a different flourescent dye (usually green). When the flourescently labeled RNAs from the two populations are allowed to competitively hybridize to a microarray, the redness or the greenness of a particular gene reflects the proportion of its mRNA that is present in the experimental condition relative to the control. Microarrays have created an unprecedented opportunity for investigating gene response under a variety of different conditions on a genome-wide scale. This allows an investigator to explore which genes in the genome play a significant role in the process under study. The technology has advanced rapidly so that now we have full genome chips on many organisms. The research community for several organisms have amassed a large number of hybridizations which encompass a wide variety of experiments probing different cellular processes. For example, synchronized yeast cells have been examined revealing over 800 genes whose expression oscillates regularly across the cell cycle in S. cerevisiae [87]. In another example, mRNA from developing

CHAPTER 1. INTRODUCTION

7

C. elegans populations were studied identifying which genes are expressed during the major stages of the nematode’s life [57]. As a variety of different types of microarray studies are completed, it becomes possible to study combinations of experiments from a single organism. For the purpose of gene function prediction, analyzing combinations of experiments provides a more complete picture of how the expression of a set of genes may be similar. Different experimental conditions may activate different molecular processes, which may in turn lead to different repertoires of co-transcribed genes. Increasing the diversity of experiments included in an analysis can theoretically increase the breadth of gene functional relationships it is possible to detect. Ideally, we would have a large set of non-redundant experiments. However, finding such a set of experiments is challenging and is an area of active research (see appendix A for an example of a method that searches for uncorrelated linear combinations of experiments in an attempt to find such a set). Recently, much work has focused on predicting gene function based on the data collected from DNA microarray studies. Many new statistical and computational approaches have been developed as a consequence of these efforts. This chapter provides an overview of some of these methods. Since an exhaustive review of the research is outside the scope of this thesis, I focus on the recent developments that directly motivated the work in this thesis.

CHAPTER 1. INTRODUCTION

8

For example, I focus on analysis that combines microarray experiments for the purpose of comparing genes to one another. Other types of microarray studies are often performed when analyzing a single experiment. For example one might wish to predict differentially up- or down-regulated genes from a Type-1 study in which mRNA is compared between normal and cancerous tissues. I do not discuss these approaches although a growing literature exists on this topic. See the review by Storey and Tibshirani (2003) for a recent review of this area [90].

1.2.2

Evaluating module prediction

How do we know a grouping of genes derived from a computational procedure reflects anything biologically meaningful? We would like to have some estimate for the accuracy of a clustering result before we follow-up with biological experimentation. A rich literature exists describing how to evaluate the results of machine-learning methods. Here I discuss methods that have been used widely to evaluate the results of computational analysis programs of microarray data. Any gene function prediction method must predict known categories of genes accurately before it can be used to make new discoveries. To predict gene function on a large scale, it is necessary to have a classification of the genes into functional categories. A gene function prediction method can only be evaluated if the function for some of the genes are known. If a large enough fraction of genes are known then

CHAPTER 1. INTRODUCTION

9

it may be possible to gauge the success of a gene function prediction effort by scoring how well it re-derives known functional groupings of genes. Efforts to categorize genes into curated categories based on the literature are therefore crucial for the development of computational gene function prediction methods. Gene categories can be used in an automated way to measure the performance of a gene prediction method. The discrete groups of genes provided by categorization efforts like Gene Ontology [5] and KEGG [67] can be used as surrogate functions to benchmark the performance of prediction methods. One approach is to use a gold-standard grouping of genes that one believes reflects the truth to some degree. Another approach is to use held-out data in a sampling strategy to measure how well the prediction method generalizes to new data instances. Both approaches have been used in DNA microarray analysis. Each is discussed in turn. One can assess the ability of the clustering to place genes belonging to a single set Gi into the same cluster. The evaluation is then repeated for a set of such groups {i = 1 . . . g}. This approach was first used by Tamayo et al. (1999) [92] to evaluate the results of a self-organizing map to yeast expression data. The gene functional categories they used were MIPS [65]. For a clustering method, this may be evident in the way the resulting clusters collect genes known to be functionally related. In a straightforward manner, the genes

CHAPTER 1. INTRODUCTION

10

in a particular cluster can be overlapped with the genes in a particular functional category. Overlapping every cluster with every functional category gives a matrix of cluster-category overlaps that can be used to evaluate the clustering result in terms of the chosen categorization. The gold-standard approach has the advantage in that it is not correlated with the data in any obvious way. There is an inherent objectivity in overlapping goldstandard gene groups with experimentally derived groups. However, the approach is limited because it relies on prior knowledge about gene function. If a dataset contains many novel and biologically significant gene groupings this evaluation approach gives no credit to a method when it successfully identifies such relationships. The attractiveness of using the held-out evaluation scheme is that it is not limited in this way.

1.3

de novo identification of gene modules from expression data

The methods described here and in the remainder of this thesis all take as input a large set of microarray experiments.

CHAPTER 1. INTRODUCTION

1.3.1

11

Compendiums

We want to detect whether genes are similarly expressed, not just in anecdotal experiments, but in general across a wide variety of conditions. If two genes have highly similar expression levels across a large number of experiments this is evidence the two genes may be functionally related because their transcripts are needed (and not needed) in a large set of different cellular states. One approach then for identifying functionally related genes is to compile a large database of expression and search for significantly tight clusters of genes. Using this approach, one may find biologically relevant modules represented in clustered sets of genes. One can pool the different measurements collected from multiple laboratories into a single compendium for analysis. For example, Eisen et al. (1998) [31], Golub et al. (1999) [40], Hughes et al. (2000) [53], and Kim et al. (2001) [59] are examples of studies where the authors combined expression data from different experiments for discovering similarities between genes. Since microarray platforms differ and laboratories may use different RNA extraction protocols, one may need to pre-process the data. Normalizing microarray data is also an active area of research (see for example [102]). To relate the expression of a gene across the microarrays requires identifying what each probe on each microarray slide represents. Joining together results from different microarray studies involves the steps outlined in Figure 1.2. Figure 1.2 illustrates how

CHAPTER 1. INTRODUCTION

12

Figure 1.2: A compendium of expression data built from individual microarray experiments. Spots on different microarrays (“chips”) contain different probe sequences. The gene recognized by a set of probes can be identified based on their similarity to a reference sequence (sequence in white). A gene may have multiple different probes on the same slide as well (in which case the expression levels are often aggregated before subsequent computations are performed).

CHAPTER 1. INTRODUCTION

13

a collection of microarray experiments can be combined. Each spot on each array is mapped to the gene that it was designed to measure. It is not always possible to identify which gene a probe matches. Gene names can be ambiguous, especially for human gene names where names are reused and several synonyms for a single gene exist. It is therefore best to rely on the specific nucleic acid sequence that was spotted on the chip to relate gene names. However, for proprietary reasons, some microarray chip manufacturers have not always published which sequences were actually spotted on an array. One of the most ubiquitous and most reliable ways of describing the sequences on an array is to to report the Genbank accession identifier for each spot. The sequence of nucleotides corresponding to a Genbank accession number can then be identified by searching the National Center for Biotechnology and Information’s database [52]. Efforts such as SOURCE are helping to address the problem of mapping gene names across microarray platforms [27]. SOURCE maintains relationships that map gene names across various functional genomic resources whenever new versions of the genome sequences are published. In addition, new microarray standards such as MIAME [14] and MAGE-ML [85] are being developed that specify the minimum requirements necessary to report microarray results (see also Ball et al. (2002) [7]).

CHAPTER 1. INTRODUCTION

1.3.2

14

Global cluster analysis

Once a sufficiently diverse set of experiments has been collected, we want to be able to detect whether two genes are related based on the data in the compendium. The compendium of data allows us to systematically collect every significant coexpression partner of every gene. In the case where the coexpression neighbors of an unknown gene are sufficiently well characterized, we may be able to infer the function of unknown genes. One way to find functional modules, is to cluster the genes based on their expression patterns in the compendium. No prior conceptions about the way genes should be grouped is used in unsupervised clustering. Rather, in a data-directed manner, the genes are organized by grouping related expression profiles. Given a set of clusters, we might then be able to predict the function of an unknown gene based on the cluster to which it is assigned. The work of Eisen et al. (1998) was one of the first studies to combine multiple diverse experiments and perform a combined analysis for the purpose of gene function prediction [31]. This work demonstrated that organizing genes based on expression similarity could recapitulate a large amount of known biology. Since Eisen’s study, a surge of clustering work has been conducted on gene expression data (see Altman and Raychaudhuri for a review of the topic [3]). Chapter 2 describes a method for constructing a global overview of co-expression data collected on C. elegans. Often the bottleneck in clustering approaches is in

CHAPTER 1. INTRODUCTION

15

understanding the resulting clusters. The chapter emphasizes how a global clustering result can be assessed both statistically and biologically.

1.3.3

Context-dependent clustering

Identifying a subset of conditions for a particular set of genes may be critical for identifying their common transcriptional signature. We do not know beforehand which genes are functionally related as well as which combination of experiments show tight clustering of the related genes. We would like to be able to solve both of these simultaneously to find coherent gene groupings in their appropriate experimental contexts. Plaid and Biclustering have been developed to make strides in this direction [45, 18, 93, 60, 61] on all of the data in a compendium simultaneously. The idea is that related genes probably are coordinately regulated in a minority of the experiments. Calculating similarity metrics based on all of the experiments might therefore be inaccurate as the measure would then be based primarily on random relative responses. The methods seek subsets of conditions and genes such that some notion of cluster tightness is maximized. Some methods treat the identification of genes and conditions as an optimization problem. A maximum-scoring partitioning of the original data matrix into sub-matrices that define a subset of genes and experiments. Methods differ in whether they seek all such sub-matrices simultaneously, or one at a time. They also

CHAPTER 1. INTRODUCTION

16

differ in whether they allow sub-matrices to overlap. Overlapping sub-matrices allow genes to belong to more than one resulting cluster. Genes that have multiple roles in the cell may have different co-expression partners depending on the experimental context. Genes may be pleiotropic and required for multiple diverse functions in an organism. Genes may interact with different sets of genes in depending on the process or cellular state within which it participates. Clustering should therefore make distinct groups of genes in a process-dependent manner. We may be able to detect the multi-functional nature of a gene from expression data if the gene is coexpressed significantly with the genes from the distinct processes in different subsets of the expression database. For these reasons, it may be ideal for a clustering to allow genes to belong to more than one cluster. Methods have been developed recently that show promise for meeting this challenge. Plaid in particular allows genes to belong to multiple overlapping clusters. Plaid searches for subsets that explain the most significant amount of remaining residual computed from an ANOVA-like model [61]. Biclustering treats the problem as an optimization problem and searches for high-scoring sub-matrices. For example, Sheng et al. (2003) define a Gibbs-Sampling approach for discovering sub-matrices that maximizes a probabilistic model [80].

CHAPTER 1. INTRODUCTION

1.4

17

Bootstrapping modules from known pathways using gene expression data

Knowing some genes involved in a process, biologists can sometimes identify additional genes using standard genetic techniques. For example, a second gene may be found that suppresses a gene (i.e. the second gene’s deletion reverses the affect of deleting the other gene). In this way, biologists may be able to identify a set of genes by “bootstrapping” from known members. However, these genetic approaches tend to be slow in terms of identifying functions for every gene in the genome. In addition, these studies rely on prior knowledge about characterized genes, or the identification of selectable phenotypes. Biologists are often interested in a small set of genes that they have found to play a role in some particular process. To fully understand how the process works, one would like to identify all of the genes that participate in the process. For this reason, biologists are often interested in finding out if any new genes are related to a set of genes known to be functionally related. Using DNA microarray data, we hope to identify new genes as those that are significantly co-expressed with the original set of genes. New genes identified in this way are good candidates for involvement in the process since their regulation is tightly coupled to genes known to be involved. Ihmels et al demonstrated an approach that could take an input set of genes and

CHAPTER 1. INTRODUCTION

18

build a specific cluster around the input set [56]. The goal is to find new candidates of the original set. Bergmann et al. (2003) show how this approach extends to a particular generalization of singular-value decomposition [10]. Chapter 3 describes a method that uses DNA microarray data to predict which new candidate genes may be related to a set of genes already known to be functionally related. A global clustering solution may fail to find new candidates. This may be due, for example, to the fact that the conditions under which the genes of interest are actively co-regulated are underrepresented in the expression database. The chapter describes a method called the gene recommender that first identifies experiments where the expression levels of the given set of genes agree. Once a set of informative experiments has been identified, the gene recommender builds a cluster around the starting set of genes in an attempt to find new candidates. The gene recommender is a type of method that shows promise for finding a more accurate set of genes related to a given set of interest than prior clustering approaches.

1.5

Finding conserved modules from gene expression data

Coregulation detected in a single organism may not always imply two genes are functionally related. It would be difficult or even impossible to distinguish accidentally

CHAPTER 1. INTRODUCTION

19

regulated genes from those that are biologically important if we use data from a single organism. We would like to know which coexpression measurements reflect the existence of a physiologically necessary link between two genes. Identifying these links promises to improve gene function prediction. Preliminary work has demonstrated that combining data from multiple organisms can be used to find reliable predictions about gene function [2, 96]. Alter et al. (2003) [2] combined cell cycle data from human and yeast and then performed a singular-value decomposition on the combined data set. In this study, the human and yeast experiments were matched, pairing the approximate time-point from human with the corresponding time-point from yeast. For example, the time-point corresponding to the entry into human S-phase was associated with the time-point for the entry into the yeast S-phase. This has the advantage of not requiring an orthology prediction. Instead, the SVD is used to define sets of genes with related expression as determined by a set of linear combinations across the human and yeast time courses. Any orthology prediction can then be used to assess or evaluate the results. For example, the human protein ORC1L is highly similar to the yeast protein ORC1 which encodes the origin of replication recognition complex. The ORC protein is a six subunit complex that is known to be required for initiation of DNA replication in all eukaryotes. The SVD results re-derive the known fact that the expression of the large subunit varies throughout the cell cycle. One can simply see that both ORC’s

CHAPTER 1. INTRODUCTION

20

in yeast and human get similar scores along the same eigenvectors. Since they are also highly similar in protein sequence, one could then infer that the H.s. ORC1L and S.c. ORC1 are probably orthologous. The SVD can help identify analogous expression responses between organisms without assuming an orthology prediction. However, because it does not incorporate an orthology prediction, it cannot detect when significant expression patterns exist in multiple organisms for the same gene. It therefore cannot detect synonymy between expression profiles assayed in multiple organisms. The method described in Chapter 4 does take protein sequence similarity into account so that evidence of synonymous expression can be used to derive a measure of conservation. Snel et al. (2002) analyzed a large set of experiments from both yeast and worm. They computed all pairwise Pearson correlations between genes in both organisms’ data sets and selected those pairs from each that achieved correlations of 0.60 and higher. Using this criteria, they found a very small fraction of overlapping pairs. Even though the common set of interactions was small, they were almost entirely composed of known protein subunits paired with other proteins known to also belong to the same complex [83]. Van Noort et al. (2003) combine microarray expression data from both C. elegans and S. cerevisiae to predict gene function. They also use a simple thresholding approach for the task and find their ability to predict known categories is greatly

CHAPTER 1. INTRODUCTION

21

improved [96]. Prediction of gene function may be improved in this kind of study because any errors in the microarray data will be more uncorrelated across the organisms than within an organism. Approaches like Van Noort et al. that combine the data across species may therefore reduce the overall effective error and produce more accurate predictions. The central hypothesis of this work is that using DNA microarray expression data from multiple organisms can improve gene function prediction over using data from only a single organism. Chapter 4 provides evidence that supports this hypothesis. The work demonstrates one of the first efforts to develop a method that simultaneously analyzes gene function across multiple species on a genome-wide scale. It presents a computational approach for measuring the degree of conservation of co-expression across organisms. I show that data from multiple organisms significantly improves gene function prediction for many classes of conserved genes. In Chapter 4, I present a method for constructing a network of conserved coexpression. The method links together all genes that exhibit significant co-expression in multiple organisms. The method is applied to four distantly related organisms – humans, flies, worms, and yeast – resulting in 22,163 links between 3182 genes, revealing the specific co-expression links preserved through millions of years of evolutionary time. The resulting network contained structure, yielding insights into the types of conserved modules the individual links represent. For example, 12 large groups of

CHAPTER 1. INTRODUCTION

22

inter-connected genes were discovered, most of which encapsulate a specific biological theme. Many of these groups correspond to either animal-specific or ancient genetic modules. The network provides strong evidence for the involvement of new genes in core biological functions such as the cell cycle, secretion, and protein expression. Experimental evidence is provided confirming the predictions implied by some of the links in the network. In support of the hypothesis put forth in this thesis, I provide quantitative evidence that combining expression data across organisms provides a clearer picture of core biological processes than using expression from a single organism alone. I provide a method for combining the results of functional genomic data across distantly related organisms to understand the function of human genes. I provide a resource for the biological community for investigating the information in such data. I also provide evidence that the resource is useful for biological inquiry. I conclude that such an integrative approach may be crucial for gene function prediction and may lead to a more accurate understanding of genetic processes.

Chapter 2 Evaluating a terrain map of coexpression for C. elegans Most of the genes in the multicellular organisms have not been directly analyzed. One of the next goals of computational genomics is to predict the function of these unknown genes. The expression of genes under different conditions can provide information about the genes’ function. Biological processes often involve multiple genes acting in concert. For example, multiple genes can act together by encoding the proteins that make up a single complex. Multiple genes also can act together by encoding proteins that participate in a common signaling pathway. It its often plausible then to detect whether two genes are related based on whether they are coexpressed

23

CHAPTER 2. EXPRESSION TERRAIN MAP

24

in a sufficiently diverse set of experimental conditions. With DNA microarray technology, we can systematically collect every significant coexpression partner of every gene, including unknown genes. In the case where the coexpression neighbors of an unknown gene are sufficiently well characterized, we may be able to infer the function of unknown genes. In this chapter I describe an effort to analyze a large set of microarray data in the hope of discovering functions for the vast majority of unknown genes in C. elegans. The study described here represents a collaboration with Dr. Stuart Kim’s laboratory. My contribution was computational – I designed sampling strategies to quantify the statistical significance of the clustering result represented by the gene expression landscape. The work presented in this chapter was published under the title “A gene expression map for Caenorhabditis elegans” in Science Copyright 2001 American Association for the Advancement of Science [59]. Microarray data collected from C. elegans is ideal for coexpression analysis. The laboratory of Dr. Stuart Kim developed one of the first microarrays for the multicellular model organism, C. elegans. They collected microarray expression measurements of nearly all of the genes in the C. elegans genome across a diverse range of experiments performed by over 30 groups from around the country. The experiments included 553 DNA microarray hybridizations, covering a diverse set of experiments that profile expression changes during development, aging, following stress, in various

CHAPTER 2. EXPRESSION TERRAIN MAP

25

mutants and under various growth conditions [59]. 178 of these experiments used partial DNA microarrays (containing 11,917 genes; 63% of the 18,967 genes from C. elegans genome) and the remaining experiments used DNA microarray containing nearly all of the genes in the genome (17,871 genes; 94%)[57, 74]. We wanted to construct a resource allowing researchers to investigate which genes are coexpressed with their genes of interest. We therefore constructed a visual representation of the gene expression relationships present in the expression database. The method we chose is a type of multi-dimensional scaling in which a set of pairwise similarities between genes was used to find a low-dimensional representation of the expression data. Section 2.1 describes how we chose a similarity metric to detect when two genes are similarly expressed. We then plotted the genes out on a twodimensional grid so that genes with high expression similarity were placed near each other and genes with low expression similarity were placed further away from each other. We plotted the grid of genes on a 3-dimensional landscape, where the surface coordinates of the genes corresponded to their grid positions and a third dimension represented the density of genes in a square region of the surface. We refer to this representation as a topomap. Section 2.2 describes how the topomap was constructed from gene expression similarity measures. The topomap contained a high degree of structure in the way the genes were distributed throughout the landscape. Rather than a homogeneous distribution, the

CHAPTER 2. EXPRESSION TERRAIN MAP

26

topomap contained regions of high gene density, which we termed mountains. The presence of mountains indicated that an underlying structure existed in the gene expression data. The presence of mountains indicated that significant sizes of genes were co-expressed with one another across the diverse set of conditions. We hoped that biological themes could be ascribed to such groups so that functions could be assigned to unknown genes based on their position in the landscape. Because it was possible that such structures might arise by chance, we rigorously tested whether the presence of mountains was a random artifact. Section 2.3 describes several strategies for gauging the significance of the presence of this kind of observed structure in the topomap. If the topomap was to be useful as a discovery tool for predicting the function of unknown genes, we had to demonstrate that it could bring together genes already known to be functionally related. My main contribution to this study was to demonstrate formally that known genes with related function co-clustered on the topomap. We constructed sets of closely related genes, henceforth called biogroups, to measure the degree to which the topomap could concentrate genes from a single biogroup. Section 2.3 also discusses how we formally tested the ability of the topomap to cluster genes from the same biogroup. The topomap proved to be a successful representation for gene coexpression data. Its success is reflected in how usable it is and in the number of new insights it has

CHAPTER 2. EXPRESSION TERRAIN MAP

27

provided and will continue to provide. However, it does have its limitations, some of which are discussed in Section 2.5. Many of these limitations can be overcome by refining or reworking some of the aspects of the topomap. Other shortcomings require new methods to be developed. The discussion section also describes these modifications and future developments that use the topomap as a stepping stone to more accurate methods.

2.1

Identifying coexpression neighbors

We wanted to be able to predict gene function based on which genes are coordinately regulated with a gene of interest. To collect all of the coordinately regulated genes for every gene in the genome, we needed to compute a measure of similarity between every pair of genes. Genes that are coregulated may be induced or repressed at different strengths. For example, two genes might be turned on or off by the same transcription factors but the amount of mRNA produced from each gene might be very different since the genes’ cis elements may have different binding affinities for the factors and different transcription initiation and elongation promoting strengths. Because this may result in coregulated genes having different absolute levels of expression, we needed to use a similarity measure that reflected the degree to which the genes’ expression responded in the same direction. We therefore required the similarity measure to detect when two genes’ expression profiles had the same overall shape,

CHAPTER 2. EXPRESSION TERRAIN MAP

28

rather than the same absolute expression levels. This section describes our choice of similarity metric and how it was computed from the expression database.

2.1.1

Measuring similarity in gene function

The expression measurements across the 553 experiments for gene i can be represented by the 553 dimensional gene profile vector xi = (xi,1 , xi,2 , ..., xi,553 ). Note that because many of the genes were not present on some of the microarrays and because some spots yielded unreliable measurements, some of the values in xi here may be missing. We then sought some measure of similarity between two gene profiles. We wanted to identify which genes are similarly expressed with a given gene i. Any gene that is tightly coregulated with gene i may provide clues as to the function of gene i. To find out which other genes were coexpressed with gene i, we chose a metric that normalized for the presence of missing data. Although many choices were possible, we used the Pearson correlation between two genes i and j as a measure of similarity. The Pearson correlation was used because it has the property that it assigns a high score to two genes when the two expression profiles have an overall similar shape. We chose to use a measure based on the Pearson correlation between the two genes since this measure provides an intuitive and familiar indication of the genes similarity. If two genes have wildly different absolute expression values across the database but their measurements rise

CHAPTER 2. EXPRESSION TERRAIN MAP

29

and fall coordinately, the Pearson correlation can capture this type of similarity. The expression levels across the 553 experiments for a gene i, written Xi , can be compared to the expression levels of gene j, where each component Xij corresponds to the measured expression level of the gene from microarray j. Because the dataset contains many missing values, Pearson correlations reflect the similarity between pairs of genes over a different number of experiments. Pearson correlations computed over a smaller subset of experiments can obtain extremely high and low values by chance. Also, the Pearson correlations of gene gi with gj and gi with gk could not be directly compared if each pair had a different number of missing data points. For example, a Pearson correlation of 0.9 might be surprisingly high if computed from all 553 experiments, but would be much less surprising if based only on 8 experiments. We therefore needed a measure of how strong an observed Pearson correlation was given the number of microarrays it was computed over. Ideally, one would like to know how likely an observed Pearson correlation is by chance. In other words, if we knew the distribution for random Pearson correlations, we could compute a P-value associated with the observed Pearson correlation. If this distribution is F then the P-value associated with ρij is the probability of observing ρij or greater by chance, i.e. P (X ≥ ρij ) where X ∼ F . We chose a measure based on a t-statistic derived from the Pearson correlations. Since the t-distribution assumes an underlying gaussian, we

CHAPTER 2. EXPRESSION TERRAIN MAP

30

first transformed the raw Pearson correlations using Fisher’s z-transformation [33]. Fisher’s z is usually used for computing confidence intervals on Pearson correlations because, under certain assumptions, it produces a normally distributed variable zij from an observed Pearson correlation using:

zij =

  1 ln 1 + ρij − ln 1 − ρij 2

(2.1)

The z-transformation maps Pearson values whose range is (-1,+1) to values whose range is (−∞, +∞). If the experiments are independent and normally distributed, then zij will also be normally distributed with mean 0 and variance

p nij − 3, where

nij is the number of values where both gene i and gene j have non-missing values. We then derived a t-statistic from each zij using:

zij tij = p nij − 3

(2.2)

which were then used to as the similarity measure between the two genes i and j. The next section discusses how we combined all of the pairwise measures of similarity to construct a single co-expression resource.

CHAPTER 2. EXPRESSION TERRAIN MAP

2.2

31

Constructing a visual resource of coexpression

We wanted to make the information in the gene expression database accessible to the research community. The database contains many clues about gene function. However, in its current form, the gene expression database was limited in its usefulness. We wanted to provide a resource that gave researchers an ability to navigate and learn about the relationships present in gene expression data. We wanted a visual metaphor that would be intuitive to use and that allowed easy assimilation of a large amount of data. The visualization system VxInsight, developed by Sandia National Laboratories, met these requirements. For each gene present on at least one of the microarrays in the dataset, the 25 most coexpressed genes were collected. We then placed the genes in the plane such that nearby genes were likely to be coexpressed across the expression database. Laying the genes out on a two-dimensional grid and plotting a surface above the genes that reflect gene density creates a topological map of gene expression that we refer to as a topomap.

Figure 2.1 shows the result of a multidimensional scaling type of approach to laying out the genes on the X-Y plane. A taller “mountain” indicates more genes within a particular on the map. Perhaps the feature about the topomap to first catch the

CHAPTER 2. EXPRESSION TERRAIN MAP

32

Figure 2.1: A topological landscape of gene coexpression for C. elegans. Each gene is positioned so that it lands near genes it is coexpressed with. The Z-direction is used to represent gene density.

CHAPTER 2. EXPRESSION TERRAIN MAP

33

Figure 2.2: A topological map derived from randomly permuted data. The terrain map was re-derived after shuffling the genes within each microarray hybridization. eye is the presence of several areas each containing a dense population of genes. The dense regions of the topological map (“topomap”) 2.1, referred to as “mountains”, indicate an underlying structure may exist in the gene coexpression groups. To test the significance of this kind of structure, the data were permuted and the terrain map was re-derived using the permuted data.

CHAPTER 2. EXPRESSION TERRAIN MAP

34

No appreciable structure could be detected in the resulting clustering derived from random data. In total, we identified approximately 44 “mountains” of coexpressed genes. We tested whether the observed structure was significant by deriving a solution from randomized data. We permuted each microarray experiment in the dataset and then re-derived the solution using the permuted data. This resulted in a homogeneous distribution of genes on the X-Y plane (Figure 2.1). The observed clusters are therefore significant in the sense that randomly permuted data would not give rise to the same kinds of clustering structure.

2.3

Testing the significance of the terrain map

Do the mountains correspond to a biologically meaningful grouping of the genes? If it could be demonstrated that the terrain map brought together genes of related function then we may be able to use it to predict functions for unknown genes.

2.3.1

Concordance with known biology

One way to assess whether the terrain map collected genes of related function is to look within each mountain and ask how many related genes have been collected by the mountain. All of the genes in each of the 44 mountains were collected into 44 mutually exclusive sets of genes M1 , M2 , ..., M44 . We constructed 56 sets of genes

35

CHAPTER 2. EXPRESSION TERRAIN MAP

with similar biological function (biogroup), such as genes involved in meiosis, mitosis, translation, DNA synthesis, etc. We refer to these sets of genes as biogroups. The following table shows examples of some of these biogroups and describes how they were constructed. There were 53 biogroups overall. Each mountain Mi could then be tested for whether it contained a significant number of genes from biogroup Bj . Overlapping every mountain with every biogroup produces a matrix of overlaps O where Omb is the number of genes shared between mountain m and biogroup b. We can formally evaluate how well a mountain brings together genes of related function using the hypergeometric distribution. The hypergeometric distribution gives a measure reflecting the degree to which an observed overlap could be expected by chance if a mountain simply drew genes from the genome randomly and with equal probability. Using the hypergeometric distribution, we compute a P-value, P (X ≥ Oij ):

P (X ≥ Oij ) = 1 −

Oij −1 |Mi | X k k=0

N −|Mi | |Gj |−k  N |Gj |





(2.3)

where N =17,661. An overlap matrix computed from the 44 mountains and 53 biogroups is shown in Figure 2.3. Figure 2.3 shows the resulting overlap matrix. Inspecting the columns of the matrix reveals that 38 out of the 44 mountains contain a significant proportion of genes belonging to a single biogroup. For example, mount 16 was enriched for muscle genes, mount 1 was enriched for both muscle and neuronal genes, and mount 15 is enriched for aging-regulated genes. Even though none

CHAPTER 2. EXPRESSION TERRAIN MAP

36

Figure 2.3: An overlap matrix. In this matrix, rows correspond to a set of genes, called a biogroup, while columns correspond to one of the 44 mountains. Each (j, i) entry in the matrix plots a color representing the significance of the overlap between mountain i and biogroup j. Red indicates a P-value of less than 1E-10. of the microarray experiments were specifically designed to detect muscle or neuronal expression, groups of muscle-enriched and neuronal-enriched still resulted from this analysis. Likewise, inspecting the rows reveals that many of the biogroups are represented in at least some mountain. The overlap matrix contains a number of entries with high significance. However, without a random model, we have no way of telling how significant this observation is. The next section describes an approach to formally answer this question.

2.3.2

Significance of concordance

The result of overlapping every mountain with every biogroup results in an overlap matrix O where Omb contains a P-value reflecting the significance of the overlap

CHAPTER 2. EXPRESSION TERRAIN MAP

37

between mountain m and biogroup b. We want to know whether obtaining a particular matrix O is significant. If we randomly constructed clusters, or if we constructed clusters from randomized data, would we observe the same degree of overlap with the biogroups? One way to answer this question is to generate a series of random clusterings and then compare their overlap with the terrain map’s overlap. So that the random clusterings were comparable to the terrain map, I constructed clusters that were the same size as the mountains. Starting with the set of every gene spotted on the microarrays, the procedure for constructing a random clustering consecutively draws |Mm | genes without replacement for m = 1 . . . 43 from the remaining set of genes. Let Ck be the kth random clustering containing the set of clusters Ck,1 , Ck,2 , ..., Ck,44 . For every random clustering k, I compared each of its random clusters with each of the biogroups, and calculated the intersection between the two sets. I will write the intersection as ω k,m,b = Ckm ∩ Bb . This procedure was repeated a thousand times, creating one thousand random topomaps from which a distribution of overlaps could be obtained. Figure 2.4 shows the distribution obtained from random sampling (blue bars) compared to the overlaps obtained in the real topomap (red bars). One can see that the topomap clusters do indeed capture many different groups of genes, while random mountains contain very few significant overlaps with known biological groups.

CHAPTER 2. EXPRESSION TERRAIN MAP

38

One reasonable method for evaluating the significance of a clustering result is to compare the overlap matrix obtained with the matrices that could be obtained by chance. To produce random overlap matrices, one can produce random clusterings by partitioning the genes into the same number and size clusters as the observed result. The random clusters can then be overlapped with each gene category, obtaining a randomly derived cluster-category matrix. Generating different random clusterings produces a series of matrices giving a sampling estimate for the distribution of overlap matrices. To test the significance of the clustering result then, one can compare different summary statistics computed from both the observed and the randomly derived cluster-category matrices. I measured the significance of the topomaps ability to concentrate genes of related function by comparing its overlaps with biogroups to what could be expected by chance. In order to measure the significance of the topomap’s ability to bring genes of related function into proximity, I used a sampling strategy to generate a distribution of overlap P-values. To generate this distribution, I sampled random topomaps by generating clusters that were the same in size and number as the obtained topomap. Each randomly generated topomap was then compared to the biogroups by the overlap analysis. Each random topomap then gave a series of overlap P-values. I collected all of the P-values after generating thousands of random topomaps. The results of this analysis are shown in Figure 2.4. Figure 2.4 shows the results of performing an overlap

CHAPTER 2. EXPRESSION TERRAIN MAP

39

Figure 2.4: The topomap is enriched for concordance with known biology. The topomap contains a large number of significant overlaps with biogroups compared to random topomaps. analysis between the biogroups and random clusterings. The clusterings were sampled to match the sizes and number of mountains obtained in the topomap. Blue bars in the figure show the distribution of overlap significance obtained from 1000 randomly generated topomaps. The x-axis corresponds to the − log 10 (P-value) computed from overlapping one biogroup with one cluster and computing the probability such an overlap would be obtained by chance using the hypergeometric distribution. The

CHAPTER 2. EXPRESSION TERRAIN MAP

40

y-axis corresponds to the fraction of mountains.

2.3.3

Using concordance to compare clusterings

In addition to gauging the significance of a clustering method against a null model, overlap analysis can be used as an external criterion for comparing the performance of two or more clustering methods. One approach to measure how many clusters significantly overlap with some previously known set of related genes. Let M be the set of all clusters and B be the set of all biogroups. Let the set of all clusters that significantly overlap with at least one biogroup be M 0 ⊆ C. Let the set of all biogroups that significantly overlap with at least one cluster be B 0 ⊆ B. If clustering method 1 contains a higher fraction of significant mountains and a higher fraction of overlapping biogroups than clustering method 2 it provides evidence that method 1 is preferable to method 2. More formally, let FM (p) be the fraction of mountains containing a P-value overlap of p or less when overlapped with the set of biogroups:    |M |  |B|   X X 1 FM (p) = 1 Pmb ≤ p ≥ 1    |M | m=1

b=1

(2.4)

CHAPTER 2. EXPRESSION TERRAIN MAP

41

and let FB (p) be the fraction of mountains containing a P-value overlap of p or less when overlapped with the set of biogroups:    |M |     X 1 X FB (p) = 1 Pmb ≤ p ≥ 1    |B| |B|

b=1

(2.5)

m=1

Equation 2.4 gives a measure of how much the resulting clusters resemble known biological groups (high fractions) or correspond to perhaps novel sets of genes. Equation 2.5 measures the extent to which the clustering represents a diverse set of biogroups. It is possible to have a high FM score but capture very few, or only one biogroup, and have a low FB score. On the other hand, a clustering might bring together many different types of genes of related function and have a high FB score but could also produce many novel clusters and therefore have a low FM . It is therefore important to measure both quantities when comparing clusterings. Figure 2.5 shows the results of comparing three different clustering methods. Figure 2.5A illustrates how FM and FB are calculated from a P-value matrix. Figure 2.5B shows the results of comparing three clustering methods ability to produce a high frequency of clusters that overlap with biogroups. The three clusterings are compared based on their FM scores (Y -axis) for different P-value cutoffs. In this case the clustering method indicated by the red curve performs better. Summing up the number of red squares in the last row would give FM for some P-value threshold, while summing up the number of red squares in the last column would give FB . Part

CHAPTER 2. EXPRESSION TERRAIN MAP

42

Figure 2.5: Method and results for comparing different clustering methods based on biogroup overlaps. (Mountain,biogroup) combinations that give a significant overlap for some chosen P-value threshold are indicated by color (red corresponds to highly significant while blue is insignificant). The fraction of red squares in the last row of the matrix equals FM while the fraction of red squares in the last column equals FB . The three clusterings compared shown are the topomap (red), Plaid (blue), and hierarchical clustering using Pearson and average-linkage (green). C contains the same plot as in B only the Y -axis plots each clustering’s FB scores.

CHAPTER 2. EXPRESSION TERRAIN MAP

2.3.4

43

Using concordance to find new relationships between known functional groups

Since the P-value matrix obtained from an overlap analysis is transposable (i.e. it is possible to switch the cluster and biogroup labels), it can also be used to infer possibly new relationships between the biogroups. The clusters can be considered to be observations about the biogroups. If two functional categories are consistently captured by the same clusters and excluded by the same clusters, then this is evidence the two categories are related in some way. More formally, let bj be the vector of log-transformed P-values computed after overlapping biogroup j with every cluster in a clustering result. bj is the cluster signature of biogroup j. In the ideal case, each cluster represents a distinct expression pattern across the database, defining a unique regulation program for some set of genes. If two biogroups contain genes belonging to the same regulation programs, it may indicate the biogroups are related. To compare two biogroups j and k, we could then compute a measure of similarity between bj and bk . Figure 2.6 shows the results of clustering biogroup overlap vectors obtained from the C. elegans topomap by their Pearson correlations. This produced many intriguing biogroup associations. The mitosis biogroup, for example, was sorted next to the cell cycle biogroup. We can easily see that the retinoblastoma group’s cluster signature is similar to both mitosis and chromatin modifying genes, which is consistent with retinoblastoma’s known function. The phosphoryllation biogroup

CHAPTER 2. EXPRESSION TERRAIN MAP

44

Figure 2.6: The rows of the P-value matrix (each row represents a single biogroup) can be clustered based on their overlaps with clusters from a clustering result. was associated with the kinase biogroup (not shown). This is a reasonable association since the phosphoryllation and kinase enzymatic reactions are reverse of one another and signaling cascades often involve both types of activity. In the phosphoryllationkinase association example, both biogroups contained genes in a wide number of clusters. These results suggest biogroups can be related even when no significant enrichment for any cluster exists.

2.4

Predicting gene function with the terrain map

One of the goals of constructing the co-expression terrain map was to provide researchers with a resource to find genes related to their own genes of interest. The VxInsight software system allows a researcher to find genes that are coexpressed with

CHAPTER 2. EXPRESSION TERRAIN MAP

45

their genes of interest by looking in a neighborhood around their gene(s). We wanted users of the resource to be able to find genes of related function by looking near their genes of interest. Using the resulting X − Y positions in the topomap leads to a well-known classification problem. Given a gene i, we want to predict its biogroup from the biogroups to which its neighboring genes belong. In the terrain map’s case, a natural choice for defining a neighborhood around gene i is to draw a circle of a defined radius centered at the gene and then collect every gene contained in the circle. One might be tempted to define a neighborhood around gene i that seems to respect the topology of the terrain map’s landscape. For example, one might be tempted to draw a contour around gene i by finding isolines around i. However, this approach biases for including genes that are more sparsely distributed around the central gene. One expects areas of dense gene packing to be more reliable for at least two reasons. 1. The similarity among the genes is strong enough to have kept them together despite the large repulsion term from the high density component. 2. One expects the consensus of a large set of genes that are tightly coexpressed to be more accurate than a single gene or a small collection of genes. Drawing a circle around a gene is reasonable since it will naturally include more genes from more dense regions and fewer genes from sparser regions. To draw such a circle, let the genes within radius r of gene i be the set Nr (i) defined as:

46

CHAPTER 2. EXPRESSION TERRAIN MAP

 Nr (i) = j : (xi − xj )2 + (yi − yj )2 ≥ r

(2.6)

The classification rule is then identical to the rule for Parzen windows [28]. We predict i to be in biogroup b if a critical fraction of its neighbors are also in biogroup b: I(i, b) =

1,

P

I(j, b)

j∈Nr (i)

!

≥ c

(2.7)

0, otherwise

We tested whether a subset of N (i) contained enough information to guess the function of gi for several unknown genes. One reasonable approach would be to sort all of the other genes relative to gene gi from closest to furthest away on the terrain map. We can write the sorted genes as ggi (1) , ggi (2) , ..., ggi (k) where ggi (j) is the jth closest gene to gene gi . The parameter k returns a substantially smaller subset of the genes than the entire genome, i.e. k  N . Further experimentation supports the finding that the clustering produced by the topomap were biological meaningful. 7 genes of unknown function placed close to several known heat-shock genes, were found to be heat-shock responsive in subsequent experiments [59].

CHAPTER 2. EXPRESSION TERRAIN MAP

2.5

47

Discussion

The topomap proved to be a successful representation for gene coexpression data. Its success is reflected in how usable it is and in the number of new insights it has provided and will continue to provide. However, it does have its limitations, some of which are discussed in Section 3.4. At the time of this writing, many C. elegans research groups have used the results of this work in multiple different ways in their everyday research. One group used the clustering result to further refine a set of genes expressed specifically in the pharynx (S. Mango, pers. comm.). The separate pharangeal clusters obtained provided a breakdown of the original set into logical pharangeal sub-functions. This research demonstrated that different subsets contained their own distinct transcription factor binding sites that participate in their own distinct regulatory roles in pharangeal development. The topomap has also been used to aid in the classification of newly sequenced nematodes such as Meloidogyne incognita [64]. In another recent example, the topomap was used by Bianchi et al. (2003) to identify a new member of a neurosensory pathway [12]. In this example, the authors focused their search for new candidates by identifying which genes were coexpressed with a gene encoding a potassium channel subunit. Despite the tremendous success of the topomap as a tool for gene functional discovery, it has several limitations. Section 3.4 discusses some of these and points the way toward future developments that can overcome these shortcomings. For example,

CHAPTER 2. EXPRESSION TERRAIN MAP

48

one of the limitations of the topomap is that it places a gene in a single position on the landscape. However, it might be better to let a gene occupy multiple positions on the map. Some genes are known to be “multi-functional” – they participate in multiple processes that involve the gene’s interaction with different sets of genes. For these multi-functional genes, it would be ideal to cluster them with each of their respective processes. Another limitation of the topomap is that it constructs a single clustering for every gene in the genome using all of the available microarray data. If we want to find genes related to a set of genes of interest, the topomap may place our genes of interest far apart. Instead of using a global clustering like the topomap, we can find new candidates using an approach that can take a set of genes and find new candidates that might be related. Chapter 4 discusses a method that was specifically designed for this problem and overcomes the limitations of using a global clustering like the topomap for this purpose. In this chapter I described an approach for evaluating whether a clustering solution corresponds significantly with known gene functional categories. The approach uses a simulation-based strategy for detecting whether the degree to which clusters overlap with known functional categories is significant. The clustering result described in this chapter represents an effort to make a large amount of expression data available and accessible to the research community. Its success is reflected in how usable it is and in the number of new insights it has

CHAPTER 2. EXPRESSION TERRAIN MAP

49

provided and will provide.

2.5.1

Similarity metric

Our choice of similarity measure was an intuitive and reasonable choice. Spearman might be chosen in the case one is concerned about outliers, extremely high or low expression measurements, dominating the Pearson calculation. We tested the dataset for outliers and felt there was little reason for concern. However, there is at least one example of strong signals in the expression database disproportionately influencing the result. 5 experiments were performed to detect which genes were in operons. These experiments yielded very reproducible and strong expression levels. Their influence on the topomap was clear – many of the operon residing genes fell into a single mountain. This was undesirable since operons represent several different processes should not be grouped together. We chose a measure that reflects when the direction of expression change between two genes is the same. However, genes that are related and that participate in the same response may have opposite expression changes. For example, one gene might be the repressor of another gene. We expect these genes to have expression changes that are still coordinated, but coordinated in an opposing fashion. Using the Pearson correlation as it was used in this work would miss this type of relationship. One possible way to detect such anti-correlated genes is to include genes not only when

CHAPTER 2. EXPRESSION TERRAIN MAP

50

they have a highly positive Pearson correlation, but also when they have a highly negative Pearson correlation. It would be possible to re-derive the topomap using the absolute value of the t-statistic and search for new coexpression neighborhoods not present in the original topomap.

2.5.2

Global versus local trade-off

Even though the terrain map has demonstrated its usefulness, it does have limitations that any single solution would have. The attempt to place a gene in a single position is necessarily limited because we expect some genes to be related to multiple sets of genes. Some genes were “orphaned” in the terrain map – they were not placed in any particular cluster but rather wound up between clusters. Some of these genes were similar to genes that landed in different clusters. An ideal solution, but one that would be difficult to visualize, would allow genes to reside in multiple clusters. Some examples of clustering methods that allow genes to reside in multiple clusters are Plaid [61] and Biclustering [18]. These methods show promise for allowing genes to be related to different sets of other genes. Another limitation of the method is that it used every microarray experiment in the database to compute similarities between the genes. While this gives an average picture of how two genes are related across the expression database, it can be easily

CHAPTER 2. EXPRESSION TERRAIN MAP

51

affected by biases in the database. Certain types of experiments might be overrepresented in the database and have a disproportionately large influence on which genes are deemed similar. Some experiments might be under-represented or not present at all and not contribute significantly to similarity calculations.

Chapter 3 A gene recommender for C. elegans

3.1

Introduction

A full understanding of a biological process requires the identification of all of the genes that function in that process. Computational methods can be useful in this gene identification effort. If a subset of genes have already been identified for a process, computational methods can find new genes that are functionally similar to the genes in the current set. Because genes that participate in the same process are often coregulated, a promising place to look for genes of related function is in the growing collection of DNA microarray data. Genes whose expression is highly correlated with the current set are good candidates for further analysis.

52

CHAPTER 3. GENE RECOMMENDER

53

Given a set of genes known to be functionally related, how can we use DNA microarray data to find new candidates? A straightforward approach would be to collect a diverse set of experiments, cluster the genes based on their expression profiles, and then look up where the genes of interest cluster. New candidate genes could then be identified as those genes that co-cluster with the starting set. However, this simple approach may lead to complications. What if the genes in the starting set do not all cluster together? In this case, one might be able to then look for clusters enriched for genes in the starting set and look in those clusters. What if the starting set of genes are scattered broadly among the clusters so that no cluster is significantly enriched for genes in the starting set? In this case, one might give up using expression data to find new candidates since the starting set do not seem to be well correlated. These alternatives are unsatisfactory. In the first case where some, but not all of the genes in the starting set co-cluster, we are forced to ignore the clustering behavior of the genes that fall into different clusters. In the second case all of the genes in the starting set seemed to be uncorrelated and we are forced to look to other data sources for help. The alternatives are not satisfactory because, in both cases, there may still be evidence in the expression data to implicate new genes related to the starting set. Thus, the simple approach of looking up related genes in a global clustering may miss biologically important relationships.

CHAPTER 3. GENE RECOMMENDER

54

Figure 3.1: Global clustering can miss significant, but under-represented correlations. Columns correspond to hypothetical microarrays while rows correspond to hypothetical genes. Color in each box indicates an increase (yellow) or decrease (blue) in expression. Figure 3.1 illustrates why using a global clustering may be a coarse tool for finding additional genes of related function. Related genes A and B may be coregulated with each other, but only over a small proportion of the experiments in the microarray database. If the genes have multiple functions, they may be coregulated with different sets of genes over different sets of experimental conditions. For example, the pair of genes (A,B) are coregulated over 5 experiments in the database, while (A,C) and (B,D) are coregulated over 10 experiments. Global clustering, more influenced by the majority of experiments in the database, might place genes A and C in the same cluster as well as place B and D together but in a different cluster. Even though A and B are coregulated over a significant fraction of the experiments, they might not be co-clustered. The global clustering strategy could also fail to detect that gene E is related to genes A and B, even though it has a significant, albeit under-represented,

CHAPTER 3. GENE RECOMMENDER

55

Figure 3.2: The retinoblastoma pathway in humans involves several known genes that suppress the entry into S-phase (DNA synthesis) during the cell cycle. pattern in common. However, if we knew that genes A and B were related a priori, we might be able to identify that A and B are significantly co-regulated in the first five experiments. Having identified a set of experiments that may activate the process in which A and B participate together, we might then be able to detect that gene E is also active in these experiments. This finding indicates that E may also be related to A and B since it is also expressed similarly under the same set of conditions used to identify the experimental context representing the active state of the common process. In the remainder of this chapter I discuss a set of genes known to play a role in the retinoblastoma pathway. The retinoblastoma pathway contains genes involved in the cellular decision to synthesize a new copy of the genome during cell division (S-phase) 3.2. The retinoblastoma gene (Rb) interacts with chromatin remodeling proteins to suppress the expression of S-phase promoting genes. The retinoblastoma (Rb) gene encodes a protein product that normally suppresses the transcription of

CHAPTER 3. GENE RECOMMENDER

56

S-phase genes through a chromatin remodeling mechanism after complexing with histone deacetylase (HDAC). Loss of the retinoblastoma gene therefore leads to the uncontrolled entry into S-phase and subsequent cell proliferation. Without a normal copy of the gene, the complex with HDAC does not form, resulting in expression of S-phase genes, promoting DNA-synthesis and entry into S-phase [30]. The loss of Rb is thought to lead to uncontrolled cell proliferation through this loss of regulation on the entry into S-phase. Since Rb has been implicated in many types of cancers, it is of interest to fully characterize the pathway in order to understand the process of tumorogenesis in greater detail. Table 3.1 shows the retinoblastoma queries along with other queries used in the analysis. Genes belonging to a single query are believed to act in the same pathway. The major sperm protein genes (MSP) for example all are thought to belong to a large protein complex that plays a critical role in sperm motility. The synaptonemal complex query genes are thought to encode the proteins that compose the structure that pairs homologous chromosomes together during meiosis. The nematode C. elegans is a good model organism for the study of the Rb pathway. Several orthologs belonging to the Rb pathway have been identified in C. elegans. In C. elegans, the Rb genes participate in a pathway that specifies vulval cell fate. Loss of the C. elegans Rb gene leads to multiple vulval structures being formed on the ventral surface of hermaphrodite worms. The Rb genes in worm suppress a

57

CHAPTER 3. GENE RECOMMENDER

Retinoblastoma lin-35 lin-53 hda-1 lin-9 lin-36

MSP msp-36 msp-10 Y50E8A.B msp-55 msp-1 Y59E9AR.1 msp-40 msp-38 Y59E9AR.7 msp-142 M199.2 Y59H11AM.D msp-49 msp-31 ZK1248.17 C36H8.1 msp-32 ZK1248.4 msp-74 msp-33 msp-113 msp-3 msp-57 ZK1251.6 msp-77 msp-53 msp-65 msp-19 T13F2.10 msp-59 msp-45 msp-78 msp-50 F58A6.9 Y116A8A.2 msp-51 msp-142 Y116A8A.7 ZK546.3

Synaptonemal Complex syp-2 syp-1 syp-3 F41H10.10 F57C9.5 him-3

Recomb./ repair spo-11 him-14 msh-5 rad-50 mre-11 rad-51

Table 3.1: Examples of query genes that can be given to the gene recommender. Each column lists the WormBase gene identifiers corresponding to each gene included in the query list. MSP, major sperm protein. RAS signal that would otherwise promote cell proliferation. Genetic analysis revealed that the pathway specifying vulval cell fate is redundant, so that the RAS signal can be suppressed by two independent sets of genes, called the SynMuv A and B sets respectively. Loss of genes from both the SynMuv A and B sets leads to multiple vulval structures in the presence of normal RAS signaling. Because C. elegans has a large and diverse expression database, it makes sense to search for new genes that may also be involved in the Rb pathway. Figure 3.3 shows an example of using the C. elegans topomap, discussed in the previous chapter, to search for genes related to the worm Rb genes. We can search for new candidates by drawing a region around the Rb genes so that it captures a certain fraction of

CHAPTER 3. GENE RECOMMENDER

58

Figure 3.3: Defining a prediction set from the C. elegans topomap. A circle of radius r is drawn around the centroid of five retinoblastoma-related genes (white dots) so as to enclose half of the retinoblastoma genes. The circle describes a region in the topomap that inscribes 138 candidate genes predicted to be related to retinoblastoma (blue dots). the Rb genes. Any non-Rb genes enclosed in the region would then be counted as candidates. However, because the Rb genes are fairly far apart on the terrain map, the number of non-Rb genes is large. Drawing a circle to enclose 3 out of the 5 Rb genes would capture 138 non-Rb genes. Also, the average distance of the Rb genes from each other is an indication that we may have to sift through many false positives before we find a gene truly involved in the pathway. It is possible that the Rb genes are more significantly correlated on a subset of the experiments rather than the entire database. For example, Figure 3.4 shows a set of experiments in the database where the retinoblastoma genes are very well correlated. The experiments in the figure are those where we expect the Rb genes to be coregulated since they represent experiments where active DNA replication is probably occurring. The goal is to find such experimental subsets automatically using

CHAPTER 3. GENE RECOMMENDER

59

Figure 3.4: The Rb query genes show tight coregulation on a subset of experiments in the C. elegans microarray database. The germ line experiments compared expression in wild-type animals to glp-4 mutants lacking a germ line, and in mutants making only sperm to mutants making only oocytes [74]. The development experiments compared expression in whole wild-type worms throughout development and in hermaphrodites versus males [57]. The dauer experiments compared expression as animals exit the dauer stage following feeding in a time course experiment [98]. The original five retinoblastoma query genes are written in red to the right of the figure. Candidates found by the gene recommender are written in black to the right of the figure.

CHAPTER 3. GENE RECOMMENDER

60

only the starting set of retinoblastoma genes. The gene recommender described in this chapter then searches for new genes coordinately regulated with the query on a subset of the experiments (Figure 3.4). Rather than using the clusters derived from a global clustering solution, the approach dynamically builds a cluster around the starting set. The approach was motivated by methods that recommend books and movies to online customers. The approach attempts to improve upon previous clustering methods for finding highly precise lists of candidate genes based on microarray expression data. My contribution to this project was (1) formulating the initial strategy with Dr. Art Owen, (2) developing the cross-validation strategy discussed in section 3.5, and (3) formulating and implementing the log-likelihood computation discussed in Section 3.4. The work presented in this chapter was published under the title “A gene recommender for C. elegans” in Genome Research Copyright 2003 Elsavier [68]. For both speed and statistical reasons, a microarray expression database requires some preprocessing before it can be used by the gene recommender. Section 3.2 describes the normalization procedure used to prepare a microarray expression database for this type of analysis. Because the conditions under which the genes are actively co-regulated may be underrepresented in the expression database, the gene recommender first identifies experiments where the expression levels of the given set of genes agree. For example,

CHAPTER 3. GENE RECOMMENDER

61

consider again the illustration in Figure 3.1. The gene recommender would first identify experiments 1 through 5 as being informative for building a cluster around genes A and B. Section 3.3 describes the method the gene recommender uses to identify experiments that are informative for clustering around a given set of genes. Once a set of informative experiments has been identified, the gene recommender builds a cluster around the starting set of genes in an attempt to find new candidates. Using the example from Figure 3.1 again, the method would use only experiments 1 through 5 and then be able to detect the similarity between gene E and the starting set. Section 3.4 describes the method the gene recommender uses to rank every gene in the genome based on their similarity to the starting set and then obtain a precise list of candidates. The performance of the gene recommender is evaluated and discussed in Section 3.5. The method is compared to the topomap approach described in the previous chapter where I show formally it obtains more precise lists of candidates. I also present a cross-validation approach for measuring the performance of the method. RNA interference experiments confirmed that the gene recommender found two new candidates involved in the retinoblastoma pathway. Finally, section 3.6 discusses possible future extensions and generalizations.

CHAPTER 3. GENE RECOMMENDER

3.2

62

Normalization

Two genes whose expression levels increase and decrease together are considered to be co-regulated, even if their absolute expression levels differ markedly. Genes that are coordinately regulated may be induced or repressed at different strengths. For example, two genes might be turned on or off by the same transcription factors but the amount of mRNA produced from each gene might be very different since the genes’ cis regulatory elements may have different binding affinities for the factors and different transcription initiation and elongation promoting strengths. Because this may result in coregulated genes having different absolute levels of expression, we needed to use a similarity measure that reflected the degree to which the genes’ expression responded in the same direction. To capture this notion of co-regulation via a simple correlation measure, we first applied a rank-based normalization to the raw data. We calculated the rank order of the genes in each microarray experiment, from most induced (scored as +1) to most repressed. The data were then normalized by taking their ranks within rows. Let pi be the number of non-missing values among Yij for j = 1, . . . , p. Let Rij be the rank of Yij among the non-missing values in Yi1 , . . . , Yip . That is Rij = 1 for the smallest non-missing value, 2 for the second smallest, and so on. The transformed values are:

0

Yij =

Rij − (pi + 1) /2 pi /2

(3.1)

63

CHAPTER 3. GENE RECOMMENDER

3.3

Finding a subset of informative experiments

The gene recommender first assigns a numerical score to each experiment measuring the extent to which the query genes tend to cluster within that experiment. The highscoring experiments are taken to be the ones that are most relevant to the query. The low-scoring ones may be irrelevant and add noise to the search for new genes. For the j’th experiment, let Y Q,j be the average of the non-missing values among Yij for i ∈ Q and let VˆQ,j be the sample variance of those values. Then the score for experiment j is: Y Q,j p SE (j) = kj q VˆQ,j + 3p12

(3.2)

where kj is the number of non-missing values among Yij for fixed j and all i ∈ Q. The experiment score is considered to be missing if kj is too small. Note that the absolute value allows us to detect when the query set are either coordinately up- or down-regulated. Figure 3.5 shows the experiment scores obtained from the retinoblastoma query. Once we score all of the experiments, we search for those with significantly high SE scores. We considered an experiment score to be high, and we select the experiment, if it exceeded a threshold level, Z ∗ . Let the set of experiments selected be E, i.e.:

E = {j : SE (j) > Z ∗ }

(3.3)

CHAPTER 3. GENE RECOMMENDER

64

Figure 3.5: Some DNA microarray experiments show coregulation of the genes in the Rb query. Histogram of experiment scores obtained using the retinoblastoma query (black bars) or using 100 random queries of the same size (white bars). A substantial fraction of the experiment Z scores obtained using the retinoblastoma query were higher than the maximum Z score obtained from random queries Each experiment in E should reflect a condition under which the query genes have similarly extreme values.

3.4 3.4.1

Finding new candidate genes Scoring genes

To find new candidate genes, the gene recommender first scores all of the genes using only a subset of the experiments E. Once a set of experiments is selected, the entire set of expression data is a projection onto the subspace whose axes equal the experiments in E. Figure 3.6 illustrates the idea for a hypothetical example in which two experiments are selected from three. Restriction to experimental subspaces can

CHAPTER 3. GENE RECOMMENDER

65

Figure 3.6: A. Two hypothetical microarray experiments where each point represents the expression levels of a single gene in the two experiments. Five genes from the same query set are shown (red points) along with one non-query gene (blue point). The open red circle corresponds to the centroid of the query genes. B. same as part A only the genes are projected onto the the single experiment represented by the X-axis. change the apparent similarity of genes. Experiment X is chosen since the query set of genes have more similar expression values on these experiments than on experiment Y. The distance, d, between a gene and the query set can change depending on the subspace used to compute a measure of similarity (dotted line in the figure). The distance d between a candidate gene and the query centroid changes when only the experiment correspond to the x-axis is used relative to the original two-dimensional space. A measure of similarity between a gene and the query is computed by computing a dot product between the gene’s expression vector and the query’s mean expression vector only within the X-Y plane. We set the score for gene i equal to the dot product between gene i and the query set on the subspace given by E. We

CHAPTER 3. GENE RECOMMENDER

66

set the score for some arbitrary gene i equal to the score given by 3.4. Equation 3.4 shows how we compute a measure of similarity between a new gene i and the query set. p

1X SG (i) = wj Yi,j Y Q,j ki j=1

(3.4)

where the weights wj select the particular subset of experiments chosen in the previous step: wj =



0j ∈ / E, 1j∈E

(3.5)

Different choices for choosing the weights exist. Bergmann et al. (2003) show that using soft weights and iterating the procedure leads to a particular generalization of singular-value decomposition [10]. The score for gene i is SG (i), the mean of Y Q,j xYi,j over experiments j ∈ Q for which both Y Q,j and Yi,j are not missing. Because we choose wj ∈ 0, 1, SG (i) is related to the cosine of the angle formed between gene i and the mean query vector on the subspace defined by E. Because the values derived from the microarrays can have missing data, it is possible that too few j exist for gene i. In this case, we ignore the gene and consider SG (i) to be missing. All of the Yi,j and Y Q,j are between −1 and 1 and so therefore is SG (i). An important decision in the search for genes related to Q is the number of experiments to include in the relevant set E. The gene score is a sum of contributions from experiments of which some may be irrelevant to the functioning of the genes in

CHAPTER 3. GENE RECOMMENDER

67

the query Q. While it is intuitively reasonable that including irrelevant and noisy experiments can degrade the performance of gene scoring, there is not a good statistical argument to select a priori a Z-score above which an experiment will improve the search performance, and below which an experiment will degrade the search. The combined effect of several not-quite significant experiments may be informative, because lack of significance is a lack of proved relevance, and not necessarily a lack of relevance. Instead, one can explore a small grid of threshold values. More concretely, select the threshold Z ∗ to minimize the number of genes i ∈ / Q that score higher than the median score of genes i ∈ Q. Our rationale is that a good set of experiments should bring the known members of Q to the top of the list. If a small number of true unknown Q members exist, then it is reasonable to prefer a small number of non-Q genes near the top of the list. Though the threshold was chosen to bring the original query to the top of the list, we see from Figure 3.8 that the query genes get high ranks even in leave-one-out experiments.

3.4.2

Estimating the likelihood of membership

In addition to ranking the genes by decreasing score, we also compute a measure indicating the likelihood that the assigned score came from the query distribution versus the background distribution. This provides an intuitive interpretation of the

CHAPTER 3. GENE RECOMMENDER

68

score assigned by the method. For each gene in the genome i we compute the loglikelihood ratio:

L(i) = ln

σ0 1 1 + (SG (i) − µ0 )2 − (SG (i) − µQ )2 σQ 2 2

(3.6)

where µ0 and σ 0 are the mean and standard deviation computed from the SG (i) scores for i ∈ / Q while µQ and σ Q are the mean and standard deviation computed from the SG (i) scores for i ∈ Q. We include all non-query genes into the background distribution. Because some genes may have been placed erroneously into the query set we only include query genes into the query distribution if their SG (i) scores are in the 90th percentile among all genes. These ratios should be used with caution since the normal approximation makes the likelihood ratios non-monotonic with respect to SG (i). Extremely high-scoring genes can have smaller likelihood ratios than genes with correspondingly smaller scores. However this causes little problem since the likelihood ratios are used to find an intuitive cutoff between the background and query distributions where the ratios are usually monotonically increasing in SG (i).

3.5

Results

The result of a call to the gene recommender produces an ordered list of genes with associated scores. Table 3.5 shows the top 30 genes returned to a query containing

69

CHAPTER 3. GENE RECOMMENDER

Clone ◦ T23G7.1 • K07A1.12 ◦ K12D12.1 • C32F10.2 ◦ R06C7.8 • C53A5.3 ◦ B0464.6 ◦ R06F6.1 ◦ T16G12.5 ◦ F55A3.7 ◦ K06H7.1 • ZK637.7 • F44B9.6 ◦ F35G12.8 ◦ C39E9.12 ◦ T12E12.4 ◦ C33H5.4 ◦ F32D1.10 ◦ T20F5.7 ◦ DY3.7 ... ◦(36) B0336.1 ... ◦(42) JC8.6

Gene dpl-1 rba-2

Protein E2F Retinoblastoma associated (human) Topoisomerase II lin-35 Retinoblastoma (human) Bup1p (yeast) hda-1 Histone deacetylase Unknown Histone hairpin protein (human) Unknown Spt16 (yeast) plk-1 Polo kinase lin-9 synMuv protein lin-36 synMuv protein SMC family Unknown drp-1 Mito. outer membrane scission klp-10 Kinesin 5B (human) Member MCM initiator complex Unknown sup-17 Disintegrin / metalloprotease

SG 0.247 0.244 0.240 0.238 0.237 0.237 0.233 0.233 0.231 0.230 0.229 0.227 0.227 0.227 0.227 0.226 0.224 0.224 0.223 0.223

wrm-1

0.219 316

TES-1

Tesmin-1 (Arabidopsis)

N 162 320 320 320 317 320 315 319 315 318 318 320 320 317 318 317 311 317 318 160

0.215 318

Table 3.2: Top scoring genes for the Rb query as ranked by the score given by the gene recommender.

5 worm genes that are orthologs of 5 human retinoblastoma genes. A solid circle in Table 3.5 denotes a gene from the query. Genes were ordered by SG . N were the number of experiments used to compute the score SG which is given by the number of non-missing values for the gene and the particular experiments in E selected by the gene recommender. B0336.1 was the 36th and JC8.6 was the 42nd highest-scoring gene

CHAPTER 3. GENE RECOMMENDER

70

respectively. The gene recommender used 320 of the 553 experiments, including every experiment in which none of the 5 query genes were missing. The gene recommender was able to cluster the five Rb query genes in a small group of 13 genes (shown in red in Table 3.5). As a control, the gene recommender did not succeed in clustering query genes into comparably small groups when it used five random samples of five genes. This finding is consistent with the random queries having smaller experiment scores (Figure 3.8), and indicates that the new candidate genes for the Rb query identified by the gene recommender are unlikely to be due to chance. The top 20 genes are strongly indicative of Rb function. The top 20 genes from the Rb hit list show high levels of co-regulation to each other, and are most abundantly expressed under conditions with high levels of cell growth and division.

3.5.1

Approximate precision and recall

The result of targeted clustering using the gene recommender produces a ranked list of genes. The ranked list can then be truncated to produce a hit list. Recall is the fraction of genes from the cassette included in the list. Precision is the fraction of genes in the hit list that also belong to the cassette. Since we don’t know which genes actually belong to the cassette, we fix the fraction of query genes captured in the hit list. If the size of the cassette is comparable to the size of the query, then the fraction of query genes will be a good approximation to a measure of recall.

CHAPTER 3. GENE RECOMMENDER

71

We can measure the precision of a hit list at fixed levels of recall by counting the number of known true-positives in the hit list when a fixed percent are retrieved. The trick is that we don’t know the full list of true-positives and our list of positives (the query) may have errors in it. However, if we assume the list of unknown truepositives is small than we can use the query as an approximate estimate of the full set of true-positives. We can then compute estimates of the recall and precision of the methods. As the hit list varies in size from 1 to the number of genes in the cassette, a precision-recall curve can be produced. The recall necessarily increases as the hit list grows while the precision typically decreases. To derive a hit list from the topomap, one can rank all of the genes according to their distance from the centroid (average) of the query genes. To derive a hit list from a clustering, one could include all of the genes that co-cluster with (1) at least 1 query gene, (2) with all of the maximum number of query genes, or (3) some fraction of query genes. In hierarchical clustering, one could form a hit list by taking all genes under the deepest common ancestor of all the query genes or some fraction thereof. The set of five Rb query genes are localized in a broad area of mount 11 in the gene expression topomap (Figure 3.7). This broad area includes not only the 5 Rb query genes, but also 337 other genes (at 100% recall). By comparison, the top 13 genes from the gene recommender contained all 5 of the Rb query genes. To capture

CHAPTER 3. GENE RECOMMENDER

72

at least 2 Rb query genes requires the top 6 genes of the gene recommender list, but requires the top 138 genes from the topomap list (50% recall; Table 3.3). In addition to the Rb query list, the gene recommender provided a shorter list of candidate genes for each of the four sets of query genes than the list generated by the gene expression topomap. I compared the gene recommender to the topomap (discussed in the previous chapter) for their ability to find precise lists of candidate genes. For the topomap, I drew a circle of radius r large enough to incorporate a desired number of query genes as shown in Figure 3.3. Drawing such a circle encompasses a set of genes, the size of which estimates the precision obtained by the topomap. Figure 3.7 illustrates how the gene recommender and the topomap differ in their ability to concentrate genes from the retinoblastoma query and a query based on 12 genes involved in prophase. Figure 3.7 shows two examples where the gene recommender is able to find candidates that reside in separate topomap mountains. White dots indicate the genes belonging to the original query. The gene recommender can find new candidates that are scattered on the C. elegans topomap. In the prophase example, a set of known prophase-involved genes were used as the query set. Because the topomap splits the set of prophase query genes into two separate mountains, it is difficult to use the topomap clustering result to find new prophase-involved genes. This has to do with the fact that the number of genes included in the two mountains is fairly large (over

CHAPTER 3. GENE RECOMMENDER

73

Figure 3.7: The gene recommender can find new candidates that are scattered on the C. elegans topomap. The new candidates predicted by the topomap and the gene recommender are shown plotted on the C. elegans topomap. A. The topomap needs to capture 138 genes (blue dots) to enclose half of the retinoblastoma genes (white dots) while the gene recommender only needs to enclose an additional 3 (yellow points). B. Same as A but for the prophase-related query.

CHAPTER 3. GENE RECOMMENDER

Query Retinoblastoma Recombination/repair Synaptonemal complex MSP

Query Size 5 6 6 43

74

Size Prec GR Topo Rand GR Topo Rand 6 138 160 50% 2% 2% 57 1271 85 5% 0% 4% 4 246 85 75% 1% 4% 32 225 4017 69% 10% 1%

Table 3.3: The gene recommender yields a more precise list of candidates than the topomap. 1000) and far exceeds the number of genes expected to be closely involved in prophase. In this case, the gene recommender is able to find high-scoring genes that reside in both mountains. This indicates that the gene recommender can successfully find an appropriate subset of experiments that measure prophase likeness. Table 3.3 compares the gene recommender to the C. elegans topomap. In addition, a random control was obtained by running the gene recommender on a set of genes randomly selected from the genome. First column in Table 3.3 corresponds to the number of genes in the original query. The next three columns in the table show the number of hits returned by the gene recommender, the topomap, and the random control at 50% recall. The random control result was obtained by running the gene recommender on a set of random query genes. The last three columns show the same result except the percent of genes in the hit list is reported. The recombination/repair group is known to contain genes from two distinct groups and therefore yielded a lower precision when treated as a single group.

CHAPTER 3. GENE RECOMMENDER

75

Figure 3.8: The leave-one-out procedure. One gene from the original query is left out of the list (in this case the Rb gene). The remaining genes are given to the gene recommender which outputs a score for each gene in the genome, including the held-out Rb gene. The procedure then rotates through all of the original query genes, leaving out each one in turn, until |Q| ranks are obtained.

3.5.2

Cross-validation

A second method to show that there is strong co-regulation of the query genes in the microarray data is to determine that each of the genes in the query have a high score in leave-one-out experiments. Figure 3.8 illustrates the leave-one-out procedure. I performed leave-one-out on the retinoblastoma query. One of the Rb genes is left out of the query list. The gene recommender is then run using the smaller list as a query. The rank given by the gene recommender to the held-out gene is then recorded.

CHAPTER 3. GENE RECOMMENDER

76

Figure 3.9: Leave-one-out cross-validation. Histograms of percentile ranks obtained after removing a gene from a query list of genes, building a cassette around the remaining query genes, and then scoring the held-out gene. Open bars show the histogram obtained from random queries ranging in size from 4 to 50; black bars show the histogram obtained from all the queries used in this study. The inset is an expanded view of the highest-scoring genes. Figure 3.9 shows the distribution of ranks received by genes from real query lists (black bars) compared to the ranks received by genes from random query lists (white bars). For each query, one of the genes in the query was left out, the algorithm was rerun on the remaining genes, and the rank obtained by the gene that was left out was then scored. The ranks are scored as percentiles with 100 corresponding to the gene most similar to the query and 0 corresponding to the least similar. As we would expect,

CHAPTER 3. GENE RECOMMENDER

77

the histogram for random queries is nearly uniform over the range from 0% to 100% (Figure 3.8). By contrast, the histogram for the real queries has a very large spike between the 95th and 100th percentile and a very small number of genes with much lower scores. This result shows that the genes in the query lists are co-regulated, and that the gene recommender algorithm can accurately identify genes based on their level of co-regulation. The ranks from the held-out Rb gene can then be used as an assay to determine whether a significant expression signature exists for the query set. For example, if leave-one-out experiments on random sets of genes yield uniform distributions of ranks as in the case of Figure 3.9, then we can test the strength of the association using a statistical test.

3.5.3

RNAi knock-down

The predictions of the gene recommender were followed up with wet biology experiments. The top 50 genes returned from a search for genes related to the Rb query list were tested with RNA interference. Of the three genes known to encode Rb-binding proteins, one (dpl-1) had a class B synMuv phenotype in RNAi experiments consistent with previous results whereas the other two (K12D12.1 and mcm-7) showed no mutant phenotype using any of the strains [17]. Among the remaining 43 genes that we tested, two showed a phenotype

CHAPTER 3. GENE RECOMMENDER

78

Figure 3.10: JC8.6(RNAi) results in a Muv phenotype similar to lin-35 Rb(RNAi). A. JC8.6(RNAi), (B) JC8.6(RNAi) lin-8(n111), (C) JC8.6(RNAi) lin-9(n112), (D) lin-35 Rb(RNAi) lin-8(n111). Arrows point to the vulva in A and C, and to pseudovulvae in B and D. Adults were fed bacteria expressing JC8.6 dsRNA, and the phenotypes of their progeny were scored. Scale bar is 20 M. indicating that they interact with the lin-35 Rb pathway. RNAi analysis of the 42nd gene in the Rb list (JC8.6) elicited a synMuv phenotype very similar to that of lin35Rb (Figure 3.10). Figure 3.10 shows the results of knocking out JC8.6. JC8.6 encodes a protein similar to mammalian tesmin and Arabidopsis TSO-1. Although little is known about the function of tesmin, TSO1 plays a role in plant meristem cell division (Song et al. 2000, Hauser et al. 2000). RNAi analysis of the 35th gene in the Rb list (wrm-1) indicates that it might act to antagonize the Rb pathway. Specifically, wrm-1 (RNAi) resulted in embryonic lethality that was suppressed by loss-of-function lin-35 or lin-9 mutations. wrm-1 encodes a beta-catenin that functions in an embryonic Wnt signaling pathway [75]. Table 3.4 shows the results from testing for suppression of the wrm-1 lethality. The first column shows the number of hours

79

CHAPTER 3. GENE RECOMMENDER

RNAi feeding wild-type lin-8 8-16 h 17 (n=36) 76 (n=28) 16-24 h 0.2 (n=413) 25 (n=342)

% viability of wrm-1 (RNAi) lin-9 14 (n=21) 0 (ngt300)

Table 3.4: lin-8 suppresses wrm-1 lethality. adult animals were fed bacteria expressing wrm-1 dsRNA. In addition to lin-8, we tested two other class A genes (lin-38 and lin-15A (n751)) but not lin-15 (n767) could suppress wrm-1 (RNAi) lethality.

3.6

Discussion

The gene recommender strategy was initially conceived during a conversation with Professor Art Owen from the department of Statistics at Stanford University. Dr. Owen finalized the details of the method and coded up the first implementation in Matlab. He analyzed the first set of queries, including the Rb query, discovering that feature-selection on experiments, as employed by the gene recommender, was a viable approach for finding genes of related function. To enable this discovery to have an even greater impact on functional genomics, I developed a web-based interface that gives the biology community the ability to search for new candidates of their own gene sets. Although I am sure providing a web address in one’s thesis is incredibly shortsighted, I feel compelled to do so anyway if only for the historical record of it. At the time of writing this thesis, the web site that hosted the search tool could be found

CHAPTER 3. GENE RECOMMENDER

80

at http://cmgm.stanford.edu/ kimlab/generecommender. To gauge the significance of the results of the gene recommender and to measure the degree to which a query set represented a single regulon, I developed the leave-one-out strategy discussed in Section 3.4. As microarray databases continue to grow and biologists continue to want to query the expression database to find genes of related function, it will become increasingly more important to identify informative subsets of the databases. With a growing variety of experiments represented in the database the fraction relevant to any particular query diminishes. For any set of genes, it becomes more and more the case that the majority of experiments have nothing to do with the process that invokes the group to function together.

3.6.1

Advantages

For identifying new genes co-expressed with a known set of genes, this algorithm has a number of advantages over global approaches used previously. First, the gene recommender selects for microarray experiments that are informative (i.e. showing tight co-regulation of the query genes) and ignores uninformative experiments that would otherwise add noise to the calculations. As a result, it generated hit lists that were shorter and more concentrated with query genes than the list generated by the gene expression topomap. Second, the gene recommender can find interactions for genes

CHAPTER 3. GENE RECOMMENDER

81

that are in multiple clusters. Genes that are multifunctional, that are expressed at different times during development, or that are expressed in different tissues may interact with multiple different pathways. A global clustering approach such as hierarchical clustering or multidimensional scaling would place such multifunctional genes into the strongest cluster and would thus lose interactions with other clusters. For example, the Rb hit list generated by the gene recommender included four genes (K12D12.1, T16G12.5, F55A3.7 and drp-1) that were not clustered with Rb by the gene expression topomap, but were instead clustered together along with other genes in mount 5. The gene recommender was able to find several new genes that appear to interact with Rb in regulating the cell cycle. This is a significant finding considering how well studied retinoblastoma is. Among the top fifteen candidate genes, three (dpl-1, K12D12.1 and mcm-7) encode proteins similar to mammalian proteins that bind to Rb [13, 89, 17]. RNA interference also confirmed that two genes (wrm-1 and JC8.6) interact with the genes in the Rb pathway (Figure 3.10); neither gene was previously known to do so. JC8.6 acts along with Rb while wrm-1 acts to antagonize the Rb pathway. JC8.6 encodes a tesmin-related protein, which is involved in cell division in Arabidopsis [84, 47]. Neither WRM-1 activity nor Rb activity are thought to be directly controlled by transcriptional regulation: WRM-1 is a homolog of beta-catenin, which is released into

CHAPTER 3. GENE RECOMMENDER

82

the nucleus as a result of Wnt signaling [100], and Rb is regulated by phosphorylation during the cell cycle [30]. Nevertheless, the gene recommender found them to be co-expressed based on gene expression data. The co-expression of WRM-1 and Rb indicates that the Wnt and Rb pathways are both present at similar developmental times and tissues in order to enable these pathways to regulate each other.

3.6.2

Extensions

The gene recommender can be extended in various ways. Below I discuss several directions for further development. The first extension involves eliminating the experimental cutoff and instead finds an appropriate weighting of the experiments. As presented above, the gene recommender relies on a variance estimate computed from the query set of genes. The second extension attempts to remove this dependency so that candidates can be searched for very small query sizes, even for a single gene.

Eliminating the experiment cutoff The gene recommender uses an experiment threshold Z ∗ and adds experiments to E only if their experiment score exceeds Z ∗ . The threshold is either predefined or is searched for using the strategy described. Rather than using a cutoff, one can soften the restriction and define a weighting over the experiments. Larger weights can be assigned to experiments under which the query have similar values, while smaller

CHAPTER 3. GENE RECOMMENDER

83

weights can be assigned to those experiments under which the query genes are more scattered. Bergmann et al. (2003) have recently shown that this softening approach is a particular generalization of singular value decomposition [10].

Searching with small queries The gene recommender relies on the calculation of a sample variance for the query within each experiment. For small query sizes, many experiments or all of the experiments will not contain a large enough number of non-missing values to provide a reliable estimate. Hypothesis-guided approaches, like the gene recommender promise to aid gene function prediction in the case where knowledge about gene function is known. It is able to scan the expression database for a context under which the genes of interest show coordinate regulation. The performance of the gene recommender suggests this type of approach will greatly aid in predicting gene function.

Searching for orthologous coexpression The gene recommender currently operates on the data from a single organism. Biologists are limited to finding genes of related function using the gene recommender to those organisms for which a large and diverse set of microarray experiments exist. To date, only a handful of organisms have such a database of microarrays available. These problems can be somewhat circumvented by incorporating expression data across multiple organisms.

CHAPTER 3. GENE RECOMMENDER

84

The next chapter describes a global multiple species approach. It investigates how to map genes across organisms and use large compendiums of microarray data to derive a single clustering result that can be made available to the research community. The shortcomings of the multiple species global solution are analogous to the shortcomings of global clustering and so we can expect the gene recommender to perform well in this area. Chapter 5 discusses a multiple species variant of the gene recommender for finding new candidates to known gene sets that can use expression data across multiple organisms for its analysis.

Chapter 4 A multiple species coexpression network of conserved genetic modules

4.1

Introduction

Expression databases contain many different regulation profiles showing how gene expression is perturbed by developmental stage, different growth conditions, stress, disease, and specific mutations [91]. Similarity of expression profiles across diverse experimental conditions can suggest new relationships between genes. The methods described in previous chapters can be used to ascribe functions for unknown genes if

85

CHAPTER 4. MULTIPLE SPECIES NETWORK

86

the unknown genes are sufficiently well correlated with genes of known function. However, co-regulation detected in a single organism may not always imply two genes are functionally related. For example, cis-regulatory DNA motifs are predicted to occur by chance in the genome, and might lead to serendipitous transcriptional regulation of nearby genes. Using experiments limited to a single organism, it would be difficult or even impossible to distinguish accidentally regulated genes from those that are biologically important. Recent evidence also suggests that seemingly unrelated genes can be significantly coexpressed if they are adjacent in the genome [76, 86]. Since genes are highly rearranged over evolutionary time-scales, such position-dependent, accidental transcriptional coupling may be mitigated by comparing coexpression across distantly related species. It is possible that the cell uses inexact mechanisms to regulate mRNA levels. The presence of spurious coexpression in microarray data could confound gene function prediction. Spurious coexpression could lead to both false-positive as well as falsenegative errors. We would like to know which coexpression measurements reflect the existence of a necessary link between two genes. Identifying these links promises to improve gene function prediction. Evolutionary conservation is a powerful criterion to identify genes that are functionally important from a set of co-regulated genes. The recent availability of large sets of DNA microarray data for human, flies, worms and yeast makes it possible to

CHAPTER 4. MULTIPLE SPECIES NETWORK

87

measure evolutionarily conserved coexpression on a genome-wide scale. Co-regulation of a pair of genes over large evolutionary distances implies that the co-regulation confers a selective advantage, most likely because the genes are functionally related. Because small and subtle changes in fitness can confer selective advantage during evolution, the test for related gene function using evolutionary conservation in the wild is more sensitive than scoring the phenotype resulting from strong loss-of-function mutants in the laboratory. This chapter describes a method for identifying conserved gene coexpression from microarray expression data collected on multiple organisms. Section 4.2 describes a statistical measure for detecting evidence of conserved coexpression between pairs of genes from multiple organisms. The measure allows any number of organisms to be analyzed simultaneously. The resulting network of gene-gene interactions can be large, and many links may suggest new findings. In order for the network to be useful for biological discovery, it has to be accessible for human analysis. Because the result can often be unwieldy, Section 4.2 describes a method for visualizing the entire network. The visualization allows researchers to easily navigate the set of conserved coexpression links. It enables the identification of areas of highly interconnected genes as well as the investigation of specific gene-gene interactions. Section 4.3 presents the results and evaluation of the methods applied to four

CHAPTER 4. MULTIPLE SPECIES NETWORK

88

distantly related organisms - H. sapiens, D. melanogaster, C. elegans, and S. cerevisiae. The section tests the validity and usefulness of the gene coexpression network derived using these methods. The network constructed including data across multiple organisms outperforms networks built using data from any single organism. These results support the main hypothesis of this thesis, which is that data from multiple organisms can be used to improve functional prediction of conserved genes. The results obtained in section 4.3 provide a novel clustering result compared to previous studies. The results create an opportunity to discover patterns among the conserved nature of gene coexpression rather than among the coexpression measurements themselves. It also combines genes of different ages into the same clustering result. For example, genes present among all of the eukaryotes included in this study are situated with genes that contain orthologs only among the metazoans. This makes it possible to investigate relationships between ancient and newly evolved modules of gene activity. Section 4.4 presents analysis that takes advantage of the uniqueness of these results to discover new properties about gene regulation from expression data. The work was done in collaboration with Dr. Daphne Koller in the Department of Computer Science at Stanford University. Eran Segal, a graduate student in Computer Science and a student of Dr. Koller’s, contributed equally to both the thoughts and effort behind the work discussed here. The ideas presented in this chapter have been published under the title “A Gene-Coexpression Network for Global Discovery

CHAPTER 4. MULTIPLE SPECIES NETWORK

89

of Conserved Genetic Modules” in Science Copyright 2003 American Association for the Advancement of Science [91].

4.2

Methods

Given a large collection of microarray data on several organisms, we would like to know if there is evidence suggesting two genes are highly coregulated across multiple organisms. The goal is to identify pairs of genes that are co-expressed not only in one experiment and in one organism, but that show correlation in diverse experiments in multiple organisms. I describe a method that provides a novel statistical procedure for measuring the inter-species agreement of gene-gene coexpression. This method provides a probabilistic measure of how likely a pair of genes exhibit coexpression arisen by chance. It assumes the observations obtained from each organism are independent, which I argue is an appropriate null model to measure departure against. It has similarities to approaches called “Meta-Analysis” in which results from several different studies, each attempting to test the same hypothesis, are being combined [48, 49]. In the case of gene coexpression, the null hypothesis is that all gene-gene correlations observed in separate species are random and independent. This section describes a method for detecting departure from this independence.

CHAPTER 4. MULTIPLE SPECIES NETWORK

4.2.1

90

Relating genes across organisms: meta-genes

To combine expression data across species requires we know which gene in another organism is the orthologous counterpart of a gene of interest in a reference organism. Because the critical amino-acids in a protein sequence are usually highly conserved (e.g. the histidines in the catalytic domain of myoglobins and hemeglobins that participate in oxygen binding), we can often detect when two genes derive from a common ancestor by comparing their protein sequences. Highly similar protein sequences can indicate two genes are orthologous. For the remainder of this chapter, I refer to a single collection of orthologous genes as a meta-gene. An unambiguous meta-gene would contain no more than one gene from each organism. It could lack a gene in a particular organism (e.g. because it was lost in one of the species or because a lateral transfer event introduced a new gene in an ancestor of some subset of the organisms). Appendix C discusses the particular method I used for defining the meta-gene sets.

4.2.2

Identifying conserved coexpression partners

Let the set of all related genes be Γ. Let the set of microarray data be X = {xii0 k } where xii0 k is the expression level of gene i under condition i0 in species k. Two genes are said to be related if (i, i0 ) ∈ Γ. The goal is to predict whether or not the pair of genes i and i0 belongs to Γ given the collection of microarray data.

CHAPTER 4. MULTIPLE SPECIES NETWORK

91

Definition 1 Let γ ii0 be the indicator variable that reflects membership in Γ, i.e. γ ii0 = 1 if and only if (i, i0 ) ∈ Γ and γ ii0 = 0 otherwise. Then, let pii0 be an estimate for our belief that (i, i0 ) ∈ Γ given the microarray data X.

pii0 = P (γ ii0 = 1|X)

(4.1)

Values of pii0 close to 1 would indicate (i, i0 ) ∈ Γ while values near 0 would indicate otherwise. One can approximate the information in X by using some measure of pairwise similarity between the two genes across the K species. Let the similarities between two genes across K organisms be {sii0 1 , sii0 2 , . . . , sii0 K }. We could then substitute these similarities for X when estimating pii0 :

pii0 = P (γ ii0 = 1|{sii0 1 , sii0 2 , . . . , sii0 K })

(4.2)

Note that equation 4.2 assumes γ ii0 is independent of X given their pairwise similarities. This assumption is convenient since it the information in the entire dataset is summarized by a few similarities. However, if the similarity measures themselves are sufficiently descriptive, this assumption is plausible.

CHAPTER 4. MULTIPLE SPECIES NETWORK

92

Thresholding approach One simple approach for predicting whether two genes are related is to use a rule that bases the prediction on whether the similarities exceed some threshold in each organism. In this approach, we choose some threshold level τ and count the number of organisms where the similarities for the two genes exceed τ , i.e. #{k|sii0 k > τ }. This is the approach taken by van Noort et al. 2003 in which the authors use Pearson correlation as the similarity measure and τ = 0.6 to predict whether genes are related based on their expression in C. elegans and S. cerevisiae [96]. They were able to show that the thresholding approach is able to reliably predict related genes by measuring their ability to predict members of KEGG categories. The simple thresholding approach has several drawbacks. It assumes high correlations observed in one organism are equivalent to high correlations observed in another. In general, this assumption is not valid since the sizes and complexities of the databases between the organisms may be different. Slightly better would be to choose a threshold based on the total number of experiments in each organisms’ database. This at least compensates for the different sizes of the databases used. A better method for estimating γ ii0 may be to account for the different correlation structures present in the different databases. For example, one could perform a singular value decomposition (SVD) on each organism’s database and use a projection of the genes onto eigenvector space for computing gene-gene similarities. In this way,

CHAPTER 4. MULTIPLE SPECIES NETWORK

93

an objective criterion to select a subset of eigenvectors could be used (e.g. high variance directions as in principal components analysis). However, the SVD procedure involves a matrix inversion, which is O(N 2 ) and so can be prohibitively slow and memory intensive for the size of datasets used.

Probabilistic method based on order statistics Perhaps the most intuitive approach for measuring the degree to which the evidence suggests i and i0 are coregulated is to estimate the probability that the observed similarities could occur by chance. However, the computation of such a quantity requires a model that describes the distribution from which the similarities are drawn. Suppose we knew the distribution from which the similarities obtained from each organism were sampled. In that case, the random variables from which the above sii0 k ’s were drawn would be given by S1 ∼ F1 , S2 ∼ F2 , . . . , SK ∼ FK , where the Fk ’s are organism-specific probability distributions. If the similarities collected from each organism k were independent, then the joint distribution of the sii0 k ’s would be given by Equation 4.3. K

P ({sii0 1 , . . . , sii0 K ) = Π P (Sii0 k < sii0 k ) k=1

(4.3)

In the next section I describe a transformation of pairwise similarities into rank ratios that allow us to use equation 4.3 to test for departure from independence and thus detect evidence of inter-species dependency.

CHAPTER 4. MULTIPLE SPECIES NETWORK

94

Transformation to rank ratios Figure 4.1 illustrates how a P-value is computed between the genes A and L based on similarities from three organisms. We can transform every pairwise similarity sii0 k between two genes i and i0 in organism k into the rank ratio rii0 k = Rii0 k /(Nk − 1) where Rii0 k is the rank obtained by sii0 k among the sorted collection of similarities to gene i, namely {sitk }, and Nk is the number of genes in species k. First, all of the genes are ranked with respect to A according to their similarity (i.e. the set of similarities {sAik } are sorted including the similarities between A and L {sALk }. This gives a different rank between the ordered pair (A, L) in each organism. For example, before computing a measure of similarity between MEG10 and MEG1514 the Pearson correlations in each organism would be converted to rank ratios. In yeast we would rank all of the yeast genes relative to YPL097W. After doing this, YPL040C would appear in the sorted list somewhere. Suppose, with its Pearson of 0.27, it actually is the 103rd most correlated yeast gene with YPL097W among all of the yeast genes included in meta-genes. Since there were 2306 other yeast genes, we would record the gene-gene correlation between YPL097W and YPL040C with a rank ratio of 103/2306 = .0447. We would repeat this for all of the organisms for MEG1514, getting a rank ratio for each organism. These rank ratios would then be used in the P-value computation discussed next.

CHAPTER 4. MULTIPLE SPECIES NETWORK

95

Figure 4.1: The method for computing P-values to detect evidence of conserved coexpression between genes A and L. In all three organisms, all of the genes are ranked relative to A based on their Pearson correlations. This yields the ranks 3, 2, and 1 for gene L. These ranks are divided by the number of genes in each organism to obtain the three rank ratios for the (A,L) pair from which the P-value can be computed.

96

CHAPTER 4. MULTIPLE SPECIES NETWORK

P-value computation We can now compute the P-value for every directed pair of meta-genes (i, i0 ) based on their gene correlations in K species. Let gms be a gene belonging to meta-gene m in species s. We rank all of the other genes relative to gms based on their Pearson correlation and then divide the rank by the total number of genes with meta-genes (and with data) in organism s, yielding K rank ratios for the (i, i0 ) pair, r1 , r2 , . . . , rK . To find out how significant the gene correlations of each pair is, we compute the probability of obtaining the observed rank ratios by chance where the order of the species does not matter. As more organisms are added, the combination of high rank ratios increases dramatically (grows as K!). Thus, the probability of obtaining high rank combinations in some of the organisms quickly approaches 1 as K grows large. We need a model that accounts for this dependence on the number of organisms included in the analysis. If we assume the rs ’s are drawn independently and uniformly, then we can compute the P-value from the joint cumulative distribution of an K-dimensional order statistic:

P (r1 , r2 , ..., rK ) = n!

Zr1 Zr2 0 t1

···

rK Z

dt1 dt2 · · · dtK

(4.4)

tK−1

I have found that the following recursive definition can be used to compute the integral

CHAPTER 4. MULTIPLE SPECIES NETWORK

97

efficiently:

P (r1 , r2 , . . . , rK ) =

K X

(rK−k+1 − rK−k ) P (r1 , r2 , . . . , rK−k , rK−k+2 , . . . , rK )

(4.5)

k=1

where r0 = 0 and the recursive call to P supplies all of the original arguments except the (n − i + 1)th. We can then connect any two meta-genes containing significantly low interaction P-values. Definition 2 Let two genes i and i0 be considered linked if Pii0 is less than some threshold α. To correct for the multiple tests performed, we need to adjust the P-value cutoff. Specifically, for a significance level of α, we include any meta-gene interactions with P-values less than a/N where N was the total number of meta-genes containing data in at least two organisms. In the results presented in section , N = 4725, giving a P-value cutoff of 1.05 · 10−5 for α = 0.05. Using this cutoff, we expect 236 false positives.

CHAPTER 4. MULTIPLE SPECIES NETWORK

4.2.3

98

Clustering and visualizing a conserved coexpression network

The previous section described methods for obtaining intuitive similarity measures between meta-genes. However, even once a suitable measure is found, it would still be unwieldy to make sense of all the pairwise comparisons between meta-genes. Choosing a pairwise measure of similarity implicitly chooses a space within which the metagenes reside. We would like to know how the meta-genes are arranged in such a space – i.e. is there structure of some sort? Is there a diffuse distribution of metagenes through this space indicating that each meta-gene is more or less similar to the same number of arbitrary other meta-genes, or is the space more heterogeneous in some way? Regions where densely packed meta-genes exist might reveal regulons of meta-genes involved in similar processes. Such regions might correspond to protein complexes or signaling pathways. We want to find patterns in this space in order to gain insights into the evolutionarily conserved architectures of coregulation common across organisms. Once a suitable metric is found, it is possible to use standard procedures to further analyze the pairwise relationships between the meta-genes. For example, methods like hierarchical clustering, that require only a similarity matrix, can be used. I used VxOrd to position genes on a two-dimensional grid and then visualized the result with with VxInsight [26]. In this visualization, meta-genes are placed near to

CHAPTER 4. MULTIPLE SPECIES NETWORK

99

Figure 4.2: The procedure for laying out meta-genes on a two-dimensional surface based on their conserved coexpression P-values. each other in the X-Y plane according to the negative logarithm of their P-value, and the density of genes in a region is shown by the altitude in the Z direction (Figure 4.2). Using VxInsight, one can find highly-interconnected areas of the network as peaks in the map as well as specific gene-gene interactions.

4.2.4

Measuring the conservation of a cluster

To measure conservation in a sensitive way, we can define a co-expression interaction in a single organism using a loose Pearson correlation cutoff of 0.25. For every meta-gene m in every species s, we can then collect the set of meta-genes where the correlation between the genes in the organism exceeds this cutoff. We then look to see which neighbors are also neighbors in the multiple species network (i.e. have significant interaction P-values). Let Igs , be the set of meta-genes that gene g interacts with in species s. Let Im be the set of meta-genes meta-gene m is connected to in the

CHAPTER 4. MULTIPLE SPECIES NETWORK

100

multiple species network. Let the set-theoretic conservation measure for meta-gene m be: ECIm =

K X

1 X Igs ∩ Im |Gms | g∈G Igs ∪ Im s=1

(4.6)

ms

Where Gms is the set of genes belonging to meta-gene m in organism s. Note that since the meta-gene mapping is nearly 1-to-1, Gms usually contains a single gene. The ECI measures the percent of a meta-gene’s neighborhood that is present in the multiple species network. High ECIs correspond to those meta-genes that have genes with co-expression neighborhoods enriched for interactions present in the multiple species network. For a component one can compute the average ECI of its meta-genes as a measure of the component’s degree of conservation.

4.3

Results and Evaluation

I applied the methods described in section 4.2 to a set of DNA microarray experiments obtained from four different organisms: 1202 DNA microarrays from human, 979 from worm, 155 from fly, and 643 from yeast. I first computed the Pearson correlation of the expression profiles between every pair of genes in the microarray data sets for each organism, and then ranked all other genes according to their Pearson correlations. I then used a probabilistic method based on order statistics to evaluate the probability (P-value) of observing a

CHAPTER 4. MULTIPLE SPECIES NETWORK

101

particular configuration of ranks across the different organisms by chance [81, 42](see section 4.2). I used a P-value cutoff of 1.06 ∗ 10−5 to indicate that two meta-genes are coexpressed. This P-value corresponds to a significance level of 0.05 and corrects for comparing a single meta-gene to all of the other meta-genes (this number was 4725 since 4725 meta-genes contained genes that contained data in at least two organisms). Therefore, under the assumptions that the correlations observed in different organisms are independent, we expect 236 interactions to have a P-value less than 1.06 ∗ 10 − 5. I then collected all of the links between pairs of co-expressed meta-genes with Pvalues less than the chosen cutoff. The resulting network contained 3416 meta-genes connected by 22,163 expression interactions. Each link suggests a potential interaction between two genes that has been conserved across evolution, and is therefore likely to be functionally related. Visualization of the network revealed 12 components of highly interconnected genes (Figure 4.3). See sub-section 4.3.5 for biological analysis of the components. I measured the significance of the interactions in the network by means of a variety of statistical tests. I shuffled the orthology mapping to test if the presence of consistent correlation was specifically due to relating genes with their orthologs. Sub-section 4.3.1 discusses the sampling strategy I used to randomize the orthology correspondence and test the size of the observed result.

CHAPTER 4. MULTIPLE SPECIES NETWORK

102

Figure 4.3: Visualizing the network of meta-genes in VxInsight [26]. The Z-direction in the figure indicates the density of meta-genes within a square area beneath it. Higher peaks correspond to regions of highly interconnected sets of meta-genes.

CHAPTER 4. MULTIPLE SPECIES NETWORK

103

To show that the methods are robust to errors in microarray expression measurements, sub-section 4.3.2 describes how I added increasing levels of noise and recomputed the results. Even if the set of predictions are perfect given the dataset, one wonders whether they are generally true or whether they are sensitive to the specific experiments included in X. Sub-section 4.3.3 describes how I tested the dependence of the results on the particular experiments included in the microarray datasets. If the set of relations predicted in are biologically meaningful, then we should be able to find known relationships to a higher degree than we could before. Subsection 4.3.4 provides evidence supporting the main hypothesis of this thesis, that data from multiple organisms can be combined to perform more accurate gene function prediction.

4.3.1

Randomizing orthology

It is theoretically possible that only a small set of expression patterns exist among the set of genes included in meta-genes. In other words, when genes are selected based on whether they have protein sequences highly similar to gene products in other organisms, the set of possible expression patterns might consequently be severely limited. Imagine the extreme case where, simply by selecting for conserved genes across the organisms, only a single expression pattern is extracted. In this extreme

CHAPTER 4. MULTIPLE SPECIES NETWORK

104

case, comparing any two genes would yield significantly high coexpression. However, the coexpression is simply due to the meta-gene construction step and not due to specific interactions among meta-gene members. In this scenario, even random pairs of meta-genes would appear to have significant coexpression interactions. To rule out this possibility, I generated a set of permuted meta-genes consisting of a random collection of genes from each organism, and constructed a network from these permuted meta-genes. I compared the number of interactions in the random network to the real network for a wide range of P-values (Figure 4.4). Figure 4.4 shows the number of meta-gene interactions (y-axis) exceeding a P-value cutoff (x-axis) in networks constructed from: real meta-genes (blue line), a random distribution (red line), and randomly permuted meta-genes (green line). P-values are shown in log10 scale. The red arrow in the figure denotes P < .05 used in the gene co-expression network. I repeated this procedure five times with different random permutations of orthologs. For each permutation and for every P-value, I found significantly more links in the real network than the random network; for example, at P < 0.05 the real networks contained 3.5 +/- 0.03 times more interactions than the random networks.

CHAPTER 4. MULTIPLE SPECIES NETWORK

105

Figure 4.4: The number of meta-gene interactions is significant compared to randomization controls. The number of meta-gene interactions (y-axis) exceeding a P-value cutoff (x-axis) in networks constructed from real meta-genes (blue curve), a random distribution (red curve), and randomly permuted meta-genes (green curve). P-values are shown in log10 scale. Red arrow marks a P-value of .05, the cutoff used in the gene-coexpression network.

CHAPTER 4. MULTIPLE SPECIES NETWORK

106

Figure 4.5: Similar meta-gene interactions can be obtained even after introducing various levels of Gaussian noise.

4.3.2

Robustness to noise

Because gene expression measurements contain some inherent variability, I also tested whether the network was stable with respect to added noise. For example, C. elegansDNA microarray experiments typically vary by SD = 0.3 to 0.5 (log2 expression ratios) [59, 74]. I added increasing levels of Gaussian noise to the entire data set for each of the organisms, and constructed new networks. Networks constructed following the addition of realistic levels of noise were highly similar to the network from the original data (Figure 4.5.

CHAPTER 4. MULTIPLE SPECIES NETWORK

107

I added increasing levels of Gaussian noise to each organisms’ dataset. Figure 4.5 shows the resulting P-values after adding 0.01 (blue circles), 0.1 (pink squares), 0.5 (yellow triangles), and 1.0 standard deviations (light blue crosses) of white Guassian noise. Shown is the negative log P-value of an interaction in the original network (x-axis) plotted against the interactions’ P-value in the network constructed from the noise-added data (y-axis).

4.3.3

Generalization to experimental choice

I wanted to evaluate whether a large fraction of potential gene inductions were represented by the available microarray experiments. The current microarray experiments might reveal only a small fraction of possible gene inductions, and so the particular set of gene interactions found in each organism would be heavily dependent on the specific set of microarray experiments that were done. Alternatively, the microarray experiments might be broad and diverse, revealing a large fraction of possible gene expression interactions. In this case, the coexpression relationships in the network would be robust to the choice of microarray experiments; for example, a significant fraction of the gene expression links should be present in networks built using only a random half of the data. I randomly split the DNA microarray data in each organism’s dataset into two halves and then built two coexpression networks, each using only half of the data. I then counted the fraction of interactions that were significant

CHAPTER 4. MULTIPLE SPECIES NETWORK

108

Figure 4.6: Splitting the datasets into random halves yields approximately the same set of meta-gene interactions. in one network (P < 0.05), given that they were significant in the other network at P < p for various values of p. I repeated this entire procedure five times. At P = 0.05, I found that 41% of the significant expression interactions in one network were also significant in the other network. These results indicate that the microarray experiments are reasonably broad and diverse, revealing a general set of gene interactions (Figure 4.6). I randomly divided the databases of each species into two equally sized sets, and then generated new networks derived from each half of the data for a series of P-values. Figure 4.6 shows the percent of meta-gene pairs with P < p in the first

CHAPTER 4. MULTIPLE SPECIES NETWORK

109

half that have P < 0.05 in the second half, for each P-value p. The P-values in the figure are shown in log-scale. Three additional randomizations gave identical results. The red arrow in the figure denotes P < .05 used in the gene co-expression network.

4.3.4

Improvement over single species

A critical test for the multiple species network is to see whether gene function can be predicted more accurately than networks constructed using data from a single organism. For each gene, I constructed an organism-specific neighborhood, which consisted of the genes that are most co-expressed with it in that organism. On average, the neighborhoods of these five genes were over 4 times more enriched for cell proliferation/cell cycle genes in the multiple species network than they were in the best single species neighborhood (Table 4.1). This observation supports the hypothesis that the multiple species network tends to retain coexpression links between functionally related meta-genes while discarding spurious gene expression links. I evaluated this hypothesis on a more global scale by comparing the ability of both the multiple species and single-species networks to link together genes previously known to be involved in a single function, excluding genes not known to participate in that function. For most functional categories (as defined by the KEGG database) [67], the multiple species network performed significantly better than the

110

CHAPTER 4. MULTIPLE SPECIES NETWORK

No. Cell Cyclea 2 14 1 6 6

Meta-geneb MEG1503 MEG342 MEG4513 MEG1192 MEG1146

Multic 10 34 1 29 17

Flyd 102 2 41 178

Humane 7 44 2 68 185 106f

Worm Yeast 67 7 237 33 50 67

50 28

Table 4.1: Precision of multi-species network on cell proliferation compared to single species. a The number of cell proliferation / cell cycle links the meta-gene has in the multi-species network. b One of the five meta-genes predicted to be involved in cell proliferation. c The number of total links the meta-gene has in the multi-species expression network. d The number of links required to get as many cell proliferation / cell cycle links as in the multiple species network. This number was obtained by ranking all of the genes based on their Pearson correlations to the central gene and then including this many genes to capture the same number of cell cycle genes as in the multi-species network. An empty cell indicates either the meta-gene does not have an ortholog or does not have data in the particular organism. e LocusLink identifier. f Note that two human genes belong to meta-gene MEG1146. single species networks. In the case of the cell-cycle function, for example, the multiple species network performed significantly better than any of the networks formed using a single species (Figure 4.7A). Specifically, for a given P-value cutoff, the multiple species network had a higher percentage of cell cycle genes linked to one another (y-axis; accuracy) for any given percentage of total cell cycle genes that are linked (x-axis; coverage). Genes involved in proteasome function and genes involved in oxidative phosphorylation were also clustered more tightly and more exclusively by the multiple species network versus any of the single species networks (Figure 4.7A). For some categories (e.g., genes involved in ribosomal function), the best network constructed with data from a single organism (e.g., using only yeast or worm microarray

CHAPTER 4. MULTIPLE SPECIES NETWORK

111

data) was about equal in quality to the network generated using multiple species (Figure 4.7A), but the networks in other organisms performed much worse. I constructed a co-expression network from each species by selecting a Pearson correlation cutoff and linking every pair of genes with a correlation higher than the cutoff. I collected all of the links involving meta-genes from a single functional category from KEGG [67]. I then calculated the percent of links connecting two members of the category. Figure 4.7 shows the accuracy of the prediction (y-axis) plotted against the percent of meta-genes that are connected to at least one other meta-gene in the category (coverage; x-axis). Accuracy was defined as the fraction of links connecting two genes of the same category over the total number of links. I varied the Pearson cutoff for constructing single species networks and varied the P-value cutoff for constructing multiple species networks to obtain different coverages and accuracies (Figure 4.7). One possible explanation for the superior performance of the multiple species network is the trivial fact that the multiple species network was built from more DNA microarray data. To rule out this possibility, I repeated the functional prediction analysis with a multiple species network formed from only 979 DNA microarrays (the same number as in the worm dataset). I found that the network built from fewer microarrays performed as well as the network built from all of the microarrays in terms of predicting gene functional categories (Figure 4.8). This suggests the predictions made by the multiple species network are not due simply to a larger database size.

CHAPTER 4. MULTIPLE SPECIES NETWORK

112

Figure 4.7: The multiple species network is more accurate for predicting gene function than any of the single-species networks.

CHAPTER 4. MULTIPLE SPECIES NETWORK

113

Figure 4.8: The superior performance of the multiple species network for predicting gene function is not due simply to larger database size. The multiple species network was re-derived from a dataset made up of random subsets of experiments from each organism so that the total number of experiments equaled the number of C. elegans experiments (979 hybridizations) and the proportions of experiments matched the dataset used to construct the full network.

CHAPTER 4. MULTIPLE SPECIES NETWORK

114

Figure 4.9: Multiple species coexpression landscape. Peaks in the landscape represent areas of high gene density.

4.3.5

Functional modules of conserved coexpression

I took all of the links found in the previous section and laid them out using VxOrd [26]. Figure 4.9 shows the results of applying VxOrd to the set of P-values obtained previously. Like the result for C. elegans discussed in Chapter 2, I found several areas of high gene density evident by the peaks in the landscape. The peaks in this case correspond to sets of meta-genes that show correlation across organisms with one another. One could again show that the structure itself of the landscape is significant as was done previously in Chapter 2. I visually inspected the landscape produced by VXOrd and identified 12 regions

CHAPTER 4. MULTIPLE SPECIES NETWORK

115

of highly interconnected meta-genes that I refer to as “components”. I then used a K-means clustering procedure to identify which genes were included in each of the 12 components. Using the X-Y coordinates of the meta-genes as determined by VXOrd, I seeded the K-means procedure with 12 centers approximately equal to the locations of the 12 components. The procedure converged in 3 iterations. Meta-genes that were farther away than d units from their closest center were excluded from membership in any of the 12 components. I chose d to be 10% of the diameter of the entire landscape. The K-means clustering procedure on the X-Y coordinates defined 12 regions of the terrain map containing a large number of highly-interconnected meta-genes that I refer to as components. Most of the components were enriched for meta-genes involved in similar biological processes, such as protein degradation, ribosomal function, cell cycle, metabolic pathways and neuronal processes (Figure 4.9, and Table 4.2). Component 5 is an example of a group that is strongly enriched for meta-genes involved in a common biological process. This component contains a total of 241 meta-genes, of which 110 were previously known to be involved in the cell cycle (out of a total of 202 cell cycle meta-genes in the network; 7.7 fold more than expected using the hypergeometric distribution, P < 10 − 85)(Table 4.2). Of the cell cycle meta-genes, 30 are involved in regulating the cell cycle, such as MEG2742 (encodes cyclin E) and MEG5621 (encodes Wee1), along with 80 that perform terminal cell cycle functions, such as MEG1092 (encodes DNA polymerase alpha). The remaining

116

CHAPTER 4. MULTIPLE SPECIES NETWORK

Component Size 1 353

2 3 4 5 6 7 8

9 10 11 12

349 320 320 241 201 167 156

139 92 65 57

Biological Function cellular cortex signaling animal-specific ribosome biogenesis energy generation proteasome cell cycle general transcription animal-specific translation init. elong., termin. aminoacyl tRNA biosynthesis ribosomal protein subunits secretion neuronal lipid metabolism peroxisome

Enrichment; Genes P-value 16/57 2.7; 10−6.1 44/321 1.3; 10−5.8 195/1441 1.3; 10−7.2 102/125 8.0; 10−83 77/147 5.6; 10−42 31/32 12; 10−32 110/202 7.7; 10−85 47/142 124/1441

5.6; 10−24 1.8; 10−17

20/110

4.0; 10−7.3

14/31

9.9; 10−11

74/78 37/85 17/42 6/16 14/32

23; 10−107 16; 10−38 21; 10−19 22; 10−7 26; 10−17

Table 4.2: Network components and their biological functions.

CHAPTER 4. MULTIPLE SPECIES NETWORK

117

131 genes were not previously known to be involved in the cell cycle and so linking these genes to known cell cycle meta-genes in the coexpression network suggests new cell cycle functions for these genes.

4.3.6

Biological confirmation

I identified several genes highly connected to cell proliferation genes. I chose those genes that were not previously known to be involved in cell proliferation. I then looked to see if I had data in either humans or worm that might reinforce the prediction that these genes were involved in cell proliferation related processes. I found 5 for which I had microarray expression data in human. MEG1503, MEG342, MEG4513, MEG1192, MEG1146 are over expressed in pancreatic cancers. I plotted the metagenes with the GeneXPress program using data from [55]. The first five columns correspond to expression data obtained from normal pancreas specimens (pSF2779N, pSF442N, pSF4N, pSF5NT, and pSF768NT) while the remaining eight correspond to expression data obtained from pancreatic cancer specimens (a pancreatic cancer cell line (HS766T), 5 Hopkins/Goggins pancreatic cancer cell cultures (PL2, PL22, PL21, PL1, and PL8), a poorly differentiated pancreas carcinoma (pSF439T), and a pancreas foamy cell adenocarcinoma specimen (pSF1T)). Each row corresponds to the expression profile of a single meta-gene across the 13 pancreatic samples. Bold indicates meta-genes newly-implicated in cell proliferation by the network. Neighbors

CHAPTER 4. MULTIPLE SPECIES NETWORK

118

Figure 4.10: Genes predicted to be involved in cell proliferation by the multiple species network are differentially regulated in pancreatic cancer. MEG1503, MEG342, MEG4513, MEG1192, and MEG1146 are over expressed in pancreatic cancers. The expression profiles taken from [55]. The first five columns correspond to expression data obtained from normal pancreas specimens, and the remaining eight columns correspond to expression data obtained from pancreatic cancer specimens. Color indicates a gene is under- (blue) or over-expressed (yellow) relative to the average expression of the gene in normal pancreas. Bold rows correspond to meta-genes with unknown functions that are implicated in cell proliferation by the network. Other rows correspond to neighbors of each implicated meta-gene that were previously known to be involved in cell proliferation. Scale shows log2 expression ratio.

CHAPTER 4. MULTIPLE SPECIES NETWORK

119

of each newly implicated meta-gene that were previously known to be involved in cell proliferation or cell cycle are also shown. The scale shows log 2 expression ratio. One of these genes had an ortholog in worm that yielded an embryonic lethal phenotype. That meant it might be possible to identify a phenotype by watching how the embryos die. Looking under a microscope suggested the embryos die because they contain an excess of nuclei in the germline (Figure 4.11). Figure 4.11 shows the wild-type gonads and gonads from worms that were fed bacteria producing ZK652.1 dsRNA for two days [34]. Gonads were stained with DAPI to show DNA in nuclei [103]. ZK652.1(RNAi) gonads have more nuclei than wild-type and lack oocytes (ooc.). rrf3(pk1426) worms were used as they are more sensitive to RNAi [82]. This supports the idea that the gene is involved in suppressing nuclear proliferation in C. elegans, thus supporting the prediction made by the network.

4.4

Systems-level analysis of conserved coexpression

4.4.1

Newly-evolved and ancient modules

In addition to learning about the function of individual genes, one can use the network to analyze entire sets of genes in order to understand the system as a whole. Consider three types of genetic modules: (1) ancient, dedicated modules, (2) evolving

CHAPTER 4. MULTIPLE SPECIES NETWORK

120

Figure 4.11: RNAi-induced phenotype of ZK652.1. Shown are wild-type gonads and gonads from worms that were fed bacteria producing ZK652.1 double-stranded RNA for 2 days [34]. Gonads were stained with 46diamidino-2-phenylindole to show DNA in the nuclei [103]. ZK652.1 (RNAi) gonads have more nuclei than the wild type and lack oocytes (ooc.). rrf-3(pk1426) worms were used because they are more sensitive to RNAi [82].

CHAPTER 4. MULTIPLE SPECIES NETWORK

121

modules and (3) modules with interchangeable parts. Ancient modules, such as the group of meta-genes involved in ribosomal function, have a main core cellular function that has been conserved from yeast to human. Meta-genes in these modules would be expected to have highly-conserved coding regions and to contain gene expression links that are conserved. Evolving modules, such as those modules involved in neuronal function, show rapid change among the four species. Meta-genes in this type of module are expected to lack a yeast ortholog and to show relatively large changes in expression links between invertebrates and humans. Modules with interchangeable parts are composed of meta-genes that have different links in different species. For example, sir-2 encodes a protein that is highly conserved from yeast to humans and is involved in regulating chromatin structure and gene expression, but it has different downstream targets in each species [43]. Other types of meta-genes with adaptable, interchangeable functions would include those encoding transcription factors, signaling molecules, and adaptor proteins. These meta-genes have coding sequences that are conserved but are likely to have different sets of gene expression links in each species. I tested the extent to which different modules are present in our gene coexpression network. I split the set of meta-genes into a set of 2969 that contain a yeast ortholog and a set of 3338 that were animal-specific as they included genes from worms, flies or humans but not yeast. Next, I determined the degree to which the gene expression

CHAPTER 4. MULTIPLE SPECIES NETWORK

122

links have been conserved for each meta-gene by defining a set-theoretic quantity called the expression conservation index (ECI) where larger values indicate stronger conservation (see Equation 4.6). I examined the degree of conservation of the different biological functions represented by the 12 main components in the gene coexpression network (Figure 4.12). Figure 4.12B shows component 11 which is composed primarily of animal-specific meta-genes (yellow boxes) and component 9 that is composed primarily of metagenes containing a yeast ortholog (blue boxes). Components 1, 7 and 11 were the most enriched for animal-specific meta-genes and also showed the lowest degree of evolutionary conservation of their gene expression links (ECIs of 0.67, 0.56, and 0.61 respectively)(Table 4.2, Figure 4.12). Component 1 was enriched for meta-genes involved in signaling pathways, consistent with the idea that signaling pathways are animal-specific and regulate diverse sets of downstream genes in different organisms. Component 11 was enriched for meta-genes involved in neuronal function, consistent with the idea that neuronal functions show high amounts of evolutionary change. Component 7 has yet to be correlated with any biological function. Nevertheless, the low conservation of both the coding region and gene interactions for this component suggests that it may be involved in processes that are evolving. In contrast to components 1, 7 and 11, component 9 is the least enriched for animal-specific meta-genes and shows the highest degree of evolutionary conservation (2.4 average conservation

CHAPTER 4. MULTIPLE SPECIES NETWORK

123

Figure 4.12: The multiple species network contains large components with differing levels of conservation. A. shows the conservation of components in the multiple species network. Each box in the heat-map represents the percent of links connected to a meta-gene in one organism that are also present in the multiple species co-expression network. Gray indicates where a meta-gene lacked an ortholog in a particular organism. The columns correspond to flies, humans, worms, and yeast respectively. The final column shows the expression conservation index for a single meta-gene based on Equation 4.6. The average conservation degree for all meta-genes in the component is shown beneath the components heat-map. Scale shows the conservation degree. B. Plots showing the distribution of animal-specific (yellow dots) and eukaryotic (blue dots) genes on components 11 and 9 in the multiple species network.

CHAPTER 4. MULTIPLE SPECIES NETWORK

124

degree for all meta-genes), consistent with the known biological function associated with this component (ribosomal function). I used the network to investigate the interconnections between multicellular and core cell-biological processes, and found several examples of processes that were intertwined. Component 3 is enriched for meta-genes involved in metabolic pathways, particularly those used to generate energy such as the glycolytic pathway and the TCA cycle (Table 4.2). Within component 3, there is a small cluster of 76 metagenes that is enriched for animal-specific meta-genes (1.5 fold enriched, P < 10 − 4) and meta-genes with a low conservation degree of gene expression links. Four of the 48 animal-specific meta-genes are involved in muscle function (10-fold enriched; P < 10 − 4.6), consistent with very high energy demands of muscle. Component 4 is enriched for meta-genes involved in protein degradation, such as meta-genes that encodes proteasomal subunits (e.g. MEG1013) or ubiquitin ligases (e.g. MEG1233) (Table 4.2). Within this component, there is a cluster of 92 meta-genes that are animal-specific. Within this animal-specific portion, 3 meta-genes are involved in apoptosis (2.1-fold enriched; P < 10 − 1.3), indicating a functional link between a core cell biological process (protein degradation) and an animal-specific process (programmed cell death).

CHAPTER 4. MULTIPLE SPECIES NETWORK

4.4.2

125

Significant abundance of large modules

Some biological functions require the coordinated effort of a large number of genes. For example, many protein subunits are required for the ribosome to synthesize new polypeptide links, and many enzymes are required to generate energy in the glycosylation pathway. Other biological functions may be comprised of genes that either act alone or that have multiple sets of interaction partners. For example, transcription factor genes act with different regulators to regulate different downstream targets depending on cell context, and thus appear as isolated meta-genes in the gene coexpression network because they do not have a consistent group of interacting partners. The entire network of genetic pathways that comprise an organism will be composed of some pathways that are designed to be large and others that are engineered to be small. To characterize the connectivity properties of the gene coexpression network, I counted the number of neighbors for each meta-gene in the network, and compared this to neighborhood sizes arising from networks constructed from permuted data (Figure 4.13). I found that the distribution of gene expression links in the gene coexpression network was highly non-random, containing significantly more metagenes with a larger number of gene expression links than the control networks. For example, there were 1290 meta-genes with 10 or more links compared to only 168.3 +/- 11.1 such links in the control networks. Figure 4.13 shows the number of links

CHAPTER 4. MULTIPLE SPECIES NETWORK

126

Figure 4.13: Distribution of number of links for each meta-gene compared to randomly generated networks. Shown is the number of links (x axis, log10 scale) compared with the number of meta-genes that have that number of links (y axis, log 10 scale) in the network (blue triangles) and in the networks constructed from permuted data (green squares). The black line (slope of −1.51) depicts the least-squares fit of the data to a linear line in the log-log plot.

CHAPTER 4. MULTIPLE SPECIES NETWORK

127

(X-axis, log10 scale) versus the number of meta-genes having that number of links (Y-axis, log10 scale) in the network (blue triangles) and in the networks constructed from permuted data (green squares). The data was randomized by permuting the experiments from each organisms’ compendium of microarray data. The black line (slope of 1.51) depicts the least-squares fit of the data to a linear line in the log-log plot. The connectivity of the network follows a power-law distribution (Figure 3; linear regression to log-log plot explains 92% of the variation) [35]. This power-law function has been observed in other analogous natural and social phenomena (such as the distribution of U.S. firm sizes or the neighborhood sizes in the World Wide Web) and also in biological networks (such as protein-protein interaction) [41, 6, 36, 37]. This result suggests the existence of a selective force in the overall design of genetic pathways to maintain a highly connected class of genes. The multiple species network contains 570 meta-genes that encode proteins of unknown function (orthologs that are conserved across evolution but whose function is poorly understood in any organism). These meta-genes have a total of 3943 connections to other meta-genes in the network (many of which have known functions), potentially allowing these novel meta-genes to be characterized.

CHAPTER 4. MULTIPLE SPECIES NETWORK

4.5

128

Discussion

I have presented a network in which genes that are linked together exhibit correlated patterns of expression that have been conserved across evolution. Each connection in the gene coexpression network has been conserved through evolution, indicating that coexpression of these specific meta-genes confers a selective advantage strongly suggesting that the meta-genes are functionally related. Hence, the network provides new predictions about how these genes function, and which genes act together in similar genetic modules. Such insight requires the analysis of large data sets, and cannot be derived from isolated experiments on a single set of genes. In particular, the network includes data from 1202 human DNA microarrays, and is one of the first and largest compendia of human gene expression profiles. The coexpression network constructed using this process provides a robust mechanism for detecting genes of related functions. It is clear that small and subtle changes in fitness can confer selective advantage during evolution. The test for related gene function using evolutionary conservation in the wild is more sensitive than scoring the phenotype resulting from strong loss-of-function mutants in the laboratory. For instance, evolution selects for genes with redundant functions when these same genes may show weak or no mutant phenotypes when scored in genetic experiments. In addition to learning about the function of individual genes, one can use the network to analyze entire sets of genes in order to understand the system as a whole. I

CHAPTER 4. MULTIPLE SPECIES NETWORK

129

determined the overall degree of conservation of protein sequences and gene expression links in each of the 12 components in the network, and thereby obtained insights about when the pathways in the components evolved, whether the set of meta-genes in the components are intrinsically linked to each other (e.g., each meta-gene only functions with other meta-genes in this group, such as ribosomal subunits) or whether the metagenes are pleiotropic and adaptable (e.g., they can form functional links with several sets of meta-genes, such as transcription factors and their downstream targets). I found a simple principle underlying the organization of the complex gene coexpression network, which is that the connectivity of the meta-genes in the network follows a power-law function. This observation suggests that the size of gene modules is a design consideration for gene interconnection architectures. The biological forces that give rise to the power-law distribution are currently unclear, but potential forces include evolutionary processes consisting of gene duplication and divergence, and design considerations regarding the efficacy of large and small genetic pathways. Other types of networks (such as the sizes of U.S. firms, financial market fluctuations, word usage, and webpage hyperlinks) also follow a Zipfian power-law distribution [41, 37]. This similarity suggests that it might be possible to adapt existing methodologies used to study how social networks are formed to provide insights about organizing principles for complex biological networks.

Chapter 5 Contributions and future directions This thesis represents a first effort to identify the function of genes using large compendia of gene expression profiles from multiple organisms. The analysis could be extended in a number of ways. First, gene expression links will be refined as more DNA microarray data become available. Second, alternative algorithms for analyzing gene expression can be used, such as algorithms for finding gene links or regulatory modules that occur only under specific cellular contexts. Finally, it should soon be possible to expand this analysis to include DNA microarray data from other organisms, especially data from mouse experiments. This will extend the range of discoveries available from this type of approach, yielding novel insights about mammalian-specific processes.

130

CHAPTER 5. CONTRIBUTIONS AND FUTURE DIRECTIONS

5.1 5.1.1

131

Increasing the breadth of discovered modules Integrating multiple different microarray platforms

The work focused primarily on the analysis of spotted cDNA microarrays developed originally by Pat Brown’s laboratory. Much more data is now available on commercial platforms. The most popular is the oligonucleotide microarray first developed by Affymetrix. These arrays contain short oligonucleotide sequences that match a unique location in the genome. Absolute expression levels, rather than relative levels as in the Pat Brown cDNA arrays, can be measured using this platform. Combining expression measurements across the two platforms therefore requires a normalization step. One approach in wide use is to compute a “virtual reference” for each gene and normalize the oligonucleotide data based on the virtual reference as if it came from a control sample. One choice for a virual reference, that is often used, is the mean or median level of expression of a single gene across all chips included in the analysis. We need computational methods that integrate multiple types of data for several reasons. Multiple data sources often provide a certain amount of independent information that, when combined with gene expression data, can improve the quality of gene function prediction. Bringing other data sources into the analysis can also increase the breadth of processes that can be analyzed. Many processes are not regulated at the transcriptional level. We may be able to study these processes by

CHAPTER 5. CONTRIBUTIONS AND FUTURE DIRECTIONS

132

analyzing protein interaction data or other post-transcriptional data.

5.1.2

Identifying mammalian-specific conserved modules

Incorporating data from more organisms, especially those that are more closely related to human, is critical for discovering conserved functional aspects about genes that relate to human health. The fabrication of microarray technologies for organisms like mouse and chimp will provide molecular information that can be used to find conserved gene expression modules relating to skeletal, muscular, and neuronal processes. As more sequencing projects are completed and functional genomics technologies are adapted to the study of additional organisms, the ability to incorporate a large number of organisms may become increasingly important. Organisms that are closely related to human will provide molecular information that is more relevant for human health.

CHAPTER 5. CONTRIBUTIONS AND FUTURE DIRECTIONS

5.2

133

Algorithmic developments for multiple species analysis

5.2.1

Recommending genes across organisms

The gene recommender could be extended to operate over datasets collected across multiple organisms. This is relatively straightforward to do. One could start with a query containing genes from a particular organism Qi . In each species j, one could identify genes in organism j that are highly similar to some gene in Qi . Let such a set of genes be Qj . We could then run a separate gene recommender in organism j using Qj as the query. This procedure would then yield a ranking for each gene i based on how co-expressed the gene was with the query set. Let the ranking for gene i assigned by the gene recommender organism j be Ri,j . This procedure could be repeated for each organism j. One could collect the results of these gene recommender runs across K organisms. The additional advantage of this approach is that, if a particular candidate gets high scores from different gene recommender runs on different organisms, this increases our confidence that such a gene is related to the query set.

CHAPTER 5. CONTRIBUTIONS AND FUTURE DIRECTIONS

5.2.2

134

Alternative estimates of functional coexpression

The methods outlined in this chapter described a procedure for estimating the degree to which any two genes are expressed in a statistically similar way across organisms for which microarray expression data is available. The emphasis was on defining an intuitive, and readily computable, null distribution from which significant departure could be detected. However, other choices were possible. For example, one could choose a similarity measure based on a likelihood ratio.

Lij =

P (Sij1 = sij1 , Sij2 = sij2 , ..., SijK = sijK |(i, j) ∈ R) P (Sij1 = sij1 , Sij2 = sij2 , ..., SijK = sijK |(i, j) ∈ / R)

(5.1)

The likelihood ratio shown in equation 5.1 compares the probability the observed similarities come from a distribution in which gene i and j are related to a distribution in which they are not related. Note that using the likelihood ratio as a measure ignores the prior probabilities that two arbitrary genes are related (i.e. P ((i, j) ∈ R)). One could imagine approximating such a prior by attempting to compute the magnitude of R. This is undesirable since it requires making assumptions about the size of processes or the number of genes that a typical gene interacts with.

CHAPTER 5. CONTRIBUTIONS AND FUTURE DIRECTIONS

5.3

135

Towards a more complete set of functional modules with integrative approaches

Coexpression data produces sets of genes postulated to be coregulated under certain conditions in the cell. The dominant model for coordinate transcription is that, together with the basal transcriptional machinery, specific transcription factors promote the production of mRNA upstream of a set of target genes. The model predicts that any coexpression between genes can be explained by identifying the combination of cis and trans elements that participate in the transcriptional process at the set of coregulated genes. Therefore with any set of putative coregulated genes we seek to identify the cis and trans elements that participate in the genes’ coregulation. The hope is that, by combining gene expression data with regulatory element prediction, we can improve both methods. For example, if we find a particular motif overrepresented in some subset Identifying trans elements responsible for the coregulation of a set of genes is an active area of research. Recent technological developments have provided experimental data that can now be used for this task [62]. For example, MAGIC is a system for combining diverse types of data for determining if gene-gene interactions are functionally relevant. MAGIC is based on Bayesian Networks and therefore provides a probabilistic formalism for the integration task (Troyanskaya et al. (2003) [95]). Other examples of integrative approaches include natural language

CHAPTER 5. CONTRIBUTIONS AND FUTURE DIRECTIONS

136

processing methods that incorporate information from the biomedical literature in addition to primary functional data for predicting gene function. For example, Raychaudhuri et al. (2003) present a novel method for finding projections in microarray data that yield surprising word distributions [71]. Here I only discuss the developments in cis element identification combined with gene expression analysis. There is a rich literature on identifying putative regulatory motifs in a set of unaligned DNA sequences [50, 101]. Although these methods rely on two types of data – expression and sequence – they are not integrative because the results cannot be used to improve both the prediction of coexpression and the prediction of cis regulatory motifs. The traditional methods for identifying cis regulatory motifs take a set of sequences and then search for overrepresented subsequences that discriminate the input set from a random set. The methods have a serial nature about them in which the output of one method (prediction of coregulation) is fed as the input to the next method (prediction of cisregulatory motifs). For example, Hertz et al. (1990) developed a method called CONSENSUS that takes a set of genes derived from those bound by a common transcription factor [50] and learns a position specific scoring matrix that represents a first-order probablistic model of a motif. The PSSM can then be used to scan the genome for additional sites. More recently, approaches have been developed that allow the coregulation and

CHAPTER 5. CONTRIBUTIONS AND FUTURE DIRECTIONS

137

cis regulatory element predictions to influence one another. For example, Conlon et al. (2003) use expression data to improve cis regulatory site prediction [23]. The authors regress the scores produced by their motif-finding program against a single gene expression experiment to identify which predicted motifs better correlate with expression. E. Segal et al. (2001) develop a Bayesian probablistic model for identifying sets of genes and representative cis motifs simultaneously [78]. New integrative approaches promise to shed light on the relationship between common transcription factor binding sites and sets of coregulated genes [70].

Appendix A Dimensional reduction of DNA microarray data using principal components analysis Before accurate gene function prediction can be performed, it may be necessary to pre-process the data to compensate for redundancy or other biases. Dimensional reduction techniques attempt to replace the original experiments with a set of nonredundant experiments. At the same time, dimensional reduction techniques can give an informative perspective on a large dimensional dataset. In the context of microarray expression data, dimensional reduction is extremely useful in the cases where the experiments are expected to be highly correlated (such as in a time-series).

138

APPENDIX A. PRINCIPAL COMPONENTS ANALYSIS

139

Principal components analysis (PCA) is a statistical technique for determining the key variables in a multidimensional data set that explain the differences in the observations, and can be used to simplify the analysis and visualization of microarray data sets. When the experimental conditions are considered to be the variables, PCA can be used to summarize the ways in which gene responses vary under different conditions. In addition, examining the components can provide insight into the underlying factors that are measured in the experiments. Both experimental noise and hidden dependencies among a set of experimental conditions may confound our ability to predict gene function. It is non-trivial to eliminate either of these complicating factors. One particular problem is that different experiments that seem different because of their biological context (heat shock, starvation, or oxygen deprivation, for example) may actually be identical or very similar in terms of the gene expression state that results. In such cases, a na¨ıve analysis might associate some genes too tightly because multiple redundant measurements exist. Thus, it may be beneficial to pre-process the data before analysis in order to identify the independent information content of different experimental conditions. I have worked on applying PCA to microarray expression data in an effort to interpret the results of a single experimental series. The work has been published in the Proceedings of the Pacific Symposium on Biocomputing [72].

APPENDIX A. PRINCIPAL COMPONENTS ANALYSIS

A.1

140

Methods

PCA is an exploratory multivariate statistical technique for simplifying complex data sets [32, 8, 69]. Given n observations on p variables, the goal of PCA is to reduce the dimensionality of the data matrix by finding r new variables, where r is less than n. Termed principal components, these r new variables together account for as much of the variance in the original n variables as possible while remaining mutually uncorrelated and orthogonal. Each principal component is a linear combination of the original variables, and so it is often possible to ascribe meaning to what the components represent. To find the principal components, one decomposes an expression matrix A using singular value decomposition as in equation A.1.

A = U ΣV 0

(A.1)

where U is an n-by-p projection of A onto the n-by-p eigenvectors V . Σ is a diagonal matrix of singular values, the squaring each entry then gives the eigenvalues. The goal of PCA is to find a smaller dimensional space that explains as much of the original data as possible. This corresponds to choosing the r components from V with highest associated eigenvalue (or corresponding entry in Σ). Each eigenvalue corresponds to the amount of variance captured by the associated eigenvector.

APPENDIX A. PRINCIPAL COMPONENTS ANALYSIS

Projection On condition T=0 T = .5 T=2 T=5 T=7 T=9 T = 11 Eigenvalue % variance

Principal Components 1 2 3 4 -0.0072 -0.0116 -0.0631 -0.2166 0.2076 -0.7524 -0.5373 0.2606 0.2358 -0.4925 0.3296 -0.5935 0.3975 -0.1156 0.5612 -0.002 0.554 0.0862 0.1869 0.4959 0.4671 0.2517 -0.153 0.1169 0.4671 0.3273 -0.4748 -0.5229 2.2928 0.401 0.1322 0.0594 76.9 % 13.5 % 4.4 % 2.0 %

5 0.0764 0.1545 -0.453 0.5919 -0.1112 -0.5413 0.3307 0.0406 1.4 %

141

6 7 -0.7433 0.625 -0.0683 -0.0756 0.1713 0.0803 -0.2532 -0.3151 0.2889 0.5559 -0.4488 -0.4324 0.254 0.044 0.0288 0.025 1.0 % 0.8 %

Table A.1: Results of PCA on the sporulation time series data. The values in the columns are coefficients of the principal components that are related to each of the experimental time points. The eigenvalues express the variance of a principal component over all genes. Principal component 1 and 2 contain over 90% of the total variance in the data. Therefore, choosing eigenvectors with the largest associated eigenvalues amounts to choosing the dimensions in the data along which it most varies. By choosing the highest variance directions in the data, one equivalently finds lines in along which the data deviation is minimized.

A.2

Results

Application to a yeast sporulation data set of Chu et al. (1998) [20] identifies two combinations of the original seven time points that explain most of the variability in the data. Table A.1 shows the principal components computed from the sporulation time course. Figure A.1 shows the seven principal components (PCs) along the x-axis

APPENDIX A. PRINCIPAL COMPONENTS ANALYSIS

142

Figure A.1: Plot of eigenvalues of the principal components. Most of the variance in the sporulation data set is contained in the first two principal components. plotted in decreasing order of the amount of variance they explain (y-axis). By adding up the variance captured by PC1 and PC2, one finds that the first two components capture over 90% of the variance in the data. The direction the PCs point also conveys information about how the expression of the genes can be interpreted. Figure A.2 shows the weights of PC1 and PC2. PC1’s first weight is near zero indicating the first time point is not informative for characterizing the genes. The remaining weights are all positive and comparable. Thus, the first PC represents an average across the last six time points. A high score on PC1 therefore means a gene has a high average level of expression across the last six time points. For PC2, the first weight is also

APPENDIX A. PRINCIPAL COMPONENTS ANALYSIS

143

Figure A.2: Plots of the coefficients of the first three principal components. Each coefficient indicates the weight of a particular experiment in the principal component. The first principal component has all positive coefficients, indicating a weighted average. The second principal component has negative values for the early time points and positive values for the latter time points, indicating a measure of change in expression. The third coefficient captures information about the concavity in the expression pattern over time.

APPENDIX A. PRINCIPAL COMPONENTS ANALYSIS

144

near zero and thus the first time point is not informative. For the remaining six, the first three are negative and the final three are positive and linearly increase. Thus, PC2 measures a change in expression over the sporulation time course. A high score indicates a gene’s expression increases while a low score indicates it decreases. Thus, even though there were seven experiments in the time series, there were only two or three independent features for each gene. I looked up which genes contained a known sporulation specific transcription factor – the mid-sporulation element (MRE) based on the position-specific scoring matrix in the SCPD database [104]. I then plotted all of the genes along the first two principal components identified by Raychaudhuri et al (2000) [72]. Figure A.3 shows how the MSE is distributed in principal component space. One can see that the first component defines a good separation of the MRE; i.e. high values on PC1 indicate strongly that the gene has a MRE element. Based on this analysis, many of the genes with high PC1 scores that do not contain a known MRE are good candidates for having the element.

A.3

Discussion

Techniques like this may become useful for combining multiple microarray experiments across species in which a smaller set of virtual microarrays – microarrays computed from a combination of original microarrays – may be used instead of the

APPENDIX A. PRINCIPAL COMPONENTS ANALYSIS

145

Figure A.3: Each gene is plotted based on its scores along principal component 1 (x-axis) and principal component 2 (y-axis). Each open circle denotes a gene with an upstream region containing a strong match to the known MSE promoter element.

APPENDIX A. PRINCIPAL COMPONENTS ANALYSIS

146

original data. For example, PCA has also been applied to the developing rat nervous system with success [99].

Appendix B Genome sampling to discover chromosomal clustering of co-expressed genes Much of biology is focused on finding how genes are regulated. One of the dominant theories for how genes are coregulated is that they share a common set of cis elements that corresponding trans elements recognize and bind and subsequently recruit the cell’s transcriptional machinery. However, other types of regulation are known. In bacteria, for example, some genes lack promoters but are coexpressed with functionally related genes by being placed in the same operons. In this way, read-through

147

APPENDIX B. CHROMOSOMAL CLUSTERING

148

transcription includes the gene’s open-reading frame into an upstream gene’s transcript. Genome-wide studies of transcription allow us to investigate how genes might be coregulated on a global scale. Many studies have found enrichment of transcription factor binding motifs in clusters of coexpressed genes (see Berman et al. (2002) and Conlon et al. (2003) for representative examples [11, 23]). Several studies have discovered coexpressed genes are non-randomly situated along the chromosomes[22, 86, 76]. I have used a statistical sampling based strategies to assess the tendency of coexpressed genes to form small groups of 2 to 5 along the chromosomes. In this analysis one is given an experimentally determined gene list and one wishes to know how significant the relative positions of the genes are. In our example, muscle-specific genes were identified with a novel tissue-specific microarray approach developed in the lab. This tendency for genes to aggregate was assessed by generating many clusters of the same size in the list and computing a measure of chromosomal clustering. The measure used was the number of inter-genic distances within some cutoff (for example 10kb). Figure B.1 illustrates which inter-genic distances were considered to be close. A p-value for the number of inter-genic distances within some cutoff could be obtained by counting the number of times the simulation obtained or exceeded the number in the muscle-specific set. The sampling strategy had to be adapted to handle the

APPENDIX B. CHROMOSOMAL CLUSTERING

149

Figure B.1: The definition of a “close” inter-genic distance. The translation start site of each gene is indicated with a vertical tick mark. Genes whose translation start sites (vertical tick) are considered “close” if they are within a pre-specified cutoff. Inter-genic distances between genes within the cutoff are drawn in red. presence of operons and paralogs in the worm genome. Because genes in operons are coordinately expressed from the same promoter, the sampling strategy was modified to sample operons to account for this trivial form of co-expression. Also neighboring genes that are highly similar may show similar hybridization on microarrays due to either cross hybridization or because the promoter regions were also copied during the duplication event. The sampling procedure was adopted to ignore any inter-genic distances between such neighboring paralogs that fell within the cutoff for both the experimental list and those randomly generated. This work demonstrates that sampling strategies can be used to ask new biological questions. Since a tendency for chromosomal clustering exists between co-expressed genes, it also indicates that including this type of information into cluster analysis may be helpful. Here I use the results of a clustering and tests for the existence of a new property

APPENDIX B. CHROMOSOMAL CLUSTERING

150

common to the genes in an expression cluster. The example that follows attempts to answer the specific question: Do the genes in a co-expressed set cluster on the chromosomes? Several known mechanisms that play a role in transcriptional regulation include: promoters that direct specific transcription factors to the transcriptional start site of a gene, transcript structure that can regulate its own production (e.g. through a leader sequence), transcript organization where multiple transcripts under the same promoter can be co-expressed, and chromatin structure that can influence the degree to which coding regions are accessible to transcriptional machinery. With a global clustering defined over the genes, we can probe the degree of influence of these affects. In the remainder of the section, I present results that give evidence that chromatin affects gene expression on a global level. The work highlights the issues involved in defining an accurate statistical procedure for detecting this phenomenon.

B.1

Methods

In this analysis one is given an experimentally determined gene list and wishes to know how significant the relative positions of the genes are. Let the set of genes provided be G.

APPENDIX B. CHROMOSOMAL CLUSTERING

B.1.1

151

Intergenic distances

If we had a prediction of the transcription start site for each gene (or full-length expressed sequences exist for each of the genes), we could measure the distances between the transcription start sites of the genes in G. Using the transcription start sites would provide the best indicator of a gene’s expression context along the chromosomes. However, since transcription start sites are difficult to obtain (they are often defined experimentally), we will instead use translation start sites since they are easy to come by. It should be noted however that some translation start sites are far downstream from the gene’s transcription start site and therefore adds some uncertainty to the analysis. Let the set of translation start positions of the genes be S. Since some genes in C. elegans exist in operons, we replace the translation start of any gene belonging to an operon with the translation start of the most upstream gene. To avoid redundant counting, only consecutive distances are used to score chromosomal clustering. Let O be the ordered set of translation start positions for all of the operons found in G. Definition 3 An intergenic distance di is the distance between operon i’s translation start site and operon i + 1’s translation start site. Note there are |O| − N intergenic distances in any set where N is the number of chromosomes.

APPENDIX B. CHROMOSOMAL CLUSTERING

B.1.2

152

Detecting chromosomal clustering

Let two start sites adjacent in O be considered close if the distance in nucleotides is less than τ . A cluster is then a series of adjacent start sites that are considered close.

B.1.3

Assigning genes to operons

Genes in some organisms are arranged into transcriptional units called operons. Every gene in an operon faces the same direction along the chromosomes and is cotranscribed with the rest of genes in the operon to produce one polycistronic transcript. Thus, one already appreciated chromosomal clustering of coexpressed genes is the operon organization. To avoid rediscovering this phenomenon when performing the randomization, one needs to pre-process the input set so that it represents operon units rather than genes. For each gene in G, the corresponding operon is identified. Let the set of unique operons identified in this way be O. The set of operons O is what is used in the remainder of the analysis.

B.1.4

Accounting for tandem duplications

New genes often arise from duplication events. Because of the mechanism of retrotransposition, gene copies often integrate near their origins. Because the promoter

APPENDIX B. CHROMOSOMAL CLUSTERING

153

region may also be copied during the duplicationi event, this would lead to the creation of a nearby, coexpressed gene. To avoid rediscovering tandem duplications, one can avoid calling two genes close on the chromosomes if they have high enough sequence similarity.

B.1.5

Sampling to assess significance of clustering

The method I used to assess whether an observed set of genes exhibits clustering along the chromosomes is a Monte-Carlo strategy. One replication of the sampling involves: (1) drawing the same number of translation start sites from the genome as in the set O, and (2) counting the number of IGDs that are less than τ . This procedure is then repeated thousands of times, obtaining a distribution over the number of close start sites. This empirical distribution is then used to assess the significance of the observed number of close IGDs. Figure B.2 shows an example of the distribution obtained after sampling start sites from the genome. It is apparent from superimposing the results of sampling onto a Gaussian distribution (Figure B.2A) and made even more clear after performing a Q-Q plot on the results (Figure B.2B) that the distribution is well approximated by a normal distribution at least for the size of O and the magnitude of τ in this case. If the distribution is well-approximated by a normal like this, one only needs to sample enough to get reliable estimates for the first two moments of the normal.

APPENDIX B. CHROMOSOMAL CLUSTERING

154

Figure B.2: The sampled distribution of the number of intergenic distances can often be well-approximated by a normal distribution. A. Histogram of the number of times a simulation (y-axis) obtained a specific number of close start sites (x-axis) (red bars). The histogram is superimposed on top of a normal distribution (black bars). B. Q-Q plot where the quantiles of a normal (x-axis) is plotted against the quantiles obtained from the simulation (y-axis). These parameters can then be used to compute a P-value for the experimental results using the normal density function.

B.2

Results

In our example, muscle-specific genes were identified with a novel tissue-specific microarray approach developed in the lab called mRNA-tagging (Roy et al. 2002) [76]. The method identifies a set of 1364 genes that are specifically expressed in C. elegans muscle (henceforth called “muscle genes”). Figure B.3 shows chromosomal distribution of the muscle genes in the genome. In this example, we are testing whether the genes tend to clump in small consecutive

APPENDIX B. CHROMOSOMAL CLUSTERING

155

Figure B.3: The chromosomal positions of the genes specifically expressed in C. elegans muscle. Shown are all of the muscle genes that had valid translation start and stop information in WormBase [88, 44] in the C. elegans genome.

APPENDIX B. CHROMOSOMAL CLUSTERING

156

groups. By generating many random clusters of 1364 genes, the tendency for the muscle specific genes to occur in groups in the genome was assessed. A gene from the muscle gene list was counted as clustered if its start position was within 10 kb of the start position of another muscle gene. We also varied the distance criteria between 1 kb and 1 MB and observed significant clustering (P¡0.001) from 1 to 25 kb. The calculations included only those genes that were present on the microarray and for which we could determine a chromosomal position. The calculations used only the first gene in an operon, and only one gene of tandem repeats. The number of clusters expected due to random chance was calculated separately for each chromosome, and then summed to give the total number. This was done so that the number expected due to random chance reflected any bias in the experimental list. For example, sperm genes are nearly missing from the X chromosome, and so I generated lists of randomly selected genes from each of the autosomes and the X chromosome in proportion to the observed amounts. Finally, since the germ line data were obtained from experiments using microarrays containing 11,917 genes, lists of genes were randomly selected from only these genes to avoid bias. A P-value could be calculated by counting the number of times the simulation matched or exceeded the number in the muscle-specific set. The sampling strategy had to be adapted to handle the presence of operons and paralogs in the worm genome (Figure 2B and 2C). Because genes in operons are coordinately expressed

APPENDIX B. CHROMOSOMAL CLUSTERING

157

from the same promoter, the sampling strategy was modified to draw operons, rather than individual genes. A duplicated gene might also copy its parent’s promoter and therefore be co-expressed as well as arranged in tandem with its parent. Fortunately, this case can be detected by identifying potential duplicates using sequence similarity. If two genes are near each other and have a high sequence similarity, then they are not counted as clustered by the simulation. This procedure identified 384 genes in the muscle-specific set of genes as being clustered B.4. This was significantly more than what was expected by chance. The random simulation gave an average of 310 clustered genes with a fairly tight distribution. The P-value estimated from the sampling procedure was 7.2e-5, suggesting that the number of clustered genes in the muscle set was indeed significant (Figure B.4). I also looked up the sizes of the clusters. A cluster was defined to be a string of consecutive start sites within 10Kb of the closest gene in the muscle set. Inspecting the sizes of the clusters revealed that the muscle-specific set contained clusters of genes that were sizes 2-5 (Figure B.5). To determine if the genes in clusters were functionally different than the genes not in clusters, I split the muscle set into 384 clustered and 980 non-clustered genes. I then overlapped the 384 clustered genes with every biogroup and computed their overlap P-value. I did the same for the 980 non-clustered genes. I found no evidence that the clustered and non-clustered genes were enriched for different functional groups.

APPENDIX B. CHROMOSOMAL CLUSTERING

158

Figure B.4: Histogram comparing 10,000 random samples with the observed muscle gene clusters (red arrow).

APPENDIX B. CHROMOSOMAL CLUSTERING

159

Figure B.5: Muscle genes are clustered in small groups of 2-5 genes along the chromosomes. Distribution of the number of muscle-expressed genes in each cluster.

APPENDIX B. CHROMOSOMAL CLUSTERING

160

I applied the strategy to many other co-expression sets in C. elegans to see if they also exhibited chromosomal clustering (Table B.1). I found that many co-expressed sets of genes exhibit chromosomal clustering. This is particularly true of the tissuespecific genes. To further investigate the chromosomal clustering phenomenon, I computed P-values for the different cluster sizes observed in the co-expression sets (Table B.1). I found that clusters of size 6 or greater are very significant, almost never arising randomly. The sperm-enriched set had 13 clusters of size six or greater which was very significant. These genes may represent genes that reside in hitherto unpredicted operons. The next most significant size tended to be doublets. This may suggest that the chromatin domains, if they are indeed influencing clustering, have their greatest effects over short distances in the worm, on the order of a couple to a handful of genes.

B.3

Discussion

These methods can help us bootstrap computational representations of gene function that can aid subsequent analysis. I have also demonstrated how an interpretative procedure, based on a co-expressed set of genes, can be used to uncover new mechanisms underlying gene expression.

161

APPENDIX B. CHROMOSOMAL CLUSTERING

Gene group L1 muscle-enriched genes Sperm-enriched genes Oocyte-enriched genes Germline-intrinsic genes Mountain 00 Mountain 01 Mountain 02 Mountain 03 Mountain 04 Mountain 05 Mountain 06 Mountain 07 Mountain 08 Mountain 09 Mountain 10 Mountain 11 Mountain 12 Mountain 13 Mountain 14 Oxidoreductases Transcription factors Lipid, fatty acid and sterol metabolism genes

Total genes 1,364 650 258 508 2,703 1,818 1,465 1,363 1,195 1,012 978 909 810 786 635 587 462 396 353 323 255

Genes considered* 1,304 616 242 475 2,377 1,662 1,100 1,166 1,113 944 915 840 761 744 555 546 435 368 331 317 214

Observed clustered 386 198 26 136 962 435 360 316 466 248 214 133 225 172 97 133 86 49 46 21 8

Expected clustered 310.1 112 14.4 67.9 868.9 465.9 270.6 271.8 268.2 160.7 159.8 130.7 119.8 102.8 64.8 71.5 35.9 27.3 22.1 16.7 8.6

P-value 7.2 ∗ 10−5 10−15 0.017 10−15 5.8 ∗ 10−4 0.89 10−15 0.0017 10−15 10−15 1.6 ∗ 10−4 0.32 10−15 10−15 0.0013 10−15 10−15 1.3 ∗ 10−5 4.4 ∗ 10−5 0.23 0.56

281

280

12

12.4

0.53

Table B.1: Genes expressed in the same tissue tend to cluster along the chromosomes. a The number of genes considered, after removing operons, putative tandem duplicates, and genes with uncertain genomic positions. b The number of genes in the gene group that are within 10 kb of another gene from the same group. c The number of clusters expected for a random distribution. Shown are the average number of clustered genes from 10,000 repetitions of randomly selecting the same number of genes from the genome as the experimental gene group and counting how many genes are within 10 kb of another gene from the same randomly selected list. d The probability that the observed number of genes that cluster could be matched or exceeded by chance

Appendix C Predicting orthology In this chapter I discuss my effort to construct an orthology map from evolutionarily diverse organisms: H. sapiens, D. melanogaster, C. elegans and S. cerevisiae. Each orthologous group is referred to as a meta-gene. The goal is to find orthologous counterparts across the organisms for as many genes as possible. I present several evaluations of the meta-gene orthologous groups.

C.1

Meta-genes: identifying orthologous sets of genes

I selected evolutionarily diverse organisms for which extensive microarray data was available: H. sapiens, D. melanogaster, C. elegans and S. cerevisiae. In order to

162

APPENDIX C. PREDICTING ORTHOLOGY

163

identify genes that are co-expressed across multiple organisms, I first associated genes from one organism with their orthologous counterparts in other organisms. I used an approach similar to previous approaches for identifying orthologous sets of genes [63, 94]. Orthologs were identified by performing an all-against-all BLAST between every pair of protein sequences from each of the organisms. I then defined a meta-gene to be a set of genes across multiple organisms whose protein sequences are each other’s best reciprocal BLAST hit. This assigned each gene to at most a single meta-gene. For example, meta-gene MEG273 refers to the human gene Psmd4, the C. elegans gene rpn-10, the D. melanogaster gene Pros54, and the S. cerevisiae gene Rpn10, all of which encode a non-ATPase subunit of the 19S proteasome cap (Figure C.2) [67]. In total, this construction resulted in 6307 meta-genes, consisting of 6591 human genes, 5180 worm genes, 5802 fly genes, and 2434 yeast genes.

C.2

Best-reciprocal similarity

Orthologs were identified by performing an all-against-all BLAST between every pair of protein sequences from each of the organisms. I then defined a meta-gene to be a set of genes across multiple organisms whose protein sequences are each other’s best reciprocal BLAST hit (Figure C.1A). This assigned each gene to at most a single meta-gene. There are 5289 meta-genes that included a single human gene, and 590 meta-genes that each included two human genes. For example, meta-gene MEG273

APPENDIX C. PREDICTING ORTHOLOGY

164

Figure C.1: Meta-genes on H. sapiens D. melanogaster C. elegans and S. cerevisiae. A. Scheme for construction of meta-genes. I performed an all-against-all BLAST (using the blastp method) of every protein sequence in one organism against every protein sequence in all of the other organisms [4]. I then record the best BLAST hit for each gene with an E-value less than 10-5 (left; directed arrows indicate best BLAST hits). I ignore all non-reciprocal BLAST hits (middle), and identify meta-genes as connected components in the resulting graph. B. Meta-gene multiplicity. Shown is the number of meta-genes (y-axis) that contain a particular number of genes from a single organism (x-axis). Two meta-genes contained four human genes and a single meta-gene contained four fly genes. No meta-genes contained more than four genes from a single organism.

APPENDIX C. PREDICTING ORTHOLOGY

165

Figure C.2: Example of the meta-gene MEG273. An arrow points from gene X to gene Y if the protein sequence of Y had the best BLAST score to X’s protein sequence among all protein sequences in Y’s database. refers to the human gene Psmd4, the C. elegans gene rpn-10, the D. melanogaster gene Pros54, and the S. cerevisiae gene Rpn10, all of which encode a non-ATPase subunit of the 19S proteasome cap (Figure C.2). In total, this construction resulted in 6307 meta-genes, consisting of 6591 human genes, 5180 worm genes, 5802 fly genes, and 2434 yeast genes.

APPENDIX C. PREDICTING ORTHOLOGY

C.3

166

Transitivity

Several orthology prediction sets have been published by various groups. One popular approach is NCBI’s COGs. COGs was originally built based on microbial organisms and often groups many eukaryotic genes to the same group (in some cases over 100 genes from C. elegans are grouped together). Most groups enforce some notion of transitivity among the genes related by sequence similarity. For every A-B and B-C best reciprocal found by BLAST, the transitivity criterion will group genes A, B, and C into one orthologous group only if A-C also represents a reciprocal best BLAST result. I did not enforce transitivity among the genes included within our definition of a putative orthologous group as COGs and TIGR’s EGOs do for example [63, 94]. Enforcing transitivity tends to be too restrictive a definition in practice, especially when using a limited number of organisms. Although it gives rise to fewer false positives it leads to more false negatives. This is evident in TIGR’s EGOs where they split an orthologous group into two because not all of the transitive relationships can be met (TIGR even allows a single gene to be in multiple orthologous groups to meet the requirement). See for example Tentative Orthologs 336024, 350993, and 402694 which are all nuclear transport factor 2’s. The S. cerevisiae gene is present in two of the groups and yet is still not correctly associated with the human nuclear transport factor 2 that has been shown to functionally rescue the yeast knockout of the gene.

APPENDIX C. PREDICTING ORTHOLOGY

C.4

167

Discussion and evaluation of the meta-gene orthology

I investigated the quality of our meta-genes in three ways. First, I counted the number of times multiple genes from the same organism were included in a metagene and found that most meta-genes contained a single gene from each organism (Figure C.1B). Second, I investigated the degree to which the meta-genes satisfy the transitivity requirement. I found that over 78% of the possible transitive relationships were present among the meta-genes. I investigated 20 random meta-genes that contain nontransitive relationships. I found that most were cases where very functionally similar genes from the same organism were included. I calculated the Pearson correlation between these genes in such cases and found them to be significantly high. Finally, I measured the degree to which the meta-genes correspond to EGOs. I chose 100 random meta-genes to test. For every pair of genes included in these meta-genes I asked whether the pair also were grouped into the same EGO. Out of the 100 random meta-genes, 47 contained genes from at least two organisms that I could identify in TIGR’s database and that I could subsequently identify the EGO groupt that TIGR placed the genes into. These 47 meta-genes contained 201 pairs of genes. I found that 111 (55%) mapped to the same EGO. The remaining 90 pairs were cases where the

APPENDIX C. PREDICTING ORTHOLOGY

168

genes in a single meta-gene mapped to different EGOs. Because the EGOs tend to split orthologous groups, I inspected each EGO and determined if one of the EGOs actually contained the correct pair. I found that 73 out of the 90 were such cases. The agreement between the meta-genes and the EGOs is therefore very high in that most pairs of genes placed in the same meta-gene are also placed in the same EGO. I estimate the agreement to be 91.5% (184/201). I considered four methods to generate sets of orthologous genes: 1. clusters of orthologous groups (COGS) defined by NCBI, 2. eukaryotic gene orthologs (EGO) defined by TIGR, 3. reciprocal best blast hits, and 4. transitive reciprocal best blast hits. Transitivity indicates that if human gene A is a reciprocal best blast hit of worm gene B and fly gene C, then worm gene B and fly gene C also need to be reciprocal best blast hits in order for genes A, B and C to be grouped as orthologs. It could be that some of the links do not identify true orthologs but rather close homologs; this might be the case for the 22% of meta-genes that do not exhibit transitivity. In order to determine whether using a close homolog rather than an ortholog would significantly affect the network relationships, I calculated the Pearson correlation between close homologs and found them to be significantly high. Hence, the method used in this thesis and the method used by the EGO database generate the same orthology relationship in most cases. This result indicates that the meta-gene network would yield similar results using close homologs rather than true

APPENDIX C. PREDICTING ORTHOLOGY

orthologs.

169

Bibliography [1] O. Alter, P. O. Brown, and D. Botstein. Singular value decomposition for genome-wide expression data processing and modeling. Proc Natl Acad Sci U S A, 97:10101–6, Aug 29 2000. [2] O. Alter, P. O. Brown, and D. Botstein. Generalized singular value decomposition for comparative analysis of genome-scale expression data sets of two different organisms. Proc Natl Acad Sci U S A, 100:3351–6, Mar 18 2003. [3] R. B. Altman and S. Raychaudhuri. Whole-genome expression analysis: challenges beyond clustering. Curr Opin Struct Biol, 11:340–7, Jun 2001. [4] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. J Mol Biol, 215:403–10, Oct 5 1990. [5] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P. Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill,

170

BIBLIOGRAPHY

171

L. Issel-Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M. Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. the gene ontology consortium. Nat Genet, 25:25–9, May 2000. [6] R. L. Axtell. Zipf distribution of u.s. firm sizes. Science, 293:1818–20, 2001. [7] C. A. Ball, G. Sherlock, H. Parkinson, P. Rocca-Sera, C. Brooksbank, H. C. Causton, D. Cavalieri, T. Gaasterland, P. Hingamp, F. Holstege, M. Ringwald, P. Spellman, Jr. Stoeckert C. J., J. E. Stewart, R. Taylor, A. Brazma, and J. Quackenbush. Standards for microarray data. Science, 298:539, Oct 18 2002. [8] A. Basilevsky. Statistical Factor Analysis and Related Methods, Theory and Applications. John Wiley and Sons, 1994. [9] A. Ben-Dor, R. Shamir, and Z. Yakhini. Clustering gene expression patterns. J Comput Biol, 6:281–97, 1999. [10] S. Bergmann, J. Ihmels, and N. Barkai. Iterative signature algorithm for the analysis of large-scale gene expression data. Phys Rev E Stat Nonlin Soft Matter Phys, 67:031902, Mar 2003. [11] B. P. Berman, Y. Nibu, B. D. Pfeiffer, P. Tomancak, S. E. Celniker, M. Levine, G. M. Rubin, and M. B. Eisen. Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the drosophila genome. Proc Natl Acad Sci U S A, 99:757–62, Jan 22 2002.

BIBLIOGRAPHY

172

[12] L. Bianchi, S. M. Kwok, M. Driscoll, and F. Sesti. A potassium channel-mirp complex controls neurosensory function in caenorhabditis elegans. J Biol Chem, 278:12415–24, Apr 4 2003. [13] Y. Blat and N. Kleckner. Cohesins bind to preferential sites along yeast chromosome iii, with differential regulation along arms versus the centric region. Cell, 98:249–59, Jul 23 1999. [14] A. Brazma, P. Hingamp, J. Quackenbush, G. Sherlock, P. Spellman, C. Stoeckert, J. Aach, W. Ansorge, C. A. Ball, H. C. Causton, T. Gaasterland, P. Glenisson, F. C. Holstege, I. F. Kim, V. Markowitz, J. C. Matese, H. Parkinson, A. Robinson, U. Sarkans, S. Schulze-Kremer, J. Stewart, R. Taylor, J. Vilo, and M. Vingron. Minimum information about a microarray experiment (miame)toward standards for microarray data. Nat Genet, 29:365–71, Dec 2001. [15] M. P. Brown, W. N. Grundy, D. Lin, N. Cristianini, C. W. Sugnet, T. S. Furey, Jr. Ares M., and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci U S A, 97:262–7, Jan 4 2000. [16] P. O. Brown and D. Botstein. Exploring the new world of the genome with dna microarrays. Nat Genet, 21:33–7, Jan 1999.

BIBLIOGRAPHY

173

[17] C. J. Ceol and H. R. Horvitz. dpl-1 dp and efl-1 e2f act with lin-35 rb to antagonize ras signaling in c. elegans vulval development. Mol Cell, 7:461–73, Mar 2001. [18] Y. Cheng and G. M. Church. Biclustering of expression data. Proc Int Conf Intell Syst Mol Biol, 8:93–103, 2000. [19] V. G. Cheung, M. Morley, F. Aguilar, A. Massimi, R. Kucherlapati, and G. Childs. Making and reading microarrays. Nat Genet, 21:15–9, Jan 1999. [20] S. Chu, J. DeRisi, M. Eisen, J. Mulholland, D. Botstein, P. O. Brown, and I. Herskowitz. The transcriptional program of sporulation in budding yeast. Science, 282:699–705, Oct 23 1998. [21] T. A. Clark, C. W. Sugnet, and Jr. Ares M. Genomewide analysis of mrna processing in yeast using splicing-specific microarrays. Science, 296:907–10, May 3 2002. [22] B. A. Cohen, R. D. Mitra, J. D. Hughes, and G. M. Church. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nat Genet, 26:183–6, Oct 2000. [23] E. M. Conlon, X. S. Liu, J. D. Lieb, and J. S. Liu. Integrating regulatory motif discovery and genome-wide expression analysis. Proc Natl Acad Sci U S A, 100:3339–44, Mar 18 2003.

BIBLIOGRAPHY

174

[24] M. C. Costanzo, M. E. Crawford, J. E. Hirschman, J. E. Kranz, P. Olsen, L. S. Robertson, M. S. Skrzypek, B. R. Braun, K. L. Hopkins, P. Kondu, C. Lengieza, J. E. Lew-Smith, M. Tillberg, and J. I. Garrels. Ypd, pombepd and wormpd: model organism volumes of the bioknowledge library, an integrated resource for protein information. Nucleic Acids Res, 29:75–9, Jan 1 2001. [25] M. C. Costanzo, J. D. Hogan, M. E. Cusick, B. P. Davis, A. M. Fancher, P. E. Hodges, P. Kondu, C. Lengieza, J. E. Lew-Smith, C. Lingner, K. J. RobergPerez, M. Tillberg, J. E. Brooks, and J. I. Garrels. The yeast proteome database (ypd) and caenorhabditis elegans proteome database (wormpd): comprehensive resources for the organization and comparison of model organism protein information. Nucleic Acids Res, 28:73–6, Jan 1 2000. [26] E. H. Davidson, J. P. Rast, P. Oliveri, A. Ransick, C. Calestani, C. H. Yuh, T. Minokawa, G. Amore, V. Hinman, C. Arenas-Mena, O. Otim, C. T. Brown, C. B. Livi, P. Y. Lee, R. Revilla, A. G. Rust, Z. Pan, M. J. Schilstra, P. J. Clarke, M. I. Arnone, L. Rowen, R. A. Cameron, D. R. McClay, L. Hood, and H. Bolouri. A genomic regulatory network for development. Science, 295:1669– 78, Mar 1 2002. [27] M. Diehn, G. Sherlock, G. Binkley, H. Jin, J. C. Matese, T. HernandezBoussard, C. A. Rees, J. M. Cherry, D. Botstein, P. O. Brown, and A. A.

BIBLIOGRAPHY

175

Alizadeh. Source: a unified genomic resource of functional annotations, ontologies, and gene expression data. Nucleic Acids Res, 31:219–23, Jan 1 2003. [28] R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, 1973. [29] D. J. Duggan, M. Bittner, Y. Chen, P. Meltzer, and J. M. Trent. Expression profiling using cdna microarrays. Nat Genet, 21:10–4, Jan 1999. [30] N. Dyson. The regulation of e2f by prb-family proteins. Genes Dev, 12:2245–62, Aug 1 1998. [31] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci U S A, 95:14863–8, Dec 8 1998. [32] B. S. Everitt and G. Dunn. Applied Multivariate Data Analysis. Oxford University Press, 1992. [33] R. A. Fisher. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 10:507–521, 1915. [34] A. G. Fraser, R. S. Kamath, P. Zipperlen, M. Martinez-Campos, M. Sohrmann, and J. Ahringer. Functional genomic analysis of c. elegans chromosome i by systematic rna interference. Nature, 408:325–30, Nov 16 2000.

BIBLIOGRAPHY

176

[35] K. Zipf G. Human Behavior and the Principle of Least Effort. Addison Wesley, 1949. [36] X. Gabaix, P. Gopikrishnan, V. Plerou, and H. E. Stanley. A theory of powerlaw distributions in financial market fluctuations. Nature, 423:267–70, May 15 2003. [37] D. Garlaschelli, G. Caldarelli, and L. Pietronero. Universal scaling relations in food webs. Nature, 423:165–8, May 8 2003. [38] S. Ghaemmaghami, W. K. Huh, K. Bower, R. W. Howson, A. Belle, N. Dephoure, E. K. O’Shea, and J. S. Weissman. Global analysis of protein expression in yeast. Nature, 425:737–41, Oct 16 2003. [39] G. Giaever, A. M. Chu, L. Ni, C. Connelly, L. Riles, S. Veronneau, S. Dow, A. Lucau-Danila, K. Anderson, B. Andre, A. P. Arkin, A. Astromoff, M. ElBakkoury, R. Bangham, R. Benito, S. Brachat, S. Campanaro, M. Curtiss, K. Davis, A. Deutschbauer, K. D. Entian, P. Flaherty, F. Foury, D. J. Garfinkel, M. Gerstein, D. Gotte, U. Guldener, J. H. Hegemann, S. Hempel, Z. Herman, D. F. Jaramillo, D. E. Kelly, S. L. Kelly, P. Kotter, D. LaBonte, D. C. Lamb, N. Lan, H. Liang, H. Liao, L. Liu, C. Luo, M. Lussier, R. Mao, P. Menard, S. L. Ooi, J. L. Revuelta, C. J. Roberts, M. Rose, P. Ross-Macdonald, B. Scherens, G. Schimmack, B. Shafer, D. D. Shoemaker, S. Sookhai-Mahadeo, R. K. Storms,

BIBLIOGRAPHY

177

J. N. Strathern, G. Valle, M. Voet, G. Volckaert, C. Y. Wang, T. R. Ward, J. Wilhelmy, E. A. Winzeler, Y. Yang, G. Yen, E. Youngman, K. Yu, H. Bussey, J. D. Boeke, M. Snyder, P. Philippsen, R. W. Davis, and M. Johnston. Functional profiling of the saccharomyces cerevisiae genome. Nature, 418:387–91, Jul 25 2002. [40] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286:531–7, Oct 15 1999. [41] S. M. Gomez, S. H. Lo, and A. Rzhetsky. Probabilistic prediction of unknown metabolic and signal-transduction networks. Genetics, 159:1291–8, Nov 2001. [42] C.M. Grinstead and J.L. Snell. Introduction to Probability. American Mathematical Society, 1997. [43] L. Guarente. Sir2 links chromatin silencing, metabolism, and aging. Genes Dev, 14:1021–6, May 1 2000. [44] T. W. Harris, R. Lee, E. Schwarz, K. Bradnam, D. Lawson, W. Chen, D. Blasier, E. Kenny, F. Cunningham, R. Kishore, J. Chan, H. M. Muller, A. Petcherski, G. Thorisson, A. Day, T. Bieri, A. Rogers, C. K. Chen, J. Spieth, P. Sternberg,

BIBLIOGRAPHY

178

R. Durbin, and L. D. Stein. Wormbase: a cross-species database for comparative genomics. Nucleic Acids Res, 31:133–7, Jan 1 2003. [45] J. A. Hartigan. Direct clustering of a data matrix. J. Amer. Statist. Assoc., 67:123–129, 1972. [46] T. Hastie, R. Tibshirani, M. B. Eisen, A. Alizadeh, R. Levy, L. Staudt, W. C. Chan, D. Botstein, and P. Brown. ’gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol, 1:RESEARCH0003, 2000. [47] B. A. Hauser, J. Q. He, S. O. Park, and C. S. Gasser. Tso1 is a novel protein that modulates cytokinesis and cell expansion in arabidopsis. Development, 127:2219–26, May 2000. [48] L.G. Hedges and I. Olkin. Vote counting methods in research synthesis. Psychological Bulletin, 88:359–369, 1980. [49] L.G. Hedges and I. Olkin. Statistical Methods for Meta-analysis. Academic Press, 1985. [50] G. Z. Hertz, 3rd Hartzell G. W., and G. D. Stormo. Identification of consensus patterns in unaligned dna sequences known to be functionally related. Comput Appl Biosci, 6:81–92, Apr 1990.

BIBLIOGRAPHY

179

[51] I. A. Hope, D. G. Albertson, S. D. Martinelli, A. S. Lynch, E. Sonnhammer, and R. Durbin. The c. elegans expression pattern database: a beginning. Trends Genet, 12:370–1, Sep 1996. [52] T. Hubbard, D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen, T. Down, R. Durbin, E. Eyras, J. Gilbert, M. Hammond, L. Huminiecki, A. Kasprzyk, H. Lehvaslaiho, P. Lijnzaad, C. Melsopp, E. Mongin, R. Pettett, M. Pocock, S. Potter, A. Rust, E. Schmidt, S. Searle, G. Slater, J. Smith, W. Spooner, A. Stabenau, J. Stalker, E. Stupka, A. Ureta-Vidal, I. Vastrik, and M. Clamp. The ensembl genome database project. Nucleic Acids Res, 30:38–41, Jan 1 2002. [53] T. R. Hughes, M. J. Marton, A. R. Jones, C. J. Roberts, R. Stoughton, C. D. Armour, H. A. Bennett, E. Coffey, H. Dai, Y. D. He, M. J. Kidd, A. M. King, M. R. Meyer, D. Slade, P. Y. Lum, S. B. Stepaniants, D. D. Shoemaker, D. Gachotte, K. Chakraburtty, J. Simon, M. Bard, and S. H. Friend. Functional discovery via a compendium of expression profiles. Cell, 102:109–26, Jul 7 2000. [54] W. K. Huh, J. V. Falvo, L. C. Gerke, A. S. Carroll, R. W. Howson, J. S. Weissman, and E. K. O’Shea. Global analysis of protein localization in budding yeast. Nature, 425:686–91, Oct 16 2003.

BIBLIOGRAPHY

180

[55] C. A. Iacobuzio-Donahue, A. Maitra, M. Olsen, A. W. Lowe, N. T. van Heek, C. Rosty, K. Walter, N. Sato, A. Parker, R. Ashfaq, E. Jaffee, B. Ryu, J. Jones, J. R. Eshleman, C. J. Yeo, J. L. Cameron, S. E. Kern, R. H. Hruban, P. O. Brown, and M. Goggins. Exploration of global gene expression patterns in pancreatic adenocarcinoma using cdna microarrays. Am J Pathol, 162:1151–62, Apr 2003. [56] J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N. Barkai. Revealing modular organization in the yeast transcriptional network. Nat Genet, 31:370–7, Aug 2002. [57] M. Jiang, J. Ryu, M. Kiraly, K. Duke, V. Reinke, and S. K. Kim. Genomewide analysis of developmental and sex-regulated gene expression profiles in caenorhabditis elegans. Proc Natl Acad Sci U S A, 98:218–23, Jan 2 2001. [58] R. S. Kamath and J. Ahringer. Genome-wide rnai screening in caenorhabditis elegans. Methods, 30:313–21, Aug 2003. [59] S. K. Kim, J. Lund, M. Kiraly, K. Duke, M. Jiang, J. M. Stuart, A. Eizinger, B. N. Wylie, and G. S. Davidson. A gene expression map for caenorhabditis elegans. Science, 293:2087–92, 2001. [60] Y. Kluger, R. Basri, J. T. Chang, and M. Gerstein. Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res, 13:703–16,

BIBLIOGRAPHY

181

Apr 2003. [61] L. C. Lazzeroni and A. B. Owen. Plaid models for gene expression data. Statistica Sinica., 12:61–86, 2002. [62] T. I. Lee, N. J. Rinaldi, F. Robert, D. T. Odom, Z. Bar-Joseph, G. K. Gerber, N. M. Hannett, C. T. Harbison, C. M. Thompson, I. Simon, J. Zeitlinger, E. G. Jennings, H. L. Murray, D. B. Gordon, B. Ren, J. J. Wyrick, J. B. Tagne, T. L. Volkert, E. Fraenkel, D. K. Gifford, and R. A. Young. Transcriptional regulatory networks in saccharomyces cerevisiae. Science, 298:799–804, Oct 25 2002. [63] Y. Lee, R. Sultana, G. Pertea, J. Cho, S. Karamycheva, J. Tsai, B. Parvizi, F. Cheung, V. Antonescu, J. White, I. Holt, F. Liang, and J. Quackenbush. Cross-referencing eukaryotic genomes: Tigr orthologous gene alignments (toga). Genome Res, 12:493–502, Mar 2002. [64] J. P. McCarter, M. D. Mitreva, J. Martin, M. Dante, T. Wylie, U. Rao, D. Pape, Y. Bowers, B. Theising, C. V. Murphy, A. P. Kloek, B. J. Chiapelli, S. W. Clifton, D. M. Bird, and R. H. Waterston. Analysis and functional classification of transcripts from the nematode meloidogyne incognita. Genome Biol, 4:R26, 2003.

BIBLIOGRAPHY

182

[65] H. W. Mewes, J. Hani, F. Pfeiffer, and D. Frishman. Mips: a database for protein sequences and complete genomes. Nucleic Acids Res, 26:33–7, Jan 1 1998. [66] B. Modrek, A. Resch, C. Grasso, and C. Lee. Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res, 29:2850–9, Jul 1 2001. [67] H. Ogata, S. Goto, K. Sato, W. Fujibuchi, H. Bono, and M. Kanehisa. Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res, 27:29–34, Jan 1 1999. [68] A. B. Owen, J. Stuart, K. Mach, A. M. Villeneuve, and S. Kim. A gene recommender algorithm to identify coexpressed genes in c. elegans. Genome Res, 13:1828–37, Aug 2003. [69] K. Pearson. On lines and planes of closest fit to systems of points in space. Phil. Mag., 2:559–572, 1901. [70] Z. S. Qin, L. A. McCue, W. Thompson, L. Mayerhofer, C. E. Lawrence, and J. S. Liu. Identification of co-regulated genes through bayesian clustering of predicted regulatory binding sites. Nat Biotechnol, 21:435–9, Apr 2003.

BIBLIOGRAPHY

183

[71] S. Raychaudhuri and R. B. Altman. A literature-based method for assessing the functional coherence of a gene group. Bioinformatics, 19:396–401, Feb 12 2003. [72] S. Raychaudhuri, J. M. Stuart, and R. B. Altman. Principal components analysis to summarize microarray experiments: application to sporulation time series. Pac Symp Biocomput, pages 455–66, 2000. [73] S. Raychaudhuri, J. M. Stuart, X. Liu, P. M. Small, and R. B. Altman. Pattern recognition of genomic features with microarrays: site typing of mycobacterium tuberculosis strains. Proc Int Conf Intell Syst Mol Biol, 8:286–95, 2000. [74] V. Reinke, H. E. Smith, J. Nance, J. Wang, C. Van Doren, R. Begley, S. J. Jones, E. B. Davis, S. Scherer, S. Ward, and S. K. Kim. A global profile of germline gene expression in c. elegans. Mol Cell, 6:605–16, Sep 2000. [75] C. E. Rocheleau, J. Yasuda, T. H. Shin, R. Lin, H. Sawa, H. Okano, J. R. Priess, R. J. Davis, and C. C. Mello. Wrm-1 activates the lit-1 protein kinase to transduce anterior/posterior polarity signals in c. elegans. Cell, 97:717–26, Jun 11 1999. [76] P. J. Roy, J. M. Stuart, J. Lund, and S. K. Kim. Chromosomal clustering of muscle-expressed genes in caenorhabditis elegans. Nature, 418:975–9, Aug 29 2002.

BIBLIOGRAPHY

184

[77] M. Schena, D. Shalon, R. W. Davis, and P. O. Brown. Quantitative monitoring of gene expression patterns with a complementary dna microarray. Science, 270:467–70, Oct 20 1995. [78] E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. Module networks: identifying regulatory modules and their conditionspecific regulators from gene expression data. Nat Genet, 34:166–76, Jun 2003. [79] E. Segal, B. Taskar, A. Gasch, N. Friedman, and D. Koller. Rich probabilistic models for gene expression. Bioinformatics, 17 Suppl 1:S243–52., 2001. [80] Q. Sheng, Y. Moreau, and B. De Moor. Biclustering microarray data by gibbs sampling. Bioinformatics, 19:196–205, 2003. [81] K. Siegrist. http://www.math.uah.edu/statold/sample/sample7.html. WWW, 1997. [82] F. Simmer, M. Tijsterman, S. Parrish, S. P. Koushika, M. L. Nonet, A. Fire, J. Ahringer, and R. H. Plasterk. Loss of the putative rna-directed rna polymerase rrf-3 makes c. elegans hypersensitive to rnai. Curr Biol, 12:1317–9, Aug 6 2002. [83] B. Snel, P. Bork, and M. A. Huynen. The identification of functional modules from the genomic association of genes. Proc Natl Acad Sci U S A, 99:5890–5, Apr 30 2002.

BIBLIOGRAPHY

185

[84] J. Y. Song, T. Leung, L. K. Ehler, C. Wang, and Z. Liu. Regulation of meristem organization and cell division by tso1, an arabidopsis gene with cysteine-rich repeats. Development, 127:2207–17, May 2000. [85] P. T. Spellman, M. Miller, J. Stewart, C. Troup, U. Sarkans, S. Chervitz, D. Bernhart, G. Sherlock, C. Ball, M. Lepage, M. Swiatek, W. L. Marks, J. Goncalves, S. Markel, D. Iordan, M. Shojatalab, A. Pizarro, J. White, R. Hubley, E. Deutsch, M. Senger, B. J. Aronow, A. Robinson, D. Bassett, Jr. Stoeckert C. J., and A. Brazma. Design and implementation of microarray gene expression markup language (mage-ml). Genome Biol, 3:RESEARCH0046, Aug 23 2002. [86] P. T. Spellman and G. M. Rubin. Evidence for large domains of similarly expressed genes in the drosophila genome. J Biol, 1:5, 2002. [87] P. T. Spellman, G. Sherlock, M. Q. Zhang, V. R. Iyer, K. Anders, M. B. Eisen, P. O. Brown, D. Botstein, and B. Futcher. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell, 9:3273–97, Dec 1998. [88] L. Stein, P. Sternberg, R. Durbin, J. Thierry-Mieg, and J. Spieth. Wormbase: network access to the genome and biology of caenorhabditis elegans. Nucleic Acids Res, 29:82–6, Jan 1 2001.

BIBLIOGRAPHY

186

[89] J. M. Sterner, S. Dew-Knight, C. Musahl, S. Kornbluth, and J. M. Horowitz. Negative regulation of dna replication by the retinoblastoma protein is mediated by its association with mcm7. Mol Cell Biol, 18:2748–57, May 1998. [90] J. D. Storey and R. Tibshirani. Statistical methods for identifying differentially expressed genes in dna microarrays. Methods Mol Biol, 224:149–57, 2003. [91] J. M. Stuart, E. Segal, D. Koller, and S. K. Kim. A gene-coexpression network for global discovery of conserved genetic modules. Science, 302:249–55, Oct 10 2003. [92] P. Tamayo, D. Slonim, J. Mesirov, Q. Zhu, S. Kitareewan, E. Dmitrovsky, E. S. Lander, and T. R. Golub. Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci U S A, 96:2907–12, Mar 16 1999. [93] A. Tanay, R. Sharan, and R. Shamir. Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18 Suppl 1:S136–44, Jul 2002. [94] R. L. Tatusov, M. Y. Galperin, D. A. Natale, and E. V. Koonin. The cog database: a tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res, 28:33–6, Jan 1 2000. [95] O. G. Troyanskaya, K. Dolinski, A. B. Owen, R. B. Altman, and D. Botstein. A bayesian framework for combining heterogeneous data sources for gene function

BIBLIOGRAPHY

187

prediction (in saccharomyces cerevisiae). Proc Natl Acad Sci U S A, 100:8348– 53, Jul 8 2003. [96] V. van Noort, B. Snel, and M. A. Huynen. Predicting gene function by conserved co-expression. Trends Genet, 19:238–42, May 2003. [97] A. J. Walhout and M. Vidal. Protein interaction maps for model organisms. Nat Rev Mol Cell Biol, 2:55–62, Jan 2001. [98] J. Wang and S. K. Kim. Global analysis of dauer gene expression in caenorhabditis elegans. Development, 130:1621–34, Apr 2003. [99] X. Wen, S. Fuhrman, G.S. Michaels, D.B. Carr, S. Smith, J. L. Barker, and R. Somogyi. Large-scale temporal gene expression mapping of central nervous system development. Proc Natl Acad Sci U S A, 95:334–339, 1998. [100] K. Willert and R. Nusse. Beta-catenin: a key mediator of wnt signaling. Curr Opin Genet Dev, 8:95–102, Feb 1998. [101] T. D. Wu, C. G. Nevill-Manning, and D. L. Brutlag. Minimal-risk scoring matrices for sequence analysis. J Comput Biol, 6:219–35, Summer 1999. [102] Y. H. Yang, S. Dudoit, P. Luu, D. M. Lin, V. Peng, J. Ngai, and T. P. Speed. Normalization for cdna microarray data: a robust composite method addressing

BIBLIOGRAPHY

188

single and multiple slide systematic variation. Nucleic Acids Res, 30:e15, Feb 15 2002. [103] J. Zalevsky, A. J. MacQueen, J. B. Duffy, K. J. Kemphues, and A. M. Villeneuve. Crossing over during caenorhabditis elegans meiosis requires a conserved muts-based pathway that is partially dispensable in budding yeast. Genetics, 153:1271–83, Nov 1999. [104] J. Zhu and M. Q. Zhang. Scpd: a promoter database of the yeast saccharomyces cerevisiae. Bioinformatics, 15:607–11, Jul-Aug 1999.