Mathematical Modelling in Cell Biology

3 downloads 0 Views 2MB Size Report
Sep 4, 2008 - 7 A @. pȬvalue. 0.0. 0.2. 0.4. 0.6. 0.8. 1.0 c u m u la tiv e ȱd istrib u tio n ȱfu n c tio n ...... tano (2004)]. ...... Minden, M. D., Sallan, S. E., Lander, E. S., Golub, T. R., and Korsmeyer, .... Exploration, Normalization, and Summaries.
Mathematical Modelling in Cell Biology From High-Throughput Data to Systems Analysis

Dissertation zur Erlangung des Doktorgrades der Fakultät für Mathematik und Physik der Albert-Ludwigs-Universität Freiburg im Breisgau

vorgelegt von Kilian Bartholomé 24. Januar 2008

Dekan: Leiter der Arbeit: Referent: Koreferent: Tag der mündlichen Prüfung:

Prof. Dr. J. Flum Prof. Dr. J. Timmer Prof. Dr. J. Timmer Prof. Dr. H. Römer 09.04.2008

Publications 1. Bartholomé, K., Kreutz, C., and Timmer, J. (2007). Estimating the Number of Differentially Expressed Genes in a Gene Set. BMC Bioinformatics, submitted. 2. Bartholomé, K., and Timmer, J. (2007). Regulatorische Prinzipien und Systemanalyse der bakteriellen Chemotaxis. at-Automatisierungstechnik, submitted. 3. Bartholomé, K., Rius, M., Letschert, K., Keller, D., Timmer, J., and Keppler, D. (2007). Data-Based Mathematical Modeling of Vectorial Transport across Double-Transfected Polarized Cells. Drug Metabolism and Disposition, 35(9):1476–1481. 4. Bartholomé, K., Timmer, J., and Kollmann, M. (2006). Design Principles of Signal Transduction Pathways to Compensate Intracellular Perturbations. Proceedings of 2006 IEEE/CCA, 1730–1733. 5. Kollmann, M., Lovdok, L., Bartholomé, K., Timmer, J., and Sourjik, V. (2005). Design Principles of a Bacterial Signalling Network. Nature, 438(7067):504–507.

Posters and Talks 1. Parameter Estimation in Ordinary Differential Equations via Multiple Shooting, Summer School 2004 Hvar, Croatia. 2. Evolutionary Design Principles of Bacterial Chemotaxis, DIMACS 2005, Piscataway, USA. 3. Dynamic Modeling of the Transcriptional Activity of NFκB in Hepatocytes, ICSB 2005, Heidelberg, Germany. 4. Double-transfected MDCKII Cells Expressing Human OATP1B3 (OATP8) and ABCC2 (MRP2): Vectorial Transport and its Mathematical Modeling, ICSB 2005, Heidelberg, Germany 5. Multi-Experiment Parameter Estimation and Application to Liver Transport Systems, SBMC 2006, Heidelberg, Germany 6. Estimating the Number of Differentially Expressed Genes in a Gene Set, Hepatosys Status Seminar 2007, Heidelberg, Germany

Contents

1 Preface

1

2 Statistical Analysis of Microarray Experiments

5

2.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.2

Introduction to Cell Biology . . . . . . . . . . . . . . . . . . . .

7

2.3

Biomarker Detection with Microarrays . . . . . . . . . . . . . .

8

2.3.1

Statistical Hypothesis Testing . . . . . . . . . . . . . . .

9

2.3.2

Multiple Testing . . . . . . . . . . . . . . . . . . . . . .

14

Gene Set Analyses . . . . . . . . . . . . . . . . . . . . . . . . . .

17

2.4.1

Existing Methods . . . . . . . . . . . . . . . . . . . . . .

18

2.4.2

Gene Set Regulation Index . . . . . . . . . . . . . . . .

23

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

2.4

2.5

3 Multi-Experiment Parameter Optimisation

33

3.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

3.2

Multi-Experiment Parameter Fitting . . . . . . . . . . . . . . .

35

3.2.1

Maximum Likelihood Estimation . . . . . . . . . . . . .

36

3.2.2

Penalised Likelihood . . . . . . . . . . . . . . . . . . . .

37

Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.3

i

ii

Contents

3.4

3.5

Application to Liver Transport System . . . . . . . . . . . . . .

42

3.4.1

Reduction of Parameter Space . . . . . . . . . . . . . .

45

3.4.2

Parameter Estimation . . . . . . . . . . . . . . . . . . .

46

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

4 Systems Analysis of the Chemotaxis Pathway

57

4.1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

4.2

Bacterial Chemotaxis . . . . . . . . . . . . . . . . . . . . . . . .

59

4.2.1

Biological Background . . . . . . . . . . . . . . . . . . .

59

4.2.2

Perfect Adaptation . . . . . . . . . . . . . . . . . . . . .

62

Design Principles of Bacterial Chemotaxis . . . . . . . . . . . .

65

4.3.1

Gene Expression Noise . . . . . . . . . . . . . . . . . . .

65

4.3.2

Network Performance under Protein Variations . . . .

68

4.3.3

Robustness against Correlated Protein Variations . . .

71

4.3.4

Robustness against Uncorrelated Protein Variations . .

74

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

82

4.3

4.4

5 Summary

83

Appendix

87

A Gene Set Regulation Index

89

A.1 Comparison to existing methods . . . . . . . . . . . . . . . . . B Chemotaxis Signaling Pathway B.1 Differential Equations . . . . . . . . . . . . . . . . . . . . . . . .

Bibliography

89 93 93

98

1 Preface

“Mathematics is biology’s next microscope, only better; biology is mathematics’ next physics, only better”

(Joel E. Cohen)

Since in 1953 Watson and Crick set the foundation of modern cell biology by the decipherment of the structure of the desoxyribonucleic acid (DNA) [Watson and Crick (1953)], the knowledge about the functioning of a cell has greatly been enhanced. The molecular basis of cellular processes like cell division, gene transcription and translation, signal transduction and transport processes could be identified and explained at least to some degree. However, while in physics most major discoveries are based on the formulation of fundamental laws in the language of mathematics, the discovery and explanation of most processes in cellular biology did not require detailed mathematical descriptions. They were mainly based on the phenomenological descriptions of observations such as “protein A shows significant interactions with protein B”. Reasons for this are that biological systems are highly complex and the understanding of the main processes of living cells was too limited to be translated into mathematical models. However, with increasing knowledge about cellular components and their interactions as well as the development of more quantitative experimental techniques, mathematical modelling becomes more and more applicable to obtain quantitative understanding of

1

2

Chapter 1: Preface

dynamic processes in cellular biology. Murray (2002) formulated the possible benefits of mathematical modelling in cell biology as follows: “Why use mathematics to study something as intrinsically complicated and ill understood as development, angiogenesis, wound healing, interacting population dynamics, regulatory networks, marital interaction and so on? We suggest that mathematics, rather theoretical modelling, must be used if we ever hope to genuinely and realistically convert an understanding of the underlying mechanisms into a predictive science.” Kitano (2004) argues in a similar way when he points out the necessity of a mathematical approach for an understanding of certain system properties: “The discovery of fundamental, systems-level principles that underlie complex biological systems is a prime scientific goal in systems biology. Robustness is a property that allows a system to maintain its functions despite external and internal perturbations. It is one of the fundamental and ubiquitously observed systems-level phenomena that cannot be understood by looking at the individual components.” In other words, mathematical modelling becomes inevitable in order to make qualitative as well as quantitative predictions and analyse inherent system properties such as robustness of highly complex processes in cell biology. In that context systems biology as the interdisciplinary union of databased mathematical modelling and cell biology has emerged in the last years. However, while the formulation of fundamental laws for the description of all interactions and processes in a system is the major goal in physics, one cannot expect to find all-encompassing mathematical theories in cell biology. Nevertheless, it has been shown that there are fundamental principles in cell biology, like for example the genetic code, that is identical for all biological organisms. In this thesis it is shown how the establishment of mathematical models for cellular processes can unravel underlying mechanisms and help to reveal certain fundamental principles in cell biology. This thesis deals with the analysis of cell-biological systems using mathematical methods and is structured in three parts. These parts correspond to three steps from the identification of key processes in a complicated system, over the establishment of a mathematical model of a cellular process, to the systems analysis of a model in order to reveal fundamental principles in cell

3

biology. In the first part, a method is introduced that integrates biological prior knowledge into the statistical analysis of microarray experiments. The method determines the differential gene expression of functionally related sets of genes such as genes belonging to the same pathway or ontology. By narrowing down the number of processes being the driving force for the evolution of certain biological phenotypes, the methods sets the foundation for new experimental starting points in highly complex biological systems. In the second part, a data-based mathematical model is established for liver transport system by multi-experiment analysis. Transport processes in liver cells largely affect the detoxification of drugs and toxins from the organism and are therefore of great importance to the pharmaceutical industry for the development of new drugs. In order to deal with inherent variations in biological systems arising from different external as well as internal sources of noise, a penalised likelihood approach is introduced. In the third part, the chemotaxis signalling pathway of bacterium Escheria Coli is investigated. The essential features of its experimentally well established network structure are derived and its robustness with respect to variations in protein concentration is analysed. It is shown why the structure of the pathway is more complex than expected.

2 Statistical Analysis of Microarray Experiments

„Auch eine Enttäuschung, wenn sie nur gründlich und endgültig ist, bedeutet einen Schritt vorwärts.“

(Max Planck)

2.1. Introduction In 1953, James Watson and Francis Crick were able to explain how information can be stored and propagated on a molecular level by unravelling the structure of the desoxyribonucleic acid (DNA). This discovery set the basis for the decipherment of the genetic code, i.e. to understand how the DNA encodes the proteins, which are the basic building block of any living cell. It has already been postulated earlier that three basic units of the DNA, the nucleic acids, encode one basic unit of the proteins, the amino acids, when in 1961 Marshall Nirenberg and Heinrich Matthaei found the first letter of the genetic alphabet. They discovered that three successive nucleic acids uracil encode the amino acid phenylalanine. Only five years later, in 1966, the entire genetic code with all 64 base triplets was deciphered. Based on these discoveries, the nucleic acid sequence of any DNA strand from any biological cell can be translated into

5

6

Chapter 2: Statistical Analysis of Microarray Experiments

the amino acid sequence of the corresponding protein. Consequently, the next goal was the decipherment of the genetic code of entire biological organisms, starting with the decryption of the complete DNA sequence of the EpsteinBarr virus containing 170 thousand base pairs (kb) in 1984. In the following years various species like yeast Saccharomyces cerevisiae with 12.1 million base pairs (Mb), Escherichia coli (E. coli) with 5 Mb and Caenorhabditis elegans (C. elegans) with about 97 Mb were sequenced until finally in 2001 the entire genetic code of the homo sapiens sapiens with 3 billion base pairs was simultaneously published by the human genome project (HGP) and Craig Venter’s company Celera. In many cases, the results of these studies set the foundation for the identification of genes causing a particular disease, possibly enabling the development of new treatments [Meltzer (2001)]. However, for the functional behaviour of a cell, the regulation of the genes enciphered in the cell’s genetic code is just as important as the genetic code itself. All cells in a single organism usually have the exact same genetic code, nevertheless they can fulfill extremely different functions and exhibit very different purposes, such as nerve cells compared to liver cells or blood cells. The reason for this different functional behaviour of cells with an identical genetic code is the different regulation of the genes in each of these cells. For example, in a nerve cells, very different genes are active than in a liver cell. Therefore, a central step for a better understanding of the functionality of a cell is the investigation of its gene regulation in addition to the knowledge of its genetic code. For this matter, the microarray analysis has been established within the last ten years as a high-throughput method designed to simultaneously determine the activity of many thousands of genes in cells by measuring the concentration of the ribonucleic acids (RNA). By comparing the gene expression patterns of two different biological entities, the microarray technology is capable of detecting the key processes in an extremely complicated system and thereby setting the foundation for new experiments for further analyses of these processes. In this chapter, the basics of the statistical analysis of microarray experiments are described and a new method for the analysis of the regulation of gene sets is introduced. Therefore, in Section 2.2 a short introduction into the biological backround of gene regulation is given followed by an overview of microarray analyses in Section 2.3. In this context, statistical hypothesis testing is presented and the issue of multiple testing is addressed. Finally, in Section 2.4, the most prominent approaches for the analysis of the regulation

Section 2.2: Introduction to Cell Biology

7

of gene sets are presented, followed by the introduction of a new approach which has the essential advantage that it does not only measure the significance of the regulation of a gene set, but can also be employed to evaluate its relevance. The main results of this chapter are summarised in Bartholomé et al. (2007a).

2.2. Introduction to Cell Biology The biological information of a living cell is stored in the desoxyribonucleic acid (DNA), which is formed of four types of monomers: adenine (A), cytosine (C), thymine (T) and guanine (G). These monomers are pairwise strung together in a linear chain, forming a double helix. The DNA encodes the amino acid sequences of all proteins in a cell. Proteins are the building blocks of a living cell, such as e.g. enzymes for metabolism, receptors for the detection of extracellular signals or the cytoskeleton for the structural shape of the cell, see [Alberts et al. (2002)]. To built a protein, the DNA is transcribed into messenger ribonucleic acid (mRNA), which is very similar to DNA but single-stranded. It is also formed of four monomers: adenine (A), cytosine (C), uracil (U) and guanine (G). At the ribosomes, the mRNA is then translated into proteins via the genetic triplet code, see Figure 2.1. Three nucleic acids, called triplets or codon, encode one amino acids, the building blocks of proteins. A word composed of three letters originating from a four letter alphabet allow to set up 43 = 64 different words. There exist 20 different amino acids, such that a word consisting of three letters from a four letter alphabet is the smallest possibility to ensure that each amino acid gets corresponds to at least one codon. Gene regulation refers to the control of the amount and timing of changes to the appearance of the functional product of a gene, which are basically the amount of mRNA and its associated proteins. This process of regulation is highly controlled on several levels, which are for example the DNA to mRNA transcription, mRNA processing or post-translational modifications of the protein. Each cell in an organism contains exactly the same DNA and thus the same genetic information as any other cell in this organism. Nevertheless, different cells can be very different in their phenotype and biological function. These differences are the consequence of a distinct gene regulation. Malfunctions in this gene regulation is the cause of many severe

8

Chapter 2: Statistical Analysis of Microarray Experiments

Figure 2.1: Illustration of the conversion process from genetic information encoded in the DNA to the protein. At the DNA, the RNA polymerase transcribes the genetic information into messenger RNA. At the ribosomes, this messenger RNA is then translated into the protein. Therefore, the determination of the concentration of the messenger RNA gives an estimate for the activity of the respective genes.

diseases. Thus, the detection of genes or sets of genes that are differentially regulated between two groups, for example between patients and healthy control, can be used for diagnosis and as a tool to identify the key regulatory processes in an extremely complex system.

2.3. Biomarker Detection with Microarrays DNA microarrays are used to detect genes, or functionally related groups of genes, that are differentially regulated in different biological entities. They are designed to detect the amount of mRNA for several thousands of of genes simultaneously. For a microarray experiment, the mRNA from the biological probe that is to be investigated is isolated. This mRNA is then reversibly transcribed either into complementary DNA (cDNA) or comple-

Section 2.3: Biomarker Detection with Microarrays

9

mentary RNA (cRNA), depending on the applied experimental technique. It is then fluorescently labelled and hybridised with short complementary nucleotide sequences (oligonucleotides) that are spotted on the array, each uniquely coding for a different gene, see Figure 2.2. Finally, the intensity of the fluorescence for each gene is determined with a specialised scanner. These intensities are then utilised in order to determine the mRNA concentration for each gene and thereby its expression level respectively its activity. Since these intensities are strongly affected by non-specific binding and noise, these data have to be preprocessed to obtain the expression levels for each gene. This preprocessing includes backround adjustment, normalisation, condensing, and quality assessment. Backround adjustment is essential because part of the measured probe intensities are due to non-specific hybridisation and noise in the optical detection system. Normalisation is necessary to be able to compare data from different microarrays. For certain microarray technologies, condensation is needed because one gene can be represented by multiple probes. Finally, the quality assessment gives an estimate about the quality of single data points as well as for the whole chip. A number of preprocessing algorithms have been proposed, e.g. the Robust Multiarray Average (RMA) [Irizarry et al. (2003a,b)], the Microarray Suite 5.0 (MAS5) [Affymetrix (2002)], the Positional Dependent Nearest Neighbor model (PDNN) [Zhang et al. (2003)], and GeneChip RMA (GC-RMA) [Wu et al. (2004)]. Here, we do not want to get into detail about the data preprocessing algorithms, we rather want to stress the question of how to identify significantly regulated genes as well as sets of genes from the expression values obtained from microarray experiments.

2.3.1. Statistical Hypothesis Testing The standard microarray experiment is designed to compare the gene expression pattern of two different biological entities. These can for example be a group that has been treated with a certain drug versus an untreated control group. For such an experimental setup, one question of interest is which genes are differentially regulated between these groups. In order to answer this question, a number of statistical tests is available [Cox and Hinkley (1974)]. The purpose of a statistical test is to decide on the basis of quantitative experimental data whether a certain hypothesis has to be rejected or not. In our case, the hypothesis addresses the question whether a gene is differentially

10

Chapter 2: Statistical Analysis of Microarray Experiments

extraction reverse transcription extracted mRNA cells

cDNA

transcription and biotin labeling gene chip

biotin−labelled cRNA

hybridisation

scan and quantification hybridised microarray

Figure 2.2: Illustration of a microarray experiment. The messenger RNA is extracted from the cells and reverse transcriptase is applied to obtain complementary DNA (cDNA). In oligonucleotide arrays, from this cDNA, biotin labelled cRNA is obtained by in vitro transcription. These cRNA molecules are then exposed to the microarray in order to hybridise to the oligonucleotides (short sequences of RNA or DNA) on the chip. After hybridisation, the chip is stained with a fluorescent molecule that binds to biotin. The amount of bound cRNA is then quantified by scanning the array in a microarray scanner and visualising the fluorescent molecules by excitation with a laser beam.

Section 2.3: Biomarker Detection with Microarrays

11

regulated between two groups, that is, whether the mean expression value of a gene in one group is significantly different to the mean expression value in the other group. Classical statistical hypothesis testing basically consists of two steps. First, a null hypothesis H0 has to be stated, which is usually an hypothesis of no difference between population parameters such as mean values or variances. Second, an appropriate test statistic t has to be chosen to check the validity of H0 . This test statistic is a function of the observed data and therefore a random variable. Thus, it can be described by a probability distribution. From the knowledge of this distribution, the cumulative probability p of measuring a value greater than a certain realisation of t under the null hypothesis can be derived: Z ∞ p= PH0 (t′ )dt′ . (2.1) t

Here, PH0 (t) is the probability distribution of the test statistic t under the null hypothesis H0 . This definition for the p-value is only valid for a one sided hypothesis test, which for simplicity we will focus on in the following. Since for most statistical tests the probability distribution of the test quantity t under the null hypothesis H0 has an infinite support, any value of t has a non-zero p-value. This means, that for no value of t the null hypothesis H0 can be rejected with certainty. Therefore, a critical threshold value tc has to be selected that separates the region where the null hypothesis is rejected from the region where the null hypothesis is not rejected. This threshold value fixes the probability of wrongly rejecting a true null hypothesis and has to be chosen such that this probability, also called significance level α or false positive rate, is tolerable. The significance level α is defined as Z ∞ PH0 (t)dt. (2.2) α= tc

Values often chosen for α are either 0.05 or 0.01, corresponding to an expected probability of 5% respectively 1% for a false positive result. A erroneous rejection of a true null hypothesis is also referred to as an error of first kind or a Type I error. The probability β of not rejecting a false null hypothesis, also termed false negative rate, is given by Z tc β= PH1 (t)dt (2.3) ∞

f (p

12

Chapter 2: Statistical Analysis of Microarray Experiments

a)

b) Œ›’’ŒŠ•ȱŸŠ•žŽȱŒ

1.0 testȱwithȱhighȱpower

Ŗȱž•’••Ž

ŖȱŸ’˜•ŠŽ

0.8

Ŗǯř

power

™›˜‹Š‹’•’¢ȱŽ—œ’¢ȱ˜ȱ

ŖǯŚ

ŖǯŘ

0.6 testȱwithȱlowȱpower 0.4

Ŗǯŗ β

0.2

α

significanceȱlevel ŖǯŖ

0.0 ȬŘ

Ŗ

Ř 

Ś

Ŝ

0

2

4

6

8

10

violationȱofȱH0

Figure 2.3: a) Illustration of a one-sided statistical hypothesis test: The black curve shows the probability distribution of t for a valid null hypothesis H0 , and the red curve shows the probability distribution of t where H0 is violated and an alternative hypothesis H1 holds. The critical value tc separates the region for which the null hypothesis is rejected from the region where it is not rejected. Hereby, the significance level α (grey area) and the false negative rate β (red area) are fixed. b) For a given violation of the null hypothesis H0 , the probability β to commit an error of Type II is much smaller for a a test with high power than for a test with low power. A test with high power gives a more distinct separation of the distributions of H0 and H1 than a test with low power.

where PH1 (t) is the probability distribution of the test statistics t under the alternative hypothesis H1 . Falsely not rejecting a null hypothesis is also called error of second kind or Type II error. In most cases, committing an error of first kind is worse than committing an error of second kind, because the first case corresponds to the case where an effect is detected although actually no effect exists. On the other hand, an error of second kind corresponds to not detecting an effect although it is there, which might be due to insufficient quality of the experimental data. The probability for rejecting an invalid null hypothesis, i.e. to make a right decision, is called the power of the test and is equal to 1 − β. The higher the power of a test, the better the separation between the distribution of the null hypothesis and the distribution of the alternative hypothesis, see Figure 2.3. The result of a statistical test is called significant, if the p-value of the realisation of the test statistic t is smaller than the significance level α. In that

Section 2.3: Biomarker Detection with Microarrays

13

case, the null hypothesis has to be rejected. In fact, the only two possible conclusions of a statistical test are either “reject H0 ” or “do not reject H0 ”. It is important to note that the conclusion “do not reject H0 ” does not necessarily mean that the null hypothesis is true, it only suggests that there is not sufficient evidence against H0 . The philosophical idea behind this way of hypothesis testing relates back to Karl Popper. He states that hypotheses can only be rejected, i.e. falsified, but never accepted [Popper (1968)]. Rejection of the null hypothesis supports the acceptance of the research hypothesis as the only alternative. Student’s t-Test The two-sample Student’s t-test [Gosset (1908)] checks for equality of the mean values of two samples under the assumption that these samples are realisations of a Gaussian distributed random variable and have equal variances. If these assumptions are met, the Student’s t-test is the statistical test with the highest power for detecting the difference in the mean value between two entities. Welch’s t-Test The Welch’s t-test is an adaptation of Student’s t-test intended for use with two samples having unequal variances. This method is more powerful than the Student’s t-test given that the assumption of homogeneity of variance is not met. Mann-Whitney U Test Violations of the assumptions of normally distributed data can either reduce the power of the t-test or, in the worst case, even lead to a false positive rate larger than the chosen significance level α, Delaney and Vargha (2000). The non-parametric pendant to the Student’s t-test that relinquishes any assumptions about the shape of the underlying distribution of the data is the MannWhitney U Test, also called Mann-Whitney-Wilcoxon or Wilcoxon rank-sum test, which was introduced by Wilcoxon (1945) and extended by Mann and Whitney (1947). It tests the null hypothesis that the median of two samples are the same, whereas it assumes the underlying distributions to be similar

14

Chapter 2: Statistical Analysis of Microarray Experiments

H0 is false H0 is true

H0 rejected TP FP

H0 not rejected FN TN

total R N0

total

S

N−S

N

Table 2.1: TN is the number of true negative, TP the number of true positive, FP the number of false positive (Type I error) and FN the number of false negative (Type II error). N = N0 + R is total the number of hypotheses, N0 the number of true null hypothesis and R the number of false null hypothesis. S is the number of rejected tests (significant results). Following these definitions, the false positive rate is FP /N0 , i.e. the number of falsely rejected hypotheses over all true null hypotheses, and the sensitivity is given by TP /R, i.e. the number of truly rejected hypotheses over all false null hypotheses. In Figure 2.4 all these terms are associated with the corresponding areas under the probability distribution.

[Quinn and Keough (2002)]. However, the frequent application of the MannWhitney U Test in biological research for problems where the assumption of homogeneity of variance is not met is inappropriate, since the demand for similar underlying distributions implies the requirement of equal variances.

2.3.2. Multiple Testing When analysing microarray experiments for differential gene expression, several thousands tests are performed, since each gene on the chip is independently tested. Testing the same hypothesis many times leads to the problem of multiple testing [Shaffer (1995)]. The large amount of repeated hypothesis tests increases the probability of committing an error of Type I. Given that a statistical test is performed N times on independent data where the null is valid, the expectation value of the number of false positives Dhypothesis E Fp , see Table 2.1, is equal to αN. Thus, for microarrays, where many tens of thousands of tests are carried out, this can lead to the detection of several hundreds of false positives. Accordingly, the significance level α has to be adjusted to account for the large number of independent statistical tests and reduce the risk of detecting false positives.

Section 2.3: Biomarker Detection with Microarrays

15

Familywise Error Rate The familywise error rate (FWER) is defined as the probability of at least one falsely rejected null hypothesis, i.e.: FWER = Pr(Fp ≥ 1) = 1 − (1 − α)N , where Fp is the number of false positives in a sample of N independent tests. Bonferroni proposed a method for controlling the familywise error rate [Shaffer (1995)] where the significance level is adjusted by dividing it by the total number of tests, leading to α˜ = α/N. Now, only genes with a p-value smaller than the adjusted significance level α˜ are considered to be significantly regulated, leading to a FWER < α. However, this is a very stringent approach, since it requires a strong reduction of the significance level α, which corresponds to the great enlargement of the false negative rate β and thus to great loss of power of the test, see Figure 2.3. A slightly less stringent modulation of this procedure is the BonferroniHolm method, where the hypotheses are ordered corresponding to their pvalues and then sequentially tested for significance. Hereby, the significance level is adjusted individually for each hypothesis according to its rank i, resulting in α˜ = α/(N − i + 1). Hi is rejected if pi ≤ α. ˜ If Hi is accepted, all hypotheses with a larger p-value are accepted without further testing. This correction method leads to a FWER ≤ α, see [Shaffer (1995)]. False Discovery Rate For microarray experiments in general a small number of false positives is acceptable. Thus, adjusting the FWER often is too stringent and leads to a great reduction of the power and a large number of false negatives. A more appropriate error rate for microarray experiments is the so-called false discovery rate (FDR). It is defined as the amount of false positives Fp among the total number of genes S that have been declared significant, i.e., FDR = Fp /S.

16

Chapter 2: Statistical Analysis of Microarray Experiments

probabilityȱdensitiy

2.5

2.0

1.5 TP

FN

1.0

0.5

0.0 0.0

FP

TN

D

0.1

0.2

1.0

pȬvalue

Figure 2.4: Probability density of p-values for a multiple tested null hypothesis H0 . This probability density is a combination of a uniform distribution for true null hypotheses and an unknown distribution that accumulates around zero for false null hypotheses. The value of the combined probability density at p = 1 corresponds to the height of the uniform distribution. Thus, by estimating this value and fixing the significance level α, the number of false positives FP , false negatives FN as well as true positives TP and true negatives TN can be determined.

Benjamini and Hochberg (1995) proposed a method to control the FDR. Hereby, the hypothesis H1 , H2 , . . . , HN are ordered according to their p-values, i.e. p1 ≤ p2 ≤ . . . ≤ pN . They showed, that by defining the q-value of a p-value pi by q(pi ) :=

N pi i

(2.4)

and rejecting all hypotheses with q(pi ) ≤ q for some fixed value q, the FDR is controlled such that FDR≤ q. Storey and Tibshirani (2003) and Pounds and Morris (2003) pursue a different accession to control the FDR. The central idea in their approach is that the p-values of truly null hypotheses follow a uniform distribution, whereas the p-values of false null hypotheses accumulate around zero, see Figure 2.4. In fact, the more power a hypothesis test has, the closer is the accumulation of p-values around zero for genes for which the null hypotheses is violated. Storey and Tibshirani (2003) define a quantity  # pi > λ; i = 1, . . . , N , π0 (λ) = N(1 − λ)

Section 2.4: Gene Set Analyses

17

and show, that for λ → 1 this quantity π0 (λ) approximates N0 /N, where N0 is the number of truly null events and N is the total number of hypothesis tests. Thus, by fitting a cubic spline to π0 (λ) and evaluating it at λ = 1, they get an c0 for the number of truly null events as well as an estimate for the estimate N b0 α for a fixed significance level α. This leads number of false positives b FP = N to an estimate for the FDR, given by b0 α N [ FDR(α) = . S(α) Hereby, S(α) = Tp + Fp is the number of hypotheses declared significant for a certain value of α, see Table 2.1 and Figure 2.4. Defining the q-value of a feature as [ b q(pi ) = min FDR(α), (2.5) α≥pi

and thresholding these q-values for each item at a level αq , a set of significant features is achieved with an expected proportion of αq false positives. The q-value can be interpreted as the expected proportion of false positives among all features as or more extreme than the observed one. In a similar approach, Pounds and Morris (2003) approximate the empirical probability distribution by a parametrical model which they name the betauniform mixture (BUM) distribution and is given by f (x | a, λ) = λ + (1 − λ)axa−1 . From this estimate, the areas of the regions Tp , Fp , TN , and FN can be deduced and the FDR be calculated by [ = FDR(t)

Fbp Tbp + Fbp

.

The q-value can be determined using Equation (2.5). A comparative study about the performance of these methods as well as other approaches are given in Broberg (2005) and in Appendix A.1.

2.4. Gene Set Analyses The result of the statistical tests introduced above are lists of genes that have significantly different expression values between two groups. The interpretation of these lists to understand the underlying biological phenomena is

18

Chapter 2: Statistical Analysis of Microarray Experiments

tedious and often depends on the experimentalists’ experience, sometimes leading to subjective interpretations. Also, the resulting gene lists can exhibit large variations when studies between different laboratories are compared, showing the limitations in the reproducibility of the expression patterns for single genes, see Shi et al. (2006) and Järvinen et al. (2004). A better approach is to investigate the regulation of biologically related sets of genes. These sets can be a collection of genes belonging to the same pathway or ontology, or having the same chromosomal location. Besides the integration of previously achieved biological knowledge, a further advantage of the analysis of sets of genes is the better reproducibility of the results. In the following section, several methods are introduced to determine whether the expression of genes in a predefined gene set in one group is different to the expression in another group. In order to annotate the genes and define the gene sets, prior biological knowledge is integrated from online databases such as the Gene Ontology [Ashburner et al. (2000)], which categorises genes according to their molecular function, cellular components and biological process, and the Kegg database [Ogata et al. (1999)], which is a collection of metabolic and signalling pathways.

2.4.1. Existing Methods Combinatorial Test A generic approach for the analysis of the regulation of a gene set starts from a list of differentially expressed genes obtained from one of the statistical tests introduced above and tests by combinatorial considerations whether a gene set is overrepresented in this list. Given a list of R differentially expressed genes among a total of N genes, the probability of finding r or more differentially expressed genes in a gene set of size G under the assumption that the gene set affiliation has no effect on gene expression follows a binomial distribution and is equal to ! ! G N−G r R−r p= ! . N R For a fixed significance level α, a gene set is considered to be significantly

new

Section 2.4: Gene Set Analyses

19

significant gene set

insignificant gene set

N-R unregulated genes R regulated genes Figure 2.5: Fisher’s Exact Test: A set of genes is considered to be significant, when the probability p of an accumulation of regulated genes in a predefined gene set by chance is smaller than a significance level α.

regulated for p < α. Hereby, the following null hypothesis is tested: H0 :

The fraction of differentially expressed genes in G is equal to the fraction of differentially expressed genes in N

This test is referred to as the Fisher’s Exact Test, see Figure 2.5. This approach and slightly modified versions of this approach have been described by many different authors, including Al-Shahrour et al. (2004), Beissbarth and Speed (2004), Breitling et al. (2004), Hosack et al. (2003), Lee et al. (2005), Pehkonen et al. (2005), Yi et al. (2006), Zhang et al. (2004), Zeeberg et al. (2003), and others. The drawback of the determination of the significance of a gene set by applying a Fisher’s Exact test is that for each gene a decision has to be made whether it is differentially expressed or not. This decision depends on the choice of the significance level α, which in turn strongly affects the outcome of the Fisher’s Exact test. Gene Set Enrichment Analysis A method for the determination of the significance of a gene set S that relinquishes on a choice of an arbitrary threshold for the declaration of the significance for each single gene is the Gene Set Enrichment Analysis (GSEA) introduced by Mootha et al. (2003). Hereby, the genes gi in a list of genes

20

Chapter 2: Statistical Analysis of Microarray Experiments

L of size N, which usually contains all genes that are on the microarray, are ordered according to some measure of the expression value ti , which can for example be the fold-expression change or signal-to-noise ratio. As test statistic, the so-called enrichment score (ES), is calculated by walking down the list L, increasing a running-sum when a gene in gene set S ⊆ L is encountered and decreasing it elsewise, see Figure 2.6. The maximum positive deviation from zero encountered in this running sum is defined as the enrichment score:

ES := max

1≤j≤N

j X i=1

Xi ,

 q  G   − N−G q Xi =     N−G G

if gi < S

(2.6)

if gi ∈ S

Hereby, G is the number of genes in gene set S. This is closely related to the Kolmogorov-Smirnov statistic, which is the maximal difference between cumulative probability distributions [Hollander and Wolfe (1999)]. Thus, in analogy to the Kolmogorov-Smirnov test, the null hypothesis can be formulated as follows: H0 :

The rank of the genes from the predefined gene set S is randomly distributed among all other genes in the gene list L.

The probability for getting a certain enrichment score by chance and thereby the significance of the enrichment score is estimated by permuting the phenotype labels and recomputing the enrichment score of the gene set of the permuted data. Hereby, a null distribution for the enrichment score is generated, from which a p-value for the observed enrichment score is calculated. By permuting the phenotype labels, the gene-gene correlations are preserved resulting in a more biologically reasonable assessment. In a modified version of the GSEA (Subramanian et al. 2005), the increment of the running enrichment score is proportional to the expression measure ti weighted by an exponent p. They claim, that this modification omits high scores for sets clustered near the middle of the ranked list and increases the sensitivity with respect to gene sets, where only a subset of genes in the gene set is significantly regulated. However, by weighting the expression measure, the modified approach does not test for a uniform distribution of the rank ordering, i.e. it does not test the above stated null hypothesis. In fact, it is unclear which null hypothesis is actually tested by the modified version. As a result, it remains dubious, what significance of a gene set in the sense of the modified GSEA really means.

Section 2.4: Gene Set Analyses

21

enrichment score

significant gene set running ES

running ES

insignificant gene set

enrichment score

gene list rank

gene list rank

gene list rank

gene list rank

Figure 2.6: Gene Set Enrichment Analysis (GSEA): The genes gi are ordered according to some measure of gene expression like the fold change or the signal-to-noise ratio. The Enrichment Score (ES) for the gene set S is determined by walking down the list of all genes L and increasing it for every gene that is contained in gene set S (red bars) and decreasing it otherwise. The null distribution of the Enrichment Score and thereby the significance of gene set S is determined by permutation across the phenotypes.

Another disadvantage of this method is that it is competitive. That means, the regulation of genes in a certain gene set S ⊂ L is compared to the regulation of all other genes in the reference set L \ S. This reference set typically consists of all genes that are available on the microarray. Thus, the significance of a gene set cannot be given independently, but only in comparison to the reference set, and an alteration of this reference set alters the outcome of the analysis. This issue has also been addressed in Allison et al. (2006), Goeman and Bühlmann (2007) and Damian and Gorfine (2004). Dinu et al. (2007) noted that a gene set identified to be statistically significant by the GSEA can lack genes that are consistently associated with the phenotype. Nevertheless, the GSEA has been widely applied for the analysis of microarrays, see for example Krivtsov et al. (2006) and Schuringa et al. (2004), and is the most popular method for gene set analysis. Several other methods have been proposed for the analysis of gene sets, e.g. the Gene Set Analysis (GSA) by Efron and Tibshirani (2007), the Significance Analysis of Microarrays for Gene Sets (SAM-GS) by Dinu et al. (2007), and some others [Goeman et al. (2004), Mansmann and Meister (2005)]. However,

22

Chapter 2: Statistical Analysis of Microarray Experiments

all of these methods bear one or more of the following disadvantages: • the method requires an arbitrary threshold value to determine whether a single gene is differentially expressed or not. Thus, the result of the method strongly depends on the choice of this value. • the method is competitive and requires a reference set. • the method uses permutations across gene labels for the determination of the significance. As pointed out by Allison et al. (2006) and Goeman and Bühlmann (2007), this can lead to inaccurate p-values, as it assumes transcripts to be expressed independently, which is often not the case. • the tested null hypothesis is not clearly defined and a significant p-value is not interpretable [Allison et al. (2006)]. • the tested null hypothesis is a weak null hypothesis such as no genes in the gene set are differentially expressed or the genes in the gene set are at most as often differentially expressed as the genes in the reference set, [Goeman and Bühlmann (2007)]. These hypotheses are weak in the sense that their rejection does not necessarily mean that a meaningful amount of genes in the gene set is differentially expressed. Thus, significance of a gene set can mean that just one gene in the set is differentially expressed, implying that significance of a gene set is not necessarily suitable for the evaluation of the relevance of a gene set. In the following, we present a new approach to determine the regulation of a given gene set from microarray data by estimating the number of differentially expressed genes in a gene set. For this estimation, no threshold value has to be chosen for the detection of the differentially expressed genes. Based on these estimations, a regulation index is defined as a measure for the regulation of a gene set. This measure can be employed to compare and rank different gene sets according to their regulation. In contrast to the p-value of hypothesis tests of many other gene set analysis approaches, this index is suitable for the evaluation of the relevance of the regulation of a gene set. We demonstrate the applicability of our method for simulated data as well as for a microarray study comparing acute lymphoid leukemia (ALL) and acute myeloid leukemia (AML), Armstrong et al. (2002).

Section 2.4: Gene Set Analyses

23

2.4.2. Gene Set Regulation Index A standard approach in most microarray studies is to statistically test whether the expression value of a gene is significantly different between two groups using statistical tests like the Student’s t-test or its non-parametric pendant the Mann-Whitney U-test. Available methods differ in the underlying statistical assumptions. Therefore, the choice of a test depends on the assumptions for the measurements. It is assumed in the following that the test for detecting differentially expressed genes was chosen appropriately, meaning that all assumptions for the hypothesis test are met and the p-values follow a uniform distribution when the null hypothesis H0 is valid. Given a gene set of size N with a total number of R differentially expressed genes and a number of N − R genes that are not differentially expressed, the probability density for the resulting p-values ρ(p) can be decomposed into a uniform distribution ρuni f (p) and an unknown, alternative distribution ρalt (p) that is concentrated near zero:   R R ρ(p) = ρalt (p) + 1 − ρ (p). N N uni f As shown in Figure 2.7 a, the height of the plateau resulting from the uniform distribution is proportional to the percentage of unregulated genes (1 − R/N). However, an estimation of the height of this plateau based on the empirical probability density as utilised for example in Pounds and Morris (2003) is tedious especially for small gene sets, since the empirical distribution density depends on the binning and is not continuous. For the empirical cumulative probability density on the other hand no binning parameter has to be chosen, wherefore it is better suited for parameter estimation. The drawback of parameter estimation based on an empirical cumulative distribution is that here the residues are correlated. This leads to overestimated, thus conservative confidence intervals, as we will see later. The cumulative distribution function of p is given by Z p cd f (p) = ρ(p′ ) dp′ 0   R R cd funi f (p), cd falt (p) + 1 − = N N where cd falt (p) and cd funi f (p) = p denote the cumulative distribution of the alternative probability density and the uniform probability density, respectively.

f (p

24

Chapter 2: Statistical Analysis of Microarray Experiments

a) 

1.0





%











'

%

(













!



"





"

#



'







$









$

#

&







&







cumulativeȱdistributionȱfunction



b)

$







'









"

'



$



#



 

 















%

%



)

(

 





















%

(



cumulativeȱdistributionȱ linearȱfit:ȱR/N+(1ȬR/N)p

0.8

0.6

0.4

0.2 R/N



















































0.0 0.0

0.2

0.4

0.6

0.8

1.0

pȬvalue



Figure 2.7: a) Probability distribution of a sample of p-values resulting for example from Student’s t-tests performed on a set of N genes with R differentially expressed genes. b) Cumulative distribution function of the probability distribution shown in a). The intercept of the linear fit with the y-axis corresponds to the percentage of differentially expressed genes in the set.

Since the mass of the alternative probability distribution ρalt (p) is accumulated at small values of p, the alternative cumulative distribution function cd falt (p) is a concave function and approximately equal to one for large p-values. Accordingly, for large p-values, where cd falt (p) ≈ 1, the cumulative distribution function cd f (p) can be approximated by f (p) =

  R R p. + 1− N N

(2.7)

Thus, the percentage of regulated genes corresponds to the y-axis intercept of f (p), see Figure 2.7 b. In order to determine the number of differentially expressed genes in a gene set, Equation (2.7) is iteratively fitted to the realisation of the cumulative distribution cd f (p) obtained from testing for differential gene expression. In the first iteration step, all p-values are used for the linear fit and a first estimate of the number of differentially expressed genes Rest is obtained. In the second iteration step, the Rest -smallest p-values are omitted for the linear fit, and a new estimate Rnew est is achieved. This procedure is repeated until Rest converges, i.e. until Rnew est = Rest , see Figure 2.8. This iteration process ensures

Section 2.4: Gene Set Analyses

25

a)

b)

1

1st Iter.

Rest = 0 cd f (p) fit f (p)

Rest pmin = pRest

0 1

Rest = f loor(Rnew est )

Rnew est , Rest

1 nd

2 Iter.

Fit f (p) to cd f (p) in [pmin , 1] Rest 0 pmin

Rnew est : new estimate for R Rnew est Rest

1

Final Iter.

0

pmin

1

= Rest Rest

1

Figure 2.8: a) Flowchart for the iteration algorithm: Rest denotes the estimated number of differentially expressed genes, pmin the minimum p-value entering the fitting algorithm and pRest the Rest smallest p-value. Starting with an initial value Rest = 0, f (p), Equation (2.7), is fitted in the entire interval [0, 1] to empirical cumulative distribution and a first estimate for the number of differentially expressed genes Rest is achieved. The inclusion of the p-values of the differentially expressed genes into the fitting procedure leads to an underestimation of the number of differentially expressed genes R. Thus, in the next iteration step, the Rest -smallest p-values are excluded from the fitting procedure, and only the resulting N − Rest largest pvalues are considered. Thus, the fitting interval is reduced to [pRest , 1] and a new estimate for Rest is calculated. This procedure is repeated until Rest converges. b) Illustration of the algorithm. After each iteration step, the linear function converges closer to the linear part of the cumulative distribution function.

26

Chapter 2: Statistical Analysis of Microarray Experiments

1

cdf(p)

R iest N 0

p−values

1

Figure 2.9: Let Riest /N be the offset of the linear fit in the i-th iteration of the algorithm. For the next iteration i + 1, the Riest -smallest p-values are omitted. These p-values (displayed in red) are by construction all below the dashed line. Therefore, the residues of these p-values to the linear fit in the i-th iteration are all below the linear fit. Omitting these values from the next iteration i + 1 can only lead to an increase of the offset of the linear fit.

that Equation (2.7) is only fitted to the uniform part of the cumulative distribution. Note that it is not necessary to set a threshold value that distinguishes between genes that are differentially expressed and genes that are not. This iteration procedure always converges. Let Riest be the estimated number of regulated genes in iteration step i. As illustrated in Figure 2.9, the pvalues that are omitted in iteration step i + 1 all contributed negative residues to the linear fit in iteration step i. Thus, the linear fit in iteration step i + 1 i will yield an estimate Ri+1 est that cannot be smaller than Rest . The series of the estimated Riest is therefore monotonic increasing with an upper bound at 1 and has to converge. A number of different methods have been proposed to determine the false discovery rate (FDR), which is closely related to the estimation of the number of differentially expressed genes. However, most of these methods have been introduced to determine the FDR for microarray analyses dealing with several thousands of genes and were not intended to cope with gene sets of limited sizes. A comparison of the performance of these methods for small gene sets with the performance of our new approach is given in Appendix A.1.

Section 2.4: Gene Set Analyses

27

The confidence interval of an estimated value for the number of regulated genes R is calculated by bootstrapping the group samples. Hereby, the microarrays are re-sampled with replacement by maintaining the sample group assignment. Then, the number of differentially expressed genes R is estimated for each realisation. As pointed out by Allison et al. (2006) and Goeman and Bühlmann (2007), bootstrapping the arrays and not the genes ensures the conservation of correlations between genes. Correlated genes have similar expression patterns and therewith similar p-values. Hence, the p-values of correlated genes accumulate in the probability distribution function, leading to a steeper slope in the cumulative probability function at this p-value. For a new bootstrap-sample, a new realisation of the expression pattern is simulated, resulting in a new realisation of p-values. Accordingly, for each bootstrap-realisation, the point of accumulation in the cumulative distribution function is shifted, and concomitantly a different estimate for R is yielded. Thus, correlated genes lead to an increase in the estimated confidence interval. In order to be able to compare the regulation of different gene sets, we define the “Gene Set Regulation Index” (GSRI) η as the 5% quantile of the distribution of the estimated percentage of differentially expressed genes Rest /N determined by bootstrapping. Thus, the GSRI η is a measure of the regulation of a gene set, meaning that with a probability of 95% η or more genes are differentially expressed in this gene set. In contrast to many other gene set analysis approaches, the GSRI is not a statistical test that determines the significance for the rejection of some null hypothesis. The GSRI is rather a meaningful measure to quantify the amount of regulation of a gene set and thereby determine its relevance. Nevertheless, the GSRI can also be interpreted as a significance measure for the null hypothesis that no gene in the gene set is differentially expressed. By definition, a value of the GSRI larger than zero means that with a probability of 95% at least one gene is differentially expressed.

Simulation Study To demonstrate the applicability of our method, a simulation study is performed for a gene set of size N = 100 with 40 microarray samples. The first 20 samples are the control group, and the second 20 samples are the treatment group. A series of 100 simulations was performed, increasing the number of differentially expresend genes from 0 to 100. The expression values for genes

f (p

28

Chapter 2: Statistical Analysis of Microarray Experiments

a) 1

0

b) 0

Kȱȱ ȱ i

R

e

s

n

t

5

%

/

ȱ ȱ

N

i

8

0

R

/

N

n

2

%

ȱ ȱ i

n

5

ȱ

%

q

%

=

2

0

u

ȱ 0

a

.

0

n

t

1

i

l

e

3

y

6

0

c

n

5 e

u

1

q

e

4

0

f

r

1

2

0

0

5

0

0

0

2

0

4

0

6

0

8

0

1

0

0

Ȭ 0

R

/

N

ȱ ȱ i

n

%

.

1

0

.

0

0

R

/

N

.

1

0

.

2

ȬK

Figure 2.10: Simulation Study. a) The estimated percentage of differentially expressed genes with confidence intervals plotted versus the real percentage of differentially expressed genes. For a better representation of the results, only every fourth simulated data point is shown. In red: the GSRI η in %. b) Histogram of the difference between the true percentage of differentially expressed genes R/N and the GSRI. The 5%-quantile of the distribution being 0.013 shows that the conservative estimate of the percentage of differentially expressed genes Rest /N as well as of the standard deviation std(Rest /N) lead to an underestimation and herewith conservative estimate of the GSRI.

that are not differentially expressed were generated from a standard normal distribution N(0, 1) for both sample groups, control and treatment. The expression values for the differentially expressed on the other hand were drawn from N(0, 1) for the control group and from N(1, 1) for the treatment group, i.e. the mean value for differentially expressed genes in the treatment group was one standard deviation larger than the mean value in the control group. For these data, the cumulative distribution function cd f (p) was calculated from the p-values obtained from a Student’s t-test. Fitting Equation (2.7) to cd f (p) using the iteration process illustrated in Figure 2.8 yielded an estimate for the number of differentially expressed genes R. The confidence intervals were estimated by bootstrapping a thousand times from the sample groups. The results of the simulation study are depicted in Figure 2.10. The left panel shows the GSRI η depicted in red as well as the estimated percentage of differentially expressed genes in the set plotted versus the true percentage of differentially expressed genes, indicating a good agreement between estimate

Section 2.4: Gene Set Analyses

29

and true value. The right panel shows the histogram of the difference between the true percentage of differentially expressed genes R and the GSRI η. The 5%quantile of this distribution being larger than zero shows that the estimation of the number of differentially expressed genes Rest and its standard deviation std(Rest ) is conservative. Basically there are two reasons for this. First, the result for the estimation of the number of differentially expressed genes Rest is in general a rational number. In each iteration step of the fitting procedure, the Rest -smallest pvalues are disregarded from the optimisation process. Thus, Rest has to be rounded to a natural number. Rounding off Rest in each iteration step leads to an underestimation and therefore a conservative estimate of the mean value of R. Second, the data points in the cumulative distribution are correlated. As stated above, the bootstrapping algorithm overestimates the confidence interval of Rest , equivalent with a conservative estimate of the variance of Rest . Together, this leads to an underestimation of the 5%-quantile of the distribution of Rest respectively the GSRI η, which is the reason for the 5%quantile being larger than zero in Figure 2.10 b.

Application to Microarray Datasets To demonstrate the applicability of our approach, we applied our method to a study comparing the gene expression profiles of 24 acute lymphoid leukemia (ALL) patients and 24 acute myeloid leukemia (AML) patients, published by Armstrong et al. (2002). For the analysis we used gene sets with a minimum size of fifteen genes from the curated gene sets (c2) taken from the database MSigDB, see Subramanian et al. (2005). This is a collection of gene sets containing amongst others pathways, ontologies, and lists of genes adopted from other publications. We compared our analysis to the results of the Gene Set Enrichment Analysis (GSEA) introduced by Mootha et al. (2003). We used GSEA for comparison, since this is the most frequently used gene set analysis approach, see e.g. Houstis et al. (2006) and Krivtsov et al. (2006). For the Leukemia dataset, the GSEA is significant with a p-value smaller than 0.05 for 124 out of 1219 sets. When these p-values are corrected for multiple testing using the false discovery rate (FDR), see page 15, only two gene sets are still significant, although about 60% of all genes in the experiment are

30

Chapter 2: Statistical Analysis of Microarray Experiments

Gene Set

N

Rest /N

GSRI

GSEA p-value

GSEA FDR

Reference

Schuringa STAT5A down

17

91.6 %

84.7 %

0.109

0.493

Schuringa et al. (2004)

mRNA Processing

40

94.8 %

82.6 %

0.108

0.325

MSigDBRNA (2005)

Golub ALL vs AML up

18

87.5 %

81.7 %