Detecting differentially expressed genes in

0 downloads 0 Views 187KB Size Report
Jun 2, 2010 - The authors propose a new test, the 'half Student's t-test', specif- .... Assuming normality, tp is a Student's t distribution ..... In light of these,.
Published by Oxford University Press on behalf of the International Epidemiological Association ß The Author 2010; all rights reserved. Advance Access publication 2 June 2010

International Journal of Epidemiology 2010;39:1597–1604 doi:10.1093/ije/dyq093

Detecting differentially expressed genes in heterogeneous diseases using half Student’s t-test Chun-Lun Hsu and Wen-Chung Lee* Research Centre for Genes, Environment and Human Health, and Graduate Institute of Epidemiology, College of Public Health, National Taiwan University, Taipei, Taiwan ROC *Corresponding author. Room 536, No. 17, Xuzhou Road, Taipei 100, Taiwan ROC. E-mail: [email protected]

Accepted

20 April 2010

Methods

The authors propose a new test, the ‘half Student’s t-test’, specifically for detecting differentially expressed genes in heterogeneous diseases. Monte–Carlo simulation shows that the test maintains the nominal  level quite well for both normal and non-normal distributions. Power of the half Student’s t is higher than that of the conventional ‘pooled’ Student’s t when there is heterogeneity in the disease under study. The power gain by using the half Student’s t can reach 10% when the standard deviation of the case group is 50% larger than that of the control group.

Results

Application to a colon cancer data reveals that when the false discovery rate (FDR) is controlled at 0.05, the half Student’s t can detect 344 differentially expressed genes, whereas the pooled Student’s t can detect only 65 genes. Or alternatively, if only 50 genes are to be selected, the FDR for the pooled Student’s t has to be set at 0.0320 (false positive rate of 3%), but for the half Student’s t, it can be at as low as 0.0001 (false positive rate of about one per ten thousands).

Conclusions The half Student’s t-test is to be recommended for the detection of differentially expressed genes in heterogeneous diseases. Keywords

Student’s t-test, gene expression, heterogeneous disease, epidemiological methods

Introduction Microarray technology provides information about hundreds and thousands of gene expression data in a single experiment.1 To screen for genes that are related to the disease under study, one can compare gene expression levels between the diseased subjects (or cancer tissue samples) and the non-diseased subjects (or normal tissue samples) and pick out those

genes, i.e. the differentially expressed genes, whose mean expression levels are statistically different between the two groups of ‘cases’ and ‘controls’. The Student’s t-test2,3 is commonly used for this task. However, when studying ‘heterogeneous diseases’, the conventional Student’s t-test may fall short of detecting differentially expressed genes. A heterogeneous disease is a disease with more than one

1597

Downloaded from http://ije.oxfordjournals.org/ by guest on January 10, 2016

Background Microarray technology provides information about hundreds and thousands of gene-expression data in a single experiment. To search for disease-related genes, researchers test for those genes that are differentially expressed between the case subjects and the control subjects.

1598

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

Methods In the case group, let the sample size be denoted as n1, the sample mean of the gene expressions, X1 , the sample standard deviation of the gene expressions, s1. In the control group, the corresponding notations are n0, X0 and s0, respectively. The conventional Student’s t combines the standard deviations of the case group and the control group to calculate a ‘pooled’ standard qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi s2 ðn 1Þþs2 ðn 1Þ deviation: sp = 1 1 n1 þn0 020 . The test statistic tp of this ‘pooled’ Student’s t-test is, X1  X0 tp = qffiffiffiffiffiffiffiffiffiffiffiffiffi : sp n11 þ n10 Assuming normality, tp is a Student’s t distribution with a degree of freedom (df) of n1 þ n02 under the null hypothesis. The test statistic th of the proposed half Student’s t-test is X1  X0 th = qffiffiffiffiffiffiffiffiffiffiffiffiffi : s0 n11 þ n10 It can be seen that the numerator of th is exactly the same as that of tp. However, there is a minute

difference in the denominators of the two statistics; the th uses the sample standard deviation s0 of the control group only, it dispenses with the sample standard deviation s1 of the case group entirely (hence we coined the term, ‘half’ Student’s t). Assuming normality, th is a Student’s t distribution with df=n0  1 under the null hypothesis. Note that here we use the sample standard deviation of the control group to represent both the standard deviation of the control population (0 ) and the standard deviation of the case population (1 ). This is because under the null hypothesis, 1 =0 . The half Student’s t should not be mistaken for a ‘one-sample’ test because it does not use the sample standard deviation of the case group. (The one-sample 0 t-test is tone sample = X1p s1 ffiffiffi , where 0 is the population n1

mean of the controls. And one immediately notices the difference between th and tone sample .) Both the pooled Student’s t and the proposed half Student’s t are in fact the so-called ‘two-sample’ tests, comparing two sample means (X1 and X0 ). And therefore the sample sizes of both the case group and the control groupqare needed in the denominators of tp and th ffiffiffiffiffiffiffiffiffiffiffiffiffi (the n11 þ n10 term) to properly convert the ‘standard deviations’ (sp and s0, respectively) to the ‘standard errors’ of X1  X0 under the null hypothesis.

Monte–Carlo simulation Two different sample sizes are simulated: 40 (n0 =n1 =20) and 120 (n0 =n1 =60). The difference (d) in the means between the case and the control groups is examined for d = 0, 15 and 25, in turn. The ratio (r) of the standard deviation of the case group to that of the control group is examined for r = 1, 1.5, 2 and 2.5, in turn. The standard deviation in the control group is set at 30. We assume the gene expression levels to be normally distributed (or perhaps after suitable transformation). The normality assumption is often reasonable as evidenced from empirical gene-expression data.8 But for the sake of completeness, we also consider three non-normality scenarios: (i) non-normal but symmetric distribution; (ii) skewto-right distribution; and (iii) skew-to-left distribution. The uniform distribution is taken as the non-normal but symmetric distribution. The Gamma distribution is taken as the skew-to-right distribution. For the skew-to-left distribution, we multiply the Gamma distribution by 1, then add to it two times of the expected value of the original Gamma distribution. A total of 100 000 simulations are performed for each scenario. In each scenario, we perform the pooled Student’s t-test and the half Student’s t-test. It is important to note that the null hypothesis corresponds to the situation of r=1 and d=0, whereas the alternative hypothesis can be any situation otherwise.

Downloaded from http://ije.oxfordjournals.org/ by guest on January 10, 2016

entity; each may have different aetiologies, clinical pictures and prognoses. Examples of heterogeneous diseases are rheumatoid arthritis,4 large B-cell lymphoma5 and acute lymphoblastic leukaemia6. For a heterogeneous disease, a gene may be over-expressed in some of the case subjects, but normal-expressed or even down-expressed in others. That particular gene truly is being ‘differentially expressed’ because of the disease. However, despite the differential expressions, the mean expression level in the case group as a whole may not differ much from that in the control group. And thus the gene could easily evade detection by the conventional Student’s t-test. (Note that here we assume the presence of more than one disease entity, but we do not know exactly how many entities there are and how to define and characterize each of them. Otherwise, we can perform a stratified analysis, according to the known disease entities.) To detect differentially expressed genes in heterogeneous diseases, we propose the ‘half Student’s t-test’. The null hypothesis in the half Student’s t-test is that the distributions of the gene expression level are the same in the case group and the control group; the alternative hypothesis is that the distributions between the two groups are different. The ‘difference’ here can be in the means, in the variances or in both. In this article, Monte–Carlo simulation will be performed to examine the statistical properties of the half Student’s t-test. A colon cancer gene-expression data7 will be analysed for demonstrations.

HALF STUDENT’S T-TEST

1599

Table 1 Type I error rates of the pooled Student’s t-test and the half Student’s t-test

Significance level Normal distribution 0.05

n0 =n1 =20 Pooled Student’s Half Student’s t-test t-test 0.0507

0.0501

n0 =n1 =60 Pooled Student’s Half Student’s t-test t-test 0.0511

0.0511

0.01

0.0100

0.0100

0.0095

0.0094

0.005

0.0048

0.0050

0.0049

0.0050

0.001

0.0009

0.0008

0.0010

0.0010

Non-normal but symmetric distribution 0.05

0.0507

0.0472

0.0505

0.0493

0.01

0.0107

0.0094

0.0106

0.0099

0.005

0.0058

0.0048

0.0051

0.0046

0.001

0.0013

0.0010

0.0010

0.0009

0.0501

0.0528

0.0505

0.0516

Skew-to-right distribution 0.01

0.0100

0.0125

0.0104

0.0119

0.005

0.0047

0.0067

0.0050

0.0057

0.001

0.0009

0.0015

0.0009

0.0014

0.05

0.0489

0.0513

0.0492

0.0502

0.01

0.0096

0.0120

0.0096

0.0108

0.005

0.0050

0.0071

0.0051

0.0058

0.001

0.0010

0.0017

0.0009

0.0013

Skew-to-left distribution

Table 1 shows type-I error rates when significance level (-level) is 0.05, 0.01, 0.005 and 0.001, respectively. It is found that when sample size is larger (n1 =n0 =60), regardless of the significance level, the half Student’s t and the pooled Student’s t can obtain quite accurate type-I error rates in all four distributions. When sample size is as small as n1 =n0 =20, both tests still maintain very accurate type-I error rates under normal distribution and non-normal but symmetric distribution. However, under skewed distributions, the type-I error rate for the half Student’s t-test shows a slight inflation with a more stringent significant level of 0.005 or 0.001. Figure 1 shows the power performances of the half Student’s t (solid line) and the pooled Student’s t (dash line) under normal distribution. We note that when r>1, the statistical power of the half Student’s t is higher than that of the pooled Student’s t in all cases; and the difference in powers can be as large as 35%. For a more moderate r=1:5 (standard deviation of the case group being 50% larger than that of the control group), the power gain of using the half Student’s t can reach 10%. When r=1, the statistical power of the half Student’s t is comparable with that of the pooled Student’s t. We also note that when d=0, the pooled Student’s t does not have the ability to detect the

difference in variances (or standard deviations) between the case group and the control group at all. But the half Student’s t can have some power to do so (though not very high); and this power increases as r increases. When d>0, the power of the pooled Student’s t decreases dramatically as r increases. By contrast, the power of the half Student’s t decreases only a little as r increases, and in some situation (d=15, n0 =n1 =20), the power even increases as r increases. Figure 2 shows the power performances under non-normal but symmetric distribution, Figure 3, the power performances under skew-to-right distribution and Figure 4, the power performances under skew-to-left distribution. The results of these three non-normal distributions are essentially the same as the results of the normal distribution in Figure 1. We also examined other situations/other tests. The results are summarized below (details not shown). (i) We examined the situations of unequal sample sizes. We found that the half Student’s t-test can still maintain quite accurate type-I error rates. It also has better power performances as compared with the pooled Student’s t.

Downloaded from http://ije.oxfordjournals.org/ by guest on January 10, 2016

0.05

1600

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

100

40

70

power(%)

power(%)

50

60 50 40

50 40

30

30

20

20

20

10

10

10

1

1.5

2

0

2.5

0 1

1.5

r

100

100

2.5

50 40

90

d = 25

80

n 1=n 0 = 60

60 50 40

2.5

50 40

30

30

20

20

20

10

10

10

0

0 2.5

2

60

30

2

2.5

70

power(%)

power(%)

60

2

1

1.5

r

2

r

2.5

0 1

1.5

r

Figure 1 Statistical power in normal distribution. Solid line: half Student’s t; dotted line: pooled Student’s t

(ii) We examined the situation when the standard deviation of the control group, on the contrary, is even larger than that of the case group (the case subjects as a group are more homogeneous than the control subjects). And we found out as expected that the half Student’s t now has lower power than the pooled Student’s t. (iii) We examined the power performances of the Welch’s t-test. The Welch’s t-test assumes unequal variances between the case groups and the control groups. However, the Welch’s t-test statistic still factor in both the case variance and the control variance in the denominator. We found that its power performances are almost the same as those of the pooled Student’s t-test. (iv) We examined the power performances of simultaneously testing means (using the pooled Student’s t) and variances (using the F-test), followed by a Bonferroni correction (because two tests are performed). We found that the half Student’s t outperforms this combined test when there is a difference in means between the case group and the control group and the standard deviation of the case group

is equal to or no more than 30% larger than that of the control group.

An example of colon cancer We analyse a colon cancer data of Alon et al.7 for demonstration. These data have the gene expression measurements of 2000 genes for 62 samples (40 colon cancer tissue samples and 22 normal tissue samples). Table 2 shows the numbers (percentages) of differentially expressed genes as detected by the pooled Student’s t and the half Student’s t, respectively. Four significance levels are considered: 0.05, 0.01, 0.005 and 0.001. It is found that, at any significance level, the half Student’s t detects more number of differentially expressed genes than the pooled Student’s t. As these data have a total of 2000 genes, the multiple-testing problem needs to be accounted for. We thus control the false discovery rate (FDR)9,10 at 0.05 and 0.005, respectively. We found that when the FDR is controlled at 0.05, the half Student’s t can detect 344 differentially expressed genes (among them, 344  344  0:05=326:8 genes are expected to

Downloaded from http://ije.oxfordjournals.org/ by guest on January 10, 2016

70

1.5

1.5

100

n 1=n 0 = 60

80

70

1

1

r

d = 15

90

n 1=n 0 = 60

80

2

r

d=0

90

power(%)

60

30

0

n 1=n 0 = 20

80

70

60

d = 25

90

n 1=n 0 = 20

80

70

100

d = 15

90

n 1=n 0 = 20

80

power(%)

100

d=0

90

HALF STUDENT’S T-TEST 100 d = 0 90 n =n = 20 1 0 80

90 80

70

d = 15

90

n 1=n 0 = 20

80

50 40

power(%)

60

60 50 40

40 30

20

20

20

10

10

10

0

0

2

2.5

0 1

1.5

2

2.5

90

n 1=n 0 = 60

power(%)

60 50 40

90

80 n 1=n 0 = 60

d = 25

80

n 1=n 0 = 60

70

70

60 50 40

40 30

20

20

20

10

10

10

1.5

2

2.5

0 1

2.5

50

30

1

2

60

30

0

2.5

0 1.5

r

2

r

2.5

1

1.5

r

Figure 2 Statistical power in non-normal but symmetric distribution. Solid line: half Student’s t; dotted line: pooled Student’s t

be the true positives), whereas the pooled Student’s t can only detect 65 genes (61.8 true positives). When the FDR is controlled at 0.005, the half Student’s t can detect 185 (184.08 true positives), whereas the pooled Student’s t can only detect 8 (7.96 true positives).

Discussion In a heterogeneous disease, due to the presence of a certain number of distinct (but unknown) disease entities, the gene expression level often shows greater variability from one case subject to another (larger standard deviation in the case group), as compared with the variability seen from one control subject to another (smaller standard deviation in the control group). This is precisely the condition that the half Student’s t outperforms the pooled Student’s t. We caution, however, that if the half Student’s t-test is used for a disease without heterogeneity, there will not be any power gain; instead, there may even be power loss, if the case subjects as a group are even more homogeneous than the control subjects. We also emphasize that although theoretically the half Student’s t can test for the difference in the means and the difference in the variances simultaneously, it

is mainly a test of equality of two means in heterogeneous diseases. It has very low power for testing the difference in two variances. If the case group and the control group differ mainly in their variances, one can forgo the t-tests altogether (the pooled t and the half t alike) and use the F test instead to achieve better power. Recently, a number of researchers11–16 have been developing tests that can detect genes with a special type of heterogeneous response: only a small number of case subjects display over-expression, whereas the remaining cases have similar expression levels as the controls. In statistical terms, those over-expressions in a small number of subjects are essentially the ‘outliers’, and the tests developed by those researchers are well positioned to be sensitive to them. In developing the half Student’s t-test, however, we did not treat the case subjects as if they were from either one of the two types: the outlying and the ordinary. Rather, we posit that the disease under study is heterogeneous, having arbitrarily many entities each with different gene-expression means (and hence, the variance in the case group as a whole is higher than that in the control group). Note that a detailed characterization of those disease entities is by no means necessary. To detect differentially expressed genes under

Downloaded from http://ije.oxfordjournals.org/ by guest on January 10, 2016

70

2

100

d = 15

power(%)

d=0

80

1.5

r

100

100 90

1

r

r

power(%)

50

30

1.5

n 1=n 0 = 20

60

30

1

d = 25

70

70

power(%)

power(%)

100

100

1601

1602

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

100

100

d=0

90

n 1=n 0 = 20

80

40 30 20 10 0

70

1

1.5

2

60 50 40

20

20

10

10 0 1

1.5

2

100

60 50 40

90

d = 25

80

80

n 1=n 0 = 60

70

70

n 1=n 0 = 60

60 50 40

30

20

20

10

10

10

0

0

1

2.5

40

30

2.5

2

50

20

2

2.5

60

30

1.5

2

100

power(%)

n 1=n 0 = 60

1

1.5

r

d = 15

90

power(%)

power(%)

70

1

2.5

1.5

r

2

r

2.5

0

1

1.5

r

Figure 3 Statistical power in skew-to-right distribution. Solid line: half Student’s t; dotted line: pooled Student’s t

such heterogeneity, all one has to do is to place the sample standard deviation of only one group—and the correct one, the control group—in the denominator and apply our half Student’s t-test. The half Student’s t-test is to be referred to the Student’s t distribution with df = n0  1. When the distribution is non-normal and the sample size is very small (say, n0 1), the

Acknowledgement The authors wish to thank Wan-Yu Lin for technical support. Conflict of interest: None declared.

Downloaded from http://ije.oxfordjournals.org/ by guest on January 10, 2016

90

80

90

n 1=n 0 = 60

2.5

r

80

90

2

1.5

r

d=0

100

2

1.5

r

power(%)

60

30

0

n 1=n 0 = 20

80

power(%)

60

d = 25

90

70

power(%)

power(%)

100

100 d = 15 90 n =n = 20 1 0 80

d=0

90

1603

1604

INTERNATIONAL JOURNAL OF EPIDEMIOLOGY

KEY MESSAGES  The half Student’s t-test maintains the nominal  level quite well for both normal and non-normal distributions.  The power of the half Student’s t is higher than that of the conventional pooled Student’s t when there is heterogeneity in the disease under study.  The power gain by using the half Student’s t can reach 10% when the standard deviation of the case group is 50% larger than that of the control group.  The half Student’s t-test is to be recommended for the detection of differentially expressed genes in heterogeneous diseases.

References 1

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

Downloaded from http://ije.oxfordjournals.org/ by guest on January 10, 2016

2

Allison DB, Cui X, Page GP et al. Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 2006;7:55–65. Dudoit S, Yang YH, Callow MJ, Speed TP. Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Stat Sinica 2002; 12:111–39. Pan W. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 2002; 18:546–54. van der Pouw Kraan TC, van Gaalen FA, Kasperkovitz PV et al. Rheumatoid arthritis is a heterogeneous disease: evidence for differences in the activation of the STAT-1 pathway between rheumatoid tissues. Arthritis Rheum 2003;48:2132–45. Alizadeh AA, Eisen MB, Davis RE et al. Distinct types of diffuse large B-cell lymphoma identified by gene-expression profiling. Nature 2000;403:503–11. Yeoh EJ, Ross ME, Shurtleff SA et al. Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene-expression profiling. Cancer Cell 2002;1:133–43. Alon U, Barkai N, Notterman DA et al. Broad patterns of gene-expression revealed by clustering analysis of tumor

and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 1999;96:6745–50. Giles PJ, Kipling D. Normality of oligonucleotide microarray data and implications for parametric statistical analyses. Bioinformatics 2003;19:2254–62. Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc (B) 1995;57:289–300. Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci USA 2003;100:9440–45. Tomlins SA, Rhodes DR, Perner S et al. Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005;310:644–48. MacDonald JW, Ghosh D. COPA—Cancer outlier profile analysis. Bioinformatics 2006;22:2950–51. Tibshirani R, Hastie T. Outlier sums for differential gene expression analysis. Biostatistics 2007;8:2–8. Wu B. Cancer outlier differential gene expression detection. Biostatistics 2007;8:566–75. Hu J. Cancer outlier detection based on likelihood ratio test. Bioinformatics 2008;24:2193–99. Ghosh D, Chinnaiyam AM. Genomic outlier profile analysis: mixture models, null hypotheses, and nonparametric estimation. Biostatistics 2009;10:60–69. Efron B, Tibshirani RJ. An Introduction to the Bootstrap. New York: Chapman & Hall, 1993.