BMC Genomics - ScienceOpen

0 downloads 0 Views 1MB Size Report
Aug 20, 2007 - 2007 Lee et al; licensee BioMed Central Ltd. This is an Open ..... removes 5' overhanging flaps in DNA repair and processes the 5' ends of ...
BMC Genomics

BioMed Central

Open Access

Research article

A detailed transcript-level probe annotation reveals alternative splicing based microarray platform differences Joseph C Lee1, David Stiles1, Jun Lu1 and Margaret C Cam*1,2 Address: 1Genomics Core Laboratory, National Institute of Diabetes & Digestive & Kidney Diseases, National Institutes of Health, Bethesda, MD 20892, USA and 2Office of the Director, National Center for Biotechnology Information, National Institutes of Health, Bethesda, MD 20894, USA Email: Joseph C Lee - [email protected]; David Stiles - [email protected]; Jun Lu - [email protected]; Margaret C Cam* - [email protected] * Corresponding author

Published: 20 August 2007 BMC Genomics 2007, 8:284

doi:10.1186/1471-2164-8-284

Received: 8 September 2006 Accepted: 20 August 2007

This article is available from: http://www.biomedcentral.com/1471-2164/8/284 © 2007 Lee et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract Background: Microarrays are a popular tool used in experiments to measure gene expression levels. Improving the reproducibility of microarray results produced by different chips from various manufacturers is important to create comparable and combinable experimental results. Alternative splicing has been cited as a possible cause of differences in expression measurements across platforms, though no study to this point has been conducted to show its influence in cross-platform differences. Results: Using probe sequence data, a new microarray probe/transcript annotation was created based on the AceView Aug05 release that allowed for the categorization of genes based on their expression measurements' susceptibility to alternative splicing differences across microarray platforms. Examining gene expression data from multiple platforms in light of the new categorization, genes unsusceptible to alternative splicing differences showed higher signal agreement than those genes most susceptible to alternative splicing differences. The analysis gave rise to a different probe-level visualization method that can highlight probe differences according to transcript specificity. Conclusion: The results highlight the need for detailed probe annotation at the transcriptome level. The presence of alternative splicing within a given sample can affect gene expression measurements and is a contributing factor to overall technical differences across platforms.

Background Microarrays have become a widely used tool to measure gene expression levels on a genome-wide basis and are available from a number of manufacturers. Each platform incorporates proprietary technology, with differences in probe design, probe bioinformatics, probe creation and deposition, reagents and protocols across platforms introducing variability into expression analysis. A large body of work has studied the reproducibility of microarray data

and therefore the interchangeability of commercial platforms. A dialogue over data sets, analysis methods, and concordance measures has evolved, but no clear consensus on the level of agreement or disagreement in expression results has been reached. It is generally understood that differences between platforms exist. The source of contention is the interpretation of the magnitude of these differences. Some conclude Page 1 of 13 (page number not for citation purposes)

BMC Genomics 2007, 8:284

http://www.biomedcentral.com/1471-2164/8/284

from the data that microarray results are sufficiently comparable across platforms [1,2]. Others caution that the technological differences have not yet been sufficiently resolved to combine experimental results from different platforms [3,4]. Regardless of the overall interpretation, both sides hypothesize that some of the differences may be attributable to the presence of splice variants [5-9]. Although alternative splicing is a logical source of crossplatform differences, there has been no direct evidence to show that this is the case. Studies have indicated 40% or more of all human genes are alternatively spliced [10,11] and expression measurement differences may arise when probes on different platforms target differentially expressed splice variants of the same gene. It has been previously demonstrated that a sequence matching method between probes increases cross-platform consistency and reproducibility [12]. Others have matched probes to genes and shown that annotation discrepancies affect analysis [13]. Here, we combine the two ideas to provide evidence for alternative splicing based cross-platform disagreement. We created an in-depth probe/genome/transcript annotation using the AceView transcript database. AceView is a comprehensive annotation of transcripts and genes that incorporates data from GenBank, dbEST and RefSeq [14]. It has been shown to offer a richer view of the transcriptome, with 3 to 5 times more high-quality transcript forms than UCSC known genes, RefSeq or Ensemble [14]. Capturing transcript diversity is important because probes may be derived from the same loci, but match different transcript sequences due to alternative splicing. Using this new annotation, we categorized genes on each platform according to their susceptibility to splice variant differences and measured their cross-platform agreement in a biological data set using a traditional correlation measure and a Euclidean distance measure. The novel usage of the distance measure lends itself to a visualization that can show alternative splicing differences or other poorly performing probes.

Results Matching platform-specific probes to AceView and RefSeq Transcripts We created a transcript-level annotation of microarray probes to study the effects of alternative splicing on crossplatform microarray discordance. Microarray probe sequences from Affymetrix (U95Av.2 GeneChip, 25 mer oligonucleotide probes), Agilent (Human 1, cDNA probes) and Codelink (Uniset Human I Bioarrays, 30 mer oligonucleotide probes) were aligned to the genome and annotated as matching transcripts through shared genomic coordinates from the AceView and RefSeq [15] transcript databases (see Methods). Table 1 shows the results of both the genome and transcript mappings.

Greater than 95% of all probes on the three platforms had genome alignments. Overall, 73%, 94% and 90% of Agilent, Codelink and Affymetrix probes, respectively, had AceView transcript alignments. The comparatively fewer alignments for Agilent stem from the strict coordinate restrictions we placed on the multiple-exon cDNA genome alignments. A detailed account of probe alignment conditions is available [see Additional file 1]. As anticipated, more probes were found to match to AceView transcripts than RefSeq transcripts, with 21%, 11% and 13% more total probes for Agilent, Codelink and Affymetrix, respectively. The rest of our analysis was therefore conducted using AceView data. Agilent and Codelink utilize single probes to target a gene and generate expression measurements. The Affymetrix U95Av2 chip is different. It targets a gene with a probe set consisting of up to 16 probes and summarizes the probe set to generate expression measurements. To create a transcriptlevel annotation of probe sets, a probe set was said to target a transcript if 5 or more probes matched it. An internal study showed five probes are necessary for reliable summarization measurements [see Additional file 2]. Of the 12453 Affymetrix probe sets, 11564 matched at least one transcript with 5 or more probes. Genes categorized by susceptibility to alternative splicing based differences To detect the effects of splice variation on gene expression measurements, we categorized each gene by probe specif-

Table 1: Number of probes on each platform with AceView and RefSeq genome and transcript alignments

Agilent Codelink Affymetrix

Unique Probes

Genome Alignments

AceView Alignments

RefSeq Alignments

13335 9969 199015

12676 (95%) 9855 (99%) 193006 (97%)

9698 (73%) 9330 (94%) 179740 (90%)

6901 (52%) 8243 (83%) 153611 (77%)

"Unique probes" is the number of probes on each platform that had sequences to be aligned. The other alignment categories indicate the number of probes that had a specific alignment.

Page 2 of 13 (page number not for citation purposes)

BMC Genomics 2007, 8:284

http://www.biomedcentral.com/1471-2164/8/284

icity against known splice variants as annotated in AceView. Genes most susceptible to expression measurement differences from splice variants are those in which the probes from each of two different platforms interrogate mutually exclusive, or disjoint, sets of transcripts. Genes that are not susceptible to these measurement differences are those in which probes on both platforms target the same, or equal, sets of transcripts. In between the two extremes are genes that are susceptible to splice variants, but the effect of which cannot be measured because the platforms target common transcripts as well as transcripts specific to each platform. Using our transcript annotation results, we categorized genes commonly targeted by each pairwise platform combination based on their susceptibility to alternative splicing. To match the Affymetrix gene expression data used below, the Affymetrix probe set annotation was used, as described above. The number of genes in each category for Affymetrix/Agilent, Affymetrix/Codelink and Agilent/Codelink is shown in Table 2. Correlation measure of alternative splicing discordance We looked for overall splice variant differences utilizing gene expression data from a previous biological experiment. RNA was obtained from PANC-1 cells of a pancreatic ductile cell phenotype and an early stage of their differentiation to a pancreatic islet phenotype (see Methods). Five technical and biological replicate microarray experiments were run on each platform and their results were averaged to produce a single fold change value for each gene. We analyzed the expression data by creating scatter plots of the log2 fold changes for genes in the equal and disjoint transcript sets for each pairwise platform combination, shown in Figure 1. Only genes that were statistically significant at p-value < 0.05 on at least one platform were included in our analysis. Table 3 shows the computed Pearson and Spearman correlation coefficients for each of the disjoint and equal gene groups.

There is a drastic drop in the correlation coefficients from the equal to the disjoint transcript sets for all three platform pairs; the difference being 0.263, 0.484 and 0.38 for Pearson and 0.253, 0.465, and 0.41 for Spearman from the Affymetrix/Agilent, Affymetrix/Codelink and Agilent/ Codelink comparisons, respectively. As genes with disjoint transcript sets are most susceptible to alternative splicing based differences and genes with equal transcript sets are unsusceptible to alternative splicing based differences, the drop in the correlation coefficients between these two groups suggests alternative splicing is a contributing factor to platform discordance. Distance measure of alternative splicing discordance A distance measure provides an alternative view of the data to confirm the correlation coefficient results. We calculated the log2 fold change of experimental versus control groups for each of the five replicates individually, creating a vector of the five fold change values for each gene on each platform. Using log2 fold change places each platform into a common measurement space. We then calculated the Euclidean distance between expression vectors from different platforms for the genes in the equal and disjoint transcript sets. Unlike the previous scatterplots, no restriction was made on statistical significance of the gene. Next, we plotted a cumulative distribution function (CDF) of all calculated distances for each grouping to highlight the differences attributable to alternative splicing, shown in Figure 2(a,c,e).

A curve that rises steeply and is shifted to the left represents a distance distribution that includes smaller distances than a curve shifted farther to the right. Smaller distances between expression vectors across platforms indicates higher agreement. The CDF curve for probes with equal transcript sets (impervious to alternative splicing based differences) is shifted to the left and rises faster than the CDF curve for disjoint transcripts sets (most sus-

Table 2: Pairwise platform gene classification based on susceptibility to alternative splicing

Platform A

Platform B

Common Genes

Equal A = B

Disjoint A傽B= Ø

A\B ≠ Ø

B\A ≠ Ø

Affymetrix Affymetrix Agilent

Agilent Codelink Codelink

5804 6429 5183

1964 2808 1648

158 85 196

1599 2526 1450

3461 2623 3173

Let A be the set of transcripts a probe(set) on platform A targets Let B be the set of transcripts a probe(set) on platform B targets Equal A = B : Number of genes in which platform A and B target equal transcript sets. Disjoint A 傽 B = Ø: Number of genes in which platform A and B target disjoint transcript sets. A\B ≠ Ø: Platform A and B target the same gene, but platform A targets extra transcripts that B does not. B\A ≠ Ø: Platform A and B target the same gene, but platform B targets extra transcripts that A does not. Note: The total number of classified genes can exceed the number of common genes because a gene may be a member of both A\B = Ø and B\A = Ø.

Page 3 of 13 (page number not for citation purposes)

http://www.biomedcentral.com/1471-2164/8/284

3

3

2

2

2

1 0 −1 −2 −3 −3

−1 0 1 2 Agilent Log Fold Change

1 0 −1 −2

Equal Disjoint −2

Codelink Log Fold Change

3

Affy Log Fold Change

Affy Log Fold Change

BMC Genomics 2007, 8:284

3

−3 −3

−1 0 1 2 Codelink Log Fold Change

0 −1 −2

Equal Disjoint −2

1

3

−3 −3

Equal Disjoint −2

−1 0 1 2 Agilent Log Fold Change

3

Figure Log2 fold 1 change for equal and disjoint genes in each pairwise platform combination Log2 fold change for equal and disjoint genes in each pairwise platform combination. Each mark corresponds to a gene shared by two platforms. The diamonds are those genes that have equal transcript sets targeted by both platforms. The triangles are those genes that have disjoint transcript sets targeted by both platforms.

ceptible to alternative splicing based differences) for all pairwise platform combinations. Thus, equal transcript sets tend to have smaller distances and higher agreement than disjoint transcript sets, indicating a distinct alternative splicing effect in platform discordance. To establish a baseline for comparison, we randomly paired expression vectors from genes on different platforms and calculated the distances. A CDF of distances from unrelated expression vectors provides an unbiased worst-case distribution and the baseline CDF is plotted in Figure 2. By comparison, if two platforms agreed completely, the CDF would be a unit step function. Platform distance distributions based on equal and disjoint gene sets fall in between the two extremes, as the baseline CDF is shifted to the right, establishing the distance for unrelated measurements. Probe-level distance measure Agilent and Codelink both use single probes to target a gene, but the Affymetrix U95Av2 genechip utilizes probe sets consisting of 16 probes to target a gene. Probe set gene expression measurements are statistical summarizations of member probes, which can be influenced by dead probes, cross-hybridization and other effects [16,17]. In order to generate a clearer view of expression measurements and agreement levels, we thought to analyze probelevel expression vectors instead of probe set summarizations for Affymetrix platform combinations.

We conducted a similar analysis as before, creating a vector of fold change values from the five replicates for each individual probe. To better isolate the effects of alternative splicing, we reduced the effect of cross-hybridization by removing from the analysis the 6.5%, 3.1% and 3.5% of individual Affymetrix, Agilent and Codelink probes,

respectively, that matched multiple AceView gene symbols. Probe cross-hybridization results, as implied by sequence, are shown in Table 4. For all genes shared between platforms, individual probeprobe comparisons using the probe annotations were made for equal or disjoint transcript set targeting. By examining all possible combinations of individual probes, it was hoped that details masked by the probe set annotation would become apparent. We created CDFs of all of the probe distances in each category and established a baseline by randomly pairing probe expression vectors, shown in Figure 2(b,d). The alternative splicing effect is again apparent, with a left shift of probe distance CDFs for the equal transcript set versus disjoint transcript set. Kolmogorov-Smirnov Test The Kolmogorov-Smirnov test is used to determine if two samples are drawn from the same underlying distribution. We used it to test whether the Equal-Disjoint, Equal-Random and Disjoint-Random CDF combinations for each of the platform pairs differ from each other. Table 5 illustrates the test results, laid out to match the ordering of the platform pairings in Figure 2. At p < 0.05, we reject the null hypothesis of drawing from the same distribution for all CDF combinations except for Disjoint-Random on Affymetrix probeset/Codelink (c). We accept that the Equal, Disjoint and Random distance distributions are different from each other, except in the case of DisjointRandom for Affymetrix/Codelink (c), where we cannot reject the null hypothesis that they are the same.

The distance distributions are different, indicating the shifts towards smaller distances and better agreement are meaningful in Figure 2. However, the disjoint transcript set in any pairwise combination involving Affymetrix

Page 4 of 13 (page number not for citation purposes)

BMC Genomics 2007, 8:284

http://www.biomedcentral.com/1471-2164/8/284

Table 3: Pearson correlation coefficients for equal and disjoint gene groups

Equal Pearson

Spearman

Platform A

Platform B

N

r

95% CI

P-value

r

95% CI

P-value

Affymetrix Affymetrix Agilent

Agilent Codelink Codelink

966 1263 908

0.769 0.73 0.662

.742 < p < .794 .703 < p < .755 .624 < p < .697