SAGE - Wiley Online Library

23 downloads 16590 Views 641KB Size Report
enumeration [13], and massively parallel signature sequencing. (MPSS) ... The digital nature of ..... script viewer made it easier to visualize and quickly identify.
STEM CELL GENETICS AND GENOMICS Reverse Serial Analysis of Gene Expression (SAGE) Characterization of Orphan SAGE Tags from Human Embryonic Stem Cells Identifies the Presence of Novel Transcripts and Antisense Transcription of Key Pluripotency Genes MARK RICHARDS,a SIEW-PENG TAN,a WOON-KHIONG CHAN,b ARIFF BONGSOa a

Department of Obstetrics and Gynaecology, National University of Singapore, National University Hospital, Singapore; bDepartment of Biological Sciences, National University of Singapore, Singapore

Key Words. Reverse serial analysis of gene expression • Human embryonic stem cells • Transcriptome • Antisense transcription POU5F1 • SOX2 • NANOG

ABSTRACT Serial analysis of gene expression (SAGE) is a powerful technique for the analysis of gene expression. A significant portion of SAGE tags, designated as orphan tags, however, cannot be reliably assigned to known transcripts. We used an improved reverse SAGE (rSAGE) strategy to convert human embryonic stem cell (hESC)-specific orphan SAGE tags into longer 3ⴕ cDNAs. We show that the systematic analysis of these 3ⴕ cDNAs permitted the discovery of hESC-specific novel transcripts and cis-natural antisense transcripts (cis-NATs) and improved the assignment of SAGE tags that resulted from splice variants, insertion/deletion, and single-nucleotide polymorphisms. More importantly, this is the first description of cis-NATs

for several key pluripotency markers in hESCs and mouse embryonic stem cells, suggesting that the formation of short interfering RNA could be an important regulatory mechanism. A systematic large-scale analysis of the remaining orphan SAGE tags in the hESC SAGE libraries by rSAGE or other 3ⴕ cDNA extension strategies should unravel additional novel transcripts and cis-NATs that are specifically expressed in hESCs. Besides contributing to the complete catalog of human transcripts, many of them should prove to be a valuable resource for the elucidation of the molecular pathways involved in the self-renewal and lineage commitment of hESCs. STEM CELLS 2006;24:1162–1173

INTRODUCTION

of gene expression (SAGE) [12], expressed sequence tag (EST) enumeration [13], and massively parallel signature sequencing (MPSS) [14, 15] have elucidated gene networks and putative signaling pathways that are believed to be essential in the maintenance of the hESC phenotype. Recent studies have implicated that the WNT and transforming growth factor-␤/activin/nodal pathways are involved in the maintenance of pluripotency in hESCs [16, 17]. Transcriptome studies have shown that key components of these two pathways are active or highly expressed in hESCs. In addition, SAGE and other gene expression profiling studies have suggested that stem cells, in particular hESCs, express numerous uncharacterized or novel

Pluripotent human embryonic stem cells (hESC) cell lines are derived from fibroblast feeder layers via the isolation and extended serial propagation of the inner cell mass from supernumerary 5-day-old blastocysts [1–3]. They have offered much hope by promising to revolutionize the future of regenerative medicine through the provision of novel cell replacement therapies to treat a variety of debilitating diseases, such as myocardial infarcts, diabetes, and Parkinson’s disease [4, 5]. The molecular mechanisms controlling pluripotency and self-renewal in hESCs are presently not well understood [6]. Transcriptome profiling studies using DNA microarrays [7–11], serial analysis

Correspondence: Woon-Khiong Chan, Ph.D., Department of Biological Sciences, National University of Singapore, 14 Science Drive 4, Singapore 117543. Telephone: 65-6516-8096; Fax: 65-6779-2486; e-mail: [email protected]. Ariff Bongso, Ph.D., D.Sc., Department of Obstetrics and Gynaecology, National University of Singapore, National University Hospital, Singapore 119074. Telephone: 65-6772-4129; Fax: 65-6779-4753; e-mail: [email protected] Received July 6, 2005; accepted for publication January 22, 2006; first published online in STEM CELLS EXPRESS February 2, 2006. ©AlphaMed Press 1066-5099/2006/$20.00/0 doi: 10.1634/stemcells.2005-0304

STEM CELLS 2006;24:1162–1173 www.StemCells.com

Richards, Tan, Chan et al. transcripts, many of which are likely to represent novel genes [12–15, 18, 19]. SAGE is a sequence-based transcriptome profiling approach that provides qualitative and quantitative assessment of gene expression [20]. The underlying principle assumes that a short nucleotide sequence, or SAGE tag, located at the last anchoring enzyme (Cmost) site contains sufficient information to represent a specific transcript. Often the NlaIII restriction enzyme is used, and the length of the SAGE tag could range from 14 (SAGE) to 21 (LongSAGE) or 26 base pairs (bp) (SuperSAGE), depending on the tagging enzymes used [20 –22]. The digital nature of SAGE tags means that cumulative SAGE data can easily be merged, allowing large-scale comparisons between independent libraries. The sequencing of concatemerized SAGE tags also permits a high-throughput determination of the transcriptome compared with EST sequencing. Besides being a robust method that reflects accurately the actual relative levels of mRNA transcripts, SAGE also allows transcripts that are expressed at low levels to be efficiently detected [23, 24]. However, the reliance on short sequence tags for gene identification imposes limitations on the precision and accuracy of gene identification. For instance, a SAGE tag may match multiple mRNA transcripts making gene assignment difficult, although with the advent of LongSAGE and SuperSAGE, this problem has been largely solved. A more daunting problem is that many SAGE tags do not appear to match known mRNA transcripts or genes. In poorly characterized transcriptomes, such as those from hESCs [12] and hematopoietic stem cells [18, 19], such orphan SAGE tags could reach as much as 40%. A recent study has shown that approximately 70% of orphan SAGE tags are indeed derived from bona fide transcripts [24], reinforcing the view that SAGE is indeed a powerful method for novel gene discovery. This suggests that a large number of the orphan SAGE tags that we have uncovered in the hESC transcriptome are true representatives of novel genes, transcripts, or splice variants [12], although the total number of genes present in the human genome is estimated at a conservative 30,000 – 40,000 [25, 26]. Another major source of uncertainty in SAGE tag-to-transcript assignment lies in the widespread presence of singlenucleotide polymorphisms (SNPs) within the human genome; SNPs occur as frequently as once every 100 –300 bases [27, 28]. Occurrence of SNPs within the SAGE tag sequence or within the tagging restriction enzyme site will result in the assignment of an alternative SAGE tag. In a recent large-scale study of the SAGE database, at least one SNP-associated alternative SAGE tag was observed for 8.6% of all known human genes when the influence of SNPs and small insertion/deletion polymorphisms on SAGE tags was taken into consideration [29]. Indeed, the presence of this class of alternative SAGE tags has led to an underestimation of the expression of certain genes (e.g., GAL) and erroneously identified others (e.g., BTF3) as being specific to hESCs [12]. Naturally occurring antisense transcripts (NATs) have been recently reported in a variety of metazoan species [30, 31], and it is likely that a significant portion of the hESC orphan SAGE tags are derived from NATs. There are two main classes of NATs. The cis-encoded NAT (cis-NAT) is transcribed from the opposite strand of the same genomic locus and has the potential to form long complementary duplex with the sense RNA transcript. In contrast, trans-encoded NAT (trans-NAT) is tranwww.StemCells.com

1163

scribed from another distinct genomic locus, possibly a pseudogene [31], and is generally short and forms imperfect duplex with its sense transcript. The human genome has been shown to express NATs widely [32–34], with as many as 20% of human genes forming sense-antisense (SA) transcript pairs [35]. For instance, hESCs have been reported to express a unique set of microRNAs, which belongs to a class of trans-NAT [36]. A recent large-scale EST project has provided an important resource of full-length cDNAs for hESCs [13]. But like the ⬎5 million ESTs that are available [37], they are difficult to use to verify the expression of NATs because many ESTs have not been directionally cloned [31–32]. In contrast, SAGE tags are directionally reliable, as they are generated from well-defined restriction sites at the 3⬘ end of each RNA transcript. Thus, large SAGE datasets contain latent information on both sense and antisense transcription [38]. Interestingly, tags matching mRNAs or ESTs in antisense orientation were first observed in SAGE libraries constructed from Plasmodium falciparum [39, 40]. Without additional sequence information, it is difficult to characterize orphan SAGE tags from hESCs and identify the transcripts they represent. Several polymerase chain reaction (PCR)-based strategies have been developed, including reverse SAGE (rSAGE) [41, 42], generation of longer cDNA fragments from SAGE tags for gene identification (GLGI) [43, 44], and rapid analysis of unknown SAGE-tag-PCR [45]. In this report, we have modified the original rSAGE protocol [41, 42], which is also similar to the GLGI [43, 44], and used it to obtain additional 3⬘ cDNA sequence information for a select group of orphan SAGE tags that are expressed specifically in hESCs. Our results identified novel transcripts unique in their expression to hESCs, transcripts that displayed alternative polyadenylation, and novel splice variants of known genes. More importantly, we found NATs for several pluripotency genes, including POU5F1 and NANOG. Collectively, the unique 3⬘ ESTs derived from orphan hESC SAGE tags (HESTs) will be an important resource in downstream functional analyses and the concerted dissection of molecular pathways critical to the pluripotent phenotype of hESCs.

MATERIALS

AND

METHODS

Culture of hESCs hESCs (HES3 line, passages 19 –25; ES Cell International, Singapore, http://www.escellinternational.com) were cultured on a feeder layer of mitomycin-C inactivated mouse embryonic fibroblasts (MEFs) as described previously [2]. HES3 cell colonies were passaged by mechanically cutting small clumps of undifferentiated HES3 (UD-HES3) cells and transferring these fragments to fresh MEF feeders at 7– 8-day intervals [2, 46]. Differentiated HES3 (D-HES3) cells were obtained by prolonged (20-day) high-density culture on MEFs [12].

Total RNA Isolation Total RNA was extracted from hESCs using TRIZOL (Invitrogen, Carlsbad, CA, http://www.invitrogen.com), whereas total RNA from the various somatic and fetal tissues were obtained commercially (Clontech, Palo Alto, CA, http://www.clontech. com). Prior to rSAGE library construction or reverse transcription (RT)-PCR, total RNA was treated with DNase I (Ambion,

1164

rSAGE Characterization of Human Embryonic Stem Cells

Figure 1. Schematic diagram of the modified rSAGE protocol. Briefly, mRNA was isolated, and cDNA synthesis was performed with an anchored biotin-labeled RT primer. cDNAs were digested with NlaIII to reduce complexity of the library. An rSAGE linker was next ligated to cleaved 3⬘ cDNAs bound to streptavidin beads, following which AscI digestion was performed to release the cDNAs. rSAGE library scale-up amplification was performed with the rSAGEF1 and rSAGER1 primers. An aliquot of the amplified rSAGE library was used in rSAGE amplifications with a serial analysis of gene expression tag-specific primer and the common Rev1 reverse primer. Abbreviations: HEST, human embryonic stem cell serial analysis of gene expression tag; PCR, polymerase chain reaction; rSAGE, reverse serial analysis of gene expression; RT, reverse transcription; TSP, tag-specific primer.

Austin, TX, http://www.ambion.com) to remove any residual genomic DNA contamination, and PCR using ␤-actin primers (forward, 5⬘-GATGCAGAAGGAGATCACTGC-3⬘; reverse, 5⬘-CACCTTCACCGTTCCAGTTT-3⬘), designed to span the last intron-exon boundary of the gene, was carried out to confirm the absence of genomic DNA.

cDNA Synthesis, NlaIII Digestion, and Linker Ligation A schematic for the rSAGE library construction with all primer and linker sequences is depicted in Figure 1. cDNA synthesis

was carried out using the Superscript II double-stranded cDNA synthesis kit (Invitrogen) with 10 ␮g of total RNA from HES3 cells and a biotinylated primer was used (5⬘-biotinATTGGCGCGCCGCGAGCACTGAGTCAATACGAT30VN3⬘; Integrated DNA Technologies, Coralville, IA, http://www. idtdna.com). Double-stranded cDNA was digested with NlaIII (New England Biolabs, Ipswich, MA, http://www.neb.com) to generate 3⬘ overhangs. The biotinylated cDNAs were immobilized on streptavidin-magnetic beads (Invitrogen). Annealed linkers, A1 (5⬘-AAGCAGTGGTATCAACGCAGAGTCATG3⬘) and A2 (5⬘-phosphate-ACTCTGCGTT-GATAC-

Richards, Tan, Chan et al. CACGCTT-aminoC7-3⬘) were ligated to the 5⬘ end of NlaIIIdigested cDNA before AscI (New England Biolabs) digestion was performed to release the 3⬘ cDNA fragments from the streptavidin-magnetic beads.

PCR Scale-Up of rSAGE Library Amplification of the primary rSAGE library was performed with 1 ␮l of the NlaIII-digested cDNAs, 5 U of Platinum Taq Polymerase (Invitrogen), rSAGEF1 (5⬘-AAGCAGT-GGTATCAACGCAGAGT-3⬘) and rSAGER1 (5⬘-GCGAGCACTGAGTCAATACGC-3⬘) primers (350 ng each). After an initial denaturation at 94oC for 2 minutes, PCR was carried out for 25 cycles at 94oC for 45 seconds, 57oC for 1 minute, and 72oC for 1 minute, with a final extension at 72oC for 5 minutes.

Selection of Orphan SAGE Tags and Design of Tag-Specific rSAGE Primers The 200 orphan SAGE tags selected for rSAGE were identified through a pairwise comparison of HES3 SAGE data against pooled data from 21 human SAGE libraries [12]. The SAGE tag-to-gene database used for gene identification was based on UniGene Build 160 (http://www.ncbi.nih.gov/SAGE/). The majority of the orphan SAGE tags selected were upregulated in HES3 compared with the pooled human SAGE libraries (p ⬍ .001; fold difference ⬎4). A table describing the SAGE tags, sequences of the SAGE tag-specific rSAGE primers (TSRPs), and their respective frequencies in tags per million (tpm) in the pooled human, HES3 and HES4 SAGE libraries, is provided as supplemental online Table 1. For those HES SAGE tags where LongSAGE tags were available, which were obtained through comparison with a HES3 LongSAGE library, the TSRPs were designed using the Primer3 software (http://frodo.wi.mit.edu) [46]. Typically, they included the entire 21 bases of the LongSAGE tag or they included additional four to eight bases of the common linker (CGCAGAGT) and up to 19 bp of the LongSAGE tag. If no appropriate LongSAGE tag was available (Tag IDs 1–77), the TSRPs were designed with seven bases of the common linker sequence (GCAGAGT) and the entire 14 bases of the SAGE tag, with the exception of Tag IDs 30 and 72.

rSAGE Amplification Reaction and Characterization of 3ⴕ rSAGE Fragments Touchdown PCRs were performed using an initial denaturation cycle at 94oC for 2 minutes, followed by four cycles at 94oC for 45 seconds, 63oC for 1 minute, and 72oC for 1 minute; four cycles at 94oC for 45 seconds, 60oC for 1 minute, and 72oC for 1 minute; 25 cycles at 94oC for 45 seconds, 58oC for 1 minute, and 72oC for 1 minute; and a final extension step at 72oC for 5 minutes. The reaction setup for rSAGE PCR was as follows: 1 ␮l of amplified rSAGE library, 1 U of Platinum Taq Polymerase, 350 ng of TSRP and rSAGER1 primer. The PCR products were run on 1.2% TAE agarose gel, and the bands were excised and purified using QIAquick Gel Extraction Kit (Qiagen, Valencia, CA, http://www.qiagen.com). Purified PCR products (2– 4 ␮l) were ligated into the pGEM-T Easy Vector (0.5 ␮l) (Promega, Madison, WI, http://www.promega.com) using T4 DNA ligase. The ligation reaction was incubated overnight at 16oC and resuspended in 8 ␮l of sterile water. Electroporation was performed using 1 ␮l of the ligated products and 25 ml of pTOP10 cells (Invitrogen). The transformants were plated on www.StemCells.com

1165

selective media, and two to four clones were picked for each rSAGE PCR product. Plasmid DNA was extracted using QIAprep Spin Miniprep Kit (Qiagen). Sequencing reactions were carried with Big Dye v3.1 (Applied BioSystems, Foster City, CA, http://www. appliedbiosystems.com) and M13 Forward primer. The sequenced products were analyzed on an ABI 3100 DNA Sequencer (Applied BioSystems).

Sequence Analysis and Identification of Genuine rSAGE PCR Products A bona fide 3⬘ rSAGE product was defined as possessing the entire SAGE tag sequence, the rSAGER1 primer sequence and a poly(A) tract of ⬎10 adenine residues. Sequences that lacked any one of the three were considered nonspecific amplification artifacts and omitted from further analysis. The rSAGE 3⬘ EST sequences were searched against the GenBank Database (NR, dbEST, and human genome) using BLASTN (http://www. ncbi.nlm.nih.gov/BLAST/), the University of California Santa Cruz human genome browser database (May 2004 build) using the BLAT program (http://genome.ucsc.edu/cgi-bin/hgBlat) and the EMBL database using a web interface-based batch BLAST program (http://biomedicum.csc.fi:8010/cgi-bin/batchblast.cgi) [20]. An rSAGE sequence was classified as novel if no matches to a transcript sequence (known gene, mRNA, or EST) were found. A sequence was considered to represent a known gene if it matched a full-length transcript sequence with ⬎95% similarity in the same orientation. A sequence was classified as known EST if it matched an EST or open reading frame (ORF) with ⬎95% similarity in the same orientation. A sequence was classified as an SNP alternative tag if it contained a single-bp mismatch within the SAGE tag sequence or NlaIII site. A sequence was classified as an insertion/deletion if it contained an insertion or deletion of fewer than three nucleotides within the SAGE tag sequence. A sequence was classified as an antisense transcript if it matched with high similarity to known transcripts in the opposite orientation. A sequence was classified as poly(A) if it was near the end of the poly(A) tract. Finally, a sequence was considered an alternative isoform if it matched the middle of known full-length transcripts in the same orientation and contained a poly(A) track immediately downstream of the matched region. Genomic coordinates of the 3⬘ SAGE ESTs were annotated based on the University of California Santa Cruz genome browser annotation database (http://genome.ucsc.edu/).

RT-PCR Confirmation of Novel 3ⴕ cDNAs First-strand synthesis was performed using the SuperScript first-strand synthesis system (Invitrogen). One ␮l of firststrand reaction was used for each PCR together with 50 pmol of forward and reverse primers. Initial denaturation was carried out at 94°C for 2 minutes, followed by 30 cycles of PCR (94oC for 30 seconds, 55oC for 30 seconds, 72oC for 1 minute), and a final extension cycle at 72oC for 5 minutes. PCRs were loaded on a 1.5% agarose gel and size fractionated. In instances where the 3⬘ cDNA sequence obtained was short and no suitable primer pairs could be found, additional 5⬘ genomic sequences were used to anchor the forward primers. In all cases, the reverse primer primed from the rSAGE 3⬘ cDNA sequence. Primers used were as follows. ACTB: product 400 bp, 5⬘-TGGCACCACACCTTTCTACAATGAGC-3⬘, 5⬘-GCACAGCTTCTCCTTAATGTCACGC-3⬘;

1166

rSAGE Characterization of Human Embryonic Stem Cells

Figure 2. Results of reverse serial analysis of gene expression (rSAGE) amplification for 200 orphan serial analysis of gene expression (SAGE) tags. (A): Pie chart shows the distribution of rSAGE products. (B): rSAGE reactions were carried out using the tag-specific rSAGE and rSAGER1 primers, the products were analyzed on an agarose gel, and the bands were visualized with ethidium bromide. Most lanes show a single distinct amplified rSAGE band. A 100-bp ladder (M) was used as a molecular weight marker. The numbers at the top of the gel represent the SAGE Tag ID. Abbreviations: EST, expressed sequence tag; M, molecular weight marker; PCR, polymerase chain reaction.

POU5F1: product 247 bp, 5⬘-CGRGAAGCTG GAGAAGGAGAAGCTG-3⬘, 5⬘-CAAGGGCCGCAGCTTACACATGTTC-3⬘; HEST97: product 160 bp, 5⬘-CCTTTGTCATGAGCCCTTGT-3⬘, 5⬘-GGAATGAAAGAATGGTTG CTC-3⬘; HEST101: product 119 bp, 5⬘-AAGAGCCTGCTACGGAACTG-3⬘, 5⬘-TCACTAGAGGTTTCCAACACACTT-3⬘; HEST120: product 159 bp, 5⬘-AAATTTGGTGCTGTGAC TCG-3⬘, 5⬘-GCGGGCTGAGTCGGATTT-3⬘; HEST123: product 200 bp, 5⬘-GGGTTATGT GTAGAAACCAAGTGA-3⬘, 5⬘-TCTTAGAACTTATGATACACCCAGTTG-3⬘; HEST127: product 218 bp, 5⬘-GGGAAAAGATGGCAAGGTTA-3⬘, 5⬘-AATATATTCGAGTCACATCA TGACA-3⬘; HEST146: product 171 bp, 5⬘GATGCCATCACTCAAACTAGACC-3⬘, 5⬘-GACGTCCTATGCAGGCATTT-3⬘; HEST147: product 205 bp, 5⬘GGGGATTCGAGGTTC CTGTA-3⬘, 5⬘-CATTTCAAGGCACAATTTTAATAGC-3⬘; HEST149: product 196 bp, 5⬘CCCAGGCTGAAGTGTAGTGA-3⬘, 5⬘-CATTTACAATGGTACAAGGAGCA-3⬘. The universal reference RNA sample was obtained from Stratagene (La Jolla, CA, http://www.stratagene.com), and somatic tissue RNA samples were obtained from Clontech.

Orientation-Specific RT-PCR To detect the NATs for POU5F1, NANOG, LIN28, TALE, TERF1, and TERA, orientation-specific first-strand cDNA synthesis was carried with the appropriate sense primers. Thereafter, Superscript II RT was heat-inactivated at 95oC for 15 minutes. PCR was performed with 3 ␮l of the 20-␮l first strand mix as described. Control experiments without reverse transcription (⫺RT controls) for each of the three antisense primers were performed to detect genomic DNA contamination. The primers used were as follows. POU5F1 NAT: product 184 bp, 5⬘-AGTTTGTGCCAGGGTTTTTG-3⬘, 5⬘-TGTGTCCCAGGCTTCTTTATTT-3⬘; NANOG NAT: product 278 bp, 5⬘TCGGTATTGTTTGGGATTGG-3⬘, 5⬘-TCATCGAAACACTCGGTGAA-3⬘; LIN28 NAT: product 178 bp, 5⬘GGAGGCCAAGAAAGGGAATA-3⬘, 5⬘-CCGCCCCATAAATT CAAGAT-3⬘; TALE NAT: product 80 bp, 5⬘-TTTTCAGACTGTGCAATA CTTAGAGAA-3⬘, 5⬘-TTAGACAGTATGTGGGCATCC-3⬘; TERF1 NAT: product 169 bp, 5⬘-

TGCGGAGT AGATGAGATGGA-3⬘, 5⬘-AAGGCAATGGAAAACAGGTAAA-3⬘; TERA NAT: product 131 bp, 5-TTTTGGCTGCAGTATTGGTG-3⬘, 5⬘-CATCCTACAGGCAAAGAGAGG-3⬘.

RESULTS rSAGE Amplification, Specificity, Efficiency, and Size Distribution of 3ⴕ cDNAs The original rSAGE (Kinzler/Vogelstein laboratories) [41, 42], the GLGI-SAGE protocol [43, 44], and our modified rSAGE strategy share several key features (Fig. 1). However, we have made several modifications to increase the efficiency of 3⬘ cDNA conversion. For instance, changes in the design of the universal primers allowed the rSAGE library scale-up and the subsequent TSRP PCR amplification to be carried out at an increased melting temperature (Tm). The introduction of a longer poly(T) tract (T30) and the inclusion of VN dinucleotides in first strand RT-PCR primer allowed a better trapping and synthesis of full length mRNAs at their 3⬘ ends, compared with a shorter poly(T) tract (T10) as used in the GLGI strategy that might result in the primer binding to internal poly(A) residues within mRNA transcripts. Finally, increasing the Mg2⫹ concentration when no distinct rSAGE band was observed in the first round of PCR could occasionally enhance the specificity of the rSAGE amplification reaction. Of the 200 HES3 orphan SAGE tags that were selected for rSAGE conversion (supplemental online Table 1), 168 (84.0%) yielded PCR amplification products (Fig. 2A). The conversion rate of orphan LongSAGE tags into longer 3⬘ cDNA fragments was much higher (93.4%) than that of the SAGE tags (69.2%). We attributed these improvements to the availability of additional sequences from the LongSAGE tags for the design of TSRPs, as well as better-designed universal primers (rSAGEF1 and rSAGER1) in our strategy (Fig. 1). In particular, we found the universal M13 primer used as the antisense primer in the original rSAGE strategy [41, 42] was unsatisfactory for rSAGE because of its low Tm. A representative agarose gel showing the rSAGE products is shown in Figure 2B. It is noteworthy that the majority (⬃90%)

Richards, Tan, Chan et al. of the TSRPs yielded only a single distinct rSAGE band. Our results also support the notion that there is no strict correlation between the efficiency of target template amplification and the abundance of the SAGE tag [29], unlike earlier reports on GLGI-SAGE [44, 45]. Other variables, such as SAGE tag length and primer sequence, may be equally important parameters influencing the efficiency of target amplification. As shown in Figure 2B, the rSAGE amplification generally generated intense bands that were easily gel-purified, although amplification of SAGE tags with a lower copy number (⬍20 tpm) yielded lesser PCR products and in some cases (Tag IDs 156 and 169) contained one or multiple faint bands that were difficult to gel purify; these bands were not analyzed. When two or more distinct rSAGE bands were obtained (Tag IDs 126, 141, and 148), they usually turned out to be discrete 3⬘ cDNA fragments. In most GLGI reports, conversion to 3⬘ cDNAs is usually attempted for SAGE tags with a high copy number [18, 19]. In contrast, a large proportion (68%) of the orphan SAGE tags we attempted to convert to 3⬘ cDNAs were present at lower frequencies (ⱕ50 tpm). We also managed to obtain genuine rSAGE products for SAGE tags with frequencies of as low as 5 tpm, which is equivalent to the detection of a singleton in the HES3 SAGE library (HESTs 79, 147, and 174; supplemental online Table 1). In conclusion, it appears that our modified rSAGE protocol has some improvements over the original rSAGE protocol [41, 42] and was as efficient as GLGI-SAGE [43, 44] and GLGI-MPSS [47]. From the 168 SAGE tags that yielded PCR amplification products, a total of 196 rSAGE products were cloned and sequenced. Of these, 148 (75.5%) were confirmed as specific rSAGE products following DNA sequencing, BLAST and BLAT confirmation (supplemental online Table 2). These 148 rSAGE 3⬘ cDNA fragments have been submitted to GenBank (accession numbers DN604327–DN604453), and we will refer to these cDNA sequences hereafter as HESTs. When TSRPs were designed using the LongSAGE tags, the overall amplification specificity reached 80.5% compared with GLGI-SAGE specificities that varied between 60% for low-copy SAGE tags and 80% for high-copy SAGE tags [43, 44]. Many of the nonspecific rSAGE fragments lacked a poly(A) tract and the rSAGER1 primer and were generated mainly because of mispriming at the 3⬘ ends (supplemental online Table 3). Finally, although the hESC lines used in our earlier SAGE study [12] and for the present rSAGE library construction were grown on MEF feeders, we did not find contaminating murine RNA transcripts a significant problem in our 3⬘ rSAGE conversion attempts. Overall, 16.0% of rSAGE reactions failed to give distinct amplification products. Taken together with the nonspecific rSAGE results, our main conclusion is that a SAGE tag does not always provide an ideal sequence for the design of thermodynamically favorable TSRPs for the efficient amplification of 3⬘ cDNA by rSAGE. Thus, orphan SAGE tags that were AT-rich or contained sequences that were self-complementary often failed to generate specific rSAGE 3⬘ cDNA fragments. Although it is possible that when the expression level of targeted templates is very low, partial annealing of the TSRPs with other highly expressed templates may result in nonspecific amplification [44], the availability of additional sequences through the generation of LongSAGE or even SuperSAGE tags [22] would www.StemCells.com

1167

allow most of the remaining orphan SAGE tags to be converted into longer 3⬘ cDNA fragments for gene identification.

Analysis of 3ⴕ HESTs Generated from HES3 Orphan SAGE Tags The size distribution of the 148 HESTs ranged from 36 to 538 bp, with 56.7% of them longer than 100 bp, which matched well to the reported data from GLGI-SAGE studies [18, 19, 43, 44]. A small number of the TSRPs [14] gave two or more distinct rSAGE bands. The majority of them were mapped to distinct transcripts (HEST31, 52, 53, 65, 98, 99, 148, and 170; supplemental online Table 2), whereas those for HEST126 and 141 were the result of alternative polyadenylation sites. Previous GLGI-SAGE reports have relied on BLAST searches to determine the identity of the 3⬘ cDNA fragments [18, 19, 43, 44]. We used both BLAT and BLAST searches to establish the identity of rSAGE cDNA sequences (Fig. 3A). Indeed, the BLAT transcript viewer made it easier to visualize and quickly identify NATs, novel introns, and new splice variants of known transcripts and to confirm SNPs within the SAGE tags. For several SAGE tags, rSAGE extension resulted only in poly(A) sequences, as a result of the NlaIII site occurring just adjacent to the poly(A) tract, and would require the use of a different tagging enzyme to reveal their true identity. More importantly, our rSAGE results have clearly identified 59 of these rSAGE 3⬘ cDNA fragments as novel rSAGE 3⬘ESTs and 30 NATs, all of which are identified for the first time (Fig. 3A). The majority of the novel rSAGE 3⬘ESTs that mapped to specific chromosomal locations also contained the canonical polyadenylation signal, AATAAA or its functional variant [48], and are likely to represent bona fide transcripts from previously undescribed human genes. As shown in Table 1, the majority of these 18 novel rSAGE 3⬘ESTs are underrepresented in the nonhuman embryonic stem (ES) SAGE libraries and are found mainly in SAGE libraries constructed from cancer cell lines or carcinomas. They are likely to represent transcripts that are expressed specifically in hESCs. For instance, HEST94 and 147 are represented only in hESCs and could turn out to be an excellent marker for the “stemness” phenotype of hESCs. To confirm the validity of these rSAGE 3⬘ cDNAs and whether they were indeed restricted only to undifferentiated hESCs, RT-PCR was performed for several selected HESTs (97, 146, 147, and 149) across a selected tissue panel (testis, brain, heart, skeletal muscle, fetal brain, and stomach), undifferentiated hESCs (HES3 and HES4), and differentiated HES3 cells (Fig. 3B). Like the well-established hESC marker POU5F1, the RTPCR products for these four novel rSAGE 3⬘ ESTs were detected only in the hESC lines and were absent in the other somatic tissues examined. The expression of HEST149 was also completely abrogated in differentiated HES3 cells (Fig. 3B) and could, like Oco90 [12], prove to be a reliable marker for monitoring the early differentiation of hESCs. Indeed, HEST149 expression was undetectable in the universal reference RNA sample, which is an RNA pool from several cancer tissues (Fig. 3B) and absent in several embryonal carcinoma lines such as GCT-27C4, GCT-27X1, and GCT-44 (unpublished results). In addition, a number of the novel 3⬘ rSAGE cDNA fragments (e.g., HESTs 73, 92, 102, and 126) could not be matched reliably to the human genome and were also not the products of contaminating MEF cDNAs. Perhaps these HESTs represent

1168

rSAGE Characterization of Human Embryonic Stem Cells

Figure 3. Identity of the 148 rSAGE 3⬘ cDNA fragments. (A): The distribution of the various categories of rSAGE products is summarized as a pie chart. (B): Human embryonic stem cell (hESC)-specific expression of eight HESTs were verified with semiquantitative reverse transcriptionpolymerase chain reaction (PCR) using total RNAs prepared from several peripheral adult tissues and fetal brain, universal reference RNA (Stratagene), undifferentiated HES3 and HES4 hESC lines, and D-HES3 cells. (C): Quantitative real-time PCR results for GJA1 SNP analysis. Abbreviations: bp, base pairs; CT, threshold cycle; EST, expressed sequence tag; FAM, 6-carboxyfluorescein; INDEL, insertion/deletion; rSAGE, reverse serial analysis of gene expression; SNP, single-nucleotide polymorphism.

transcripts from novel hybrid RNAs with a regulatory function or as yet undiscovered genes. The presence of consensus polyadenylation sites on several of these HESTs (e.g., 92, 102, and 126) is a good indication that these are authentic transcripts. Interestingly, four HESTs (112, 120, 128, and 170) showed high sequence similarity to the WiCell hESC ESTs [13]. HEST2 and 146, classified as novel sequences, did not overlap with known hESC ESTs but mapped to genomic regions proximal to chromosomal sites where several WiCell hESC ESTs appear to be transcribed from. Obtaining 3⬘ cDNA sequences that matched WiCell ESTs [13] indicated that our modified rSAGE protocol was working well. In addition, our RT-PCR data also confirmed that the expression of HEST120, 127, and 146 were confined to hESCs, although HEST120 (and to a lesser extent HEST127) was also detected in the fetal brain (Fig. 3B). Unfortunately, although these ESTs are highly restricted in their expression to hESCs, as demonstrated either by RT-PCR or by their representation in human ESC SAGE libraries [12], their exact functional role is unknown. The impact of SNPs on the correct assignment of SAGE tags to specific transcripts [29] is also illustrated by our rSAGE results. For instance, HEST49 matched the CHD8 with almost 100% sequence similarity and is the result of an SNP that created a new NlaIII restriction site upstream of the AATAAA polyadenylation site. The full-length cDNA sequence of CHD8 is 8,160 bp long, and this SNP would generate the C-most

SAGE tag. The original C-most SAGE tag for CHD8 is GGCCCCATTG (nts 7311–7320), which is also represented in the HES3 SAGE library (5 tpm). We also detected an SNP within the C-most SAGE tag of GJA1, which encodes the gap junction protein connexin 43. The putative C-most SAGE tag is TGTTCTGGAG (nts 2916 –2925). The rSAGE conversion of the orphan SAGE tag, TGTTTTGGAG, resulted in HEST113, which displayed a 97% sequence similarity to the 3⬘ terminal region of the GJA1 coding region. Careful examination of corresponding EST and genomic DNA sequences indicated that this orphan tag most likely represented an SNP in the canonical GJA1 SAGE tag and not the hypothetical protein FLJ10407 as suggested by the predicted tag-to-gene mapping of SAGEGenie. The GJA1 SNP was verified using 6-carboxyfluorescein (FAM)- and VIC-labeled Taqman probes that were specific to the polymorphism (Fig. 3C). The generation of longer 3⬘ cDNA sequences by rSAGE has also helped to resolve some of the ambiguities in tag to gene assignments, at least in HES3 cells. For example, HEST119 (AGTGAGGATA) matched the hypothetical protein FLJ35155 (C3orf21), which is restricted in expression to hESC lines and tissues of cancerous origin. In addition, the SAGE tag for HEST114 (CATCCAAAAA) was incorrectly assigned to NPY and CEP2 by SAGEGenie and SAGEMap, respectively. Instead, rSAGE conversion confirmed that HEST114 matched

Richards, Tan, Chan et al.

1169

Table 1. Chromosomal location and SAGE library representation of 18 novel 3⬘ reverse SAGE expressed sequence tags with authentic polyadenylation signal HEST

SAGE tag sequence

HES D-HES3 Chromosomal (tpm) (tpm) location

Poly(A) signal

Human SAGE libraries

2

AAATTTGGTA

73

0

Chr 16(⫹)

11

ATGTACTCTA

21

0

Chr 13(⫺)

30

GGCCATTGTT

21

0

Chr 2(⫹)

90

TGAATTGCTT

156

12

Chr 10(⫺)

AATAAA Astrocytoma, medulloblastoma, cartilage chondrosarcoma, melanoma, retinal pigment epithelium AATAAA Astrocytoma, colon carcinoma, medulloblastoma, placenta ATTAAA Many carcinoma SAGE libraries, HRPE AATAAA Many SAGE libraries

94

TGGGTTGTCT

192

24

Chr 5(⫹)

AATAAA Many SAGE libraries

95

CATTTTCTGG

99

12

Chr 8(⫹)

97

GAGCAACCAT

78

0

Chr 1(⫹)

GATAAA Ovarian carcinoma, astrocytoma, ependymoma, glioblastoma, colon adenocarcinoma, retina, CD34⫹ cells AATAAA None

98

ATGGTGCACA

145

12

Chr 8(⫺)

AATAAA

103

TTGTCAAAAT

93

24

Chr 2(⫺)

ATTAAA

132

TCAATTCTAT

36

24

Chr 5(⫹)

AATAAA

139 146

GGCACGTTCT AGACAGAGAG

21 21

0 0

Chr X23(⫹) AATGAA Chr 2(⫹) AATAAA

147

CTGACCGACA

14

0

Chr 11(⫺)

ATTAAA

149

TTGACAAAGT

26

0

Chr 2(⫺)

AATAAA

152

TAGAACTGTA

26

0

Chr 4(⫺)

AATAAA

162

CAGTTGTGAA

10

36

Chr 10(⫺)

AATAAA

188

CAGCCCCCAG

22

0

Chr 2(⫹)

AATAAA

198

ACAATCAAGA

16

12

Chr 10(⫺)

AATAAA

Human ES SAGE libraries HES3, HES4, H1, H7, H9, H13, HSF6, BG01

HES3, HES4, H9, H14, HSF6, BG01 HES3, HES4, H1, H13, HSF6, BGO1 HES3, HES4, H1, H9, H13, H14, HSF6, BG01 HES3, HES4, H1, H7, H9, H13, H14, HSF6, BG01 HES3, HES4, H1, H7, H9, H13, H14, HSF6, BG01

HES3, H1. H9, H13, H14, HSF6 Many SAGE libraries HES3, HES4, H1, H7, H9, H13, H14, HSF6, BG01 Many SAGE libraries HES3, HES4, H1, H7, H9, H13, H14, HSF6, BG01 Breast carcinoma, astrocytoma, HES3, HES4, H1, H7, H9, ependymona, H13, H14, HSF6, BG01 medulloblastoma, endothelial Lymph node HES3, H1, H9, H13 Astrocytoma, ependymoma, HES3, HES4, H1, H7, H9, glioblastoma, CD4 T cells, H13, H14, BG01 CD34⫹ cells, monocytes None HES3, HES4, H1, H7, H9, H13, H14, HSF6, BG01 Lung adenomcarcinoma, breast HES3, HES4, H1, H7, H9, carcinoma, medulloblastoma, H13, H14, HSF6 muscle, placenta Cervix, astrocytoma, HES3, HES4, H1, H7, H9, medulloblastoma, breast H13, H14, HSF6, BG01 carcinoma, breast normal, lymph node, CD4 cells, liver cholangiocarcinoma Medulloblastoma, brain, HES3, HES4, H1, H7, H9, vascular endothelium, H13, H14 thyroid Many SAGE libraries HES3, HES4, H1, H9, H13, HSF6, BG01 Many SAGE libraries HES3, HES4, H1, H13, HSF6

Abbreviations: Chr, chromosome; ES, embryonic stem; HES, human embryonic stem; HEST, human embryonic stem cell serial analysis of gene expression tag; SAGE, serial analysis of gene expression; tpm, tags per million.

to the hypothetical protein FLJ10884, a hypothetical protein restricted in its expression to the testis, placenta, and hESC lines, instead of NPY.

Antisense Transcription in hESCs BLAT and BLAST searches revealed that many of the HESTs were the products of antisense transcription. Interestingly, cisNATs for several important ES-specific genes, such as NANOG (HEST16), POU5F1 (HEST88), and LIN28 (HEST168), were www.StemCells.com

identified by our rSAGE results (supplemental online Table 2). Analyzing the chromosomal location of these cis-NATs and the corresponding sense tags from the HES3 library revealed the presence of sense-antisense (SA) gene pairs [34, 35, 38]. Table 2 is a list of 18 SA SAGE tag pairs and the corresponding antisense HESTs that were experimentally obtained with rSAGE. Although several SA SAGE tag pairs can be mapped in trans to remote genomic loci, other pairs mapped in cis on contiguous oppositely oriented DNA strands (Fig. 4A). Besides

rSAGE Characterization of Human Embryonic Stem Cells

1170

Table 2. Sense-antisense SAGE tags pairs of antisense HESTs HEST 167 151 129 106 115 178 87 168 16 107 88 83 52.2 157 155 109

Gene

cis/trans

Antisense SAGE tag

HES3 (tpm)

D-HES3 (tpm)

Sense SAGE tag

HES3 (tpm)

D-HES3 (tpm)

COP1 ERH CN332624 FSCN1 ILF2 KHDRBS1 KIF2C LIN28 NANOG NAP1L1 POU5F1 RPLP1 SF3A3 SNRPD2 TERA TGIF

trans cis trans cis trans trans trans cis cis trans cis cis trans cis cis cis

GAGTTACATT GCTAAACTGC CGAACAAAAG GGCGTTTAGA TAAAGCCCAG GTGGTGCCTA GTCCTGGTGG GAGTTACATT TCATTACGAT AGGTAGTTAG ATGTGGGATT TTATAAAAGA AGATTACATA TTGCAGTGCC ACTACATACA GGAATGAGAA

31 119 93 125 57 31 161 31 36 114 187 602 16 47 36 135

48 120 132 96 48 48 216 48 0 60 24 277 12 12 24 0

GTGTTGCACA TCCTCAAGAT GGAACAAACA ATAGTAGCTT GTGACAGACA TGTAAGTCTG GGACACTCCT TTTACTGCTA AGTACTACTT TTCATTCATT TATCACTTTT TTCAATAAAA CTGGCAGATT GTGCTGGAGA CACTTTGTAT TGGAACAGGA

783 524 1,359 166 296 213 21 249 57 52 970 4,291 119 316 228 171

313 445 228 685 48 48 24 84 0 0 48 2,429 180 180 0 48

Abbreviations: HEST, human embryonic stem cell serial analysis of gene expression tag; SAGE, serial analysis of gene expression; tpm, tags per million.

POU5F1, NANOG, and LIN28, a number of other highly expressed hESC-specific genes, like TGIF/TALE (HEST109), ERH (HEST151), TERA (HEST155), and TERF1 (HEST193.2), also expressed cis-NATs. Furthermore, the representation of many of these co-expressed SA SAGE tag pairs decreased upon differentiation of the hESCs (Table 2). The SAGE tags for NANOG (TCATTACGAT) and POU5F1 (ATGTGGGATT) cis-NATs were found only in hESC SAGE libraries, indicating that the expression pattern of cis-NATs for NANOG and POU5F1 are even more restricted than their sense transcript counterparts. To validate that the cis-NATs for POU5F1, NANOG, LIN28, TALE, TERF1, and TERA were specifically in hESCs, orientation specific RT-PCR [33, 49] was carried out using total RNA isolated from HES3, a universal reference RNA sample (Stratagene), testis, and stomach (Fig. 4B). First strand cDNAs were prepared using primers specific to POU5F1, NANOG, LIN28, TALE, TERF1, and TERA, respectively. Specific RTPCR products for all three cis-NATs were detected only when RT was included, thus confirming that these cis-NATs were specifically expressed in hESCs and not due to spurious PCR amplification. HEST115 and 168 appeared to represent spliced SA transcripts from ILF2 and LIN28, respectively. Nucleotides (nts) 1– 42 of HEST115 matched the ILF2 coding region in the antisense orientation (Chr1[⫹]: 150447872–150447913), whereas nts 24 –222 matched the sense orientation (Chr1[⫺[: 150447587–150447785). Likewise, nts 1–133 of HEST168 matched the LIN28 coding region in the antisense orientation (Chr1[⫺]: 26439918 –26440050), whereas nts 131–171 matched the sense orientation (Chr1[⫹]: 26440310 –26440350). This novel sense-antisense RNA hybrid structure is originally reported for the cardiac troponin I gene in rat hearts [50]. The structure the cardiac troponin I “hybrid RNA,” which the authors themselves have tentatively concluded to be formed from the transcription of the troponin mRNA in the cytoplasm, is very similar to what we have described for ILF2 and LIN28. The functional significance of these hybrid RNAs is currently unknown.

Figure 4. Confirmation of natural antisense transcription in HES3 cells. (A): Illustration of the cis- and trans-serial analysis of gene expression AS tag pair concept. (B): Expression of POU5F1, NANOG, LIN28, TALE, TERA, and TERF1 cis-natural antisense transcripts (NATs). For amplification of cis-NATs, sense-specific primers were used for reverse transcription (RT) instead of oligo(dT) primer. During the subsequent polymerase chain reaction amplification, sense and antisense primers were used. Total RNA that had not been reversetranscribed was used as a template control for genomic DNA contamination (⫺RT). Abbreviations: AS, antisense; bp, base pairs.

DISCUSSION Unlike DNA microarray, SAGE does not require prior knowledge of the sequences to be analyzed. Hence, SAGE libraries provide discreet and unbiased directional gene expression data that are ideally suited for gene discovery and SA expression analysis [35, 38]. Although MPSS [51] is capable of deeper coverage of the gene expression profile, it requires specialized reagents and equipment, and this has restricted the availability of MPSS libraries for various human tissues and cell types, including those for hESCs. On the other hand, SAGE comprises several standard molecular biology techniques and can be adapted for microanalysis [52, 53]. This has resulted in the

Richards, Tan, Chan et al.

1171

construction of SAGE libraries from a large variety of human cell types and tissues, and they are an important resource for the discovery of novel genes and NATs [38, 54, 55]. Although the human transcriptome is necessarily less complex than the human genome, it is quite apparent that transcriptome complexity has been underestimated [34, 35, 38, 44]. Noncoding RNA, regulatory RNA, NATs, and novel splice variants add to the multifaceted nature of the transcriptome. In the present study, we have used a modified rSAGE strategy to convert selected orphan SAGE tags from hESCs into longer 3⬘ cDNAs. It has facilitated the identification of isoforms due to splicing, alternative polyadenylation and SNPs. A large number of novel hESC-specific genes have also been identified, indicating that the hESC transcriptome is indeed poorly characterized [12]. This is also the first description of cis-NATs from several key pluripotent genes that are involved in the maintenance of hESC self-renewal, suggesting that SA transcript pairing might be a key regulatory mechanism [31]. A recent study reported that 41.5% of SA transcript overlaps occurred in the last exon or untranslated region (UTR) of the coding sequence [34]. We have found that overlaps between the cis-NAT of LIN28, NANOG, and POU5F1 and their corresponding sense transcripts occurred in the 3⬘ UTR of the coding sequence as well. Although the exact significance of this positional overlap is unknown, UTRs are believed to contribute toward the localization, stability, and translational control of mRNA transcripts. Indeed, the finding that ⬎30% of vertebrate mRNAs show orthologue-specific conservation of 3⬘ UTRs suggests a possible functional or regulatory role for UTR sequences [56]. The recent finding that many of the human SA gene pairs are also detected in mouse, rat, and fugu and are

probably conserved throughout the course of vertebrate evolution [57] lends some support to the notion that cis-NATs are not due to a “leakage” of the transcriptional apparatus but rather that their abundance is the result of active transcription. For POU5F1 and NANOG, we have ruled out the possibility that their cis-NATs are due to the insertion of L1 retrotransposon [58]. However, because there are several pseudogenes for POU5F1 and NANOG, the possibility of trans-NATs from these genomic loci remains to be determined. Several reports have hinted that the contribution of NATs in the human genome has been underestimated [34, 35] and that up to 25% of human transcripts might form natural SA pairs. Although initial studies indicated that there was no correlation between NATs and their function or localization [34], a more recent survey of SA pairs confirmed that they are predominant for genes involved in translation regulator activity, DNA damage response, and cell growth, whereas non-SA transcripts were found to have a significantly different functional distribution [35]. Several of the human ES NATs and SA gene pairs we have identified are representative of genes that code for transcription factors and RNA-binding proteins, whereas SA gene pairs for ubiquitously expressed genes, such as glyceraldehyde-3-phosphate dehydrogenase and ACTB, were not present in the HES3 SAGE library. The fact that SA transcripts have a significantly higher probability of involvement in translation regulator activity and are more frequently located in both the nucleus and cytoplasm [35] is compatible with a role in antisense-mediated gene regulation occurring in both the nucleus and cytoplasm and at the transcription and translation levels [31]. Although certain human miRNAs (miR-1 and miR-124) have been recently demonstrated to influence and define tissue-

Table 3. Occurrence of SAGE tags of cis-natural antisense transcripts of selected embryonic stem-specific genes in human and mouse embryonic stem cell SAGE libraries Gene

Human cis-AS tag

POU5F1 ATGTGGGATT

HES3 (tpm)

CGAP hES cell SAGE libraries

161

H1, H9, H7, H13, H14, HES3, HES4, HSF6 H1, H9, H13, H14, HES3, HES4, HSF6 H1, H9, H7, H13, HES3, HES4, HSF6 H9 H7, HES4 H1, H9, HES3, HES4 H1, H9, H7, H13, H14, HES3, HES4, HSF6 H1, H7, H9, HES4, HSF6 H1, H9, H7, H14, HES3, HES4, HSF6 H1, H9, H13, HES4, HES4, HSF6 H1, H9, H7, H13, H14, HES3, HES4, HSF6 H1, H9, H7, H13, H14, HES3, HES4, HSF6 H1, H9, H7, H13, H14, HES3, HES4, HSF6

NANOG

TCATTACGAT

36

SOX2

GATTCTCGGC

21

TERF1 CHK2 DNMT3B TDGF1

GATGGAGGAC TCTTCATCCT CCTACAACTG TATCAAGAAA

0 0 0 0

CRABP1 GGGAAAACGG FZD7 TGATAAAGTG

21 31

FGFR1

TAAATATATA

16

FGFR2

ATCTCGGCTC

21

LIN28

GAGTTACATT

31

TERA

ACTACATACA

36

Mouse cis-AS tag

R1 (tpm)

Other mES/mEC/mEG cell SAGE libraries

GGAGAGCCCA

7

P19 (15 tpm)

GGTGGGTGGG

0

D3 (53 tpm), P19 (18 tpm), EG-1 (21 tpm)

TGCGACAGGG

0

TCTGTCTTTT TGTAGTAGGT AAAATAGAAG AACAATTATT

0 7 0 0

CAAATGCCAA TGATAAAGTG

7 7

CCAGCAGTCC

0

TTCCAACTGT

0

TGTCACAGGA

7

ACTACATACA

14

P19 EC (30 tpm)

EG-1 (42 tpm), D3 (26 tpm)

D3 (26 tpm), P19 (61 tpm), EG-1 (63 tpm) D3 (53 tpm), P19 (76 tpm), EG-1 (105 tpm)

Abbreviations: AS, antisense; CGAP, The Cancer Genome Anatomy Project; HES, human embryonic stem; mEC, mouse embryonic carcinoma; mEG, mouse embryonic germ; mES, mouse embryonic stem; SAGE, serial analysis of gene expression; tpm, tags per million.

www.StemCells.com

1172

specific gene expression profiles in HeLa cells [59], the functional roles of the cis-NATs in similar context have not been previously reported. Since cis-NATs are also capable of regulating gene expression through RNA masking, transcriptional or RNA interference [31, 32], the identification of cis-NATs for POU5F1 and NANOG prompted us to determine whether cisNATs might be commonly expressed for other key regulators that are involved in the maintenance of pluripotency in hESCs. Both the mouse and the human SAGE libraries were searched for the presence of SAGE tags representing the cis-NATs for ES-specific genes [12, 60]. We failed to find SAGE tags representing UTF1, REX1, LEFTB, and GDF3 cis-NATs in human and mouse SAGE libraries. However, we detected cis-NATs for a number of key ES-specific genes (e.g., FGFR1, FGFR2, TDGF1, SOX2) in HES3 and SAGE libraries constructed from other hESC lines (Table 3). In addition, SAGE tags representing pou5f1, nanog, tera, and lin28 were also detected in mouse embryonic stem cells (mESCs). In summary, cis-NATs for a number of ES-specific genes, such as POU5F1 and NANOG, were shown to be expressed in both hESCs and mESCs, and it is possible that some of these cis-NATs might have a role in maintaining the “stemness” phenotype of ES cells.

REFERENCES

rSAGE Characterization of Human Embryonic Stem Cells Our study further underscores the importance of obtaining longer 3⬘ cDNAs from orphan SAGE tags and the versatility of rSAGE as a powerful complementary tool to SAGE expression libraries for gene discovery. Lastly, the hESC-specific transcripts that we have described are clear targets for further study, and the conversion of the remaining orphan SAGE tags from HES3 and other hESCs would likely provide additional valuable resources, mainly in terms of novel transcripts, and uncover additional cis-NATs for the in-depth functional dissection of the molecular pathways involved in the self-renewal of pluripotent hESCs and their subsequent lineage commitment to their differentiated progenies.

ACKNOWLEDGMENTS This study was supported by Embryonic Stem Cell International Pte. Ltd. grant R-174-000-081-592 and National University of Singapore Academic Research Fund grant R-154-000-179-112.

DISCLOSURES The authors indicate no potential conflicts of interest.

15 Wei CL, Miura T, Robson P et al. Transcriptome profiling of human and murine ESCs identifies divergent paths required to maintain the stem cell state. STEM CELLS 2005;23:166 –185.

1

Thomson JA, Itskovitz-Eldor J, Shapiro SS et al. Embryonic stem cell lines from human blastocysts. Science 1998;282:1145–1147.

2

Reubinoff BE, Pera MF, Fong CY et al. Embryonic stem cell lines from human blastocysts: Somatic differentiation in vitro. Nat Biotechnol 2000; 18:399 – 404.

16 Sato N, Meijer L, Skaltsounis L et al. Maintenance of pluripotency in human and mouse embryonic stem cells through activation of Wnt signaling by a pharmacological GSK-3-specific inhibitor. Nat Med 2004; 10:55– 63.

3

Richards M, Fong CY, Chan WK et al. Human feeders support prolonged undifferentiated growth of human inner cell masses and embryonic stem cells. Nat Biotechnol 2002;20:933–936.

17 James D, Levine AJ, Besser D et al. TGF␤/activin/nodal signaling is necessary for the maintenance of pluripotency in human embryonic stem cells. Development 2005;132:1273–1282.

4

Mayhall EA, Lugassy N, Zon LI. The clinical potential of stem cells. Curr Opin Cell Biol 2004;16:713–720.

5

Edwards RG. Stem cells today: A. Origin and potential of embryo stem cells. Reprod Biomed Online 2004;8:275–306.

18 Lee S, Zhou G, Clark T et al. The pattern of gene expression in human CD15⫹ myeloid progenitor cells. Proc Natl Acad Sci U S A 2001;98: 3340 –3345.

6

Rao M. Conserved and divergent paths that regulate self-renewal in mouse and human embryonic stem cells. Dev Biol 2004;275:269 –286.

7

Loring JF, Porter JG, Seilhammer J et al. A gene expression profile of embryonic stem cells and embryonic stem cell-derived neurons. Restor Neurol Neurosci 2001;18:81– 88.

8

Sato N, Sanjuan IM, Heke M et al. Molecular signature of human embryonic stem cells and its comparison with the mouse. Dev Biol 2003;260:404 – 413.

9

Sperger JM, Chen X, Draper JS et al. Gene expression patterns in human embryonic stem cells and human pluripotent germ cell tumors. Proc Natl Acad Sci U S A 2003;100:13350 –13355.

10 Abeyta MJ, Clark AT, Rodriguez RT et al. Unique gene expression signatures of independently-derived human embryonic stem cell lines. Hum Mol Genet 2004;13:601– 608. 11 Bhattacharya B, Miura T, Brandenberger R et al. Gene expression in human embryonic stem cell lines: Unique molecular signature. Blood 2004;103:2956 –2964. 12 Richards M, Tan SP, Tan JH et al. The transcriptome profile of human embryonic stem cells as defined by SAGE. STEM CELLS 2004;22:51– 64. 13 Brandenberger R, Wei H, Zhang S et al. Transcriptome characterization elucidates signaling networks that control human ES cell growth and differentiation. Nat Biotechnol 2004;22:707–716. 14 Brandenberger R, Khrebtukova I, Thies RS et al. MPSS profiling of human embryonic stem cells. BMC Dev Biol 2004;4:10.

19 Zhou G, Chen J, Lee S et al. The pattern of gene expression in human CD34(⫹) stem/progenitor cells. Proc Natl Acad Sci U S A 2001;98: 13966 –13971. 20 Velculescu VE, Zhang L, Vogelstein B et al. Serial analysis of gene expression. Science 1995;270:484 – 487. 21 Saha S, Sparks AB, Rago C et al. Using the transcriptome to annotate the genome. Nat Biotechnol 2002;20:508 –512. 22 Matsumura H, Reich S, Ito A et al. Gene expression analysis of plant host-pathogen interactions by SuperSAGE. Proc Natl Acad Sci U S A 2003;100:15718 –15723. 23 Boon K, Osorio EC, Greenhut SF et al. An anatomy of normal and malignant gene expression. Proc Natl Acad Sci U S A 2002;99:11287– 11292. 24 Chen J, Sun M, Lee S et al. Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags. Proc Natl Acad Sci U S A 2002;99:12257–12262. 25 Venter JC, Adams MD, Myers EW et al. The sequence of the human genome. Science 2001;291:1304 –1351. 26 Lander ES, Linton LM, Birren B et al. Initial sequencing and analysis of the human genome. Nature 2001;409:860 –921. 27 Wang DG, Fan JB, Siao CJ et al. Large scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 1998;280:1077–1082. 28 Sachidanandam R, Weissman D, Schmidt SC et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 2001;409:928 –933.

Richards, Tan, Chan et al.

1173

29 Silva AP, de Souza JE, Galante PA et al. The impact of SNPs on the interpretation of SAGE and MPSS experimental data. Nucleic Acids Res 2004;32:6104 – 6110.

45 van den Berg A, van der Leij J, Poppema S. Serial analysis of gene expression: Rapid RT-PCR analysis of unknown SAGE tags. Nucleic Acids Res 1999;27:e17.

30 Kumar M, Carmichael GG. Antisense RNA: Function and fate of duplex RNA in cells of higher eukaryotes. Microbiol Mol Biol Rev 1998;62: 1415–1434.

46 Richards M, Tan S, Fong CY et al. Comparative evaluation of various human feeders for prolonged undifferentiated growth of human embryonic stem cells. STEM CELLS 2003;21:546 –556.

31 Lavorgna G, Dahary D, Lehner B et al. In search of antisense. Trends Biochem Sci 2004;29:88 –94.

47 Silva AP, Chen J, Carraro DM et al. Generation of longer 3⬘ cDNA fragments from massively parallel signature sequencing tags. Nucleic Acids Res 2004;32:e94.

32 Lehner B, Williams G, Campbell RD et al. Antisense transcripts in the human genome. Trends Genet 2002;18:63– 65. 33 Shendure J, Church GM. Computational discovery of sense-antisense transcription in the human and mouse genomes. Genome Biol 2002; research:0044.1– 0044.14. 34 Yelin R, Dahary D, Sorek R et al. (2003) Widespread occurrence of antisense transcription in the human genome. Nat Biotechnol 2003;21: 379 –386. 35 Chen J, Sun M, Kent WJ et al. Over 20% of human transcripts might form sense-antisense pairs. Nucleic Acids Res 2004;32:4812– 4820. 36 Suh MR, Lee Y, Kim JY et al. Human embryonic stem cells express a unique set of microRNAs. Dev Biol 2004;270:488 – 498. 37 Schuler GD, Boguski MS, Stewart EA et al. A gene map of the human genome. Science 1996;274:540 –546. 38 Quere R, Manchon L, Lejeune M et al. Mining SAGE data allows large-scale, sensitive screening of antisense transcript expression. Nucleic Acids Res 2004;32:e163. 39 Patankar S, Munasinghe A, Shoaibi A et al. Serial analysis of gene expression in Plasmodium falciparum reveals the global expression profile of erythrocytic stages and the presence of anti-sense transcripts in the malarial parasite. Mol Biol Cell 2001;12:3114 –3125. 40 Gunasekera AM, Patankar S, Schug J et al. Widespread distribution of antisense transcripts in the Plasmodium falciparum genome. Mol Biochem Parasitol 2004;136:35– 42. 41 Polyak K, Xia Y, Zweier JL et al. A model for p53-induced apoptosis. Nature 1997;389:300 –305. 42 Yu J, Zhang L, Hwang PM et al. Identification and classification of p53-regulated genes. Proc Natl Acad Sci U S A 1999;96:14517–14522. 43 Chen JJ, Rowley JD, Wang SM. Generation of longer cDNA fragments from serial analysis of gene expression tags for gene identification. Proc Natl Acad Sci U S A 2002;97:349 –353. 44 Chen J, Lee S, Zhou G et al. High-throughput GLGI procedure for converting a large number of serial analysis of gene expression tag sequences into 3⬘ complementary DNAs. Genes Chromosomes Cancer 2002;33:252–261.

48 Tian B, Hu J, Zhang H et al. A large-scale analysis of mRNA polyadenylation of human and mouse genes. Nucleic Acids Res 2005;33:201– 212. 49 Rosok O, Sioud M. Systematic identification of sense-antisense transcripts in mammalian cells. Nat Biotechnol 2004;22:104 –108. 50 Bartsch H, Voigtsberger S, Baumann G et al. Detection of a novel sense-antisense RNA-hybrid structure by RACE experiments on endogenous troponin I antisense RNA. RNA 2004;10:1215–1224. 51 Brenner S, Johnson M, Bridgham J et al. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 2000;18:630 – 634. 52 Datson NA, van der Perk-de Jong J, van den Berg MP et al. MicroSAGE: A modified procedure for serial analysis of gene expression in limited amounts of tissue. Nucleic Acids Res 1999;27:1300 –1307. 53 Vilain C, Libert F, Venet D et al. Small amplified RNA-SAGE: An alternative approach to study transcriptome from limiting amount of mRNA. Nucleic Acids Res 2003;31:e24. 54 Boheler KR, Stern MD. The new role of SAGE in gene discovery. Trends Biotechnol 2003;21:55–57. 55 Dinel S, Bolduc C, Belleau P et al. Reproducibility, bioinformatic analysis and power of the SAGE method to evaluate changes in transcriptome. Nucleic Acids Res 2005;33:e26. 56 Lipman DJ. Making (anti)sense of non-coding sequence conservation. Nucleic Acids Res 1997;25:3580 –3583. 57 Dahary D, Elory-Stein O, Sorek R. Naturally occuring antisense: Transcriptional leakage or real overlap. Genome Res 2005;15:364 –368. 58 Han JS, Szak ST, Boeke JD. Transcriptional disruption by the L1 retrotransposon and implications for mammalian transcriptomes. Nature 2004;429:268 –274. 59 Lim LP, Lau NC, Garrett-Engele P et al. Microarray analysis shows that some microRNAs downregulate large numbers of target mRNAs. Nature 2005;433:769 –773. 60 Pera MF, Trounson AO. Human embryonic stem cells: Prospects for development. Development 2004;131:5515–5525.

See www.StemCells.com for supplemental material available online.

www.StemCells.com