Expansion Mechanisms and Functional ... - Plant Physiology

4 downloads 0 Views 1MB Size Report
Western Cape, Bellville 7535, South Africa (A.C.) .... and to provide experimental evidence for functional .... DNA transposons are class II mobile elements and.
Expansion Mechanisms and Functional Annotations of Hypothetical Genes in the Rice Genome[W] Shu-Ye Jiang, Alan Christoffels, Rengasamy Ramamoorthy, and Srinivasan Ramachandran* Rice Functional Genomics Group, Temasek Life Sciences Laboratory, National University of Singapore, Singapore 117604 (S.-Y.J., R.R., S.R.); and South African National Bioinformatics Institute, University of the Western Cape, Bellville 7535, South Africa (A.C.)

In each completely sequenced genome, 30% to 50% of genes are annotated as uncharacterized hypothetical genes. In the rice (Oryza sativa) genome, 10,918 hypothetical genes were annotated in the latest version (release 6) of the Michigan State University rice genome annotation. We have implemented an integrative approach to analyze their duplication/expansion and function. The analyses show that tandem/segmental duplication and transposition/retrotransposition have significantly contributed to the expansion of hypothetical genes despite their different contribution rates. A total of 3,769 hypothetical genes have been detected from retrogene, tandem, segmental, Pack-MULE, or long terminated direct repeat-related duplication/ expansion. The nonsynonymous substitutions per site and synonymous substitutions per site analyses showed that 21.65% of them were still functional, accounting for 7.47% of total hypothetical genes. Global expression analyses have identified 1,672 expressed hypothetical genes. Among them, 415 genes might function in a developmental stage-specific manner. Antisense strand expression and small RNA analyses have demonstrated that a high percentage of these hypothetical genes might play important roles in negatively regulating gene expression. Homologous searches against Arabidopsis (Arabidopsis thaliana), maize (Zea mays), sorghum (Sorghum bicolor), and indica rice genomes suggest that most of the hypothetical genes could be annotated from recently evolved genomic sequences. These data advance the understanding of rice hypothetical genes as being involved in lineage-specific expansion and that they function in a specific developmental stage. Our analyses also provide a valuable means to facilitate the characterization and functional annotation of hypothetical genes in other organisms.

Rapid progress has been achieved in genome sequencing since the first genome of a cellular organism, Haemophilus influenzae, was completely sequenced in 1995 (Fleischmann et al., 1995). It is not difficult to sequence a genome today. To date, genomes from 104 eukaryotes, 801 bacteria, and 56 archaea have been completely sequenced and published, and the sequencing for 1,029 eukaryotic, 2,422 bacterial, and 100 archaeal genomes are in progress, based on the Genome Online Database version 2.0 (Liolios et al., 2008; http://www.genomesonline.org/gold.cgi, March 26, 2009). Along with the availability of more complete genome sequences, various genome annotation tools have been developed and corresponding databases established. For example, several rice (Oryza sativa) genome annotation databases are now publicly available for research, such as the Michigan State University (MSU) rice genome annotation database (previously The Institute for Genomic Research rice genome annotation database, now moved to MSU; http://rice.plantbiology.msu.edu/; Yuan et al., 2005; Ouyang et al., 2007), the Rice Annotation Project * Corresponding author; e-mail [email protected]. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Srinivasan Ramachandran ([email protected]). [W] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.109.139402

Database (http://rapdb.dna.affrc.go.jp/; Ohyanagi et al., 2006; Rice Annotation Project, 2008), RiceGAAS (http://ricegaas.dna.affrc.go.jp/; Sakata et al., 2002), and so on. However, in each completely sequenced genome, 30% to 50% of genes are annotated as either “hypothetical genes” or “conserved hypothetical genes” (Kolker et al., 2004, 2005; Roberts, 2004; Sivashankari and Shanmughavel, 2006). Hypothetical genes, which are predicted solely by computer algorithms, are experimentally uncharacterized genes, and their functions cannot be deduced from simple sequence comparisons, as they lack sequence similarity with known proteins or domains (Kolker et al., 2004, 2005; Ouyang et al., 2007). If their proteins are homologous to those with unknown function in other organisms, they are typically referred to conserved hypothetical proteins (http://rice. plantbiology.msu.edu/new.shtml; Sivashankari and Shanmughavel, 2006). Some hypothetical genes may be due to misannotation. However, evidence has shown that up to 33% of the predicted hypothetical genes were expressed in Shewanella oneidensis, and they may function in various cellular processes, including energy conversion, ion transport, secondary metabolism, and signal transduction (Kolker et al., 2005). A necessary step to confirm their existence is to obtain evidence of gene or protein expression. As a result, some previously annotated hypothetical genes were regarded as “expressed or known genes” after their expression or functional data were available

Plant PhysiologyÒ, August 2009, Vol. 150, pp. 1997–2008, www.plantphysiol.org Ó 2009 American Society of Plant Biologists

1997

Jiang et al.

(Saha et al., 2002; Kolker et al., 2004; Redman et al., 2004; Shin et al., 2004; Prasad et al., 2005; Xiao et al., 2005). Therefore, hypothetical genes should be regarded as a mixture of functional and misannotated genes or pseudogenes. On the other hand, the numbers of the hypothetical genes are rapidly growing as more and more genomes have been completely sequenced and annotated. The increasing numbers of these genes pose a major challenge to efforts toward the understanding of complete genomes, since many hypothetical genes have been shown to play important roles during various biological processes (Kolker et al., 2004, 2005). To face such a challenge, new approaches should be established to efficiently exclude misannotated hypothetical genes and to provide experimental evidence for functional hypothetical genes at the genome-wide level. However, to our knowledge, no data have been reported on the systematic expression analyses of rice hypothetical genes, despite the availability of various genome-wide expression data such as microarray (http://www. ncbi.nlm.nih.gov/geo/), massively parallel signature sequencing (MPSS; Nobuta et al., 2007; http://mpss. udel.edu/rice/), and so on. In this report, we first identified and characterized all annotated hypothetical genes from the japonica rice genome. We then analyzed their genome distribution and organization as well as the domain/motif structures of their deduced protein sequences by comparing with known and transposon element-related genes/proteins. We also evaluated their expansion history and evolutionary mechanisms to exclude misannotated genes. Subsequently, we examined their expression by microarray and MPSS data sets among various tissues and under abiotic and biotic stresses followed by quantitative real-time reverse transcription-PCR confirmation. Finally, we investigated expressed small RNA (smRNA) loci located on hypothetical genes to further annotate their biological functions. Our analyses advance the understanding of rice hypothetical genes as being involved in lineage-specific expansion and that they function in a specific developmental stage. We also provide a valuable means to facilitate the characterization and functional annotation of hypothetical genes in other organisms.

RESULTS Compilation of Annotated Hypothetical Genes and Their Characterization

In the latest version (release 6) of the MSU rice genome annotation, 56,797 genes were predicted, including 16,185 genes encoding transposon/retrotransposon elements (TEs). Thus, 40,612 genes were predicted to encode non-TE proteins. Among these genes, 10,918 genes were annotated as hypothetical genes, accounting for 26.88% of non-TE coding genes (Supplemental Table S1). We designated the remaining 1998

29,098 non-TE coding genes as “known genes.” In Arabidopsis (Arabidopsis thaliana), only 1,549 (5.73%) were predicted as hypothetical genes based on the TAIR8 database (The Arabidopsis Information Resource; http://www.arabidopsis.org/) released on April 28, 2008 (Supplemental Table S2). That percentage is significantly lower than that in rice. This could be due to the presence of fewer TEs in Arabidopsis, which may be a major contributor to the expansion of hypothetical genes (see below). The average genomic size of hypothetical genes is around 1,154 bp, significantly smaller than known genes (approximately 3,445 bp) and TE-related genes (approximately 3,256 bp). In general, hypothetical proteins contain fewer Pfam (Finn et al., 2006) domains than known genes (Supplemental Fig. S1A), and the majority of these domains found in hypothetical proteins are annotated as Domains of Unknown Function (DUFs). The majority of hypothetical proteins are present in one (42%) or two (20%) copies in the genome, and these percentages are higher than that from known genes (Supplemental Fig. S1B). Thus, lower percentages of hypothetical proteins were found to contain families with more members (more than three; Supplemental Fig. S1B). The largest hypothetical protein family consists of 70 DUF834 domain-containing proteins. Until now, these hypothetical proteins were annotated only in the rice genome. To further characterize domains in hypothetical proteins, all of them were submitted to the InterPro database (Mulder and Apweiler, 2007) to search taxonomic coverage of these proteins. Around 30% of these domains were detected only in rice (rice specific), and about 40% of them were present only in green plants; the remaining domains were detected either in bacteria and green plants (approximately 4%) or in all tested organisms (25%; Supplemental Fig. S1C). Tandem and Segmental Duplications and Expansion of Hypothetical Genes in the Rice Genome

To decipher duplication mechanisms of hypothetical genes in the rice genome, we analyzed the contribution of tandem and segmental duplications to the expansion of hypothetical genes. We began by identifying all tandemly duplicated genes in the rice genome (see “Materials and Methods”). Among the non-TErelated coding genes, a total of 3,251 pairs of tandemly duplicated genes were identified when the tandem cluster was considered as a region containing members that belong to the same families with no more than 10 genes between neighboring members, as described in previous reports (Shiu and Bleecker, 2003; Shiu et al., 2004). These pairs consisted of 5,209 tandemly duplicated genes (Supplemental Table S3). Among them, 840 tandem duplicates were annotated as hypothetical genes, accounting for 7.69% of total tandemly duplicated genes, while up to 14.71% of known genes (4,369) were tandemly duplicated (Table I; Fig. 1A), suggesting a significantly lower contribuPlant Physiol. Vol. 150, 2009

Expansion and Function of Hypothetical Genes

Table I. A summary of the contribution of duplication/transposition/retrotransposition to the expansion/annotation of hypothetical genes Expansion Mechanism

Tandem duplication Segmental duplication Retrotransposon related LTR retrotransposonb Retrogene DNA transposon related MULE CACTA hAT Helitron Total

Nos.a

Hypothetical Genes

TE-Related Genes

5,209 6,270

840 (7.69%) 173 (1.58%)

– –

26,378 2,916

1,051 (9.63%) 437 (4.00%)

6,775 (41.86%) –

10,688 620 103 552 –

1,268 51 0 9 3,829

1,372 (8.48%) 598 (3.69%) 57 (0.35%) 8 (0.05%) 8,810 (54.43%)

(11.61%) (0.47%) (0.00%) (0.08%) (35.07%)

Known Genes

4,369 (14.71%) 6,097 (20.53%) 844 (2.84%) 2,480 (8.35%) 1,123 42 3 67 15,025

(3.78%) (0.14%) (0.01%) (0.23%) (50.60%)

a

The numbers indicate the total tandemly/segmentally duplicated genes or transposon/retrotransposon b elements in the rice genome. These elements include intact and solo LTR retrotransposons.

tion of tandem duplication to the expansion of hypothetical genes when compared with known genes (Z test, P , 0.001). On the other hand, 2,537 of 3,251 pairs consist of only known genes, of which 1,202 pairs (47.38%) have evolved with intron gain and loss, since two members in each pair showed differences in exon organization (Fig. 1A). The remaining 714 pairs consisted of at least one hypothetical gene in each pair. These hypothetical gene-containing pairs were observed with significantly higher rates of intron gain and loss, as more than half of the pairs (56.72%; Fig. 1A) exhibited differences in their numbers of exons (Z test, P , 0.001). We have also identified 6,270 segmentally duplicated genes based on the methods described by Lin et al. (2006; http://rice.plantbiology.msu.edu/segmental_ dup/index.shtml). Among them, 6,097 were annotated as known genes, accounting for 20.53% of total known genes, while only 1.58% of hypothetical genes (173) were retained within duplicated blocks, significantly lower than the number of known genes (Z test, P , 0.001; Table I; Fig. 1B). These 6,270 genes are from 3,916 but not 3,135 pairs of segmental duplicates, since one gene may be segmentally duplicated twice or more. Among them, 96.4% of pairs (3,774) consist of only known genes, of which only 1,456 pairs (38.58%) of duplicates underwent the events of intron gain and loss. The remaining 142 pairs of segmental duplicates contained at least one hypothetical gene in each pair, of which up to 51.41% of hypothetical gene-containing pairs have evolved with intron gain and loss, significantly higher than the former (Z test, P , 0.001; Fig. 1B).

Retrotransposons/Transposons and Expansions/ Annotations of Hypothetical Genes

There are two groups of retrotransposons. One group is characterized by the flanking long terminated direct repeats (LTRs), including copia- and gypsy-like retrotransposons. Another group, the non-LTR retroPlant Physiol. Vol. 150, 2009

transposons, is subdivided into two major families: long interspersed nuclear elements and short interspersed nuclear elements. By executing the RetrOryza (Chaparro et al., 2007) and LTR_FINDER (Xu and Wang, 2007) programs, we have identified a total of 26,378 LTR retrotransposons (Table I). Among them,

Figure 1. Expansion of hypothetical genes by both tandem and segmental duplications. A, Contributions of tandem duplication to the expansion of known and hypothetical genes. Tandemly duplicated genes were identified based on the criteria that they are located within 10 predicted ORFs, as described in “Materials and Methods.” Locus numbers of tandemly duplicated genes are listed in Supplemental Table S3. B, Contributions of segmental duplication to the expansion of known and hypothetical genes. Segmentally duplicated genes were identified by Lin et al. (2006). Asterisks indicate significant differences at P , 0.001. 1999

Jiang et al.

TE-related genes have been annotated from 6,775 LTR retrotransposons in the MSU rice genome annotation database (Table I; Supplemental Table S4). A total of 844 LTR retrotransposons have been detected to contain annotated known gene sequences, accounting for only 2.84% of total annotated known genes (Table I; Supplemental Table S4). In addition, a total of 1,051 LTR retrotransposons have been identified to contain annotated hypothetical genes, making up to 9.01% of total hypothetical genes (Table I; Supplemental Table S4). This ratio is significantly higher than that of known gene sequences (Z test, P , 0.001). After retrotranspositions, one copy might have lost its ability of retrotransposition and evolved into known genes or hypothetical genes. Several examples are shown in Supplemental Text S1. Besides LTR retrotransposon, non-LTR retrotransposon was also reported to drive genome expansion and evolution by retrotransposition, and genes generated by these class I elements were named as retrogenes (Esnault et al., 2000; Kazazian, 2004; Dewannieux and Heidmann, 2005). Such retrogenes are usually devoid of introns and with the presence of target site duplications and/or a poly (A) tract (Betran et al., 2002; Wang et al., 2006). We have identified all retrogenes genome wide using oneexon-containing gene as a query sequence for BLAST searches (see “Materials and Methods”). These analyses showed that a total of 437 hypothetical genes were regarded as retrogenes based on the criteria (see “Materials and Methods”), accounting for 4.00% of the total hypothetical genes (Table I; Supplemental Table S5). This ratio is significantly lower than the ratio from known genes (Table I; Z test, P , 0.001). Several examples are also shown in Supplemental Text S1. DNA transposons are class II mobile elements and they contain two subclasses, among which the Mutatorlike transposable element (MULE), CACTA, and hobo/ Ac/ Tam3 (hAT) families are major subclass I transposons and the Helitron family contains major subclass II transposons in higher plants. MULEs carrying fragments of host genes were named as Pack-MULEs (Jiang et al., 2004). Based on our analyses, 1,268 hypothetical genes were identified to be related to MULEs, accounting for 11.61% of total hypothetical genes (Table I; Supplemental Table S6). However, only 8.48% (1,372) of TE-related and 3.78% (1,123) of known genes were identified to be involved in MULE-related expansions (Table I; Supplemental Table S6), and the ratios were significantly lower than those from hypothetical genes (Z test, P , 0.001). Besides MULEs, we have also identified 51 hypothetical genes (0.47%) whose expansions/annotations were related to CACTA elements (Table I; Supplemental Table S7). There were no hypothetical genes annotated within 103 hAT family members in the rice genome (Table I; Supplemental Table S8). We found that only nine hypothetical genes have been involved in Helitron-related expansions (Table I; Supplemental Table S9). Thus, Helitron elements have a very limited contribution to the expansion of hypothetical genes. A similar result has been 2000

reported by Sweredoski et al. (2008). They found evidence for only 11 unique gene-capture events in rice. Supplemental Text S1 also listed several examples for DNA transposon-related expansion/annotation of hypothetical genes. In summary, our data show that hypothetical genes are overrepresented in transposon-related sequences relative to known genes but not in tandem or segmental duplications (Table I). Higher percentages of these transposon-related hypothetical genes were under less selective constraints and were not expressed, as shown below. Thus, a higher proportion of them might not be functional or were from misannotations. Selective Constraints on Hypothetical Genes during Evolution

Tandem duplication and transposition/retrotransposition significantly contributed to the expansion/ annotation of hypothetical genes. Thus, it is very interesting to determine whether these duplicated/ expanded descendants are still functional. Therefore, the nonsynonymous substitutions per site/synonymous substitutions per site ratios (Ka/Ks) of these pairs were estimated and tested statistically. Only pairs consisting of both hypothetical and known genes were selected for Ka/Ks calculations. Pairs from five types of expansions due to their major contributions were selected for such analyses (Fig. 2). In general, both Pack-MULE and LTR-related pairs have several examples of Ka/Ks greater than 0.5, while the other types are mostly less than 0.5 (Fig. 2, A–E). To determine the percentages of genes that were under functional constraints, a likelihood ratio test was carried out according to the method (Betran et al., 2002) to test whether the Ka/Ks ratio between pairs was significantly less than 0.5. This test showed that Ka/Ks ratios for 145 retrogene pairs were statistically less than 0.5, accounting for 33.18% of total hypothetical retrogenes (Fig. 2F). Analogically, 26.53%, 27.75%, 24.53%, and 4.47% of tandem, segmental, Pack-MULE, and LTR retrotransposon-related hypothetical genes were under functional constraints, respectively (Fig. 2F). In total, we have detected 816 hypothetical-known gene pairs with Ka/Ks ratios less than 0.5 by statistical analyses (Fig. 2F). Thus, among a total of 3,769 hypothetical genes from retrogene, tandem, segmental, Pack-MULE, or LTR-related duplication/expansion/ annotation (Table I), 21.65% of them were under functional constraints (Fig. 2F). Since we have analyzed only hypothetical-known gene pairs, the remaining pairs, including hypothetical-TE and hypotheticalhypothetical as well as frameshift gene pairs, were not analyzed. Expression Analyses of Hypothetical Genes

Microarray expression data from 15 samples were used to assess the expression of hypothetical genes. Plant Physiol. Vol. 150, 2009

Expansion and Function of Hypothetical Genes

Figure 2. Frequency distributions and C-value tests of Ka/Ks ratios between expanded gene pairs as well as their functional constraints. Frequency distributions of Ka/Ks ratios were analyzed using expanded pairs from tandem duplication (A), segmental duplication (B), Pack-MULEs (C), LTR-related expansion (D), and retrogene pairs (E). C-value tests and functional constraints of various expanded hypothetical genes are shown in F. Only hypothetical-known gene pairs were selected for such analyses.

These data were obtained from Gene Expression Omnibus (GEO) data sets (Barrett et al., 2007; http:// www.ncbi.nlm.nih.gov/geo/) with accession number GSE6893 (Jain et al., 2007). We determined whether a hypothetical gene was expressed or not using a statistical test as described in “Materials and Methods.” Based on these analyses, a total of 912 genes were regarded as “expressed genes” among 7,393 hypothetical genes probed in the Affymetrix microarray chips (Fig. 3A). The remaining 6,481 genes did not exhibit evidence for expression in these tissues. Approximately half (45.5%) of the genes that exhibit expression were only detected in one tissue. The remaining expressed genes were detected in two or more tissues. Most of the tissue-specific genes were expressed during reproductive stages, including young/mature inflorescences and seeds. Some of the tissue-specific hypothetical genes were randomly selected for cDNA real-time PCR analyses to confirm their expression patterns. We have selected eight tissue-specific hypoPlant Physiol. Vol. 150, 2009

thetical genes, including young/mature leaf, panicle, root, and seed-specific genes, for such analyses, and our results confirmed that the selected genes exhibited similar transcriptional profiles as shown in microarray analyses (Fig. 3B). To further investigate expression profiles of hypothetical genes, additional microarray analyses were carried out using RNA samples from 14-d-old seedlings treated or untreated by 250 mM NaCl or 30% polyethylene glycol stress. GEO data sets with accession numbers GSM159268 to GSM159270 were used for cold stress expression analysis. Our analyses revealed that a total of 463 hypothetical genes showed detectable transcription signals under either normal or stressed growth conditions (Fig. 3A). Furthermore, expression analyses were also carried out using the japonica rice MPSS database (see “Materials and Methods”). In the database, up to 70 libraries were constructed for genome-wide expression profiling. This analysis showed that a total of 770 hypothetical genes were expressed in at least one library (Fig. 3A). 2001

Jiang et al.

Figure 3. Expression profiles of 7,393 hypothetical genes by microarray and MPSS. A, Summary of hypothetical gene expression analyses. The data were based on both microarray and MPSS analyses in various tissues and under various abiotic and biotic stresses. B, cDNA real-time PCR expression verification of a set of hypothetical genes in various tissues: YL, young leaf; ML, mature leaf; YP, young panicle; MP, mature panicle; YR, young root; MR, mature root; YS, young seed; MS, mature seed. C, The effect of expansion mechanisms on the expression of hypothetical genes. The numbers 1, 2, 3, 4, and 5 indicate different expansion ways: LTR retrotransposons, retrogenes, Pack-MULEs, tandem duplicates, and segmental duplicates, respectively. NS, No significant difference by Z test. Asterisks indicate significant differences at * P , 0.05 and *** P , 0.001, respectively.

In summary, 1,672 of 7,393 analyzed hypothetical genes have been detected with expression signals (Fig. 3A; Supplemental Table S10). We analyzed the effects of the duplication or transposition on the expression of these genes. Our data showed that around 35% of the 173 segmentally duplicated hypothetical genes were expressed (Fig. 3C). This ratio was significantly higher than that from the remaining expansion events (Z test, P , 0.001). The percentage of expressed hypothetical genes from tandem duplication was estimated around 21%. These data revealed that both segmental and 2002

tandem duplicates achieved higher percentages of expressed genes when compared with those from transposition/retrotransposition events (Z test, P , 0.001). To evaluate the expression of the remaining 3,525 hypothetical genes that were not probed in the microarray chips, 192 of them were randomly selected for quantitative real-time reverse transcription-PCR analyses. cDNA was synthesized from total RNAs extracted from various stages of rice tissues, including young and mature leaves, panicles, and roots. These cDNAs Plant Physiol. Vol. 150, 2009

Expansion and Function of Hypothetical Genes

were subjected to real-time PCR to detect the expression of 192 hypothetical genes using their corresponding gene-specific primers. These analyses revealed 32 (16.7%) hypothetical genes with detectable expression signals (Fig. 4, A and B). Among these expressed genes, 12 (37.5%) exhibited tissue-specific expression and the remaining 20 genes were expressed in multiple tissues (Fig. 4, C and D). Detection of Antisense Strand Transcription in Hypothetical Genes

Besides the sense strand transcription of hypothetical genes, we also carried out an analysis of antisense strand expression of hypothetical genes to check the possibility of these genes as negative regulators, since antisense expression could affect their targeted gene expression through RNA-mediated gene silencing (Meister and Tuschl, 2004). We have detected two

classes of antisense transcripts, class 3 (antisense to annotated open reading frame [ORF]) and class 6 (antisense strand within annotated intron), by MPSS database searches (Nobuta et al., 2007). By searching a 17-bp signature data set in 70 libraries from the same or different tissues, we detected a total of 283 hypothetical genes expressing their antisense strands located in class 3 with at least five transcripts per million (TPM) in at least one or more libraries (Fig. 5A). A total of 219 hypothetical genes were detected with class 3 antisense expression when subjected to searches of a 20-bp signature database. On the other hand, antisense transcripts of class 6 were detected in 173, 132, or 108 hypothetical genes when 17-bp, 20-bp, or both signature data sets were searched, respectively. By filtering off antisense expressing genes commonly occurring in both class 3 and class 6, we have detected a total of 278 hypothetical genes with antisense expression either in the exon or the intron region, as shown by both 17- and

Figure 4. Expression profiles of 192 hypothetical genes by real-time PCR. A, Summary of hypothetical gene expression analyses. B, Examples of cDNA real-time PCR products visualized by agarose gel. M, A 100-bp DNA ladder from Roche. The gene UBQ5 was used as a positive control (Jain et al., 2006). For each gene, the left lane shows the amplified band by cDNA real-time PCR and right lane shows a negative control without cDNA template. C, Summary of tissue-specific or non-tissue-specific genes among 32 hypothetical genes. D, Expression patterns of tissue-specific genes. YL, Young leaf; ML, mature leaf; YP, young panicle; MP, mature panicle; YR, young root; MR, mature root. Plant Physiol. Vol. 150, 2009

2003

Jiang et al. Figure 5. Antisense and unique smRNA loci in hypothetical genes. A, Antisense strand expression analyses. The data were based on the MPSS database. Class 3 and class 6 indicate the expression signals detected from the antisense strand matched to annotated ORFs and intron regions, respectively. B, A summary of sense and antisense strand expression and smRNA analyses.

20-bp signatures, accounting for 2.55% of total hypothetical genes (Fig. 5A; Supplemental Table S11). Unique smRNA Loci in Hypothetical Genes

Plant smRNAs include microRNAs, short interfering RNAs, and trans-acting short interfering RNAs and act as important negative regulators of gene expression (Axtell and Bowman, 2008; Ramachandran and Chen, 2008). To investigate the smRNA transcriptome in rice, six smRNA libraries were constructed from rice inflorescences, seedlings, stem tissues, and seedlings treated with abscisic acid or with the rice blast pathogen Magnaporthe grisea. The abundance of each sequence in these libraries was normalized based on the relative cloning frequency in each library, labeled as transcripts per quarter million (TPQ; Lu et al., 2008). All of the 17-bp signatures were mapped onto the rice genome, and the unique signatures with only one hit in a hypothetical gene were selected for further analyses. Such analyses led to the identification of a total of 3,025 hypothetical genes with unique and detectable smRNA loci (TPQ $ 1), accounting for 27.71% of total annotated hypothetical genes (column 1 in Fig. 5B; Supplemental 2004

Table S12). However, only 929 (8.51%) hypothetical genes were detected to contain smRNA loci with no less than five TPQ in their expression (column 2 in Fig. 5B; Supplemental Table S12). To further investigate smRNA loci in hypothetical genes, we also analyzed the Cereal Small RNA Database (CSRDB; Johnson et al., 2007). In this database, 12,819 expressed unique smRNAs have been sequenced. By analyzing their physical positions, we found that 1,350 hypothetical genes were detected with expressed unique smRNAs, accounting for 12.36% of total annotated hypothetical genes (column 3 in Fig. 5B; Supplemental Table S12). By combining both MPSS and CSRDB smRNA databases, we have identified 3,775 (34.58%; TPQ $ 1) or 2,071 (18.97%; TPQ $ 5) hypothetical genes with smRNA loci (columns 4 and 5 in Fig. 5B). On the contrary, only 22.85% of known genes and 23.18% of TE-related genes (TPQ $ 1) were detected with expressed smRNA loci. Among 3,775 smRNA loci detected in hypothetical genes, 54% matched coding regions, and the remaining loci were in their intron regions. Similar results were observed in TE-related genes (56%). However, for known genes, only 36% of loci were detected to match coding regions. Plant Physiol. Vol. 150, 2009

Expansion and Function of Hypothetical Genes

DISCUSSION The Evolutionary Fate of Hypothetical Genes Derived from Duplication/Expansion

One may wonder why so many hypothetical genes have been annotated and what are the mechanisms that drive their evolution or expansion. We have investigated the effects of tandem and segmental duplications on the expansion of hypothetical genes. We also evaluated the contributions of transposons/ retrotransposons to such expansion/annotation. The analyses show that MULEs and LTR retrotransposons are regarded as major contributors to the expansions. As many hypothetical genes are transposable elements, it is conceivable that many of the remaining hypothetical genes may be derived from ancient TEs that lost recognizable transposon features and therefore cannot be detected. As a result, the contribution of TEs to the expansion of hypothetical genes could be much more significant than what we have estimated in this study. Based on our current results, the association of transposons with hypothetical genes falls into two distinct categories. Some elements, such as Pack-MULEs, duplicate and amplify other hypothetical genes. Other elements, like LTR sequences, represent true transposon sequences that are likely misannotated as hypothetical genes and should be excluded in future annotations. Although some of the hypothetical genes represent misannotations, a considerable part of them are functional genes. A total of 33.18% of retrogene-related hypothetical genes were regarded as real genes under certain selective constraints. A previous report also showed that 73% of rice retrogenes were functional, since they have intact ORFs, low Ka/Ks ratios, and evidence of expression (Wang et al., 2006). The following mechanisms are proposed to explain the active transcription of retrogenes: by directly fusing to host genes; by hitchhiking on the regulatory elements from other genes; by directly inheriting promoters from their parental genes; and by some regulatory elements evolved from the retrotransposons (Vinckenbosch et al., 2006). Apart from retrogenes, hypothetical genes under functional constraints might also be evolved from tandem (26.53%) and segmental (27.75%) duplications, Pack-MULEs (24.53%), and LTR retrotransposons (4.47%). Taken together, among the 3,769 hypothetical genes related to such duplication/expansion events, a total of 774 hypothetical genes (20.54%) should be regarded as real genes. However, some of the remaining genes might also be functional, due to positive selection with Ka/Ks . 1. On the other hand, since most hypothetical genes could be annotated from recently evolved genomic sequences (see below), lower Ka/Ks ratios might also be detected, although they were not under selective constraints. Thus, we might have overestimated the percentages of hypothetical genes with functional constraints. Plant Physiol. Vol. 150, 2009

Most Hypothetical Genes Could Be Annotated from Recently Evolved Genomic Sequences and Exhibited Lineage-Specific Expansion

The majority of hypothetical genes are not conserved in other species. Only 151 hypothetical genes (1.4% of total hypothetical genes) have homologs in the Arabidopsis genome. The proportions of hypothetical genes with detectable homologs in maize (Zea mays) and sorghum (Sorghum bicolor) are 23.5% and 20.0%, respectively. Similar results have been reported by Zhu and Buell (2007), who detected 6,744 annotated rice hypothetical genes with no sequence similarity in the 2,512-Mb nonrice genomic and transcriptomic sequences from 184 species. Furthermore, a high percentage of hypothetical genes (46.5%) have been detected among the identified 861 genes that are highly conserved within, as well as specific to, the Poaceae (Campbell et al., 2007). However, most of the hypothetical genes could be detected in the indica genome (98.3%). These data demonstrate the recently originated sequences of rice hypothetical genes and suggest that most of these genes have evolved after the divergence of rice from the other Gramineae plants. Therefore, their expansion is considered to be lineage specific, which may also imply the recently active duplication and transposition events in rice genomes. Hypothetical Genes May Function with a Tissue-Specific Mode and Also as Negative Regulators

At present, only EST/cDNA or protein expression data are used for rice gene annotation. On the other hand, high-throughput expression analyses, including microarray (Schena et al., 1995), Serial Analysis of Gene Expression (Velculescu et al., 1995), MPSS (Brenner et al., 2000), and RNA-Seq (Mortazavi et al., 2008), have been carried out in various organisms. The genomewide transcriptome analyses by microarray have been used for improving the functional annotations of hypothetical genes in S. oneidensis (Kolker et al., 2005). Although various high-throughput methods are widely used in genome-wide transcript analyses in rice (Ma et al., 2005; Gowda et al., 2006; Li et al., 2006; Nobuta et al., 2007), the data have not been used to improve the annotation of rice hypothetical genes. We found evidence for expression from 22.6% of hypothetical genes, and nearly half of these exhibit expression only in a single tissue. In addition to the expression analyses detected in normal sense strands of annotated hypothetical genes, we also investigated the expression profiles based on antisense strand signals. The analyses revealed that some of the hypothetical genes may also function as negative regulators, since 2.55% of these genes were found to encode antisense transcripts. Our analysis of smRNA loci revealed that 18.97% to 34.58% of hypothetical genes might be involved in smRNA-based gene-silencing pathways. This may be partially due to the higher ratios of hypothetical genes that were expanded by 2005

Jiang et al.

transpositions/retrotranspositions, while smRNAs are typically evolved from transposons/retrotransposons (Hamilton et al., 2002; Mallory and Vaucheret, 2004; Ghildiyal et al., 2008). These facts suggest that a higher percentage of hypothetical genes may play important roles in negatively regulating gene expression through RNA-mediated gene silencing. MATERIALS AND METHODS Plant Materials Rice (Oryza sativa japonica ‘Nipponbare’) was used for all experiments in this study. Seeds were imbibed in water. After germination, they were planted in the greenhouse and grown under natural light and temperature conditions.

Microarray Hybridization and Data Analysis Two-week-old seedlings grown under either normal or stressed conditions (drought and salinity; treated with 30% polyethylene glycol for 1 h or 250 mM NaCl for 2 h, respectively) were used as starting materials. Total RNA samples were prepared using the RNeasy Plant Mini Kit (Qiagen). Only those RNA samples with an A260/A280 ratio of 1.9 to 2.1 were used for microarray analysis. We used Affymetrix GeneChip Rice Genome Arrays (catalog no. 900599) for the analysis. One-cycle target labeling, hybridization to arrays, washing, staining, and scanning were carried out according to the manufacturer’s instructions (Affymetrix). Hybridization data were analyzed using Affymetrix GeneChip Operating Software (GCOS 1.4). We determined expressed hypothetical genes using “detection P value” as detected by the GCOS 1.4 program for each biological replicate. Based on this value, an ABS call was given, as an absolute analysis indicates if the transcript is present (marked with P), absent (A), marginal (M), or no call (NC). Only those hypothetical genes were regarded as “expressed genes” whose expression was detected (P) by two biological replicates.

Primer Selection and cDNA Real-Time PCR Analysis A set of expressed hypothetical genes by microarray analyses was randomly selected for cDNA real-time PCR analyses to confirm their expression patterns. On the other hand, 192 genes were also randomly selected from 3,525 hypothetical genes that were not probed in Affymetrix microarray chips for cDNA real-time PCR analyses. All gene-specific primer sequences were designed by Applied Biosystems Primer Express software and are listed in Supplemental Table S13. A total of eight developmental stages of tissues were collected: young leaf and root (14 d old), mature leaf and root (70 d old), young (unexserted) and mature (flowering) panicle, and young (milky) and mature seeds. The cDNA real-time PCR analyses were carried out according to the description by Jiang et al. (2007).

MPSS Analyses MPSS is a rapid method to produce 17- or 20-bp sequence tags that represent the population of mRNAs in a given tissue. In the MPSS database, expression data from 70 libraries were available for analyzing gene expression in different plant tissues (Nobuta et al., 2007; http://mpss.udel.edu/rice/). The expression level of a gene was estimated by TPM (a normalized abundance). The TPM was calculated as a summary of abundance for signatures of class 1 (inside an annotated ORF), class 2 (within 500 bp 3# of an ORF), class 5 (within an intron of an annotated gene, sense strand), and class 7 (spans an intron splice site) according to the default set of the database. An expressed gene was identified when the TPM values from at least one library in both the 17-bp and 20-bp signature databases were not less than 5, since a cutoff of approximately 3 TPM was used as the background in the MPSS database.

Databases and Annotations The latest version (release 6) of the MSU rice genome annotation was used for identification of all rice hypothetical genes. For Arabidopsis (Arabidopsis thaliana), the latest version of the Arabidopsis genome annotation (TAIR8;

2006

http://www.arabidopsis.org) was used for retrieving all annotated hypothetical genes (Swarbreck et al., 2008). Draft indica rice (Yu et al., 2002), sorghum (Sorghum bicolor), and maize (Oryza sativa) genome sequences were downloaded from the following Web sites: http://rice.big.ac.cn/rice/index2.jsp, http://www.phytozome.net/sorghum, and http://www.maizesequence.org/ index.html, respectively.

Detection of Duplication-Related Hypothetical Genes Chromosomal distributions of hypothetical rice genes were performed by searching the physical positions of their corresponding locus numbers in the MSU rice genome annotation database. The colinearity of duplicated regions that contained hypothetical genes was determined using the following database search: MSU segmental genome duplication (http://rice.plantbiology. msu.edu/segmental_dup/index.shtml; Lin et al., 2006) for rice. Tandemly duplicated rice genes were determined using predicted rice proteins downloaded from the MSU rice genome annotation database (release 6) according to the description by Rizzon et al. (2006). Transposon-like transcripts (16,185), as defined in the annotation, were removed from the data set. The remaining 40,612 proteins were screened in an all-versus-all BLAST search using BLOSUM62 matrix with an E value of less than 0.01. A pair of matching peptides were subjected to the following criteria: (1) sequence identity of 70% or greater; (2) multiple hits for the same pair of sequences were collapsed if the overlap was 10 or more amino acids; (3) the fraction of the query sequence covered by the aligned regions (i.e. coverage) was defined as a minimum of 30% of the query length; and (4) pairs of matching proteins were clustered into groups (families) using a transitive closure algorithm (if A and B are equal and B and C are equal, then A and C are equal). A total of 27,040 proteins were assigned to 5,982 families. The remaining proteins were singletons. Tandemly duplicated genes were scored when two genes belonged to the same family, were located on the same chromosome, and were separated by 10 unrelated genes. Each chromosome was scanned for proteins that share a family and labeled as tandem duplicates if there was a maximum 10 unrelated genes between them.

Detection of Transposon/Retrotransposon-Related Expansion of Hypothetical Genes For genome-wide identification of LTR retrotransposon elements, whole genome sequences were downloaded from release 6 of the MSU rice genome annotation database and were then used for detection of full-length LTR retrotransposons by executing the LTR_Finder program (Xu and Wang, 2007). On the other hand, the database RetrOryza was employed to collect both fulllength and solo LTR retrotransposons (Chaparro et al., 2007). For detection of putative retrogenes in release 6 of the MSU rice annotation database, all annotated non-TE protein sequences were downloaded from the database, and those protein sequences from genes with single exons were subjected to BLASTP searches against all non-TE-related protein sequences deduced from two or more exon-containing coding sequences. Homologs were collected for further analysis, while a minimum 70% of queried proteincoding regions were aligned with an E value threshold at 1028. We then selected candidate retrogenes based on the criteria of Wang et al. (2006). Members of MULE families have been identified (Jiang et al., 2004; Juretic et al., 2005; Hanada et al., 2009). However, they used an old version of the rice genome sequences, and the numbers of MULEs were underestimated. Therefore, we have reidentified these members based on release 6 of the MSU rice genome sequences using the methods described (Jiang et al., 2004; Juretic et al., 2005). To identify the members of hAT and CACTA DNA transposon families, we used two separate approaches. First, both terminal inverted repeats (TIRs) and subterminal repeats (TRs) were used for BLASTN searches (Altschul et al., 1997). For the hAT family, query sequences were from multiple species, including Ac/Ds elements from maize and Tam3 from snapdragon (Antirrhinum majus; Hehl et al., 1991; Huttley et al., 1995; Kempken and Windhofer, 2001; Moon et al., 2006). For the CACTA family, En/Spm from maize and other CACTA elements from multiple species including rice were used as query sequences (Pereira et al., 1986; Wang et al., 2003; Wicker et al., 2003, and refs. therein). These searches generated two sets of BLAST hits, one set for 3# terminal regions and another set for 5# terminal regions. Next, matches for putative hAT/CACTA elements were found by looking for pairs of BLAST hits (E # 1e-5) that were less than 4,000 bp for hAT elements and 30,000 bp for CACTA elements. Second, we built a Hidden Markov Model (HMM) profile

Plant Physiol. Vol. 150, 2009

Expansion and Function of Hypothetical Genes

using HMMER 2.3.2 (http://hmmer.janelia.org/) with default values. Seed TIRs and TRs were selected from candidates by the first method and the known elements mentioned above. Using the profile HMMs, we scanned the MSU release 6 genome sequences looking for hits that were separated by at least 200 bp and that were no more than 4,000 bp for hAT elements and 30,000 bp for CACTA elements in the proper orientation. We then manually inspected these putative hAT/CACTA elements to remove any remaining artifacts by comparing their target site duplication/TIR sequences. On the other hand, since members of the Helitron family were identified using release 5 of the MSU genome sequences (Sweredoski et al., 2008), we have reidentified these elements using release 6 of the MSU genome sequences using the method described by Sweredoski et al. (2008).

Estimation of Ka/Ks Ratios and Analysis of Functional Divergence For Ka and Ks estimation, amino acid sequences of MSU annotated hypothetical genes were aligned with those derived from the ORFs of their parental genes (known genes) in the region of sequence similarity in a pair (Betran et al., 2002) and were subsequently transferred to the original cDNA sequences. Pairs were selected while a minimum 70% of queried proteincoding regions were aligned with an E value threshold at 1028. Only alignments longer than 150 bp with at least 70% identity were selected, in case of incomplete overlap in a pair. Pairs with frameshifts were eliminated, because they would probably generate a high Ka/Ks ratio if the portion from the frameshift was of significant length. Both Ka and Ks values were then estimated using the yn00 program of the PAML4b package (Yang and Nielsen, 2000). The Ka/Ks ratios were then used for evaluating the functional divergence by testing the C value as reported earlier (Thornton and Long, 2002). A likelihood ratio test of Ka/Ks between pairs was also carried out according to the method of Betran et al. (2002) to test whether hypothetical genes were pseudogenes or functional genes.

Supplemental Data The following materials are available in the online version of this article. Supplemental Figure S1. Domain organization of known and hypothetical proteins. Supplemental Table S1. Genes encoding hypothetical proteins in the rice genome Supplemental Table S2. Genes encoding hypothetical proteins in the Arabidopsis genome (TAIR8) Supplemental Table S3. Tandemly duplicated hypothetical and known genes in the rice genome. Supplemental Table S4. LTR retrotransposon-related genes. Supplemental Table S5. Retrogenes identified using one-exon-containing annotated genes as query sequences. Supplemental Table S6. MULE-related rice loci Supplemental Table S7. CACTA element-related rice loci. Supplemental Table S8. All hAT-related rice loci Supplemental Table S9. All Helitron-related rice loci. Supplemental Table S10. The 1,672 expressed hypothetical genes. Supplemental Table S11. The 278 antisense expressed hypothetical genes. Supplemental Table S12. Hypothetical genes with expressed unique smRNA loci. Supplemental Table S13. Primer sequences for cDNA real-time PCR analyses. Supplemental Text S1. Detailed analyses of retrotransposon/transposonrelated expansion/annotation of hypothetical genes.

ACKNOWLEDGMENTS We thank Nadimuthu Kumar, Chong Kian Long Kelvin, and Colin Soh Wei Quan for their help in total RNA isolation. We also thank Prasanna Nori

Plant Physiol. Vol. 150, 2009

Venkatesh and Ma Zhigang for carrying out the microarrays. We are grateful for the assistance from Professional Editing Service (http://www. prof-editing.com) in editing and polishing this document. Received April 2, 2009; accepted June 15, 2009; published June 17, 2009.

LITERATURE CITED Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 3389–3402 Axtell MJ, Bowman JL (2008) Evolution of plant microRNAs and their targets. Trends Plant Sci 13: 343–349 Barrett T, Troup DB, Wilhite SE, Ledoux P, Rudnev D, Evangelista C, Kim IF, Soboleva A, Tomashevsky M, Edgar R (2007) NCBI GEO: mining tens of millions of expression profiles. Database and tools update. Nucleic Acids Res 35: D760–D765 Betran E, Thornton K, Long M (2002) Retroposed new genes out of the X in Drosophila. Genome Res 12: 1854–1859 Brenner S, Johnson M, Bridgham J, Golda G, Lloyd DH, Johnson D, Luo S, McCurdy S, Foy M, Ewan M, et al (2000) Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat Biotechnol 18: 630–634 Campbell MA, Zhu W, Jiang N, Lin H, Ouyang S, Childs KL, Haas BJ, Hamilton JP, Buell CR (2007) Identification and characterization of lineage-specific genes within the Poaceae. Plant Physiol 145: 1311–1322 Chaparro C, Guyot R, Zuccolo A, Pie´gu B, Panaud O (2007) RetrOryza: a database of the rice LTR-retrotransposons. Nucleic Acids Res 35: D66–D70 Dewannieux M, Heidmann T (2005) LINEs, SINEs and processed pseudogenes: parasitic strategies for genome modeling. Cytogenet Genome Res 110: 35–48 Esnault C, Maestre J, Heidmann T (2000) Human LINE retrotransposons generate processed pseudogenes. Nat Genet 24: 363–367 Finn RD, Mistry J, Schuster-Bo¨ckler B, Griffiths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, et al (2006) Pfam: clans, Web tools and services. Nucleic Acids Res 34: D247–D251 Fleischmann RD, Adams MD, White O, Clayton RA, Kirkness EF, Kerlavage AR, Bult CJ, Tomb JF, Dougherty BA, Merrick JM, et al (1995) Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496–512 Ghildiyal M, Seitz H, Horwich MD, Li C, Du T, Lee S, Xu J, Kittler EL, Zapp ML, Weng Z, et al (2008) Endogenous siRNAs derived from transposons and mRNAs in Drosophila somatic cells. Science 320: 1077–1081 Gowda M, Li H, Alessi J, Chen F, Pratt R, Wang GL (2006) Robust analysis of 5#-transcript ends (5#-RATE): a novel technique for transcriptome analysis and genome annotation. Nucleic Acids Res 34: e126 Hamilton A, Voinnet O, Chappell L, Baulcombe D (2002) Two classes of short interfering RNA in RNA silencing. EMBO J 21: 4671–4679 Hanada K, Vallejo V, Nobuta K, Slotkin RK, Lisch D, Meyers BC, Shiu SH, Jiang N (2009) The functional role of Pack-MULEs in rice inferred from purifying selection and expression profile. Plant Cell 21: 25–38 Hehl R, Nacken WK, Krause A, Saedler H, Sommer H (1991) Structural analysis of Tam3, a transposable element from Antirrhinum majus, reveals homologies to the Ac element from maize. Plant Mol Biol 16: 369–371 Huttley GA, MacRae AF, Clegg MT (1995) Molecular evolution of the Ac/ Ds transposable-element family in pearl millet and other grasses. Genetics 139: 1411–1419 Jain M, Nijhawan A, Arora R, Agarwal P, Ray S, Sharma P, Kapoor S, Tyagi AK, Khurana JP (2007) F-box proteins in rice: genome-wide analysis, classification, temporal and spatial gene expression during panicle and seed development, and regulation by light and abiotic stress. Plant Physiol 143: 1467–1483 Jain M, Nijhawan A, Tyagi AK, Khurana JP (2006) Validation of housekeeping genes as internal control for studying gene expression in rice by quantitative real-time PCR. Biochem Biophys Res Commun 345: 646–651 Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Pack-MULE transposable elements mediate gene evolution in plants. Nature 431: 569–573 Jiang SY, Bachmann D, La H, Ma Z, Venkatesh PN, Ramamoorthy R,

2007

Jiang et al.

Ramachandran S (2007) Ds insertion mutagenesis as an efficient tool to produce diverse variations for rice breeding. Plant Mol Biol 65: 385–402 Johnson C, Bowman L, Adai AT, Vance V, Sundaresan V (2007) CSRDB: a small RNA integrated database and browser resource for cereals. Nucleic Acids Res 35: D829–D833 Juretic N, Hoen DR, Huynh ML, Harrison PM, Bureau TE (2005) The evolutionary fate of MULE-mediated duplications of host gene fragments in rice. Genome Res 15: 1292–1297 Kazazian HH Jr (2004) Mobile elements: drivers of genome evolution. Science 303: 1626–1632 Kempken F, Windhofer F (2001) The hAT family: a versatile transposon group common to plants, fungi, animals, and man. Chromosoma 110: 1–9 Kolker E, Makarova KS, Shabalina S, Picone AF, Purvine S, Holzman T, Cherny T, Armbruster D, Munson RS Jr, Kolesov G, et al (2004) Identification and functional analysis of ‘hypothetical’ genes expressed in Haemophilus influenzae. Nucleic Acids Res 32: 2353–2361 Kolker E, Picone AF, Galperin MY, Romine MF, Higdon R, Makarova KS, Kolker N, Anderson GA, Qiu X, Auberry KJ, et al (2005) Global profiling of Shewanella oneidensis MR-1: expression of hypothetical genes and improved functional annotations. Proc Natl Acad Sci USA 102: 2099–2104 Li L, Wang X, Stolc V, Li X, Zhang D, Su N, Tongprasit W, Li S, Cheng Z, Wang J, et al (2006) Genome-wide transcription analyses in rice using tiling microarrays. Nat Genet 38: 124–129 Lin H, Zhu W, Silva JC, Gu X, Buell CR (2006) Intron gain and loss in segmentally duplicated genes in rice. Genome Biol 7: R41 Liolios K, Mavrommatis K, Tavernarakis N, Kyrpides NC (2008) The genomes on line database (GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res 36: D475–D479 Lu C, Jeong DH, Kulkarni K, Pillay M, Nobuta K, German R, Thatcher SR, Maher C, Zhang L, Ware D, et al (2008) Genome-wide analysis for discovery of rice microRNAs reveals natural antisense microRNAs (nat-miRNAs). Proc Natl Acad Sci USA 105: 4951–4956 Ma L, Chen C, Liu X, Jiao Y, Su N, Li L, Wang X, Cao M, Sun N, Zhang X, et al (2005) A microarray analysis of the rice transcriptome and its comparison to Arabidopsis. Genome Res 15: 1274–1283 Mallory AC, Vaucheret H (2004) MicroRNAs: something important between the genes. Curr Opin Plant Biol 7: 120–125 Meister G, Tuschl T (2004) Mechanisms of gene silencing by doublestranded RNA. Nature 431: 343–349 Moon S, Jung KH, Lee DE, Jiang WZ, Koh HJ, Heu MH, Lee DS, Suh HS, An G (2006) Identification of active transposon dTok, a member of the hAT family, in rice. Plant Cell Physiol 47: 1473–1483 Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods 5: 621–628 Mulder N, Apweiler R (2007) InterPro and InterProScan: tools for protein sequence classification and comparison. Methods Mol Biol 396: 59–70 Nobuta K, Venu RC, Lu C, Belo´ A, Vemaraju K, Kulkarni K, Wang W, Pillay M, Green PJ, Wang GL, et al (2007) An expression atlas of rice mRNAs and small RNAs. Nat Biotechnol 25: 473–477 Ohyanagi H, Tanaka T, Sakai H, Shigemoto Y, Yamaguchi K, Habara T, Fujii Y, Antonio BA, Nagamura Y, Imanishi T, et al (2006) The Rice Annotation Project Database (RAP-DB): hub for Oryza sativa ssp. japonica genome information. Nucleic Acids Res 34: D741–D744 Ouyang S, Zhu W, Hamilton J, Lin H, Campbell M, Childs K, ThibaudNissen F, Malek RL, Lee Y, Zheng L, et al (2007) The TIGR rice genome annotation resource: improvements and new features. Nucleic Acids Res 35: D883–D887 Pereira A, Cuypers H, Gierl A, Schwarz-Sommer Z, Saedler H (1986) Molecular analysis of the En/Spm transposable element system of Zea mays. EMBO J 5: 835–841 Prasad AM, Sivanandan C, Resminath R, Thakare DR, Bhat SR, Srinivasan (2005) Cloning and characterization of a pentatricopeptide protein encoding gene (LOJ) that is specifically expressed in lateral organ junctions in Arabidopsis thaliana. Gene 353: 67–79 Ramachandran V, Chen X (2008) Small RNA metabolism in Arabidopsis. Trends Plant Sci 13: 368–374 Redman JC, Haas BJ, Tanimoto G, Town CD (2004) Development and

2008

evaluation of an Arabidopsis whole genome Affymetrix probe array. Plant J 38: 545–561 Rice Annotation Project (2008) The Rice Annotation Project Database (RAP-DB): 2008 update. Nucleic Acids Res 36: D1028–D1033 Rizzon C, Ponger L, Gaut BS (2006) Striking similarities in the genomic distribution of tandemly arrayed genes in Arabidopsis and rice. PLOS Comput Biol 2: e115 Roberts RJ (2004) Identifying protein function: a call for community action. PLoS Biol 2: E42 Saha S, Sparks AB, Rago C, Akmaev V, Wang CJ, Vogelstein B, Kinzler KW, Velculescu VE (2002) Using the transcriptome to annotate the genome. Nat Biotechnol 20: 508–512 Sakata K, Nagamura Y, Numa H, Antonio BA, Nagasaki H, Idonuma A, Watanabe W, Shimizu Y, Horiuchi I, Matsumoto T, et al (2002) RiceGAAS: an automated annotation system and database for rice genome sequence. Nucleic Acids Res 30: 98–102 Schena M, Shalon D, Davis RW, Brown PO (1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467–470 Shin JH, Yang JW, Le Pecheur M, London J, Hoeger H, Lubec G (2004) Altered expression of hypothetical proteins in hippocampus of transgenic mice overexpressing human Cu/Zn-superoxide dismutase 1. Proteome Sci 2: 1–10 Shiu SH, Bleecker AB (2003) Expansion of the receptor-like kinase/Pelle gene family and receptor-like proteins in Arabidopsis. Plant Physiol 132: 530–543 Shiu SH, Karlowski WM, Pan R, Tzeng YH, Mayer KF, Li WH (2004) Comparative analysis of the receptor-like kinase family in Arabidopsis and rice. Plant Cell 16: 1220–1234 Sivashankari S, Shanmughavel P (2006) Functional annotation of hypothetical proteins: a review. Bioinformation 1: 335–338 Swarbreck D, Wilks C, Lamesch P, Berardini TZ, Garcia-Hernandez M, Foerster H, Li D, Meyer T, Muller R, Ploetz L, et al (2008) The Arabidopsis Information Resource (TAIR): gene structure and function annotation. Nucleic Acids Res 36: D1009–D1014 Sweredoski M, DeRose-Wilson L, Gaut BS (2008) A comparative computational analysis of nonautonomous helitron elements between maize and rice. BMC Genomics 9: 467 Thornton K, Long M (2002) Rapid divergence of gene duplicates on the Drosophila melanogaster X chromosome. Mol Biol Evol 19: 918–925 Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270: 484–487 Vinckenbosch N, Dupanloup I, Kaessmann H (2006) Evolutionary fate of retroposed gene copies in the human genome. Proc Natl Acad Sci USA 103: 3220–3225 Wang GD, Tian PF, Cheng ZK, Wu G, Jiang JM, Li DB, Li Q, He ZH (2003) Genomic characterization of Rim2/Hipa elements reveals a CACTA-like transposon superfamily with unique features in the rice genome. Mol Genet Genomics 270: 234–242 Wang W, Zheng H, Fan C, Li J, Shi J, Cai Z, Zhang G, Liu D, Zhang J, Vang S, et al (2006) High rate of chimeric gene origination by retroposition in plant genomes. Plant Cell 18: 1791–802 Wicker T, Guyot R, Yahiaoui N, Keller B (2003) CACTA transposons in Triticeae: a diverse family of high-copy repetitive elements. Plant Physiol 132: 52–63 Xiao YL, Smith SR, Ishmael N, Redman JC, Kumar N, Monaghan EL, Ayele M, Haas BJ, Wu HC, Town CD (2005) Analysis of the cDNAs of hypothetical genes on Arabidopsis chromosome 2 reveals numerous transcript variants. Plant Physiol 139: 1323–1337 Xu Z, Wang H (2007) LTR_FINDER: an efficient tool for the prediction of full-length LTR retrotransposons. Nucleic Acids Res 35: W265–W268 Yang Z, Nielsen R (2000) Estimating synonymous and nonsynonymous substitution rates under evolutionary models. Mol Biol Evol 17: 32–43 Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 79–92 Yuan Q, Ouyang S, Wang A, Zhu W, Maiti R, Lin H, Hamilton J, Haas B, Sultana R, Cheung F, et al (2005) The Institute for Genomic Research Osa1 rice genome annotation database. Plant Physiol 138: 18–26 Zhu W, Buell CR (2007) Improvement of whole-genome annotation of cereals through comparative analyses. Genome Res 17: 299–310

Plant Physiol. Vol. 150, 2009