Molecular BioSystems - PubAg - USDA

2 downloads 0 Views 2MB Size Report
repetitive elements, Trends Genet., 2000, 16(9), 418–420. 19 J. M. McAbee, T. A. Hill, D. J. Skinner, A. Izhaki, B. A. Hauser,. R. J. Meister, G. Venugopala Reddy, ...
Molecular BioSystems This article was published as part of the

Computational and Systems Biology themed issue Please take a look at the full table of contents to access the other papers in this issue.

PAPER

www.rsc.org/molecularbiosystems | Molecular BioSystems

Coding DNA repeated throughout intergenic regions of the Arabidopsis thaliana genome: evolutionary footprints of RNA silencingwz Jian Feng,a Daniel Q. Naimana and Bret Cooper*b Received 12th February 2009, Accepted 26th March 2009 First published as an Advance Article on the web 30th April 2009 DOI: 10.1039/b903031j Pyknons are non-random sequence patterns significantly repeated throughout non-coding genomic DNA that also appear at least once among coding genes. They are interesting because they portend an unforeseen connection between coding and non-coding DNA. Pyknons have only been discovered in the human genome, so it is unknown whether pyknons have wider biological relevance or are simply a phenomenon of the human genome. To address this, DNA sequence patterns from the Arabidopsis thaliana genome were detected using a probability-based method. 24 654 statistically significant sequence patterns, 16 to 24 nucleotides long, repeating 10 or more times in non-coding DNA also appeared in 46% of A. thaliana protein-coding genes. A. thaliana pyknons exhibit features similar to human pyknons, including being distinct sequence patterns, having multiple instances in genes and having remarkable similarity to small RNA sequences with roles in gene silencing. Chromosomal position mapping revealed that genomic pyknon density has concordance with siRNA and transposable element positioning density. Because the A. thaliana and human genomes have approximately the same number of genes but drastically different amounts of non-coding DNA, these data reveal that pyknons represent a biologically important link between coding and non-coding DNA. Because of the association of pyknons with siRNAs and localization to silenced regions of heterochromatin, we postulate that RNA-mediated gene silencing leads to the accumulation of gene sequences in non-coding DNA regions.

Introduction Patterns inherent to the sequences of genomes signify higher orders of biological importance. For example, palindromic DNA sequence patterns reveal the sites for restriction endonuclease activity.1 Likewise, motifs, or degenerate sets of patterns, can reveal genes and promoters sharing common genomic properties across species.2–4 Meanwhile, repeated DNA sequence patterns, such as those tied to transposable elements (TE), are an integral part of a genetic variation.5,6 TE propagation in intergenic DNA has been postulated as one explanation for the wide variability between the number of genes in genomes and genome sizes, commonly known as the C-value paradox.7,8 Recently, a new type of DNA sequence pattern, the pyknon, was discovered in the human genome.9 Human pyknons are defined as 16–24 nucleotide (nt) sequence strings first found to appear in the intergenic non-coding regions of the human genome 40 or more times and then found to appear in the a

Department of Applied Mathematics and Statistics, The Johns Hopkins University, Baltimore, MD 21218 b Soybean Genomics and Improvement Laboratory, USDA-ARS, 10300 Baltimore Ave, Bldg. 006, Rm. 213, Beltsville, MD 20705. E-mail: [email protected] w This article is part of a Molecular BioSystems themed issue on Computational and Systems Biology. z Electronic supplementary information (ESI) available: Nucleotide (nt) and record number statistics for each file; a list of A. thaliana pyknons and probabilities; lists of pyknons and chromosome positions; PTTRNFNDR program (executable in 32-bit Linux). See DOI: 10.1039/b903031j

This journal is

 c

The Royal Society of Chemistry 2009

coding regions of genes (CR = exons minus introns) or in their cognate 5 0 and 3 0 untranslated regions (UTR). The frequency of appearance of these patterns exceeds their probability of random occurrence. Thus, the patterns appear to have biological significance. Generally, human pyknons have several intriguing features that distinguish them from other previously defined repeat elements. First, they exist in more than 90% of all known human CR or UTRs. Second, the set of human pyknons subsume sequences related to known micro RNAs (miRNA) and 3 0 UTR motifs. Third, DNA sequences between pyknons include regions that may be favorable for post-transcriptional gene-silencing activity.10 These pyknon features inextricably link coding DNA sequences with non-coding DNA regions once thought of as ‘‘junk’’ 11,12 and portend an unforeseen genome-wide coordinated activity. Among all genomic patterns, pyknons are peculiar because they are found via intergenomic sequence statistical analysis rather than through cross-genomic sequence comparison. In other words, by their very definition, pyknons can only be identified per genome. Since they have only been identified in the human genome, an investigation of other genomes is warranted to determine whether pyknon patterns have wider genomic relevance. Therefore, we have chosen to analyze the genome of Arabidopsis thaliana, the model eudicot. The quality of its available DNA sequence resources is ideal for this purpose,13 and the vast phylogenetic distance between humans and plants is likely to bring more meaning to the very nature of any real interrelationship between coding and non-coding DNA. Mol. BioSyst., 2009, 5, 1679–1687 | 1679

In prior work, we developed an unsupervised pattern recognition algorithm called PTTRNFNDR whose input is Latin-letter sequence strings.14 This algorithm uses no preconceived template for pattern recognition within strings. Rather, it uses a probability-based model to find patterns that exist above a frequency of random occurrence. The probabilistic nature of PTTRNFNDR guarantees the statistical significance and uniqueness (non-redundancy) of the patterns it discovers. Patterns and their estimated probabilities of appearance are updated through multiple iterations and examinations of different pattern length. Shorter patterns that are parts of longer ones are not necessarily excluded from consideration, and a long string will not be labeled as a pattern if it does not appear significantly often compared to its substrings. Without using a reference template, PTTRNFNDR has been shown to be highly effective at identifying real words in the concatenated text of the novel Moby Dick and has been used to find promoter sequences repeated throughout DNA regions upstream of translation start sites that had otherwise only been recognized through cross-species genomic comparisons.4,14 Using this algorithm to statistically analyze patterns, we have discovered a set of pyknons in the A. thaliana genome. In this paper, we describe this set, compare the features of these pyknons to those in the human genome, delineate a correlation to RNA-mediated gene silencing and propose a model for the propagation of short coding sequences in heterochromatin. This independent analysis shows that pyknons transcend genomes and have a broader biological context.

Results and discussion Pyknon discovery in the A. thaliana genome For their initial discovery of human pyknons, Rigoutsos et al. (2006) first found patterns in intergenic DNA and then performed a separate statistical analysis to show that patterns 16–24 nt long and appearing 40 or more times were present above random occurrence with respect to the size of the intergenic space in the human genome.9 Since our algorithm performs its statistical analysis as part of the pattern discovery process, a separate statistical analysis was not required. Nevertheless, we deferred to the human pyknon definition and then modified it with respect to the size of the A. thaliana genome which encodes approximately the same number of genes but has 40 times less combined intergenic and intron DNA. Because of this size differential and the nature of our model, the minimal number of appearances for a sequence to be defined as a pattern was set at 10. The impact of this definition change will be shown. Patterns were deemed to be significant if their appearance frequency was 4 standard deviations more often than their probability of existence suggested. Standard deviation was not a measure of significance for the set of human pyknons. We first searched for patterns in the DNA sequences of intergenic and intron regions of version 8.0 of the A. thaliana genome annotation (Table S1 in the ESIz). Within the original intron data file were sequences for splice variants, but we removed them for the purposes of this project since their presence artificially increased the number of repeated 1680 | Mol. BioSyst., 2009, 5, 1679–1687

sequences. Thus, we only evaluated introns belonging to reference genes (e.g. AT1G70100.1 but not the variant AT1G70100.2). This reduced the number of records in the file by B24%. The intergenic and adjusted intron files and their reverse complementary sequences were combined into one large text file and processed by PTTRNFNDR. When searching for patterns, the maximal length of the patterns was computationally limited to 24 nt. 236 572 significant patterns, 16 to 24 nt long, appearing 10 or more times were found in the combined intergenic and intron regions. Next, the forward CR (a.k.a. the CDS file) and 5 0 and 3 0 UTR sequence files depleted of splice variants, mitochondrion and chloroplast records, were scanned for sequence strings identical to those in the pattern set. In total, there were 24 654 patterns, 16–24 nt long, and found to repeat at least 10 times in non-coding regions at a statistically significant level that also had at least one instance in the CR or UTR of any protein coding reference gene. These were defined as A. thaliana pyknons (Table 1; Table S2 and S3 in the ESIz). To verify the significance of this pattern set and demonstrate the stringency of discovery, four sets of artificial (decoy) sequence strings having the same lengths as the sequences in the adjusted intergenic, intron, CR and UTR files but randomly populated with nts at the same frequency as those in the test set were constructed using a Monte Carlo method.14 We ran PTTRNFNDR with the same parameters on the four decoy intergenic + decoy intron databases. On average, 85 patterns 16–24 nt long were found, thus reinforcing the significance of the 236 572 patterns found among true intron and intergenic sequences. Hence, the existence of patterns in true intron and intergenic sequences is not coincidental. On average, only 5.8 of the patterns from decoy intergenic and intron sequences could be found in decoy CR + decoy UTR sequences. This sets the boundary for random pattern matching. With this in mind, we looked at how many real intergenic + intron patterns appeared in decoy CR + decoy UTRs and determined how many times those real patterns appeared in the decoy sequences compared to the true coding sequences. On average, 2518 out of 236 572 (B1.0%) real patterns matched the decoy genic sequences (Table 1). As to be expected, few or none of the randomly associated real patterns were 18–24 nt long, and there were far fewer real 16mers matching decoy CR and decoy UTR sequences (2184 av.) than real CR and UTR sequences (11 370 pyknons). In other words, the real 16 nt patterns appeared at a rate that exceeded the rate of random occurrence by 5.2 fold. Longer patterns exceeded the rate of random occurrence even more strikingly. Next, we defined the 16–24 nt real patterns that matched the decoy genomic sequences as ‘‘false pyknons.’’ There were only 2030 instances of false pyknons in the decoy CR and decoy UTRs compared to 21 245 non-overlapping instances of true pyknons in real CR and UTRs (Table 1). Again, the number of instances of longer real pyknons was drastically greater than the number of instances of longer false pyknons. These data support our claim that there are 236 572 patterns 16–24 nt long with statistical overrepresentation in non-coding A. thaliana genomic DNA and that the appearance of 24 654 of these patterns in real CR and UTR sequences is significant. This journal is

 c

The Royal Society of Chemistry 2009

Table 1

Numbers, copies and coverage of patterns, pyknons, and false pyknons

Nucleotide (nt) length

Significant patterns in intergenic + intron records (real patterns)

Real patterns among CR + UTRs (pyknons)

16 17 18 19 20 21 22 23 24 Total

49 799 24 601 14 397 10 022 8309 8506 9654 16 414 94 870 236 572

11 370 5136 2740 1482 937 639 501 463 1386 24 654

a

Non-overlapping copies of pyknons in CR + UTRs

nt Coverage of non-overlapping copies of pyknons in CR + UTRs

Real patterns among decoy CR + decoy UTRs (false pyknons)a

Non-overlapping copies of false pyknons in decoy CR + decoy UTRsa

12 832 4424 1758 764 433 287 135 132 480 21 245

205 312 75 208 31 644 14 516 8660 6027 2970 3036 11 520 358 893b

2184  79 293  44 35  21 54 11 0 0 0 0 2518  141

1790  24 206  28 28  5 51 11 0 0 0 0 2030  50

Average and standard deviation of trials against 4 random databases.

A. thaliana DNA sequence patterns are distinct BLASTCLUST was used to compare each pyknon against the set of pyknons and their reverse complement sequences. When the default score threshold was used (S = 1.75), minimum length coverage set at 0.9 and word size set at 14, there were 23 687 clusters produced. Other parameters were tested but they did not appreciably change the number of clusters. Thus, the high number of clusters generated from 24 654 pyknons revealed that the small patterns were not just parts of longer patterns. In other words, a very high proportion of the pyknon sequences are unique. Hence, there is little primary relationship between the sequences other than that they were first found to exist 10 or more times in the intergenic DNA sequences and then shown to have instances in the CR or UTRs of different genes.

b

0.73% of all CR + UTR nts.

basis than that used in the human genome study. Finally, we utilized standard deviation to establish significance and used decoy-databases to independently verify the significance. There was no available similar false-match statistics associated with the human analysis that could enable us to set our pattern finding definition empirically. Thus, there is no way to determine if our methods are directly comparable. Memory constraints for the 32-bit Linux operating system currently prohibit us from analyzing the much larger human genome with 32-bit compatible-PTTRNFNDR and directly reconciling the two studies. Thus, we can only evaluate the statistics. In our case the strength of our statistics is bolstered by the independent analysis of decoys. Consequently, it can be argued that while the differences between the genomes and the methods are drastic, the similar statistics between the two studies favor a legitimate relationship between genic and intergenic DNA.

A. thaliana pyknons are associated with genes

More on pyknons, TEs and repeat sequences

45.8% of known and predicted genes in version 8.0 of the A. thaliana genome (CR and UTRs) contain at least one pyknon instance (12 355 out of 26 990 gene records). Deeper analysis shows that not all genes with pyknons have them in all three regions. This is partly explained by the fact that there are an incomplete number of 5 0 UTR and 3 0 UTR records to complement all CR records. Despite the unequal representation, 21 245 non-overlapping copies of pyknons cover 4.66, 2.58 and 0.370% of the total nt in the 5 0 UTR, 3 0 UTR and CR regions, respectively. If we consider only the subset of CR records with cognate 5 0 and 3 0 UTRs, pyknon regional representation is nearly the same (4.70%, 2.59% and 0.335%). This demonstrates a location bias for pyknons in transcribed, but non-translated sequences. By comparison, the human genome is biased for pyknons in non-translated 3 0 UTR regions.9 Our findings link 45.8% of the genes to the non-coding DNA sequences that comprise 60.5% of the A. thaliana genome. This gene coverage is less than the reported 90% coverage in the human genome.9 However, there are some differences between the analyses that may have contributed to different statistics. First, PTTRNFNDR is very different than the algorithm used in the human study. Next, our definition that a pattern must appear 10 or more times in the non-coding DNA is more conservative on a pattern-per-amount of DNA

It could be argued by searching intergenic DNA for repeat sequences and then comparing those sequences to coding DNA that we have merely rediscovered known TEs or poorly annotated TEs. To rebut this notion, we point out that many of the repeated sequences of A. thaliana heterochromatic and centromeric regions were examined in detail as early as the year 2000.15,16 This knowledge and other annotations have been incorporated into multiple revisions of the A. thaliana genome annotation. We used the most recent version 8.0 which has been touted as having improved TE annotation. In fact, the set of 26 990 A. thaliana genes examined specifically did not include 4759 TEs, TE-associated genes and pseudogenes, meaning that we have not simply matched sequences which are most commonly associated with repeats. It is unclear if such differentiation was performed in the human analysis, meaning that the human coverage may include a small but real number of human TE-associated or retroelements.17 Notwithstanding this, these differences and distinctions do not diminish the findings here or in the other study. Rather, they reinforce the primary notion that transcribed and protein-coding gene sequences central to the function of a cell, as opposed to TEs or pseudogenes, are repeated at a statistically significant level throughout non-coding intergenic and intron DNA in the genomes of Homo sapiens and A. thaliana.

This journal is

 c

The Royal Society of Chemistry 2009

Mol. BioSyst., 2009, 5, 1679–1687 | 1681

To further support the idea that we did not simply identify TE-associated patterns in the CR gene set, we used RepeatMasker to mask the DNA sequences in intergenic, intron, CR and UTR regions with homology to the A. thaliana repeat and low-complexity sequences in Repbase.18 There is one parametric change (-s) that alters RepeatMasker match stringency but we found no appreciable difference in results by altering this parameter. When applying the stronger stringency thereby leading to more accurate repeat sequence matching, we found that 2864 pyknons appeared exclusively in repeat regions. However, 1571 pyknons appeared exclusively in repeat-free regions. In comparison to humans where only 79 out of 127 998 human pyknons existed solely in repeat-free regions, A. thaliana has a much greater proportion of its pyknons exclusively in repeat-free regions. In both repeat and repeat-free regions, there were 20 219 pyknons. Strictly speaking, it could be said that there is an unexpected large degree of sequence relatedness between repeat and repeat-free DNA regions. Broadly speaking, 88.38% of the non-redundant pyknons discovered do exist in repeat-free regions. Genes with multiple pyknon instances 5149 genes contain multiple numbers of distinct pyknons. Specifically, there are 76 genes with 7 or more instances (Fig. 1). KAN1 (AT5G16560), which encodes a myb family transcription factor that regulates organ polarity,19 has 7 pyknons dispersed throughout its CR and UTRs. If introns are included, there are 16 pyknons in the prospective, transcribed, unspliced message (Fig. 2A). Other genes, such as HIGH-LEVEL EXPRESSION OF SUGAR-INDUCIBLE GENE 2 (AT2G30470) and TEOSINTE BRANCHED1, CYCLOIDEA AND PCF (TCP) 14 (AT3G47620; Fig. 2B), and genes encoding DNA-binding proteins (AT5G28300; Fig. 2C), protein kinases (AT1G29750), and proteins with a zinc-finger (AT2G04500, AT3G13810), heavy-metal-associated

(AT3G05220) and RNA binding (AT5G55550) motifs are also in this set. These genes are known to be expressed given cDNA or EST data at http://www.arabidopsis.org. Among the set with 7 or more instances, there are 29 genes strictly annotated as hypothetical or having unknown function. Although analysis by Rigoutsos et al.,9 showed that some gene classes have over- and under-representations of pyknon instances, we did not perform such an analysis because it does not consider the relationship between repeat patterns in the protein coding sequences, amino acid order, protein function and pyknon instance genic/intergenic relationships. Genomic positioning of A. thaliana pyknons The distances between the starting points of 2 or more consecutive pyknon instances in CR and UTR sequences were examined and not found to be biased (Fig. 3). This trait finding is a departure from that of human pyknons which have relative positioning favoring 22 and 24 nt. Perhaps this discrepancy exists because A. thaliana pyknon density is less than that for humans. Additionally, we evaluated pyknon density in a way not examined in humans. The distributions of all pyknon instances across the five A. thaliana chromosomal sequences were plotted as empirical probability density functions in an attempt to find ‘‘hot-spots’’ of pyknon appearance in relation to other genomic features such as centromeres and the location of DNA encoding small RNAs, TEs and genes (Fig. 4). Although pyknons have instances in nearly half of all genes, the highest pyknon densities in the genome are near centromeric regions where gene density is sparse. This is not surprising since this can be viewed as a function of part of the definition of pyknons which says that 10 or more copies of these patterns have to exist in non-coding regions. This might lead one to assume that only genes near centromeres or TEs have great numbers of pyknons instances. This assumption would be

Fig. 1 A. thaliana chromosome map of genes containing 7 or more pyknons.

1682 | Mol. BioSyst., 2009, 5, 1679–1687

This journal is

 c

The Royal Society of Chemistry 2009

Fig. 2 Nucleotide sequence of genes and associated pyknons. Pyknons are in lower case letters and are alternatively highlighted in blue and green.

incorrect. Only 4 out of 76 genes with 7 or more pyknon instances are localized close to centromeres where gene density is low and where small RNA and transposon density is high. The rest reside in chromosomal arms where gene density is high and the small RNA, transposon and regional pyknon position density is low (Fig. 1 and 4). This proves that while overall pyknon distribution throughout the genome appears to be heavily weighted to areas near centromeres where gene complexity is low and TE and small RNA occurrence is high, This journal is

 c

The Royal Society of Chemistry 2009

there is no overt link between pyknon gene density and gene position. In short, a gene does not have to be in a pyknon dense area to have pyknons. Pyknons match siRNAs Since the areas of greatest pyknon density also overlapped the areas of greatest small RNA position density, we sought to determine the relatedness between pyknons and small RNAs, Mol. BioSyst., 2009, 5, 1679–1687 | 1683

Fig. 3 The empirical probability density distribution for the distances in nucleotides between consecutive pyknon instances within genes. (A) The numbers of nucleotides between the ending position of pyknons and the starting positions of the following pyknons (i.e. the space between pyknons). (B) The numbers of nucleotides between the starting positions of the consecutive pyknon instances (i.e. the length of a pyknon plus the space to the next pyknon). The black line delineates the distances for pyknons found in the combined datasets for 3 0 UTR, 5 0 UTR and CR sequences, while the blue, red and green ones show the distances in these sets of sequences separately. Only parts of the distributions are shown due to the long tails of the distributions. The predominant peak at 24 nt is an artifact of the pattern discovery process constrained to a maximum length of 24 nt which thereby favors this limit instead of longer patterns (e.g. 26 nt) not considered in the search.

Fig. 4 The empirical probability density distributions of pyknon, transposon, siRNA, and gene instances across A. thaliana chromosomes. Centromere relative positioning is shown as a sphere in each line above the graphs.

especially since a preponderance of pyknons are not solely associated with TE-related repeat sequences. Small RNAs are 20–25 nt molecules shown to play important roles in gene expression regulation via transcriptional, post-transcriptional and related gene silencing mechanisms.20 Most small RNAs found in A. thaliana are either single-strand miRNAs21 or double-strand short-interfering RNAs (siRNA).22 In order to determine a relationship, we compared 206 077 known A. thaliana small RNA sequences to the A. thaliana pyknons using BLAST.23 When we required all nts of pyknon queries to match the subject (forward or reverse) and considered no gaps, we found no significant matching between pyknons and 120 annotated A. thaliana miRNAs. When 30% mismatch was allowed, as was allowed in the human pyknon analysis, 54 A. thaliana pyknons were aligned. Therefore, by relaxing the alignment conditions, a better association between pyknons 1684 | Mol. BioSyst., 2009, 5, 1679–1687

and miRNAs was made. Consequently, we find it difficult to determine if there is a bona fide sequence relationship between pyknons and miRNAs or if the relationship is merely a product of relaxed stringency. Furthermore, we now question the veracity of the human pyknon-miRNA relationship, which was predicated on the clustering of human pyknons with 1087 out of 2634 miRNAs from the 2005 Rfam database.9 Our doubts arise from their data showing that a mere 689 out of 127 998 pyknons (o1%) were involved in the clusters and from the fact that the 2634 miRNAs searched actually comprised sequences from humans, pigs, mice, viruses, rice and A. thaliana. It was not reported exactly how many of the human pyknons clustered with human miRNAs, but their data reveal that nearly as many human pyknons (50) clustered with A. thaliana miRNAs in their study as did A. thaliana pyknons (54) with A. thaliana miRNAs in our study. In summary, we This journal is

 c

The Royal Society of Chemistry 2009

believe that the relationship between pyknons and miRNAs is not as strong as once suggested. Although we found little relationship between A. thaliana pyknons and miRNAs, we detected a stronger relationship to siRNAs. When considering no mismatches for alignment (perfect or near-perfect complementarity is thought to be critical for siRNA function24), we found that 143 were exactly the same as A. thaliana siRNA sequences and 1,409 were wholly contained within siRNA sequences. When one mismatch was considered, an additional 1228 pyknons were aligned. When we compared the four sets of false pyknons (comprising a total of 10 105 sequences) to the siRNA database, there were no perfect or single-nt mismatch alignments. Thus these data show that there is a strong relationship between nearly 11% of A. thaliana pyknons and known A. thaliana siRNA sequences. It is worth noting that the small RNA database that we searched was composed only of sequences 20 nt or greater22 and that pyknons range from 16–24 nt. It is possible that plants produce smaller siRNAs with identity to pyknons, but there is no way to know if these match pyknons since smaller siRNAs were excluded from the database.22 Notwithstanding this, we argue that A. thaliana pyknons have sequence identity to some siRNAs, map to regions where siRNA density is high, and can be found in nearly half of A. thaliana genes. Interestingly, 15% of the A. thaliana protein coding genes have 1 to 5 siRNA loci and that the loci are weighted to 5 0 and 3 0 UTRs of transcribed genes.22 Thus, these striking similarities between pyknons and siRNAs suggest that siRNA biogenesis from transcribed genes could be the origin of some repeats in intergenic and intron regions. Since endogenous human siRNAs are generally unknown at this date25 and since the list of human pyknon sequences was not published, it is not possible to conclude if the relationship between pyknons and siRNAs is unique to A. thaliana or a more general property associated with pyknons. Intergenic pyknons differ from intronic pyknons Non-overlapping pyknons and their reverse complements exist 126 911 and 31 679 times in forward intergenic and intron sequences, respectively. This represents about 4.34% and 2.09% of all the nt space of these non-coding regions, or 3.58% of all non-coding DNA. By comparison, 16% of human non-coding regions comprise pyknons.9 Recall that there is 40 times less non-coding DNA in the A. thaliana genome than in the human genome. So, the percent coverage of A. thaliana pyknons across non-coding DNA is remarkable. The data imply that a significant portion of non-coding DNA may have its origin in transcribed genes. Our pyknon discovery focused on patterns from combined intergenic and intron sequences. This analysis relies on an assumption that general properties are shared among the patterns in intergenic and intron sequences. However, this assumption may be naı¨ ve, because all non-coding introns are parts of transcribed genes whereas the same cannot be said for intergenic sequences. So, we separately analyzed intergenic and intron sequences to see if patterns from these regions exhibited different features. Interestingly, intergenic searching produced 460 497 patterns 16–24 nt long and yielded 30 640 This journal is

 c

The Royal Society of Chemistry 2009

pyknons (9.20  e3 pattern per nt, 6.12  e4 intergenic pyknon per nt). A separate intron analysis yielded 13 885 patterns 16–24 nt long and 4759 pyknons (5.46  e4 pattern per nt, 1.87  e4 intronic pyknon per nt). The difference between pyknon/pattern densities (rate per nt) was not caused by the size difference of the sequence datasets. When half of the records of CR and UTR regions plus their reverse complements were culled to make a mock file with a similar number of nts as the intron sequence file, 12 755 patterns were found (3.19  e4 pattern per nt). This suggests that intron patterns have features more similar to genes than to intergenic regions. It is possible that these differences between intronic pyknons and intergenic pyknons point to specialized biological roles of these patterns in the two different types of non-coding regions. Alternatively, intergenic regions may just be more tolerant of pyknon accumulation.

Conclusion We report a previously unknown relationship between intergenic regions and genic regions of A. thaliana through a statistical analysis of sequence patterns that exist between these regions. Truly, the unequal and repeated distribution of coding DNA sequences among non-coding DNA is a perplexing observation. Because non-random, biological patterns indicate the underpinnings of order and important relevant processes, it is possible that the set of A. thaliana pyknons reveal a novel link between two seemingly disparate genomic regions. Pyknon patterns have been discovered in only one other genome, the human genome, but there are remarkable overlaps between the features of human and A. thaliana pyknons, especially in light of the fact that the organisms have the same number of genes but little sequence similarity and have a great difference in the amount of non-coding DNA between them. Pyknons in both organisms (1) are a distinct set of intergenic DNA sequences that are also in many genes; (2) are found in coding and UTR regions of genes, with a greater number of instances in UTRs; (3) comprise a substantial amount of total nts in the intergenic regions; (4) are not necessarily recapitulations of TE-associated repeat elements. In addition to these similarities, both human pyknons and A. thaliana pyknons are linked to small RNAs involved in gene silencing. However, one important distinction is that unlike human pyknons which purportedly have similarity to miRNAs, A. thaliana pyknons are more similar to siRNAs in terms of sequence identity and local positioning in chromosomes and UTRs of genes.22 Given these sequence relationships and the mapping associations between pyknons, siRNAs and TEs, pyknons may portend silencing of both genes and TEs via siRNAs generated from long double-stranded RNAs, rather than fold-back sequences of single-strand miRNA transcripts, as predicted in the human analysis. This prospect prompts some interesting questions. Since plants use siRNAs to quell the expression of transposons and other repeat sequences,26 and since siRNAs are important for RNA-dependent DNA methylation and heterochromatin formation,27,28 could the heterochromatic siRNAs also silence the expression of genes with pyknon instances? Or might there be a specific mechanism Mol. BioSyst., 2009, 5, 1679–1687 | 1685

that prevents heterochromatic siRNAs from silencing genes with related sequences? Our data reveal that pyknons are not just a phenomenon of the human genome, which has more than 98% non-coding DNA. In fact, our observance of the existence of pyknons for the first time in a plant genome suggests that there is a real biological mechanism that drives the appearance of sequences of coding DNA into intergenic regions across kingdoms. The biogenesis of siRNAs may play a role in this. There are sets of genes that generate complementary RNA transcripts and there is global antisense transcription throughout the A. thaliana genome.29–31 Such sense and antisense expressions could lead to the accumulation of long dsRNA templates suitable for siRNA biogenesis pathways.10 Given the fact that siRNAs from protein-coding genes can be made from the same pathway that includes the formation of heterochromatic siRNAs,22 it is highly plausible that many genes have the potential of generating siRNAs. In the events where siRNAs did temper gene expression, secondary siRNAs could become amplified to further target mRNAs.32,33 Certainly in a way to which Rigoutsos et al.9 alluded, retrotransposases or retroviruses have the potential to convert abundant dsRNAs and integrate their products into a genome over time. Alternatively, TEs may randomly pull DNA sequences with them causing protein-coding gene sequences to move into intergenic regions adjacent to other TEs.34 Either way, repeated integration into non-coding regions would drive the accumulation of these progenitor coding sequences. This action would be manifested as genome expansion if there was no selection pressure against it and mutation would lead to shorter stretches of sequence identity over time. Transcription of these repeated regions in both coding and non-coding DNA would produce RNAs exhibiting the dual, rather than opposed,35 properties of being part of an organized layer of genome regulation and being part of transcriptional noise. Just as small RNA complementarity to genes has been used to identify genes that are silenced, our observations suggest that pyknon instances may help reveal which genes are influenced by RNA silencing or have potential sites of binding for small RNAs. With respect to the most recent experimentation, siRNA distribution in protein coding genes does not appear to reflect an overt effect on gene expression.22 However, as it stands, 15% of protein coding genes have sequences that match siRNAs22 which indicates siRNAs are made from genes. And since they are, they could be propagated. Therefore, we conclude that nearly half of all genes may have siRNAgenerating loci because these genes have pyknon instances and because 11% of the pyknons are related to siRNA sequences that have been discovered so far.

Experimental

letters were converted to ‘N’ (Table S1 in the ESIz). All mitochondrion and chloroplast genome encoded gene records prefixed by ATC or ATM were deleted from the files. Sequences owing to splice variation were removed from intron, CR, 5 0 UTR and 3 0 UTR files. Pattern finding A fasta format file consisting of the adjusted A. thaliana intergenic and intron sequences and their reverse complements was processed as input by PTTRNFNDR (program in the ESIz).14 Once patterns were detected, they were inserted into a dictionary along with their estimated probabilities of appearance. The dictionary patterns and their probabilities were updated through multiple iterations and examinations of different pattern length. Statistical significance of pattern appearance was measured by the difference between the observed numbers of the pattern instances and their expected values, divided by the standard deviations. For these data files, a string qualified as a pattern if it was 16–24 nt long, appeared at least 10 times and if its frequency was 4 standard deviations more often than its probability of existence suggested. PTTRNFNDR was also used to process separate files of intergenic (forward and reverse complement) and intron (forward and reverse complement) sequences. In these cases, since the input datasets were smaller, a string only needed to appear 5 times to be qualified as a pattern, as long as it appeared 4 standard deviations more often than what the probability suggested. Pyknon analysis A pyknon sequence could not contain two or more copies of the letter ‘N’. The chromosomal loci of small RNA were downloaded from the Arabidopsis Small RNA Project Database in August, 2007,23 the TE loci were obtained from RepBase in September, 2007,18 and the centromere loci obtained from Round et al., 1997.36 The gene loci used were the ones associated with the header of each sequence. All empirical probability density functions were created using version 2.5.1 of R (http://www.R-project.org). BLASTCLUST (ftp://ftp.ncbi.nih.gov/blast/executables/) was used to cluster the pyknon sequences to each other while BLAST was used to compare pyknons to sequences in the small RNA database. Version open-3.1.8 of RepeatMasker was obtained from http://repeatmasker.org.

Abbreviations TE NT CR UTR miRNA siRNA

Transposable elements Nucleotides Intronless protein coding region sequences Untranslated region Micro RNA Small-interfering RNA

DNA sequences Version 8.0 of the A. thaliana genomic DNA sequences was downloaded from http://www.arabidopsis.org/. Sequences for the five chromosomes and subset sequences consisting of intergenic, intron, CDS (CR), 5 0 UTR, and 3 0 UTR regions were used for experimentation. All non-‘A’, ‘C’, ‘G’, or ‘T’ 1686 | Mol. BioSyst., 2009, 5, 1679–1687

Acknowledgements This work was funded by a specific cooperative agreement between the Johns Hopkins University and USDA-ARS, Beltsville, MD and by a National Research Initiative grant award to BC from USDA-CSREES (no. 2005-35605-15392). This journal is

 c

The Royal Society of Chemistry 2009

References 1 T. J. Kelly, Jr and H. O. Smith, A restriction enzyme from Hemophilus influenzae. II, J. Mol. Biol., 1970, 51(2), 393–409. 2 Chimpanzee Sequencing and Analysis Consortium, Initial sequence of the chimpanzee genome and comparison with the human genome, Nature, 2005, 437(7055), 69–87. 3 T. Itoh, T. Tanaka, R. A. Barrero, C. Yamasaki, Y. Fujii, P. B. Hilton, B. A. Antonio, H. Aono, R. Apweiler, R. Bruskiewich, T. Bureau, F. Burr, A. Costa de Oliveira, G. Fuks, T. Habara, G. Haberer, B. Han, E. Harada, A. T. Hiraki, H. Hirochika, D. Hoen, H. Hokari, S. Hosokawa, Y. I. Hsing, H. Ikawa, K. Ikeo, T. Imanishi, Y. Ito, P. Jaiswal, M. Kanno, Y. Kawahara, T. Kawamura, H. Kawashima, J. P. Khurana, S. Kikuchi, S. Komatsu, K. O. Koyanagi, H. Kubooka, D. Lieberherr, Y. C. Lin, D. Lonsdale, T. Matsumoto, A. Matsuya, W. R. McCombie, J. Messing, A. Miyao, N. Mulder, Y. Nagamura, J. Nam, N. Namiki, H. Numa, S. Nurimoto, C. O’Donovan, H. Ohyanagi, T. Okido, S. Oota, N. Osato, L. E. Palmer, F. Quetier, S. Raghuvanshi, N. Saichi, H. Sakai, Y. Sakai, K. Sakata, T. Sakurai, F. Sato, Y. Sato, H. Schoof, M. Seki, M. Shibata, Y. Shimizu, K. Shinozaki, Y. Shinso, N. K. Singh, B. Smith-White, J. Takeda, M. Tanino, T. Tatusova, S. Thongjuea, F. Todokoro, M. Tsugane, A. K. Tyagi, A. Vanavichit, A. Wang, R. A. Wing, K. Yamaguchi, M. Yamamoto, N. Yamamoto, Y. Yu, H. Zhang, Q. Zhao, K. Higo, B. Burr, T. Gojobori and T. Sasaki, Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana, Genome Res., 2007, 17(2), 175–183. 4 M. Kellis, N. Patterson, M. Endrizzi, B. Birren and E. S. Lander, Sequencing and comparison of yeast species to identify genes and regulatory elements, Nature, 2003, 423(6937), 241–254. 5 U. Bonas, H. Sommer and H. Saedler, The 17-kb Tam1 element of Antirrhinum majus induces a 3-bp duplication upon integration into the chalcone synthase gene, EMBO J., 1984, 3(5), 1015–1019. 6 N. Fedoroff, S. Wessler and M. Shure, Isolation of the transposable maize controlling elements Ac and Ds, Cell, 1983, 35(1), 235–242. 7 D. L. Hartl, Molecular melodies in high and low C, Nat. Rev. Genet., 2000, 1(2), 145–149. 8 B. S. Gaut and J. Ross-Ibarra, Selection on major components of angiosperm genomes, Science, 2008, 320(5875), 484–486. 9 I. Rigoutsos, T. Huynh, K. Miranda, A. Tsirigos, A. McHardy and D. Platt, Short blocks from the noncoding parts of the human genome have instances within nearly all known genes and relate to biological processes, Proc. Natl. Acad. Sci. U. S. A., 2006, 103(17), 6605–6610. 10 H. Vaucheret, Post-transcriptional small RNA pathways in plants: mechanisms and regulations, Genes Dev., 2006, 20(7), 759–771. 11 A. T. Willingham and T. R. Gingeras, TUF love for ‘‘junk’’ DNA, Cell, 2006, 125(7), 1215–1220. 12 M. Morgante, Plant genome organisation and diversity: the year of the junk!, Curr. Opin. Biotechnol., 2006, 17(2), 168–173. 13 T. A. G. Initiative, Analysis of the genome sequence of the flowering plant Arabidopsis thaliana, Nature, 2000, 408(6814), 796–815. 14 J. Feng, D. Q. Naiman and B. Cooper, Probability-based pattern recognition and statistical framework for randomization: modeling tandem mass spectrum/peptide sequence false match frequencies, Bioinformatics, 2007, 23(17), 2210–2217. 15 The Cold Spring Harbor Laboratory, Washington University Genome Sequencing Center, and PE Biosystems Arabidopsis Sequencing Consortium, The complete sequence of a heterochromatic island from a higher eukaryote, Cell, 2000, 100(3), 377–386. 16 P. F. Fransz, S. Armstrong, J. H. de Jong, L. D. Parnell, C. van Drunen, C. Dean, P. Zabel, T. Bisseling and G. H. Jones, Integrated cytogenetic map of chromosome arm 4S of A. thaliana: structural organization of heterochromatic knob and centromere region, Cell, 2000, 100(3), 367–376. 17 A. Meynert and E. Birney, Picking pyknons out of the human genome, Cell, 2006, 125(5), 836–838.

This journal is

 c

The Royal Society of Chemistry 2009

18 J. Jurka, Repbase update: a database and an electronic journal of repetitive elements, Trends Genet., 2000, 16(9), 418–420. 19 J. M. McAbee, T. A. Hill, D. J. Skinner, A. Izhaki, B. A. Hauser, R. J. Meister, G. Venugopala Reddy, E. M. Meyerowitz, J. L. Bowman and C. S. Gasser, ABERRANT TESTA SHAPE encodes a KANADI family member, linking polarity determination to separation and growth of Arabidopsis ovule integuments, Plant J., 2006, 46(3), 522–531. 20 D. Baulcombe, RNA silencing in plants, Nature, 2004, 431(7006), 356–363. 21 D. P. Bartel, MicroRNAs: genomics, biogenesis, mechanism, and function, Cell, 2004, 116(2), 281–297. 22 K. D. Kasschau, N. Fahlgren, E. J. Chapman, C. M. Sullivan, J. S. Cumbie, S. A. Givan and J. C. Carrington, Genome-wide profiling and analysis of Arabidopsis siRNAs, PLoS Biol., 2007, 5(3), e57. 23 A. M. Gustafson, E. Allen, S. Givan, D. Smith, J. C. Carrington and K. D. Kasschau, ASRP: the Arabidopsis Small RNA Project Database, Nucleic Acids Res., 2005, 33(Database Issue), D637–640. 24 S. W. Ding and O. Voinnet, Antiviral immunity directed by small RNAs, Cell, 2007, 130(3), 413–426. 25 T. Watanabe, Y. Totoki, A. Toyoda, M. Kaneda, S. KuramochiMiyagawa, Y. Obata, H. Chiba, Y. Kohara, T. Kono, T. Nakano, M. A. Surani, Y. Sakaki and H. Sasaki, Endogenous siRNAs from naturally formed dsRNAs regulate transcripts in mouse oocytes, Nature, 2008, 453(7194), 539–543. 26 D. Zilberman and S. Henikoff, Silencing of transposons in plant genomes; kick them when they’re down, Genome Biol., 2004, 5(12), 249. 27 E. Bernstein and C. D. Allis, RNA meets chromatin, Genes Dev., 2005, 19(14), 1635–1655. 28 S. W. Chan, I. R. Henderson and S. E. Jacobsen, Gardening the genome: DNA methylation in Arabidopsis thaliana, Nat. Rev. Genet., 2005, 6(5), 351–360. 29 S. Boi, G. Solda and M. L. Tenchini, Shedding light on the dark side of the genome; overlapping genes in higher eukaryotes, Curr. Genomics, 2004, 5509–524. 30 C. H. Jen, I. Michalopoulos, D. R. Westhead and P. Meyer, Natural antisense transcripts with coding capacity in Arabidopsis may have a regulatory role that is not linked to double-stranded RNA degradation, Genome Biol., 2005, 6(6), R51. 31 K. Yamada, J. Lim, J. M. Dale, H. Chen, P. Shinn, C. J. Palm, A. M. Southwick, H. C. Wu, C. Kim, M. Nguyen, P. Pham, R. Cheuk, G. Karlin-Newmann, S. X. Liu, B. Lam, H. Sakano, T. Wu, G. Yu, M. Miranda, H. L. Quach, M. Tripp, C. H. Chang, J. M. Lee, M. Toriumi, M. M. Chan, C. C. Tang, C. S. Onodera, J. M. Deng, K. Akiyama, Y. Ansari, T. Arakawa, J. Banh, F. Banno, L. Bowser, S. Brooks, P. Carninci, Q. Chao, N. Choy, A. Enju, A. D. Goldsmith, M. Gurjal, N. F. Hansen, Y. Hayashizaki, C. Johnson-Hopson, V. W. Hsuan, K. Iida, M. Karnes, S. Khan, E. Koesema, J. Ishida, P. X. Jiang, T. Jones, J. Kawai, A. Kamiya, C. Meyers, M. Nakajima, M. Narusaka, M. Seki, T. Sakurai, M. Satou, R. Tamse, M. Vaysberg, E. K. Wallender, C. Wong, Y. Yamamura, S. Yuan, K. Shinozaki, R. W. Davis, A. Theologis and J. R. Ecker, Empirical analysis of transcriptional activity in the Arabidopsis genome, Science, 2003, 302(5646), 842–846. 32 M. J. Axtell, C. Jan, R. Rajagopalan and D. P. Bartel, A two-hit trigger for siRNA biogenesis in plants, Cell, 2006, 127(3), 565–577. 33 J. Pak and A. Fire, Distinct populations of primary and secondary effectors during RNAi in C. elegans, Science, 2007, 315(5809), 241–244. 34 N. Jiang, Z. Bao, X. Zhang, S. R. Eddy and S. R. Wessler, PackMULE transposable elements mediate gene evolution in plants, Nature, 2004, 431(7008), 569–573. 35 J. S. Mattick and I. V. Makunin, Non-coding RNA, Hum. Mol. Genet., 2006, 15(Review Issue 1), R17–29. 36 E. K. Round, S. K. Flowers and E. J. Richards, Arabidopsis thaliana centromere regions: genetic map positions and repetitive DNA structure, Genome Res., 1997, 7(11), 1045–1053.

Mol. BioSyst., 2009, 5, 1679–1687 | 1687