Dispersed repeats in plant genomes

9 downloads 60 Views 554KB Size Report
(Bennett and Smith 1976). ..... range of cellular properties (recently reviewed by Bennett. 1987). .... Flavell RB, Bennett MD, Smith JB, Smith DB (1974) Genome.
Chromosoma (1991) 100 : 355-359

CHROMOSOMA 9 Springer-Verlag1991

Chromosoma Focus

Dispersed repeats in plant genomes * David R. Smyth

Plant genomes The amount of D N A in an unreplicated haploid cell (the C value) is relatively constant within a species. However in higher plants it is particularly variable between species, ranging over nearly three orders of magnitude (Bennett and Smith 1976). The lowest amount recorded to date is in the ephemeral crucifer, Arabidopsis thaliana. Results from reannealing experiments originally suggested that this species has around 70000 kb per genome (Leutwiler et al. 1984), although recent data from genomic libraries have put it closer to 100000 kb. At the top of the scale lie various monocot species. For example, the true lilies (Lilium species) have around 3040 million kb per genome, while the record belongs to Fritillaria species, also bulbous monocots but with about twice the level of their Lilium relatives. Thus plant gehomes can range in size from 100000 to nearly 100000000 kb! And yet the structural and developmental complexity of pIant species with the Iowest amounts of D N A per cell is not fundamentally different from those with the highest. The number of different genes required to * This article is in some aspects based on contributions made at the 4th International Conference on Arabidopsis Research held in Vienna4n 1990 Department of Genetics and Developmental Biology, Monash University, Clayton, Victoria 3168, Australia

achieve this complexity is likely to be similar whatever the genome size. The number of genes translated overall in the mature tobacco plant is estimated to be around 60000 (Kamalay and Goldberg 1980). If, as examples cloned to date suggest, such genes usually occur once in the genome, and assuming that their processed transcripts are about 1.25 kb long on average, then about 75 000 kb of D N A sequence is required for coding capacity of translated genes. This is close to all the possible coding capacity of the drabidopsis genome (remembering that it does not include introns or the 6% of the genome likely to be involved in ribosomal R N A production (Pruitt and Meyerowitz 1986)). But if plants with 1000 times more D N A per cell have the same coding requirements, what is the role of the other 99.9% of their D N A sequences?

Repeated sequences Since their discovery in the 1960s, the contribution of repeated sequences to genomes has been a focus of attention. Reannealing experiments early showed that most genomes carry a range of different repeats occurring in different numbers of copies. Thus it was thought that variation in genome size might be limited to variation in the fraction attributable to these repeats, superimposed on a constant baseline of single copy sequences. However clearly this is not so. The amount of single

Reprinted Issue without Advertisements

-~; .,_.

356 copy sequences also increases (Hutchinson et al. 1980). Nevertheless, species with small genomes do, as a rule, have a lower relative proportion of repeats (Flavell et al. 1974). Only about 15% of the A. thaliana nuclear gehome reanneals significantly faster than single copy se-quences (Leutwiler et al. 1984), while in cereals with very large genomes, this fraction lies between 75% and 90%. Reductionists turned to the characterization of individual repeats in the hope that their significance in gehomes would be revealed. Firstly, the properties of tandemly repeated satellite DNAs were uncovered. These are usually transcriptionally inactive, and large blocks are preferentially located in constitutive heterochromatin (C-bands). The fraction of a genome attributable to satellite DNA varies greatly, but species with very large genomes have nowhere near enough to account for the excess DNA. For example in the tiny genome of A. thaliana about 1.5% is made up of a specific satellite sequence (Martinez-Zapater et al. 1986), and additional tandem repeats may be present in the 12.5% of its chromatin that is C-banded (Schweizer et al. 1987). By contrast, C-bands contribute less than 10% to the huge chromosomes of 20 Lilium species sampled (Smyth et al. 1989b). Much of the remaining repeated DNA apparently occurs in the form of dispersed repeats. Such sequences were initially more difficult to characterize individually because of their association with diverse neighbouring sequences. However four approaches have been successful. Firstly, particularly abundant sequences have been accessible through consensus internal restriction sites. Short and long interspersed repeats (SINEs and LINEs) of mammals were initially identified in this way. Secondly, some sequences with abundant transcripts turned out to be dispersed repeats, such as the copia and 412 elements of Drosophila melanogaster. Thirdly, isolation of sequences associated with genetic instability or mobility have uncovered a range of dispersed repeats. For example, the Ac transposon of Zea mays was originally cloned through its association with various unstable mutations. Also the P element transposon and the I element retrotransposon of D. melanogaster were initially detected through being the cause of hybrid dysgenesis. Finally, various dispersed repeats have been cloned by chance. They were stumbled across in the course of characterizing a piece of DNA of interest for other reasons. In hindsight, it seems intuitive that sequences that are dispersed at many scattered locations are likely to be mobile elements, and this is consistent with the picture emerging from the study of dispersed repeats in plants.

Categories of dispersed repeat Now that the sample size of known dispersed repeats from animals, fungi and plants is building up, it is clear that they fail into a limited number of classes. The major dichotomy depends on their mode of amplification and moyement, either via a DNA or an RNA intermediate. Here the former will be called transposons and the latter retroelements.

Transposons carry a transposase gene as well as DNA sequences recognized by the transposase and necessary for transposition. Non-autonomous versions can also occur. These carry only the recognition sequences and rely on a trans-coded transposase. On the other hand, retroelements include a much wider spectrum of sequences. Recent evolutionary studies have suggested that the ancestral retroelement might have been one that encoded its own reverse transcriptase, RNA binding activity and means for integrating into genomes (Xiong and Eickbush 1988, 1990; Doolittle et al. 1989). It seems convenient to call all elements with this structure retrotransposons (Xiong and Eickbush 1988), although the term was initially coined for the Tyl/copia class of retroelements alone (Boeke et al. 1985). If this is done, retrotransposons fall into two major categories depending on whether long terminal repeats (LTRs) are produced during amplification. The non-LTR class is the less specialized and likely to be more closely related to a common ancestral sequence. It includes the abundant LINE elements of mammals. Members of the other class have been called LTR retrotransposons, or viral retrotransposons because of their close relationship to retroviruses. The latter contain an additional gene for a coat protein, which allows them actively to escape from their cellular milieu. There is increasing evidence that retroviruses arose in vertebrates relatively recently from one or more LTR retrotransposons (Xiong and Eickbush 1988, 1990; Doolittle et al. 1989). Other retroelements include some that encode their own reverse transcriptase but little else. More important in the assemblage of known dispersed repeats are retroelements that have been passively copied into DNA by reverse transcriptase activity originating elsewhere. The best known of these are the SINEs (including the Alu sequences of vertebrates) and processed pseudogenes.

Plant dispersed repeats Which of these known classes of dispersed repeat have been detected in plant genomes? Firstly, a wide range of transposons have been recovered (see Gierl and Saedler 1989 for a recent review). Most seem to fall into two major families. The first includes AcIDs of maize and Tam3 of Antirrhinum while examples of the second family are En/Spm of maize, Taml of Antirrhinurn, Tgml of soybean and Pisl of pea. Even though these elements may be important in generating genetic diversity, they do not seem to amplify beyond several hundred copies per genome. Given their relatively small size (the largest is Tam1 at 15 kb), it seems unlikely that they will contribute significantly as a group to the larger plant genomes. More recently, a number of plant retrotransposons have been discovered (Table 1), and some of these account for several percent of even the large genomes. Firstly, seven LTR retrotransposons have been recovered and fully sequenced. This class can in turn be subdivided into two families - the copia group and the gypsy group (Doolittle et al. 1989). These are named after ref-

357

Table 1. Retrotransposons in plant genomes (fully sequenced elements only)

Element Size No. (kb) copies

Host species

IC Genome Reference size (kb x 10 6)

LTR (or viral) retrotransposons Copia group Tal 5.2 1-3 Arabidopsis thaliana 0.10 Tntl

5.3

Tstl

5.1

Gypsy group dell 9.3 IFG

5.9

>100 Nicotiana tabacum

1 Solanum tuberosum 1.0

> 13000 Lilium henryi 10000 Pinus lambertiana

Not established Cinl ~ 0.69 Bsl b

3.2

4.5

32 42

1000 Zea mays

2.3

1--5 Z. mays

2.3

Non-LTR (or non-viral) retrotransposons Cin4 > 6.8 25-50 Z. rnays del2

1.9

240000 Lilium speciosum

2.3 29

Voytas and Ausubel (1988) Grandbastien et al. (1989) Camirand et al. (1990) Smyth et al. (1989a) D. Kossack et al. (personal commication) Shepherd et al. (1984) Jin and Bennetzen (1989) Johns et al. (1989) Schwarz-Sommer et al. (1987) P. Leeton and D. Smyth (in preparation)

Single LTRs only b Apparently rearranged erence elements found initially in D. melanogaster. They differ in the order of functional regions in their polyprotein. It seems likely that there was an ancient rearrangem e n t in an ancestor of the copia group because these all have the integrase region directly upstream of the reverse transcriptase. In all other elements and the retroviruses the integrase is near the C-terminus. C o m p a r a tive sequence analysis of the reverse transcriptase region supports this interpretation (Xiong and Eickbush 1990). In plants, the three k n o w n examples o f the copia group are not very a b u n d a n t in their host species. T n t l f r o m tobacco occurs in m o r e than a hundred copies per genome (Grandbastien et al. 1989), but the others, including Tal f r o m A. thaliana (Voytas and Ausubel 1988), are found only once or severally. By contrast, the two gypsy group elements are particularly abundant (although it m u s t be emphasized that they were chosen for study because o f their abundance). The dell element o f Lilium henryi occurs in m o r e than 13000 copies (Smyth et al. 1989a), and it is even more abundant in the related species Lilium longiflorum (Joseph et al. 1990). The I F G element of pines has also amplified to a b o u t ]0000 copies (D.S. Kossack, R.S. Sederoff and C.S. Kinlaw, personal communication). Because these species have such large genomes, even these copy numbers can only account for a few percent of their hosts' genomes at most. The two n o n - L T R retrotransposons characterized f r o m plants so far differ markedly in copy number (Table 1). The Cin4 element of maize was detected as the

cause o f p o l y m o r p h i s m in a translated gene, and is not very a b u n d a n t (Schwarz-Sommer et al. 1987). The recently discovered del2 element in the large genome of Liliurn speciosum (P.R.J. Leeton and D.R.S. in preparation), however, occurs in nearly one quarter of a million copies. Together they account for a b o u t 4 % o f this species' genome. This sequence alone is therefore equivalent to ten copies o f the A. thaliana genome being present in each haploid cell of L. speciosum! The L I N E family of vertebrates, which are also n o n - L T R retrotransposons, m a y likewise contribute several percent to their host's genome. However, because vertebrate genomes are mostly around ten times smaller than those of lilies, the L I N E copy n u m b e r is a b o u t ten times less. Truly, in terms of c o m m i t m e n t of cellular resources, del2 is p r o b a b l y the m o s t successful retroelement discovered so far! It is too early to say if any particular class of retrot r a n s p o s o n contributes preferentially to the large plant genomes. The small sample examined to date (Table 1) does show that b o t h gypsy L T R and n o n - L T R elements can be very a b u n d a n t even within one genus (Lilium). It will be interesting to see if copia L T R elements are ever amplified to very high numbers.

Functions of dispersed repeats? The significance o f the wide variation in genome size in plants is still an open question. C o m p a r a t i v e studies have shown that genome size correlates closely with a

358 range of cellular properties (recently reviewed by Bennett 1987). Species with larger genomes have larger chromosomes, nuclei and cells, a longer S-phase and slower cell division cycles (both mitotic and meiotic). These nucleotypic effects could influence aspects of the life form of plants such as growth rate and generation time. Turning the argument around, such whole organism properties could therefore be modulated indirectly through changes in the D N A content per cell. That is, genome size itself would be highly adaptive in plants because of its major effect on growth habit and life form (Bennett 1987). If this is accepted, then repeated sequences can be viewed as the bricks and mortar by which plants can rapidly increase their genome size. Judging by the types and amounts of dispersed repeat uncovered so far, it looks as if retroelements, and in particular retrotransposons, may often be involved (Table 1). The reason for this might be their very efficient mechanisms of amplification. Relative to transposons, for example, huge numbers of potential templates (transcripts) may be present in each cell. Also retrotransposons carry their own reverse transcriptase and integrating capacity, so they do not have to rely on chance to provide the necessary machinery from other sources. Finally, there is indirect evidence that retrotransposons can move readily between species. Horizontal transmission even between kingdoms is likely (Smyth et al. 1989a). There are now two examples of closely related retrotransposons being found in insects, fungi and plants. Members of the gypsy group have been found in Drosophila, yeast (Ty3) and plants (deli and IFG). Likewise, copia group members, discovered in Drosophila, also occur in yeast (Tyl) and plants (Tal, Tntl and Tstl). Just how retrotransposons might move around is not clear, but passive transduction in a viral capsid is one possibility. If retrotransposons usually have reverse transcriptase associated with their transcripts (as occurs in retroviruses), then the passively transferred element would be able to infect its new host's genome immediately. Several questions arise. Firstly, random insertion of foreign D N A into a genome will often disrupt necessary genes. Presumably such inserts are weeded out. It is relevant here that the larger a genome becomes, the easier it is to get even larger because of the falling risk that new inserts will disrupt single copy genes. The second question asks how genomes might decrease in size should this become adaptive. It is difficult to see how allelic retrotransposons could be lost from a species' genome once fixed in all individuals. Within the LTR family, unequal crossing over can reduce elements in size down to solo LTRs. This has recently been demonstrated for one of the Tal inserts ofArabidopsis (Voytas et al. 1990) and is likely to have occurred in all of the Cinl inserts in maize (Shepherd et al. 1984) and some dell elements in L. henryi (Sentry and Smyth 1989). However loss of individual non-LTR retrotransposons has not been described. All we can propose is that genome sizes get smaller as a result of many short deletions. Dispersed repeats could be involved here if unequal crossing over occurred between close but non-allelic elements leading to deletion of the D N A sequences lying between them.

Finally one can ask about the origin of single copy, non-coding sequences, which often make up a significant fraction of plant genomes. A simple suggestion is that these are mostly the decaying relicts of former repeats. Once fixed in a plant's genome, many elements would be trapped. In the case of retrotransposons, reverse transcriptase is an error prone enzyme so that inserted elements may already be divergent in sequence. And once an element is no longer transcribed it becomes effectively dead and its sequence free to mutate. Many of the retrotransposons described in Table 1 are already dead and are accumulating mutations that preclude their autonomous activity. Over time their sequence will presumably change so much that their ancestry will not be recognizable, and they will join the single copy D N A class.

Future directions

Have all categories of dispersed repeat been discovered ? Do they have a role beyond simply providing genomic bulk? Are they really dispersed at random throughout plant chromosomes? Do the particularly abundant elements in Lilium and Pinus (Table 1) make a contribution to the many bright chromosomal Q-bands seen in the former genus (Kongsuwan and Smyth 1977) and the Gbands reported in the latter (Drewry 1982)? One strategy for better understanding their role might be to locate and characterize all the dispersed repeats present in one species. Clearly this is impractical in most plants which have many members of many diverse families. This is where A. thaliana comes in. With the smallest genome size known in plants and relatively few dispersed repeats (Pruitt and Meyerowitz 1986), it should be possible to identify them all. One family, the Tal LTR retrotransposon family, has already been characterized in detail (Voytas and Ausubel 1988; Voytas et al. 1990). Tal occurs in only a few copies all of which have been sequenced and mapped. Because of this, definite conclusions about the origin of Tal, the rearrangements it has undergone and the pathway of its spread through different Arabidopsis races can be made (Voytas et al. 1990). It seems likely that additional dispersed repeat families will be present in Arabidopsis (Pruitt and Meyerowitz 1986). These should come to light in the international programme now underway to characterize fully this species' genome (Meyerowitz 1989; Somerville 1989; National Science Foundation 1990). Its 100000 kb genome is packaged into five chromosomes to which many restriction fragment length polymorphism (RFLP) markers have been mapped. The genome has also been cloned in yeast artificial chromosome vectors, so that it is available in about 2000 fragments averaging 150 kb. These are being probed with the RFLP clones and each other to generate a complete set of overlapping fragments covering the whole genome in known order. Dispersed repeats will be revealed during this process as cross-hybridizing sequences mapping to several different locations. If the Arabidopsis genome is fully sequenced then all of its dispersed repeats will be located and de-

359

fined. Thus, ironically, Arabidopsis, which contains very few dispersed repeats, may well provide the key that helps to unlock their role and significance not only in its own genome but also in the genomes of many other plant species in which they are much more abundantly preser/t.

References Bennett MD (1987) Variation in genomic form in plants and its ecological implications. New Phytol 106 [Suppl] : 177-200 Bennett MD, Smith JB (1976) Nuclear DNA amounts in angiosperms. Philos Trans R Soc Lond [Biol] 274:227-274 Boeke JD, Garfinkel D J, Styles CA, Fink GR (1985) Ty elements transpose through an RNA intermediate. Cell 40:491-500 Camirand A, St-Pierre B, Marineau C, Brisson N (1990) Occurrence of a copia-like transposable element in one of the introns of the potato starch phosphorylase gene. Mol Gen Genet 224: 33-39 Doolittle RF, Feng D-F, Johnson MS, McClure MA (1989) Origins and evolutionary relationships of retroviruses. Quart Rev Biol 64:1-30 Drewry A (1982) G-banded chromosomes in Pinus resinosa. J Hered 73: 305-306 Flavell RB, Bennett MD, Smith JB, Smith DB (1974) Genome size and the proportion of repeated nucleotide sequence DNA in plants. Biochem Genet 12:257-269 Gierl A, Saedler H (1989) Transposition in plants. In: Eckstein F, Lilley DMJ (eds) Nucleic acids and molecular biology, vol 3. Springer, Berlin Heidelberg New York, pp 251-259 Grandbastien M-A, Spielmann A, Caboche M (1989) Tntl, a mobile retroviral-like transposable element of tobacco isolated by plant cell genetics. Nature 337:376-380 Hutchinson J, Narayan RKJ, Rees H (1980) Constraints upon the composition of supplementary DNA. Chromosoma 78:137145 Jin Y-K, Bennetzen JL (1989) Structure and coding properties of Bsl, a maize retrovirus-like transposon. Proc Natl Acad Sci USA 86:6235-6239 Johns MA, Babcock MS, Fuerstenberg SM, Freeling M, Simpson RB (1989) An unusually compact retrotransposon in maize. Plant Mol Biol 12:633-642 Joseph JL, Sentry JW, Smyth DR (1990) Interspecies distribution of abundant DNA sequences in Liliurn. J Mol Evol 30:146-154 Kamalay JC, Goldberg RB (1980) Regulation of structural gene expression in tobacco. Cell 19: 935-946

Kongsuwan K, Smyth DR (1977) Q-bands in Lilium and their relationship to C-banded heterochromatin. Chromosoma 60:169-178 Leutwiler LS, Hough-Evans BR, Meyerowitz EM (1984) The DNA of Arabidopsis thaliana. Mol Gen Genet 194:15-23 Martinez-Zapater JM, Estelle MA, Somerville CR (1986) A highly repeated DNA sequence in Arabidopsis thaliana. Mol Gen Genet 204:417-423 Meyerowitz EM (1989) Arabidopsis, a useful weed. Cell 56:263269 National Science Foundation (1990) A long-range plan for the multinational coordinated Arabidopsis thaliana genome research project. Washington, DC Pruitt RE, Meyerowitz EM (1986) Characterization of the genome of Arabidopsis thaliana. J Mol Biol 187 : 169-183 Schwarz-Sommer Z, Leclercq L, G6bel E, Saedler H (1987) Cin4, an insert altering the structure of the A1 gene in Zea mays, exhibits properties of nonviral retrotransposons. EMBO J 6:3873-3880 Schweizer D, Ambros P, Grfindler P, Varga F (1987) Attempts to relate cytological and molecular chromosome data of Arabidopsis thaliana to its genetic linkage map. Arabidopsis Inf Serv 25 : 27-34 Sentry JW, Smyth DR (1989) An element with long terminal repeats and its variant arrangements in the genome of Lilium henryi. Mol Gen Genet 215:349-354 Shepherd NS, Schwarz-Sommer Z, Blumberg vel Spalve J, Gupta M, Wienand U, Saedler H (1984) Similarity of the Cinl repetitive family of Zea mays to eukaryotic transposable elements. Nature 307 : 185-187 Smyth DR, Kalitsis P, Joseph JL, Sentry JW (1989a) Plant retrotransposon from Lilium henryi is related to Ty3 of yeast and the gypsy group of Drosophila. Proc Natl Acad Sci USA 86:5015-5019 Smyth DR, Kongsuwan K, Wisudharomn S (1989b) A survey of C-band patterns in chromosomes of Lilium (Liliaeeae). Plant Syst Evol t63 : 53-69 Somerville C (1989) A rabidopsis blooms. Plant Cell 1 : 1131-1135 Voytas DF, Ausubel FM (1988) A copia-like transposable element family in Arabidopsis thaliana. Nature 336:242-244 Voytas DF, Konieczny A, Cummings MP, Ausubel FM (1990) The structure, distribution and evolution of the Tal retrotransposable element family of Arabidopsis thaliana. Genetics 126:713-721 Xiong Y, Eickbush TH (1988) Similarity of reverse transcriptaselike sequences of viruses, transposable elements, and mitochondrial introns. Mol Biol Evol 5:675-690 Xiong Y, Eickbush TH (1990) Origin and evolution of retroelements based on their reverse transcriptase sequences. EMBO J 9 : 3353-3362