Patterns of Gene Duplication in Saccharomyces ... - Zhenglong Gu

2 downloads 0 Views 138KB Size Report
Abstract. 1. In this paper we present a new method for detecting block duplications in a genome. It is more stringent than previous ones in that it requires a more.
J Mol Evol (2003) 56:28±37 DOI: 10.1007/s00239-002-2377-2

Patterns of Gene Duplication in Saccharomyces cerevisiae and Caenorhabditis elegans Andre R.O. Cavalcanti,1,2 Ricardo Ferreira,2 Zhenglong Gu,1 Wen-Hsiung Li1 1 2

Department of Ecology and Evolution, University of Chicago, 1101 East 57th Street, Chicago, IL 60637, USA Departamento de Quimica Fundamental, Universidade Federal de Pernambuco, Pernambuco, Brazil

Received: 25 March 2002 / Accepted: 17 July 2002

1 Abstract. In this paper we present a new method for detecting block duplications in a genome. It is more stringent than previous ones in that it requires a more rigorous de®nition of paralogous genes and that it requires the paralogous proteins on the two blocks to be contiguous. In addition, it provides three criterion choices: (1) the same composition (i.e., having the same paralogues in the two windows), (2) the same composition and gene order, and (3) the same composition, gene order, and gene orientation. The method is completely automated, requiring no visual inspection as in previous methods. We applied it to analyze the complete genomes of S. cerevisiae and C. elegans. In yeast we detected fewer duplicated blocks than previously reported. In C. elegans, however, we detected more block duplications than previously reported, indicating that although our method has a more stringent de®nition of block duplication than previous ones, it may be more sensitive in detection because it considers every possible window rather than only ®xed nonoverlapping windows. Our results show that block duplication is a common phenomenon in both organisms. The patterns of block duplication in the two species are, however, markedly di€erent. The yeast shows much more extensive block duplication than the nematode, with some chromosomes having more than 40% of the duplications derived from block duplications. Moreover, in the yeast the majority of block duplications occurred between chromosomes, while in the nemaCorrespondence to: Wen-Hsiung Li; email: [email protected]

tode most block duplications occurred within chromosomes. Key words: Gene duplication Ð Block duplication Ð Protein families Ð Database cleaning Ð Whole-genome duplication

Introduction Gene duplication is considered one of the most important steps for the emergence of genetic novelties (Haldane 1932; Ohno 1970). A substantial part of eukaryotic genomes is composed of multigene families, which probably evolved from duplication events. Duplications can involve (1) part of a gene, (2) a single gene, (3) part of a chromosome (a block duplication), (4) an entire chromosome, or (5) the whole genome (see Li 1997). According to Ohno (1970), whole-genome duplications have been more important in evolution than regional duplications, because in regional duplications only parts of the regulatory system of structural genes may be duplicated, causing an imbalance that can disrupt the normal function of the duplicated regions. With the availability of complete genome sequences, it is now possible to study the frequency of each of the above ®ve types of duplication in a genome. Rubin etal. (2000) studied the extent of gene duplication in the Saccharomyces cerevisiae, Drosophila melanogaster, and Caenorhabditis elegans genomes and Lynch and Conery (2000) used genomic sequence

29

data from these species and others to study the evolutionary fate of duplicated genes. Wolfe and Shields (1997) and Seoighe and Wolfe (1999) studied the pattern of gene duplication in yeast and detected a large number of blocks of duplicated genes. They proposed that these block duplications occurred at the same time and shared the same order and orientation and concluded that the yeast genome was the result of a whole-genome duplication event, followed by massive gene loss and chromosomal reciprocal translocations. Semple and Wolfe (1999) performed a similar analysis in 45% of the Caenorhabditis elegans protein coding genes, obtaining only three duplicated blocks, each consisting of three contiguous duplicated genes. Friedman and Hughes (2001) investigated the patterns of duplication in yeast, C. elegans, and D. melanogaster. They detected block duplications in C. elegans and yeast. Their results showed that in yeast there are several ancient duplicated blocks, in accord with Wolfe and Shields (1997). However, as the ps values (proportions of di€erent synonymous sites) were above the saturation level, they could not be shown to be approximately equal, as would be expected in a whole-genome duplication. Furthermore, they showed that some of the yeast block duplications happened after the proposed whole-genome duplication. Hence they concluded, ``If the duplication of genomic blocks by transposable elements has been an ongoing feature of yeast genome evolution, this factor alone might explain the existence of anciently duplicated genomic blocks in this species, without the need to invoke an ancient polyploidization event.'' The above studies used the BLASTP E value as the sole criterion for identifying homologous proteins. However, the detection of homologous proteins requires a more rigorous analysis. Domain shu‚ing or sharing (Doolittle 1995) can mislead the identi®cation of duplicate genes, since a domain shared by two nonhomologous proteins can cause these proteins to hit each other with a low E value, leading to false hits. The identi®cation of remote homology is another diculty in detecting gene duplications (Doolittle 1986; Rost 1999); improvements in methodology often lead to the discovery of new homologous relationships and new gene family members (Krogh et al. 1994; Sonnhammer et al. 1997). Previously, we (Gu et al. 2002) conducted an analysis of the number of duplicated genes in the genomes of yeast, C. elegans, and D. melanogaster and found that the criteria used to detect homologues can strongly in¯uence the classi®cation of genes into families. The above studies also relied on visual inspection to detect block duplications. In this paper, starting with a more reliable de®nition of gene families, we describe a new and completely automated method for detecting block duplications and perform a detailed

analysis of the structure of gene duplications in yeast and C. elegans; we did not include D. melanogaster in this analysis because the genomic sequences were not yet completely assembled in chromosomes. Our analyses revealed that the pattern of gene duplication in yeast is quite di€erent from that in C. elegans. Data and Methods Data The protein data sets were obtained from the following websites. C. elegans: http://www.sanger.ac.uk/Projects/C_elegans/wormpep/. Wormpep release 40 was used. There were 19,730 protein sequences in the database, of which 48 did not have genomic position information and 22 did not have corresponding coding sequences (cds). We ®ltered these data to exclude the same gene with di€erent names, isoforms, and genes made up of repetitive elements; for the procedure, see Gu et al. (2002). After ®ltering we used the remaining 19,201 protein sequences in our analysis. Yeast: ftp://ncbi.nlm.nih.gov/genbank/genomes/S_cerevisiae/. We used the NCBI October 2000 version, which was part of the Reference Sequence (RefSeq) project. The annotation for this version was based on the Saccharomyces Genome Database in the Stanford genomic resources (SGD; http://genome-www.stanford.edu/Saccharomyces/). We used the same ®ltering procedure as for the worm data and also excluded the sequences of the mitochondrial chromosome; a total of 6172 protein sequences was used in our analysis.

Protein Family De®nition The protein families we used were de®ned by Gu et al. (2002). The method used consisted of running a FASTA analysis of each protein in an organism against the whole proteome of the same species. The hits thus obtained (E value < 10) were ®ltered by the following two homology criteria: (1) the alignable length (L) between the two proteins is larger than 80% of the longer sequence, and (2) the sequence identity in the aligned region is at least 30% if L > 150, or at least 0.06 + 4.8L)0.032(1+exp()L/1000)) if L £ 150 (Rost 1999; Gu et al. 2002). We used a single linkage algorithm to group the homologues into clusters, which were reexamined as follows. The database was cleaned by ®ltering out isoforms (due to alternative splicing) and repetitive (transposable) elements, which might lead to false hits. Using the ``cleaned'' database, the above procedures for the FASTA analysis, for de®ning paralogues, and for clustering proteins into families were repeated. The numbers of protein families thus de®ned in yeast and C. elegans are listed in Table 1 (Gu et al. 2002). The set of all of the proteins that have one or more paralogues in a genome, that is, the proteins that are not singletons, is known as the paranome. Using these results we numbered all the proteins in the paranome sequentially, according to their positions on the chromosomes.

Detection of Block Duplications A block duplication of genes is a duplication that involves more than one gene. To detect blocks of duplicate genes we de®ned windows in the paranome. Each window of size ``n'' was de®ned as a chromosome segment containing n contiguous paranome members. For each position in the paranome we de®ned a window containing the gene at that position and the following ``n ) 1''

30 Table 1. elegans

Distribution of multiple gene families in yeast and C. Number of families

Family size

Yeast

C. elegans

Singletons 2 3 4 5 6±10 11±20 21±50 51±80 >80 No. of genes No. of gene families No. of genes in the paranome

4,799 415 56 23 9 19 8 0 0 0 6,172 530 1,373

13,097 666 188 94 71 104 57 33 5 3 19,201 1,221 6,104

paranome members. This window was then compared with all other possible windows in the paranome, excluding just those that overlap with it. This calculation was then repeated for each position in the paranome. We ®rst performed these calculations for windows of size 2, and when two windows shared two paralogous proteins, we progressively increased the window size, adding one paranome member at each step, until the newly added member in one window was not paralogous to the newly added one in the other window. We performed three kinds of analysis to determine if two windows hit each other, progressively increasing the stringency of the criterion. The three kinds of hits de®ned were (a) compositionÐtwo windows were counted as a hit if they had the same family composition, regardless of the order; (b) orderÐtwo windows were counted as a hit if they had the same family composition and the genes appeared in the same order regardless of the orientation of the genes; and (c) orientationÐtwo windows were counted as a hit if, besides ful®lling the ®rst two requirements, the genes also had the same orientation. This procedure is similar to that of Friedman and Hughes (2001) in the sense that we used only the paranome members in the analysis but di€ers from it in the following respects. i. Coverage: Friedman and Hughes divide the paranome into nonoverlapping windows, so if a block duplication falls in between two windows, it can be overlooked in the manual extension phase; in our method we use all possible windows so that we have a total coverage of the paranome. ii. Hit de®nition: If two windows in the paranome share a number of homologous genes, they are considered a hit by Friedman and Hughes, regardless of whether or not these homologous genes are contiguous; in our method the two windows constitute a hit only if these homologous genes are contiguous. iii. Hit size: Friedman and Hughes de®ne the hit size to be the number of families shared by two windows, while we count the number of homologous genes shared by two windows. For example, if two windows each contain two members of a protein family, it is counted as a hit of size 1 in their method but a hit of size 2 in our method. iv. Gene order and orientation: Friedman and Hughes do not take the gene order or gene orientation into consideration, whereas we de®ne three types of hits, two of which consider gene order or both gene order and orientation. The drawback of our method is that, for now, it can detect only windows in which all the members are contiguous in the two paranomes. For example, if another duplicated gene is inserted into

a block or one copy of a gene in the original duplication is lost (and if it was not a member of a size 2 family, in which case the other member would be a singleton again and would not have been included in the paranome), we will detect not the whole block, but two smaller blocks. We took a closer look in all of the detected duplicated blocks to see how many times these blocks could be easily extended but did not ®nd any case in which a single insertion or deletion could be responsible for breaking down a block. Therefore, this problem seems to be unimportant, at least in the analyzed genomes. We are working on a way to automate this search, but as it is, our results should be taken as conservative estimates of the size of block duplications. After detecting all windows, we ®ltered the results to exclude all hits that overlap. The main criterion for ®ltering was to keep the larger block. In the ordered and oriented hits, if two blocks with the same size overlapped, we kept the one with the smaller standard deviation in the proportion of synonymous changes (ps) between the pairs of homologous genes. This procedure was not used for the composition hits because of the diculty of de®ning homologous pairs. In this case overlapping blocks of the same size were ®ltered based only on their position, and the ®rst one to be detected was kept. The remaining blocks were further re®ned by removing all the hits composed of a repetition of genes belonging to the same family, as these could be the result of a series of local duplication in the two windows instead of a block duplication. To evaluate the statistical signi®cance of the detected block duplications we repeated the same procedure for 500 randomized paranomes for the yeast genome and for the worm genome; each randomized genome was obtained by shu‚ing the real members in the paranome.

Calculation of Genetic Distances and Codon Bias To determine if the genes in a detected block all duplicated at the same time, we calculated, for each pair of homologous genes in the window, the proportion of synonymous sites di€erences per synonymous sites (ps). If the values of ps are homogeneous inside the block, it indicates that all the genes duplicated at the same time. We used Nei and Gojobori's (1986) method to estimate ps; we did not correct for multiple substitutions at a site because, for a large fraction of the homologous pairs, the ps values were saturated. The codon bias measure used was the ENC (e€ective number of codons) (Wright 1990) and was calculated using the program CodonW.

Results Gene Families in the Yeast and Worm In yeast we excluded the mitochondrial chromosome (28 genes; 1 family of size 2 and 26 singletons). The resulting family information for the yeast and worm is described in detail by Gu et al. (2002) and is listed in Table 1. For the yeast the size of the paranome, which is de®ned as the set of all duplicate genes in a genome, is 1373 (about 22% of the proteome), and for the worm it is 6104 (about 31% of the proteome). C. elegans has more and larger families than yeast; the largest worm family has 242 members and consists of olfactory receptors (Robertson 1998, 2000), while the largest yeast family contains only 20 members and consists of seripauperins; see Gu et al. (2002) for a discussion of the results for gene families.

31 Table 2. Number of block duplications in worm and yeast Worm

Yeast

Size

Composition

Order

Orientation

Composition

Order

Orientation

2 3 4 5 6 7 8 9 10 11

590 179 80 11 8 3 3 0 2 1

693 168 62 8 6 2 3 0 0 0

475 61 42 5 3 1 2 0 0 0

83 15 12 7 6 1 1 0 0 0

85 15 12 6 6 1 1 0 0 0

79 15 13 5 6 1 1 0 0 0

Table 3. Mean number of blocks and standard deviation calculated from a set of 500 random genomes for worm and yeast Composition

Order

Orientation

Size of block

No. of blocks

SD

No. of blocks

SD

No. of blocks

SD

2 3 4

791.06 11.00 0.18

75.29 4.17 0.44

Worm 798.78 4.25 0.02

76.52 2.21 0.17

402.83 1.10 0.00

41.38 1.09 0.04

2 3

7.81 0.05

2.81 0.21

Yeast 7.85 0.01

2.82 0.10

3.94 0.00

1.97 0.05

Comparing our results to those obtained by Friedman and Hughes (2001), we found that in our analysis the sizes of the two paranomes are smaller (6104 vs 7077 for the worm and 1373 vs 1440 for the yeast), indicating that some of the genes that were considered homologous in Friedman and Hughes were ®ltered out as isoforms or repetitive elements in our analysis or did not satisfy our criteria of homology. Number of Blocks The results of our block duplication analysis are given in Table 2, which lists the numbers of pairs of blocks that have the same family composition, gene order, and gene orientation, respectively. The larger number of windows of size 2 with the same order than of windows of size 2 with the same composition occurs because a block with the same composition may break into smaller blocks with the same order. Table 3 gives the results of the analysis of the shu‚ed paranomes. As can be seen, for the yeast all the observed numbers of block duplications are substantially higher than would be expected by chance. For the worm the numbers of windows of size 2 are not signi®cantly (at the 5% level) larger than expected by chance. For larger block sizes there are signi®cantly more hits than expected by chance.

We then compared these numbers with the results of previous studies. We consider the composition hits because Friedman and Hughes (2001) did not take into account the order or orientation of the homologues and Wolfe and Shields (1997) allowed for inversions inside the hits. For yeast Wolfe and Shields (1997) detected 55 duplicated blocks, each containing at least 3 duplicate genes and a total of 376 pairs of homologous genes involved in block duplications; Friedman and Hughes (2001) detected 39 duplicated blocks of size larger or equal to 4, involving a total of 240 pairs of duplicated genes. We found 42 blocks with at least 3 duplicate genes (totaling 179 pairs of duplicated genes) and 27 with at least 4 (134 pairs of homologues). So for the yeast our method detected fewer blocks than both of the two previous methods. The reason is twofold. First, our family de®nition is more stringent than theirs, which was based solely on the E value. Second, we require the homologous genes to be contiguous in the paranome, whereas in both of the previous methods the blocks could contain intervening paranome members. On the other hand, for the worm we detected many more block duplications than the ®ve detected by Friedman and Hughes (2001). This could be due in part to di€erences in the hit size de®nition. According to Friedman and Hughes (2001), if multiple members of a family are present in two regions, these regions constitute a hit of size 1, whereas in our method the

32

number of homologous genes in the two regions determines the size of a hit. However, if we count the number of families in each of the blocks detected by our method and score the blocks using these numbers, we still detect more than two times more hits than Friedman and Hughes (2001), showing that the power of our method to detect more blocks is not an artifact of the hit size scoring scheme. Comparing the number of blocks with the same gene orientation with those with the same composition only or with .the same gene order (Table 2), we ®nd that there is little di€erence among these numbers for the yeast, but there are large di€erences for the worm. In the rest of this paper we consider only the block duplications with a size >2 and with the same order and orientation. The ps Values in Blocks To examine if the genes within a detected block were duplicated at the same time, we used a procedure similar to that used by Friedman and Hughes (2001). For each block we calculated the ps value for each pair of homologous genes inside the block. If the genes on a block were in fact derived from a block duplication, the ps values of the homologous genes between the two blocks should be similar. Figure 1 shows the mean ps value and standard deviation for each block duplication of a size larger than 2 for the worm ordered by ascending ps values. To quantify how homogeneous the ps values inside each pair of blocks are, we conducted ANOVAs on the results. The variation between the groups accounts for 91.5% of the total variation in the ps values (p < 0.0005). From Fig. 1, it can be seen that most blocks in worm are old duplications because their ps values are above the saturation level. There is also a number of blocks with ps values around 0.35; closer inspection showed that these detected blocks are the histone blocks that duplicated more than one time (see the next section) and then led to several nonindependent hits. If we plot the same graph for the yeast (graph not shown), the standard deviations of ps inside the blocks are much higher than in the worm case. An ANOVA on the yeast data showed that the grouping structure explains only 63.3% of the variability (p < 0.0005). However, Friedman and Hughes (2001) and Gu et al. (2002) showed that in yeast for genes with a high codon usage bias, ps is not a good measure of the time of divergence, as the synonymous sites in these genes are subjected to selective constraints [for a similar conclusion for Drosophila see Sharp and Li (1989) and Moriyama and Hartl (l993)]. To account for the codon usage bias, Fig. 2 shows the ps values and standard deviations for the yeast

blocks after we discarded the pairs in which the mean e€ective number of codons (ENC) is smaller than 30 (high codon usage bias). An ANOVA showed that after exclusion of the high codon usage bias pairs, the grouping structure explains 83.2% of the variance in ps (p < 0.0005). If we use a more stringent criterion in the de®nition of the ENC cuto€ value and exclude all the pairs with an ENC smaller than 40, 90.4% of the variance can be explained by the grouping structure (p < 0.0005). Sequential Block Duplications In the studies by Wolfe and Shields (1997) and Friedman and Hughes (2001) all the blocks detected were present in just two positions of the genome, implying that these blocks were duplicated only once. In our analysis this is true for the yeast. For the worm, however, some of the blocks were duplicated several times, being found three or more times in di€erent positions of the genome. Sometimes the whole block was found in various locations of the genome, and sometimes parts of the blocks were also found scattered in the genome. Table 4 illustrates this ®nding. The number of pairs detected should be equal to half the number of blocks in the genome if the blocks have been duplicated only once. This is true for the yeast, but for the worm the number of pairs detected is larger than half the number of nonredundant blocks, showing that some blocks have been duplicated more than once. When we take a closer look at the blocks that were duplicated more than once in the worm genome, we ®nd that they are composed mostly of histones. There are two major kinds of histone blocks in the worm. The ®rst type has the pattern histone H4, histone H3, histone H2A, and histone H2B and is present ®ve times in the genome. The other pattern is histone H2A, histone H2B, histone H4, and histone H3 and is present six times in the genome. There are some other histone genes scattered throughout the genome, in pairs or alone, but the majority belongs to one of the two de®ned above. Another block that is duplicated more than once is a block of chitinase followed by ®ve tyrosineprotein kinases, repeated four times in the genome. All the other blocks repeated more than once contain members of the two largest C. elegans protein families (two olfactory receptor families; sizes 242 and 181). If we exclude all the pairs that involve blocks duplicated more than once in the worm, we have the following results: 30 blocks of size 3, 7 blocks of size 4, 1 block of size 5, 1 block of size 6, 1 block of size 7, and 2 blocks of size 8.

33

Fig. 1. Mean ps values and standard deviations for the block duplications (size >2; ordered by ascending ps values) in the nematode genome.

Intra- and Interchromosomal Duplications In the yeast most of the block duplications are interchromosomal, while in the worm they are intrachromosomal. Table 5 lists the number of pairs of blocks within and between chromosomes for the yeast and worm. As can be seen, about 90% of the yeast block duplications are between chromosomes. For the worm about 80% of them are within chromosomes. If we exclude all multiple block duplications (see the preceding section), of the 42 detected duplications, only 2 occur between chromosomes. In the worm eight block duplications are in tandem, while only one block duplication is in tandem in the yeast. We compared this result with the patterns of single-gene duplication in the yeast and worm. To determine the frequency of within- and betweenchromosome single-gene duplication, we used gene families of size 2, excluding all families involved in block duplications. The results are listed in Table 5. As can be seen the pattern of block duplication in the yeast follows the pattern of single-gene duplications, about 90% of the duplications being between chromosomes. For the worm the patterns for singlegene and block duplications are completely di€erent;

most of the block duplications are within chromosomes, while the single-gene duplications do not have a preference for being within or between chromosomes. These ®ndings suggest that block and single-gene duplications in the yeast have occurred in a similar manner, in agreement with the whole-genome duplication hypothesis, while in the worm block duplications and single-gene duplications apparently have occurred in di€erent manners, indicating that there was probably no whole-genome duplication, at least in the relatively recent past. Numbers of Genes and Families, Genome Coverage, and Location of the Blocks Tables 6 and 7 list, for each chromosome, the number of genes, the number of duplicated genes (paranome members), the proportion of the genes in the chromosome that are duplicated, the number of paranome members involved in block duplication, and the proportion of the paranome members that are involved in block duplications in relation to the number of paranome members, for the worm and yeast, respectively.

34

Fig. 2. Mean ps values and standard deviations for the yeast block duplications (size >2; ordered by ascending ps values; only homologous pairs with a mean ENC value smaller lower than 30% were used).

As can be seen from the tables there is more variation in these quantities for the worm than for the yeast. For example, values in column 4, which lists the proportion of duplicated genes in each chromosome, vary in the worm from 20.5% (chromosome X) to 48.9% (chromosome V), with a mean of 29.5% and a standard deviation of 11%. In the yeast this proportion varies from 19.8% (chromosome XVI) to 27.6% (chromosome I), with a mean of 22.8% and a standard deviation of 2.4%. In the last row in the tables we see that 31.8% of the worm genes are duplicated genes, while this value is 22.2% for the yeast, but although the worm has more duplicated genes than the yeast, its duplications seem to be concentrated in some chromosomes such as chromosome V; for examples of large gene families concentrated in chromosome V, see Sluder etal. (1999) and Robertson (2000). In contrast, in the yeast the duplications are distributed among chromosomes. If we look at the proportion of the duplicated genes that are involved in block duplications (column 6 in Tables 6 and 7), we see a similar result. For the yeast (Table 7) this proportion is more uniformly distributed between the chromosomes. The values range from 12.9% (chromosome XIV) to 41.2% (chromosome XVI), with a mean of 24.5% and

a standard deviation of 7.8%. For the worm the blocks are not as uniformly distributed (Table 6), with chromosome V having 12.4% of its duplicated genes involved in block duplications, while in chromosome X none of the 539 duplicated genes is involved in block duplications; the mean for the worm is 4.5%, with a standard deviation of 4.9%. Although the range of the values is larger for yeast, the percentage change in the values of the proportions is much higher for the worm. This is, again, an indication that the duplication events in the worm are concentrated in some chromosomes (again, chromosome V), while in the yeast they are distributed among all chromosomes. The observation that the block duplications are distributed among all yeast chromosomes, while in the worm they tend to be mostly in some chromosomes, supports the hypothesis of a whole-genome duplication event in the yeast (Wolfe and Shields 1997). For the worm, the paranome is composed of 6104 genes, of which 436 are included in our block duplications (1092 if we consider also blocks of size 2). Of the 1221 gene families, 74 are involved in block duplications (215 if we consider also blocks of size 2). For the yeast the results are even more impressive: of

35 Table 4. Number of nonredundant duplicated blocks and number of pairs of duplicated blocks for worm and yeast Worm

Table 5. Number of block duplications on the same or di€erent chromosomes for worm and yeast

Yeast

Worm

Yeast

Block size

No. of pairs

No. of blocks

No. of pairs

No. of blocks

Location in chromosomes

Blocksa

Single-gene Blocks

Single-gene

3 4 5 6 7 8

61 42 5 3 1 2

77 32 7 6 2 4

15 13 5 6 1 1

30 26 10 12 2 2

Same Di€erent Total

89 (38) 25 (2) 114 (40)

326 320 646

26 278 304

the 1373 paranome members, 345 are in block duplications (602 if we include blocks of size 2), and of the 530 families, 156 are inside duplicated blocks (261 considering size 2).

4 37 41

a Numbers in parentheses are the results excluding multiple duplicated blocks.

sible for singletons present in one of the blocks without a homologous gene in the other, which could have been lost. Discussion

Number of Intervening Singletons in the Blocks

The Method

Our method detects block duplications using just the paranome members: if two regions of the paranome have the same family arrangement, we consider these regions the result of a block duplication. However, genes that are contiguous in the paranome are not necessarily contiguous in the genome, since they can have any number of intervening singleton genes between them and still be contiguous in the paranome. For example, the largest block duplication that we detected for yeast is of size 8, meaning that there are eight paranome members that are contiguous with the same order and orientation in two regions of the paranome. However, when we count all the genes, including singletons, in these regions, each of the regions contains 23 genes. The regions do not need to have the same number of genes; the block of size 7 in the yeast is composed of two regions, each with seven paranome members that are contiguous, but when we count all the genes one region contains 14 genes and the other 10. We calculated the number of genes that are inside duplicated block for each chromosome in the worm and yeast, and the results are listed in the last column in Tables 6 and 7. Comparing columns 5 and 7 in each of these tables, it can be seen that the duplicated blocks in the yeast include more singleton genes than the duplicated blocks in the worm. The last line in the tables shows that, for the yeast, all the blocks contain 345 paranome members and 508 singletons (853±345), while in the worm the blocks contain 436 paranome members and only 116 singletons (552±436). This ®nding is in accord with Wolfe and Shields (1997), who proposed that the yeast genome is the result of a whole-genome duplication, followed by aneuploidization and massive gene loss. This massive gene loss could be respon-

Our method detected fewer block duplications than previously reported for yeast (Wolfe and Shields 1997; Friedman and Hughes 2001). There are two possible reasons. First, our de®nition of protein families is more stringent than that in both of the two previous studies and yielded smaller numbers of families for the yeast and worm. Second, we required that the paranome members in a block be contiguous; this is a conservative requirement to avoid spurious hits and, also, to make the de®nition of a block more precise. However, a visual inspection of the results suggests that the second factor may not be very important in the yeast and worm, because we did not ®nd any block that could be extended by the incorporation of nearby noncontiguous paranome members. The large number of duplicated blocks we found in the worm is surprising in view of the fact that Friedman and Hughes (2001) detected only ®ve of these blocks. This might be an indication that Friedman and Hughes's method of dividing the paranome in a ®xed number of nonoverlapping windows overlooks some of the block duplications. Another factor for the di€erence between the two analyses is that there are several instances where more than one member of the same family is present in each block; according to Friedman and Hughes (2001) these members would be counted just once, thus reducing the size of the hits. However, a closer examination indicates that this factor has only a minor e€ect. Our method has the advantage of being completely automated, requiring no visual analysis step as in the block detection step of Wolfe and Shields (1997) and the extension step of Friedman and Hughes (2001). The price we pay for this automation is that we have to consider only contiguous para-

36 Table 6.

Number of genes in the paranome and involved in block duplications for worm

Chromosome

No. of genes

No. of paranome members

% of genes in the paranome

No. of paranome % of paranome members in blocks members in blocks

No. of genes inside duplicated blocks

I II III IV V X All

2,810 3,319 2,490 3,106 4,792 2,684 19,201

632 1,050 506 1,025 2,342 549 6,104

22.5 31.6 20.3 33.0 48.9 20.5 31.8

3 83 6 53 291 0 436

4 101 6 55 386 0 552

Table 7.

0.5 7.9 1.2 5.2 12.4 0.0 7.1

Number of genes in the paranome and involved in block duplications for yeast

Chromosome

No. of genes

No. of paranome members

% of genes in the paranome

No. of paranome members in blocks

% of paranome members in blocks

No. of genes inside duplicated blocks

I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI All

105 420 167 800 282 131 562 281 218 382 335 539 480 417 563 490 6172

29 105 39 165 72 30 116 75 49 76 69 127 104 93 127 97 1373

27.6 25.0 23.4 20.6 25.5 22.9 20.6 26.7 22.5 19.9 20.6 23.6 21.7 22.3 22.6 19.8 22.2

6 26 13 46 19 5 33 15 8 24 13 24 36 12 25 40 345

20.7 24.8 33.3 27.9 26.4 16.7 28.4 20.0 16.3 31.6 18.8 18.9 34.6 12.9 19.7 41.2 25.1

11 64 25 97 53 5 110 35 14 36 33 63 99 31 62 115 853

nome members in the blocks, which seems not to be a great problem. Patterns of Block Duplications in the Yeast and Worms Both organisms studied show evidence of extensive block duplications. Although this fact was known for the yeast, it was thought that block duplication was a rare event in the worm. Indeed, Semple and Wolfe (1999) detected only three blocks and Friedman and Hughes (2001) detected only ®ve blocks in the worm, whereas we found that 7.1% of the duplicated genes in the worm resulted from block duplications. The yeast shows much more extensive block duplication than the worm, with some chromosomes, such as chromosome XVI, having more than 40% of the duplications resulting from block duplications. However, the worm also shows evidence of extensive block duplications, especially for chromosome V, which has 12% of its duplicate genes resulting from block duplications. In the yeast 25.1% of the duplicated genes came from block duplications; in the worm this value is only 7.1%.

The patterns of duplication are markedly di€erent between the two genomes. First, in the yeast the duplication events are approximately randomly distributed among all the chromosomes, while in the worm some chromosomes have many more duplication events than others; for block duplications this di€erence between the two genomes is even more marked. The homogeneous distribution of duplications among yeast chromosomes is consistent with the hypothesis of a whole-genome duplication in yeast (Wolfe and Shields 1997). Second, the duplicated blocks in the yeast have much more intervening singletons than the blocks in the worm. The singletons inside blocks would imply massive translocation or gene loss in the yeast genome if the whole-genome duplication hypothesis is true. Third, in the worm the block duplications are generally intrachromosomal, while in the yeast they are interchromosomal. For the yeast this pattern of block duplication is the same as that for single-gene duplication, generally interchromosomal. However, for the worm single-gene duplications show no tendency to be inter- or intrachromosomal.

37 Acknowledgments. This work was supported by NIH Grants GM30998, GM55759, and HD38287. A.R.O.C. was supported by CAPESÐBrasilia. We thank A. Nekrutenko, H. Wang, K, Thornton, and E. Stahl for discussion. Computer support from R. Blocker is greatly appreciated.

3 References Doolittle RF (1986) Of URFs and ORFs: A primer on how to analyze derived amino acid sequences. University Science Book, Mill Valley, CA Doolittle RF (1995) The multiplicity of domains in protein. Annu Rev Biochem 64:287±314 Friedman R, Hughes AL (2001) Gene duplication and the structure of eukaryotic genome. Genome Res 11:373±381 Gu Z, Cavalcanti ARO, Chen FC, Bouman P, Li WH (2002) Extent of gene duplication in the genomes of Drosophila, nematode, and yeast. Mol Bio Evol 19:256±262 Haldane JBS (1932) The causes of evolution. Longmans and Green, London Krogh A, Brown M, Mianm IS, Sjolander K, Haussler D (1994) Hidden Markov models in computational biology: Applications to protein modeling. J Mol Biol 235:1501±1531 Li W-H (1997) Molecular evolution. Sinauer, Sunderland, MA Lynch M, Conery JS (2000) The evolutionary fate and consequences of duplicate genes. Science 290:1151±1155 Moriyama EN, Hartl DL (1993) Codon usage bias and base composition of nuclear genes in Drosophila. Genetics 134:847±858 Nei M, Gojobori T (1986) Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol 3:418±426

Ohno S (970) Evolution by gene duplication. Springer-Verlag, Berlin Robertson HM (1998) Two large families of chemoreceptor genes in the nematodes Caenorhabditis elegans and Caenorhabditis briggsae reveal extensive gene duplication, diversi®cation, movement, and intron loss. Genome Res 8:449±463 Robertson HM (2000) The large srh family of chemoreceptor genes in Caenorhabditis nematodes reveals processes of genome evolution involving large duplications and deletions and intron gains and losses. Genome Res 10:192±203 Rost B (1999) Twilight zone for protein sequences alignments. Protein Eng 12:85±94 Rubin GM, et al. (2000) Comparative genomics of the eukaryotes. Science 287:2204±2215 Semple C, Wolfe KH (1999) Gene duplication and gene conversion in the Caenorhabditis elegans genome. J Mol Evol 48: 555±564 Seoighe C, Wolfe KH (1999) Updated map of duplicated regions in the yeast genome. Gene 238:253±261 Sharp PM, Li WH (1989) On the rate of DNA sequence evolution in Drosophila. J Mol Evol 28:398±402 Sluder AE, Mathews SW, Hough D, Yin VP, Maina CV (1999) The nuclear receptor superfamily has undergone extensive proliferation and diversi®cation in nematodes. Genome Res 9:103±120 Sonnhammer ELL, Eddy SR, Durbin R (1997) Pfam: A comprehensive database of protein domain families based on seed alignments. Proteins 28:405±420 Wolfe KH, Shields DC (1997) Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387:708±713 Wright F (1990) The e€ective number of codons used in a gene. Gene 87:23±29