Selection, genomewide fitness effects and ... - Wiley Online Library

3 downloads 4003 Views 532KB Size Report
Selection, genome-wide fitness effects and evolutionary ... *Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich 8057, ...
Molecular Ecology (2013) 22, 3525–3538

doi: 10.1111/mec.12329

Selection, genome-wide fitness effects and evolutionary rates in the model legume Medicago truncatula TIMOTHY PAAPE,* THOMAS BATAILLON,† PENG ZHOU,‡ TOM J. Y. KONO,§ ROMAN B R I S K I N E , ‡ ¶ N E V I N D . Y O U N G ‡ * * and P E T E R T I F F I N * * *Institute of Evolutionary Biology and Environmental Studies, University of Zurich, Zurich 8057, Switzerland †Bioinformatics Research Center, Aarhus University, DK-8000 Aarhus C, Denmark, ‡Department of Plant Pathology, University of Minnesota, St. Paul, MN 55108, USA, §University of Minnesota, Department of Agronomy and Plant Genetics, St. Paul, MN 55108, USA, ¶Department of Computer Science, University of Minnesota, Minneapolis, MN 55455, USA, **Department of Plant Biology, University of Minnesota, St. Paul, MN 55108, USA

Abstract Sequence data for >20 000 annotated genes from 56 accessions of Medicago truncatula were used to identify potential targets of positive selection, the determinants of evolutionary rate variation and the relative importance of positive and purifying selection in shaping nucleotide diversity. Based upon patterns of intraspecific diversity and interspecific divergence, c. 50–75% of nonsynonymous polymorphisms are subject to strong purifying selection and 1% of the sampled genes harbour a signature of positive selection. Combining polymorphism with expression data, we estimated the distribution of fitness effects and found that the proportion of deleterious mutations is significantly greater for expressed genes than for genes with undetected transcripts (nonexpressed) in a previous RNA-seq experiment and greater for broadly expressed genes than those expressed in only a single tissue. Expression level is the strongest correlate of evolutionary rates at nonsynonymous sites, and despite multiple genomic features being significantly correlated with evolutionary rates, they explain less than 20% of the variation in nonsynonymous rates (dN) and 50% in enteric bacteria (Charlesworth & Eyre-Walker 2006), >40% in Drosophila simulans and Drosophila melanogaster (Smith & EyreWalker 2002; Welch 2006), Mus muscsulus castaneus (Halligan et al. 2010), Capsella grandiflora (Slotte et al. 2010), Helianthus petioloaris (Strasburg et al. 2009) and >30% in Populus tremula (Ingvarsson 2009). By contrast, analyses of genome-wide data from humans (Boyko et al. 2008; EyreWalker & Keightley 2009), Saccharomyces (Liti et al. 2009) and many plant species (Foxe et al. 2008; Gossmann et al. 2010; Slotte et al. 2011) show little evidence for extensive positive selection driving divergence, possibly due to population structure and/or small effective population sizes (Gossmann et al. 2010; Gossmann et al. 2012). Although positive selection is certainly underlying some proportion of between-species amino acid differences, for most species, the inferred distribution of fitness effects (DFE) reveals that the largest proportion of mutations are strongly deleterious and subject to purifying selection (Eyre-Walker & Keightley 2009; Gossmann et al. 2010; Slotte et al. 2010; Tellier et al. 2011; Hvilsom et al. 2012). In addition to providing an opportunity to identify both the specific targets of selection and the relative importance of positive vs. purifying selection in shaping nucleotide diversity, genome-wide data also provide an opportunity to identify genomic correlates of evolutionary rate variation. Rates of nucleotide evolution vary with several aspects of gene function and structure including gene length, intron number, expression level and tissue specificity (reviewed in Gaut et al. 2011). Gene expression is a strong correlate of evolutionary rate variation at nonsynonymous sites in a wide variety of organisms including mammals (Duret & Mouchiroud 2000; Liao & Zhang 2006), Drosophila (Lemos et al. 2005), bacteria (Rocha & Danchin 2003), Chlamydomonas (Rocha 2006) and yeast (Drummond et al. 2005). The negative correlation between expression and nonsynonymous rate varia-

tion likely reflects stronger purifying selection reflecting the selective importance of abundant proteins (Drummond et al. 2005; Drummond & Wilke 2008). In contrast to evolution at nonsynonymous sites, at least in Arabidopsis, the rate of evolution at synonymous sites is more strongly correlated with gene length and GC content (Slotte et al. 2011; Yang & Gaut 2011), factors that may reflect underlying variation in mutation rates. The species Medicago truncatula has become a genomic model for studying symbiosis with nitrogen-fixing bacteria (Young et al. 2005) as well as legume crop development (Young & Udvardi 2009). Branca et al. (2011) used genome-scale sequence data to describe how nucleotide diversity in M. truncatula varies along chromosomes and searched for genes that were candidates of selection based solely on within-species diversity. Here, we present analyses of nucleotide diversity, interspecific divergence and gene expression data for 20 968 M. truncatula genes. Our specific objectives are fourfold: (i) estimate the genome-wide strength of purifying and positive selection, (ii) examine the relationship between the strength of purifying selection and gene expression by estimating the DFE (the proportion of beneficial or deleterious new mutations), (iii) use data on gene location, length and expression to identify genomic features correlated with the rates of synonymous and nonsynonymous sites evolution and (iv) identify genes that have signatures of having evolved in response to strong positive selection. Because diversity-based tests can depend on the composition of the population sampled (e.g. Stadler et al. 2009), we apply these tests to three partitions of the data, the entire sample and two genetically defined subpopulations.

Materials and methods Nucleotide data from 20 968 putative coding sequences were obtained by sequencing the full genomes of 56 Medicago truncatula accessions drawn from a core collection sampled from across the species range (Ronfort et al. 2006). Twenty-six accessions were sequenced to aligned coverage of ~159 unique coverage (Branca et al. 2011) and 30 accessions were sequenced to ~59 coverage (Table S1, Supporting information, NCBI Short Read Archive SRP001874). Paired-end Illumina sequencing libraries (~200–450 nt insert sizes) were prepared for sequencing according to standard methods (Bentley et al. 2008) using total DNA extracted from a pool of 30-day-old dark-grown seedlings. Libraries were sequenced using GAII or GAIIx Illumina instruments to yield paired 90-mer or 151-mer reads (trimmed to 90mers for analysis). Illumina image analysis pipeline with default parameters was used for base-calling, quality filtering, and to remove adapter and PhiX contamination. Reads © 2013 John Wiley & Sons Ltd

P O S I T I V E A N D P U R I F Y I N G S E L E C T I O N I N M E D I C A G O T R U N C A T U L A 3527 that passed initial quality control filtering were aligned to the M. truncatula reference genome v.3.5 (Young et al. 2011) (www.medicagohapmap.org) using GSNAP (Wu & Nacu 2010). The M. truncatula reference genome is a BAC-based assembly of accession A17 (hereafter referred to as HM101) that covers c. 70% of the euchromatin. After excluding reads 70% of reads. The >70% of reads calling a site means that there are no heterozygous sites within individuals. This should have minor effects given high selfing rates in natural populations (>98%) (Ronfort et al. 2006; Siol et al. 2007) and greater than or equal to three generations of selfing prior to DNA extraction. Positions with >1000 for 26 deep covered accessions or >500 unique reads for shallow accessions were excluded to prevent calling SNPs in repetitive regions that appear only once in the reference genome. Called SNPs for the 56 accessions are available at www.medicagohapmap.org. We evaluated interspecific divergence between M. truncatula and Medicago sativa (alfalfa) using genome sequence data from a single accession of M. sativa (~149 coverage; Yoder et al. 2013). Comparison of the Illumina data to Sanger sequence data for 47 genomic regions from each of 16 accessions [~47 kb from each accession (De Mita et al. 2011)] revealed 2843 SNPs called identically in both the Sanger and Illumina data for these 16 accessions and 102 putative SNPs identified in the Illumina data that were not present in the Sanger data (Branca et al. 2011). We focused our analyses on putative protein-coding regions annotated by the International Medicago Genome Annotation Group (IMGAG V3.5, medicago.org/ genome/IMGAG; He et al. 2009) with an additional ~300 manually annotated CCP/NCR and NBS-LRR genes (K. Silverstein & P. Zhou, unpublished data). We analysed data for 20 968 genes for which  70% of the annotated length was covered in both  40 of 56 M. truncatula accessions and M. sativa (data available at medicagohapmap.org). The 20 968 coding sequence alignments represent approximately half of the 42 833 predicted gene models (Young et al. 2011).

Selection We used the programs polydNdS and MK test from the libsequence package (Thornton 2003) to estimate © 2013 John Wiley & Sons Ltd

intraspecific diversity [hW (Watterson 1975) and hp (Tajima 1989)], the total number of polymorphic nonsynonymous (Pn) and synonymous (Ps) sites as well as the number of fixed nonsynonymous (dN) and synonymous (dS) differences per nonsynonymous and synonymous site, respectively, between M. truncatula and M. sativa. We used three statistics to identify genes most likely to have evolved in response to positive selection tests, dN: dS (the ratio of nonsynonymous difference per nonsynonymous site to synonymous differences per synonymous site), the McDonald–Kreitman (1991) test (MK test hereafter, a contingency test comparing the numbers of between-species difference and within-species polymorphisms at nonsynonymous and synonymous sites) and the joint DTH statistic (Zeng et al. 2006). The DTH statistic is based on the joint distribution of Tajima’s D (DT, estimated using the computer program in libsequence), a statistic based on the relative frequencies of variants segregating at polymorphic sites and the normalized version of Fay and Wu’s H (estimated using calcH in libsequence), a statistic used to infer past selection on the basis of derived variants [i.e. the frequency of variants that have entered a population since it last shared a common ancestor with an outgroup (Fay & Wu 2000; Zeng et al. 2006)]. For estimates of DT and H, sites with > 16 N’s (missing data) at a single position in an alignment were excluded from the analyses. We used the polymorphism and divergence data to calculate the direction of selection (DoS = Dn/ (Dn+Ds) Pn/(Pn+Ps)) statistic of Stoletzki & EyreWalker (2010), where Dn and Ds are the numbers of fixed nonsynonymous and synonymous substitutions. DoS < 0 is consistent with purifying selection and DoS > 0 is consistent with positive selection. For each gene, we also calculated the proportion of amino acid substitutions driven by positive selection (a) as 1 DsPn/ DnPs (Eyre-Walker 2006). Similar values of a were obtained using the DOFE 3.0 software of (Eyre-Walker & Keightley 2009).

Gene expression data Gene expression data were obtained from Young et al. (2011). In brief, RNA was extracted from six tissue types (nodule, leaf, flower, root, pod and bud) sampled from a single accession (HM101), and RNA was sequenced using Illumina GAII technology. For 4931 of the gene models for which we obtained sequenced data, there were no detected transcripts in the RNA-seq data from Young et al. (2011). We refer to the genes with undetected transcripts as nonexpressed genes, although it should be recognized that genes with undetected expression may be expressed at low levels or in environmental conditions other than those in which the

3528 T . P A A P E E T A L . plants were grown. Alternatively, there may be inaccuracy in gene model prediction. For statistical analyses, total expression was calculated as the sum of expression across the six sampled tissues, expression specificity was calculated using the tissue specificity index,τ (Yanai et al. 2005), using log2 values of expression, gene length was based on IMGAG annotation, and chromosomal position was the relative position between the centromere and end of each chromosome arm (calculated for each arm separately). The τ statistic ranges from 0 for genes with equal levels of expression in all tissues to 1 for genes expressed in only a single tissue. Total expression data were log2-transformed prior to analyses.

Correlates of evolutionary rates We used Pearson’s correlations and partial correlations (pcor.test in R www.yilab.gatech.edu/pcor.html, Kim & Yi 2006) to examine the linear relationship between evolutionary rate (dN, dS and dN:dS) and total expression, and tissue specificity (τ), gene length and chromosomal position. We used linear regression (lm function in R) with the stepAIC function for model selection to estimate the proportion of evolutionary rate variation predicted from these potential explanatory variables.

Results Genome-wide purifying selection and adaptation

DFE and SFS The distributions of the frequency of variants segregating at polymorphic sites, also known as the site frequency spectra (SFS), were created using the SoFoS software (www.rilab.org/code/files/SoFoS.html) on the set of nonsynonymous and synonymous polymorphisms identified using polydNdS (Thornton 2003). Both folded and unfolded SFS were calculated; the folded spectrum does not differentiate between ancestral polymorphisms and polymorphisms that are the result of mutations that have entered a population since it split from a common ancestor, while the unfolded spectra is based on derived allele frequencies. If a SNP site was not covered in the full sample, the frequency at that site was scaled to the full size by estimating how additional samples would appear at that nucleotide. This was done by putting a prior, f (p), on the underlying allele frequency, p, and then integrating over the prior as shown in the supplementary methods of Hufford et al. (2012). To estimate the DFE (i.e. the distribution of the strength of selection acting against new mutations), we used the likelihood method of Eyre-Walker & Keightley (2009) implemented in the software DOFE 3.0. The program was run for 1 9 106 steps and sampled every 1000 steps after a burn in of 100 000 steps. Strongly deleterious mutations have Nes > 10 (where Ne is the effective population size and s is the selection coefficient), mildly deleterious mutations have 1 < Nes < 10, and effectively neutral mutations have Nes < 1. To estimate DFE, we used folded allele frequency spectra and the estimated number of nonsynonymous (Dn) and synonymous (Ds) differences between M. truncatula and M. sativa. Allele frequency data were summed across all genes within each of four partitions of the data (expressed genes, nonexpressed genes, genes expressed in all tissues as τ < 0.2 and genes expressed in only a single tissue). The DFE method implicitly assumes that SNPs are unlinked and that synonymous SNPs are selectively neutral.

Within 20 968 predicted protein-coding genes, we identified 348 051 nonsynonymous polymorphisms and 226 616 synonymous polymorphisms (median hWsyn = 0.0096 per site, hpsyn = 0.0054 per site) with an excess of low-frequency variants relative to expectations from a neutral equilibrium model (median DT = 1.57, Table 1, Fig. 1). Pooling unequally sampled subpopulations can skew DT values through unequal representation of SNPs segregating within the different subpopulations (e.g. Ptak & Przeworski 2002; Pool & Aquadro 2006; Arunyawat et al. 2007; Moeller et al. 2007; Stadler et al. 2009). This seems to be at least partially responsible for the negatively centred DT in our sample; STRUCTURE analyses (Fig. S1, Supporting information) revealed that the 56 accession sample harboured three subpopulations, and median DT for the two of these subpopulations for which we had >10 accessions (hereafter referred to as K1 and K2) was greater than that for the entire sample (median DT K1 = 1.26, DT K2 = 0.72). The correlation between DT in the two subpopulations was highly significant (Pd.f. = 19 966 < 0.001) but weak (Pearson R = 0.16). The Medicago truncatula sample differed from the single Medicago sativa accession at 172 622 fixed nonsynonymous and 221 252 fixed synonymous sites (median dN = 0.0096, median dS = 0.047). The evolutionary rates at nonsynonymous sites (dN), synonymous sites (dS) and the ratio dN:dS differed significantly between expressed and nonexpressed genes (all P 1, and 226 (1.2%) had dN:dS > 2 or had  7 fixed nonsynonymous sites but no fixed synonymous sites. Among these were 80 genes with annotated functions and 146 annotated as © 2013 John Wiley & Sons Ltd

Fig. 2 Direction of selection (DoS) among 20 439 coding genes (genes with no synonymous substitutions were excluded) showing ~75% of the sampled genes have values of DoS < 0 (red vertical line), indicative of purifying selection.

hypothetical proteins (Table S2, Supporting information). The MK test identified 206 genes with a signature of adaptation (P < 0.05) and DoS > 0 (Table S3, Supporting information). The mean dN:dS of these candidates = 1.07, and the proportion of amino acid substitutions driven by positive selection (a) ranges from 0.34 to 1 (Table S3, Supporting information). Because of the large number of tests conducted, these genes should be considered as candidates of selection rather than as having robust statistical significance for non-neutral evolution. We also note that the pervasive genome-wide purifying selection obscures the expected null model (neutral) distribution of P-values. The average gene length of the 206 genes identified as potential targets of positive selection by the MK test is 1404 bp, slightly longer than the genome-wide average of 1146 bp, and the length difference appears to reflect greater statistical

3530 T . P A A P E E T A L . power associated with greater numbers of informative sites in longer genes (Charlesworth & Eyre-Walker 2008; Obbard et al. 2009; Keightley & Eyre-Walker 2010). In fact, for all genes, there was a slightly negative correlation between length and DoS (R = 0.08) and a much stronger negative correlation between length and DoS (R = 0.37) for genes with DoS > 0. To identify genes that bear a signature of having evolved in response to recent positive selection and to explore sampling effects on diversity-based tests of selection, we applied the DTH test (Zeng et al. 2006) separately to three partitions of the data: the entire data, subpopulation K1 and subpopulation K2. Across the three partitions, a total of 270 genes had values of both DT and H in the lowest 5% of the distribution of each statistic (Fig. 3); we considered these as candidates for having evolved in response to past selection. The genes identified by DTH depended on whether the test was applied to the entire data set or from the K1 or K2 subpopulations separately, and no genes were identified as targets of selection in all three data sets (Fig. 3). The lack of overlap among subpopulations is due in part to the 5% significance threshold—an additional 55 of the 270 genes would have been identified in more than 1 subgroup if we had relaxed the 5% significance threshold to 10% or had used only DT or H to identify selected genes. However, it is also clear that different samples identify different candidates; of the 184 genes that were below the 5% significance threshold in only the K1 or K2 subpopulation, 91 had DT values in the entire data set that were greater than the median DT value.

Purifying selection and gene expression The SFS show greater proportions of low-frequency variants segregating at nonsynonymous than synonymous polymorphic sites for expressed genes than non-

full N = 56

57 11 K1 N = 34

83

0 4

4

111

K2 N = 11

Fig. 3 Venn diagram of genes identified as targets of selection by the DTH statistic in the full data and two STRUCTURE identified subpopulations.

expressed genes, consistent with stronger purifying selection acting on expressed genes (Fig. 4A). Similarly, nonsynonymous and synonymous substitutions in genes showing homogeneous expression across all six tissue types (i.e. τ < 0.2) show signatures of stronger purifying selection than those expressed in only a single tissue (Fig. 4B). The DFE is also indicative of stronger purifying selection acting on expressed than nonexpressed genes; expressed genes show a significantly greater proportion of strong deleterious (Nes > 10) mutations than genes not expressed in any tissue (Fig. 5). The nonexpressed genes also show nearly two times greater proportion of mutations in the neutral (Nes = 0–1) category. The strength of purifying selection is also related to the tissue specificity of expression. Mutations in genes with more homogeneous expression across the six tissues (τ < 0.2) show the greatest proportions of strongly deleterious mutations (68% with Nes > 10) and the lowest proportion of nearly neutral mutations (18% with 0 < Nes < 1). The relationships between expression and purifying selection do not appear to be strongly influenced by potential errors in SNP identification; we found very similar results when singletons were removed from the analyses (Table S4, Supporting information) and when we analysed data using only the 26 accessions with >129 coverage (Table S5, Supporting information).

Correlates of evolutionary rates Partial correlations between evolutionary rates and genomic features were qualitatively similar for expressed (Table 2) and nonexpressed genes (Table S6, Supporting information). For the 16 007 genes expressed in at least one of the six tissues examined, dN was most strongly correlated with total expression (R = 0.23), dS (R = 0.15) and gene length (R = 0.12). dS also was most strongly correlated with total expression, but unlike dN, the correlation was positive (R = 0.16). Although many of the correlations were statistically significant, they are relatively weak; with the exception of the trivial correlations between dN or dS and dN:dS, the strongest correlation was between dN and total expression (R = 0.23). Even linear combinations of genomic features including chromosome arm, location, length, tissue specificity of expression and GC content were only weakly related to evolutionary rates (Table 3) with the best-fitting models explaining only 18% of the variance in dN, 7% of the variance in dS and 13% of the variance in dN:dS. Full correlations calculated without accounting for covariance between variables (Table S6, Supporting information) were similar to the partial correlations with the exception of correlations involving tissue specificity, for which the © 2013 John Wiley & Sons Ltd

P O S I T I V E A N D P U R I F Y I N G S E L E C T I O N I N M E D I C A G O T R U N C A T U L A 3531 (A) Expressed non-synonymous Expressed synonymous Not expressed non-synonymous Not expressed synonymous

Fig. 4 Derived allele frequencies of nonsynonymous and synonymous substitutions for (A) expressed genes (N = 16 007) and nonexpressed genes (N = 4931) and (B) nonsynonymous and synonymous substitutions for genes expressed homogeneously (τ < 0.2) in six tissues (nodules, roots, leaf, stem, bud and seed pod) and expressed in only one tissue.

(B) Expressed all tissues non-synonymous Expressed all tissues synonymous Expressed single tissue non-synonymous Expressed single tissue synonymous

Derived allele frequency

full correlations were ~ 4 - 6 times stronger. The differences between the full and partial correlations reflect a strong negative correlation (R = 0.61) between tissue specificity and expression.

Candidate genes identified to be under positive selection

10 to 100 Fig. 5 Proportion of segregating mutations under varying strengths of selection. Expressed genes show a significantly greater proportion of strongly deleterious mutations (Nes > 10) than nonexpressed genes, which show a significantly greater proportion of neutral mutations (0 < Nes < 1) than all other categories. Similarly, genes expressed (N = 4070) in all six tissues show reduced neutral and higher strongly deleterious mutations than genes expressed in only one tissue (N = 1692 genes). © 2013 John Wiley & Sons Ltd

Among the 80 annotated genes with dN:dS > 2, 15 are likely involved in defence against pathogens and/or herbivores: three defensins, a xyloglucan-specific endoglucanase inhibitor, homologs of which confer resistance against Aspergillus (Qin et al. 2003), a pierisin which can induce apoptosis in butterflies (Watanabe et al. 1999), 5 NBS-LRR disease resistance genes, 2 E3-ubiquitin-protein ligase RING1 genes, which may be involved in cell death and salicylic acid-mediated defence response (Lee et al. 2011) and three proteinase inhibitors, which may impede digestive enzymes in herbivore guts or secreted by pathogens. Also among the 80 putatively selected genes are three genes with

3532 T . P A A P E E T A L . Table 2 Partial correlations of genomic features with evolutionary rates at nonsynonymous (dN), synonymous (dS) and dN:dS for 16 007 expressed genes (nonexpressed gene data presented in Table S6, Supporting information) dN Chromosome position Length dN dS GC content Tissue specificity τ Total expression

0.01 0.12 – 0.15 0.08 0.08 0.23

dS

dN:dS

0.09 0.06 0.15 – 0.01 0.03 0.16

0.04 0.08 0.71 0.61 0.03 0.04 0.23

All correlations were statistically significant at P < 0.001 except those between dN and chromosome position and dS and GC content.

Table 3 Adjusted R2 from linear regression models for dN, dS and dN:dS. Multiple models were run to compare the relative importance of different explanatory variables Adjusted R2 Model

arm + dis + length arm + dis + length arm + dis + length arm + dis + length dis + length + exp

+ + + +

dX + τ + exp + GC τ + exp + GC dX + exp + GC exp + GC

dN

dS

dN:dS

0.18* 0.16 0.08* 0.07* 0.15*

0.07† 0.05† 0.04 0.03 0.06

– 0.13 – 0.04 0.15

arm, chromosome arm; dis, distance from the centromere at which gene is located; τ, tissue specificity of expression; exp, expression level (log2-transformed prior to analyses); GC, proportion GC of coding region. *dis did not significantly improve the fit of the model and was dropped. † GC did not significantly improve the fit of the model and was dropped.

roles in DNA methylation including one of three DEMETER-like genes and two of four histone-lysine Nmethyltransferase SUVR4 genes. Seven of 18 nodulespecific glycine-rich protein-coding genes, implicated to be involved in nodule organogenesis (Kevei et al. 2002), had dN:dS > 1 although none had dN:dS > 2. Of the 206 genes with DoS > 0 and significant MK test-based departure from neutrality (Table S3, Supporting information), we found 17 F-Box proteins (average dN: dS = 1.75), 10 NBS-LRRs (average dN:dS = 1.18) and four nodule-specific or nodulation-related genes (average dN:dS = 0.96). Among the remaining 270 genes identified by DTH as targets of selection in one or more subpopulations were 193 functionally annotated genes, many with potentially interesting biological functions (Table S2, Supporting

information). Similar to the genes identified as targets of selection by dN:dS, genes likely to be involved in defending plants against pathogens and/or herbivores comprise the largest single class of selection candidates. In addition to the defence-related genes, there were four histone-lysine N-methyltransferase genes (SUVH6, ASHR1, E(z) and a SUVR4), which may be involved in regulating gene expression or transposable element movement (Springer et al. 2003). Several genes with possible roles in mediating symbiosis with rhizobia or mycorrhizae were also identified, including an early nodulin, two ankyrin repeat genes, several genes that encode enzymes in the flavonoid pathway, a glucan endo-1,3-beta-glucosidase a homolog of which is upregulated in super-nodulating soybeans (Salavati et al. 2012), a syntaxin, which may be important in the symbiosome membrane in nodules (Catalano et al. 2007), and SUNN a clavata-like receptor kinase with a supernodulating phenotype (Schnabel et al. 2005). A final gene of interest is an acetolactate synthase (ALS) gene— mutations in which can confer tolerance to sulfonylurea herbicides (Oldach et al. 2008). Of the 270 candidates identified by the DTH statistic, 48 pairs were adjacent, separated by one gene, or separated by up to four genes, which themselves had DT values in the lowest 5%, suggesting that they do not represent independent selective events. The largest of these regions (located at 17.8–18.1 Mbp on chromosome 6) spanned ~270 kb and contained five analysed genes, three of which (one disease resistance-like protein, one chitinase, and one PRP) were identified by DTH with the PRP and disease resistance gene also having dN:dS > 1. This region also contains a cluster of 10 disease resistance-like proteins, although these were not included in our analyses due to lack of coverage, possibly caused by inability to uniquely align reads in these highly repetitive regions.

Discussion Genome-wide purifying selection and adaptation Analyses of whole-genome sequence data revealed purifying selection to be an important evolutionary force in Medicago truncatula, with >50% of nonsynonymous mutations being strongly deleterious (Nes > 10) and the empirical distribution of the DoS statistic strongly negatively centred. These results are similar to those from recent genome-wide analyses of Arabidopsis thaliana, for which ~75% of polymorphic sites are subject to purifying selection (Slotte et al. 2011), a species with a similar mating system and life history. Consistent with pervasive effects of purifying selection, the genome-wide estimate of the proportion of amino © 2013 John Wiley & Sons Ltd

P O S I T I V E A N D P U R I F Y I N G S E L E C T I O N I N M E D I C A G O T R U N C A T U L A 3533 acid differences between M. truncatula and Medicago sativa driven by positive selection (a) is negative, similar to estimates from several other plant species (Gossmann et al. 2010; Slotte et al. 2010, 2011). Weak effects of positive selection are consistent with the expectation that high rates of self-fertilization in M. truncatula (Ronfort et al. 2006; Siol et al. 2007) impede the efficacy of selection (Charlesworth & Wright 2001; Glemin 2007). Small effective population sizes and population structure may also inhibit the efficacy of selection, independent of selfing (Gossmann et al. 2010). In fact, the plant species showing strongest evidence for positive selection being an important force driving divergence on a genomewide scale have been outcrossing species such as Helianthus ssp (Strasburg et al. 2009; Renaut et al. 2012), Capsella grandiflora (Slotte et al. 2010) and possibly Populus tremula (Ingvarsson 2009; Gossmann et al. 2010), which all have relatively large effective population sizes and little population structure. Interestingly, the mean DoS statistic of expressed genes was approximately three times more negative than DoS for genes with undetected transcripts (referred to as nonexpressed although these are presumably either expressed at low levels, in unsampled tissues, or environments other than those in which expression was assayed or possibly are misannotated). The more negative DoS values are consistent with stronger purifying selection. Stronger purifying selection on expressed genes was also reflected in the estimated DFE, which shows stronger purifying selection on more highly expressed genes and on genes expressed in all vs. only one of the six tissues for which expression data were available. These findings are consistent with patterns in mammals (Liao & Zhang 2006) and likely reflect the functional importance of genes that are either highly or widely expressed (reviewed in Drummond & Wilke 2008; Gaut et al. 2011). Interestingly, for both tissue-specific and nonexpressed genes, we find considerably higher proportions of neutral mutations (Nes < 1) than has been found in most other plants (Gossmann et al. 2010; Slotte et al. 2010; Tellier et al. 2011) with the exception of those with small estimated effective population sizes (Boechera stricta, Populus balsamifera and Schiedea globosa) by Gossmann et al. (2010). However, Gossmann et al. (2010) point out that direct statistical comparisons between species using the method of Eyre-Walker & Keightley (2009) require equivalent numbers of chromosomes sampled across taxa, so we interpret our comparisons with caution. We also determined whether sequencing error or low coverage was strongly affecting our results, when we removed singletons (substitutions found only a single accession) and estimated DFE on accessions with >129 we found small changes. The small differences we see using only © 2013 John Wiley & Sons Ltd

deeply covered accessions may be attributed to the issue previously noted about equally sampled chromosomes across data sets when reducing from 56 to 26 accessions as our sample composition inherently changes when we sample reduced numbers of plant accessions.

Correlates of evolutionary rates Similar to what has been found in multiple species (reviewed in Wright 2004; Rocha 2006; Gaut et al. 2011; Ometto et al. 2012), expression level was the strongest correlate of evolutionary rates at nonsynonymous (dN), synonymous (dS) and the ratio of nonsynonymous to synonymous (dN:dS) sites (excluding the correlations between dN, dS and dN:dS). These strong correlations likely reflect functional importance of highly expressed genes. Overall, the correlations between evolutionary rates and genomic features found in the M. truncatula data are very similar to those reported for A. thaliana (Slotte et al. 2011; Yang & Gaut 2011) with dN positively correlated with tissue specificity and dS negatively correlated with gene length and expression level. Similar patterns were recently shown for sunflowers (Renaut et al. 2012) although the correlations with dN: dS appear to be generally weaker. This may, however, be attributable to partial correlations being affected by the different covariates used in our study. Unlike Yang & Gaut (2011), however, we found a positive correlation between dS and distance from the centromere. The positive correlation between dS and distance from the centromere seems unexpected given that there is a negative correlation between nucleotide diversity and distance from the centromere (Branca et al. 2011) and that diversity and divergence are expected to be positively correlated. However, we found synonymous diversity (hwsyn) was negatively correlated with synonymous divergence (dS) with M. sativa when dS was calculated using only fixed differences between species (R = 0.24). This negative correlation appears due to segregation of ancestral polymorphisms —when we estimated dS using only a single M. truncatula accession rather than using sites at which all 56 M. truncatula accessions differ from M. sativa, the correlation between hwsyn and dS is slightly positive (R = 0.054) and the correlation between dS and distance from the centromere near zero (R = 0.019). Although many of the correlations between evolutionary rates and genomic features were statistically highly significant, they were generally weak. In other words, similar to results from Yang & Gaut’s (2011) analyses of A. thaliana data, we found that genomic features have relatively low predictive power of evolutionary rates in M. truncatula. Even the most parameter-rich regression models had R2 values of 0.21 for dN and

3534 T . P A A P E E T A L . 25% of the among-gene variation in dN and in yeast in which expression predicts >50% of among-gene variation in dN. The reasons for the low explanatory power in Arabidopsis and Medicago are unclear. However, as suggested by Yang & Gaut (2011), stochastic variation in divergence related to ancestral polymorphism in closely related species used in these studies seems likely to be important contributor. Mating system may also be an important contributor—both A. thaliana and M. truncatula are highly selfing species in which the efficacy of selection may be much weaker and the importance of drift much greater than it is in outcrossing species (Charlesworth & Wright 2001; Glemin 2007).

Candidate genes under selection To identify genes that are candidates for having evolved in response to positive selection, we used dN: dS, MK and DTH statistics. While our analyses identified individual candidates of selection, they also provide an empirical example of the extent to which genomic scans can be sensitive to sample composition; of the 270 genes identified as candidates of selection by DTH, no genes were identified in all three partitions of the data and only 18 were identified in two partitions of the data. Finding evidence for selection in only one partition of the data may reflect stochastic sampling, local adaptation or ongoing sweeps (Slatkin & Wiehe 1998; Santiago & Caballero 2005; Kim & Maruki 2011). However, it does not appear that our subpopulationspecific signatures of positive selection are due to the conservative criteria we used to identify potential targets of selection (Zeng et al. 2006); approximately one-third of the genes identified as targets of selection in the K1 or K2 data sets had DT values greater than the median in the entire data. In our analyses, defence-related genes comprise the single largest functional class of genes bearing a signature of recent positive selection. Evidence for selection having driven the evolution of defence genes is not surprising given that molecular population genetic analyses have repeatedly identified host immunity genes as having evolved in response to selection and pathogens and herbivores to be potent forces of selection in contemporary populations (reviewed in DeMeaux & Mitchell-Olds 2003; Tiffin & Moeller 2006; Barreiro & Quintana-Murci 2010). We also identified several genes potentially or putatively involved in rhizobial or mycorrhizae signalling (an early nodulin, two ankyrin repeat genes and several

flavonoid pathway genes) as well as nodule development [including 7 of 18 nodule-specific glycine-rich proteins with dN:dS > 1 and NIN, a transcription factor involving signal transduction upon rhizobial infection (Stacey et al. 2006)], which shows significance in MK tests along with high a and was also identified by De Mita et al. (2011) as bearing a signature of selection. Although the theoretical expectations for the evolution of genes mediating interactions with mutualists are not well developed, selection favouring plant genotypes that avoid associations with less beneficial rhizobia strains may result in bouts of positive selection acting on genes involved in rhizobia–plant signalling, similar to those generated by host–pathogen interactions (De Mita et al. 2007). Absent from the list of selected nodulation-related genes is DMI1, a regulator of rhizobia–plant signalling that was previously identified as bearing a signature of adaptive evolution in M. truncatula (De Mita et al. 2007). The difference in the identification of DMI1 may reflect the dependency of selection scans on the composition of the intraspecific sample (De Mita et al. 2011). Finally, signatures of selection (either DTH or dN:dS, Table S2, Supporting information) were identified at eight genes that, based on homology with functionally characterized Arabidopsis genes, may alter gene expression or transposable element activity through DNA methylation. These eight genes include 7 of 21 histonelysine methyltransferases and one of three DEMETERlike genes, which are involved in demethylation and allele-specific expression. Among the genes whose expression is affected by DEMETER is MEDEA (Gehring et al. 2006), a gene that shows strong maternal-copy expression in the endosperm and strong evidence for positive selection in Arabidopsis lyrata (Spillane et al. 2007). These results suggest that adaptive changes in gene expression may occur not only through evolution of transcription factors or cis-regulatory sequence but also by evolution of methylation enzymes responsible for epigenetic changes. This potential is also consistent with functional differences in chromatin regulation that have been identified between Arabidopsis species (De Meaux & Pecinka 2012).

Conclusion Using frequency and divergence estimates, we estimate over half of the nonsynonymous polymorphisms segregating in Medicago truncatula are under very strong purifying selection, and only a very small proportion of genes (~1%) harbour a signature of positive selection. We also found that the strength of purifying selection was dependent on gene expression; genes with detectable expression and those expressed in all tissues experience © 2013 John Wiley & Sons Ltd

P O S I T I V E A N D P U R I F Y I N G S E L E C T I O N I N M E D I C A G O T R U N C A T U L A 3535 stronger purifying selection than genes with undetected transcripts or expressed in only a single tissue. Although gene expression was more highly correlated with the rate of nonsynonymous evolution than were other genomic features, linear models that included expression as well as several other gnomic characteristics explained 1 and DtH in 5% tails; excel datasheet.

3538 T . P A A P E E T A L . Table S3 MK test results for 206 genes with P < 0.05 and DoS > 0; excel datasheet.

ple of 56 accessions with mixed coverage (Table S1) vs. 26 accessions with >129 coverage.

Table S4 Distribution of fitness effects with singletons removed.

Table S6 Correlations of genomic features with evolutionary rates at non-synonymous (dN), synonymous (dS), dN:dS for 4931 genes with no evidence of expression.

Table S5 Distribution of fitness effects for whole genome sam-

© 2013 John Wiley & Sons Ltd