Computer Note - Oxford Journals - Oxford University Press

6 downloads 0 Views 71KB Size Report
Ecology Division, Harpenden, Hertfordshire AL5 2JQ, U.K.. (Dawson); and the ... more than others, while population history, demography, spatial structure, or the ...
 2003 The American Genetic Association

Journal of Heredity 2003:94(5):429–431 DOI: 10.1093/jhered/esg083

Computer Note DetSel 1.0: A Computer Program to Detect Markers Responding to Selection R. VITALIS, K. DAWSON, P. BOURSOT, K. BELKHIR

AND

From the School of Biological Sciences, Royal Holloway, University of London, Egham, Surrey TW20 0EX, U.K. (Vitalis); Rothamsted Research, Plant and Invertebrate Ecology Division, Harpenden, Hertfordshire AL5 2JQ, U.K. (Dawson); and the Laboratoire Ge´nome Populations & Interactions, Universite´ Montpellier 2, C. C. 063, Place Euge`ne Bataillon, 34095 Montpellier Cedex 5, France (Boursot and Belkhir). Address correspondence to Renaud Vitalis at the address above, or e-mail: [email protected].

Estimating population parameters from polymorphism frequency data requires neutral genetic markers. Any departure from neutrality may invalidate the inferences drawn from such analyses. We recently discussed the possibility of identifying markers that show deviation from neutral expectations in pairwise comparisons of diverging populations. We are now releasing a user-friendly software package that implements all the necessary steps to identify the signature of selection among molecular markers in a set of polymorphism data. This software can be downloaded free of charge at http://www.univ-montp2.fr/;genetix/ detsel/detsel.html.

Most of the inference methods in population genetics make the assumption that molecular markers [e.g., allozymes, microsatellites, random amplified polymorphic DNAs (RAPDs), restriction fragment length polymorphisms (RFLPs)] are selectively neutral (Avise 1994). That is, the marker loci have no effect on fitness. However, it has long been recognized that selection may shape the distribution of genetic variation at the so-called ‘‘neutral’’ markers (Choudhary et al. 1992; Singh et al. 1987; Wilding et al. 2001). Since the genetic variation measured is assumed to be identically distributed across molecular markers, locus-specific effects may bias the estimation of some parameters of interest (e.g., migration rate, effective population size) from gene frequency data. This is precisely what has been shown in the species complex Mytilus edulis, where the observed

heterozygote deficits could not be attributed to spatial Wahlund effects alone. Here, natural selection acting directly on the markers or on closely linked loci also has to be invoked (Raymond et al. 1997). Moreover, in some studies, comparisons of different classes of molecular markers have shown that significant differences in the distribution of genetic polymorphisms may arise across classes of markers, as well as across markers, within a class (see, e.g., Estoup et al. 1998; Fre´ville et al. 2001; Lemaire et al. 2000; Ross et al. 1999). So the first question that needs to be answered when analyzing population genetics data sets is this: Can we reject the hypothesis that the genetic markers we use behave as neutral loci? In other words, can we identify some loci that have been responding, directly or indirectly, to selective processes? Such loci should be excluded from the data set in order make more reliable inferences about the history or the structure of populations. Cavalli-Sforza (1966) was the first to point out that any selective process would affect certain regions of the genome more than others, while population history, demography, spatial structure, or the genetic system (Darlington 1939) would act on the whole genome alike. Lewontin and Krakauer (1973) proposed tests of neutrality based on this idea, but these tests were soon shown to be invalid, since the underlying neutral model was far too restrictive (Nei and Chakravarti 1977; Robertson 1975a,b). Bowcock et al. (1991) and then Beaumont and Nichols (1996) brought the necessary refinements to Lewontin and Krakauer’s (1973) approach. In these latter methods the joint distribution of estimates of the parameter FST, together with other summary statistics, is generated from simulations of plausible neutral models of the sampled populations (Bowcock et al. 1991). Certain conditional distributions of FST estimates are found to be robust to the details of the neutral models (Beaumont and Nichols 1996). Any locus that shows up as an outlier with respect to these robust conditional distributions is then under suspicion of being influenced, to some extent, by selective processes. Yet because of the large number of populations considered in their simulations, Beaumont and Nichols (1996) did not account for any case where high mutation rates may depress the parameter FST. This has been shown to undermine the approach for the analysis of stepwise, highly mutating markers (Flint et al. 1999). In the absence of independent and accurate data, the number of parameters needed to describe the history and the structure of populations is large and prevents the estimation of the parameters, as well as the simultaneous test of homogeneity of response of markers (Felsenstein 1982). As advocated by several authors (Robertson 1975b; Tsakas and Krimbas 1976), the solution to this ‘‘infinitely many parameters’’ problem might be to focus on a simple,

429

Journal of Heredity 2003:94(5)

although general, model of population divergence. Following this line of reasoning, we recently proposed a method of detecting selection that relies on a model of population divergence by pure random drift (Vitalis et al. 2001). In this model, a single population (that need not be at equilibrium between mutation and drift) splits into two populations, hereafter referred to as populations 1 and 2. We defined the population-specific parameters F1 and F2, which are simple functions of the parameters of interest of the model, through the ratio of divergence time t over the size of the population Ni, that is, Fi » exp(t/Ni). We further established that the joint distribution of the estimates of those parameters, conditioned on the observed total number of allelic states in the pooled sample (over populations 1 and 2), is relatively insensitive to the nuisance parameters in the model (the ancestral effective population size, mutation rate, and rate of drift before the population split). So if we generate the expected distribution of our single-locus estimates of F1 and F2 using the multilocus estimates of these parameters and some range of values for the nuisance parameters, and then condition that distribution on the total number of allelic states in the sample, we can now identify any marker loci that lie outside the 95% highly probable region derived here from a neutral model (Vitalis et al. 2001). We developed the user-friendly software package DetSel (version 1.0), which implements this method. From a set of polymorphism data, this program provides estimates of pairwise population divergence and generates the expected distribution of these estimates in a neutral scenario. The coalescent-based simulations performed to generate the null distribution ease the computational burden by orders of magnitude, as compared to classical individual-based simulations. Using the intuitive graphical user interface (GUI), it is easy to spot any marker loci that are outliers from the neutral distribution. In the following, we set out a stepby-step guide to the complete analysis.

Methods Input Data Files The software package DetSel accepts input files in Genepop format (Raymond and Rousset 1995). The user chooses a pair of populations and then the loci to be analyzed. If some loci have been shown to deviate from the neutral expectation in a previous run, they may be discarded from the calculation of multilocus estimates of specific parameters F1 and F2. The locus-specific values of the parameter estimates F^1 and F^2 are calculated as K P

½p^iu ð2ni p^iu  1Þ  P^i uu =½2ðni  1Þ 

F^i ¼ u¼1

K P

ðp^1u p^2u Þ

u¼1

1

K P

;

ðp^1u p^2u Þ

u¼1

ð1Þ where p^iu is an unbiased estimate of the frequency of allele u in population i, P^i uu is an unbiased estimate of the frequency of homozygote individuals for allele u in population i, and ni 430

is the sample size of population i. The multilocus estimates are calculated as the ratio of the average of one-locus estimates of the numerator in equation (1) over the average of one-locus estimates of the denominator in equation (1). The multilocus estimates of the parameters F1 and F2 are used as input values for the simulations. Nuisance Parameters The joint distribution of the statistics, conditional on the total number of allelic states in the pooled sample, has been shown to be fairly insensitive to the nuisance parameters of the model. This suggests that different values of the nuisance parameters should not influence the shape of the joint distribution for a given pair of (F1, F2) values. However, the nuisance parameters influence the number of allelic states represented in the sample. Large ancestral population size and/or short evolution time before divergence generate samples with a large number of allelic states in the final sample. It is therefore important that the choice of nuisance parameter values be made according to the observed number of allelic states. To help achieve this, the distribution of allele numbers over all the simulated samples is shown in the second tab of the main dialog box of DetSel 1.0. But it is also possible (and even recommended) to generate the full distribution by taking different values of the nuisance parameters, thus mixing different drift scenarios before divergence. Since the values of these parameters are unknown, it is desirable to take account of this uncertainty in the construction of the joint distribution of the interest parameters. The coalescent process defined by the parameter values is run a large number of times, to be specified by the user.

Results Conditional Distributions For each set of parameter values, a sequence of artificial (haploid) data sets is generated using standard coalescent simulations (see, e.g., Hudson 1990). These simulations are performed as described in Vitalis et al. (2001). The joint distribution is given, conditioned on the total number of allelic states in the pooled sample. Graphic Outputs The distribution is plotted as an envelope that contains a fraction q of the simulated data (e.g., q ¼ 95%, or any other value specified by the user). This envelope is drawn as follows: a two-dimensional histogram is constructed. The whole square region is covered by a two-dimensional mesh, with a number of square cells specified by the user. All the observations (F^1, F^2) are then binned in the appropriate cells. The total cell count is divided by the total number of observations to obtain a discrete probability distribution over the two-dimensional array. This discrete distribution is a close approximation to the expected joint distribution of the estimators (F^1, F^2). The q-level ‘‘high probability region’’ is constructed as follows. The cells are sorted in order of

Computer Note

decreasing probability. Then, starting from the cells with the highest associated probabilities, cells are sequentially added to the confidence region until the cumulative probability of the whole set of cells obtained is equal to (or just exceeds) the chosen q value. From this procedure we obtain for each simulation a region within which a proportion q of the simulated data lies. Note that this confidence region may be discontinuous. Evidence of Selection For a given pair of populations, the observed values of F^1 and F^2 calculated from equation (1) are plotted as points on the graph of the corresponding high probability region generated by simulation. Any locus, for a given pair of populations, for which the point (F^1, F^2) lies outside this high probability region is a candidate for having been influenced by selection acting directly on this locus, or on nearby regions of the genome. However, one should be cautious in interpreting the results of a single pairwise analysis. Any form of selection acting in a local population, which has influenced a particular locus, should show up in all the population pairs that include the population where selection is acting. Therefore only when a particular locus is found to be an outlier in all or most of the pairwise comparisons that include a particular population should local selection be invoked to explain the pattern observed. The simulated data can also be saved as a text file that includes the (F^1, F^2) values as well as the total number of allelic states in the pooled sample, for each run. This file can then be analyzed using a Mathematica notebook (Wolfram 1999), available from the authors upon request. Download Location The software package can be downloaded free of charge at http://www.univ-montp2.fr/;genetix/detsel/detsel.html. This package was programmed using Borland Cþþ Builder Version 5.0 (Inprise Corporation) on a PC running Microsoft Windows 2000 (128 kb RAM available).

Acknowledgments This work was partially funded by contract no. BIO4-CT96-1189 of the Commission of the European Communities (DG XII) to G. Hewitt (coordinator), and R. Vitalis benefited from a European Community MarieCurie Fellowship (contract no. HPMF-CT-2001-01355).

Choudhary M, Coulthart MB, and Singh RS, 1992. A comprehensive study of genic variation in natural populations of Drosophila melanogaster. VI. Patterns and processes of genic divergence between D. melanogaster and its sibling species, Drosophila simulans. Genetics 130:843–853. Darlington CD, 1939. The evolution of genetic systems. Cambridge: Cambridge University Press. Estoup A, Rousset F, Michalakis Y, Cornuet J-M, Adriamanga M, and Guyomard R, 1998. Comparative analysis of microsatellite and allozyme markers: a case study investigating microgeographic differentiation in brown trout (Salmo trutta). Mol Ecol 7:339–353. Felsenstein J, 1982. How can we infer geography and history from gene frequencies? J Theor Biol 96:9–20. Flint J, Bond J, Rees DC, Boyce AJ, Roberts-Thomson JM, Excoffier L, Clegg JB, Beaumont MA, Nichols RA, and Harding RM, 1999. Minisatellite mutational processes reduce FST estimates. Hum Genet 105:567–576. Fre´ville H, Justy F, and Olivieri I, 2001. Comparative allozyme and microsatellite population structure in a narrow endemic plant species, Centaurea corymbosa Pourret (Asteraceae). Mol Ecol 10:879–889. Hudson RR, 1990. Gene genealogies and the coalescent process. In: Oxford survey in evolutionary biology (Futuyma RJ and Antonovics J, eds). Oxford: Oxford University Press; 1–44. Lemaire C, Allegrucci G, Naciri M, Bahri-Sfar L, Kara H, and Bonhomme F, 2000. Do discrepancies between microsatellite and allozyme variation reveal differential selection between sea and lagoon in the sea bass (Dicentrarchus labrax)? Mol Ecol 9:457–467. Lewontin RC and Krakauer J, 1973. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphism. Genetics 74: 175–195. Nei M and Chakravarti A, 1977. Drift variance of FST and GST statistics obtained from a finite number of isolated populations. Theor Popul Biol 11:307–325. Raymond M and Rousset F, 1995. Genepop (version 1.2): population genetics software for exact tests and ecumenicism. J Hered 86:248–249. Raymond M, Va¨a¨nto¨ RL, Thomas F, Rousset R, de Meeu¨s T, and Renaud F, 1997. Heterozygote deficiency in the mussel Mytilus edulis species complex revisited. Mar Ecol Prog Ser 156:225–237. Robertson A, 1975a. Gene frequency distribution as a test of selective neutrality. Genetics 81:775–785. Robertson A, 1975b. Remarks on the Lewontin-Krakauer test. Genetics 80:396. Ross KG, Shoemaker DD, Krieger MJB, DeHeer J, and Keller L, 1999. Assessing genetic structure with multiple classes of molecular markers: a case study involving the introduced fire ant Solenopsis invicta. Mol Biol Evol 16:525–543. Singh RS, Choudhary M, and David JR, 1987. Contrasting patterns of geographic variation in the cosmopolitan sibling species Drosophila melanogaster and Drosophila simulans. Biochem Genet 25:27–40. Tsakas S and Krimbas CB, 1976. Testing the heterogeneity of F values: a suggestion and a correction. Genetics 84:399–401.

References Avise JC, 1994. Molecular markers, natural history and evolution. New York: Chapman & Hall. Beaumont MA and Nichols RA, 1996. Evaluating loci for use in the genetic analysis of population structure. Proc R Soc Lond B 263:1619–1626.

Vitalis R, Dawson K, and Boursot P, 2001. Interpretation of variation across marker loci as evidence of selection. Genetics 158:1811–1823. Wilding CS, Butlin RK, and Grahame J, 2001. Differential gene exchange between parapatric morphs of Littorina saxatilis detected using AFLP markers. J Evol Biol 14:611–619. Wolfram S, 1999. The Mathematica book, 4th ed. Cambridge: Cambridge University Press.

Bowcock AM, Kidd JR, Mountain JL, Hebert JM, Carotenuto L, Kidd JR, and Cavalli-Sforza LL, 1991. Drift, admixture, and selection in human evolution: a study with DNA polymorphisms. Proc Natl Acad Sci USA 88:839–843.

Received November 25, 2002 Accepted May 31, 2003

Cavalli-Sforza LL, 1966. Population structure and human evolution. Proc R Soc Lond B 164:362–379.

Corresponding Editor: Sudhir Kumar

431