Mutations as Levy flights

5 downloads 5708 Views 235KB Size Report
May 31, 2016 - cally absent, whereas in non-coding DNA the presence of correlations were ... other hand, is a normalized probability for an event in- volving a segment of .... http://myxo.css.msu.edu/ecoli/summdata.html. [4] J.E. Barrick and ...
Mutations as Levy flights Dario Leon (1) and Augusto Gonzalez (2) (1) Facultad de Fisica, Universidad de La Habana, Cuba (2) Instituto de Cibernetica, Matematica y Fisica, Calle E 309, Vedado, La Habana 10400, Cuba

arXiv:1605.09697v1 [q-bio.PE] 31 May 2016

Data on single-nucleotide polymorphisms and large chromosomal rearrangements, coming from a long time evolution experiment with Escherichia Coli, are analyzed in order to argue that mutations along a cell lineage can be modeled as Levy flights in an abstract mutation space. These Levy flights have two components: random single-base substitutions and large DNA rearrangements. The data provide estimations for the time frequencies of both events and the size distribution function of large rearrangements. PACS numbers: 87.23.-n, 87.23.Kg, 05.40.Fb

Introduction. A single strand of DNA is a onedimensional lattice having in its sites a variable, ui , with only four possible values: G, A, T and C, which could also be given as a set of numerical values, for example, 3/8, 1/8, -1/8, and -3/8. More than 20 years ago, the distribution of bases in the DNA of eukaryotic cells were extensively studied [1]. The result was that in coding regions of the DNA long-range correlations are practically absent, whereas in non-coding DNA the presence of correlations were unambiguously shown. A walk along the DNA, defined in terms of yl = Pl i=1 ui , as l increases draws a kind of “fractal” landscape, better described by a Levy walk or flight [2]. This Levy-walk model was conceived as a description of the DNA at a given time, and there were no suggestions that mutations in a cell lineage resemble Levy flights in the mutation space. In the present paper, we use data from a long-term evolution experiment (LTEE) in E. Coli populations [3] in order to argue that mutations can indeed be modeled as Levy processes. Data on single-nucleotide polymorphisms (SNPs) in mixed-population samples, taken from generation 2000 to 40000, come from sequencing these samples and aligning to the genome sequence of the ancestral strain [4]. On the other hand, large chromosomal rearrangements in clones harvested from these samples are identified by means of a combination of optical techniques, genome sequencing and PCR analysis [5]. Mutations and Levy processes. Mutations are inherited changes in the DNA. Unlike Ref. [2], where the DNA shape was studied by means of the yl variable, we shall now follow the time evolution, yl (t). Let us consider, for the sake of simplicity, a single variable yL (t), where L is the DNA strand size. Time, on the other hand, is best represented by the number of cell generations along a cell lineage, measured from the common ancestor. The simplest mutations one can imagine are singlepoint mutations (SPMs), where a base at a given site of the DNA lattice is substituted by a different one. The probability of such events in the LTEE was estimated as pSP M = 10−4 − 10−3 bases per generation for the whole genome [6], that is one SPM every 104 − 103 generations.

Taking into account that in the circular DNA of E. Coli there are near five million bases [7], the probability of SPMs per base per generation is around 2 × 10−11 − 2 × 10−10 , a number consistent with general estimations [8]. Notice that SPMs are represented as small sporadic jumps in the variable yL (t). For example, ∆yL = 3/4 means substitution of a cytosine by a guanine. In a population of bacteria, the pattern drawn by the set of variables {yL (t)}, one variable for each bacterium, is that one of a Brownian motion, and the maximum deviation in the ensemble, |yL (t) − yL (0)|max , follows a t1/2 law [9]. This is one important component of Levy flights. The second component of Levy flights are large jumps [10], which have as biological counterparts the deletions of DNA fragments, insertions of base sequences at a given site, translocations, and inversions of fragments. With regard to the latter, we shall stress that the DNA is a double lattice of oriented strands [8]. Thus inversion exactly means that the dual fragment is placed in the original strand, whereas the original fragment now appears in the dual strand. Notice that the description of large jumps require additional variables, for example the strand size, L, which after deletion or insertion of a fragment is changed. For the probability of large jumps, we shall use the ansatz pLJ π(l), where pLJ is a time rate, that is one event in a certain number of generations. π(l), on the other hand, is a normalized probability for an event involving a segment of size l ≥ 1. The results of [2] suggest that π(l) is a scale-free distribution, that is π(l) = (ν − 1)/lν , where 1 < ν < 3. Data on SNPs. In an evolution experiment, random fluctuations are filtered by natural selection. The evolution dynamics in the LTEE is schematically represented in Fig. 1. Lineages with neutral or deleterious mutations are usually truncated, whereas beneficial mutations confer evolutive advantage to clones and, thus, higher probability to continue. Once they appear, beneficial mutations are fixed in more than 50 % of the population after a fixing time. Loosely speaking, if Pb is the rate of beneficial mutations in the population, and τf the time necessary to fix

2 One day

30 fixed > 96 % total model

Number of SPMs

25 20 15 10 5 0 0

5000

10000

15000

20000

Number of generations FIG. 2. (Color online) Number of SPMs as a function of time (number of generations) in a population named Ara-1 of the LTEE [4]. Data from generation 0 (ancestral strain, taken as reference) to 20000 are included in the figure.

Time

FIG. 1. Phylogenetic representation of one day evolution in the LTEE. After a few clonal divisions (2-3 in the figure, 6-7 in the experiment) individuals are randomly selected to pass to the next day. Most lineages are truncated, whereas those with higher fitness have better possibilities to continue to the next day.

one of these mutations, the number of fixed beneficial mutations at a given time t is roughly t/(τb + τf ), where τb = 1/Pb . We draw in Fig. 2 the data on SNPs, taken from Ref. [4]. A population, called Ara-1 in the experiment, is sampled at generations 2000, 5000, 10000, 15000, 20000, 30000, and 40000. The two latter points are not included in the figure because of a mutator phenotype, which appeared at generation 27000 and lead to a 100-fold increase of the mutation rate. Alignment of 36-base reads in mixed population samples yielded 40- to 60-fold coverage, allowing to determine frequencies of SNPs above 4 % in the population. Authors report “fixed” SPMs, meaning that their frequency, f , is above 96 %, as well as SNPs, where 4 % < f < 96 %. The data labelled “fixed” in the figure, corresponding to mutations with f ≥ 96 %, shows a linear increase at short times with a slope 1.0×10−3 mutations/generation, which may be taken as an estimation of Pb . The data labelled “total”, on the other hand, is our estimation

for the total number of mutations one may detect in a clone, no matter which mutations one finds. The slope in the initial linear increase is a little higher, around 1.8 × 10−3 mutations/generation. We shall stress, once more, that these are numbers for the whole population. The rate of beneficial mutations in a single cell lineage is pb = Pb /Ncell ∼ 1.8 × 10−3 /(5 × 106 ) ∼ 3 × 10−10 mutations/generation. Notice that, in the experiment, the number of continuing cell lineages and the genome size are roughly the same, which sometimes may lead to confusion. Notice also that the total number of mutations shows a sublinear behaviour at large times. This is a consequence of the fact that τf usually increases when a new beneficial mutation is added on the background of existing beneficial mutations, a phenomenom known as epistasis. In order to fit the data in the figure, we use p the dependence, coming from a model in Ref. [11], 2s( 1 + aNgen −1)/a, where s is the slope at Ngen = 0, and a a parameter. Data on large chromosomal rearrangements. Data on large chromosomal rearrangements are provided in Ref. [5]. Due to experimental limitations, authors can not reliably detect rearrangements smaller than 5 Kilo base-pairs (Kbp). On the other hand, they can only perform measurements on clones, that is representatives of a population, which may exhibit strong deviations from mean values. The first set of results involve a time sequence of clones of the population Ara-1, as in the previous section. That is, samples at generations 2000, 5000, 10000, 15000, 20000, 30000, 40000, and 50000. As mentioned above, we shall use the following ansatz for the time and size probability distribution of such events: pLJ π(l). pLJ is the rate (uniform distribution in time), and π(l) = (ν −1)/lν is the normalized probability

Number of rearrangements

Number of generations 7

0

10000

20000

30000

40000

50000

6 5 4 3 2 1

100

10

1 10

0

Number of changes with size > l

Number of rearrangements with size > l

3

2

10

3

10

4

10

5

10

6

Size (l)

10

FIG. 4. (Color online) Log-log plot of the size distribution of large arrangements in clones obtained from the 12 independently evolving populations in the LTEE, sampled at generation 40000.

1 4

5

10

10

6

10

Size (l) FIG. 3. (Color online) Top: Number of large (greater than 5 Kbp) chromosomal rearrangements in clones of the Ara1 population as a function of time (number of generations). Bottom: Log-log plot of the size distribution of events (see detailed explanation in the text).

for a rearrangement of size l ≥ 1. We do not distinguish between the different kinds of rearrangements: deletions, insertions, translocations, and inversions. Fig. 3, top panel, shows the detected number of events as a function of time (number of generations). Most of these rearrangements seem to be fixed, in the sense that they are detected also at later times. Thus, in order to fit the data p we use the same function as for SNPs, that is 2s( 1 + aNgen − 1)/a. From the slope, we get a rough estimation for the rate of beneficial large changes in the population, PbLJ ∼ 5 × 10−4 large changes/generation. For a single cell lineage, pbLJ ∼ 10−10 large changes/generation. Fig 3, bottom panel, on the other hand, reflects the size statistics. We use a log-log plot. The x-axis is the size, l, and the y-axis is the number of rearrangements with size greater or equal than l. According to our ansatz, this number equals Z C(ν − 1) l



dx C = ν−1 , xν l

where C is a normalization constant. Notice that, when a change appears at a given time and is fixed, we shall not count it as a different event in a latter time. The data in Fig. 3, bottom panel, is very well fitted by the function C/lν−1 , with ν ≈ 1.42. Below, we shall consider a larger data with better statistics. The second set of data comes from clones harvested from the 12 independently evolving populations in the LTEE, sampled at generation 40000. There are 110 detected large rearrangements in these clones. The results, shown in Fig. 4, are perfectly fitted by the dependence C/lν−1 , with ν = 1.49, suggesting a limit ν = 3/2 for the exponent. The slope changes for l < 5 Kbp, because the experiment can not detect all of the rearrangements for these l values, as mentioned above. Discussion. The data on SPMs and large rearrangements in bacterial DNA in the course of 50000 generations of evolution seem to support the Levy flight picture for mutations along cell lineages. The experiments detect mostly beneficial mutations fixed in the population. Deleterious and most neutral mutations are not registered. Along a cell lineage, we expect pSP M ∼ 5 × 10−4 mutations/generation, but only pb ∼ 3 × 10−10 beneficial mutations/generation. Assuming that the observed ratio pb /pbLJ ≈ 3 holds also for pSP M /pLJ , we get pLJ ∼ 1.6 × 10−4 mutations/generation. This value for pLJ does not properly accounts for events with l < 5 Kbp. Thus, the actual rate should be even larger, roughly equal to pSP M . Large chromosomal rearrangements are energetically more demanding than SPMs. It does not seems natural to have similar rates for both kinds of events. Exposure of E. Coli to external factors they are not used to, like ultraviolet light of lamps or the background neutron radiation [12], could be

4 the reason for such a high rate of large changes observed in the LTEE. The ratio pSP M /pLJ may abruptly vary after a mutator phenotype emerges in a population and becomes dominant. In Ara-1, for example, pSP M shows a 100-fold increase, but pLJ keeps roughly the previous value. For the exponent ν in the size distribution of large rearrangements, we get ν ≈ 3/2. The biological mechanism by which such a distribution is generated shall be further studied. There could be a general argument in favor of the Levy flight model of mutations. In the described experiment, where the population size is controlled, biological evolution can be viewed as an optimization problem. The mean fitness in the population is the cost function. Mutations provide the mechanism for searching the parameter space, and natural selection picks up the best representatives in the population. A local search, like the SPMs, could trap mutation trajectories around a local

maximum in the fitness landscape. An optimal search algorithm shall include large jumps of any size, that is a scale-free size distribution. The idea is already implemented in computational optimization techniques [13]. Finally, one may ask how mutations a la Levy would introduce correlations into the DNA. Assume, for example, that we start from a DNA with no correlations. Inversions, translocations, and deletions do not increase correlations. On the other hand, coding segments, related to metabolism and other vital functions, would hardly experience significant mutations. Only biased SPMs and insertions in non-coding DNA would lead to correlations on a long-term basis.

[1] S.V. Buldyrev, A.L. Goldberger, S. Havlin, et. al., Phys. Rev. E 51 (1995) 5084. [2] S.V. Buldyrev, A.L. Goldberger, S. Havlin, et. al., Phys. Rev. E 47 (1993) 4514. [3] R.E. Lenski, Summary data from the long-term evolution experiment, 2016. http://myxo.css.msu.edu/ecoli/summdata.html [4] J.E. Barrick and R.E. Lenski, Cold Spring Harbor Symposia on Quantitative Biology, Vol. 54 (2009) 1. [5] C. Raeside, J. Gaffe, D.E. Deatherage, et. al., mBio 5 (2014) e01377-14. [6] R.E. Lenski, C.L. Winkworth, and M.A. Riley, J. Mol. Evol. 56 (2003) 498. [7] F.R. Blattner, G. Plunkett, C.A. Bloch, et. al., Science 277 (1977) 145362.

[8] B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter, Molecular Biology of the Cell, Garland Science, New-York, 2002. [9] A. Einstein, Investigations on the theory of the Brownian movement, Dover, 1956. [10] M.F. Shlesinger, G. Zaslavsky, and U. Frish (Eds.), Levy flights and related phenomena in Physics, Lecture Notes in Physics, Vol. 450, Springer, Berlin, 1995. [11] M.J. Wiser, N. Ribeck, and R.E. Lenski, Science 342 (2013) 1364. [12] A. Gonzalez, Revista Cubana de Fisica 31 (2014) 71. http://rcf.fisica.uh.cu/index.php/en/2014-07-16-0646-22/vol31-no-2-2014 [13] Chang-Yong Lee and Xin Yao, IEEE Trans. Evol. Comp., Vol. 8, No. 1, pp 1, 2004.

Acknowledgments. The authors acknowledge support from the National Program of Basic Sciences in Cuba, and from the Office of External Activities of the International Center for Theoretical Physics (ICTP).