Constructing phylogenetic trees in individual based models

2 downloads 0 Views 722KB Size Report
Sep 13, 2017 - Carolina L. N. Costaa,∗, David M. Schneiderb, Marlon F. Ramosb, Marcus ...... [70] S. Gourbiere, J. Mallet, Are species real? the shape of the ...
Constructing phylogenetic trees in individual based models

arXiv:1709.04416v1 [q-bio.PE] 13 Sep 2017

Carolina L. N. Costaa,∗, David M. Schneiderb , Marlon F. Ramosb , Marcus A.M. de Aguiarb a Instituto

de Biologia, Universidade Estadual de Campinas, Unicamp, 13083-859, Campinas, SP, Brazil b Instituto de F´ ısica ‘Gleb Wataghin’, Universidade Estadual de Campinas, Unicamp, 13083-859, Campinas, SP, Brazil

Abstract Phylogenetic trees are systematic tools to describe relatedness among species. The inference of biological trees aims to find the best phylogenetic tree that reconstructs the evolution of a group of species. Computer models that simulate the speciation process can track population dynamics and record information about genealogic relationships. In this paper we describe a procedure for the construction of phylogenetic trees in individual based models of evolution and speciation. Keeping track of the parental relationships of all members of the population we set up a matrix containing the times to the most recent common ancestor (MRCAT) between all pairs of individuals. This information is then used in an algorithm that produces the tree. MRCAT matrices display interesting mathematical properties, which we analyze in detail. We illustrated the method with simulations of a spatial model of speciation based on assortative mating. In the model individuals are separated into males and females, so that both maternal and paternal phylogenetic trees can be generated. The resulting trees were compared with trees inferred from a clustering method based on genetic distances of the same simulated individuals. Phylogenies obtained from simulations can help understand how different speciation processes and model hypothesis affect tree properties. Simulated trees can also be used to test inference methods based on genetic or character distances obtained directly from the final population data, mimicking the process performed for real populations, where the complete past history is inaccessible. Keywords: Phylogenetic tree construction, individual based models, speciation process, the most recent common ancestor, spatial model, assortative mating

∗ Corresponding

author. Phone: +55-19-35215466 Email address: [email protected] (Carolina L. N. Costa)

Preprint submitted to Elsevier

September 14, 2017

1. Introduction Phylogenetics is a fundamental tool in evolutionary biology, particularly in the study of speciation [1, 2]. It was Charles Darwin himself who depicted for the first time a phylogenetic tree in ‘On the Origin of Species’, which was curiously the only figure he included in his 502 pages long, seminal work [3]. Nevertheless, rigorous attempts to infer trees from biological data, specially under coalescence theory [4, 5], through mathematical algorithms started only when computers were available, namely around 50 years ago [6, 7], a time in which the field experienced an explosive growth, with the development of softwares only devoted to phylogenetic inference (for some examples, see [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]). Methods for inferring trees from data are divided in two major categories: distance-based methods and character-based methods [19]. By grouping species according to their genetic similarities, distance-based methods comprise UPGMA (Unweighted Paired Group Method with Arithmetic mean) [20], NJ (Neighbor Joining) [21], ME (Minimum Evolution method) [22] and FM (Fitch-Margoliash method) [23]. Character-based methods, on the other hand, produce trees with some optimizing criterion applied to each character present in the target sample. Their main representatives are the MP (Maximum Parsimony method) [24] and the ML (Maximum Likelihood method) [25]. Although there are books fully devoted to describe the details of different methods [26, 6, 7], and the use of the current available software which implements them [8, 9, 13, 11, 14, 17, 18], the problem of selecting a method to build a phylogentic tree is far from a closed issue [27, 28, 29]. With the availability of fast computers it also became possible to test several theoretical models of speciation. Hypothesis such as sympatric versus allopatric processes, sexual selection, assortative mating and the effect of number of genes started to be simulated on computers (for some examples, see [30, 31, 32, 33, 34, 35]). Among the different approaches designed to quantitatively study speciation (see references [36] and [37] for a comprehensive overview of the subject), some models incorporate population or range sizes, making predictions about phylogenetic trees that arise from simulations ([38, 39, 40]). Phylogenetic trees have also been studied with Monte Carlo methods, making use of networks composed by ecological interactions [41]. Individual based models (IBM), in particular, widely used in biology (for a review see [42]), have the advantage to provide a record of the genealogical relationships among individuals and/or the phylogenies of simulated species [38, 43, 44, 45, 46]. These relationships can be stored in a matrix, with the Most Recent Common Ancestor Time (MRCAT) between individuals, which is analogous to matrices employed by distance methods, except that entries represent now the number of generations one needs to go backwards to find a common ancestor for a given pair of extant individuals. This information is generally not available from discretely coded characters, molecular sequences, frequencies, and quantitative traits used by the methods enumerated above [6]. In this article we describe a simple procedure to construct phylogenetic trees 2

from data obtained from simulations of IBM models. The trees thus generated can be compared with trees obtained from biological data in order to validate the hypothesis behind the IBM. More importantly, trees obtained from simulations might offer a practical way to test the reliability of the method chosen to produce the tree from data. In order to illustrate the construction of trees we perform simulations of a model of speciation which was successfully employed in many different scenarios [47, 48, 49, 50, 51]. The framework presented along this paper relies on trees generated with information obtained by simulating the evolutionary history of a population. At the end of the evolutionary process MRCAT matrix contains the exact time to the most recent common ancestor between every pair of the extant population. Moreover, individuals themselves are characterized by all relevant information employed by the numerical model (genetic/phenotypic data, geographical location, etc). Therefore, a particular distance or character-based method can be utilized to infer the phylogeny from population. Comparing trees generated using distance or character-based methods with the tree provided by the MRCAT matrix can be a useful tool to test the suitability of the chosen method for that particular speciation process used in the model. We provide a simple example of this procedure. Although the methodology presented here has some similarities with the coalescence theory [4, 5], they are quite different. While the coalescence method is an inference made from a contemporary sample of a population, and therefore contains uncertainties in the inferred genealogical relationships, the MRCAT method generates the exact genealogic relationships between all individuals of a population over all the evolutionary time, eliminating uncertainties inherent to a statistical approach. Obviously, this is only possible because of the modeling approach presented here. The paper is organized as follows: in section II we describe in detail how information on common ancestry can be stored in MRCAT matrices and discuss their mathematical properties. We also describe an algorithm for constructing phylogenetic trees from the matrix, and used a simple speciation model to illustrate the construction of phylogentic and genealogic trees. Finally, we construct trees using only the genetic information of extant individuals and compare it with trees generated by MRCAT matrices. In section III we shown the resultings trees build upon the speciation model described using MRCAT matrices and using matrices of genetic distance based on a clustering method, in order to compare their differences and similarities. The discussion and conclusions are exposed in section IV. 2. Material and methods 2.1. MRCAT matrices in asexual models Consider a population of Nt asexual individuals at generation t. Individuals can have phenotypic or genetic characteristics that determine their behavior, fitness, motion, etc. These model-dependent features define which individuals will have offspring and how their features will be passed on to them. At the end

3

Individuals at generation t + 1 1 2 3 4 ... Nt+1

Parent at generation t P (1) = 4 P (2) = 8 P (3) = 1 P (4) = 4 ... P (Nt+1 ) = 15

Table 1: List of individuals (i) at generation t + 1 and their respective parents (P (i)) at generation t in an asexual model. This information is necessary to construct the MRCAT matrix. Parents of each individual must be recorded to track the most recent common ancestor between individuals at the end of a simulation. Note that individuals at generation t are not the same individuals at generation t + 1 (discrete generations).

of generation t a list of the new individuals comprising generation t+1, together with a list of their respective parents is produced. An example is shown in Table 1. The parent of individual i in generation t+1 is denoted P (i). In the example in Table 1, P (1) = 4, P (2) = 8, P (3) = 1, etc. The MRCAT between individuals i and j is, therefore, Tt+1 (i, j) = Tt (P (i), P (j)) + 1.

(1)

which is simply the time to the most recent common ancestor between the parents plus one, since a generation has passed [31]. As examples Tt+1 (1, 2) = Tt (4, 8) + 1 and Tt+1 (1, 4) = Tt (4, 4) + 1 = 1. since in this last case they have the same parent. Starting from T = 0 and noting that Tt (i, i) = 0 at all times the rule (1) allows one to compute the MRCAT matrix for any number of generations. The matrix T is stored only for two times, the present and the next generation, so that the memory cost does not depend on time, only on the (square) size of population. 2.2. MRCAT matrices in sexual models The construction of MRCAT matrices in sexual models is slightly different, since each individual i has two parents, a mother P1 (i) and a father P2 (i). Consider as an example a population which has 4 females and 3 males in generation t and gives rise to 5 females and 3 males in generation t + 1 (Table 2). Notice that not only the total number of individuals but also the number of males and females may vary over generations. As the model is sexual, both maternal and paternal lineages can be followed in the simulations, allowing the construction of two different MRCAT matrices and trees. A third option is to ignore the separation of sex and compute the most recent common ancestral taking into account 4

Individuals at generation t + 1 Females 1 2 3 4 5 Males 6 7 8

Mother at generation t

Father at generation t

P1 (1) = 4 P1 (2) = 3 P1 (3) = 1 P1 (4) = 4 P1 (5) = 2

P2 (1) = 6 P2 (2) = 7 P2 (3) = 7 P2 (4) = 5 P2 (5) = 6

P1 (6) = 1 P1 (7) = 3 P1 (8) = 3

P2 (6) = 5 P2 (7) = 5 P2 (8) = 7

Table 2: List of individuals (i) at generation t+1 and their respective parents (P1 (i) = mother and P2 (i) = f ather) at generation t in a sexual model. In this case each individual has two parents, P1 and P2 . Notice that the couple 3 and 7 at generation t had two offspring, the individuals 2 and 8 at generation t+1, while other couples had only one offspring. Additionally, notice that there were 4 females and 3 males at generation t, while there are 5 females and 3 males at generation t + 1.

both parents, which is the only option if the model considers hermaphroditic individuals. 2.2.1. Maternal and paternal lineages The maternal lineage of individuals is obtained by computing the time to the most recent common ancestor of their corresponding mothers: M Tt+1 (i, j) = TtM (P1 (i), P1 (j)) + 1

(2)

with T0M (i, j) = 0 and TtM (i, i) = 0. Similarly, the paternal lineage is computed with F Tt+1 (i, j) = TtF (P2 (i), P2 (j)) + 1 (3) with T0F (i, j) = 0 and TtF (i, i) = 0. Both T M and T F are computed for all individuals, females and males. 2.2.2. Hermaphroditic simulations Many simulations consider, for simplicity, hermaphroditic individuals. In this case, the separation into maternal and paternal lineages does not make sense and the definition of the MRCAT matrix is Tt+1 (i, j) = min{Tt (P1 (i), P1 (j)), Tt (P1 (i), P2 (j)), Tt (P2 (i), P1 (j)), Tt (P2 (i), P2 (j))}+1 (4) with T0 (i, j) = 0 and Tt (i, i) = 0. This considers, literally, the most recent common ancestor of i and j, taking all possibilities into account. The same definition is applied to sexual models with sex separation when the constructed genealogy does not separate the maternal and paternal lineages (a “general” genealogy). In this case the MRCAT matrix does not determine the tree uniquely. 5

A detailed example of this situation is described in Supporting Information, section I. 2.2.3. Genealogies and Phylogenies So far we have only considered the construction of genealogical trees for individuals of a population. If the process of evolution leads to the formation of species, the same method can be used to draw the corresponding phylogenetic trees connecting the species. If NS species exist at time t and ind(i, j) is the j-th individual of the i-th species, one needs to construct sub-matrices of size NS × NS of the full MRCAT matrix considering only one individual per species. A simple choice is to take ind(i, 1) for i = 1, 2, . . . NS and calculate phy = Tind(1,i),ind(1,j) . The shape of the phylogenetic tree is independent of Ti,j the particular choice of individuals, although the height of the branches might. It should be noted that the time to the most recent common ancestor between individuals of different species is only an approximation to the speciation time, since speciation can happen several generations later. 2.3. Constructing the tree from the MRCAT matrix At the end of the evolutionary process the MRCAT matrix contains the time to the most recent common ancestor between every pair of the extant population and this information can be used to generate genealogical or phylogenetic trees. In this section we describe how this can be done. A tree is a graph in which external nodes (or tips) represent extant species (phylogeny) or individuals (genealogy) and are called leaves. Internal nodes are the most common ancestors between a pair of species or individuals and branches reflect the time between an ancestor and its descendents (figure 1). Building a tree requires the joining of individuals into groups, from the leaves all the way to the root. Three different types of grouping can occur in this process, as illustrated in figure 2: (1) a pair of individuals can be joined together (a structure called cherry); (2) an individual can join by right (2a) or by left (2b) an existing group or; (3) two groups can be joined together (3). At each step of a tree construction one of these processes occur, always starting with the pair of individuals having the most recent common ancestor, which is necessarily of type (1).   0 3 6 6 5 7  3 0 6 6 5 7     6 6 0 4 6 7    T = (5)   6 6 4 0 6 7   5 5 6 6 0 7  7 7 7 7 7 0 Consider the tree illustrated in figure 1 and the corresponding MRCAT matrix given by Eq. (5). In this example, individuals A and B share a common ancestor only 3 generations ago and they form the first group, g1 = {A, B}. This information can be extracted from the MRCAT matrix, in which TAB = 3 6

8

ro o t

7

tim e to c o m m o n a n c e s to r

6 5 n o d e s

4 3 2 le a v e s

1 0 C

D

A

B

E

F

Figure 1: Example of a genealogical tree with six individuals, represented by filled circles as the leaves of the tree. Open circles denote the nodes, or the most recent common ancestors between each pair of individuals, and the topmost node is the tree root, or the most recent common ancestor among all six individuals. Letters in red below the leaves are individual labels.

2 a

2 b

3

tim e to c o m m o n a n c e s to r

1

Figure 2: The three possible types of grouping that occurs in the process of tree construction. (1) A pair of individuals joined together from the base of the tree, also called a cherry. (2) An individual joined by right (2a) or by left (2b) to an previous existing group from their most recent common ancestors. (3) Two previous groups joined together from their most recent common ancestors.

7

is the smallest entry. The second group is formed by individuals C and D (g2 = {C, D}), corresponding to TCD = 4, the second smallest entry in the MRCAT matrix. The next smallest time, TBE = 5, involves an individual already grouped in a previous step (individual B, belonging to g1 ), so the grouping in this case is of type (2a) or (2b), because individual (E) will join a pre-existing group (g1 ). Accordingly, the new group will be formed by the juxtaposition of this previous group and the new individual (g3 = {g1 , E} = {{A, B}, E}). Continuing this way, we find next TDE = 6, a situation in which both individuals already belong to a previously formed group. In this case the new group is formed by joining directly these two groups (g4 = {{g2 }, {g3 }} = {C, D, A, B, E}). Finally, the highest entry in the matrix corresponds to TEF = 7, in which the grouping occurs according to description (2a) or (2b), namely by the direct juxtaposition of group g4 and individual F . This is the last group formed, containing all the six individuals: g5 = {{g4 }, F } = {C, D, A, B, E, F }. This group gives precisely the order of individuals in the base, as depicted in figure 1. To each group there are three lines associated in the tree, that can be sketched in three different ways, depending on whether both branches start at the bottom of the tree (1), one of them start at a higher location (2a and 2b), or both start from higher positions (3), as shown in figure 2. To define the positions of nodes representing the most recent common ancestor of each group we have to calculate their average positions in the x-axis. The x-axis can represent different unities of measure like genetic or phenotypic distance between individuals. In figure 1, the individuals are separated by one unit in x-axis for simplicity. In the simplest case, with unitary distances, to define the position of the most recent common ancestor in a cherry like g1 , we calculate the average between the x-position of individual C (position 1) and D (position 2), resulting in a x-position of 1.5. In g3 , a less simple case, the average among the x-position of individuals A (position 3), B (position 4), E (position 5) is 4.0. The same scheme is applied for the other groups, until we find the position of the most recent common ancestor of all individuals. To define the x-positions of all individuals based on genetic or phenotypic distances, for example, we calculate the genetic/phenotypic distance between each subsequent pair of individuals given the order determined in the last group (g5 ). The details of the algorithm and the calculation of individual x-positions can be seen in Supporting Information, section II. 2.4. Properties of MRCAT matrices Two basic properties of any MRCAT matrix are (i) the diagonal is zero and (ii) the matrix is symmetric. 2.4.1. The N-1 independent times MRCAT matrices corresponding to asexual reproduction or maternal and paternal matrices in sexual populations have at most N −1 independent positive entries plus the entries 0 at the main diagonal. In Eq.(5) where N = 6, for example, only the 5 numbers 3, 4, 5, 6 and 7 appear. If the time to the most

8

recent common ancestral between individuals C and D were also 3 the matrix would have only 4 different entries, 3, 5, 6 and 7. However, in any of these matrices there is never more than N − 1 different positive numbers. To prove this statement we shall consider only the case of asexual reproduction, since the proof for maternal or paternal lineages is very similar. Suppose that at generation t − 1 Tt−1 has kt−1 ≤ Nt−1 − 1 different positive entries a1 , a2 , ..., akt−1 plus the entries 0 at the main diagonal. Then, according to the update rule given by Eq. (1) the values of Tt (i, j) = Tt−1 (P (i), P (j)) + 1 will assume the values ai + 1 (i = 1, 2, . . . kt−1 ) only if every individual of generation t − 1 had exactly one offspring. In this case Nt = Nt−1 and Tt would have the same number of independent entries as Tt−1 , i.e., kt = kt−1 ≤ Nt − 1. If, however, D individuals did not reproduce and M individuals had more than one offspring, then  kt−1 − D + 1 ≤ Nt−1 − D if M 6= 0 kt = (6) kt−1 − D ≤ Nt−1 − D − 1 if M =0 since multiple offspring only contribute with Tt (i, j) = 1 whenever i and j shared the same parent. The number of individuals, on the other hand, is Nt = Nt−1 − D + (Mo − M ), where Mo is the total number of offspring of all the M individuals that had more then one offspring. If M 6= 0 then Mo − M ≥ 1 and Nt ≥ Nt−1 − D + 1. If M = M0 = 0 then Nt = Nt−1 − D. Therefore  Nt−1 − D ≤ Nt − 1 if M 6= 0 kt ≤ (7) Nt−1 − D − 1 ≤ Nt − 1 if M =0 which implies kt < Nt − 1 in both cases. Since k0 = 0 the relation holds for any t. For hermaphroditic populations or general genealogies the above proof is not valid, since the update rule in this case is more complicated, given by Eq.(4), and there might indeed be more than N − 1 independent times, as discussed in Supporting Information, section I. Since there are exactly N − 1 internal nodes in the tree and since each node indicates a time to the most recent common ancestor between a pair of individuals, the independent times correspond exactly to the times indicated by the nodes. Notice that these times are the times to the most recent common ancestor between adjacent neighbors in the base of the tree. In Eq.(5) and Fig. 1, for example, TC,D = 4, TD,A = 6, TA,B = 3, TB,E = 5 and TE,F = 7. 2.4.2. Canonical form of MRCAT For asexual reproduction or maternal and paternal matrices in sexual populations all the information of a MRCAT matrix is contained in the N − 1 independent times and all remaining entries of the matrix can be obtained from them. In order to do that it is important to rearrange the matrix by re-labeling individuals by the order they appear in the base of the tree. In the example

9

described in figure 1 this corresponds to rename C→1 D→2 A→3 B→4 E→5 F → 6. In this new order the N − 1 independent times are given by Tk,k+1 for k = 1, . . . , N − 1, corresponding to the first diagonal of the matrix (right above the main diagonal). In this order the matrix is in the canonical form and all other elements can be obtained immediately. The time to common ancestor between second neighbors, individuals k and k + 2 is the maximum between Tk,k+1 and Tk+1,k+2 : Tk,k+2 = max{Tk,k+1 , Tk+1,k+2 } (8) for k = 1, . . . , N − 2. Analogously, Tk,k+3 = max{Tk,k+2 , Tk+1,k+3 }

(9)

for k = 1, . . . , N − 3 and so on. In general Tk,k+m = max{Tk,k+m−1 , Tk+1,k+m } for m = 2, . . . , N − 1 and k = 1, . . . , N For the matrix (5) we obtain  0 4  4 0   6 6 T =  6 6   6 6 7 7

(10)

−m . 6 6 0 3 5 7

6 6 3 0 5 7

6 6 5 5 0 7

7 7 7 7 7 0

       

(11)

with the independent times displayed in bold face. 2.5. Building trees in a IBM of speciation As an example of the process of constructing genealogical and phylogenetic trees we consider an adaptation of the speciation model introduced in [47] for hermaphroditic individuals and extended in [49] to include sex separation into males and females. The model describes a population of N individuals randomly placed in a L × L spatial lattice. Each individual has a genome represented by a sequence of B binary variables where each locus plays the role of an independent biallelic gene. Individuals also carry one separate label that specify its sex (0 for male and 1 for female). Genetic distances between individuals are calculated as the Hamming distance [52] between their genetic sequences, i.e., the number of positions (loci) at which the corresponding alleles are different. Individuals 10

whose genetic distance is larger than G (a parameter of the model) are considered incompatible and do not reproduce (assortative mating). If the distance between a male and a female is less then G the pair is considered compatible and can generate offspring. Individuals also have a mating neighborhood, a circular spatial area of radius S centered on themselves, from where they can choose a mate. Evolution proceeds in discrete generations such that the entire population is replaced by offspring. Each individual of the current generation has a probability Q of dying without reproducing. In this case a neighbor individual, randomly selected from its mating neighborhood, is chosen to reproduce in its place. When the individual does reproduce, which happens with probability (1 − Q), it chooses a compatible mate of opposite sex within its mating neighborhood and an offspring is produced by combining the parental genomes with a single crossover at a random point. At this moment, each locus in the offspring genome has a probability µ of mutate. Finally, the offspring replaces the reproducing individual. The initial population has identical individuals with the B genes set to 0 and the sex label set to 0 or 1 with equal probability. Species are defined as groups of individuals connected by gene flow, so that any pair of individuals belonging to different species are incompatible. However, two individuals belonging to the same species can also be incompatible, as long as they can exchange genes indirectly through other conspecifics. We did simulations based on this speciation model and constructed genealogic and phylogenetic trees with the algorithm described in subsections 2.2 and 2.3. Genealogies contained only 20 individuals, randomly sampled from the population (the pool of individuals considering all species) to facilitate the visualization of tree details. The phylogeny was built with one individual representing each species. 2.6. Comparing true phylogenies with genetic distance trees We have shown in this paper that keeping track of all parental relationships in IBM simulations allows the reconstruction of true phylogenies and genealogies. The word ‘true’ here means that we have recorded the full history of the population and, given a criterion to distinguish between different species, we have the exact branching points in the phylogenetic tree. Here we illustrate the use of true trees as a means to test specific inference methods. Since our model uses genetic distances as a measure for assortative mating and for reproductive isolation, we shall use genetic distances as a proxy for ancestry, such that the larger the genetic distance between two individuals the farther back is their common ancestor. Therefore, selecting the same individuals per species of the phylogeny and the same 20 individuals sampled for the genealogies of subsection 2.5, we constructed matrices of genetic distances and trees based on the clustering method introduced by [53], also known as the UPGMA or average-linked method [6]. The resulting genealogic and phylogenetic trees were compared with trees obtained by actual MRCAT matrices, generated by the algorithm explained in subsection 2.2. The same clustering method can be used to build trees with the MRCAT matrix in the case of hermaphrodite or general genealogies (2.3 and Supporting Information, section II). 11

3. Results Figure 3 shows the result of one simulation based on the speciation model described in 2.5 and figures 4 and 5 show the corresponding trees obtained with the algorithm described in this paper. The population was evolved for only 500 generations and 7 species have branched off the original population (shown in black). The parameters were N = 2600, L = 128, B = 125, G = 20, S = 6, Q = 0.05, µ = 0.001. Figure 3 shows the spatial distribution of the population where each square is an individual and colors represent different species. Phylogeny (figure 4A) and maternal/paternal lineages (matrilineal and patrilineal genealogies) (figure 5A,B) were constructed according to steps described in sections 2.2 and 2.3. To build the genealogies we considered only 20 individuals, randomly sampled from the population (the pool of individuals considering all species) to facilitate the visualization of details in each tree. Because the number of individuals is small only three species are represented in the genealogies. Genealogies with a larger sample of population, in which all species are represented, can be seen in Supporting Information, section III. Matrilineal and patrilineal genealogies are different because they were built from different matrices. In the first case, the MRCAT matrix contains the time to the most recent common female ancestor between each pair of individuals, while in the second case the MRCAT matrix has the time to the most recent common male ancestor between the same individuals, which leads to different times and genealogical histories. In addition, if we considered the general genealogy (figure 5C), taking the most recent common ancestral among females and males, i.e., disregarding sex, the resulting MRCAT matrix does not uniquely specifies the genealogy. Two examples of different genealogical trees obtained with the same matrix are shown in Supporting Information, section III, and are compared with the tree generated by the clustering method. Trees generated by matrices of genetic distances based on the clustering method are displayed in figures 4 (right, phylogeny) and 5D (genealogy), and are compared with trees constructed by MRCAT matrix. As can be seen, trees are indeed very similar but not identical to true trees. One of the reasons for the mismatch is the fact the mutations can be reversed, partially destroying the correlation between genetic distance and evolutionary time [6]. 4. Discussion Understanding all the mechanisms that promote speciation is still an open problem in evolutionary biology [34, 37]. Even more challenging is to identify which of these mechanisms were important in a particular case. A large number of mathematical and computational models were developed in the past years to test hypothesis on speciation processes, including a variety of geographic modes (sympatric [54, 36, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64], allopatric [36, 55, 65, 66, 67, 68, 69, 70, 71], parapatric [33, 72, 73, 74], continuous populations or island models) and forms of selection (sexual [75, 61, 68, 63], disruptive [64], ecological [76, 77, 64], neutral[78, 33, 79, 47, 68, 80, 81, 82, 48, 49, 50]). 12

1 5

5

1 9

1 6

3 1 4

1 1 3

6 1 1

1 7

7 8 4

9 1 0

2

2 0 1 8 1 2

Figure 3: Spatial distribution of individuals from one simulation based on the model described in section 2.5. Individuals are represented by squares, and each color represents a different species. Numbered circles represent individuals sampled to construct genealogies showed in figure 5. Parameter values: Number of generations t = 500, N = 2600, L = 128, B = 125, G = 20, S = 6, Q = 0.05, µ = 0.001.

Figure 4: Phylogenies built from the same simulation illustrated in figure 3. Left: Phylogeny built from the MRCAT matrix at the end of simulation. Right: Phylogeny built from a genetic distance matrix calculated from the genetic sequences of one representative per species. Differences in their histories are visible, i.e., the patterns of descent are distinct. The xaxis represents the genetic distance between subsequent pairs of contemporary individuals (individuals at the end of the simulation), and were calculated as described in Supporting Information, section II (algorithm Build Tree). The y-axis represents the time (left) and the genetic distance (right) between the most recent common ancestors of these species.

13

Figure 5: Genealogies obtained by the MRCAT matrix (A, B, C) and by measuring genetic distances (D). A and B represent matrilineal and patrilineal genealogies for 20 individuals randomly sampled from population. C) One of the possible general genealogies built from MRCAT matrix of the same 20 individuals. D) Genealogy built from the genetic distance matrix of the same individuals. The x-axis represents the genetic distance between subsequent pairs of contemporary individuals and the y-axis represents the time (A, B, C) and the genetic distance (D) between the most recent common ancestors of these individuals.

14

The results of models, however, can only seldom be compared with real data [83, 84, 85, 86, 87, 88, 89]. In these cases comparisons often include qualitative abundance and spatial distributions and in some cases genetic or phenotypic distances. Nevertheless, very little attention has been given to the resulting phylogenies or genealogical trees. Evolutionary trees are essential tools to understand the organization of life. They reveal how species are related to each other and the times between speciation events. The topological structure of trees also contains clues about processes originating a particular group of species, however that information is often hard to extract [1]. In this paper we have described a simple procedure to construct genealogies and phylogenies in a general class of IBM models. We have discussed the properties of MRCAT matrices and presented an algorithm to construct paternal, maternal and general (also for hermaphroditic models) trees from the matrix. We have also presented an example of a spatial IBM where individuals are separated into males and females and sexual reproduction is restricted by genetic similarity and spatial proximity. We have constructed the maternal, paternal and general trees, highlighting their differences. The qualitative properties of the trees thus obtained could in principle be compared to specific trees to test the validity of the model hypothesis. Different model hypothesis and parameter regimes can lead to quite different tree topologies, which can be used as signatures of the processes being modeled. Similar ideas have been explored in the context of DNA sequences [90] and linguistics [91]. These features, not explored in this article, will be the subject of a future publication. Finally, we stress that the trees generated by the procedure described here are ‘exact’, and not probabilistic, trees, since the evolution of the population is followed step by step, keeping track of all parental relationships among individuals. These trees can be compared with those obtained directly from the final population data, mimicking the process performed for real populations, where the complete past history is inaccessible and coalescent methods are generally employed. This can help to test and compare the different methods used by empiricists. As a simple example we have reconstructed the genealogical tree of a sample of individuals computing only their genetic distances. The resulting tree is very similar to the exact tree, but it is not identical. The ability to save the entire history in IBM models is, therefore, a valuable tool to advance in the pursuit of the relationship between speciation processes and tree topologies. Acknowledgments This research was supported by Sao Paulo Research Foundation (FAPESP), National Council for Scientific and Technological Development (CNPq), and Coordination for the Improvement of Higher Education Personnel (CAPES). We thank P. L. Costa for his constructive comments about the manuscript, and A. B. Martins and L. D. Fernandes, who provided expertise that greatly assisted the research, all contributing with their insights and comments about the research results. 15

Data Acessibility - Python script to construct trees: uploaded as online Supporting Information. Conflicts of interest: none. References [1] T. G. Barraclough, S. Nee, Phylogenetics and speciation, Trends in Ecology & Evolution 16 (7) (2001) 391–399. [2] Z. Yang, B. Rannala, Molecular phylogenetics: principles and practice, Nature Reviews Genetics 13 (5) (2012) 303–314. [3] C. Darwin, On the origin of species by means of natural selection, London: John Murray. [4] J. F. C. Kingman, The coalescent, Stochastic processes and their applications 13 (3) (1982) 235–248. [5] J. F. Kingman, On the genealogy of large populations, Journal of Applied Probability 19 (A) (1982) 27–43. [6] J. Felsenstein, Inferring phylogenies, Vol. 2, Sinauer associates Sunderland, 2004. [7] O. Gascuel, Mathematics of evolution and phylogeny, OUP Oxford, 2005. [8] J. Felsenstein, Phylip - phylogeny inference package (version 3.2), Cladistics 5 (1989) 164–166. [9] M. Clement, D. Posada, K. A. Crandall, Tcs: a computer program to estimate gene genealogies, Molecular ecology 9 (10) (2000) 1657–1659. [10] J. P. Huelsenbeck, F. Ronquist, R. Nielsen, J. P. Bollback, Bayesian inference of phylogeny and its impact on evolutionary biology, science 294 (5550) (2001) 2310–2314. [11] E. Paradis, J. Claude, K. Strimmer, Ape: analyses of phylogenetics and evolution in r language, Bioinformatics 20 (2) (2004) 289–290. [12] B. G. Hall, Phylogenetic trees made easy: a how-to manual, Vol. 547, Sinauer Associates Sunderland, MA, 2004. [13] S. Kumar, M. Nei, J. Dudley, K. Tamura, Mega: a biologist-centric software for evolutionary analysis of dna and protein sequences, Briefings in bioinformatics 9 (4) (2008) 299–306. [14] K. P. Schliep, phangorn: phylogenetic analysis in r, Bioinformatics 27 (4) (2011) 592–593.

16

[15] H.-J. Yu, D.-S. Huang, Novel graphical representation of genome sequence and its applications in similarity analysis, Physica A: Statistical Mechanics and its Applications 391 (23) (2012) 6128 – 6136. [16] C. Li, Y. Yang, M. Jia, Y. Zhang, X. Yu, C. Wang, Phylogenetic analysis of {DNA} sequences based on -word and rough set theory, Physica A: Statistical Mechanics and its Applications 398 (2014) 162 – 171. [17] E. Paradis, coalescentMCMC: MCMC Algorithms for the Coalescent, r package version 0.4-1 (2015). URL https://CRAN.R-project.org/package=coalescentMCMC [18] C. Zambrano-Vega, A. J. Nebro, J. F. Aldana-Montes, Mo-phylogenetics: a phylogenetic inference software tool with multi-objective evolutionary metaheuristics, Methods in Ecology and Evolution 7 (2016) 800–805. [19] S. S. Roy, R. Dasgupta, A. Bagchi, et al., A review on phylogenetic analysis: A journey through modern era, Computational Molecular Bioscience 4 (03) (2014) 39. [20] F. Murtagh, Complexities of hierarchic clustering algorithms: State of the art, Computational Statistics Quarterly 1 (2) (1984) 101–113. [21] N. Saitou, M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees., Molecular biology and evolution 4 (4) (1987) 406–425. [22] A. Rzhetsky, M. Nei, Theoretical foundation of the minimum-evolution method of phylogenetic inference., Molecular biology and evolution 10 (5) (1993) 1073–1095. [23] W. M. Fitch, E. Margoliash, Construction of phylogenetic trees, Science 155 (3760) (1967) 279–284. [24] E. Sober, Parsimony in systematics: philosophical issues, Annual Review of Ecology and Systematics 14 (1) (1983) 335–357. [25] J. Felsenstein, Evolutionary trees from dna sequences: a maximum likelihood approach, Journal of molecular evolution 17 (6) (1981) 368–376. [26] M. Salemi, A.-M. Vandamme, The phylogenetic handbook: a practical approach to DNA and protein phylogeny, Cambridge University Press, 2003. [27] L. Brocchieri, Phylogenetic inferences from molecular sequences: review and critique, Theoretical population biology 59 (1) (2001) 27–40. [28] S. Holmes, Statistics for phylogenetic trees, Theoretical population biology 63 (1) (2003) 17–32. [29] S. Holmes, Statistical approach to tests involving phylogenies, Mathematics of Evolution and Phylogeny (2005) 91–120. 17

[30] P. G. Higgs, B. Derrida, Stochastic models for species formation in evolving populations, Journal of Physics A: Mathematical and General 24 (17) (1991) L985. [31] P. G. Higgs, B. Derrida, Genetic distance and species formation in evolving populations, Journal of molecular evolution 35 (5) (1992) 454–465. [32] F. Manzo, L. Peliti, Geographic speciation in the derrida-higgs model of species formation, Journal of Physics A: Mathematical and General 27 (21) (1994) 7079. [33] S. Gavrilets, H. Li, M. D. Vose, Patterns of parapatric speciation, Evolution 54 (4) (2000) 1126–1134. [34] M. Kirkpatrick, V. Ravign´e, Speciation by natural and sexual selection: models and experiments, The American Naturalist 159 (S3) (2002) S22– S35. [35] D. M. King, A. D. Scott, S. Bahar, Multiple phase transitions in an agentbased evolutionary model with neutral fitness, Royal Society Open Science 4 (4). doi:10.1098/rsos.170005. [36] S. Gavrilets, Perspective: models of speciation: what have we learned in 40 years?, Evolution 57 (10) (2003) 2197–2215. [37] S. Gavrilets, Models of speciation: Where are we now?, Journal of heredity 105 (S1) (2014) 743–755. [38] M. A. McPeek, The ecological dynamics of clade diversification and community assembly, The American Naturalist 172 (6) (2008) E270–E284. [39] S. P. Hubbell, The Unified Neutral Theory of Biodiversity and Biogeography, Princeton University Press, Princeton, NJ, 2001. [40] A. L. Pigot, A. B. Phillimore, I. P. Owens, C. D. L. Orme, The shape and temporal dynamics of phylogenetic trees arising from geographic speciation, Systematic biology (2010) syq058. [41] C. P. Cruz, C. R. Fonseca, G. Corso, Ecological interaction and phylogeny, studying functionality on composed networks, Physica A: Statistical Mechanics and its Applications 391 (3) (2012) 673 – 679. doi:https: //doi.org/10.1016/j.physa.2011.08.028. [42] D. L. DeAngelis, V. Grimm, Individual-based models in ecology after four decades, F1000Prime Rep 6 (39) (2014) 6. ´ [43] T. J. Davies, A. P. Allen, L. Borda-de Agua, J. Regetz, C. J. Meli´an, Neutral biodiversity theory can explain the imbalance of phylogenetic trees but not the tempo of their diversification, Evolution 65 (7) (2011) 1841– 1850.

18

[44] M. Manceau, A. Lambert, H. Morlon, Phylogenies support out-ofequilibrium models of biodiversity, Ecology letters 18 (4) (2015) 347–356. [45] F. Gascuel, R. Ferri`ere, R. Aguil´ee, A. Lambert, How ecology and landscape dynamics shape phylogenetic trees, Systematic biology 64 (4) (2015) 590– 607. [46] O. Missa, C. Dytham, H. Morlon, Understanding how biodiversity unfolds through time under neutral theory, Phil. Trans. R. Soc. B 371 (1691) (2016) 20150226. [47] M. A. M. De Aguiar, M. Baranger, E. Baptestini, L. Kaufman, Y. BarYam, Global patterns of speciation and diversity, Nature 460 (7253) (2009) 384–387. [48] E. M. Baptestini, M. A. de Aguiar, Y. Bar-Yam, Conditions for neutral speciation via isolation by distance, Journal of theoretical biology 335 (2013) 51–56. [49] E. M. Baptestini, M. A. de Aguiar, Y. Bar-Yam, The role of sex separation in neutral speciation, Theoretical ecology 6 (2) (2013) 213–223. [50] A. B. Martins, M. A. de Aguiar, Y. Bar-Yam, Evolution and stability of ring species, Proceedings of the National Academy of Sciences 110 (13) (2013) 5080–5084. [51] A. Martins, M. A. M. de Aguiar, Barriers to gene flow and ring species formation, Evolution 71 (2) (2017) 442–448. [52] R. W. Hamming, Error detecting and error correcting codes, Bell Labs Technical Journal 29 (2) (1950) 147–160. [53] R. R. Sokal, A statistical method for evaluating systematic relationships, University of Kansas Science Bulletin 38 (1958) 1409–1438. [54] G. S. van Doorn, F. J. Weissing, Ecological versus sexual selection models of sympatric speciation: a synthesis, Selection 2 (1-2) (2002) 17–40. [55] S. Gavrilets, Fitness Landscapes and the Origin of Species, Princeton University Press, Princeton, NJ, 2004. [56] U. Dieckmann, M. Doebeli, J. Metz, D. Tautz, Adaptive speciation, Cambridge University Press, 2004. [57] S. Gavrilets, adaptive speciationit is not that easy: a reply to doebeli et al, Evolution 59 (3) (2005) 696–699. [58] R. B¨ urger, K. A. Schneider, M. Willensdorfer, S. Otto, The conditions for speciation through intraspecific competition, Evolution 60 (11) (2006) 2185–2206.

19

[59] R. B¨ urger, K. A. Schneider, Intraspecific competitive divergence and convergence under assortative mating, The American Naturalist 167 (2) (2006) 190–205. [60] P. S. Pennings, M. Kopp, G. Mesz´ena, U. Dieckmann, J. Hermisson, An analytically tractable model for competitive speciation, The American Naturalist 171 (1) (2007) E44–E71. [61] G. S. van Doorn, P. Edelaar, F. J. Weissing, On the origin of species by natural and sexual selection, Science 326 (5960) (2009) 1704–1707. [62] N. Barton, What role does natural selection play in speciation?, Philosophical Transactions of the Royal Society B: Biological Sciences 365 (1547) (2010) 1825–1840. [63] L. K. MGonigle, R. Mazzucco, S. P. Otto, U. Dieckmann, Sexual selection enables long-term coexistence despite ecological equivalence, Nature 484 (7395) (2012) 506–509. [64] A. Rettelbach, M. Kopp, U. Dieckmann, J. Hermisson, Three modes of adaptive speciation in spatially structured populations, The American Naturalist 182 (6) (2013) E215–E234. [65] M. Shpak, T. Day, The role of deleterious mutations in allopatric speciation, Evolution 59 (7) (2005) 1389–1399. [66] N. A. Johnson, A. H. Porter, Evolution of branched regulatory genetic pathways: directional selection on pleiotropic loci accelerates developmental system drift, Genetica 129 (1) (2007) 57–70. [67] N. H. Barton, M. A. R. De Cara, The evolution of strong reproductive isolation, Evolution 63 (5) (2009) 1171–1190. [68] J. C. Uyeda, S. J. Arnold, P. A. Hohenlohe, L. S. Mead, Drift promotes speciation by sexual selection, Evolution 63 (3) (2009) 583–594. [69] J. L. Fierst, T. F. Hansen, Genetic architecture and postzygotic reproductive isolation: evolution of bateson–dobzhansky–muller incompatibilities in a polygenic model, Evolution 64 (3) (2010) 675–693. [70] S. Gourbiere, J. Mallet, Are species real? the shape of the species boundary with exponential failure, reinforcement, and the missing snowball, Evolution 64 (1) (2010) 1–24. [71] C. Fra¨ısse, J. Elderfield, J. Welch, The genetics of speciation: are complex incompatibilities easier to evolve?, Journal of evolutionary biology 27 (4) (2014) 688–699. [72] S. Gavrilets, Waiting time to parapatric speciation, Proceedings of the Royal Society of London B: Biological Sciences 267 (1461) (2000) 2483– 2492. 20

[73] C. Bank, R. B¨ urger, J. Hermisson, The limits to parapatric speciation: Dobzhansky–muller incompatibilities in a continent–island model, Genetics 191 (3) (2012) 845–863. [74] R. Yamaguchi, Y. Iwasa, First passage time to allopatric speciation, Interface focus 3 (6) (2013) 20130026. [75] G. F. Turner, M. T. Burrows, A model of sympatric speciation by sexual selection, Proceedings of the Royal Society of London B: Biological Sciences 260 (1359) (1995) 287–292. [76] D. Schluter, The ecology of adaptive radiation, OUP Oxford, 2000. [77] P. Nosil, Ecological speciation, Oxford University Press, 2012. [78] S. Gavrilets, L. Hai, M. D. Vose, Rapid parapatric speciation on holey adaptive landscapes, Proceedings of the Royal Society of London B: Biological Sciences 265 (1405) (1998) 1483–1489. [79] G. A. Hoelzer, R. Drewes, J. Meier, R. Doursat, Isolation-by-distance and outbreeding depression are sufficient to drive parapatric speciation in the absence of environmental influences, PLoS Comput Biol 4 (7) (2008) e1000126. [80] P. Desjardins-Proulx, D. Gravel, How likely is speciation in neutral ecology?, The American Naturalist 179 (1) (2011) 137–144. [81] C. J. Meli´ an, D. Alonso, S. Allesina, R. S. Condit, R. S. Etienne, Does sex speed up evolutionary rate and increase biodiversity?, PLoS Comput Biol 8 (3) (2012) e1002414. [82] P. Desjardins-Proulx, D. Gravel, A complex speciation–richness relationship in a simple neutral model, Ecology and evolution 2 (8) (2012) 1781– 1790. [83] R. Lande, O. Seehausen, J. J. Van Alphen, Mechanisms of rapid sympatric speciation by sex reversal and sexual selection in cichlid fish, Genetica 112 (1) (2001) 435–443. [84] S. Gavrilets, A. Vose, M. Barluenga, W. Salzburger, A. Meyer, Case studies and mathematical models of ecological speciation. 1. cichlids in a crater lake, Molecular Ecology 16 (14) (2007) 2893–2909. [85] M. Kawata, A. Shoji, S. Kawamura, O. Seehausen, A genetically explicit model of speciation by sensory drive within a continuous population in aquatic environments, BMC evolutionary biology 7 (1) (2007) 99. [86] E. A. Duenez-Guzman, J. Mav´arez, M. D. Vose, S. Gavrilets, Case studies and mathematical models of ecological speciation. 4. hybrid speciation in butterflies in a jungle, Evolution 63 (10) (2009) 2611–2626.

21

[87] S. Sadedin, J. Hollander, M. Panova, K. Johannesson, S. Gavrilets, Case studies and mathematical models of ecological speciation. 3: Ecotype formation in a swedish snail, Molecular Ecology 18 (19) (2009) 4006–4023. [88] R. Aguil´ee, A. Lambert, D. Claessen, Ecological speciation in dynamic landscapes, Journal of evolutionary biology 24 (12) (2011) 2663–2677. [89] M. Yamamichi, A. Sasaki, Single-gene speciation with pleiotropy: effects of allele dominance, population size, and delayed inheritance, Evolution 67 (7) (2013) 2011–2023. [90] S. Moller, H. Hameister, M.-T. Htt, A genome signature derived from the interplay of word frequencies and symbol correlations, Physica A: Statistical Mechanics and its Applications 414 (2014) 216 – 226. [91] S. Wichmann, E. W. Holman, D. Bakker, C. H. Brown, Evaluating linguistic distance measures, Physica A: Statistical Mechanics and its Applications 389 (17) (2010) 3632 – 3639.

22