Comparative genomics sheds light on niche differentiation and the ...

0 downloads 0 Views 9MB Size Report
Comparative genomics sheds light on niche differentiation and. 4 the evolutionary history of comammox Nitrospira. 5. 6. Alejandro Palomo1,2, Anders G ...
1 2 3

Supplementary Information

Comparative genomics sheds light on niche differentiation and the evolutionary history of comammox Nitrospira

4 5 6 7 8 9 10 11 12 13

Alejandro Palomo1,2, Anders G Pedersen2, S Jane Fowler1, Arnaud Dechesne1, Thomas Sicheritz-Pontén2, Barth F Smets1 1 2

Department of Environmental Engineering, Technical University of Denmark, Kgs Lyngby, Denmark Department of Bio and Health Informatics, Technical University of Denmark, Kgs Lyngby, Denmark

SI Material and Methods

14

Recovery of individual draft genomes.

15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40

In our recent study, a composite population genome (CG24) composed of five genomes was recovered from six metagenomes from three locations in a rapid sand filter (1). Due to the abundance and heterogeneity of Nitrospira spp., the nucleotide signature based binning approach initially used (VizBin (2)) did not enable the separation of CG24 into individual genomes. Therefore, a strategy based on subsample assembly (3) followed by differential coverage binning (4) was conducted. First, individual de novo assemblies of the metagenome where CG24 was most abundant (ISL_Top2) were executed using a subset of the total reads (using 5%, 10%, 20%, 30%, 40%, 50%, 60% and 70% of the total reads). These assemblies were independently subjected to differential coverage analysis using mmgenome (5). For differential coverage analysis, the six samples (ISL_Top1, ISL_Top2 and ISL_Top4, ISL_Bulk1, ISL_Bulk2 and ISL_Bulk4) as well as GC content and tetranucleotide frequencies information were used. Once this approach was applied to all the subassemblies, the completion and quality of all the recovered individual bins was evaluated using mmgenome, a set of 107 HMMs essential single-copy genes and CheckM (6). When the same individual draft genome was recovered, the bin best assembled (based on N50), most complete, and with lowest contamination was retained. Thus, CG24_A was recovered from the subassembly using the 10% of the total reads of ISL_Top2. CG24_B, CG24_C and CG24_D were recovered using 20%, 50% and 70% of the total reads of ISL_Top2, respectively. To try to recover a putative low abundance Nitrospira sp., a co-assembly of the three samples (ISL_Top1, ISL_Top2 and ISL_Top4) where CG24 was most abundant was performed. Then, the subsample assembly strategy followed by differential coverage binning was carried out. After the quality evaluation, the highest quality bin, CG24_E, was extracted from the co-assembly made using 50% of the total reads. The binning steps for the five individual bins are shown in the Fig. S1.

41 42 43

For each of the investigated proteins we could therefore construct a multiple alignment of up to 19 homologous sequences. Based on each alignment, we performed all pairwise sequencecomparisons, and for each comparison computed the sequence dissimilarity (i.e., the fraction

Pairwise sequence proteins dissimilarity comparison.

1

44 45 46 47 48 49 50 51 52 53 54 55 56 57 58

of amino-acid sites that differ in the two proteins). We did the same for a concatenated alignment of 14 ribosomal proteins from the same set of genomes, and then plotted the pairwise sequence dissimilarities of the investigated proteins (ammonium oxidation-related or housekeeping) against the corresponding pairwise sequence dissimilarities for the ribosomal proteins (Fig. 4). If the ribosomal proteins have been inherited vertically and are evolving with clock-like dynamics (i.e., at an approximately constant rate), then ribosomal sequence diversity should be roughly proportional to the time since the pair of species diverged. If the investigated protein is also evolving at a constant rate and has also been inherited vertically (along with the ribosomal protein) then we expect such a plot to be approximately linear, with the slope being equal to the relative rate of evolution. However, if there has been HGT of an investigated protein, then all pairwise distances will not fall on a single line - some pairs of sequences may be more similar than expected from the species phylogeny while others may be less so

59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83

In order to stringently analyze the possible occurrence of gene duplication, gene loss, and horizontal gene transfers, we performed a so-called reconciliation analysis. The main idea is to compare a gene-tree (or guest-tree) to the corresponding species-tree (or host-tree), where the species-tree has been constructed from genes believed to be inherited vertically. If there are discrepancies between gene- and species-trees the goal is to infer a reconciliation between them, i.e., to infer a likely history of duplications, losses, and transfers that explain the differences. This entails finding a mapping from nodes in the guest-tree to nodes or branches in the species-tree. In the present analysis we used a species-tree constructed using a clock model, meaning that the internal nodes were dated. This provides important constraints for reconciliation analysis, since transfers are only possible between contemporaneous branches or nodes in the species tree. We employed Bayesian methods for all steps in the reconciliation analysis. The essence of Bayesian inference is to use probabilities for quantifying uncertainty about states of reality. Briefly, a system of interest is described in the form of a mathematical model, uncertainty about model parameters is expressed in the form of a prior probability distribution over the possible parameter values, data is collected, and Bayes' theorem is then used to combine the prior with the data to obtain the posterior distribution. The approach allows one to include prior knowledge in a stringent manner and to properly account for uncertainty about "nuisance parameters" in inference about parameters of interest. For instance, in the present case, we can take into account uncertainty about guest tree topology, amino acid substitution rates, and timing of events, when trying to estimate the probability that a horizontal transfer has taken place. This is especially important here since the phylogenies investigated are quite deep (going back hundreds of millions of years), and there is thus considerable uncertainty about underlying events.

84 85 86

For the clock-based species-tree construction, the following priors were used. Tree prior: calibrated Yule model. Yule birth rate: gamma distribution, a = 0.001, b = 1000. Mean clock rate: lognormal distribution with M = -10.4 and S = 1.73 (units in millions of years, and prior

Reconciliation analysis.

2

87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112

set such that 95% highest density interval (95% HDI) = [1E-6, 1E-3] substitutions/site/million years). ucldStdev.c: gamma distribution, with a=0.5396, b=0.3819. As mention in the main text, the timetree.org resource and the substitution rate of 16S rRNA (16S rRNA divergence: 1% / 50MY) were also used. Specifically, we used the following clade calibrations (node numberings as in SI Material and Methods, Fig. 1): Node 1: 150-175 MYA (from 16S rRNA gene rate, based on Ochman and Wilson (7)). Node 2: 350-400 MYA (from 16S rRNA gene rate, based on Ochman and Wilson (7)). Node 3: 450-500 MYA from 16S rRNA gene rate, based on Ochman and Wilson (7); and Marin et al. (8)). Node 4: 8701210 MYA (Lücker et al. (9) which calculated from 16S rRNA gene rate, based on Ochman and Wilson (7)). Node: 1623-2621 MYA (Battistuzzi, Feijao, and Hedges (10); Battistuzzi and Hedges (11); Marin et al. (8)). In all cases the prior was set to a normal distribution with mean and standard deviation chosen such that the 95% HDI corresponded to the above ranges. The Markov chain Monte Carlo (MCMC) analysis was allowed to run for 50 million iterations, with a thinning of 50,000 (thus resulting in a total of 1000 samples per chain). Five independent chains were started, and after the runs finished convergence and mixing was checked using the software Tracer (12). The ribosomal protein-based species-tree was found to display considerable topological uncertainty in clade B (Fig. S2). This is most likely because the species in this clade are relatively closely related, and there is therefore limited phylogenetic signal distinguishing different possible branching patterns. Since reconciliation analysis is based on finding discrepancies between a gene tree and a species-tree (which is assumed to be correct), uncertainty in the species-tree is problematic and may lead to spurious inference of transfers, duplications and losses. To avoid this, we removed the clade B bacteria CG24A and CG24E from the species tree. The resulting tree without these species was fully resolved and with good support for all branches.

113 114 115 116 117 118 119 120

As mention in the main text, guest-trees were midpoint-rooted. Using a set of midpointrooted guest trees as input to the reconciliation analysis allowed us to account for uncertainty about guest tree topology (since the sample file will contain a number of different trees, with some topologies being more frequent than others), while putting emphasis on our prior belief that evolution will, on the whole, be clock-like (since midpoint rooting will result in trees where the most distant leaves have the same distance to the root). This is the same assumption made when interpreting deviations from linearity in the pairwise distance plots as evidence for transfer.

3

121 122 123 124

SI Material and Methods Figure 1. Phylogenetic tree based on 14 concatenated ribosomal proteins. Arrows denote nodes with time estimation information.

125 126

SI Results and Discussion

127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144

Gene synteny analysis Besides the already mention distinct features in the main text, here are highlighted other discrepancies between AOB and comammox genomes in relation with gene arrangement. The gene arrangement of the AMO-related genes differs between comammox and bAOB. The order of these genes both in clade A and clade B is: amoBAC,copDC,amoD(D1 and D2),amoE, while in bAOB genomes is amoCAB,amoE,amoD,copCD. Additionally, unlike AOB, comammox have two copies of amoD and an agmatinase (AGMAT) gene in close proximity to a gene coding a urea carboxylase-related transporter (uctT) (Fig. 5 and Fig. S9). This last feature also distinguishes comammox clade A and clade B. While AGMAT is located next to the uctT gene in clade A genomes, one histidine kinase and several hypothetical proteins separate AGMAT and uctT in clade B genomes (Fig. 5 and Fig. S9). Another feature differentiating the clades from each other is the presence of two hypothetical proteins-coding genes present as single copies in clade B and bAOB (one close to the HAO cluster and the other, to the AMO cluster) which are duplicated in the clade A genomes (Fig. 5 and Fig. S9). Furthermore, genes associated with iron storage (brf) and a dicarboxylate transport (dtcA) are typically placed in close proximity to the AMO, HAO and cytochrome c biosynthesis genes cluster in clade A genomes while is not the case in clade B and bAOB. On the other hand, clade B bacteria exclusively contain a transposase between 4

145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175

amoBAC and copCD, and two genes encoding for a copper containing nitrite reductase (CuNiR gene) next to the HAO cluster (Fig. 5 and Fig. S9). Another distinction between clade A and clade B genomes is that despite the presence of AMO, HAO and cytochrome c biosynthesis in the same genomic region, the order in clade A genomes is AMO, HAO and cytochrome c biosynthesis, while in the clade B genomes is AMO, cytochrome c biosynthesis and HAO (Fig. 5 and Fig. S9). Integrative conjugative element in CG24_B and AAU_MBR1 Especially striking were the cases of comammox CG24_B and AAU_MBR1 bins where a contig with a high density of heavy metal resistance genes was detected (Table S2). Along with the mentioned Nitrospira-type resistance genes, other phylogenetically distinct chromate, arsenic and mercury resistance proteins were found in the intervening region. Moreover, the flanking regions of this contig contain two loci with the tra genes of the relaxosome and the trb genes coding for conjugal transfer proteins, respectively. In addition, previously predicted and novel hypothetical proteins, several integrases, two toxin-antitoxin proteins and a phage infection protein are present in this part of their genomes. The presence of multiple Nitrospira-type genes, together with the absence of an origin of replication, suggests that these conjugation-associated genes are integrated in the chromosome instead of on a plasmid. Overall, based on the genomic characteristics of this identified contig, this putative integrative conjugative element could confer resistance to different heavy metals as well as virulence and defence capacity to CG24_B and AAU_MBR1.

Arsenite oxidation in Ca. Nitrospira defluvii In relation to the arsenite oxidation genes found in Ca. N. defluvii, while arsenite oxidation coupled to energy conservation has not been demonstrated in Nitrospira, a Nitrospira sp. (99% 16S rRNA gene similarity to Ca. N. defluvii) was recently found to be dominant in an arsenic-rich sand filter, notably at depths with the highest arsenic concentrations (13). Moreover, Ca. N. defluvii was enriched in a fluidized-bed reactor treating simulated arsenicrich mining water (14), suggesting that some Nitrospira may harbour the ability to conserve energy from arsenite oxidation.

5

CG24_A binning workflow:

CG24_B binning workflow:

6

CG24_C binning workflow:

CG24_D binning workflow:

7

CG24_E binning workflow:

8

Fig. S1. Differential coverage plots of metagenomes obtained from Islevbro waterworks at different sampling depth for CG24_A, CG24_B, CG24_C, CG24_D and CG24_E. Scaffolds are displayed as circles, scaled by length and colored according to phylum-level taxonomic affiliation. Only scaffolds >1 kbp are shown.

9

A2 1

CG24B

1

nitrificans 1

AAUMBR1 1

1

nitrosa

1

inopinata

0.96

moscoviensis CG24D CG24A

1 0.57 0.57

GWW3 CG24E

1

CG24C

1 1

GWW4 AAUMBR2 1

1

defluvii

1

OLB3 lepto AL212 1

Is79A3 1 1

briensis 1

multiformis

1 1

lacus communis 1

europaea

1 1

eutropha oceani 1

watsonii

200.0

Fig. S2. Species-tree of AOB and comammox Nitrospira based on 14 concatenated ribosomal proteins. The tree was constructed using the software BEAST, employing a relaxed log-normal clock model with selected internal nodes calibrated as shown in Supplementary Methods Fig. 1. The posterior clade probabilities are shown on all internal nodes. Since reconciliation analysis is based on finding discrepancies between the species-tree and gene trees, the uncertainty about the branching pattern around CG24A and CG24E is problematic; these two species were removed from the species tree for the rest of the analysis (see Supplementary Fig. 10).

10

1

AL212_haoA

1 1

Is79A3_haoA

1

CG24A_AmoA

1

0.99

communis_haoA

GWW3_AmoA 1

0.96

1

europaea_haoA

CG24C_AmoA 1

1

0.55

1

GWW4_AmoA

1

eutropha_haoA

CG24E_AmoA 1

AL212_AmoA

briensis_haoA

1

1 1

Is79A3_AmoA 0.97

briensis_AmoA

multiformis_haoA

1

1

0.56

1

lacus_AmoA

1

inopinata_haoA

1 1

1

multiformis_AmoA

1

nitrificans_haoA

1

communis_AmoA

1

nitrosa_haoA

europaea_AmoA 1

1

0.64

eutropha_AmoA

CG24A_haoA

1

0.96 1

AAUMBR1_AmoA

0.85

GWW3_haoA

1

nitrosa_AmoA

1

1 1

GWW4_haoA

inopinata_AmoA 1

1CG24B_AmoA

A2_haoA

1 1

A2_AmoA

CG24B_haoA

0.73

nitrificans_AmoA

1

watsonii_AmoA

1

oceani_AmoA

0.2

watsonii_haoA

1

1

oceani_haoA

0.2

Fig. S3. Phylogenetic trees based on left) AmoA; right) HaoA protein sequences. The trees shown here are majority rule consensus trees from runs using the software MrBayes. The posterior clade probabilities are shown at each internal node. Note that some branching patterns are associated with considerable uncertainty. The 95% credible intervals for trees contained 54 trees for AmoA and 4 trees for HaoA.

11

Fig. S4. Relative abundance of the five extracted genomes in the top (n=3; red) and bulk samples (n=3; blue).

Fig. S5. Pairwise average amino acid identity (AAI) comparison of the 16 Nitrospira genomes.

12

Fig. S6. Principal component analysis of 16 Nitrospira genomes based on coding sequences annotated by SEED subsystems.

13

14

15

16

Fig. S7. Relationship between genomes phylogenetic distance and protein sequence divergence for housekeeping proteins for comammox Nitrospira and b-AOB genomes. Each dot represents a pair of genomes and is coloured according to the groups to which the compared genomes belong. The y-axis shows the pairwise protein dissimilarity (fraction of amino-acid sites that differ) while the x-axis shows the corresponding pairwise dissimilarity for a set of 14 ribosomal proteins. Coloured lines show the linear regression for each groupwise comparison with shadowed regions indicating 95% confidence intervals for the slopes. A: comammox clade A; B: comammox clade B; Ntsm: Nitrosomonas genomes; Ntrssp: Nitrosospira genomes. AOB: Both Nitrosomonas and Nitrosospira.

17

18

19

20

21

Fig. S8. Relationship between genomes phylogenetic distance and protein sequence divergence for ammonium oxidation-related proteins for comammox Nitrospira and b-AOB genomes. Each dot represents a pair of genomes and is coloured according to the groups to which the compared genomes belong. The y-axis shows the pairwise protein dissimilarity (fraction of amino-acid sites that differ) while the x-axis shows the corresponding pairwise dissimilarity for a set of 14 ribosomal proteins. Coloured lines show the linear regression for each groupwise comparison with shadowed regions indicating 95% confidence intervals for the slopes. A: comammox clade A; B: comammox clade B; Ntsm: Nitrosomonas genomes; Ntrssp: Nitrosospira genomes. AOB: Both Nitrosomonas and Nitrosospira.

22

Fig. S9. Genomic regions containing genes of ammonium oxidation pathway and other AOB-related genomic features in comammox Nitrospira clade A (Top Panel) and comammox Nitrospira clade B (Bottom Panel). Homologous genes are connected by lines. Functions of the encoded proteins are represented by colour. Parallel double lines designate a break in locus organization. Single lines designate breaks probably due to contig fragmentation.

23

Fig. S10. Time-calibrated phylogenetic tree AOB and comammox Nitrospira based on 14 ribosomal proteins. Node bars correspond to the 95% highest posterior density (HPD) interval for the age of each node. Scale bar denotes 200 MYA.

24

Table S1. 27 suspected transferred genes, 14 syntenic ribosomal proteins and 18 housekeeping genes.

Suspected Ribosomal Housekeeping transferred Proteins genes genes polA AmoA L2 gyrB AmoB L3 ftsZ AmoC L4 ruvA AmoD1 L5 urvC AmoD2 L6 gapA AmoE L14 smpB CopC L16 rbfA CopD L18 prfA HaoA L22 lepA HaoB L24 mfd CycA S1 rim CycB S3 recA CcmA S8 secA CcmB S19 Ffh CcmC mreB CcmE pyrH CcmF dnaG CcmG CcmH CcmI CopA CopB Ureacarbox YggS hyp_83 PC3432 amideporin

Table S2 See enclosed excel spreadsheet

25

Table S3. Additional genomes used in this study.

b-proteobacteria AOB Nitrosomonas communis Nitrosomonas europaea Nitrosomonas eutropha Nitrosomonas sp. AL212 Nitrosomonas sp. Is79A3 Nitrosospira briensis Nitrosospira lacus Nitrosospira multiformis

g-proteobacteria AOB

Nitrosococcus oceani

Nitrospirae Leptospirillum ferrooxidans

Nitrosococcus watsonii



Table S4. Posterior probability of gene transfer (Ptran_s), duplication (Pdup_s) or either transfer or duplication (Ptrdu_s) for the suspected transferred genes.

Gene Ptran_s Pdup_s Ptrdu_s AmoA 0.99 0.99 1 AmoB 0.87 0.11 0.95 AmoC 0.81 1 1 AmoE 1 1 1 CcmA 0.46 0.99 1 CcmB 0.94 1 1 CcmC 0.99 1 1 CcmE 0.99 0.92 1 CcmF 0.81 0.81 1 CcmG 0.98 0.98 1 CcmH 0.98 0.87 1 CcmI 0.89 0.99 1 CopA 1 1 1 CopB 0.82 0.96 1 CopC 1 0.91 1 PC3432 0.98 0.96 1 Ureacarboxylase 0.98 0.58 1 YggS 0.97 0.8 1 amideporin 0.72 0.97 1 cycA 0.84 0.98 1 cycB 0.82 0.96 1 haoA 0.89 0.95 1 haoB 0.95 0.92 1 hypoth_83 0.92 0.81 1

26

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Palomo A, et al. (2016) Metagenomic analysis of rapid gravity sand filter microbial communities suggests novel physiology of Nitrospira spp. ISME J 10(11):2569–2581. Laczny CC, et al. (2015) VizBin - an application for reference-independent visualization and human-augmented binning of metagenomic data. Microbiome 3(1):1. Hug LA, et al. (2016) Critical biogeochemical functions in the subsurface are associated with bacteria from new phyla and little studied lineages. Environ Microbiol 18(1):159–173. Albertsen M, et al. (2013) Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol 31(6):533–8. Karst SM, Kirkegaard RH, Albertsen M (2016) Mmgenome: a Toolbox for Reproducible Genome Extraction From Metagenomes. bioRxiv:59121. Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW (2015) CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res 25(7):1043–55. Ochman H, Wilson AC (1987) Evolution in bacteria: Evidence for a universal substitution rate in cellular genomes. J Mol Evol 26(1–2):74–86. Marin J, Battistuzzi FU, Brown AC, Hedges SB (2016) The timetree of prokaryotes: new insights into their evolution and speciation. Mol Biol Evol 34(2):msw245. Lucker S, et al. (2010) A Nitrospira metagenome illuminates the physiology and evolution of globally important nitrite-oxidizing bacteria. Proc Natl Acad Sci 107(30):13479–13484. Battistuzzi FU, Feijao A, Hedges SB (2004) A genomic timescale of prokaryote evolution: insights into the origin of methanogenesis, phototrophy, and the colonization of land. BMC Evol Biol 4(1):44. Battistuzzi FU, Hedges SB (2009) A major clade of prokaryotes with ancient adaptations to life on land. Mol Biol Evol 26(2):335–343. Rambaut A, Suchard M, Xie D, Drummond A (2014) Tracer v1.6, Available from http://beast.bio.ed.ac.uk/Tracer. Nitzsche KS, Weigold P, Lösekann-Behrens T, Kappler A, Behrens S (2015) Microbial community composition of a household sand filter used for arsenic, iron, and manganese removal from groundwater in Vietnam. Chemosphere 138:47–59. Papirio S, et al. (2014) Effect of arsenic on nitrification of simulated mining water. Bioresour Technol 164:149–154.

27