Global extent of horizontal gene transfer - PNAS

27 downloads 0 Views 865KB Size Report
Mar 13, 2007 - We thank Drs. Jingtong Hou, Gregory Sims, and Se-Ran Jun for our ... Amit Fliess BM, Ron Unger (2002) Proteins 48:377–387. 24. Tordai H ...
Global extent of horizontal gene transfer In-Geol Choi* and Sung-Hou Kim† Physical Biosciences Division, Lawrence Berkeley National Laboratory and Department of Chemistry, University of California, Berkeley, CA 94720 Contributed by Sung-Hou Kim, December 28, 2006 (sent for review July 1, 2006)

protein domain family 兩 protein sequence family 兩 lateral gene transfer

O

ne of the new important concepts that emerged from a large number of genomic sequences in the last decade is that of horizontal gene transfer (HGT): gene transfer among organisms of different species. HGT has been found to have occurred in all three domains: Archaea, Bacteria, and Eukarya. The concept of HGT has been evoked to interpret various evolutionary processes ranging from speciation and the adaptation of organisms to uncertainties in phylogenetic inference of the tree of life (1–9). Although HGT has been regarded as a driving force in the innovation and evolution of genomes, especially in prokaryotes, its extent and impact on the evolutionary process and phylogeny of organisms or species remains controversial (8–10). There have been several methods developed to detect HGT, including (i) difference between gene trees derived from a limited number of gene families and the reference trees such as the small-subunit ribosomal RNA (SSU rRNA) tree (11–13) or whole genome tree (14); (ii) unexpectedly high sequence similarity of a gene from two distant genomes compared with those among homologous genes in closely related genomes (15); and (iii) unusual nucleotide composition or codon usages of a gene compared with the rest of the genes within a genome (16, 17). Many factors affect the detection of HGT, such as lineage-specific gene loss (18, 19), unequal rates of base substitution (1), loss of signal due to amelioration processes (16), and others (1, 15). It has been suggested that HGT may have been ‘‘rampant’’ in primitive genomes (6, 20), but, for modern organisms, it may not be a dominant factor in speciation, because HGT has less effect on overall genome phylogeny (10, 21). There have been many convincing evidences for HGT for specific genes or gene families, but there has been no estimate of the global extent of HGT in terms of protein domains. Here, we present a statistical method to identify the member(s) in a protein family that may have joined the family by HGT events and examine the global extent of HGT events for all protein www.pnas.org兾cgi兾doi兾10.1073兾pnas.0611557104

domain families of known curated sequences at various ranges of taxonomic levels. A protein (sequence) domain is a functionally independent unit in protein sequence. The gene coding for it often behaves like a modular genetic element that transfers within or between genomes, sometimes forming a new gene coding for a multiple domain protein (22–24). Because the fixation of a new gene during evolution depends mostly on its advantage for survival, we focus on HGT of the genetic module coding for the sequence domains, rather than the entire genes. At present, there are ⬇1.2 million curated protein domain sequences from three domains of life (Archaea, Bacteria, and Eukarya) in the Pfam (release 16.0) (25). Results The Phylogenetic Tree of All Organisms Represented by the Pfam Protein Domains. The first step in our method requires construct-

ing a phylogenetic tree of the organisms represented by all protein domains in Pfam. Many approaches have been developed recently to construct phylogeny of organisms by using a set of selected gene families or whole genomes, and it was found that it is practically the same as that constructed from SSU rRNA sequences, suggesting that HGT does not alter significantly the SSU rRNA-based tree (19, 26). The reconstruction of the phylogenetic tree of organisms, with all species covering ⬇1.2 million protein domain sequences in the Pfam is not practical, and thus we simplified the tree structure by using representative taxa: To obtain representative taxa, we examined the taxonomic origins of all organisms from which all protein sequences in Pfam (release 16.0) are derived and extracted their taxonomic identifications at three ranges of taxonomic levels (second to fourth hierarchical level listed in the Pfam) as described in Materials and Methods (see Fig. 1). In most cases, the second, third, and fourth levels correspond to phylum, the range between phylum and order, and the range between order and genus, respectively. Although the number of protein members per family in the Pfam varied considerably from 2 to 49,343, the number of unique representative taxa per family ranged only from 1 to 191 at the three taxonomic ranges. There is good correlation among the numbers of unique representative taxa, nonredundant species, and family size (Fig. 2), suggesting that selected taxonomic ranges are representative, and sampling bias resulting from them would not affect the estimation of the global extent of HGT. Thus, we used these representative taxa Author contributions: I.-G.C. and S.-H.K. designed research; I.-G.C. performed research; I.-G.C. and S.-H.K. analyzed data; and I.-G.C. and S.-H.K. wrote the paper. The authors declare no conflict of interest. Freely available online through the PNAS open access option. Abbreviations: CAG, common ancestral gene; GO, gene ontology; HGT, horizontal gene transfer; ML, maximum likelihood; MRCA, most recent common ancestor; NJ, neighbor joining; PD, phylogenetic distance; SSU rRNA, small-subunit ribosomal RNA. *Present address: Division of Biotechnology, College of Life Sciences and Biotechnology, Korea University, Seoul, Korea 136-713. †To

whom correspondence should be addressed. E-mail: [email protected].

This article contains supporting information online at www.pnas.org/cgi/content/full/ 0611557104/DC1. © 2007 by The National Academy of Sciences of the USA

PNAS 兩 March 13, 2007 兩 vol. 104 兩 no. 11 兩 4489 – 4494

EVOLUTION

Horizontal gene transfer (HGT) is thought to play an important role in the evolution of species and innovation of genomes. There have been many convincing evidences for HGT for specific genes or gene families, but there has been no estimate of the global extent of HGT. Here, we present a method of identifying HGT events within a given protein family and estimate the global extent of HGT in all curated protein domain families (⬇8,000) listed in the Pfam database. The results suggest four conclusions: (i) for all protein domain families in Pfam, the fixation of genes horizontally transferred is not a rampant phenomenon between organisms with substantial phylogenetic separations (1.1–9.7% of Pfam families surveyed at three taxonomic ranges studied show indication of HGT); (ii) however, at the level of domains, >50% of Archaea have one or more protein domains acquired by HGT, and nearly 30 –50% of Bacteria did the same when examined at three taxonomic ranges. But, the equivalent value for Eukarya is