Integrated Genomic Island Prediction Tool (IGIPT) - CiteSeerX

5 downloads 0 Views 200KB Size Report
Table 1: Analysis of A. aurescens genome using IGIPT. International Conference on ... present our results for Arthrobacter aurescens TC1 genome, in which 11 ...
Integrated Genomic Island Prediction Tool (IGIPT) by Ruchi Jain, Sandeep Ramineni, Nita Parekh

in icit, pp.131-132, International Conference on Information Technology, 2008. DOI: http://doi.ieeecomputersociety.org/10.1109/ICIT.2008.42

Report No: IIIT/TR/2008/175

Centre for Computational Natural Sciences and Bioinformatics International Institute of Information Technology Hyderabad - 500 032, INDIA December 2008

International Conference on Information Technology

Integrated Genomic Island Prediction Tool (IGIPT) Ruchi Jain, Sandeep Ramineni and Nita Parekh Centre for Computational Natural Sciences and Bioinformatics [email protected], [email protected] deviating significantly from the genomic average (~1.5σ) are identified as probable GIs. Similarly, GC content at the gene level is computed by computing the G+C frequency at the three codon positions, GC1, GC2, and GC3 [4]. In this case analysis on genes lying within window W is compared with the complete gene set of the organism.

Abstract We have developed a web based integrated platform for the identification of genomic islands in which various measures that capture bias in nucleotide compositions have been implemented, viz., GC content (both at the whole genome and at three codon positions in genes), genomic signature, k-mer distribution (k=2–6), codon usage bias and amino acid usage bias. The analysis carried out in sliding windows (default size 10Kb) is compared with the genomic average for each measure. The output is displayed in a tabular format for each window which may be filtered if the values of the measures differ by 1.5σ (standard deviations) from the genomic average. Availability: http://ccnsb.iiit.ac.in/IGIPT/

2.2 Genomic signature: The dinucleotide bias of window W with respect to genome G is given by [2] * δ * (W , G ) = 1 16 | ρ xy (W ) − ρ *xy (G ) |



where the sum extends over all nucleotides and the genome signature profile

ρ xy* = f xy* / f x* f y* , f xy* , f x* being the frequencies of dinucleotide XY and mononucleotide X, respectively, computed for the sequence concatenated with its inverted complement.

1. Introduction Genomic islands (GIs) are horizontally transferred regions, typically ~ 10 - 200 Kb in size. Any biological advantage provided to the recipient organism by the transferred DNA creates selective pressure for its retention in the host genome. Increasing evidence indicates that genomic islands are important in the evolution of bacteria, influencing traits such as antibiotic resistance, symbiosis and fitness, and adaptation in general. Most widely used approaches for identifying recent horizontal transfers are based on the “genome hypothesis,” according to which codon usage and GC content are distinct signatures of each genome [1]. Here, we have implemented six such measures that capture anomaly in nucleotide composition on a single platform [2-4].

2. Materials and Methods Various measures implemented in the IGIPT tool are briefly discussed below.

2.3 Word Distributions: Horizontally acquired regions have distinct word (k-mer) compositions and thus can be used for identifying probable GIs [3]. The frequency of all possible words of size k (= 2 – 6) are computed in sliding windows, W, and also for the whole genome. The difference in the average k-mer frequency is computed as * δ k* (W , G ) = 1 4 k | ρ *xy (W ) − ρ xy (G ) |



Windows exhibiting significant deviation from the genomic average are identified as probable GIs.

2.4 Codon Bias: The unequal usage of synonymous codons is referred to as codon bias. Let F be a family of genes with average codon frequencies f(x,y,z) for the codon triplets (x,y,z) normalized so that: ∑ f ( x, y , z ) = 1 x, y , z = a and the sum extends over all codons translated to amino acid a. The codon usage difference of the gene Table 1: Analysis of A. aurescens genome using IGIPT

2.1 GC content anomalies: The frequency of G+C in a sliding window, W, (~10Kb) is compared to the average genomic G+C frequency [2]. Windows 978-0-7695-3513-5/08 $25.00 © 2008 IEEE DOI 10.1109/ICIT.2008.42

131

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 6, 2009 at 01:23 from IEEE Xplore. Restrictions apply.

substrates. Table 1 summarizes the output from IGIPT Measures at Genome Level

Measures at Gene Level

Experimentally

Size

Verified GIs (Kb)

(Kb)

GC

GS

k=2

k=3

k=4

k=5

k=6

CB

AAB

GC1

GC2

GC3

14.5

16.9

24

P

P

P

P

P

P

P

N

N

N

N

N

95.9

97.4

15

N

P

N

N

N

N

N

N

N

N

N

N

123.2

127.1

39

P

P

P

P

P

P

P

P

P

P

N

P

146.5

150.0

35

P

P

P

P

P

P

P

P

N

N

N

P

179.1

180.7

16

N

N

N

N

N

N

N

N

P

P

N

N

338.6

339.8

12

N

N

N

N

N

N

N

N

N

N

N

N

341.2

342.3

11

P

P

N

N

N

N

P

N

P

P

P

P

345.9

347.9

20

P

P

P

P

P

P

P

P

P

P

P

P

385.0

386.2

12

P

P

N

N

N

N

N

N

N

N

P

N

410.3

412.7

24

P

P

P

P

P

P

P

N

N

N

P

N

452.0

453.2

12

N

N

N

N

N

N

N

N

N

N

N

N

on A. aurescens genome. The results are shown only for windows differing by 1.5σ from the genomic average; here P – represents identified by a particular measure, and N – not identified. Regions acquired from donors with similar compositional bias as the host genome will not be identified by these measures. Since no single measure can truly identify a horizontally acquired region, we suggest users to confirm with two or more measures. The implementation of six different measures on a single platform greatly increases the predictability power of IGIPT.

family F relative to the genome G is given by B( F | G) =

∑ p (F )[ a

a

∑ | f ( x, y, z ) − g ( x, y, z ) |]

( x , y , z )=a

where pa (F ) are the normalized amino acid frequencies of the gene family F [2].

2.5 Amino Acid Bias: This refers to the deviation in the frequency of usage of individual amino acids over the average usage of all 20 amino acids [2]. The amino acid bias between gene family F and the genome G is given by 20

A ( F | G ) = (1 / 20 ) ∑ | a i ( F ) − a i ( G ) |

4. References

i =1

ai (F ) - the average amino acid frequency of ai in F.

[1] E.V. Koonin, K.S. Makarova, and L. Aravind, Horizontal gene transfer in prokaryotes: quantification and classification, Ann. Rev Microbiol, 55, 709-742, 2001. [2] S. Karlin, Detecting anomalous gene clusters and Pathogenicity islands in diverse bacterial genomes, Trends in Microbiology, 9 (7), 2001. [3] I. Rajan, S. Aravamuthan and S.S. Mande, Identification of compositionally distinct regions in genomes using the centroid method, Bioinformatics, 23, 2672–2677, 2007. [4] S.G. Vallve, A. Romeu and J. Palau, Horizontal gene transfer in bacterial and archaeal complete genomes, Genome Res. 10, 1719-1725, 2000.

3. Results and Discussion A set of prokaryotic genomes and their gene sequences were downloaded from NCBI (ftp://ftp.ncbi.nih.gov/ genomes/Bacteria) and analyzed using IGIPT. Here we present our results for Arthrobacter aurescens TC1 genome, in which 11 genes have been reported with atypical composition. The islands include transposons and related genes, transcriptional regulators, resistance genes, and genes involved in metabolism and transport of a wide range of

132

Authorized licensed use limited to: IEEE Xplore. Downloaded on January 6, 2009 at 01:23 from IEEE Xplore. Restrictions apply.