May 23-26, 2010

6 downloads 8096 Views 6MB Size Report
May 26, 2010 - with easily available and SDK and developer tools. 5. Since the main ... available in market like iPhone, iPodTouch, Android phone, Windows ...
SHORT ABSTRACTS ISBRA 2010 6TH INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS RESEARCH AND APPLICATIONS

May 23-26, 2010 University of Connecticut Storrs, CT

http://www.cs.gsu.edu/isbra10

Symposium Organizers Steering Committee Dan Gusfield, University of California, Davis Yi Pan, Georgia State University Marie-France Sagot, INRIA General Chairs Ion Mandoiu, University of Connecticut Alex Zelikovsky, Georgia State University Program Chairs Mark Borodovsky, Georgia Institute of Technology J. Peter Gogarten, University of Connecticut Teresa Przytycka, National Institutes of Health Sanguthevar Rajasekaran, University of Connecticut Publicity Chair Dumitru Brinza, Life Technologies Finance Chairs Anu Bourgeois, Georgia State University Raj Sunderraman, Georgia State University Poster Chairs Yufeng Wu, University of Connecticut Craig Nelson, University of Connecticut Local Organization Chairs Reda A. Ammar, University of Connecticut Les Loew, University of Connecticut Health Center Linda Strausbaugh, University of Connecticut

Xxxxx Sponsors 

UCONN School of Engineering Booth Engineering Center for Advanced Technology UCONN Center for Applied Genetics and Technology Life Technologies

i

Committee Members Richa Agarwala, NCBI Srinivas Aluru, Iowa State University Danny Barash, Ben-Gurion University Robert Beiko, Dalhousie University Anne Bergeron, Universite du Quebec a Montreal Daniel Berrar, University of Ulster Olivier Bodenreider, NLM Paola Bonizzoni, Universita' Degli Studi di MilanoBicocca Mark Borodovsky, Georgia Tech (Co-Chair) Daniel Brown, University of Waterloo Tien-Hao Chang, National Cheng Kung University Chien-Yu Chen, National Taiwan University Luonan Chen, Osaka Sangyo University Bhaskar DasGupta, niversity of Illinois at Chicago Amitava Datta, University of Western Australia Guillaume Fertin, University of Nantes Vladimir Filkov, University of California Davis Jean Gao, University of Texas at Arlington J.Peter Gogarten, University of Connecticut (Co-Chair) Katia Guimaraes, Federal University of Pernambuco Jieyue He, Southeast University Manuela Helmer-Citterich, Univ. "Tor vergata" Roma

Vasant Honavar, Iowa State University Jinling Huang, Eastern Carolina University S. Iyengar, Louisiana State University Lars Kaderali, University of Heidelberg Ming-Yang Kao, Northwestern University Yury Khudyakov, CDC Danny Krizanc, Wesleyan University Jing Li, Case Western Reserve University Guohui Lin, University of Alberta Fenglou Mao, University of Georgia Osamu Maruyama, Kyushu University Satoru Miyano, University of Tokyo Ion Moraru, University of Connecticut Health Center Bernard Moret, EPFL Giri Narasimhan, Florida International University Andrei Paun, Louisiana Tech University Itsik Pe'er, Columbia University Mihai Pop, University of Maryland Maria Poptsova, University of Connecticut Teresa Przytycka, NCBI (CoChair) Sven Rahmann, Technical University Dortmund Sanguthevar Rajasekaran, University of Connecticut (CoChair)

ii

Shoba Ranganathan, Macquarie University S. Cenk Sahinalp, Simon Fraser University Sartaj Sahni, University of Florida David Sankoff, University of Ottawa Russell Schwartz, Carnegie Mellon University Joao Setubal, Virginia Bioinformatics Institute Mona Singh, Princeton University Steven Skiena, Stony Brook University Jens Stoye, Bielefeld University Raj Sunderraman, Georgia State University Wing-Kin Sung, Nuational University of Singapore Sing-Hoi Sze, Texas A&M University Haixu Tang, Indiana University Gabriel Valiente, Technical University of Catalonia Jean-Philippe Vert, Ecole des Mines de Paris Stéphane Vialette, Université Paris-Est Marne-la-Vallée Li-San Wang, University of Pennsylvania Lusheng Wang, City University of Hong Kong Carsten Wiuf, Aarhus University Richard Wong, Kanazawa University Yufeng Wu, niversity of Connecticut Yanqing Zhang, Georgia State University Leming Zhou, University of Pittsburgh

ISBRA 2010 Short Abstracts Doyo Gragn Enki and Nickolay T. Trendafilov. The semi-divisive approach to geneclustering

1

Alexander Lobkovsky, Yuri Wolf and Eugene Koonin. Universal distribution of evolution rates of protein coding genes from a model of folding robustness

4

Olga Zhaxybayeva, David Williams, Kristen Swithers and J. Peter Gogarten. Quartet decomposition and its applications to study evolutionary histories of genes in genomes

8

Fang-Xiang Wu. Identification of Periodically Expressed Genes from Their Time-Course Expression Profiles

12

Heather Adams, Bruce Southey, Harris Lewin, Robin Everts, Sadie Marjani, Cindy Tian and 16 Sandra Rodriguez-Zas. Meta-analysis of partially overlapping microarray experiments uncovers gene expression relationships across experiment-dependent conditions Heather Adams, Bruce Southey and Sandra Rodriguez-Zas. A novel meta-classification strategy to identify candidate gene biomarkers for accurate sample classification

17

Aminul Islam and Hasan Jamil. Curray: A Query Language for Microarray Data Management and Analysis

18

Trafina Jadhav, Marie Wooten and Michael Wooten. Computational search for preferred TRAF6/p62 ubiquitination sites.

22

Chih Lee, Ion Mandoiu and Craig Nelson. Inferring Ethnicity from Mitochondrial DNA Sequence

26

Kristin Delfino, Sandra Rodriguez Zas, Nicola Serao and Bruce Southey. A computational strategy to identify general and individualized genomic biomarkers

31

Todd J Treangen and Eduardo PC Rocha. Repeatoire: de novo detection of intragenomic repeat repertoires

34

Jen-hwa Chu, Scott T. Weiss, Ross Lazarus, Vincent J. Carey and Benjamin A. Raby. Quantifying differential gene connectivity between disease states for objective identification of disease-relevant hub genes

35

Ren-Cheng Wang, Jie-Zhi Cheng, Kenneth Hung, Chia-Lin Chang, Yun-Ju Sun and ChungMing Chen. Intein Predictor: Prediction of Inteins with Amino Acid Sequence and Support Vector Machine

36

Sandra Rodriguez-Zas, Bruce Southey, Mira Cohen, Guy Bloch and Gene Robinson. Comparison Of Approaches To Identify And Characterize Periodic Gene Expression Profiles

40

N. V. L. Serão, K. R. Delfino, Bruce Southey and Sandra Rodriguez-Zas. Development of a transcriptomic-based index to prognosticate cancer

42

iii

Fenglou Mao, Maria Poptsova, David Williams, Olga Zhaxybayeva, Peter Gogarten and Ying Xu. Quartet Decomposition Server: A Platform for Analyzing Phylogenetic Trees

46

Md. Shafiul Alam, Satish Panigrahi, Puspal Bhabak and Asish Mukhopadhyay. Multi-gene linear separability of gene expression data in linear time

51

Fabio Fassetti, Ofelia Leone, Luigi Palopoli, Simona E. Rombo and Adolfo Saiardi. IP6K gene identification by tag search

55

Irina Astrovskaya, Kelly Westbrooks and Alexander Zelikovsky. Reconstruction of HCV quasispecies haplotypes from 454 Life Science reads

59

Dumitru Brinza, Fiona Hyland and Asim Siddiqui. Genome Completion by SOLiD™ System Next-Generation Sequencing

60

Kejie Li and Michael Gribskov. RNA Structural-BLAST: A New RNA Structure Database Searching Service Using RNA Fingerprint

64

Ken Nguyen and Yi Pan. KB-MSA: Knowledge-based Multiple Sequence Alignment

68

Thomas Heider, James Lindsay, Chenwei Wang, Rachel O'Neill and Andrew Pask. Enhancing de novo genome projects with the integration of physical, linkage and syteny maps

72

Marius Nicolae, Serghei Mangul, Ion Mandoiu and Alex Zelikovsky. Estimation of alternative splicing isoform frequencies from RNA-Seq data

73

Yanyan Huang, Mazhar Khan and Ion Mandoiu. Development of Real Time RT-PCR Assays for Neuraminidase Subtyping of Avian Influenza Virus

78

James Lindsay, Dawn M Carone, Elizabeth Murchison, Andrew Pask, Gregory J. Hannon, Ion Mandoiu and Rachel J. O'Neill. Unique small RNA signatures uncovered in the tammar wallaby genome

83

84 James Lindsay, Jin Zhang, Ion Mandoiu, Yufeng Wu, Marilyn Renfree, Andrew Pask, Thomas Heider, Craig Obergfell and Rachel O'Neill. Novel multi-platform next generation assembly methods for mammalian genomes Ovidiu Daescu, Steven R. Goodman, Anastasia Kurdia and Marko Zivanic. Voronoi distance measures for protein-protein interaction networks

85

Aditi Gupta and Michael Gribskov. Predicting protein binding regions in RNA

89

Serghei Mangul and Alex Zelikovsky. Haplotype spectrum reconstruction from sequencing reads

93

Abhishek Chhetri, Forrest Wen, Yizhong Wang and Kang Zhang. Shape Discrimination Test on Handheld Devices for Patient Self-Test

94

Jiayin Wang and Jin Zhang. Identifying SNP Interactions Using Relational Clustering Logic Regression

98

André Almeida, Maria Souza and Zanoni Dias. Progressive Multiple Protein Sequence Alignment

102

iv

Arun Seetharam, Vincent Keller and Gary Stuart. Study on conservation and distribution of C2H2 Zinc-finger genes in Eukaryotes

106

Syed Toufeeq Ahmed, Sheela Kanwar, Jorg Hakenberg and Hasan Davulcu. BioEve: A Discovery Engine for Life Sciences Literature

110

Qiong Cheng, Piotr Berman and Alex Zelikovsky. Alignments of Metabolic Networks with Bounded Treewidth

114

Ming Fang and Raj Sunderraman. Maintaining Integrity Constraints in Large Distributed Bio-Ontologies

115

Promita Bose and Robert Harrison. Delaunay tessellated matrices for discriminating protein 3D models

119

Sandhya Balasubramanian, Natalia Maltsev, Dina Sulakhe, Eduardo Berrocal and Conrad 120 Gilliam. GEDI – an integrated bioinformatics platform for analysis of complex disease phenotypes Chu-Ting Chang, Chin-Huang Lai, Haw-Yueh Tang, Sung-Jan Lin, Yau-Li Huang, ChiMing Chu, Ta-Tung Li and Wei-Ming Wang. Blood Vanadium, soybean milk intake, genotype rs1160312 and family history are associated with Androgenetic Alopecia

122

Baikang Pei and Dong-Guk Shin. Using Automatically Generated Molecular Ordering Relations to Improve Biological Network Reconstruction

124

Shan Yang, Darryl Leon, Paul Stothard and Stephen Moore. Using Cloud Computing Workflow for Analyzing Next-generation Sequencing Data from Bovine Genomes

128

Jianqiang Du, Xiaomin Wu and Huqin Zhang. Mass spectrometry based proteomic analysis of Kashin-Beck disease

129

Asav Dharia, Matthew Gajdosik, Jin Jun and Craig Nelson. Introducing a New Model for the Retention of Gene Duplicates: DPI, Duplication, Purification, and Inactivation

142

Alexandre Lomsadze and Mark Borodovsky. How does genomic GC content affect gene finding accuracy?

143

Noah Zaitlen, Bogdan Pasaniuc and Eran Halperin. Estimating homologous gene expression levels in RNA-seq experiments

144

v

ISBRA 2010 SHORT ABSTRACTS

The semi-divisive approach to gene-clustering Doyo Gragn and Nickolay T. Trendafilov Department of Mathematics and Statistics, The Open University Walton Hall, Milton Keynes MK7 6AA, UK

1

Introduction

The typical microarray gene-expression data set is characterized by a large number of genes compared to the number of samples, with the gene-to-sample ratio of approximately 100-fold. The structure and the information contained in the data matrix is often overshadowed by its size. One possible method of extracting important information is by clustering the genes and/or the samples. A number of approaches, both hierarchical and nonhierarchical, have already been proposed to organize the genes into groups or clusters in such a way that similar (coexpressed) genes are placed close together. There is no unique best clustering criterion, as there is no precise and workable definition for cluster. The objective of the proposed method, like other similar methods, is to identify groups of genes that have coherent patterns of expression with large variance across samples. A strong correlation of expression patterns between genes indicates coregulation. In addition, coexpressed genes in the same cluster are likely to be involved in the same cellular processes.

2

The semi-divisive approach

Consider an n × p gene expression data matrix X with n samples (rows) and p genes (columns). The data is log-transformed, mean-centered and scaled. The proposed method forms k(≪ p) clusters of genes sequentially, with each cluster passing two steps: first ordering the genes based on certain criterion, and then partitioning the ordered genes based on another criterion. The algorithm starts with all genes. At each clustering stage, the genes are ’sorted’ or ordered based on a given criterion and a partitioning hierarchical clustering method is used to split the vector of ordered genes into two groups. The partitioning criterion is designed in such a way that the vector of ordered genes is split at the position of the ”weakest-link” or the maximum ’gap’ between two groups. Then, the group with the largest value of the criterion forms a cluster and the algorithm repeats on the second group. The process continues until either a required number of clusters is obtained or no further ordering and partitioning is possible.

1

ISBRA 2010 SHORT ABSTRACTS

2

2.1

D. Gragn & N.T. Trendafilov

The gene-ordering procedure

The genes can be sorted based on the sum of their squared correlation coeffi2 cients. Suppose, the highest squared correlation coefficient is rij corresponding (2) to indexes of genes i and j, then form the starting vector s with elements (2) (2) 2 2 s1 = i and s2 = j. Next, identify a gene with index k, that maximizes rik +rjk (3)

(3)

(i ̸= j ̸= k) and set s3 = k, i.e. s(3) = [s(2) |s3 ]. Then, select the fourth gene, (4) 2 2 2 say s4 = l, which maximizes ril + rjl + rkl , and etc. The procedure continues in a similar way until all the genes are sorted. In general, the (q + 1)th ordered gene, say m, will be the one that maximizes q ∑ i=1

2.2

rs2(q) ,m , over all m ∈ / s(q) .

(1)

i

The gene-partitioning procedure

Once the vector of indexes of the ordered genes s ≡ s(p) is found, the following criterion for partitioning is proposed. For the sake of simplicity, the formulation of the criterion is based on the correlation matrix of the ordered variables, though the computations latter involve the data matrix only. Denote by R the correlation matrix of the ordered genes and let s be partitioned into two vectors as s ≡ [s1 | s2 ] having k1 and p − k1 genes. Then R can be rewritten as the following block-matrix: [ ] R11 R12 R= , R21 R22 where Rii (i=1,2) is the correlation matrix of si , and each element of the matrix R12 = R⊤ 21 refers to the correlation coefficient between a gene from s1 and a gene from s2 . The objective is to choose the partition in such a way that the following function (determinant) is maximized: ([ ⊤ ⊤ ] [ ][ ]) v1 0 R11 R12 v1 0 δ(s1 , s2 ) = det R21 R22 0 v2 0⊤ v ⊤ 2 ) ( ⊤ ⊤ v1 R11 v1 v1 R12 v2 = det ⊤ v⊤ 2 R21 v1 v2 R22 v2 ( ⊤ ) ( ⊤ ) ( )2 = v1 R11 v1 × v2 R22 v2 − v⊤ . (2) 1 R12 v2 In other words, vectors v1 and v2 should be found that maximize (2). Obviously, this function is maximized if v1 and v2 are the eigenvectors corresponding to the largest eigenvalues of R11 and R22 , respectively, i.e.: ( )2 max δ(s1 , s2 ) = d21 (R11 ) × d21 (R22 ) − v⊤ , (3) 1 R12 v2 with R12 = 0k1 ×(p−k1 ) , if the genes in the two groups are uncorrelated. As p is large, the eigenvalue-eigenvector pairs can be efficiently obtained from the SVD of the partitioned data matrices X1 ∈ Rn×k1 and X2 ∈ Rn×(p−k1 ) .

2

ISBRA 2010 SHORT ABSTRACTS

Order-and-partition approach

3

At the optimal solution of the first stage of the algorithm, s1 gives the first cluster of k1 genes. To form the next cluster, the ordering and partitioning steps are repeated on the data matrix of the remaining p−k1 vector of genes contained ∑k in s2 . In general, the data matrix of p − i=0 ki ordered genes is used at the ith stage of the partitioning procedure with k0 = 0.

3

Application to data

We applied the order-and-partition method to a real gene expression data set publicly available from http://microarray.princeton.edu/oncology/ and the results are compared with that of the existing gene-shaving method. The proposed method gives a promising result and could be considered as a potential method for clustering genes.

Fig. 1. Heat maps for 1909 genes before clustering (Left) and the first five clusters of genes (Right)

3

ISBRA 2010 SHORT ABSTRACTS

Universal distribution of evolution rates of protein coding genes from a model of folding robustness Alexander E. Lobkovsky∗, Yuri I. Wolf, and Eugene V. Koonin March 25, 2010 The discovery of a broad distribution of evolutionary rates of protein coding genes prompted the introduction of an intuitively appealing functional hypothesis of gene evolution according to which the relative importance of a gene for the survival of the organism determine the its evolution rate. However, recent genome-scale studies prompted a rethinking of the functional hypothesis. Evolution rate was found to be most consistently and strongly correlated with expression level. In addition, the shape of the distribution of evolution rates across complete genomes was found to be roughly log-normal for all life forms. These findings argue for a purely mechanistic explanation for the variation of evolution rates. Robustness to misfolding was recently suggested as a fundamental mechanism for the observed negative correlation between protein abundance and evolution rate. We implement the misfolding robustness hypothesis within a tractable folding model and compute the distribution of evolution rates of model sequences. The shape of the distribution is insensitive to the details of the model and resembles empirical distributions better than the log-normal distribution. Whereas each protein-coding gene evolves at a characteristic rate that remains relatively constant over long periods of time [Zuckerkandl and Pauling, 1965], the characteristic rates span three to four orders of magnitude. The reasons for the broad rate variability and the nature of the factors which determine the evolution rate (ER) of a particular gene remain largely unknown although a number of hypotheses have been advanced. A rapid increase in the amount and quality of genome-scale data has enabled systematic studies of ER’s variability and correlates. Several pieces of evidence point to a fundamental and universal nature of the processes that determine the ER of a particular gene. Although a complex structure of correlations between the gene’s ER and other quantities such as expression level (EL), connectivity of the product in protein interaction networks, knockout effect, and propensity for loss, have been discovered, the strongest and most significant correlation is between ER and EL [Wolf et al., 2006]. Comparison of the ER of homologous domains embedded in differently expressed proteins supports the idea that the EL is an important determinant of the ER [Wolf et al., 2008]. Recent studies revealed a remarkable constancy of the shape of the ER distributions across sets of orthologous genes in diverse life forms, from bacteria to mammals [Grishin et al., 2000, Wolf et al., 2009]. Finally, sophisticated analysis (Wolf et al in press) of high precision EL data for a large set of orthologs in two distantly related organisms [Schrimpf et al., 2009] found that roughly half of the total ER variability can be attributed to structural-functional constraints and the EL. The importance of the structural constraints together with the universality of the ER distribution suggest that a fundamental microscopic mechanism governs sequence evolution in all life-forms. More specifically, the fitness of a sequence, which determines the mutant fixation probabilities and is therefore at the core of any model that predicts ER, must derive from a physical (the same in all life forms) and not biological characteristic of the protein molecule. Since active sites typically involve only a small number of residues, differences in their physical properties cannot account for the broad variability of ER. Therefore, the physical property responsible for the protein’s fitness must involve the whole molecule. Several such global properties have been proposed to determine fitness: protein’s thermo-stability [Cui ∗ Corresponding

author, [email protected], NCBI/NLM/NIH, Bethesda, MD

1 4

ISBRA 2010 SHORT ABSTRACTS

et al., 2002, M´elin et al., 1999, Zeldovich et al., 2007], foldability [Govindarajan and Goldstein, 1997], degree of folding frustration [Nelson and Onuchic, 1998], and folding rate [Gutin et al., 1995]. The fitness mechanism must reproduce the observed negative correlation between EL and ER. Paradoxically, because highly expressed proteins must be robust to substitutions due to translational errors, mutations in the original sequence should be readily accepted apparently leading to a high ER. The resolution of the paradox is in the nature of the fitness landscape constructed using a concept of fitness which incorporates mistranslation events. Higher selection pressure, which results from higher EL, confines sequence evolution to a progressively smaller regions sequence space of high fitness. Therefore, despite frequently accepted mutations, sequence divergence remains low. The mistranslation induced misfolding (MIM) hypothesis, which postulates that protein’s fitness derives from its probability of mistranslation induced misfolding, is an example of a specific model in which the general explanation for the negative EL-ER correlation was illustrated quantitatively [Drummond and Wilke, 2008, Drummond et al., 2005, 2006]. Our effort builds on the MIM idea and implements a specific framework in which the folding probability of a model sequence can be computed and associated with its fitness. We identify networks of sequences that share a common native structure and compute the distribution of network averaged evolution rates across distinct networks. We find that the shape of the ER distribution is insensitive to the details of the model and resembles empirical distributions more closely than a log normal distribution. Before describing the details of our model for computing the ER distribution we would like to point out that macroscopic quantities such as the ER of a gene often result from an aggregation of a large number of random microscopic variables. It can be shown rigorously that when the aggregation process retains only the information of the variance of the microscopic variables, the distribution of the aggregate tends toward normal [Frank, 2009]. Therefore, it is possible to imagine a range of microscopic generative models all of which are plausible and yield the empirically observed macroscopic distribution. Let us clearly state the assumptions that comprise our model for computing ER variability of model proteins: • Model sequences have the same length N ; • A physical folding model allows for the computation of the native structure as well as the correct folding probability (CFP) for any model sequence; • Every distinct structure spawns a common native structure network of sequences connected by point substitutions; • Evolution proceeds via sequential fixation attempts of point substitutions; • The fitness of each model sequence is a monotonically increasing function of its CFP • Fixation probability of a point substitution is zero if the mutant is not in the common structure network and is a function of the fitness difference between the mutant and the original sequence otherwise; • Mutant fixation probability is computed by assuming that a single mutant exists in a population at any given time; • Evolution is assumed to have reached a stationary state whereby all sequences in the common structure network are explored and represented in the stationary ensemble; • ER is associated with each distinct common structure network and is defined as the network average mutation probability weighted by the representative fraction of the sequence in the stationary ensemble. We constructed an off-lattice folding model to compute native structures and CFP’s of the model sequences. We varied the number of monomer types between two (hydrophobic and polar types) and four (hydrophobic, polar and charged). We also varied the details of the pair interaction potentials 2 5

ISBRA 2010 SHORT ABSTRACTS

4

Model, A = 0.5 × 10−4 Homsa Bursp

Normalized log(ER)

2

0

-2

-4 -4

-2

0

2

4

Normal order statistic medians

Figure 1: Normal probability plot comparing the model derived ER distribution to empirically observed ER distribution obtained for Homo Sapience (Homsa) and Burkholderia cenocepacia (Bursp). such as the strength of the long range repulsion between the hydrophobic and other monomers as well as the screening length of the Coulomb interactions. We also investigated three different monotonic relationships between the CFP and the fitness. We find a broad distribution of common structure network sizes and densities in our model in accord with previous findings [Bornberg-Bauer, 1997, Bornberg-Bauer and Chan, 1999, Chan and BornbergBauer, 2002]. The network averaged ER distribution exhibits a peak at a value which depends on the population size in the same way as a neutral mutation rate. The distribution of the logarithms of the ER has a heavy tail on the low ER side consistent with empirically measured ER distributions Wolf et al. [2009]. Figure 1 summarizes the three way comparison between a log-normal distribution, the model derived distribution and the empirical distribution in a normal probability plot. Both, the model derived and the empirical ER distributions deviate from the log-normal distribution the same way (concave upward). In summary, we explored the implications of the hypothesis that protein fitness is determined solely by the folding robustness of the protein molecule for variability of the sequence evolution rate. The shape of ER distribution derived from the model is insensitive to the model details and closely resembles the empirical universal distribution. In addition, The model naturally reproduces the observed negative correlation between ER and protein abundance. Our model is simple enough to be tractable but appears to be sufficiently rich to mimic the major forces that govern protein folding. Therefore the results seem to suggest that the course of protein evolution is determined primarily by the fundamental principles of protein folding which are the same across all life forms. Under this view, protein evolution is constrained by purifying selection against misfolding-inducing mutations but is less dependent on functional selection.

References E. Bornberg-Bauer. How are model protein structures distributed in sequence space? Biophys. J., 73 (5):2394–2403, November 1997. doi: 10.1016/S0006-3495(97)78268-7.

3 6

ISBRA 2010 SHORT ABSTRACTS

E. Bornberg-Bauer and H. S. Chan. Modeling evolutionary landscapes: mutational stability, topology, and superfunnels in sequence space. Proc. Natl. Acad. Sci. USA, 96(19):10689–94, 1999. H. S. Chan and E. Bornberg-Bauer. Perspectives on protein evolution from simple exact models. Appl. Bioinformatics, 1(3):121–144, 2002. Y. Cui, W. H. Wong, E. Bornberg-Bauer, and H. S. Chan. Recombinatoric exploration of novel folded structures: A heteropolymer-based model of protein evolutionary landscapes. Proc. Natl. Acad. Sci. USA, 99(2):809–814, January 2002. D. A. Drummond and C. O. Wilke. Mistranslation-induced protein misfolding as a dominant constraint on coding-sequence evolution. Cell, 134(2):341–352, 2008. D. A. Drummond, J. D. Bloom, C. Adami, C. O. Wilke, and F. H. Arnold. Why highly expressed proteins evolve slowly. Proc. Natl. Acad. Sci. USA, 104(40):14338–14343, 2005. D. A. Drummond, A. Raval, and C. O. Wilke. A single determinant dominates the rate of yeast protein evolution. Mol. Biol. Evol., 23(2):327–337, 2006. S. A. Frank. The common patterns of nature. J. Evol. Biol., 22:1563–1585, 2009. S. Govindarajan and R. A. Goldstein. Evolution of model proteins on a foldability landscape. Proteins, 29(4):461–466, December 1997. N. V. Grishin, Y. I. Wolf, and E. V. Koonin. From complete genomes to measures of substitution rate variability within and between proteins. Genome Research, 10(7):991–1000, 2000. A. M. Gutin, V. I. Abkevich, and E. I. Shakhnovich. Evolution-like selection of fast-folding model proteins. Proc. Natl. Acad. Sci. USA, 92:1282–1286, February 1995. R. M´elin, H. Li, N. S. Wingreen, and C. Tang. Designability, thermodynamic stability, and dynamics in protein folding: A lattice model study. J. Chem. Phys., 110(2):1252–1262, January 1999. doi: 10.1063/1.478168. E. D. Nelson and J. N. Onuchic. Proposed mechanism for stability of proteins to evolutionary mutations. Proc. Natl. Acad. Sci. USA, 95:10682–10686, September 1998. S. P. Schrimpf, M. Weiss, L. Reiter, C. H. Ahrens, M. Jovanovic, J. Malmstrom, E. Brunner, S. Mohanty, M. J. Lercher, P. E. Hunziker, R. Aebersold, C. von Mering, and M. O. Hengartner. Comparative functional analysis of the Caenorhabditis elegans and Drosophila melanogaster proteomes. PLoS Biology, 7(3):e1000048, 2009. doi: 10.1371/journal.pbio.1000048. M. Y. Wolf, Y. I. Wolf, and E. V. Koonin. Comparable contributions of structural-functional constraints and expression level to the rate of protein sequence evolution. Biol. Direct, 3(1):40, 2008. Y. I. Wolf, L. Carmel, and E. V. Koonin. Unifying measures of gene function and evolution. Proc. Biol. Sci., 273:1507–1515, 2006. Y. I. Wolf, P. S. Novichkov, G. P. Karev, E. V. Koonin, and D. J. Lipman. The universal distribution of evolutionary rates of genes and distinct characteristics of eukaryotic genes of different apparent ages. Proc. Natl. Acad. Sci. USA, 106(18):7273–7280, 2009. K. B. Zeldovich, P. Chen, and E. I. Shakhnovich. Protein stability imposes limits on organism complexity and speed of molecular evolution. Proc. Natl. Acad. Sci. USA, 104(41):16152–16157, 2007. E. Zuckerkandl and L. Pauling. Evolutionary divergence and convergence of proteins. In V. Bryson and H. J. Vogel, editors, Evolving Gene and Proteins, pages 97–166. Academic Press, New York, 1965.

4 7

ISBRA 2010 SHORT ABSTRACTS

Quartet decomposition and its applications to study evolutionary histories of genes in genomes Olga Zhaxybayeva1 , David Williams2, Kristen Swithers2, J. Peter Gogarten2 1

Department of Chemistry and Biochemistry, Mount Allison University, Sackville, NB, Canada 2 Department of Molecular and Cell Biology, University of Connecticut, Storrs, CT, USA, {[email protected]}

Abstract. Horizontal gene transfer is an important evolutionary process shaping genomes of microorganisms. Its presence complicates an inference of how different microbial lineages related to each other based on the examination of their gene content. We introduce a method, which we call quartet decomposition that examines histories of individual genes in genomes and assesses the impact of horizontal gene transfer onto those histories. This method does not require genes to be concatenated, and can deal with gene families that are not present in all analyzed genomes. We illustrate application of this method to two bacterial groups: marine cyanobacteria and Thermotogales. Keywords: horizontal gene transfer, embedded quartet, spectrogram, cyanobacteria, Thermotogales

1 Introduction In an era of affordable genomic sequencing, we now have access to over a thousand prokaryotic genomes [1], with many more in progress, and with at least some sequencing efforts targeted for a balanced representation of the diversity of life [2]. Only a handful of genes (core genes) are shared among all prokaryotic genomes and gene content varies dramatically even if different strains of the same species are compared [3, 4]. Reconstruction of phylogenetic trees from concatenated core genes leads to evolutionary scenarios representing only a small fraction of genes in genomes (dubbed to be “the tree of one percent” [5]) and ignores the evolutionary histories recorded in the remainder of each genome. Horizontal gene transfer is now recognized as an important evolutionary force shaping the evolution of prokaryotic genomes [6] and is shown to affect a large fraction of genes in genomes (e.g., [7-9]). Here we describe a method that examines and compares individual histories of protein-coding genes in a set of genomes. This method does not

8

ISBRA 2010 SHORT ABSTRACTS

make the assumption that all genes in genomes share the same evolutionary history and can deal with gene families not present in all analyzed genomes.

2 Quartet Decomposition Method As the name of the method implies, each gene family in a group of analyzed genomes is “decomposed” into all possible four-taxon trees that are consistent with (embedded within) a corresponding gene tree (Fig. 1).

Fig. 1. Illustration of an embedded quartet in a gene tree. In each gene tree (shown in gray thin lines), we look at the relationship of any four external nodes at a time (shown in thick black lines), which we call an embedded quartet. All possible combinations of four external nodes (taxa) are examined. In one gene tree, an embedded quartet can support only one of three possible phylogenetic relationships (tree topologies) among the four taxa. However, in other gene trees, an alternative phylogenetic relationship for this quartet might be observed.

Since we only consider four taxa at a time, gene families present in as few as four genomes can be analyzed and compared to larger gene families. After all embedded quartets are evaluated for the topology supported in each gene family, the results are pooled together for all gene families and can be summarized in various ways: (i) support for all topologies of each embedded quartet can be summarized in a histogram (which we call a spectrogram) that allows to see how much conflicting evolutionary signal is present in analyzed genomes; (ii) embedded quartet scatterplots can be used to screen gene families according to a specific scenario (e.g., by GC content or ecological/functional division); (iii) quartet topologies supported by plurality or majority of gene families can be combined into a supertree (plurality topology) or a network, such as NeighborNet [10]; (iv) a list of gene families in disagreement with the plurality or a reference topology can

9

ISBRA 2010 SHORT ABSTRACTS

be extracted. For more details about quartet decomposition methodology, see refs. [8, 11].

3 Impact of horizontal gene transfer on evolution of marine cyanobacteria Prochlorococcus marinus and marine Synechococcus clade A are closely related genera of marine cyanobacteria that due to their abundance in the oceans play an important role in global carbon cycling. Despite having >96% 16S rRNA identity, a remarkable genome divergence was observed between (and within) Prochlorococcus marinus and marine Synechococcus. Using quartet decomposition, we analyzed gene families present in 18 available genomes of P. marinus and marine Synechococcus, in an attempt to assess how horizontal gene transfer and vertical inheritance shaped the evolution of these genomes and their adaptation to environmental constraints. The analyses revealed a complex evolutionary scenario relating these genomes, with the most plausible explanation being that the plurality of genes does not reflect the organismal evolution of these genomes but rather reflects “highway(s) of gene sharing” [7]. It appears that some Prochlorococcus genomes acquired many genes from marine Synechoccoccus, resulting in genomes that have become more “Synechococcus-like” but still maintain genes for their ecological niche. The frequent gene exchange appears to have lead to a phylogenetic signal reflecting gene sharing and not organismal histories [12], a process that in analogy to despeciation [13] in this case could be labeled as “degeneration.”

4 Use of embedded quartets in analyses of Thermotogales genomes Members of the Thermotogales are thermophilic anaerobes found in marine and fresh water geothermal environments. There are five genera within this order for which genomic sequences are available. The five genomes of the Thermotoga genus analyzed here (Thermotoga sp. RQ2, T. maritima, T. naphthophila, T. neapolitana, and T. petrophila) share strong synteny and other than T. neapolitana may be considered different strains of the same species based on Average Nucleotide Identity and 16S rRNA data [14]. This synteny allows for a comprehensive analysis of the core genes of the Thermotoga providing insight into recombination and perhaps speciation events. Gene transfer among these genomes may represent within or between population exchanges. Gene trees with five members, in this case representing the core genome of five Thermotoga, can be represented by five quartets. Following quartet decomposition analysis we plotted the five sets of quartets for each gene tree along the genome of T. neapolitana. Each configuration was plotted with a different color depending on whether

10

ISBRA 2010 SHORT ABSTRACTS

it was consistent with the plurality topology or one of the other two topologies and the intensity of color according to the bootstrap support. This approach allowed us to infer individual homologous replacement events in which multiple adjacent genes where transferred. For example, a region of approximately 23 kilobase pairs containing 22 mostly metabolic genes was likely transferred among T. maritima, T. sp. RQ2 and T. petrophila. Acknowledgments. This work was in part supported by Canadian Institute of Health Research Postdoctoral Fellowship to O.Z. Work in the laboratory of J.P.G. was supported by grants from the NASA Exobiology (NNX08AQ10G) and NSF Assembling the Tree of Life (DEB 0830024) programs.

References 1. 2. 3. 4.

5. 6. 7. 8.

9. 10. 11.

12. 13. 14.

Genomes Online Database, http://www.genomesonline.org/ Wu, D., et al.: A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature 462, 1056-1060 (2009) Rasko, D.A., et al.: The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol 190, 6881-6893 (2008) Welch, R.A., et al.: Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc Natl Acad Sci U S A 99, 17020-17024 (2002) Dagan, T. and W. Martin: The tree of one percent. Genome Biol 7, 118 (2006) Gogarten, M.B., J.P. Gogarten, and L. Olendzenski, Horizontal Gene Transfer: Genomes in Flux. 1st ed. 2009: Humana Press-Springer. Beiko, R.G., T.J. Harlow, and M.A. Ragan: Highways of gene sharing in prokaryotes. Proc Natl Acad Sci U S A 102, 14332-14337 (2005) Zhaxybayeva, O., J.P. Gogarten, R.L. Charlebois, W.F. Doolittle, and R.T. Papke: Phylogenetic analyses of cyanobacterial genomes: quantification of horizontal gene transfer events. Genome Res 16, 1099-1108 (2006) Zhaxybayeva, O., et al.: On the chimeric nature, thermophilic origin, and phylogenetic placement of the Thermotogales. Proc Natl Acad Sci U S A 106, 5865-5870 (2009) Huson, D.H. and D. Bryant: Application of Phylogenetic Networks in Evolutionary Studies. Mol Biol Evol 23, 254-267 (2006) Zhaxybayeva, O., W.F. Doolittle, R.T. Papke, and J.P. Gogarten: Intertwined Evolutionary Histories of Marine Synechococcus and Prochlorococcus marinus. Genome Biology and Evolution 1, 325-339 (2009) Gogarten, J.P., W.F. Doolittle, and J.G. Lawrence: Prokaryotic evolution in light of gene transfer. Mol Biol Evol 19, 2226-2238 (2002) Sheppard, S.K., N.D. McCarthy, D. Falush, and M.C. Maiden: Convergence of Campylobacter species: implications for bacterial evolution. Science 320, 237-239 (2008) Konstantinidis, K.T. and J.M. Tiedje: Genomic insights that advance the species definition for prokaryotes. Proc Natl Acad Sci U S A 102, 2567-2572 (2005)

11

ISBRA 2010 SHORT ABSTRACTS

Identification of Periodically Expressed Genes from Their Time-Course Expression Profiles Fang-Xiang Wu Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada e-mail: faw341@mail. usask.ca. Abstract: Time-course gene expression profiles associated with those periodic biological processes appears periodic. Identifying periodically expressed gene from their time-course expression data could help understand the molecular mechanism of those biological processes. This paper proposes a dynamic model based method for identifying periodically expressed genes from their timecourse expression profiles. In the proposed method, a periodical gene expression profile is modeled by a linear combination of trigonometric sine and cosine functions in time plus a Gaussian noise term. A two-step parameter estimation method is employed for estimating parameters in the model. On the other hand, non-periodical gene expression profiles are model by a constant plus a Gaussian noise term. The statistical F-testing is used to make a decision if a gene is periodically expressed or not. One synthetic dataset and two biological datasets were employed to evaluate the performance of the proposed method. The results show that the proposed method can effectively identify periodically expressed genes. Keywords: time-course gene expression profiles, periodically expressed gene, parameter estimation, F-testing

I. INTRODUCTION Many biological processes such as cell-cycle division exhibit periodic behaviors. Gene expression profiles associated with these periodic biological processes exhibits periodic. Identification periodically expressed gene from their time-course expression data could help understand the molecular mechanism of those biological processes. In past decade, a number of methods have been proposed to identify periodically expressed genes [1-6]. Most of these methods use Fish g-test to determine if a gene expression profile is periodic. However a recent research [7] concludes that the power of Fisher g-test is poor if the time-course data is short (less than 40 data points) and/or that data length is not an integer number of periods. Generally, it is believed that the sinusoidal function can best describe periodically expressed gene profiles. In this study, a periodical gene expression profile is modeled by a linear combination of trigonometric sine and cosine functions plus a Gaussian noise term. From previous literature [1-6], the challenge for this model is to estimate the parameters in the model as the frequency is nonlinear in the model. This study proposes a two-step linear least squares method to estimate parameters in the model. On the other hand, nonperiodical gene expression profiles are model by a constant plus a Gaussian noise

12

ISBRA 2010 SHORT ABSTRACTS

term. Instead of using Fisher g-test, the statistical F-test is used to make a decision if a gene is periodically expressed or not. Both one synthetic dataset and one biological datasets were employed to evaluate the performance of the proposed method.

II. METHODS 2.1 Model for periodically expressed gene profiles Let x(t ) (t=1,2,…, m) be a time-course gene expression profile generated from a biological process, where m is the number of time points at which gene expression is measured. To model x(t) with periodicity, we adopt the linear combination of sine and cosine functions plus a Gaussian noise term as follows: x(t ) = μ 0 + a cos(ωt ) + b sin(ωt ) + ε (t )

(1)

where a and b are the coefficients of sine and cosine function, respectively; ω is the frequency of periodic expression data; µ0 is the mean of gene expression profiles; and ε (t ) represent random errors. This study assumes that the errors have a normal distribution independent of time with the mean of 0 and the variance of σ 2 . In this study, we always shift the mean of gene expression profiles to 0. Thus model (1) is reduced to x(t ) = a cos(ωt ) + b sin(ωt ) + ε (t )

(2)

This model is equivalent to sinusoidal function model [2-6] x(t ) = A sin(ωt + Φ ) + ε (t )

(3)

which are widely used to generate the synthetic periodic gene expression profiles [7]. Given a time-course gene expression profile x(t ) (t=1, 2,…, m), estimating parameters a, b and ω in model (2) is a nonlinear estimation problem as ω is nonlinear in the model. In general, all nonlinear optimization programs can be used to estimate parameters in model (2), for example, Gauss-Newton iteration method and its variants such as Box-Kanemasu interpolation method, Levenberg damped least squares methods, and Marquardt’s method [8]. However, these iteration methods are sensitive to initial values. Another main shortcoming is that these methods may converge to the local minimum of the least squares cost function, and thus cannot find the real values of the parameters. Our observation is that noise free model (2) x(t ) = a cos(ωt ) + b sin(ωt )

can be viewed as the general solution of a following differential equation

&x&(t ) + ω 2 x(t ) = 0 2

(4) second order ordinary

(5)

and that ω is linear in equation (5) which is independent of a and b. Therefore, we propose the following two-step parameter estimation methods to estimate parameters a, b and ω in model (2):

13

ISBRA 2010 SHORT ABSTRACTS

Step1: numerically calculate the second derivative of x(t). Then based on equation (5), use linear least squares method to estimate parameter ω2. Step2: Substitute the estimated value of ω into equation (2). Apply the least squares method to model (2) to estimate parameters a and b.

2.2 Hypothesis testing The main concern of this study is if a gene is periodically expressed. This is a genuine question which is not easily to answer through statistical inference. However, a practice way of testing periodicity in the literature [2-6] is to perform a statistical hypothesis test of either the gene expression profile is pure normal white noise or it is periodic as specified by equation (2). That is to test the null hypothesis of H0:

x(t ) = ε (t )

(6)

versus the alternative hypothesis of x(t ) = a cos(ωt ) + b sin(ωt ) + ε (t )

Ha:

(2)

Here it is assumed that the mean of gene expression profiles has been shifted to zero. To perform such hypothesis test, we define the following F-statistic F=

2 ⎞ m − 4 ⎛ S0 ⎜⎜ 2 − 1⎟⎟ 3 ⎝ S1 ⎠

(7)

where S 02 is the residual of model (6) and S 12 is the residual of model (2) with estimated parameters. According to model selection principle [8], F-statistic (7) follows the F-distribution with the degrees of freedom (3, m-4). When the value of F -statistic is large enough (greater than a threshold), model (6) is rejected, i.e., the gene expression profile exhibits periodic behaviour, and otherwise the gene expression profile appears white noises.

III. EXPERIMENT RESULTS AND CONCLUSION Bothe one synthetic dataset and one biological dataset are employed to investigate the performance of the proposed method in different aspects. Synthetic dataset (SYN): The synthetic dataset is generated by the sine function modeling periodical genes expression profiles and by white noise modeling nonperiodic gene expression profiles. By these models, two datasets, each having 300 profiles, are created: one is periodically expressed gene expression dataset (PEGED) and another one is non-periodically expressed gene expression (noisy) dataset (nPEGED). The proposed method is applied to both PEGED and nPEGED to investigate the accuracy of the proposed method in identifying periodically expressed genes. The significance level is chosen as 0.05 and thus threshold value is calculated as 3.7083 by icdf (‘f’, 1-0.05, 3, 10) which is a MatLab standard function. Using this threshold, the proposed method determines all synthetic expression profiles in PEGED to be periodic. This indicates that the accuracy of proposed method is 100% in PEGED. On the other hand, the proposed method determines 7 (2.33%) synthetic

14

ISBRA 2010 SHORT ABSTRACTS

expression profiles in nPEGED to be periodic. Therefore, we can conclude that the proposed method can accurately identify periodic expression profiles. Eluration-synchronized gene expression data of the yeast (ELU): Excluding missing data, the resultant dataset contains the expression profiles of 5766 genes. This dataset is used to illustrate the performance of the proposed method on the reallife biological dataset. Again the significance level is chosen as 0.05. The proposed method determines that 362 genes are periodically expressed while other genes are not periodically expressed. We have manually checked 8 profiles randomly chosen from 362 genes identified to be periodically expressed, which really look like periodical signals; and 8 profiles chosen from other genes identified not to be periodically expressed, which does not look like periodical signals at all. In conclusion, this paper has presented a trigonometric function model-based method to identify periodically expressed genes from their time-course expression profiles. In the presented method, a two step linear least squares method is proposed to estimate all model parameters including the frequency. In addition, unlike those existing method using Fisher g-test, the presented method naturally uses F-test to determine if a gene expression profile appears periodic or not. Computational experiments on one synthetic dataset and one biological dataset show that the proposed method can effectively identify periodically expressed genes from their time-course expression profiles. Acknowledgment. This research is supported by Natural Sciences and Engineering Research Council of Canada (NSERC). References

1.

2.

3. 4. 5.

6. 7.

8.

Spellman, P.T., et al.: Comprehensive identification of cell cycle-regulated genes of the Yeast Saccharomyces cerevisiae by microarray hybridization. Molecular Biology of the Cell 9 (1998) 3273-3297 Harmer S., Hogenesch J. B., Straume M., Chang H. S., Han B., Zhu T., Wang X., Kreps J. A., and Kay S. A.: Orchesterated transcription of key pathways in Arabidopsis by the circadian clock. Science 290 (2000) 2110-2113 Wichert S., Fokianos K. and Strimmer K.: Identifying periodically expressed transcripts in microarray time series data. Bioinformatics 20 (2004) 5-20 Chen J.: Identification of significant period genes in microarray gene expression data. BMC bioinformatics 6 (2005) 286 Glynn E. F., Chen J., and Mushegian A.R.: Detecting periodic patterns in unevenly spaced gene expression time series using Lomb-Scargle Periodogrms. Bioinformatics 22 (2006) 310-316 Chen J. and Chang K. C.: Discovering Statistically significant period Gene expression. International Statistical Review 76 (2008) 228-246 Liew A. W. C., Law N. F. Cao X. Q., and Yan H.: Statistical power of fisher test for the detection of short periodic gene expression profiles. Pattern Recognition 42 (2009) 549-556 Beck J.V. and Arnold K. J.: Parameter Estimation in Engineering and Science. New York: John Wiley & Sons, 1977

15

ISBRA 2010 SHORT ABSTRACTS

Meta-analysis of cattle embryo gene expression studies profiling various reproductive technologies, developmental stages and tissue sources Heather Adams1, Bruce Southey2, Robin Everts1,3, Sadie Marjani4, Cindy Tian5, Harris Lewin1,6, Sandra Rodriguez-Zas1,6,7 1

Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois 3 Current address: SEQUENOM, Inc., San Diego, California 4 Department of Genetics, Yale University School of Medicine, New Haven, Connecticut 5 Center for Regenerative Biology/Department of Animal Science, University of Connecticut, Storrs, Connecticut 6 Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 7 Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois 2

Studying the differences in efficiency between reproductive technologies has become a growing area of research, as somatic cell nuclear transfer (SCNT) emerges as a more widely used and accepted method of reproduction. Efficiency measures are based on developmental attributes of embryos created via different technologies. The development of an embryo is affected by numerous genetic, non-genetic and maternal factors, which proves difficult in identifying the underlying mechanisms playing a role in these efficiency differences. Previous work has identified genes at specific developmental stages (pre- or postimplantation), or considered a single tissue source or specific reproductive technology. The goal of the current study was to integrate information from four microarray gene expression experiments that encompassed various developmental stages (7-day, 25-day and 280-day), tissue sources (embryonic or placental) and reproductive technologies (artificial insemination, AI; somatic cell nuclear transfer, SCNT) to gain a more complete understanding of the mechanisms influencing efficiency in cattle using metaanalytical techniques. The availability of the raw data allowed for the comparison of three approaches to integrate information across experiments including (1) an individual-study analysis, (2) a study-level meta-analysis, and (3) a sample-level meta-analysis to identify expression differences between AI and SCNT technologies regardless of developmental stage or tissue source (stage one), as well as an additional approach to identify differences across all combinations of developmental stage, tissue source and reproductive technology (stage two). Results from stage one study-level and sample-level metaanalyses uncovered 168 and 380 genes with differential expression between AI and NT, respectively. The largest overlap of significantly differentially-expressed genes occurred between an individual study and the sample-level meta-analysis. Further, study- and sample-level approaches identified an additional 434 genes not detected in any individual-study analysis, illustrating the advantages of meta-analysis to find consistency among studies, as well as having the ability to identify genes associated with differences between AI and NT not previously identified by individual studies. Stage two identified genes associated with a specific technology, stage or source, or regardless of technology, stage or source. The largest number of significantly differentially expressed genes was identified when looking between developmental stages (1149 genes), suggesting developmental stage has more influence on gene expression differences than reproductive technology or tissue source. Enrichment analysis of the differentially expressed genes from both stages identified numerous functional categories associated with differences between conditions, including regulation of anatomical structure morphogenesis, regulation of cell shape and mitotic cell cycle. Results from the stages identified numerous genes associated with significant differential expression between conditions, including POLG2, PRDX1, DKK3, OXT, ASXL1, THRAP6, and AKT1. Further research into these genes may provide additional insights into efficiency differences and embryo development. Key words: meta-analysis, gene expression, cattle, embryo development

16

ISBRA 2010 SHORT ABSTRACTS

A novel meta-classification strategy to identify candidate gene biomarkers for accurate sample classification Heather Adams1, Bruce Southey2, Sandra Rodriguez-Zas1,3,4 1

Department of Animal Sciences, University of Illinois at Urbana-Champaign, Urbana, Illinois Department of Chemistry, University of Illinois at Urbana-Champaign, Urbana, Illinois 3 Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, Illinois 4 Department of Statistics, University of Illinois at Urbana-Champaign, Champaign, Illinois 2

Research is currently being conducted to classify samples into groups (i.e. treatments, stages, conditions) based on gene expression profiles, however most of the approaches use gene expression as a dependent variable to detect potential biomarkers and not the actual classification group. Other issues arise as numerous studies consist of a large number of potential classifiers and few samples, such as the case with microarray gene expression experiments. Candidate biomarkers identified in these studies also tend to exhibit decreased performance when applied to other experiments. To address these concerns, a multi-stage meta-classification approach to identify classifier biomarker genes that provide accurate classification of samples across experiments was analyzed. In stage one, non-linear mixed-effects logistic models were used to identify potential classification genes across studies. Three approaches that integrate information across experiments on a by-gene basis were compared: (1) identification of genes with overlapping significant expression profiles association with the classes, (2) collective analysis of estimates of the association between gene expression and classes obtained from each experiment (meta-analysis), and (3) collective analysis of the association between gene expression and classes across all experiments (meta-analysis). In stage two, complementary multi-gene classification approaches based on potential classifier genes identified in stage one were evaluated. Genes were combined into classification functions using various variable selection criteria (stepwise, forward and backward methods) that were applied to the data, and the corresponding functions were evaluated based on the calculation of correct classification rates. In stage three, additional classification genes were identified using a score selection method applied to all genes in the original dataset, allowing for the best combinations of classifiers to be outputted and analyzed. A leave-one-study-out cross-validation analysis assessed the performance of the identified potential classifier genes on samples separate from those used in the identification of the biomarkers. The multi-stage approach was applied to eight microarray gene expression experiments comparing the expression patterns of honey bees of different species and sub-species from two behavioral classes (nurse, forager). Stage one results confirmed the ability of meta-analytical techniques (approaches 1 and 2) to collectively analyze estimates and detect biomarkers that were overlooked when each experiment was analyzed independently (approach 1). Further identification of classification functions and cross-validation in stage two discovered gene sets with exceptional correct classification rates. Additional gene sets identified in stage three complemented stage two results, and presented comparable classification performance. These additional sets can be used to provide supplementary information into the processes, functions and pathways of the genes that differ between classes. To demonstrate this, results from the approaches were examined using functional enrichment analysis to uncover biological processes, molecular functions and pathways that were over-represented among the genes. Results from the approaches identified numerous genes as potential biomarkers for behavior including Glyp, Dip-C, Lys-2, MED31, PNUTS, Tsf1 and GB15988. Results of the functional enrichment analysis found over-representation of genes with biological processes chemical homeostasis (GO:048878), transition metal ion transport (GO:0000041), and di-, tri-valent inorganic cation homeostasis (GO:0030005) among others. By combining statistical significance associations between gene expression and odds of behavior, and classification error rates, numerous biomarker genes and corresponding functional information regarding behavior classification was identified. The proposed multistage meta-classification strategy is a practical and informative method that offers consistent and reliable classification across experiments. Keywords: meta-analysis, classification, gene expression, honey bee maturation

17

ISBRA 2010 SHORT ABSTRACTS

Curray: A Query Language for Microarray Data Management and Analysis? Aminul Islam and Hasan Jamil Department of Computer Science Wayne State University, USA [email protected], [email protected] Abstract. In principle, gene expression data can be viewed as providing just the three-valued expression profiles of target biological elements relative to an experiment at hand. Although complicated, gathering the expression profiles does not pose much of a challenge from a query language standpoint. What is interesting is how these expression profiles are used to tease out information from the vast array of information repositories that associate meaning to the expression profiles. Since such annotations are inherently experiment specific functions, much the same way queries in databases are, developing a querying system for gene expression data appears to be pointless. Instead, developing tools and techniques to support individual assignment has been considered prudent. In this paper, we explore, possibly for the first time, the possibility of supporting a gene expression data management and querying system that is able to support both needs and reduce impedance mismatch between systems. To this end, we propose a new platform independent and general purpose query language called Curray, for Custom Microarray query language, to support online expression data analysis using distributed resources. We show that Curray’s declarative and extensible features nimbly allow flexible modeling and room for customization.

1

Recipe for Curray

The lack of declarative query languages for microarray data manipulation is remarkable, given that the community is keenly aware of its potential. The emergence of MIAME [3] actually acknowledges the fact that we need a high level view for expression data that hides the lower level physical view of the microarrays. ArrayExpress tool suit [7] provides library support for the migration of Affymterix, Agilent and Illumina data to its FGEM (Final Gene Expression Matrix) format, from which conversion to MAGE-ML [4] or MAGE-TAB is routine. While systems such as Bioconductor/R [5] allows analysis support for Affymterix in the form of a rich collection of libraries, it also supports data in emerging formats such as MIAME, MAGE-ML, MAGE-TAB and ArrayExpress. The question now is whether it is possible to develop a data management system and a query language for expression data that does not depend upon specific tools such as Bioconductor and allows flexibility in terms of data management. As a ?

Research supported in part by National Science Foundation grants CNS 0521454 and IIS 0612203.

18

ISBRA 2010 SHORT ABSTRACTS

byproduct, it will also offer seamless integration with other declarative systems such as SQL, BioFlow [6, 2] and Prolog, for example, to allow access to conventional data and reasoning engines. We believe the answer is in the positive. 1.1

Curray as a Declarative Language for Expression Data

To be able to view all expression data uniformly, we make a clear separation between the conceptual components of such data and their platform (technology, class and type) related components so that at the application level, queries can be asked in uniform ways without referring to the low level details. To offer control of data to the user, we also allow drill down and roll up type of concepts on expression data. For example, using R/Bioconductor’s affy library [1] the probe set intensities can be summarized to form one expression value for each gene. Probe level data are often used for quality control, RNA degradation assessments, different probe level normalization and background correction procedures, and flexible functions that permit the user to convert probe level data to expression measures. However, this function is not available for MAGE-ML data. So, queries need to be performed over Affymetrix level data if that information is of interest. Thus, providing support for one or the other can be limiting and may motivate users to stay close to lower level data that offers them the control they may need. We view Curray as a sub-language of our recently proposed language BioFlow [6] in our LifeDB database management system for Life Sciences applications [2]. Curray provides all necessary constructs needed to handle expression data while BioFlow support basic data manipulation, data integration and representation mismatch (XML, relational, text, etc.). We divide representation and manipulation of expression data into three categories in Curray - data definition, concept definition and expression data manipulation language. We take advantage of the fact that almost all microarray data are read only, and hence no updates are expected. Consequently, views created from base data are also read only. We now present the proposed language Curray using several example fragments. Data Definition Language Assume that we have two gene expression data sets, epilepsy data and tumor data, generated from brain tissue samples - one in Agilent format and another in MAGE-ML. We can then define the data sets as shown in statements 1(a) through 1(d) for storage in Curray. In statement 1(a), we say that epilepsy data table should be created in text format using the agilent file format definition, and gene id should be used as the primary key. That also means, that agilent format definition must be declared so that create expressiontable is well defined. Such format definitions can be declared using the construct shown in statement 1(b). The tables annotation and probe set in statement 1(b) can be defined using the syntax in statement 1(a) already discussed, giving it an unrestricted hierarchical structure, where the terminal definition will not have the required clause in the define format statement. Notice that these definitions induce a dependency graph and join path among the set of tables that constitute the epilepsy data. Also, since the format for agilent (or affymetrix) data is standard, it needs to be defined only once in the database.

19

ISBRA 2010 SHORT ABSTRACTS

Once defined, we can statically create a MAGE-ML version of the agilent table epilepsy data as shown in statement 1(c). create expressiontable epilepsy data format agilent text primary key gene id

define format agilent { gene id string, probeName string, . . . } required subtables annotation foreign key gene id, probe set foreign key probe id;

(a)

(b)

create expressionview epilepsy data ml format mage-ml from epilepsy data using FunctionName;

(c)

create view diff express genelist as select gene id from epilepsy data where experiment = brain study group by gene id having fold change > 2.0 using library bioconductor

(e)

select regulation, . . . from roll-up epilepsy data to mage-ml using FunctionName where θ

(d) define function DAVID INPUT URL from http://david.abcc.ncifcrf.gov/summary.jsp submit (gene list string, type of id string); define function DAVID extract pathway string, related genes string, fdr float using wrapper david wrapper mapper david mapper from URL david url submit (species string | david url string); call DAVID (’homo sapiens’, call DAVID INPUT with diff express genelist);

(f) Fig. 1. Data definition and data manipulation statements in Curray.

Note that there are numerous functions for converting affymetrix and agilent format to MAGE-ML format and we could choose one based on system preference. For the purpose of quality assurance, the output of FunctionName must match format definition in the format clause. Since we are not aware of any function that can create a lower level view such as MAGE-ML to affymetrix, we postulate that it is not that useful to drill down dynamically. But it is certainly possible to imagine a dynamic conversion to a higher level view such as MAGE-ML from agilent at query time as shown in statement 1(d) in which we are using a qualifier in front of a table name (in text, relation or XML format) to convert it to MAGE-ML format. The only difference is that this conversion will not be materialized, and regulation will be an attribute in the converted table. Since dynamic drill down is not feasible in the absence of suitable map functions1 , the user must query the lower level representation that are always part of the database such as the epilepsy data table when needed, e.g., application of the affy library discussed above. The definition suit above actually makes it possible to view the collection of tables that define epilepsy data as a single unit connected internally as a graph. These connections using join path can also be established by BioFlow’s link and combine type object aggregation and record linkage constructs. As a consequence, defining data manipulation syntax becomes interesting. 1

But if and when reverse map functions become available, an equivalent drill-down option can be specified as drill-down epilepsy data ml to agilent using FunctionName and we can achieve platform transparency by being able to move from one format to the other.

20

ISBRA 2010 SHORT ABSTRACTS

Data Manipulation Language Data manipulation in Curray is accomplished using its native statements for basic expression data querying and using BioFlow for complex data processing involving local or remote data repositories and tools. The basic Curray expression functions follow the suggestions in [8] that advocates four basic functionalities – class discovery, class comparison, class prediction and mechanistic studies. Figure 1 depicts a few representative statements. The statement 1(e) states that a fold change analysis needs to be performed and all genes that have a fold change greater than 2.0 should be returned in a brain study experiment using the fold change function in bioconductor library. The second statement in 1(f) actually uses the online database DAVID for a pathway analysis where false discovery rate (FDR) is used as cutoff value for the gene list in diff express genelist computed using 1(e) where we show the BioFlow define function statement that implements the DAVID pathway function. It may be noted here that all these statements follow SQL’s compositional completeness property and hence can be composed to arbitrary level of nesting. A similar statement can be developed for GO term analysis using DAVID, without much effort. A discussion on the complete set of possible query constructs is left as a future exercise. Before we conclude this discussion, we must mention that any complex analysis that is not covered by the basic Curray statements can be easily performed using BioFlow. The most enabling feature in Curray is in the way expression data is interpreted in the create expressiontable and create expressionview statements, and how they are organized behind the scene so that users need not worry about the details. Furthermore, the way an SQL-like feeling is offered is novel in which analysis functions are blended in through the using library option in the having clause.

References 1. Affy package. http://www.bioconductor.org/packages/2.0/bioc/html/affy. html. 2. A. Bhattacharjee, A. Islam, M. S. Amin, S. Hossain, S. Hosain, H. Jamil, and L. Lipovich. On-the-fly integration and ad hoc querying of life sciences databases using LifeDB. In DEXA, 2009. 3. A. Brazma et al. Minimum information about a microarray experiment (miame)—[mdash]—toward standards for microarray data. Nat Genet, Dec 2001. 4. S. Durinck et al. Importing mage-ml format microarray data into bioconductor. Bioinformatics, 2004. 5. R. C. Gentleman et al. Bioconductor: open software development for computational biology and bioinformatics. Genome biology, 2004. 6. H. Jamil and A. Islam. The power of declarative languages: A comparative exposition of scientific workflow design using BioFlow and Taverna. In IEEE SWF, July 2009. 7. A. Kauffmann et al. Importing arrayexpress datasets into r/bioconductor. Bioinformatics, Aug 2009. 8. J. Quackenbush. Computational approaches to analysis of dna microarray data. Yearbook of Medical Informatics, 1:91–103, 2006.

21

ISBRA 2010 SHORT ABSTRACTS

Computational search for preferred TRAF6/p62 ubiquitination sites.

Trafina Jadhav, Marie W. Wooten, and Michael C Wooten *

Program in Cellular and Molecular Biosciences, Department of Biological Sciences, Auburn University, Auburn, AL, 36849, USA

*Corresponding author: Michael C Wooten, Program in Cellular and Molecular Biosciences, Department of Biological Sciences, 331 Funchess Hall, Auburn University, Auburn, AL, 36849, USA, Tel: (334) 844- 4826; Fax: (334) 844-9234; [email protected].

22

ISBRA 2010 SHORT ABSTRACTS

Abstract. Ubiquitination is a dynamic process with only a small fraction of the proteome, often 300 nt [2]. In total, we found nearly 150,000 repeats classified into over 50,000 repeat families. The majority of the repeats were part of multi-copy repeat families (60%), with the multiplicity of the repeat families ranging from 2 to 230 copies. Additionally, IS and phage represent the most abundant source of repeats in these genomes. Finally, we also have added Repeatoire to the genome comparison environment M-GCAT [3] to interactively visualize the repeated regions. This allows one to quickly view repeat families in the context of intergenomic comparisons (global multiple genome alignment), which can prove useful when investigating genome dynamics across a collection of closely-related genomes and also when pinpointing repeat family expansions inside of genome alignments. Repeatoire is freely available and can be downloaded from: http://www.pasteur.fr/ip/easysite/go/03b-000042-075/units-andgroups/evolutionary-microbial-genomics/software/repeatoire References 1. Treangen, T. J., et al. (2009), 'A novel heuristic for local multiple alignment of interspersed DNA repeats', IEEE/ACM Trans Comput Biol Bioinform, 6 (2), 180-9. 2. Treangen, T. J., et al. (2009), 'Genesis, effects and fates of repeats in prokaryotic genomes', FEMS Microbiol Rev, 33 (3), 539-71. 3. Treangen, T.J., and Messeguer, X. (2006), 'M-GCAT: Interactively and efficiently constructing large-scale multiple genome comparison frameworks in closely related species', BMC Bioinformatics, 7, 433.

34

ISBRA 2010 SHORT ABSTRACTS

Quantifying differential gene connectivity between disease states for objective identification of disease-relevant hub genes Jen-hwa Chu, Scott T. Weiss, Ross Lazarus, Vincent J. Carey, and Benjamin A. Raby Channing Laboratory, Brigham and Women’s Hospital, Harvard Medical School, 181 Longwood Ave, Boston, MA 02115 USA

Abstract. Network modeling of whole transcriptome expression data enables characterization of complex epistatic (gene-gene) interactions that underlie cellular functions. Though numerous methods have been proposed and successfully implemented to develop these networks, there are no formal methods for comparing differences in network connectivity patterns as a function of phenotypic trait. Here we describe a novel approach for quantifying the differences in gene-gene connectivity patterns across disease states based on Graphical Gaussian Models (GGMs). We compare the posterior probabilities of connectivity for each gene pair across two disease states, expressed as a posterior odds-ratio (postOR) for each pair, which can be used to identify network components most relevant to disease status. We demonstrate that the GGM method reliably detects differences in network connectivity patterns in datasets of varying sample size. Applying this method to two independent breast cancer expression data sets, we identified numerous differences in network connectivity across histological grades of breast cancer. Most notably, our model identified two gene hubs (MMP12 and CXCL13) that each exhibited differential connectivity to more than 30 transcripts in both datasets. Both genes have been previously implicated in breast cancer pathobiology, but themselves are not differentially expressed by histologic grade in either dataset, and would thus have not been identified using traditional differential gene expression testing approaches. Therefore, GGM can be used to formally evaluate differences in global interactome connectivity across disease states, and can serve as a powerful tool for exploring the molecular events that contribute to disease at a systems level. Supported by R01 HL086601, RC2 HL101543 and R01 HG003646.

35

ISBRA 2010 SHORT ABSTRACTS

Intein Predictor: Prediction of Inteins with Amino Acid Sequence and Support Vector Machine Ren-Cheng Wang1, Jie-Zhi Cheng1, Kenneth Hung1, Chia-Lin Chang1, Yun-Ju Sun1 and Chung-Ming Chen1 1

Institute of Biomedical Engineering, National Taiwan University,Taipei,Taiwan [email protected]

Abstract. Inteins are special genetic elements which can parasitize in proteins. So far, more than 300 distinctive inteins have been identified and archived in the public intein database, Inbase [1]. Inteins can be widely found in various living organisms across the three life domains, i.e. archaea, eubacteria and eucarya. To assist the understanding of this kind of parasitic protein and discover unknown inteins, we propose a method for the prediction of intein sequence. First we analyze intein sequence and find that inteins have a low degree of sequence similarity and high conservative amino acid of some sites. To solve the problem, we use support vector machine as the classification tool. The experiment result suggests that over 90% prediction accuracy can be achieved by the proposed method. Keywords: Intein Prediction, SVM, Feature Selection

1

Introduction

An intein is a parasitic genetic element similar to self-splicing introns [2]. Intein are able to splice itself from its host protein and rejoin other protein segments, namely exteins, without influencing the function of the host protein. The self-splicing process of an intein is a spontaneous reaction (see Figure 1). Up to date, more than 300 distinctive inteins have been discovered and are archived in the public intein database (Inbase) [1]. Inteins can be found in various living organisms across three life domains, i.e. archaea, eubacteria and eucarya. Currently, approximate seventy percentage of discovered intein parasitize in the host genes of DNA/RNA polymerase, while other 25% or more can be retrieved in the genes of metabolism [3]. Inteins are very versatile in the biotechnological applications, e.g., protein synthesis [4], drug development, gene therapy [5], and so on. It may be ascribed to that inteins are very efficient in protein splicing. Important as it is, the evolutionary process of inteins remains controversy [2-3] and the classification of inteins has been rarely investigated. To assist the understanding and discovery of intein sequences, we propose a tool to predict intein sequences based on known inteins. The tool can distinguish inteins from other proteins and hence may help the identification of inteins in a host protein. With this tool, we may be able to

36

ISBRA 2010 SHORT ABSTRACTS

find some sequence features of inteins to boost the understanding of their evolutionary process. Intein can be functionally divided into splicing domain and DOD Endonuclease domain, as shown in Figure 2. DOD Endonuclease domain can recognize sites of 14– 40 DNA residues and usually does not require a complete match with the target sequence for a homing process to spread intein [6]. Accordingly DOD domains of inteins may vary with different target genes. Nevertheless, DOD domain is not a necessary domain for an intein. An intein without DOD domain is regarded as a miniintein. On the other hand, splicing domain exists in all inteins and plays a very crucial role for the function of protein splicing. As a result, we adopt the A, B, F, and G motifs in splicing domain as features to characterize an intein for the purpose of classification. In this study, we adopt support vector machine (SVM) technique [7] to classify inteins from other proteins. In our experiment, more than ninety percent of inteins can be successfully identified from other proteins.

Fig. 1. Self-splicing process of an intein. The self-splicing process is a spontaneous reaction.

Fig. 2. The functional composition of an intein. An intein is composed of splicing domain and DOD Endonuclease domain. Splicing domain is in charge of splicing function with four motifs, i.e., A, B, F and G. Homing process can be carried out by DOD domain.

2

Methods

We collected 378 inteins as our samples from Inbase [5] by excluding incomplete and replicated inteins. The four motifs A, B, F and G are adopted to describe an intein in our study and hence result in a sequence with 51 amino acids. Meanwhile 378 random sequences are prepared as false data by using Swiss-Prot. We note that the derived amino acid sequences vary a lot from intein to intein. To show the substantial sequence dissimilarity among inteins, we compute the pairwise difference of all amino acid sequences and plot the quantitative sequence identities in Figure 3(A). From Figure 3(A), we may find that more than 61% compared pairs hold sequence identity less than 0.3, while 0.3 is commonly regarded as the cut-off sequence identity threshold to recognize a protein family. To illustrate the occurrence probability of amino acid at each site, WebLogo [4] is performed for the evaluation. The resulting occurrence probability of amino acid is shown in Figure 3(B). Although the sequence

37

ISBRA 2010 SHORT ABSTRACTS

identity tend to below 0.3 over all C3782 compared pairs, high conservation of amino acids can be observed at 51 sites (see Figure 3(B)).

Fig. 3. (A) Amino acid sequence identities over all compared pairs among 378 inteins. Horizontal axis is the identity percentage of two compared sequence. The vertical axis stands for the numbers of compared sequence. (B) The occurrence probability of amino acid at the 51 sequence site. ‘*’ indicates that the occurrence probability is over 0.5.

To predict intein sequence, LIBSVM [7] is used for the task of classification. SVM is a powerful classification technique and can achieve satisfactory performance even when the dimensionality of data is formidably high and sample size is small. Because every amino acid has a unique hydrophobicity, it can be used to quantitatively described an amino acid in our study. In the literature there are several ways to measure hydrophobicity. Five measures of hydrophobicity from the studies of [8-12], respectively, are adopted to characterize an amino acid. To take into account of the sequential relation between the consecutive amino acids, sliding window technique with four different window sizes, namely 1, 3, 5 and 7, is applied (see Figure 4(B)). The hydrophobicity measures of amino acids within a window are averaged for the integration of influence from neighbors. So far, we may have many combinations of features due to difference sliding window sizes and hydrophobicity measurement. For instance, an amino acid sequence of an intein with sliding window size 3 will have 49*5 possible feature sets. To further scale down the dimensionality of features, we develop two phases of feature selection strategy. The first phase exhaustively seeks the combination of window size and hydrophobicity measurement of the best SVM classification performance. With the obtained specific window size and hydrophobicity measurement, a forward selection strategy is carried to select significant features from the sequential amino acid features. Finally, BLAST is used to collect 378 real proteins which may be similar to inteins as false data. These 378 false data selected by BLAST are again mixed with the 378 inteins to validate our method.

3

Result and Conclusion

Figure 4 demonstrates the first phase of feature selection strategy to pick the best combination of window size and hydrophobicity measurement. It can be found that the combination of window size 3 and the hydrophobicity measurement of [11] can achieve the best differentiation performance with SVM under 5-fold cross-validation and produce a set of 49 features for each intein. Following that, a forward selection

38

ISBRA 2010 SHORT ABSTRACTS

strategy is performed to select the most significant features from the 49 features. Figure 4 (B) reports the result of the forward selection strategy with SVM under 5fold cross-validation. The performance of data set mixed with the false data generated by Swiss-Prot and BLAST are plotted as dotted and solid curves in Figure 5, respectively. We may found that high differentiation rate can be attained when 5 or more selected features are involved.

Fig. 4. (A) The performance of the first phase feature selection to pick the best combination of window size and hydrophobicity measurement. (B) The performance of forward selection for the data set prepared by Swiss-Prot and BLAST, respectively. We note that data set generated by Swiss-Prot is denoted as random false data while data set of BLAST is realfase data.

Inteins are mystical protein whose function and properties remain unknown. This study proposes a tool to predict more intein sequences for further intein analysis. The proposed method will be set up as a website tool in the future for public use to assist the analysis the difference of inteins from species.

References 1. Perler, F.B.: InBase, the Intein Database. Nucleic Acids Res. 30, 383-384. (2002) 2. Pietrokovski, S.: Intein spread and extinction in evolution. Trends Genet. 17, 465–472. (2001) 3. Liu X.Q.: PROTEIN-SPLICING INTEIN: Genetic Mobility, Origin, And Evolution. Annual Review of Genetics. 34: 61-76 (2000) 4. Schwarzer, D., and Cole, P.A.: Protein semisynthesis and expressed protein ligation: chasing a protein's tail. Curr Opin Chem Biol 9(6): 561–9 (2005) 5. de Grey, A.D.N.J.: Mitochondrial gene therapy: an arena for the biomedical use of inteins. Trends Biotechnol. 18(9): 394-399 (2000) 6. Gogarten, J.P., Senejani, A.G., Zhaxybayeva, O., et al.: Inteins: structure, function, and evolution.Annu. Rev. Microbiol. 56, 263–287 (2002) 7. Chang, C.-C. and C.-J. Lin.: LIBSVM: a library for support vector machines.Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm. (2001) 8. Kyte, J. &Doolittle, R. F.: A simple method for displaying the hydropathic character of a protein. J. Mol. Biol.IS7, 109-132. (1982) 9. Sweet R.M., Eisenberg D.: Amino acid scale: Optimized matching hydrophobicity (OMH). J. Mol. Biol. 171:479-488 (1983). 10. Eisenberg D., Schwarz E., Komarony M., Wall R.: Amino acid scale: Normalized consensus hydrophobicity scale. J. Mol. Biol. 179:125-142 (1984). 11. Hoop TP and Woods KR.: Prediction of protein antigenic determinants from amino acid sequences. Proc Natl Acad Sci USA 78:3824 (1981) 12. Guy H.R.: Amino acid scale: Hydrophobicity scale based on free energy of transfer (kcal/mole). Biophys J. 47:61-70 (1985).

39

ISBRA 2010 SHORT ABSTRACTS

A computational strategy to identify general and individualized genomic biomarkers Sandra Rodriguez Zas1 , Bruce Southey2 , Mira Cohen3 , Guy Bloch3 , and Gene Robinson4 1 Department of Animal Sciences, University of Illinois, Urbana, IL 61801 Department of Chemistry, University of Illinois, Urbana, IL 61801 University of Illinois at Urbana-Champaign 3 Department of Evolution, Systematics, and Ecology, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, 91904, Israel 4 Department of Entomology. University of Illinois, Urbana, Illinois 61801

2

Many behavioral and physiological processes exhibit an internal circadian rhythm with physiological or molecular indicators showing a sinusoidal periodic pattern. The gene expression profiles underlying these processes can also exhibit a similar periodic pattern. This sinusoidal pattern includes a maximum or peak value and a minimum or trough value separated by 12 hours over a 24 hour period. A cosinus model provides a parsimonious description of the sinusoidal gene expression trajectories using only two parameters; amplitude (half the difference between the peak and trough of the curve) and phase shift (location of the peak and trough of the curve on the time scale), in addition to the parameter that describes the expected or average gene expression level. In microarray experiments, the modeling of gene expression profile that have periodic patterns is complicated by presence of technical (e.g. dye, microarray, biological (e.g. subject) and design (e.g. multifactorial experiments) sources of variation. A common approach to analyze microarray studies with periodic patterns is to use a multi-step approach to initially address the multiple sources of variation and subsequently identify genes that present circadian patterns. This approach includes three main steps: 1) analysis of normalized gene expression values using standard ANOVA or another basic model that only considers time effects to detect genes exhibiting significant variation across time; 2) prediction of a cosinus pattern using the estimates from step 1); and 3) comparison of predicted (step 2) and estimates (step 1) using R-squared or other model fit indicators to identify genes that have a circadian pattern. Although simple to implement, this multistep approach does not account for the uncertainty of the estimates at each step and does not consider simultaneously the multiple sources of variation. Furthermore, the ANOVA approach ignores the relationship between consecutive time points by modeling the time points as independent factor levels. To overcome these limitations, a one-step approach that simultaneously accommodates time trends described using parametric (nonlinear or linear) or semi-parametric models, experimental, technical and biological sources of variation was developed. The one- and multi-step approaches were tested on a microarray experiment that measured the expression of 11,000 genes in the brain of honey bees repre-

40

ISBRA 2010 SHORT ABSTRACTS

senting two behavior groups (10 bees per group) across 6 time points in a 24-hour period using a loop design with 150 arrays. Comparison of the capability of the multi-step and one-step approach to detect gene exhibiting a periodic profile was based on statistical significance, precision and magnitude of the amplitude and shift estimates. Of the tests for amplitude and shift, 85% and 54%, respectively had lower statistical significance on the multi-step compared to the one-step approach. In addition, the precision (inverse of the scale parameter) was lower in 77% and 85% of the amplitude and shift estimates from the multi-step compared to the one-step approach, respectively. Interpretation of the differences between approaches is illustrated with two genes known to be associated with the circadian rhythm pathway. The one-step approach indicated that shaggy, a gene of the circadian rhythm pathway in honey bees and fruit flies, exhibited a strong circadian pattern (amplitude P-value < 0.003, fold change = 1.29). The multi-step approach also identified a significant (P-value < 0.003) difference in expression levels between time points and the R-squared indicator of the cosinus model fit was 51%. In addition, the one-step approach detected that gilgamesh, a gene known to be involved in the circadian rhythm pathway of fruit flies, exhibited a circadian trend in honey bees (amplitude P-value < 0.05, fold change between peak and trough = 1.13). The multi-step approach identified a peak expression (P-value < 0.002). However, the R-squared value of the cosinus model fit with the multi-step approach was approximately 10%. The poor sinusoidal model fit may be due to the inadequacy of the ANOVA estimates obtained in the previous step due to large sampling variation. The discrepancy between the multi-step and one-step stems from the capability of the second approach to borrow information from all the time points to more accurately estimate parameters and identify sinusoidal oscillations through the amplitude parameter. The one-step approach demonstrated in this study successfully identified genes with circadian sinusoidal patterns and the cosine model allowed the complete characterization of the 24-hour expression profile with only two parameters. This one-step approach can accommodate any type of non-linear and semiparametric model and thus can be applied in the detection of other cyclical or acyclical gene expression profiles.

41

ISBRA 2010 SHORT ABSTRACTS

Development of a transcriptomic-based index to prognosticate cancer Nicola Ser˜ ao, Kristin Delfino, Bruce Southey, and Sandra Rodriguez Zas University of Illinois, Urbana, IL

Several studies have undertaken the task of identifying biomarkers of cancer using gene expression profiles. The value of genomic-based tools to better predict cancer outcomes and improve treatment is particularly critical to improve survival to glioblastoma multiforme. This cancer is the most common and malignant type of brain tumor with an expected survival of one year. The identification of transcript biomarkers that can effectively predict survival has been hampered by the approaches typically used to mine the gene expression information available. In most cases, the association between glioblastoma survival and expression is studied on an individual-gene basis and no clinical cohort information is considered in the analysis. In addition, the expression and survival measurements are sometimes discretized into high and low levels and this could result in loss of information. Lastly, a stringent cutoff is used to discard gene profiles and only consider for biomarker identification a number of transcripts that is lower than the number of samples or patients available. An integrated approach to identify a set of transcriptome biomarkers of glioblastoma survival that can be used in a survival or time-to-glioblastomaevent index is proposed. Clinical and gene expression indicators of glioblastoma survival together with potential interactions, in numbers larger than the number of patients available were evaluated using the complementary stepwise and forward variable selection approaches within a multivariate Cox survival framework. Survival, clinical and gene expression information from 287 patients corresponding to the September 2009 data freeze was obtained from the Cancer Genome Atlas (http://cancergenome.nih.gov/). The expression of 22,277 genes was measured using an Affymetrix U133A platform, each sample was represented by one array and all arrays were scanned across 10 batches. Gene expression profiles associated with the genesis or escalation of glioblastoma were identified by means of considering two time-to-glioblastoma-event variables, survival (age at death in years) and months from diagnosis to progression or recurrence. Clinical variables considered were: two race levels (white Caucasian or not), gender, five therapy levels (R = radiation alone, CRnoT = chemo, radiation and not targeted therapy plus other therapy if present, CRT = chemo plus radiation and targeted therapy only, OTHER = any other combination of radiation, chemo, targeted, immune and hormonal therapy, NONE = no therapy), detection of glioblastoma progression or recurrence after the original diagnostic, and the technical variable of array batch. The gene expression data werelog2 transformed and normalized using quantile transformation (at a probe level) and GC-RMA. Preliminary analysis identified 963 and 1048 genes with significant (P < 0.01) individual associations with survival and months from diagnosis to pro-

42

ISBRA 2010 SHORT ABSTRACTS

gression or recurrence. These genes were further considered in the development of the index based on the strength of the signal of the association with glioblastoma. A total of 61 gene profiles (corresponding to 56 genes due to four genes represented by two probes and one by three probes) were associated with overall glioblastoma survival (P < 0.05 with four borderline P < 0.1). Of these, 51 profiles had broad associations with survival (described by hazards ratio) were not conditional on the clinical group that the patient pertained: IL13RA1 (0.468), SAR1A (0.439), TP53 (0.756), AKT2 (0.677), PIK3R1 (1.67), GIGYF2 (0.388), KIAA0090 (2.89), CAMK2G BF111268 (1.54), COL14A1 (3.3), TLK2 (0.636), SLC43A3 (0.559), TSPYL5 (1.34), HOXA10 (3.3), RPL37A (0.51), EP300 (0.344), RAC2 (1.74), WDR67 (1.3), PPBP (1.16), CAMK2G AA284757, (0.534), CCNB1 (3.16), USF2 (1.42), CHI3L1 AJ251847 (0.87), PLCG1 (0.55), SPG21 (1.91), TIMM23 (0.501), WDYHV1 (1.95), LIN7C (2. 4), SOD2 (0.468), RAB15 (0.484), RPL10 (0.798), JAG2 (2.78), CCNB2 (0.598), PDCD4 (4.68), UROS (2.37), E2F3 (0.256), TOPORS (0.415), CDK2 (2.74), MTAP (0.67), CSF1 (0.58), AKT1 (1.7), SIX6 (1.82), SCN5A (3.21), ROS1 (2.58), KCNJ4 (0.501), CDC42 BC002711 (0.572), CDC42 BC003682 (3.94), FSTL1 (0.306), AKR1C3 (0.708), SYNE1 (0.174), IGF1 (0.762) and CDKN2A (1.18).

The four genes presenting a borderline association (P < 0.1) with overall glioblastoma survival were: EGFR (0.682), CDC42 R37664 (0.818), CHST4 (0.44) and ARHGEF4 (0.803). The remainder six gene profiles (out of 61) exhibited associations (P < 0.05) with survival that were dependent on the clinical characteristics of the patient. For example, although the hazards ratio of death decreased with increased levels of U48722 EGFR, the reduction was higher in males (0.6) than in females (0.877). On the other hand, the hazard ratio increased with increased levels of C2 and PRKCB M13975 and the increase was higher in males (1.935 and 5.216, respectively) than in females (1.304 for both). Also, the hazards ratio of glioblastoma increased with increased levels of SOX10 in nonwhite patients (1.082) compared to white-Caucasian patients (0.546). Increments on the level of CHI3L1 M80927 were associated with significant increments in the hazards ratio of glioblastoma. However, patients receiving NONE therapy (2.423) had the highest hazards ratio, followed by CRnoT (1.531), OTHER (1.311), R (1.278) and CRT (1.277). The association between PRKCB NM 002738 and hazards ratio was dependent on the gender and therapy received. On average, the hazards ratio increased with increased levels of PRKCB in females (1.275) and decreased in males (0.363). Patients receiving NONE therapy had the lowest hazards ratio (0.384), followed by CRnoT (0.515), R (0.639), CRT (0.71) and OTHER (0.751). The majority of the probes (41 out of 61) associated with overall survival in this study are correspond to 35 genes that have been reported in the literature or are part of the glioma gene pathway: IL13RA1, TP53, AKT2, EGFR, PIK3R1, CAMK2G, COL14A1, TLK2, EP300, RAC2, WDR67, CDC42, CCNB1, USF2, CHI3L1, PLCG1, SOD2, RAB15, RPL10, JAG2, CCNB2, PDCD4, UROS, E2F3, TOPORS, CDK2, MTAP, CSF1, AKT1, FSTL1, AKR1C3, SYNE1, IGF1, CDKN2A, and PRKCB.

43

ISBRA 2010 SHORT ABSTRACTS

A total of 60 gene profiles were associated with time from diagnosis to progression (P < 0.05, including four genes at P < 0.1). Of these, 51 profiles had broad-spectrum general associations with progression that were not conditional on the clinical variables (P < 0.05). The genes (and hazards ratio) were: TP53 (0.428), FGFR2 (0.386), CAMK2B AF081924 and U23460 (0.539, 3.02), IL13RA1 (1.72), SMARCB1 (2.06), PIK3R1 (1.6), HNRNPD (2.94), CAMK2G (1.86), PVR (0.32), APP X06989 (0.473), PRKCA (1.9), CALM3 (0.432), WEE1 (1.37), PDGFB (0.181), MMP14 (1.66), MDM2 (0.657), UNG (2.79), NRAS (3.93), PPP1R15A (1.54), HSPA1A (0.821), IFNGR1 (0.316), E2F3 (0.417), PTEN (0.601), MTAP (1.66), POLD2 (0.433), TIMP3 AW338933 and NM 000362 (0.257, 3.1), CALM2 (0.22), RAF1 (0.459), CCND1 (0.802), FSTL1 (0.467),IGL@ (2.22), CLIP3 (1.75), PAICS (5.28), AGPAT1 (0.273), LOC283079 (2.73), SNX10 (1.34), PKNOX2 (0.457), MNS1 (1.51), KDM6B (2.03), CAV2 (1.25), ATF5 (2.31), PLA2G7 (0.11), AANAT (1.78),ZFY (2.06), SHOX (2.66), CD24 (0.847), NDUFV1 (0.333), UTP20 (2.08) and KCNJ13 (1.74). The four genes that had a borderline association (P < 0.1) with time from diagnosis to progression were: APP BC004369 (1.56), HRAS (1.49), FADD (1.57) and CLEC2B (1.19). The remainder five gene profiles (out of 60) exhibited significant (P < 0.05, including one gene at P < 0.1) associations with diagnosis-to-progression time that were dependent on the clinical characteristics of the patient. Increments on the expression of RAB31 were associated (P < 0.1) with higher hazards ratio in non-white (7.723) relative to white-Caucasian patients (1.466). On the other hand, increments on the expression of VAV3 were associated with decreased hazards ratio in white-Caucasians (0.685) and non-white patients (0.415). Also, increments on the expression of RPS20were associated with higher hazards ratio in Caucasian (1.828) and lower hazards ratio in non-Caucasian patients (0.754). Increments in the expression of APOOL were associated with increased hazard ratios. Patients receiving CRT therapy showed the highest value (4.819), followed by OTHER (3.858), CRnoT (2.226), NONE (1.935) and R (1.641). The hazards ratio of progression decreased with increased levels of GDF10 more in males (0.371) than in females (0.803). Moreover, most of the gene profiles (35 out of 60), associated with time from diagnosis to progression or recurrence in this study, correspond to 33 genes part of the glioma gene pathway or have been reported in the literature. These are: TP53, FGFR2, CAMK2B, IL13RA1, SMARCB1, PIK3R1, HRAS, APOOL, HNRNPD, CAMK2G, PVR, APP, PRKCA, CALM3, WEE1, PDGFB, MMP14, MDM2, UNG, HSPA1A, NRAS, PPP1R15A, IFNGR1, E2F3, PTEN, MTAP, POLD2, TIMP3, CALM2, RAF1, CCND1, FSTL1 and CAMK2B. Noteworthy are 7 genes (CAMK2G, E2F3, FSTL1, IL13RA1, MTAP, PIK3R1 and TP53) that were significantly associated with overall glioblastoma survival and time from diagnostic to progression. These genes may have a specific role on the progression of glioblastomas. The adequacy of the gene indices to predict overall and post-diagnosis glioblastoma survival was confirmed using a leave-one-out cross-validation. This study demonstrated the value of integrating variable selection and survival models to identify not only genes that have broad-range associations with survival but also

44

ISBRA 2010 SHORT ABSTRACTS

genes that are associated with survival on a more specific manner that depends on the clinical characteristics of the patient. Known biomarker gene profiles were confirmed, and new general and clinical-dependent gene profiles were uncovered. These findings can be the basis for improved prognostic tools and individualized treatments that improve the survival and quality of life of patients suffering glioblastoma multiforme.

45

ISBRA 2010 SHORT ABSTRACTS

Quartet Decomposition Server: A Platform for Analyzing Phylogenetic Trees 1

2

2

3

2

Fenglou Mao , Maria Poptsova , David Williams , Olga Zhaxybayeva , Peter Gogarten , Ying Xu 1. 2. 3.

1

Department of Biochemistry and Molecular Biology, University of Georgia, 120 Green St, Athens, GA 30622, USA Department of Molecular and Cell Biology, University of Connecticut, 91 North Eagleville Road, Storrs, CT 06269, USA Department of Chemistry and Biochemistry, Mount Allison University, Sackville, NB, Canada

Abstract We present a web server for performing quartet-based phylogenetic analyses, where a quartet is a binary tree for a quadruple of taxa. It represents a computationally challenging task to perform quartet-based analysis on a large number of gene trees as the number of involved quartets could be fairly large. Our web server provides a platform for a user to carry out quartetbased phylogenetic analyses while the user is required to provide only the gene trees. The analysis server has the following key features: 1) the web-based user interface supports an easy upload of up to thousands gene trees stored in one compressed file; 2) the server automatically generates a quartet decomposition of the input trees and a quartet spectrum to give the user an overview about how the generated quartets from multiple gene trees agree or disagree with each other; and this quartet spectrum is clickable to provide the user further details about agreeing and conflicting quartets; 3) the web server provides three quartet filters allowing a user to remove error prone quartets from further analyses; 4) the server supports downloads of generated quartets for further offline customized analyses; and 5) the server calculates an agreement score, a real value between 0 and 1, for each gene tree with all other trees. The web server can be accessed from http://csbl1.bmb.uga.edu/QD/. 1. Introduction It is well known that genetic material in prokaryotes can be transferred between divergent organisms easily. Many microorganisms can take up DNA directly from the environment; in addition, viruses such as phages can invade prokaryotic cells and bring new DNA fragments to host cells; and the conjugation machinery for DNA transfer between cells often is not species specific. The frequent exchanges of genetic materials among prokaryotes make it a highly challenging problem to study the phylogenetic relationships among different species as they make it difficult to reliably reconstruct a species tree. For example, horizontal gene transfers (HGT) among a group of species will make the construction of a species tree of these organisms challenging since different orthologous gene groups may give rise to different trees due to the HGTs. It always represents a challenging problem as to which gene tree is the best in terms of representing the species relationship among a group of organisms. The most popular strategy currently is to construct a species tree based on the so-called conserved genes or genetic units such as 16S ribosomal RNA, which is acceptable but it does not resolve the problem that these conserved units could also be the results of HGTs. In addition, this issue makes the reliable identification of HGTs very challenging.

46

ISBRA 2010 SHORT ABSTRACTS

Zhaxybayeva et al. (1) have recently developed a new method for detecting HGT, by decomposing gene trees into a collection of small units, each called a quartet, and by analyzing the resulting quartets for identification of HGT events involved, which proved to be highly reliable in accurate inference of the HGTs and hence in inference of species trees. The detailed description of the method is presented elsewhere (1). Here we briefly outline the basic idea of the method, and refer the reader to (1) for details. A quartet is an un-rooted binary tree consisting of four taxa. Given one such gene tree, the four taxa could have one of the three topologies below:

Figure 1: Three possible quartet topologies Here we use text string (AB)(CD), (AC)(DB) and (AD)(BC) to represent the left, middle and right topology, respectively. For a given gene tree, our algorithm enumerates all possible combinations of any four taxa, and calculates distances between any two of them. Thus we can obtain six distances for a specific quartet, which will be used to resolve the branch lengths in the quartet (four external branches and one internal branch) as well as the topology of the quartet. 2. Key features of the Web server Our quartet decomposition server provides a platform for performing various quartet– based analyses, making it easier for the user to carry out technically difficult tasks such as tree decomposition, quartet spectrum generation and quartet-based analyses. The server has the following key features. 2.1. Quartet spectrum Quartet decomposition of a gene tree is a process for finding all possible quartet topologies for the given tree. Currently our server requires 100 bootstrap trees for each gene family, thus each quartet topology will have a bootstrap support value ranging from 0% to 100% for each gene family. For a given threshold on the bootstrap support values, we can determine if a specific quartet topology is supported by a given gene family. If we have N gene families, each quartet topology may be supported by 0 to N gene families. Thus we can draw a histogram summarizing the overall consistency among the generated quartets, which is called a quartet spectrum. Figure 2 shows an example of a quartet spectrum.

47

ISBRA 2010 SHORT ABSTRACTS

In our web server, we make each quartet spectrum clickable so when a user clicks on the bar representing a specific quartet, a new page will pop-up with the detailed information for that quartet, including its bootstrap support value in each gene family.

Figure 2: An example of a quartet spectrum. The x-axis represents the quartets arranged in the descending order of the number of gene families supporting the plurality topology of each quartet, and the y-axis represents the number of the supported gene families. The negative part at y axis gives the sum of the number gene families supporting the two other conflicting topologies. 2.2 Agreement scores For each gene family we calculate an agreement score (2), measuring the consistency of the 100 bootstrap trees for this gene family, using the following formula.

, where N is the number of bootstrap trees for this gene family (N=100 as the default); M is the

x  4 

number of possible quartets; M    with x being the number of species being considered; and ni1, ni2, ni3 is the number of the first, second and third topologies for the ith quartet, respectively. The score S will be 1 if all the bootstrap trees are same, and it will be less than 1 if



48

ISBRA 2010 SHORT ABSTRACTS

there is an inconsistency between some bootstrap trees. The more conflicts there are, the closer the score is to “0”. Normally the gene similarity in the gene family decides the value of this score. If genes have very similar sequences, the bootstrap trees will have a higher chance to be different and hence will result in a lower agreement score. 2.3 Filters Long external branch: We can see from Figure 1 that there are four external branches and one internal branch. We only trust those topologies with relatively short external branches. Hence we want to remove those quartet topologies with very long external branches. Currently a filter is implemented to remove such quartets using the following criterion: if the ratio between the longest external branch and the internal branch is larger than a threshold (default value = 10), it will be removed. Short internal branch: If a quartet has a very short internal branch, it is also considered as abnormal and will be removed. The criterion we are using now is that if a quartet has an internal branch shorter than the threshold (default value = 0.02), we remove it. Remove less supported quartets: Since the number of quartet equals the number of the combinations of any four taxa in a given gene tree, this number could be very large for a relatively large gene tree. Generally the less supported quartets are less informative and have a higher chance to provide false positives (i.e, conflicts with the plurality). We implemented the following filter, which requires two thresholds: the first threshold is the bootstrap support value threshold (T1), and the second (T2) is the percentage of gene families resolving the quartet with a bootstrap support value at least T1. For a specific quartet, if the percentage of gene families supporting it with a bootstrap support value at least T1 is less than T2, this quart will be removed from further analyses, e.g., from display of a quartet spectrum. This last filter can be applied multiple times to generate different quartet spectrums even after the decomposition process is done, while the other two filters have to be specified before the decomposition process starts. The decomposition process, which is highly CPU intensive, has to be run again if a user wants to change the values for the first two filters. 2.4 Quartet download Although we have provided a number of quartet analysis tools through the server, a user may want to perform his/her own analyses on computed quartets. Since the quartet decomposition is not trivial and a user may have to deal with many trees, which is not easy for many biologists, we provide two download options for a user to download the decomposed quartets. The first download option is to download a specific part of the quartets, which uses two thresholds: the first threshold (T1) is a bootstrap support value and the second threshold is an integer number (T2). A user can download a subset of the decomposed quartets that are supported with a bootstrap support value at least T1 in at least T2 gene families. The second download option is based on the quartet spectrum. The quartet topologies with positive the y-values are considered as good quartet topology (plurality quartet topology), and

49

ISBRA 2010 SHORT ABSTRACTS

as bad quartet topology (conflicts) if they have negative y-values. The user can download the good or bad quartet topologies based on his/her choice. A bootstrap support value threshold need to be provided because a quartet spectrum is generated based on the threshold. And the spectrum is also affected by the third filter we mentioned above, which removes some unwanted quartets. This download option only downloads the quartets that pass the filter. The following is a link to the result from one of our finished jobs. The spectrum is generated from 14 genomes with 112 gene families, each gene family include 100 bootstrapped gene trees. The result can be accessed from http://csbl1.bmb.uga.edu/QD/jobstatus.php?jobid=QDGyylaa. Except for the first two filters, which need to be specified before the job is started, all other features described above can be accessed from this page. 3. Summary We developed a web server for performing quartet decomposition and various quartetbased analyses such as quartet spectrum generation, agreement score calculation along with a set of filters. To support customized analyses, two download options are provided for an advanced user to do offline analysis without doing the tedious quartet decomposition process. We believe this server could be a good starting point for a quartet based analysis, which is a different type of analysis from traditional comparative genomics analysis.

Acknowledgement This work is supported in part by National Science Foundation (DEB-0830024, DBI-0354771, ITR-IIS-0407204, DBI-0542119).

Reference 1.

Zhaxybayeva, O., Gogarten, J.P., Charlebois, R.L., Doolittle, W.F. and Papke, R.T. (2006) Phylogenetic analyses of cyanobacterial genomes: quantification of horizontal gene transfer events. Genome Res, 16, 1099-1108.

2.

Zhaxybayeva, O., Doolittle, W.F., Papke, R.T. and Gogarten, J.P. (2009) Intertwined Evolutionary Histories of Marine Synechococcus and Prochlorococcus marinus. Genome Biol Evol, 2009, 325-339.

50

ISBRA 2010 SHORT ABSTRACTS

Multi-gene linear separability of gene expression data in linear time Md. Shafiul Alam, Satish Panigrahi, Puspal Bhabak, and Asish Mukhopadhyay⋆ School of Computer Science, University of Windsor, 401 Sunset Avenue, Windsor, ON, N9B 3P4, Canada {alam9, panigra, bhabak, asishm}@uwindsor.ca

Abstract. In [9] Unger and Chor showed how to test for linear separability of gene expression data with respect to pairs of genes. Their method however is not amenable to an efficient test when more than 2 genes are involved. The main contribution of this note is to show how to use linear programming to check for linear separability of gene expression data with respect to any number of genes in O(n) time where n is the sample size. The hidden constant in the O(n) term depends exponentially on the number of genes (the dimensionality of the problem). So, this makes for an efficient test when the number of genes is a small constant. To test the effectiveness of our algorithm, as an initial step, we have implemented this algorithm for gene pairs and are working on extending this implementation to larger groups of genes. Key words: gene expression analysis, linear separability, tissue classification, cancer diagnosis

1

Introduction

According to Ben-Dor et al. [2], the correct diagnosis of a cancer type is often crucial to a successful treatment. As normal cells can evolve into cancerous cells through mutations in genes, it is believed that the gene expression data can be exploited for more effective diagnosis and treatment of cancer. For this, it is necessary to identify groups of genes that play important roles in various types of cancers. Once the genes are identified, it is possible to diagnose the presence of or the type of a cancer and determine the course of treatment [6]. In [9] Unger and Chor showed how to test for linear separability of gene expression data with respect to pairs of genes. Interestingly enough, they were able to show that 7 out of the 10 datasets of two types of cancerous cells that they studied are linearly separable with respect to a pair of genes. From this they concluded that this linear separability of gene expression datasets is strongly correlated to some underlying molecular mechanism of these gene pairs. Their ⋆

Research Supported by an NSERC Individual Discovery Grant to this author

51

ISBRA 2010 SHORT ABSTRACTS

2

M. S. Alam, S. Panigrahi, P. Bhabak and A.Mukhopadhyay

method of linear separability, applicable to a pair of genes only, checks for separability incrementally. When the dataset is linearly separable its running time is in O(n2 ), where n is the sample size. However, checking just pairs of genes for classification may not be enough. Indeed, van’t Veer et al. [10] found that 231 genes are significantly related to breast cancer. Among these, they identified 70 genes as optimal marker genes for classification of breast cancer for prognosis purposes. Khan et al. [5] found 96 genes to classify small, round, blue-cell cancers. Ben-Dor et al. [2] used 1734,375 genes to classify various cancers. Golub et al. [3] selected 50 genes to classify leukemias. Some researchers used far more genes to classify cancers. For example, Alon et al. [1] used 2,000 genes to classify colon cancers. A major bottleneck with any classification scheme based on gene expression data is that while the number of samples are limited, numbering in hundreds, the feature space is much bigger, running into tens of thousands of genes. Using too many genes as classifiers will result in over fitting, while on the other hand using too few may result in under fitting. The common consensus is that genes numbering between 10 and 50 genes may be sufficient for good classification [3, 6]. Note that even using the minimum number of 10 genes for linear classification of 100 samples that involve 20,000 genes is also an enormous task. This calls for a very efficient algorithm with an incremental feature that allows for early termination if the samples are not separable for some combination of 10 genes. The main contribution of this note is to show how to use the linear programming algorithm of [7, 8] to check for linear separability of gene expression data with respect to any number of genes in O(n) time where n is the sample size. The hidden constant in O(n) depends exponentially on the number of genes. Thus, the test is efficient when the number of genes is reasonably small. To test the effectiveness of our algorithm, as an initial step, we have implemented this algorithm for gene pairs and are working on extending this implementation to larger groups of genes.

2

Multi-gene Linear Separability

We have a set of n (= m1 + m2 ) samples, m1 from a cancer type C1 and m2 from a cancer type C2 ( for example, m1 from ALL and m2 from AML) in a d-dimensional space, where d is the size of the gene set that is being tested as a linear classifier. If there exists a hyperplane that separates the m1 samples of C1 from the m2 samples of C2 then the considered group of d-genes is a linear separator. We reformulate the original separability problem in primal space as a linear program in dual space. This involves mapping each sample point to a hyperplane in dual space. Suppose there is a separating hyperplane in primal space and, say, the sample points of C1 are above this plane, while the sample points of C2 are below. The dual mapping preserves this above-below relationship in the sense that if a sample point p lies above (below) a hyperplane H in the primal space then the dual of p, viz. the hyperplane p∗ , lies below (above) the dual of H,

52

ISBRA 2010 SHORT ABSTRACTS

Multi-gene linear separability of gene expression data in linear time

3

viz. the point H ∗ , in the dual space. One such above-below preserving dual map from the primal plane (x, y) to the dual plane (u, v) is: p = (px , py ) → p∗ : v = px u − py l : y = lu x − lv → l∗ = (lu , lv ) Thus, there is a separating hyperplane in primal space if the resulting linear program in dual space has a feasible solution (see Fig. 1 for the 2d case). Note that we will have to solve 2 linear programs since it is not known if the m1 samples of C1 lie above or below the separating hyperplane H. Formally one of these linear programs in the dual space (u1 , u2 , . . . , un ) is shown below: pi1 u1 + ... + pin−1 un−1 − un − pin < 0, i = 1, ..., m1 ′





i p1i u1 + ... + pn−1 un−1 − un − pni > 0, i = 1, ..., m2

(1) (2)

where (pi1 , pi2 , . . . , pin ) is the i-th sample point from C1 , and the first set of inequalities express the conditions that these sample points are above the separating plane, while the second set of inequalities express the conditions that the ′ ′ ′ sample points (p1i , p2i , . . . , pni ) from C2 are below this plane.

y

0 1 1 0 0 1

1 0 1 0 11 00 00 11 00 11

1 0 1 0

H

11 00 00 11

0 1 1 0 0 1

v

H∗

u

x P rimalP lane

DualP lane

Fig. 1. A separating line in primal space is a feasible solution in dual space

3

Experimental Results

To test out our ideas, as a preliminary step, we have implemented our algorithm in 2-dimensions. This means testing pairs of genes as classifier. We ran our program on a Dell Inspiron laptop with an Intel Core2 Duo processor on the publicly available data set of Golub [3] with ALL and AML data testing for separability with respect to all pairs of 12,582 genes for a total of 79,147,071 separability tests. Out of these 249,567 pairs proved to be separable, representing just 0.32% of the total. The total running time was 5.07 hrs, which makes the case for having a very efficient separability test.

53

ISBRA 2010 SHORT ABSTRACTS

4

4

M. S. Alam, S. Panigrahi, P. Bhabak and A.Mukhopadhyay

Conclusions

We believe that gene groups of size greater than 2 might be better classifiers. We are working on extending our algorithm to cover these cases. We are also going to test our algorithm on all data sets that have been tried by Unger and Chor [9] to see if our findings are consistent with their conclusions. The main contribution of our paper is to point out that separability tests can be carried out efficiently for groups of genes larger than 2. Acknowledgements. Thanks to Dr. Alioune Ngom of the University of Windsor for bringing this problem to our attention.

References 1. Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., and Levine, A.J. 1999. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotidearrays. Proc. Natl. Acad. Sci., USA, 96, 6745-6750, 1999. 2. Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M. and Yakhini, Z. Tissue classification with gene expression profiles. In Journal of Computational Biology, 7(3-4),559-583, 2000. 3. Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D. and Lander, E. S. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531-537, October 1999. 4. Gordon, G. J., Jensen, R. V., Hsiao, L-L., Gullans, S. R., Blumenstock, J. E., Ramaswamy, S., Richards, W. G., Sugarbaker, D. J. and Bueno, R. Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research, 62(17):4963-4967, Sep 2002. 5. Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C. R., Peterson, C. and Meltzer, P. S. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nature Medicine, 7(6):673-679, June 2001. 6. Kim, S., Dougherty, E. R., Barrera, J., Chen, Y., Bittner, M. L. and Trent, J. M. Strong feature sets from small samples. Journal of Computational Biology, 9(1), 127-146, 2002. 7. Megiddo, N. Linear-time algorithms for linear programming in R3 and related problems. In SIAM Journal of Computing, 12(4), 759-776, 1983. 8. Megiddo, N. Linear Programming in Linear Time When the Dimension is Fixed. In Journal of the ACM, 31(1), 114-127, 1984. 9. Unger, G. and Chor B. (2007). Linear Separability of Gene Expression Datasets. To appear in IEEE Transactions on Computational Biology and Bioinformatics. 10. van’t Veer, L. J., Dal, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J., Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R. and Friend, S. H. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871), 530-536, Jan 2002.

54

ISBRA 2010 SHORT ABSTRACTS

IP6K gene identification by tag search Fabio Fassetti1 , Ofelia Leone1 , Luigi Palopoli1 , Simona E. Rombo1 , and Adolfo Saiardi2 2

1 DEIS, Universit`a della Calabria LMCB, MRC Cell Biology Unit & Department of Developmental Biology, UCL

Abstract. IP6 Kinases (IP6Ks) are important mammalian enzymes involved in inositol phosphates metabolism. Although IP6K has not yet been identified in plant chromosomes, there are many clues suggesting its presence in vegetal cells. In this paper, we propose a new approach to search for the gene IP6K and applied it on mitocondrial DNA (mtDNA) of plants, where such a gene could have been nested, possibly encrypted and hidden by virtue of the editing and/or trans-splicing processes. In order to identify the gene IP6K, we focused on the nucleotide sequence corresponding to a specific tag of the IP6K family. To this aim, we exploited DLSME, a software for motif extraction recently proposed, which allows for the motif structure to be only partially specified by the user. The analysis we conducted provided the negative, though we argue relevant, result that IP6K does not actually occur in vegetable mtDNA.

1

Introduction

Inositol polyphosphates are an important class of regulatory molecules involved in a variety of intracellular signaling pathways. Inositol hexakisphosphate (IP6 , also known as phytic acid) is the most abundant inositol polyphosphate in eukaryotic cells. It is the precursor of a class of more anionic inositol polyphosphate, the inositol pyrophosphates, in which the fully phosphorylated IP6 ring is further phosphorylated to create high-energy pyrophosphate group. The best charateryzed inositol pyrophosphates are the diphosphoinositol pentakisphosphate (IP7 or PP-IP5 ) and the bis-diphosphoinositol tetrakisphosphate (IP8 or [PP]2 -IP4 ), with one and two pyrophosphate group, respectively [2]. Inositol pyrophosphates are important cellular messengers that control a wide range of cellular function, including endocytosis [19], apoptosis [12], telomere length [18], and has been proposed to be able to drive a new kind of protein posttrasductional modification protein pyro-phosphorylation [3]. Since their discovery in the early 1990s, inositol pyrophosphates have been found in all eucariotic cells analyzed, from yeast to mammalian neuron, along with the widespread conservation of the enzymes responsible for their synthesis. The mammalian enzymes responsible for IP7 synthesis are called IP6 Kinases (IP6Ks); they are able to convert IP6 plus ATP to IP7 [17]. It is now known that IP6Ks belong to a superfamily of Inositol Polyphosphates Kinase (PFAM accession number PF03770), that evolved from a common ancestor, comprising IP6Ks, Inositol Polyphosphate Multikinase (IPMK) and IP3 -3Ks that specifically convert I(1,4,5)P3 to I(1,3,4,5)P4 . Interestingly, the presence of pyrophosphate IP7 has been demonstrated also in vegetal organisms, both in monocotyledonous and in dycotiledonous plants [7, 4]. Furthermore, the conversion of IP6 to IP7 has been detected in Arabidopsis cells and leaf tissue in the presence of ATP, demonstrating IP6-kinase activity in plant extracts 1 . These findings, together with the observed high conservation through the evolution of IP6K, strongly suggest the presence of this enzyme in vegetal cells. Therefore, IP6K enzyme was searched in plant genomes by homology based methods, but all studies have failed to reveal its presence. Only two IPMK proteins (called AtIPK2a and AtIPK2b in Arabidopsis thaliana) have been identified so far [20, 22]. These two enzymes contribute to inositol 1,3,4,5,6-pentakisphosphate (IP5 ) production in Arabidopsis, but do not show any inositol pyrophosphate enzymatic activity [20, 22]. However, there are many clues connecting IP6K to cell mitochondria. It was shown that human IP6K2 moves from nuclei to mitochondria and provides physiologic regulation of apoptotic process by generating IP7 [13]. Furthermore, yeasts deficient in KCS1 (yeast IP6-Kinase), kcsl∆mutants, do not survive if they are grown in conditions in which survival is dependent from mitochondrial function, thus demonstrating the importance of IP6K for this organelles 2 . Summarizing, to date IP6K has not been identified in plant chromosomes, but 1 2

Adolfo Saiardi and Cristina Azavedo unpublished manuscript Adolfo Saiardi unpublished manuscript

55

ISBRA 2010 SHORT ABSTRACTS

there are many clues suggesting its presence in vegetal cells. Some further observations could suggest that the corresponding gene might be find in plant mtDNA, probably encripted and hidden by virtue of editing and/or trans splicing processes. It is known that most of mtDNA information concerns genic products acting inside the mitochondrion itself. Plant mitochondrial genome have several peculiar characteristics such as the larger size (from 200Kb to 2400Kb), the presence of introns and genetic material of chloroplast or nuclear origin [14]. Also, mitochondrial genome is characterized by occurrence of phenomena (like editing) enlarging protein variability [5]. In the present study we have searched IP6K in mtDNA of plants. Because of the considerable sequence heterogeneity among the several IPKs known, common homology search program are not useful to this aim. Thus, we decided to use a new approach, looking for a specific IPK family tag. In particular, four key amino acids are very conserved among IPKs [17]. The consensus sequence P-XXX-D-X-K-X-G, identifies the catalytic site of the enzyme [16] and it can be considered a specific tag of IPK gene family. We exploited DLSME [6], a software for motif extraction which allows for the motif structure to be only partially specified by the user, to search for the IPK family tag in mtDNA sequences, as explained in detail in the following section. In Section 3 we discuss our reults and draw conclusion.

2

Method Overview

Common softwares for sequence search are based on sequence homology, but they are not very useful when the expected homology between the gene searched for and the known sequences is low. Furthermore, these softwares can not detect possible changes in nucleotide sequences due to RNA editing mechanisms. The intuition behind our work is that all IPK gene are characterized by the presence of specific tags, short sequences of few amino acids, corresponding to enzyme functional regions. The most important is the P-XXX-D-X-K-X-G domain, corresponding to the catalytic site of the enzyme. For the identification of IP6K gene in plant mtDNA, we focused on the nucleotide sequence corresponding to this specific IPK tag. In particular, we analyzed all the published mtDNA sequences (available at http://www.ncbi.nlm.nih.gov/sites/entrez) using DLSME [6]. For each identified tag, we extracted a sequence of about 1200 nucleotides surrounding the consensus sequence and examined it as a candidate IP6K gene. Nucleotidic sequences were translated in aminoacid sequences by using the Transeq [15] software. Then, in order to detect possible homologies, we performed sequence alignments using ClustalW [8] and BLAST [1]. Finally, using the TBLASTX and TBALSTN algorithms, we screened expressed sequence tag (EST) databases for proteins containing the mitochondrial sequences identified by our tag search. In the following, we first provide a brief description of DLSME and of the setting we exploited for our purposes, and then describe the main results we have obtained by our analysis. 2.1

DLSME Settings

DLSME [6] is a system designed to mine general kinds of motifs where several “exceptions” may be tolerated; that is, it is able to handle different complex kinds of pattern variabilities. In particular, DLSME is able to search for patterns composed of any number of short subsequences (boxes, in the following), where both the conserved regions and the regions between two boxes can be specified by the user as intervals ranging from a minimum to a maximum value. Moreover, mismatches can be taken into account by DLSME, as well as “skips” (deletions) and box “swaps” (box invertions) that possibly affect box occurrences. Furthermore, in DLSME it is possible to specify boxes where some symbols are “anchored” to get a fixed value. Despite the complexity of the addressed pattern variabilities, the system is able to exhibit very good performance in terms of both spatial and temporal costs. In particular, for the purposes of this research, we looked for the pattern: CC{T,C,A,G}---------GA{T,C}---AA{A,G}---GG{T,C,A,G} using the following sets of DLSME configuration parameters:

56

ISBRA 2010 SHORT ABSTRACTS

Distance: Number of skips: Number of swaps: Number of boxes:

2.2

Hamming First box length: 0 Distance from second box: 0 First box anchors: 4

3 Second box length: 9 Distance from third box: CCT Second box anchors: CCC CCA CCG Third box length: 3 Fourth box length: Distance from fourth box: 3 Fourth box anchors: Third box anchors: AAA AAG

3 3 GAT GAC

3 GGT GGC GGA GGG

Results

The full mitochondrial genome sequence is known for 36 different vegetal organisms, belonging to various Phyla, even very distant ones from an evolutionary point of view. The specific IP6Ks tag (P-XXX-D-X-K-X-G) search was performed over all sequenced mitochondrial genomes available to date and both DNA strands were analyzed. Twenty genomes out of 36 gave at least one positive match. Interestingly, we noted that some tag sequences (all 9 amino acids) were identical among different organisms. For each identified tag we extracted a sequence of about 1200 nucleotides surrounding it. To find out possible relevant homologies, we performed alignments among the sequences found in different vegetal organism. All the sequences sharing the same tag showed high homology in the region surrounding the consensus sequence, while alignment with IP6K known genes (Saccharomyces cerevisiae KCS1 or human IP6K1) showed only a weak similarity. Furthermore, in order to confirm the identity of our putative hit, we looked for other IP6Ks conserved domains, such as the SSLL or IDF regions, in the identified putative aminoacid sequences. These preliminary analysis led us to focus on the sequence PVGT DRKGG, that was found in mtDNA of Tripsacum dactyloides, Sorghum bicolour, and three different species of Zea genus (Zea mays, Zea perennis and Zea luxurians). Alignment between the 410 aminoacid around the PVGT DRKGG sequence of Tripsacum dactyloides and the human IP6K gene showed an interesting correspondence of the consensus region. To verify if the Tripsacum dactyloides sequence was an actively transcribed gene, we analyzed the Expressed Sequence Tags (ESTs) databases. These databases include short fragment of DNA derived from a longer cDNA sequence and representing part of the expressed genome. In order to confirm the expression of our mtDNA sequence, we screened EST databases using the region surrounding the PVGT DRKGG tag of Tripsacum dactyloides. This search failed to find any EST matching indicating that our putative hit is unlikely to be transcribed in mRNA. Finally, we used a region of 50 amino acids of Tripsacum dactyloides mtDNA surrounding the consensus sequence to perform a multiple alignment with corresponding regions of inositol phosphate kinase (IPMK, IP6K, IP3-3K) from different organism using ClustalW2. Our sequence resulted to be an outsider, thus that it does not belong to any subgroup of kinase composing the IPK gene family.

3

Discussion and Conclusions

Inositol hexaphosphate kinase (IP6K) catalyzes the conversion of IP6 to IP7 using ATP as phosphate donor. It belongs to an inositol polyphosphate kinase superfamily, the IPKs (Pfam PF03770), that evolved from a common ancestor. The inositol pyrophosphate IP7 is present in all eukaryotic cells analyzed thus far, from amoeba to man; it is not surprising that the enzyme responsible for its synthesis is highly conserved through evolution. In fact, after the first IP6K purification from rat brain [21], the enzyme was cloned in other mammalians, and its high evolutionary conservation was regularly observed, which facilitated the identification and cloning of IP6K enzymes from distant organisms, including yeast and the amoeba Dictyostelium [11]. It is notable that Dictyiostelium diverted from the evolutionary mainstream after the diversion of yeast but before the splitting between animals and plants [10]. Furthermore, the only IPK gene present in the ancient eukaryote diplomate Giardia lambia has been demonstrate to be a IP6K [9]. Thus, on the basis of evolutionary considerations, IP6K is expected to be found also in vegetal organisms. Moreover, pyrophosphate IP7 is present in vegetal organisms, and IP6-kinase activity has been demonstrated in plants. However, bioinformatics analysis failed to identify any IP6 kinase in the complete Arabidopsis thaliana nuclear genome. We hypothesized that IP6K gene actually occurs nested in vegetal mtDNA, where

57

ISBRA 2010 SHORT ABSTRACTS

more frequently phenomena enlarging protein variability do occur. Tags identification in mtDNA could indicate the presence of IP6K gene, even if not in a canonic form. Indeed, trans splicing mechanisms could compact a gene consisting of more segments dislocated in different mtDNA regions, and editing phenomena could account for the failure of homology searches. In fact, editing mechanism could generate RNA molecules much different from DNA producing them, so that DNA sequence can be not immediately referable to IP6K gene in its transcript. We searched for a specific IP6K tag within all available vegetal mtDNA sequences using DLSME, a very flexible system for motif discovery, allowing for dealing with genetic code degeneration and possible occurrences of editing events. Our search revealed several tags in mtDNA of examined plant, but an accurate analysis of sequences surrounding the consensus motifs led us to conclude that our hit does not belong to the IPK gene family. The P-XXX-D-X-K-X-G consensus sequence is a characterizing motif of IP kinases, and it was found in all members of the family. Our search failed to find any sequence containing the tag ascribable to IP6K gene therefore we can conclude that IP6K gene is not present in plant mtDNA. As the future work, we plan to extend the search of the IP6K tag on the nuclear genome of plants.

References 1. S.F. Altschul and et al. Gapped blast and psi-blast: a new generation of protein database search programs. NAR, 25(17):3389–3402, 1997. 2. M. Bennett, S.M. Onnebo, C. Azevedo, and A. Saiardi. Inositol pyrophosphates: metabolism and signaling. Cell Mol Life Sci., 63:552–564, 2006. 3. R. Bhandari, A. Saiardi, and Y. Ahmadibeni et al. Protein pyrophosphorylation by inositol pyrophosphates is a posttranslational event. Proc Natl Acad Sci U S A, 104(39):15305–15310, 2007. 4. C.A. Brearley and D.E. Hanke. Inositol phosphates in barley (hordeum vul. l.) aleurone tissue are stereochemically similar to the products of breakdown of insp6 in vitro by wheat-bran phytase. Bioch. J., 318(1):279–286, 1996. 5. M. Takenaka et al. The process of rna editing in plant mitochondria. Mitochondrion, 8:35–46, 2008. 6. F. Fassetti, G. Greco, and G. Terracina. Mining loosely structured motifs from biological data. IEEE Trans. Knowl. Data Eng., 20(11):1472–1489, 2008. 7. S. Flores and C.C. Smart. Abscisic acid-induced changes in inositol metabolism in spirodela polyrrhiza. Planta, 211:823–832, 2000. 8. M.A. Larkin, G. Blackshields, and N.P. Brown. Clustalw and clustalx version 2. Bioinf., 23(21):2947–2948, 2007. 9. A.J. Letcher, M.J. Schell, and R.F. Irvine. Do mammals make all their own inositol hexakisphosphate? Biochem J., 416(2):263–270, 2008. 10. W.F. Loomis and D.W. Smith. Consensus phylogeny of dictyostelium. Experientia, 51(12):1110–1115, 1995. 11. H.R. Luo, Y.E. Huang, and J.C. Chen et al. Inositol pyrophosphates mediate chemotaxis in dictyostelium via pleckstrin homology domain-ptdins(3,4,5)p3 interactions. Cell, 114(5):559–572, 2003. 12. B.H. Morrison, J.A. Bauer, and J. Hu et al. Inositol hexakisphosphate kinase 2 sensitizes ovarian carcinoma cells to multiple cancer therapeutics. Oncogene, 21(12):1882–1889, 2002. 13. E. Nagata, H.R. Luo, and A. Saiardi et al. Inositol hexakisphosphate kinase-2, a physiologic mediator of cell death. J Biol Chem, 280(2):1634–1640, 2005. 14. J.D. Palmer, K.L. Adams, and Y. Cho et al. Dynamic evolution of plant mitochondrial genomes: mobile genes and introns and highly variable mutation rates. Proc Natl Acad Sci U S A, 97(13):6960–6966, 2000. 15. P. Rice, I. Longden, and A. Bleasby. Emboss: the european molecular biology open software suite. Trends in Genetics, 16(6):276–277, 2000. 16. A. Saiardi, J.J. Caffrey, S.H. Snyder, and S.B. Shears. The inositol hexakisphosphate kinase family. catalytic flexibility and function in yeast vacuole biogenesis. J Biol Chem, 275(32):24686–24692, 2000. 17. A. Saiardi, H. Erdjument-Bromage, and A.M. Snowman et al. Synthesis of diphosphoinositol pentakisphosphate by a newly identified family of higher inositol polyphosphate kinases. Curr Biol, 9(22):1323–1326, 1999. 18. A. Saiardi, A.C. Resnick, and A.M. Snowman. Inositol pyrophosphates regulate cell death and telomere length through phosphoinositide 3-kinase-related protein kinases. PNAS, 102:1911–1914, 2005. 19. A. Saiardi, C. Sciambi, and J.M. McCaffery. Inositol pyrophosphates regulate endocytic trafficking. PNAS, 99:14206– 14211, 2002. 20. J. Stevenson-Paulik, A.R. Odom, and J.D. York. Molecular and biochemical characterization of two plant inositol polyphosphate 6-/3-/5-kinases. J Biol Chem, 277:42711–42718, 2002. 21. S.M. Voglmaier, M.E. Bembenek, and A.I. Kaplin et al. Purified inositol hexakisphosphate kinase is an atp synthase: Diphosphainositol pentakisphosphate as a high-energy phosphate donor. PNAS, 15:4305–10, 1996. 22. H.J. Xia, C. Brearley, and S. Elge et al. Arabidopsis inositol polyphosphate 6-/3-kinase is a nuclear protein that complements a yeast mutant lacking a functional argr-mcm1 transcription complex. Plant Cell, 15:449–463, 2003.

58

ISBRA 2010 SHORT ABSTRACTS

Reconstruction of HCV quasispecies haplotypes from 454 Life Science reads Irina Astrovskaya ? , Kelly Westbrooks and Alexander Zelikovsky Department of Computer Science, Georgia State University, Atlanta, GA 30303, and Life Technologies [email protected], [email protected], [email protected]

Abstract. Understanding how the genomes of viruses mutate and evolve within infected individuals requires reconstruction of viral variants (quasispecies). Previously, we have successfully reassembled 48 Hepatitis C virus quasispecies (1700 bp long E1E2 region) from simulated 454 Life Science reads. Our method builds a read graph including all possible haplotypes, selects the most probable haplotypes and finds maximum likelihood estimates of their frequencies using EM. This paper attempts to process 100K 454 Life Sciences reads from the 5200bp long E1E2 region of the HCV with the average lengths of 300bp and 600bp. Significant deviation of reads from the reference strain are cause by both mutation and genotyping errors. We estimate that about 48% of all deletions and 28% of all insertions in the reads are attributed to the genotyping errors. Additionally, 5%-7% of alleles are undetermined. We propose an imputation method that uses alleles of the neighbored SNPs being highly correlated with a SNP containing undetermined allele. The preliminary results show that our imputation method can not only correctly impute the value for allele but also find potential substitution genotyping error. If we allow only a perfect match in read overlapping, we would not have enough power to explain all reads. So we suggest to cluster quasispecies by allowing mismatches in overlapping during construction of the graph. Once the graph is constructed, we select the candidate set and apply EM to them. Additionally, we use EM estimates to find the probability of how many original quasispecies were not captured by our candidate selector methods. Finally, we report the results of reconstructing quasispecies from the 5.2Kb long E1E2 region of Hepatitis C Virus.

?

Partially supported by GSU Molecular Basis of Disease Fellowship.

59

ISBRA 2010 SHORT ABSTRACTS

Genome Completion by SOLiD(TM) System Next-Generation Sequencing Dumitru Brinza, Fiona Hyland, and Asim Sidiqqui Life Technologies, 850 Lincoln Center, Foster City, CA, 94404 [email protected]

Abstract. Ultra-high throughput next-generation sequencing technologies, such as the Applied Biosystems SOLiD(TM)1 platform, provide the ability to sequence genomes rapidly and cheaply. These technologies are widely used in re-sequencing projects for detection of genetic variations between sequenced genome and an existing reference genomes. Knowing sequence of the variants tremendously contribute to detection and study of genetic markers causing disease or involved in certain phenotypic outcomes. While there exist methods for SNP calling, short indel detection/ reconstruction, and long indel detection, the reconstruction of novel (SNP prone, large indel) regions remains challenging. Here, we present a new Assembly assistant for SOLiD(TM) platform (ASiD) method for accurate and efficient assembly of large part of novel sequence from short paired reads generated by SOLiD(TM) platform. ASiD can be also applied to reconstruct sequence between adjacent contigs in scaffolds generated by de novo assembly, that significantly increases overall contigs length. For the HuRef genome, sequenced at 25x coverage using 1.3 Kb matepaired libraries and 50 bp long SOLiD(TM) system reads, the method reconstructs 55% of the expected novel sequence which is not present in NCBI Human Reference (HG18) (