Inferring gene annotations in Gene Ontology from gene expression data

3 downloads 0 Views 94KB Size Report
Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics, 21:3896–3904, 2005. [16]Olga Troyanskaya, Michael Cantor, ...
Vol. 00 no. 00 2006 Pages 1–6

BIOINFORMATICS

Inferring gene annotations in Gene Ontology from gene expression data Xuerui Wang, David Kulp, Andrew McCallum Department of Computer Science, 140 Governors Drive, University of Massachusetts, Amherst, MA 01003, USA

ABSTRACT Motivation: The Gene Ontology (GO) project develops a standard way to describe gene products in terms of their associated biological processes, cellular components and molecular functions. However, it is far from complete. Due to lacking biological knowledge or other technical difficulties, many gene products do not have GO annotations, and the annotations for many other gene products are not specific enough to be useful. Previous studies show that, to some extent, both sequence similarity and expression similarity indicate functional relationship. The ability to infer unknown annotations or to refine known (but not specific) annotations is fundamentally important in discovery of various new biological knowledge. Results: In this paper, we focus on gene expression data and biological process annotations. The present work describes a new probabilistic framework based on the GO hierarchy where specific (biological process) annotations of genes are inferred from gene expression data. We apply the method to the gene expression data of [5] and the GO annotations from Saccharomyces Genome Database (SGD). We show the results in predicting (or refining) already known GO annotations under leave-1-out cross-validation. By comparing the results with previous non-generative models (such as kNN), we also conclude that genes having close annotations to each other in GO are not necessarily more similar in expression data than genes are further away. Additionally, we propose a new rank-based distance metric which is provably more robust than well-known metrics for gene similarity. Availability: Available upon request. Contact: [email protected]

1 INTRODUCTION Rapid advancements in high-throughput methods for measuring levels of gene expression for tens of thousand of of genes in parallel have led to revolutionary changes in bioinformatics research. The abundance of available microarray data enables us to conduct large-scale inference using computational approaches regarding the roles of genes and relations between genes in cellular context. Biological knowledge of gene and protein role in cells is being accumulated in an explosive fashion. To address the

© Oxford University Press 2006.

need for consistent descriptions of gene products, the Gene Ontology (GO) project [4], among other efforts using controlled vocabularies, organizes biological terms into three ontologies: cellular component, biological process, and molecular function. The GO annotations are very powerful in searching all available aggregated biological knowledge regarding specific topics, for example, finding all the gene products that are involved in bacterial protein synthesis when searching for new targets for antibiotics [4]. However, the Gene Ontology is far from complete. Many genes discussed in the literature do not have GO annotations due to lagged updates and many genes do not have specific enough annotations because of lack of related biological knowledge. There is a great need to automatically identify unknown annotations or refine known (but not specific) annotations, and thus to accumulate and discover biological knowledge in a more systematic way. The huge amount of gene express data make it possible to conduct gene annotation inference computationally. Previous studies showed that genes with similar expression profiles (under some conditions) have similar annotations in GO [9, 7, 2, 10, 8, 17]. However, it is not yet clear whether, in the DAG structure of the Gene Ontology, genes that have close annotations to each other are more similar in expression than genes are further away. To computationally prove/disprove the validity of the above statement is also an objective of this paper. The GO annotations are organized in a species-independent manner. The yeast genome has been well studied and thus its GO annotations are relatively accurate and complete. To make it easier to assess the methods, in this paper, we will focus on the Saccharomyces Genome Database (SGD) annotations. Among the three GO ontologies (biological processes, cellular components and molecular functions), for our analysis of gene expression data, we only focus on Biological Process in this paper. Molecular Function and Cellular Location are conventionally assumed to be more amenable to sequencebased comparisons. For example, conserved motifs indicate how the protein interacts with other molecules, but not what a biological process suggests. Similarly, common molecular

1

Wang, Kulp and McCallum

function and cellular location are assumed to not necessarily be co-expressed. On the other hand, biological processes may have reasonably high-correlated gene expression level. Several previous studies [7, 2] also share this observation.

2 MATERIALS In all experiments conducted in this paper, we use the gene expression data collection described in [5], available on-line at http://genome-www.stanford.edu/yeast_stress/data.shtml. The data represent the normalized, background-corrected log2 values of the Red/Green ratios measured on the DNA microarrays. All missing values in the dataset are estimated by us using KNNimpute [16] with 17 nearest neighbors under Euclidean distance metric. In total, we utilize the gene expression data of 6,152 genes under 173 experimental conditions. We take the annotations in the SGD database distributed on February 15, 2006, and for consistent consideration, the DAG structure of the Gene Ontology is directly constructed from the ontology definitions on February 15, 2006. For simplicity, we do not distinguish “is_a” and “part_of” relationships. We assume that the annotations also follow the “true path rule”, that is, if a gene is annotated at a node, it is implicitly annotated at all nodes in the pathway from this node all the way up to its top-level parent(s). The term GO:0000004 (biological process unknown) was removed since it is not a real annotation. Many GO nodes have no genes or only a few genes associated with them. It is often the case that these nodes are not well studied. For simplicity, only the biological process GO terms associated with more than 10 genes in the above dataset are considered. In this way, we end up with 725 GO terms. In comparison with previous studies (e.g., [17]), we study the problem in a much larger scale (725 vs. 48 GO terms).

3 METHODS We propose a Bayesian framework to conduct annotation inference in GO from gene expression data. Unlike the previous methods, the framework explicitly incorporates the knowledge about the structures of the gene taxonomy into consideration by assuming that genes that are closer in GO have similar expression data. Thus, we not only offer a new way to infer gene annotations in GO, but validate the assumption indirectly by comparing it with other models which do not take such an assumption. The DAG structures of GO and the fact that a gene may have multiple most specific annotations make it possible that there are generally multiple paths from the root to a gene’s most specific annotation(s). Traditional data-driven clustering techniques do not respect either the GO structure or multiple annotations, and we need a model that take both into account. In our framework, we associate a probabilistic distribution over gene expression profiles with each node in GO. On one hand, we want to learn a good model at each node to

2

fit the associated data. One the other hand, to observe the GO hierarchical structures, the distribution distance between neighboring nodes should be small, more exactly, the distributions with neighboring nodes should be close. Instead of the ad-hoc distance metrics used in hierarchical clustering, a natural choice of distance measure between distributions is the Kullback-Leibler (KL) divergence, as used in the probabilistic abstraction hierarchies (PAH) model [14]. Unlike in our framework in which the “true path rule” of annotations is followed, the PAH model generates a data points only from one leaf node of a hierarchy. There is a trade-off between fitting individual distributions and making neighboring node close in distribution. We introduce a penalty factor λ for distance between distributions of neighboring nodes, and treat the distance penalty as a prior of the hierarchy. Notation used in the model is shown in Table 1. Our model also share some insight of Bayesian Hierarchical Clustering (BHC) [6], which constructs a binarytree-structured hierarchy using Bayesian hypothesis testing, without specifying a distance metric. The BHC model also generates data points at every node in the path from the most specific cluster to the root. However, any data point, in the BHC model, is only allowed to be associated with only one leaf cluster. The PAH model mainly focuses on learn a better treestructured hierarchy instead of incorporating an existing DAG-structured hierarchy like GO. For the purpose of inferring gene annotations, our framework is more flexible than the original PAH model in that we can ask multiple annotations at arbitrary levels instead of one annotation at the leaf node level. More importantly, our framework defines a probabilistic model of the data which allows us to ask questions like how well the GO structures match the gene expression data. It can also be used to calculate the probability of a gene belonging to any node in the DAG structures. When we have more evidence on other genes that share the similar expression profile of the test gene, we are able to infer more specific annotation(s) for it; otherwise, only less specific annotation(s) are possible. A relatively general GO term might be the most specific annotation only for a few genes that may have very different gene expression profiles, as reported in [11]. By allowing a gene’s expression value to be generated at all nodes in the pathway from its most specific annotation to the root, our model naturally avoids the danger of overfitting and focuses on all genes having corresponding annotations. We learn the individual distributions at each node by maximum likelihood estimation and simultaneously make distributions at close node close (smaller KL divergence in this paper). So we want to maximize that P (GO|Data) ∝ P (Data|GO)P (GO). Or equivalently,

Inferring gene annotations in Gene Ontology from gene expression data

Table 1. Notation used in this paper

SYMBOL G N D E Mn Gn Sn |Sn | eg ag gn λ ρ GO Data

DESCRIPTION number of genes that have annotations in GO number of nodes in the GO hierarchy number of experimental conditions all edges in the GO hierarchy model (distribution) associated with the nth node in the GO hierarchy number of genes having annotations in node n set of nodes that are neighbors of node n number of nodes that are neighbors of node n expression of gene g under all experiments node(s) with the most specific annotation(s) of gene g number of genes at node n penalty factor for distribution distance a distance measure between the distribution of two nodes (using symmetric KL distance) M1 , M2 , ..., MN e1 , e2 , ..., eG

J = logP (Data|GO) + logP (GO) =

G X

X

logP (eg |Mag ) −

g=1

λρ(Mi , Mj )

(i,j)∈E

Next we define the concrete distribution at each node Mi and the distance measure ρ. The dataset is a G × D matrix which encodes the expression level measurements of G different genes under D different experimental conditions. For our purpose, we need a continuous distribution over