A model system for studying the integration of molecular biology ...

8 downloads 139 Views 64KB Size Report
Results: We attempted to construct a database of human and mouse genes integrating data from GenBank and the human and mouse genome-databases.
*'  )* 

"!,  

BIOINFORMATICS

A model system for studying the integration of molecular biology databases *#) .'!/ .%.) )" ) -#) ** () #! &,*) *+-*+/  $) -+!!- + +*+                         

Abstract Motivation: Integration of molecular biology databases remains limited in practice despite its practical importance and considerable research effort. The complexity of the problem is such that an experimental approach is mandatory, yet this very complexity makes it hard to design definitive experiments. This dilemma is common in science, and one tried-and-true strategy is to work with model systems. We propose a model system for this problem, namely a database of genes integrating diverse data across organisms, and describe an experiment using this model. Results: We attempted to construct a database of human and mouse genes integrating data from GenBank and the human and mouse genome-databases. We discovered numerous errors in these well-respected databases: ∼15% of genes are apparently missing from the genome-databases; links between the sequence and genome-databases are missing for another 5–10% of the cases; about a third of likely homology links are missing between the genome-databases; 10–20% of entries classified as ‘genes’ are apparently misclassified. By using a model system, we were able to study the problems caused by anomalous data without having to face all the hard problems of database integration. Contact: [email protected] Introduction Integration of molecular biology databases is critically important because of the interconnectedness of biological research. This view was stated forcefully by a body of experts convened by the Department of Energy (DOE) about 5 years ago, in April 1993, whose major conclusion was that successful data management for the Human Genome Project requires integrated databases (Robbins et al., 1993). ‘Achieving this’, they said, ‘must be a top priority.’ Fueled in part by this recommendation, database integration has grown into a large area of research. Major projects and systems include Entrez (Schuler et al., 1996; NCBI, 1998a), GeneCards (Rebhan et al., 1997a,b), IGD (Ritter, 1994; Link, 1998), Kleisli (Buneman et al., 1995; Crabtree, 1998), Morphase (Davidson and Kosky, 1997; Harker, 1998), OPM (Chen and Markowitz, 1995; Chen et al., 1996) and SRS (Etzold et al.,  Oxford University Press

1996, 1998). An excellent overview of the field appears in Markowitz and Ritter (1995). Molecular biology data are presently stored in about two dozen major databases and hundreds of others (Keen et al., 1992; Burks and Redgrave, 1993; Ashburner and Goodman, 1997; Martin, 1998). Some databases are organized by data type, e.g. the International Nucleic Acid Sequence Data Library (a.k.a. GenBank) (EBI, 1996; Benson et al., 1997; DDBJ, 1998; NCBI, 1998b) is responsible for nucleic acid sequences, SWISS-PROT (SWISS-PROT, 1998) is the major repository for protein sequences, NAD (NAD, 1998) stores structures of nucleic acid sequences, PDB (PDB, 1998) handles protein structures, and so forth. Other databases are organism specific; these include GDB (GDB, 1998) and OMIM (OMIM, 1998) for human, MGD (MGI, 1996) for mouse, PigBASE (PiGBASE, 1997) for pig, ZFIN (ZFIN, 1998) for Zebrafish, FlyBase (FlyBase, 1998) and Encyclopaedia of Drosophila (EofD, 1996) for Drosophila, ACeDB (ACeDB, 1998) for Caenorhabditis elegans, AtDB (AtDB, 1998) for Arabidopsis, MaizedDB (MaizeDB, 1998) for corn, SGD (SGD, 1998) for yeast (Saccharomyces cerevisiae), ECDC (ECDC, 1997) for Eschericha coli, and many, many others. Still other databases focus on particular research topics. This is an eclectic group including databases of mutations in human disease genes [e.g. p53 (IARC, 1998) and the Human Gene Mutation Database (HGMD, 1997; Krawczak and Cooper, 1997)], databases devoted to individual genes [e.g. M6P/IGF2R (Jirtle, 1998)], databases about specific protein families [e.g. the kinesins (Greene and Henikoff, 1998) and the G protein-coupled receptors (GPCRDB, 1998; Horn et al., 1998)], databases of transcription factors and their binding sites [e.g. TRANSFAC (Transfac-Team, 1998)], and many more. Most of the large data production centers also operate their own databases [e.g. TIGR (TIGR, 1998a)]; these data are generally duplicated in the official, community databases, but the center databases are often more up to date. Eric Lander, in a recent commentary, proposed eight new goals to succeed the Human Genome Project, all of which envision the production of massive new datasets (Lander, 1996); if even a few of Lander’s proposals come to fruition, the community will soon be faced with many more databases.

575

J.Macauley, H.Wang and N.Goodman

Despite its importance, database integration lags in practice. We suspect that most working biologists experience database integration through the Entrez and SRS systems operated by NCBI and EBI, respectively. Entrez provides two forms of integration. One is to link, for a given gene in a given organism, the database entries for the gene’s nucleic acid sequence, protein sequence, structure (when known), genomic position and literature citations. Entrez also links each data element to pre-computed ‘neighbors’ of the same type, e.g. it links protein sequences that show a high degree of sequence similarity. SRS links a much wider set of databases (∼80 at last count), but does little to integrate the data so linked; e.g. one can follow a link from a protein sequence to the database entry for the corresponding gene in the relevant organism-specific database, but one cannot easily collect organism-specific data across all organisms in which the gene has been studied. SRS also lacks Entrez’s neighboring capability. The newly emerging sequence-cluster databases (sometimes called gene-index databases), viz. UniGene (NCBI, 1998c), the TIGR Gene Index (TIGR, 1998b) and STACK (Miller et al., 1997), provide a different kind of integration by coalescing all nucleotide sequences for a given gene in a given organism into a single database entry; although only available for human, at present, this approach seems quite useful and will likely spread. Homology databases, such as XREFdb (Bassett et al., 1997, 1998) and HOVERGEN (Duret et al., 1994; Duret, 1998), integrate along another dimension by linking sequences that are putatively homologous. These systems play an invaluable role in the day-to-day practice of biology, and they tell a powerful story about the importance of database integration, but they fall far short of a total solution. Database integration is a multi-faceted problem involving technological, scientific and social issues. One key aspect, and the subject of much current research, is to develop software technology to effect the desired integration. Another essential issue is to devise a means of identifying related data elements, and a method of linking or combining these elements (Fasman, 1994). The handling of related data elements must be scientifically well founded in order for the integration to be scientifically meaningful (Karp, 1995; Schulze-Kremer, 1997). Real databases contain errors and omissions, and the integrated system must be able to cope gracefully with these anomalies, from both technological and scientific perspectives. For integration to be feasible in practice, the people who develop and operate the constituent databases must be eager participants; their assistance is needed on a technical level, e.g. to support the interfaces needed by the integrated system, and on an institutional level, e.g. to grant licenses for use of their data in the integrated system. One way to accelerate progress is to define a model system, i.e. a specific integrated database, and encourage researchers

576

to demonstrate their methods on this model. A good model system should be rich enough to exhibit all facets of the problem, but at the same time, should admit useful partial solutions that address a subset of the issues. We are inspired by the success of this approach in the structure prediction field, where the CASP contests have become an annual showcase (CASP, 1998). In this paper, we propose a model system for database integration, and we present results from a simple experiment using the model. The experimental data were obtained about a year ago, in the spring of 1997. If the experiment were repeated today, the precise quantitative results would be different. The methods and general conclusions remain valid.

The model system We envision an integrated database of genes. For a given gene in a given organism, the database would ‘horizontally’ link sequence, structure, map position and phenotype, and would ‘vertically’ link related elements of the same type that pertain to other genes (in the same or other organisms). In addition, for each data element, the database would contain links to the literature. A major issue is to define what we mean by ‘related element’. For present purposes, we propose that sequences and structures be linked on the basis of putative homology, map positions on the basis of synteny conservation, and literature citations by lexical neighboring (as in Entrez) or semantic similarity. We leave open the means of determining these links, as this is a key aspect to be addressed by each investigator who uses the model. For nucleic acid sequences, we imagine that some investigators would pursue UniGene-style clustering of similar sequences. We imagine that in the fullness of time, other notions of relatedness could be added, e.g. one might link genes involved in the same pathway, or ones whose expression patterns are correlated. Another hard problem that we leave open is the handling of allelic variation, alternative splicing and similar phenomena. It may be better to start from the classical genetics definition of a gene (‘a complementation group of alleles’), rather than the molecular biology definition (‘a transcribed segment of the genome resulting in a functional molecule’) implicit in our discussion so far; we hope that some investigators might pursue this alternative. This model system exhibits, but also separates, the hard aspects of database integration. The horizontal links exercise the software technology for integration without introducing difficult scientific questions. Depending on which databases are integrated, the horizontal links can also be used to stress the social issues of cooperation. The vertical links introduce hard scientific questions, but for many data types introduce no new hurdles regarding the mechanics of integration. The model is readily extensible: it is easy to imagine adding other

Model system for database integration

types of data that are naturally associated with genes, including those of protein mass spectrometry, gene expression, regulation and pathways, although it might be better to move to a less gene-centric model before going too far down this path. It is reasonable to regard the existing Entrez and SRS systems as incarnations of the model, which demonstrates the utility of partial solutions.

The experiment: a database of human and mouse genes Overview We set out to develop a database for a subset of the model covering human and mouse genes. We wanted the database to contain sequences, map positions, phenotypic data and literature citations (but we were not so ambitious as to include structures). We expected to obtain a definitive list of mouse and human genes from MGD and GDB, respectively. We expected to get map and phenotype data from these databases, as well, and to get nucleotide and protein sequences from GenBank. For nucleic acid sequences, we felt it important to employ sequence clustering, because of the abundance of redundant sequences for these organisms. Although several sequencecluster databases exist for human, there were none available for the mouse at the time we did this work, so we decided to build our own. We used the same method to build a database for human, as well, to ensure consistency of methods and to provide a means of testing our methods. We also wanted the database to contain vertical links connecting putatively homologous genes across these organisms. For sequences, we expected to determine homology by sequence similarity in the obvious way. For genome-database entries, we expected to obtain a definitive list of known mouse/human homologies and the mouse/human synteny conservation map from MGD, since MGD is the pre-eminent resource for this information. [Although GDB also contains such information, it uses MGD as its major source (Fasman et al., 1997).] We wanted the database to support the following sorts of queries: given a sequence identifier (accession number), find the cluster to which it belongs and the associated gene; given a gene, find the associated cluster and all its sequences; given a gene, find its homolog in the other species; given a sequence identifier or gene, retrieve its entry in the appropriate constituent database; given a gene, retrieve its map position; given a map position in one genome, find the predicted conserved position in the other genome; given a map region, retrieve all genes in that region; and simple combinations of the above. Our goal was to study scientific aspects of database integration, not technological ones, and we adopted a very simple implementation strategy. We planned to download static snapshots of the relevant data, extract what we needed,

and store the information in a database designed for this specific purpose. We expected to use LabBase (Goodman et al., 1998) as our data management system, which is well suited for such purposes. Our plan for data access was to implement ‘canned queries’ in Perl (Stein, 1996; Wall et al., 1996) that would run against the LabBase database. As we began to compile the database, we became aware of apparent anomalies in the data. Some of our ‘favorite genes and homologies’ were missing from the databases. Summary information on the MGD World Wide Web site indicated that only a quarter of mouse genes had human homologs, a finding that surprised us, since conventional wisdom suggests that genes cloned in one organism are soon re-cloned in the other. These observations led us to a detailed study of gene and homology data in MGD and GDB, and a comparison of these databases with GenBank. The results of our study showed the following: (i) many entries classified as ‘genes’ in the genome-databases are not associated with transcribed sequences of mammalian origin; (ii) many likely mouse/ human homologs are absent from these databases; (iii) many mouse and human genes that are present in GenBank are absent from the genome-databases; and (iv) although many links between the sequence and genome-databases are absent, the number is not so large as to pose a major impediment to integration. In the end, we chose to pursue the study and have not yet completed the database.

Methods Determining numbers of genes in MGD and GDB. For MGD, we obtained entries from the genes/markers/phenotypes page on the Mouse Genome Informatics (MGI) homepage by retrieving a comprehensive list of everything under the classification of ‘genes’, and removing ‘withdrawn’ symbols. For GDB, we obtained a list of gene entries in a similar manner. We randomly selected 100 entries from each database and classified entries into six categories by manually examining the entry and associated literature. These categories included: functional, transcribed genes of mammalian origin; pseudogenes (inactive but stable components of the genome derived by mutation of an ancestral active gene); phenotypes for which the associated gene was still uncloned; endogenous viruses (genetic elements arising from the insertion of viral sequences into the genome); cryptic entries which were impossible to define based on the information (or lack thereof) provided in the databases; and entries that represent complexes of genes rather than single genes. The sample size (100) was chosen based on casual observations indicating that approximately half the entries would fall into the first category. A simple binomial statistic was used to define the 95% confidence interval (CI) for all groups analyzed as described below.

577

J.Macauley, H.Wang and N.Goodman

Determining the number of gene entries in GenBank. We also estimated the number of mouse and human gene sequences that are present in GenBank. We extracted all mouse and human protein translations from GenBank into separate datasets for each species, and clustered the entries in each dataset using methods similar to those published elsewhere (Bleasby and Wootton, 1990; Houlgatte et al., 1995; Schuler et al., 1995): specifically, sequences sharing at least 95% identity with a minimum overlap of 30 amino acid residues were placed in the same cluster. For each cluster, we programmatically compared the accession numbers of the constituent GenBank entries to those found in MGD and GDB, and considered a cluster to be present in the corresponding genome-database if the accession number of any constituent was found there, and absent otherwise. We did not further analyze clusters deemed to be present in their respective genome-databases. For clusters deemed to be absent from their respective genome-databases, we randomly selected 100 clusters and performed sequence, database and literature searches in an attempt to confirm that the gene represented by the cluster was truly absent from the genome-database. Those genes that were found in the genome-database at this stage were classified as (i) present in the genome-database but without a link to the sequence database, or (ii) present in the genome-database with a link to the sequence database missed by our automated analysis. Those genes that we could not find by this method were further classified as (iii) absent from the genome-database and unpublished, or (iv) absent from the genome-database and published.

in MGD under the classification of ‘genes’; removing ‘withdrawn’ symbols reduced this to 7943 entries. Using a similar process, we found 6781 gene entries in GDB. We randomly selected 100 entries from each database for further analysis. From our random sample of 100 entries, it was immediately apparent that only a subset of the entries were supported in the literature as bona fide mammalian genes. As shown in Table 1, 60 out of 100 ‘gene’ entries in MGD fell into this category. Of these 60 entries, 39 had cDNA sequence links, six had only genomic sequence links, eight had references to sequences that were not available electronically and nine had no sequence reference at all. Of 100 entries from GDB, 75 fell into the defined category. Of these 75 entries, 65 had cDNA or protein sequence links, nine had references to sequences that were not available electronically and one had no sequence information at all. These results suggest that MGD contains 4766 (0.6 × 7943) transcribed genes of mammalian origin, and GDB contains 5085 (0.75 × 6781) such entries.

Classification of homologs. We examined all genome-database entries that fell into our first category (transcribed genes of mammalian origin) with respect to the identification of mouse/human homologs. We classified an entry as homolog positive if a mouse/human homolog was identified for that entry in its database, and homolog negative otherwise. We did not further analyze homolog-positive entries. For homolog-negative entries, we performed sequence, database and literature searches in an attempt to find likely homologs. We classified an entry as a false homolog negative if we could find an almost identical sequence in the other species [>95% amino acid sequence identity using BLASTX or BLASTP (Altschul et al., 1990)] with supporting tissue expression data or map location. We classified an entry as a true homolog negative if we could not find a likely homolog using this method.

Gene entries in GenBank versus MGD and GDB. We next estimated the numbers of mouse and human genes represented in GenBank, and compared these figures with those obtained from MGD and GDB. We obtained 5208 mouse clusters and 8514 human clusters by the process outlined in Methods. Upon comparing the accession numbers in each cluster to a list of all sequence accession numbers in MGD and GDB, we found 2783 mouse clusters (53%) in MGD and 5643 human clusters (66%) in GDB. Table 2 shows the results of analyzing samples of 100 entries per species that were not in their respective genomedatabases. The results indicate that about half the clusters that our program could not find in MGD or GDB are, indeed, missing from those databases. For MGD, we found that 37 of the 100 entries corresponded to functional, transcribed sequences of mammalian origin that one would reasonably expect to find in a genome-database, while 16 of the 100 entries represented the products of rearranged immune genes (illegal V-region translations, immunoglobulin and T-cell receptor sequences, etc.); for GDB, the numbers were 43 and 12, respectively. The results also indicate that almost half the clusters our program could not find in MGD or GDB are, in fact, present in those databases. Of these, 18 in MGD and 16

Results Number of genes in MGD and GDB. Initially, we set out to retrieve all gene entries in MGD and GDB. Based on our criteria, the numbers of gene entries in both databases were somewhat lower than we expected. We found 11 662 entries

578

Table 1. Many gene entries in MGD and GDB lack evidence to suggest that they are functional, transcribed sequences of mammalian origin Classification

No. found in MGD

No. found in GDB

Transcribed mammalian genes

60 (± 10)

75 (± 8)

Pseudogenes

10 (± 3)

7 (± 5)

Phenotypes

13 (± 7)

7 (± 5)

Endogenous viruses

9 (± 6)

1 (± 1)

Cryptic

3 (± 3)

10 (± 6)

Complexes

5 (± 4)



Model system for database integration

in GDB do not have a link to the sequence data. The remainder (29 in each database) had a sequence link that our program missed. We analyzed these cases and found that in many instances the sequence link pointed to an alternate cluster, suggesting that our clustering criteria were too permissive; in other cases, the link pointed to untranslated genomic sequences or to ‘related segments’ (GDB) that were not identified in the original search. Table 3 extrapolates these results to all clusters, including those which our program found to be present in MGD and GDB. The results show that 15–17% of non-immune system genes present in GenBank are not present in the genome-databases, and that another 5–8% of such genes are present in the genome-databases, but without a link to their sequence. In total, 25% of non-immune system mouse genes in GenBank are either absent from MGD or have no sequence link, and 20% of such human genes are absent from GDB or have no sequence link. Mouse/human homologs in MGD and GDB. Having pared the list of ‘gene’ entries to those that were clearly transcribed

and of mammalian origin (and therefore more likely to have homologs), we could examine more closely which entries in our sample had mouse/human homologs. As shown in Table 4, 24 of the 60 such entries from MGD (40%) had homologs identified as such in MGD, 20 of the 60 entries (33%) had no homologs identified in MGD nor could we find a likely homolog by searching the databases and literature, and 16 of the 60 entries (27%) had no homologs identified in MGD, although we were able to find a likely homolog by our searches. This implies that 44% (16/36) of the MGD entries lacking homologs are likely to be false negatives. For GDB, the results were: 30 of 75 entries (40%) had homologs identified in GDB; 32 of 75 (43%) had no homologs identified in GDB nor could we find one; and 13 of 75 (17%) had no homologs identified in GDB although we were able to find a likely homolog. This implies that 29% (13/45) of the GDB entries lacking homologs are likely to be false negatives. Information regarding how homologs were identified was sparse; in most cases, this information was provided as a list of references, which, when examined electronically, provided little useful information.

Table 2. Many GenBank entries could not be found in MGD and GDB

Classification

No. in MGD group

No. in GDB group

(95% CI)

(95% CI)

Present in database—no sequence link

18 (± 8)

16 (± 7)

Present in database—sequence link missed by our program

29 (± 9)

29 (± 9)

6 (± 5)

8 (± 5)

Absent from database—unpublished Absent from database—published

31 (± 9)

35 (± 9)

Othera

16 (± 7)

12 (± 6)

aEntries

are either illegal V-region features, specific antibody transcripts or transcripts from other immune genes that are no longer in germ line configuration.

Table 3. Many genes in GenBank are not in MGD and GDB

Classification

No. in MGD group

No. in GDB group

All clusters

5208 (100%)

8514 (100%)

2783 (53%)

5643 (66%)

Present in database—sequence link found by our program Present in database—no sequence

linka

Present in database—sequence link missed

436 (8%)

459 (5%)

703 (14%)

833 (10%)

146 (3%)

230 (3%)

752 (14%)

1005 (12%)

388 (7%)

345 (4%)

by our programa Absent from database—unpublisheda Absent from

database—publisheda

Otherb aExtrapolated. bEntries are either illegal V-region

features, specific antibody transcripts or transcripts from other immune genes that are no longer in germ line configuration.

579

J.Macauley, H.Wang and N.Goodman

Table 4. Many potential human/mouse homologs are not in MGD and GDB No. in MGD group (%)

No. in GDB group (%)

(95% CI)

(95% CI)

Homolog positive (homolog present in database)

24 (± 12)

30 (± 11)

True homolog negative (homolog absent from

20 (± 12)

32 (± 11)

16 (± 11)

13 (± 8)

Classification

database and no homolog found by our search) False homolog negative (homolog absent from database, but homolog found by our search)

Discussion Real data and databases are messy, which contributes in no small measure to the difficulty of database integration. Even well-respected, authoritative databases, such as MGD and GDB, contain numerous anomalies. We found that many (∼15%) mouse and human genes whose sequences are in GenBank were not represented in MGD or GDB. One possible explanation for this finding is that MGD and GDB originated as databases of mapped genes, and may be deficient in genes that are cloned but unmapped (although both databases contain some unmapped genes); further investigation is needed to confirm this conjecture. A modest number (5–10%) of links are missing between the genome-databases and GenBank, and a larger number (roughly 30–40%) of links are missing between likely mouse/human homologs. The entries classified as ‘genes’ in the databases encompassed a variety of biological phenomena: most conform to the standard molecular biology definition of a mammalian gene (a transcribed sequence of mammalian origin resulting in a functional molecule), but many are closer to the genetics definition of a gene (a mappable trait); for mouse, a large number of entries were found to be endogenous viruses (legitimate genes, but recent additions to the lineage). As many as 10–20% of the entries are pseudogenes (which are not genes by any reasonable definition) or otherwise apparently misclassified. Any effort to integrate real databases must cope with such artifacts. The flip side of the coin is that database integration offers a means of discovering, and possibly correcting, database errors. It was by cross-referencing three databases (MGD, GDB and GenBank) that we were able to identify the problems that we have discussed. We were generally able to identify potential errors quickly using software, although definitive diagnosis required human analysis. Even so, the amount of work per error was not overwhelming; we could typically resolve a problem such as a missing link in less than an hour. To place this in context, our results show that there are only 2000–3000 missing links between GenBank and MGD or GDB; at one hour per link, these could be fixed with less than 2 person-years of effort. Database integration thus

580

provides both the incentive and the means to improve the quality of the constituent databases. The model system we have proposed is well suited to the task of discovering data errors. The model contains ‘horizontal links’ connecting data elements from a variety of sources for each gene, and ‘vertical links’ connecting data elements for related genes. The ensemble is a highly interconnected network of data elements which can be navigated to find omissions and inconsistencies. Simply put, an omission exists whenever a path that should exist from one point to another is missing; an inconsistency exists whenever two paths that should go from point A to point B actually go to different places. We exploited this method in our study: the lack of a path from a sequence-cluster to a genome-database entry told us that the genome-database was missing a gene; also, when we computed a putative homology link from a sequence-cluster in one organism to a cluster in the other, but did not see a link between the corresponding genome-database entries, we inferred that the link was missing. Had we computed a homology link between two sequences, say A and B, but found that A’s genome-database entry was linked to some other gene, that would have signified an inconsistency. This type of analysis can be performed programmatically and may be a practical means of identifying a large fraction of the errors in existing databases. The model is also well suited for discovering new relationships among genes. The same network that can be traversed to find missing or errant paths can also be used to find novel, meaningful ones. The model system allowed us to confront the hard problems caused by messy data without forcing us to solve all the other hard problems that are inherent in database integration. The next step is to push forward and complete the database we set out to build, namely a database that integrates data from the sequence and genome-databases for human and mouse. This next step will, no doubt, reveal more problems, but such is the nature of research. Database integration demands an experimental approach. It is a complex problem entailing many hard and diverse issues. These include the need to invent new software technology to accomplish the integration, as well as hard scientific

Model system for database integration

and social considerations. These same factors, though, make it hard to conduct definitive experiments. The use of a model system, such as the one we have proposed, may offer a more rapid strategy for solving this critical problem.

References ACeDB (1998) THE C. elegans GENOME PROJECT. http://www.sanger.ac.uk/Projects/C_elegans/. Sanger Centre. Altschul,S.F., Gish,W., Miller,W., Myers,E.W. and Lipman,D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403–410. Ashburner,M. and Goodman,N. (1997) Informatics–genome and genetic databases. Curr. Opin. Genet. Dev., 7, 750–756. AtDB (1998) The Arabidopsis thaliana Database (AtDB). http://genome-www.stanford.edu/Arabidopsis/. Department of Genetics, Stanford University. Bassett,D.E., Boguski,M.S., Spencer,F., Reeves,R., Kim,S.-H., Weaver,T. and Hieter,P. (1997) Genome cross-referencing and XREFdb: implications for the identification and analysis of genes mutated in human disease. Nature Genet., 15, 339–344. Bassett,D.E., Boguski,M.S., Spencer,F., Reeves,R., Weaver,T., Webb,K., Thomas,H., Tolstoshev,C. and Hieter,P. (1998) Cross-referencing the genetics of model organisms with mammalian phenotypes (XREFdb). http://www.ncbi.nlm.nih.gov/XREFdb/. National Center for Biotechnology Information, National Library of Medicine. Benson,D.A., Boguski,M.S., Lipman,D.J. and Ostell,J. (1997) GenBank. Nucleic Acids Res., 25, 1–6. Bleasby,A.J. and Wootton,J.C. (1990) Construction of validated, non-redundant composite protein sequence databases. Protein Eng., 3, 153–159. Buneman,P., Davidson,S.B., Hart,K., Overton,G.C. and Wong,L. (1995) A data transformation system for biological data sources. In Proceedings of the 21st International Conference on Very Large Data Bases (VLDB), Zurich, Switzerland, The Very Large Data Bases (VLDB) Endowment Inc. Burks,C. and Redgrave,G. (1993) LiMB Release 3.0. ftp://ncbi.nlm.nih.gov/repository/LiMB/. Los Alamos National Laboratory (May 1993). CASP (1998) Protein structure prediction center. http://PredictionCenter.llnl.gov/. Biology and Biotechnology Research Program, Lawrence Livermore National Laboratory. Chen,I.-M.A. and Markowitz,V.M. (1995) An overview of the Object-Protocol Model (OPM) and OPM data management tools. Inf. Syst., 20, 393–418. Chen,I.-M.A., Kosky,A., Markowitz,V.M. and Szeto,E. (1996) OPM*QS: The Object-Protocol Model Multidatabase Query System. http://gizmo.lbl.gov/DM_TOOLS/OPM/OPM_QS/ OPM_QS.html. Lawrence Berkeley National Laboratory. Crabtree,J. (1998) CPL query page at CBIL. http://agave.humgen.upenn.edu/cpl/cplhome.html. Computational Biology and Informatics Laboratory, University of Pennsylvania (February 1998). Davidson,S.B. and Kosky,A.S. (1997) WOL: a language for database transformations and constraints. In Proceedings of the 13th International Conference on Data Engineering, Birmingham, UK.

DDBJ (1998) DNA Data Bank of Japan WWW title page. http://www.ddbj.nig.ac.jp/. Center for Information Biology, National Institute of Genetics. Duret,L. (1998) Homologous vertebrate genes data base (HOVERGEN). http://biom1.univ-lyon1.fr:8080/doclogi/hoverangl/ ahovergen.html. Laboratoire de Biométrie, Université Claude Bernard. Duret,L., Mouchiroud,D. and Gouy,M. (1994) HOVERGEN: a database of homologous vertebrate genes. Nucleic Acids Res., 22, 2360–2365. EBI (1996) The European Bioinformatics Institute. http://www.ebi.ac.uk/. European Bioinformatics Institute, EMBL. ECDC (1997) E.coli database collection—ECDC. http://susi.bio.unigiessen.de/. Justus-Liebig-University. EofD (1996) Encyclopaedia of Drosophila. http://shoofly.bdgp.berkeley.edu/. Berkeley Drosophila Genome Project and FlyBase. Etzold,T., Ulyanov,A. and Argos,P. (1996) SRS: information retrieval system for molecular biology data banks. Methods Enzymol., 266, 114–128. Etzold,T., Verde,G., Kreil,D. and Carter,P. (1998) Sequence Retrieval System. http://srs.ebi.ac.uk:5000/. European Bioinformatics Institute (February 1998). Fasman,K. (1994) Restructuring the genome data base: a model for a federation of biological databases. J. Comput. Biol., 1, 165–171. Fasman,K.H., Letovsky,S.I., Li,P., Cottingham,R.W. and Kingsbury,D.T. (1997) The GDB human genome database. Nucleic Acids Res., 25, 72–80. FlyBase (1998) A database of the Drosophila genome (FlyBase). http://flybase.bio.indiana.edu/. GDB (1998) The Genome Database (GDB). www.gdb.org. The Johns Hopkins University. Goodman,N., Rozen,S., Smith,A.G. and Stein,L.D. (1998) The LabBase system for data management in large scale biology research laboratories. Bioinformatics, 14, 562–574. GPCRDB (1998) GPCRDB: information system for G proteincoupled receptors (GPCRs). http://www.gpcr.org/7tm/. EMBL. Greene,L. and Henikoff,S. (1998) The kinesin home page. http://www.blocks.fhcrc.org/∼kinesin/index.html. Fred Hutchinson Cancer Research Center. Harker,S. (1998) Morphase homepage. http://www.cis.upenn.edu/∼db/morphase/. Department of Computer and Information Science, University of Pennsylvania (February 1998). HGMD (1997) The Human Gene Mutation Database (HGMD). http://www.uwcm.ac.uk/uwcm/mg/hgmd0.html? Institute of Medical Genetics. Horn,F., Weare,J., Beukers,M.W., Hörsch,S., Bairoch,A., Chen,W., Edvardsen,Ø., Campagne,F. and Vriend,G. (1998) GPCRDB: information system for G protein-coupled receptors. Nucleic Acids Res., 26, 277–281. Houlgatte,R., Mariage-Samson,R., Duprat,S., Tessier,A., Bentolila,S., Lamy,B. and Auffray,C. (1995) The Genexpress Index: a resource for gene discovery and the genic map of the human genome. Genome Res., 5, 272–304. IARC (1998) Database of somatic P53 mutations in human tumors and cell lines. http://www.iarc.fr/p53/homepage.htm. International Agency for Research on Cancer.

581

J.Macauley, H.Wang and N.Goodman

Jirtle,R.L. (1998) M6P/IGF2R information core. http://radonc.duke.edu/∼jirtle/homepage.html. Department of Radiation Oncology, Duke University. Karp,P. (1995) A strategy for database interoperation. J. Comput. Biol., 2, 573–586. Keen,G., Redgrave,G., Lawton,J., Cinkosky,M., Mishra,S., Fickett,J. and Burks,C. (1992) Access to molecular biology databases. Math. Comput. Model., 16, 93–101. Krawczak,M. and Cooper,D.N. (1997) The human gene mutation database. Trends Genet., 13, 121–122. Lander,E.S. (1996) The new genomics: global views of biology. Science, 274, 536–539. Link,J. (1998) IGD—genome information system. http://mbpsun9.embnet.dkfz-heidelberg.de/igd-gis/. DKFZ – German Cancer Research Center Heidelberg (February 1998). Maize,D.B. (1998) A maize genome database (MaizeDB). http://www.agron.missouri.edu/. USDA-ARS Plant Genetics Unit, University of Missouri-Columbia. Markowitz,V.M. and Ritter,O. (1995) Characterizing heterogeneous molecular biology database systems. J. Comput. Biol., 2, 547–556. Martin,S. (1998) Virtual library: genetics. http://www.ornl.gov/ TechResources/Human_Genome/genetics.html. Oak Ridge National Laboratory (February 1998). MGI (1996) Mouse Genome Informatics (MGI). http://www.informatics.jax.org/. The Jackson Laboratory. Miller,R., Burke,J., Christoffels,A. and Hide,W. (1997) Sequence Tag Alignment and Consensus Knowledgebase (STACK). http://techno.sanbi.ac.za/stack/. South African National Bioinformatics Institute, The University of the Western Cape. NAD (1998) The Nucleic Acid Database (NDB). http://ndbserver.rutgers.edu. The Nucleic Acid Database Project, Rutgers, The State University of New Jersey. NCBI (1998a) Entrez. http://www.ncbi.nlm.nih.gov/Entrez/. National Center for Biotechnology Information, National Library of Medicine (February 1998). NCBI (1998b) GenBank overview. http://www.ncbi.nlm.nih.gov/ Web/Genbank/. National Center for Biotechnology Information, National Library of Medicine. NCBI (1998c) UniGene: unique human gene sequence collection. http://www.ncbi.nlm.nih.gov/UniGene/. National Center for Biotechnology Information, National Library of Medicine. OMIM (1998) Online Mendelian inheritance in man (OMIM). http://www.ncbi.nlm.nih.gov/Omim. National Center for Biotechnology Information, National Library of Medicine. PDB (1998) The Protein Data Bank (PDB). http://www.pdb.bnl.gov/. Brookhaven National Laboratory.

582

PiGBASE (1997) The genome database of the pig (PiGBASE). http://www.ri.bbsrc.ac.uk/pigmap/pigbase/pigbase.html. Roslin Institute. Rebhan,M., Chalifa-Caspi,V., Prilusky,J. and Lancet,D. (1997a) GeneCards: encyclopedia for genes, proteins and diseases. http://bioinfo.weizmann.ac.il/cards. Bioinformatics Unit and Genome Center, Weizmann Institute of Science. Rebhan,M., Chalifa-Caspi,V., Prilusky,J. and Lancet,D. (1997b) GeneCards: integrating information about genes, proteins and diseases. Trends Genet., 13, 163. Ritter,O. (1994) The integrated genomic database. In Suhai,S. (ed.), Computational Methods in Genome Research. Plenum, pp. 57–73. Robbins,R.J. et al. (1993) Report of the Invitational DOE Workshop on Genome Informatics, 26–27 April 1993. http://www.bis.med.jhmi.edu/Dan/DOE/whitepaper/inf_rep2.html. Department of Energy (April 1993). Schuler,G. et al. (1995) A gene map of the human genome. Science, 274, 540–546. Schuler,G.D., Epstein,J.A., Ohkawa,H. and Kans,J.A. (1996) Entrez: molecular biology database and retrieval system. Methods Enzymol., 266, 141–162. Schulze-Kremer,S. (1997) Adding semantics to genome databases: towards an ontology for molecular biology. In Proceedings of the 5th International Conference on Intelligent Systems for Molecular Biology (ISMB), Halkidiki, Greece. SGD (1998) The Saccharomyces Genome Database (SGD). http://genome-www.stanford.edu/Saccharomyces/. Department of Genetics, Stanford University. Stein,L.D. (1996) How Perl saved the Human Genome Project. Perl J., 1, 5–9. SWISS-PROT (1998) SWISS-PROT: annotated protein sequence database. http://expasy.hcuge.ch/sprot/sprot-top.html. ExPASy, Geneva University Hospital and University of Geneva. TIGR (1998a) TIGR Database (TDB). http://www.tigr.org/tdb/ tdb.html. The Institute for Genomic Research. TIGR (1998b) TIGR human gene index. http://www.tigr.org/tdb/hgi/ hgi.html. The Institute for Genomic Research. Transfac-Team (1998) TRANSFAC—the transcription factor database. http://transfac.gbf.de/TRANSFAC/. Ges. f. Biotechn. Forschung mbH (GBF). Wall,L., Christiansen,T. and Schwartz,R.L. (1996) Programming Perl. O’Reilly & Associates. ZFIN (1998) The Zebrafish database project. http://zfish.uoregon.edu/ ZFIN/. University of Oregon.