CHASE, an Effective Combination of Homology-Search Methods

4 downloads 2470 Views 202KB Size Report
Jun 27, 2003 - evaluation procedures to obtain a combined “E-value”. ... based analyses such as PHI-Blast [1-7]. .... and PHI-Blast to rescale the e-values.
FSPM Preprint No. 153

CHASE, an Effective Combination of Homology-Search Methods Intikhab Alam Syed*, Andreas Dress** and Georg Fuellen***

Abstract Many methods have been developed to search for homologous members of a protein family in data-bases, and the reliability of results and conclusions may be compromised if only one method is used, neglecting the others. Therefore, we introduce a general scheme for combining such methods. Based on this scheme, we implemented a tool called CHASE (Comparative Homology Agreement Search) that integrates different search strategies and evaluation procedures to obtain a combined “E-value”. Our results show that a consensus method integrating distinct homology-search algorithms easily outperforms any of its component algorithms. In particular, an evaluation based on the SCOP data-base reveals that, on average, a coverage of 44% can be obtained in searches for distantly related homologues (i.e. members of the same superfamily –the most difficult task), accepting only 10 false positives. The best individual method obtains a coverage of 35%, accepting the same number of false positives.

*International NRW Graduate School in Bioinformatics and Genome Research Center of Biotechnology (CeBiTec) University of Bielefeld 33615 Bielefeld, Germany [email protected]

**FSP Mathematisierung - Strukturbildungsprozesse University of Bielefeld 33615 Bielefeld, Germany [email protected]

*** Integrated Functional Genomics, IZKF University Hospital Muenster (Hautklinik) Von-Esmarch-Str. 56 48149 Muenster, Germany [email protected] Corresponding Author.

Friday June 27, 2003

2

0. Introduction Sequence-homology search algorithms are important computational tools in molecular biology. There exist at least three general classes of techniques employed in searches for additional homologues, namely pairwise sequence comparisons such as Blast (Basic Local Alignment Search Tool), profile-based searches such as HMMsearch, and motif- or patternbased analyses such as PHI-Blast [1-7].

In a pairwise search, a query sequence is compared to any data-base sequence, yielding a confidence estimate that is supposed to indicate the probability of finding a comparably similar random sequence. The comparison is done for every sequence in the data-base, and the sequences with highest confidence (“the hits”) are reported. The most popular pairwisesearch tool is Blast [2].

Simple profile searches make use of position-specific scoring statistics and are usually more sensitive than pairwise comparisons. The introduction of hidden Markov models (HMMs) appears to provide a firmer statistical basis for profile analysis. The majority of currently available profile tools use HMMs, for example the HMMER package [8, 9].

Kinship between protein sequences can also lead to - and, thus, be recognized by - the occurrence of particular amino-acid patterns (also known as motifs, signatures, or fingerprints) that were conserved throughout the evolution of the protein family in question and are believed to correlate with specific structural features and function. Motif analysis can, therefore, also be used for identifying new members of a protein family [10-12]. Motifs can be identified in sequences automatically (for example using PRATT [13]). They are the backbone of motif-based homology-search methods such as PHI-Blast [6].

Protein-family specific motif libraries have been assembled (such as PROSITE [24], BLOCKS [14], and ProDom [15]), and so have libraries of more sophisticated mathematical models of protein family relationship based on HMMs [16- 18].

3

The choice between homology-search methods depends in particular on the type of input data. Pairwise sequence comparisons are employed if only a single query sequence is known. In contrast, profile-based searches (such as HMMsearch) are usually trained with sets of 20100 sequences [7]. Furthermore, profile searches are best for investigators who have the time to carefully curate their query alignment. Motif or pattern based methods fall between these two extremes.

In this study, we will show that the overall performance of homology searches can be improved if these methods are combined appropriately. The combination of methods is an advanced form of a meta-study. Important medical questions are typically studied more than once, and a meta-study compiles and analyses the results of all relevant studies. Interpro [19] and Metafam [20] present such compilations in protein-family research. Combining methods directly to generate a consensus result is common practice in some areas of bioinformatics. Two algorithms that combine different methods are Pcons [21] for fold recognition, and Jpred [22] for secondary structure prediction. They improve the accuracy of the results considerably. To date, to the best of our knowledge, there is no method available that produces a consensus over sequence-based homology-search methods.

Here, we have restricted ourselves to combining the following five methods: HMMsearch [4], Treesearch [23], PSI-Blast [24], Mast [25], and PHI-Blast [6]. All of them use a collection of sequences as search input. The first three methods perform profile-based searches either using profiles directly (as read off from a multiple alignment of the input sequences) or transforming them into a Hidden Markov Model. Mast uses profiles derived from motif analysis. PHI-Blast is a motif-based method that uses “regular expressions” [26] designed to represent family-specific patterns.

We combine these five methods as follows: First, given a collection of query sequences, method-specific input queries structured according to the specific requirements of the individual search algorithms will be computed for each of the five component algorithms. Then, after these have been applied using their respective input queries, we compute and report a “consensus hit list”.

4

In addition to detailing the resulting consensus tool dubbed CHASE (Comparative Homology Agreement SEarch), we present a comparative evaluation of its performance. Needless to say, the evaluation is of course performed by testing CHASE on a data-base that is disjoint from the data-base used to calibrate this tool.

1

Materials and Methods

1.1 Homology-search methods We use the following homology-search methods that provide confidence estimates (E-values, as described below) for their results: HMMsearch, Treesearch, PSI-Blast, Mast, and PHIBlast. Table 1 collects the basic features of these methods. To perform their task, they require a query and a target data-base such as Swissprot [27] or SCOP [28]. The exact query format requirements however vary from method to method. For example, HMMsearch requires a Hidden Markov Model (HMM). PSI-Blast requires a set of aligned sequences as a “jumpstart alignment” containing highlighted regions from which one or more “position-specific scoring matrices” are derived. Treesearch is another profile-based homology-search method that requires a sequence alignment, a phylogenetic tree, and an HMM. Mast requires profiles derived from motifs in the form of MEME output [25], and PHI-Blast requires a pattern (for example a Prosite pattern) and a sequence as input to conduct homology searches. We developed scripts called input processors (IPs) that take a collection of sequences and process these as follows to obtain the specific type of input for each of these homologysearch methods. HMMsearch IP: We use ClustalW [29] to generate a multiple alignment that in turn is used by HMMbuild, available with the HMMER package, to build a Hidden Markov Model. We calibrate the required HMM using hmmcalibrate. Treesearch IP: We use build_compound, available with Treesearch, to generate, as required, a sequence alignment (using ClustalW), a phylogenetic tree (using fitch [30]), and an HMM (using HMMbuild). PSI-Blast IP: We use ClustalW to align the input sequences, and some formatting procedure so that PSI-Blast can be executed.

5

Mast IP: We use MEME to generate motifs and convert them into the required profiles. PHI-Blast IP: We use PRATT to generate a Prosite-like pattern, and we generate a consensus sequence to start the PHI-Blast search.

1.2 Automatic evaluation of data-base search methods, and calculation of performance weights Phase4 [31; version 1.6] is a system for the automatic evaluation of data-base search methods. In Phase4, the performance of a method is evaluated by its ability to find a test set of sequences in a target data-base, using a training set of sequences for learning; depending on the search method, learning is, for example, the calculation of an HMM. To construct the test and training sets, Phase4 relies on target data bases like SCOP [28; version 1.53] that classify proteins - in a strictly Linnaean (or hierarchical) fashion – according to membership in families (of closely related sequences) and in superfamilies (of not so closely related sequences). The separation of the target data base into training sets and test sets is called an (evaluation) scenario. For example, the scenario “Distant Family One Model” is used to evaluate a homology search method for its ability to report distant relationships in protein families by splitting off one family from a given superfamily to provide test sequences, and keeping the rest of the superfamily as training sequences. Such a test is executed for each family in turn, for every superfamily (see Table 2 for commonly used scenarios, and [31] for more details). To evaluate the performance of any method numerically, Phase4 offer evaluators. These evaluators make use of the list of sequences found that are ranked according to a confidence estimate, called an E-value. In the simplest pairwise case of standard Blast searches, given a normalized pairwise-comparison score of size σ, the E-value estimates the expected number of distinct local matches with normalized score ≥ σ in a random database (see [24]). This concept can be generalized to other search methods, with different degrees of mathematical rigor. E-values are reported by each of the search methods we want to combine, and our combination scheme will report a combined E-value. For a given test, the “coverage versus false positive counts" evaluator presents the percentage of family members of the test set that are listed together with a certain number of non-family members in the ranked list of the

6

sequences found. More precisely, it calculates the average percentage y of true positives with an E-value better than or equal to the threshold value v(x) for which x false positives are found, thus rendering the percentage coverage y as a function y=y(x) of the absolute number x of misclassifications considered acceptable. Finally, results are averaged over all tests executed and plotted.

In particular, we use the Phase4 system to evaluate the individual homology-search methods embedded in CHASE. Among several available scenarios offered by Phase4 that define training and test sequences using the SCOP data base as described before, we use “Distant Family One Model”, “Family Halves One Model” and “Family Half One Model” (see Table 2 for details). An E-value EC = 1,000 was set as a cut-off for all individual homology-search methods; sequences with a larger E-value are not listed. For each method i, consider the average percent coverage Pi of family members s of the test set with an E-value Ei(s) smaller than the smallest threshold Ei[k] for which, for some integer k>0, the first k false positives are found. In this case, the average is taken over the coverage for all three scenarios mentioned above, and the coverage for a scenario is in turn the average taken over all tests. Using some fixed number k (in our case, we used k=50), this gives rise to the weighting scheme W=W1, …, Wn listed in Table 3, where n is the number of methods, and the weight of method i is set to the average coverage Pi divided by the total sum of the average coverages of all n methods so that

n

∑W i =1

i

= 1 holds.

1.3 A scheme for combining homology-search methods As shown in Figure 1, our scheme for combining different homology-search methods features the idea of running them in parallel. Once the searches are complete, the results of each method are parsed to extract specific information such as the unique sequence identifiers of the hits and the corresponding E-values. Tallying data for all methods, we obtain a preliminary list of hits, each row containing one sequence identifier and the corresponding E-values reported by the different methods. This list is similar to the one presented in Figure 5, except for the rescaling and reordering to be described below.

7

A major problem in combining confidence estimates is the variability in the size of the Evalues estimated by different homology-search methods. We rescale E-values to homogenize the confidence estimates in order to combine them. More precisely, to construct a consensus hit list from these data, we first rescale the E-values Ei(s) obtained by the individual methods i=1,…,n, for each sequence s, to produce E-values Ei*(s) of comparable size. We then use the weights as described in Section 1.2 to obtain a weighted average E-value. These two steps are now described in detail.

1.3.1

Placing methods on a common scale.

To rescale E-values, we proceed as follows: First, data-base searches are conducted using the training set as input for all homology-search methods, and the best hits are reported. More specifically, for each method i and each sequence s in the data-base, we report the sequence provided its E-value Ei(s) is below a cut-off value EC of 1,000. Then, one method is chosen to be used as a reference method, on the basis of which the E-values of the other methods are rescaled [32]. In CHASE, we use HMMsearch as our reference method since its output are calibrated E-values which we deem more reliable. Next, before doing any E-value manipulation, we take the logarithm to base 10 to transform the E-values for all methods. This transformation is necessary since E-values may be very close to zero for good data-base hits, and we must avoid rounding problems. This way, we obtain, for each sequence s taken into consideration and each method i=1,…,n, a number ei(s):=log10Ei(s) that we call the “evalue” of the sequence (with a small e) for conciseness. Next, we use a regression procedure yielding the slopes and the intercepts for HMMsearch versus Treesearch, PSI-Blast, Mast, and PHI-Blast to rescale the e-values. For example, ordinary least-squares regression [33] applied to HMMsearch e-values eHMM(s) and corresponding PSI-Blast e-values ePSI(s) provides a slope a and an intercept b for which the sum ∑(eHMM(s)-a• ePSI(s)-b)2 is minimized. Here, the sum is taken over all sequences s with both e-values eHMM(s) and ePSI(s) below a certain threshold e0. This procedure is repeated each time CHASE is applied. Slope and intercept depend on the specific data - there is no universal data-independent regression line for the various methods. For each sequence s with ePSI(s) < e0, we then put ePSI*(s):= min { a • ePSI(s) + b, e0 }, and we put

(1)

ePSI*(s):=ePSI(s), for every sequence s with ePSI(s) ≥ e0.

8

For a small scaling threshold e0, the formula rescales small e-values according to the regression line, and keeps large e-values as they are. Keeping large e-values as they are, may be useful because they may be “downscaled” otherwise, suggesting a significance that is not there. In the rare case that rescaled e-values exceed the threshold, they are set to precisely this threshold in order to keep the ranking as is. The larger e0 becomes, fewer e-values are kept as they are. In Section 2 below, we set e0 = log10(EC) = 3. Since no hits are considered for which the E-value exceeds the E-value cut-off EC = 1,000, all values are rescaled in this case. Nevertheless, results improve slightly for smaller e0, as discussed later on. The same scaling procedure is applied to the e-values given by the other three methods. For notational convenience, we set eHMM*(s):=eHMM(s) for our reference method HMMsearch.

1.3.2

Calculating the C-value

Once we have got the rescaled e-values e1*,…,en* for all n methods, we calculate the c-value for each sequence s as the W-weighted sum: n

c -value( s ) := ∑ ei* ( s ) • Wi . i =1

The final C-value (on the original E-value scale) is then obtained as C-value(s):= 10

c-value(s)

.

As will be shown, this C-value makes it possible to represent a consensus over individual homology-search methods and it yields a superior overall ranking of hits. “Missing E-values” arise if a homology search method finds a sequence not found by another, given the E-value cut-off EC. These are set to the cut-off E-value EC.

1.4 Evaluation of CHASE Our tool CHASE implements the above scheme using the five homology-search methods described. We compute the weights W1, …, Wn of the component search algorithms once. We then compute the resulting C-values of the sequences in each data-base search. Treating the C-values as E-values, we can use Phase4 again to evaluate the performance of CHASE and to compare its performance with the performance of its component algorithms. Clearly, the weights that we compute – and thus the performance of our consensus method – depend on 9

the data-base that we use. In particular, if a component algorithm does very well on that database, it will get a high weight implying that it will strongly influence the outcome of the consensus method, making it look good on that particular database, too.

To avoid this kind of circularity, we use one data-base to compute the weights of our component algorithms, and a distinct one to evaluate the resulting consensus method. To this end, we have split the data-base SCOP (version 1.53) into two separate data-bases, one - the odd data-base - containing every second SCOP-superfamily, starting with the first one, the other - the even - data-base, containing the rest. We use the odd data-base to compute the weights, W1, …, Wn, and the even data-base to evaluate the performance of the resulting consensus method and to compare this performance with that of its component algorithms, using again the training and testing sequences from three scenarios offered in Phase4, as described in Table 2. As before, we used “coverage versus false positive count” in Phase4 as a performance evaluator, and sorting of CHASE hits was based on the C-value. Sequences with a C-Value exceeding EC = 1,000 are not listed. We set the E-value cut-off EC to 1000, and the e-value threshold used for rescaling e0 to 3 (=log101000) so that all values are rescaled.

2

Results and Discussion

We conducted a comparative evaluation of five homology-search methods and our consensus method Comparative Homology Agreement Search (CHASE). We used three different scenarios offered by Phase4, as listed in Table 2, to define distant, close, and very close relationships between SCOP data-base entries. If one considers the average coverage of true positives at the cost of zero false positives, as shown in Figure 3a, and ranks the methods according to their ability to find distant homologous proteins, CHASE obtains a coverage of 32%, and HMMSearch comes next with a coverage of 26%. Then come Mast, PSI-Blast, Treesearch and PHI-Blast, with coverages between 25 and 19%. It is important to note that we do not claim to conduct a valid comparison of individual methods. Such a comparison would need to do more justice to the different input requirements of these. The objective comparative analysis of the individual methods, starting with the same training data of 10

sequences for each, suffers from the application of the Input Processors (described above) by which some of the input information may be lost – e.g. PHI-Blast conducts searches using a single motif calculated automatically from the training set of sequences, embedded in a single consensus sequence, and the loss of information outside the motif may be responsible for its comparatively poor performance.

If we plot coverages of true positives at the cost of 10 false positives, performance of CHASE goes up, covering 44% on average in case of distant relationships, compared to 35% coverage by HMMSearch. Permitting 50 false positives, as presented in Figure 3b, these numbers go up to 53% and 43%, respectively.

The advantage of CHASE is smaller in case of close and very close relationships, but it still outperforms the second-best method by a good margin. The “Coverage versus false positive count” plots in Fig. 4 for the various Phase4 scenarios give a more detailed picture of the coverage of true positives, for up to 200 false positives. If the e-value threshold used for rescaling is set to –1 instead of 3, not all values are rescaled anymore in the c-value formula (1). Remarkably, CHASE appears to perform even slightly better in this case. For example, CHASE obtains 35% coverage of distant relatives at a cost of zero false positives, and 55% coverage permitting 50 false positives.

The results of running CHASE for the SCOP superfamily featuring the FAD/NAD(P)binding domain are shown in Figure 5. C-values along with rescaled E-values from different methods are printed. The names of the members of the given family are printed in black (in the “description” column), the others (the names of the false positives) in red. We consider a family member to be classified correctly by method i if its rescaled E-value is smaller than the rescaled E-value of the first false positive. For the false positives listed by method i, the minimum rescaled E-value is printed in red. Rescaled E-values of family members that would not be classified correctly using method i alone are marked in orange. They are larger than the minimum rescaled E-value of the false positives for method i, so that the false positive with the smallest rescaled E-Value would precede these in the ranking based on method i. In the twilight zone of rows 15 to 24, CHASE performs well, triggered by the

11

rescaled E-values marked in green that indicate success for at least one method. (Formula (1) was designed such that the rescaling does not affect the relative order of E-values in a single column.)

The evaluation that we report is one scenario for CHASE. In another scenario, CHASE can be used to search for a maximum number of members of a protein family by providing expert rather then automatic input information to the component methods. Such information could, for example, be Hidden Markov Models for various protein families taken from the Pfam data base, or patterns described in the Prosite data base [24], etc. The latter may be found by PS_Scan [24]. Then, if we conduct homology searches in the Swissprot data-base, CHASE results can be further compared with the available expert knowledge from Prosite in the form of true positives, false negatives, and false positives. Therefore, we provide two kinds of user interface. The simple user interface takes only a set of protein sequences, automatically providing the input required by the various component methods. In the advanced user interface, one can submit protein sequence(s), a sequence alignment, or a profile and pattern(s) as input for the underlying search methods. In the future, we would like to include more search methods. In some evaluation scenarios, methods like Family Pairwise Search [5] turned out to be superior to the methods we combine, but they lack E-values. Therefore, we are working on ways of defining E-values for them since these are the current requirement for inclusion of a method into CHASE (of course, one could also modify CHASE so that it works, rather than with E-values, with the resulting, or any other, ranked lists – a procedure that we will also investigate in the future).

3

Conclusion

Our results show that combining homology-search methods provides improved performance over an entire set of scenarios, ranging from the detection of distant to very close relationships between protein sequences. This corroborates, in the context of protein family research, the frequent claim that appropriately designed consensus methods can be more reliable than any of their component algorithms.

12

Acknowledgements

We thank Marc Rehmsmeier for advise regarding the Phase4 system. We are also grateful to Mohammed Shahid for developing the web interface for CHASE. This work was supported by the DFG and the International NRW Graduate School in Bioinformatics and Genome Research.

13

References 1.

Pearson, W. R. (1995). Comparison of methods for searching protein sequence databases. Prot. Sci. 4:1145-1160.

2.

Altschul, S., Gish, W., Miller, W., Myers, E. W., and Lipman, D. (1990). "A Basic Local Alignment Search Tool". JMB, 215, 403-410.

3.

Bork, P. and Gibson, T. J. (1996). Applying Motif and Profile Searches. Methods in Enzymology 266, 162-183.

4.

Eddy S.R. (2001). HMMER: Profile hidden Markov models for biological sequence analysis (http://hmmer.wustl.edu/).

5.

Grundy, W. N. (1998). Homology Detection via Family Pairwise Search. Journal of Computational Biology 5(3): 479-492.

6.

Zhang, Z., Schaffer, A. A., Miller, W., Madden, T. L., Lipman, D. J., Koonin, E. V. and Altschul, S. F. (1998). Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res, 26(17), 3986-90.

7.

Holmes, I. (2000). Review of sequence homology search techniques on the WorldWide Web. HIV Sequence Compendium.

8.

Eddy, S.R. (1996). Hidden Markov models. Current Opinion in Structural Biology, 6: 361365.

9.

Eddy, S.R. (1998). Profile hidden Markov models. Bioinformatics 14:755-7638.

10.

Bairoch, A., P. Bucher, and K. Hofmann. (1997). The PROSITE database, its status in 1997. Nucleic Acids Res. 25:217-221.

11.

Hudak, J. and McClure, M. A. (1999). A Comparative Analysis of Computational Motif-Detection Methods. Pacific Symposium on Biocomputing 4:138-149.

12.

Jonassen, I., Collins, J. F., Higgins, D. G. (1995). Finding flexible patterns in unaligned protein sequences. Protein Science 4, 1587-1595.

13.

Jonassen, I. (1997). Efficient discovery of conserved patterns using a pattern graph. CABIOS 13, 509-522.

14.

Henikoff, J. G., Greene, E. A., Pietrokovski, S., Henikoff, S. (2000) Increased coverage of protein families with the blocks database servers.. Nucleic Acids Res 1; 28(1): 22830.

14

15.

Corpet, F., Servant, F., Gouzy, J., Kahn, D. (2000) ProDom and ProDom-CG: tools for protein domain analysis and whole genome comparisons. Nucleic Acids Res. 28:267269.

16.

Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., GriffithsJones,

S.,

Howe,

K.L.,

Marshall,

M.

&

Sonnhammer,

E.L.

(2002). The Pfam protein families’ database. Nucleic Acids Res. Jan 1; 30(1): 276-80. 17.

Krogh, A., Brown, M., Mian, I. S., Sjolander, K. and Haussler, D. (1994). Hidden Markov models in computational biology, applications to protein modeling. J. Molec. Biol., 235, 1501-1531.

18.

Baldi, P., Chauvin, Y., Hunkapiller, T., and McClure, M. A. (1994). Hidden Markov Models of. Biological primary sequence information. Proc. Natl. Acad. Sci. USA, 91, 1059-063.

19.

Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D.R., Durbin, R., Falquet, L., Fleischmann, W., Gouzy, J., Hermjakob, H., Hulo, N., Jonassen, I., Kahn, D., Kanapin, A., Karavidopoulou, Y., Lopez, R., Marx, B., Mulder, N.J., Oinn, T.M., Pagni, M., Servant, F., Sigrist, C.J.A., Zdobnov, E.M. (The InterPro Consortium) (2001). The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research, vol 29(1):37-40.

20.

Silverstein, KA., Shoop, E., Johnson. J.E., Kilian, A., Freeman, J.L., Kunau, T.M., Awad, I.A., Mayer, M., Retzel, E.F. (2001). The MetaFam Server: a comprehensive protein family resource. Nucleic Acids Res. 29: 49-51.

21.

Lundström, J., Rychlewski, L., Bujnicki, J., Elofsson, A. (2001) Pcons: A neuralnetwork-based consensus predictor that improves fold recognition. Protein Sci. Nov;10(11):2354-62.

22.

Cuff, J. A., Clamp, M. E. and Barton,G. J. (1998). "JPred: A consensus secondary structure prediction server", Bioinformatics, 14, 892-893.

23.

Rehmsmeier, M., and Vingron, M. (2001). Phylogenetic Information Improves Homology Detection. Proteins: Structure, Function, and Genetics, 45(4): 360-371.

15

24.

Altschul, S. F., Madden, T. L., Schaffer, A. A., Zhang, J., Zhang, Z., Miller, W., & Lipman, D. J. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25(17), 3389-402.

25.

Timothy, L. B., and Michael G. (1998). Combining evidence using p-values: application to sequence homology searches, Bioinformatics, Vol. 14, pp. 48-54.

26.

Falquet, L., Pagni, M., Bucher, P., Hulo, N., Sigrist, C.J., Hofmann, K., Bairoch, A. 2002). The PROSITE database, its status in 2002 Nucleic Acids Res. 30:235-238.

27.

Bairoch, A., Apweiler, R. (2000). The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28:45-48.

28.

Murzin, A., Brenner, S.E., Hubbard, T., Chothia, C,. (1995). SCOP: a structural classification of proteins database for the investigation of sequences and structures. Journal of Molecular Biology 247:536-540.

29.

Higgins, D., Thompson, J., Gibson, T., Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994). CLUSTAL W: improving the sensitivity of progressivemultiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680.

30.

Felsenstein, J. 1989. PHYLIP -- Phylogeny Inference Package (Version 3.2). Cladistics 5: 164-166.

31.

Rehmsmeier, M. (2002). Phase4: Automatic evaluation of database search methods. Briefings in Bioinformatics, 3(4): 342-352.

32.

Yona, G., Linial, N. and Linial, M. (2000). ProtoMap: Automatic classification of protein sequences and hierarchy of protein families. Nucleic Acids Research 28, 49-55.

33.

Rzhetsky A, Nei M. (1992) Statistical properties of the ordinary least squares, generalized least squares, and minimum-evolution methods of phylogenetic inference. J Mol Evol. Oct; 35(4): 367-75.

16

Figures

Figure 1. A general scheme for combining homology-search methods.

Homology searches are conducted by various methods simultaneously and results are combined by the CHASE scheme (see Fig. 2) to produce a consensus result.

17

Figure 2. An outline of CHASE.

CHASE uses input processors that transform a set of sequences into inputs for various homology-search methods, namely HMMsearch, Treesearch, PSI-Blast, Mast and PHI-Blast. CHASE executes the underlying homology-search methods, the results of which are combined by the CHASE scheme to get a consensus. The CHASE report is available in various formats such as HTML, XML and a flat file.

18

a. SCOP (even part) coverage permitting no false positives 100

Chase

Percent Coverage

90

HMMsearch

Treesearch

PSI-Blast

PHI-Blast

Mast

80 70 60 50 40 30 20 10 0 Distant Relationships (DFOM)

Close Relationships (FHvOM)

Very Close Relationships (FHfOM)

Family Constructs

b. SCOP (even part) coverage permitting 50 False Positives CHASE

HMM

TreeS

PSI-Blast

PHI-Blast

Mast

100 90

Percent Coverage

80 70 60 50 40 30 20 10 0 Distant Relationships (DFOM)

Close Relationships (FHvOM)

Very Close RelationshipsFHfOM

Family Constructs

Figure 3. Average coverage of CHASE and its component homology-search methods.

The average coverage of true positives permitting (a) zero and (b) fifty false positives is shown, using SCOP (even half) as the target data-base and various scenarios provided by Phase4 (as described in Table 2).

19

Distant Relationships (DFOM)

Close Relationships (FhvOM)

Very Close Relationships (FhfOM)

Figure 4. Coverage versus false positive counts.

20

This figure shows the Phase4 evaluation in the form of “coverage versus false positive counts” for CHASE as well as for HMMsearch, Treesearch, PSI-Blast Mast and PHI-Blast, using three different scenarios (as described in Table 2) offered in Phase4. In all scenarios, CHASE gives better average coverage, compared to the individual homology-search methods. Averaging is done over all SCOP families included in the even half of the data-base. (The odd half was used to learn the weights used by the CHASE combination scheme.)

21

Figure 5. Sample CHASE result.

No

C-value

Description

HMMsearch

Treesearch

PSI-Blast

PHI-Blast

Mast

1

6e-115

3.3.1.2.2 (9-318,451-506) Cholesterol oxidase {

1e-113

1e-84

1e-99

3e-154

1e-125

2

3e-113

3.3.1.2.1 (4-318,451-506) Cholesterol oxidase

5e-106

5e-85

4e-100

3e-146

6e-133

3

2e-101

3.3.1.2.7 (3-324,521-583) Glucose oxidase {Asp

2e-137

5e-85

3e-126

9e-20

1e-138

4

6e-100

3.3.1.2.8 (1-328,525-587) Glucose oxidase {Pen

2e-136

6e-85

4e-129

5e-12

1e-136

5

1e-85

3.3.1.4.2 (1-225,358-442) Fumarate reductase f

5e-98

5e-86

1e-86

7e-67

8e-90

6

3e-81

3.3.1.4.1 (2-237,354-422) L-aspartate oxidase

2e-93

1e-84

2e-87

3e-61

2e-77

7

6e-76

3.3.1.4.3 (1-250,372-457) Fumarate reductase f

1e-97

1e-85

2e-94

3e-16

2e-83

8

4e-74

3.3.1.4.4 (103-359,506-568) Flavocytochrome c3

5e-95

4e-85

6e-86

2e-09

4e-94

9

8e-72

3.3.1.4.5 (103-359,506-570) Flavocytochrome c3

2e-94

3e-85

1e-82

8e-05

2e-90 1e-07

10

4e-64

3.3.1.2.9 (5-293,406-463) Polyamine oxidase {M

2e-100

4e-75

2e-124

2

11

2e-58

3.3.1.3.1 (1-291,389-430) Guanine nucleotide d

7e-88

4e-80

1e-109

3

0.003

12

1e-47

3.3.1.2.5 (1-217,322-385) Sarcosine oxidase {B

1e-66

4e-81

3e-71

0.006

2e-07

13

4e-47

3.3.1.2.3 (1-173,276-391) p-Hydroxybenzoate hy

1e-69

3e-78

6e-71

0.1

2e-06

14

1e-38

3.3.1.2.6 (1-240) Phenol hydroxylase {Soil-liv

5e-45

3e-73

3e-52

0.0004

7e-13

15

3e-26

3.3.1.1.3 (107-331) Adrenodoxin reductase of m

8e-23

6e-76

2e-29

7e+02

0.395

16

1e-15

3.3.1.1.2 (490-645) Trimethylamine dehydrogena

0.003

4e-64

3e-11

9

1e+03

17

2e-13

3.3.1.2.6 (342-461) Phenol hydroxylase {Soil-l

110

2e-63

4e-06

5e+02

47.1

18

0.01

3.3.1.5.8 (1-158,278-348) Dihydrolipoamide deh

19

0.001

0.01

0.6

5e-07

19

0.03

3.3.1.5.8 (1-158,278-348) Dihydrolipoamide deh

5.9

0.005

0.007

0.6

8e-05

20

0.232

3.3.1.5.9 (7-154,272-346) Dihydrolipoamide deh

130

0.005

0.05

2

0.003 3e-07

21

0.295

3.3.1.5.1 (18-165,291-363) Glutathione reducta

110

0.01

1

7e+02

22

1.43

3.3.1.5.4 (3-169,287-357) Trypanothione reduct

710

2.27

7

3

3e-05

23

3.94

3.3.1.5.10 (117-275,401-470) Dihydrolipoamide

51

0.03

1e+03

7e+02

0.0001

24

4.38

3.3.1.5.8 (1-150,266-335) Dihydrolipoamide deh

740

0.07

1e+03

8e+01

3e-05

25

6.31

4.18.1.1.22 (2-181) MHC class I, alpha-1 an

1e+03

1e-06

3e+01

4e+02

1e+03

26

8.79

3.4.1.2.1 (1-194,288-340) D-amino acid oxidase

1e+03

52.4

4

5

0.02

27

8.97

3.3.1.5.3 (1-169,287-357) Trypanothione reduct

250

0.847

8

5e+02

0.03

28

9.66

3.68.1.1.2 Adenosine kinase Human (Homo sapie

1e+03

7e-06

2e+03

2e+02

38.9

29

10.2

3.2.1.2.1 Uridine diphosphogalactose-4-epimera

1e+03

3e-05

2e+02

2e+01

1e+03

30

11.3

4.139.1.5.1 Glycosylasparaginase (aspartylgluc

1e+03

4e-07

2e+03

3e+02

1e+03

31

12.2

5.8.1.3.1 T7 RNA polymerase {Bacteriophage T

77

0.006

2e+01

7e+02

42.9

32

12.7

3.2.1.5.15 (7-149) Lactate dehydrogenase {Bifi

1e+03

1e+03

0.4

2e+01

0.002

33

13.1

3.3.1.5.5 (1-118,245-316) Thioredoxin reductas

920

1.06

1

2e+02

1.12

34

16

3.84.1.1.1 Asparaginase type II {Escherichia c

610

1e+03

4e-06

1e+02

1e+03

35

17.8

3.4.1.1.2 (341-489,646-729) Trimethylamine deh

1e+03

231

0.05

7e+02

0.08

36

17.8

3.32.1.13.8 HslU {Bacteria (Escherichia coli

4

0.003

2e+03

7e+02

221

37

22.2

3.3.1.5.7 (120-242) NADH peroxidase {Streptoco

1e+03

1e+03

0.4

4e+01

0.03

38

23.1

5.11.1.1.1 DNA topoisomerase II, C-terminal fr

1e+03

9e-05

1e+02

7e+02

1e+03

39

23.8

3.90.1.1.14 Putrescine receptor (PotF) {Escher

1e+03

1e-05

2e+03

5e+02

1e+03

40

25.1

3.68.1.1.1 Ribokinase {Escherichia col

330

0.752

2e+03

6

1.8

This is a CHASE result for SCOP version 1.53 superfamily 3.3.1, featuring the FAD/NAD(P)-binding domain. The hits are sorted by C-value. Rescaled E-values (as calculated by the scaling formula (1) in the text, but displayed on the original E-value scale not taking the logarithm) from different methods are presented on the right. The (truncated) descriptions of the false positives are marked in red, as is their minimum E-value, per

22

column. E-values of hits that would not be classified correctly using a single method are marked in orange. CHASE performs good triggered by the E-values marked in green that indicate success for at least one method.

23

Tables Table 1. Homology-search methods used.

Method

Technique

Input(s)

Confidence Estimate

HMMsearch

Profile Search

Hidden Markov Model

E-value

Treesearch

Profile search

Hidden Markov Model + Sequence

E-value

Alignment + Phylogenetic Tree

PSI-Blast

Profile Search

Sequence(s) / Sequence Alignment

E-value

Mast

Motif/Profile Search

Meme Motifs

E-value

PHI-Blast

Motif/Pattern Search

Sequence + Pattern(s)

E-value

Table 2. Scenarios defined by Phase4, given a data-base that is organized into families and superfamilies.

Scenario

Description

“Distant relationship” (Distant

From a superfamily, each family in turn is chosen to be a test family. From

Family One Model, DFOM)

the remainder of the superfamily, a model (e.g. an HMM) is constructed that is later used to search the data-base.

“Close relationship” (Family

From each family of a superfamily, half its sequences are chosen as

Halves One Model, FHvOM)

training, the remaining sequences as test sequences. From the training sequences (drawn from all families of the superfamily), a model (e.g. an HMM) is constructed that is later used to search the data-base.

“Very close relationship”

For each family, do the following:

(Family Half One Model,

From the family, half of its sequences are chosen as test sequences, the

FHfOM)

remaining family sequences as training sequences. The sequences of the surrounding superfamily will be ignored in the evaluation.

Note. The division into test and training sequences is described. Such a division is performed for each superfamily in turn. For the FHfOM model, average performance is calculated over an inner loop that considers each family in turn.

24

Table 3. Estimated weights for different homology-search methods, based on the performance of the methods using the odd part of the SCOP data-base.

Method

Weight

1

HMMsearch

0.2178

2

Treesearch

0.2051

3

PSI-Blast

0.2037

4

Mast

0.1766

5

PHI-Blast

0.1968

25