Combinatorial Extension (CE) - CiteSeerX

7 downloads 0 Views 90KB Size Report
Alignments and a Java applet for analyzing similarities and differences in proteins comprising an alignment are available from http://cl.sdsc.edu/align_db.html.
Combinatorial Extension (CE) using a Composite Property Description. A New Approach to 3-D Structure Alignment and its Application to the Protein Kinase Family. Ilya N. Shindyalov and Philip E. Bourne1,2* San Diego Supercomputer Center PO Box 85608, San Diego, CA 92186, USA 1

Department of Pharmacology University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093, USA 2

The Burnham Institute 10901 North Torrey Pines Road, La Jolla, CA 92037, USA Voice:(619) 534-8301, fax:(619) 534-5113, email: {shindyal,bourne}@sdsc.edu *

To whom correspondence should be addressed.

Keywords: Combinatorial Extension; Composite Properties; 3D Structure Alignment; Protein Kinases Abstract The ambiguity between the results of a 3-D structure alignment obtained by different algorithms and that determined by domain experts reveals that simply finding the best rmsd is not enough to match biologically meaningful features, and hence provide the most meaningful structure alignment. Yet an accurate comparative analysis reveals a great deal about the biological function of related proteins. A new algorithm is presented that utilizes a combinatorial extension (CE) of the optimal path. A path that is further refined by the use of protein properties relevant to structural and functional features. The resulting all-by-all alignments are reported for the 38 known structures of the protein kinase catalytic subunit. Results indicate that the alignments are significantly better than other methods and come close to those reported by protein kinase experts. Alignments and a Java applet for analyzing similarities and differences in proteins comprising an alignment are available from http://cl.sdsc.edu/align_db.html.

Introduction In this work we demonstrate that the alignment and subsequent comparative analysis of proteins based on 3-D structure similarity frequently give wrong results. That is, a result that contradicts what is known biologically. This result is not surprising since structure alignment based on 3-D structure has been shown to be NP complete (Lathrop, 1994) and hence various heuristics are applied to simplify the problem. The choice of heuristics has been shown to significantly effect the alignment. Godzik (1996) showed that (i) different methods produce quite different alignments, to the point where they are different at all positions and (ii) using the same methodology many alignments are close to each other in score but are different in positions aligned. These two findings clearly indicate that better methods for structure alignment are needed. The approach described here uses a set of heuristics that have been tested empirically and the resulting initial alignment optimized through the use of properties known to influence the alignment. A new algorithm is reported which builds an alignment between two protein structures. The algorithm involves a combinatorial extension (CE) of an alignment path defined by aligned fragment pairs (AFPs) rather than the more conventional techniques which use dynamic programming and Monte Carlo optimization. AFPs, as the name suggests, are pairs of fragments, one from each protein, which confer structure similarity. Combinations of AFPs that represent possible continuous alignment paths are selectively extended or discarded thereby leading to a single optimal alignment. The algorithm is fast and accurate in finding an optimal structure alignment and hence suitable for database scanning and detailed analyses of large protein families like the protein kinases reported here. The method has been tested and is compared to results from Dali (Holm and Sander, 1993) and VAST (Madej et al., 1995). Methods Definition of the alignment path The alignment between two protein structures A and B of length nA and nB, respectively, is considered the longest continuous path P of AFPs of size m in a similarity matrix, S, of size ( n A − m) ⋅ ( n B − m ) representing all possible AFPs that conform to the criteria for structure similarity. One of the following three conditions should be satisfied for every two consecutive AFPs i and i+1 in the alignment path: piA+1 = piA + m and piB+1 = piB + m

(1)

or piA+1 > piA + m and piB+1 = piB + m

(2)

or piA+1

=

piA

+ m and

piB+1

> piB + m

(3)

A

where pi is the AFP’s starting residue position in protein A at the ith position in the B

alignment path; similarly for pi . Condition (1) describes two consecutive AFPs aligned

without gaps and conditions (2) and (3) represent two consecutive AFPs aligned with gaps inserted in proteins A and B, respectively. Combinatorial extension of the alignment path The alignment path is constructed from AFPs of fixed size m (8 has been shown empirically to be a reasonable choice). That is, one fragment of length m from the first protein and another fragment from the second protein form a pair if they satisfy a similarity criterion described below. The first AFP starting the path can be selected at any position within the similarity matrix S, consecutive AFPs are added such that conditions (1-3) are satisfied. To limit the gap size, conditions (2) and (3) are enhanced by the addition of the following two conditions, respectively: piA+1 ≤ piA + m + G

(4)

and piB+1 ≤ piB + m + G

(5)

where G is the maximum allowable size of the gap (30 is a reasonable choice as determined empirically). Heuristics for similarity evaluation and path extension There are several alternative alignment strategies that differ in computation time and accuracy. In the course of this study we limit the evaluation of similarity to the following three distance measures: (i) distance Dij calculated using an “independent” set of inter-residue distances, where each residue participates once and only once in the selected distance set: Dij =

m −2  1 A  d p A p A − d pBB p B + d pAA + m−1, p A + m −1 − d pBB + m −1, pB + m −1 + ∑ d pAA + k , p A + m −1− k − d pBB + k , pB + m −1− k  (6) i j i j i j i j i j m i j  k =1

(ii) distance D ij calculated using a full set of inter-residue distances, where all possible distances except those for neighboring residues are evaluated: D ij =

m −1 m −1 1 

  ∑ ∑ d pAA +k , p A +l − d pBB + k , p B +l  (7) j i j m  k =0 l = 0 i  2

(iii) rmsd obtained from structures optimally superimposed as rigid bodies using least-squares minimization (Hendrickson, 1979). Where: D ij denotes the distance between two combinations of two fragments from proteins A and B defined by two AFPs at positions i and j in the alignment path where i ≠ j . In the case of a single AFP, that is where i = j , the distance is given as Dii .

piA denotes AFP’s starting residue position in protein A at the ith position in the alignment path; similarly for piB . d ijA denotes the distance between residues i and j in the protein A based on the coordinates of Cα atoms; similarly for d ijB . m denotes the size of the fragment. Distance measure (i) is used to evaluate combination of two AFPs, one already in the alignment path and one to be added, and distance measure (ii) is used to evaluate a single AFP, i.e., how well two protein fragments forming an AFP match each other. Distance measure (iii) is used as the last step in selecting the few best alignments and for optimizing gaps in the final alignment (see the next section). The following three major path extension strategies can be used when adding the next AFP to the alignment path: (i) Consider all possible AFPs which extend the path and satisfy the similarity criteria. (ii) Consider only the best AFP (described subsequently) which extends the path and satisfies the similarity criteria. (iii) Use some intermediate strategy. The first strategy defines an exhaustive combinatorial search for the optimal path, while the second strategy defines a limited search among the best paths. It is shown that the second strategy is sufficient to reveal structure similarities when combined with some path evaluation heuristics. The second strategy is far superior in performance. Another important aspect is the selection of starting point for the alignment path. We normally consider all possible starting points in the similarity matrix S which satisfy the similarity criteria. In searching for the alignment of maximum length, all starting points not leading to an alignment of length greater than the length of the longest alignment found thus far are discarded. This saves computational time, but limits matches to one per polypeptide chain, again desirable for searches within a protein family. Extension of the alignment path is based solely on the distance criteria. Neither the size of the gap, nor the statistical significance of the alignment path is considered at this point in the analysis. The longest alignment path is then evaluated for statistical significance (represented as a Z-score). This is done by evaluating the probability of finding an alignment path of the same length with the same or smaller number of gaps and distance from a random comparison of structures using a non-redundant set (Hobohm et al., 1992). The following heuristics have been utilized in deciding whether a path should be extended. Decisions are made at three levels: (i) single AFP (ii) AFP against the path (iii) whole path This results in the following three conditions, respectively: Dnn < D0 (8) 1 n−1 ∑ Din < D1 n − 1 i =0

(9)

1 n n ∑ ∑ Dij < D1 (10) n 2 i =0 j =0 where Dij is the distance between aligned fragments defined by the AFPs i and j in the alignment path and n is the next AFP to be considered for addition to the alignment path o

of n-1 AFPs in length. D0 and D1 are similarity thresholds with typical values of D0=3 A o

and D1=4 A . It was shown empirically that the most accurate alignment occurs when the selection of the best AFP and extension of path is done in three steps: (i) all candidate AFPs are selected based on condition 8; (ii) the best AFP is chosen based on condition 9; (iii) the decision to extend or terminate the path is made based on condition 10. Optimization of the final path o

A final optimization has been added which contributes up to 2A improvement in the rmsd between two protein structures. It is only applied to alignments with Z-scores above a certain threshold (normally 3.5) and is implemented in three steps: (i) the 20 best paths at the end of the search are evaluated based upon rmsd and the best one selected; (ii) each gap in this single alignment is evaluated for possible relocation in both directions up to m/2 positions, where m is the AFP size, and if the rmsd of superimposed structures (Hendrickson, 1979) indicates improvement, then modified gap boundaries are adopted; (iii) iterative optimization using dynamic programming (Needleman and Wunch, 1970) is performed on the distance matrix calculated using residues from the two superimposed structures. Terminal gaps have not been penalized. Iteration attempts to increase the alignment length found previously while keeping the rmsd at about the same level. Such optimization did not have significant impact on the computation time for database searches since it is performed only in a limited number of cases where the Z-score is sufficiently high. Comparison with other Web accessible 3-D structure comparison methods: Dali and VAST To test the method we used 10 “difficult” similarities from a representative sample of known structurally related proteins (Fischer et al., 1996; Table 1). Out of the 10 similarities found with CE, VAST (Madej et al., 1995) does not find two, and Dali (Holm and Sander, 1993) does not find two. The number of matched positions for CE differed on average by 14% from Dali and VAST. In 9 cases the number of aligned residues was larger, and in 7 cases smaller than that reported by either Dali or VAST. Similarly, the rmsd was smaller or the same in 7 cases and larger in 9 cases - the larger the number of matched positions, the larger the rmsd.

Table 1. Comparison of structure alignments for 10 “difficult” structures from (Fischer et al., 1996) obtained by three methods: Dali (Holm and Sander, 1993), VAST (Madej et al., 1995), and CE. #

Chain 1(size)

Chain 2(size)

VAST

Dali o

1 2 3 4 5 6 7 8 9 10

1FXI:A 1TEN:_ 3HLA:B 2AZA:A 1CEW:I 1CID:_ 1CRL:_ 2SIM:_ 1BGE:B 1TIE:_

1UBQ:_ 3HHR:B 2RHE:_ 1PAZ:_ 1MOL:A 2RHE:_ 1EDE:_ 1NSB:A 2GMF:A 4FGF:_

NA/rmsd( A ) 48/2.1 78/1.6 74/2.2 71/1.9 85/2.2 284/3.8 74/2.5 82/1.7

CE o

NA/rmsd( A ) 86/1.9 63/2.5 81/2.3 95/3.3 211/3.4 286/3.8 98/3.5 108/2.0

o

NA/rmsd( A ) 64/3.8 87/1.9 114/3.5 85/2.9 69/1.9 94/2.7 187/3.2 264/3.0 94/4.1 116/2.9

Adding Composite Properties for Comparison of Structures Given the initial success of CE in detecting weak structure homology from structural information alone we added a composite property description and empirically determined the relative significance of each property for a single protein family. The following properties have been assessed: structure (defined by coordinates of Cα atoms), sequence, secondary structure, solvent exposure, and conservation index (Table 2). In the current implementation, after initial superposition obtained using the CE algorithm as described above using structure alone, similarity based on the composite property description is calculated. This is done on a residue-by-residue basis and then dynamic programming is used to find the optimum alignment for the whole polypeptide chain. Thus, we represent properties as scores Pij that measure the match between residues i and j from two proteins (A and B), respectively. How Pij is determined for each property is discussed. Table 2. Overview of properties used in the structure comparison. Code Name Description STR Structure Based on the RMSD calculated for the superposition of Cα atoms after optimal alignment found using the CE algorithm. SEQ Sequence Based on the PET91 amino-acid similarity measure defined by Jones and Thornton, 1992. SS Secondary structure Based on the 3-,4-,5-helices, beta-structure, bridges, bends, and turns as defined by Kabsch and Sander (1983). EXP Solvent exposure As defined by Lee and Richards (1971). CONS Conservation index Based on the sequences compiled for proteins with known structure and property table as defined by Taylor, 1986.

Structure (as coordinates of Cα atoms) Here Pij is defined as:

C1 − dij , if C1 − d ij > C2 Pij =  C2 , otherwise where dij denotes the distance between residues i and j from proteins A and B respectively, calculated from the Cα atomic coordinates after an optimal superposition has been obtained from the CE algorithm. C1 and C2 are constants for converting dij into a composite score. We used C1 = 7 and C2 = −2 with dij given in Angstroms. Sequence Pij is

the value of the PET91 matrix (Jones and Thornton, 1991) for amino acids occurring

at positions i and j. Secondary structure Pij

is defined as:

1, if Si = S j Pij =  0, otherwise where Si is the secondary structure code for an amino acid defined by Kabsch and Sander (1983) as one of {‘H’, ‘B’, ‘E’, ‘G’, ‘I’, ‘T’, ‘S’}. Solvent exposure Pij

is defined as: Pij = E0 − Ei − E j

where Ei is the solvent exposure defined by Lee and Richards (1971) and E0 is a constant. A value of E0 = 5 is used in these calculations. Conservation index The conservation index is calculated as follows. First, similar sequences have been assembled for every structure under consideration by searching the SWISS-PROT (Bairoch and Apweiler, 1997) database using BLAST (Altschul et al., 1990) at the NCBI server (http://www.ncbi.nlm.nih.gov). A maximum of the highest 50 significant hits were retained and further analyzed. Local dynamic programming was applied between the structure sequence and each sequence from the BLAST search. A sequence was retained if at least 70% of the residues for that structure corresponded to the sequence and at least 40% of the aligned positions had identical amino acids. It was also a requirement that the sum of amino acid differences between assembled and representing structure sequences was at least 5-fold higher than the length of the structure sequence to guarantee a sufficient level of evolutionary information. For each position the amino acid types (from 1 to 20) in the narrowest property class encompassing amino acids found in aligned sequences at a given position as defined

by Taylor (1986) have been chosen as the conservation index Ii for that position. Hence, Pij is defined as: Pij = 20 − I i − I j

Definition of a composite property The composite property measuring structural similarity at the residue level is defined as: ~ Pij = ∑ wk ⋅ Pijk k

k ij

where P is the structural similarity for residues at positions i and j from proteins A and B calculated based on the kth property and wk is a weight chosen empirically. To find the optimal alignment based on this composite property description using local dynamic programming was used with a gap initialization penalty of 10 and a gap extension penalty of 1. Comparison of alignments The comparison between two alignments is defined as: a

D

= ∑ a iD

(11)

i

where

D ai

1, if a i1 ≠ −1 and a 1i ≠ a i2 = 0 , otherwise

and a D is the number of differences between the 1st and 2nd alignments, a i1 is the residue position from the second sequence in the alignment which matches a residue located at the i th positions in the first sequence in 1st alignment. Likewise for a i2 in the 2nd alignment. a i1 , a i2 are assigned to -1 if position i is not aligned to any other position.

Results To measure the success of the structure alignments we compared them against three protein kinase catalytic domains aligned manually by experts (Taylor and RadzioAndzelm, 1994). The structures were cAMP-dependent protein kinase (PKA; PDB chain code 1CDK:A), mitogen-activated protein kinase (MAPK; PDB chain code 1GOL:_), and cyclin-dependent kinase (CDK2; PDB chain code 1FIN:A). Sequence similarity between these kinases is low: 1CDK:A vs 1GOL:_ - 27%; 1CDK:A vs 1FIN:A - 27%; and 1GOL:_ vs 1FIN:A - 39%. A structure alignment based on sequence alone is poor. As a starting point for comparison we considered a pure structure-based alignment produced by Dali (Holm and Sander, 1993) and taken from FSSP (Holm and Sander, 1994). The results of a comparison between the alignment produced by experts and the Dali algorithm for PKA vs MAPK and PKA vs CDK2 are given in Table 3. A comparison of the detailed alignment between PKA vs MAPK made by experts and that made by Dali is shown in Fig 2. Single properties and various composite properties based

on the CE algorithm have been analyzed (Table 3), the detailed sequence alignment based on composite properties given in Fig 3 and an example of the comparative structure alignment given in Fig. 4. (a)

1CDK:A 1GOL:_

39 LD QFERIKTLGT GSFGRVMLVK HKETGNHFAM KILDKQKVVK LKQIEHTLNE KRILQAVN GP RYTNLSYIGE GAYGMVCSAY DNLNKVRVAI RKISPFEHQ- -TYCQRTLRE IKILLRFR

1CDK:A 1GOL:_

99 FP FLVKLEYSFK DN-----SNLYMVME YVPGGEMFSH LRRIGR-FSEP HARFYAAQIV LT HE NIIGINDIIR APTIEQMKDVYIVQD LME-TDLYKL LKTQ--HLSND HICYFLYQIL RG

1CDK:A 153 FEYLHSLD LIYRDLKPEN LLIDQQGYIQ VTDFGFAKRV KGRT------WTLCGT PEYLAPE 1GOL:_ LKYIHSAN VLHRDLKPSN LLLNTTCDLK ICDFGLARVA DPDHDHTGFLTEYVAT RWYRAPE 1CDK:A 208 IIL -SKGYNKAVDW WALGVLIYEM AAGYPPFFAD QPIQIYEKIV SGKVR-----------1GOL:_ IML NSKGYTKSIDI WSVGCILAEM LSNRPIFPGK HYLDQLNHIL GILGSPSQEDLNCIINL 1CDK:A 256 -------------------FPSHF SSDLKDLLRN LLQVDLTKRF GNLKDGVNDI KNHKWF 1GOL:_ KARNYLLSLPHKNKVPWNRLFPNA DSKALDLLDK MLTFNPHKRI E-----VEQA LAHPYL

(b)

1CDK:A 1GOL:_

39 LD QFERIKTLGT GSFGRVMLVK HKETGNHFAM KILD-kQKVVk lkqIEHTLNe KRIL-QA GP RYTNLSYIGE GAYGMVCSAY DNLNKVRVAI RKISpFEHQT- --yCQRTLR- EIKIlLR

1CDK:A 1GOL:_

97 VNFP FLVKLEYSFK D-----NSNLYMVME YVPgGEMFSH LRRIGRFSEP HARFYAAQIV L FRHE NIIGINDIIR AptieqMKDVYIVQD LME-TDLYKL LKTQ-HLSND HICYFLYQIL R

1CDK:A 152 TFEYLHSLDL IYRDLKPENL LIDQQGYIQV TDFGFA------krvk grtwtlcgTPEYLAPE 1GOL:_ GLKYIHSANV LHRDLKPSNL LLNTTCDLKI CDFGLArvadpdhdht gflteyvATRWYRAPE 1CDK:A 197 IILS K-GYNKAVDWW ALGVLIYEMA AGYPPFFADQ PIQIYEKIVS GK-------------1GOL:_ IMLN SKGYTKSIDIW SVGCILAEML SNRPIFPGKH YLDQLNHILG ILGSPSQEDLNCIINL 1CDK:A 253 ------------------vRFPShFS SDLKDLLRNL LQVDLTKRFG nlkdgVNDIK 1GOL:_ KARnyLLSLPHKNKVPWNRLFPN-AD SKALDLLDKM LTFNPHKRIE -----VEQAL 1CDK:A 291 NHKWFATTdw iaiyqrkVEA PFIPKFkgpg dtsnfddyee eeirvsinek cgkefsef 1GOL:_ AHPYLEQYyd psdepiaeap fkfdmelddl pkeklkelif eetarfqpgy rs------

Fig 2. Comparison of: (a) manual structural alignment (Taylor and Radzio-Andzelm, 1994) and (b) calculated by Dali method (Holm and Sander, 1994). Amino acids shown in lower case are not considered aligned. Differences are shown in shaded boxes.

(a)

1CDK:A 1GOL:_

39 LD QFERIKTLGT GSFGRVMLVK HKETGNHFAM KILDKQKVVK LKQIEHTLNE KRILQAVN GP RYTNLSYIGE GAYGMVCSAY DNLNKVRVAI RKISPFEHQ- -TYCQRTLRE IKILLRFR

1CDK:A 1GOL:_

99 FP FLVKLEYSFK DN-----SNLYMVME YVPGGEMFSH LRRIGR-FSEP HARFYAAQIV LT HE NIIGINDIIR APTIEQMKDVYIVQD LME-TDLYKL LKTQ--HLSND HICYFLYQIL RG

1CDK:A 153 FEYLHSLD LIYRDLKPEN LLIDQQGYIQ VTDFGFAKRV KGRT------WTLCGT PEYLAPE 1GOL:_ LKYIHSAN VLHRDLKPSN LLLNTTCDLK ICDFGLARVA DPDHDHTGFLTEYVAT RWYRAPE 1CDK:A 208 IIL -SKGYNKAVDW WALGVLIYEM AAGYPPFFAD QPIQIYEKIV SGKVR-----------1GOL:_ IML NSKGYTKSIDI WSVGCILAEM LSNRPIFPGK HYLDQLNHIL GILGSPSQEDLNCIINL 1CDK:A 256 -------------------FPSHF SSDLKDLLRN LLQVDLTKRF GNLKDGVNDI KNHKWF 1GOL:_ KARNYLLSLPHKNKVPWNRLFPNA DSKALDLLDK MLTFNPHKRI E-----VEQA LAHPYL

(b)

1CDK:A 1GOL:_

38 HLD QFERIKTLGT GSFGRVMLVK HKETGNHFAM KILDKQKVVK LKQIEHTLNE KRILQAV VGP RYTNLSYIGE GAYGMVCSAY DNLNKVRVAI RKISPFEHQ- -TYCQRTLRE IKILLRF

1CDK:A 1GOL:_

98 NFP FLVKLEYSFK D-----NSNLYMVME YVPGGEMFSH LRRIGRFSEP HARFYAAQIV LT RHE NIIGINDIIR APTIEQMKDVYIVQD LME-TDLYKL LKTQ-HLSND HICYFLYQIL RG

1CDK:A 153 FEYLHSLD LIYRDLKPEN LLIDQQGYIQ VTDFGFAKRV KGRTW------TLCGT PEYLAPE 1GOL:_ LKYIHSAN VLHRDLKPSN LLLNTTCDLK ICDFGLARVA DPDHDHTGFLTEYVAT RWYRAPE 1CDK:A 208 IIL -SKGYNKAVDW WALGVLIYEM AAGYPPFFAD QPIQIYEKIV SGKV------------1GOL:_ IML NSKGYTKSIDI WSVGCILAEM LSNRPIFPGK HYLDQLNHIL GILGSPSQEDLNCIINL 1CDK:A 255 1GOL:_

------------------RFPSHF SSDLKDLLRN LLQVDLTKRF GNLKDGVNDI KNHKWFA KARNYLLSLPHKNKVPWNRLFPNA DSKALDLLDK MLTFNPHKRI E-----VEQA LAHPYLE

1CDK:A 298 1GOL:_

TTD WIAIYQRKVE APFIPKFKGP GDTSNFDDYE EEEIRVSINE KCGKEFSEF QYY DPS-DEPIAE APFKFDMELD DLP------- ---------- -KEKLKELI

Fig 3. Comparison of: (a) manual structural alignment (Taylor and Radzio-Andzelm, 1994) and (b) calculated by the method proposed in this work using properties STR+SEQ+CONS. Differences are shown in shaded boxes.

Table 3. Comparison of manual structural alignment (Taylor and Radzio-Andzelm, 1994) and calculated by Dali method (Holm and Sander, 1994) and by method proposed in this work using various combinations of properties. Absolute difference between alignments is given according to equation (11) and relative to length of manual alignment (in %). Method Dali STR STR+SEQ+CONS SEQ SS CONS EXP STR+SEQ

PKA (1CDK:A) vs MAPK (1GOL:_) length of alignment = 248 34 (13.7%) 8 (3.2%) 3 (1.2%) 98 (39.5%) 76 (30.6%) 84 (33.9%) 45 (18.1%) 4 (1.6%)

PKA (1CDK:A) vs CDK2 (1FIN:A) length of alignment = 251 30 (12.0%). 8 (3.2%) 5 (2.0%) 76 (30.3%) 77 (30.3%). 107 (42.6%) 62 (24.7%) 6 (2.4%)

Clearly there is a greater than 10% improvement over Dali in reproducing the manual alignment using CE by: (i) a suitable choice of heuristics; (ii) suitable choice of gap penalties during dynamic programming. There is significant further improvement when incorporating three properties in the alignment (STR+SEQ+CON). Adding SEQ information is clearly the most significant. The question that immediately arises from this work is what impact do the introduction of these composite properties have on the structure alignment? Consider one example (Fig. 4).

(a) (b) (c) Fig 4. Comparison of alignments in 3D using Rasmol (Sayle and Milner-White, 1995) program: (a) manual structural alignment (Taylor and Radzio-Andzelm, 1994); (b) calculated by Dali method (Holm and Sander, 1994); (c) calculated by method proposed in this work using the STR+SEQ+CONS combination of properties. On all pictures 1CDK:A has the darkest color, area in 1GOL:_ corresponding to the gap in alignment has lightest color, area of 1GOL:_ aligned to 1CDK:A is shown in middle color grade. A comparison of the three structural alignments (manual, Dali, properties STR+SEQ+CONS) in the area 247-259 (on 1CDK:A) show distinct differences (Fig 4). The best rmsd when superimposing just the local area shown in Fig 4 occurs for Dali (b) o

o

o

2.9 A followed by STR+SEQ+CONS (c) - 3.2 A and manual (a) - 4.9 A . The manual

alignment (a) shows a better match in the loop area (257-259), while (b) and (c) are better in the helical area (247-254). The better rmsd for Dali comes at the price of an additional gap following the aligned fragments (Fig 2), which was not found justifiable in the case of the manual alignment. The manual alignment favors a particular local features at the cost of a better overall match, whereas Dali tries to match both areas simultaneously and provides a better rmsd for the whole fragment but is less clear in defining each individual match.

Discussion Results suggest the importance of including composite features that de-emphasize the overall rmsd and emphasize the importance of biological features inherent in those properties. That is, the application of CE using a composite property description produces a high degree of convergence (differences of only 1-2%) to the manual structure alignment produced by experts. Even then the differences between alignments are on the boundaries of gaps (Fig. 3). Given this success the next logical step was to calculate an all-by-all alignment for the 38 available protein kinase catalytic domain structures and to develop a tool to analyze those alignments. That tool is a Java applet, Compare3D, with the capabilities summarized in Table 4. The database of aligned structures (server) and Compare3D (client) are available via the Web at http://cl.sdsc.edu/align_db.html. We are currently applying CE to other protein families, with the intent of eventually producing an all-by-all comparison for the complete PDB. Table 4. Basic functionality of the Compare3D applet. Functional Class Alignment

3D view

Feature table

Function Edit color Edit gaps Edit ends Browse residue environment Select/deselect residues Superimpose selected residues Rotate/translate/zoom Switch stereo/mono Switch CA/all atoms Show features

Description Change residue color. Insert/delete gaps in the alignment. Include/exclude parts of the sequence at alignment ends. Show details of interaction for a given residue. Define set of residues for display/editing. Calculate new superposition based on residue subset and display sequence/structure similarity. Basic structure rendering.

Select/color residues according to some structural or functional protein property e.g., secondary structure, structure difference, sequence evolutionary conservation, amino acid physical property.

References • •

Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403-410. Bairoch, A., Apweiler, R. (1997) The SWISS-PROT protein sequence data bank and its supplement TrEMBL. Nucleic Acids Res., 25, 31-36.

• • • • • • • • • • • • • • •

Fischer, D., Elofsson, A., Rice, D.W. , Eisenberg, D. (1996) Assessing the performance of fold recognition methods by means of a comprehensive benchmark. Proc. 1st Pacific Symposium on Biocomputing. 300-318. Gibrat, J.F., Madej, T., Bryant, S.H. (1996) Surprising similarities in structure comparison. Current Opin. in Struct. Biol., 6, 377-385. Godzik, A. (1996) The structural alignment between two proteins: Is there a unique answer? Protein Science, 5, 1325-1338. Hendrickson, W.A. (1979) Transformations to optimize the superposition of similar structures. Acta Cryst., A35, 158-163. Hobohm, U., Scharf, M., Schneider, R., Sander,C. (1992) Selection of representative protein data sets. Protein Science, 1, 409-417. Holm, L., Sander, C. (1993) Protein structure comparison by alignment of distance matrices. J. Mol. Biol., 233, 123-138. Holm, L., Sander, C. (1994) The FSSP database of structurally aligned protein fold families. Nucl. Acids Res., 22, 3600-3609. Jones, D.T., Thornton, J.M. (1992) The rapid generation of mutation data matrices from protein sequences. CABIOS, 8, 275-282. Kabsch, W., Sander, C. (1983) Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 25772637. Lathrop, R.H. (1994) The protein threading problem with sequence amino acid interaction preferences is NP-complete. Prot. Engng, 7, 1059-1068. Lee, B., Richards, F.M. (1971) The interpretation of protein structures: estimation of static accessibility. J. Mol. Biol., 55, 379-400. Madej, T., Gibrat, J.F., Bryant, S.H. (1995) Threading a database of protein cores. Proteins, 23, 356-369. Needleman, S.B., Wunch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol., 48, 443-453. Taylor, S.S., Radzio-Andzelm, E. (1994) Three protein kinase structures define a common motif. Structure, 2, 345-355. Taylor, W.R. (1986) The classification of amino acid conservation. J. Theor. Biol., 119, 205-218.